* Re: [PATCH net] net: mana: Sync page pool RX frags for CPU
From: Simon Horman @ 2026-06-19 9:05 UTC (permalink / raw)
To: Dexuan Cui
Cc: kys, haiyangz, wei.liu, longli, andrew+netdev, davem, edumazet,
kuba, pabeni, kotaranov, ernis, dipayanroy, kees, jacob.e.keller,
ssengar, linux-hyperv, netdev, linux-kernel, linux-rdma, stable
In-Reply-To: <20260618035029.249361-1-decui@microsoft.com>
On Wed, Jun 17, 2026 at 08:50:29PM -0700, Dexuan Cui wrote:
> MANA allocates RX buffers from page pool fragments when frag_count is
> greater than 1. In that case the buffers remain DMA mapped by page pool
> and the RX completion path does not call dma_unmap_single(). As a result,
> the implicit sync-for-CPU normally performed by dma_unmap_single() is
> missing before the packet data is passed to the networking stack.
>
> This breaks RX on configurations which require explicit DMA syncing, for
> example when booted with swiotlb=force.
>
> Fix this by recording the page pool page and DMA sync offset when the RX
> buffer is allocated, and syncing the received packet range for CPU access
> before handing the RX buffer to the stack.
>
> Also validate the packet length reported in the RX CQE before using it as
> a DMA sync length or passing it to skb processing. The CQE is supplied
> by the device and should not be blindly trusted by Confidential VMs.
I think this last part warrants being split out into a separate patch.
>
> Fixes: 730ff06d3f5c ("net: mana: Use page pool fragments for RX buffers instead of full pages to improve memory efficiency.")
> Cc: stable@vger.kernel.org
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
...
^ permalink raw reply
* [PATCH v4 net] net: mana: Optimize irq affinity for low vcpu configs
From: Shradha Gupta @ 2026-06-19 7:33 UTC (permalink / raw)
To: Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
Dipayaan Roy, Shiraz Saleem, Michael Kelley, Long Li, Yury Norov
Cc: Shradha Gupta, linux-hyperv, linux-kernel, netdev, Paul Rosswurm,
Shradha Gupta, Saurabh Singh Sengar, stable
In mana driver, the number of IRQs allocated is capped by the
min(num_cpu + 1, queue count). In cases, where the IRQ count is greater
than the vcpu count, we want to utilize all the vCPUs, irrespective of
their NUMA/core bindings.
This is important, especially in the envs where number of vCPUs are so
few that the softIRQ handling overhead on two IRQs on the same vCPU is
much more than their overheads if they were spread across sibling vCPUs.
This behaviour is more evident with dynamic IRQ allocation. Since MANA
IRQs are assigned at a later stage compared to static allocation, other
device IRQs may already be affinitized to the vCPUs. As a result, IRQ
weights become imbalanced, causing multiple MANA IRQs to land on the
same vCPU, while some vCPUs have none.
In such cases when many parallel TCP connections are tested, the
throughput drops significantly.
We also studied the results of setting the affinity and hint to
NULL in these cases, and observed that, with this logic if there are
pre existing IRQs allocated on the VM(apart from MANA), during MANA
IRQs allocation, it leads to clustering of the MANA queue IRQs again.
These results can be seen through case 3 in the following data.
Test envs:
=======================================================
Case 1: without this patch
=======================================================
4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
TYPE effective vCPU aff
=======================================================
IRQ0: HWC 0
IRQ1: mana_q1 0
IRQ2: mana_q2 2
IRQ3: mana_q3 0
IRQ4: mana_q4 3
%soft on each vCPU(mpstat -P ALL 1) on receiver
vCPU 0 1 2 3
=======================================================
pass 1: 38.85 0.03 24.89 24.65
pass 2: 39.15 0.03 24.57 25.28
pass 3: 40.36 0.03 23.20 23.17
=======================================================
Case 2: with this patch
=======================================================
4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
TYPE effective vCPU aff
=======================================================
IRQ0: HWC 0
IRQ1: mana_q1 0
IRQ2: mana_q2 1
IRQ3: mana_q3 2
IRQ4: mana_q4 3
%soft on each vCPU(mpstat -P ALL 1) on receiver
vCPU 0 1 2 3
=======================================================
pass 1: 15.42 15.85 14.99 14.51
pass 2: 15.53 15.94 15.81 15.93
pass 3: 16.41 16.35 16.40 16.36
=======================================================
Case 3: with affinity set to NULL
=======================================================
4 vCPU(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
TYPE effective vCPU aff
=======================================================
IRQ0: HWC 0
IRQ1: mana_q1 2
IRQ2: mana_q2 3
IRQ3: mana_q3 2
IRQ4: mana_q4 3
=======================================================
Throughput Impact(in Gbps, same env)
=======================================================
TCP conn with patch w/o patch aff NULL
20480 15.65 7.73 5.25
10240 15.63 8.93 5.77
8192 15.64 9.69 7.16
6144 15.64 13.16 9.33
4096 15.69 15.75 13.50
2048 15.69 15.83 13.61
1024 15.71 15.28 13.60
Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
Cc: stable@vger.kernel.org
Co-developed-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Simon Horman <horms@kernel.org>
---
Changes in v4
* Add mana prefix on irq_affinity_*() in mana driver
* Corrected grammar, comment for mana_irq_setup_linear()
* added new line as per guidelines
* added case 3 in commit message for when affinity is NULL
---
Changes in v3
* Optimize the comments in mana_gd_setup_dyn_irqs()
* add more details in the dev_dbg for extra IRQs
---
Changes in v2
* Removed the unused skip_first_cpu variable
* fixed exit condition in irq_setup_linear() with len == 0
* changed return type of irq_setup_linear() as it will always be 0
* removed the unnecessary rcu_read_lock() in irq_setup_linear()
* added appropriate comments to indicate expected behaviour when
IRQs are more than or equal to num_online_cpus()
---
.../net/ethernet/microsoft/mana/gdma_main.c | 78 +++++++++++++++----
1 file changed, 64 insertions(+), 14 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index a0fdd052d7f1..e8b7ffb47eb9 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -210,6 +210,8 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
} else {
/* If dynamic allocation is enabled we have already allocated
* hwc msi
+ * Also, we make sure in this case the following is always true
+ * (num_msix_usable - 1 HWC) <= num_online_cpus()
*/
gc->num_msix_usable = min(resp.max_msix, num_online_cpus() + 1);
}
@@ -1909,8 +1911,8 @@ void mana_gd_free_res_map(struct gdma_resource *r)
* do the same thing.
*/
-static int irq_setup(unsigned int *irqs, unsigned int len, int node,
- bool skip_first_cpu)
+static int mana_irq_setup_numa_aware(unsigned int *irqs, unsigned int len,
+ int node, bool skip_first_cpu)
{
const struct cpumask *next, *prev = cpu_none_mask;
cpumask_var_t cpus __free(free_cpumask_var);
@@ -1946,11 +1948,24 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
return 0;
}
+/* must be called with cpus_read_lock() held */
+static void mana_irq_setup_linear(unsigned int *irqs, unsigned int len)
+{
+ int cpu;
+
+ for_each_online_cpu(cpu) {
+ if (len == 0)
+ break;
+
+ irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
+ len--;
+ }
+}
+
static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
{
struct gdma_context *gc = pci_get_drvdata(pdev);
struct gdma_irq_context *gic;
- bool skip_first_cpu = false;
int *irqs, err, i, msi;
irqs = kmalloc_objs(int, nvec);
@@ -1958,10 +1973,12 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
return -ENOMEM;
/*
+ * In this function, num_msix_usable = HWC IRQ + Queue IRQ.
+ * nvec is only Queue IRQ (HWC already setup).
* While processing the next pci irq vector, we start with index 1,
* as IRQ vector at index 0 is already processed for HWC.
* However, the population of irqs array starts with index 0, to be
- * further used in irq_setup()
+ * further used in mana_irq_setup_numa_aware()
*/
for (i = 1; i <= nvec; i++) {
msi = i;
@@ -1975,18 +1992,51 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
}
/*
- * When calling irq_setup() for dynamically added IRQs, if number of
- * CPUs is more than or equal to allocated MSI-X, we need to skip the
- * first CPU sibling group since they are already affinitized to HWC IRQ
+ * When calling mana_irq_setup_numa_aware() for dynamically added IRQs,
+ * if number of CPUs is more than or equal to allocated MSI-X, we need to
+ * skip the first CPU sibling group since they are already affinitized to
+ * HWC IRQ
*/
cpus_read_lock();
- if (gc->num_msix_usable <= num_online_cpus())
- skip_first_cpu = true;
+ if (gc->num_msix_usable <= num_online_cpus()) {
+ err = mana_irq_setup_numa_aware(irqs, nvec, gc->numa_node,
+ true);
+ if (err) {
+ cpus_read_unlock();
+ goto free_irq;
+ }
+ } else {
+ /*
+ * When num_msix_usable are more than num_online_cpus, our
+ * queue IRQs should be equal to num of online vCPUs.
+ * We try to make sure queue IRQs spread across all vCPUs.
+ * In such a case NUMA or CPU core affinity does not matter.
+ * Note: in this case the total mana IRQ should always be
+ * num_online_cpus + 1. The first HWC IRQ is already handled
+ * in HWC setup calls
+ * However, if CPUs went offline since num_msix_usable was
+ * computed, queue IRQs will be more than num_online_cpus().
+ * In such cases remaining extra IRQs will retain their default
+ * affinity.
+ */
+ int first_unassigned = num_online_cpus();
- err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
- if (err) {
- cpus_read_unlock();
- goto free_irq;
+ if (nvec > first_unassigned) {
+ char buf[32];
+
+ if (first_unassigned == nvec - 1)
+ snprintf(buf, sizeof(buf), "%d",
+ first_unassigned);
+ else
+ snprintf(buf, sizeof(buf), "%d-%d",
+ first_unassigned, nvec - 1);
+
+ dev_dbg(&pdev->dev,
+ "MANA IRQ indices #%s will retain the default CPU affinity\n",
+ buf);
+ }
+
+ mana_irq_setup_linear(irqs, nvec);
}
cpus_read_unlock();
@@ -2041,7 +2091,7 @@ static int mana_gd_setup_irqs(struct pci_dev *pdev, int nvec)
nvec -= 1;
}
- err = irq_setup(irqs, nvec, gc->numa_node, false);
+ err = mana_irq_setup_numa_aware(irqs, nvec, gc->numa_node, false);
if (err) {
cpus_read_unlock();
goto free_irq;
base-commit: 96e7f9122aae0ed000ee321f324b812a447906d9
--
2.34.1
^ permalink raw reply related
* [PATCH net] net: mana: Fall back to standard MTU when PF reports adapter_mtu of 0
From: Erni Sri Satya Vennela @ 2026-06-19 5:53 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
edumazet, kuba, pabeni, dipayanroy, ssengar, jacob.e.keller,
ernis, horms, gargaditya, kees, linux-hyperv, netdev,
linux-kernel, bpf
Commit d7709812e13d ("net: mana: hardening: Validate adapter_mtu from
MANA_QUERY_DEV_CONFIG") rejected any adapter_mtu value smaller than
ETH_MIN_MTU + ETH_HLEN, including 0, returning -EPROTO and failing
mana_probe().
Some older PF firmware versions still in the field report
adapter_mtu as 0 in the MANA_QUERY_DEV_CONFIG response. With the
hardening check in place, the MANA VF driver now fails to load on
those hosts, breaking networking entirely for guests.
MANA hardware always supports the standard Ethernet MTU. Treat a
reported adapter_mtu of 0 as "the PF did not advertise a value" and
fall back to ETH_FRAME_LEN, the same value used for the pre-V2
message version path. Only jumbo frames remain unavailable until
the PF reports a valid MTU.
Other small-but-nonzero bogus values are still rejected, preserving
the original protection against the unsigned-subtraction wrap that
would otherwise let ndev->max_mtu underflow to a huge value.
Fixes: d7709812e13d ("net: mana: hardening: Validate adapter_mtu from MANA_QUERY_DEV_CONFIG")
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
---
drivers/net/ethernet/microsoft/mana/mana_bpf.c | 3 ++-
drivers/net/ethernet/microsoft/mana/mana_en.c | 16 ++++++++++++++--
2 files changed, 16 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_bpf.c b/drivers/net/ethernet/microsoft/mana/mana_bpf.c
index b5e9bb184a1d..53308e139cbe 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_bpf.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_bpf.c
@@ -237,7 +237,8 @@ static int mana_xdp_set(struct net_device *ndev, struct bpf_prog *prog,
bpf_prog_put(old_prog);
if (prog)
- ndev->max_mtu = MANA_XDP_MTU_MAX;
+ ndev->max_mtu = min_t(unsigned int, MANA_XDP_MTU_MAX,
+ gc->adapter_mtu - ETH_HLEN);
else
ndev->max_mtu = gc->adapter_mtu - ETH_HLEN;
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 87862b0434c7..7438ea6b3f26 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1233,12 +1233,24 @@ int mana_gd_query_device_cfg(struct gdma_context *gc, u32 proto_major_ver,
*max_num_vports = resp.max_num_vports;
if (resp.hdr.response.msg_version >= GDMA_MESSAGE_V2) {
- if (resp.adapter_mtu < ETH_MIN_MTU + ETH_HLEN) {
+ if (resp.adapter_mtu == 0) {
+ /*
+ * Some older PF firmware versions report an
+ * adapter_mtu of 0. MANA hardware always supports the
+ * standard Ethernet MTU, so fall back to ETH_FRAME_LEN.
+ * Jumbo frames will not be available in this case.
+ */
+ dev_info(dev,
+ "PF reported adapter_mtu of 0, falling back to %u (jumbo frames disabled)\n",
+ ETH_FRAME_LEN);
+ gc->adapter_mtu = ETH_FRAME_LEN;
+ } else if (resp.adapter_mtu < ETH_MIN_MTU + ETH_HLEN) {
dev_err(dev, "Adapter MTU too small: %u\n",
resp.adapter_mtu);
return -EPROTO;
+ } else {
+ gc->adapter_mtu = resp.adapter_mtu;
}
- gc->adapter_mtu = resp.adapter_mtu;
} else {
gc->adapter_mtu = ETH_FRAME_LEN;
}
--
2.34.1
^ permalink raw reply related
* RE: [PATCH net] net: mana: Sync page pool RX frags for CPU
From: Haiyang Zhang @ 2026-06-18 18:38 UTC (permalink / raw)
To: Dexuan Cui, KY Srinivasan, wei.liu@kernel.org, Dexuan Cui,
Long Li, andrew+netdev@lunn.ch, davem@davemloft.net,
edumazet@google.com, kuba@kernel.org, pabeni@redhat.com,
Konstantin Taranov, horms@kernel.org, ernis@linux.microsoft.com,
dipayanroy@linux.microsoft.com, kees@kernel.org,
jacob.e.keller@intel.com, ssengar@linux.microsoft.com,
linux-hyperv@vger.kernel.org, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org
Cc: stable@vger.kernel.org
In-Reply-To: <20260618035029.249361-1-decui@microsoft.com>
> -----Original Message-----
> From: Dexuan Cui <decui@microsoft.com>
> Sent: Wednesday, June 17, 2026 11:50 PM
> To: KY Srinivasan <kys@microsoft.com>; Haiyang Zhang
> <haiyangz@microsoft.com>; wei.liu@kernel.org; Dexuan Cui
> <DECUI@microsoft.com>; Long Li <longli@microsoft.com>;
> andrew+netdev@lunn.ch; davem@davemloft.net; edumazet@google.com;
> kuba@kernel.org; pabeni@redhat.com; Konstantin Taranov
> <kotaranov@microsoft.com>; horms@kernel.org; ernis@linux.microsoft.com;
> dipayanroy@linux.microsoft.com; kees@kernel.org; jacob.e.keller@intel.com;
> ssengar@linux.microsoft.com; linux-hyperv@vger.kernel.org;
> netdev@vger.kernel.org; linux-kernel@vger.kernel.org; linux-
> rdma@vger.kernel.org
> Cc: stable@vger.kernel.org
> Subject: [PATCH net] net: mana: Sync page pool RX frags for CPU
>
> MANA allocates RX buffers from page pool fragments when frag_count is
> greater than 1. In that case the buffers remain DMA mapped by page pool
> and the RX completion path does not call dma_unmap_single(). As a result,
> the implicit sync-for-CPU normally performed by dma_unmap_single() is
> missing before the packet data is passed to the networking stack.
>
> This breaks RX on configurations which require explicit DMA syncing, for
> example when booted with swiotlb=force.
>
> Fix this by recording the page pool page and DMA sync offset when the RX
> buffer is allocated, and syncing the received packet range for CPU access
> before handing the RX buffer to the stack.
>
> Also validate the packet length reported in the RX CQE before using it as
> a DMA sync length or passing it to skb processing. The CQE is supplied
> by the device and should not be blindly trusted by Confidential VMs.
>
> Fixes: 730ff06d3f5c ("net: mana: Use page pool fragments for RX buffers
> instead of full pages to improve memory efficiency.")
> Cc: stable@vger.kernel.org
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> ---
> drivers/net/ethernet/microsoft/mana/mana_en.c | 61 +++++++++++++++----
> include/net/mana/mana.h | 8 +++
> 2 files changed, 57 insertions(+), 12 deletions(-)
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
^ permalink raw reply
* RE: [RFC PATCH 5/6] arm64: hyperv: Route hypercalls through RSI host call in CCA Realms
From: Michael Kelley @ 2026-06-18 17:46 UTC (permalink / raw)
To: Kameron Carr, kys@microsoft.com, haiyangz@microsoft.com,
wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com
Cc: catalin.marinas@arm.com, will@kernel.org, mark.rutland@arm.com,
lpieralisi@kernel.org, sudeep.holla@kernel.org, arnd@arndb.de,
thuth@redhat.com, linux-hyperv@vger.kernel.org,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
Michael Kelley
In-Reply-To: <20260609181030.2378391-6-kameroncarr@linux.microsoft.com>
From: Kameron Carr <kameroncarr@linux.microsoft.com> Sent: Tuesday, June 9, 2026 11:10 AM
>
> Modify the five hypercall wrapper functions to check is_realm_world()
> and use the per-CPU rsi_host_call structure when inside a Realm.
>
> Signed-off-by: Kameron Carr <kameroncarr@linux.microsoft.com>
> ---
> arch/arm64/hyperv/hv_core.c | 175 +++++++++++++++++++++++++++++-------
> 1 file changed, 141 insertions(+), 34 deletions(-)
>
> diff --git a/arch/arm64/hyperv/hv_core.c b/arch/arm64/hyperv/hv_core.c
> index e33a9e3c366a1..1759998ef2667 100644
> --- a/arch/arm64/hyperv/hv_core.c
> +++ b/arch/arm64/hyperv/hv_core.c
> @@ -16,6 +16,7 @@
> #include <asm-generic/bug.h>
> #include <hyperv/hvhdk.h>
> #include <asm/mshyperv.h>
> +#include <asm/rsi.h>
>
> /*
> * hv_do_hypercall- Invoke the specified hypercall
> @@ -25,12 +26,32 @@ u64 hv_do_hypercall(u64 control, void *input, void *output)
> struct arm_smccc_res res;
> u64 input_address;
> u64 output_address;
> + struct rsi_host_call *hostcall;
> + unsigned long flags;
> + u64 ret;
>
> input_address = input ? virt_to_phys(input) : 0;
> output_address = output ? virt_to_phys(output) : 0;
>
> - arm_smccc_1_1_hvc(HV_FUNC_ID, control,
> - input_address, output_address, &res);
> + if (is_realm_world()) {
> + local_irq_save(flags);
> + hostcall = *this_cpu_ptr(hyperv_pcpu_hostcall_struct);
> + memset(hostcall, 0, sizeof(*hostcall));
> + hostcall->gprs[0] = HV_FUNC_ID;
> + hostcall->gprs[1] = control;
> + hostcall->gprs[2] = input_address;
> + hostcall->gprs[3] = output_address;
> +
> + if (rsi_host_call(virt_to_phys(hostcall)) == RSI_SUCCESS)
> + ret = hostcall->gprs[0];
> + else
> + ret = HV_STATUS_INVALID_HYPERCALL_INPUT;
> + local_irq_restore(flags);
> + return ret;
This code sequence for handling the realm case is almost exactly
duplicated for the three hypercall variants. The only difference is
how gprs[2] and gprs[3] are populated. So I think the code
sequence could go into a helper routine with the appropriate
values for gprs[2] and gprs[3] passed in.
> + }
> +
> + arm_smccc_1_1_hvc(HV_FUNC_ID, control, input_address,
> + output_address, &res);
> return res.a0;
> }
> EXPORT_SYMBOL_GPL(hv_do_hypercall);
> @@ -45,9 +66,28 @@ u64 hv_do_fast_hypercall8(u16 code, u64 input)
> {
> struct arm_smccc_res res;
> u64 control;
> + struct rsi_host_call *hostcall;
> + unsigned long flags;
> + u64 ret;
>
> control = (u64)code | HV_HYPERCALL_FAST_BIT;
>
> + if (is_realm_world()) {
> + local_irq_save(flags);
> + hostcall = *this_cpu_ptr(hyperv_pcpu_hostcall_struct);
> + memset(hostcall, 0, sizeof(*hostcall));
> + hostcall->gprs[0] = HV_FUNC_ID;
> + hostcall->gprs[1] = control;
> + hostcall->gprs[2] = input;
> +
> + if (rsi_host_call(virt_to_phys(hostcall)) == RSI_SUCCESS)
> + ret = hostcall->gprs[0];
> + else
> + ret = HV_STATUS_INVALID_HYPERCALL_INPUT;
> + local_irq_restore(flags);
> + return ret;
> + }
> +
> arm_smccc_1_1_hvc(HV_FUNC_ID, control, input, &res);
> return res.a0;
> }
> @@ -62,9 +102,29 @@ u64 hv_do_fast_hypercall16(u16 code, u64 input1, u64 input2)
> {
> struct arm_smccc_res res;
> u64 control;
> + struct rsi_host_call *hostcall;
> + unsigned long flags;
> + u64 ret;
>
> control = (u64)code | HV_HYPERCALL_FAST_BIT;
>
> + if (is_realm_world()) {
> + local_irq_save(flags);
> + hostcall = *this_cpu_ptr(hyperv_pcpu_hostcall_struct);
> + memset(hostcall, 0, sizeof(*hostcall));
> + hostcall->gprs[0] = HV_FUNC_ID;
> + hostcall->gprs[1] = control;
> + hostcall->gprs[2] = input1;
> + hostcall->gprs[3] = input2;
> +
> + if (rsi_host_call(virt_to_phys(hostcall)) == RSI_SUCCESS)
> + ret = hostcall->gprs[0];
> + else
> + ret = HV_STATUS_INVALID_HYPERCALL_INPUT;
> + local_irq_restore(flags);
> + return ret;
> + }
> +
> arm_smccc_1_1_hvc(HV_FUNC_ID, control, input1, input2, &res);
> return res.a0;
> }
> @@ -76,24 +136,44 @@ EXPORT_SYMBOL_GPL(hv_do_fast_hypercall16);
> void hv_set_vpreg(u32 msr, u64 value)
> {
> struct arm_smccc_res res;
> + struct rsi_host_call *hostcall;
> + unsigned long flags;
> + u64 status;
> +
> + if (is_realm_world()) {
> + local_irq_save(flags);
> + hostcall = *this_cpu_ptr(hyperv_pcpu_hostcall_struct);
> + memset(hostcall, 0, sizeof(*hostcall));
> + hostcall->gprs[0] = HV_FUNC_ID;
> + hostcall->gprs[1] = HVCALL_SET_VP_REGISTERS |
> + HV_HYPERCALL_FAST_BIT |
> + HV_HYPERCALL_REP_COMP_1;
> + hostcall->gprs[2] = HV_PARTITION_ID_SELF;
> + hostcall->gprs[3] = HV_VP_INDEX_SELF;
> + hostcall->gprs[4] = msr;
> + hostcall->gprs[6] = value;
>
> - arm_smccc_1_1_hvc(HV_FUNC_ID,
> - HVCALL_SET_VP_REGISTERS | HV_HYPERCALL_FAST_BIT |
> - HV_HYPERCALL_REP_COMP_1,
> - HV_PARTITION_ID_SELF,
> - HV_VP_INDEX_SELF,
> - msr,
> - 0,
> - value,
> - 0,
> - &res);
> + if (rsi_host_call(virt_to_phys(hostcall)) == RSI_SUCCESS)
> + status = hostcall->gprs[0];
> + else
> + status = HV_STATUS_INVALID_HYPERCALL_INPUT;
> + local_irq_restore(flags);
> + } else {
> + arm_smccc_1_1_hvc(HV_FUNC_ID,
> + HVCALL_SET_VP_REGISTERS |
> + HV_HYPERCALL_FAST_BIT |
> + HV_HYPERCALL_REP_COMP_1,
> + HV_PARTITION_ID_SELF, HV_VP_INDEX_SELF, msr,
> + 0, value, 0, &res);
> + status = res.a0;
> + }
>
> /*
> - * Something is fundamentally broken in the hypervisor if
> - * setting a VP register fails. There's really no way to
> - * continue as a guest VM, so panic.
> + * Something is fundamentally broken in the hypervisor (or, in a
> + * Realm, the RMM denied the host call) if setting a VP register
> + * fails. There's really no way to continue as a guest VM, so panic.
> */
> - BUG_ON(!hv_result_success(res.a0));
> + BUG_ON(!hv_result_success(status));
> }
> EXPORT_SYMBOL_GPL(hv_set_vpreg);
>
> @@ -108,29 +188,56 @@ void hv_get_vpreg_128(u32 msr, struct
> hv_get_vp_registers_output *result)
> {
> struct arm_smccc_1_2_regs args;
> struct arm_smccc_1_2_regs res;
> + struct rsi_host_call *hostcall;
> + u64 status;
>
> - args.a0 = HV_FUNC_ID;
> - args.a1 = HVCALL_GET_VP_REGISTERS | HV_HYPERCALL_FAST_BIT |
> - HV_HYPERCALL_REP_COMP_1;
> - args.a2 = HV_PARTITION_ID_SELF;
> - args.a3 = HV_VP_INDEX_SELF;
> - args.a4 = msr;
> + if (is_realm_world()) {
> + unsigned long flags;
>
> - /*
> - * Use the SMCCC 1.2 interface because the results are in registers
> - * beyond X0-X3.
> - */
> - arm_smccc_1_2_hvc(&args, &res);
> + local_irq_save(flags);
> + hostcall = *this_cpu_ptr(hyperv_pcpu_hostcall_struct);
> + memset(hostcall, 0, sizeof(*hostcall));
> +
> + hostcall->gprs[0] = HV_FUNC_ID;
> + hostcall->gprs[1] = HVCALL_GET_VP_REGISTERS |
> + HV_HYPERCALL_FAST_BIT |
> + HV_HYPERCALL_REP_COMP_1;
> + hostcall->gprs[2] = HV_PARTITION_ID_SELF;
> + hostcall->gprs[3] = HV_VP_INDEX_SELF;
> + hostcall->gprs[4] = msr;
> +
> + if (rsi_host_call(virt_to_phys(hostcall)) == RSI_SUCCESS) {
> + status = hostcall->gprs[0];
> + result->as64.low = hostcall->gprs[6];
> + result->as64.high = hostcall->gprs[7];
> + } else {
> + status = HV_STATUS_INVALID_HYPERCALL_INPUT;
> + }
> + local_irq_restore(flags);
> + } else {
> + args.a0 = HV_FUNC_ID;
> + args.a1 = HVCALL_GET_VP_REGISTERS | HV_HYPERCALL_FAST_BIT |
> + HV_HYPERCALL_REP_COMP_1;
> + args.a2 = HV_PARTITION_ID_SELF;
> + args.a3 = HV_VP_INDEX_SELF;
> + args.a4 = msr;
> +
> + /*
> + * Use the SMCCC 1.2 interface because the results are in
> + * registers beyond X0-X3.
> + */
> + arm_smccc_1_2_hvc(&args, &res);
> + status = res.a0;
> + result->as64.low = res.a6;
> + result->as64.high = res.a7;
> + }
>
> /*
> - * Something is fundamentally broken in the hypervisor if
> - * getting a VP register fails. There's really no way to
> - * continue as a guest VM, so panic.
> + * Something is fundamentally broken in the hypervisor (or, in a
> + * Realm, the RMM denied the host call) if getting a VP register
> + * fails. There's really no way to continue as a guest VM, so panic.
> */
> - BUG_ON(!hv_result_success(res.a0));
> -
> - result->as64.low = res.a6;
> - result->as64.high = res.a7;
> + BUG_ON(!hv_result_success(status));
> }
> EXPORT_SYMBOL_GPL(hv_get_vpreg_128);
>
> --
> 2.45.4
>
^ permalink raw reply
* RE: [RFC PATCH 3/6] arm64: hyperv: Add per-CPU RSI host call infrastructure for CCA Realms
From: Michael Kelley @ 2026-06-18 17:46 UTC (permalink / raw)
To: Kameron Carr, kys@microsoft.com, haiyangz@microsoft.com,
wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com
Cc: catalin.marinas@arm.com, will@kernel.org, mark.rutland@arm.com,
lpieralisi@kernel.org, sudeep.holla@kernel.org, arnd@arndb.de,
thuth@redhat.com, linux-hyperv@vger.kernel.org,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
Michael Kelley
In-Reply-To: <20260609181030.2378391-4-kameroncarr@linux.microsoft.com>
From: Kameron Carr <kameroncarr@linux.microsoft.com> Sent: Tuesday, June 9, 2026 11:10 AM
>
> Arm CCA Realms cannot issue Hyper-V hypercalls via HVC; the guest must
> route them through the RSI_HOST_CALL interface, which takes the IPA of a
> per-CPU rsi_host_call structure as its argument.
>
> Add hyperv_pcpu_hostcall_struct as a per-CPU pointer to that buffer and
> allocate it for the boot CPU during hyperv_init() and for each secondary
> CPU in hv_cpu_init(). The allocation is gated on is_realm_world() so
> non-Realm arm64 Hyper-V guests pay no memory cost.
I wonder if there's a simpler approach here. What about calculating the
total size of struct rsi_host_call needed for all CPUs, then doing a single
dynamic allocation to effectively create an array of entries? Each CPU
would just index into the array with its processor ID. You could still have
a per-cpu pointer that points to the correct array entry to avoid the need
to get the processor ID, but I wonder if even that is worth the trouble. Since
struct rsi_host_call size is a power of 2, the indexing is just a simple shift.
The hyperv_pcpu_input_page is allocated the way it is because it's much
bigger. But 16 struct rsi_host_call fit into a single 4 KiB, so there's no
danger of hitting a memory allocation limit at boot time. Even with 8192
CPUs the allocation is only 2 MiB.
Michael
>
> Signed-off-by: Kameron Carr <kameroncarr@linux.microsoft.com>
> ---
> arch/arm64/hyperv/mshyperv.c | 78 ++++++++++++++++++++++++++++++-
> arch/arm64/include/asm/mshyperv.h | 3 ++
> 2 files changed, 79 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm64/hyperv/mshyperv.c b/arch/arm64/hyperv/mshyperv.c
> index 4fdc26ade1d74..08fec82691683 100644
> --- a/arch/arm64/hyperv/mshyperv.c
> +++ b/arch/arm64/hyperv/mshyperv.c
> @@ -15,10 +15,16 @@
> #include <linux/errno.h>
> #include <linux/version.h>
> #include <linux/cpuhotplug.h>
> +#include <linux/slab.h>
> +#include <linux/percpu.h>
> #include <asm/mshyperv.h>
> +#include <asm/rsi.h>
>
> static bool hyperv_initialized;
>
> +void * __percpu *hyperv_pcpu_hostcall_struct;
> +EXPORT_SYMBOL_GPL(hyperv_pcpu_hostcall_struct);
> +
> int hv_get_hypervisor_version(union hv_hypervisor_version_info *info)
> {
> hv_get_vpreg_128(HV_REGISTER_HYPERVISOR_VERSION,
> @@ -60,6 +66,46 @@ static bool __init hyperv_detect_via_acpi(void)
>
> #endif
>
> +static void hv_hostcall_free(void)
> +{
> + int cpu;
> +
> + if (!hyperv_pcpu_hostcall_struct)
> + return;
> +
> + for_each_possible_cpu(cpu)
> + kfree(*per_cpu_ptr(hyperv_pcpu_hostcall_struct, cpu));
> + free_percpu(hyperv_pcpu_hostcall_struct);
> + hyperv_pcpu_hostcall_struct = NULL;
> +}
> +
> +static int hv_cpu_init(unsigned int cpu)
> +{
> + void **hostcall_struct;
> + gfp_t flags;
> + void *mem;
> +
> + if (hyperv_pcpu_hostcall_struct) {
> + /* hv_cpu_init() can be called with IRQs disabled from hv_resume() */
> + flags = irqs_disabled() ? GFP_ATOMIC : GFP_KERNEL;
> +
> + hostcall_struct = (void **)this_cpu_ptr(hyperv_pcpu_hostcall_struct);
> + /*
> + * The hostcall_struct memory is not freed when the CPU
> + * goes offline. If a previously offlined CPU is brought
> + * back online, the memory is reused here.
> + */
> + if (!*hostcall_struct) {
> + mem = kzalloc_obj(struct rsi_host_call, flags);
> + if (!mem)
> + return -ENOMEM;
> + *hostcall_struct = mem;
> + }
> + }
> +
> + return hv_common_cpu_init(cpu);
> +}
> +
> static bool __init hyperv_detect_via_smccc(void)
> {
> uuid_t hyperv_uuid = UUID_INIT(
> @@ -73,6 +119,8 @@ static bool __init hyperv_detect_via_smccc(void)
> static int __init hyperv_init(void)
> {
> struct hv_get_vp_registers_output result;
> + void **hostcall_struct;
> + void *mem;
> u64 guest_id;
> int ret;
>
> @@ -85,6 +133,27 @@ static int __init hyperv_init(void)
> if (!hyperv_detect_via_acpi() && !hyperv_detect_via_smccc())
> return 0;
>
> + /*
> + * The RSI host-call buffer is only ever used when
> + * is_realm_world() is true. Skip the per-CPU allocation on
> + * non-Realm guests.
> + */
> + if (is_realm_world()) {
> + hyperv_pcpu_hostcall_struct = alloc_percpu(void *);
> + if (!hyperv_pcpu_hostcall_struct)
> + return -ENOMEM;
> +
> + hostcall_struct = (void **)this_cpu_ptr(hyperv_pcpu_hostcall_struct);
> + if (!*hostcall_struct) {
> + mem = kzalloc_obj(struct rsi_host_call);
> + if (!mem) {
> + ret = -ENOMEM;
> + goto free_hostcall_mem;
> + }
> + *hostcall_struct = mem;
> + }
> + }
> +
> /* Setup the guest ID */
> guest_id = hv_generate_guest_id(LINUX_VERSION_CODE);
> hv_set_vpreg(HV_REGISTER_GUEST_OS_ID, guest_id);
> @@ -106,12 +175,13 @@ static int __init hyperv_init(void)
>
> ret = hv_common_init();
> if (ret)
> - return ret;
> + goto free_hostcall_mem;
>
> ret = cpuhp_setup_state(CPUHP_AP_HYPERV_ONLINE,
> "arm64/hyperv_init:online",
> - hv_common_cpu_init, hv_common_cpu_die);
> + hv_cpu_init, hv_common_cpu_die);
> if (ret < 0) {
> hv_common_free();
> + hv_hostcall_free();
> return ret;
> }
>
> @@ -125,6 +195,10 @@ static int __init hyperv_init(void)
>
> hyperv_initialized = true;
> return 0;
> +
> +free_hostcall_mem:
> + hv_hostcall_free();
> + return ret;
> }
>
> early_initcall(hyperv_init);
> diff --git a/arch/arm64/include/asm/mshyperv.h b/arch/arm64/include/asm/mshyperv.h
> index b721d3134ab66..65a00bd14c6cb 100644
> --- a/arch/arm64/include/asm/mshyperv.h
> +++ b/arch/arm64/include/asm/mshyperv.h
> @@ -63,4 +63,7 @@ static inline u64 hv_get_non_nested_msr(unsigned int reg)
>
> #include <asm-generic/mshyperv.h>
>
> +/* Per-CPU RSI host call structure for CCA Realms */
> +extern void *__percpu *hyperv_pcpu_hostcall_struct;
> +
> #endif
> --
> 2.45.4
>
^ permalink raw reply
* RE: [RFC PATCH 2/6] firmware: smccc: Detect hypervisor via RSI host call in CCA Realms
From: Michael Kelley @ 2026-06-18 17:45 UTC (permalink / raw)
To: Kameron Carr, kys@microsoft.com, haiyangz@microsoft.com,
wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com
Cc: catalin.marinas@arm.com, will@kernel.org, mark.rutland@arm.com,
lpieralisi@kernel.org, sudeep.holla@kernel.org, arnd@arndb.de,
thuth@redhat.com, linux-hyperv@vger.kernel.org,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
Michael Kelley
In-Reply-To: <20260609181030.2378391-3-kameroncarr@linux.microsoft.com>
From: Kameron Carr <kameroncarr@linux.microsoft.com> Sent: Tuesday, June 9, 2026 11:10 AM
>
> Modify arm_smccc_hypervisor_has_uuid() to check is_realm_world() and
> use rsi_host_call() to query the hypervisor vendor UUID when inside a
> Realm. The realm path is factored into a helper,
> arm_smccc_realm_get_hypervisor_uuid(), that owns a file-static
> rsi_host_call buffer (uuid_hc) serialized by a spinlock.
>
> The RSI-specific includes, file-static state and helper are guarded
> with CONFIG_ARM64 because <asm/rsi.h> does not exist on 32-bit ARM.
>
> For non-Realm environments, the existing arm_smccc_1_1_invoke() path
> is unchanged.
>
> Signed-off-by: Kameron Carr <kameroncarr@linux.microsoft.com>
> ---
> drivers/firmware/smccc/smccc.c | 41 +++++++++++++++++++++++++++++++++-
> 1 file changed, 40 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/firmware/smccc/smccc.c b/drivers/firmware/smccc/smccc.c
> index bdee057db2fd3..6b465e65472b0 100644
> --- a/drivers/firmware/smccc/smccc.c
> +++ b/drivers/firmware/smccc/smccc.c
> @@ -12,6 +12,12 @@
> #include <linux/platform_device.h>
> #include <asm/archrandom.h>
>
> +#ifdef CONFIG_ARM64
> +#include <linux/cleanup.h>
> +#include <linux/spinlock.h>
> +#include <asm/rsi.h>
> +#endif
> +
> static u32 smccc_version = ARM_SMCCC_VERSION_1_0;
> static enum arm_smccc_conduit smccc_conduit = SMCCC_CONDUIT_NONE;
>
> @@ -67,12 +73,45 @@ s32 arm_smccc_get_soc_id_revision(void)
> }
> EXPORT_SYMBOL_GPL(arm_smccc_get_soc_id_revision);
>
> +#ifdef CONFIG_ARM64
> +static struct rsi_host_call uuid_hc;
> +static DEFINE_SPINLOCK(uuid_hc_lock);
So evidently Sashiko is wrong in saying that struct rsi_host_call must be
in decrypted memory?
>
> +/*
> + * Helper function to get the hypervisor UUID via an RsiHostCall.
> + */
> +static bool arm_smccc_realm_get_hypervisor_uuid(struct arm_smccc_res *res)
> +{
> + guard(spinlock_irqsave)(&uuid_hc_lock);
> +
> + memset(&uuid_hc, 0, sizeof(uuid_hc));
> + uuid_hc.gprs[0] = ARM_SMCCC_VENDOR_HYP_CALL_UID_FUNC_ID;
> +
> + if (rsi_host_call(__pa_symbol(&uuid_hc)) != RSI_SUCCESS)
> + return false;
Rather than having this function return a boolean upon failure,
couldn't it just set res->a0 to SMCCC_RET_NOT_SUPPORTED like
arm_smcc_1_1_invoke()? Then arm_smccc_hypervisor_has_uuid()
could process both paths exactly the same way.
> +
> + res->a0 = uuid_hc.gprs[0];
> + res->a1 = uuid_hc.gprs[1];
> + res->a2 = uuid_hc.gprs[2];
> + res->a3 = uuid_hc.gprs[3];
> + return true;
> +}
> +#endif
> +
> bool arm_smccc_hypervisor_has_uuid(const uuid_t *hyp_uuid)
> {
> struct arm_smccc_res res = {};
> uuid_t uuid;
>
> - arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_CALL_UID_FUNC_ID, &res);
> +#ifdef CONFIG_ARM64
> + if (is_realm_world()) {
> + if (!arm_smccc_realm_get_hypervisor_uuid(&res))
> + return false;
> + } else
> +#endif
> +
> arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_CALL_UID_FUNC_ID,
> + &res);
> +
> if (res.a0 == SMCCC_RET_NOT_SUPPORTED)
> return false;
>
> --
> 2.45.4
>
^ permalink raw reply
* RE: [RFC PATCH 1/6] arm64: rsi: Add RSI host call structure and helper function
From: Michael Kelley @ 2026-06-18 17:45 UTC (permalink / raw)
To: Kameron Carr, kys@microsoft.com, haiyangz@microsoft.com,
wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com
Cc: catalin.marinas@arm.com, will@kernel.org, mark.rutland@arm.com,
lpieralisi@kernel.org, sudeep.holla@kernel.org, arnd@arndb.de,
thuth@redhat.com, linux-hyperv@vger.kernel.org,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
Michael Kelley
In-Reply-To: <20260609181030.2378391-2-kameroncarr@linux.microsoft.com>
From: Kameron Carr <kameroncarr@linux.microsoft.com> Sent: Tuesday, June 9, 2026 11:10 AM
>
> Add struct rsi_host_call to rsi_smc.h, which represents the host call
> data structure used by the Realm Management Monitor (RMM) for the
> RSI_HOST_CALL interface. The structure contains a 16-bit immediate field
> and 31 general-purpose register values, aligned to 256 bytes as required
> by the CCA RMM specification.
>
> Add rsi_host_call() static inline wrapper in rsi_cmds.h that invokes
> SMC_RSI_HOST_CALL with the physical address of the host call structure.
> This will be used by Hyper-V guest code to route hypercalls through the
> RSI interface when running inside an Arm CCA Realm.
>
> Signed-off-by: Kameron Carr <kameroncarr@linux.microsoft.com>
> ---
> arch/arm64/include/asm/rsi_cmds.h | 9 +++++++++
> arch/arm64/include/asm/rsi_smc.h | 6 ++++++
> 2 files changed, 15 insertions(+)
>
> diff --git a/arch/arm64/include/asm/rsi_cmds.h b/arch/arm64/include/asm/rsi_cmds.h
> index 2c8763876dfb7..83b4b1f598454 100644
> --- a/arch/arm64/include/asm/rsi_cmds.h
> +++ b/arch/arm64/include/asm/rsi_cmds.h
> @@ -159,4 +159,13 @@ static inline unsigned long
> rsi_attestation_token_continue(phys_addr_t granule,
> return res.a0;
> }
>
> +static inline long rsi_host_call(phys_addr_t host_call_struct)
> +{
> + struct arm_smccc_res res;
> +
> + arm_smccc_smc(SMC_RSI_HOST_CALL, host_call_struct, 0, 0, 0, 0, 0, 0,
> + &res);
> + return res.a0;
> +}
For consistent grouping, it seems like this inline function should
be placed after rsi_set_addr_range_state() since it follows the
same pattern. It's a bit different from the token functions.
> +
> #endif /* __ASM_RSI_CMDS_H */
> diff --git a/arch/arm64/include/asm/rsi_smc.h b/arch/arm64/include/asm/rsi_smc.h
> index e19253f96c940..ffea93340ed7f 100644
> --- a/arch/arm64/include/asm/rsi_smc.h
> +++ b/arch/arm64/include/asm/rsi_smc.h
> @@ -142,6 +142,12 @@ struct realm_config {
> */
> } __aligned(0x1000);
>
> +struct rsi_host_call {
> + u16 immediate;
I don't see the "immediate" used anywhere in this patch set.
Is it always zero for the Hyper-V use cases? Just curious ...
> + u64 gprs[31];
> +} __aligned(256);
> +static_assert(sizeof(struct rsi_host_call) == 256);
This struct defines an ABI with the RMM layer, so I'd suggest
adding explicit padding of 6 bytes after the immediate so there's
no implicit dependency on the compiler adding the padding.
Sashiko had the same comment ....
Michael
> +
> #endif /* __ASSEMBLER__ */
>
> /*
> --
> 2.45.4
>
^ permalink raw reply
* Re: [PATCH v2 1/2] RDMA/erdma: initialize ret for empty receive WR lists
From: sashiko-bot @ 2026-06-18 4:30 UTC (permalink / raw)
To: Ruoyu Wang; +Cc: linux-hyperv
In-Reply-To: <20260618041752.481193-1-ruoyuw560@gmail.com>
Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
Pre-existing issues:
- [High] The receive work request queue (RQ) can silently overflow because `erdma_post_recv` and `erdma_post_recv_one` lack a bounds check against the queue capacity, and `rq_ci` is not properly tracked during completion polling.
- [High] In `erdma_post_send`, when the send queue is full, `*bad_send_wr` is incorrectly assigned the head of the list (`send_wr`) instead of the failing work request (`wr`).
- [High] Missing `dma_wmb()` before updating the doorbell records (`sq_dbrec` and `rq_dbrec`) in host memory.
--
commit 7fd711bd13dff4553304f89e7b55436f427b3b9d
Author: Ruoyu Wang <ruoyuw560@gmail.com>
RDMA/erdma: initialize ret for empty receive WR lists
erdma_post_recv() returns ret after walking the receive work request list.
If the caller passes an empty list, the loop is skipped and ret is not
assigned. Initialize ret to 0 so an empty receive work request list returns
success instead of stack data.
> diff --git a/drivers/infiniband/hw/erdma/erdma_qp.c b/drivers/infiniband/hw/erdma/erdma_qp.c
> index 25f6c49aec779..e002343832f74 100644
> --- a/drivers/infiniband/hw/erdma/erdma_qp.c
> +++ b/drivers/infiniband/hw/erdma/erdma_qp.c
> @@ -734,7 +734,7 @@ int erdma_post_recv(struct ib_qp *ibqp, const struct ib_recv_wr *recv_wr,
> const struct ib_recv_wr *wr = recv_wr;
> struct erdma_qp *qp = to_eqp(ibqp);
> unsigned long flags;
> - int ret;
> + int ret = 0;
>
> spin_lock_irqsave(&qp->lock, flags);
>
[Severity: High]
This is a pre-existing issue, but does the receive work request queue (RQ)
have a risk of silent overflow?
While erdma_post_send() tracks send queue capacity, erdma_post_recv()
unconditionally posts receive work requests and advances rq_pi without a
bounds check against the queue capacity:
drivers/infiniband/hw/erdma/erdma_qp.c:erdma_post_recv() {
...
while (wr) {
ret = erdma_post_recv_one(qp, wr);
...
}
drivers/infiniband/hw/erdma/erdma_qp.c:erdma_post_recv_one() {
...
qp->kern_qp.rq_pi++;
...
}
Additionally, receive queue completions don't seem to update kern_qp->rq_ci:
drivers/infiniband/hw/erdma/erdma_cq.c:erdma_poll_one_cqe() {
...
} else {
id_table = kern_qp->rwr_tbl;
depth = qp->attrs.rq_size;
}
...
}
If a kernel consumer posts more receives than the queue capacity, could the
ring buffer wrap around and overwrite active, unpolled work requests, leading
to data corruption and invalid WR IDs during polling?
[Severity: High]
This is a pre-existing issue, but is the failing work request assigned
correctly when the send queue is full?
In erdma_post_send(), if the queue becomes full after some work requests have
already been successfully posted, *bad_send_wr is assigned the head of the
list instead of the failing request:
drivers/infiniband/hw/erdma/erdma_qp.c:erdma_post_send() {
...
if ((u16)(sq_pi - qp->kern_qp.sq_ci) >= qp->attrs.sq_size) {
ret = -ENOMEM;
*bad_send_wr = send_wr;
break;
}
...
}
Does this signal to the caller that the entire batch failed from the start,
prompting a retry of the entire list and potentially leading to duplicate
execution?
[Severity: High]
This is a pre-existing issue, but should there be a memory barrier before
updating the doorbell records in host memory?
The driver updates doorbell records after constructing work queue elements
(WQEs/RQEs):
drivers/infiniband/hw/erdma/erdma_qp.c:erdma_post_recv_one() {
...
*(u64 *)qp->kern_qp.rq_dbrec = *(u64 *)rqe;
...
}
drivers/infiniband/hw/erdma/erdma_qp.c:kick_sq_db() {
...
*(u64 *)qp->kern_qp.sq_dbrec = db_data;
...
}
While the subsequent hardware doorbell ring via writeq() includes a barrier
before the MMIO operation, could the CPU reorder the memory stores of the WQE
with the memory store of the dbrec?
If the device polls the updated dbrec before the WQE stores are visible,
might it fetch a stale or partially written WQE via DMA?
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260618041752.481193-1-ruoyuw560@gmail.com?part=1
^ permalink raw reply
* [PATCH v2 2/2] RDMA/mana_ib: initialize err for empty send WR lists
From: Ruoyu Wang @ 2026-06-18 4:17 UTC (permalink / raw)
To: Jason Gunthorpe, Leon Romanovsky
Cc: Cheng Xu, Kai Shen, Long Li, Konstantin Taranov, linux-rdma,
linux-hyperv, linux-kernel, Ruoyu Wang
In-Reply-To: <20260618041752.481193-1-ruoyuw560@gmail.com>
mana_ib_post_send() returns err after walking the send work request list.
If the caller passes an empty list, the loop is skipped and err is not
assigned.
Initialize err to 0 so an empty send work request list returns success
instead of stack data.
Fixes: c8017f5b4856 ("RDMA/mana_ib: UD/GSI work requests")
Signed-off-by: Ruoyu Wang <ruoyuw560@gmail.com>
---
v2:
- Split the erdma and mana_ib changes into separate patches.
- Add a driver-specific Fixes tag.
drivers/infiniband/hw/mana/wr.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/infiniband/hw/mana/wr.c b/drivers/infiniband/hw/mana/wr.c
index 1813567d3b16c..36a1d506f08f6 100644
--- a/drivers/infiniband/hw/mana/wr.c
+++ b/drivers/infiniband/hw/mana/wr.c
@@ -144,7 +144,7 @@ static int mana_ib_post_send_ud(struct mana_ib_qp *qp, const struct ib_ud_wr *wr
int mana_ib_post_send(struct ib_qp *ibqp, const struct ib_send_wr *wr,
const struct ib_send_wr **bad_wr)
{
- int err;
+ int err = 0;
struct mana_ib_qp *qp = container_of(ibqp, struct mana_ib_qp, ibqp);
for (; wr; wr = wr->next) {
--
2.51.0
^ permalink raw reply related
* [PATCH v2 1/2] RDMA/erdma: initialize ret for empty receive WR lists
From: Ruoyu Wang @ 2026-06-18 4:17 UTC (permalink / raw)
To: Jason Gunthorpe, Leon Romanovsky
Cc: Cheng Xu, Kai Shen, Long Li, Konstantin Taranov, linux-rdma,
linux-hyperv, linux-kernel, Ruoyu Wang
erdma_post_recv() returns ret after walking the receive work request list.
If the caller passes an empty list, the loop is skipped and ret is not
assigned.
Initialize ret to 0 so an empty receive work request list returns success
instead of stack data.
Fixes: 155055771704 ("RDMA/erdma: Add verbs implementation")
Signed-off-by: Ruoyu Wang <ruoyuw560@gmail.com>
---
v2:
- Split the erdma and mana_ib changes into separate patches.
- Add a driver-specific Fixes tag.
drivers/infiniband/hw/erdma/erdma_qp.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/infiniband/hw/erdma/erdma_qp.c b/drivers/infiniband/hw/erdma/erdma_qp.c
index 25f6c49aec779..e002343832f74 100644
--- a/drivers/infiniband/hw/erdma/erdma_qp.c
+++ b/drivers/infiniband/hw/erdma/erdma_qp.c
@@ -734,7 +734,7 @@ int erdma_post_recv(struct ib_qp *ibqp, const struct ib_recv_wr *recv_wr,
const struct ib_recv_wr *wr = recv_wr;
struct erdma_qp *qp = to_eqp(ibqp);
unsigned long flags;
- int ret;
+ int ret = 0;
spin_lock_irqsave(&qp->lock, flags);
--
2.51.0
^ permalink raw reply related
* [PATCH net] net: mana: Sync page pool RX frags for CPU
From: Dexuan Cui @ 2026-06-18 3:50 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
edumazet, kuba, pabeni, kotaranov, horms, ernis, dipayanroy, kees,
jacob.e.keller, ssengar, linux-hyperv, netdev, linux-kernel,
linux-rdma
Cc: stable
MANA allocates RX buffers from page pool fragments when frag_count is
greater than 1. In that case the buffers remain DMA mapped by page pool
and the RX completion path does not call dma_unmap_single(). As a result,
the implicit sync-for-CPU normally performed by dma_unmap_single() is
missing before the packet data is passed to the networking stack.
This breaks RX on configurations which require explicit DMA syncing, for
example when booted with swiotlb=force.
Fix this by recording the page pool page and DMA sync offset when the RX
buffer is allocated, and syncing the received packet range for CPU access
before handing the RX buffer to the stack.
Also validate the packet length reported in the RX CQE before using it as
a DMA sync length or passing it to skb processing. The CQE is supplied
by the device and should not be blindly trusted by Confidential VMs.
Fixes: 730ff06d3f5c ("net: mana: Use page pool fragments for RX buffers instead of full pages to improve memory efficiency.")
Cc: stable@vger.kernel.org
Signed-off-by: Dexuan Cui <decui@microsoft.com>
---
drivers/net/ethernet/microsoft/mana/mana_en.c | 61 +++++++++++++++----
include/net/mana/mana.h | 8 +++
2 files changed, 57 insertions(+), 12 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index c9b1df1ed109..d8906169666d 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -2044,15 +2044,19 @@ static void mana_rx_skb(void *buf_va, bool from_pool,
}
static void *mana_get_rxfrag(struct mana_rxq *rxq, struct device *dev,
- dma_addr_t *da, bool *from_pool)
+ dma_addr_t *da, bool *from_pool,
+ struct page **pp_page, u32 *dma_sync_offset)
{
struct page *page;
u32 offset;
void *va;
+
*from_pool = false;
+ *pp_page = NULL;
+ *dma_sync_offset = 0;
/* Don't use fragments for jumbo frames or XDP where it's 1 fragment
- * per page.
+ * per page. These buffers are mapped with dma_map_single().
*/
if (rxq->frag_count == 1) {
/* Reuse XDP dropped page if available */
@@ -2087,31 +2091,47 @@ static void *mana_get_rxfrag(struct mana_rxq *rxq, struct device *dev,
va = page_to_virt(page) + offset;
*da = page_pool_get_dma_addr(page) + offset + rxq->headroom;
*from_pool = true;
+ *pp_page = page;
+ *dma_sync_offset = offset + rxq->headroom;
return va;
}
/* Allocate frag for rx buffer, and save the old buf */
static void mana_refill_rx_oob(struct device *dev, struct mana_rxq *rxq,
- struct mana_recv_buf_oob *rxoob, void **old_buf,
- bool *old_fp)
+ struct mana_recv_buf_oob *rxoob, u32 pktlen,
+ void **old_buf, bool *old_fp)
{
+ u32 dma_sync_offset;
+ struct page *pp_page;
bool from_pool;
dma_addr_t da;
void *va;
- va = mana_get_rxfrag(rxq, dev, &da, &from_pool);
+ va = mana_get_rxfrag(rxq, dev, &da, &from_pool, &pp_page,
+ &dma_sync_offset);
if (!va)
return;
- if (!rxoob->from_pool || rxq->frag_count == 1)
+ if (!rxoob->from_pool || rxq->frag_count == 1) {
dma_unmap_single(dev, rxoob->sgl[0].address, rxq->datasize,
DMA_FROM_DEVICE);
+ } else {
+ /* The page pool maps the whole page and only syncs for device
+ * automatically (PP_FLAG_DMA_SYNC_DEV). Sync the received bytes
+ * for the CPU before they are read: this is required if DMA
+ * is incoherent or bounce buffers are used.
+ */
+ page_pool_dma_sync_for_cpu(rxq->page_pool, rxoob->pp_page,
+ rxoob->dma_sync_offset, pktlen);
+ }
*old_buf = rxoob->buf_va;
*old_fp = rxoob->from_pool;
rxoob->buf_va = va;
rxoob->sgl[0].address = da;
rxoob->from_pool = from_pool;
+ rxoob->pp_page = pp_page;
+ rxoob->dma_sync_offset = dma_sync_offset;
}
static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
@@ -2170,12 +2190,24 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
rxbuf_oob = &rxq->rx_oobs[curr];
WARN_ON_ONCE(rxbuf_oob->wqe_inf.wqe_size_in_bu != 1);
- mana_refill_rx_oob(dev, rxq, rxbuf_oob, &old_buf, &old_fp);
+ if (unlikely(pktlen > rxq->datasize)) {
+ /* Increase it even if mana_rx_skb() isn't called. */
+ rxq->rx_cq.work_done++;
- /* Unsuccessful refill will have old_buf == NULL.
- * In this case, mana_rx_skb() will drop the packet.
- */
- mana_rx_skb(old_buf, old_fp, oob, rxq, i);
+ ++ndev->stats.rx_dropped;
+ netdev_warn_once(ndev,
+ "Dropped oversized RX packet: len=%u, datasize=%u\n",
+ pktlen, rxq->datasize);
+
+ /* Reuse the RX buffer since rxbuf_oob is unchanged. */
+ } else {
+ mana_refill_rx_oob(dev, rxq, rxbuf_oob, pktlen, &old_buf, &old_fp);
+
+ /* Unsuccessful refill will have old_buf == NULL.
+ * In this case, mana_rx_skb() will drop the packet.
+ */
+ mana_rx_skb(old_buf, old_fp, oob, rxq, i);
+ }
mana_move_wq_tail(rxq->gdma_rq,
rxbuf_oob->wqe_inf.wqe_size_in_bu);
@@ -2566,6 +2598,8 @@ static int mana_fill_rx_oob(struct mana_recv_buf_oob *rx_oob, u32 mem_key,
struct mana_rxq *rxq, struct device *dev)
{
struct mana_port_context *mpc = netdev_priv(rxq->ndev);
+ struct page *pp_page = NULL;
+ u32 dma_sync_offset = 0;
bool from_pool = false;
dma_addr_t da;
void *va;
@@ -2573,13 +2607,16 @@ static int mana_fill_rx_oob(struct mana_recv_buf_oob *rx_oob, u32 mem_key,
if (mpc->rxbufs_pre)
va = mana_get_rxbuf_pre(rxq, &da);
else
- va = mana_get_rxfrag(rxq, dev, &da, &from_pool);
+ va = mana_get_rxfrag(rxq, dev, &da, &from_pool, &pp_page,
+ &dma_sync_offset);
if (!va)
return -ENOMEM;
rx_oob->buf_va = va;
rx_oob->from_pool = from_pool;
+ rx_oob->pp_page = pp_page;
+ rx_oob->dma_sync_offset = dma_sync_offset;
rx_oob->sgl[0].address = da;
rx_oob->sgl[0].size = rxq->datasize;
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index 8f721cd4e4a7..4111b93169d2 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -305,6 +305,14 @@ struct mana_recv_buf_oob {
void *buf_va;
bool from_pool; /* allocated from a page pool */
+ /* head page of the page_pool fragment; valid only when
+ * from_pool && frag_count > 1.
+ */
+ struct page *pp_page;
+ /* Fragment offset plus rxq->headroom, passed to
+ * page_pool_dma_sync_for_cpu().
+ */
+ u32 dma_sync_offset;
/* SGL of the buffer going to be sent as part of the work request. */
u32 num_sge;
--
2.34.1
^ permalink raw reply related
* RE: [PATCH] PCI: hv: add hard timeout to wait_for_response()
From: Michael Kelley @ 2026-06-17 16:02 UTC (permalink / raw)
To: Hamza Mahfooz, linux-hyperv@vger.kernel.org
Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Lorenzo Pieralisi, Krzysztof Wilczyński,
Manivannan Sadhasivam, Rob Herring, Bjorn Helgaas,
linux-pci@vger.kernel.org, stable@vger.kernel.org
In-Reply-To: <20260612174010.2598695-1-hamzamahfooz@linux.microsoft.com>
From: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com> Sent: Friday, June 12, 2026 10:40 AM
>
The addressees for this patch include the linux-hyperv and linux-pci
mailing lists, but not the broad linux-kernel mailing list. Any reason not
to include the latter? In Documentation/process/submitting-patches.rst,
in the section "Select the recipients for your patch", it says
linux-kernel@vger.kernel.org should be used by default for all patches
I subscribe to the broad linux-kernel list and do my own filtering, instead
of subscribing to individual subsystem lists. So I didn't see this patch
until I glanced at https://lore.kernel.org/linux-hyperv/. Following the
guidance to always include linux-kernel would help me, for one. :-)
Thx,
Michael
> It is possible that we never receive a rescind event, in which case we
> will wait indefinitely for a device that will never show up. So, assume
> a device is gone if have been polling for more than 5 seconds.
>
> Cc: stable@vger.kernel.org
> Fixes: c3635da2a336 ("PCI: hv: Do not wait forever on a device that has disappeared")
> Signed-off-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com>
> ---
> drivers/pci/controller/pci-hyperv.c | 10 +++++++++-
> 1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index cfc8fa403dad..bd63efc4a210 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -52,6 +52,7 @@
> #include <linux/acpi.h>
> #include <linux/sizes.h>
> #include <linux/of_irq.h>
> +#include <linux/jiffies.h>
> #include <asm/mshyperv.h>
>
> /*
> @@ -1038,6 +1039,8 @@ static void put_pcichild(struct hv_pci_dev *hpdev)
> kfree(hpdev);
> }
>
> +#define TIMEOUT_MS 5000
> +
> /*
> * There is no good way to get notified from vmbus_onoffer_rescind(),
> * so let's use polling here, since this is not a hot path.
> @@ -1045,8 +1048,13 @@ static void put_pcichild(struct hv_pci_dev *hpdev)
> static int wait_for_response(struct hv_device *hdev,
> struct completion *comp)
> {
> + unsigned long timeout = get_jiffies_64() + msecs_to_jiffies(TIMEOUT_MS);
> + unsigned long now;
> +
> while (true) {
> - if (hdev->channel->rescind) {
> + now = get_jiffies_64();
> + if (hdev->channel->rescind ||
> + time_after(now, timeout)) {
> dev_warn_once(&hdev->device, "The device is gone.\n");
> return -ENODEV;
> }
> --
> 2.54.0
>
^ permalink raw reply
* Re: [RFC PATCH v1] PCI: Remove pcie_flr() and convert all callers to use pcie_reset_flr()
From: sashiko-bot @ 2026-06-16 19:22 UTC (permalink / raw)
To: Farhan Ali; +Cc: imx, Frank.Li, linux-hyperv, linux-pci
In-Reply-To: <20260615192200.3489-1-alifm@linux.ibm.com>
Thank you for your contribution! Sashiko AI review found 5 potential issue(s) to consider:
New issues:
- [Critical] Unchecked VF FLR failure introduces a reliable DMA Use-After-Free (UAF) in the MANA driver during queue teardown.
- [High] Replacing `pcie_flr()` with `pcie_reset_flr()` silently breaks Function Level Reset (FLR) for all SR-IOV Virtual Functions (VFs) across multiple drivers.
- [High] Silent FLR failure in PCI reset quirks causes the kernel to block for 20 seconds during device teardown or initialization.
- [Low] The patch breaks the kernel build because it missed converting a caller of `pcie_flr()` in the `ixgbe` driver.
Pre-existing issues:
- [Critical] An unbounded MMIO read loop causes an infinite loop and hard lockup if the PCIe device is hot-unplugged or encounters a fatal hardware error.
--
commit e494ba565a5af359d72294da06691da5e6507322
Author: Farhan Ali <alifm@linux.ibm.com>
Subject: PCI: Remove pcie_flr() and convert all callers to use pcie_reset_flr()
This patch removes the unconditional pcie_flr() function and converts its
remaining callers to use pcie_reset_flr(). This ensures validation checks,
such as verifying FLR support in the device capabilities, are applied
consistently across the kernel.
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
[ ... ]
> int pcie_reset_flr(struct pci_dev *dev, bool probe)
> {
> int ret;
>
> if (dev->dev_flags & PCI_DEV_FLAGS_NO_FLR_RESET)
> return -ENOTTY;
>
> if (!(dev->devcap & PCI_EXP_DEVCAP_FLR))
> return -ENOTTY;
[Severity: High]
According to the PCIe Base Specification, SR-IOV Virtual Functions do not
report FLR capability in their Device Capabilities register, so this check
will fail for VFs.
Is this a regression that inadvertently breaks FLR for VFs that previously
relied on pcie_flr() bypassing this check? Callers that ignore the return
value will silently skip resetting the hardware.
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
[ ... ]
> @@ -8319,7 +8319,7 @@ static void ixgbe_check_for_bad_vf(struct ixgbe_adapter *adapter)
> if (status_reg != IXGBE_FAILED_READ_CFG_WORD &&
> status_reg & PCI_STATUS_REC_MASTER_ABORT) {
> ixgbe_bad_vf_abort(adapter, vf);
> - pcie_flr(vfdev);
> + pcie_reset_flr(vfdev, PCI_RESET_DO_RESET);
> }
[Severity: Low]
Does this patch introduce a build regression by missing another caller in
ixgbe_main.c?
Searching the tree might reveal that ixgbe_io_error_detected() still calls
the now-removed pcie_flr(), which would cause a build failure when compiling
with CONFIG_IXGBE.
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
[ ... ]
> @@ -3306,7 +3306,8 @@ static int mana_dealloc_queues(struct net_device *ndev)
> }
> if (atomic_read(&txq->pending_sends)) {
> err =
> - pcie_flr(to_pci_dev(gd->gdma_context->dev));
> + pcie_reset_flr(to_pci_dev(gd->gdma_context->dev),
> + PCI_RESET_DO_RESET);
> if (err) {
> netdev_err(ndev, "flr failed %d with %d pkts pending in txq %u\n",
> err,
[Severity: Critical]
If pcie_reset_flr() returns -ENOTTY for VFs, the FLR fails and this loop
breaks early.
Will the driver then proceed to unmap and free active SKB buffers in the
subsequent loop:
mana_dealloc_queues() {
...
for (i = 0; i < apc->num_queues; i++) {
txq = &apc->tx_qp[i].txq;
while ((skb = skb_dequeue(&txq->pending_skbs))) {
mana_unmap_skb(skb, apc);
dev_kfree_skb_any(skb);
}
...
}
Since the hardware wasn't reset and remains active, can this lead to a DMA
use-after-free regression as the hardware continues to process the freed
memory?
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
[ ... ]
> @@ -4207,7 +4209,7 @@ static int reset_hinic_vf_dev(struct pci_dev *pdev, bool probe)
> val = val | HINIC_VF_FLR_PROC_BIT;
> iowrite32be(val, bar + HINIC_VF_OP);
>
> - pcie_flr(pdev);
> + pcie_reset_flr(pdev, PCI_RESET_DO_RESET);
>
> /*
> * The device must recapture its Bus and Device Numbers after FLR
[Severity: High]
If pcie_reset_flr() returns -ENOTTY without actually issuing a reset to the
hardware, will the firmware ever clear the HINIC_VF_FLR_PROC_BIT?
Looking at the rest of reset_hinic_vf_dev(), it appears the code will
subsequently spin in a busy-wait loop for the full 20-second timeout
waiting for that bit to clear:
reset_hinic_vf_dev() {
...
do {
val = ioread32be(bar + HINIC_VF_OP);
if (!(val & HINIC_VF_FLR_PROC_BIT))
goto reset_complete;
msleep(20);
} while (time_before(jiffies, timeout));
...
}
Could this regression silently block the thread for 20 seconds during
device teardown?
> diff --git a/drivers/ptp/ptp_netc.c b/drivers/ptp/ptp_netc.c
> --- a/drivers/ptp/ptp_netc.c
> +++ b/drivers/ptp/ptp_netc.c
[ ... ]
> @@ -802,7 +802,7 @@ static int netc_timer_pci_probe(struct pci_dev *pdev)
> if (!priv)
> return -ENOMEM;
>
> - pcie_flr(pdev);
> + pcie_reset_flr(pdev, PCI_RESET_DO_RESET);
> err = pci_enable_device_mem(pdev);
[Severity: Critical]
This is a pre-existing issue, but while reviewing this driver's PCIe
handling, does the interrupt handler have an unbounded loop on PCIe
disconnect?
In netc_timer_handle_etts_event(), the code continuously reads a
memory-mapped register:
netc_timer_handle_etts_event() {
...
while (netc_timer_rd(priv, NETC_TMR_STAT) & TMR_STAT_ETS_VLD(index)) {
...
}
...
}
If the device is removed or experiences a fatal error, the MMIO read will
return all 1s (0xFFFFFFFF). Because 0xFFFFFFFF masked with the bit is
always true, would this loop spin infinitely and cause a hard lockup?
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260615192200.3489-1-alifm@linux.ibm.com?part=1
^ permalink raw reply
* Re: [PATCH net-next v4] net: mana: Add Interrupt Moderation support
From: Jakub Kicinski @ 2026-06-16 1:54 UTC (permalink / raw)
To: Haiyang Zhang
Cc: linux-hyperv, netdev, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Andrew Lunn, David S. Miller, Eric Dumazet,
Paolo Abeni, Konstantin Taranov, Simon Horman, Shradha Gupta,
Erni Sri Satya Vennela, Dipayaan Roy, Aditya Garg, Breno Leitao,
linux-kernel, linux-rdma, paulros
In-Reply-To: <20260613205812.2659945-1-haiyangz@linux.microsoft.com>
On Sat, 13 Jun 2026 13:57:54 -0700 Haiyang Zhang wrote:
> Add Static and Dynamic Interrupt Moderation (DIM) support for
> Rx and Tx.
> Update queue creation procedure with new data struct with the related
> settings.
> Add functions to collect stat for DIM, and workers to update DIM data
> and settings.
> Update ethtool handler to get/set the moderation settings from a user.
> To avoid detach/re-attach ops, ring DIM doorbell to change settings
> at run time.
> By default, adaptive-rx/tx (DIM) are enabled if supported by HW.
The merge window has started and we need to start working on our PRs.
This will need to be reposted after 7.2-rc1 is tagged, sorry
--
pw-bot: defer
^ permalink raw reply
* Re: [PATCH net-next v10 0/2] net: mana: add ethtool private flag for full-page RX buffers
From: Jakub Kicinski @ 2026-06-16 0:33 UTC (permalink / raw)
To: Dipayaan Roy
Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
pabeni, leon, longli, kotaranov, horms, shradhagupta, ssengar,
ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov,
pavan.chebbi
In-Reply-To: <ajCXIpDVaVcUcQwd@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>
On Mon, 15 Jun 2026 17:21:54 -0700 Dipayaan Roy wrote:
> On Mon, Jun 15, 2026 at 01:42:47PM -0700, Jakub Kicinski wrote:
> > On Mon, 15 Jun 2026 12:25:53 -0700 Dipayaan Roy wrote:
> > > Just a gentle ping on this series. The approach was agreed upon, and it
> > > has picked up a few Reviewed-by tags as well.
> > >
> > > Please let me know if you need anything else from me, or if I should
> > > resend it to collect the tags.
> >
> > Don't recall now what the exact sequence was but pretty sure this
> > no longer applied after some other mana series was merged.
>
> I see, the net-next is closed now, I will rebase and resend this
> once it opens on June 29th.
Sorry for not flagging this sooner, IDK how it escaped the reply.
Maybe some mix of Jake's comments plus it not being applicable
later.
Not to deflect blame but y'all should coordinate better, the "no longer
applies" situation happens in mana a lot more often than with other
drivers :(
^ permalink raw reply
* Re: [PATCH net-next v10 0/2] net: mana: add ethtool private flag for full-page RX buffers
From: Dipayaan Roy @ 2026-06-16 0:21 UTC (permalink / raw)
To: Jakub Kicinski
Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
pabeni, leon, longli, kotaranov, horms, shradhagupta, ssengar,
ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov,
pavan.chebbi
In-Reply-To: <20260615134247.0bd7e16e@kernel.org>
On Mon, Jun 15, 2026 at 01:42:47PM -0700, Jakub Kicinski wrote:
> On Mon, 15 Jun 2026 12:25:53 -0700 Dipayaan Roy wrote:
> > Just a gentle ping on this series. The approach was agreed upon, and it
> > has picked up a few Reviewed-by tags as well.
> >
> > Please let me know if you need anything else from me, or if I should
> > resend it to collect the tags.
>
> Don't recall now what the exact sequence was but pretty sure this
> no longer applied after some other mana series was merged.
I see, the net-next is closed now, I will rebase and resend this
once it opens on June 29th.
Regards
Dipayaan Roy
^ permalink raw reply
* Re: [PATCH net-next v10 0/2] net: mana: add ethtool private flag for full-page RX buffers
From: Jakub Kicinski @ 2026-06-15 20:42 UTC (permalink / raw)
To: Dipayaan Roy
Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
pabeni, leon, longli, kotaranov, horms, shradhagupta, ssengar,
ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov,
pavan.chebbi
In-Reply-To: <ajBRwYftEol8IE49@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>
On Mon, 15 Jun 2026 12:25:53 -0700 Dipayaan Roy wrote:
> Just a gentle ping on this series. The approach was agreed upon, and it
> has picked up a few Reviewed-by tags as well.
>
> Please let me know if you need anything else from me, or if I should
> resend it to collect the tags.
Don't recall now what the exact sequence was but pretty sure this
no longer applied after some other mana series was merged.
^ permalink raw reply
* Re: [PATCH net-next v10 0/2] net: mana: add ethtool private flag for full-page RX buffers
From: Dipayaan Roy @ 2026-06-15 19:25 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov,
pavan.chebbi
Cc: kuba
In-Reply-To: <20260602202801.1873742-1-dipayanroy@linux.microsoft.com>
On Tue, Jun 02, 2026 at 01:24:37PM -0700, Dipayaan Roy wrote:
> On some ARM64 platforms with 4K PAGE_SIZE, utilizing page_pool
> fragments for allocation in the RX refill path (~2kB buffer per fragment)
> causes 15-20% throughput regression under high connection counts
> (>16 TCP streams at 180+ Gbps). Using full-page buffers on these
> platforms shows no regression and restores line-rate performance.
>
> This behavior is observed on a single platform; other platforms
> perform better with page_pool fragments, indicating this is not a
> page_pool issue but platform-specific.
>
> This series adds an ethtool private flag "full-page-rx" to let the
> user opt in to one RX buffer per page:
>
> ethtool --set-priv-flags eth0 full-page-rx on
>
> There is no behavioral change by default. The flag can be persisted
> via udev rule for affected platforms.
>
> This series depends on the following fixes now merged in net-next:
> 17bfe0a8c014 ("net: mana: add NULL guards in teardown path to prevent panic")
> 5b05aa36ee24 ("net: mana: skip redundant detach on already-detached port")
>
> Changes in v10:
> - Rebased on net-next which now includes the prerequisite fixes.
> - Recovery logic in mana_set_priv_flags() leverages the idempotent
> mana_detach() from the merged fixes.
> Changes in v9:
> - Added correct tree.
> Changes in v8:
> - Fixed queue_reset_work recovery by restoring port_is_up before
> scheduling reset so the handler can properly re-attach.
> - Simplified "err && schedule_port_reset" to "schedule_port_reset".
> Changes in v7:
> - Rebased onto net-next.
> - Retained private flag approach after David Wei's testing on
> Grace (ARM64) confirmed that fragment mode outperforms
> full-page mode on other platforms, validating this is a
> single-platform workaround rather than a generic issue.
> Changes in v6:
> - Added missed maintainers.
> Changes in v5:
> - Split prep refactor into separate patch (patch 1/2)
> Changes in v4:
> - Dropping the smbios string parsing and add ethtool priv flag
> to reconfigure the queues with full page rx buffers.
> Changes in v3:
> - changed u8* to char*
> Changes in v2:
> - separate reading string index and the string, remove inline.
>
> Dipayaan Roy (2):
> net: mana: refactor mana_get_strings() and mana_get_sset_count() to
> use switch
> net: mana: force full-page RX buffers via ethtool private flag
>
> drivers/net/ethernet/microsoft/mana/mana_en.c | 22 ++-
> .../ethernet/microsoft/mana/mana_ethtool.c | 178 +++++++++++++++---
> include/net/mana/mana.h | 8 +
> 3 files changed, 177 insertions(+), 31 deletions(-)
>
> --
> 2.43.0
>
Hi Jakub,
Just a gentle ping on this series. The approach was agreed upon, and it
has picked up a few Reviewed-by tags as well.
Please let me know if you need anything else from me, or if I should
resend it to collect the tags.
Thanks,
Dipayaan Roy
^ permalink raw reply
* [RFC PATCH v1] PCI: Remove pcie_flr() and convert all callers to use pcie_reset_flr()
From: Farhan Ali @ 2026-06-15 19:21 UTC (permalink / raw)
To: linux-kernel, linux-pci
Cc: helgaas, giovanni.cabiddu, alifm, herbert, davem,
dennis.dalessandro, jgg, leon, vikas.gupta, edumazet, kuba,
pabeni, michael.chan, pavan.chebbi, claudiu.manoil,
vladimir.oltean, wei.fang, xiaoning.wang, anthony.l.nguyen,
przemyslaw.kitszel, kys, haiyangz, wei.liu, decui, longli,
richardcochran, Andrew Lunn, Bjorn Helgaas, open list:QAT DRIVER,
open list:CRYPTO API, open list:HFI1 DRIVER,
open list:BROADCOM BNG_EN 800 GIGABIT ETHERNET DRIVER,
open list:FREESCALE ENETC ETHERNET DRIVERS,
moderated list:INTEL ETHERNET DRIVERS,
open list:Hyper-V/Azure CORE AND DRIVERS
The pcie_reset_flr() function includes validation checks to verify FLR
support before performing the reset, while pcie_flr() performs the reset
unconditionally. Having both functions creates unnecessary complexity.
Commit 56f107d7813f ("PCI: Add pcie_reset_flr() with 'probe' argument")
introduced pcie_reset_flr() and removed pcie_has_flr(), converting callers
that previously used the pcie_has_flr() + pcie_flr() to use
pcie_reset_flr() instead. However, it did not convert all pcie_flr()
callers, leaving two different FLR mechanisms in the kernel.
One of the callers of pcie_flr(), the Intel 82599 Virtual Function has a
defect where FLR works despite not advertising FLR support in the PCIe
Device Capability register. Rather than using pcie_flr() to work around
this, enable the FLR capability bit in devcap via an early quirk. This
allows the device to use the standard pcie_reset_flr() path instead of
requiring a device-specific reset method.
Remove pcie_flr() entirely and convert all remaining callers to
pcie_reset_flr(), ensuring consistent validation across the kernel.
Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
---
drivers/crypto/intel/qat/qat_common/adf_aer.c | 2 +-
drivers/infiniband/hw/hfi1/chip.c | 4 +-
.../net/ethernet/broadcom/bnge/bnge_core.c | 2 +-
drivers/net/ethernet/broadcom/bnxt/bnxt.c | 4 +-
.../ethernet/cavium/liquidio/lio_vf_main.c | 2 +-
.../ethernet/cavium/liquidio/octeon_mailbox.c | 3 +-
drivers/net/ethernet/freescale/enetc/enetc.c | 2 +-
.../ethernet/freescale/enetc/enetc_pci_mdio.c | 2 +-
drivers/net/ethernet/intel/ice/ice_main.c | 2 +-
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 2 +-
drivers/net/ethernet/microsoft/mana/mana_en.c | 3 +-
drivers/pci/pci.c | 38 ++++++------------
drivers/pci/quirks.c | 40 +++++++++----------
drivers/ptp/ptp_netc.c | 2 +-
include/linux/pci.h | 1 -
15 files changed, 49 insertions(+), 60 deletions(-)
diff --git a/drivers/crypto/intel/qat/qat_common/adf_aer.c b/drivers/crypto/intel/qat/qat_common/adf_aer.c
index ed01fb9ad74e..a2364a59bc7f 100644
--- a/drivers/crypto/intel/qat/qat_common/adf_aer.c
+++ b/drivers/crypto/intel/qat/qat_common/adf_aer.c
@@ -89,7 +89,7 @@ EXPORT_SYMBOL_GPL(adf_reset_sbr);
void adf_reset_flr(struct adf_accel_dev *accel_dev)
{
- pcie_flr(accel_to_pci_dev(accel_dev));
+ pcie_reset_flr(accel_to_pci_dev(accel_dev), PCI_RESET_DO_RESET);
}
EXPORT_SYMBOL_GPL(adf_reset_flr);
diff --git a/drivers/infiniband/hw/hfi1/chip.c b/drivers/infiniband/hw/hfi1/chip.c
index 44c524e45396..9f53d73e5e76 100644
--- a/drivers/infiniband/hw/hfi1/chip.c
+++ b/drivers/infiniband/hw/hfi1/chip.c
@@ -14042,7 +14042,7 @@ static int init_chip(struct hfi1_devdata *dd)
dd_dev_info(dd, "Resetting CSRs with FLR\n");
/* do the FLR, the DC reset will remain */
- pcie_flr(dd->pcidev);
+ pcie_reset_flr(dd->pcidev, PCI_RESET_DO_RESET);
/* restore command and BARs */
ret = restore_pci_variables(dd);
@@ -14054,7 +14054,7 @@ static int init_chip(struct hfi1_devdata *dd)
if (is_ax(dd)) {
dd_dev_info(dd, "Resetting CSRs with FLR\n");
- pcie_flr(dd->pcidev);
+ pcie_reset_flr(dd->pcidev, PCI_RESET_DO_RESET);
ret = restore_pci_variables(dd);
if (ret) {
dd_dev_err(dd, "%s: Could not restore PCI variables\n",
diff --git a/drivers/net/ethernet/broadcom/bnge/bnge_core.c b/drivers/net/ethernet/broadcom/bnge/bnge_core.c
index 68b74eb2c3a2..4aec01f53e54 100644
--- a/drivers/net/ethernet/broadcom/bnge/bnge_core.c
+++ b/drivers/net/ethernet/broadcom/bnge/bnge_core.c
@@ -274,7 +274,7 @@ static int bnge_probe_one(struct pci_dev *pdev, const struct pci_device_id *ent)
if (is_kdump_kernel()) {
pci_clear_master(pdev);
- pcie_flr(pdev);
+ pcie_reset_flr(pdev, PCI_RESET_DO_RESET);
}
rc = bnge_pci_enable(pdev);
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 35e1f8f663c7..21f8dcbe671e 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -16918,7 +16918,7 @@ static int bnxt_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
*/
if (is_kdump_kernel()) {
pci_clear_master(pdev);
- pcie_flr(pdev);
+ pcie_reset_flr(pdev, PCI_RESET_DO_RESET);
}
max_irqs = bnxt_get_max_irq(pdev);
@@ -17203,7 +17203,7 @@ static void bnxt_shutdown(struct pci_dev *pdev)
netif_close(dev);
if (bnxt_hwrm_func_drv_unrgtr(bp)) {
- pcie_flr(pdev);
+ pcie_reset_flr(pdev, PCI_RESET_DO_RESET);
goto shutdown_exit;
}
bnxt_ptp_clear(bp);
diff --git a/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c b/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c
index 43c595f3b84e..7f3557d36341 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c
@@ -429,7 +429,7 @@ static void octeon_pci_flr(struct octeon_device *oct)
pci_write_config_word(oct->pci_dev, PCI_COMMAND,
PCI_COMMAND_INTX_DISABLE);
- pcie_flr(oct->pci_dev);
+ pcie_reset_flr(oct->pci_dev, PCI_RESET_DO_RESET);
pci_cfg_access_unlock(oct->pci_dev);
diff --git a/drivers/net/ethernet/cavium/liquidio/octeon_mailbox.c b/drivers/net/ethernet/cavium/liquidio/octeon_mailbox.c
index ad685f5d0a13..be08e213aa9a 100644
--- a/drivers/net/ethernet/cavium/liquidio/octeon_mailbox.c
+++ b/drivers/net/ethernet/cavium/liquidio/octeon_mailbox.c
@@ -260,7 +260,8 @@ static int octeon_mbox_process_cmd(struct octeon_mbox *mbox,
dev_info(&oct->pci_dev->dev,
"got a request for FLR from VF that owns DPI ring %u\n",
mbox->q_no);
- pcie_flr(oct->sriov_info.dpiring_to_vfpcidev_lut[mbox->q_no]);
+ pcie_reset_flr(oct->sriov_info.dpiring_to_vfpcidev_lut[mbox->q_no],
+ PCI_RESET_DO_RESET);
break;
case OCTEON_PF_CHANGED_VF_MACADDR:
diff --git a/drivers/net/ethernet/freescale/enetc/enetc.c b/drivers/net/ethernet/freescale/enetc/enetc.c
index aa8a87124b10..c1c1b523abb5 100644
--- a/drivers/net/ethernet/freescale/enetc/enetc.c
+++ b/drivers/net/ethernet/freescale/enetc/enetc.c
@@ -3635,7 +3635,7 @@ int enetc_pci_probe(struct pci_dev *pdev, const char *name, int sizeof_priv)
size_t alloc_size;
int err, len;
- pcie_flr(pdev);
+ pcie_reset_flr(pdev, PCI_RESET_DO_RESET);
err = pci_enable_device_mem(pdev);
if (err)
return dev_err_probe(&pdev->dev, err, "device enable failed\n");
diff --git a/drivers/net/ethernet/freescale/enetc/enetc_pci_mdio.c b/drivers/net/ethernet/freescale/enetc/enetc_pci_mdio.c
index e108cac8288d..cfccfca1981d 100644
--- a/drivers/net/ethernet/freescale/enetc/enetc_pci_mdio.c
+++ b/drivers/net/ethernet/freescale/enetc/enetc_pci_mdio.c
@@ -73,7 +73,7 @@ static int enetc_pci_mdio_probe(struct pci_dev *pdev,
mdio_priv->mdio_base = ENETC_EMDIO_BASE;
snprintf(bus->id, MII_BUS_ID_SIZE, "%s", dev_name(dev));
- pcie_flr(pdev);
+ pcie_reset_flr(pdev, PCI_RESET_DO_RESET);
err = pci_enable_device_mem(pdev);
if (err) {
dev_err(dev, "device enable failed\n");
diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index e2fbe111f849..14b8a90625a8 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -5180,7 +5180,7 @@ ice_probe(struct pci_dev *pdev, const struct pci_device_id __always_unused *ent)
if (is_kdump_kernel()) {
pci_save_state(pdev);
pci_clear_master(pdev);
- err = pcie_flr(pdev);
+ err = pcie_reset_flr(pdev, PCI_RESET_DO_RESET);
if (err)
return err;
pci_restore_state(pdev);
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 2646ee6f295f..d8796a68094f 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -8318,7 +8318,7 @@ static void ixgbe_check_for_bad_vf(struct ixgbe_adapter *adapter)
if (status_reg != IXGBE_FAILED_READ_CFG_WORD &&
status_reg & PCI_STATUS_REC_MASTER_ABORT) {
ixgbe_bad_vf_abort(adapter, vf);
- pcie_flr(vfdev);
+ pcie_reset_flr(vfdev, PCI_RESET_DO_RESET);
}
}
}
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index c9b1df1ed109..e51c1170aba7 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -3305,7 +3305,8 @@ static int mana_dealloc_queues(struct net_device *ndev)
}
if (atomic_read(&txq->pending_sends)) {
err =
- pcie_flr(to_pci_dev(gd->gdma_context->dev));
+ pcie_reset_flr(to_pci_dev(gd->gdma_context->dev),
+ PCI_RESET_DO_RESET);
if (err) {
netdev_err(ndev, "flr failed %d with %d pkts pending in txq %u\n",
err,
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index d34266651ad0..878556ea50de 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4321,16 +4321,25 @@ int pci_wait_for_pending_transaction(struct pci_dev *dev)
EXPORT_SYMBOL(pci_wait_for_pending_transaction);
/**
- * pcie_flr - initiate a PCIe function level reset
+ * pcie_reset_flr - initiate a PCIe function level reset
* @dev: device to reset
+ * @probe: if true, return 0 if device can be reset this way
*
- * Initiate a function level reset unconditionally on @dev without
- * checking any flags and DEVCAP
+ * Initiate a function level reset on @dev.
*/
-int pcie_flr(struct pci_dev *dev)
+int pcie_reset_flr(struct pci_dev *dev, bool probe)
{
int ret;
+ if (dev->dev_flags & PCI_DEV_FLAGS_NO_FLR_RESET)
+ return -ENOTTY;
+
+ if (!(dev->devcap & PCI_EXP_DEVCAP_FLR))
+ return -ENOTTY;
+
+ if (probe)
+ return 0;
+
if (!pci_wait_for_pending_transaction(dev))
pci_err(dev, "timed out waiting for pending transaction; performing function level reset anyway\n");
@@ -4357,28 +4366,7 @@ int pcie_flr(struct pci_dev *dev)
done:
pci_dev_reset_iommu_done(dev);
return ret;
-}
-EXPORT_SYMBOL_GPL(pcie_flr);
-
-/**
- * pcie_reset_flr - initiate a PCIe function level reset
- * @dev: device to reset
- * @probe: if true, return 0 if device can be reset this way
- *
- * Initiate a function level reset on @dev.
- */
-int pcie_reset_flr(struct pci_dev *dev, bool probe)
-{
- if (dev->dev_flags & PCI_DEV_FLAGS_NO_FLR_RESET)
- return -ENOTTY;
-
- if (!(dev->devcap & PCI_EXP_DEVCAP_FLR))
- return -ENOTTY;
-
- if (probe)
- return 0;
- return pcie_flr(dev);
}
EXPORT_SYMBOL_GPL(pcie_reset_flr);
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index caaed1a01dc0..564f581599b8 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -2019,6 +2019,23 @@ DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_PXH_0, quirk_pc
DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_PXH_1, quirk_pcie_pxh);
DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_PXHV, quirk_pcie_pxh);
+#define PCI_DEVICE_ID_INTEL_82599_SFP_VF 0x10ed
+static void quirk_intel_82599_sfp_virtfn(struct pci_dev *dev)
+{
+ /*
+ * http://www.intel.com/content/dam/doc/datasheet/82599-10-gbe-controller-datasheet.pdf
+ *
+ * The 82599 supports FLR on VFs, but FLR support is reported only
+ * in the PF DEVCAP (sec 9.3.10.4), not in the VF DEVCAP (sec 9.5).
+ * So enable PCI_EXP_DEVCAP_FLR directly without first checking if it is
+ * supported.
+ */
+
+ dev->devcap |= PCI_EXP_DEVCAP_FLR;
+}
+
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82599_SFP_VF, quirk_intel_82599_sfp_virtfn);
+
/*
* Some Intel PCI Express chipsets have trouble with downstream device
* power management.
@@ -3944,20 +3961,6 @@ DECLARE_PCI_FIXUP_SUSPEND_LATE(PCI_VENDOR_ID_INTEL,
* reset a single function if other methods (e.g. FLR, PM D0->D3) are
* not available.
*/
-static int reset_intel_82599_sfp_virtfn(struct pci_dev *dev, bool probe)
-{
- /*
- * http://www.intel.com/content/dam/doc/datasheet/82599-10-gbe-controller-datasheet.pdf
- *
- * The 82599 supports FLR on VFs, but FLR support is reported only
- * in the PF DEVCAP (sec 9.3.10.4), not in the VF DEVCAP (sec 9.5).
- * Thus we must call pcie_flr() directly without first checking if it is
- * supported.
- */
- if (!probe)
- pcie_flr(dev);
- return 0;
-}
#define SOUTH_CHICKEN2 0xc2004
#define PCH_PP_STATUS 0xc7200
@@ -4058,7 +4061,7 @@ static int reset_chelsio_generic_dev(struct pci_dev *dev, bool probe)
PCI_MSIX_FLAGS_ENABLE |
PCI_MSIX_FLAGS_MASKALL);
- pcie_flr(dev);
+ pcie_reset_flr(dev, PCI_RESET_DO_RESET);
/*
* Restore the configuration information (BAR values, etc.) including
@@ -4070,7 +4073,6 @@ static int reset_chelsio_generic_dev(struct pci_dev *dev, bool probe)
return 0;
}
-#define PCI_DEVICE_ID_INTEL_82599_SFP_VF 0x10ed
#define PCI_DEVICE_ID_INTEL_IVB_M_VGA 0x0156
#define PCI_DEVICE_ID_INTEL_IVB_M2_VGA 0x0166
@@ -4150,7 +4152,7 @@ static int nvme_disable_and_flr(struct pci_dev *dev, bool probe)
pci_iounmap(dev, bar);
- pcie_flr(dev);
+ pcie_reset_flr(dev, PCI_RESET_DO_RESET);
return 0;
}
@@ -4207,7 +4209,7 @@ static int reset_hinic_vf_dev(struct pci_dev *pdev, bool probe)
val = val | HINIC_VF_FLR_PROC_BIT;
iowrite32be(val, bar + HINIC_VF_OP);
- pcie_flr(pdev);
+ pcie_reset_flr(pdev, PCI_RESET_DO_RESET);
/*
* The device must recapture its Bus and Device Numbers after FLR
@@ -4238,8 +4240,6 @@ static int reset_hinic_vf_dev(struct pci_dev *pdev, bool probe)
}
static const struct pci_dev_reset_methods pci_dev_reset_methods[] = {
- { PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82599_SFP_VF,
- reset_intel_82599_sfp_virtfn },
{ PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_IVB_M_VGA,
reset_ivb_igd },
{ PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_IVB_M2_VGA,
diff --git a/drivers/ptp/ptp_netc.c b/drivers/ptp/ptp_netc.c
index 94e952ee6990..24bae237926a 100644
--- a/drivers/ptp/ptp_netc.c
+++ b/drivers/ptp/ptp_netc.c
@@ -802,7 +802,7 @@ static int netc_timer_pci_probe(struct pci_dev *pdev)
if (!priv)
return -ENOMEM;
- pcie_flr(pdev);
+ pcie_reset_flr(pdev, PCI_RESET_DO_RESET);
err = pci_enable_device_mem(pdev);
if (err)
return dev_err_probe(dev, err, "Failed to enable device\n");
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 2c4454583c11..345f0821471a 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1468,7 +1468,6 @@ u32 pcie_bandwidth_available(struct pci_dev *dev, struct pci_dev **limiting_dev,
int pcie_link_speed_mbps(struct pci_dev *pdev);
void pcie_print_link_status(struct pci_dev *dev);
int pcie_reset_flr(struct pci_dev *dev, bool probe);
-int pcie_flr(struct pci_dev *dev);
int __pci_reset_function_locked(struct pci_dev *dev);
int pci_reset_function(struct pci_dev *dev);
int pci_reset_function_locked(struct pci_dev *dev);
--
2.43.0
^ permalink raw reply related
* Re: [PATCH net-next v4] net: mana: Add Interrupt Moderation support
From: Simon Horman @ 2026-06-15 12:51 UTC (permalink / raw)
To: Haiyang Zhang
Cc: linux-hyperv, netdev, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Andrew Lunn, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Konstantin Taranov, Shradha Gupta,
Erni Sri Satya Vennela, Dipayaan Roy, Aditya Garg, Breno Leitao,
linux-kernel, linux-rdma, paulros
In-Reply-To: <20260613205812.2659945-1-haiyangz@linux.microsoft.com>
On Sat, Jun 13, 2026 at 01:57:54PM -0700, Haiyang Zhang wrote:
> From: Haiyang Zhang <haiyangz@microsoft.com>
>
> Add Static and Dynamic Interrupt Moderation (DIM) support for
> Rx and Tx.
> Update queue creation procedure with new data struct with the related
> settings.
> Add functions to collect stat for DIM, and workers to update DIM data
> and settings.
> Update ethtool handler to get/set the moderation settings from a user.
> To avoid detach/re-attach ops, ring DIM doorbell to change settings
> at run time.
> By default, adaptive-rx/tx (DIM) are enabled if supported by HW.
>
> Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
> ---
> v4:
> Fixed tx stat, concurrency, and mb issues from Simon's review.
...
Thanks for your comprehensive reply to the AI-generated review of v3
that I forwarded. And for fixing the issues present in v3.
Reviewed-by: Simon Horman <horms@kernel.org>
^ permalink raw reply
* [PATCH net-next v4] net: mana: Add Interrupt Moderation support
From: Haiyang Zhang @ 2026-06-13 20:57 UTC (permalink / raw)
To: linux-hyperv, netdev, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Andrew Lunn, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Konstantin Taranov, Simon Horman,
Shradha Gupta, Erni Sri Satya Vennela, Dipayaan Roy, Aditya Garg,
Breno Leitao, linux-kernel, linux-rdma
Cc: paulros
From: Haiyang Zhang <haiyangz@microsoft.com>
Add Static and Dynamic Interrupt Moderation (DIM) support for
Rx and Tx.
Update queue creation procedure with new data struct with the related
settings.
Add functions to collect stat for DIM, and workers to update DIM data
and settings.
Update ethtool handler to get/set the moderation settings from a user.
To avoid detach/re-attach ops, ring DIM doorbell to change settings
at run time.
By default, adaptive-rx/tx (DIM) are enabled if supported by HW.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
---
v4:
Fixed tx stat, concurrency, and mb issues from Simon's review.
v3:
Updated to avoid detach/re-attach ops as suggested by Paolo.
v2:
Updated with comments from Jedrzej.
---
drivers/net/ethernet/microsoft/Kconfig | 1 +
.../net/ethernet/microsoft/mana/gdma_main.c | 29 +++
drivers/net/ethernet/microsoft/mana/mana_en.c | 171 ++++++++++++++++++
.../ethernet/microsoft/mana/mana_ethtool.c | 167 ++++++++++++++++-
include/net/mana/gdma.h | 24 ++-
include/net/mana/mana.h | 54 ++++++
6 files changed, 437 insertions(+), 9 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/Kconfig b/drivers/net/ethernet/microsoft/Kconfig
index 3f36ee6a8ece..e9be18c92ca5 100644
--- a/drivers/net/ethernet/microsoft/Kconfig
+++ b/drivers/net/ethernet/microsoft/Kconfig
@@ -21,6 +21,7 @@ config MICROSOFT_MANA
depends on X86_64 || (ARM64 && !CPU_BIG_ENDIAN)
depends on PCI_HYPERV
select AUXILIARY_BUS
+ select DIMLIB
select PAGE_POOL
select NET_SHAPER
help
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index c9ec80a1dd6f..7a012b1e5751 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -1,6 +1,7 @@
// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
/* Copyright (c) 2021, Microsoft Corporation. */
+#include <linux/bitfield.h>
#include <linux/debugfs.h>
#include <linux/module.h>
#include <linux/pci.h>
@@ -464,6 +465,7 @@ static int mana_gd_disable_queue(struct gdma_queue *queue)
#define DOORBELL_OFFSET_RQ 0x400
#define DOORBELL_OFFSET_CQ 0x800
#define DOORBELL_OFFSET_EQ 0xFF8
+#define DOORBELL_OFFSET_DIM 0x820
static void mana_gd_ring_doorbell(struct gdma_context *gc, u32 db_index,
enum gdma_queue_type q_type, u32 qid,
@@ -504,6 +506,16 @@ static void mana_gd_ring_doorbell(struct gdma_context *gc, u32 db_index,
addr += DOORBELL_OFFSET_SQ;
break;
+ case GDMA_DIM:
+ e.dim.id = qid;
+ e.dim.mod_usec = FIELD_GET(MANA_INTR_MODR_USEC_MAX, tail_ptr);
+ e.dim.mod_usec_vld = !!(tail_ptr & MANA_INTR_MODR_USEC_VLD);
+ e.dim.mod_comps = FIELD_GET(MANA_INTR_MODR_COMP_MASK, tail_ptr);
+ e.dim.mod_comps_vld = num_req;
+
+ addr += DOORBELL_OFFSET_DIM;
+ break;
+
default:
WARN_ON(1);
return;
@@ -538,6 +550,23 @@ void mana_gd_ring_cq(struct gdma_queue *cq, u8 arm_bit)
}
EXPORT_SYMBOL_NS(mana_gd_ring_cq, "NET_MANA");
+void mana_gd_ring_dim(struct gdma_queue *cq, u32 mod_usec, bool mod_usec_vld,
+ u32 mod_comps, bool mod_comps_vld)
+{
+ struct gdma_context *gc = cq->gdma_dev->gdma_context;
+ u32 dim_val;
+
+ /* Convert the DIM values to doorbell parameters */
+ dim_val = FIELD_PREP(MANA_INTR_MODR_USEC_MAX, mod_usec) |
+ FIELD_PREP(MANA_INTR_MODR_COMP_MASK, mod_comps);
+ if (mod_usec_vld)
+ dim_val |= MANA_INTR_MODR_USEC_VLD;
+
+ mana_gd_ring_doorbell(gc, cq->gdma_dev->doorbell, GDMA_DIM, cq->id,
+ dim_val, mod_comps_vld);
+}
+EXPORT_SYMBOL_NS(mana_gd_ring_dim, "NET_MANA");
+
#define MANA_SERVICE_PERIOD 10
static void mana_serv_rescan(struct pci_dev *pdev)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 26aef21c6c2c..d36850084f2e 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1579,6 +1579,9 @@ int mana_create_wq_obj(struct mana_port_context *apc,
mana_gd_init_req_hdr(&req.hdr, MANA_CREATE_WQ_OBJ,
sizeof(req), sizeof(resp));
+
+ req.hdr.req.msg_version = GDMA_MESSAGE_V3;
+ req.hdr.resp.msg_version = GDMA_MESSAGE_V2;
req.vport = vport;
req.wq_type = wq_type;
req.wq_gdma_region = wq_spec->gdma_region;
@@ -1587,6 +1590,9 @@ int mana_create_wq_obj(struct mana_port_context *apc,
req.cq_size = cq_spec->queue_size;
req.cq_moderation_ctx_id = cq_spec->modr_ctx_id;
req.cq_parent_qid = cq_spec->attached_eq;
+ req.req_cq_moderation = cq_spec->req_cq_moderation;
+ req.cq_moderation_comp = cq_spec->cq_moderation_comp;
+ req.cq_moderation_usec = cq_spec->cq_moderation_usec;
err = mana_send_request(apc->ac, &req, sizeof(req), &resp,
sizeof(resp));
@@ -1844,6 +1850,7 @@ static void mana_poll_tx_cq(struct mana_cq *cq)
struct gdma_posted_wqe_info *wqe_info;
unsigned int pkt_transmitted = 0;
unsigned int wqe_unit_cnt = 0;
+ unsigned int tx_bytes = 0;
struct mana_txq *txq = cq->txq;
struct mana_port_context *apc;
struct netdev_queue *net_txq;
@@ -1925,6 +1932,8 @@ static void mana_poll_tx_cq(struct mana_cq *cq)
mana_unmap_skb(skb, apc);
+ tx_bytes += skb->len;
+
napi_consume_skb(skb, cq->budget);
pkt_transmitted++;
@@ -1955,6 +1964,10 @@ static void mana_poll_tx_cq(struct mana_cq *cq)
if (atomic_sub_return(pkt_transmitted, &txq->pending_sends) < 0)
WARN_ON_ONCE(1);
+ /* Feed DIM with the completion rate observed here, in NAPI context. */
+ cq->tx_dim_pkts += pkt_transmitted;
+ cq->tx_dim_bytes += tx_bytes;
+
cq->work_done = pkt_transmitted;
}
@@ -2306,6 +2319,119 @@ static void mana_poll_rx_cq(struct mana_cq *cq)
xdp_do_flush();
}
+static void mana_rx_dim_work(struct work_struct *work)
+{
+ struct dim *dim = container_of(work, struct dim, work);
+ struct dim_cq_moder cur_moder;
+ struct mana_cq *cq;
+
+ cur_moder = net_dim_get_rx_moderation(dim->mode, dim->profile_ix);
+ cq = container_of(dim, struct mana_cq, dim);
+
+ cur_moder.usec = min_t(u16, cur_moder.usec, MANA_INTR_MODR_USEC_MAX);
+ cur_moder.pkts = min_t(u16, cur_moder.pkts, MANA_INTR_MODR_COMP_MAX);
+
+ mana_gd_ring_dim(cq->gdma_cq, cur_moder.usec, true,
+ cur_moder.pkts, true);
+
+ dim->state = DIM_START_MEASURE;
+}
+
+static void mana_tx_dim_work(struct work_struct *work)
+{
+ struct dim *dim = container_of(work, struct dim, work);
+ struct dim_cq_moder cur_moder;
+ struct mana_cq *cq;
+
+ cur_moder = net_dim_get_tx_moderation(dim->mode, dim->profile_ix);
+ cq = container_of(dim, struct mana_cq, dim);
+
+ cur_moder.usec = min_t(u16, cur_moder.usec, MANA_INTR_MODR_USEC_MAX);
+ cur_moder.pkts = min_t(u16, cur_moder.pkts, MANA_INTR_MODR_COMP_MAX);
+
+ mana_gd_ring_dim(cq->gdma_cq, cur_moder.usec, true,
+ cur_moder.pkts, true);
+
+ dim->state = DIM_START_MEASURE;
+}
+
+/* The caller must update apc->rx/tx_dim_enabled before disabling and
+ * after enabling. And synchronize_net() before draining the DIM work,
+ * so that NAPI cannot observe a stale flag.
+ */
+int mana_dim_change(struct mana_cq *cq, bool enable)
+{
+ bool is_rx = cq->type == MANA_CQ_TYPE_RX;
+ struct mana_port_context *apc;
+ work_func_t work_func;
+ u32 usec, comp;
+
+ if (is_rx) {
+ apc = netdev_priv(cq->rxq->ndev);
+ usec = apc->intr_modr_rx_usec;
+ comp = apc->intr_modr_rx_comp;
+ work_func = mana_rx_dim_work;
+ } else {
+ apc = netdev_priv(cq->txq->ndev);
+ usec = apc->intr_modr_tx_usec;
+ comp = apc->intr_modr_tx_comp;
+ work_func = mana_tx_dim_work;
+ }
+
+ /* On enable, zero the DIM state so net_dim() starts measuring from
+ * scratch.
+ * On disable, drain any pending DIM work and restore the static
+ * moderation values.
+ */
+ if (enable) {
+ memset(&cq->dim, 0, sizeof(cq->dim));
+ cq->dim.mode = DIM_CQ_PERIOD_MODE_START_FROM_EQE;
+ INIT_WORK(&cq->dim.work, work_func);
+ } else {
+ cancel_work_sync(&cq->dim.work);
+ mana_gd_ring_dim(cq->gdma_cq, usec, true, comp, true);
+ }
+
+ return 0;
+}
+
+static void mana_update_rx_dim(struct mana_cq *cq)
+{
+ struct mana_port_context *apc = netdev_priv(cq->rxq->ndev);
+ struct dim_sample dim_sample = {};
+ struct mana_rxq *rxq = cq->rxq;
+
+ /* Pairs with smp_store_release() in mana_set_coalesce(): observing the
+ * enable flag set guarantees the DIM (re)initialization is visible.
+ */
+ if (!smp_load_acquire(&apc->rx_dim_enabled))
+ return;
+
+ dim_update_sample(READ_ONCE(cq->dim_event_ctr), rxq->stats.packets,
+ rxq->stats.bytes, &dim_sample);
+ net_dim(&cq->dim, &dim_sample);
+}
+
+static void mana_update_tx_dim(struct mana_cq *cq)
+{
+ struct mana_port_context *apc = netdev_priv(cq->txq->ndev);
+ struct dim_sample dim_sample = {};
+
+ /* Pairs with smp_store_release() in mana_set_coalesce(): observing the
+ * enable flag set guarantees the DIM (re)initialization is visible.
+ */
+ if (!smp_load_acquire(&apc->tx_dim_enabled))
+ return;
+
+ /* cq->tx_dim_pkts/bytes are accumulated in mana_poll_tx_cq(), in the
+ * same NAPI context as this read, so they track the hardware
+ * completion rate and need no u64_stats_sync protection.
+ */
+ dim_update_sample(READ_ONCE(cq->dim_event_ctr), cq->tx_dim_pkts,
+ cq->tx_dim_bytes, &dim_sample);
+ net_dim(&cq->dim, &dim_sample);
+}
+
static int mana_cq_handler(void *context, struct gdma_queue *gdma_queue)
{
struct mana_cq *cq = context;
@@ -2324,6 +2450,15 @@ static int mana_cq_handler(void *context, struct gdma_queue *gdma_queue)
if (w < cq->budget) {
mana_gd_ring_cq(gdma_queue, SET_ARM_BIT);
cq->work_done_since_doorbell = 0;
+
+ /* Update DIM before napi_complete_done() to prevent running
+ * net_dim() concurrently.
+ */
+ if (cq->type == MANA_CQ_TYPE_RX)
+ mana_update_rx_dim(cq);
+ else
+ mana_update_tx_dim(cq);
+
napi_complete_done(&cq->napi, w);
} else if (cq->work_done_since_doorbell >=
(cq->gdma_cq->queue_size / COMP_ENTRY_SIZE) * 4) {
@@ -2356,6 +2491,7 @@ static void mana_schedule_napi(void *context, struct gdma_queue *gdma_queue)
{
struct mana_cq *cq = context;
+ WRITE_ONCE(cq->dim_event_ctr, cq->dim_event_ctr + 1);
napi_schedule_irqoff(&cq->napi);
}
@@ -2398,6 +2534,7 @@ static void mana_destroy_txq(struct mana_port_context *apc)
if (apc->tx_qp[i]->txq.napi_initialized) {
napi_synchronize(napi);
napi_disable_locked(napi);
+ cancel_work_sync(&apc->tx_qp[i]->tx_cq.dim.work);
netif_napi_del_locked(napi);
apc->tx_qp[i]->txq.napi_initialized = false;
}
@@ -2529,6 +2666,11 @@ static int mana_create_txq(struct mana_port_context *apc,
cq_spec.modr_ctx_id = 0;
cq_spec.attached_eq = cq->gdma_cq->cq.parent->id;
+ /* DIM setting can be changed at runtime */
+ cq_spec.req_cq_moderation = true;
+ cq_spec.cq_moderation_usec = apc->intr_modr_tx_usec;
+ cq_spec.cq_moderation_comp = apc->intr_modr_tx_comp;
+
err = mana_create_wq_obj(apc, apc->port_handle, GDMA_SQ,
&wq_spec, &cq_spec,
&apc->tx_qp[i]->tx_object);
@@ -2559,6 +2701,13 @@ static int mana_create_txq(struct mana_port_context *apc,
set_bit(NAPI_STATE_NO_BUSY_POLL, &cq->napi.state);
netif_napi_add_locked(net, &cq->napi, mana_poll);
+
+ /* Initialize the DIM work before enabling NAPI, so that a poll
+ * cannot reach net_dim() with an uninitialized cq->dim.work.
+ */
+ INIT_WORK(&cq->dim.work, mana_tx_dim_work);
+ cq->dim.mode = DIM_CQ_PERIOD_MODE_START_FROM_EQE;
+
napi_enable_locked(&cq->napi);
txq->napi_initialized = true;
@@ -2596,6 +2745,7 @@ static void mana_destroy_rxq(struct mana_port_context *apc,
napi_synchronize(napi);
napi_disable_locked(napi);
+ cancel_work_sync(&rxq->rx_cq.dim.work);
netif_napi_del_locked(napi);
}
@@ -2834,6 +2984,11 @@ static struct mana_rxq *mana_create_rxq(struct mana_port_context *apc,
cq_spec.modr_ctx_id = 0;
cq_spec.attached_eq = cq->gdma_cq->cq.parent->id;
+ /* DIM setting can be changed at runtime */
+ cq_spec.req_cq_moderation = true;
+ cq_spec.cq_moderation_usec = apc->intr_modr_rx_usec;
+ cq_spec.cq_moderation_comp = apc->intr_modr_rx_comp;
+
err = mana_create_wq_obj(apc, apc->port_handle, GDMA_RQ,
&wq_spec, &cq_spec, &rxq->rxobj);
if (err)
@@ -2866,6 +3021,12 @@ static struct mana_rxq *mana_create_rxq(struct mana_port_context *apc,
WARN_ON(xdp_rxq_info_reg_mem_model(&rxq->xdp_rxq, MEM_TYPE_PAGE_POOL,
rxq->page_pool));
+ /* Initialize the DIM work before enabling NAPI, so that a poll
+ * cannot reach net_dim() with an uninitialized cq->dim.work.
+ */
+ INIT_WORK(&cq->dim.work, mana_rx_dim_work);
+ cq->dim.mode = DIM_CQ_PERIOD_MODE_START_FROM_EQE;
+
napi_enable_locked(&cq->napi);
mana_gd_ring_cq(cq->gdma_cq, SET_ARM_BIT);
@@ -3532,6 +3693,16 @@ static int mana_probe_port(struct mana_context *ac, int port_idx,
apc->link_cfg_error = 1;
apc->cqe_coalescing_enable = 0;
+ /* Initialize interrupt moderation settings if supported by HW */
+ if (gc->pf_cap_flags1 & GDMA_PF_CAP_FLAG_1_DYN_INTERRUPT_MODERATION) {
+ apc->intr_modr_rx_usec = MANA_INTR_MODR_USEC_DEF;
+ apc->intr_modr_rx_comp = MANA_INTR_MODR_COMP_DEF;
+ apc->intr_modr_tx_usec = MANA_INTR_MODR_USEC_DEF;
+ apc->intr_modr_tx_comp = MANA_INTR_MODR_COMP_DEF;
+ apc->rx_dim_enabled = MANA_ADAPTIVE_RX_DEF;
+ apc->tx_dim_enabled = MANA_ADAPTIVE_TX_DEF;
+ }
+
mutex_init(&apc->vport_mutex);
apc->vport_use_count = 0;
diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index 94e658d07a27..5e5fb5b18bbf 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -419,6 +419,15 @@ static int mana_get_coalesce(struct net_device *ndev,
!kernel_coal->rx_cqe_nsecs)
kernel_coal->rx_cqe_nsecs = MANA_RX_CQE_NSEC_DEF;
+ ec->rx_coalesce_usecs = apc->intr_modr_rx_usec;
+ ec->rx_max_coalesced_frames = apc->intr_modr_rx_comp;
+
+ ec->tx_coalesce_usecs = apc->intr_modr_tx_usec;
+ ec->tx_max_coalesced_frames = apc->intr_modr_tx_comp;
+
+ ec->use_adaptive_rx_coalesce = apc->rx_dim_enabled;
+ ec->use_adaptive_tx_coalesce = apc->tx_dim_enabled;
+
return 0;
}
@@ -428,9 +437,34 @@ static int mana_set_coalesce(struct net_device *ndev,
struct netlink_ext_ack *extack)
{
struct mana_port_context *apc = netdev_priv(ndev);
- u8 saved_cqe_coalescing_enable;
+ struct {
+ u16 intr_modr_rx_usec;
+ u16 intr_modr_rx_comp;
+ u16 intr_modr_tx_usec;
+ u16 intr_modr_tx_comp;
+ u8 cqe_coalescing_enable;
+ bool rx_dim_enabled;
+ bool tx_dim_enabled;
+ } saved;
+ bool modr_changed = false;
+ bool dim_changed = false;
+ struct gdma_context *gc;
int err;
+ gc = apc->ac->gdma_dev->gdma_context;
+
+ /* Both static and dynamic interrupt moderation (DIM) rely on the
+ * same HW capability advertised by the PF.
+ */
+ if ((ec->use_adaptive_rx_coalesce || ec->use_adaptive_tx_coalesce ||
+ ec->rx_coalesce_usecs || ec->tx_coalesce_usecs ||
+ ec->rx_max_coalesced_frames || ec->tx_max_coalesced_frames) &&
+ !(gc->pf_cap_flags1 & GDMA_PF_CAP_FLAG_1_DYN_INTERRUPT_MODERATION)) {
+ NL_SET_ERR_MSG(extack,
+ "Interrupt Moderation is not supported by HW");
+ return -EOPNOTSUPP;
+ }
+
if (kernel_coal->rx_cqe_frames != 1 &&
kernel_coal->rx_cqe_frames != MANA_RXCOMP_OOB_NUM_PPI) {
NL_SET_ERR_MSG_FMT(extack,
@@ -440,18 +474,129 @@ static int mana_set_coalesce(struct net_device *ndev,
return -EINVAL;
}
- saved_cqe_coalescing_enable = apc->cqe_coalescing_enable;
+ if (ec->rx_coalesce_usecs > MANA_INTR_MODR_USEC_MAX ||
+ ec->tx_coalesce_usecs > MANA_INTR_MODR_USEC_MAX) {
+ NL_SET_ERR_MSG_FMT(extack,
+ "coalesce usecs must be <= %lu",
+ MANA_INTR_MODR_USEC_MAX);
+ return -EINVAL;
+ }
+
+ if (ec->rx_max_coalesced_frames > MANA_INTR_MODR_COMP_MAX ||
+ ec->tx_max_coalesced_frames > MANA_INTR_MODR_COMP_MAX) {
+ NL_SET_ERR_MSG_FMT(extack,
+ "coalesce frames must be <= %lu",
+ MANA_INTR_MODR_COMP_MAX);
+ return -EINVAL;
+ }
+
+ if (ec->rx_coalesce_usecs != apc->intr_modr_rx_usec ||
+ ec->rx_max_coalesced_frames != apc->intr_modr_rx_comp ||
+ ec->tx_coalesce_usecs != apc->intr_modr_tx_usec ||
+ ec->tx_max_coalesced_frames != apc->intr_modr_tx_comp)
+ modr_changed = true;
+
+ saved.intr_modr_rx_usec = apc->intr_modr_rx_usec;
+ saved.intr_modr_rx_comp = apc->intr_modr_rx_comp;
+ saved.intr_modr_tx_usec = apc->intr_modr_tx_usec;
+ saved.intr_modr_tx_comp = apc->intr_modr_tx_comp;
+
+ apc->intr_modr_rx_usec = ec->rx_coalesce_usecs;
+ apc->intr_modr_rx_comp = ec->rx_max_coalesced_frames;
+ apc->intr_modr_tx_usec = ec->tx_coalesce_usecs;
+ apc->intr_modr_tx_comp = ec->tx_max_coalesced_frames;
+
+ if (!!ec->use_adaptive_rx_coalesce != apc->rx_dim_enabled ||
+ !!ec->use_adaptive_tx_coalesce != apc->tx_dim_enabled)
+ dim_changed = true;
+
+ saved.rx_dim_enabled = apc->rx_dim_enabled;
+ saved.tx_dim_enabled = apc->tx_dim_enabled;
+
+ saved.cqe_coalescing_enable = apc->cqe_coalescing_enable;
apc->cqe_coalescing_enable =
kernel_coal->rx_cqe_frames == MANA_RXCOMP_OOB_NUM_PPI;
- if (!apc->port_is_up)
+ if (!apc->port_is_up) {
+ WRITE_ONCE(apc->rx_dim_enabled, !!ec->use_adaptive_rx_coalesce);
+ WRITE_ONCE(apc->tx_dim_enabled, !!ec->use_adaptive_tx_coalesce);
return 0;
+ }
- err = mana_config_rss(apc, TRI_STATE_TRUE, false, false);
- if (err)
- apc->cqe_coalescing_enable = saved_cqe_coalescing_enable;
+ if (apc->cqe_coalescing_enable != saved.cqe_coalescing_enable) {
+ /* CQE coalescing setting is applied via RSS configuration. */
+ err = mana_config_rss(apc, TRI_STATE_TRUE, false, false);
+ if (err) {
+ netdev_err(ndev, "Change CQE coalescing failed: %d\n",
+ err);
+ apc->cqe_coalescing_enable =
+ saved.cqe_coalescing_enable;
+ apc->intr_modr_rx_usec = saved.intr_modr_rx_usec;
+ apc->intr_modr_rx_comp = saved.intr_modr_rx_comp;
+ apc->intr_modr_tx_usec = saved.intr_modr_tx_usec;
+ apc->intr_modr_tx_comp = saved.intr_modr_tx_comp;
+ return err;
+ }
+ }
- return err;
+ if (modr_changed || dim_changed) {
+ bool new_rx_dim = !!ec->use_adaptive_rx_coalesce;
+ bool new_tx_dim = !!ec->use_adaptive_tx_coalesce;
+ bool disable_rx_dim = saved.rx_dim_enabled && !new_rx_dim;
+ bool disable_tx_dim = saved.tx_dim_enabled && !new_tx_dim;
+ bool enable_rx_dim = !saved.rx_dim_enabled && new_rx_dim;
+ bool enable_tx_dim = !saved.tx_dim_enabled && new_tx_dim;
+ int q;
+
+ /* On disable: clear the per-port flag first and
+ * synchronize_net() so any in-flight NAPI poll observes
+ * the new value and will not schedule further DIM work;
+ * then drain pending work and restore the static
+ * moderation values.
+ */
+ if (disable_rx_dim)
+ WRITE_ONCE(apc->rx_dim_enabled, false);
+ if (disable_tx_dim)
+ WRITE_ONCE(apc->tx_dim_enabled, false);
+ if (disable_rx_dim || disable_tx_dim)
+ synchronize_net();
+
+ for (q = 0; q < apc->num_queues; q++) {
+ struct mana_cq *rx_cq = &apc->rxqs[q]->rx_cq;
+ struct mana_cq *tx_cq = &apc->tx_qp[q]->tx_cq;
+
+ if (disable_rx_dim)
+ mana_dim_change(rx_cq, false);
+ else if (enable_rx_dim)
+ mana_dim_change(rx_cq, true);
+ else if (!new_rx_dim && modr_changed)
+ mana_gd_ring_dim(rx_cq->gdma_cq,
+ apc->intr_modr_rx_usec, true,
+ apc->intr_modr_rx_comp, true);
+
+ if (disable_tx_dim)
+ mana_dim_change(tx_cq, false);
+ else if (enable_tx_dim)
+ mana_dim_change(tx_cq, true);
+ else if (!new_tx_dim && modr_changed)
+ mana_gd_ring_dim(tx_cq->gdma_cq,
+ apc->intr_modr_tx_usec, true,
+ apc->intr_modr_tx_comp, true);
+ }
+
+ /* Publish the enable flag with release semantics so a
+ * concurrent NAPI poll that observes it set also sees the DIM
+ * (re)init done by mana_dim_change() above.
+ */
+ if (enable_rx_dim)
+ /* pairs with smp_load_acquire() in mana_update_rx_dim() */
+ smp_store_release(&apc->rx_dim_enabled, true);
+ if (enable_tx_dim)
+ /* pairs with smp_load_acquire() in mana_update_tx_dim() */
+ smp_store_release(&apc->tx_dim_enabled, true);
+ }
+
+ return 0;
}
/* mana_set_channels - change the number of queues on a port
@@ -595,7 +740,13 @@ static int mana_get_link_ksettings(struct net_device *ndev,
}
const struct ethtool_ops mana_ethtool_ops = {
- .supported_coalesce_params = ETHTOOL_COALESCE_RX_CQE_FRAMES,
+ .supported_coalesce_params = ETHTOOL_COALESCE_RX_CQE_FRAMES |
+ ETHTOOL_COALESCE_RX_USECS |
+ ETHTOOL_COALESCE_RX_MAX_FRAMES |
+ ETHTOOL_COALESCE_TX_USECS |
+ ETHTOOL_COALESCE_TX_MAX_FRAMES |
+ ETHTOOL_COALESCE_USE_ADAPTIVE_RX |
+ ETHTOOL_COALESCE_USE_ADAPTIVE_TX,
.op_needs_rtnl = ETHTOOL_OP_NEEDS_RTNL_SCHANNELS |
ETHTOOL_OP_NEEDS_RTNL_SRINGPARAM,
.get_ethtool_stats = mana_get_ethtool_stats,
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index 0c395917b214..8529cef0d7c4 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -47,6 +47,7 @@ enum gdma_queue_type {
GDMA_RQ,
GDMA_CQ,
GDMA_EQ,
+ GDMA_DIM,
};
enum gdma_work_request_flags {
@@ -126,6 +127,17 @@ union gdma_doorbell_entry {
u64 tail_ptr : 31;
u64 arm : 1;
} eq;
+
+ struct {
+ u64 id : 24;
+ u64 reserved : 8;
+ u64 mod_usec : 10;
+ u64 reserve1 : 5;
+ u64 mod_usec_vld : 1;
+ u64 mod_comps : 8;
+ u64 reserve2 : 7;
+ u64 mod_comps_vld: 1;
+ } dim;
}; /* HW DATA */
struct gdma_msg_hdr {
@@ -502,6 +514,9 @@ void mana_gd_ring_cq(struct gdma_queue *cq, u8 arm_bit);
int mana_schedule_serv_work(struct gdma_context *gc, enum gdma_eqe_type type);
+void mana_gd_ring_dim(struct gdma_queue *cq, u32 mod_usec, bool mod_usec_vld,
+ u32 mod_comps, bool mod_comps_vld);
+
struct gdma_wqe {
u32 reserved :24;
u32 last_vbytes :8;
@@ -650,6 +665,9 @@ enum {
/* Driver supports self recovery on Hardware Channel timeouts */
#define GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECOVERY BIT(25)
+/* Driver supports dynamic interrupt moderation - DIM */
+#define GDMA_DRV_CAP_FLAG_1_DYN_INTERRUPT_MODERATION BIT(28)
+
#define GDMA_DRV_CAP_FLAGS1 \
(GDMA_DRV_CAP_FLAG_1_EQ_SHARING_MULTI_VPORT | \
GDMA_DRV_CAP_FLAG_1_NAPI_WKDONE_FIX | \
@@ -665,7 +683,8 @@ enum {
GDMA_DRV_CAP_FLAG_1_PROBE_RECOVERY | \
GDMA_DRV_CAP_FLAG_1_HANDLE_STALL_SQ_RECOVERY | \
GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECOVERY | \
- GDMA_DRV_CAP_FLAG_1_EQ_MSI_UNSHARE_MULTI_VPORT)
+ GDMA_DRV_CAP_FLAG_1_EQ_MSI_UNSHARE_MULTI_VPORT | \
+ GDMA_DRV_CAP_FLAG_1_DYN_INTERRUPT_MODERATION)
#define GDMA_DRV_CAP_FLAGS2 0
@@ -701,6 +720,9 @@ struct gdma_verify_ver_req {
u8 os_ver_str4[128];
}; /* HW DATA */
+/* HW supports dynamic interrupt moderation - DIM */
+#define GDMA_PF_CAP_FLAG_1_DYN_INTERRUPT_MODERATION BIT(15)
+
struct gdma_verify_ver_resp {
struct gdma_resp_hdr hdr;
u64 gdma_protocol_ver;
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index 13c87baf018e..df4c4a3f68fa 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -4,6 +4,7 @@
#ifndef _MANA_H
#define _MANA_H
+#include <linux/dim.h>
#include <net/xdp.h>
#include <net/net_shaper.h>
@@ -64,6 +65,19 @@ enum TRI_STATE {
/* Maximum number of packets per coalesced CQE */
#define MANA_RXCOMP_OOB_NUM_PPI 4
+/* Default/max interrupt moderation settings */
+#define MANA_INTR_MODR_USEC_DEF 0
+#define MANA_INTR_MODR_COMP_DEF 0
+
+#define MANA_ADAPTIVE_RX_DEF true
+#define MANA_ADAPTIVE_TX_DEF true
+
+/* DIM doorbell value field layout */
+#define MANA_INTR_MODR_USEC_MAX GENMASK(9, 0)
+#define MANA_INTR_MODR_USEC_VLD BIT(15)
+#define MANA_INTR_MODR_COMP_MAX GENMASK(7, 0)
+#define MANA_INTR_MODR_COMP_MASK GENMASK(23, 16)
+
/* Update this count whenever the respective structures are changed */
#define MANA_STATS_RX_COUNT (6 + MANA_RXCOMP_OOB_NUM_PPI - 1)
#define MANA_STATS_TX_COUNT 11
@@ -297,6 +311,17 @@ struct mana_cq {
int work_done;
int work_done_since_doorbell;
int budget;
+
+ /* DIM - Dynamic Interrupt Moderation */
+ struct dim dim;
+ u16 dim_event_ctr;
+
+ /* Cumulative TX completions fed to DIM. Updated and read only in
+ * NAPI context (mana_poll_tx_cq() / mana_update_tx_dim()), so they
+ * measure the hardware completion rate and need no u64_stats_sync.
+ */
+ u64 tx_dim_pkts;
+ u64 tx_dim_bytes;
};
struct mana_recv_buf_oob {
@@ -573,6 +598,15 @@ struct mana_port_context {
u8 cqe_coalescing_enable;
u32 cqe_coalescing_timeout_ns;
+ /* Interrupt moderation settings */
+ u16 intr_modr_rx_usec;
+ u16 intr_modr_rx_comp;
+ u16 intr_modr_tx_usec;
+ u16 intr_modr_tx_comp;
+
+ bool rx_dim_enabled;
+ bool tx_dim_enabled;
+
struct mana_ethtool_stats eth_stats;
struct mana_ethtool_phy_stats phy_stats;
@@ -598,6 +632,8 @@ int mana_alloc_queues(struct net_device *ndev);
int mana_attach(struct net_device *ndev);
int mana_detach(struct net_device *ndev, bool from_close);
+int mana_dim_change(struct mana_cq *cq, bool enable);
+
int mana_probe(struct gdma_dev *gd, bool resuming);
void mana_remove(struct gdma_dev *gd, bool suspending);
@@ -633,6 +669,9 @@ struct mana_obj_spec {
u32 queue_size;
u32 attached_eq;
u32 modr_ctx_id;
+ u8 req_cq_moderation;
+ u16 cq_moderation_comp;
+ u16 cq_moderation_usec;
};
enum mana_command_code {
@@ -764,6 +803,15 @@ struct mana_create_wqobj_req {
u32 cq_size;
u32 cq_moderation_ctx_id;
u32 cq_parent_qid;
+
+ /* V2 */
+ u8 allow_rqwqe_chain;
+
+ /* V3 */
+ u8 req_cq_moderation;
+ u16 cq_moderation_comp;
+ u16 cq_moderation_usec;
+ u8 reserved2[2];
}; /* HW DATA */
struct mana_create_wqobj_resp {
@@ -771,6 +819,12 @@ struct mana_create_wqobj_resp {
u32 wq_id;
u32 cq_id;
mana_handle_t wq_obj;
+
+ /* V2 */
+ u16 cq_moderation_comp;
+ u16 cq_moderation_usec;
+ u8 cq_moderation_enabled;
+ u8 reserved1[3];
}; /* HW DATA */
/* Destroy WQ Object */
--
2.34.1
^ permalink raw reply related
* RE: [EXTERNAL] Re: [PATCH net-next v3] net: mana: Add Interrupt Moderation support
From: Haiyang Zhang @ 2026-06-13 20:48 UTC (permalink / raw)
To: Simon Horman
Cc: linux-hyperv@vger.kernel.org, netdev@vger.kernel.org,
KY Srinivasan, wei.liu@kernel.org, Dexuan Cui, Long Li,
andrew+netdev@lunn.ch, davem@davemloft.net, edumazet@google.com,
kuba@kernel.org, pabeni@redhat.com, Konstantin Taranov,
shradhagupta@linux.microsoft.com, ernis@linux.microsoft.com,
dipayanroy@linux.microsoft.com, gargaditya@linux.microsoft.com,
kees@kernel.org, leitao@debian.org, linux-kernel@vger.kernel.org,
linux-rdma@vger.kernel.org, Paul Rosswurm
In-Reply-To: <20260613082014.715350-1-horms@kernel.org>
> -----Original Message-----
> From: Simon Horman <horms@kernel.org>
> Sent: Saturday, June 13, 2026 4:20 AM
> To: haiyangz@linux.microsoft.com
> Cc: Simon Horman <horms@kernel.org>; linux-hyperv@vger.kernel.org;
> netdev@vger.kernel.org; KY Srinivasan <kys@microsoft.com>; Haiyang Zhang
> <haiyangz@microsoft.com>; wei.liu@kernel.org; Dexuan Cui
> <DECUI@microsoft.com>; Long Li <longli@microsoft.com>;
> andrew+netdev@lunn.ch; davem@davemloft.net; edumazet@google.com;
> kuba@kernel.org; pabeni@redhat.com; Konstantin Taranov
> <kotaranov@microsoft.com>; shradhagupta@linux.microsoft.com;
> ernis@linux.microsoft.com; dipayanroy@linux.microsoft.com;
> gargaditya@linux.microsoft.com; kees@kernel.org; leitao@debian.org; linux-
> kernel@vger.kernel.org; linux-rdma@vger.kernel.org; Paul Rosswurm
> <paulros@microsoft.com>
> Subject: [EXTERNAL] Re: [PATCH net-next v3] net: mana: Add Interrupt
> Moderation support
>
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> Full review at:
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsashiko.
> dev%2F&data=05%7C02%7Chaiyangz%40microsoft.com%7Ccbca53cce3d24759006708dec
> 924a556%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C639169356403391033%7C
> Unknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJX
> aW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=D2ZPGJ5XznEkcL
> uH3Cxtze0FYOL3xxDIY%2Bg2DP508Q8%3D&reserved=0
> ---
> Date: Thu Jun 11 19:02:22 2026 +0000
>
> net: mana: Add Interrupt Moderation support
>
> This commit adds Static and Dynamic Interrupt Moderation (DIM) support
> for
> Rx and Tx. It updates the queue creation procedure, adds functions to
> collect stats for DIM, and provides ethtool handler support to get/set
> the moderation settings.
>
> > diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c
> b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > index 26aef21c6c2c8..d92e022d8533c 100644
> > --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> > +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > @@ -1579,6 +1579,9 @@ int mana_create_wq_obj(struct mana_port_context
> *apc,
> >
> > mana_gd_init_req_hdr(&req.hdr, MANA_CREATE_WQ_OBJ,
> > sizeof(req), sizeof(resp));
> > +
> > + req.hdr.req.msg_version = GDMA_MESSAGE_V3;
> > + req.hdr.resp.msg_version = GDMA_MESSAGE_V2;
> > req.vport = vport;
> > req.wq_type = wq_type;
> > req.wq_gdma_region = wq_spec->gdma_region;
> > @@ -1587,6 +1590,9 @@ int mana_create_wq_obj(struct mana_port_context
> *apc,
> > req.cq_size = cq_spec->queue_size;
> > req.cq_moderation_ctx_id = cq_spec->modr_ctx_id;
> > req.cq_parent_qid = cq_spec->attached_eq;
> > + req.req_cq_moderation = cq_spec->req_cq_moderation;
> > + req.cq_moderation_comp = cq_spec->cq_moderation_comp;
> > + req.cq_moderation_usec = cq_spec->cq_moderation_usec;
>
> [Severity: High]
> Does this break queue creation on older Azure host firmwares?
>
> The code unconditionally bumps the WQ object creation request to use
> GDMA_MESSAGE_V3 and passes the req_cq_moderation flag. If the firmware
> does
> not support GDMA_PF_CAP_FLAG_1_DYN_INTERRUPT_MODERATION, could the
> firmware
> reject the queue creation request and break device initialization
> entirely?
Our firmware is forward compatible with newer message versions, so the
old firmware still properly handles this message, just the new feature
fields are ignored, and queue creation will be successful.
And if the DIM capability bit is zero from FW, driver will keep the DIM
feature to be off and unchangeable.
>
> [ ... ]
>
> > +static void mana_update_tx_dim(struct mana_cq *cq)
> > +{
> > + struct mana_port_context *apc = netdev_priv(cq->txq->ndev);
> > + struct dim_sample dim_sample = {};
> > + struct mana_txq *txq = cq->txq;
> > +
> > + if (!apc->tx_dim_enabled)
> > + return;
> > +
> > + dim_update_sample(READ_ONCE(cq->dim_event_ctr), txq->stats.packets,
> > + txq->stats.bytes, &dim_sample);
> > + net_dim(&cq->dim, &dim_sample);
> > +}
>
> [Severity: High]
> Does this measure the CPU enqueue rate rather than the hardware completion
> rate?
>
> It appears that txq->stats.packets is incremented in mana_start_xmit()
> upon enqueue, not during completion in mana_poll_tx_cq(). Dynamic
> interrupt
> moderation relies on measuring the actual hardware completion rate to
> function correctly.
>
> Additionally, is it safe to read txq->stats.packets here without the
> u64_stats_fetch_begin() retry loop? Reading a 64-bit statistic locklessly
> without the synchronization primitives could result in torn reads on 32-
> bit
> systems.
In the next patch version, I added counters to cq, and track the HW comp rate,
And the reader is in the same NAPI thread to prevent any torn reads.
>
> [ ... ]
>
> > @@ -2324,7 +2434,13 @@ static int mana_cq_handler(void *context, struct
> gdma_queue *gdma_queue)
> > if (w < cq->budget) {
> > mana_gd_ring_cq(gdma_queue, SET_ARM_BIT);
> > cq->work_done_since_doorbell = 0;
> > - napi_complete_done(&cq->napi, w);
> > +
> > + if (napi_complete_done(&cq->napi, w)) {
> > + if (cq->type == MANA_CQ_TYPE_RX)
> > + mana_update_rx_dim(cq);
> > + else
> > + mana_update_tx_dim(cq);
> > + }
>
> [Severity: High]
> Is it safe to call mana_update_rx_dim() and mana_update_tx_dim() after
> napi_complete_done()?
>
> Since napi_complete_done() ends the NAPI polling cycle and relinquishes
> mutual exclusion, and the hardware interrupt was already re-armed by
> mana_gd_ring_cq() right before this, couldn't a new packet trigger an
> interrupt immediately? This could allow NAPI to be scheduled on another
> CPU
> concurrently while this CPU is still running the DIM update, potentially
> corrupting the non-thread-safe net_dim() state machine.
Thanks, I moved the dim update before the napi_complete_done() to avoid
concurrently running it.
>
> [ ... ]
>
> > @@ -2562,6 +2685,9 @@ static int mana_create_txq(struct
> mana_port_context *apc,
> > napi_enable_locked(&cq->napi);
> > txq->napi_initialized = true;
> >
> > + INIT_WORK(&cq->dim.work, mana_tx_dim_work);
> > + cq->dim.mode = DIM_CQ_PERIOD_MODE_START_FROM_EQE;
>
> [Severity: High]
> Could this allow NAPI to schedule an uninitialized work_struct?
>
> napi_enable_locked() is called right before INIT_WORK(). Once NAPI is
> enabled, it can be scheduled to run immediately (for example, via
> netpoll).
> If NAPI executes before INIT_WORK completes, mana_update_tx_dim() would
> pass the uninitialized cq->dim to net_dim(), which could cause a kernel
> panic when it attempts to call schedule_work().
>
> This pattern also occurs in mana_create_rxq().
I have updated the init order in the new version.
>
> > diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> > index 94e658d07a27e..67a2b282ff4da 100644
> > --- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> > +++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
>
> [ ... ]
>
> > @@ -440,18 +474,123 @@ static int mana_set_coalesce(struct net_device
> *ndev,
> > + if (enable_rx_dim)
> > + apc->rx_dim_enabled = true;
> > + if (enable_tx_dim)
> > + apc->tx_dim_enabled = true;
>
> [Severity: High]
> Could the lack of memory barriers here expose uninitialized DIM structures
> to NAPI?
>
> The driver initializes DIM via mana_dim_change() and then sets
> apc->rx_dim_enabled to true. Without an smp_store_release() here and a
> corresponding smp_load_acquire() in mana_update_rx_dim(), weakly-ordered
> CPUs like ARM64 might reorder the stores. Concurrently, NAPI polling might
> observe the flag as true before the initialization is fully visible in
> memory,
> potentially invoking net_dim() on garbage memory.
I added smp_store_release() & smp_load_acquire() to fix it in the next version.
And, I will submit the next version soon.
Thanks,
- Haiyang
^ permalink raw reply
* Re: [PATCH net-next v3] net: mana: Add Interrupt Moderation support
From: Simon Horman @ 2026-06-13 8:20 UTC (permalink / raw)
To: haiyangz
Cc: Simon Horman, linux-hyperv, netdev, kys, haiyangz, wei.liu, decui,
longli, andrew+netdev, davem, edumazet, kuba, pabeni, kotaranov,
shradhagupta, ernis, dipayanroy, gargaditya, kees, leitao,
linux-kernel, linux-rdma, paulros
In-Reply-To: <20260611190239.2532429-1-haiyangz@linux.microsoft.com>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
Date: Thu Jun 11 19:02:22 2026 +0000
net: mana: Add Interrupt Moderation support
This commit adds Static and Dynamic Interrupt Moderation (DIM) support for
Rx and Tx. It updates the queue creation procedure, adds functions to
collect stats for DIM, and provides ethtool handler support to get/set
the moderation settings.
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index 26aef21c6c2c8..d92e022d8533c 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> @@ -1579,6 +1579,9 @@ int mana_create_wq_obj(struct mana_port_context *apc,
>
> mana_gd_init_req_hdr(&req.hdr, MANA_CREATE_WQ_OBJ,
> sizeof(req), sizeof(resp));
> +
> + req.hdr.req.msg_version = GDMA_MESSAGE_V3;
> + req.hdr.resp.msg_version = GDMA_MESSAGE_V2;
> req.vport = vport;
> req.wq_type = wq_type;
> req.wq_gdma_region = wq_spec->gdma_region;
> @@ -1587,6 +1590,9 @@ int mana_create_wq_obj(struct mana_port_context *apc,
> req.cq_size = cq_spec->queue_size;
> req.cq_moderation_ctx_id = cq_spec->modr_ctx_id;
> req.cq_parent_qid = cq_spec->attached_eq;
> + req.req_cq_moderation = cq_spec->req_cq_moderation;
> + req.cq_moderation_comp = cq_spec->cq_moderation_comp;
> + req.cq_moderation_usec = cq_spec->cq_moderation_usec;
[Severity: High]
Does this break queue creation on older Azure host firmwares?
The code unconditionally bumps the WQ object creation request to use
GDMA_MESSAGE_V3 and passes the req_cq_moderation flag. If the firmware does
not support GDMA_PF_CAP_FLAG_1_DYN_INTERRUPT_MODERATION, could the firmware
reject the queue creation request and break device initialization entirely?
[ ... ]
> +static void mana_update_tx_dim(struct mana_cq *cq)
> +{
> + struct mana_port_context *apc = netdev_priv(cq->txq->ndev);
> + struct dim_sample dim_sample = {};
> + struct mana_txq *txq = cq->txq;
> +
> + if (!apc->tx_dim_enabled)
> + return;
> +
> + dim_update_sample(READ_ONCE(cq->dim_event_ctr), txq->stats.packets,
> + txq->stats.bytes, &dim_sample);
> + net_dim(&cq->dim, &dim_sample);
> +}
[Severity: High]
Does this measure the CPU enqueue rate rather than the hardware completion
rate?
It appears that txq->stats.packets is incremented in mana_start_xmit()
upon enqueue, not during completion in mana_poll_tx_cq(). Dynamic interrupt
moderation relies on measuring the actual hardware completion rate to
function correctly.
Additionally, is it safe to read txq->stats.packets here without the
u64_stats_fetch_begin() retry loop? Reading a 64-bit statistic locklessly
without the synchronization primitives could result in torn reads on 32-bit
systems.
[ ... ]
> @@ -2324,7 +2434,13 @@ static int mana_cq_handler(void *context, struct gdma_queue *gdma_queue)
> if (w < cq->budget) {
> mana_gd_ring_cq(gdma_queue, SET_ARM_BIT);
> cq->work_done_since_doorbell = 0;
> - napi_complete_done(&cq->napi, w);
> +
> + if (napi_complete_done(&cq->napi, w)) {
> + if (cq->type == MANA_CQ_TYPE_RX)
> + mana_update_rx_dim(cq);
> + else
> + mana_update_tx_dim(cq);
> + }
[Severity: High]
Is it safe to call mana_update_rx_dim() and mana_update_tx_dim() after
napi_complete_done()?
Since napi_complete_done() ends the NAPI polling cycle and relinquishes
mutual exclusion, and the hardware interrupt was already re-armed by
mana_gd_ring_cq() right before this, couldn't a new packet trigger an
interrupt immediately? This could allow NAPI to be scheduled on another CPU
concurrently while this CPU is still running the DIM update, potentially
corrupting the non-thread-safe net_dim() state machine.
[ ... ]
> @@ -2562,6 +2685,9 @@ static int mana_create_txq(struct mana_port_context *apc,
> napi_enable_locked(&cq->napi);
> txq->napi_initialized = true;
>
> + INIT_WORK(&cq->dim.work, mana_tx_dim_work);
> + cq->dim.mode = DIM_CQ_PERIOD_MODE_START_FROM_EQE;
[Severity: High]
Could this allow NAPI to schedule an uninitialized work_struct?
napi_enable_locked() is called right before INIT_WORK(). Once NAPI is
enabled, it can be scheduled to run immediately (for example, via netpoll).
If NAPI executes before INIT_WORK completes, mana_update_tx_dim() would
pass the uninitialized cq->dim to net_dim(), which could cause a kernel
panic when it attempts to call schedule_work().
This pattern also occurs in mana_create_rxq().
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> index 94e658d07a27e..67a2b282ff4da 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
[ ... ]
> @@ -440,18 +474,123 @@ static int mana_set_coalesce(struct net_device *ndev,
> + if (enable_rx_dim)
> + apc->rx_dim_enabled = true;
> + if (enable_tx_dim)
> + apc->tx_dim_enabled = true;
[Severity: High]
Could the lack of memory barriers here expose uninitialized DIM structures
to NAPI?
The driver initializes DIM via mana_dim_change() and then sets
apc->rx_dim_enabled to true. Without an smp_store_release() here and a
corresponding smp_load_acquire() in mana_update_rx_dim(), weakly-ordered
CPUs like ARM64 might reorder the stores. Concurrently, NAPI polling might
observe the flag as true before the initialization is fully visible in memory,
potentially invoking net_dim() on garbage memory.
^ permalink raw reply
* Re: [PATCH net v2 0/2] net: mana: fix error-path issues in queue setup
From: patchwork-bot+netdevbpf @ 2026-06-13 1:00 UTC (permalink / raw)
To: Aditya Garg
Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
edumazet, kuba, pabeni, horms, shradhagupta, dipayanroy, ernis,
kees, shacharr, stephen, gargaditya, ssengar, linux-hyperv,
netdev, linux-kernel
In-Reply-To: <20260608101345.2267320-1-gargaditya@linux.microsoft.com>
Hello:
This series was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Mon, 8 Jun 2026 03:13:39 -0700 you wrote:
> Two error-path fixes in MANA queue setup, both surfaced during Sashiko
> AI review of a recently upstreamed patch series.
>
> Patch 1 initializes queue->id to INVALID_QUEUE_ID in
> mana_gd_create_mana_wq_cq() so that a CQ creation failure before the
> firmware id is assigned does not NULL gc->cq_table[0] and silently
> break whichever real CQ owns that slot. This mirrors the existing
> pattern in mana_gd_create_eq().
>
> [...]
Here is the summary with links:
- [net,v2,1/2] net: mana: initialize gdma queue id to INVALID_QUEUE_ID
https://git.kernel.org/netdev/net/c/5985474e1cb4
- [net,v2,2/2] net: mana: guard TX wq object destroy with INVALID_MANA_HANDLE check
https://git.kernel.org/netdev/net/c/f8fd56977eee
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox