* Re: [PATCH v8 09/10] x86/hyperv/vtl: Mark the wakeup mailbox page as private
From: Wei Liu @ 2026-03-09 17:57 UTC (permalink / raw)
To: Ricardo Neri
Cc: x86, Krzysztof Kozlowski, Conor Dooley, Rob Herring,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
Michael Kelley, Rafael J. Wysocki, Saurabh Sengar, Chris Oo,
Kirill A. Shutemov, linux-hyperv, devicetree, linux-acpi,
linux-kernel, Ricardo Neri, Yunhong Jiang
In-Reply-To: <20260107-rneri-wakeup-mailbox-v8-9-2f5b6785f2f5@linux.intel.com>
Dexuan, are you happy with the patch? You can also delegate to Saurabh
if you think it's more appropriate. Thanks!
On Wed, Jan 07, 2026 at 01:44:45PM -0800, Ricardo Neri wrote:
> From: Yunhong Jiang <yunhong.jiang@linux.intel.com>
>
> The current code maps MMIO devices as shared (decrypted) by default in a
> confidential computing VM.
>
> In a TDX environment, secondary CPUs are booted using the Multiprocessor
> Wakeup Structure defined in the ACPI specification. The virtual firmware
> and the operating system function in the guest context, without
> intervention from the VMM. Map the physical memory of the mailbox as
> private. Use the is_private_mmio() callback.
>
> Signed-off-by: Yunhong Jiang <yunhong.jiang@linux.intel.com>
> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> ---
> Changes in v8:
> - Included linux/acpi.h to add missing definitions that caused build
> breaks (kernel test robot)
>
> Changes in v7:
> - Dropped check for !CONFIG_X86_MAILBOX_WAKEUP. The symbol is no longer
> valid and now we have a stub for !CONFIG_ACPI.
> - Dropped Reviewed-by tags from Dexuan and Michael as this patch
> changed.
>
> Changes in v6:
> - Fixed a compile error with !CONFIG_X86_MAILBOX_WAKEUP.
> - Added Reviewed-by tag from Dexuan. Thanks!
>
> Changes in v5:
> - None
>
> Changes in v4:
> - Updated to use the renamed function acpi_get_mp_wakeup_mailbox_paddr().
> - Added Reviewed-by tag from Michael. Thanks!
>
> Changes in v3:
> - Use the new helper function get_mp_wakeup_mailbox_paddr().
> - Edited the commit message for clarity.
>
> Changes in v2:
> - Added the helper function within_page() to improve readability
> - Override the is_private_mmio() callback when detecting a TDX
> environment. The address of the mailbox is checked in
> hv_is_private_mmio_tdx().
> ---
> arch/x86/hyperv/hv_vtl.c | 17 +++++++++++++++++
> 1 file changed, 17 insertions(+)
>
> diff --git a/arch/x86/hyperv/hv_vtl.c b/arch/x86/hyperv/hv_vtl.c
> index 752101544663..2af825f7a447 100644
> --- a/arch/x86/hyperv/hv_vtl.c
> +++ b/arch/x86/hyperv/hv_vtl.c
> @@ -6,6 +6,9 @@
> * Saurabh Sengar <ssengar@microsoft.com>
> */
>
> +#include <linux/acpi.h>
> +
> +#include <asm/acpi.h>
> #include <asm/apic.h>
> #include <asm/boot.h>
> #include <asm/desc.h>
> @@ -59,6 +62,18 @@ static void __noreturn hv_vtl_restart(char __maybe_unused *cmd)
> hv_vtl_emergency_restart();
> }
>
> +static inline bool within_page(u64 addr, u64 start)
> +{
> + return addr >= start && addr < (start + PAGE_SIZE);
> +}
> +
> +static bool hv_vtl_is_private_mmio_tdx(u64 addr)
> +{
> + u64 mb_addr = acpi_get_mp_wakeup_mailbox_paddr();
> +
> + return mb_addr && within_page(addr, mb_addr);
> +}
> +
> void __init hv_vtl_init_platform(void)
> {
> /*
> @@ -71,6 +86,8 @@ void __init hv_vtl_init_platform(void)
> /* There is no paravisor present if we are here. */
> if (hv_isolation_type_tdx()) {
> x86_init.resources.realmode_limit = SZ_4G;
> + x86_platform.hyper.is_private_mmio = hv_vtl_is_private_mmio_tdx;
> +
> } else {
> x86_platform.realmode_reserve = x86_init_noop;
> x86_platform.realmode_init = x86_init_noop;
>
> --
> 2.43.0
>
^ permalink raw reply
* [PATCH net-next v2] net: mana: Expose hardware diagnostic info via debugfs
From: Erni Sri Satya Vennela @ 2026-03-09 14:38 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
edumazet, kuba, pabeni, kotaranov, horms, shradhagupta,
dipayanroy, yury.norov, kees, ernis, shirazsaleem, linux-hyperv,
netdev, linux-kernel, linux-rdma
Add debugfs entries to expose hardware configuration and diagnostic
information that aids in debugging driver initialization and runtime
operations without adding noise to dmesg.
Device-level entries (under /sys/kernel/debug/mana/<slot>/):
- num_msix_usable, max_num_queues: Max resources from hardware
- gdma_protocol_ver, pf_cap_flags1: VF version negotiation results
- num_vports, bm_hostmode: Device configuration
Per-vPort entries (under /sys/kernel/debug/mana/<slot>/vportN/):
- port_handle: Hardware vPort handle
- max_sq, max_rq: Max queues from vPort config
- indir_table_sz: Indirection table size
- steer_rx, steer_rss, steer_update_tab, steer_cqe_coalescing:
Last applied steering configuration parameters
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
---
Changes in v2:
* Add debugfs_remove_recursice for gc>mana_pci_debugfs in
mana_gd_suspend to handle multiple duplicates creation in
mana_gd_setup and mana_gd_resume path.
* Move debugfs creation for num_vports and bm_hostmode out of
if(!resuming) condition since we have to create it again even for
resume.
* Recreate mana_pci_debugfs in mana_gd_resume.
---
.../net/ethernet/microsoft/mana/gdma_main.c | 21 +++++++++++++
drivers/net/ethernet/microsoft/mana/mana_en.c | 31 +++++++++++++++++++
include/net/mana/gdma.h | 1 +
include/net/mana/mana.h | 8 +++++
4 files changed, 61 insertions(+)
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index aef8612b73cb..43fb366dc183 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -152,6 +152,11 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
if (gc->max_num_queues > gc->num_msix_usable - 1)
gc->max_num_queues = gc->num_msix_usable - 1;
+ debugfs_create_u32("num_msix_usable", 0400, gc->mana_pci_debugfs,
+ &gc->num_msix_usable);
+ debugfs_create_u32("max_num_queues", 0400, gc->mana_pci_debugfs,
+ &gc->max_num_queues);
+
return 0;
}
@@ -1222,6 +1227,13 @@ int mana_gd_verify_vf_version(struct pci_dev *pdev)
return err ? err : -EPROTO;
}
gc->pf_cap_flags1 = resp.pf_cap_flags1;
+ gc->gdma_protocol_ver = resp.gdma_protocol_ver;
+
+ debugfs_create_x64("gdma_protocol_ver", 0400, gc->mana_pci_debugfs,
+ &gc->gdma_protocol_ver);
+ debugfs_create_x64("pf_cap_flags1", 0400, gc->mana_pci_debugfs,
+ &gc->pf_cap_flags1);
+
if (resp.pf_cap_flags1 & GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECONFIG) {
err = mana_gd_query_hwc_timeout(pdev, &hwc->hwc_timeout);
if (err) {
@@ -2128,6 +2140,9 @@ int mana_gd_suspend(struct pci_dev *pdev, pm_message_t state)
mana_gd_cleanup(pdev);
+ debugfs_remove_recursive(gc->mana_pci_debugfs);
+ gc->mana_pci_debugfs = NULL;
+
return 0;
}
@@ -2140,6 +2155,12 @@ int mana_gd_resume(struct pci_dev *pdev)
struct gdma_context *gc = pci_get_drvdata(pdev);
int err;
+ if (gc->is_pf)
+ gc->mana_pci_debugfs = debugfs_create_dir("0", mana_debugfs_root);
+ else
+ gc->mana_pci_debugfs = debugfs_create_dir(pci_slot_name(pdev->slot),
+ mana_debugfs_root);
+
err = mana_gd_setup(pdev);
if (err)
return err;
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index ea71de39f996..1117ae16b065 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1263,6 +1263,9 @@ static int mana_query_vport_cfg(struct mana_port_context *apc, u32 vport_index,
apc->port_handle = resp.vport;
ether_addr_copy(apc->mac_addr, resp.mac_addr);
+ apc->vport_max_sq = *max_sq;
+ apc->vport_max_rq = *max_rq;
+
return 0;
}
@@ -1409,6 +1412,11 @@ static int mana_cfg_vport_steering(struct mana_port_context *apc,
netdev_info(ndev, "Configured steering vPort %llu entries %u\n",
apc->port_handle, apc->indir_table_sz);
+
+ apc->steer_rx = rx;
+ apc->steer_rss = apc->rss_state;
+ apc->steer_update_tab = update_tab;
+ apc->steer_cqe_coalescing = req->cqe_coalescing_enable;
out:
kfree(req);
return err;
@@ -3110,6 +3118,24 @@ static int mana_init_port(struct net_device *ndev)
eth_hw_addr_set(ndev, apc->mac_addr);
sprintf(vport, "vport%d", port_idx);
apc->mana_port_debugfs = debugfs_create_dir(vport, gc->mana_pci_debugfs);
+
+ debugfs_create_u64("port_handle", 0400, apc->mana_port_debugfs,
+ &apc->port_handle);
+ debugfs_create_u32("max_sq", 0400, apc->mana_port_debugfs,
+ &apc->vport_max_sq);
+ debugfs_create_u32("max_rq", 0400, apc->mana_port_debugfs,
+ &apc->vport_max_rq);
+ debugfs_create_u32("indir_table_sz", 0400, apc->mana_port_debugfs,
+ &apc->indir_table_sz);
+ debugfs_create_u32("steer_rx", 0400, apc->mana_port_debugfs,
+ &apc->steer_rx);
+ debugfs_create_u32("steer_rss", 0400, apc->mana_port_debugfs,
+ &apc->steer_rss);
+ debugfs_create_u32("steer_update_tab", 0400, apc->mana_port_debugfs,
+ &apc->steer_update_tab);
+ debugfs_create_u32("steer_cqe_coalescing", 0400, apc->mana_port_debugfs,
+ &apc->steer_cqe_coalescing);
+
return 0;
reset_apc:
@@ -3598,6 +3624,11 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
ac->bm_hostmode = bm_hostmode;
+ debugfs_create_u16("num_vports", 0400, gc->mana_pci_debugfs,
+ &ac->num_ports);
+ debugfs_create_u8("bm_hostmode", 0400, gc->mana_pci_debugfs,
+ &ac->bm_hostmode);
+
if (!resuming) {
ac->num_ports = num_ports;
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index ec17004b10c0..917945f0e3dc 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -440,6 +440,7 @@ struct gdma_context {
struct gdma_dev mana_ib;
u64 pf_cap_flags1;
+ u64 gdma_protocol_ver;
struct workqueue_struct *service_wq;
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index a078af283bdd..83f6de67c0cc 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -563,6 +563,14 @@ struct mana_port_context {
/* Debugfs */
struct dentry *mana_port_debugfs;
+
+ /* Cached vport/steering config for debugfs */
+ u32 vport_max_sq;
+ u32 vport_max_rq;
+ u32 steer_rx;
+ u32 steer_rss;
+ u32 steer_update_tab;
+ u32 steer_cqe_coalescing;
};
netdev_tx_t mana_start_xmit(struct sk_buff *skb, struct net_device *ndev);
--
2.34.1
^ permalink raw reply related
* Re: [net-next] net: mana: Expose hardware diagnostic info via debugfs
From: Erni Sri Satya Vennela @ 2026-03-09 11:49 UTC (permalink / raw)
To: Jakub Kicinski
Cc: pabeni, linux-kernel, yury.norov, kys, decui, kees, longli,
dipayanroy, davem, netdev, linux-rdma, kotaranov, andrew+netdev,
linux-hyperv, edumazet, haiyangz, ssengar, shradhagupta, horms,
shirazsaleem, wei.liu
In-Reply-To: <20260307032228.1379456-1-kuba@kernel.org>
On Fri, Mar 06, 2026 at 07:22:28PM -0800, Jakub Kicinski wrote:
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
>
> For local reproduction steps see https://netdev-ai.bots.linux.dev/ai-local.html
> ---
> net: mana: Expose hardware diagnostic info via debugfs
>
> This commit adds debugfs entries to expose hardware configuration and
> diagnostic information for the MANA driver. It creates device-level and
> port-level debugfs entries to aid in debugging initialization and runtime
> operations.
> Do these debugfs entries get recreated on every suspend/resume cycle?
No. Thankyou for pointing out Jakub.
I'll make this change in next version.
>
> Both mana_gd_query_max_resources() and mana_gd_verify_vf_version() are
> called from mana_gd_setup() during device initialization. On suspend,
> mana_gd_suspend() calls mana_gd_cleanup() which doesn't remove these
> debugfs entries. When mana_gd_resume() calls mana_gd_setup() again, it
> attempts to create the same debugfs entries a second time.
>
> This means debugfs_start_creating() in fs/debugfs/inode.c will print an
> error for each duplicate file:
>
> "'num_msix_usable' already exists in 'mana/<slot>/'"
> "'max_num_queues' already exists in 'mana/<slot>/'"
> "'gdma_protocol_ver' already exists in 'mana/<slot>/'"
> "'pf_cap_flags1' already exists in 'mana/<slot>/'"
>
> This produces 4 error messages in dmesg on every suspend/resume cycle,
> contradicting the commit message's goal of operating "without adding
> noise to dmesg."
>
> Should these device-level entries be removed during suspend (in
> mana_gd_cleanup or mana_gd_suspend), or should their creation be moved
> to mana_gd_probe where they're only created once?
Right. I will remove the entries during suspend and re-create them in
resume.
>
> [ ... ]
> --
> pw-bot: cr
^ permalink raw reply
* RE: [PATCH net-next,V3, 2/3] net: mana: Add support for RX CQE Coalescing
From: Haiyang Zhang @ 2026-03-08 16:30 UTC (permalink / raw)
To: Haiyang Zhang, linux-hyperv@vger.kernel.org,
netdev@vger.kernel.org, KY Srinivasan, Wei Liu, Dexuan Cui,
Long Li, Andrew Lunn, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Konstantin Taranov, Simon Horman,
Erni Sri Satya Vennela, Shradha Gupta, Dipayaan Roy,
Shiraz Saleem, Kees Cook, Subbaraya Sundeep, Breno Leitao,
Aditya Garg, linux-kernel@vger.kernel.org,
linux-rdma@vger.kernel.org
Cc: Paul Rosswurm
In-Reply-To: <20260306231936.549499-3-haiyangz@linux.microsoft.com>
> -----Original Message-----
> From: Haiyang Zhang <haiyangz@linux.microsoft.com>
> Sent: Friday, March 6, 2026 6:19 PM
> To: linux-hyperv@vger.kernel.org; netdev@vger.kernel.org; KY Srinivasan
> <kys@microsoft.com>; Haiyang Zhang <haiyangz@microsoft.com>; Wei Liu
> <wei.liu@kernel.org>; Dexuan Cui <DECUI@microsoft.com>; Long Li
> <longli@microsoft.com>; Andrew Lunn <andrew+netdev@lunn.ch>; David S.
> Miller <davem@davemloft.net>; Eric Dumazet <edumazet@google.com>; Jakub
> Kicinski <kuba@kernel.org>; Paolo Abeni <pabeni@redhat.com>; Konstantin
> Taranov <kotaranov@microsoft.com>; Simon Horman <horms@kernel.org>; Erni
> Sri Satya Vennela <ernis@linux.microsoft.com>; Shradha Gupta
> <shradhagupta@linux.microsoft.com>; Dipayaan Roy
> <dipayanroy@linux.microsoft.com>; Shiraz Saleem
> <shirazsaleem@microsoft.com>; Kees Cook <kees@kernel.org>; Subbaraya
> Sundeep <sbhatta@marvell.com>; Breno Leitao <leitao@debian.org>; Aditya
> Garg <gargaditya@linux.microsoft.com>; linux-kernel@vger.kernel.org;
> linux-rdma@vger.kernel.org
> Cc: Paul Rosswurm <paulros@microsoft.com>
> Subject: [PATCH net-next,V3, 2/3] net: mana: Add support for RX CQE
> Coalescing
>
> From: Haiyang Zhang <haiyangz@microsoft.com>
> @@ -2112,13 +2122,16 @@ static void mana_process_rx_cqe(struct mana_rxq
> *rxq, struct mana_cq *cq,
> ++ndev->stats.rx_dropped;
> rxbuf_oob = &rxq->rx_oobs[rxq->buf_index];
> netdev_warn_once(ndev, "Dropped a truncated packet\n");
> - goto drop;
>
> - case CQE_RX_COALESCED_4:
> - netdev_err(ndev, "RX coalescing is unsupported\n");
> - apc->eth_stats.rx_coalesced_err++;
> + mana_move_wq_tail(rxq->gdma_rq,
> + rxbuf_oob->wqe_inf.wqe_size_in_bu);
> + mana_post_pkt_rxq(rxq);
> return;
>
> + case CQE_RX_COALESCED_4:
> + coalesced = true;
> + break;
> +
> case CQE_RX_OBJECT_FENCE:
> complete(&rxq->fence_event);
> return;
> @@ -2130,30 +2143,36 @@ static void mana_process_rx_cqe(struct mana_rxq
> *rxq, struct mana_cq *cq,
> return;
> }
>
> - pktlen = oob->ppi[0].pkt_len;
> + for (i = 0; i < MANA_RXCOMP_OOB_NUM_PPI; i++) {
> + pktlen = oob->ppi[i].pkt_len;
> + if (pktlen == 0) {
> + if (i == 0)
> + netdev_err_once(
> + ndev,
> + "RX pkt len=0, rq=%u, cq=%u,
> rxobj=0x%llx\n",
> + rxq->gdma_id, cq->gdma_id, rxq->rxobj);
> + break;
> + }
>
> - if (pktlen == 0) {
> - /* data packets should never have packetlength of zero */
> - netdev_err(ndev, "RX pkt len=0, rq=%u, cq=%u, rxobj=0x%llx\n",
> - rxq->gdma_id, cq->gdma_id, rxq->rxobj);
> - return;
> - }
> + curr = rxq->buf_index;
> + rxbuf_oob = &rxq->rx_oobs[curr];
> + WARN_ON_ONCE(rxbuf_oob->wqe_inf.wqe_size_in_bu != 1);
>
> - curr = rxq->buf_index;
> - rxbuf_oob = &rxq->rx_oobs[curr];
> - WARN_ON_ONCE(rxbuf_oob->wqe_inf.wqe_size_in_bu != 1);
> + mana_refill_rx_oob(dev, rxq, rxbuf_oob, &old_buf, &old_fp);
>
> - mana_refill_rx_oob(dev, rxq, rxbuf_oob, &old_buf, &old_fp);
> + /* Unsuccessful refill will have old_buf == NULL.
> + * In this case, mana_rx_skb() will drop the packet.
> + */
> + mana_rx_skb(old_buf, old_fp, oob, rxq, i);
>
> - /* Unsuccessful refill will have old_buf == NULL.
> - * In this case, mana_rx_skb() will drop the packet.
> - */
> - mana_rx_skb(old_buf, old_fp, oob, rxq);
> + mana_move_wq_tail(rxq->gdma_rq,
> + rxbuf_oob->wqe_inf.wqe_size_in_bu);
I will fix this pointed out by AI review:
> The comment says "Unsuccessful refill will have old_buf == NULL" but this is
> only true for the first iteration.
> Should old_buf be set to NULL at the top of the loop, before calling
> mana_refill_rx_oob()?
^ permalink raw reply
* Re: [PATCH rdma-next 0/8] RDMA/mana_ib: Handle service reset for RDMA resources
From: Leon Romanovsky @ 2026-03-07 17:38 UTC (permalink / raw)
To: Long Li
Cc: Konstantin Taranov, Jakub Kicinski, David S . Miller, Paolo Abeni,
Eric Dumazet, Andrew Lunn, Jason Gunthorpe, Haiyang Zhang,
K . Y . Srinivasan, Wei Liu, Dexuan Cui, Simon Horman, netdev,
linux-rdma, linux-hyperv, linux-kernel
In-Reply-To: <20260307014723.556523-1-longli@microsoft.com>
On Fri, Mar 06, 2026 at 05:47:14PM -0800, Long Li wrote:
> When the MANA hardware undergoes a service reset, the ETH auxiliary device
> (mana.eth) used by DPDK persists across the reset cycle — it is not removed
> and re-added like RC/UD/GSI QPs. This means userspace RDMA consumers such
> as DPDK have no way of knowing that firmware handles for their PD, CQ, WQ,
> QP and MR resources have become stale.
NAK to any of this.
In case of hardware reset, mana_ib AUX device needs to be destroyed and
recreated later.
The same is applicable for mana.eth as well.
Thanks
^ permalink raw reply
* Re: [net-next] net: mana: Expose hardware diagnostic info via debugfs
From: Jakub Kicinski @ 2026-03-07 3:22 UTC (permalink / raw)
To: ernis
Cc: Jakub Kicinski, pabeni, linux-kernel, yury.norov, kys, decui,
kees, longli, dipayanroy, davem, netdev, linux-rdma, kotaranov,
andrew+netdev, linux-hyperv, edumazet, haiyangz, ssengar,
shradhagupta, horms, shirazsaleem, wei.liu
In-Reply-To: <20260305205252.470089-1-ernis@linux.microsoft.com>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
For local reproduction steps see https://netdev-ai.bots.linux.dev/ai-local.html
---
net: mana: Expose hardware diagnostic info via debugfs
This commit adds debugfs entries to expose hardware configuration and
diagnostic information for the MANA driver. It creates device-level and
port-level debugfs entries to aid in debugging initialization and runtime
operations.
> diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> index aef8612b73cb..fb513d7802cb 100644
> --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> @@ -152,6 +152,11 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
> if (gc->max_num_queues > gc->num_msix_usable - 1)
> gc->max_num_queues = gc->num_msix_usable - 1;
>
> + debugfs_create_u32("num_msix_usable", 0400, gc->mana_pci_debugfs,
> + &gc->num_msix_usable);
> + debugfs_create_u32("max_num_queues", 0400, gc->mana_pci_debugfs,
> + &gc->max_num_queues);
> +
> return 0;
> }
>
> @@ -1222,6 +1227,13 @@ int mana_gd_verify_vf_version(struct pci_dev *pdev)
> return err ? err : -EPROTO;
> }
> gc->pf_cap_flags1 = resp.pf_cap_flags1;
> + gc->gdma_protocol_ver = resp.gdma_protocol_ver;
> +
> + debugfs_create_x64("gdma_protocol_ver", 0400, gc->mana_pci_debugfs,
> + &gc->gdma_protocol_ver);
> + debugfs_create_x64("pf_cap_flags1", 0400, gc->mana_pci_debugfs,
> + &gc->pf_cap_flags1);
> +
Do these debugfs entries get recreated on every suspend/resume cycle?
Both mana_gd_query_max_resources() and mana_gd_verify_vf_version() are
called from mana_gd_setup() during device initialization. On suspend,
mana_gd_suspend() calls mana_gd_cleanup() which doesn't remove these
debugfs entries. When mana_gd_resume() calls mana_gd_setup() again, it
attempts to create the same debugfs entries a second time.
This means debugfs_start_creating() in fs/debugfs/inode.c will print an
error for each duplicate file:
"'num_msix_usable' already exists in 'mana/<slot>/'"
"'max_num_queues' already exists in 'mana/<slot>/'"
"'gdma_protocol_ver' already exists in 'mana/<slot>/'"
"'pf_cap_flags1' already exists in 'mana/<slot>/'"
This produces 4 error messages in dmesg on every suspend/resume cycle,
contradicting the commit message's goal of operating "without adding
noise to dmesg."
Should these device-level entries be removed during suspend (in
mana_gd_cleanup or mana_gd_suspend), or should their creation be moved
to mana_gd_probe where they're only created once?
[ ... ]
--
pw-bot: cr
^ permalink raw reply
* [PATCH rdma-next 8/8] RDMA/mana_ib: Skip firmware commands for invalidated handles
From: Long Li @ 2026-03-07 1:47 UTC (permalink / raw)
To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
Paolo Abeni, Eric Dumazet, Andrew Lunn, Jason Gunthorpe,
Leon Romanovsky, Haiyang Zhang, K . Y . Srinivasan, Wei Liu,
Dexuan Cui
Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel
In-Reply-To: <20260307014723.556523-1-longli@microsoft.com>
After a service reset, firmware handles for PD, CQ, WQ, QP, and MR
are set to INVALID_MANA_HANDLE by the reset notification path.
Check for INVALID_MANA_HANDLE in each destroy callback before issuing
firmware destroy commands. When a handle is invalid, skip the firmware
call and proceed directly to kernel resource cleanup (umem, queues,
memory). This avoids sending stale handles to firmware after reset.
Affected callbacks:
- mana_ib_dealloc_pd: skip mana_ib_gd_destroy_pd
- mana_ib_destroy_cq: skip mana_ib_gd_destroy_cq and queue destroy
- mana_ib_destroy_wq: skip mana_ib_destroy_queue
- mana_ib_destroy_qp_rss: skip mana_destroy_wq_obj per WQ
- mana_ib_destroy_qp_raw: skip mana_destroy_wq_obj
- mana_ib_dereg_mr: skip mana_ib_gd_destroy_mr
Signed-off-by: Long Li <longli@microsoft.com>
---
drivers/infiniband/hw/mana/cq.c | 10 ++++++----
drivers/infiniband/hw/mana/main.c | 12 +++++++++---
drivers/infiniband/hw/mana/mr.c | 8 +++++---
drivers/infiniband/hw/mana/qp.c | 9 ++++++---
4 files changed, 26 insertions(+), 13 deletions(-)
diff --git a/drivers/infiniband/hw/mana/cq.c b/drivers/infiniband/hw/mana/cq.c
index b054684b8de7..315301bccb97 100644
--- a/drivers/infiniband/hw/mana/cq.c
+++ b/drivers/infiniband/hw/mana/cq.c
@@ -143,10 +143,12 @@ int mana_ib_destroy_cq(struct ib_cq *ibcq, struct ib_udata *udata)
mana_ib_remove_cq_cb(mdev, cq);
- /* Ignore return code as there is not much we can do about it.
- * The error message is printed inside.
- */
- mana_ib_gd_destroy_cq(mdev, cq);
+ if (cq->cq_handle != INVALID_MANA_HANDLE) {
+ /* Ignore return code as there is not much we can do about it.
+ * The error message is printed inside.
+ */
+ mana_ib_gd_destroy_cq(mdev, cq);
+ }
mana_ib_destroy_queue(mdev, &cq->queue);
diff --git a/drivers/infiniband/hw/mana/main.c b/drivers/infiniband/hw/mana/main.c
index 61ce30aa9cb2..d60205184dba 100644
--- a/drivers/infiniband/hw/mana/main.c
+++ b/drivers/infiniband/hw/mana/main.c
@@ -147,6 +147,9 @@ int mana_ib_dealloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
mutex_unlock(&mana_ucontext->lock);
}
+ if (pd->pd_handle == INVALID_MANA_HANDLE)
+ return 0;
+
mana_gd_init_req_hdr(&req.hdr, GDMA_DESTROY_PD, sizeof(req),
sizeof(resp));
@@ -280,9 +283,12 @@ void mana_ib_dealloc_ucontext(struct ib_ucontext *ibcontext)
list_del_init(&mana_ucontext->dev_list);
mutex_unlock(&mdev->ucontext_lock);
- ret = mana_gd_destroy_doorbell_page(gc, mana_ucontext->doorbell);
- if (ret)
- ibdev_dbg(ibdev, "Failed to destroy doorbell page %d\n", ret);
+ if (mana_ucontext->doorbell != INVALID_DOORBELL) {
+ ret = mana_gd_destroy_doorbell_page(gc, mana_ucontext->doorbell);
+ if (ret)
+ ibdev_dbg(ibdev, "Failed to destroy doorbell page %d\n",
+ ret);
+ }
}
int mana_ib_create_kernel_queue(struct mana_ib_dev *mdev, u32 size, enum gdma_queue_type type,
diff --git a/drivers/infiniband/hw/mana/mr.c b/drivers/infiniband/hw/mana/mr.c
index 7189ccd41576..75bc2a9c366a 100644
--- a/drivers/infiniband/hw/mana/mr.c
+++ b/drivers/infiniband/hw/mana/mr.c
@@ -336,9 +336,11 @@ int mana_ib_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata)
mutex_unlock(&mana_ucontext->lock);
}
- err = mana_ib_gd_destroy_mr(dev, mr->mr_handle);
- if (err)
- return err;
+ if (mr->mr_handle != INVALID_MANA_HANDLE) {
+ err = mana_ib_gd_destroy_mr(dev, mr->mr_handle);
+ if (err)
+ return err;
+ }
if (mr->umem)
ib_umem_release(mr->umem);
diff --git a/drivers/infiniband/hw/mana/qp.c b/drivers/infiniband/hw/mana/qp.c
index d590aca9b93a..76d59addb645 100644
--- a/drivers/infiniband/hw/mana/qp.c
+++ b/drivers/infiniband/hw/mana/qp.c
@@ -846,9 +846,11 @@ static int mana_ib_destroy_qp_rss(struct mana_ib_qp *qp,
for (i = 0; i < (1 << ind_tbl->log_ind_tbl_size); i++) {
ibwq = ind_tbl->ind_tbl[i];
wq = container_of(ibwq, struct mana_ib_wq, ibwq);
- ibdev_dbg(&mdev->ib_dev, "destroying wq->rx_object %llu\n",
+ ibdev_dbg(&mdev->ib_dev,
+ "destroying wq->rx_object %llu\n",
wq->rx_object);
- mana_destroy_wq_obj(mpc, GDMA_RQ, wq->rx_object);
+ if (wq->rx_object != INVALID_MANA_HANDLE)
+ mana_destroy_wq_obj(mpc, GDMA_RQ, wq->rx_object);
}
return 0;
@@ -867,7 +869,8 @@ static int mana_ib_destroy_qp_raw(struct mana_ib_qp *qp, struct ib_udata *udata)
mpc = netdev_priv(ndev);
pd = container_of(ibpd, struct mana_ib_pd, ibpd);
- mana_destroy_wq_obj(mpc, GDMA_SQ, qp->qp_handle);
+ if (qp->qp_handle != INVALID_MANA_HANDLE)
+ mana_destroy_wq_obj(mpc, GDMA_SQ, qp->qp_handle);
mana_ib_destroy_queue(mdev, &qp->raw_sq);
--
2.43.0
^ permalink raw reply related
* [PATCH rdma-next 7/8] RDMA/mana_ib: Notify service reset events to RDMA devices
From: Long Li @ 2026-03-07 1:47 UTC (permalink / raw)
To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
Paolo Abeni, Eric Dumazet, Andrew Lunn, Jason Gunthorpe,
Leon Romanovsky, Haiyang Zhang, K . Y . Srinivasan, Wei Liu,
Dexuan Cui
Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel
In-Reply-To: <20260307014723.556523-1-longli@microsoft.com>
Register reset_notify and resume_notify callbacks so the RDMA driver
is informed when the MANA service undergoes a reset cycle.
On reset notification:
- Acquire reset_rwsem write lock to serialize with resource creation
- Walk every tracked ucontext and invalidate firmware handles for
all PD, CQ, WQ, QP, and MR resources (set to INVALID_MANA_HANDLE)
- Dispatch IB_EVENT_PORT_ERR to each affected ucontext so userspace
(e.g. DPDK) learns about the reset
On resume notification:
- Release reset_rwsem write lock, unblocking new resource creation
Resource creation paths (alloc_pd, create_cq, create_wq, create_qp for
RAW_PACKET, reg_user_mr) acquire reset_rwsem read lock to ensure handles
are not invalidated while being set up.
Signed-off-by: Long Li <longli@microsoft.com>
---
drivers/infiniband/hw/mana/cq.c | 15 ++-
drivers/infiniband/hw/mana/device.c | 103 ++++++++++++++++++
drivers/infiniband/hw/mana/main.c | 9 ++
drivers/infiniband/hw/mana/mana_ib.h | 2 +
drivers/infiniband/hw/mana/mr.c | 4 +
drivers/infiniband/hw/mana/qp.c | 5 +
drivers/infiniband/hw/mana/wq.c | 4 +
drivers/net/ethernet/microsoft/mana/mana_en.c | 14 ++-
include/net/mana/gdma.h | 6 +
9 files changed, 155 insertions(+), 7 deletions(-)
diff --git a/drivers/infiniband/hw/mana/cq.c b/drivers/infiniband/hw/mana/cq.c
index 89cf60987ff5..b054684b8de7 100644
--- a/drivers/infiniband/hw/mana/cq.c
+++ b/drivers/infiniband/hw/mana/cq.c
@@ -41,13 +41,17 @@ int mana_ib_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
ibdev_dbg(ibdev, "CQE %d exceeding limit\n", attr->cqe);
return -EINVAL;
}
+ }
+
+ down_read(&mdev->reset_rwsem);
+ if (udata) {
cq->cqe = attr->cqe;
err = mana_ib_create_queue(mdev, ucmd.buf_addr, cq->cqe * COMP_ENTRY_SIZE,
&cq->queue);
if (err) {
ibdev_dbg(ibdev, "Failed to create queue for create cq, %d\n", err);
- return err;
+ goto err_unlock;
}
mana_ucontext = rdma_udata_to_drv_context(udata, struct mana_ib_ucontext,
@@ -56,14 +60,15 @@ int mana_ib_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
} else {
if (attr->cqe > U32_MAX / COMP_ENTRY_SIZE / 2 + 1) {
ibdev_dbg(ibdev, "CQE %d exceeding limit\n", attr->cqe);
- return -EINVAL;
+ err = -EINVAL;
+ goto err_unlock;
}
buf_size = MANA_PAGE_ALIGN(roundup_pow_of_two(attr->cqe * COMP_ENTRY_SIZE));
cq->cqe = buf_size / COMP_ENTRY_SIZE;
err = mana_ib_create_kernel_queue(mdev, buf_size, GDMA_CQ, &cq->queue);
if (err) {
ibdev_dbg(ibdev, "Failed to create kernel queue for create cq, %d\n", err);
- return err;
+ goto err_unlock;
}
doorbell = mdev->gdma_dev->doorbell;
}
@@ -105,6 +110,7 @@ int mana_ib_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
mutex_unlock(&mana_ucontext->lock);
}
+ up_read(&mdev->reset_rwsem);
return 0;
err_remove_cq_cb:
@@ -113,7 +119,8 @@ int mana_ib_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
mana_ib_gd_destroy_cq(mdev, cq);
err_destroy_queue:
mana_ib_destroy_queue(mdev, &cq->queue);
-
+err_unlock:
+ up_read(&mdev->reset_rwsem);
return err;
}
diff --git a/drivers/infiniband/hw/mana/device.c b/drivers/infiniband/hw/mana/device.c
index 149e8d4d5b8e..081be31563ca 100644
--- a/drivers/infiniband/hw/mana/device.c
+++ b/drivers/infiniband/hw/mana/device.c
@@ -103,6 +103,7 @@ static int mana_ib_netdev_event(struct notifier_block *this,
netdev_put(ndev, &dev->dev_tracker);
return NOTIFY_OK;
+
default:
return NOTIFY_DONE;
}
@@ -110,6 +111,93 @@ static int mana_ib_netdev_event(struct notifier_block *this,
return NOTIFY_DONE;
}
+/*
+ * Reset cleanup: invalidate firmware handles for all tracked user objects.
+ *
+ * Called during service reset BEFORE dispatching IB_EVENT_PORT_ERR to
+ * user-mode.
+ *
+ * Only invalidates FW handles — does NOT free kernel resources (umem, queues)
+ * or remove objects from lists. The IB core's destroy callbacks handle full
+ * resource teardown when user-space closes the uverbs FD or ib_unregister_device
+ * is called. The destroy callbacks skip FW commands when the handle is already
+ * INVALID_MANA_HANDLE.
+ *
+ * For CQs, also removes the CQ callback to prevent stale completions.
+ */
+static void mana_ib_reset_notify(void *ctx)
+{
+ struct mana_ib_dev *mdev = ctx;
+ struct mana_ib_ucontext *uctx;
+ struct mana_ib_qp *qp;
+ struct mana_ib_wq *wq;
+ struct mana_ib_cq *cq;
+ struct mana_ib_mr *mr;
+ struct mana_ib_pd *pd;
+ struct ib_event ibev;
+ int i;
+
+ down_write(&mdev->reset_rwsem);
+
+ ibdev_dbg(&mdev->ib_dev, "reset cleanup starting\n");
+
+ mutex_lock(&mdev->ucontext_lock);
+ list_for_each_entry(uctx, &mdev->ucontext_list, dev_list) {
+ mutex_lock(&uctx->lock);
+
+ list_for_each_entry(qp, &uctx->qp_list, ucontext_list)
+ qp->qp_handle = INVALID_MANA_HANDLE;
+
+ list_for_each_entry(wq, &uctx->wq_list, ucontext_list)
+ wq->rx_object = INVALID_MANA_HANDLE;
+
+ list_for_each_entry(cq, &uctx->cq_list, ucontext_list) {
+ mana_ib_remove_cq_cb(mdev, cq);
+ cq->cq_handle = INVALID_MANA_HANDLE;
+ }
+
+ list_for_each_entry(mr, &uctx->mr_list, ucontext_list)
+ mr->mr_handle = INVALID_MANA_HANDLE;
+
+ list_for_each_entry(pd, &uctx->pd_list, ucontext_list)
+ pd->pd_handle = INVALID_MANA_HANDLE;
+
+ uctx->doorbell = INVALID_DOORBELL;
+
+ mutex_unlock(&uctx->lock);
+ }
+ mutex_unlock(&mdev->ucontext_lock);
+
+ up_write(&mdev->reset_rwsem);
+
+ /* Revoke user doorbell mappings so userspace cannot ring
+ * stale doorbells after firmware handles are invalidated.
+ */
+ rdma_user_mmap_disassociate(&mdev->ib_dev);
+
+ /* Notify userspace (e.g. DPDK) that the port is down */
+ for (i = 0; i < mdev->ib_dev.phys_port_cnt; i++) {
+ ibev.device = &mdev->ib_dev;
+ ibev.element.port_num = i + 1;
+ ibev.event = IB_EVENT_PORT_ERR;
+ ib_dispatch_event(&ibev);
+ }
+}
+
+static void mana_ib_resume_notify(void *ctx)
+{
+ struct mana_ib_dev *dev = ctx;
+ struct ib_event ibev;
+ int i;
+
+ for (i = 0; i < dev->ib_dev.phys_port_cnt; i++) {
+ ibev.device = &dev->ib_dev;
+ ibev.element.port_num = i + 1;
+ ibev.event = IB_EVENT_PORT_ACTIVE;
+ ib_dispatch_event(&ibev);
+ }
+}
+
static int mana_ib_probe(struct auxiliary_device *adev,
const struct auxiliary_device_id *id)
{
@@ -134,6 +222,7 @@ static int mana_ib_probe(struct auxiliary_device *adev,
xa_init_flags(&dev->qp_table_wq, XA_FLAGS_LOCK_IRQ);
mutex_init(&dev->ucontext_lock);
INIT_LIST_HEAD(&dev->ucontext_list);
+ init_rwsem(&dev->reset_rwsem);
if (mana_ib_is_rnic(dev)) {
dev->ib_dev.phys_port_cnt = 1;
@@ -216,6 +305,15 @@ static int mana_ib_probe(struct auxiliary_device *adev,
dev_set_drvdata(&adev->dev, dev);
+ /* ETH device persists across reset — use callback for cleanup.
+ * RNIC device is removed/re-added, so its cleanup happens in remove.
+ */
+ if (!mana_ib_is_rnic(dev)) {
+ mdev->reset_notify = mana_ib_reset_notify;
+ mdev->resume_notify = mana_ib_resume_notify;
+ mdev->reset_notify_ctx = dev;
+ }
+
return 0;
deallocate_pool:
@@ -242,6 +340,11 @@ static void mana_ib_remove(struct auxiliary_device *adev)
if (mana_ib_is_rnic(dev))
mana_drain_gsi_sqs(dev);
+ if (!mana_ib_is_rnic(dev)) {
+ dev->gdma_dev->reset_notify = NULL;
+ dev->gdma_dev->resume_notify = NULL;
+ dev->gdma_dev->reset_notify_ctx = NULL;
+ }
ib_unregister_device(&dev->ib_dev);
dma_pool_destroy(dev->av_pool);
if (mana_ib_is_rnic(dev)) {
diff --git a/drivers/infiniband/hw/mana/main.c b/drivers/infiniband/hw/mana/main.c
index f739e6da5435..61ce30aa9cb2 100644
--- a/drivers/infiniband/hw/mana/main.c
+++ b/drivers/infiniband/hw/mana/main.c
@@ -81,6 +81,8 @@ int mana_ib_alloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
dev = container_of(ibdev, struct mana_ib_dev, ib_dev);
gc = mdev_to_gc(dev);
+ down_read(&dev->reset_rwsem);
+
mana_gd_init_req_hdr(&req.hdr, GDMA_CREATE_PD, sizeof(req),
sizeof(resp));
@@ -98,6 +100,7 @@ int mana_ib_alloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
if (!err)
err = -EPROTO;
+ up_read(&dev->reset_rwsem);
return err;
}
@@ -118,6 +121,7 @@ int mana_ib_alloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
mutex_unlock(&mana_ucontext->lock);
}
+ up_read(&dev->reset_rwsem);
return 0;
}
@@ -230,10 +234,13 @@ int mana_ib_alloc_ucontext(struct ib_ucontext *ibcontext,
mdev = container_of(ibdev, struct mana_ib_dev, ib_dev);
gc = mdev_to_gc(mdev);
+ down_read(&mdev->reset_rwsem);
+
/* Allocate a doorbell page index */
ret = mana_gd_allocate_doorbell_page(gc, &doorbell_page);
if (ret) {
ibdev_dbg(ibdev, "Failed to allocate doorbell page %d\n", ret);
+ up_read(&mdev->reset_rwsem);
return ret;
}
@@ -252,6 +259,8 @@ int mana_ib_alloc_ucontext(struct ib_ucontext *ibcontext,
list_add_tail(&ucontext->dev_list, &mdev->ucontext_list);
mutex_unlock(&mdev->ucontext_lock);
+ up_read(&mdev->reset_rwsem);
+
return 0;
}
diff --git a/drivers/infiniband/hw/mana/mana_ib.h b/drivers/infiniband/hw/mana/mana_ib.h
index ce5c6c030fb2..29201cf3274c 100644
--- a/drivers/infiniband/hw/mana/mana_ib.h
+++ b/drivers/infiniband/hw/mana/mana_ib.h
@@ -86,6 +86,8 @@ struct mana_ib_dev {
/* Protects ucontext_list */
struct mutex ucontext_lock;
struct list_head ucontext_list;
+ /* Serializes resource create callbacks vs reset cleanup */
+ struct rw_semaphore reset_rwsem;
};
struct mana_ib_wq {
diff --git a/drivers/infiniband/hw/mana/mr.c b/drivers/infiniband/hw/mana/mr.c
index 559bb4f7c31d..7189ccd41576 100644
--- a/drivers/infiniband/hw/mana/mr.c
+++ b/drivers/infiniband/hw/mana/mr.c
@@ -141,6 +141,8 @@ struct ib_mr *mana_ib_reg_user_mr(struct ib_pd *ibpd, u64 start, u64 length,
if (!mr)
return ERR_PTR(-ENOMEM);
+ down_read(&dev->reset_rwsem);
+
mr->umem = ib_umem_get(ibdev, start, length, access_flags);
if (IS_ERR(mr->umem)) {
err = PTR_ERR(mr->umem);
@@ -195,6 +197,7 @@ struct ib_mr *mana_ib_reg_user_mr(struct ib_pd *ibpd, u64 start, u64 length,
mutex_unlock(&mana_ucontext->lock);
}
+ up_read(&dev->reset_rwsem);
return &mr->ibmr;
err_dma_region:
@@ -204,6 +207,7 @@ struct ib_mr *mana_ib_reg_user_mr(struct ib_pd *ibpd, u64 start, u64 length,
ib_umem_release(mr->umem);
err_free:
+ up_read(&dev->reset_rwsem);
kfree(mr);
return ERR_PTR(err);
}
diff --git a/drivers/infiniband/hw/mana/qp.c b/drivers/infiniband/hw/mana/qp.c
index 315bc54d8ae6..d590aca9b93a 100644
--- a/drivers/infiniband/hw/mana/qp.c
+++ b/drivers/infiniband/hw/mana/qp.c
@@ -701,12 +701,16 @@ int mana_ib_create_qp(struct ib_qp *ibqp, struct ib_qp_init_attr *attr,
struct ib_udata *udata)
{
struct mana_ib_qp *qp = container_of(ibqp, struct mana_ib_qp, ibqp);
+ struct mana_ib_dev *mdev =
+ container_of(ibqp->device, struct mana_ib_dev, ib_dev);
int err;
INIT_LIST_HEAD(&qp->ucontext_list);
switch (attr->qp_type) {
case IB_QPT_RAW_PACKET:
+ down_read(&mdev->reset_rwsem);
+
/* When rwq_ind_tbl is used, it's for creating WQs for RSS */
if (attr->rwq_ind_tbl)
err = mana_ib_create_qp_rss(ibqp, ibqp->pd, attr,
@@ -724,6 +728,7 @@ int mana_ib_create_qp(struct ib_qp *ibqp, struct ib_qp_init_attr *attr,
mutex_unlock(&mana_ucontext->lock);
}
+ up_read(&mdev->reset_rwsem);
return err;
case IB_QPT_RC:
return mana_ib_create_rc_qp(ibqp, ibqp->pd, attr, udata);
diff --git a/drivers/infiniband/hw/mana/wq.c b/drivers/infiniband/hw/mana/wq.c
index 1af9869933aa..67b757cf30f9 100644
--- a/drivers/infiniband/hw/mana/wq.c
+++ b/drivers/infiniband/hw/mana/wq.c
@@ -31,6 +31,8 @@ struct ib_wq *mana_ib_create_wq(struct ib_pd *pd,
ibdev_dbg(&mdev->ib_dev, "ucmd wq_buf_addr 0x%llx\n", ucmd.wq_buf_addr);
+ down_read(&mdev->reset_rwsem);
+
err = mana_ib_create_queue(mdev, ucmd.wq_buf_addr, ucmd.wq_buf_size, &wq->queue);
if (err) {
ibdev_dbg(&mdev->ib_dev,
@@ -52,9 +54,11 @@ struct ib_wq *mana_ib_create_wq(struct ib_pd *pd,
mutex_unlock(&mana_ucontext->lock);
}
+ up_read(&mdev->reset_rwsem);
return &wq->ibwq;
err_free_wq:
+ up_read(&mdev->reset_rwsem);
kfree(wq);
return ERR_PTR(err);
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index ea71de39f996..3493b36426f7 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -3659,15 +3659,19 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
}
}
- err = add_adev(gd, "eth");
+ if (!resuming)
+ err = add_adev(gd, "eth");
INIT_DELAYED_WORK(&ac->gf_stats_work, mana_gf_stats_work_handler);
schedule_delayed_work(&ac->gf_stats_work, MANA_GF_STATS_PERIOD);
-
out:
if (err) {
mana_remove(gd, false);
} else {
+ /* Notify IB layer that ports are back up after reset */
+ if (resuming && gd->resume_notify)
+ gd->resume_notify(gd->reset_notify_ctx);
+
dev_dbg(dev, "gd=%p, id=%u, num_ports=%d, type=%u, instance=%u\n",
gd, gd->dev_id.as_uint32, ac->num_ports,
gd->dev_id.type, gd->dev_id.instance);
@@ -3691,9 +3695,13 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
cancel_delayed_work_sync(&ac->gf_stats_work);
/* adev currently doesn't support suspending, always remove it */
- if (gd->adev)
+ if (gd->adev && !suspending)
remove_adev(gd);
+ /* Notify IB layer before tearing down net devices during reset */
+ if (suspending && gd->reset_notify)
+ gd->reset_notify(gd->reset_notify_ctx);
+
for (i = 0; i < ac->num_ports; i++) {
ndev = ac->ports[i];
if (!ndev) {
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index ec17004b10c0..9187c5b4d0d1 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -249,6 +249,12 @@ struct gdma_dev {
struct auxiliary_device *adev;
bool is_suspended;
bool rdma_teardown;
+
+ /* Called by mana_remove() during reset to notify IB layer */
+ void (*reset_notify)(void *ctx);
+ /* Called by mana_probe() during resume to notify IB layer */
+ void (*resume_notify)(void *ctx);
+ void *reset_notify_ctx;
};
/* MANA_PAGE_SIZE is the DMA unit */
--
2.43.0
^ permalink raw reply related
* [PATCH rdma-next 6/8] RDMA/mana_ib: Track MR per ucontext
From: Long Li @ 2026-03-07 1:47 UTC (permalink / raw)
To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
Paolo Abeni, Eric Dumazet, Andrew Lunn, Jason Gunthorpe,
Leon Romanovsky, Haiyang Zhang, K . Y . Srinivasan, Wei Liu,
Dexuan Cui
Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel
In-Reply-To: <20260307014723.556523-1-longli@microsoft.com>
Add per-ucontext list tracking for MR objects. Each MR is added to
the ucontext's mr_list on creation and removed on destruction. This
enables iterating over all MRs belonging to a ucontext for service
reset cleanup.
Also export mana_ib_gd_destroy_mr() for use by reset cleanup code.
Signed-off-by: Long Li <longli@microsoft.com>
---
drivers/infiniband/hw/mana/main.c | 1 +
drivers/infiniband/hw/mana/mana_ib.h | 3 +++
drivers/infiniband/hw/mana/mr.c | 21 ++++++++++++++++++++-
3 files changed, 24 insertions(+), 1 deletion(-)
diff --git a/drivers/infiniband/hw/mana/main.c b/drivers/infiniband/hw/mana/main.c
index c6a859628ba3..f739e6da5435 100644
--- a/drivers/infiniband/hw/mana/main.c
+++ b/drivers/infiniband/hw/mana/main.c
@@ -243,6 +243,7 @@ int mana_ib_alloc_ucontext(struct ib_ucontext *ibcontext,
mutex_init(&ucontext->lock);
INIT_LIST_HEAD(&ucontext->pd_list);
+ INIT_LIST_HEAD(&ucontext->mr_list);
INIT_LIST_HEAD(&ucontext->cq_list);
INIT_LIST_HEAD(&ucontext->qp_list);
INIT_LIST_HEAD(&ucontext->wq_list);
diff --git a/drivers/infiniband/hw/mana/mana_ib.h b/drivers/infiniband/hw/mana/mana_ib.h
index 9d90fda2c830..ce5c6c030fb2 100644
--- a/drivers/infiniband/hw/mana/mana_ib.h
+++ b/drivers/infiniband/hw/mana/mana_ib.h
@@ -134,6 +134,7 @@ struct mana_ib_mr {
struct ib_mr ibmr;
struct ib_umem *umem;
mana_handle_t mr_handle;
+ struct list_head ucontext_list;
};
struct mana_ib_dm {
@@ -208,6 +209,7 @@ struct mana_ib_ucontext {
/* Protects resource lists below */
struct mutex lock;
struct list_head pd_list;
+ struct list_head mr_list;
struct list_head cq_list;
struct list_head qp_list;
struct list_head wq_list;
@@ -665,6 +667,7 @@ struct ib_mr *mana_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
struct ib_udata *udata);
int mana_ib_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata);
+int mana_ib_gd_destroy_mr(struct mana_ib_dev *dev, u64 mr_handle);
int mana_ib_create_qp(struct ib_qp *qp, struct ib_qp_init_attr *qp_init_attr,
struct ib_udata *udata);
diff --git a/drivers/infiniband/hw/mana/mr.c b/drivers/infiniband/hw/mana/mr.c
index 9613b225dad4..559bb4f7c31d 100644
--- a/drivers/infiniband/hw/mana/mr.c
+++ b/drivers/infiniband/hw/mana/mr.c
@@ -87,7 +87,7 @@ static int mana_ib_gd_create_mr(struct mana_ib_dev *dev, struct mana_ib_mr *mr,
return 0;
}
-static int mana_ib_gd_destroy_mr(struct mana_ib_dev *dev, u64 mr_handle)
+int mana_ib_gd_destroy_mr(struct mana_ib_dev *dev, u64 mr_handle)
{
struct gdma_destroy_mr_response resp = {};
struct gdma_destroy_mr_request req = {};
@@ -185,6 +185,16 @@ struct ib_mr *mana_ib_reg_user_mr(struct ib_pd *ibpd, u64 start, u64 length,
* as part of the lifecycle of this MR.
*/
+ INIT_LIST_HEAD(&mr->ucontext_list);
+ if (udata) {
+ struct mana_ib_ucontext *mana_ucontext =
+ rdma_udata_to_drv_context(udata,
+ struct mana_ib_ucontext, ibucontext);
+ mutex_lock(&mana_ucontext->lock);
+ list_add_tail(&mr->ucontext_list, &mana_ucontext->mr_list);
+ mutex_unlock(&mana_ucontext->lock);
+ }
+
return &mr->ibmr;
err_dma_region:
@@ -313,6 +323,15 @@ int mana_ib_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata)
dev = container_of(ibdev, struct mana_ib_dev, ib_dev);
+ if (udata) {
+ struct mana_ib_ucontext *mana_ucontext =
+ rdma_udata_to_drv_context(udata,
+ struct mana_ib_ucontext, ibucontext);
+ mutex_lock(&mana_ucontext->lock);
+ list_del_init(&mr->ucontext_list);
+ mutex_unlock(&mana_ucontext->lock);
+ }
+
err = mana_ib_gd_destroy_mr(dev, mr->mr_handle);
if (err)
return err;
--
2.43.0
^ permalink raw reply related
* [PATCH rdma-next 5/8] RDMA/mana_ib: Track QP per ucontext
From: Long Li @ 2026-03-07 1:47 UTC (permalink / raw)
To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
Paolo Abeni, Eric Dumazet, Andrew Lunn, Jason Gunthorpe,
Leon Romanovsky, Haiyang Zhang, K . Y . Srinivasan, Wei Liu,
Dexuan Cui
Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel
In-Reply-To: <20260307014723.556523-1-longli@microsoft.com>
Add per-ucontext list tracking for QP objects. Only RAW_PACKET QPs
are tracked since they persist across reset events. RC, UD and GSI
QPs are removed and re-added during reset by IB core and do not
need tracking.
Signed-off-by: Long Li <longli@microsoft.com>
---
drivers/infiniband/hw/mana/main.c | 1 +
drivers/infiniband/hw/mana/mana_ib.h | 2 ++
drivers/infiniband/hw/mana/qp.c | 47 ++++++++++++++++++++++------
3 files changed, 40 insertions(+), 10 deletions(-)
diff --git a/drivers/infiniband/hw/mana/main.c b/drivers/infiniband/hw/mana/main.c
index e6da5c8400f4..c6a859628ba3 100644
--- a/drivers/infiniband/hw/mana/main.c
+++ b/drivers/infiniband/hw/mana/main.c
@@ -244,6 +244,7 @@ int mana_ib_alloc_ucontext(struct ib_ucontext *ibcontext,
mutex_init(&ucontext->lock);
INIT_LIST_HEAD(&ucontext->pd_list);
INIT_LIST_HEAD(&ucontext->cq_list);
+ INIT_LIST_HEAD(&ucontext->qp_list);
INIT_LIST_HEAD(&ucontext->wq_list);
mutex_lock(&mdev->ucontext_lock);
diff --git a/drivers/infiniband/hw/mana/mana_ib.h b/drivers/infiniband/hw/mana/mana_ib.h
index 96b5a13470ae..9d90fda2c830 100644
--- a/drivers/infiniband/hw/mana/mana_ib.h
+++ b/drivers/infiniband/hw/mana/mana_ib.h
@@ -198,6 +198,7 @@ struct mana_ib_qp {
refcount_t refcount;
struct completion free;
+ struct list_head ucontext_list;
};
struct mana_ib_ucontext {
@@ -208,6 +209,7 @@ struct mana_ib_ucontext {
struct mutex lock;
struct list_head pd_list;
struct list_head cq_list;
+ struct list_head qp_list;
struct list_head wq_list;
};
diff --git a/drivers/infiniband/hw/mana/qp.c b/drivers/infiniband/hw/mana/qp.c
index 82f84f7ad37a..315bc54d8ae6 100644
--- a/drivers/infiniband/hw/mana/qp.c
+++ b/drivers/infiniband/hw/mana/qp.c
@@ -700,14 +700,31 @@ static int mana_ib_create_ud_qp(struct ib_qp *ibqp, struct ib_pd *ibpd,
int mana_ib_create_qp(struct ib_qp *ibqp, struct ib_qp_init_attr *attr,
struct ib_udata *udata)
{
+ struct mana_ib_qp *qp = container_of(ibqp, struct mana_ib_qp, ibqp);
+ int err;
+
+ INIT_LIST_HEAD(&qp->ucontext_list);
+
switch (attr->qp_type) {
case IB_QPT_RAW_PACKET:
/* When rwq_ind_tbl is used, it's for creating WQs for RSS */
if (attr->rwq_ind_tbl)
- return mana_ib_create_qp_rss(ibqp, ibqp->pd, attr,
- udata);
+ err = mana_ib_create_qp_rss(ibqp, ibqp->pd, attr,
+ udata);
+ else
+ err = mana_ib_create_qp_raw(ibqp, ibqp->pd, attr,
+ udata);
+
+ if (!err && udata) {
+ struct mana_ib_ucontext *mana_ucontext =
+ rdma_udata_to_drv_context(udata,
+ struct mana_ib_ucontext, ibucontext);
+ mutex_lock(&mana_ucontext->lock);
+ list_add_tail(&qp->ucontext_list, &mana_ucontext->qp_list);
+ mutex_unlock(&mana_ucontext->lock);
+ }
- return mana_ib_create_qp_raw(ibqp, ibqp->pd, attr, udata);
+ return err;
case IB_QPT_RC:
return mana_ib_create_rc_qp(ibqp, ibqp->pd, attr, udata);
case IB_QPT_UD:
@@ -716,9 +733,8 @@ int mana_ib_create_qp(struct ib_qp *ibqp, struct ib_qp_init_attr *attr,
default:
ibdev_dbg(ibqp->device, "Creating QP type %u not supported\n",
attr->qp_type);
+ return -EINVAL;
}
-
- return -EINVAL;
}
static int mana_ib_gd_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
@@ -898,14 +914,26 @@ static int mana_ib_destroy_ud_qp(struct mana_ib_qp *qp, struct ib_udata *udata)
int mana_ib_destroy_qp(struct ib_qp *ibqp, struct ib_udata *udata)
{
struct mana_ib_qp *qp = container_of(ibqp, struct mana_ib_qp, ibqp);
+ int ret = -ENOENT;
switch (ibqp->qp_type) {
case IB_QPT_RAW_PACKET:
+ if (udata) {
+ struct mana_ib_ucontext *mana_ucontext =
+ rdma_udata_to_drv_context(udata,
+ struct mana_ib_ucontext, ibucontext);
+ mutex_lock(&mana_ucontext->lock);
+ list_del_init(&qp->ucontext_list);
+ mutex_unlock(&mana_ucontext->lock);
+ }
+
if (ibqp->rwq_ind_tbl)
- return mana_ib_destroy_qp_rss(qp, ibqp->rwq_ind_tbl,
- udata);
+ ret = mana_ib_destroy_qp_rss(qp, ibqp->rwq_ind_tbl,
+ udata);
+ else
+ ret = mana_ib_destroy_qp_raw(qp, udata);
- return mana_ib_destroy_qp_raw(qp, udata);
+ return ret;
case IB_QPT_RC:
return mana_ib_destroy_rc_qp(qp, udata);
case IB_QPT_UD:
@@ -914,7 +942,6 @@ int mana_ib_destroy_qp(struct ib_qp *ibqp, struct ib_udata *udata)
default:
ibdev_dbg(ibqp->device, "Unexpected QP type %u\n",
ibqp->qp_type);
+ return ret;
}
-
- return -ENOENT;
}
--
2.43.0
^ permalink raw reply related
* [PATCH rdma-next 4/8] RDMA/mana_ib: Track WQ per ucontext
From: Long Li @ 2026-03-07 1:47 UTC (permalink / raw)
To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
Paolo Abeni, Eric Dumazet, Andrew Lunn, Jason Gunthorpe,
Leon Romanovsky, Haiyang Zhang, K . Y . Srinivasan, Wei Liu,
Dexuan Cui
Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel
In-Reply-To: <20260307014723.556523-1-longli@microsoft.com>
Add per-ucontext list tracking for WQ objects. Each WQ is added to
the ucontext's wq_list on creation and removed on destruction. This
enables iterating over all WQs belonging to a ucontext for service
reset cleanup.
Signed-off-by: Long Li <longli@microsoft.com>
---
drivers/infiniband/hw/mana/main.c | 1 +
drivers/infiniband/hw/mana/mana_ib.h | 2 ++
drivers/infiniband/hw/mana/wq.c | 20 ++++++++++++++++++++
3 files changed, 23 insertions(+)
diff --git a/drivers/infiniband/hw/mana/main.c b/drivers/infiniband/hw/mana/main.c
index 214c1d4e1548..e6da5c8400f4 100644
--- a/drivers/infiniband/hw/mana/main.c
+++ b/drivers/infiniband/hw/mana/main.c
@@ -244,6 +244,7 @@ int mana_ib_alloc_ucontext(struct ib_ucontext *ibcontext,
mutex_init(&ucontext->lock);
INIT_LIST_HEAD(&ucontext->pd_list);
INIT_LIST_HEAD(&ucontext->cq_list);
+ INIT_LIST_HEAD(&ucontext->wq_list);
mutex_lock(&mdev->ucontext_lock);
list_add_tail(&ucontext->dev_list, &mdev->ucontext_list);
diff --git a/drivers/infiniband/hw/mana/mana_ib.h b/drivers/infiniband/hw/mana/mana_ib.h
index 8d3edf7ba335..96b5a13470ae 100644
--- a/drivers/infiniband/hw/mana/mana_ib.h
+++ b/drivers/infiniband/hw/mana/mana_ib.h
@@ -94,6 +94,7 @@ struct mana_ib_wq {
int wqe;
u32 wq_buf_size;
mana_handle_t rx_object;
+ struct list_head ucontext_list;
};
struct mana_ib_pd {
@@ -207,6 +208,7 @@ struct mana_ib_ucontext {
struct mutex lock;
struct list_head pd_list;
struct list_head cq_list;
+ struct list_head wq_list;
};
struct mana_ib_rwq_ind_table {
diff --git a/drivers/infiniband/hw/mana/wq.c b/drivers/infiniband/hw/mana/wq.c
index 6206244f762e..1af9869933aa 100644
--- a/drivers/infiniband/hw/mana/wq.c
+++ b/drivers/infiniband/hw/mana/wq.c
@@ -41,6 +41,17 @@ struct ib_wq *mana_ib_create_wq(struct ib_pd *pd,
wq->wqe = init_attr->max_wr;
wq->wq_buf_size = ucmd.wq_buf_size;
wq->rx_object = INVALID_MANA_HANDLE;
+
+ INIT_LIST_HEAD(&wq->ucontext_list);
+ if (udata) {
+ struct mana_ib_ucontext *mana_ucontext =
+ rdma_udata_to_drv_context(udata,
+ struct mana_ib_ucontext, ibucontext);
+ mutex_lock(&mana_ucontext->lock);
+ list_add_tail(&wq->ucontext_list, &mana_ucontext->wq_list);
+ mutex_unlock(&mana_ucontext->lock);
+ }
+
return &wq->ibwq;
err_free_wq:
@@ -64,6 +75,15 @@ int mana_ib_destroy_wq(struct ib_wq *ibwq, struct ib_udata *udata)
mdev = container_of(ib_dev, struct mana_ib_dev, ib_dev);
+ if (udata) {
+ struct mana_ib_ucontext *mana_ucontext =
+ rdma_udata_to_drv_context(udata,
+ struct mana_ib_ucontext, ibucontext);
+ mutex_lock(&mana_ucontext->lock);
+ list_del_init(&wq->ucontext_list);
+ mutex_unlock(&mana_ucontext->lock);
+ }
+
mana_ib_destroy_queue(mdev, &wq->queue);
kfree(wq);
--
2.43.0
^ permalink raw reply related
* [PATCH rdma-next 3/8] RDMA/mana_ib: Track CQ per ucontext
From: Long Li @ 2026-03-07 1:47 UTC (permalink / raw)
To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
Paolo Abeni, Eric Dumazet, Andrew Lunn, Jason Gunthorpe,
Leon Romanovsky, Haiyang Zhang, K . Y . Srinivasan, Wei Liu,
Dexuan Cui
Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel
In-Reply-To: <20260307014723.556523-1-longli@microsoft.com>
Add per-ucontext list tracking for CQ objects. Each CQ is added to
the ucontext's cq_list on creation and removed on destruction. This
enables iterating over all CQs belonging to a ucontext for service
reset cleanup.
Signed-off-by: Long Li <longli@microsoft.com>
---
drivers/infiniband/hw/mana/cq.c | 19 +++++++++++++++++++
drivers/infiniband/hw/mana/main.c | 1 +
drivers/infiniband/hw/mana/mana_ib.h | 2 ++
3 files changed, 22 insertions(+)
diff --git a/drivers/infiniband/hw/mana/cq.c b/drivers/infiniband/hw/mana/cq.c
index b2749f971cd0..89cf60987ff5 100644
--- a/drivers/infiniband/hw/mana/cq.c
+++ b/drivers/infiniband/hw/mana/cq.c
@@ -95,6 +95,16 @@ int mana_ib_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
INIT_LIST_HEAD(&cq->list_send_qp);
INIT_LIST_HEAD(&cq->list_recv_qp);
+ INIT_LIST_HEAD(&cq->ucontext_list);
+ if (udata) {
+ struct mana_ib_ucontext *mana_ucontext =
+ rdma_udata_to_drv_context(udata,
+ struct mana_ib_ucontext, ibucontext);
+ mutex_lock(&mana_ucontext->lock);
+ list_add_tail(&cq->ucontext_list, &mana_ucontext->cq_list);
+ mutex_unlock(&mana_ucontext->lock);
+ }
+
return 0;
err_remove_cq_cb:
@@ -115,6 +125,15 @@ int mana_ib_destroy_cq(struct ib_cq *ibcq, struct ib_udata *udata)
mdev = container_of(ibdev, struct mana_ib_dev, ib_dev);
+ if (udata) {
+ struct mana_ib_ucontext *mana_ucontext =
+ rdma_udata_to_drv_context(udata,
+ struct mana_ib_ucontext, ibucontext);
+ mutex_lock(&mana_ucontext->lock);
+ list_del_init(&cq->ucontext_list);
+ mutex_unlock(&mana_ucontext->lock);
+ }
+
mana_ib_remove_cq_cb(mdev, cq);
/* Ignore return code as there is not much we can do about it.
diff --git a/drivers/infiniband/hw/mana/main.c b/drivers/infiniband/hw/mana/main.c
index 62d89ca06ba1..214c1d4e1548 100644
--- a/drivers/infiniband/hw/mana/main.c
+++ b/drivers/infiniband/hw/mana/main.c
@@ -243,6 +243,7 @@ int mana_ib_alloc_ucontext(struct ib_ucontext *ibcontext,
mutex_init(&ucontext->lock);
INIT_LIST_HEAD(&ucontext->pd_list);
+ INIT_LIST_HEAD(&ucontext->cq_list);
mutex_lock(&mdev->ucontext_lock);
list_add_tail(&ucontext->dev_list, &mdev->ucontext_list);
diff --git a/drivers/infiniband/hw/mana/mana_ib.h b/drivers/infiniband/hw/mana/mana_ib.h
index 6dba08bccc18..8d3edf7ba335 100644
--- a/drivers/infiniband/hw/mana/mana_ib.h
+++ b/drivers/infiniband/hw/mana/mana_ib.h
@@ -150,6 +150,7 @@ struct mana_ib_cq {
int cqe;
u32 comp_vector;
mana_handle_t cq_handle;
+ struct list_head ucontext_list;
};
enum mana_rc_queue_type {
@@ -205,6 +206,7 @@ struct mana_ib_ucontext {
/* Protects resource lists below */
struct mutex lock;
struct list_head pd_list;
+ struct list_head cq_list;
};
struct mana_ib_rwq_ind_table {
--
2.43.0
^ permalink raw reply related
* [PATCH rdma-next 2/8] RDMA/mana_ib: Track PD per ucontext
From: Long Li @ 2026-03-07 1:47 UTC (permalink / raw)
To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
Paolo Abeni, Eric Dumazet, Andrew Lunn, Jason Gunthorpe,
Leon Romanovsky, Haiyang Zhang, K . Y . Srinivasan, Wei Liu,
Dexuan Cui
Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel
In-Reply-To: <20260307014723.556523-1-longli@microsoft.com>
Add per-ucontext list tracking for PD objects. Each PD is added to
the ucontext's pd_list on creation and removed on destruction. This
enables iterating over all PDs belonging to a ucontext, which will
be needed for service reset cleanup.
Signed-off-by: Long Li <longli@microsoft.com>
---
drivers/infiniband/hw/mana/main.c | 21 +++++++++++++++++++++
drivers/infiniband/hw/mana/mana_ib.h | 2 ++
2 files changed, 23 insertions(+)
diff --git a/drivers/infiniband/hw/mana/main.c b/drivers/infiniband/hw/mana/main.c
index fc28bdafcfd6..62d89ca06ba1 100644
--- a/drivers/infiniband/hw/mana/main.c
+++ b/drivers/infiniband/hw/mana/main.c
@@ -72,6 +72,7 @@ int mana_ib_alloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
struct ib_device *ibdev = ibpd->device;
struct gdma_create_pd_resp resp = {};
struct gdma_create_pd_req req = {};
+ struct mana_ib_ucontext *mana_ucontext;
enum gdma_pd_flags flags = 0;
struct mana_ib_dev *dev;
struct gdma_context *gc;
@@ -107,6 +108,16 @@ int mana_ib_alloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
mutex_init(&pd->vport_mutex);
pd->vport_use_count = 0;
+
+ INIT_LIST_HEAD(&pd->ucontext_list);
+ if (udata) {
+ mana_ucontext = rdma_udata_to_drv_context(udata,
+ struct mana_ib_ucontext, ibucontext);
+ mutex_lock(&mana_ucontext->lock);
+ list_add_tail(&pd->ucontext_list, &mana_ucontext->pd_list);
+ mutex_unlock(&mana_ucontext->lock);
+ }
+
return 0;
}
@@ -123,6 +134,15 @@ int mana_ib_dealloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
dev = container_of(ibdev, struct mana_ib_dev, ib_dev);
gc = mdev_to_gc(dev);
+ if (udata) {
+ struct mana_ib_ucontext *mana_ucontext =
+ rdma_udata_to_drv_context(udata,
+ struct mana_ib_ucontext, ibucontext);
+ mutex_lock(&mana_ucontext->lock);
+ list_del_init(&pd->ucontext_list);
+ mutex_unlock(&mana_ucontext->lock);
+ }
+
mana_gd_init_req_hdr(&req.hdr, GDMA_DESTROY_PD, sizeof(req),
sizeof(resp));
@@ -222,6 +242,7 @@ int mana_ib_alloc_ucontext(struct ib_ucontext *ibcontext,
ucontext->doorbell = doorbell_page;
mutex_init(&ucontext->lock);
+ INIT_LIST_HEAD(&ucontext->pd_list);
mutex_lock(&mdev->ucontext_lock);
list_add_tail(&ucontext->dev_list, &mdev->ucontext_list);
diff --git a/drivers/infiniband/hw/mana/mana_ib.h b/drivers/infiniband/hw/mana/mana_ib.h
index c7e333d3e9d8..6dba08bccc18 100644
--- a/drivers/infiniband/hw/mana/mana_ib.h
+++ b/drivers/infiniband/hw/mana/mana_ib.h
@@ -107,6 +107,7 @@ struct mana_ib_pd {
bool tx_shortform_allowed;
u32 tx_vp_offset;
+ struct list_head ucontext_list;
};
struct mana_ib_av {
@@ -203,6 +204,7 @@ struct mana_ib_ucontext {
struct list_head dev_list;
/* Protects resource lists below */
struct mutex lock;
+ struct list_head pd_list;
};
struct mana_ib_rwq_ind_table {
--
2.43.0
^ permalink raw reply related
* [PATCH rdma-next 1/8] RDMA/mana_ib: Track ucontext per device
From: Long Li @ 2026-03-07 1:47 UTC (permalink / raw)
To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
Paolo Abeni, Eric Dumazet, Andrew Lunn, Jason Gunthorpe,
Leon Romanovsky, Haiyang Zhang, K . Y . Srinivasan, Wei Liu,
Dexuan Cui
Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel
In-Reply-To: <20260307014723.556523-1-longli@microsoft.com>
Add per-device tracking of ucontext objects. Each ucontext is added
to the device's ucontext_list on allocation and removed on deallocation.
A mutex protects the list and a per-ucontext lock protects resource
lists that will be added in subsequent patches.
This enables iterating over all active ucontexts during service reset
cleanup.
Signed-off-by: Long Li <longli@microsoft.com>
---
drivers/infiniband/hw/mana/device.c | 2 ++
drivers/infiniband/hw/mana/main.c | 10 ++++++++++
drivers/infiniband/hw/mana/mana_ib.h | 6 ++++++
3 files changed, 18 insertions(+)
diff --git a/drivers/infiniband/hw/mana/device.c b/drivers/infiniband/hw/mana/device.c
index ccc2279ca63c..149e8d4d5b8e 100644
--- a/drivers/infiniband/hw/mana/device.c
+++ b/drivers/infiniband/hw/mana/device.c
@@ -132,6 +132,8 @@ static int mana_ib_probe(struct auxiliary_device *adev,
dev->ib_dev.dev.parent = gc->dev;
dev->gdma_dev = mdev;
xa_init_flags(&dev->qp_table_wq, XA_FLAGS_LOCK_IRQ);
+ mutex_init(&dev->ucontext_lock);
+ INIT_LIST_HEAD(&dev->ucontext_list);
if (mana_ib_is_rnic(dev)) {
dev->ib_dev.phys_port_cnt = 1;
diff --git a/drivers/infiniband/hw/mana/main.c b/drivers/infiniband/hw/mana/main.c
index 8d99cd00f002..fc28bdafcfd6 100644
--- a/drivers/infiniband/hw/mana/main.c
+++ b/drivers/infiniband/hw/mana/main.c
@@ -221,6 +221,12 @@ int mana_ib_alloc_ucontext(struct ib_ucontext *ibcontext,
ucontext->doorbell = doorbell_page;
+ mutex_init(&ucontext->lock);
+
+ mutex_lock(&mdev->ucontext_lock);
+ list_add_tail(&ucontext->dev_list, &mdev->ucontext_list);
+ mutex_unlock(&mdev->ucontext_lock);
+
return 0;
}
@@ -236,6 +242,10 @@ void mana_ib_dealloc_ucontext(struct ib_ucontext *ibcontext)
mdev = container_of(ibdev, struct mana_ib_dev, ib_dev);
gc = mdev_to_gc(mdev);
+ mutex_lock(&mdev->ucontext_lock);
+ list_del_init(&mana_ucontext->dev_list);
+ mutex_unlock(&mdev->ucontext_lock);
+
ret = mana_gd_destroy_doorbell_page(gc, mana_ucontext->doorbell);
if (ret)
ibdev_dbg(ibdev, "Failed to destroy doorbell page %d\n", ret);
diff --git a/drivers/infiniband/hw/mana/mana_ib.h b/drivers/infiniband/hw/mana/mana_ib.h
index a7c8c0fd7019..c7e333d3e9d8 100644
--- a/drivers/infiniband/hw/mana/mana_ib.h
+++ b/drivers/infiniband/hw/mana/mana_ib.h
@@ -83,6 +83,9 @@ struct mana_ib_dev {
struct dma_pool *av_pool;
netdevice_tracker dev_tracker;
struct notifier_block nb;
+ /* Protects ucontext_list */
+ struct mutex ucontext_lock;
+ struct list_head ucontext_list;
};
struct mana_ib_wq {
@@ -197,6 +200,9 @@ struct mana_ib_qp {
struct mana_ib_ucontext {
struct ib_ucontext ibucontext;
u32 doorbell;
+ struct list_head dev_list;
+ /* Protects resource lists below */
+ struct mutex lock;
};
struct mana_ib_rwq_ind_table {
--
2.43.0
^ permalink raw reply related
* [PATCH rdma-next 0/8] RDMA/mana_ib: Handle service reset for RDMA resources
From: Long Li @ 2026-03-07 1:47 UTC (permalink / raw)
To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
Paolo Abeni, Eric Dumazet, Andrew Lunn, Jason Gunthorpe,
Leon Romanovsky, Haiyang Zhang, K . Y . Srinivasan, Wei Liu,
Dexuan Cui
Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel
When the MANA hardware undergoes a service reset, the ETH auxiliary device
(mana.eth) used by DPDK persists across the reset cycle — it is not removed
and re-added like RC/UD/GSI QPs. This means userspace RDMA consumers such
as DPDK have no way of knowing that firmware handles for their PD, CQ, WQ,
QP and MR resources have become stale.
This series adds per-ucontext resource tracking and a reset notification
mechanism so that:
1. The RDMA driver is informed of service reset events via direct callbacks
from the ETH driver (reset_notify / resume_notify).
2. On reset, all tracked firmware handles are invalidated (set to
INVALID_MANA_HANDLE), user doorbell mappings are revoked via
rdma_user_mmap_disassociate(), and IB_EVENT_PORT_ERR is dispatched to
each affected ucontext so userspace can detect the reset.
3. Destroy callbacks check for INVALID_MANA_HANDLE and skip firmware
commands for resources already invalidated by the reset path,
preventing stale handles from being sent to firmware.
4. A reset_rwsem serializes handle invalidation against resource creation
to avoid races between the reset path and new resource allocation.
Patches 1-6 introduce per-ucontext tracking lists for each resource type.
Patch 7 implements the reset/resume notification mechanism with rwsem
serialization, mmap revocation, and IB event dispatch.
Patch 8 adds INVALID_MANA_HANDLE checks in destroy callbacks.
Tested with DPDK testpmd on Azure VM (linux-next-20260306) — confirmed
IB_EVENT_PORT_ERR (type=10) and IB_EVENT_PORT_ACTIVE (type=9) are delivered
to userspace during service reset, and testpmd tears down cleanly afterwards.
Long Li (8):
RDMA/mana_ib: Track ucontext per device
RDMA/mana_ib: Track PD per ucontext
RDMA/mana_ib: Track CQ per ucontext
RDMA/mana_ib: Track WQ per ucontext
RDMA/mana_ib: Track QP per ucontext
RDMA/mana_ib: Track MR per ucontext
RDMA/mana_ib: Notify service reset events to RDMA devices
RDMA/mana_ib: Skip firmware commands for invalidated handles
drivers/infiniband/hw/mana/cq.c | 44 +++++--
drivers/infiniband/hw/mana/device.c | 105 ++++++++++++++++++
drivers/infiniband/hw/mana/main.c | 56 +++++++++-
drivers/infiniband/hw/mana/mana_ib.h | 19 ++++
drivers/infiniband/hw/mana/mr.c | 33 +++++-
drivers/infiniband/hw/mana/qp.c | 61 +++++++---
drivers/infiniband/hw/mana/wq.c | 24 ++++
drivers/net/ethernet/microsoft/mana/mana_en.c | 14 ++-
include/net/mana/gdma.h | 6 +
9 files changed, 331 insertions(+), 31 deletions(-)
--
2.43.0
^ permalink raw reply
* [PATCH 0/8] RDMA/mana_ib: Handle service reset for RDMA resources
From: Long Li @ 2026-03-07 1:44 UTC (permalink / raw)
To: K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Shradha Gupta, Simon Horman, Konstantin Taranov,
Souradeep Chakrabarti, Erick Archer, linux-hyperv, netdev,
linux-kernel, linux-rdma
Cc: Long Li
When the MANA hardware undergoes a service reset, the ETH auxiliary device
(mana.eth) used by DPDK persists across the reset cycle — it is not removed
and re-added like RC/UD/GSI QPs. This means userspace RDMA consumers such
as DPDK have no way of knowing that firmware handles for their PD, CQ, WQ,
QP and MR resources have become stale.
This series adds per-ucontext resource tracking and a reset notification
mechanism so that:
1. The RDMA driver is informed of service reset events via direct callbacks
from the ETH driver (reset_notify / resume_notify).
2. On reset, all tracked firmware handles are invalidated (set to
INVALID_MANA_HANDLE), user doorbell mappings are revoked via
rdma_user_mmap_disassociate(), and IB_EVENT_PORT_ERR is dispatched to
each affected ucontext so userspace can detect the reset.
3. Destroy callbacks check for INVALID_MANA_HANDLE and skip firmware
commands for resources already invalidated by the reset path,
preventing stale handles from being sent to firmware.
4. A reset_rwsem serializes handle invalidation against resource creation
to avoid races between the reset path and new resource allocation.
Patches 1-6 introduce per-ucontext tracking lists for each resource type.
Patch 7 implements the reset/resume notification mechanism with rwsem
serialization, mmap revocation, and IB event dispatch.
Patch 8 adds INVALID_MANA_HANDLE checks in destroy callbacks.
Tested with DPDK testpmd on Azure VM (linux-next-20260306) — confirmed
IB_EVENT_PORT_ERR (type=10) and IB_EVENT_PORT_ACTIVE (type=9) are delivered
to userspace during service reset, and testpmd tears down cleanly afterwards.
Long Li (8):
RDMA/mana_ib: Track ucontext per device
RDMA/mana_ib: Track PD per ucontext
RDMA/mana_ib: Track CQ per ucontext
RDMA/mana_ib: Track WQ per ucontext
RDMA/mana_ib: Track QP per ucontext
RDMA/mana_ib: Track MR per ucontext
RDMA/mana_ib: Notify service reset events to RDMA devices
RDMA/mana_ib: Skip firmware commands for invalidated handles
drivers/infiniband/hw/mana/cq.c | 44 +++++--
drivers/infiniband/hw/mana/device.c | 105 ++++++++++++++++++
drivers/infiniband/hw/mana/main.c | 56 +++++++++-
drivers/infiniband/hw/mana/mana_ib.h | 19 ++++
drivers/infiniband/hw/mana/mr.c | 33 +++++-
drivers/infiniband/hw/mana/qp.c | 61 +++++++---
drivers/infiniband/hw/mana/wq.c | 24 ++++
drivers/net/ethernet/microsoft/mana/mana_en.c | 14 ++-
include/net/mana/gdma.h | 6 +
9 files changed, 331 insertions(+), 31 deletions(-)
--
2.43.0
^ permalink raw reply
* [PATCH net-next,V3, 3/3] net: mana: Add ethtool counters for RX CQEs in coalesced type
From: Haiyang Zhang @ 2026-03-06 23:19 UTC (permalink / raw)
To: linux-hyperv, netdev, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Andrew Lunn, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Konstantin Taranov, Simon Horman,
Erni Sri Satya Vennela, Dipayaan Roy, Shradha Gupta,
Shiraz Saleem, Kees Cook, Subbaraya Sundeep, Aditya Garg,
Breno Leitao, linux-kernel, linux-rdma
Cc: paulros
In-Reply-To: <20260306231936.549499-1-haiyangz@linux.microsoft.com>
From: Haiyang Zhang <haiyangz@microsoft.com>
For RX CQEs with type CQE_RX_COALESCED_4, to measure the coalescing
efficiency, add counters to count how many contains 2, 3, 4 packets
respectively.
Also, add a counter for the error case of first packet with length == 0.
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
---
drivers/net/ethernet/microsoft/mana/mana_en.c | 21 ++++++++++++++++++-
.../ethernet/microsoft/mana/mana_ethtool.c | 15 +++++++++++--
include/net/mana/mana.h | 9 +++++---
3 files changed, 39 insertions(+), 6 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index c06fec50e51f..11ea2b17502d 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -2146,11 +2146,23 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
for (i = 0; i < MANA_RXCOMP_OOB_NUM_PPI; i++) {
pktlen = oob->ppi[i].pkt_len;
if (pktlen == 0) {
- if (i == 0)
+ /* Collect coalesced CQE count based on packets processed.
+ * Coalesced CQEs have at least 2 packets, so index is i - 2.
+ */
+ if (i > 1) {
+ u64_stats_update_begin(&rxq->stats.syncp);
+ rxq->stats.coalesced_cqe[i - 2]++;
+ u64_stats_update_end(&rxq->stats.syncp);
+ } else if (i == 0) {
+ /* Error case stat */
+ u64_stats_update_begin(&rxq->stats.syncp);
+ rxq->stats.pkt_len0_err++;
+ u64_stats_update_end(&rxq->stats.syncp);
netdev_err_once(
ndev,
"RX pkt len=0, rq=%u, cq=%u, rxobj=0x%llx\n",
rxq->gdma_id, cq->gdma_id, rxq->rxobj);
+ }
break;
}
@@ -2173,6 +2185,13 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
if (!coalesced)
break;
}
+
+ /* Coalesced CQE with all 4 packets */
+ if (coalesced && i == MANA_RXCOMP_OOB_NUM_PPI) {
+ u64_stats_update_begin(&rxq->stats.syncp);
+ rxq->stats.coalesced_cqe[MANA_RXCOMP_OOB_NUM_PPI - 2]++;
+ u64_stats_update_end(&rxq->stats.syncp);
+ }
}
static void mana_poll_rx_cq(struct mana_cq *cq)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index 4b234b16e57a..6a4b42fe0944 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -149,7 +149,7 @@ static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
{
struct mana_port_context *apc = netdev_priv(ndev);
unsigned int num_queues = apc->num_queues;
- int i;
+ int i, j;
if (stringset != ETH_SS_STATS)
return;
@@ -168,6 +168,9 @@ static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
ethtool_sprintf(&data, "rx_%d_xdp_drop", i);
ethtool_sprintf(&data, "rx_%d_xdp_tx", i);
ethtool_sprintf(&data, "rx_%d_xdp_redirect", i);
+ ethtool_sprintf(&data, "rx_%d_pkt_len0_err", i);
+ for (j = 0; j < MANA_RXCOMP_OOB_NUM_PPI - 1; j++)
+ ethtool_sprintf(&data, "rx_%d_coalesced_cqe_%d", i, j + 2);
}
for (i = 0; i < num_queues; i++) {
@@ -201,6 +204,8 @@ static void mana_get_ethtool_stats(struct net_device *ndev,
u64 xdp_xmit;
u64 xdp_drop;
u64 xdp_tx;
+ u64 pkt_len0_err;
+ u64 coalesced_cqe[MANA_RXCOMP_OOB_NUM_PPI - 1];
u64 tso_packets;
u64 tso_bytes;
u64 tso_inner_packets;
@@ -209,7 +214,7 @@ static void mana_get_ethtool_stats(struct net_device *ndev,
u64 short_pkt_fmt;
u64 csum_partial;
u64 mana_map_err;
- int q, i = 0;
+ int q, i = 0, j;
if (!apc->port_is_up)
return;
@@ -239,6 +244,9 @@ static void mana_get_ethtool_stats(struct net_device *ndev,
xdp_drop = rx_stats->xdp_drop;
xdp_tx = rx_stats->xdp_tx;
xdp_redirect = rx_stats->xdp_redirect;
+ pkt_len0_err = rx_stats->pkt_len0_err;
+ for (j = 0; j < MANA_RXCOMP_OOB_NUM_PPI - 1; j++)
+ coalesced_cqe[j] = rx_stats->coalesced_cqe[j];
} while (u64_stats_fetch_retry(&rx_stats->syncp, start));
data[i++] = packets;
@@ -246,6 +254,9 @@ static void mana_get_ethtool_stats(struct net_device *ndev,
data[i++] = xdp_drop;
data[i++] = xdp_tx;
data[i++] = xdp_redirect;
+ data[i++] = pkt_len0_err;
+ for (j = 0; j < MANA_RXCOMP_OOB_NUM_PPI - 1; j++)
+ data[i++] = coalesced_cqe[j];
}
for (q = 0; q < num_queues; q++) {
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index a7f89e7ddc56..3336688fed5e 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -61,8 +61,11 @@ enum TRI_STATE {
#define MAX_PORTS_IN_MANA_DEV 256
+/* Maximum number of packets per coalesced CQE */
+#define MANA_RXCOMP_OOB_NUM_PPI 4
+
/* Update this count whenever the respective structures are changed */
-#define MANA_STATS_RX_COUNT 5
+#define MANA_STATS_RX_COUNT (6 + MANA_RXCOMP_OOB_NUM_PPI - 1)
#define MANA_STATS_TX_COUNT 11
#define MANA_RX_FRAG_ALIGNMENT 64
@@ -73,6 +76,8 @@ struct mana_stats_rx {
u64 xdp_drop;
u64 xdp_tx;
u64 xdp_redirect;
+ u64 pkt_len0_err;
+ u64 coalesced_cqe[MANA_RXCOMP_OOB_NUM_PPI - 1];
struct u64_stats_sync syncp;
};
@@ -227,8 +232,6 @@ struct mana_rxcomp_perpkt_info {
u32 pkt_hash;
}; /* HW DATA */
-#define MANA_RXCOMP_OOB_NUM_PPI 4
-
/* Receive completion OOB */
struct mana_rxcomp_oob {
struct mana_cqe_header cqe_hdr;
--
2.34.1
^ permalink raw reply related
* [PATCH net-next,V3, 2/3] net: mana: Add support for RX CQE Coalescing
From: Haiyang Zhang @ 2026-03-06 23:19 UTC (permalink / raw)
To: linux-hyperv, netdev, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Andrew Lunn, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Konstantin Taranov, Simon Horman,
Erni Sri Satya Vennela, Shradha Gupta, Dipayaan Roy,
Shiraz Saleem, Kees Cook, Subbaraya Sundeep, Breno Leitao,
Aditya Garg, linux-kernel, linux-rdma
Cc: paulros
In-Reply-To: <20260306231936.549499-1-haiyangz@linux.microsoft.com>
From: Haiyang Zhang <haiyangz@microsoft.com>
Our NIC can have up to 4 RX packets on 1 CQE. To support this feature,
check and process the type CQE_RX_COALESCED_4. The default setting is
disabled, to avoid possible regression on latency.
And, add ethtool handler to switch this feature. To turn it on, run:
ethtool -C <nic> rx-cqe-frames 4
To turn it off:
ethtool -C <nic> rx-cqe-frames 1
The rx-cqe-nsec is the time out value in nanoseconds after the first
packet arrival in a coalesced CQE to be sent. It's read-only for this
NIC.
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
---
drivers/net/ethernet/microsoft/mana/mana_en.c | 72 ++++++++++++-------
.../ethernet/microsoft/mana/mana_ethtool.c | 60 +++++++++++++++-
include/net/mana/mana.h | 8 ++-
3 files changed, 111 insertions(+), 29 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index ea71de39f996..c06fec50e51f 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1365,6 +1365,7 @@ static int mana_cfg_vport_steering(struct mana_port_context *apc,
sizeof(resp));
req->hdr.req.msg_version = GDMA_MESSAGE_V2;
+ req->hdr.resp.msg_version = GDMA_MESSAGE_V2;
req->vport = apc->port_handle;
req->num_indir_entries = apc->indir_table_sz;
@@ -1376,7 +1377,9 @@ static int mana_cfg_vport_steering(struct mana_port_context *apc,
req->update_hashkey = update_key;
req->update_indir_tab = update_tab;
req->default_rxobj = apc->default_rxobj;
- req->cqe_coalescing_enable = 0;
+
+ if (rx != TRI_STATE_FALSE)
+ req->cqe_coalescing_enable = apc->cqe_coalescing_enable;
if (update_key)
memcpy(&req->hashkey, apc->hashkey, MANA_HASH_KEY_SIZE);
@@ -1407,6 +1410,10 @@ static int mana_cfg_vport_steering(struct mana_port_context *apc,
err = -EPROTO;
}
+ if (resp.hdr.response.msg_version >= GDMA_MESSAGE_V2)
+ apc->cqe_coalescing_timeout_ns =
+ resp.cqe_coalescing_timeout_ns;
+
netdev_info(ndev, "Configured steering vPort %llu entries %u\n",
apc->port_handle, apc->indir_table_sz);
out:
@@ -1915,11 +1922,12 @@ static struct sk_buff *mana_build_skb(struct mana_rxq *rxq, void *buf_va,
}
static void mana_rx_skb(void *buf_va, bool from_pool,
- struct mana_rxcomp_oob *cqe, struct mana_rxq *rxq)
+ struct mana_rxcomp_oob *cqe, struct mana_rxq *rxq,
+ int i)
{
struct mana_stats_rx *rx_stats = &rxq->stats;
struct net_device *ndev = rxq->ndev;
- uint pkt_len = cqe->ppi[0].pkt_len;
+ uint pkt_len = cqe->ppi[i].pkt_len;
u16 rxq_idx = rxq->rxq_idx;
struct napi_struct *napi;
struct xdp_buff xdp = {};
@@ -1963,7 +1971,7 @@ static void mana_rx_skb(void *buf_va, bool from_pool,
}
if (cqe->rx_hashtype != 0 && (ndev->features & NETIF_F_RXHASH)) {
- hash_value = cqe->ppi[0].pkt_hash;
+ hash_value = cqe->ppi[i].pkt_hash;
if (cqe->rx_hashtype & MANA_HASH_L4)
skb_set_hash(skb, hash_value, PKT_HASH_TYPE_L4);
@@ -2098,9 +2106,11 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
struct mana_recv_buf_oob *rxbuf_oob;
struct mana_port_context *apc;
struct device *dev = gc->dev;
+ bool coalesced = false;
void *old_buf = NULL;
u32 curr, pktlen;
bool old_fp;
+ int i;
apc = netdev_priv(ndev);
@@ -2112,13 +2122,16 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
++ndev->stats.rx_dropped;
rxbuf_oob = &rxq->rx_oobs[rxq->buf_index];
netdev_warn_once(ndev, "Dropped a truncated packet\n");
- goto drop;
- case CQE_RX_COALESCED_4:
- netdev_err(ndev, "RX coalescing is unsupported\n");
- apc->eth_stats.rx_coalesced_err++;
+ mana_move_wq_tail(rxq->gdma_rq,
+ rxbuf_oob->wqe_inf.wqe_size_in_bu);
+ mana_post_pkt_rxq(rxq);
return;
+ case CQE_RX_COALESCED_4:
+ coalesced = true;
+ break;
+
case CQE_RX_OBJECT_FENCE:
complete(&rxq->fence_event);
return;
@@ -2130,30 +2143,36 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
return;
}
- pktlen = oob->ppi[0].pkt_len;
+ for (i = 0; i < MANA_RXCOMP_OOB_NUM_PPI; i++) {
+ pktlen = oob->ppi[i].pkt_len;
+ if (pktlen == 0) {
+ if (i == 0)
+ netdev_err_once(
+ ndev,
+ "RX pkt len=0, rq=%u, cq=%u, rxobj=0x%llx\n",
+ rxq->gdma_id, cq->gdma_id, rxq->rxobj);
+ break;
+ }
- if (pktlen == 0) {
- /* data packets should never have packetlength of zero */
- netdev_err(ndev, "RX pkt len=0, rq=%u, cq=%u, rxobj=0x%llx\n",
- rxq->gdma_id, cq->gdma_id, rxq->rxobj);
- return;
- }
+ curr = rxq->buf_index;
+ rxbuf_oob = &rxq->rx_oobs[curr];
+ WARN_ON_ONCE(rxbuf_oob->wqe_inf.wqe_size_in_bu != 1);
- curr = rxq->buf_index;
- rxbuf_oob = &rxq->rx_oobs[curr];
- WARN_ON_ONCE(rxbuf_oob->wqe_inf.wqe_size_in_bu != 1);
+ mana_refill_rx_oob(dev, rxq, rxbuf_oob, &old_buf, &old_fp);
- mana_refill_rx_oob(dev, rxq, rxbuf_oob, &old_buf, &old_fp);
+ /* Unsuccessful refill will have old_buf == NULL.
+ * In this case, mana_rx_skb() will drop the packet.
+ */
+ mana_rx_skb(old_buf, old_fp, oob, rxq, i);
- /* Unsuccessful refill will have old_buf == NULL.
- * In this case, mana_rx_skb() will drop the packet.
- */
- mana_rx_skb(old_buf, old_fp, oob, rxq);
+ mana_move_wq_tail(rxq->gdma_rq,
+ rxbuf_oob->wqe_inf.wqe_size_in_bu);
-drop:
- mana_move_wq_tail(rxq->gdma_rq, rxbuf_oob->wqe_inf.wqe_size_in_bu);
+ mana_post_pkt_rxq(rxq);
- mana_post_pkt_rxq(rxq);
+ if (!coalesced)
+ break;
+ }
}
static void mana_poll_rx_cq(struct mana_cq *cq)
@@ -3332,6 +3351,7 @@ static int mana_probe_port(struct mana_context *ac, int port_idx,
apc->port_handle = INVALID_MANA_HANDLE;
apc->pf_filter_handle = INVALID_MANA_HANDLE;
apc->port_idx = port_idx;
+ apc->cqe_coalescing_enable = 0;
mutex_init(&apc->vport_mutex);
apc->vport_use_count = 0;
diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index f2d220b371b5..4b234b16e57a 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -20,8 +20,6 @@ static const struct mana_stats_desc mana_eth_stats[] = {
tx_cqe_unknown_type)},
{"tx_linear_pkt_cnt", offsetof(struct mana_ethtool_stats,
tx_linear_pkt_cnt)},
- {"rx_coalesced_err", offsetof(struct mana_ethtool_stats,
- rx_coalesced_err)},
{"rx_cqe_unknown_type", offsetof(struct mana_ethtool_stats,
rx_cqe_unknown_type)},
};
@@ -390,6 +388,61 @@ static void mana_get_channels(struct net_device *ndev,
channel->combined_count = apc->num_queues;
}
+#define MANA_RX_CQE_NSEC_DEF 2048
+static int mana_get_coalesce(struct net_device *ndev,
+ struct ethtool_coalesce *ec,
+ struct kernel_ethtool_coalesce *kernel_coal,
+ struct netlink_ext_ack *extack)
+{
+ struct mana_port_context *apc = netdev_priv(ndev);
+
+ kernel_coal->rx_cqe_frames =
+ apc->cqe_coalescing_enable ? MANA_RXCOMP_OOB_NUM_PPI : 1;
+
+ kernel_coal->rx_cqe_nsecs = apc->cqe_coalescing_timeout_ns;
+
+ /* Return the default timeout value for old FW not providing
+ * this value.
+ */
+ if (apc->port_is_up && apc->cqe_coalescing_enable &&
+ !kernel_coal->rx_cqe_nsecs)
+ kernel_coal->rx_cqe_nsecs = MANA_RX_CQE_NSEC_DEF;
+
+ return 0;
+}
+
+static int mana_set_coalesce(struct net_device *ndev,
+ struct ethtool_coalesce *ec,
+ struct kernel_ethtool_coalesce *kernel_coal,
+ struct netlink_ext_ack *extack)
+{
+ struct mana_port_context *apc = netdev_priv(ndev);
+ u8 saved_cqe_coalescing_enable;
+ int err;
+
+ if (kernel_coal->rx_cqe_frames != 1 &&
+ kernel_coal->rx_cqe_frames != MANA_RXCOMP_OOB_NUM_PPI) {
+ NL_SET_ERR_MSG_FMT(extack,
+ "rx-frames must be 1 or %u, got %u",
+ MANA_RXCOMP_OOB_NUM_PPI,
+ kernel_coal->rx_cqe_frames);
+ return -EINVAL;
+ }
+
+ saved_cqe_coalescing_enable = apc->cqe_coalescing_enable;
+ apc->cqe_coalescing_enable =
+ kernel_coal->rx_cqe_frames == MANA_RXCOMP_OOB_NUM_PPI;
+
+ if (!apc->port_is_up)
+ return 0;
+
+ err = mana_config_rss(apc, TRI_STATE_TRUE, false, false);
+ if (err)
+ apc->cqe_coalescing_enable = saved_cqe_coalescing_enable;
+
+ return err;
+}
+
static int mana_set_channels(struct net_device *ndev,
struct ethtool_channels *channels)
{
@@ -510,6 +563,7 @@ static int mana_get_link_ksettings(struct net_device *ndev,
}
const struct ethtool_ops mana_ethtool_ops = {
+ .supported_coalesce_params = ETHTOOL_COALESCE_RX_CQE_FRAMES,
.get_ethtool_stats = mana_get_ethtool_stats,
.get_sset_count = mana_get_sset_count,
.get_strings = mana_get_strings,
@@ -520,6 +574,8 @@ const struct ethtool_ops mana_ethtool_ops = {
.set_rxfh = mana_set_rxfh,
.get_channels = mana_get_channels,
.set_channels = mana_set_channels,
+ .get_coalesce = mana_get_coalesce,
+ .set_coalesce = mana_set_coalesce,
.get_ringparam = mana_get_ringparam,
.set_ringparam = mana_set_ringparam,
.get_link_ksettings = mana_get_link_ksettings,
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index a078af283bdd..a7f89e7ddc56 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -378,7 +378,6 @@ struct mana_ethtool_stats {
u64 tx_cqe_err;
u64 tx_cqe_unknown_type;
u64 tx_linear_pkt_cnt;
- u64 rx_coalesced_err;
u64 rx_cqe_unknown_type;
};
@@ -557,6 +556,9 @@ struct mana_port_context {
bool port_is_up;
bool port_st_save; /* Saved port state */
+ u8 cqe_coalescing_enable;
+ u32 cqe_coalescing_timeout_ns;
+
struct mana_ethtool_stats eth_stats;
struct mana_ethtool_phy_stats phy_stats;
@@ -902,6 +904,10 @@ struct mana_cfg_rx_steer_req_v2 {
struct mana_cfg_rx_steer_resp {
struct gdma_resp_hdr hdr;
+
+ /* V2 */
+ u32 cqe_coalescing_timeout_ns;
+ u32 reserved1;
}; /* HW DATA */
/* Register HW vPort */
--
2.34.1
^ permalink raw reply related
* [PATCH net-next,V3, 1/3] net: ethtool: add ethtool COALESCE_RX_CQE_FRAMES/NSECS
From: Haiyang Zhang @ 2026-03-06 23:19 UTC (permalink / raw)
To: linux-hyperv, netdev, Andrew Lunn, Jakub Kicinski,
David S. Miller, Eric Dumazet, Paolo Abeni, Simon Horman,
Donald Hunter, Jonathan Corbet, Shuah Khan,
Kory Maincent (Dent Project), Gal Pressman, Oleksij Rempel,
Vadim Fedorenko, linux-kernel, linux-doc
Cc: haiyangz, paulros
In-Reply-To: <20260306231936.549499-1-haiyangz@linux.microsoft.com>
From: Haiyang Zhang <haiyangz@microsoft.com>
Add two parameters for drivers supporting Rx CQE Coalescing.
ETHTOOL_A_COALESCE_RX_CQE_FRAMES:
Maximum number of frames that can be coalesced into a CQE.
ETHTOOL_A_COALESCE_RX_CQE_NSECS:
Time out value in nanoseconds after the first packet arrival in a
coalesced CQE to be sent.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
---
Documentation/netlink/specs/ethtool.yaml | 8 ++++++++
Documentation/networking/ethtool-netlink.rst | 10 ++++++++++
include/linux/ethtool.h | 6 +++++-
include/uapi/linux/ethtool_netlink_generated.h | 2 ++
net/ethtool/coalesce.c | 14 +++++++++++++-
5 files changed, 38 insertions(+), 2 deletions(-)
diff --git a/Documentation/netlink/specs/ethtool.yaml b/Documentation/netlink/specs/ethtool.yaml
index 4707063af3b4..d254e26c014c 100644
--- a/Documentation/netlink/specs/ethtool.yaml
+++ b/Documentation/netlink/specs/ethtool.yaml
@@ -861,6 +861,12 @@ attribute-sets:
name: tx-profile
type: nest
nested-attributes: profile
+ -
+ name: rx-cqe-frames
+ type: u32
+ -
+ name: rx-cqe-nsecs
+ type: u32
-
name: pause-stat
@@ -2257,6 +2263,8 @@ operations:
- tx-aggr-time-usecs
- rx-profile
- tx-profile
+ - rx-cqe-frames
+ - rx-cqe-nsecs
dump: *coalesce-get-op
-
name: coalesce-set
diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst
index 32179168eb73..a9fbb16891fa 100644
--- a/Documentation/networking/ethtool-netlink.rst
+++ b/Documentation/networking/ethtool-netlink.rst
@@ -1076,6 +1076,8 @@ Kernel response contents:
``ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS`` u32 time (us), aggr, Tx
``ETHTOOL_A_COALESCE_RX_PROFILE`` nested profile of DIM, Rx
``ETHTOOL_A_COALESCE_TX_PROFILE`` nested profile of DIM, Tx
+ ``ETHTOOL_A_COALESCE_RX_CQE_FRAMES`` u32 max packets, Rx CQE
+ ``ETHTOOL_A_COALESCE_RX_CQE_NSECS`` u32 delay (ns), Rx CQE
=========================================== ====== =======================
Attributes are only included in reply if their value is not zero or the
@@ -1109,6 +1111,12 @@ well with frequent small-sized URBs transmissions.
to DIM parameters, see `Generic Network Dynamic Interrupt Moderation (Net DIM)
<https://www.kernel.org/doc/Documentation/networking/net_dim.rst>`_.
+Rx CQE coalescing allows multiple received packets to be coalesced into a single
+Completion Queue Entry (CQE). ``ETHTOOL_A_COALESCE_RX_CQE_FRAMES`` describes the
+maximum number of frames that can be coalesced into a CQE.
+``ETHTOOL_A_COALESCE_RX_CQE_NSECS`` describes max time in nanoseconds after the
+first packet arrival in a coalesced CQE to be sent.
+
COALESCE_SET
============
@@ -1147,6 +1155,8 @@ Request contents:
``ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS`` u32 time (us), aggr, Tx
``ETHTOOL_A_COALESCE_RX_PROFILE`` nested profile of DIM, Rx
``ETHTOOL_A_COALESCE_TX_PROFILE`` nested profile of DIM, Tx
+ ``ETHTOOL_A_COALESCE_RX_CQE_FRAMES`` u32 max packets, Rx CQE
+ ``ETHTOOL_A_COALESCE_RX_CQE_NSECS`` u32 delay (ns), Rx CQE
=========================================== ====== =======================
Request is rejected if it attributes declared as unsupported by driver (i.e.
diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index 83c375840835..656d465bcd06 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -332,6 +332,8 @@ struct kernel_ethtool_coalesce {
u32 tx_aggr_max_bytes;
u32 tx_aggr_max_frames;
u32 tx_aggr_time_usecs;
+ u32 rx_cqe_frames;
+ u32 rx_cqe_nsecs;
};
/**
@@ -380,7 +382,9 @@ bool ethtool_convert_link_mode_to_legacy_u32(u32 *legacy_u32,
#define ETHTOOL_COALESCE_TX_AGGR_TIME_USECS BIT(26)
#define ETHTOOL_COALESCE_RX_PROFILE BIT(27)
#define ETHTOOL_COALESCE_TX_PROFILE BIT(28)
-#define ETHTOOL_COALESCE_ALL_PARAMS GENMASK(28, 0)
+#define ETHTOOL_COALESCE_RX_CQE_FRAMES BIT(29)
+#define ETHTOOL_COALESCE_RX_CQE_NSECS BIT(30)
+#define ETHTOOL_COALESCE_ALL_PARAMS GENMASK(30, 0)
#define ETHTOOL_COALESCE_USECS \
(ETHTOOL_COALESCE_RX_USECS | ETHTOOL_COALESCE_TX_USECS)
diff --git a/include/uapi/linux/ethtool_netlink_generated.h b/include/uapi/linux/ethtool_netlink_generated.h
index 114b83017297..8134baf7860f 100644
--- a/include/uapi/linux/ethtool_netlink_generated.h
+++ b/include/uapi/linux/ethtool_netlink_generated.h
@@ -371,6 +371,8 @@ enum {
ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS,
ETHTOOL_A_COALESCE_RX_PROFILE,
ETHTOOL_A_COALESCE_TX_PROFILE,
+ ETHTOOL_A_COALESCE_RX_CQE_FRAMES,
+ ETHTOOL_A_COALESCE_RX_CQE_NSECS,
__ETHTOOL_A_COALESCE_CNT,
ETHTOOL_A_COALESCE_MAX = (__ETHTOOL_A_COALESCE_CNT - 1)
diff --git a/net/ethtool/coalesce.c b/net/ethtool/coalesce.c
index 3e18ca1ccc5e..349bb02c517a 100644
--- a/net/ethtool/coalesce.c
+++ b/net/ethtool/coalesce.c
@@ -118,6 +118,8 @@ static int coalesce_reply_size(const struct ethnl_req_info *req_base,
nla_total_size(sizeof(u32)) + /* _TX_AGGR_MAX_BYTES */
nla_total_size(sizeof(u32)) + /* _TX_AGGR_MAX_FRAMES */
nla_total_size(sizeof(u32)) + /* _TX_AGGR_TIME_USECS */
+ nla_total_size(sizeof(u32)) + /* _RX_CQE_FRAMES */
+ nla_total_size(sizeof(u32)) + /* _RX_CQE_NSECS */
total_modersz * 2; /* _{R,T}X_PROFILE */
}
@@ -269,7 +271,11 @@ static int coalesce_fill_reply(struct sk_buff *skb,
coalesce_put_u32(skb, ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES,
kcoal->tx_aggr_max_frames, supported) ||
coalesce_put_u32(skb, ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS,
- kcoal->tx_aggr_time_usecs, supported))
+ kcoal->tx_aggr_time_usecs, supported) ||
+ coalesce_put_u32(skb, ETHTOOL_A_COALESCE_RX_CQE_FRAMES,
+ kcoal->rx_cqe_frames, supported) ||
+ coalesce_put_u32(skb, ETHTOOL_A_COALESCE_RX_CQE_NSECS,
+ kcoal->rx_cqe_nsecs, supported))
return -EMSGSIZE;
if (!req_base->dev || !req_base->dev->irq_moder)
@@ -338,6 +344,8 @@ const struct nla_policy ethnl_coalesce_set_policy[] = {
[ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES] = { .type = NLA_U32 },
[ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES] = { .type = NLA_U32 },
[ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS] = { .type = NLA_U32 },
+ [ETHTOOL_A_COALESCE_RX_CQE_FRAMES] = { .type = NLA_U32 },
+ [ETHTOOL_A_COALESCE_RX_CQE_NSECS] = { .type = NLA_U32 },
[ETHTOOL_A_COALESCE_RX_PROFILE] =
NLA_POLICY_NESTED(coalesce_profile_policy),
[ETHTOOL_A_COALESCE_TX_PROFILE] =
@@ -570,6 +578,10 @@ __ethnl_set_coalesce(struct ethnl_req_info *req_info, struct genl_info *info,
tb[ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES], &mod);
ethnl_update_u32(&kernel_coalesce.tx_aggr_time_usecs,
tb[ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS], &mod);
+ ethnl_update_u32(&kernel_coalesce.rx_cqe_frames,
+ tb[ETHTOOL_A_COALESCE_RX_CQE_FRAMES], &mod);
+ ethnl_update_u32(&kernel_coalesce.rx_cqe_nsecs,
+ tb[ETHTOOL_A_COALESCE_RX_CQE_NSECS], &mod);
if (dev->irq_moder && dev->irq_moder->profile_flags & DIM_PROFILE_RX) {
ret = ethnl_update_profile(dev, &dev->irq_moder->rx_profile,
--
2.34.1
^ permalink raw reply related
* [PATCH net-next,V3, 0/3] add ethtool COALESCE_RX_CQE_FRAMES/NSECS and use it in MANA driver
From: Haiyang Zhang @ 2026-03-06 23:19 UTC (permalink / raw)
To: linux-hyperv, netdev; +Cc: haiyangz, paulros
From: Haiyang Zhang <haiyangz@microsoft.com>
Add two parameters for drivers supporting Rx CQE Coalescing.
ETHTOOL_A_COALESCE_RX_CQE_FRAMES:
Maximum number of frames that can be coalesced into a CQE.
ETHTOOL_A_COALESCE_RX_CQE_NSECS:
Time out value in nanoseconds after the first packet arrival in a
coalesced CQE to be sent.
Also implement in MANA driver with the new parameter and
counters.
Haiyang Zhang (3):
net: ethtool: add ethtool COALESCE_RX_CQE_FRAMES/NSECS
net: mana: Add support for RX CQE Coalescing
net: mana: Add ethtool counters for RX CQEs in coalesced type
Documentation/netlink/specs/ethtool.yaml | 8 ++
Documentation/networking/ethtool-netlink.rst | 10 ++
drivers/net/ethernet/microsoft/mana/mana_en.c | 91 +++++++++++++------
.../ethernet/microsoft/mana/mana_ethtool.c | 75 ++++++++++++++-
include/linux/ethtool.h | 6 +-
include/net/mana/mana.h | 17 +++-
.../uapi/linux/ethtool_netlink_generated.h | 2 +
net/ethtool/coalesce.c | 14 ++-
8 files changed, 187 insertions(+), 36 deletions(-)
--
2.34.1
^ permalink raw reply
* [PATCH net-next v3 6/6] RDMA/mana_ib: Allocate interrupt contexts on EQs
From: Long Li @ 2026-03-06 21:33 UTC (permalink / raw)
To: K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Shradha Gupta, Simon Horman, Konstantin Taranov,
Souradeep Chakrabarti, Erick Archer, linux-hyperv, netdev,
linux-kernel, linux-rdma
Cc: Long Li
In-Reply-To: <20260306213302.544681-1-longli@microsoft.com>
Use the GIC functions to allocate interrupt contexts for RDMA EQs. These
interrupt contexts may be shared with Ethernet EQs when MSI-X vectors
are limited.
The driver now supports allocating dedicated MSI-X for each EQ. Indicate
this capability through driver capability bits.
Signed-off-by: Long Li <longli@microsoft.com>
---
drivers/infiniband/hw/mana/main.c | 33 ++++++++++++++++++++++++++-----
include/net/mana/gdma.h | 7 +++++--
2 files changed, 33 insertions(+), 7 deletions(-)
diff --git a/drivers/infiniband/hw/mana/main.c b/drivers/infiniband/hw/mana/main.c
index d51dd0ee85f4..0b74dd093b41 100644
--- a/drivers/infiniband/hw/mana/main.c
+++ b/drivers/infiniband/hw/mana/main.c
@@ -787,6 +787,7 @@ int mana_ib_create_eqs(struct mana_ib_dev *mdev)
{
struct gdma_context *gc = mdev_to_gc(mdev);
struct gdma_queue_spec spec = {};
+ struct gdma_irq_context *gic;
int err, i;
spec.type = GDMA_EQ;
@@ -797,9 +798,15 @@ int mana_ib_create_eqs(struct mana_ib_dev *mdev)
spec.eq.log2_throttle_limit = LOG2_EQ_THROTTLE;
spec.eq.msix_index = 0;
+ gic = mana_gd_get_gic(gc, false, &spec.eq.msix_index);
+ if (!gic)
+ return -ENOMEM;
+
err = mana_gd_create_mana_eq(mdev->gdma_dev, &spec, &mdev->fatal_err_eq);
- if (err)
+ if (err) {
+ mana_gd_put_gic(gc, false, 0);
return err;
+ }
mdev->eqs = kzalloc_objs(struct gdma_queue *,
mdev->ib_dev.num_comp_vectors);
@@ -810,31 +817,47 @@ int mana_ib_create_eqs(struct mana_ib_dev *mdev)
spec.eq.callback = NULL;
for (i = 0; i < mdev->ib_dev.num_comp_vectors; i++) {
spec.eq.msix_index = (i + 1) % gc->num_msix_usable;
+
+ gic = mana_gd_get_gic(gc, false, &spec.eq.msix_index);
+ if (!gic) {
+ err = -ENOMEM;
+ goto destroy_eqs;
+ }
+
err = mana_gd_create_mana_eq(mdev->gdma_dev, &spec, &mdev->eqs[i]);
- if (err)
+ if (err) {
+ mana_gd_put_gic(gc, false, spec.eq.msix_index);
goto destroy_eqs;
+ }
}
return 0;
destroy_eqs:
- while (i-- > 0)
+ while (i-- > 0) {
mana_gd_destroy_queue(gc, mdev->eqs[i]);
+ mana_gd_put_gic(gc, false, (i + 1) % gc->num_msix_usable);
+ }
kfree(mdev->eqs);
destroy_fatal_eq:
mana_gd_destroy_queue(gc, mdev->fatal_err_eq);
+ mana_gd_put_gic(gc, false, 0);
return err;
}
void mana_ib_destroy_eqs(struct mana_ib_dev *mdev)
{
struct gdma_context *gc = mdev_to_gc(mdev);
- int i;
+ int i, msi;
mana_gd_destroy_queue(gc, mdev->fatal_err_eq);
+ mana_gd_put_gic(gc, false, 0);
- for (i = 0; i < mdev->ib_dev.num_comp_vectors; i++)
+ for (i = 0; i < mdev->ib_dev.num_comp_vectors; i++) {
mana_gd_destroy_queue(gc, mdev->eqs[i]);
+ msi = (i + 1) % gc->num_msix_usable;
+ mana_gd_put_gic(gc, false, msi);
+ }
kfree(mdev->eqs);
}
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index 4e0278b00bbb..662e58f51e87 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -612,6 +612,7 @@ enum {
#define GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECONFIG BIT(3)
#define GDMA_DRV_CAP_FLAG_1_GDMA_PAGES_4MB_1GB_2GB BIT(4)
#define GDMA_DRV_CAP_FLAG_1_VARIABLE_INDIRECTION_TABLE_SUPPORT BIT(5)
+#define GDMA_DRV_CAP_FLAG_1_HW_VPORT_LINK_AWARE BIT(6)
/* Driver can handle holes (zeros) in the device list */
#define GDMA_DRV_CAP_FLAG_1_DEV_LIST_HOLES_SUP BIT(11)
@@ -628,7 +629,8 @@ enum {
/* Driver detects stalled send queues and recovers them */
#define GDMA_DRV_CAP_FLAG_1_HANDLE_STALL_SQ_RECOVERY BIT(18)
-#define GDMA_DRV_CAP_FLAG_1_HW_VPORT_LINK_AWARE BIT(6)
+/* Driver supports separate EQ/MSIs for each vPort */
+#define GDMA_DRV_CAP_FLAG_1_EQ_MSI_UNSHARE_MULTI_VPORT BIT(19)
/* Driver supports linearizing the skb when num_sge exceeds hardware limit */
#define GDMA_DRV_CAP_FLAG_1_SKB_LINEARIZE BIT(20)
@@ -656,7 +658,8 @@ enum {
GDMA_DRV_CAP_FLAG_1_SKB_LINEARIZE | \
GDMA_DRV_CAP_FLAG_1_PROBE_RECOVERY | \
GDMA_DRV_CAP_FLAG_1_HANDLE_STALL_SQ_RECOVERY | \
- GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECOVERY)
+ GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECOVERY | \
+ GDMA_DRV_CAP_FLAG_1_EQ_MSI_UNSHARE_MULTI_VPORT)
#define GDMA_DRV_CAP_FLAGS2 0
--
2.43.0
^ permalink raw reply related
* [PATCH net-next v3 5/6] net: mana: Allocate interrupt context for each EQ when creating vPort
From: Long Li @ 2026-03-06 21:33 UTC (permalink / raw)
To: K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Shradha Gupta, Simon Horman, Konstantin Taranov,
Souradeep Chakrabarti, Erick Archer, linux-hyperv, netdev,
linux-kernel, linux-rdma
Cc: Long Li
In-Reply-To: <20260306213302.544681-1-longli@microsoft.com>
Use GIC functions to create a dedicated interrupt context or acquire a
shared interrupt context for each EQ when setting up a vPort.
Signed-off-by: Long Li <longli@microsoft.com>
---
drivers/net/ethernet/microsoft/mana/gdma_main.c | 2 +-
drivers/net/ethernet/microsoft/mana/mana_en.c | 17 ++++++++++++++++-
include/net/mana/gdma.h | 1 +
3 files changed, 18 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index bdc9dc437fb7..81c0be96c94b 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -809,7 +809,6 @@ static void mana_gd_deregister_irq(struct gdma_queue *queue)
}
spin_unlock_irqrestore(&gic->lock, flags);
- queue->eq.msix_index = INVALID_PCI_MSIX_INDEX;
synchronize_rcu();
}
@@ -924,6 +923,7 @@ static int mana_gd_create_eq(struct gdma_dev *gd,
out:
dev_err(dev, "Failed to create EQ: %d\n", err);
mana_gd_destroy_eq(gc, false, queue);
+ queue->eq.msix_index = INVALID_PCI_MSIX_INDEX;
return err;
}
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index bfa0f354355d..8a60e7567951 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1598,6 +1598,7 @@ void mana_destroy_eq(struct mana_port_context *apc)
struct gdma_context *gc = ac->gdma_dev->gdma_context;
struct gdma_queue *eq;
int i;
+ unsigned int msi;
if (!apc->eqs)
return;
@@ -1610,7 +1611,9 @@ void mana_destroy_eq(struct mana_port_context *apc)
if (!eq)
continue;
+ msi = eq->eq.msix_index;
mana_gd_destroy_queue(gc, eq);
+ mana_gd_put_gic(gc, !gc->msi_sharing, msi);
}
kfree(apc->eqs);
@@ -1627,6 +1630,7 @@ static void mana_create_eq_debugfs(struct mana_port_context *apc, int i)
eq.mana_eq_debugfs = debugfs_create_dir(eqnum, apc->mana_eqs_debugfs);
debugfs_create_u32("head", 0400, eq.mana_eq_debugfs, &eq.eq->head);
debugfs_create_u32("tail", 0400, eq.mana_eq_debugfs, &eq.eq->tail);
+ debugfs_create_u32("irq", 0400, eq.mana_eq_debugfs, &eq.eq->eq.irq);
debugfs_create_file("eq_dump", 0400, eq.mana_eq_debugfs, eq.eq, &mana_dbg_q_fops);
}
@@ -1637,6 +1641,7 @@ int mana_create_eq(struct mana_port_context *apc)
struct gdma_queue_spec spec = {};
int err;
int i;
+ struct gdma_irq_context *gic;
WARN_ON(apc->eqs);
apc->eqs = kzalloc_objs(struct mana_eq, apc->num_queues);
@@ -1653,12 +1658,22 @@ int mana_create_eq(struct mana_port_context *apc)
apc->mana_eqs_debugfs = debugfs_create_dir("EQs", apc->mana_port_debugfs);
for (i = 0; i < apc->num_queues; i++) {
- spec.eq.msix_index = (i + 1) % gc->num_msix_usable;
+ if (gc->msi_sharing)
+ spec.eq.msix_index = (i + 1) % gc->num_msix_usable;
+
+ gic = mana_gd_get_gic(gc, !gc->msi_sharing, &spec.eq.msix_index);
+ if (!gic) {
+ err = -ENOMEM;
+ goto out;
+ }
+
err = mana_gd_create_mana_eq(gd, &spec, &apc->eqs[i].eq);
if (err) {
dev_err(gc->dev, "Failed to create EQ %d : %d\n", i, err);
+ mana_gd_put_gic(gc, !gc->msi_sharing, spec.eq.msix_index);
goto out;
}
+ apc->eqs[i].eq->eq.irq = gic->irq;
mana_create_eq_debugfs(apc, i);
}
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index 1a7f4abe7a8b..4e0278b00bbb 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -342,6 +342,7 @@ struct gdma_queue {
void *context;
unsigned int msix_index;
+ unsigned int irq;
u32 log2_throttle_limit;
} eq;
--
2.43.0
^ permalink raw reply related
* [PATCH net-next v3 4/6] net: mana: Use GIC functions to allocate global EQs
From: Long Li @ 2026-03-06 21:33 UTC (permalink / raw)
To: K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Shradha Gupta, Simon Horman, Konstantin Taranov,
Souradeep Chakrabarti, Erick Archer, linux-hyperv, netdev,
linux-kernel, linux-rdma
Cc: Long Li
In-Reply-To: <20260306213302.544681-1-longli@microsoft.com>
Replace the GDMA global interrupt setup code with the new GIC allocation
and release functions for managing interrupt contexts.
Signed-off-by: Long Li <longli@microsoft.com>
---
.../net/ethernet/microsoft/mana/gdma_main.c | 83 +++----------------
1 file changed, 10 insertions(+), 73 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index c43fd8089e77..bdc9dc437fb7 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -1831,30 +1831,13 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
* further used in irq_setup()
*/
for (i = 1; i <= nvec; i++) {
- gic = kzalloc_obj(*gic);
+ gic = mana_gd_get_gic(gc, false, &i);
if (!gic) {
err = -ENOMEM;
goto free_irq;
}
- gic->handler = mana_gd_process_eq_events;
- INIT_LIST_HEAD(&gic->eq_list);
- spin_lock_init(&gic->lock);
-
- snprintf(gic->name, MANA_IRQ_NAME_SZ, "mana_q%d@pci:%s",
- i - 1, pci_name(pdev));
-
- /* one pci vector is already allocated for HWC */
- irqs[i - 1] = pci_irq_vector(pdev, i);
- if (irqs[i - 1] < 0) {
- err = irqs[i - 1];
- goto free_current_gic;
- }
-
- err = request_irq(irqs[i - 1], mana_gd_intr, 0, gic->name, gic);
- if (err)
- goto free_current_gic;
- xa_store(&gc->irq_contexts, i, gic, GFP_KERNEL);
+ irqs[i - 1] = gic->irq;
}
/*
@@ -1876,19 +1859,11 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
kfree(irqs);
return 0;
-free_current_gic:
- kfree(gic);
free_irq:
for (i -= 1; i > 0; i--) {
irq = pci_irq_vector(pdev, i);
- gic = xa_load(&gc->irq_contexts, i);
- if (WARN_ON(!gic))
- continue;
-
irq_update_affinity_hint(irq, NULL);
- free_irq(irq, gic);
- xa_erase(&gc->irq_contexts, i);
- kfree(gic);
+ mana_gd_put_gic(gc, false, i);
}
kfree(irqs);
return err;
@@ -1909,34 +1884,13 @@ static int mana_gd_setup_irqs(struct pci_dev *pdev, int nvec)
start_irqs = irqs;
for (i = 0; i < nvec; i++) {
- gic = kzalloc_obj(*gic);
+ gic = mana_gd_get_gic(gc, false, &i);
if (!gic) {
err = -ENOMEM;
goto free_irq;
}
- gic->handler = mana_gd_process_eq_events;
- INIT_LIST_HEAD(&gic->eq_list);
- spin_lock_init(&gic->lock);
-
- if (!i)
- snprintf(gic->name, MANA_IRQ_NAME_SZ, "mana_hwc@pci:%s",
- pci_name(pdev));
- else
- snprintf(gic->name, MANA_IRQ_NAME_SZ, "mana_q%d@pci:%s",
- i - 1, pci_name(pdev));
-
- irqs[i] = pci_irq_vector(pdev, i);
- if (irqs[i] < 0) {
- err = irqs[i];
- goto free_current_gic;
- }
-
- err = request_irq(irqs[i], mana_gd_intr, 0, gic->name, gic);
- if (err)
- goto free_current_gic;
-
- xa_store(&gc->irq_contexts, i, gic, GFP_KERNEL);
+ irqs[i] = gic->irq;
}
/* If number of IRQ is one extra than number of online CPUs,
@@ -1965,19 +1919,11 @@ static int mana_gd_setup_irqs(struct pci_dev *pdev, int nvec)
kfree(start_irqs);
return 0;
-free_current_gic:
- kfree(gic);
free_irq:
for (i -= 1; i >= 0; i--) {
irq = pci_irq_vector(pdev, i);
- gic = xa_load(&gc->irq_contexts, i);
- if (WARN_ON(!gic))
- continue;
-
irq_update_affinity_hint(irq, NULL);
- free_irq(irq, gic);
- xa_erase(&gc->irq_contexts, i);
- kfree(gic);
+ mana_gd_put_gic(gc, false, i);
}
kfree(start_irqs);
@@ -2052,26 +1998,17 @@ static int mana_gd_setup_remaining_irqs(struct pci_dev *pdev)
static void mana_gd_remove_irqs(struct pci_dev *pdev)
{
struct gdma_context *gc = pci_get_drvdata(pdev);
- struct gdma_irq_context *gic;
int irq, i;
if (gc->max_num_msix < 1)
return;
- for (i = 0; i < gc->max_num_msix; i++) {
- irq = pci_irq_vector(pdev, i);
- if (irq < 0)
- continue;
-
- gic = xa_load(&gc->irq_contexts, i);
- if (WARN_ON(!gic))
- continue;
-
+ for (i = 0; i < (gc->msi_sharing ? gc->max_num_msix : 1); i++) {
/* Need to clear the hint before free_irq */
+ irq = pci_irq_vector(pdev, i);
irq_update_affinity_hint(irq, NULL);
- free_irq(irq, gic);
- xa_erase(&gc->irq_contexts, i);
- kfree(gic);
+
+ mana_gd_put_gic(gc, false, i);
}
pci_free_irq_vectors(pdev);
--
2.43.0
^ permalink raw reply related
* [PATCH net-next v3 3/6] net: mana: Introduce GIC context with refcounting for interrupt management
From: Long Li @ 2026-03-06 21:32 UTC (permalink / raw)
To: K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Shradha Gupta, Simon Horman, Konstantin Taranov,
Souradeep Chakrabarti, Erick Archer, linux-hyperv, netdev,
linux-kernel, linux-rdma
Cc: Long Li
In-Reply-To: <20260306213302.544681-1-longli@microsoft.com>
To allow Ethernet EQs to use dedicated or shared MSI-X vectors and RDMA
EQs to share the same MSI-X, introduce a GIC (GDMA IRQ Context) with
reference counting. This allows the driver to create an interrupt context
on an assigned or unassigned MSI-X vector and share it across multiple
EQ consumers.
Signed-off-by: Long Li <longli@microsoft.com>
---
.../net/ethernet/microsoft/mana/gdma_main.c | 158 ++++++++++++++++++
include/net/mana/gdma.h | 10 ++
2 files changed, 168 insertions(+)
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index a6ab2f053fe9..c43fd8089e77 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -1559,6 +1559,163 @@ static irqreturn_t mana_gd_intr(int irq, void *arg)
return IRQ_HANDLED;
}
+void mana_gd_put_gic(struct gdma_context *gc, bool use_msi_bitmap, int msi)
+{
+ struct pci_dev *dev = to_pci_dev(gc->dev);
+ struct msi_map irq_map;
+ struct gdma_irq_context *gic;
+ int irq;
+
+ mutex_lock(&gc->gic_mutex);
+
+ gic = xa_load(&gc->irq_contexts, msi);
+ if (WARN_ON(!gic)) {
+ mutex_unlock(&gc->gic_mutex);
+ return;
+ }
+
+ if (use_msi_bitmap)
+ gic->bitmap_refs--;
+
+ if (use_msi_bitmap && gic->bitmap_refs == 0)
+ clear_bit(msi, gc->msi_bitmap);
+
+ if (!refcount_dec_and_test(&gic->refcount))
+ goto out;
+
+ irq = pci_irq_vector(dev, msi);
+
+ irq_update_affinity_hint(irq, NULL);
+ free_irq(irq, gic);
+
+ if (pci_msix_can_alloc_dyn(dev)) {
+ irq_map.virq = irq;
+ irq_map.index = msi;
+ pci_msix_free_irq(dev, irq_map);
+ }
+
+ xa_erase(&gc->irq_contexts, msi);
+ kfree(gic);
+
+out:
+ mutex_unlock(&gc->gic_mutex);
+}
+EXPORT_SYMBOL_NS(mana_gd_put_gic, "NET_MANA");
+
+/*
+ * Get a GIC (GDMA IRQ Context) on a MSI vector
+ * a MSI can be shared between different EQs, this function supports setting
+ * up separate MSIs using a bitmap, or directly using the MSI index
+ *
+ * @use_msi_bitmap:
+ * True if MSI is assigned by this function on available slots from bitmap.
+ * False if MSI is passed from *msi_requested
+ */
+struct gdma_irq_context *mana_gd_get_gic(struct gdma_context *gc,
+ bool use_msi_bitmap,
+ int *msi_requested)
+{
+ struct gdma_irq_context *gic;
+ struct pci_dev *dev = to_pci_dev(gc->dev);
+ struct msi_map irq_map = { };
+ int irq;
+ int msi;
+ int err;
+
+ mutex_lock(&gc->gic_mutex);
+
+ if (use_msi_bitmap) {
+ msi = find_first_zero_bit(gc->msi_bitmap, gc->num_msix_usable);
+ if (msi >= gc->num_msix_usable) {
+ dev_err(gc->dev, "No free MSI vectors available\n");
+ gic = NULL;
+ goto out;
+ }
+ *msi_requested = msi;
+ } else {
+ msi = *msi_requested;
+ }
+
+ gic = xa_load(&gc->irq_contexts, msi);
+ if (gic) {
+ refcount_inc(&gic->refcount);
+ if (use_msi_bitmap) {
+ gic->bitmap_refs++;
+ set_bit(msi, gc->msi_bitmap);
+ }
+ goto out;
+ }
+
+ irq = pci_irq_vector(dev, msi);
+ if (irq == -EINVAL) {
+ irq_map = pci_msix_alloc_irq_at(dev, msi, NULL);
+ if (!irq_map.virq) {
+ err = irq_map.index;
+ dev_err(gc->dev,
+ "Failed to alloc irq_map msi %d err %d\n",
+ msi, err);
+ gic = NULL;
+ goto out;
+ }
+ irq = irq_map.virq;
+ msi = irq_map.index;
+ }
+
+ gic = kzalloc(sizeof(*gic), GFP_KERNEL);
+ if (!gic) {
+ if (irq_map.virq)
+ pci_msix_free_irq(dev, irq_map);
+ goto out;
+ }
+
+ gic->handler = mana_gd_process_eq_events;
+ gic->msi = msi;
+ gic->irq = irq;
+ INIT_LIST_HEAD(&gic->eq_list);
+ spin_lock_init(&gic->lock);
+
+ if (!gic->msi)
+ snprintf(gic->name, MANA_IRQ_NAME_SZ, "mana_hwc@pci:%s",
+ pci_name(dev));
+ else
+ snprintf(gic->name, MANA_IRQ_NAME_SZ, "mana_msi%d@pci:%s",
+ gic->msi, pci_name(dev));
+
+ err = request_irq(irq, mana_gd_intr, 0, gic->name, gic);
+ if (err) {
+ dev_err(gc->dev, "Failed to request irq %d %s\n",
+ irq, gic->name);
+ kfree(gic);
+ gic = NULL;
+ if (irq_map.virq)
+ pci_msix_free_irq(dev, irq_map);
+ goto out;
+ }
+
+ refcount_set(&gic->refcount, 1);
+ gic->bitmap_refs = use_msi_bitmap ? 1 : 0;
+
+ err = xa_err(xa_store(&gc->irq_contexts, msi, gic, GFP_KERNEL));
+ if (err) {
+ dev_err(gc->dev, "Failed to store irq context for msi %d: %d\n",
+ msi, err);
+ free_irq(irq, gic);
+ kfree(gic);
+ gic = NULL;
+ if (irq_map.virq)
+ pci_msix_free_irq(dev, irq_map);
+ goto out;
+ }
+
+ if (use_msi_bitmap)
+ set_bit(msi, gc->msi_bitmap);
+
+out:
+ mutex_unlock(&gc->gic_mutex);
+ return gic;
+}
+EXPORT_SYMBOL_NS(mana_gd_get_gic, "NET_MANA");
+
int mana_gd_alloc_res_map(u32 res_avail, struct gdma_resource *r)
{
r->map = bitmap_zalloc(res_avail, GFP_KERNEL);
@@ -2044,6 +2201,7 @@ static int mana_gd_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
goto release_region;
mutex_init(&gc->eq_test_event_mutex);
+ mutex_init(&gc->gic_mutex);
pci_set_drvdata(pdev, gc);
gc->bar0_pa = pci_resource_start(pdev, 0);
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index b744253b44e8..1a7f4abe7a8b 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -388,6 +388,10 @@ struct gdma_irq_context {
spinlock_t lock;
struct list_head eq_list;
char name[MANA_IRQ_NAME_SZ];
+ unsigned int msi;
+ unsigned int irq;
+ refcount_t refcount;
+ unsigned int bitmap_refs;
};
enum gdma_context_flags {
@@ -447,6 +451,9 @@ struct gdma_context {
unsigned long flags;
+ /* Protect access to GIC context */
+ struct mutex gic_mutex;
+
/* Indicate if this device is sharing MSI for EQs on MANA */
bool msi_sharing;
@@ -1019,6 +1026,9 @@ int mana_gd_resume(struct pci_dev *pdev);
bool mana_need_log(struct gdma_context *gc, int err);
+struct gdma_irq_context *mana_gd_get_gic(struct gdma_context *gc, bool use_msi_bitmap,
+ int *msi_requested);
+void mana_gd_put_gic(struct gdma_context *gc, bool use_msi_bitmap, int msi);
int mana_gd_query_device_cfg(struct gdma_context *gc, u32 proto_major_ver,
u32 proto_minor_ver, u32 proto_micro_ver,
u16 *max_num_vports, u8 *bm_hostmode);
--
2.43.0
^ permalink raw reply related
* [PATCH net-next v3 2/6] net: mana: Query device capabilities and configure MSI-X sharing for EQs
From: Long Li @ 2026-03-06 21:32 UTC (permalink / raw)
To: K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Shradha Gupta, Simon Horman, Konstantin Taranov,
Souradeep Chakrabarti, Erick Archer, linux-hyperv, netdev,
linux-kernel, linux-rdma
Cc: Long Li
In-Reply-To: <20260306213302.544681-1-longli@microsoft.com>
When querying the device, adjust the max number of queues to allow
dedicated MSI-X vectors for each vPort. The number of queues per vPort
is clamped to no less than 16. MSI-X sharing among vPorts is disabled
by default and is only enabled when there are not enough MSI-X vectors
for dedicated allocation.
Rename mana_query_device_cfg() to mana_gd_query_device_cfg() as it is
used at GDMA device probe time for querying device capabilities.
Signed-off-by: Long Li <longli@microsoft.com>
---
.../net/ethernet/microsoft/mana/gdma_main.c | 66 ++++++++++++++++---
drivers/net/ethernet/microsoft/mana/mana_en.c | 36 +++++-----
include/net/mana/gdma.h | 13 +++-
3 files changed, 91 insertions(+), 24 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index aef8612b73cb..a6ab2f053fe9 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -107,6 +107,9 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
struct gdma_context *gc = pci_get_drvdata(pdev);
struct gdma_query_max_resources_resp resp = {};
struct gdma_general_req req = {};
+ unsigned int max_num_queues;
+ u8 bm_hostmode;
+ u16 num_ports;
int err;
mana_gd_init_req_hdr(&req.hdr, GDMA_QUERY_MAX_RESOURCES,
@@ -152,6 +155,40 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
if (gc->max_num_queues > gc->num_msix_usable - 1)
gc->max_num_queues = gc->num_msix_usable - 1;
+ err = mana_gd_query_device_cfg(gc, MANA_MAJOR_VERSION, MANA_MINOR_VERSION,
+ MANA_MICRO_VERSION, &num_ports, &bm_hostmode);
+ if (err)
+ return err;
+
+ if (!num_ports)
+ return -EINVAL;
+
+ /*
+ * Adjust gc->max_num_queues returned from the SOC to allow dedicated MSIx
+ * for each vPort. Reduce max_num_queues to no less than 16 if necessary
+ */
+ max_num_queues = (gc->num_msix_usable - 1) / num_ports;
+ max_num_queues = roundup_pow_of_two(max(max_num_queues, 1U));
+ if (max_num_queues < 16)
+ max_num_queues = 16;
+
+ /*
+ * Use dedicated MSIx for EQs whenever possible, use MSIx sharing for
+ * Ethernet EQs when (max_num_queues * num_ports > num_msix_usable - 1)
+ */
+ max_num_queues = min(gc->max_num_queues, max_num_queues);
+ if (max_num_queues * num_ports > gc->num_msix_usable - 1)
+ gc->msi_sharing = true;
+
+ /* If MSI is shared, use max allowed value */
+ if (gc->msi_sharing)
+ gc->max_num_queues_vport = min(gc->num_msix_usable - 1, gc->max_num_queues);
+ else
+ gc->max_num_queues_vport = max_num_queues;
+
+ dev_info(gc->dev, "MSI sharing mode %d max queues %d\n",
+ gc->msi_sharing, gc->max_num_queues);
+
return 0;
}
@@ -1803,6 +1840,7 @@ static int mana_gd_setup_hwc_irqs(struct pci_dev *pdev)
/* Need 1 interrupt for HWC */
max_irqs = min(num_online_cpus(), MANA_MAX_NUM_QUEUES) + 1;
min_irqs = 2;
+ gc->msi_sharing = true;
}
nvec = pci_alloc_irq_vectors(pdev, min_irqs, max_irqs, PCI_IRQ_MSIX);
@@ -1881,6 +1919,8 @@ static void mana_gd_remove_irqs(struct pci_dev *pdev)
pci_free_irq_vectors(pdev);
+ bitmap_free(gc->msi_bitmap);
+ gc->msi_bitmap = NULL;
gc->max_num_msix = 0;
gc->num_msix_usable = 0;
}
@@ -1912,20 +1952,30 @@ static int mana_gd_setup(struct pci_dev *pdev)
if (err)
goto destroy_hwc;
- err = mana_gd_query_max_resources(pdev);
+ err = mana_gd_detect_devices(pdev);
if (err)
goto destroy_hwc;
- err = mana_gd_setup_remaining_irqs(pdev);
- if (err) {
- dev_err(gc->dev, "Failed to setup remaining IRQs: %d", err);
- goto destroy_hwc;
- }
-
- err = mana_gd_detect_devices(pdev);
+ err = mana_gd_query_max_resources(pdev);
if (err)
goto destroy_hwc;
+ if (!gc->msi_sharing) {
+ gc->msi_bitmap = bitmap_zalloc(gc->num_msix_usable, GFP_KERNEL);
+ if (!gc->msi_bitmap) {
+ err = -ENOMEM;
+ goto destroy_hwc;
+ }
+ /* Set bit for HWC */
+ set_bit(0, gc->msi_bitmap);
+ } else {
+ err = mana_gd_setup_remaining_irqs(pdev);
+ if (err) {
+ dev_err(gc->dev, "Failed to setup remaining IRQs: %d", err);
+ goto destroy_hwc;
+ }
+ }
+
dev_dbg(&pdev->dev, "mana gdma setup successful\n");
return 0;
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 428dafaf315b..bfa0f354355d 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1000,10 +1000,9 @@ static int mana_init_port_context(struct mana_port_context *apc)
return !apc->rxqs ? -ENOMEM : 0;
}
-static int mana_send_request(struct mana_context *ac, void *in_buf,
- u32 in_len, void *out_buf, u32 out_len)
+static int gdma_mana_send_request(struct gdma_context *gc, void *in_buf,
+ u32 in_len, void *out_buf, u32 out_len)
{
- struct gdma_context *gc = ac->gdma_dev->gdma_context;
struct gdma_resp_hdr *resp = out_buf;
struct gdma_req_hdr *req = in_buf;
struct device *dev = gc->dev;
@@ -1037,6 +1036,14 @@ static int mana_send_request(struct mana_context *ac, void *in_buf,
return 0;
}
+static int mana_send_request(struct mana_context *ac, void *in_buf,
+ u32 in_len, void *out_buf, u32 out_len)
+{
+ struct gdma_context *gc = ac->gdma_dev->gdma_context;
+
+ return gdma_mana_send_request(gc, in_buf, in_len, out_buf, out_len);
+}
+
static int mana_verify_resp_hdr(const struct gdma_resp_hdr *resp_hdr,
const enum mana_command_code expected_code,
const u32 min_size)
@@ -1170,11 +1177,10 @@ static void mana_pf_deregister_filter(struct mana_port_context *apc)
err, resp.hdr.status);
}
-static int mana_query_device_cfg(struct mana_context *ac, u32 proto_major_ver,
- u32 proto_minor_ver, u32 proto_micro_ver,
- u16 *max_num_vports, u8 *bm_hostmode)
+int mana_gd_query_device_cfg(struct gdma_context *gc, u32 proto_major_ver,
+ u32 proto_minor_ver, u32 proto_micro_ver,
+ u16 *max_num_vports, u8 *bm_hostmode)
{
- struct gdma_context *gc = ac->gdma_dev->gdma_context;
struct mana_query_device_cfg_resp resp = {};
struct mana_query_device_cfg_req req = {};
struct device *dev = gc->dev;
@@ -1189,7 +1195,7 @@ static int mana_query_device_cfg(struct mana_context *ac, u32 proto_major_ver,
req.proto_minor_ver = proto_minor_ver;
req.proto_micro_ver = proto_micro_ver;
- err = mana_send_request(ac, &req, sizeof(req), &resp, sizeof(resp));
+ err = gdma_mana_send_request(gc, &req, sizeof(req), &resp, sizeof(resp));
if (err) {
dev_err(dev, "Failed to query config: %d", err);
return err;
@@ -1217,8 +1223,6 @@ static int mana_query_device_cfg(struct mana_context *ac, u32 proto_major_ver,
else
*bm_hostmode = 0;
- debugfs_create_u16("adapter-MTU", 0400, gc->mana_pci_debugfs, &gc->adapter_mtu);
-
return 0;
}
@@ -3329,7 +3333,7 @@ static int mana_probe_port(struct mana_context *ac, int port_idx,
int err;
ndev = alloc_etherdev_mq(sizeof(struct mana_port_context),
- gc->max_num_queues);
+ gc->max_num_queues_vport);
if (!ndev)
return -ENOMEM;
@@ -3338,8 +3342,8 @@ static int mana_probe_port(struct mana_context *ac, int port_idx,
apc = netdev_priv(ndev);
apc->ac = ac;
apc->ndev = ndev;
- apc->max_queues = gc->max_num_queues;
- apc->num_queues = gc->max_num_queues;
+ apc->max_queues = gc->max_num_queues_vport;
+ apc->num_queues = gc->max_num_queues_vport;
apc->tx_queue_size = DEF_TX_BUFFERS_PER_QUEUE;
apc->rx_queue_size = DEF_RX_BUFFERS_PER_QUEUE;
apc->port_handle = INVALID_MANA_HANDLE;
@@ -3598,13 +3602,15 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
gd->driver_data = ac;
}
- err = mana_query_device_cfg(ac, MANA_MAJOR_VERSION, MANA_MINOR_VERSION,
- MANA_MICRO_VERSION, &num_ports, &bm_hostmode);
+ err = mana_gd_query_device_cfg(gc, MANA_MAJOR_VERSION, MANA_MINOR_VERSION,
+ MANA_MICRO_VERSION, &num_ports, &bm_hostmode);
if (err)
goto out;
ac->bm_hostmode = bm_hostmode;
+ debugfs_create_u16("adapter-MTU", 0400, gc->mana_pci_debugfs, &gc->adapter_mtu);
+
if (!resuming) {
ac->num_ports = num_ports;
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index ec17004b10c0..b744253b44e8 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -399,8 +399,10 @@ struct gdma_context {
struct device *dev;
struct dentry *mana_pci_debugfs;
- /* Per-vPort max number of queues */
+ /* Hardware max number of queues */
unsigned int max_num_queues;
+ /* Per-vPort max number of queues */
+ unsigned int max_num_queues_vport;
unsigned int max_num_msix;
unsigned int num_msix_usable;
struct xarray irq_contexts;
@@ -444,6 +446,12 @@ struct gdma_context {
struct workqueue_struct *service_wq;
unsigned long flags;
+
+ /* Indicate if this device is sharing MSI for EQs on MANA */
+ bool msi_sharing;
+
+ /* Bitmap tracks where MSI is allocated when it is not shared for EQs */
+ unsigned long *msi_bitmap;
};
static inline bool mana_gd_is_mana(struct gdma_dev *gd)
@@ -1011,4 +1019,7 @@ int mana_gd_resume(struct pci_dev *pdev);
bool mana_need_log(struct gdma_context *gc, int err);
+int mana_gd_query_device_cfg(struct gdma_context *gc, u32 proto_major_ver,
+ u32 proto_minor_ver, u32 proto_micro_ver,
+ u16 *max_num_vports, u8 *bm_hostmode);
#endif /* _GDMA_H */
--
2.43.0
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox