* [PATCH net-next v8 0/6] net: mana: Per-vPort EQ and MSI-X interrupt management
@ 2026-05-08 22:11 Long Li
2026-05-08 22:11 ` [PATCH net-next v8 1/6] net: mana: Create separate EQs for each vPort Long Li
` (5 more replies)
0 siblings, 6 replies; 9+ messages in thread
From: Long Li @ 2026-05-08 22:11 UTC (permalink / raw)
To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
Paolo Abeni, Eric Dumazet, Andrew Lunn, Jason Gunthorpe,
Leon Romanovsky, Haiyang Zhang, K . Y . Srinivasan, Wei Liu,
Dexuan Cui, shradhagupta
Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel
This series moves EQ ownership from the shared mana_context to per-vPort
mana_port_context, enabling each vPort to have dedicated MSI-X vectors
when the hardware provides enough vectors. When vectors are limited, the
driver falls back to sharing MSI-X among vPorts.
The series introduces a GDMA IRQ Context (GIC) abstraction with reference
counting to manage interrupt context lifecycle. This allows both Ethernet
and RDMA EQs to dynamically acquire dedicated or shared MSI-X vectors at
vPort creation time rather than pre-allocating all vectors at probe time.
Key changes:
- Per-vPort EQ allocation with exported lifecycle functions for RDMA use
- Device capability query to determine dedicated vs shared MSI-X mode
- GIC context with refcounting for flexible interrupt management
- On-demand interrupt context allocation when creating vPort EQs
- RDMA EQ integration with the GIC framework
Changes in v8:
- Fix comment to reference per-vPort queue count instead of
gc->max_num_queues (patch 2)
- Remove duplicate irq_update_affinity_hint() calls from error paths
and mana_gd_remove_irqs(); the clearing is now centralized in
mana_gd_put_gic() (patch 4)
- Note the IRQ name change (mana_q -> mana_msi) in the commit
message (patch 4)
- Remove dead conditional write to spec.eq.msix_index (patch 5)
- Document GIC ownership contract and msix_index invariant change
in commit message (patch 5)
- Populate eq.irq on RDMA EQs for consistency with the Ethernet
path (patch 6)
- Document BIT(6) relocation and capability flag semantics in
commit message (patch 6)
- Fix checkpatch --strict alignment and line length warnings
Changes in v7:
- Use rounddown_pow_of_two() instead of roundup_pow_of_two() when
computing per-vPort queue count to avoid unnecessarily forcing shared
MSI-X mode (patch 2)
- Call mana_gd_setup_remaining_irqs() unconditionally to ensure
irq_contexts are populated in both dedicated and shared MSI-X modes,
fixing bisectability between patches 2 and 5 (patch 2)
- Guard ibdev_dbg() in mana_ib_cfg_vport() with error check so the
vport handle is not logged on the failure path (patch 1)
- Use cached gic->irq instead of pci_irq_vector() lookup in
mana_gd_put_gic() for consistency with the allocation path (patch 3)
- Fix unsigned int* to int* pointer type mismatch when calling
mana_gd_get_gic() by using a local int variable for the MSI index
(patches 5, 6)
Changes in v6:
- Rebased on net-next/main (v7.1-rc1)
Changes in v5:
- Rebased on net-next/main
Changes in v4:
- Rebased on net-next/main 7.0-rc4
- Patch 2: Use MANA_DEF_NUM_QUEUES instead of hardcoded 16 for
max_num_queues clamping
- Patch 3: Track dyn_msix in GIC context instead of re-checking
pci_msix_can_alloc_dyn() on each call; improved remove_irqs iteration
to skip unallocated entries
Changes in v3:
- Rebased on net-next/main
- Patch 1: Added NULL check for mpc->eqs in mana_ib_create_qp_rss() to
prevent NULL pointer dereference when RSS QP is created before a raw QP
has configured the vport and allocated EQs
Changes in v2:
- Rebased on net-next/main (adapted to kzalloc_objs/kzalloc_obj macros,
new GDMA_DRV_CAP_FLAG definitions)
- Patch 2: Fixed misleading comment for max_num_queues vs
max_num_queues_vport in gdma.h
- Patch 3: Fixed spelling typo in gdma_main.c ("difference" -> "different")
Long Li (6):
net: mana: Create separate EQs for each vPort
net: mana: Query device capabilities and configure MSI-X sharing for
EQs
net: mana: Introduce GIC context with refcounting for interrupt
management
net: mana: Use GIC functions to allocate global EQs
net: mana: Allocate interrupt context for each EQ when creating vPort
RDMA/mana_ib: Allocate interrupt contexts on EQs
drivers/infiniband/hw/mana/main.c | 62 +++-
drivers/infiniband/hw/mana/qp.c | 16 +-
.../net/ethernet/microsoft/mana/gdma_main.c | 316 +++++++++++++-----
drivers/net/ethernet/microsoft/mana/mana_en.c | 169 ++++++----
include/net/mana/gdma.h | 33 +-
include/net/mana/mana.h | 7 +-
6 files changed, 434 insertions(+), 169 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH net-next v8 1/6] net: mana: Create separate EQs for each vPort
2026-05-08 22:11 [PATCH net-next v8 0/6] net: mana: Per-vPort EQ and MSI-X interrupt management Long Li
@ 2026-05-08 22:11 ` Long Li
2026-05-12 11:34 ` Paolo Abeni
2026-05-08 22:11 ` [PATCH net-next v8 2/6] net: mana: Query device capabilities and configure MSI-X sharing for EQs Long Li
` (4 subsequent siblings)
5 siblings, 1 reply; 9+ messages in thread
From: Long Li @ 2026-05-08 22:11 UTC (permalink / raw)
To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
Paolo Abeni, Eric Dumazet, Andrew Lunn, Jason Gunthorpe,
Leon Romanovsky, Haiyang Zhang, K . Y . Srinivasan, Wei Liu,
Dexuan Cui, shradhagupta
Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel
To prepare for assigning vPorts to dedicated MSI-X vectors, remove EQ
sharing among the vPorts and create dedicated EQs for each vPort.
Move the EQ definition from struct mana_context to struct mana_port_context
and update related support functions. Export mana_create_eq() and
mana_destroy_eq() for use by the MANA RDMA driver.
Signed-off-by: Long Li <longli@microsoft.com>
---
drivers/infiniband/hw/mana/main.c | 19 ++-
drivers/infiniband/hw/mana/qp.c | 16 ++-
drivers/net/ethernet/microsoft/mana/mana_en.c | 111 ++++++++++--------
include/net/mana/mana.h | 7 +-
4 files changed, 98 insertions(+), 55 deletions(-)
diff --git a/drivers/infiniband/hw/mana/main.c b/drivers/infiniband/hw/mana/main.c
index ac5e75dd3494..8000ab6e8beb 100644
--- a/drivers/infiniband/hw/mana/main.c
+++ b/drivers/infiniband/hw/mana/main.c
@@ -20,8 +20,10 @@ void mana_ib_uncfg_vport(struct mana_ib_dev *dev, struct mana_ib_pd *pd,
pd->vport_use_count--;
WARN_ON(pd->vport_use_count < 0);
- if (!pd->vport_use_count)
+ if (!pd->vport_use_count) {
+ mana_destroy_eq(mpc);
mana_uncfg_vport(mpc);
+ }
mutex_unlock(&pd->vport_mutex);
}
@@ -55,15 +57,22 @@ int mana_ib_cfg_vport(struct mana_ib_dev *dev, u32 port, struct mana_ib_pd *pd,
return err;
}
- mutex_unlock(&pd->vport_mutex);
pd->tx_shortform_allowed = mpc->tx_shortform_allowed;
pd->tx_vp_offset = mpc->tx_vp_offset;
+ err = mana_create_eq(mpc);
+ if (err) {
+ mana_uncfg_vport(mpc);
+ pd->vport_use_count--;
+ }
- ibdev_dbg(&dev->ib_dev, "vport handle %llx pdid %x doorbell_id %x\n",
- mpc->port_handle, pd->pdn, doorbell_id);
+ mutex_unlock(&pd->vport_mutex);
- return 0;
+ if (!err)
+ ibdev_dbg(&dev->ib_dev, "vport handle %llx pdid %x doorbell_id %x\n",
+ mpc->port_handle, pd->pdn, doorbell_id);
+
+ return err;
}
int mana_ib_alloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
diff --git a/drivers/infiniband/hw/mana/qp.c b/drivers/infiniband/hw/mana/qp.c
index 645581359cee..6f1043383e8c 100644
--- a/drivers/infiniband/hw/mana/qp.c
+++ b/drivers/infiniband/hw/mana/qp.c
@@ -168,7 +168,15 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
cq_spec.gdma_region = cq->queue.gdma_region;
cq_spec.queue_size = cq->cqe * COMP_ENTRY_SIZE;
cq_spec.modr_ctx_id = 0;
- eq = &mpc->ac->eqs[cq->comp_vector];
+ /* EQs are created when a raw QP configures the vport.
+ * A raw QP must be created before creating rwq_ind_tbl.
+ */
+ if (!mpc->eqs) {
+ ret = -EINVAL;
+ i--;
+ goto fail;
+ }
+ eq = &mpc->eqs[cq->comp_vector % mpc->num_queues];
cq_spec.attached_eq = eq->eq->id;
ret = mana_create_wq_obj(mpc, mpc->port_handle, GDMA_RQ,
@@ -317,7 +325,11 @@ static int mana_ib_create_qp_raw(struct ib_qp *ibqp, struct ib_pd *ibpd,
cq_spec.queue_size = send_cq->cqe * COMP_ENTRY_SIZE;
cq_spec.modr_ctx_id = 0;
eq_vec = send_cq->comp_vector;
- eq = &mpc->ac->eqs[eq_vec];
+ if (!mpc->eqs) {
+ err = -EINVAL;
+ goto err_destroy_queue;
+ }
+ eq = &mpc->eqs[eq_vec % mpc->num_queues];
cq_spec.attached_eq = eq->eq->id;
err = mana_create_wq_obj(mpc, mpc->port_handle, GDMA_SQ, &wq_spec,
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 462a457e7d53..2f3d619e0f2e 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1615,78 +1615,83 @@ void mana_destroy_wq_obj(struct mana_port_context *apc, u32 wq_type,
}
EXPORT_SYMBOL_NS(mana_destroy_wq_obj, "NET_MANA");
-static void mana_destroy_eq(struct mana_context *ac)
+void mana_destroy_eq(struct mana_port_context *apc)
{
+ struct mana_context *ac = apc->ac;
struct gdma_context *gc = ac->gdma_dev->gdma_context;
struct gdma_queue *eq;
int i;
- if (!ac->eqs)
+ if (!apc->eqs)
return;
- debugfs_remove_recursive(ac->mana_eqs_debugfs);
- ac->mana_eqs_debugfs = NULL;
+ debugfs_remove_recursive(apc->mana_eqs_debugfs);
+ apc->mana_eqs_debugfs = NULL;
- for (i = 0; i < gc->max_num_queues; i++) {
- eq = ac->eqs[i].eq;
+ for (i = 0; i < apc->num_queues; i++) {
+ eq = apc->eqs[i].eq;
if (!eq)
continue;
mana_gd_destroy_queue(gc, eq);
}
- kfree(ac->eqs);
- ac->eqs = NULL;
+ kfree(apc->eqs);
+ apc->eqs = NULL;
}
+EXPORT_SYMBOL_NS(mana_destroy_eq, "NET_MANA");
-static void mana_create_eq_debugfs(struct mana_context *ac, int i)
+static void mana_create_eq_debugfs(struct mana_port_context *apc, int i)
{
- struct mana_eq eq = ac->eqs[i];
+ struct mana_eq eq = apc->eqs[i];
char eqnum[32];
sprintf(eqnum, "eq%d", i);
- eq.mana_eq_debugfs = debugfs_create_dir(eqnum, ac->mana_eqs_debugfs);
+ eq.mana_eq_debugfs = debugfs_create_dir(eqnum, apc->mana_eqs_debugfs);
debugfs_create_u32("head", 0400, eq.mana_eq_debugfs, &eq.eq->head);
debugfs_create_u32("tail", 0400, eq.mana_eq_debugfs, &eq.eq->tail);
debugfs_create_file("eq_dump", 0400, eq.mana_eq_debugfs, eq.eq, &mana_dbg_q_fops);
}
-static int mana_create_eq(struct mana_context *ac)
+int mana_create_eq(struct mana_port_context *apc)
{
- struct gdma_dev *gd = ac->gdma_dev;
+ struct gdma_dev *gd = apc->ac->gdma_dev;
struct gdma_context *gc = gd->gdma_context;
struct gdma_queue_spec spec = {};
int err;
int i;
- ac->eqs = kzalloc_objs(struct mana_eq, gc->max_num_queues);
- if (!ac->eqs)
+ WARN_ON(apc->eqs);
+ apc->eqs = kzalloc_objs(struct mana_eq, apc->num_queues);
+ if (!apc->eqs)
return -ENOMEM;
spec.type = GDMA_EQ;
spec.monitor_avl_buf = false;
spec.queue_size = EQ_SIZE;
spec.eq.callback = NULL;
- spec.eq.context = ac->eqs;
+ spec.eq.context = apc->eqs;
spec.eq.log2_throttle_limit = LOG2_EQ_THROTTLE;
- ac->mana_eqs_debugfs = debugfs_create_dir("EQs", gc->mana_pci_debugfs);
+ apc->mana_eqs_debugfs =
+ debugfs_create_dir("EQs", apc->mana_port_debugfs);
- for (i = 0; i < gc->max_num_queues; i++) {
+ for (i = 0; i < apc->num_queues; i++) {
spec.eq.msix_index = (i + 1) % gc->num_msix_usable;
- err = mana_gd_create_mana_eq(gd, &spec, &ac->eqs[i].eq);
+ err = mana_gd_create_mana_eq(gd, &spec, &apc->eqs[i].eq);
if (err) {
dev_err(gc->dev, "Failed to create EQ %d : %d\n", i, err);
goto out;
}
- mana_create_eq_debugfs(ac, i);
+ mana_create_eq_debugfs(apc, i);
}
return 0;
out:
- mana_destroy_eq(ac);
+ mana_destroy_eq(apc);
return err;
}
+EXPORT_SYMBOL_NS(mana_create_eq, "NET_MANA");
static int mana_fence_rq(struct mana_port_context *apc, struct mana_rxq *rxq)
{
@@ -2451,7 +2456,7 @@ static int mana_create_txq(struct mana_port_context *apc,
spec.monitor_avl_buf = false;
spec.queue_size = cq_size;
spec.cq.callback = mana_schedule_napi;
- spec.cq.parent_eq = ac->eqs[i].eq;
+ spec.cq.parent_eq = apc->eqs[i].eq;
spec.cq.context = cq;
err = mana_gd_create_mana_wq_cq(gd, &spec, &cq->gdma_cq);
if (err)
@@ -2844,13 +2849,12 @@ static void mana_create_rxq_debugfs(struct mana_port_context *apc, int idx)
static int mana_add_rx_queues(struct mana_port_context *apc,
struct net_device *ndev)
{
- struct mana_context *ac = apc->ac;
struct mana_rxq *rxq;
int err = 0;
int i;
for (i = 0; i < apc->num_queues; i++) {
- rxq = mana_create_rxq(apc, i, &ac->eqs[i], ndev);
+ rxq = mana_create_rxq(apc, i, &apc->eqs[i], ndev);
if (!rxq) {
err = -ENOMEM;
netdev_err(ndev, "Failed to create rxq %d : %d\n", i, err);
@@ -2869,9 +2873,8 @@ static int mana_add_rx_queues(struct mana_port_context *apc,
return err;
}
-static void mana_destroy_vport(struct mana_port_context *apc)
+static void mana_destroy_rxqs(struct mana_port_context *apc)
{
- struct gdma_dev *gd = apc->ac->gdma_dev;
struct mana_rxq *rxq;
u32 rxq_idx;
@@ -2883,8 +2886,12 @@ static void mana_destroy_vport(struct mana_port_context *apc)
mana_destroy_rxq(apc, rxq, true);
apc->rxqs[rxq_idx] = NULL;
}
+}
+
+static void mana_destroy_vport(struct mana_port_context *apc)
+{
+ struct gdma_dev *gd = apc->ac->gdma_dev;
- mana_destroy_txq(apc);
mana_uncfg_vport(apc);
if (gd->gdma_context->is_pf && !apc->ac->bm_hostmode)
@@ -2905,11 +2912,7 @@ static int mana_create_vport(struct mana_port_context *apc,
return err;
}
- err = mana_cfg_vport(apc, gd->pdid, gd->doorbell);
- if (err)
- return err;
-
- return mana_create_txq(apc, net);
+ return mana_cfg_vport(apc, gd->pdid, gd->doorbell);
}
static int mana_rss_table_alloc(struct mana_port_context *apc)
@@ -3195,21 +3198,36 @@ int mana_alloc_queues(struct net_device *ndev)
err = mana_create_vport(apc, ndev);
if (err) {
- netdev_err(ndev, "Failed to create vPort %u : %d\n", apc->port_idx, err);
+ netdev_err(ndev, "Failed to create vPort %u : %d\n",
+ apc->port_idx, err);
return err;
}
+ err = mana_create_eq(apc);
+ if (err) {
+ netdev_err(ndev, "Failed to create EQ on vPort %u: %d\n",
+ apc->port_idx, err);
+ goto destroy_vport;
+ }
+
+ err = mana_create_txq(apc, ndev);
+ if (err) {
+ netdev_err(ndev, "Failed to create TXQ on vPort %u: %d\n",
+ apc->port_idx, err);
+ goto destroy_eq;
+ }
+
err = netif_set_real_num_tx_queues(ndev, apc->num_queues);
if (err) {
netdev_err(ndev,
"netif_set_real_num_tx_queues () failed for ndev with num_queues %u : %d\n",
apc->num_queues, err);
- goto destroy_vport;
+ goto destroy_txq;
}
err = mana_add_rx_queues(apc, ndev);
if (err)
- goto destroy_vport;
+ goto destroy_rxq;
apc->rss_state = apc->num_queues > 1 ? TRI_STATE_TRUE : TRI_STATE_FALSE;
@@ -3218,7 +3236,7 @@ int mana_alloc_queues(struct net_device *ndev)
netdev_err(ndev,
"netif_set_real_num_rx_queues () failed for ndev with num_queues %u : %d\n",
apc->num_queues, err);
- goto destroy_vport;
+ goto destroy_rxq;
}
mana_rss_table_init(apc);
@@ -3226,19 +3244,25 @@ int mana_alloc_queues(struct net_device *ndev)
err = mana_config_rss(apc, TRI_STATE_TRUE, true, true);
if (err) {
netdev_err(ndev, "Failed to configure RSS table: %d\n", err);
- goto destroy_vport;
+ goto destroy_rxq;
}
if (gd->gdma_context->is_pf && !apc->ac->bm_hostmode) {
err = mana_pf_register_filter(apc);
if (err)
- goto destroy_vport;
+ goto destroy_rxq;
}
mana_chn_setxdp(apc, mana_xdp_get(apc));
return 0;
+destroy_rxq:
+ mana_destroy_rxqs(apc);
+destroy_txq:
+ mana_destroy_txq(apc);
+destroy_eq:
+ mana_destroy_eq(apc);
destroy_vport:
mana_destroy_vport(apc);
return err;
@@ -3343,6 +3367,9 @@ static int mana_dealloc_queues(struct net_device *ndev)
mana_fence_rqs(apc);
/* Even in err case, still need to cleanup the vPort */
+ mana_destroy_rxqs(apc);
+ mana_destroy_txq(apc);
+ mana_destroy_eq(apc);
mana_destroy_vport(apc);
return 0;
@@ -3663,12 +3690,6 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
INIT_DELAYED_WORK(&ac->gf_stats_work, mana_gf_stats_work_handler);
- err = mana_create_eq(ac);
- if (err) {
- dev_err(dev, "Failed to create EQs: %d\n", err);
- goto out;
- }
-
err = mana_query_device_cfg(ac, MANA_MAJOR_VERSION, MANA_MINOR_VERSION,
MANA_MICRO_VERSION, &num_ports, &bm_hostmode);
if (err)
@@ -3808,8 +3829,6 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
free_netdev(ndev);
}
- mana_destroy_eq(ac);
-
if (ac->per_port_queue_reset_wq) {
destroy_workqueue(ac->per_port_queue_reset_wq);
ac->per_port_queue_reset_wq = NULL;
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index aa90a858c8e3..c8e7d16f6685 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -480,8 +480,6 @@ struct mana_context {
u8 bm_hostmode;
struct mana_ethtool_hc_stats hc_stats;
- struct mana_eq *eqs;
- struct dentry *mana_eqs_debugfs;
struct workqueue_struct *per_port_queue_reset_wq;
/* Workqueue for querying hardware stats */
struct delayed_work gf_stats_work;
@@ -501,6 +499,9 @@ struct mana_port_context {
u8 mac_addr[ETH_ALEN];
+ struct mana_eq *eqs;
+ struct dentry *mana_eqs_debugfs;
+
enum TRI_STATE rss_state;
mana_handle_t default_rxobj;
@@ -1034,6 +1035,8 @@ void mana_destroy_wq_obj(struct mana_port_context *apc, u32 wq_type,
int mana_cfg_vport(struct mana_port_context *apc, u32 protection_dom_id,
u32 doorbell_pg_id);
void mana_uncfg_vport(struct mana_port_context *apc);
+int mana_create_eq(struct mana_port_context *apc);
+void mana_destroy_eq(struct mana_port_context *apc);
struct net_device *mana_get_primary_netdev(struct mana_context *ac,
u32 port_index,
--
2.43.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH net-next v8 2/6] net: mana: Query device capabilities and configure MSI-X sharing for EQs
2026-05-08 22:11 [PATCH net-next v8 0/6] net: mana: Per-vPort EQ and MSI-X interrupt management Long Li
2026-05-08 22:11 ` [PATCH net-next v8 1/6] net: mana: Create separate EQs for each vPort Long Li
@ 2026-05-08 22:11 ` Long Li
2026-05-08 22:11 ` [PATCH net-next v8 3/6] net: mana: Introduce GIC context with refcounting for interrupt management Long Li
` (3 subsequent siblings)
5 siblings, 0 replies; 9+ messages in thread
From: Long Li @ 2026-05-08 22:11 UTC (permalink / raw)
To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
Paolo Abeni, Eric Dumazet, Andrew Lunn, Jason Gunthorpe,
Leon Romanovsky, Haiyang Zhang, K . Y . Srinivasan, Wei Liu,
Dexuan Cui, shradhagupta
Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel
When querying the device, adjust the max number of queues to allow
dedicated MSI-X vectors for each vPort. The number of queues per vPort
is clamped to no less than MANA_DEF_NUM_QUEUES. MSI-X sharing among
vPorts is disabled by default and is only enabled when there are not
enough MSI-X vectors for dedicated allocation.
Rename mana_query_device_cfg() to mana_gd_query_device_cfg() as it is
used at GDMA device probe time for querying device capabilities.
Signed-off-by: Long Li <longli@microsoft.com>
---
.../net/ethernet/microsoft/mana/gdma_main.c | 59 ++++++++++++++++++-
drivers/net/ethernet/microsoft/mana/mana_en.c | 40 ++++++++-----
include/net/mana/gdma.h | 13 +++-
3 files changed, 93 insertions(+), 19 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index f3316e929175..4673ff62e6d9 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -149,6 +149,9 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
struct gdma_context *gc = pci_get_drvdata(pdev);
struct gdma_query_max_resources_resp resp = {};
struct gdma_general_req req = {};
+ unsigned int max_num_queues;
+ u8 bm_hostmode;
+ u16 num_ports;
int err;
mana_gd_init_req_hdr(&req.hdr, GDMA_QUERY_MAX_RESOURCES,
@@ -197,6 +200,43 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
if (gc->max_num_queues == 0)
return -ENOSPC;
+ err = mana_gd_query_device_cfg(gc, MANA_MAJOR_VERSION,
+ MANA_MINOR_VERSION,
+ MANA_MICRO_VERSION,
+ &num_ports, &bm_hostmode);
+ if (err)
+ return err;
+
+ if (!num_ports)
+ return -EINVAL;
+
+ /*
+ * Adjust the per-vPort max queue count to allow dedicated
+ * MSIx for each vPort. Clamp to no less than MANA_DEF_NUM_QUEUES.
+ */
+ max_num_queues = (gc->num_msix_usable - 1) / num_ports;
+ max_num_queues = rounddown_pow_of_two(max(max_num_queues, 1U));
+ if (max_num_queues < MANA_DEF_NUM_QUEUES)
+ max_num_queues = MANA_DEF_NUM_QUEUES;
+
+ /*
+ * Use dedicated MSIx for EQs whenever possible, use MSIx sharing for
+ * Ethernet EQs when (max_num_queues * num_ports > num_msix_usable - 1)
+ */
+ max_num_queues = min(gc->max_num_queues, max_num_queues);
+ if (max_num_queues * num_ports > gc->num_msix_usable - 1)
+ gc->msi_sharing = true;
+
+ /* If MSI is shared, use max allowed value */
+ if (gc->msi_sharing)
+ gc->max_num_queues_vport = min(gc->num_msix_usable - 1,
+ gc->max_num_queues);
+ else
+ gc->max_num_queues_vport = max_num_queues;
+
+ dev_info(gc->dev, "MSI sharing mode %d max queues %d\n",
+ gc->msi_sharing, gc->max_num_queues);
+
return 0;
}
@@ -1859,6 +1899,7 @@ static int mana_gd_setup_hwc_irqs(struct pci_dev *pdev)
/* Need 1 interrupt for HWC */
max_irqs = min(num_online_cpus(), MANA_MAX_NUM_QUEUES) + 1;
min_irqs = 2;
+ gc->msi_sharing = true;
}
nvec = pci_alloc_irq_vectors(pdev, min_irqs, max_irqs, PCI_IRQ_MSIX);
@@ -1937,6 +1978,8 @@ static void mana_gd_remove_irqs(struct pci_dev *pdev)
pci_free_irq_vectors(pdev);
+ bitmap_free(gc->msi_bitmap);
+ gc->msi_bitmap = NULL;
gc->max_num_msix = 0;
gc->num_msix_usable = 0;
}
@@ -1971,6 +2014,10 @@ static int mana_gd_setup(struct pci_dev *pdev)
if (err)
goto destroy_hwc;
+ err = mana_gd_detect_devices(pdev);
+ if (err)
+ goto destroy_hwc;
+
err = mana_gd_query_max_resources(pdev);
if (err)
goto destroy_hwc;
@@ -1981,9 +2028,15 @@ static int mana_gd_setup(struct pci_dev *pdev)
goto destroy_hwc;
}
- err = mana_gd_detect_devices(pdev);
- if (err)
- goto destroy_hwc;
+ if (!gc->msi_sharing) {
+ gc->msi_bitmap = bitmap_zalloc(gc->num_msix_usable, GFP_KERNEL);
+ if (!gc->msi_bitmap) {
+ err = -ENOMEM;
+ goto destroy_hwc;
+ }
+ /* Set bit for HWC */
+ set_bit(0, gc->msi_bitmap);
+ }
dev_dbg(&pdev->dev, "mana gdma setup successful\n");
return 0;
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 2f3d619e0f2e..3f6cdc2cd82d 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1007,10 +1007,9 @@ static int mana_init_port_context(struct mana_port_context *apc)
return !apc->rxqs ? -ENOMEM : 0;
}
-static int mana_send_request(struct mana_context *ac, void *in_buf,
- u32 in_len, void *out_buf, u32 out_len)
+static int gdma_mana_send_request(struct gdma_context *gc, void *in_buf,
+ u32 in_len, void *out_buf, u32 out_len)
{
- struct gdma_context *gc = ac->gdma_dev->gdma_context;
struct gdma_resp_hdr *resp = out_buf;
struct gdma_req_hdr *req = in_buf;
struct device *dev = gc->dev;
@@ -1044,6 +1043,14 @@ static int mana_send_request(struct mana_context *ac, void *in_buf,
return 0;
}
+static int mana_send_request(struct mana_context *ac, void *in_buf,
+ u32 in_len, void *out_buf, u32 out_len)
+{
+ struct gdma_context *gc = ac->gdma_dev->gdma_context;
+
+ return gdma_mana_send_request(gc, in_buf, in_len, out_buf, out_len);
+}
+
static int mana_verify_resp_hdr(const struct gdma_resp_hdr *resp_hdr,
const enum mana_command_code expected_code,
const u32 min_size)
@@ -1177,11 +1184,10 @@ static void mana_pf_deregister_filter(struct mana_port_context *apc)
err, resp.hdr.status);
}
-static int mana_query_device_cfg(struct mana_context *ac, u32 proto_major_ver,
- u32 proto_minor_ver, u32 proto_micro_ver,
- u16 *max_num_vports, u8 *bm_hostmode)
+int mana_gd_query_device_cfg(struct gdma_context *gc, u32 proto_major_ver,
+ u32 proto_minor_ver, u32 proto_micro_ver,
+ u16 *max_num_vports, u8 *bm_hostmode)
{
- struct gdma_context *gc = ac->gdma_dev->gdma_context;
struct mana_query_device_cfg_resp resp = {};
struct mana_query_device_cfg_req req = {};
struct device *dev = gc->dev;
@@ -1196,7 +1202,8 @@ static int mana_query_device_cfg(struct mana_context *ac, u32 proto_major_ver,
req.proto_minor_ver = proto_minor_ver;
req.proto_micro_ver = proto_micro_ver;
- err = mana_send_request(ac, &req, sizeof(req), &resp, sizeof(resp));
+ err = gdma_mana_send_request(gc, &req, sizeof(req),
+ &resp, sizeof(resp));
if (err) {
dev_err(dev, "Failed to query config: %d", err);
return err;
@@ -1230,8 +1237,6 @@ static int mana_query_device_cfg(struct mana_context *ac, u32 proto_major_ver,
else
*bm_hostmode = 0;
- debugfs_create_u16("adapter-MTU", 0400, gc->mana_pci_debugfs, &gc->adapter_mtu);
-
return 0;
}
@@ -3415,7 +3420,7 @@ static int mana_probe_port(struct mana_context *ac, int port_idx,
int err;
ndev = alloc_etherdev_mq(sizeof(struct mana_port_context),
- gc->max_num_queues);
+ gc->max_num_queues_vport);
if (!ndev)
return -ENOMEM;
@@ -3424,9 +3429,9 @@ static int mana_probe_port(struct mana_context *ac, int port_idx,
apc = netdev_priv(ndev);
apc->ac = ac;
apc->ndev = ndev;
- apc->max_queues = gc->max_num_queues;
+ apc->max_queues = gc->max_num_queues_vport;
/* Use MANA_DEF_NUM_QUEUES as default, still honoring the HW limit */
- apc->num_queues = min(gc->max_num_queues, MANA_DEF_NUM_QUEUES);
+ apc->num_queues = min(gc->max_num_queues_vport, MANA_DEF_NUM_QUEUES);
apc->tx_queue_size = DEF_TX_BUFFERS_PER_QUEUE;
apc->rx_queue_size = DEF_RX_BUFFERS_PER_QUEUE;
apc->port_handle = INVALID_MANA_HANDLE;
@@ -3690,13 +3695,18 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
INIT_DELAYED_WORK(&ac->gf_stats_work, mana_gf_stats_work_handler);
- err = mana_query_device_cfg(ac, MANA_MAJOR_VERSION, MANA_MINOR_VERSION,
- MANA_MICRO_VERSION, &num_ports, &bm_hostmode);
+ err = mana_gd_query_device_cfg(gc, MANA_MAJOR_VERSION,
+ MANA_MINOR_VERSION,
+ MANA_MICRO_VERSION,
+ &num_ports, &bm_hostmode);
if (err)
goto out;
ac->bm_hostmode = bm_hostmode;
+ debugfs_create_u16("adapter-MTU", 0400,
+ gc->mana_pci_debugfs, &gc->adapter_mtu);
+
if (!resuming) {
ac->num_ports = num_ports;
} else {
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index 6d836060976a..9c05b1e15c3e 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -399,8 +399,10 @@ struct gdma_context {
struct device *dev;
struct dentry *mana_pci_debugfs;
- /* Per-vPort max number of queues */
+ /* Hardware max number of queues */
unsigned int max_num_queues;
+ /* Per-vPort max number of queues */
+ unsigned int max_num_queues_vport;
unsigned int max_num_msix;
unsigned int num_msix_usable;
struct xarray irq_contexts;
@@ -446,6 +448,12 @@ struct gdma_context {
struct workqueue_struct *service_wq;
unsigned long flags;
+
+ /* Indicate if this device is sharing MSI for EQs on MANA */
+ bool msi_sharing;
+
+ /* Bitmap tracks where MSI is allocated when it is not shared for EQs */
+ unsigned long *msi_bitmap;
};
static inline bool mana_gd_is_mana(struct gdma_dev *gd)
@@ -1018,4 +1026,7 @@ int mana_gd_resume(struct pci_dev *pdev);
bool mana_need_log(struct gdma_context *gc, int err);
+int mana_gd_query_device_cfg(struct gdma_context *gc, u32 proto_major_ver,
+ u32 proto_minor_ver, u32 proto_micro_ver,
+ u16 *max_num_vports, u8 *bm_hostmode);
#endif /* _GDMA_H */
--
2.43.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH net-next v8 3/6] net: mana: Introduce GIC context with refcounting for interrupt management
2026-05-08 22:11 [PATCH net-next v8 0/6] net: mana: Per-vPort EQ and MSI-X interrupt management Long Li
2026-05-08 22:11 ` [PATCH net-next v8 1/6] net: mana: Create separate EQs for each vPort Long Li
2026-05-08 22:11 ` [PATCH net-next v8 2/6] net: mana: Query device capabilities and configure MSI-X sharing for EQs Long Li
@ 2026-05-08 22:11 ` Long Li
2026-05-12 11:36 ` Paolo Abeni
2026-05-08 22:12 ` [PATCH net-next v8 4/6] net: mana: Use GIC functions to allocate global EQs Long Li
` (2 subsequent siblings)
5 siblings, 1 reply; 9+ messages in thread
From: Long Li @ 2026-05-08 22:11 UTC (permalink / raw)
To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
Paolo Abeni, Eric Dumazet, Andrew Lunn, Jason Gunthorpe,
Leon Romanovsky, Haiyang Zhang, K . Y . Srinivasan, Wei Liu,
Dexuan Cui, shradhagupta
Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel
To allow Ethernet EQs to use dedicated or shared MSI-X vectors and RDMA
EQs to share the same MSI-X, introduce a GIC (GDMA IRQ Context) with
reference counting. This allows the driver to create an interrupt context
on an assigned or unassigned MSI-X vector and share it across multiple
EQ consumers.
Signed-off-by: Long Li <longli@microsoft.com>
---
.../net/ethernet/microsoft/mana/gdma_main.c | 159 ++++++++++++++++++
include/net/mana/gdma.h | 12 ++
2 files changed, 171 insertions(+)
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 4673ff62e6d9..78cb89c46ff3 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -1618,6 +1618,164 @@ static irqreturn_t mana_gd_intr(int irq, void *arg)
return IRQ_HANDLED;
}
+void mana_gd_put_gic(struct gdma_context *gc, bool use_msi_bitmap, int msi)
+{
+ struct pci_dev *dev = to_pci_dev(gc->dev);
+ struct msi_map irq_map;
+ struct gdma_irq_context *gic;
+ int irq;
+
+ mutex_lock(&gc->gic_mutex);
+
+ gic = xa_load(&gc->irq_contexts, msi);
+ if (WARN_ON(!gic)) {
+ mutex_unlock(&gc->gic_mutex);
+ return;
+ }
+
+ if (use_msi_bitmap)
+ gic->bitmap_refs--;
+
+ if (use_msi_bitmap && gic->bitmap_refs == 0)
+ clear_bit(msi, gc->msi_bitmap);
+
+ if (!refcount_dec_and_test(&gic->refcount))
+ goto out;
+
+ irq = gic->irq;
+
+ irq_update_affinity_hint(irq, NULL);
+ free_irq(irq, gic);
+
+ if (gic->dyn_msix) {
+ irq_map.virq = irq;
+ irq_map.index = msi;
+ pci_msix_free_irq(dev, irq_map);
+ }
+
+ xa_erase(&gc->irq_contexts, msi);
+ kfree(gic);
+
+out:
+ mutex_unlock(&gc->gic_mutex);
+}
+EXPORT_SYMBOL_NS(mana_gd_put_gic, "NET_MANA");
+
+/*
+ * Get a GIC (GDMA IRQ Context) on a MSI vector
+ * a MSI can be shared between different EQs, this function supports setting
+ * up separate MSIs using a bitmap, or directly using the MSI index
+ *
+ * @use_msi_bitmap:
+ * True if MSI is assigned by this function on available slots from bitmap.
+ * False if MSI is passed from *msi_requested
+ */
+struct gdma_irq_context *mana_gd_get_gic(struct gdma_context *gc,
+ bool use_msi_bitmap,
+ int *msi_requested)
+{
+ struct gdma_irq_context *gic;
+ struct pci_dev *dev = to_pci_dev(gc->dev);
+ struct msi_map irq_map = { };
+ int irq;
+ int msi;
+ int err;
+
+ mutex_lock(&gc->gic_mutex);
+
+ if (use_msi_bitmap) {
+ msi = find_first_zero_bit(gc->msi_bitmap, gc->num_msix_usable);
+ if (msi >= gc->num_msix_usable) {
+ dev_err(gc->dev, "No free MSI vectors available\n");
+ gic = NULL;
+ goto out;
+ }
+ *msi_requested = msi;
+ } else {
+ msi = *msi_requested;
+ }
+
+ gic = xa_load(&gc->irq_contexts, msi);
+ if (gic) {
+ refcount_inc(&gic->refcount);
+ if (use_msi_bitmap) {
+ gic->bitmap_refs++;
+ set_bit(msi, gc->msi_bitmap);
+ }
+ goto out;
+ }
+
+ irq = pci_irq_vector(dev, msi);
+ if (irq == -EINVAL) {
+ irq_map = pci_msix_alloc_irq_at(dev, msi, NULL);
+ if (!irq_map.virq) {
+ err = irq_map.index;
+ dev_err(gc->dev,
+ "Failed to alloc irq_map msi %d err %d\n",
+ msi, err);
+ gic = NULL;
+ goto out;
+ }
+ irq = irq_map.virq;
+ msi = irq_map.index;
+ }
+
+ gic = kzalloc(sizeof(*gic), GFP_KERNEL);
+ if (!gic) {
+ if (irq_map.virq)
+ pci_msix_free_irq(dev, irq_map);
+ goto out;
+ }
+
+ gic->handler = mana_gd_process_eq_events;
+ gic->msi = msi;
+ gic->irq = irq;
+ INIT_LIST_HEAD(&gic->eq_list);
+ spin_lock_init(&gic->lock);
+
+ if (!gic->msi)
+ snprintf(gic->name, MANA_IRQ_NAME_SZ, "mana_hwc@pci:%s",
+ pci_name(dev));
+ else
+ snprintf(gic->name, MANA_IRQ_NAME_SZ, "mana_msi%d@pci:%s",
+ gic->msi, pci_name(dev));
+
+ err = request_irq(irq, mana_gd_intr, 0, gic->name, gic);
+ if (err) {
+ dev_err(gc->dev, "Failed to request irq %d %s\n",
+ irq, gic->name);
+ kfree(gic);
+ gic = NULL;
+ if (irq_map.virq)
+ pci_msix_free_irq(dev, irq_map);
+ goto out;
+ }
+
+ gic->dyn_msix = !!irq_map.virq;
+ refcount_set(&gic->refcount, 1);
+ gic->bitmap_refs = use_msi_bitmap ? 1 : 0;
+
+ err = xa_err(xa_store(&gc->irq_contexts, msi, gic, GFP_KERNEL));
+ if (err) {
+ dev_err(gc->dev, "Failed to store irq context for msi %d: %d\n",
+ msi, err);
+ free_irq(irq, gic);
+ kfree(gic);
+ gic = NULL;
+ if (irq_map.virq)
+ pci_msix_free_irq(dev, irq_map);
+ goto out;
+ }
+
+ if (use_msi_bitmap)
+ set_bit(msi, gc->msi_bitmap);
+
+out:
+ mutex_unlock(&gc->gic_mutex);
+ return gic;
+}
+EXPORT_SYMBOL_NS(mana_gd_get_gic, "NET_MANA");
+
int mana_gd_alloc_res_map(u32 res_avail, struct gdma_resource *r)
{
r->map = bitmap_zalloc(res_avail, GFP_KERNEL);
@@ -2107,6 +2265,7 @@ static int mana_gd_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
goto release_region;
mutex_init(&gc->eq_test_event_mutex);
+ mutex_init(&gc->gic_mutex);
pci_set_drvdata(pdev, gc);
gc->bar0_pa = pci_resource_start(pdev, 0);
gc->bar0_size = pci_resource_len(pdev, 0);
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index 9c05b1e15c3e..fbe3c1427b45 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -388,6 +388,11 @@ struct gdma_irq_context {
spinlock_t lock;
struct list_head eq_list;
char name[MANA_IRQ_NAME_SZ];
+ unsigned int msi;
+ unsigned int irq;
+ refcount_t refcount;
+ unsigned int bitmap_refs;
+ bool dyn_msix;
};
enum gdma_context_flags {
@@ -449,6 +454,9 @@ struct gdma_context {
unsigned long flags;
+ /* Protect access to GIC context */
+ struct mutex gic_mutex;
+
/* Indicate if this device is sharing MSI for EQs on MANA */
bool msi_sharing;
@@ -1026,6 +1034,10 @@ int mana_gd_resume(struct pci_dev *pdev);
bool mana_need_log(struct gdma_context *gc, int err);
+struct gdma_irq_context *mana_gd_get_gic(struct gdma_context *gc,
+ bool use_msi_bitmap,
+ int *msi_requested);
+void mana_gd_put_gic(struct gdma_context *gc, bool use_msi_bitmap, int msi);
int mana_gd_query_device_cfg(struct gdma_context *gc, u32 proto_major_ver,
u32 proto_minor_ver, u32 proto_micro_ver,
u16 *max_num_vports, u8 *bm_hostmode);
--
2.43.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH net-next v8 4/6] net: mana: Use GIC functions to allocate global EQs
2026-05-08 22:11 [PATCH net-next v8 0/6] net: mana: Per-vPort EQ and MSI-X interrupt management Long Li
` (2 preceding siblings ...)
2026-05-08 22:11 ` [PATCH net-next v8 3/6] net: mana: Introduce GIC context with refcounting for interrupt management Long Li
@ 2026-05-08 22:12 ` Long Li
2026-05-08 22:12 ` [PATCH net-next v8 5/6] net: mana: Allocate interrupt context for each EQ when creating vPort Long Li
2026-05-08 22:12 ` [PATCH net-next v8 6/6] RDMA/mana_ib: Allocate interrupt contexts on EQs Long Li
5 siblings, 0 replies; 9+ messages in thread
From: Long Li @ 2026-05-08 22:12 UTC (permalink / raw)
To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
Paolo Abeni, Eric Dumazet, Andrew Lunn, Jason Gunthorpe,
Leon Romanovsky, Haiyang Zhang, K . Y . Srinivasan, Wei Liu,
Dexuan Cui, shradhagupta
Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel
Replace the GDMA global interrupt setup code with the new GIC allocation
and release functions for managing interrupt contexts.
This changes the per-queue interrupt names in /proc/interrupts from
mana_q0, mana_q1, ... to mana_msi1, mana_msi2, ... to reflect the
MSI-X index rather than a zero-based queue number. The HWC interrupt
name (mana_hwc) is unchanged.
Signed-off-by: Long Li <longli@microsoft.com>
---
.../net/ethernet/microsoft/mana/gdma_main.c | 96 +++----------------
1 file changed, 13 insertions(+), 83 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 78cb89c46ff3..3408bc1fd6ab 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -1878,7 +1878,7 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
struct gdma_context *gc = pci_get_drvdata(pdev);
struct gdma_irq_context *gic;
bool skip_first_cpu = false;
- int *irqs, irq, err, i;
+ int *irqs, err, i;
irqs = kmalloc_objs(int, nvec);
if (!irqs)
@@ -1891,30 +1891,13 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
* further used in irq_setup()
*/
for (i = 1; i <= nvec; i++) {
- gic = kzalloc_obj(*gic);
+ gic = mana_gd_get_gic(gc, false, &i);
if (!gic) {
err = -ENOMEM;
goto free_irq;
}
- gic->handler = mana_gd_process_eq_events;
- INIT_LIST_HEAD(&gic->eq_list);
- spin_lock_init(&gic->lock);
-
- snprintf(gic->name, MANA_IRQ_NAME_SZ, "mana_q%d@pci:%s",
- i - 1, pci_name(pdev));
-
- /* one pci vector is already allocated for HWC */
- irqs[i - 1] = pci_irq_vector(pdev, i);
- if (irqs[i - 1] < 0) {
- err = irqs[i - 1];
- goto free_current_gic;
- }
-
- err = request_irq(irqs[i - 1], mana_gd_intr, 0, gic->name, gic);
- if (err)
- goto free_current_gic;
- xa_store(&gc->irq_contexts, i, gic, GFP_KERNEL);
+ irqs[i - 1] = gic->irq;
}
/*
@@ -1936,20 +1919,9 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
kfree(irqs);
return 0;
-free_current_gic:
- kfree(gic);
free_irq:
- for (i -= 1; i > 0; i--) {
- irq = pci_irq_vector(pdev, i);
- gic = xa_load(&gc->irq_contexts, i);
- if (WARN_ON(!gic))
- continue;
-
- irq_update_affinity_hint(irq, NULL);
- free_irq(irq, gic);
- xa_erase(&gc->irq_contexts, i);
- kfree(gic);
- }
+ for (i -= 1; i > 0; i--)
+ mana_gd_put_gic(gc, false, i);
kfree(irqs);
return err;
}
@@ -1958,7 +1930,7 @@ static int mana_gd_setup_irqs(struct pci_dev *pdev, int nvec)
{
struct gdma_context *gc = pci_get_drvdata(pdev);
struct gdma_irq_context *gic;
- int *irqs, *start_irqs, irq;
+ int *irqs, *start_irqs;
unsigned int cpu;
int err, i;
@@ -1969,34 +1941,13 @@ static int mana_gd_setup_irqs(struct pci_dev *pdev, int nvec)
start_irqs = irqs;
for (i = 0; i < nvec; i++) {
- gic = kzalloc_obj(*gic);
+ gic = mana_gd_get_gic(gc, false, &i);
if (!gic) {
err = -ENOMEM;
goto free_irq;
}
- gic->handler = mana_gd_process_eq_events;
- INIT_LIST_HEAD(&gic->eq_list);
- spin_lock_init(&gic->lock);
-
- if (!i)
- snprintf(gic->name, MANA_IRQ_NAME_SZ, "mana_hwc@pci:%s",
- pci_name(pdev));
- else
- snprintf(gic->name, MANA_IRQ_NAME_SZ, "mana_q%d@pci:%s",
- i - 1, pci_name(pdev));
-
- irqs[i] = pci_irq_vector(pdev, i);
- if (irqs[i] < 0) {
- err = irqs[i];
- goto free_current_gic;
- }
-
- err = request_irq(irqs[i], mana_gd_intr, 0, gic->name, gic);
- if (err)
- goto free_current_gic;
-
- xa_store(&gc->irq_contexts, i, gic, GFP_KERNEL);
+ irqs[i] = gic->irq;
}
/* If number of IRQ is one extra than number of online CPUs,
@@ -2025,20 +1976,9 @@ static int mana_gd_setup_irqs(struct pci_dev *pdev, int nvec)
kfree(start_irqs);
return 0;
-free_current_gic:
- kfree(gic);
free_irq:
- for (i -= 1; i >= 0; i--) {
- irq = pci_irq_vector(pdev, i);
- gic = xa_load(&gc->irq_contexts, i);
- if (WARN_ON(!gic))
- continue;
-
- irq_update_affinity_hint(irq, NULL);
- free_irq(irq, gic);
- xa_erase(&gc->irq_contexts, i);
- kfree(gic);
- }
+ for (i -= 1; i >= 0; i--)
+ mana_gd_put_gic(gc, false, i);
kfree(start_irqs);
return err;
@@ -2112,26 +2052,16 @@ static int mana_gd_setup_remaining_irqs(struct pci_dev *pdev)
static void mana_gd_remove_irqs(struct pci_dev *pdev)
{
struct gdma_context *gc = pci_get_drvdata(pdev);
- struct gdma_irq_context *gic;
- int irq, i;
+ int i;
if (gc->max_num_msix < 1)
return;
for (i = 0; i < gc->max_num_msix; i++) {
- irq = pci_irq_vector(pdev, i);
- if (irq < 0)
- continue;
-
- gic = xa_load(&gc->irq_contexts, i);
- if (WARN_ON(!gic))
+ if (!xa_load(&gc->irq_contexts, i))
continue;
- /* Need to clear the hint before free_irq */
- irq_update_affinity_hint(irq, NULL);
- free_irq(irq, gic);
- xa_erase(&gc->irq_contexts, i);
- kfree(gic);
+ mana_gd_put_gic(gc, false, i);
}
pci_free_irq_vectors(pdev);
--
2.43.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH net-next v8 5/6] net: mana: Allocate interrupt context for each EQ when creating vPort
2026-05-08 22:11 [PATCH net-next v8 0/6] net: mana: Per-vPort EQ and MSI-X interrupt management Long Li
` (3 preceding siblings ...)
2026-05-08 22:12 ` [PATCH net-next v8 4/6] net: mana: Use GIC functions to allocate global EQs Long Li
@ 2026-05-08 22:12 ` Long Li
2026-05-08 22:12 ` [PATCH net-next v8 6/6] RDMA/mana_ib: Allocate interrupt contexts on EQs Long Li
5 siblings, 0 replies; 9+ messages in thread
From: Long Li @ 2026-05-08 22:12 UTC (permalink / raw)
To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
Paolo Abeni, Eric Dumazet, Andrew Lunn, Jason Gunthorpe,
Leon Romanovsky, Haiyang Zhang, K . Y . Srinivasan, Wei Liu,
Dexuan Cui, shradhagupta
Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel
Use GIC functions to create a dedicated interrupt context or acquire a
shared interrupt context for each EQ when setting up a vPort.
The caller now owns the GIC reference across the EQ create/destroy
lifecycle: mana_create_eq() calls mana_gd_get_gic() before creating
each EQ and mana_destroy_eq() calls mana_gd_put_gic() after destroying
it. The msix_index invalidation is moved from mana_gd_deregister_irq()
to the mana_gd_create_eq() error path so that mana_destroy_eq() can
read the index before teardown.
Signed-off-by: Long Li <longli@microsoft.com>
---
.../net/ethernet/microsoft/mana/gdma_main.c | 2 +-
drivers/net/ethernet/microsoft/mana/mana_en.c | 18 +++++++++++++++++-
include/net/mana/gdma.h | 1 +
3 files changed, 19 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 3408bc1fd6ab..b70271a0624f 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -857,7 +857,6 @@ static void mana_gd_deregister_irq(struct gdma_queue *queue)
}
spin_unlock_irqrestore(&gic->lock, flags);
- queue->eq.msix_index = INVALID_PCI_MSIX_INDEX;
synchronize_rcu();
}
@@ -972,6 +971,7 @@ static int mana_gd_create_eq(struct gdma_dev *gd,
out:
dev_err(dev, "Failed to create EQ: %d\n", err);
mana_gd_destroy_eq(gc, false, queue);
+ queue->eq.msix_index = INVALID_PCI_MSIX_INDEX;
return err;
}
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 3f6cdc2cd82d..42fd517e56d2 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1626,6 +1626,7 @@ void mana_destroy_eq(struct mana_port_context *apc)
struct gdma_context *gc = ac->gdma_dev->gdma_context;
struct gdma_queue *eq;
int i;
+ unsigned int msi;
if (!apc->eqs)
return;
@@ -1638,7 +1639,9 @@ void mana_destroy_eq(struct mana_port_context *apc)
if (!eq)
continue;
+ msi = eq->eq.msix_index;
mana_gd_destroy_queue(gc, eq);
+ mana_gd_put_gic(gc, !gc->msi_sharing, msi);
}
kfree(apc->eqs);
@@ -1655,6 +1658,7 @@ static void mana_create_eq_debugfs(struct mana_port_context *apc, int i)
eq.mana_eq_debugfs = debugfs_create_dir(eqnum, apc->mana_eqs_debugfs);
debugfs_create_u32("head", 0400, eq.mana_eq_debugfs, &eq.eq->head);
debugfs_create_u32("tail", 0400, eq.mana_eq_debugfs, &eq.eq->tail);
+ debugfs_create_u32("irq", 0400, eq.mana_eq_debugfs, &eq.eq->eq.irq);
debugfs_create_file("eq_dump", 0400, eq.mana_eq_debugfs, eq.eq, &mana_dbg_q_fops);
}
@@ -1665,6 +1669,8 @@ int mana_create_eq(struct mana_port_context *apc)
struct gdma_queue_spec spec = {};
int err;
int i;
+ int msi;
+ struct gdma_irq_context *gic;
WARN_ON(apc->eqs);
apc->eqs = kzalloc_objs(struct mana_eq, apc->num_queues);
@@ -1682,12 +1688,22 @@ int mana_create_eq(struct mana_port_context *apc)
debugfs_create_dir("EQs", apc->mana_port_debugfs);
for (i = 0; i < apc->num_queues; i++) {
- spec.eq.msix_index = (i + 1) % gc->num_msix_usable;
+ msi = (i + 1) % gc->num_msix_usable;
+
+ gic = mana_gd_get_gic(gc, !gc->msi_sharing, &msi);
+ if (!gic) {
+ err = -ENOMEM;
+ goto out;
+ }
+ spec.eq.msix_index = msi;
+
err = mana_gd_create_mana_eq(gd, &spec, &apc->eqs[i].eq);
if (err) {
dev_err(gc->dev, "Failed to create EQ %d : %d\n", i, err);
+ mana_gd_put_gic(gc, !gc->msi_sharing, msi);
goto out;
}
+ apc->eqs[i].eq->eq.irq = gic->irq;
mana_create_eq_debugfs(apc, i);
}
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index fbe3c1427b45..6c138cc77407 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -342,6 +342,7 @@ struct gdma_queue {
void *context;
unsigned int msix_index;
+ unsigned int irq;
u32 log2_throttle_limit;
} eq;
--
2.43.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH net-next v8 6/6] RDMA/mana_ib: Allocate interrupt contexts on EQs
2026-05-08 22:11 [PATCH net-next v8 0/6] net: mana: Per-vPort EQ and MSI-X interrupt management Long Li
` (4 preceding siblings ...)
2026-05-08 22:12 ` [PATCH net-next v8 5/6] net: mana: Allocate interrupt context for each EQ when creating vPort Long Li
@ 2026-05-08 22:12 ` Long Li
5 siblings, 0 replies; 9+ messages in thread
From: Long Li @ 2026-05-08 22:12 UTC (permalink / raw)
To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
Paolo Abeni, Eric Dumazet, Andrew Lunn, Jason Gunthorpe,
Leon Romanovsky, Haiyang Zhang, K . Y . Srinivasan, Wei Liu,
Dexuan Cui, shradhagupta
Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel
Use the GIC functions to allocate interrupt contexts for RDMA EQs. These
interrupt contexts may be shared with Ethernet EQs when MSI-X vectors
are limited.
The driver now supports allocating dedicated MSI-X for each EQ. Indicate
this capability through driver capability bits. The RDMA EQs pass
use_msi_bitmap=false to share MSI-X vectors with Ethernet, while the
capability flag advertises that the driver supports per-vPort EQ
separation when hardware has sufficient vectors.
Populate eq.irq on all RDMA EQs for consistency with the Ethernet path.
Also relocate the GDMA_DRV_CAP_FLAG_1_HW_VPORT_LINK_AWARE define to its
numeric BIT(6) position among the other capability flags.
Signed-off-by: Long Li <longli@microsoft.com>
---
drivers/infiniband/hw/mana/main.c | 43 +++++++++++++++++++++++++------
include/net/mana/gdma.h | 7 +++--
2 files changed, 40 insertions(+), 10 deletions(-)
diff --git a/drivers/infiniband/hw/mana/main.c b/drivers/infiniband/hw/mana/main.c
index 8000ab6e8beb..7adab0457a66 100644
--- a/drivers/infiniband/hw/mana/main.c
+++ b/drivers/infiniband/hw/mana/main.c
@@ -749,7 +749,8 @@ int mana_ib_create_eqs(struct mana_ib_dev *mdev)
{
struct gdma_context *gc = mdev_to_gc(mdev);
struct gdma_queue_spec spec = {};
- int err, i;
+ struct gdma_irq_context *gic;
+ int err, i, msi;
spec.type = GDMA_EQ;
spec.monitor_avl_buf = false;
@@ -757,11 +758,19 @@ int mana_ib_create_eqs(struct mana_ib_dev *mdev)
spec.eq.callback = mana_ib_event_handler;
spec.eq.context = mdev;
spec.eq.log2_throttle_limit = LOG2_EQ_THROTTLE;
- spec.eq.msix_index = 0;
+
+ msi = 0;
+ gic = mana_gd_get_gic(gc, false, &msi);
+ if (!gic)
+ return -ENOMEM;
+ spec.eq.msix_index = msi;
err = mana_gd_create_mana_eq(mdev->gdma_dev, &spec, &mdev->fatal_err_eq);
- if (err)
+ if (err) {
+ mana_gd_put_gic(gc, false, 0);
return err;
+ }
+ mdev->fatal_err_eq->eq.irq = gic->irq;
mdev->eqs = kzalloc_objs(struct gdma_queue *,
mdev->ib_dev.num_comp_vectors);
@@ -771,32 +780,50 @@ int mana_ib_create_eqs(struct mana_ib_dev *mdev)
}
spec.eq.callback = NULL;
for (i = 0; i < mdev->ib_dev.num_comp_vectors; i++) {
- spec.eq.msix_index = (i + 1) % gc->num_msix_usable;
+ msi = (i + 1) % gc->num_msix_usable;
+
+ gic = mana_gd_get_gic(gc, false, &msi);
+ if (!gic) {
+ err = -ENOMEM;
+ goto destroy_eqs;
+ }
+ spec.eq.msix_index = msi;
+
err = mana_gd_create_mana_eq(mdev->gdma_dev, &spec, &mdev->eqs[i]);
- if (err)
+ if (err) {
+ mana_gd_put_gic(gc, false, msi);
goto destroy_eqs;
+ }
+ mdev->eqs[i]->eq.irq = gic->irq;
}
return 0;
destroy_eqs:
- while (i-- > 0)
+ while (i-- > 0) {
mana_gd_destroy_queue(gc, mdev->eqs[i]);
+ mana_gd_put_gic(gc, false, (i + 1) % gc->num_msix_usable);
+ }
kfree(mdev->eqs);
destroy_fatal_eq:
mana_gd_destroy_queue(gc, mdev->fatal_err_eq);
+ mana_gd_put_gic(gc, false, 0);
return err;
}
void mana_ib_destroy_eqs(struct mana_ib_dev *mdev)
{
struct gdma_context *gc = mdev_to_gc(mdev);
- int i;
+ int i, msi;
mana_gd_destroy_queue(gc, mdev->fatal_err_eq);
+ mana_gd_put_gic(gc, false, 0);
- for (i = 0; i < mdev->ib_dev.num_comp_vectors; i++)
+ for (i = 0; i < mdev->ib_dev.num_comp_vectors; i++) {
mana_gd_destroy_queue(gc, mdev->eqs[i]);
+ msi = (i + 1) % gc->num_msix_usable;
+ mana_gd_put_gic(gc, false, msi);
+ }
kfree(mdev->eqs);
}
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index 6c138cc77407..d84e474309a3 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -615,6 +615,7 @@ enum {
#define GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECONFIG BIT(3)
#define GDMA_DRV_CAP_FLAG_1_GDMA_PAGES_4MB_1GB_2GB BIT(4)
#define GDMA_DRV_CAP_FLAG_1_VARIABLE_INDIRECTION_TABLE_SUPPORT BIT(5)
+#define GDMA_DRV_CAP_FLAG_1_HW_VPORT_LINK_AWARE BIT(6)
/* Driver can handle holes (zeros) in the device list */
#define GDMA_DRV_CAP_FLAG_1_DEV_LIST_HOLES_SUP BIT(11)
@@ -631,7 +632,8 @@ enum {
/* Driver detects stalled send queues and recovers them */
#define GDMA_DRV_CAP_FLAG_1_HANDLE_STALL_SQ_RECOVERY BIT(18)
-#define GDMA_DRV_CAP_FLAG_1_HW_VPORT_LINK_AWARE BIT(6)
+/* Driver supports separate EQ/MSIs for each vPort */
+#define GDMA_DRV_CAP_FLAG_1_EQ_MSI_UNSHARE_MULTI_VPORT BIT(19)
/* Driver supports linearizing the skb when num_sge exceeds hardware limit */
#define GDMA_DRV_CAP_FLAG_1_SKB_LINEARIZE BIT(20)
@@ -659,7 +661,8 @@ enum {
GDMA_DRV_CAP_FLAG_1_SKB_LINEARIZE | \
GDMA_DRV_CAP_FLAG_1_PROBE_RECOVERY | \
GDMA_DRV_CAP_FLAG_1_HANDLE_STALL_SQ_RECOVERY | \
- GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECOVERY)
+ GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECOVERY | \
+ GDMA_DRV_CAP_FLAG_1_EQ_MSI_UNSHARE_MULTI_VPORT)
#define GDMA_DRV_CAP_FLAGS2 0
--
2.43.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH net-next v8 1/6] net: mana: Create separate EQs for each vPort
2026-05-08 22:11 ` [PATCH net-next v8 1/6] net: mana: Create separate EQs for each vPort Long Li
@ 2026-05-12 11:34 ` Paolo Abeni
0 siblings, 0 replies; 9+ messages in thread
From: Paolo Abeni @ 2026-05-12 11:34 UTC (permalink / raw)
To: longli
Cc: kotaranov, kuba, davem, pabeni, edumazet, andrew+netdev, jgg,
leon, haiyangz, kys, wei.liu, decui, shradhagupta, horms, netdev,
linux-rdma, linux-hyperv, linux-kernel
From: AI Reviewer <ai@example.com>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://netdev-ai.bots.linux.dev/sashiko/
---
Subject: net: mana: Create separate EQs for each vPort
This patch moves EQ ownership from struct mana_context to
struct mana_port_context, creates per-vPort EQs in mana_alloc_queues()
and mana_ib_cfg_vport(), and exports mana_create_eq()/mana_destroy_eq()
for use by the MANA RDMA driver.
A few questions below about the new lifetime and ordering rules for
mpc->eqs.
> diff --git a/drivers/infiniband/hw/mana/main.c b/drivers/infiniband/hw/mana/main.c
> index ac5e75dd3494..8000ab6e8beb 100644
> --- a/drivers/infiniband/hw/mana/main.c
> +++ b/drivers/infiniband/hw/mana/main.c
> @@ -20,8 +20,10 @@ void mana_ib_uncfg_vport(struct mana_ib_dev *dev, struct mana_ib_pd *pd,
> pd->vport_use_count--;
> WARN_ON(pd->vport_use_count < 0);
>
> - if (!pd->vport_use_count)
> + if (!pd->vport_use_count) {
> + mana_destroy_eq(mpc);
> mana_uncfg_vport(mpc);
> + }
>
> mutex_unlock(&pd->vport_mutex);
> }
[High]
After this change, mpc->eqs lifetime is governed by pd->vport_use_count,
and only raw QPs call mana_ib_cfg_vport()/mana_ib_uncfg_vport(). RSS
QPs in mana_ib_create_qp_rss() read mpc->eqs and latch eq->eq->id into
the HW CQ via cq_spec.attached_eq, but they do not bump
pd->vport_use_count.
So if a user creates a raw QP on a PD (use_count=1, mpc->eqs allocated),
then creates an RSS QP on the same PD whose CQs now reference those EQ
ids, and finally destroys the raw QP first, this path runs with
use_count dropping to 0 and calls mana_destroy_eq(mpc) -> kfree(apc->eqs)
and mana_gd_destroy_queue() for each EQ while the surviving RSS QP's HW
CQs are still attached to those EQs.
Is there an intended rule that enforces destruction ordering between raw
and RSS QPs on the same PD, or should RSS QPs also participate in the
refcount that guards mpc->eqs?
> @@ -55,15 +57,22 @@ int mana_ib_cfg_vport(struct mana_ib_dev *dev, u32 port, struct mana_ib_pd *pd,
> return err;
> }
>
> - mutex_unlock(&pd->vport_mutex);
>
> pd->tx_shortform_allowed = mpc->tx_shortform_allowed;
> pd->tx_vp_offset = mpc->tx_vp_offset;
> + err = mana_create_eq(mpc);
> + if (err) {
> + mana_uncfg_vport(mpc);
> + pd->vport_use_count--;
> + }
>
> - ibdev_dbg(&dev->ib_dev, "vport handle %llx pdid %x doorbell_id %x\n",
> - mpc->port_handle, pd->pdn, doorbell_id);
> + mutex_unlock(&pd->vport_mutex);
>
> - return 0;
> + if (!err)
> + ibdev_dbg(&dev->ib_dev, "vport handle %llx pdid %x doorbell_id %x\n",
> + mpc->port_handle, pd->pdn, doorbell_id);
> +
> + return err;
> }
[Low]
pd->tx_shortform_allowed and pd->tx_vp_offset are written before
mana_create_eq() is attempted. On the mana_create_eq() failure branch,
mana_uncfg_vport(mpc) is called and pd->vport_use_count is decremented
back to 0, but pd->tx_shortform_allowed/pd->tx_vp_offset are left set to
the values derived from a vport whose HW-side configuration has been
reverted. Should these fields only be committed to pd after
mana_create_eq() succeeds?
> diff --git a/drivers/infiniband/hw/mana/qp.c b/drivers/infiniband/hw/mana/qp.c
> index 0fbcf449c134..6167742df9d2 100644
> --- a/drivers/infiniband/hw/mana/qp.c
> +++ b/drivers/infiniband/hw/mana/qp.c
> @@ -171,7 +171,15 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
> cq_spec.gdma_region = cq->queue.gdma_region;
> cq_spec.queue_size = cq->cqe * COMP_ENTRY_SIZE;
> cq_spec.modr_ctx_id = 0;
> - eq = &mpc->ac->eqs[cq->comp_vector];
> + /* EQs are created when a raw QP configures the vport.
> + * A raw QP must be created before creating rwq_ind_tbl.
> + */
> + if (!mpc->eqs) {
> + ret = -EINVAL;
> + i--;
> + goto fail;
> + }
> + eq = &mpc->eqs[cq->comp_vector % mpc->num_queues];
> cq_spec.attached_eq = eq->eq->id;
[Critical]
mana_ib_create_qp_rss() does not acquire pd->vport_mutex and does not
touch pd->vport_use_count. The NULL check above reads mpc->eqs without
synchronization; immediately after it, the same mpc->eqs is dereferenced
and eq->eq->id is latched into cq_spec.attached_eq.
Can the following sequence occur?
thread A (creating RSS QP on pd) thread B (destroying last raw QP on pd)
mana_ib_create_qp_rss()
if (!mpc->eqs) /* non-NULL */
... mana_ib_uncfg_vport()
mutex_lock(&pd->vport_mutex);
if (!--pd->vport_use_count)
mana_destroy_eq(mpc);
kfree(apc->eqs);
eq = &mpc->eqs[... % mpc->num_queues];
cq_spec.attached_eq = eq->eq->id; /* reads freed memory */
Is there something else serializing these paths that this reviewer is
missing?
[Medium]
The off-by-one on the error path looks inconsistent with the other fail
sites in this function. Every other goto fail in this loop leaves i
alone and relies on:
while (i-- > 0) {
ibwq = ind_tbl->ind_tbl[i];
...
mana_ib_remove_cq_cb(mdev, cq);
mana_destroy_wq_obj(mpc, GDMA_RQ, wq->rx_object);
}
Here the new branch does ret = -EINVAL; i--; goto fail, which drops one
extra iteration of cleanup. If this check ever fires at i > 0 (for
example if the lifetime issue above is fixed by making mpc->eqs
observable as NULL mid-loop), the WQ object and installed CQ callback
created in the previous iteration are not torn down. Should this match
the other fail sites and omit the i--?
[Medium]
Before this patch ac->eqs was populated unconditionally in mana_probe()
and existed for the lifetime of the device, so mana_ib_create_qp_rss()
and mana_ib_create_qp_raw() always had EQs available regardless of
whether any netdev was up or any raw QP had been created.
After this patch, mpc->eqs is populated only by mana_alloc_queues()
(netdev open) or by mana_ib_cfg_vport() (which is only invoked from
mana_ib_create_qp_raw()). An RDMA user who creates an RSS QP on a PD
while the netdev is down and without a prior raw QP on the same PD will
now see -EINVAL from this new check, where the same call succeeded
before.
The commit message describes this work as preparatory and does not
mention this user-visible ordering requirement. Should this semantic
change be called out in the commit log, and should the implicit "a raw
QP must be created first" rule be documented somewhere user-facing?
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index b2faa7cf398f..131a997e1083 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> @@ -1615,78 +1615,83 @@ void mana_destroy_wq_obj(struct mana_port_context *apc, u32 wq_type,
> }
> EXPORT_SYMBOL_NS(mana_destroy_wq_obj, "NET_MANA");
>
> -static void mana_destroy_eq(struct mana_context *ac)
> +void mana_destroy_eq(struct mana_port_context *apc)
> {
> + struct mana_context *ac = apc->ac;
> struct gdma_context *gc = ac->gdma_dev->gdma_context;
> struct gdma_queue *eq;
> int i;
>
> - if (!ac->eqs)
> + if (!apc->eqs)
> return;
>
> - debugfs_remove_recursive(ac->mana_eqs_debugfs);
> - ac->mana_eqs_debugfs = NULL;
> + debugfs_remove_recursive(apc->mana_eqs_debugfs);
> + apc->mana_eqs_debugfs = NULL;
>
> - for (i = 0; i < gc->max_num_queues; i++) {
> - eq = ac->eqs[i].eq;
> + for (i = 0; i < apc->num_queues; i++) {
> + eq = apc->eqs[i].eq;
> if (!eq)
> continue;
>
> mana_gd_destroy_queue(gc, eq);
> }
[High]
The loop bound is now apc->num_queues rather than the immutable
gc->max_num_queues. mana_create_eq() allocates apc->eqs sized to
apc->num_queues, and mana_destroy_eq() walks the same count.
When RDMA has called mana_ib_cfg_vport() -> mana_create_eq() while the
netdev is down, apc->num_queues can still be mutated underneath it.
mana_detach(false) skips mana_dealloc_queues() when port_st_save is
false, and then mana_set_channels() (ethtool -L) writes
apc->num_queues = new_count while mpc->eqs is already held by the RDMA
side. mana_attach() then also skips mana_alloc_queues().
When the RDMA user later destroys the last raw QP and this function
runs, i iterates 0..new_count-1. If new_count is larger than the count
used at kzalloc_objs() time, apc->eqs is read out of bounds and
potentially passed to mana_gd_destroy_queue(). If new_count is smaller,
the tail EQs (real gdma_queue objects with MSI-X and DMA regions) are
leaked.
Before this patch both sides used gc->max_num_queues so no mismatch was
possible. Should apc->num_queues be snapshotted into apc at
mana_create_eq() time (or should changing num_queues be forbidden while
mpc->eqs is populated by the RDMA path)?
> -static int mana_create_eq(struct mana_context *ac)
> +int mana_create_eq(struct mana_port_context *apc)
> {
> - struct gdma_dev *gd = ac->gdma_dev;
> + struct gdma_dev *gd = apc->ac->gdma_dev;
> struct gdma_context *gc = gd->gdma_context;
> struct gdma_queue_spec spec = {};
> int err;
> int i;
>
> - ac->eqs = kzalloc_objs(struct mana_eq, gc->max_num_queues);
> - if (!ac->eqs)
> + WARN_ON(apc->eqs);
> + apc->eqs = kzalloc_objs(struct mana_eq, apc->num_queues);
> + if (!apc->eqs)
> return -ENOMEM;
[Low]
WARN_ON(apc->eqs) here is advisory only; execution continues into
apc->eqs = kzalloc_objs(...), which unconditionally overwrites any
non-NULL pointer. If the invariant is ever violated by a future caller
or error path, the prior EQ array and its gdma_queue objects are
orphaned (kernel memory and firmware EQ state leaked) with only a WARN
splat as the signal.
Would "if (WARN_ON(apc->eqs)) return -EEXIST;" express the intent more
safely?
--
This is an AI-generated review.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH net-next v8 3/6] net: mana: Introduce GIC context with refcounting for interrupt management
2026-05-08 22:11 ` [PATCH net-next v8 3/6] net: mana: Introduce GIC context with refcounting for interrupt management Long Li
@ 2026-05-12 11:36 ` Paolo Abeni
0 siblings, 0 replies; 9+ messages in thread
From: Paolo Abeni @ 2026-05-12 11:36 UTC (permalink / raw)
To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
Eric Dumazet, Andrew Lunn, Jason Gunthorpe, Leon Romanovsky,
Haiyang Zhang, K . Y . Srinivasan, Wei Liu, Dexuan Cui,
shradhagupta
Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel
On 5/9/26 12:11 AM, Long Li wrote:
> diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> index 4673ff62e6d9..78cb89c46ff3 100644
> --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> @@ -1618,6 +1618,164 @@ static irqreturn_t mana_gd_intr(int irq, void *arg)
> return IRQ_HANDLED;
> }
>
> +void mana_gd_put_gic(struct gdma_context *gc, bool use_msi_bitmap, int msi)
> +{
> + struct pci_dev *dev = to_pci_dev(gc->dev);
> + struct msi_map irq_map;
> + struct gdma_irq_context *gic;
> + int irq;
Since a new revision is needed, please fix the reverse christmas tree
above and elsewhere, thanks!
/P
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2026-05-12 11:36 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-08 22:11 [PATCH net-next v8 0/6] net: mana: Per-vPort EQ and MSI-X interrupt management Long Li
2026-05-08 22:11 ` [PATCH net-next v8 1/6] net: mana: Create separate EQs for each vPort Long Li
2026-05-12 11:34 ` Paolo Abeni
2026-05-08 22:11 ` [PATCH net-next v8 2/6] net: mana: Query device capabilities and configure MSI-X sharing for EQs Long Li
2026-05-08 22:11 ` [PATCH net-next v8 3/6] net: mana: Introduce GIC context with refcounting for interrupt management Long Li
2026-05-12 11:36 ` Paolo Abeni
2026-05-08 22:12 ` [PATCH net-next v8 4/6] net: mana: Use GIC functions to allocate global EQs Long Li
2026-05-08 22:12 ` [PATCH net-next v8 5/6] net: mana: Allocate interrupt context for each EQ when creating vPort Long Li
2026-05-08 22:12 ` [PATCH net-next v8 6/6] RDMA/mana_ib: Allocate interrupt contexts on EQs Long Li
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox