* [PATCH net-next v9 0/6] net: mana: Per-vPort EQ and MSI-X interrupt management
@ 2026-05-13 22:09 Long Li
2026-05-13 22:09 ` [PATCH net-next v9 1/6] net: mana: Create separate EQs for each vPort Long Li
` (5 more replies)
0 siblings, 6 replies; 9+ messages in thread
From: Long Li @ 2026-05-13 22:09 UTC (permalink / raw)
To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
Paolo Abeni, Eric Dumazet, Andrew Lunn, Jason Gunthorpe,
Leon Romanovsky, Haiyang Zhang, K . Y . Srinivasan, Wei Liu,
Dexuan Cui, shradhagupta
Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel
This series moves EQ ownership from the shared mana_context to per-vPort
mana_port_context, enabling each vPort to have dedicated MSI-X vectors
when the hardware provides enough vectors. When vectors are limited, the
driver falls back to sharing MSI-X among vPorts.
The series introduces a GDMA IRQ Context (GIC) abstraction with reference
counting to manage interrupt context lifecycle. This allows both Ethernet
and RDMA EQs to dynamically acquire dedicated or shared MSI-X vectors at
vPort creation time rather than pre-allocating all vectors at probe time.
This series touches both the net and RDMA MANA drivers and is intended
to go through the net-next tree. The patches are available on a shared
branch for both netdev and RDMA maintainers to review.
The following changes since commit 73d587ae684d176fac9db94173f77d78a794ea4f:
net: ethtool: fix missing closing paren in rings_reply_size() (2026-05-11 18:42:25 -0700)
are available in the Git repository at:
https://github.com/longlimsft/linux.git tags/mana-eq-msi-v9
for you to fetch changes up to 8249f52c3a065d92d24f27ab12c0b4d197ba14c4:
RDMA/mana_ib: Allocate interrupt contexts on EQs
Changes in v9:
- RSS QPs now take a vport reference via pd->vport_use_count to ensure
EQs outlive all QP consumers. EQs are only destroyed when the last
QP (raw or RSS) on the PD releases its reference (patch 1)
- Serialize mana_set_channels() against RDMA vport configuration via
apc->vport_mutex when the port is down. When the port is up, Ethernet
owns the vport exclusively so no locking is needed (patch 1)
- Change WARN_ON(apc->eqs) to bail out with -EEXIST to prevent
leaking prior EQ array if invariant is violated (patch 1)
- Only commit pd->tx_shortform_allowed and pd->tx_vp_offset after
mana_create_eq() succeeds (patch 1)
- Reset gc->msi_sharing at the top of mana_gd_query_max_resources()
so it is recomputed from current hardware state on resume (patch 2)
- Fix reverse Christmas tree variable declaration ordering (patches
1, 3, 5)
Changes in v8:
- Fix comment to reference per-vPort queue count instead of
gc->max_num_queues (patch 2)
- Remove duplicate irq_update_affinity_hint() calls from error paths
and mana_gd_remove_irqs(); the clearing is now centralized in
mana_gd_put_gic() (patch 4)
- Note the IRQ name change (mana_q -> mana_msi) in the commit
message (patch 4)
- Remove dead conditional write to spec.eq.msix_index (patch 5)
- Document GIC ownership contract and msix_index invariant change
in commit message (patch 5)
- Populate eq.irq on RDMA EQs for consistency with the Ethernet
path (patch 6)
- Document BIT(6) relocation and capability flag semantics in
commit message (patch 6)
- Fix checkpatch --strict alignment and line length warnings
Changes in v7:
- Use rounddown_pow_of_two() instead of roundup_pow_of_two() when
computing per-vPort queue count to avoid unnecessarily forcing shared
MSI-X mode (patch 2)
- Call mana_gd_setup_remaining_irqs() unconditionally to ensure
irq_contexts are populated in both dedicated and shared MSI-X modes,
fixing bisectability between patches 2 and 5 (patch 2)
- Guard ibdev_dbg() in mana_ib_cfg_vport() with error check so the
vport handle is not logged on the failure path (patch 1)
- Use cached gic->irq instead of pci_irq_vector() lookup in
mana_gd_put_gic() for consistency with the allocation path (patch 3)
- Fix unsigned int* to int* pointer type mismatch when calling
mana_gd_get_gic() by using a local int variable for the MSI index
(patches 5, 6)
Changes in v6:
- Rebased on net-next/main (v7.1-rc1)
Changes in v5:
- Rebased on net-next/main
Changes in v4:
- Rebased on net-next/main 7.0-rc4
- Patch 2: Use MANA_DEF_NUM_QUEUES instead of hardcoded 16 for
max_num_queues clamping
- Patch 3: Track dyn_msix in GIC context instead of re-checking
pci_msix_can_alloc_dyn() on each call; improved remove_irqs iteration
to skip unallocated entries
Changes in v3:
- Rebased on net-next/main
- Patch 1: Added NULL check for mpc->eqs in mana_ib_create_qp_rss() to
prevent NULL pointer dereference when RSS QP is created before a raw QP
has configured the vport and allocated EQs
Changes in v2:
- Rebased on net-next/main (adapted to kzalloc_objs/kzalloc_obj macros,
new GDMA_DRV_CAP_FLAG definitions)
- Patch 2: Fixed misleading comment for max_num_queues vs
max_num_queues_vport in gdma.h
- Patch 3: Fixed spelling typo in gdma_main.c ("difference" -> "different")
Long Li (6):
net: mana: Create separate EQs for each vPort
net: mana: Query device capabilities and configure MSI-X sharing for
EQs
net: mana: Introduce GIC context with refcounting for interrupt
management
net: mana: Use GIC functions to allocate global EQs
net: mana: Allocate interrupt context for each EQ when creating vPort
RDMA/mana_ib: Allocate interrupt contexts on EQs
drivers/infiniband/hw/mana/main.c | 67 +++-
drivers/infiniband/hw/mana/qp.c | 37 +-
.../net/ethernet/microsoft/mana/gdma_main.c | 323 +++++++++++++-----
drivers/net/ethernet/microsoft/mana/mana_en.c | 170 +++++----
.../ethernet/microsoft/mana/mana_ethtool.c | 27 +-
include/net/mana/gdma.h | 33 +-
include/net/mana/mana.h | 7 +-
7 files changed, 488 insertions(+), 176 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH net-next v9 1/6] net: mana: Create separate EQs for each vPort
2026-05-13 22:09 [PATCH net-next v9 0/6] net: mana: Per-vPort EQ and MSI-X interrupt management Long Li
@ 2026-05-13 22:09 ` Long Li
2026-05-14 22:10 ` sashiko-bot
2026-05-13 22:09 ` [PATCH net-next v9 2/6] net: mana: Query device capabilities and configure MSI-X sharing for EQs Long Li
` (4 subsequent siblings)
5 siblings, 1 reply; 9+ messages in thread
From: Long Li @ 2026-05-13 22:09 UTC (permalink / raw)
To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
Paolo Abeni, Eric Dumazet, Andrew Lunn, Jason Gunthorpe,
Leon Romanovsky, Haiyang Zhang, K . Y . Srinivasan, Wei Liu,
Dexuan Cui, shradhagupta
Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel
To prepare for assigning vPorts to dedicated MSI-X vectors, remove EQ
sharing among the vPorts and create dedicated EQs for each vPort.
Move the EQ definition from struct mana_context to struct mana_port_context
and update related support functions. Export mana_create_eq() and
mana_destroy_eq() for use by the MANA RDMA driver.
RSS QPs now take a vport reference via pd->vport_use_count to ensure
EQs outlive all QP consumers. The vport must already be configured by
a raw QP before an RSS QP can be created. EQs are only destroyed when
the last QP (raw or RSS) on the PD releases its reference.
Serialize mana_set_channels() against RDMA vport configuration to
prevent num_queues from changing while RDMA holds EQs sized to the
current value. When the port is down, apc->vport_mutex is held for
the entire operation since mana_detach()/mana_attach() do not take
vport_mutex in that case. When the port is up, Ethernet owns the
vport exclusively so no additional locking is needed.
Signed-off-by: Long Li <longli@microsoft.com>
---
drivers/infiniband/hw/mana/main.c | 24 ++--
drivers/infiniband/hw/mana/qp.c | 37 +++++-
drivers/net/ethernet/microsoft/mana/mana_en.c | 112 +++++++++++-------
.../ethernet/microsoft/mana/mana_ethtool.c | 27 ++++-
include/net/mana/mana.h | 7 +-
5 files changed, 145 insertions(+), 62 deletions(-)
diff --git a/drivers/infiniband/hw/mana/main.c b/drivers/infiniband/hw/mana/main.c
index ac5e75dd3494..6159bd03a021 100644
--- a/drivers/infiniband/hw/mana/main.c
+++ b/drivers/infiniband/hw/mana/main.c
@@ -20,8 +20,10 @@ void mana_ib_uncfg_vport(struct mana_ib_dev *dev, struct mana_ib_pd *pd,
pd->vport_use_count--;
WARN_ON(pd->vport_use_count < 0);
- if (!pd->vport_use_count)
+ if (!pd->vport_use_count) {
+ mana_destroy_eq(mpc);
mana_uncfg_vport(mpc);
+ }
mutex_unlock(&pd->vport_mutex);
}
@@ -55,15 +57,23 @@ int mana_ib_cfg_vport(struct mana_ib_dev *dev, u32 port, struct mana_ib_pd *pd,
return err;
}
- mutex_unlock(&pd->vport_mutex);
- pd->tx_shortform_allowed = mpc->tx_shortform_allowed;
- pd->tx_vp_offset = mpc->tx_vp_offset;
+ err = mana_create_eq(mpc);
+ if (err) {
+ mana_uncfg_vport(mpc);
+ pd->vport_use_count--;
+ } else {
+ pd->tx_shortform_allowed = mpc->tx_shortform_allowed;
+ pd->tx_vp_offset = mpc->tx_vp_offset;
+ }
- ibdev_dbg(&dev->ib_dev, "vport handle %llx pdid %x doorbell_id %x\n",
- mpc->port_handle, pd->pdn, doorbell_id);
+ mutex_unlock(&pd->vport_mutex);
- return 0;
+ if (!err)
+ ibdev_dbg(&dev->ib_dev, "vport handle %llx pdid %x doorbell_id %x\n",
+ mpc->port_handle, pd->pdn, doorbell_id);
+
+ return err;
}
int mana_ib_alloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
diff --git a/drivers/infiniband/hw/mana/qp.c b/drivers/infiniband/hw/mana/qp.c
index 0fbcf449c134..108ec4c5ce51 100644
--- a/drivers/infiniband/hw/mana/qp.c
+++ b/drivers/infiniband/hw/mana/qp.c
@@ -79,6 +79,7 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
struct ib_qp_init_attr *attr,
struct ib_udata *udata)
{
+ struct mana_ib_pd *mana_pd = container_of(pd, struct mana_ib_pd, ibpd);
struct mana_ib_qp *qp = container_of(ibqp, struct mana_ib_qp, ibqp);
struct mana_ib_dev *mdev =
container_of(pd->device, struct mana_ib_dev, ib_dev);
@@ -155,6 +156,18 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
qp->port = port;
+ /* Take a reference on the vport to ensure EQs outlive this QP.
+ * The vport must already be configured by a raw QP.
+ */
+ mutex_lock(&mana_pd->vport_mutex);
+ if (!mana_pd->vport_use_count) {
+ mutex_unlock(&mana_pd->vport_mutex);
+ ret = -EINVAL;
+ goto fail;
+ }
+ mana_pd->vport_use_count++;
+ mutex_unlock(&mana_pd->vport_mutex);
+
for (i = 0; i < ind_tbl_size; i++) {
struct mana_obj_spec wq_spec = {};
struct mana_obj_spec cq_spec = {};
@@ -171,13 +184,13 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
cq_spec.gdma_region = cq->queue.gdma_region;
cq_spec.queue_size = cq->cqe * COMP_ENTRY_SIZE;
cq_spec.modr_ctx_id = 0;
- eq = &mpc->ac->eqs[cq->comp_vector];
+ eq = &mpc->eqs[cq->comp_vector % mpc->num_queues];
cq_spec.attached_eq = eq->eq->id;
ret = mana_create_wq_obj(mpc, mpc->port_handle, GDMA_RQ,
&wq_spec, &cq_spec, &wq->rx_object);
if (ret)
- goto fail;
+ goto free_vport;
/* The GDMA regions are now owned by the WQ object */
wq->queue.gdma_region = GDMA_INVALID_DMA_REGION;
@@ -199,7 +212,7 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
ret = mana_ib_install_cq_cb(mdev, cq);
if (ret) {
mana_destroy_wq_obj(mpc, GDMA_RQ, wq->rx_object);
- goto fail;
+ goto free_vport;
}
}
resp.num_entries = i;
@@ -210,7 +223,7 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
ucmd.rx_hash_key_len,
ucmd.rx_hash_key);
if (ret)
- goto fail;
+ goto free_vport;
ret = ib_copy_to_udata(udata, &resp, sizeof(resp));
if (ret) {
@@ -226,7 +239,7 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
err_disable_vport_rx:
mana_disable_vport_rx(mpc);
-fail:
+free_vport:
while (i-- > 0) {
ibwq = ind_tbl->ind_tbl[i];
ibcq = ibwq->cq;
@@ -237,6 +250,9 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
mana_destroy_wq_obj(mpc, GDMA_RQ, wq->rx_object);
}
+ mana_ib_uncfg_vport(mdev, mana_pd, port);
+
+fail:
kfree(mana_ind_table);
return ret;
@@ -321,7 +337,11 @@ static int mana_ib_create_qp_raw(struct ib_qp *ibqp, struct ib_pd *ibpd,
cq_spec.queue_size = send_cq->cqe * COMP_ENTRY_SIZE;
cq_spec.modr_ctx_id = 0;
eq_vec = send_cq->comp_vector;
- eq = &mpc->ac->eqs[eq_vec];
+ if (!mpc->eqs) {
+ err = -EINVAL;
+ goto err_destroy_queue;
+ }
+ eq = &mpc->eqs[eq_vec % mpc->num_queues];
cq_spec.attached_eq = eq->eq->id;
err = mana_create_wq_obj(mpc, mpc->port_handle, GDMA_SQ, &wq_spec,
@@ -785,14 +805,17 @@ static int mana_ib_destroy_qp_rss(struct mana_ib_qp *qp,
{
struct mana_ib_dev *mdev =
container_of(qp->ibqp.device, struct mana_ib_dev, ib_dev);
+ struct ib_pd *ibpd = qp->ibqp.pd;
struct mana_port_context *mpc;
struct net_device *ndev;
+ struct mana_ib_pd *pd;
struct mana_ib_wq *wq;
struct ib_wq *ibwq;
int i;
ndev = mana_ib_get_netdev(qp->ibqp.device, qp->port);
mpc = netdev_priv(ndev);
+ pd = container_of(ibpd, struct mana_ib_pd, ibpd);
/* Disable vPort RX steering before destroying RX WQ objects.
* Otherwise firmware still routes traffic to the destroyed queues,
@@ -817,6 +840,8 @@ static int mana_ib_destroy_qp_rss(struct mana_ib_qp *qp,
mana_destroy_wq_obj(mpc, GDMA_RQ, wq->rx_object);
}
+ mana_ib_uncfg_vport(mdev, pd, qp->port);
+
return 0;
}
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index b2faa7cf398f..f1f6f7940b61 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1615,78 +1615,84 @@ void mana_destroy_wq_obj(struct mana_port_context *apc, u32 wq_type,
}
EXPORT_SYMBOL_NS(mana_destroy_wq_obj, "NET_MANA");
-static void mana_destroy_eq(struct mana_context *ac)
+void mana_destroy_eq(struct mana_port_context *apc)
{
+ struct mana_context *ac = apc->ac;
struct gdma_context *gc = ac->gdma_dev->gdma_context;
struct gdma_queue *eq;
int i;
- if (!ac->eqs)
+ if (!apc->eqs)
return;
- debugfs_remove_recursive(ac->mana_eqs_debugfs);
- ac->mana_eqs_debugfs = NULL;
+ debugfs_remove_recursive(apc->mana_eqs_debugfs);
+ apc->mana_eqs_debugfs = NULL;
- for (i = 0; i < gc->max_num_queues; i++) {
- eq = ac->eqs[i].eq;
+ for (i = 0; i < apc->num_queues; i++) {
+ eq = apc->eqs[i].eq;
if (!eq)
continue;
mana_gd_destroy_queue(gc, eq);
}
- kfree(ac->eqs);
- ac->eqs = NULL;
+ kfree(apc->eqs);
+ apc->eqs = NULL;
}
+EXPORT_SYMBOL_NS(mana_destroy_eq, "NET_MANA");
-static void mana_create_eq_debugfs(struct mana_context *ac, int i)
+static void mana_create_eq_debugfs(struct mana_port_context *apc, int i)
{
- struct mana_eq eq = ac->eqs[i];
+ struct mana_eq eq = apc->eqs[i];
char eqnum[32];
sprintf(eqnum, "eq%d", i);
- eq.mana_eq_debugfs = debugfs_create_dir(eqnum, ac->mana_eqs_debugfs);
+ eq.mana_eq_debugfs = debugfs_create_dir(eqnum, apc->mana_eqs_debugfs);
debugfs_create_u32("head", 0400, eq.mana_eq_debugfs, &eq.eq->head);
debugfs_create_u32("tail", 0400, eq.mana_eq_debugfs, &eq.eq->tail);
debugfs_create_file("eq_dump", 0400, eq.mana_eq_debugfs, eq.eq, &mana_dbg_q_fops);
}
-static int mana_create_eq(struct mana_context *ac)
+int mana_create_eq(struct mana_port_context *apc)
{
- struct gdma_dev *gd = ac->gdma_dev;
+ struct gdma_dev *gd = apc->ac->gdma_dev;
struct gdma_context *gc = gd->gdma_context;
struct gdma_queue_spec spec = {};
int err;
int i;
- ac->eqs = kzalloc_objs(struct mana_eq, gc->max_num_queues);
- if (!ac->eqs)
+ if (WARN_ON(apc->eqs))
+ return -EEXIST;
+ apc->eqs = kzalloc_objs(struct mana_eq, apc->num_queues);
+ if (!apc->eqs)
return -ENOMEM;
spec.type = GDMA_EQ;
spec.monitor_avl_buf = false;
spec.queue_size = EQ_SIZE;
spec.eq.callback = NULL;
- spec.eq.context = ac->eqs;
+ spec.eq.context = apc->eqs;
spec.eq.log2_throttle_limit = LOG2_EQ_THROTTLE;
- ac->mana_eqs_debugfs = debugfs_create_dir("EQs", gc->mana_pci_debugfs);
+ apc->mana_eqs_debugfs =
+ debugfs_create_dir("EQs", apc->mana_port_debugfs);
- for (i = 0; i < gc->max_num_queues; i++) {
+ for (i = 0; i < apc->num_queues; i++) {
spec.eq.msix_index = (i + 1) % gc->num_msix_usable;
- err = mana_gd_create_mana_eq(gd, &spec, &ac->eqs[i].eq);
+ err = mana_gd_create_mana_eq(gd, &spec, &apc->eqs[i].eq);
if (err) {
dev_err(gc->dev, "Failed to create EQ %d : %d\n", i, err);
goto out;
}
- mana_create_eq_debugfs(ac, i);
+ mana_create_eq_debugfs(apc, i);
}
return 0;
out:
- mana_destroy_eq(ac);
+ mana_destroy_eq(apc);
return err;
}
+EXPORT_SYMBOL_NS(mana_create_eq, "NET_MANA");
static int mana_fence_rq(struct mana_port_context *apc, struct mana_rxq *rxq)
{
@@ -2451,7 +2457,7 @@ static int mana_create_txq(struct mana_port_context *apc,
spec.monitor_avl_buf = false;
spec.queue_size = cq_size;
spec.cq.callback = mana_schedule_napi;
- spec.cq.parent_eq = ac->eqs[i].eq;
+ spec.cq.parent_eq = apc->eqs[i].eq;
spec.cq.context = cq;
err = mana_gd_create_mana_wq_cq(gd, &spec, &cq->gdma_cq);
if (err)
@@ -2844,13 +2850,12 @@ static void mana_create_rxq_debugfs(struct mana_port_context *apc, int idx)
static int mana_add_rx_queues(struct mana_port_context *apc,
struct net_device *ndev)
{
- struct mana_context *ac = apc->ac;
struct mana_rxq *rxq;
int err = 0;
int i;
for (i = 0; i < apc->num_queues; i++) {
- rxq = mana_create_rxq(apc, i, &ac->eqs[i], ndev);
+ rxq = mana_create_rxq(apc, i, &apc->eqs[i], ndev);
if (!rxq) {
err = -ENOMEM;
netdev_err(ndev, "Failed to create rxq %d : %d\n", i, err);
@@ -2869,9 +2874,8 @@ static int mana_add_rx_queues(struct mana_port_context *apc,
return err;
}
-static void mana_destroy_vport(struct mana_port_context *apc)
+static void mana_destroy_rxqs(struct mana_port_context *apc)
{
- struct gdma_dev *gd = apc->ac->gdma_dev;
struct mana_rxq *rxq;
u32 rxq_idx;
@@ -2883,8 +2887,12 @@ static void mana_destroy_vport(struct mana_port_context *apc)
mana_destroy_rxq(apc, rxq, true);
apc->rxqs[rxq_idx] = NULL;
}
+}
+
+static void mana_destroy_vport(struct mana_port_context *apc)
+{
+ struct gdma_dev *gd = apc->ac->gdma_dev;
- mana_destroy_txq(apc);
mana_uncfg_vport(apc);
if (gd->gdma_context->is_pf && !apc->ac->bm_hostmode)
@@ -2905,11 +2913,7 @@ static int mana_create_vport(struct mana_port_context *apc,
return err;
}
- err = mana_cfg_vport(apc, gd->pdid, gd->doorbell);
- if (err)
- return err;
-
- return mana_create_txq(apc, net);
+ return mana_cfg_vport(apc, gd->pdid, gd->doorbell);
}
static int mana_rss_table_alloc(struct mana_port_context *apc)
@@ -3195,21 +3199,36 @@ int mana_alloc_queues(struct net_device *ndev)
err = mana_create_vport(apc, ndev);
if (err) {
- netdev_err(ndev, "Failed to create vPort %u : %d\n", apc->port_idx, err);
+ netdev_err(ndev, "Failed to create vPort %u : %d\n",
+ apc->port_idx, err);
return err;
}
+ err = mana_create_eq(apc);
+ if (err) {
+ netdev_err(ndev, "Failed to create EQ on vPort %u: %d\n",
+ apc->port_idx, err);
+ goto destroy_vport;
+ }
+
+ err = mana_create_txq(apc, ndev);
+ if (err) {
+ netdev_err(ndev, "Failed to create TXQ on vPort %u: %d\n",
+ apc->port_idx, err);
+ goto destroy_eq;
+ }
+
err = netif_set_real_num_tx_queues(ndev, apc->num_queues);
if (err) {
netdev_err(ndev,
"netif_set_real_num_tx_queues () failed for ndev with num_queues %u : %d\n",
apc->num_queues, err);
- goto destroy_vport;
+ goto destroy_txq;
}
err = mana_add_rx_queues(apc, ndev);
if (err)
- goto destroy_vport;
+ goto destroy_rxq;
apc->rss_state = apc->num_queues > 1 ? TRI_STATE_TRUE : TRI_STATE_FALSE;
@@ -3218,7 +3237,7 @@ int mana_alloc_queues(struct net_device *ndev)
netdev_err(ndev,
"netif_set_real_num_rx_queues () failed for ndev with num_queues %u : %d\n",
apc->num_queues, err);
- goto destroy_vport;
+ goto destroy_rxq;
}
mana_rss_table_init(apc);
@@ -3226,19 +3245,25 @@ int mana_alloc_queues(struct net_device *ndev)
err = mana_config_rss(apc, TRI_STATE_TRUE, true, true);
if (err) {
netdev_err(ndev, "Failed to configure RSS table: %d\n", err);
- goto destroy_vport;
+ goto destroy_rxq;
}
if (gd->gdma_context->is_pf && !apc->ac->bm_hostmode) {
err = mana_pf_register_filter(apc);
if (err)
- goto destroy_vport;
+ goto destroy_rxq;
}
mana_chn_setxdp(apc, mana_xdp_get(apc));
return 0;
+destroy_rxq:
+ mana_destroy_rxqs(apc);
+destroy_txq:
+ mana_destroy_txq(apc);
+destroy_eq:
+ mana_destroy_eq(apc);
destroy_vport:
mana_destroy_vport(apc);
return err;
@@ -3343,6 +3368,9 @@ static int mana_dealloc_queues(struct net_device *ndev)
mana_fence_rqs(apc);
/* Even in err case, still need to cleanup the vPort */
+ mana_destroy_rxqs(apc);
+ mana_destroy_txq(apc);
+ mana_destroy_eq(apc);
mana_destroy_vport(apc);
return 0;
@@ -3663,12 +3691,6 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
INIT_DELAYED_WORK(&ac->gf_stats_work, mana_gf_stats_work_handler);
- err = mana_create_eq(ac);
- if (err) {
- dev_err(dev, "Failed to create EQs: %d\n", err);
- goto out;
- }
-
err = mana_query_device_cfg(ac, MANA_MAJOR_VERSION, MANA_MINOR_VERSION,
MANA_MICRO_VERSION, &num_ports, &bm_hostmode);
if (err)
@@ -3808,8 +3830,6 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
free_netdev(ndev);
}
- mana_destroy_eq(ac);
-
if (ac->per_port_queue_reset_wq) {
destroy_workqueue(ac->per_port_queue_reset_wq);
ac->per_port_queue_reset_wq = NULL;
diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index 04350973e19e..e121834d17f3 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -454,18 +454,40 @@ static int mana_set_coalesce(struct net_device *ndev,
return err;
}
+/* mana_set_channels - change the number of queues on a port
+ *
+ * Returns -EBUSY if the port is down and RDMA holds the vport with
+ * EQs sized to the current num_queues.
+ */
static int mana_set_channels(struct net_device *ndev,
struct ethtool_channels *channels)
{
struct mana_port_context *apc = netdev_priv(ndev);
unsigned int new_count = channels->combined_count;
unsigned int old_count = apc->num_queues;
+ bool locked = false;
int err;
+ /* When the port is down, hold vport_mutex for the entire
+ * operation to serialize against RDMA's mana_cfg_vport().
+ * This is safe because mana_detach()/mana_attach() skip
+ * vport teardown/setup when port_st_save is false.
+ * When the port is up, Ethernet owns the vport exclusively
+ * so no locking against RDMA is needed.
+ */
+ if (!apc->port_is_up) {
+ mutex_lock(&apc->vport_mutex);
+ if (apc->vport_use_count) {
+ mutex_unlock(&apc->vport_mutex);
+ return -EBUSY;
+ }
+ locked = true;
+ }
+
err = mana_pre_alloc_rxbufs(apc, ndev->mtu, new_count);
if (err) {
netdev_err(ndev, "Insufficient memory for new allocations");
- return err;
+ goto unlock;
}
err = mana_detach(ndev, false);
@@ -483,6 +505,9 @@ static int mana_set_channels(struct net_device *ndev,
out:
mana_pre_dealloc_rxbufs(apc);
+unlock:
+ if (locked)
+ mutex_unlock(&apc->vport_mutex);
return err;
}
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index aa90a858c8e3..c8e7d16f6685 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -480,8 +480,6 @@ struct mana_context {
u8 bm_hostmode;
struct mana_ethtool_hc_stats hc_stats;
- struct mana_eq *eqs;
- struct dentry *mana_eqs_debugfs;
struct workqueue_struct *per_port_queue_reset_wq;
/* Workqueue for querying hardware stats */
struct delayed_work gf_stats_work;
@@ -501,6 +499,9 @@ struct mana_port_context {
u8 mac_addr[ETH_ALEN];
+ struct mana_eq *eqs;
+ struct dentry *mana_eqs_debugfs;
+
enum TRI_STATE rss_state;
mana_handle_t default_rxobj;
@@ -1034,6 +1035,8 @@ void mana_destroy_wq_obj(struct mana_port_context *apc, u32 wq_type,
int mana_cfg_vport(struct mana_port_context *apc, u32 protection_dom_id,
u32 doorbell_pg_id);
void mana_uncfg_vport(struct mana_port_context *apc);
+int mana_create_eq(struct mana_port_context *apc);
+void mana_destroy_eq(struct mana_port_context *apc);
struct net_device *mana_get_primary_netdev(struct mana_context *ac,
u32 port_index,
--
2.43.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH net-next v9 2/6] net: mana: Query device capabilities and configure MSI-X sharing for EQs
2026-05-13 22:09 [PATCH net-next v9 0/6] net: mana: Per-vPort EQ and MSI-X interrupt management Long Li
2026-05-13 22:09 ` [PATCH net-next v9 1/6] net: mana: Create separate EQs for each vPort Long Li
@ 2026-05-13 22:09 ` Long Li
2026-05-14 22:10 ` sashiko-bot
2026-05-13 22:09 ` [PATCH net-next v9 3/6] net: mana: Introduce GIC context with refcounting for interrupt management Long Li
` (3 subsequent siblings)
5 siblings, 1 reply; 9+ messages in thread
From: Long Li @ 2026-05-13 22:09 UTC (permalink / raw)
To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
Paolo Abeni, Eric Dumazet, Andrew Lunn, Jason Gunthorpe,
Leon Romanovsky, Haiyang Zhang, K . Y . Srinivasan, Wei Liu,
Dexuan Cui, shradhagupta
Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel
When querying the device, adjust the max number of queues to allow
dedicated MSI-X vectors for each vPort. The per-vPort queue count is
clamped towards MANA_DEF_NUM_QUEUES but will not exceed the hardware
maximum reported by the device.
MSI-X sharing among vPorts is enabled when there are not enough MSI-X
vectors for dedicated allocation, or when the platform does not support
dynamic MSI-X allocation (in which case all vectors are pre-allocated
at probe time and sharing is always used). The msi_sharing flag is
reset at the top of mana_gd_query_max_resources() so it is recomputed
from current hardware state on each probe or resume cycle.
A device reporting zero ports now results in a fatal probe error since
the per-vPort MSI-X math requires at least one port.
Rename mana_query_device_cfg() to mana_gd_query_device_cfg() as it is
used at GDMA device probe time for querying device capabilities.
Signed-off-by: Long Li <longli@microsoft.com>
---
.../net/ethernet/microsoft/mana/gdma_main.c | 66 ++++++++++++++++++-
drivers/net/ethernet/microsoft/mana/mana_en.c | 40 ++++++-----
include/net/mana/gdma.h | 13 +++-
3 files changed, 100 insertions(+), 19 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 3bc3fff55999..bbd055849e36 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -179,8 +179,18 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
struct gdma_context *gc = pci_get_drvdata(pdev);
struct gdma_query_max_resources_resp resp = {};
struct gdma_general_req req = {};
+ unsigned int max_num_queues;
+ u8 bm_hostmode;
+ u16 num_ports;
int err;
+ /* Reset msi_sharing so it is recomputed from current hardware
+ * state. On resume, num_online_cpus() or num_msix_usable may
+ * have changed, making dedicated MSI-X feasible where it was
+ * not before.
+ */
+ gc->msi_sharing = false;
+
mana_gd_init_req_hdr(&req.hdr, GDMA_QUERY_MAX_RESOURCES,
sizeof(req), sizeof(resp));
@@ -227,6 +237,43 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
if (gc->max_num_queues == 0)
return -ENOSPC;
+ err = mana_gd_query_device_cfg(gc, MANA_MAJOR_VERSION,
+ MANA_MINOR_VERSION,
+ MANA_MICRO_VERSION,
+ &num_ports, &bm_hostmode);
+ if (err)
+ return err;
+
+ if (!num_ports)
+ return -EINVAL;
+
+ /*
+ * Adjust the per-vPort max queue count to allow dedicated
+ * MSIx for each vPort. Clamp to no less than MANA_DEF_NUM_QUEUES.
+ */
+ max_num_queues = (gc->num_msix_usable - 1) / num_ports;
+ max_num_queues = rounddown_pow_of_two(max(max_num_queues, 1U));
+ if (max_num_queues < MANA_DEF_NUM_QUEUES)
+ max_num_queues = MANA_DEF_NUM_QUEUES;
+
+ /*
+ * Use dedicated MSIx for EQs whenever possible, use MSIx sharing for
+ * Ethernet EQs when (max_num_queues * num_ports > num_msix_usable - 1).
+ */
+ max_num_queues = min(gc->max_num_queues, max_num_queues);
+ if (max_num_queues * num_ports > gc->num_msix_usable - 1)
+ gc->msi_sharing = true;
+
+ /* If MSI is shared, use max allowed value */
+ if (gc->msi_sharing)
+ gc->max_num_queues_vport = min(gc->num_msix_usable - 1,
+ gc->max_num_queues);
+ else
+ gc->max_num_queues_vport = max_num_queues;
+
+ dev_info(gc->dev, "MSI sharing mode %d max queues %d\n",
+ gc->msi_sharing, gc->max_num_queues);
+
return 0;
}
@@ -1889,6 +1936,7 @@ static int mana_gd_setup_hwc_irqs(struct pci_dev *pdev)
/* Need 1 interrupt for HWC */
max_irqs = min(num_online_cpus(), MANA_MAX_NUM_QUEUES) + 1;
min_irqs = 2;
+ gc->msi_sharing = true;
}
nvec = pci_alloc_irq_vectors(pdev, min_irqs, max_irqs, PCI_IRQ_MSIX);
@@ -1967,6 +2015,8 @@ static void mana_gd_remove_irqs(struct pci_dev *pdev)
pci_free_irq_vectors(pdev);
+ bitmap_free(gc->msi_bitmap);
+ gc->msi_bitmap = NULL;
gc->max_num_msix = 0;
gc->num_msix_usable = 0;
}
@@ -2001,6 +2051,10 @@ static int mana_gd_setup(struct pci_dev *pdev)
if (err)
goto destroy_hwc;
+ err = mana_gd_detect_devices(pdev);
+ if (err)
+ goto destroy_hwc;
+
err = mana_gd_query_max_resources(pdev);
if (err)
goto destroy_hwc;
@@ -2011,9 +2065,15 @@ static int mana_gd_setup(struct pci_dev *pdev)
goto destroy_hwc;
}
- err = mana_gd_detect_devices(pdev);
- if (err)
- goto destroy_hwc;
+ if (!gc->msi_sharing) {
+ gc->msi_bitmap = bitmap_zalloc(gc->num_msix_usable, GFP_KERNEL);
+ if (!gc->msi_bitmap) {
+ err = -ENOMEM;
+ goto destroy_hwc;
+ }
+ /* Set bit for HWC */
+ set_bit(0, gc->msi_bitmap);
+ }
dev_dbg(&pdev->dev, "mana gdma setup successful\n");
return 0;
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index f1f6f7940b61..3ee74e7e300c 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1007,10 +1007,9 @@ static int mana_init_port_context(struct mana_port_context *apc)
return !apc->rxqs ? -ENOMEM : 0;
}
-static int mana_send_request(struct mana_context *ac, void *in_buf,
- u32 in_len, void *out_buf, u32 out_len)
+static int gdma_mana_send_request(struct gdma_context *gc, void *in_buf,
+ u32 in_len, void *out_buf, u32 out_len)
{
- struct gdma_context *gc = ac->gdma_dev->gdma_context;
struct gdma_resp_hdr *resp = out_buf;
struct gdma_req_hdr *req = in_buf;
struct device *dev = gc->dev;
@@ -1044,6 +1043,14 @@ static int mana_send_request(struct mana_context *ac, void *in_buf,
return 0;
}
+static int mana_send_request(struct mana_context *ac, void *in_buf,
+ u32 in_len, void *out_buf, u32 out_len)
+{
+ struct gdma_context *gc = ac->gdma_dev->gdma_context;
+
+ return gdma_mana_send_request(gc, in_buf, in_len, out_buf, out_len);
+}
+
static int mana_verify_resp_hdr(const struct gdma_resp_hdr *resp_hdr,
const enum mana_command_code expected_code,
const u32 min_size)
@@ -1177,11 +1184,10 @@ static void mana_pf_deregister_filter(struct mana_port_context *apc)
err, resp.hdr.status);
}
-static int mana_query_device_cfg(struct mana_context *ac, u32 proto_major_ver,
- u32 proto_minor_ver, u32 proto_micro_ver,
- u16 *max_num_vports, u8 *bm_hostmode)
+int mana_gd_query_device_cfg(struct gdma_context *gc, u32 proto_major_ver,
+ u32 proto_minor_ver, u32 proto_micro_ver,
+ u16 *max_num_vports, u8 *bm_hostmode)
{
- struct gdma_context *gc = ac->gdma_dev->gdma_context;
struct mana_query_device_cfg_resp resp = {};
struct mana_query_device_cfg_req req = {};
struct device *dev = gc->dev;
@@ -1196,7 +1202,8 @@ static int mana_query_device_cfg(struct mana_context *ac, u32 proto_major_ver,
req.proto_minor_ver = proto_minor_ver;
req.proto_micro_ver = proto_micro_ver;
- err = mana_send_request(ac, &req, sizeof(req), &resp, sizeof(resp));
+ err = gdma_mana_send_request(gc, &req, sizeof(req),
+ &resp, sizeof(resp));
if (err) {
dev_err(dev, "Failed to query config: %d", err);
return err;
@@ -1230,8 +1237,6 @@ static int mana_query_device_cfg(struct mana_context *ac, u32 proto_major_ver,
else
*bm_hostmode = 0;
- debugfs_create_u16("adapter-MTU", 0400, gc->mana_pci_debugfs, &gc->adapter_mtu);
-
return 0;
}
@@ -3416,7 +3421,7 @@ static int mana_probe_port(struct mana_context *ac, int port_idx,
int err;
ndev = alloc_etherdev_mq(sizeof(struct mana_port_context),
- gc->max_num_queues);
+ gc->max_num_queues_vport);
if (!ndev)
return -ENOMEM;
@@ -3425,9 +3430,9 @@ static int mana_probe_port(struct mana_context *ac, int port_idx,
apc = netdev_priv(ndev);
apc->ac = ac;
apc->ndev = ndev;
- apc->max_queues = gc->max_num_queues;
+ apc->max_queues = gc->max_num_queues_vport;
/* Use MANA_DEF_NUM_QUEUES as default, still honoring the HW limit */
- apc->num_queues = min(gc->max_num_queues, MANA_DEF_NUM_QUEUES);
+ apc->num_queues = min(gc->max_num_queues_vport, MANA_DEF_NUM_QUEUES);
apc->tx_queue_size = DEF_TX_BUFFERS_PER_QUEUE;
apc->rx_queue_size = DEF_RX_BUFFERS_PER_QUEUE;
apc->port_handle = INVALID_MANA_HANDLE;
@@ -3691,13 +3696,18 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
INIT_DELAYED_WORK(&ac->gf_stats_work, mana_gf_stats_work_handler);
- err = mana_query_device_cfg(ac, MANA_MAJOR_VERSION, MANA_MINOR_VERSION,
- MANA_MICRO_VERSION, &num_ports, &bm_hostmode);
+ err = mana_gd_query_device_cfg(gc, MANA_MAJOR_VERSION,
+ MANA_MINOR_VERSION,
+ MANA_MICRO_VERSION,
+ &num_ports, &bm_hostmode);
if (err)
goto out;
ac->bm_hostmode = bm_hostmode;
+ debugfs_create_u16("adapter-MTU", 0400,
+ gc->mana_pci_debugfs, &gc->adapter_mtu);
+
if (!resuming) {
ac->num_ports = num_ports;
} else {
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index 6d836060976a..9c05b1e15c3e 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -399,8 +399,10 @@ struct gdma_context {
struct device *dev;
struct dentry *mana_pci_debugfs;
- /* Per-vPort max number of queues */
+ /* Hardware max number of queues */
unsigned int max_num_queues;
+ /* Per-vPort max number of queues */
+ unsigned int max_num_queues_vport;
unsigned int max_num_msix;
unsigned int num_msix_usable;
struct xarray irq_contexts;
@@ -446,6 +448,12 @@ struct gdma_context {
struct workqueue_struct *service_wq;
unsigned long flags;
+
+ /* Indicate if this device is sharing MSI for EQs on MANA */
+ bool msi_sharing;
+
+ /* Bitmap tracks where MSI is allocated when it is not shared for EQs */
+ unsigned long *msi_bitmap;
};
static inline bool mana_gd_is_mana(struct gdma_dev *gd)
@@ -1018,4 +1026,7 @@ int mana_gd_resume(struct pci_dev *pdev);
bool mana_need_log(struct gdma_context *gc, int err);
+int mana_gd_query_device_cfg(struct gdma_context *gc, u32 proto_major_ver,
+ u32 proto_minor_ver, u32 proto_micro_ver,
+ u16 *max_num_vports, u8 *bm_hostmode);
#endif /* _GDMA_H */
--
2.43.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH net-next v9 3/6] net: mana: Introduce GIC context with refcounting for interrupt management
2026-05-13 22:09 [PATCH net-next v9 0/6] net: mana: Per-vPort EQ and MSI-X interrupt management Long Li
2026-05-13 22:09 ` [PATCH net-next v9 1/6] net: mana: Create separate EQs for each vPort Long Li
2026-05-13 22:09 ` [PATCH net-next v9 2/6] net: mana: Query device capabilities and configure MSI-X sharing for EQs Long Li
@ 2026-05-13 22:09 ` Long Li
2026-05-13 22:09 ` [PATCH net-next v9 4/6] net: mana: Use GIC functions to allocate global EQs Long Li
` (2 subsequent siblings)
5 siblings, 0 replies; 9+ messages in thread
From: Long Li @ 2026-05-13 22:09 UTC (permalink / raw)
To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
Paolo Abeni, Eric Dumazet, Andrew Lunn, Jason Gunthorpe,
Leon Romanovsky, Haiyang Zhang, K . Y . Srinivasan, Wei Liu,
Dexuan Cui, shradhagupta
Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel
To allow Ethernet EQs to use dedicated or shared MSI-X vectors and RDMA
EQs to share the same MSI-X, introduce a GIC (GDMA IRQ Context) with
reference counting. This allows the driver to create an interrupt context
on an assigned or unassigned MSI-X vector and share it across multiple
EQ consumers.
Signed-off-by: Long Li <longli@microsoft.com>
---
.../net/ethernet/microsoft/mana/gdma_main.c | 159 ++++++++++++++++++
include/net/mana/gdma.h | 12 ++
2 files changed, 171 insertions(+)
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index bbd055849e36..fdd2ef24414b 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -1655,6 +1655,164 @@ static irqreturn_t mana_gd_intr(int irq, void *arg)
return IRQ_HANDLED;
}
+void mana_gd_put_gic(struct gdma_context *gc, bool use_msi_bitmap, int msi)
+{
+ struct pci_dev *dev = to_pci_dev(gc->dev);
+ struct gdma_irq_context *gic;
+ struct msi_map irq_map;
+ int irq;
+
+ mutex_lock(&gc->gic_mutex);
+
+ gic = xa_load(&gc->irq_contexts, msi);
+ if (WARN_ON(!gic)) {
+ mutex_unlock(&gc->gic_mutex);
+ return;
+ }
+
+ if (use_msi_bitmap)
+ gic->bitmap_refs--;
+
+ if (use_msi_bitmap && gic->bitmap_refs == 0)
+ clear_bit(msi, gc->msi_bitmap);
+
+ if (!refcount_dec_and_test(&gic->refcount))
+ goto out;
+
+ irq = gic->irq;
+
+ irq_update_affinity_hint(irq, NULL);
+ free_irq(irq, gic);
+
+ if (gic->dyn_msix) {
+ irq_map.virq = irq;
+ irq_map.index = msi;
+ pci_msix_free_irq(dev, irq_map);
+ }
+
+ xa_erase(&gc->irq_contexts, msi);
+ kfree(gic);
+
+out:
+ mutex_unlock(&gc->gic_mutex);
+}
+EXPORT_SYMBOL_NS(mana_gd_put_gic, "NET_MANA");
+
+/*
+ * Get a GIC (GDMA IRQ Context) on a MSI vector
+ * a MSI can be shared between different EQs, this function supports setting
+ * up separate MSIs using a bitmap, or directly using the MSI index
+ *
+ * @use_msi_bitmap:
+ * True if MSI is assigned by this function on available slots from bitmap.
+ * False if MSI is passed from *msi_requested
+ */
+struct gdma_irq_context *mana_gd_get_gic(struct gdma_context *gc,
+ bool use_msi_bitmap,
+ int *msi_requested)
+{
+ struct pci_dev *dev = to_pci_dev(gc->dev);
+ struct gdma_irq_context *gic;
+ struct msi_map irq_map = { };
+ int irq;
+ int msi;
+ int err;
+
+ mutex_lock(&gc->gic_mutex);
+
+ if (use_msi_bitmap) {
+ msi = find_first_zero_bit(gc->msi_bitmap, gc->num_msix_usable);
+ if (msi >= gc->num_msix_usable) {
+ dev_err(gc->dev, "No free MSI vectors available\n");
+ gic = NULL;
+ goto out;
+ }
+ *msi_requested = msi;
+ } else {
+ msi = *msi_requested;
+ }
+
+ gic = xa_load(&gc->irq_contexts, msi);
+ if (gic) {
+ refcount_inc(&gic->refcount);
+ if (use_msi_bitmap) {
+ gic->bitmap_refs++;
+ set_bit(msi, gc->msi_bitmap);
+ }
+ goto out;
+ }
+
+ irq = pci_irq_vector(dev, msi);
+ if (irq == -EINVAL) {
+ irq_map = pci_msix_alloc_irq_at(dev, msi, NULL);
+ if (!irq_map.virq) {
+ err = irq_map.index;
+ dev_err(gc->dev,
+ "Failed to alloc irq_map msi %d err %d\n",
+ msi, err);
+ gic = NULL;
+ goto out;
+ }
+ irq = irq_map.virq;
+ msi = irq_map.index;
+ }
+
+ gic = kzalloc(sizeof(*gic), GFP_KERNEL);
+ if (!gic) {
+ if (irq_map.virq)
+ pci_msix_free_irq(dev, irq_map);
+ goto out;
+ }
+
+ gic->handler = mana_gd_process_eq_events;
+ gic->msi = msi;
+ gic->irq = irq;
+ INIT_LIST_HEAD(&gic->eq_list);
+ spin_lock_init(&gic->lock);
+
+ if (!gic->msi)
+ snprintf(gic->name, MANA_IRQ_NAME_SZ, "mana_hwc@pci:%s",
+ pci_name(dev));
+ else
+ snprintf(gic->name, MANA_IRQ_NAME_SZ, "mana_msi%d@pci:%s",
+ gic->msi, pci_name(dev));
+
+ err = request_irq(irq, mana_gd_intr, 0, gic->name, gic);
+ if (err) {
+ dev_err(gc->dev, "Failed to request irq %d %s\n",
+ irq, gic->name);
+ kfree(gic);
+ gic = NULL;
+ if (irq_map.virq)
+ pci_msix_free_irq(dev, irq_map);
+ goto out;
+ }
+
+ gic->dyn_msix = !!irq_map.virq;
+ refcount_set(&gic->refcount, 1);
+ gic->bitmap_refs = use_msi_bitmap ? 1 : 0;
+
+ err = xa_err(xa_store(&gc->irq_contexts, msi, gic, GFP_KERNEL));
+ if (err) {
+ dev_err(gc->dev, "Failed to store irq context for msi %d: %d\n",
+ msi, err);
+ free_irq(irq, gic);
+ kfree(gic);
+ gic = NULL;
+ if (irq_map.virq)
+ pci_msix_free_irq(dev, irq_map);
+ goto out;
+ }
+
+ if (use_msi_bitmap)
+ set_bit(msi, gc->msi_bitmap);
+
+out:
+ mutex_unlock(&gc->gic_mutex);
+ return gic;
+}
+EXPORT_SYMBOL_NS(mana_gd_get_gic, "NET_MANA");
+
int mana_gd_alloc_res_map(u32 res_avail, struct gdma_resource *r)
{
r->map = bitmap_zalloc(res_avail, GFP_KERNEL);
@@ -2144,6 +2302,7 @@ static int mana_gd_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
goto release_region;
mutex_init(&gc->eq_test_event_mutex);
+ mutex_init(&gc->gic_mutex);
pci_set_drvdata(pdev, gc);
gc->bar0_pa = pci_resource_start(pdev, 0);
gc->bar0_size = pci_resource_len(pdev, 0);
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index 9c05b1e15c3e..fbe3c1427b45 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -388,6 +388,11 @@ struct gdma_irq_context {
spinlock_t lock;
struct list_head eq_list;
char name[MANA_IRQ_NAME_SZ];
+ unsigned int msi;
+ unsigned int irq;
+ refcount_t refcount;
+ unsigned int bitmap_refs;
+ bool dyn_msix;
};
enum gdma_context_flags {
@@ -449,6 +454,9 @@ struct gdma_context {
unsigned long flags;
+ /* Protect access to GIC context */
+ struct mutex gic_mutex;
+
/* Indicate if this device is sharing MSI for EQs on MANA */
bool msi_sharing;
@@ -1026,6 +1034,10 @@ int mana_gd_resume(struct pci_dev *pdev);
bool mana_need_log(struct gdma_context *gc, int err);
+struct gdma_irq_context *mana_gd_get_gic(struct gdma_context *gc,
+ bool use_msi_bitmap,
+ int *msi_requested);
+void mana_gd_put_gic(struct gdma_context *gc, bool use_msi_bitmap, int msi);
int mana_gd_query_device_cfg(struct gdma_context *gc, u32 proto_major_ver,
u32 proto_minor_ver, u32 proto_micro_ver,
u16 *max_num_vports, u8 *bm_hostmode);
--
2.43.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH net-next v9 4/6] net: mana: Use GIC functions to allocate global EQs
2026-05-13 22:09 [PATCH net-next v9 0/6] net: mana: Per-vPort EQ and MSI-X interrupt management Long Li
` (2 preceding siblings ...)
2026-05-13 22:09 ` [PATCH net-next v9 3/6] net: mana: Introduce GIC context with refcounting for interrupt management Long Li
@ 2026-05-13 22:09 ` Long Li
2026-05-13 22:09 ` [PATCH net-next v9 5/6] net: mana: Allocate interrupt context for each EQ when creating vPort Long Li
2026-05-13 22:09 ` [PATCH net-next v9 6/6] RDMA/mana_ib: Allocate interrupt contexts on EQs Long Li
5 siblings, 0 replies; 9+ messages in thread
From: Long Li @ 2026-05-13 22:09 UTC (permalink / raw)
To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
Paolo Abeni, Eric Dumazet, Andrew Lunn, Jason Gunthorpe,
Leon Romanovsky, Haiyang Zhang, K . Y . Srinivasan, Wei Liu,
Dexuan Cui, shradhagupta
Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel
Replace the GDMA global interrupt setup code with the new GIC allocation
and release functions for managing interrupt contexts.
This changes the per-queue interrupt names in /proc/interrupts from
mana_q0, mana_q1, ... to mana_msi1, mana_msi2, ... to reflect the
MSI-X index rather than a zero-based queue number. The HWC interrupt
name (mana_hwc) is unchanged.
Signed-off-by: Long Li <longli@microsoft.com>
---
.../net/ethernet/microsoft/mana/gdma_main.c | 96 +++----------------
1 file changed, 13 insertions(+), 83 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index fdd2ef24414b..e1a0e897b1b9 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -1915,7 +1915,7 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
struct gdma_context *gc = pci_get_drvdata(pdev);
struct gdma_irq_context *gic;
bool skip_first_cpu = false;
- int *irqs, irq, err, i;
+ int *irqs, err, i;
irqs = kmalloc_objs(int, nvec);
if (!irqs)
@@ -1928,30 +1928,13 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
* further used in irq_setup()
*/
for (i = 1; i <= nvec; i++) {
- gic = kzalloc_obj(*gic);
+ gic = mana_gd_get_gic(gc, false, &i);
if (!gic) {
err = -ENOMEM;
goto free_irq;
}
- gic->handler = mana_gd_process_eq_events;
- INIT_LIST_HEAD(&gic->eq_list);
- spin_lock_init(&gic->lock);
-
- snprintf(gic->name, MANA_IRQ_NAME_SZ, "mana_q%d@pci:%s",
- i - 1, pci_name(pdev));
-
- /* one pci vector is already allocated for HWC */
- irqs[i - 1] = pci_irq_vector(pdev, i);
- if (irqs[i - 1] < 0) {
- err = irqs[i - 1];
- goto free_current_gic;
- }
-
- err = request_irq(irqs[i - 1], mana_gd_intr, 0, gic->name, gic);
- if (err)
- goto free_current_gic;
- xa_store(&gc->irq_contexts, i, gic, GFP_KERNEL);
+ irqs[i - 1] = gic->irq;
}
/*
@@ -1973,20 +1956,9 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
kfree(irqs);
return 0;
-free_current_gic:
- kfree(gic);
free_irq:
- for (i -= 1; i > 0; i--) {
- irq = pci_irq_vector(pdev, i);
- gic = xa_load(&gc->irq_contexts, i);
- if (WARN_ON(!gic))
- continue;
-
- irq_update_affinity_hint(irq, NULL);
- free_irq(irq, gic);
- xa_erase(&gc->irq_contexts, i);
- kfree(gic);
- }
+ for (i -= 1; i > 0; i--)
+ mana_gd_put_gic(gc, false, i);
kfree(irqs);
return err;
}
@@ -1995,7 +1967,7 @@ static int mana_gd_setup_irqs(struct pci_dev *pdev, int nvec)
{
struct gdma_context *gc = pci_get_drvdata(pdev);
struct gdma_irq_context *gic;
- int *irqs, *start_irqs, irq;
+ int *irqs, *start_irqs;
unsigned int cpu;
int err, i;
@@ -2006,34 +1978,13 @@ static int mana_gd_setup_irqs(struct pci_dev *pdev, int nvec)
start_irqs = irqs;
for (i = 0; i < nvec; i++) {
- gic = kzalloc_obj(*gic);
+ gic = mana_gd_get_gic(gc, false, &i);
if (!gic) {
err = -ENOMEM;
goto free_irq;
}
- gic->handler = mana_gd_process_eq_events;
- INIT_LIST_HEAD(&gic->eq_list);
- spin_lock_init(&gic->lock);
-
- if (!i)
- snprintf(gic->name, MANA_IRQ_NAME_SZ, "mana_hwc@pci:%s",
- pci_name(pdev));
- else
- snprintf(gic->name, MANA_IRQ_NAME_SZ, "mana_q%d@pci:%s",
- i - 1, pci_name(pdev));
-
- irqs[i] = pci_irq_vector(pdev, i);
- if (irqs[i] < 0) {
- err = irqs[i];
- goto free_current_gic;
- }
-
- err = request_irq(irqs[i], mana_gd_intr, 0, gic->name, gic);
- if (err)
- goto free_current_gic;
-
- xa_store(&gc->irq_contexts, i, gic, GFP_KERNEL);
+ irqs[i] = gic->irq;
}
/* If number of IRQ is one extra than number of online CPUs,
@@ -2062,20 +2013,9 @@ static int mana_gd_setup_irqs(struct pci_dev *pdev, int nvec)
kfree(start_irqs);
return 0;
-free_current_gic:
- kfree(gic);
free_irq:
- for (i -= 1; i >= 0; i--) {
- irq = pci_irq_vector(pdev, i);
- gic = xa_load(&gc->irq_contexts, i);
- if (WARN_ON(!gic))
- continue;
-
- irq_update_affinity_hint(irq, NULL);
- free_irq(irq, gic);
- xa_erase(&gc->irq_contexts, i);
- kfree(gic);
- }
+ for (i -= 1; i >= 0; i--)
+ mana_gd_put_gic(gc, false, i);
kfree(start_irqs);
return err;
@@ -2149,26 +2089,16 @@ static int mana_gd_setup_remaining_irqs(struct pci_dev *pdev)
static void mana_gd_remove_irqs(struct pci_dev *pdev)
{
struct gdma_context *gc = pci_get_drvdata(pdev);
- struct gdma_irq_context *gic;
- int irq, i;
+ int i;
if (gc->max_num_msix < 1)
return;
for (i = 0; i < gc->max_num_msix; i++) {
- irq = pci_irq_vector(pdev, i);
- if (irq < 0)
- continue;
-
- gic = xa_load(&gc->irq_contexts, i);
- if (WARN_ON(!gic))
+ if (!xa_load(&gc->irq_contexts, i))
continue;
- /* Need to clear the hint before free_irq */
- irq_update_affinity_hint(irq, NULL);
- free_irq(irq, gic);
- xa_erase(&gc->irq_contexts, i);
- kfree(gic);
+ mana_gd_put_gic(gc, false, i);
}
pci_free_irq_vectors(pdev);
--
2.43.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH net-next v9 5/6] net: mana: Allocate interrupt context for each EQ when creating vPort
2026-05-13 22:09 [PATCH net-next v9 0/6] net: mana: Per-vPort EQ and MSI-X interrupt management Long Li
` (3 preceding siblings ...)
2026-05-13 22:09 ` [PATCH net-next v9 4/6] net: mana: Use GIC functions to allocate global EQs Long Li
@ 2026-05-13 22:09 ` Long Li
2026-05-13 22:09 ` [PATCH net-next v9 6/6] RDMA/mana_ib: Allocate interrupt contexts on EQs Long Li
5 siblings, 0 replies; 9+ messages in thread
From: Long Li @ 2026-05-13 22:09 UTC (permalink / raw)
To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
Paolo Abeni, Eric Dumazet, Andrew Lunn, Jason Gunthorpe,
Leon Romanovsky, Haiyang Zhang, K . Y . Srinivasan, Wei Liu,
Dexuan Cui, shradhagupta
Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel
Use GIC functions to create a dedicated interrupt context or acquire a
shared interrupt context for each EQ when setting up a vPort.
The caller now owns the GIC reference across the EQ create/destroy
lifecycle: mana_create_eq() calls mana_gd_get_gic() before creating
each EQ and mana_destroy_eq() calls mana_gd_put_gic() after destroying
it. The msix_index invalidation is moved from mana_gd_deregister_irq()
to the mana_gd_create_eq() error path so that mana_destroy_eq() can
read the index before teardown.
Signed-off-by: Long Li <longli@microsoft.com>
---
.../net/ethernet/microsoft/mana/gdma_main.c | 2 +-
drivers/net/ethernet/microsoft/mana/mana_en.c | 18 +++++++++++++++++-
include/net/mana/gdma.h | 1 +
3 files changed, 19 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index e1a0e897b1b9..53281fef2ccd 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -894,7 +894,6 @@ static void mana_gd_deregister_irq(struct gdma_queue *queue)
}
spin_unlock_irqrestore(&gic->lock, flags);
- queue->eq.msix_index = INVALID_PCI_MSIX_INDEX;
synchronize_rcu();
}
@@ -1009,6 +1008,7 @@ static int mana_gd_create_eq(struct gdma_dev *gd,
out:
dev_err(dev, "Failed to create EQ: %d\n", err);
mana_gd_destroy_eq(gc, false, queue);
+ queue->eq.msix_index = INVALID_PCI_MSIX_INDEX;
return err;
}
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 3ee74e7e300c..433ec88d0d69 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1625,6 +1625,7 @@ void mana_destroy_eq(struct mana_port_context *apc)
struct mana_context *ac = apc->ac;
struct gdma_context *gc = ac->gdma_dev->gdma_context;
struct gdma_queue *eq;
+ unsigned int msi;
int i;
if (!apc->eqs)
@@ -1638,7 +1639,9 @@ void mana_destroy_eq(struct mana_port_context *apc)
if (!eq)
continue;
+ msi = eq->eq.msix_index;
mana_gd_destroy_queue(gc, eq);
+ mana_gd_put_gic(gc, !gc->msi_sharing, msi);
}
kfree(apc->eqs);
@@ -1655,6 +1658,7 @@ static void mana_create_eq_debugfs(struct mana_port_context *apc, int i)
eq.mana_eq_debugfs = debugfs_create_dir(eqnum, apc->mana_eqs_debugfs);
debugfs_create_u32("head", 0400, eq.mana_eq_debugfs, &eq.eq->head);
debugfs_create_u32("tail", 0400, eq.mana_eq_debugfs, &eq.eq->tail);
+ debugfs_create_u32("irq", 0400, eq.mana_eq_debugfs, &eq.eq->eq.irq);
debugfs_create_file("eq_dump", 0400, eq.mana_eq_debugfs, eq.eq, &mana_dbg_q_fops);
}
@@ -1663,7 +1667,9 @@ int mana_create_eq(struct mana_port_context *apc)
struct gdma_dev *gd = apc->ac->gdma_dev;
struct gdma_context *gc = gd->gdma_context;
struct gdma_queue_spec spec = {};
+ struct gdma_irq_context *gic;
int err;
+ int msi;
int i;
if (WARN_ON(apc->eqs))
@@ -1683,12 +1689,22 @@ int mana_create_eq(struct mana_port_context *apc)
debugfs_create_dir("EQs", apc->mana_port_debugfs);
for (i = 0; i < apc->num_queues; i++) {
- spec.eq.msix_index = (i + 1) % gc->num_msix_usable;
+ msi = (i + 1) % gc->num_msix_usable;
+
+ gic = mana_gd_get_gic(gc, !gc->msi_sharing, &msi);
+ if (!gic) {
+ err = -ENOMEM;
+ goto out;
+ }
+ spec.eq.msix_index = msi;
+
err = mana_gd_create_mana_eq(gd, &spec, &apc->eqs[i].eq);
if (err) {
dev_err(gc->dev, "Failed to create EQ %d : %d\n", i, err);
+ mana_gd_put_gic(gc, !gc->msi_sharing, msi);
goto out;
}
+ apc->eqs[i].eq->eq.irq = gic->irq;
mana_create_eq_debugfs(apc, i);
}
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index fbe3c1427b45..6c138cc77407 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -342,6 +342,7 @@ struct gdma_queue {
void *context;
unsigned int msix_index;
+ unsigned int irq;
u32 log2_throttle_limit;
} eq;
--
2.43.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH net-next v9 6/6] RDMA/mana_ib: Allocate interrupt contexts on EQs
2026-05-13 22:09 [PATCH net-next v9 0/6] net: mana: Per-vPort EQ and MSI-X interrupt management Long Li
` (4 preceding siblings ...)
2026-05-13 22:09 ` [PATCH net-next v9 5/6] net: mana: Allocate interrupt context for each EQ when creating vPort Long Li
@ 2026-05-13 22:09 ` Long Li
5 siblings, 0 replies; 9+ messages in thread
From: Long Li @ 2026-05-13 22:09 UTC (permalink / raw)
To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
Paolo Abeni, Eric Dumazet, Andrew Lunn, Jason Gunthorpe,
Leon Romanovsky, Haiyang Zhang, K . Y . Srinivasan, Wei Liu,
Dexuan Cui, shradhagupta
Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel
Use the GIC functions to allocate interrupt contexts for RDMA EQs. These
interrupt contexts may be shared with Ethernet EQs when MSI-X vectors
are limited.
The driver now supports allocating dedicated MSI-X for each EQ. Indicate
this capability through driver capability bits. The RDMA EQs pass
use_msi_bitmap=false to share MSI-X vectors with Ethernet, while the
capability flag advertises that the driver supports per-vPort EQ
separation when hardware has sufficient vectors.
Populate eq.irq on all RDMA EQs for consistency with the Ethernet path.
Also relocate the GDMA_DRV_CAP_FLAG_1_HW_VPORT_LINK_AWARE define to its
numeric BIT(6) position among the other capability flags.
Signed-off-by: Long Li <longli@microsoft.com>
---
drivers/infiniband/hw/mana/main.c | 43 +++++++++++++++++++++++++------
include/net/mana/gdma.h | 7 +++--
2 files changed, 40 insertions(+), 10 deletions(-)
diff --git a/drivers/infiniband/hw/mana/main.c b/drivers/infiniband/hw/mana/main.c
index 6159bd03a021..47e5322bebca 100644
--- a/drivers/infiniband/hw/mana/main.c
+++ b/drivers/infiniband/hw/mana/main.c
@@ -750,7 +750,8 @@ int mana_ib_create_eqs(struct mana_ib_dev *mdev)
{
struct gdma_context *gc = mdev_to_gc(mdev);
struct gdma_queue_spec spec = {};
- int err, i;
+ struct gdma_irq_context *gic;
+ int err, i, msi;
spec.type = GDMA_EQ;
spec.monitor_avl_buf = false;
@@ -758,11 +759,19 @@ int mana_ib_create_eqs(struct mana_ib_dev *mdev)
spec.eq.callback = mana_ib_event_handler;
spec.eq.context = mdev;
spec.eq.log2_throttle_limit = LOG2_EQ_THROTTLE;
- spec.eq.msix_index = 0;
+
+ msi = 0;
+ gic = mana_gd_get_gic(gc, false, &msi);
+ if (!gic)
+ return -ENOMEM;
+ spec.eq.msix_index = msi;
err = mana_gd_create_mana_eq(mdev->gdma_dev, &spec, &mdev->fatal_err_eq);
- if (err)
+ if (err) {
+ mana_gd_put_gic(gc, false, 0);
return err;
+ }
+ mdev->fatal_err_eq->eq.irq = gic->irq;
mdev->eqs = kzalloc_objs(struct gdma_queue *,
mdev->ib_dev.num_comp_vectors);
@@ -772,32 +781,50 @@ int mana_ib_create_eqs(struct mana_ib_dev *mdev)
}
spec.eq.callback = NULL;
for (i = 0; i < mdev->ib_dev.num_comp_vectors; i++) {
- spec.eq.msix_index = (i + 1) % gc->num_msix_usable;
+ msi = (i + 1) % gc->num_msix_usable;
+
+ gic = mana_gd_get_gic(gc, false, &msi);
+ if (!gic) {
+ err = -ENOMEM;
+ goto destroy_eqs;
+ }
+ spec.eq.msix_index = msi;
+
err = mana_gd_create_mana_eq(mdev->gdma_dev, &spec, &mdev->eqs[i]);
- if (err)
+ if (err) {
+ mana_gd_put_gic(gc, false, msi);
goto destroy_eqs;
+ }
+ mdev->eqs[i]->eq.irq = gic->irq;
}
return 0;
destroy_eqs:
- while (i-- > 0)
+ while (i-- > 0) {
mana_gd_destroy_queue(gc, mdev->eqs[i]);
+ mana_gd_put_gic(gc, false, (i + 1) % gc->num_msix_usable);
+ }
kfree(mdev->eqs);
destroy_fatal_eq:
mana_gd_destroy_queue(gc, mdev->fatal_err_eq);
+ mana_gd_put_gic(gc, false, 0);
return err;
}
void mana_ib_destroy_eqs(struct mana_ib_dev *mdev)
{
struct gdma_context *gc = mdev_to_gc(mdev);
- int i;
+ int i, msi;
mana_gd_destroy_queue(gc, mdev->fatal_err_eq);
+ mana_gd_put_gic(gc, false, 0);
- for (i = 0; i < mdev->ib_dev.num_comp_vectors; i++)
+ for (i = 0; i < mdev->ib_dev.num_comp_vectors; i++) {
mana_gd_destroy_queue(gc, mdev->eqs[i]);
+ msi = (i + 1) % gc->num_msix_usable;
+ mana_gd_put_gic(gc, false, msi);
+ }
kfree(mdev->eqs);
}
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index 6c138cc77407..d84e474309a3 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -615,6 +615,7 @@ enum {
#define GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECONFIG BIT(3)
#define GDMA_DRV_CAP_FLAG_1_GDMA_PAGES_4MB_1GB_2GB BIT(4)
#define GDMA_DRV_CAP_FLAG_1_VARIABLE_INDIRECTION_TABLE_SUPPORT BIT(5)
+#define GDMA_DRV_CAP_FLAG_1_HW_VPORT_LINK_AWARE BIT(6)
/* Driver can handle holes (zeros) in the device list */
#define GDMA_DRV_CAP_FLAG_1_DEV_LIST_HOLES_SUP BIT(11)
@@ -631,7 +632,8 @@ enum {
/* Driver detects stalled send queues and recovers them */
#define GDMA_DRV_CAP_FLAG_1_HANDLE_STALL_SQ_RECOVERY BIT(18)
-#define GDMA_DRV_CAP_FLAG_1_HW_VPORT_LINK_AWARE BIT(6)
+/* Driver supports separate EQ/MSIs for each vPort */
+#define GDMA_DRV_CAP_FLAG_1_EQ_MSI_UNSHARE_MULTI_VPORT BIT(19)
/* Driver supports linearizing the skb when num_sge exceeds hardware limit */
#define GDMA_DRV_CAP_FLAG_1_SKB_LINEARIZE BIT(20)
@@ -659,7 +661,8 @@ enum {
GDMA_DRV_CAP_FLAG_1_SKB_LINEARIZE | \
GDMA_DRV_CAP_FLAG_1_PROBE_RECOVERY | \
GDMA_DRV_CAP_FLAG_1_HANDLE_STALL_SQ_RECOVERY | \
- GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECOVERY)
+ GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECOVERY | \
+ GDMA_DRV_CAP_FLAG_1_EQ_MSI_UNSHARE_MULTI_VPORT)
#define GDMA_DRV_CAP_FLAGS2 0
--
2.43.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH net-next v9 1/6] net: mana: Create separate EQs for each vPort
2026-05-13 22:09 ` [PATCH net-next v9 1/6] net: mana: Create separate EQs for each vPort Long Li
@ 2026-05-14 22:10 ` sashiko-bot
0 siblings, 0 replies; 9+ messages in thread
From: sashiko-bot @ 2026-05-14 22:10 UTC (permalink / raw)
To: Long Li; +Cc: linux-hyperv
Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [Critical] Race condition in `mana_set_channels()` allows an RDMA application to steal vport ownership during an Ethernet reconfiguration, leading to out-of-bounds memory accesses and interface breakage.
- [Critical] Per-PD reference counting for per-port EQs causes a scope mismatch, leading to NULL pointer dereferences and resource leaks when a Protection Domain is shared across multiple ports.
--
commit 31a11c151b6c97fb0cd32cd8787a2fa1cd92856b
Author: Long Li <longli@microsoft.com>
Subject: net: mana: Create separate EQs for each vPort
To prepare for assigning vPorts to dedicated MSI-X vectors, remove EQ
sharing among the vPorts and create dedicated EQs for each vPort.
> diff --git a/drivers/infiniband/hw/mana/main.c b/drivers/infiniband/hw/mana/main.c
> index ac5e75dd3494d..6159bd03a0217 100644
> --- a/drivers/infiniband/hw/mana/main.c
> +++ b/drivers/infiniband/hw/mana/main.c
[ ... ]
> @@ -55,15 +57,23 @@ int mana_ib_cfg_vport(struct mana_ib_dev *dev, u32 port, struct mana_ib_pd *pd,
> return err;
> }
>
> - mutex_unlock(&pd->vport_mutex);
>
> - pd->tx_shortform_allowed = mpc->tx_shortform_allowed;
> - pd->tx_vp_offset = mpc->tx_vp_offset;
> + err = mana_create_eq(mpc);
> + if (err) {
> + mana_uncfg_vport(mpc);
> + pd->vport_use_count--;
> + } else {
> + pd->tx_shortform_allowed = mpc->tx_shortform_allowed;
> + pd->tx_vp_offset = mpc->tx_vp_offset;
> + }
>
> - ibdev_dbg(&dev->ib_dev, "vport handle %llx pdid %x doorbell_id %x\n",
> - mpc->port_handle, pd->pdn, doorbell_id);
> + mutex_unlock(&pd->vport_mutex);
Could this lead to a NULL pointer dereference or resource leak if a Protection
Domain is shared across multiple ports?
The vport_use_count is tracked per-PD (pd->vport_use_count), but Event Queues
are now managed per-port (mpc->eqs).
If a user creates a QP on Port 1, pd->vport_use_count becomes 1 and mpc1->eqs
are allocated. If the user then creates a QP on Port 2 using the same PD, the
code in mana_ib_cfg_vport() will see pd->vport_use_count > 1 and bypass
configuration for Port 2.
This leaves mpc2->eqs as NULL if the Ethernet interface is down, leading to a
NULL dereference when EQs are accessed. When QPs are later destroyed, it
will also destroy the context of whichever port the last QP belonged to,
permanently leaking the hardware configuration and EQs of the port that was
actually initialized.
[ ... ]
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> index 04350973e19e3..e121834d17f3a 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> @@ -454,18 +454,40 @@ static int mana_set_coalesce(struct net_device *ndev,
> return err;
> }
>
> +/* mana_set_channels - change the number of queues on a port
> + *
> + * Returns -EBUSY if the port is down and RDMA holds the vport with
> + * EQs sized to the current num_queues.
> + */
> static int mana_set_channels(struct net_device *ndev,
> struct ethtool_channels *channels)
> {
> struct mana_port_context *apc = netdev_priv(ndev);
> unsigned int new_count = channels->combined_count;
> unsigned int old_count = apc->num_queues;
> + bool locked = false;
> int err;
>
> + /* When the port is down, hold vport_mutex for the entire
> + * operation to serialize against RDMA's mana_cfg_vport().
> + * This is safe because mana_detach()/mana_attach() skip
> + * vport teardown/setup when port_st_save is false.
> + * When the port is up, Ethernet owns the vport exclusively
> + * so no locking against RDMA is needed.
> + */
> + if (!apc->port_is_up) {
> + mutex_lock(&apc->vport_mutex);
> + if (apc->vport_use_count) {
> + mutex_unlock(&apc->vport_mutex);
> + return -EBUSY;
> + }
> + locked = true;
> + }
>
> err = mana_pre_alloc_rxbufs(apc, ndev->mtu, new_count);
> if (err) {
> netdev_err(ndev, "Insufficient memory for new allocations");
> - return err;
> + goto unlock;
> }
>
> err = mana_detach(ndev, false);
Can this lead to a race condition with RDMA operations when the port is up?
If apc->port_is_up is true, the function does not acquire apc->vport_mutex.
It proceeds to call mana_detach(), which calls mana_uncfg_vport() and lowers
apc->vport_use_count to 0.
Because no lock is held across the detach/attach window, a concurrent RDMA
operation like mana_ib_create_qp_raw() could grab apc->vport_mutex, observe
a use count of 0, and steal ownership of the vport.
If the RDMA thread allocates apc->eqs based on old_count, and then
mana_set_channels() updates apc->num_queues = new_count, subsequent RDMA
queue indexing operations may read out of bounds. The Ethernet interface will
also fail to reattach because RDMA now owns the vport, leaving the interface
broken.
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260513220956.402058-1-longli@microsoft.com?part=1
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH net-next v9 2/6] net: mana: Query device capabilities and configure MSI-X sharing for EQs
2026-05-13 22:09 ` [PATCH net-next v9 2/6] net: mana: Query device capabilities and configure MSI-X sharing for EQs Long Li
@ 2026-05-14 22:10 ` sashiko-bot
0 siblings, 0 replies; 9+ messages in thread
From: sashiko-bot @ 2026-05-14 22:10 UTC (permalink / raw)
To: Long Li; +Cc: linux-hyperv
Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] The explicit `msi_sharing = true` requirement for platforms without dynamic MSI-X support is unconditionally overwritten, leading to incorrect MSI-X configuration.
--
commit 113e4b9419d10774e0a86653bf123b1add7cbc2e
Author: Long Li <longli@microsoft.com>
net: mana: Query device capabilities and configure MSI-X sharing for EQs
This commit adjusts the max number of queues to allow dedicated MSI-X vectors
for each vPort and enables MSI-X sharing among vPorts when there are not
enough vectors or the platform does not support dynamic MSI-X allocation.
> diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> index 3bc3fff559991..bbd055849e363 100644
> --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> @@ -179,8 +179,18 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
> struct gdma_context *gc = pci_get_drvdata(pdev);
> struct gdma_query_max_resources_resp resp = {};
> struct gdma_general_req req = {};
> + unsigned int max_num_queues;
> + u8 bm_hostmode;
> + u16 num_ports;
> int err;
>
> + /* Reset msi_sharing so it is recomputed from current hardware
> + * state. On resume, num_online_cpus() or num_msix_usable may
> + * have changed, making dedicated MSI-X feasible where it was
> + * not before.
> + */
> + gc->msi_sharing = false;
> +
Does this unconditional reset unintentionally overwrite the platform constraint
for systems without dynamic MSI-X support?
[ ... ]
> @@ -1889,6 +1936,7 @@ static int mana_gd_setup_hwc_irqs(struct pci_dev *pdev)
> /* Need 1 interrupt for HWC */
> max_irqs = min(num_online_cpus(), MANA_MAX_NUM_QUEUES) + 1;
> min_irqs = 2;
> + gc->msi_sharing = true;
> }
>
> nvec = pci_alloc_irq_vectors(pdev, min_irqs, max_irqs, PCI_IRQ_MSIX);
The commit message states that sharing is always used when the platform does
not support dynamic MSI-X allocation. This is successfully set here in
mana_gd_setup_hwc_irqs().
However, looking at mana_gd_setup(), mana_gd_query_max_resources() is called
after mana_gd_setup_hwc_irqs().
Will this execution order cause the driver to ignore the limitation and
attempt to use dedicated MSI-X vectors if there happen to be enough
pre-allocated vectors to satisfy the per-vPort math later in the function?
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260513220956.402058-1-longli@microsoft.com?part=2
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2026-05-14 22:10 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-13 22:09 [PATCH net-next v9 0/6] net: mana: Per-vPort EQ and MSI-X interrupt management Long Li
2026-05-13 22:09 ` [PATCH net-next v9 1/6] net: mana: Create separate EQs for each vPort Long Li
2026-05-14 22:10 ` sashiko-bot
2026-05-13 22:09 ` [PATCH net-next v9 2/6] net: mana: Query device capabilities and configure MSI-X sharing for EQs Long Li
2026-05-14 22:10 ` sashiko-bot
2026-05-13 22:09 ` [PATCH net-next v9 3/6] net: mana: Introduce GIC context with refcounting for interrupt management Long Li
2026-05-13 22:09 ` [PATCH net-next v9 4/6] net: mana: Use GIC functions to allocate global EQs Long Li
2026-05-13 22:09 ` [PATCH net-next v9 5/6] net: mana: Allocate interrupt context for each EQ when creating vPort Long Li
2026-05-13 22:09 ` [PATCH net-next v9 6/6] RDMA/mana_ib: Allocate interrupt contexts on EQs Long Li
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.