Linux-HyperV List
 help / color / mirror / Atom feed
From: Long Li <longli@microsoft.com>
To: Long Li <longli@microsoft.com>,
	Konstantin Taranov <kotaranov@microsoft.com>,
	Jakub Kicinski <kuba@kernel.org>,
	"David S . Miller" <davem@davemloft.net>,
	Paolo Abeni <pabeni@redhat.com>,
	Eric Dumazet <edumazet@google.com>,
	Andrew Lunn <andrew+netdev@lunn.ch>,
	Jason Gunthorpe <jgg@ziepe.ca>, Leon Romanovsky <leon@kernel.org>,
	Haiyang Zhang <haiyangz@microsoft.com>,
	"K . Y . Srinivasan" <kys@microsoft.com>,
	Wei Liu <wei.liu@kernel.org>, Dexuan Cui <decui@microsoft.com>,
	shradhagupta@linux.microsoft.com
Cc: Simon Horman <horms@kernel.org>,
	netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: [PATCH net-next v12 0/6] net: mana: Per-vPort EQ and MSI-X management
Date: Thu,  4 Jun 2026 17:57:09 -0700	[thread overview]
Message-ID: <20260605005717.2059954-1-longli@microsoft.com> (raw)

This series moves EQ ownership from the shared mana_context to per-vPort
mana_port_context, enabling each vPort to have dedicated MSI-X vectors
when the hardware provides enough vectors. When vectors are limited, the
driver falls back to sharing MSI-X among vPorts.

The series introduces a GDMA IRQ Context (GIC) abstraction with reference
counting to manage interrupt context lifecycle. This allows both Ethernet
and RDMA EQs to dynamically acquire dedicated or shared MSI-X vectors at
vPort creation time rather than pre-allocating all vectors at probe time.

This series is intended to go through the net-next tree.

The following changes since commit 93790c374b9d77f3db15786d7d432872d92751cf:

  net/mlx5: convert miss_list allocation to kvmalloc_array() (2026-06-04 09:33:24 -0700)

are available in the Git repository at:

  https://github.com/longlimsft/linux.git tags/mana-eq-msi-v12

for you to fetch changes up to 18505b11dcf052442cdeba5e208a85219776206a:

  RDMA/mana_ib: Allocate interrupt contexts on EQs (2026-06-05 00:26:56 +0000)

Changes in v12:
- Restrict each vport to a single RSS QP. The hardware only supports
  one steering config per vport, and destroy disables RX globally.
  Previously a second RSS QP would silently overwrite the first.
  Track via pd->has_rss_qp under vport_mutex (patch 1)
- Validate comp_vector against per-vPort EQ count with modulo mapping.
  Document the rationale: when RDMA-advertised num_comp_vectors exceeds
  the port's num_queues, the vector is remapped to an available EQ
  rather than failing QP creation (patch 1)
- Extend channel_changing serialization to the async per-port queue
  reset handler, preventing RDMA from claiming the vport during
  the reset detach/attach window (patch 1)
- Fix HW vport registration leak: roll back
  mana_pf_register_hw_vport() when mana_cfg_vport() fails in
  mana_create_vport() (patch 1)
- Cap num_ports to MAX_PORTS_IN_MANA_DEV before the per-vPort MSI-X
  budget calculation so it matches the port count that is later
  instantiated by mana_probe() (patch 2)
- Use a local msi variable in mana_gd_setup_irqs() and
  mana_gd_setup_dyn_irqs() to decouple the loop counter from
  the callee-updated mana_gd_get_gic() parameter (patch 4)
- Add WARN_ON(!xa_empty()) assertion in mana_gd_remove_irqs()
  before pci_free_irq_vectors() to catch leaked GIC references
  (patch 4)
- Log gc->max_num_queues_vport (per-vPort value) instead of
  gc->max_num_queues (global) in the MSI sharing mode message,
  use %u format specifiers (patch 2)
- Clarify comment about MANA_DEF_NUM_QUEUES clamping vs hardware
  max precedence (patch 2)
- Rebase onto net-next/main (2026-06-04)

Changes in v11:
- Address AI reviewer feedback from Paolo on patch 1: add cross-port
  PD-sharing check in mana_ib_create_qp_rss() to match the guard
  already present in mana_ib_cfg_vport(), preventing NULL deref on
  mpc->eqs when an RSS QP is created on a different port than the
  PD's bound port (patch 1)
- Document that pd->vport_port is only valid when vport_use_count > 0
  in the struct mana_ib_pd comment, as suggested by the AI reviewer
  (patch 1)
- Propagate actual error code from mana_ib_cfg_vport() instead of
  hardcoding -ENODEV in the raw QP creation path (patch 1)
- Switch mana_gd_get_gic() from returning NULL to IS_ERR/PTR_ERR on
  failure so callers can propagate the actual error code (-ENOSPC,
  -ENOMEM, etc.) instead of always returning -ENOMEM (patch 3)
- Update all mana_gd_get_gic() callers (patches 2, 4, 5, 6) to use
  IS_ERR()/PTR_ERR() error checking
- Set *msi_requested after pci_msix_alloc_irq_at() returns the actual
  assigned index, so the caller gets the correct MSI vector when
  dynamic allocation remaps it (patch 3)
- Add comments documenting the GIC refcount ownership contract in
  mana_gd_register_irq() and mana_gd_deregister_irq() (patch 5)
- Move the zero-port detection error message from mana_probe() to
  mana_gd_query_max_resources() where the actual check occurs (patch 2)
- Clamp apc->max_queues to gc->max_num_queues_vport in
  mana_init_port() so that on resume, if max_num_queues_vport has
  decreased, num_queues is reduced before EQ allocation (patch 2)

Changes in v10:
- Add channel_changing flag to block RDMA from grabbing the vport
  during mana_set_channels() detach/attach window. The flag is checked
  in mana_cfg_vport() only when called from the RDMA path via a new
  check_channel_changing parameter (patch 1)
- Bind each PD to a single physical port via pd->vport_port to prevent
  cross-port PD sharing which would cause EQ scope mismatch. Returns
  -EINVAL if a second port tries to use an already-bound PD (patch 1)
- Guard gc->msi_sharing reset with pci_msix_can_alloc_dyn() to avoid
  overwriting the non-dyn platform constraint set by
  mana_gd_setup_hwc_irqs() (patch 2)

Changes in v9:
- RSS QPs now take a vport reference via pd->vport_use_count to ensure
  EQs outlive all QP consumers. EQs are only destroyed when the last
  QP (raw or RSS) on the PD releases its reference (patch 1)
- Serialize mana_set_channels() against RDMA vport configuration via
  apc->vport_mutex when the port is down. When the port is up, Ethernet
  owns the vport exclusively so no locking is needed (patch 1)
- Change WARN_ON(apc->eqs) to bail out with -EEXIST to prevent
  leaking prior EQ array if invariant is violated (patch 1)
- Only commit pd->tx_shortform_allowed and pd->tx_vp_offset after
  mana_create_eq() succeeds (patch 1)
- Reset gc->msi_sharing at the top of mana_gd_query_max_resources()
  so it is recomputed from current hardware state on resume (patch 2)
- Fix reverse Christmas tree variable declaration ordering (patches
  1, 3, 5)

Changes in v8:
- Fix comment to reference per-vPort queue count instead of
  gc->max_num_queues (patch 2)
- Remove duplicate irq_update_affinity_hint() calls from error paths
  and mana_gd_remove_irqs(); the clearing is now centralized in
  mana_gd_put_gic() (patch 4)
- Note the IRQ name change (mana_q -> mana_msi) in the commit
  message (patch 4)
- Remove dead conditional write to spec.eq.msix_index (patch 5)
- Document GIC ownership contract and msix_index invariant change
  in commit message (patch 5)
- Populate eq.irq on RDMA EQs for consistency with the Ethernet
  path (patch 6)
- Document BIT(6) relocation and capability flag semantics in
  commit message (patch 6)
- Fix checkpatch --strict alignment and line length warnings

Changes in v7:
- Use rounddown_pow_of_two() instead of roundup_pow_of_two() when
  computing per-vPort queue count to avoid unnecessarily forcing shared
  MSI-X mode (patch 2)
- Call mana_gd_setup_remaining_irqs() unconditionally to ensure
  irq_contexts are populated in both dedicated and shared MSI-X modes,
  fixing bisectability between patches 2 and 5 (patch 2)
- Guard ibdev_dbg() in mana_ib_cfg_vport() with error check so the
  vport handle is not logged on the failure path (patch 1)
- Use cached gic->irq instead of pci_irq_vector() lookup in
  mana_gd_put_gic() for consistency with the allocation path (patch 3)
- Fix unsigned int* to int* pointer type mismatch when calling
  mana_gd_get_gic() by using a local int variable for the MSI index
  (patches 5, 6)

Changes in v6:
- Rebased on net-next/main (v7.1-rc1)

Changes in v5:
- Rebased on net-next/main

Changes in v4:
- Rebased on net-next/main 7.0-rc4
- Patch 2: Use MANA_DEF_NUM_QUEUES instead of hardcoded 16 for
  max_num_queues clamping
- Patch 3: Track dyn_msix in GIC context instead of re-checking
  pci_msix_can_alloc_dyn() on each call; improved remove_irqs iteration
  to skip unallocated entries

Changes in v3:
- Rebased on net-next/main
- Patch 1: Added NULL check for mpc->eqs in mana_ib_create_qp_rss() to
  prevent NULL pointer dereference when RSS QP is created before a raw QP
  has configured the vport and allocated EQs

Changes in v2:
- Rebased on net-next/main (adapted to kzalloc_objs/kzalloc_obj macros,
  new GDMA_DRV_CAP_FLAG definitions)
- Patch 2: Fixed misleading comment for max_num_queues vs
  max_num_queues_vport in gdma.h
- Patch 3: Fixed spelling typo in gdma_main.c ("difference" -> "different")

Long Li (6):
  net: mana: Create separate EQs for each vPort
  net: mana: Query device capabilities and configure MSI-X sharing for
    EQs
  net: mana: Introduce GIC context with refcounting for interrupt
    management
  net: mana: Use GIC functions to allocate global EQs
  net: mana: Allocate interrupt context for each EQ when creating vPort
  RDMA/mana_ib: Allocate interrupt contexts on EQs

 drivers/infiniband/hw/mana/main.c             |  83 +++-
 drivers/infiniband/hw/mana/mana_ib.h          |  14 +
 drivers/infiniband/hw/mana/qp.c               |  68 +++-
 .../net/ethernet/microsoft/mana/gdma_main.c   | 359 +++++++++++++-----
 drivers/net/ethernet/microsoft/mana/mana_en.c | 198 ++++++----
 .../ethernet/microsoft/mana/mana_ethtool.c    |  23 +-
 include/net/mana/gdma.h                       |  33 +-
 include/net/mana/mana.h                       |  15 +-
 8 files changed, 604 insertions(+), 189 deletions(-)


base-commit: 93790c374b9d77f3db15786d7d432872d92751cf
-- 
2.43.0


             reply	other threads:[~2026-06-05  0:57 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-05  0:57 Long Li [this message]
2026-06-05  0:57 ` [PATCH net-next v12 1/6] net: mana: Create separate EQs for each vPort Long Li
2026-06-05  0:57 ` [PATCH net-next v12 2/6] net: mana: Query device capabilities and configure MSI-X sharing for EQs Long Li
2026-06-05  0:57 ` [PATCH net-next v12 3/6] net: mana: Introduce GIC context with refcounting for interrupt management Long Li
2026-06-05  0:57 ` [PATCH net-next v12 4/6] net: mana: Use GIC functions to allocate global EQs Long Li
2026-06-05  0:57 ` [PATCH net-next v12 5/6] net: mana: Allocate interrupt context for each EQ when creating vPort Long Li
2026-06-05  0:57 ` [PATCH net-next v12 6/6] RDMA/mana_ib: Allocate interrupt contexts on EQs Long Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260605005717.2059954-1-longli@microsoft.com \
    --to=longli@microsoft.com \
    --cc=andrew+netdev@lunn.ch \
    --cc=davem@davemloft.net \
    --cc=decui@microsoft.com \
    --cc=edumazet@google.com \
    --cc=haiyangz@microsoft.com \
    --cc=horms@kernel.org \
    --cc=jgg@ziepe.ca \
    --cc=kotaranov@microsoft.com \
    --cc=kuba@kernel.org \
    --cc=kys@microsoft.com \
    --cc=leon@kernel.org \
    --cc=linux-hyperv@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=shradhagupta@linux.microsoft.com \
    --cc=wei.liu@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox