Re: [PATCH net-next v10 1/6] net: mana: Create separate EQs for each vPort

Linux-HyperV List
 help / color / mirror / Atom feed

From: Paolo Abeni <pabeni@redhat.com>
To: longli@microsoft.com
Cc: kotaranov@microsoft.com, kuba@kernel.org, davem@davemloft.net,
	pabeni@redhat.com, edumazet@google.com, andrew+netdev@lunn.ch,
	jgg@ziepe.ca, leon@kernel.org, haiyangz@microsoft.com,
	kys@microsoft.com, wei.liu@kernel.org, decui@microsoft.com,
	shradhagupta@linux.microsoft.com, horms@kernel.org,
	netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH net-next v10 1/6] net: mana: Create separate EQs for each vPort
Date: Tue, 19 May 2026 15:28:25 +0200	[thread overview]
Message-ID: <20260519132825.172490-1-pabeni@redhat.com> (raw)
In-Reply-To: <20260515040508.491748-2-longli@microsoft.com>

From: AI Reviewer <ai@example.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://netdev-ai.bots.linux.dev/sashiko/
---
net: mana: Create separate EQs for each vPort

This patch moves EQ allocation from per-device to per-vPort, exports
mana_create_eq()/mana_destroy_eq() for use by the RDMA driver, takes
a vport reference for RSS QPs, and adds cross-port PD-sharing
rejection for raw QPs along with serialization between
mana_set_channels() and the RDMA vport configuration.

> diff --git a/drivers/infiniband/hw/mana/main.c b/drivers/infiniband/hw/mana/main.c
> index ac5e75dd3494..f8a9013f0ca3 100644
> --- a/drivers/infiniband/hw/mana/main.c
> +++ b/drivers/infiniband/hw/mana/main.c
> @@ -40,13 +42,27 @@ int mana_ib_cfg_vport(struct mana_ib_dev *dev, u32 port, struct mana_ib_pd *pd,
>  
>  	pd->vport_use_count++;
>  	if (pd->vport_use_count > 1) {
> +		/* Reject cross-port PD sharing. EQs and vport config
> +		 * are per-port, so the PD must stay bound to the port
> +		 * that was configured on the first raw QP creation.
> +		 */
> +		if (pd->vport_port != port) {
> +			pd->vport_use_count--;
> +			mutex_unlock(&pd->vport_mutex);
> +			ibdev_dbg(&dev->ib_dev,
> +				  "PD already bound to port %u\n",
> +				  pd->vport_port);
> +			return -EINVAL;
> +		}
>  		ibdev_dbg(&dev->ib_dev,
>  			  "Skip as this PD is already configured vport\n");
>  		mutex_unlock(&pd->vport_mutex);
>  		return 0;
>  	}
>  
> -	err = mana_cfg_vport(mpc, pd->pdn, doorbell_id);
> +	pd->vport_port = port;
> +
> +	err = mana_cfg_vport(mpc, pd->pdn, doorbell_id, true);
>  	if (err) {
>  		pd->vport_use_count--;
>  		mutex_unlock(&pd->vport_mutex);

[Low]
In the failure path of mana_ib_cfg_vport(), if mana_cfg_vport() or the
later mana_create_eq() fails, vport_use_count is rolled back to 0 but
pd->vport_port keeps the value just written.

Is the contract that pd->vport_port is only meaningful while
vport_use_count > 0?  That invariant is not documented in the struct
mana_ib_pd comment, and a future reader who consults pd->vport_port
without first checking vport_use_count under vport_mutex would see stale
data.  Would it be worth either resetting pd->vport_port on the rollback
path or documenting the rule explicitly?

> @@ -55,15 +71,23 @@ int mana_ib_cfg_vport(struct mana_ib_dev *dev, u32 port, struct mana_ib_pd *pd,
>  		return err;
>  	}
>  
> -	mutex_unlock(&pd->vport_mutex);
>  
> -	pd->tx_shortform_allowed = mpc->tx_shortform_allowed;
> -	pd->tx_vp_offset = mpc->tx_vp_offset;
> +	err = mana_create_eq(mpc);
> +	if (err) {
> +		mana_uncfg_vport(mpc);
> +		pd->vport_use_count--;
> +	} else {
> +		pd->tx_shortform_allowed = mpc->tx_shortform_allowed;
> +		pd->tx_vp_offset = mpc->tx_vp_offset;
> +	}

[ ... ]

> diff --git a/drivers/infiniband/hw/mana/qp.c b/drivers/infiniband/hw/mana/qp.c
> index 0fbcf449c134..108ec4c5ce51 100644
> --- a/drivers/infiniband/hw/mana/qp.c
> +++ b/drivers/infiniband/hw/mana/qp.c
> @@ -155,6 +156,18 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
>  
>  	qp->port = port;
>  
> +	/* Take a reference on the vport to ensure EQs outlive this QP.
> +	 * The vport must already be configured by a raw QP.
> +	 */
> +	mutex_lock(&mana_pd->vport_mutex);
> +	if (!mana_pd->vport_use_count) {
> +		mutex_unlock(&mana_pd->vport_mutex);
> +		ret = -EINVAL;
> +		goto fail;
> +	}
> +	mana_pd->vport_use_count++;
> +	mutex_unlock(&mana_pd->vport_mutex);
> +

[Critical]
The cross-port PD-sharing check added to mana_ib_cfg_vport() is not
mirrored here.  mana_ib_create_qp_rss() takes the user-supplied port
from ucmd.port and only checks that vport_use_count is non-zero, so the
RSS path will happily attach to a different port than the one bound to
the PD by the first raw QP.

If a user creates a raw QP on port A (which sets pd->vport_port=A and
allocates mpc_A->eqs via mana_create_eq()), then creates an RSS QP on
the same PD with ucmd.port=B where port B has no Ethernet up and no
RDMA activity, mpc_B->eqs is NULL.  The subsequent code in this same
function:

	eq = &mpc->eqs[cq->comp_vector % mpc->num_queues];
	cq_spec.attached_eq = eq->eq->id;

would then dereference NULL through mpc_B->eqs.

If port B does have Ethernet up, mpc_B->eqs is owned by the Ethernet
driver and the RSS QP attaches to those EQs.  When the QP is destroyed
mana_ib_destroy_qp_rss() calls mana_ib_uncfg_vport(mdev, pd, qp->port=B),
and once pd->vport_use_count reaches 0 mana_ib_uncfg_vport() will run
mana_destroy_eq(mpc_B) on Ethernet's live EQs and call mana_uncfg_vport
on a port whose apc->vport_use_count was never incremented by this PD,
tripping the WARN_ON in mana_uncfg_vport() and corrupting Ethernet's
vport state.  Meanwhile port A's EQs and vport configuration are leaked
because nothing on this PD will destroy them.

Should mana_ib_create_qp_rss() apply the same pd->vport_port == port
check that mana_ib_cfg_vport() now performs, before bumping
vport_use_count?

[Medium]
The new "vport must already be configured by a raw QP" requirement is a
visible uAPI change.  Before this patch, EQs were allocated device-wide
on mana_context at probe time, so an RSS QP could be created without any
prior raw QP on the PD.  After the patch, applications that legitimately
use only RSS QPs (RX-only RDMA workloads) will see -EINVAL here unless
they also create a raw QP first.

Was breaking the RSS-only case considered, and would lazy EQ creation on
the first QP regardless of type be feasible to avoid the
incompatibility?

>  	for (i = 0; i < ind_tbl_size; i++) {
>  		struct mana_obj_spec wq_spec = {};
>  		struct mana_obj_spec cq_spec = {};
> @@ -171,13 +184,13 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
>  		cq_spec.gdma_region = cq->queue.gdma_region;
>  		cq_spec.queue_size = cq->cqe * COMP_ENTRY_SIZE;
>  		cq_spec.modr_ctx_id = 0;
> -		eq = &mpc->ac->eqs[cq->comp_vector];
> +		eq = &mpc->eqs[cq->comp_vector % mpc->num_queues];
>  		cq_spec.attached_eq = eq->eq->id;

[ ... ]

> diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> index 04350973e19e..4633acc976f0 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> @@ -462,10 +467,22 @@ static int mana_set_channels(struct net_device *ndev,
>  	unsigned int new_count = channels->combined_count;
>  	unsigned int old_count = apc->num_queues;
>  	int err;
>  
> +	/* Set channel_changing to block RDMA from grabbing the vport
> +	 * during the detach/attach window. mana_cfg_vport() checks
> +	 * this flag under vport_mutex and returns -EBUSY if set.
> +	 */
> +	mutex_lock(&apc->vport_mutex);
> +	if (!apc->port_is_up && apc->vport_use_count) {
> +		mutex_unlock(&apc->vport_mutex);
> +		return -EBUSY;
> +	}
> +	apc->channel_changing = true;
> +	mutex_unlock(&apc->vport_mutex);

[Low]
The commit message says:

    When the port is down, apc->vport_mutex is held for the entire
    operation since mana_detach()/mana_attach() do not take vport_mutex
    in that case. When the port is up, Ethernet owns the vport
    exclusively so no additional locking is needed.

Does this description match the code?  mana_set_channels() drops
apc->vport_mutex immediately after setting channel_changing=true and
runs mana_pre_alloc_rxbufs(), mana_detach() and mana_attach() without
the mutex held, then re-acquires the mutex only at the end to clear the
flag.

The actual serialization mechanism is the channel_changing boolean
checked by mana_cfg_vport() under vport_mutex, not a long mutex hold.
Could the commit message be updated to match the implemented design so
that future readers do not assume a different invariant?
-- 
This is an AI-generated review.

next prev parent reply	other threads:[~2026-05-19 13:29 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-15  4:05 [PATCH net-next v10 0/6] net: mana: Per-vPort EQ and MSI-X interrupt management Long Li
2026-05-15  4:05 ` [PATCH net-next v10 1/6] net: mana: Create separate EQs for each vPort Long Li
2026-05-16  4:05   ` sashiko-bot
2026-05-19 13:28   ` Paolo Abeni [this message]
2026-05-19 13:30     ` Paolo Abeni
2026-05-15  4:05 ` [PATCH net-next v10 2/6] net: mana: Query device capabilities and configure MSI-X sharing for EQs Long Li
2026-05-16  4:05   ` sashiko-bot
2026-05-15  4:05 ` [PATCH net-next v10 3/6] net: mana: Introduce GIC context with refcounting for interrupt management Long Li
2026-05-15  4:05 ` [PATCH net-next v10 4/6] net: mana: Use GIC functions to allocate global EQs Long Li
2026-05-15  4:05 ` [PATCH net-next v10 5/6] net: mana: Allocate interrupt context for each EQ when creating vPort Long Li
2026-05-16  4:05   ` sashiko-bot
2026-05-15  4:05 ` [PATCH net-next v10 6/6] RDMA/mana_ib: Allocate interrupt contexts on EQs Long Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260519132825.172490-1-pabeni@redhat.com \
    --to=pabeni@redhat.com \
    --cc=andrew+netdev@lunn.ch \
    --cc=davem@davemloft.net \
    --cc=decui@microsoft.com \
    --cc=edumazet@google.com \
    --cc=haiyangz@microsoft.com \
    --cc=horms@kernel.org \
    --cc=jgg@ziepe.ca \
    --cc=kotaranov@microsoft.com \
    --cc=kuba@kernel.org \
    --cc=kys@microsoft.com \
    --cc=leon@kernel.org \
    --cc=linux-hyperv@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=longli@microsoft.com \
    --cc=netdev@vger.kernel.org \
    --cc=shradhagupta@linux.microsoft.com \
    --cc=wei.liu@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox