From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 576B8242D7F; Sun, 3 May 2026 01:45:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777772703; cv=none; b=nXpl4R724D1sSES07MbBt8olvlHbiajtj97A/1IHvt4e1wUA/InNwqkREbJZB3ONgTyhwcB5tLxvlnf8dPtQCFvDSDzBgI6wmJy/6X9Q3Cs3hHEe4nKJulPQ19GUU9fodk+cVin4A41mdrlPBUtOUGJACLZCs5sqwK+ZwrG2Rkc= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777772703; c=relaxed/simple; bh=f5BVVl5dYTMXLvjLglm+1jEFqV3SnZBjj6+FguEdqu0=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=f9WMb/mF634/YioNza6g9CESVcrEjF9WAqwzJi36kkFCF/Zsw8i/aSB5rDW2O4EBtajDh2wWjBOEiqvBCtf+VkIruI2eYs7x867uaSjngSK6GX5zbZPAjDES1qslXBPdPgcuy/eaOgzwnSLkg+M80VmimOt6E2t/M8l7Y866rl0= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=JemTb5lb; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="JemTb5lb" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 8650BC19425; Sun, 3 May 2026 01:45:02 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1777772703; bh=f5BVVl5dYTMXLvjLglm+1jEFqV3SnZBjj6+FguEdqu0=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=JemTb5lbrhFpYPTc9mzUdj7fP/jy6Qgdnl4YmMEqDqssw+tWxFqmA0mI7F5DKD1Up iMcttJmGK8fDXjAzIm1XG4iBSydH3qwVW7wRi59sG2dagqvfV6dxLxmpPDvp2cP1R9 EsGqxod/6AAOM8YAq3AKTV3WaiYzUb4t8SoYB7lI1CXxFlPgPvVPvfucvdkm6JSOzt KICrbPrhfEEXm/PTQ+q2wNPu8Sy7Eh+w+cktcyYq1NGvjOCM6yg0x8MqYj7YLW0df/ eYXogPefSx6RT3031rrX7HoHKVlfsvnjEmlpiqa/MthNjKZdEW3vjWl4KAcafXkirR PkxytWUrbCDHw== From: Jakub Kicinski To: tariqt@nvidia.com Cc: Jakub Kicinski , edumazet@google.com, pabeni@redhat.com, andrew+netdev@lunn.ch, davem@davemloft.net, saeedm@nvidia.com, leon@kernel.org, mbloch@nvidia.com, moshe@nvidia.com, agoldberger@nvidia.com, netdev@vger.kernel.org, linux-rdma@vger.kernel.org, linux-kernel@vger.kernel.org, gal@nvidia.com, dtatulea@nvidia.com Subject: Re: [PATCH net-next 3/3] net/mlx5: Add VHCA_ID page management mode support Date: Sat, 2 May 2026 18:45:01 -0700 Message-ID: <20260503014501.4098393-1-kuba@kernel.org> X-Mailer: git-send-email 2.54.0 In-Reply-To: <20260501044156.260875-4-tariqt@nvidia.com> References: <20260501044156.260875-4-tariqt@nvidia.com> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit This is an AI-generated review of your patch. The human sending this email has considered the AI review valid, or at least plausible. --- net/mlx5: Add VHCA_ID page management mode support This change adds a VHCA_ID-based page management mode that uses vhca_id instead of function_id as the effective function identifier when the firmware advertises icm_mng_function_id_mode. Boot pages keep FUNC_ID semantics for backward compatibility, and tracked pages are migrated to the vhca_id key after set_hca_cap(). A few questions below on the per-type counter accounting, the boot-page migration helper, and locking on the FW-driven page path. > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c > index e0eafcf0c52a..d3eaefc5c0e0 100644 > --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c > +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c > @@ -852,6 +852,48 @@ bool mlx5_esw_vport_vhca_id(struct mlx5_eswitch *esw, u16 vportn, u16 *vhca_id) > return true; > } > > +u16 mlx5_esw_vhca_id_to_func_type(struct mlx5_core_dev *dev, u16 vhca_id) > +{ > + struct mlx5_eswitch *esw = dev->priv.eswitch; > + struct mlx5_vport *vport; > + unsigned long i; > + u16 type; > + > + if (vhca_id == MLX5_CAP_GEN(dev, vhca_id)) > + return MLX5_SELF; > + > + if (!esw) > + return MLX5_FUNC_TYPE_NONE; > + > + mutex_lock(&esw->state_lock); This function is reached from give_pages()/reclaim_pages()/ release_all_pages() via func_vhca_id_to_type(), which in turn runs from the pg_wq work handler triggered by firmware page-request EQE events. Does acquiring esw->state_lock on that path introduce a new lock dependency? Several eswitch paths (for example mlx5_esw_vport_enable(), mlx5_esw_vport_disable(), mlx5_eswitch_set_vport_mac()) hold state_lock while synchronously issuing firmware commands. Before this patch, the page path held no eswitch locks. Would it be safer to resolve the func_type outside of state_lock, for example by caching the vhca_id-to-type mapping separately, or by attaching the resolved type to the fw_page at give time so reclaim does not need to look it up again? > + mlx5_esw_for_each_vport(esw, i, vport) { > + if (vport->vhca_id != vhca_id) > + continue; > + > + if (vport->vport == MLX5_VPORT_HOST_PF) { > + type = MLX5_HOST_PF; > + goto unlock; > + } > + > + if (xa_get_mark(&esw->vports, i, MLX5_ESW_VPT_SF)) { > + type = MLX5_SF; > + goto unlock; > + } > + > + if (xa_get_mark(&esw->vports, i, MLX5_ESW_VPT_VF)) { > + type = MLX5_VF; > + goto unlock; > + } > + > + type = MLX5_EC_VF; > + goto unlock; Can this fall-through misattribute counters? If a matching vport is not HOST_PF and has neither the VPT_SF nor the VPT_VF mark, it is unconditionally classified as MLX5_EC_VF. Would an UPLINK, ECPF, manager, or adjacent-style vport whose vhca_id happens to match end up charged to the EC_VF bucket here? > + } > + type = MLX5_FUNC_TYPE_NONE; > +unlock: > + mutex_unlock(&esw->state_lock); > + return type; > +} > + [ ... ] > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c > index a242053f3a58..52cf341ad6b3 100644 > --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c > +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c > @@ -597,6 +597,9 @@ static int handle_hca_cap(struct mlx5_core_dev *dev, void *set_ctx) > if (MLX5_CAP_GEN_MAX(dev, release_all_pages)) > MLX5_SET(cmd_hca_cap, set_hca_cap, release_all_pages, 1); > > + if (MLX5_CAP_GEN_MAX(dev, icm_mng_function_id_mode)) > + MLX5_SET(cmd_hca_cap, set_hca_cap, icm_mng_function_id_mode, 1); > + [ ... ] > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c b/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c > index 77ffa31cc505..7ebe88aa3b3e 100644 > --- a/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c > +++ b/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c [ ... ] > @@ -89,12 +105,21 @@ static u16 func_id_to_type(struct mlx5_core_dev *dev, u16 func_id, bool ec_funct > return MLX5_SF; > } > > +static u16 func_vhca_id_to_type(struct mlx5_core_dev *dev, u16 func_vhca_id, > + bool ec_function) > +{ > + if (mlx5_page_mgt_mode_is_vhca_id(dev)) > + return mlx5_esw_vhca_id_to_func_type(dev, func_vhca_id); > + > + return func_id_to_type(dev, func_vhca_id, ec_function); > +} Can the give/reclaim accounting become asymmetric in VHCA_ID mode? The per-type counters are only updated when the resolved type is not MLX5_FUNC_TYPE_NONE: func_type = func_vhca_id_to_type(dev, func_vhca_id, ec_function); if (func_type != MLX5_FUNC_TYPE_NONE) dev->priv.page_counters[func_type] += npages; Since mlx5_esw_vhca_id_to_func_type() walks the eswitch vport table dynamically on every call, can the give and the corresponding reclaim resolve to different types? For example, if a give runs before vport->vhca_id is populated in mlx5_esw_vport_caps_get(), the lookup returns MLX5_FUNC_TYPE_NONE and the counter is not incremented. Later, when the vport is fully populated, reclaim resolves to a real type and decrements the counter below the amount ever added, which on a u32 drives it to a very large value. Similarly, if a vport is removed before reclaim, the increment at give time is recorded but the decrement at reclaim is skipped, so the counter leaks upward. Would caching the func_type on the fw_page at give time and reusing it on reclaim make the accounting symmetric by construction? [ ... ] > @@ -658,30 +708,101 @@ static int req_pages_handler(struct notifier_block *nb, > * req->npages (and not min ()). > */ > req->npages = max_t(s32, npages, MAX_RECLAIM_NPAGES); > - req->ec_function = ec_function; > + if (!mlx5_page_mgt_mode_is_vhca_id(dev)) > + req->ec_function = ec_function; > req->release_all = release_all; > INIT_WORK(&req->work, pages_work_handler); > queue_work(dev->priv.pg_wq, &req->work); > return NOTIFY_OK; > } > > +/* > + * After set_hca_cap(), the second satisfy_startup_pages(dev, 0) may see > + * VHCA_ID mode. If page_root_xa already has the PF entry from the first > + * (boot) call under FUNC_ID keys 0 or (ec_function << 16), migrate that > + * entry to the device vhca_id key so lookups use VHCA_ID semantics. > + */ > +static int mlx5_pagealloc_migrate_pf_to_vhca_id(struct mlx5_core_dev *dev) > +{ > + u32 vhca_id_key, old_key; > + struct rb_root *root; > + struct fw_page *fwp; > + struct rb_node *p; > + bool ec_function; > + int err; > + > + if (xa_empty(&dev->priv.page_root_xa)) > + return 0; > + > + vhca_id_key = MLX5_CAP_GEN(dev, vhca_id); > + ec_function = mlx5_core_is_ecpf(dev); > + > + old_key = ec_function ? (1U << 16) : 0; > + root = xa_load(&dev->priv.page_root_xa, old_key); > + if (!root) > + return 0; Does this assume the boot-path func_vhca_id was always 0? The boot call to mlx5_cmd_query_pages() reads func_vhca_id directly from the firmware output, and give_pages() then uses that value to compute the key. The migration here instead hardcodes old_key as ec_function ? (1U << 16) : 0. If firmware returned a non-zero boot function_id, xa_load(old_key) returns NULL, the function silently returns 0, the caller flips the mode to VHCA_ID, and the original rb_root is orphaned in page_root_xa under the old key. Subsequent free_fwp()/find_fw_page() paths would then hit WARN_ON_ONCE(!root) and leak the DMA mappings and pages. Would it be more robust to look up the actual key used at boot (derived from the stored func_vhca_id), and to treat the "xa not empty but old_key absent" case as an invariant violation rather than silently succeeding? > + > + if (old_key == vhca_id_key) > + return 0; > + > + err = xa_insert(&dev->priv.page_root_xa, vhca_id_key, root, GFP_KERNEL); > + if (err) { > + mlx5_core_warn(dev, > + "failed to migrate page root key 0x%x to vhca_id 0x%x\n", > + old_key, vhca_id_key); > + return err; > + } > + > + xa_erase(&dev->priv.page_root_xa, old_key); > + > + for (p = rb_first(root); p; p = rb_next(p)) { > + fwp = rb_entry(p, struct fw_page, rb_node); > + fwp->function = vhca_id_key; > + } Is the ordering here safe against any concurrent free_fwp()? Between xa_erase(old_key) and the loop that updates fwp->function, every fw_page still carries the old key while page_root_xa no longer resolves it. If a free_fwp() were to run in that window: root = xa_load(&dev->priv.page_root_xa, fwp->function); if (WARN_ON_ONCE(!root)) return; it would return early, skipping dma_unmap_page(), __free_page(), and kfree(fwp), leaking the DMA mapping and the backing page. No concurrent free path is structurally reachable today because this runs before the EQ notifier is registered in mlx5_pagealloc_start(), but would it be cleaner to update the fwp->function values first, then swap the xarray entries (or store the new value at a single key) so the two views cannot disagree? > + > + return 0; > +} > + > int mlx5_satisfy_startup_pages(struct mlx5_core_dev *dev, int boot) > { > - u16 func_id; > + bool ec_function = false; > + u16 func_vhca_id; > s32 npages; > int err; > > - err = mlx5_cmd_query_pages(dev, &func_id, &npages, boot); > + /* When boot flag is set, the icm_mng_function_id_mode capability is > + * not yet set (only set after set_hca_cap()), so use FUNC_ID mode > + * for backward compatibility. When boot is false, set mode from > + * cap (set_hca_cap has run successfully). > + */ > + if (boot) { > + mlx5_page_mgt_mode_set(dev, MLX5_PAGE_MGT_MODE_FUNC_ID); > + } else { > + if (MLX5_CAP_GEN_MAX(dev, icm_mng_function_id_mode) == > + MLX5_ID_MODE_FUNCTION_VHCA_ID) { The comment just above says "set mode from cap (set_hca_cap has run successfully)", which reads as "use the current/negotiated cap value", but the check uses MLX5_CAP_GEN_MAX rather than MLX5_CAP_GEN. The sibling code in drivers/net/ethernet/mellanox/mlx5/core/debugfs.c uses MLX5_CAP_GEN(dev, icm_mng_function_id_mode) for the same semantic check. Could the comment and the _MAX usage be made consistent? If anyone later adds a conditional around the MLX5_SET() in handle_hca_cap() (for example a module parameter), the driver would start flipping to VHCA_ID mode based on _MAX without actually having enabled the feature in firmware. > + err = mlx5_pagealloc_migrate_pf_to_vhca_id(dev); > + if (err) > + return err; > + mlx5_page_mgt_mode_set(dev, MLX5_PAGE_MGT_MODE_VHCA_ID); > + } > + } Once page_mgt_mode is flipped to MLX5_PAGE_MGT_MODE_VHCA_ID, it stays set in dev->priv and is only reset on the next boot=1 call. Between teardown and the next boot=1 path (for example during health recovery, PCI reset, or any async flow that runs page work before mlx5_satisfy_startup_pages(dev, 1) re-runs), would a FW page-request EQE observe a stale mode? The req_pages EQ notifier is registered via mlx5_pagealloc_start() and unregistered via mlx5_pagealloc_stop(), so this is not reachable today, but would it be worth resetting the mode explicitly on the teardown side rather than relying on the next reinit? [ ... ] > @@ -751,6 +874,9 @@ int mlx5_reclaim_startup_pages(struct mlx5_core_dev *dev) > WARN(dev->priv.fw_pages, > "FW pages counter is %d after reclaiming all pages\n", > dev->priv.fw_pages); > + if (mlx5_page_mgt_mode_is_vhca_id(dev) && !dev->priv.eswitch) > + return 0; > + Does this guard address the asymmetric-counter case raised above? It only skips the per-type WARNs when the eswitch is entirely absent. In the common case where the eswitch is present but a vport's vhca_id/marks change between give and reclaim, the counters can still drift and these WARNs would still fire on normal teardown paths. [ ... ]