From: Moshe Shemesh <moshe@nvidia.com>
To: Jakub Kicinski <kuba@kernel.org>, <tariqt@nvidia.com>
Cc: <edumazet@google.com>, <pabeni@redhat.com>,
<andrew+netdev@lunn.ch>, <davem@davemloft.net>,
<saeedm@nvidia.com>, <leon@kernel.org>, <mbloch@nvidia.com>,
<agoldberger@nvidia.com>, <netdev@vger.kernel.org>,
<linux-rdma@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
<gal@nvidia.com>, <dtatulea@nvidia.com>
Subject: Re: [PATCH net-next 3/3] net/mlx5: Add VHCA_ID page management mode support
Date: Mon, 4 May 2026 14:33:57 +0300 [thread overview]
Message-ID: <2b06dc5a-542e-440e-af7e-73c29204cda5@nvidia.com> (raw)
In-Reply-To: <20260503014501.4098393-1-kuba@kernel.org>
On 5/3/2026 4:45 AM, Jakub Kicinski wrote:
>
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> ---
> net/mlx5: Add VHCA_ID page management mode support
>
> This change adds a VHCA_ID-based page management mode that uses vhca_id
> instead of function_id as the effective function identifier when the
> firmware advertises icm_mng_function_id_mode. Boot pages keep FUNC_ID
> semantics for backward compatibility, and tracked pages are migrated to
> the vhca_id key after set_hca_cap().
>
> A few questions below on the per-type counter accounting, the boot-page
> migration helper, and locking on the FW-driven page path.
>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
>> index e0eafcf0c52a..d3eaefc5c0e0 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
>> @@ -852,6 +852,48 @@ bool mlx5_esw_vport_vhca_id(struct mlx5_eswitch *esw, u16 vportn, u16 *vhca_id)
>> return true;
>> }
>>
>> +u16 mlx5_esw_vhca_id_to_func_type(struct mlx5_core_dev *dev, u16 vhca_id)
>> +{
>> + struct mlx5_eswitch *esw = dev->priv.eswitch;
>> + struct mlx5_vport *vport;
>> + unsigned long i;
>> + u16 type;
>> +
>> + if (vhca_id == MLX5_CAP_GEN(dev, vhca_id))
>> + return MLX5_SELF;
>> +
>> + if (!esw)
>> + return MLX5_FUNC_TYPE_NONE;
>> +
>> + mutex_lock(&esw->state_lock);
>
> This function is reached from give_pages()/reclaim_pages()/
> release_all_pages() via func_vhca_id_to_type(), which in turn runs
> from the pg_wq work handler triggered by firmware page-request EQE
> events. Does acquiring esw->state_lock on that path introduce a new
> lock dependency?
>
> Several eswitch paths (for example mlx5_esw_vport_enable(),
> mlx5_esw_vport_disable(), mlx5_eswitch_set_vport_mac()) hold
> state_lock while synchronously issuing firmware commands. Before this
> patch, the page path held no eswitch locks.
>
> Would it be safer to resolve the func_type outside of state_lock, for
> example by caching the vhca_id-to-type mapping separately, or by
> attaching the resolved type to the fw_page at give time so reclaim
> does not need to look it up again?
>
I don't think there is a real issue with the lock, but for sure better
without, I will try caching vhca_id to type mapping.
>> + mlx5_esw_for_each_vport(esw, i, vport) {
>> + if (vport->vhca_id != vhca_id)
>> + continue;
>> +
>> + if (vport->vport == MLX5_VPORT_HOST_PF) {
>> + type = MLX5_HOST_PF;
>> + goto unlock;
>> + }
>> +
>> + if (xa_get_mark(&esw->vports, i, MLX5_ESW_VPT_SF)) {
>> + type = MLX5_SF;
>> + goto unlock;
>> + }
>> +
>> + if (xa_get_mark(&esw->vports, i, MLX5_ESW_VPT_VF)) {
>> + type = MLX5_VF;
>> + goto unlock;
>> + }
>> +
>> + type = MLX5_EC_VF;
>> + goto unlock;
>
> Can this fall-through misattribute counters?
>
> If a matching vport is not HOST_PF and has neither the VPT_SF nor the
> VPT_VF mark, it is unconditionally classified as MLX5_EC_VF. Would an
> UPLINK, ECPF, manager, or adjacent-style vport whose vhca_id happens
> to match end up charged to the EC_VF bucket here?
>
The fallthrough is correct in practice - uplink is the physical port,
not a separate HCA function, FW doesn't request pages for it. ECPF is
the esw manager, if requested here can be only as self, will catch the
MLX5_SELF condition. Adjacent VFs are caught correctly by the VF mark.
>> + }
>> + type = MLX5_FUNC_TYPE_NONE;
>> +unlock:
>> + mutex_unlock(&esw->state_lock);
>> + return type;
>> +}
>> +
>
> [ ... ]
>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
>> index a242053f3a58..52cf341ad6b3 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
>> @@ -597,6 +597,9 @@ static int handle_hca_cap(struct mlx5_core_dev *dev, void *set_ctx)
>> if (MLX5_CAP_GEN_MAX(dev, release_all_pages))
>> MLX5_SET(cmd_hca_cap, set_hca_cap, release_all_pages, 1);
>>
>> + if (MLX5_CAP_GEN_MAX(dev, icm_mng_function_id_mode))
>> + MLX5_SET(cmd_hca_cap, set_hca_cap, icm_mng_function_id_mode, 1);
>> +
>
> [ ... ]
>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c b/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c
>> index 77ffa31cc505..7ebe88aa3b3e 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c
>
> [ ... ]
>
>> @@ -89,12 +105,21 @@ static u16 func_id_to_type(struct mlx5_core_dev *dev, u16 func_id, bool ec_funct
>> return MLX5_SF;
>> }
>>
>> +static u16 func_vhca_id_to_type(struct mlx5_core_dev *dev, u16 func_vhca_id,
>> + bool ec_function)
>> +{
>> + if (mlx5_page_mgt_mode_is_vhca_id(dev))
>> + return mlx5_esw_vhca_id_to_func_type(dev, func_vhca_id);
>> +
>> + return func_id_to_type(dev, func_vhca_id, ec_function);
>> +}
>
> Can the give/reclaim accounting become asymmetric in VHCA_ID mode?
>
> The per-type counters are only updated when the resolved type is not
> MLX5_FUNC_TYPE_NONE:
>
> func_type = func_vhca_id_to_type(dev, func_vhca_id, ec_function);
> if (func_type != MLX5_FUNC_TYPE_NONE)
> dev->priv.page_counters[func_type] += npages;
>
> Since mlx5_esw_vhca_id_to_func_type() walks the eswitch vport table
> dynamically on every call, can the give and the corresponding reclaim
> resolve to different types?
>
> For example, if a give runs before vport->vhca_id is populated in
> mlx5_esw_vport_caps_get(), the lookup returns MLX5_FUNC_TYPE_NONE and
> the counter is not incremented. Later, when the vport is fully
> populated, reclaim resolves to a real type and decrements the counter
> below the amount ever added, which on a u32 drives it to a very large
> value.
>
> Similarly, if a vport is removed before reclaim, the increment at
> give time is recorded but the decrement at reclaim is skipped, so the
> counter leaks upward.
>
> Would caching the func_type on the fw_page at give time and reusing
> it on reclaim make the accounting symmetric by construction?
Following first comment, I will try caching.
>
> [ ... ]
>
>> @@ -658,30 +708,101 @@ static int req_pages_handler(struct notifier_block *nb,
>> * req->npages (and not min ()).
>> */
>> req->npages = max_t(s32, npages, MAX_RECLAIM_NPAGES);
>> - req->ec_function = ec_function;
>> + if (!mlx5_page_mgt_mode_is_vhca_id(dev))
>> + req->ec_function = ec_function;
>> req->release_all = release_all;
>> INIT_WORK(&req->work, pages_work_handler);
>> queue_work(dev->priv.pg_wq, &req->work);
>> return NOTIFY_OK;
>> }
>>
>> +/*
>> + * After set_hca_cap(), the second satisfy_startup_pages(dev, 0) may see
>> + * VHCA_ID mode. If page_root_xa already has the PF entry from the first
>> + * (boot) call under FUNC_ID keys 0 or (ec_function << 16), migrate that
>> + * entry to the device vhca_id key so lookups use VHCA_ID semantics.
>> + */
>> +static int mlx5_pagealloc_migrate_pf_to_vhca_id(struct mlx5_core_dev *dev)
>> +{
>> + u32 vhca_id_key, old_key;
>> + struct rb_root *root;
>> + struct fw_page *fwp;
>> + struct rb_node *p;
>> + bool ec_function;
>> + int err;
>> +
>> + if (xa_empty(&dev->priv.page_root_xa))
>> + return 0;
>> +
>> + vhca_id_key = MLX5_CAP_GEN(dev, vhca_id);
>> + ec_function = mlx5_core_is_ecpf(dev);
>> +
>> + old_key = ec_function ? (1U << 16) : 0;
>> + root = xa_load(&dev->priv.page_root_xa, old_key);
>> + if (!root)
>> + return 0;
>
> Does this assume the boot-path func_vhca_id was always 0?
>
> The boot call to mlx5_cmd_query_pages() reads func_vhca_id directly
> from the firmware output, and give_pages() then uses that value to
> compute the key. The migration here instead hardcodes old_key as
> ec_function ? (1U << 16) : 0.
>
> If firmware returned a non-zero boot function_id, xa_load(old_key)
> returns NULL, the function silently returns 0, the caller flips the
> mode to VHCA_ID, and the original rb_root is orphaned in page_root_xa
> under the old key. Subsequent free_fwp()/find_fw_page() paths would
> then hit WARN_ON_ONCE(!root) and leak the DMA mappings and pages.
>
> Would it be more robust to look up the actual key used at boot
> (derived from the stored func_vhca_id), and to treat the "xa not
> empty but old_key absent" case as an invariant violation rather than
> silently succeeding?
>
During boot (mlx5_satisfy_startup_pages(dev, 1)), the mode is FUNC_ID.
The mlx5_cmd_query_pages() call sets embedded_cpu_function in the
request but leaves function_id as 0 (the PF queries its own pages).
Firmware returns function_id in the output, which for the PF's own pages
is always 0.
The "silently returns 0" when xa_load returns NULL is also fine, it
means no boot pages were allocated.
>> +
>> + if (old_key == vhca_id_key)
>> + return 0;
>> +
>> + err = xa_insert(&dev->priv.page_root_xa, vhca_id_key, root, GFP_KERNEL);
>> + if (err) {
>> + mlx5_core_warn(dev,
>> + "failed to migrate page root key 0x%x to vhca_id 0x%x\n",
>> + old_key, vhca_id_key);
>> + return err;
>> + }
>> +
>> + xa_erase(&dev->priv.page_root_xa, old_key);
>> +
>> + for (p = rb_first(root); p; p = rb_next(p)) {
>> + fwp = rb_entry(p, struct fw_page, rb_node);
>> + fwp->function = vhca_id_key;
>> + }
>
> Is the ordering here safe against any concurrent free_fwp()?
>
> Between xa_erase(old_key) and the loop that updates fwp->function,
> every fw_page still carries the old key while page_root_xa no longer
> resolves it. If a free_fwp() were to run in that window:
>
> root = xa_load(&dev->priv.page_root_xa, fwp->function);
> if (WARN_ON_ONCE(!root))
> return;
>
> it would return early, skipping dma_unmap_page(), __free_page(), and
> kfree(fwp), leaking the DMA mapping and the backing page.
>
> No concurrent free path is structurally reachable today because this
> runs before the EQ notifier is registered in mlx5_pagealloc_start(),
> but would it be cleaner to update the fwp->function values first,
> then swap the xarray entries (or store the new value at a single key)
> so the two views cannot disagree?
>
Reordering is safer, I will reorder.
>> +
>> + return 0;
>> +}
>> +
>> int mlx5_satisfy_startup_pages(struct mlx5_core_dev *dev, int boot)
>> {
>> - u16 func_id;
>> + bool ec_function = false;
>> + u16 func_vhca_id;
>> s32 npages;
>> int err;
>>
>> - err = mlx5_cmd_query_pages(dev, &func_id, &npages, boot);
>> + /* When boot flag is set, the icm_mng_function_id_mode capability is
>> + * not yet set (only set after set_hca_cap()), so use FUNC_ID mode
>> + * for backward compatibility. When boot is false, set mode from
>> + * cap (set_hca_cap has run successfully).
>> + */
>> + if (boot) {
>> + mlx5_page_mgt_mode_set(dev, MLX5_PAGE_MGT_MODE_FUNC_ID);
>> + } else {
>> + if (MLX5_CAP_GEN_MAX(dev, icm_mng_function_id_mode) ==
>> + MLX5_ID_MODE_FUNCTION_VHCA_ID) {
>
> The comment just above says "set mode from cap (set_hca_cap has run
> successfully)", which reads as "use the current/negotiated cap value",
> but the check uses MLX5_CAP_GEN_MAX rather than MLX5_CAP_GEN. The
> sibling code in drivers/net/ethernet/mellanox/mlx5/core/debugfs.c uses
> MLX5_CAP_GEN(dev, icm_mng_function_id_mode) for the same semantic
> check.
>
> Could the comment and the _MAX usage be made consistent? If anyone
> later adds a conditional around the MLX5_SET() in handle_hca_cap()
> (for example a module parameter), the driver would start flipping to
> VHCA_ID mode based on _MAX without actually having enabled the feature
> in firmware.
OK, I will rewrite the comment.
>
>> + err = mlx5_pagealloc_migrate_pf_to_vhca_id(dev);
>> + if (err)
>> + return err;
>> + mlx5_page_mgt_mode_set(dev, MLX5_PAGE_MGT_MODE_VHCA_ID);
>> + }
>> + }
>
> Once page_mgt_mode is flipped to MLX5_PAGE_MGT_MODE_VHCA_ID, it stays
> set in dev->priv and is only reset on the next boot=1 call. Between
> teardown and the next boot=1 path (for example during health
> recovery, PCI reset, or any async flow that runs page work before
> mlx5_satisfy_startup_pages(dev, 1) re-runs), would a FW page-request
> EQE observe a stale mode?
>
> The req_pages EQ notifier is registered via mlx5_pagealloc_start()
> and unregistered via mlx5_pagealloc_stop(), so this is not reachable
> today, but would it be worth resetting the mode explicitly on the
> teardown side rather than relying on the next reinit?
No. Once flipped to MLX5_PAGE_MGT_MODE_VHCA_ID, both driver and firmware
keep using it till reopen function (next boot=1 call).
>
> [ ... ]
>
>> @@ -751,6 +874,9 @@ int mlx5_reclaim_startup_pages(struct mlx5_core_dev *dev)
>> WARN(dev->priv.fw_pages,
>> "FW pages counter is %d after reclaiming all pages\n",
>> dev->priv.fw_pages);
>> + if (mlx5_page_mgt_mode_is_vhca_id(dev) && !dev->priv.eswitch)
>> + return 0;
>> +
>
> Does this guard address the asymmetric-counter case raised above?
>
> It only skips the per-type WARNs when the eswitch is entirely absent.
> In the common case where the eswitch is present but a vport's
> vhca_id/marks change between give and reclaim, the counters can still
> drift and these WARNs would still fire on normal teardown paths.
>
Following first comment, will try caching.
> [ ... ]
next prev parent reply other threads:[~2026-05-04 11:34 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-01 4:41 [PATCH net-next 0/3] net/mlx5: ICM page management in VHCA_ID mode Tariq Toukan
2026-05-01 4:41 ` [PATCH net-next 1/3] net/mlx5: Relax capability check for eswitch query paths Tariq Toukan
2026-05-01 4:41 ` [PATCH net-next 2/3] net/mlx5: Make debugfs page counters by function type dynamic Tariq Toukan
2026-05-01 4:41 ` [PATCH net-next 3/3] net/mlx5: Add VHCA_ID page management mode support Tariq Toukan
2026-05-03 1:45 ` Jakub Kicinski
2026-05-04 11:33 ` Moshe Shemesh [this message]
2026-05-03 1:45 ` Jakub Kicinski
2026-05-04 11:41 ` Moshe Shemesh
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2b06dc5a-542e-440e-af7e-73c29204cda5@nvidia.com \
--to=moshe@nvidia.com \
--cc=agoldberger@nvidia.com \
--cc=andrew+netdev@lunn.ch \
--cc=davem@davemloft.net \
--cc=dtatulea@nvidia.com \
--cc=edumazet@google.com \
--cc=gal@nvidia.com \
--cc=kuba@kernel.org \
--cc=leon@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=mbloch@nvidia.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=saeedm@nvidia.com \
--cc=tariqt@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.