From: Moshe Shemesh <moshe@nvidia.com>
To: Jakub Kicinski <kuba@kernel.org>, <tariqt@nvidia.com>
Cc: <edumazet@google.com>, <pabeni@redhat.com>,
<andrew+netdev@lunn.ch>, <davem@davemloft.net>,
<saeedm@nvidia.com>, <leon@kernel.org>, <mbloch@nvidia.com>,
<agoldberger@nvidia.com>, <netdev@vger.kernel.org>,
<linux-rdma@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
<gal@nvidia.com>, <dtatulea@nvidia.com>
Subject: Re: [PATCH net-next 3/3] net/mlx5: Add VHCA_ID page management mode support
Date: Mon, 4 May 2026 14:33:57 +0300 [thread overview]
Message-ID: <2b06dc5a-542e-440e-af7e-73c29204cda5@nvidia.com> (raw)
In-Reply-To: <20260503014501.4098393-1-kuba@kernel.org>
On 5/3/2026 4:45 AM, Jakub Kicinski wrote:
>
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> ---
> net/mlx5: Add VHCA_ID page management mode support
>
> This change adds a VHCA_ID-based page management mode that uses vhca_id
> instead of function_id as the effective function identifier when the
> firmware advertises icm_mng_function_id_mode. Boot pages keep FUNC_ID
> semantics for backward compatibility, and tracked pages are migrated to
> the vhca_id key after set_hca_cap().
>
> A few questions below on the per-type counter accounting, the boot-page
> migration helper, and locking on the FW-driven page path.
>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
>> index e0eafcf0c52a..d3eaefc5c0e0 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
>> @@ -852,6 +852,48 @@ bool mlx5_esw_vport_vhca_id(struct mlx5_eswitch *esw, u16 vportn, u16 *vhca_id)
>> return true;
>> }
>>
>> +u16 mlx5_esw_vhca_id_to_func_type(struct mlx5_core_dev *dev, u16 vhca_id)
>> +{
>> + struct mlx5_eswitch *esw = dev->priv.eswitch;
>> + struct mlx5_vport *vport;
>> + unsigned long i;
>> + u16 type;
>> +
>> + if (vhca_id == MLX5_CAP_GEN(dev, vhca_id))
>> + return MLX5_SELF;
>> +
>> + if (!esw)
>> + return MLX5_FUNC_TYPE_NONE;
>> +
>> + mutex_lock(&esw->state_lock);
>
> This function is reached from give_pages()/reclaim_pages()/
> release_all_pages() via func_vhca_id_to_type(), which in turn runs
> from the pg_wq work handler triggered by firmware page-request EQE
> events. Does acquiring esw->state_lock on that path introduce a new
> lock dependency?
>
> Several eswitch paths (for example mlx5_esw_vport_enable(),
> mlx5_esw_vport_disable(), mlx5_eswitch_set_vport_mac()) hold
> state_lock while synchronously issuing firmware commands. Before this
> patch, the page path held no eswitch locks.
>
> Would it be safer to resolve the func_type outside of state_lock, for
> example by caching the vhca_id-to-type mapping separately, or by
> attaching the resolved type to the fw_page at give time so reclaim
> does not need to look it up again?
>
I don't think there is a real issue with the lock, but for sure better
without, I will try caching vhca_id to type mapping.
>> + mlx5_esw_for_each_vport(esw, i, vport) {
>> + if (vport->vhca_id != vhca_id)
>> + continue;
>> +
>> + if (vport->vport == MLX5_VPORT_HOST_PF) {
>> + type = MLX5_HOST_PF;
>> + goto unlock;
>> + }
>> +
>> + if (xa_get_mark(&esw->vports, i, MLX5_ESW_VPT_SF)) {
>> + type = MLX5_SF;
>> + goto unlock;
>> + }
>> +
>> + if (xa_get_mark(&esw->vports, i, MLX5_ESW_VPT_VF)) {
>> + type = MLX5_VF;
>> + goto unlock;
>> + }
>> +
>> + type = MLX5_EC_VF;
>> + goto unlock;
>
> Can this fall-through misattribute counters?
>
> If a matching vport is not HOST_PF and has neither the VPT_SF nor the
> VPT_VF mark, it is unconditionally classified as MLX5_EC_VF. Would an
> UPLINK, ECPF, manager, or adjacent-style vport whose vhca_id happens
> to match end up charged to the EC_VF bucket here?
>
The fallthrough is correct in practice - uplink is the physical port,
not a separate HCA function, FW doesn't request pages for it. ECPF is
the esw manager, if requested here can be only as self, will catch the
MLX5_SELF condition. Adjacent VFs are caught correctly by the VF mark.
>> + }
>> + type = MLX5_FUNC_TYPE_NONE;
>> +unlock:
>> + mutex_unlock(&esw->state_lock);
>> + return type;
>> +}
>> +
>
> [ ... ]
>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
>> index a242053f3a58..52cf341ad6b3 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
>> @@ -597,6 +597,9 @@ static int handle_hca_cap(struct mlx5_core_dev *dev, void *set_ctx)
>> if (MLX5_CAP_GEN_MAX(dev, release_all_pages))
>> MLX5_SET(cmd_hca_cap, set_hca_cap, release_all_pages, 1);
>>
>> + if (MLX5_CAP_GEN_MAX(dev, icm_mng_function_id_mode))
>> + MLX5_SET(cmd_hca_cap, set_hca_cap, icm_mng_function_id_mode, 1);
>> +
>
> [ ... ]
>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c b/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c
>> index 77ffa31cc505..7ebe88aa3b3e 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c
>
> [ ... ]
>
>> @@ -89,12 +105,21 @@ static u16 func_id_to_type(struct mlx5_core_dev *dev, u16 func_id, bool ec_funct
>> return MLX5_SF;
>> }
>>
>> +static u16 func_vhca_id_to_type(struct mlx5_core_dev *dev, u16 func_vhca_id,
>> + bool ec_function)
>> +{
>> + if (mlx5_page_mgt_mode_is_vhca_id(dev))
>> + return mlx5_esw_vhca_id_to_func_type(dev, func_vhca_id);
>> +
>> + return func_id_to_type(dev, func_vhca_id, ec_function);
>> +}
>
> Can the give/reclaim accounting become asymmetric in VHCA_ID mode?
>
> The per-type counters are only updated when the resolved type is not
> MLX5_FUNC_TYPE_NONE:
>
> func_type = func_vhca_id_to_type(dev, func_vhca_id, ec_function);
> if (func_type != MLX5_FUNC_TYPE_NONE)
> dev->priv.page_counters[func_type] += npages;
>
> Since mlx5_esw_vhca_id_to_func_type() walks the eswitch vport table
> dynamically on every call, can the give and the corresponding reclaim
> resolve to different types?
>
> For example, if a give runs before vport->vhca_id is populated in
> mlx5_esw_vport_caps_get(), the lookup returns MLX5_FUNC_TYPE_NONE and
> the counter is not incremented. Later, when the vport is fully
> populated, reclaim resolves to a real type and decrements the counter
> below the amount ever added, which on a u32 drives it to a very large
> value.
>
> Similarly, if a vport is removed before reclaim, the increment at
> give time is recorded but the decrement at reclaim is skipped, so the
> counter leaks upward.
>
> Would caching the func_type on the fw_page at give time and reusing
> it on reclaim make the accounting symmetric by construction?
Following first comment, I will try caching.
>
> [ ... ]
>
>> @@ -658,30 +708,101 @@ static int req_pages_handler(struct notifier_block *nb,
>> * req->npages (and not min ()).
>> */
>> req->npages = max_t(s32, npages, MAX_RECLAIM_NPAGES);
>> - req->ec_function = ec_function;
>> + if (!mlx5_page_mgt_mode_is_vhca_id(dev))
>> + req->ec_function = ec_function;
>> req->release_all = release_all;
>> INIT_WORK(&req->work, pages_work_handler);
>> queue_work(dev->priv.pg_wq, &req->work);
>> return NOTIFY_OK;
>> }
>>
>> +/*
>> + * After set_hca_cap(), the second satisfy_startup_pages(dev, 0) may see
>> + * VHCA_ID mode. If page_root_xa already has the PF entry from the first
>> + * (boot) call under FUNC_ID keys 0 or (ec_function << 16), migrate that
>> + * entry to the device vhca_id key so lookups use VHCA_ID semantics.
>> + */
>> +static int mlx5_pagealloc_migrate_pf_to_vhca_id(struct mlx5_core_dev *dev)
>> +{
>> + u32 vhca_id_key, old_key;
>> + struct rb_root *root;
>> + struct fw_page *fwp;
>> + struct rb_node *p;
>> + bool ec_function;
>> + int err;
>> +
>> + if (xa_empty(&dev->priv.page_root_xa))
>> + return 0;
>> +
>> + vhca_id_key = MLX5_CAP_GEN(dev, vhca_id);
>> + ec_function = mlx5_core_is_ecpf(dev);
>> +
>> + old_key = ec_function ? (1U << 16) : 0;
>> + root = xa_load(&dev->priv.page_root_xa, old_key);
>> + if (!root)
>> + return 0;
>
> Does this assume the boot-path func_vhca_id was always 0?
>
> The boot call to mlx5_cmd_query_pages() reads func_vhca_id directly
> from the firmware output, and give_pages() then uses that value to
> compute the key. The migration here instead hardcodes old_key as
> ec_function ? (1U << 16) : 0.
>
> If firmware returned a non-zero boot function_id, xa_load(old_key)
> returns NULL, the function silently returns 0, the caller flips the
> mode to VHCA_ID, and the original rb_root is orphaned in page_root_xa
> under the old key. Subsequent free_fwp()/find_fw_page() paths would
> then hit WARN_ON_ONCE(!root) and leak the DMA mappings and pages.
>
> Would it be more robust to look up the actual key used at boot
> (derived from the stored func_vhca_id), and to treat the "xa not
> empty but old_key absent" case as an invariant violation rather than
> silently succeeding?
>
During boot (mlx5_satisfy_startup_pages(dev, 1)), the mode is FUNC_ID.
The mlx5_cmd_query_pages() call sets embedded_cpu_function in the
request but leaves function_id as 0 (the PF queries its own pages).
Firmware returns function_id in the output, which for the PF's own pages
is always 0.
The "silently returns 0" when xa_load returns NULL is also fine, it
means no boot pages were allocated.
>> +
>> + if (old_key == vhca_id_key)
>> + return 0;
>> +
>> + err = xa_insert(&dev->priv.page_root_xa, vhca_id_key, root, GFP_KERNEL);
>> + if (err) {
>> + mlx5_core_warn(dev,
>> + "failed to migrate page root key 0x%x to vhca_id 0x%x\n",
>> + old_key, vhca_id_key);
>> + return err;
>> + }
>> +
>> + xa_erase(&dev->priv.page_root_xa, old_key);
>> +
>> + for (p = rb_first(root); p; p = rb_next(p)) {
>> + fwp = rb_entry(p, struct fw_page, rb_node);
>> + fwp->function = vhca_id_key;
>> + }
>
> Is the ordering here safe against any concurrent free_fwp()?
>
> Between xa_erase(old_key) and the loop that updates fwp->function,
> every fw_page still carries the old key while page_root_xa no longer
> resolves it. If a free_fwp() were to run in that window:
>
> root = xa_load(&dev->priv.page_root_xa, fwp->function);
> if (WARN_ON_ONCE(!root))
> return;
>
> it would return early, skipping dma_unmap_page(), __free_page(), and
> kfree(fwp), leaking the DMA mapping and the backing page.
>
> No concurrent free path is structurally reachable today because this
> runs before the EQ notifier is registered in mlx5_pagealloc_start(),
> but would it be cleaner to update the fwp->function values first,
> then swap the xarray entries (or store the new value at a single key)
> so the two views cannot disagree?
>
Reordering is safer, I will reorder.
>> +
>> + return 0;
>> +}
>> +
>> int mlx5_satisfy_startup_pages(struct mlx5_core_dev *dev, int boot)
>> {
>> - u16 func_id;
>> + bool ec_function = false;
>> + u16 func_vhca_id;
>> s32 npages;
>> int err;
>>
>> - err = mlx5_cmd_query_pages(dev, &func_id, &npages, boot);
>> + /* When boot flag is set, the icm_mng_function_id_mode capability is
>> + * not yet set (only set after set_hca_cap()), so use FUNC_ID mode
>> + * for backward compatibility. When boot is false, set mode from
>> + * cap (set_hca_cap has run successfully).
>> + */
>> + if (boot) {
>> + mlx5_page_mgt_mode_set(dev, MLX5_PAGE_MGT_MODE_FUNC_ID);
>> + } else {
>> + if (MLX5_CAP_GEN_MAX(dev, icm_mng_function_id_mode) ==
>> + MLX5_ID_MODE_FUNCTION_VHCA_ID) {
>
> The comment just above says "set mode from cap (set_hca_cap has run
> successfully)", which reads as "use the current/negotiated cap value",
> but the check uses MLX5_CAP_GEN_MAX rather than MLX5_CAP_GEN. The
> sibling code in drivers/net/ethernet/mellanox/mlx5/core/debugfs.c uses
> MLX5_CAP_GEN(dev, icm_mng_function_id_mode) for the same semantic
> check.
>
> Could the comment and the _MAX usage be made consistent? If anyone
> later adds a conditional around the MLX5_SET() in handle_hca_cap()
> (for example a module parameter), the driver would start flipping to
> VHCA_ID mode based on _MAX without actually having enabled the feature
> in firmware.
OK, I will rewrite the comment.
>
>> + err = mlx5_pagealloc_migrate_pf_to_vhca_id(dev);
>> + if (err)
>> + return err;
>> + mlx5_page_mgt_mode_set(dev, MLX5_PAGE_MGT_MODE_VHCA_ID);
>> + }
>> + }
>
> Once page_mgt_mode is flipped to MLX5_PAGE_MGT_MODE_VHCA_ID, it stays
> set in dev->priv and is only reset on the next boot=1 call. Between
> teardown and the next boot=1 path (for example during health
> recovery, PCI reset, or any async flow that runs page work before
> mlx5_satisfy_startup_pages(dev, 1) re-runs), would a FW page-request
> EQE observe a stale mode?
>
> The req_pages EQ notifier is registered via mlx5_pagealloc_start()
> and unregistered via mlx5_pagealloc_stop(), so this is not reachable
> today, but would it be worth resetting the mode explicitly on the
> teardown side rather than relying on the next reinit?
No. Once flipped to MLX5_PAGE_MGT_MODE_VHCA_ID, both driver and firmware
keep using it till reopen function (next boot=1 call).
>
> [ ... ]
>
>> @@ -751,6 +874,9 @@ int mlx5_reclaim_startup_pages(struct mlx5_core_dev *dev)
>> WARN(dev->priv.fw_pages,
>> "FW pages counter is %d after reclaiming all pages\n",
>> dev->priv.fw_pages);
>> + if (mlx5_page_mgt_mode_is_vhca_id(dev) && !dev->priv.eswitch)
>> + return 0;
>> +
>
> Does this guard address the asymmetric-counter case raised above?
>
> It only skips the per-type WARNs when the eswitch is entirely absent.
> In the common case where the eswitch is present but a vport's
> vhca_id/marks change between give and reclaim, the counters can still
> drift and these WARNs would still fire on normal teardown paths.
>
Following first comment, will try caching.
> [ ... ]
next prev parent reply other threads:[~2026-05-04 11:34 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-01 4:41 [PATCH net-next 0/3] net/mlx5: ICM page management in VHCA_ID mode Tariq Toukan
2026-05-01 4:41 ` [PATCH net-next 1/3] net/mlx5: Relax capability check for eswitch query paths Tariq Toukan
2026-05-01 4:41 ` [PATCH net-next 2/3] net/mlx5: Make debugfs page counters by function type dynamic Tariq Toukan
2026-05-01 4:41 ` [PATCH net-next 3/3] net/mlx5: Add VHCA_ID page management mode support Tariq Toukan
2026-05-03 1:45 ` Jakub Kicinski
2026-05-04 11:33 ` Moshe Shemesh [this message]
2026-05-03 1:45 ` Jakub Kicinski
2026-05-04 11:41 ` Moshe Shemesh
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2b06dc5a-542e-440e-af7e-73c29204cda5@nvidia.com \
--to=moshe@nvidia.com \
--cc=agoldberger@nvidia.com \
--cc=andrew+netdev@lunn.ch \
--cc=davem@davemloft.net \
--cc=dtatulea@nvidia.com \
--cc=edumazet@google.com \
--cc=gal@nvidia.com \
--cc=kuba@kernel.org \
--cc=leon@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=mbloch@nvidia.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=saeedm@nvidia.com \
--cc=tariqt@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox