[PATCH net-next 0/3] net/mlx5: ICM page management in VHCA

public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH net-next 0/3] net/mlx5: ICM page management in VHCA_ID mode
@ 2026-05-01  4:41 Tariq Toukan
  2026-05-01  4:41 ` [PATCH net-next 1/3] net/mlx5: Relax capability check for eswitch query paths Tariq Toukan
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Tariq Toukan @ 2026-05-01  4:41 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller
  Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
	Moshe Shemesh, Akiva Goldberger, netdev, linux-rdma, linux-kernel,
	Gal Pressman, Dragos Tatulea

Hi,

Find detailed description by Moshe below.

Regards,
Tariq

This series adds driver support for the VHCA_ID page management mode.
When firmware and driver support this mode, ICM (Interconnect Context
Memory) page management uses the device vhca_id as the function
identifier in MANAGE_PAGES, QUERY_PAGES, and page request events instead
of the legacy function_id + ec_function pair.

Background
Firmware can operate page management in two modes:
FUNC_ID mode (current): Function identity is (function_id, ec_function).
This remains the default and is used for boot pages and when the new
mode capability is not set.
VHCA_ID mode (new): Function identity is vhca_id only; ec_function is
ignored. This aligns page management with the vhca_id-based model used
by other firmware commands and simplifies identification on SmartNIC and
multi-function setups.

Moshe Shemesh (3):
  net/mlx5: Relax capability check for eswitch query paths
  net/mlx5: Make debugfs page counters by function type dynamic
  net/mlx5: Add VHCA_ID page management mode support

 .../net/ethernet/mellanox/mlx5/core/debugfs.c |  39 ++-
 .../ethernet/mellanox/mlx5/core/esw/ipsec.c   |   2 +-
 .../net/ethernet/mellanox/mlx5/core/eswitch.c |  48 +++-
 .../net/ethernet/mellanox/mlx5/core/eswitch.h |   7 +
 .../mellanox/mlx5/core/eswitch_offloads.c     |  14 +-
 .../net/ethernet/mellanox/mlx5/core/main.c    |  10 +-
 .../ethernet/mellanox/mlx5/core/pagealloc.c   | 226 ++++++++++++++----
 include/linux/mlx5/driver.h                   |   9 +
 8 files changed, 289 insertions(+), 66 deletions(-)

base-commit: 4e37987362bcac8909f2d4b4458f3aa645e41641
-- 
2.44.0

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH net-next 1/3] net/mlx5: Relax capability check for eswitch query paths
  2026-05-01  4:41 [PATCH net-next 0/3] net/mlx5: ICM page management in VHCA_ID mode Tariq Toukan
@ 2026-05-01  4:41 ` Tariq Toukan
  2026-05-01  4:41 ` [PATCH net-next 2/3] net/mlx5: Make debugfs page counters by function type dynamic Tariq Toukan
  2026-05-01  4:41 ` [PATCH net-next 3/3] net/mlx5: Add VHCA_ID page management mode support Tariq Toukan
  2 siblings, 0 replies; 8+ messages in thread
From: Tariq Toukan @ 2026-05-01  4:41 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller
  Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
	Moshe Shemesh, Akiva Goldberger, netdev, linux-rdma, linux-kernel,
	Gal Pressman, Dragos Tatulea

From: Moshe Shemesh <moshe@nvidia.com>

Several eswitch functions that only query other functions' HCA
capabilities or read cached vport state are guarded by the
vhca_resource_manager capability. This capability is required for
set_hca_cap operations but query_hca_cap of other functions only
requires the vport_group_manager capability.

Relax the capability check from vhca_resource_manager to
vport_group_manager in the following query-only paths:
- mlx5_esw_vport_caps_get() - queries other function general caps
- esw_ipsec_vf_query_generic() - queries other function ipsec cap
- mlx5_devlink_port_fn_migratable_get() - reads cached vport state
- mlx5_devlink_port_fn_roce_get() - reads cached vport state
- mlx5_devlink_port_fn_max_io_eqs_get() - queries other function caps
- mlx5_esw_vport_enable/disable() - vhca_id map/unmap

Functions that perform also set_hca_cap (migratable_set, roce_set,
max_io_eqs_set, esw_ipsec_vf_set_generic, esw_ipsec_vf_set_bytype)
retain the vhca_resource_manager requirement.

Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Akiva Goldberger <agoldberger@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/esw/ipsec.c    |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.c  |  6 +++---
 .../ethernet/mellanox/mlx5/core/eswitch_offloads.c | 14 ++++++++------
 3 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/esw/ipsec.c b/drivers/net/ethernet/mellanox/mlx5/core/esw/ipsec.c
index 8b12c3ae0cf7..4811b60ea430 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/esw/ipsec.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/esw/ipsec.c
@@ -12,7 +12,7 @@ static int esw_ipsec_vf_query_generic(struct mlx5_core_dev *dev, u16 vport_num,
 	void *hca_cap, *query_cap;
 	int err;
 
-	if (!MLX5_CAP_GEN(dev, vhca_resource_manager))
+	if (!MLX5_CAP_GEN(dev, vport_group_manager))
 		return -EOPNOTSUPP;
 
 	if (!mlx5_esw_ipsec_vf_offload_supported(dev)) {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index 66a773a99876..e0eafcf0c52a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -806,7 +806,7 @@ static int mlx5_esw_vport_caps_get(struct mlx5_eswitch *esw, struct mlx5_vport *
 	void *hca_caps;
 	int err;
 
-	if (!MLX5_CAP_GEN(esw->dev, vhca_resource_manager))
+	if (!MLX5_CAP_GEN(esw->dev, vport_group_manager))
 		return 0;
 
 	query_ctx = kzalloc(query_out_sz, GFP_KERNEL);
@@ -938,7 +938,7 @@ int mlx5_esw_vport_enable(struct mlx5_eswitch *esw, struct mlx5_vport *vport,
 		vport->info.trusted = true;
 
 	if (!mlx5_esw_is_manager_vport(esw, vport_num) &&
-	    MLX5_CAP_GEN(esw->dev, vhca_resource_manager)) {
+	    MLX5_CAP_GEN(esw->dev, vport_group_manager)) {
 		ret = mlx5_esw_vport_vhca_id_map(esw, vport);
 		if (ret)
 			goto err_vhca_mapping;
@@ -976,7 +976,7 @@ void mlx5_esw_vport_disable(struct mlx5_eswitch *esw, struct mlx5_vport *vport)
 		arm_vport_context_events_cmd(esw->dev, vport_num, 0);
 
 	if (!mlx5_esw_is_manager_vport(esw, vport_num) &&
-	    MLX5_CAP_GEN(esw->dev, vhca_resource_manager))
+	    MLX5_CAP_GEN(esw->dev, vport_group_manager))
 		mlx5_esw_vport_vhca_id_unmap(esw, vport);
 
 	if (vport->vport != MLX5_VPORT_HOST_PF &&
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 69ddf56e2fc9..392d8f364db6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -4677,8 +4677,9 @@ int mlx5_devlink_port_fn_migratable_get(struct devlink_port *port, bool *is_enab
 		return -EOPNOTSUPP;
 	}
 
-	if (!MLX5_CAP_GEN(esw->dev, vhca_resource_manager)) {
-		NL_SET_ERR_MSG_MOD(extack, "Device doesn't support VHCA management");
+	if (!MLX5_CAP_GEN(esw->dev, vport_group_manager)) {
+		NL_SET_ERR_MSG_MOD(extack,
+				   "Device doesn't support vport group management");
 		return -EOPNOTSUPP;
 	}
 
@@ -4753,8 +4754,9 @@ int mlx5_devlink_port_fn_roce_get(struct devlink_port *port, bool *is_enabled,
 	struct mlx5_eswitch *esw = mlx5_devlink_eswitch_nocheck_get(port->devlink);
 	struct mlx5_vport *vport = mlx5_devlink_port_vport_get(port);
 
-	if (!MLX5_CAP_GEN(esw->dev, vhca_resource_manager)) {
-		NL_SET_ERR_MSG_MOD(extack, "Device doesn't support VHCA management");
+	if (!MLX5_CAP_GEN(esw->dev, vport_group_manager)) {
+		NL_SET_ERR_MSG_MOD(extack,
+				   "Device doesn't support vport group management");
 		return -EOPNOTSUPP;
 	}
 
@@ -5076,9 +5078,9 @@ mlx5_devlink_port_fn_max_io_eqs_get(struct devlink_port *port, u32 *max_io_eqs,
 	int err;
 
 	esw = mlx5_devlink_eswitch_nocheck_get(port->devlink);
-	if (!MLX5_CAP_GEN(esw->dev, vhca_resource_manager)) {
+	if (!MLX5_CAP_GEN(esw->dev, vport_group_manager)) {
 		NL_SET_ERR_MSG_MOD(extack,
-				   "Device doesn't support VHCA management");
+				   "Device doesn't support vport group management");
 		return -EOPNOTSUPP;
 	}
 
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH net-next 2/3] net/mlx5: Make debugfs page counters by function type dynamic
  2026-05-01  4:41 [PATCH net-next 0/3] net/mlx5: ICM page management in VHCA_ID mode Tariq Toukan
  2026-05-01  4:41 ` [PATCH net-next 1/3] net/mlx5: Relax capability check for eswitch query paths Tariq Toukan
@ 2026-05-01  4:41 ` Tariq Toukan
  2026-05-01  4:41 ` [PATCH net-next 3/3] net/mlx5: Add VHCA_ID page management mode support Tariq Toukan
  2 siblings, 0 replies; 8+ messages in thread
From: Tariq Toukan @ 2026-05-01  4:41 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller
  Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
	Moshe Shemesh, Akiva Goldberger, netdev, linux-rdma, linux-kernel,
	Gal Pressman, Dragos Tatulea

From: Moshe Shemesh <moshe@nvidia.com>

Make the per function type debugfs page counters dynamically added after
mlx5_eswitch_init(). When page management operates in vhca_id mode, only
the function acting as either eSwitch or vport manager can initialize
the eSwitch structure and translate the vhca_id to function type for the
functions to which it supplies pages. The next patch will add support
for page management in vhca_id mode.

Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Akiva Goldberger <agoldberger@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/debugfs.c | 39 +++++++++++++++++--
 .../net/ethernet/mellanox/mlx5/core/main.c    |  7 +++-
 include/linux/mlx5/driver.h                   |  2 +
 3 files changed, 42 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/debugfs.c b/drivers/net/ethernet/mellanox/mlx5/core/debugfs.c
index 8fe263190d38..6347957fefcb 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/debugfs.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/debugfs.c
@@ -285,10 +285,6 @@ void mlx5_pages_debugfs_init(struct mlx5_core_dev *dev)
 	pages = dev->priv.dbg.pages_debugfs;
 
 	debugfs_create_u32("fw_pages_total", 0400, pages, &dev->priv.fw_pages);
-	debugfs_create_u32("fw_pages_vfs", 0400, pages, &dev->priv.page_counters[MLX5_VF]);
-	debugfs_create_u32("fw_pages_ec_vfs", 0400, pages, &dev->priv.page_counters[MLX5_EC_VF]);
-	debugfs_create_u32("fw_pages_sfs", 0400, pages, &dev->priv.page_counters[MLX5_SF]);
-	debugfs_create_u32("fw_pages_host_pf", 0400, pages, &dev->priv.page_counters[MLX5_HOST_PF]);
 	debugfs_create_u32("fw_pages_alloc_failed", 0400, pages, &dev->priv.fw_pages_alloc_failed);
 	debugfs_create_u32("fw_pages_give_dropped", 0400, pages, &dev->priv.give_pages_dropped);
 	debugfs_create_u32("fw_pages_reclaim_discard", 0400, pages,
@@ -300,6 +296,41 @@ void mlx5_pages_debugfs_cleanup(struct mlx5_core_dev *dev)
 	debugfs_remove_recursive(dev->priv.dbg.pages_debugfs);
 }
 
+void mlx5_pages_by_func_type_debugfs_init(struct mlx5_core_dev *dev)
+{
+	struct dentry *pages = dev->priv.dbg.pages_debugfs;
+
+	if (!pages)
+		return;
+
+	if (!dev->priv.eswitch &&
+	    MLX5_CAP_GEN(dev, icm_mng_function_id_mode) ==
+	    MLX5_ID_MODE_FUNCTION_VHCA_ID)
+		return;
+
+	debugfs_create_u32("fw_pages_vfs", 0400, pages,
+			   &dev->priv.page_counters[MLX5_VF]);
+	debugfs_create_u32("fw_pages_ec_vfs", 0400, pages,
+			   &dev->priv.page_counters[MLX5_EC_VF]);
+	debugfs_create_u32("fw_pages_sfs", 0400, pages,
+			   &dev->priv.page_counters[MLX5_SF]);
+	debugfs_create_u32("fw_pages_host_pf", 0400, pages,
+			   &dev->priv.page_counters[MLX5_HOST_PF]);
+}
+
+void mlx5_pages_by_func_type_debugfs_cleanup(struct mlx5_core_dev *dev)
+{
+	struct dentry *pages = dev->priv.dbg.pages_debugfs;
+
+	if (!pages)
+		return;
+
+	debugfs_lookup_and_remove("fw_pages_vfs", pages);
+	debugfs_lookup_and_remove("fw_pages_ec_vfs", pages);
+	debugfs_lookup_and_remove("fw_pages_sfs", pages);
+	debugfs_lookup_and_remove("fw_pages_host_pf", pages);
+}
+
 static u64 qp_read_field(struct mlx5_core_dev *dev, struct mlx5_core_qp *qp,
 			 int index, int *is_str)
 {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 74827e8ca125..a242053f3a58 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -987,11 +987,12 @@ static int mlx5_init_once(struct mlx5_core_dev *dev)
 		mlx5_core_err(dev, "Failed to init eswitch %d\n", err);
 		goto err_sriov_cleanup;
 	}
+	mlx5_pages_by_func_type_debugfs_init(dev);
 
 	err = mlx5_fpga_init(dev);
 	if (err) {
 		mlx5_core_err(dev, "Failed to init fpga device %d\n", err);
-		goto err_eswitch_cleanup;
+		goto err_page_debugfs_cleanup;
 	}
 
 	err = mlx5_vhca_event_init(dev);
@@ -1034,7 +1035,8 @@ static int mlx5_init_once(struct mlx5_core_dev *dev)
 	mlx5_vhca_event_cleanup(dev);
 err_fpga_cleanup:
 	mlx5_fpga_cleanup(dev);
-err_eswitch_cleanup:
+err_page_debugfs_cleanup:
+	mlx5_pages_by_func_type_debugfs_cleanup(dev);
 	mlx5_eswitch_cleanup(dev->priv.eswitch);
 err_sriov_cleanup:
 	mlx5_sriov_cleanup(dev);
@@ -1072,6 +1074,7 @@ static void mlx5_cleanup_once(struct mlx5_core_dev *dev)
 	mlx5_sf_hw_table_cleanup(dev);
 	mlx5_vhca_event_cleanup(dev);
 	mlx5_fpga_cleanup(dev);
+	mlx5_pages_by_func_type_debugfs_cleanup(dev);
 	mlx5_eswitch_cleanup(dev->priv.eswitch);
 	mlx5_sriov_cleanup(dev);
 	mlx5_mpfs_cleanup(dev);
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 04b96c5abb57..b460b3bae195 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -1034,6 +1034,8 @@ void mlx5_pagealloc_start(struct mlx5_core_dev *dev);
 void mlx5_pagealloc_stop(struct mlx5_core_dev *dev);
 void mlx5_pages_debugfs_init(struct mlx5_core_dev *dev);
 void mlx5_pages_debugfs_cleanup(struct mlx5_core_dev *dev);
+void mlx5_pages_by_func_type_debugfs_init(struct mlx5_core_dev *dev);
+void mlx5_pages_by_func_type_debugfs_cleanup(struct mlx5_core_dev *dev);
 int mlx5_satisfy_startup_pages(struct mlx5_core_dev *dev, int boot);
 int mlx5_reclaim_startup_pages(struct mlx5_core_dev *dev);
 void mlx5_register_debugfs(void);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH net-next 3/3] net/mlx5: Add VHCA_ID page management mode support
  2026-05-01  4:41 [PATCH net-next 0/3] net/mlx5: ICM page management in VHCA_ID mode Tariq Toukan
  2026-05-01  4:41 ` [PATCH net-next 1/3] net/mlx5: Relax capability check for eswitch query paths Tariq Toukan
  2026-05-01  4:41 ` [PATCH net-next 2/3] net/mlx5: Make debugfs page counters by function type dynamic Tariq Toukan
@ 2026-05-01  4:41 ` Tariq Toukan
  2026-05-03  1:45   ` Jakub Kicinski
  2026-05-03  1:45   ` Jakub Kicinski
  2 siblings, 2 replies; 8+ messages in thread
From: Tariq Toukan @ 2026-05-01  4:41 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller
  Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
	Moshe Shemesh, Akiva Goldberger, netdev, linux-rdma, linux-kernel,
	Gal Pressman, Dragos Tatulea

From: Moshe Shemesh <moshe@nvidia.com>

Add support for VHCA_ID-based page management mode. When the device
firmware advertises the icm_mng_function_id_mode capability with
MLX5_ID_MODE_FUNCTION_VHCA_ID, page management operations between the
driver and firmware may use vhca_id instead of function_id as the
effective function identifier, and the ec_function field is ignored.

Update page management commands to conditionally set ec_function field
only in FUNC_ID mode. Boot page allocation always uses FUNC_ID mode
semantics for backward compatibility, as the capability bit is only
available after set_hca_cap(). If after set_hca_cap() VHCA_ID mode was
set, modify the tracking of the boot pages in page_root_xa to use
vhca_id too.

Add mlx5_esw_vhca_id_to_func_type() to resolve the function type via
vport lookup in VHCA_ID mode, enabling per-type debugfs counters.

Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Akiva Goldberger <agoldberger@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/eswitch.c |  42 ++++
 .../net/ethernet/mellanox/mlx5/core/eswitch.h |   7 +
 .../net/ethernet/mellanox/mlx5/core/main.c    |   3 +
 .../ethernet/mellanox/mlx5/core/pagealloc.c   | 226 ++++++++++++++----
 include/linux/mlx5/driver.h                   |   7 +
 5 files changed, 235 insertions(+), 50 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index e0eafcf0c52a..d3eaefc5c0e0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -852,6 +852,48 @@ bool mlx5_esw_vport_vhca_id(struct mlx5_eswitch *esw, u16 vportn, u16 *vhca_id)
 	return true;
 }
 
+u16 mlx5_esw_vhca_id_to_func_type(struct mlx5_core_dev *dev, u16 vhca_id)
+{
+	struct mlx5_eswitch *esw = dev->priv.eswitch;
+	struct mlx5_vport *vport;
+	unsigned long i;
+	u16 type;
+
+	if (vhca_id == MLX5_CAP_GEN(dev, vhca_id))
+		return MLX5_SELF;
+
+	if (!esw)
+		return MLX5_FUNC_TYPE_NONE;
+
+	mutex_lock(&esw->state_lock);
+	mlx5_esw_for_each_vport(esw, i, vport) {
+		if (vport->vhca_id != vhca_id)
+			continue;
+
+		if (vport->vport == MLX5_VPORT_HOST_PF) {
+			type = MLX5_HOST_PF;
+			goto unlock;
+		}
+
+		if (xa_get_mark(&esw->vports, i, MLX5_ESW_VPT_SF)) {
+			type = MLX5_SF;
+			goto unlock;
+		}
+
+		if (xa_get_mark(&esw->vports, i, MLX5_ESW_VPT_VF)) {
+			type = MLX5_VF;
+			goto unlock;
+		}
+
+		type = MLX5_EC_VF;
+		goto unlock;
+	}
+	type = MLX5_FUNC_TYPE_NONE;
+unlock:
+	mutex_unlock(&esw->state_lock);
+	return type;
+}
+
 static int esw_vport_setup(struct mlx5_eswitch *esw, struct mlx5_vport *vport)
 {
 	bool vst_mode_steering = esw_vst_mode_is_steering(esw);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index 2fd601bd102f..5940b4cbfd77 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -863,6 +863,7 @@ void mlx5_esw_vport_vhca_id_unmap(struct mlx5_eswitch *esw,
 				  struct mlx5_vport *vport);
 int mlx5_eswitch_vhca_id_to_vport(struct mlx5_eswitch *esw, u16 vhca_id, u16 *vport_num);
 bool mlx5_esw_vport_vhca_id(struct mlx5_eswitch *esw, u16 vportn, u16 *vhca_id);
+u16 mlx5_esw_vhca_id_to_func_type(struct mlx5_core_dev *dev, u16 vhca_id);
 
 void mlx5_esw_offloads_rep_remove(struct mlx5_eswitch *esw,
 				  const struct mlx5_vport *vport);
@@ -1034,6 +1035,12 @@ mlx5_esw_vport_vhca_id(struct mlx5_eswitch *esw, u16 vportn, u16 *vhca_id)
 	return false;
 }
 
+static inline u16
+mlx5_esw_vhca_id_to_func_type(struct mlx5_core_dev *dev, u16 vhca_id)
+{
+	return MLX5_FUNC_TYPE_NONE;
+}
+
 static inline void
 mlx5_eswitch_safe_aux_devs_remove(struct mlx5_core_dev *dev) {}
 static inline struct mlx5_flow_handle *
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index a242053f3a58..52cf341ad6b3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -597,6 +597,9 @@ static int handle_hca_cap(struct mlx5_core_dev *dev, void *set_ctx)
 	if (MLX5_CAP_GEN_MAX(dev, release_all_pages))
 		MLX5_SET(cmd_hca_cap, set_hca_cap, release_all_pages, 1);
 
+	if (MLX5_CAP_GEN_MAX(dev, icm_mng_function_id_mode))
+		MLX5_SET(cmd_hca_cap, set_hca_cap, icm_mng_function_id_mode, 1);
+
 	if (MLX5_CAP_GEN_MAX(dev, mkey_by_name))
 		MLX5_SET(cmd_hca_cap, set_hca_cap, mkey_by_name, 1);
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c b/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c
index 77ffa31cc505..7ebe88aa3b3e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c
@@ -38,6 +38,7 @@
 #include "mlx5_core.h"
 #include "lib/eq.h"
 #include "lib/tout.h"
+#include "eswitch.h"
 
 enum {
 	MLX5_PAGES_CANT_GIVE	= 0,
@@ -69,9 +70,24 @@ enum {
 	MLX5_NUM_4K_IN_PAGE		= PAGE_SIZE / MLX5_ADAPTER_PAGE_SIZE,
 };
 
-static u32 get_function(u16 func_id, bool ec_function)
+static bool mlx5_page_mgt_mode_is_vhca_id(const struct mlx5_core_dev *dev)
 {
-	return (u32)func_id | (ec_function << 16);
+	return dev->priv.page_mgt_mode == MLX5_PAGE_MGT_MODE_VHCA_ID;
+}
+
+static void mlx5_page_mgt_mode_set(struct mlx5_core_dev *dev,
+				   enum mlx5_page_mgt_mode mode)
+{
+	dev->priv.page_mgt_mode = mode;
+}
+
+static u32 get_function_key(struct mlx5_core_dev *dev, u16 func_vhca_id,
+			    bool ec_function)
+{
+	if (mlx5_page_mgt_mode_is_vhca_id(dev))
+		return (u32)func_vhca_id;
+
+	return (u32)func_vhca_id | (ec_function << 16);
 }
 
 static u16 func_id_to_type(struct mlx5_core_dev *dev, u16 func_id, bool ec_function)
@@ -89,12 +105,21 @@ static u16 func_id_to_type(struct mlx5_core_dev *dev, u16 func_id, bool ec_funct
 	return MLX5_SF;
 }
 
+static u16 func_vhca_id_to_type(struct mlx5_core_dev *dev, u16 func_vhca_id,
+				bool ec_function)
+{
+	if (mlx5_page_mgt_mode_is_vhca_id(dev))
+		return mlx5_esw_vhca_id_to_func_type(dev, func_vhca_id);
+
+	return func_id_to_type(dev, func_vhca_id, ec_function);
+}
+
 static u32 mlx5_get_ec_function(u32 function)
 {
 	return function >> 16;
 }
 
-static u32 mlx5_get_func_id(u32 function)
+static u32 mlx5_get_func_vhca_id(u32 function)
 {
 	return function & 0xffff;
 }
@@ -196,7 +221,7 @@ static struct fw_page *find_fw_page(struct mlx5_core_dev *dev, u64 addr,
 	return result;
 }
 
-static int mlx5_cmd_query_pages(struct mlx5_core_dev *dev, u16 *func_id,
+static int mlx5_cmd_query_pages(struct mlx5_core_dev *dev, u16 *func_vhca_id,
 				s32 *npages, int boot)
 {
 	u32 out[MLX5_ST_SZ_DW(query_pages_out)] = {};
@@ -207,14 +232,20 @@ static int mlx5_cmd_query_pages(struct mlx5_core_dev *dev, u16 *func_id,
 	MLX5_SET(query_pages_in, in, op_mod, boot ?
 		 MLX5_QUERY_PAGES_IN_OP_MOD_BOOT_PAGES :
 		 MLX5_QUERY_PAGES_IN_OP_MOD_INIT_PAGES);
-	MLX5_SET(query_pages_in, in, embedded_cpu_function, mlx5_core_is_ecpf(dev));
+
+	if (mlx5_page_mgt_mode_is_vhca_id(dev))
+		MLX5_SET(query_pages_in, in, function_id,
+			 MLX5_CAP_GEN(dev, vhca_id));
+	else
+		MLX5_SET(query_pages_in, in, embedded_cpu_function,
+			 mlx5_core_is_ecpf(dev));
 
 	err = mlx5_cmd_exec_inout(dev, query_pages, in, out);
 	if (err)
 		return err;
 
 	*npages = MLX5_GET(query_pages_out, out, num_pages);
-	*func_id = MLX5_GET(query_pages_out, out, function_id);
+	*func_vhca_id = MLX5_GET(query_pages_out, out, function_id);
 
 	return err;
 }
@@ -334,7 +365,7 @@ static int alloc_system_page(struct mlx5_core_dev *dev, u32 function)
 	return err;
 }
 
-static void page_notify_fail(struct mlx5_core_dev *dev, u16 func_id,
+static void page_notify_fail(struct mlx5_core_dev *dev, u16 func_vhca_id,
 			     bool ec_function)
 {
 	u32 in[MLX5_ST_SZ_DW(manage_pages_in)] = {};
@@ -342,19 +373,23 @@ static void page_notify_fail(struct mlx5_core_dev *dev, u16 func_id,
 
 	MLX5_SET(manage_pages_in, in, opcode, MLX5_CMD_OP_MANAGE_PAGES);
 	MLX5_SET(manage_pages_in, in, op_mod, MLX5_PAGES_CANT_GIVE);
-	MLX5_SET(manage_pages_in, in, function_id, func_id);
-	MLX5_SET(manage_pages_in, in, embedded_cpu_function, ec_function);
+	MLX5_SET(manage_pages_in, in, function_id, func_vhca_id);
+
+	if (!mlx5_page_mgt_mode_is_vhca_id(dev))
+		MLX5_SET(manage_pages_in, in, embedded_cpu_function,
+			 ec_function);
 
 	err = mlx5_cmd_exec_in(dev, manage_pages, in);
 	if (err)
-		mlx5_core_warn(dev, "page notify failed func_id(%d) err(%d)\n",
-			       func_id, err);
+		mlx5_core_warn(dev,
+			       "page notify failed func_vhca_id(%d) err(%d)\n",
+			       func_vhca_id, err);
 }
 
-static int give_pages(struct mlx5_core_dev *dev, u16 func_id, int npages,
+static int give_pages(struct mlx5_core_dev *dev, u16 func_vhca_id, int npages,
 		      int event, bool ec_function)
 {
-	u32 function = get_function(func_id, ec_function);
+	u32 function = get_function_key(dev, func_vhca_id, ec_function);
 	u32 out[MLX5_ST_SZ_DW(manage_pages_out)] = {0};
 	int inlen = MLX5_ST_SZ_BYTES(manage_pages_in);
 	int notify_fail = event;
@@ -390,9 +425,12 @@ static int give_pages(struct mlx5_core_dev *dev, u16 func_id, int npages,
 
 	MLX5_SET(manage_pages_in, in, opcode, MLX5_CMD_OP_MANAGE_PAGES);
 	MLX5_SET(manage_pages_in, in, op_mod, MLX5_PAGES_GIVE);
-	MLX5_SET(manage_pages_in, in, function_id, func_id);
+	MLX5_SET(manage_pages_in, in, function_id, func_vhca_id);
 	MLX5_SET(manage_pages_in, in, input_num_entries, npages);
-	MLX5_SET(manage_pages_in, in, embedded_cpu_function, ec_function);
+
+	if (!mlx5_page_mgt_mode_is_vhca_id(dev))
+		MLX5_SET(manage_pages_in, in, embedded_cpu_function,
+			 ec_function);
 
 	err = mlx5_cmd_do(dev, in, inlen, out, sizeof(out));
 	if (err == -EREMOTEIO) {
@@ -405,17 +443,20 @@ static int give_pages(struct mlx5_core_dev *dev, u16 func_id, int npages,
 	}
 	err = mlx5_cmd_check(dev, err, in, out);
 	if (err) {
-		mlx5_core_warn(dev, "func_id 0x%x, npages %d, err %d\n",
-			       func_id, npages, err);
+		mlx5_core_warn(dev, "func_vhca_id 0x%x, npages %d, err %d\n",
+			       func_vhca_id, npages, err);
 		goto out_dropped;
 	}
 
-	func_type = func_id_to_type(dev, func_id, ec_function);
-	dev->priv.page_counters[func_type] += npages;
+	func_type = func_vhca_id_to_type(dev, func_vhca_id, ec_function);
+	if (func_type != MLX5_FUNC_TYPE_NONE)
+		dev->priv.page_counters[func_type] += npages;
 	dev->priv.fw_pages += npages;
 
-	mlx5_core_dbg(dev, "npages %d, ec_function %d, func_id 0x%x, err %d\n",
-		      npages, ec_function, func_id, err);
+	mlx5_core_dbg(dev,
+		      "npages %d, ec_function %d, func 0x%x, mode %d, err %d\n",
+		      npages, ec_function, func_vhca_id,
+		      mlx5_page_mgt_mode_is_vhca_id(dev), err);
 
 	kvfree(in);
 	return 0;
@@ -428,14 +469,14 @@ static int give_pages(struct mlx5_core_dev *dev, u16 func_id, int npages,
 out_free:
 	kvfree(in);
 	if (notify_fail)
-		page_notify_fail(dev, func_id, ec_function);
+		page_notify_fail(dev, func_vhca_id, ec_function);
 	return err;
 }
 
-static void release_all_pages(struct mlx5_core_dev *dev, u16 func_id,
+static void release_all_pages(struct mlx5_core_dev *dev, u16 func_vhca_id,
 			      bool ec_function)
 {
-	u32 function = get_function(func_id, ec_function);
+	u32 function = get_function_key(dev, func_vhca_id, ec_function);
 	struct rb_root *root;
 	struct rb_node *p;
 	int npages = 0;
@@ -454,12 +495,14 @@ static void release_all_pages(struct mlx5_core_dev *dev, u16 func_id,
 		free_fwp(dev, fwp, fwp->free_count);
 	}
 
-	func_type = func_id_to_type(dev, func_id, ec_function);
-	dev->priv.page_counters[func_type] -= npages;
+	func_type = func_vhca_id_to_type(dev, func_vhca_id, ec_function);
+	if (func_type != MLX5_FUNC_TYPE_NONE)
+		dev->priv.page_counters[func_type] -= npages;
 	dev->priv.fw_pages -= npages;
 
-	mlx5_core_dbg(dev, "npages %d, ec_function %d, func_id 0x%x\n",
-		      npages, ec_function, func_id);
+	mlx5_core_dbg(dev, "npages %d, ec_function %d, func 0x%x, mode %d\n",
+		      npages, ec_function, func_vhca_id,
+		      mlx5_page_mgt_mode_is_vhca_id(dev));
 }
 
 static u32 fwp_fill_manage_pages_out(struct fw_page *fwp, u32 *out, u32 index,
@@ -487,7 +530,7 @@ static int reclaim_pages_cmd(struct mlx5_core_dev *dev,
 	struct fw_page *fwp;
 	struct rb_node *p;
 	bool ec_function;
-	u32 func_id;
+	u32 func_vhca_id;
 	u32 npages;
 	u32 i = 0;
 	int err;
@@ -499,10 +542,11 @@ static int reclaim_pages_cmd(struct mlx5_core_dev *dev,
 
 	/* No hard feelings, we want our pages back! */
 	npages = MLX5_GET(manage_pages_in, in, input_num_entries);
-	func_id = MLX5_GET(manage_pages_in, in, function_id);
+	func_vhca_id = MLX5_GET(manage_pages_in, in, function_id);
 	ec_function = MLX5_GET(manage_pages_in, in, embedded_cpu_function);
 
-	root = xa_load(&dev->priv.page_root_xa, get_function(func_id, ec_function));
+	root = xa_load(&dev->priv.page_root_xa,
+		       get_function_key(dev, func_vhca_id, ec_function));
 	if (WARN_ON_ONCE(!root))
 		return -EEXIST;
 
@@ -518,10 +562,11 @@ static int reclaim_pages_cmd(struct mlx5_core_dev *dev,
 	return 0;
 }
 
-static int reclaim_pages(struct mlx5_core_dev *dev, u16 func_id, int npages,
-			 int *nclaimed, bool event, bool ec_function)
+static int reclaim_pages(struct mlx5_core_dev *dev, u16 func_vhca_id,
+			 int npages, int *nclaimed, bool event,
+			 bool ec_function)
 {
-	u32 function = get_function(func_id, ec_function);
+	u32 function = get_function_key(dev, func_vhca_id, ec_function);
 	int outlen = MLX5_ST_SZ_BYTES(manage_pages_out);
 	u32 in[MLX5_ST_SZ_DW(manage_pages_in)] = {};
 	int num_claimed;
@@ -540,12 +585,16 @@ static int reclaim_pages(struct mlx5_core_dev *dev, u16 func_id, int npages,
 
 	MLX5_SET(manage_pages_in, in, opcode, MLX5_CMD_OP_MANAGE_PAGES);
 	MLX5_SET(manage_pages_in, in, op_mod, MLX5_PAGES_TAKE);
-	MLX5_SET(manage_pages_in, in, function_id, func_id);
+	MLX5_SET(manage_pages_in, in, function_id, func_vhca_id);
 	MLX5_SET(manage_pages_in, in, input_num_entries, npages);
-	MLX5_SET(manage_pages_in, in, embedded_cpu_function, ec_function);
 
-	mlx5_core_dbg(dev, "func 0x%x, npages %d, outlen %d\n",
-		      func_id, npages, outlen);
+	if (!mlx5_page_mgt_mode_is_vhca_id(dev))
+		MLX5_SET(manage_pages_in, in, embedded_cpu_function,
+			 ec_function);
+
+	mlx5_core_dbg(dev, "func 0x%x, npages %d, outlen %d mode %d\n",
+		      func_vhca_id, npages, outlen,
+		      mlx5_page_mgt_mode_is_vhca_id(dev));
 	err = reclaim_pages_cmd(dev, in, sizeof(in), out, outlen);
 	if (err) {
 		npages = MLX5_GET(manage_pages_in, in, input_num_entries);
@@ -577,8 +626,9 @@ static int reclaim_pages(struct mlx5_core_dev *dev, u16 func_id, int npages,
 	if (nclaimed)
 		*nclaimed = num_claimed;
 
-	func_type = func_id_to_type(dev, func_id, ec_function);
-	dev->priv.page_counters[func_type] -= num_claimed;
+	func_type = func_vhca_id_to_type(dev, func_vhca_id, ec_function);
+	if (func_type != MLX5_FUNC_TYPE_NONE)
+		dev->priv.page_counters[func_type] -= num_claimed;
 	dev->priv.fw_pages -= num_claimed;
 
 out_free:
@@ -658,30 +708,101 @@ static int req_pages_handler(struct notifier_block *nb,
 	 * req->npages (and not min ()).
 	 */
 	req->npages = max_t(s32, npages, MAX_RECLAIM_NPAGES);
-	req->ec_function = ec_function;
+	if (!mlx5_page_mgt_mode_is_vhca_id(dev))
+		req->ec_function = ec_function;
 	req->release_all = release_all;
 	INIT_WORK(&req->work, pages_work_handler);
 	queue_work(dev->priv.pg_wq, &req->work);
 	return NOTIFY_OK;
 }
 
+/*
+ * After set_hca_cap(), the second satisfy_startup_pages(dev, 0) may see
+ * VHCA_ID mode. If page_root_xa already has the PF entry from the first
+ * (boot) call under FUNC_ID keys 0 or (ec_function << 16), migrate that
+ * entry to the device vhca_id key so lookups use VHCA_ID semantics.
+ */
+static int mlx5_pagealloc_migrate_pf_to_vhca_id(struct mlx5_core_dev *dev)
+{
+	u32 vhca_id_key, old_key;
+	struct rb_root *root;
+	struct fw_page *fwp;
+	struct rb_node *p;
+	bool ec_function;
+	int err;
+
+	if (xa_empty(&dev->priv.page_root_xa))
+		return 0;
+
+	vhca_id_key = MLX5_CAP_GEN(dev, vhca_id);
+	ec_function = mlx5_core_is_ecpf(dev);
+
+	old_key = ec_function ? (1U << 16) : 0;
+	root = xa_load(&dev->priv.page_root_xa, old_key);
+	if (!root)
+		return 0;
+
+	if (old_key == vhca_id_key)
+		return 0;
+
+	err = xa_insert(&dev->priv.page_root_xa, vhca_id_key, root, GFP_KERNEL);
+	if (err) {
+		mlx5_core_warn(dev,
+			       "failed to migrate page root key 0x%x to vhca_id 0x%x\n",
+			       old_key, vhca_id_key);
+		return err;
+	}
+
+	xa_erase(&dev->priv.page_root_xa, old_key);
+
+	for (p = rb_first(root); p; p = rb_next(p)) {
+		fwp = rb_entry(p, struct fw_page, rb_node);
+		fwp->function = vhca_id_key;
+	}
+
+	return 0;
+}
+
 int mlx5_satisfy_startup_pages(struct mlx5_core_dev *dev, int boot)
 {
-	u16 func_id;
+	bool ec_function = false;
+	u16 func_vhca_id;
 	s32 npages;
 	int err;
 
-	err = mlx5_cmd_query_pages(dev, &func_id, &npages, boot);
+	/* When boot flag is set, the icm_mng_function_id_mode capability is
+	 * not yet set (only set after set_hca_cap()), so use FUNC_ID mode
+	 * for backward compatibility. When boot is false, set mode from
+	 * cap (set_hca_cap has run successfully).
+	 */
+	if (boot) {
+		mlx5_page_mgt_mode_set(dev, MLX5_PAGE_MGT_MODE_FUNC_ID);
+	} else {
+		if (MLX5_CAP_GEN_MAX(dev, icm_mng_function_id_mode) ==
+		    MLX5_ID_MODE_FUNCTION_VHCA_ID) {
+			err = mlx5_pagealloc_migrate_pf_to_vhca_id(dev);
+			if (err)
+				return err;
+			mlx5_page_mgt_mode_set(dev, MLX5_PAGE_MGT_MODE_VHCA_ID);
+		}
+	}
+
+	err = mlx5_cmd_query_pages(dev, &func_vhca_id, &npages, boot);
 	if (err)
 		return err;
 
-	mlx5_core_dbg(dev, "requested %d %s pages for func_id 0x%x\n",
-		      npages, boot ? "boot" : "init", func_id);
+	mlx5_core_dbg(dev,
+		      "requested %d %s pages for func_vhca_id 0x%x\n",
+		      npages, boot ? "boot" : "init", func_vhca_id);
 
 	if (!npages)
 		return 0;
 
-	return give_pages(dev, func_id, npages, 0, mlx5_core_is_ecpf(dev));
+	/* In VHCA_ID mode, ec_function remains false (not used). */
+	if (!mlx5_page_mgt_mode_is_vhca_id(dev))
+		ec_function = mlx5_core_is_ecpf(dev);
+
+	return give_pages(dev, func_vhca_id, npages, 0, ec_function);
 }
 
 enum {
@@ -709,15 +830,17 @@ static int mlx5_reclaim_root_pages(struct mlx5_core_dev *dev,
 
 	while (!RB_EMPTY_ROOT(root)) {
 		u32 ec_function = mlx5_get_ec_function(function);
-		u32 function_id = mlx5_get_func_id(function);
+		u32 func_vhca_id = mlx5_get_func_vhca_id(function);
 		int nclaimed;
 		int err;
 
-		err = reclaim_pages(dev, function_id, optimal_reclaimed_pages(),
+		err = reclaim_pages(dev, func_vhca_id,
+				    optimal_reclaimed_pages(),
 				    &nclaimed, false, ec_function);
 		if (err) {
-			mlx5_core_warn(dev, "reclaim_pages err (%d) func_id=0x%x ec_func=0x%x\n",
-				       err, function_id, ec_function);
+			mlx5_core_warn(dev,
+				       "reclaim_pages err (%d) func_vhca_id=0x%x ec_func=0x%x\n",
+				       err, func_vhca_id, ec_function);
 			return err;
 		}
 
@@ -751,6 +874,9 @@ int mlx5_reclaim_startup_pages(struct mlx5_core_dev *dev)
 	WARN(dev->priv.fw_pages,
 	     "FW pages counter is %d after reclaiming all pages\n",
 	     dev->priv.fw_pages);
+	if (mlx5_page_mgt_mode_is_vhca_id(dev) && !dev->priv.eswitch)
+		return 0;
+
 	WARN(dev->priv.page_counters[MLX5_VF],
 	     "VFs FW pages counter is %d after reclaiming all pages\n",
 	     dev->priv.page_counters[MLX5_VF]);
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index b460b3bae195..8e22ea662644 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -556,6 +556,12 @@ enum mlx5_func_type {
 	MLX5_HOST_PF,
 	MLX5_EC_VF,
 	MLX5_FUNC_TYPE_NUM,
+	MLX5_FUNC_TYPE_NONE = MLX5_FUNC_TYPE_NUM,
+};
+
+enum mlx5_page_mgt_mode {
+	MLX5_PAGE_MGT_MODE_FUNC_ID,
+	MLX5_PAGE_MGT_MODE_VHCA_ID,
 };
 
 struct mlx5_ft_pool;
@@ -575,6 +581,7 @@ struct mlx5_priv {
 	u32			fw_pages_alloc_failed;
 	u32			give_pages_dropped;
 	u32			reclaim_pages_discard;
+	enum mlx5_page_mgt_mode	page_mgt_mode;
 
 	struct mlx5_core_health health;
 	struct list_head	traps;
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next 3/3] net/mlx5: Add VHCA_ID page management mode support
  2026-05-01  4:41 ` [PATCH net-next 3/3] net/mlx5: Add VHCA_ID page management mode support Tariq Toukan
@ 2026-05-03  1:45   ` Jakub Kicinski
  2026-05-04 11:33     ` Moshe Shemesh
  2026-05-03  1:45   ` Jakub Kicinski
  1 sibling, 1 reply; 8+ messages in thread
From: Jakub Kicinski @ 2026-05-03  1:45 UTC (permalink / raw)
  To: tariqt
  Cc: Jakub Kicinski, edumazet, pabeni, andrew+netdev, davem, saeedm,
	leon, mbloch, moshe, agoldberger, netdev, linux-rdma,
	linux-kernel, gal, dtatulea

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
net/mlx5: Add VHCA_ID page management mode support

This change adds a VHCA_ID-based page management mode that uses vhca_id
instead of function_id as the effective function identifier when the
firmware advertises icm_mng_function_id_mode. Boot pages keep FUNC_ID
semantics for backward compatibility, and tracked pages are migrated to
the vhca_id key after set_hca_cap().

A few questions below on the per-type counter accounting, the boot-page
migration helper, and locking on the FW-driven page path.

> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
> index e0eafcf0c52a..d3eaefc5c0e0 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
> @@ -852,6 +852,48 @@ bool mlx5_esw_vport_vhca_id(struct mlx5_eswitch *esw, u16 vportn, u16 *vhca_id)
>  	return true;
>  }
>  
> +u16 mlx5_esw_vhca_id_to_func_type(struct mlx5_core_dev *dev, u16 vhca_id)
> +{
> +	struct mlx5_eswitch *esw = dev->priv.eswitch;
> +	struct mlx5_vport *vport;
> +	unsigned long i;
> +	u16 type;
> +
> +	if (vhca_id == MLX5_CAP_GEN(dev, vhca_id))
> +		return MLX5_SELF;
> +
> +	if (!esw)
> +		return MLX5_FUNC_TYPE_NONE;
> +
> +	mutex_lock(&esw->state_lock);

This function is reached from give_pages()/reclaim_pages()/
release_all_pages() via func_vhca_id_to_type(), which in turn runs
from the pg_wq work handler triggered by firmware page-request EQE
events. Does acquiring esw->state_lock on that path introduce a new
lock dependency?

Several eswitch paths (for example mlx5_esw_vport_enable(),
mlx5_esw_vport_disable(), mlx5_eswitch_set_vport_mac()) hold
state_lock while synchronously issuing firmware commands. Before this
patch, the page path held no eswitch locks.

Would it be safer to resolve the func_type outside of state_lock, for
example by caching the vhca_id-to-type mapping separately, or by
attaching the resolved type to the fw_page at give time so reclaim
does not need to look it up again?

> +	mlx5_esw_for_each_vport(esw, i, vport) {
> +		if (vport->vhca_id != vhca_id)
> +			continue;
> +
> +		if (vport->vport == MLX5_VPORT_HOST_PF) {
> +			type = MLX5_HOST_PF;
> +			goto unlock;
> +		}
> +
> +		if (xa_get_mark(&esw->vports, i, MLX5_ESW_VPT_SF)) {
> +			type = MLX5_SF;
> +			goto unlock;
> +		}
> +
> +		if (xa_get_mark(&esw->vports, i, MLX5_ESW_VPT_VF)) {
> +			type = MLX5_VF;
> +			goto unlock;
> +		}
> +
> +		type = MLX5_EC_VF;
> +		goto unlock;

Can this fall-through misattribute counters?

If a matching vport is not HOST_PF and has neither the VPT_SF nor the
VPT_VF mark, it is unconditionally classified as MLX5_EC_VF. Would an
UPLINK, ECPF, manager, or adjacent-style vport whose vhca_id happens
to match end up charged to the EC_VF bucket here?

> +	}
> +	type = MLX5_FUNC_TYPE_NONE;
> +unlock:
> +	mutex_unlock(&esw->state_lock);
> +	return type;
> +}
> +

[ ... ]

> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> index a242053f3a58..52cf341ad6b3 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> @@ -597,6 +597,9 @@ static int handle_hca_cap(struct mlx5_core_dev *dev, void *set_ctx)
>  	if (MLX5_CAP_GEN_MAX(dev, release_all_pages))
>  		MLX5_SET(cmd_hca_cap, set_hca_cap, release_all_pages, 1);
>  
> +	if (MLX5_CAP_GEN_MAX(dev, icm_mng_function_id_mode))
> +		MLX5_SET(cmd_hca_cap, set_hca_cap, icm_mng_function_id_mode, 1);
> +

[ ... ]

> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c b/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c
> index 77ffa31cc505..7ebe88aa3b3e 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c

[ ... ]

> @@ -89,12 +105,21 @@ static u16 func_id_to_type(struct mlx5_core_dev *dev, u16 func_id, bool ec_funct
>  	return MLX5_SF;
>  }
>  
> +static u16 func_vhca_id_to_type(struct mlx5_core_dev *dev, u16 func_vhca_id,
> +				bool ec_function)
> +{
> +	if (mlx5_page_mgt_mode_is_vhca_id(dev))
> +		return mlx5_esw_vhca_id_to_func_type(dev, func_vhca_id);
> +
> +	return func_id_to_type(dev, func_vhca_id, ec_function);
> +}

Can the give/reclaim accounting become asymmetric in VHCA_ID mode?

The per-type counters are only updated when the resolved type is not
MLX5_FUNC_TYPE_NONE:

	func_type = func_vhca_id_to_type(dev, func_vhca_id, ec_function);
	if (func_type != MLX5_FUNC_TYPE_NONE)
		dev->priv.page_counters[func_type] += npages;

Since mlx5_esw_vhca_id_to_func_type() walks the eswitch vport table
dynamically on every call, can the give and the corresponding reclaim
resolve to different types?

For example, if a give runs before vport->vhca_id is populated in
mlx5_esw_vport_caps_get(), the lookup returns MLX5_FUNC_TYPE_NONE and
the counter is not incremented. Later, when the vport is fully
populated, reclaim resolves to a real type and decrements the counter
below the amount ever added, which on a u32 drives it to a very large
value.

Similarly, if a vport is removed before reclaim, the increment at
give time is recorded but the decrement at reclaim is skipped, so the
counter leaks upward.

Would caching the func_type on the fw_page at give time and reusing
it on reclaim make the accounting symmetric by construction?

[ ... ]

> @@ -658,30 +708,101 @@ static int req_pages_handler(struct notifier_block *nb,
>  	 * req->npages (and not min ()).
>  	 */
>  	req->npages = max_t(s32, npages, MAX_RECLAIM_NPAGES);
> -	req->ec_function = ec_function;
> +	if (!mlx5_page_mgt_mode_is_vhca_id(dev))
> +		req->ec_function = ec_function;
>  	req->release_all = release_all;
>  	INIT_WORK(&req->work, pages_work_handler);
>  	queue_work(dev->priv.pg_wq, &req->work);
>  	return NOTIFY_OK;
>  }
>  
> +/*
> + * After set_hca_cap(), the second satisfy_startup_pages(dev, 0) may see
> + * VHCA_ID mode. If page_root_xa already has the PF entry from the first
> + * (boot) call under FUNC_ID keys 0 or (ec_function << 16), migrate that
> + * entry to the device vhca_id key so lookups use VHCA_ID semantics.
> + */
> +static int mlx5_pagealloc_migrate_pf_to_vhca_id(struct mlx5_core_dev *dev)
> +{
> +	u32 vhca_id_key, old_key;
> +	struct rb_root *root;
> +	struct fw_page *fwp;
> +	struct rb_node *p;
> +	bool ec_function;
> +	int err;
> +
> +	if (xa_empty(&dev->priv.page_root_xa))
> +		return 0;
> +
> +	vhca_id_key = MLX5_CAP_GEN(dev, vhca_id);
> +	ec_function = mlx5_core_is_ecpf(dev);
> +
> +	old_key = ec_function ? (1U << 16) : 0;
> +	root = xa_load(&dev->priv.page_root_xa, old_key);
> +	if (!root)
> +		return 0;

Does this assume the boot-path func_vhca_id was always 0?

The boot call to mlx5_cmd_query_pages() reads func_vhca_id directly
from the firmware output, and give_pages() then uses that value to
compute the key. The migration here instead hardcodes old_key as
ec_function ? (1U << 16) : 0.

If firmware returned a non-zero boot function_id, xa_load(old_key)
returns NULL, the function silently returns 0, the caller flips the
mode to VHCA_ID, and the original rb_root is orphaned in page_root_xa
under the old key. Subsequent free_fwp()/find_fw_page() paths would
then hit WARN_ON_ONCE(!root) and leak the DMA mappings and pages.

Would it be more robust to look up the actual key used at boot
(derived from the stored func_vhca_id), and to treat the "xa not
empty but old_key absent" case as an invariant violation rather than
silently succeeding?

> +
> +	if (old_key == vhca_id_key)
> +		return 0;
> +
> +	err = xa_insert(&dev->priv.page_root_xa, vhca_id_key, root, GFP_KERNEL);
> +	if (err) {
> +		mlx5_core_warn(dev,
> +			       "failed to migrate page root key 0x%x to vhca_id 0x%x\n",
> +			       old_key, vhca_id_key);
> +		return err;
> +	}
> +
> +	xa_erase(&dev->priv.page_root_xa, old_key);
> +
> +	for (p = rb_first(root); p; p = rb_next(p)) {
> +		fwp = rb_entry(p, struct fw_page, rb_node);
> +		fwp->function = vhca_id_key;
> +	}

Is the ordering here safe against any concurrent free_fwp()?

Between xa_erase(old_key) and the loop that updates fwp->function,
every fw_page still carries the old key while page_root_xa no longer
resolves it. If a free_fwp() were to run in that window:

	root = xa_load(&dev->priv.page_root_xa, fwp->function);
	if (WARN_ON_ONCE(!root))
		return;

it would return early, skipping dma_unmap_page(), __free_page(), and
kfree(fwp), leaking the DMA mapping and the backing page.

No concurrent free path is structurally reachable today because this
runs before the EQ notifier is registered in mlx5_pagealloc_start(),
but would it be cleaner to update the fwp->function values first,
then swap the xarray entries (or store the new value at a single key)
so the two views cannot disagree?

> +
> +	return 0;
> +}
> +
>  int mlx5_satisfy_startup_pages(struct mlx5_core_dev *dev, int boot)
>  {
> -	u16 func_id;
> +	bool ec_function = false;
> +	u16 func_vhca_id;
>  	s32 npages;
>  	int err;
>  
> -	err = mlx5_cmd_query_pages(dev, &func_id, &npages, boot);
> +	/* When boot flag is set, the icm_mng_function_id_mode capability is
> +	 * not yet set (only set after set_hca_cap()), so use FUNC_ID mode
> +	 * for backward compatibility. When boot is false, set mode from
> +	 * cap (set_hca_cap has run successfully).
> +	 */
> +	if (boot) {
> +		mlx5_page_mgt_mode_set(dev, MLX5_PAGE_MGT_MODE_FUNC_ID);
> +	} else {
> +		if (MLX5_CAP_GEN_MAX(dev, icm_mng_function_id_mode) ==
> +		    MLX5_ID_MODE_FUNCTION_VHCA_ID) {

The comment just above says "set mode from cap (set_hca_cap has run
successfully)", which reads as "use the current/negotiated cap value",
but the check uses MLX5_CAP_GEN_MAX rather than MLX5_CAP_GEN. The
sibling code in drivers/net/ethernet/mellanox/mlx5/core/debugfs.c uses
MLX5_CAP_GEN(dev, icm_mng_function_id_mode) for the same semantic
check.

Could the comment and the _MAX usage be made consistent? If anyone
later adds a conditional around the MLX5_SET() in handle_hca_cap()
(for example a module parameter), the driver would start flipping to
VHCA_ID mode based on _MAX without actually having enabled the feature
in firmware.

> +			err = mlx5_pagealloc_migrate_pf_to_vhca_id(dev);
> +			if (err)
> +				return err;
> +			mlx5_page_mgt_mode_set(dev, MLX5_PAGE_MGT_MODE_VHCA_ID);
> +		}
> +	}

Once page_mgt_mode is flipped to MLX5_PAGE_MGT_MODE_VHCA_ID, it stays
set in dev->priv and is only reset on the next boot=1 call. Between
teardown and the next boot=1 path (for example during health
recovery, PCI reset, or any async flow that runs page work before
mlx5_satisfy_startup_pages(dev, 1) re-runs), would a FW page-request
EQE observe a stale mode?

The req_pages EQ notifier is registered via mlx5_pagealloc_start()
and unregistered via mlx5_pagealloc_stop(), so this is not reachable
today, but would it be worth resetting the mode explicitly on the
teardown side rather than relying on the next reinit?

[ ... ]

> @@ -751,6 +874,9 @@ int mlx5_reclaim_startup_pages(struct mlx5_core_dev *dev)
>  	WARN(dev->priv.fw_pages,
>  	     "FW pages counter is %d after reclaiming all pages\n",
>  	     dev->priv.fw_pages);
> +	if (mlx5_page_mgt_mode_is_vhca_id(dev) && !dev->priv.eswitch)
> +		return 0;
> +

Does this guard address the asymmetric-counter case raised above?

It only skips the per-type WARNs when the eswitch is entirely absent.
In the common case where the eswitch is present but a vport's
vhca_id/marks change between give and reclaim, the counters can still
drift and these WARNs would still fire on normal teardown paths.

[ ... ]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next 3/3] net/mlx5: Add VHCA_ID page management mode support
  2026-05-01  4:41 ` [PATCH net-next 3/3] net/mlx5: Add VHCA_ID page management mode support Tariq Toukan
  2026-05-03  1:45   ` Jakub Kicinski
@ 2026-05-03  1:45   ` Jakub Kicinski
  2026-05-04 11:41     ` Moshe Shemesh
  1 sibling, 1 reply; 8+ messages in thread
From: Jakub Kicinski @ 2026-05-03  1:45 UTC (permalink / raw)
  To: tariqt
  Cc: Jakub Kicinski, edumazet, pabeni, andrew+netdev, davem, saeedm,
	leon, mbloch, moshe, agoldberger, netdev, linux-rdma,
	linux-kernel, gal, dtatulea

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
Subject: net/mlx5: Add VHCA_ID page management mode support

This commit adds support for VHCA_ID-based page management mode. It
updates page management commands to conditionally use vhca_id and adds
a function to resolve the function type via vport lookup.

> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
> --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
> @@ -852,6 +852,48 @@ bool mlx5_esw_vport_vhca_id(struct mlx5_eswitch *esw, u16 vportn, u16 *vhca_id)
>  	return true;
>  }
>  
> +u16 mlx5_esw_vhca_id_to_func_type(struct mlx5_core_dev *dev, u16 vhca_id)
> +{
> +	struct mlx5_eswitch *esw = dev->priv.eswitch;
> +	struct mlx5_vport *vport;
> +	unsigned long i;
> +	u16 type;
> +
> +	if (vhca_id == MLX5_CAP_GEN(dev, vhca_id))
> +		return MLX5_SELF;
> +
> +	if (!esw)
> +		return MLX5_FUNC_TYPE_NONE;
> +
> +	mutex_lock(&esw->state_lock);

Can this race with driver unload?

Since the page allocator workqueue (pg_wq) is asynchronous and destroyed
after the eswitch structure is freed, could a pending page request evaluate
esw as valid just before it's freed, and then attempt to lock it here?

Also, is there a potential deadlock when acquiring esw->state_lock here?

If a thread holds the lock while sending a firmware command, and the
firmware issues a page request to complete it, the pg_wq worker will block
here waiting for the lock. Would this prevent the firmware command from
completing?

[ ... ]

> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c b/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c
> --- a/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c
[ ... ]
> -static int give_pages(struct mlx5_core_dev *dev, u16 func_id, int npages,
> +static int give_pages(struct mlx5_core_dev *dev, u16 func_vhca_id, int npages,
>  		      int event, bool ec_function)
>  {
> -	u32 function = get_function(func_id, ec_function);
> +	u32 function = get_function_key(dev, func_vhca_id, ec_function);
>  	u32 out[MLX5_ST_SZ_DW(manage_pages_out)] = {0};
>  	int inlen = MLX5_ST_SZ_BYTES(manage_pages_in);

This isn't a new issue introduced by this patch, but does the npages
calculation risk an integer overflow?

The npages parameter is derived from the firmware payload, and the
allocation size is calculated as:

inlen += npages * MLX5_FLD_SZ_BYTES(manage_pages_in, pas[0]);

If a compromised firmware provides a massive npages value, could this
overflow the 32-bit inlen and result in an undersized kvzalloc allocation,
followed by an out-of-bounds write in the subsequent loop?

[ ... ]

> @@ -577,8 +626,9 @@ static int reclaim_pages(struct mlx5_core_dev *dev, u16 func_id, int npages,
>  	if (nclaimed)
>  		*nclaimed = num_claimed;
>  
> -	func_type = func_id_to_type(dev, func_id, ec_function);
> -	dev->priv.page_counters[func_type] -= num_claimed;
> +	func_type = func_vhca_id_to_type(dev, func_vhca_id, ec_function);
> +	if (func_type != MLX5_FUNC_TYPE_NONE)
> +		dev->priv.page_counters[func_type] -= num_claimed;

Does this dynamically look up the vport during page reclaim?

If an SF or VF has been destroyed and its vport removed, the lookup will
return MLX5_FUNC_TYPE_NONE. Will this cause the decrement to be skipped,
leading to a counter leak and triggering the WARN() statements in
mlx5_reclaim_startup_pages() during driver unload?

[ ... ]

> @@ -658,30 +708,101 @@ static int req_pages_handler(struct notifier_block *nb,
[ ... ]
> +	} else {
> +		if (MLX5_CAP_GEN_MAX(dev, icm_mng_function_id_mode) ==
> +		    MLX5_ID_MODE_FUNCTION_VHCA_ID) {
> +			err = mlx5_pagealloc_migrate_pf_to_vhca_id(dev);
> +			if (err)
> +				return err;
> +			mlx5_page_mgt_mode_set(dev, MLX5_PAGE_MGT_MODE_VHCA_ID);
> +		}
> +	}

What happens to the device state if mlx5_pagealloc_migrate_pf_to_vhca_id()
fails?

The firmware has already been instructed to use VHCA_ID mode during driver
load, but if the migration fails, the function returns early and leaves
dev->priv.page_mgt_mode as FUNC_ID.

When the initialization aborts and mlx5_reclaim_startup_pages() runs, will
the driver send reclaim commands using the old ID, causing the firmware
to reject them and leak the allocated DMA memory?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next 3/3] net/mlx5: Add VHCA_ID page management mode support
  2026-05-03  1:45   ` Jakub Kicinski
@ 2026-05-04 11:33     ` Moshe Shemesh
  0 siblings, 0 replies; 8+ messages in thread
From: Moshe Shemesh @ 2026-05-04 11:33 UTC (permalink / raw)
  To: Jakub Kicinski, tariqt
  Cc: edumazet, pabeni, andrew+netdev, davem, saeedm, leon, mbloch,
	agoldberger, netdev, linux-rdma, linux-kernel, gal, dtatulea



On 5/3/2026 4:45 AM, Jakub Kicinski wrote:
> 
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> ---
> net/mlx5: Add VHCA_ID page management mode support
> 
> This change adds a VHCA_ID-based page management mode that uses vhca_id
> instead of function_id as the effective function identifier when the
> firmware advertises icm_mng_function_id_mode. Boot pages keep FUNC_ID
> semantics for backward compatibility, and tracked pages are migrated to
> the vhca_id key after set_hca_cap().
> 
> A few questions below on the per-type counter accounting, the boot-page
> migration helper, and locking on the FW-driven page path.
> 
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
>> index e0eafcf0c52a..d3eaefc5c0e0 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
>> @@ -852,6 +852,48 @@ bool mlx5_esw_vport_vhca_id(struct mlx5_eswitch *esw, u16 vportn, u16 *vhca_id)
>>        return true;
>>   }
>>
>> +u16 mlx5_esw_vhca_id_to_func_type(struct mlx5_core_dev *dev, u16 vhca_id)
>> +{
>> +     struct mlx5_eswitch *esw = dev->priv.eswitch;
>> +     struct mlx5_vport *vport;
>> +     unsigned long i;
>> +     u16 type;
>> +
>> +     if (vhca_id == MLX5_CAP_GEN(dev, vhca_id))
>> +             return MLX5_SELF;
>> +
>> +     if (!esw)
>> +             return MLX5_FUNC_TYPE_NONE;
>> +
>> +     mutex_lock(&esw->state_lock);
> 
> This function is reached from give_pages()/reclaim_pages()/
> release_all_pages() via func_vhca_id_to_type(), which in turn runs
> from the pg_wq work handler triggered by firmware page-request EQE
> events. Does acquiring esw->state_lock on that path introduce a new
> lock dependency?
> 
> Several eswitch paths (for example mlx5_esw_vport_enable(),
> mlx5_esw_vport_disable(), mlx5_eswitch_set_vport_mac()) hold
> state_lock while synchronously issuing firmware commands. Before this
> patch, the page path held no eswitch locks.
> 
> Would it be safer to resolve the func_type outside of state_lock, for
> example by caching the vhca_id-to-type mapping separately, or by
> attaching the resolved type to the fw_page at give time so reclaim
> does not need to look it up again?
> 

I don't think there is a real issue with the lock, but for sure better 
without, I will try caching vhca_id to type mapping.

>> +     mlx5_esw_for_each_vport(esw, i, vport) {
>> +             if (vport->vhca_id != vhca_id)
>> +                     continue;
>> +
>> +             if (vport->vport == MLX5_VPORT_HOST_PF) {
>> +                     type = MLX5_HOST_PF;
>> +                     goto unlock;
>> +             }
>> +
>> +             if (xa_get_mark(&esw->vports, i, MLX5_ESW_VPT_SF)) {
>> +                     type = MLX5_SF;
>> +                     goto unlock;
>> +             }
>> +
>> +             if (xa_get_mark(&esw->vports, i, MLX5_ESW_VPT_VF)) {
>> +                     type = MLX5_VF;
>> +                     goto unlock;
>> +             }
>> +
>> +             type = MLX5_EC_VF;
>> +             goto unlock;
> 
> Can this fall-through misattribute counters?
> 
> If a matching vport is not HOST_PF and has neither the VPT_SF nor the
> VPT_VF mark, it is unconditionally classified as MLX5_EC_VF. Would an
> UPLINK, ECPF, manager, or adjacent-style vport whose vhca_id happens
> to match end up charged to the EC_VF bucket here?
> 

The fallthrough is correct in practice  - uplink is the physical port, 
not a separate HCA function, FW doesn't request pages for it. ECPF is 
the esw manager, if requested here can be only as self, will catch the 
MLX5_SELF condition. Adjacent VFs are caught correctly by the VF mark.

>> +     }
>> +     type = MLX5_FUNC_TYPE_NONE;
>> +unlock:
>> +     mutex_unlock(&esw->state_lock);
>> +     return type;
>> +}
>> +
> 
> [ ... ]
> 
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
>> index a242053f3a58..52cf341ad6b3 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
>> @@ -597,6 +597,9 @@ static int handle_hca_cap(struct mlx5_core_dev *dev, void *set_ctx)
>>        if (MLX5_CAP_GEN_MAX(dev, release_all_pages))
>>                MLX5_SET(cmd_hca_cap, set_hca_cap, release_all_pages, 1);
>>
>> +     if (MLX5_CAP_GEN_MAX(dev, icm_mng_function_id_mode))
>> +             MLX5_SET(cmd_hca_cap, set_hca_cap, icm_mng_function_id_mode, 1);
>> +
> 
> [ ... ]
> 
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c b/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c
>> index 77ffa31cc505..7ebe88aa3b3e 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c
> 
> [ ... ]
> 
>> @@ -89,12 +105,21 @@ static u16 func_id_to_type(struct mlx5_core_dev *dev, u16 func_id, bool ec_funct
>>        return MLX5_SF;
>>   }
>>
>> +static u16 func_vhca_id_to_type(struct mlx5_core_dev *dev, u16 func_vhca_id,
>> +                             bool ec_function)
>> +{
>> +     if (mlx5_page_mgt_mode_is_vhca_id(dev))
>> +             return mlx5_esw_vhca_id_to_func_type(dev, func_vhca_id);
>> +
>> +     return func_id_to_type(dev, func_vhca_id, ec_function);
>> +}
> 
> Can the give/reclaim accounting become asymmetric in VHCA_ID mode?
> 
> The per-type counters are only updated when the resolved type is not
> MLX5_FUNC_TYPE_NONE:
> 
>          func_type = func_vhca_id_to_type(dev, func_vhca_id, ec_function);
>          if (func_type != MLX5_FUNC_TYPE_NONE)
>                  dev->priv.page_counters[func_type] += npages;
> 
> Since mlx5_esw_vhca_id_to_func_type() walks the eswitch vport table
> dynamically on every call, can the give and the corresponding reclaim
> resolve to different types?
> 
> For example, if a give runs before vport->vhca_id is populated in
> mlx5_esw_vport_caps_get(), the lookup returns MLX5_FUNC_TYPE_NONE and
> the counter is not incremented. Later, when the vport is fully
> populated, reclaim resolves to a real type and decrements the counter
> below the amount ever added, which on a u32 drives it to a very large
> value.
> 
> Similarly, if a vport is removed before reclaim, the increment at
> give time is recorded but the decrement at reclaim is skipped, so the
> counter leaks upward.
> 
> Would caching the func_type on the fw_page at give time and reusing
> it on reclaim make the accounting symmetric by construction?

Following first comment, I will try caching.

> 
> [ ... ]
> 
>> @@ -658,30 +708,101 @@ static int req_pages_handler(struct notifier_block *nb,
>>         * req->npages (and not min ()).
>>         */
>>        req->npages = max_t(s32, npages, MAX_RECLAIM_NPAGES);
>> -     req->ec_function = ec_function;
>> +     if (!mlx5_page_mgt_mode_is_vhca_id(dev))
>> +             req->ec_function = ec_function;
>>        req->release_all = release_all;
>>        INIT_WORK(&req->work, pages_work_handler);
>>        queue_work(dev->priv.pg_wq, &req->work);
>>        return NOTIFY_OK;
>>   }
>>
>> +/*
>> + * After set_hca_cap(), the second satisfy_startup_pages(dev, 0) may see
>> + * VHCA_ID mode. If page_root_xa already has the PF entry from the first
>> + * (boot) call under FUNC_ID keys 0 or (ec_function << 16), migrate that
>> + * entry to the device vhca_id key so lookups use VHCA_ID semantics.
>> + */
>> +static int mlx5_pagealloc_migrate_pf_to_vhca_id(struct mlx5_core_dev *dev)
>> +{
>> +     u32 vhca_id_key, old_key;
>> +     struct rb_root *root;
>> +     struct fw_page *fwp;
>> +     struct rb_node *p;
>> +     bool ec_function;
>> +     int err;
>> +
>> +     if (xa_empty(&dev->priv.page_root_xa))
>> +             return 0;
>> +
>> +     vhca_id_key = MLX5_CAP_GEN(dev, vhca_id);
>> +     ec_function = mlx5_core_is_ecpf(dev);
>> +
>> +     old_key = ec_function ? (1U << 16) : 0;
>> +     root = xa_load(&dev->priv.page_root_xa, old_key);
>> +     if (!root)
>> +             return 0;
> 
> Does this assume the boot-path func_vhca_id was always 0?
> 
> The boot call to mlx5_cmd_query_pages() reads func_vhca_id directly
> from the firmware output, and give_pages() then uses that value to
> compute the key. The migration here instead hardcodes old_key as
> ec_function ? (1U << 16) : 0.
> 
> If firmware returned a non-zero boot function_id, xa_load(old_key)
> returns NULL, the function silently returns 0, the caller flips the
> mode to VHCA_ID, and the original rb_root is orphaned in page_root_xa
> under the old key. Subsequent free_fwp()/find_fw_page() paths would
> then hit WARN_ON_ONCE(!root) and leak the DMA mappings and pages.
> 
> Would it be more robust to look up the actual key used at boot
> (derived from the stored func_vhca_id), and to treat the "xa not
> empty but old_key absent" case as an invariant violation rather than
> silently succeeding?
> 

During boot (mlx5_satisfy_startup_pages(dev, 1)), the mode is FUNC_ID. 
The mlx5_cmd_query_pages() call sets embedded_cpu_function in the 
request but leaves function_id as 0 (the PF queries its own pages). 
Firmware returns function_id in the output, which for the PF's own pages 
is always 0.
The "silently returns 0" when xa_load returns NULL is also fine, it 
means no boot pages were allocated.

>> +
>> +     if (old_key == vhca_id_key)
>> +             return 0;
>> +
>> +     err = xa_insert(&dev->priv.page_root_xa, vhca_id_key, root, GFP_KERNEL);
>> +     if (err) {
>> +             mlx5_core_warn(dev,
>> +                            "failed to migrate page root key 0x%x to vhca_id 0x%x\n",
>> +                            old_key, vhca_id_key);
>> +             return err;
>> +     }
>> +
>> +     xa_erase(&dev->priv.page_root_xa, old_key);
>> +
>> +     for (p = rb_first(root); p; p = rb_next(p)) {
>> +             fwp = rb_entry(p, struct fw_page, rb_node);
>> +             fwp->function = vhca_id_key;
>> +     }
> 
> Is the ordering here safe against any concurrent free_fwp()?
> 
> Between xa_erase(old_key) and the loop that updates fwp->function,
> every fw_page still carries the old key while page_root_xa no longer
> resolves it. If a free_fwp() were to run in that window:
> 
>          root = xa_load(&dev->priv.page_root_xa, fwp->function);
>          if (WARN_ON_ONCE(!root))
>                  return;
> 
> it would return early, skipping dma_unmap_page(), __free_page(), and
> kfree(fwp), leaking the DMA mapping and the backing page.
> 
> No concurrent free path is structurally reachable today because this
> runs before the EQ notifier is registered in mlx5_pagealloc_start(),
> but would it be cleaner to update the fwp->function values first,
> then swap the xarray entries (or store the new value at a single key)
> so the two views cannot disagree?
> 

Reordering is safer, I will reorder.

>> +
>> +     return 0;
>> +}
>> +
>>   int mlx5_satisfy_startup_pages(struct mlx5_core_dev *dev, int boot)
>>   {
>> -     u16 func_id;
>> +     bool ec_function = false;
>> +     u16 func_vhca_id;
>>        s32 npages;
>>        int err;
>>
>> -     err = mlx5_cmd_query_pages(dev, &func_id, &npages, boot);
>> +     /* When boot flag is set, the icm_mng_function_id_mode capability is
>> +      * not yet set (only set after set_hca_cap()), so use FUNC_ID mode
>> +      * for backward compatibility. When boot is false, set mode from
>> +      * cap (set_hca_cap has run successfully).
>> +      */
>> +     if (boot) {
>> +             mlx5_page_mgt_mode_set(dev, MLX5_PAGE_MGT_MODE_FUNC_ID);
>> +     } else {
>> +             if (MLX5_CAP_GEN_MAX(dev, icm_mng_function_id_mode) ==
>> +                 MLX5_ID_MODE_FUNCTION_VHCA_ID) {
> 
> The comment just above says "set mode from cap (set_hca_cap has run
> successfully)", which reads as "use the current/negotiated cap value",
> but the check uses MLX5_CAP_GEN_MAX rather than MLX5_CAP_GEN. The
> sibling code in drivers/net/ethernet/mellanox/mlx5/core/debugfs.c uses
> MLX5_CAP_GEN(dev, icm_mng_function_id_mode) for the same semantic
> check.
> 
> Could the comment and the _MAX usage be made consistent? If anyone
> later adds a conditional around the MLX5_SET() in handle_hca_cap()
> (for example a module parameter), the driver would start flipping to
> VHCA_ID mode based on _MAX without actually having enabled the feature
> in firmware.

OK, I will rewrite the comment.

> 
>> +                     err = mlx5_pagealloc_migrate_pf_to_vhca_id(dev);
>> +                     if (err)
>> +                             return err;
>> +                     mlx5_page_mgt_mode_set(dev, MLX5_PAGE_MGT_MODE_VHCA_ID);
>> +             }
>> +     }
> 
> Once page_mgt_mode is flipped to MLX5_PAGE_MGT_MODE_VHCA_ID, it stays
> set in dev->priv and is only reset on the next boot=1 call. Between
> teardown and the next boot=1 path (for example during health
> recovery, PCI reset, or any async flow that runs page work before
> mlx5_satisfy_startup_pages(dev, 1) re-runs), would a FW page-request
> EQE observe a stale mode?
> 
> The req_pages EQ notifier is registered via mlx5_pagealloc_start()
> and unregistered via mlx5_pagealloc_stop(), so this is not reachable
> today, but would it be worth resetting the mode explicitly on the
> teardown side rather than relying on the next reinit?

No. Once flipped to MLX5_PAGE_MGT_MODE_VHCA_ID, both driver and firmware 
keep using it till reopen function (next boot=1 call).
> 
> [ ... ]
> 
>> @@ -751,6 +874,9 @@ int mlx5_reclaim_startup_pages(struct mlx5_core_dev *dev)
>>        WARN(dev->priv.fw_pages,
>>             "FW pages counter is %d after reclaiming all pages\n",
>>             dev->priv.fw_pages);
>> +     if (mlx5_page_mgt_mode_is_vhca_id(dev) && !dev->priv.eswitch)
>> +             return 0;
>> +
> 
> Does this guard address the asymmetric-counter case raised above?
> 
> It only skips the per-type WARNs when the eswitch is entirely absent.
> In the common case where the eswitch is present but a vport's
> vhca_id/marks change between give and reclaim, the counters can still
> drift and these WARNs would still fire on normal teardown paths.
> 

Following first comment, will try caching.

> [ ... ]


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next 3/3] net/mlx5: Add VHCA_ID page management mode support
  2026-05-03  1:45   ` Jakub Kicinski
@ 2026-05-04 11:41     ` Moshe Shemesh
  0 siblings, 0 replies; 8+ messages in thread
From: Moshe Shemesh @ 2026-05-04 11:41 UTC (permalink / raw)
  To: Jakub Kicinski, tariqt
  Cc: edumazet, pabeni, andrew+netdev, davem, saeedm, leon, mbloch,
	agoldberger, netdev, linux-rdma, linux-kernel, gal, dtatulea



On 5/3/2026 4:45 AM, Jakub Kicinski wrote:
> 
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> ---
> Subject: net/mlx5: Add VHCA_ID page management mode support
> 
> This commit adds support for VHCA_ID-based page management mode. It
> updates page management commands to conditionally use vhca_id and adds
> a function to resolve the function type via vport lookup.
> 
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
>> @@ -852,6 +852,48 @@ bool mlx5_esw_vport_vhca_id(struct mlx5_eswitch *esw, u16 vportn, u16 *vhca_id)
>>        return true;
>>   }
>>
>> +u16 mlx5_esw_vhca_id_to_func_type(struct mlx5_core_dev *dev, u16 vhca_id)
>> +{
>> +     struct mlx5_eswitch *esw = dev->priv.eswitch;
>> +     struct mlx5_vport *vport;
>> +     unsigned long i;
>> +     u16 type;
>> +
>> +     if (vhca_id == MLX5_CAP_GEN(dev, vhca_id))
>> +             return MLX5_SELF;
>> +
>> +     if (!esw)
>> +             return MLX5_FUNC_TYPE_NONE;
>> +
>> +     mutex_lock(&esw->state_lock);
> 
> Can this race with driver unload?
> 
> Since the page allocator workqueue (pg_wq) is asynchronous and destroyed
> after the eswitch structure is freed, could a pending page request evaluate
> esw as valid just before it's freed, and then attempt to lock it here?
> 
> Also, is there a potential deadlock when acquiring esw->state_lock here?
> 
> If a thread holds the lock while sending a firmware command, and the
> firmware issues a page request to complete it, the pg_wq worker will block
> here waiting for the lock. Would this prevent the firmware command from
> completing?

In give_pages() the function mlx5_esw_vhca_id_to_func_type() is called 
only after the firmware got the pages. Following other mail comments 
will use caching instead of lock.
> 
> [ ... ]
> 
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c b/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c
> [ ... ]
>> -static int give_pages(struct mlx5_core_dev *dev, u16 func_id, int npages,
>> +static int give_pages(struct mlx5_core_dev *dev, u16 func_vhca_id, int npages,
>>                      int event, bool ec_function)
>>   {
>> -     u32 function = get_function(func_id, ec_function);
>> +     u32 function = get_function_key(dev, func_vhca_id, ec_function);
>>        u32 out[MLX5_ST_SZ_DW(manage_pages_out)] = {0};
>>        int inlen = MLX5_ST_SZ_BYTES(manage_pages_in);
> 
> This isn't a new issue introduced by this patch, but does the npages
> calculation risk an integer overflow?
> 
> The npages parameter is derived from the firmware payload, and the
> allocation size is calculated as:
> 
> inlen += npages * MLX5_FLD_SZ_BYTES(manage_pages_in, pas[0]);
> 
> If a compromised firmware provides a massive npages value, could this
> overflow the 32-bit inlen and result in an undersized kvzalloc allocation,
> followed by an out-of-bounds write in the subsequent loop?

Technically yes, but as Sashiko writes here, it is not relevant to this 
patch or this series. it could be addressed as a separate hardening patch.

> 
> [ ... ]
> 
>> @@ -577,8 +626,9 @@ static int reclaim_pages(struct mlx5_core_dev *dev, u16 func_id, int npages,
>>        if (nclaimed)
>>                *nclaimed = num_claimed;
>>
>> -     func_type = func_id_to_type(dev, func_id, ec_function);
>> -     dev->priv.page_counters[func_type] -= num_claimed;
>> +     func_type = func_vhca_id_to_type(dev, func_vhca_id, ec_function);
>> +     if (func_type != MLX5_FUNC_TYPE_NONE)
>> +             dev->priv.page_counters[func_type] -= num_claimed;
> 
> Does this dynamically look up the vport during page reclaim?
> 
> If an SF or VF has been destroyed and its vport removed, the lookup will
> return MLX5_FUNC_TYPE_NONE. Will this cause the decrement to be skipped,
> leading to a counter leak and triggering the WARN() statements in
> mlx5_reclaim_startup_pages() during driver unload?

VF and SF vports are removed after reclaiming pages, but as already 
wrote on the other mail, will try caching

> 
> [ ... ]
> 
>> @@ -658,30 +708,101 @@ static int req_pages_handler(struct notifier_block *nb,
> [ ... ]
>> +     } else {
>> +             if (MLX5_CAP_GEN_MAX(dev, icm_mng_function_id_mode) ==
>> +                 MLX5_ID_MODE_FUNCTION_VHCA_ID) {
>> +                     err = mlx5_pagealloc_migrate_pf_to_vhca_id(dev);
>> +                     if (err)
>> +                             return err;
>> +                     mlx5_page_mgt_mode_set(dev, MLX5_PAGE_MGT_MODE_VHCA_ID);
>> +             }
>> +     }
> 
> What happens to the device state if mlx5_pagealloc_migrate_pf_to_vhca_id()
> fails?
> 
> The firmware has already been instructed to use VHCA_ID mode during driver
> load, but if the migration fails, the function returns early and leaves
> dev->priv.page_mgt_mode as FUNC_ID.
> 
> When the initialization aborts and mlx5_reclaim_startup_pages() runs, will
> the driver send reclaim commands using the old ID, causing the firmware
> to reject them and leak the allocated DMA memory?

If mlx5_pagealloc_migrate_pf_to_vhca_id() fails, it can fail only on 
xa_insert of the new key, OOM issue and we couldn't set the new key. if 
it does fail, the device init fails entirely anyway.



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-05-04 11:42 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-01  4:41 [PATCH net-next 0/3] net/mlx5: ICM page management in VHCA_ID mode Tariq Toukan
2026-05-01  4:41 ` [PATCH net-next 1/3] net/mlx5: Relax capability check for eswitch query paths Tariq Toukan
2026-05-01  4:41 ` [PATCH net-next 2/3] net/mlx5: Make debugfs page counters by function type dynamic Tariq Toukan
2026-05-01  4:41 ` [PATCH net-next 3/3] net/mlx5: Add VHCA_ID page management mode support Tariq Toukan
2026-05-03  1:45   ` Jakub Kicinski
2026-05-04 11:33     ` Moshe Shemesh
2026-05-03  1:45   ` Jakub Kicinski
2026-05-04 11:41     ` Moshe Shemesh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox