public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH mlx5-next 0/8] mlx5-next updates 2026-03-08
@ 2026-03-08  6:55 Tariq Toukan
  2026-03-08  6:55 ` [PATCH mlx5-next 1/8] net/mlx5: Add IFC bits for shared headroom pool PBMC support Tariq Toukan
                   ` (7 more replies)
  0 siblings, 8 replies; 14+ messages in thread
From: Tariq Toukan @ 2026-03-08  6:55 UTC (permalink / raw)
  To: Leon Romanovsky, Jason Gunthorpe, Saeed Mahameed, Tariq Toukan
  Cc: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller, Mark Bloch, linux-kernel, linux-rdma, netdev,
	Gal Pressman, Dragos Tatulea, Moshe Shemesh, Shay Drory,
	Alexei Lazar

Hi,

This series contains mlx5 shared updates as preparation for upcoming
features.

First patch by Alex contains IFC changes as preparation for an upcoming
feature.

Patches 2 and up by Shay introduce mlx5 infrastructure for SD switchdev
and LAG support.
Detailed description by Shay below.

Regards,
Tariq

This series adds shared infrastructure to enable Socket Direct (SD)
single-netdev switchdev transition and LAG support in subsequent patches.

Currently, LAG is not supported in Socket Direct configurations, and
BlueField-3/4 utilizing SD for North-South traffic operates with two
distinct eSwitches per physical port. This forces the use of separate
IPs and MAC addresses for each NUMA node, complicating network
configuration and requiring firmware to handle MPFS with different
inner and outer packets for communication.

The goal is to expose a single external IP address (single MAC address)
per physical port while maintaining SD's bandwidth and latency benefits.
This means having a single eswitch per physical port managing all
physical ports via merged eswitch with multiple vports. This enables
single FDB creation which will result in a single RDMA device to be used by
DOCA/HWS/OVS.

To achieve this, the LAG infrastructure needs changes since the current
implementation assumes a fixed mapping between device indices and LAG
ports, which breaks with SD's multi-device-per-port model.

This series prepares the groundwork by:

1. Adding IFC bits for silent mode query and VHCA RX destination type,
   needed for SD device coordination and cross-VHCA traffic steering.

2. Converting the LAG pf array to xarray and using xa_alloc for dynamic
   index management. This decouples LAG indexing from physical device
   indices, allowing flexible device membership.

3. Convert peer_miss_rule array to xarray, key with vhca_id.

4. Introducing LAG variant of device index helpers that produce unique
   identifiers even when multiple devices share the same physical port.

5. Adding VHCA RX flow destination support for steering traffic to a
   specific VHCA's receive path.

6. Moving LAG demux table ownership to the LAG layer with APIs for
   SW-only LAG modes where firmware cannot create the demux table.

A follow-up series will build on this infrastructure to implement:
- SD single-netdev switchdev mode transition with shared FDB
  corresponded to the SD group.
- LAG support enabling bonding of SD groups

Since the follow-up series is large (~20 patches), the shared code
between RDMA and net is sent in advance to avoid overloading the
shared branch tree.

Alexei Lazar (1):
  net/mlx5: Add IFC bits for shared headroom pool PBMC support

Shay Drory (6):
  net/mlx5: Add silent mode set/query and VHCA RX IFC bits
  net/mlx5: LAG, replace pf array with xarray
  net/mlx5: E-switch, modify peer miss rule index to vhca_id
  net/mlx5: LAG, replace mlx5_get_dev_index with LAG sequence number
  net/mlx5: Add VHCA RX flow destination support for FW steering
  {net/RDMA}/mlx5: Add LAG demux table API and vport demux rules

Tariq Toukan (1):
  net/mlx5: LAG, use xa_alloc to manage LAG device indices

 drivers/infiniband/hw/mlx5/ib_rep.c           |  24 +-
 drivers/infiniband/hw/mlx5/main.c             |  21 +-
 drivers/infiniband/hw/mlx5/mlx5_ib.h          |   1 -
 .../mellanox/mlx5/core/diag/fs_tracepoint.c   |   3 +
 .../net/ethernet/mellanox/mlx5/core/en_tc.c   |   9 +-
 .../net/ethernet/mellanox/mlx5/core/eswitch.h |  14 +-
 .../mellanox/mlx5/core/eswitch_offloads.c     | 103 ++-
 .../net/ethernet/mellanox/mlx5/core/fs_cmd.c  |   6 +-
 .../net/ethernet/mellanox/mlx5/core/fs_core.c |  17 +-
 .../ethernet/mellanox/mlx5/core/lag/debugfs.c |   3 +-
 .../net/ethernet/mellanox/mlx5/core/lag/lag.c | 684 ++++++++++++++----
 .../net/ethernet/mellanox/mlx5/core/lag/lag.h |  49 +-
 .../net/ethernet/mellanox/mlx5/core/lag/mp.c  |  20 +-
 .../ethernet/mellanox/mlx5/core/lag/mpesw.c   |  15 +-
 .../mellanox/mlx5/core/lag/port_sel.c         |  28 +-
 .../net/ethernet/mellanox/mlx5/core/lib/sd.c  |   2 +-
 include/linux/mlx5/fs.h                       |  10 +-
 include/linux/mlx5/lag.h                      |  21 +
 include/linux/mlx5/mlx5_ifc.h                 |  26 +-
 19 files changed, 849 insertions(+), 207 deletions(-)
 create mode 100644 include/linux/mlx5/lag.h


base-commit: 385a06f74ff7a03e3fb0b15fb87cfeb052d75073
-- 
2.44.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH mlx5-next 1/8] net/mlx5: Add IFC bits for shared headroom pool PBMC support
  2026-03-08  6:55 [PATCH mlx5-next 0/8] mlx5-next updates 2026-03-08 Tariq Toukan
@ 2026-03-08  6:55 ` Tariq Toukan
  2026-03-08  6:55 ` [PATCH mlx5-next 2/8] net/mlx5: Add silent mode set/query and VHCA RX IFC bits Tariq Toukan
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Tariq Toukan @ 2026-03-08  6:55 UTC (permalink / raw)
  To: Leon Romanovsky, Jason Gunthorpe, Saeed Mahameed, Tariq Toukan
  Cc: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller, Mark Bloch, linux-kernel, linux-rdma, netdev,
	Gal Pressman, Dragos Tatulea, Moshe Shemesh, Shay Drory,
	Alexei Lazar

From: Alexei Lazar <alazar@nvidia.com>

Add hardware interface definitions for shared headroom pool (SHP) in
port buffer management:

- shp_pbmc_pbsr_support: capability bit in PCAM enhanced features
  indicating device support for shared headroom pool in PBMC/PBSR.
- shared_headroom_pool: buffer entry in PBMC register (pbmc_reg_bits)
  for the shared headroom pool configuration, reusing the bufferx
  layout; reduce trailing reserved region accordingly.

Signed-off-by: Alexei Lazar <alazar@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 include/linux/mlx5/mlx5_ifc.h | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index a3948b36820d..a76c54bf1927 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -10845,7 +10845,9 @@ struct mlx5_ifc_pcam_enhanced_features_bits {
 	u8         fec_200G_per_lane_in_pplm[0x1];
 	u8         reserved_at_1e[0x2a];
 	u8         fec_100G_per_lane_in_pplm[0x1];
-	u8         reserved_at_49[0xa];
+	u8         reserved_at_49[0x2];
+	u8         shp_pbmc_pbsr_support[0x1];
+	u8         reserved_at_4c[0x7];
 	u8	   buffer_ownership[0x1];
 	u8	   resereved_at_54[0x14];
 	u8         fec_50G_per_lane_in_pplm[0x1];
@@ -12090,8 +12092,9 @@ struct mlx5_ifc_pbmc_reg_bits {
 	u8         port_buffer_size[0x10];
 
 	struct mlx5_ifc_bufferx_reg_bits buffer[10];
+	struct mlx5_ifc_bufferx_reg_bits shared_headroom_pool;
 
-	u8         reserved_at_2e0[0x80];
+	u8         reserved_at_320[0x40];
 };
 
 struct mlx5_ifc_sbpr_reg_bits {
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH mlx5-next 2/8] net/mlx5: Add silent mode set/query and VHCA RX IFC bits
  2026-03-08  6:55 [PATCH mlx5-next 0/8] mlx5-next updates 2026-03-08 Tariq Toukan
  2026-03-08  6:55 ` [PATCH mlx5-next 1/8] net/mlx5: Add IFC bits for shared headroom pool PBMC support Tariq Toukan
@ 2026-03-08  6:55 ` Tariq Toukan
  2026-03-08  6:55 ` [PATCH mlx5-next 3/8] net/mlx5: LAG, replace pf array with xarray Tariq Toukan
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Tariq Toukan @ 2026-03-08  6:55 UTC (permalink / raw)
  To: Leon Romanovsky, Jason Gunthorpe, Saeed Mahameed, Tariq Toukan
  Cc: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller, Mark Bloch, linux-kernel, linux-rdma, netdev,
	Gal Pressman, Dragos Tatulea, Moshe Shemesh, Shay Drory,
	Alexei Lazar

From: Shay Drory <shayd@nvidia.com>

Update the mlx5 IFC headers with newly defined capability and
command-layout bits:

- Add silent_mode_query and rename silent_mode to silent_mode_set cap
  fields.
- Add forward_vhca_rx and MLX5_IFC_FLOW_DESTINATION_TYPE_VHCA_RX.
- Expose silent mode fields in the L2 table query command structures.

Update the SD support check to use the new capability name
(silent_mode_set) to match the updated IFC definition.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/fs_cmd.c  |  2 +-
 .../net/ethernet/mellanox/mlx5/core/lib/sd.c  |  2 +-
 include/linux/mlx5/mlx5_ifc.h                 | 19 ++++++++++++++-----
 3 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
index c348ee62cd3a..16b28028609d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
@@ -1183,7 +1183,7 @@ int mlx5_fs_cmd_set_l2table_entry_silent(struct mlx5_core_dev *dev, u8 silent_mo
 {
 	u32 in[MLX5_ST_SZ_DW(set_l2_table_entry_in)] = {};
 
-	if (silent_mode && !MLX5_CAP_GEN(dev, silent_mode))
+	if (silent_mode && !MLX5_CAP_GEN(dev, silent_mode_set))
 		return -EOPNOTSUPP;
 
 	MLX5_SET(set_l2_table_entry_in, in, opcode, MLX5_CMD_OP_SET_L2_TABLE_ENTRY);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
index 954942ad93c5..762c783156b4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
@@ -107,7 +107,7 @@ static bool mlx5_sd_is_supported(struct mlx5_core_dev *dev, u8 host_buses)
 	/* Disconnect secondaries from the network */
 	if (!MLX5_CAP_GEN(dev, eswitch_manager))
 		return false;
-	if (!MLX5_CAP_GEN(dev, silent_mode))
+	if (!MLX5_CAP_GEN(dev, silent_mode_set))
 		return false;
 
 	/* RX steering from primary to secondaries */
diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index a76c54bf1927..8fa4fb3d36cf 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -469,7 +469,8 @@ struct mlx5_ifc_flow_table_prop_layout_bits {
 	u8	   table_miss_action_domain[0x1];
 	u8         termination_table[0x1];
 	u8         reformat_and_fwd_to_table[0x1];
-	u8         reserved_at_1a[0x2];
+	u8         forward_vhca_rx[0x1];
+	u8         reserved_at_1b[0x1];
 	u8         ipsec_encrypt[0x1];
 	u8         ipsec_decrypt[0x1];
 	u8         sw_owner_v2[0x1];
@@ -2012,12 +2013,14 @@ struct mlx5_ifc_cmd_hca_cap_bits {
 	u8         disable_local_lb_mc[0x1];
 	u8         log_min_hairpin_wq_data_sz[0x5];
 	u8         reserved_at_3e8[0x1];
-	u8         silent_mode[0x1];
+	u8         silent_mode_set[0x1];
 	u8         vhca_state[0x1];
 	u8         log_max_vlan_list[0x5];
 	u8         reserved_at_3f0[0x3];
 	u8         log_max_current_mc_list[0x5];
-	u8         reserved_at_3f8[0x3];
+	u8         reserved_at_3f8[0x1];
+	u8         silent_mode_query[0x1];
+	u8         reserved_at_3fa[0x1];
 	u8         log_max_current_uc_list[0x5];
 
 	u8         general_obj_types[0x40];
@@ -2279,6 +2282,7 @@ enum mlx5_ifc_flow_destination_type {
 	MLX5_IFC_FLOW_DESTINATION_TYPE_VPORT        = 0x0,
 	MLX5_IFC_FLOW_DESTINATION_TYPE_FLOW_TABLE   = 0x1,
 	MLX5_IFC_FLOW_DESTINATION_TYPE_TIR          = 0x2,
+	MLX5_IFC_FLOW_DESTINATION_TYPE_VHCA_RX	    = 0x4,
 	MLX5_IFC_FLOW_DESTINATION_TYPE_FLOW_SAMPLER = 0x6,
 	MLX5_IFC_FLOW_DESTINATION_TYPE_UPLINK       = 0x8,
 	MLX5_IFC_FLOW_DESTINATION_TYPE_TABLE_TYPE   = 0xA,
@@ -6265,7 +6269,9 @@ struct mlx5_ifc_query_l2_table_entry_out_bits {
 
 	u8         reserved_at_40[0xa0];
 
-	u8         reserved_at_e0[0x13];
+	u8         reserved_at_e0[0x11];
+	u8         silent_mode[0x1];
+	u8         reserved_at_f2[0x1];
 	u8         vlan_valid[0x1];
 	u8         vlan[0xc];
 
@@ -6281,7 +6287,10 @@ struct mlx5_ifc_query_l2_table_entry_in_bits {
 	u8         reserved_at_20[0x10];
 	u8         op_mod[0x10];
 
-	u8         reserved_at_40[0x60];
+	u8         reserved_at_40[0x40];
+
+	u8         silent_mode_query[0x1];
+	u8         reserved_at_81[0x1f];
 
 	u8         reserved_at_a0[0x8];
 	u8         table_index[0x18];
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH mlx5-next 3/8] net/mlx5: LAG, replace pf array with xarray
  2026-03-08  6:55 [PATCH mlx5-next 0/8] mlx5-next updates 2026-03-08 Tariq Toukan
  2026-03-08  6:55 ` [PATCH mlx5-next 1/8] net/mlx5: Add IFC bits for shared headroom pool PBMC support Tariq Toukan
  2026-03-08  6:55 ` [PATCH mlx5-next 2/8] net/mlx5: Add silent mode set/query and VHCA RX IFC bits Tariq Toukan
@ 2026-03-08  6:55 ` Tariq Toukan
  2026-03-08  6:55 ` [PATCH mlx5-next 4/8] net/mlx5: LAG, use xa_alloc to manage LAG device indices Tariq Toukan
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Tariq Toukan @ 2026-03-08  6:55 UTC (permalink / raw)
  To: Leon Romanovsky, Jason Gunthorpe, Saeed Mahameed, Tariq Toukan
  Cc: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller, Mark Bloch, linux-kernel, linux-rdma, netdev,
	Gal Pressman, Dragos Tatulea, Moshe Shemesh, Shay Drory,
	Alexei Lazar

From: Shay Drory <shayd@nvidia.com>

Replace the fixed-size array with a dynamic xarray.

This commit changes:
- Adds mlx5_lag_pf() helper for consistent xarray access
- Converts all direct pf[] accesses to use the new helper/macro
- Dynamically allocates lag_func entries via xa_store/xa_load

No functional changes intended. This prepares the LAG infrastructure
for future flexibility in device indexing.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 .../ethernet/mellanox/mlx5/core/lag/debugfs.c |   3 +-
 .../net/ethernet/mellanox/mlx5/core/lag/lag.c | 300 ++++++++++++------
 .../net/ethernet/mellanox/mlx5/core/lag/lag.h |   8 +-
 .../net/ethernet/mellanox/mlx5/core/lag/mp.c  |  20 +-
 .../ethernet/mellanox/mlx5/core/lag/mpesw.c   |  12 +-
 .../mellanox/mlx5/core/lag/port_sel.c         |  20 +-
 6 files changed, 243 insertions(+), 120 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/debugfs.c b/drivers/net/ethernet/mellanox/mlx5/core/lag/debugfs.c
index 62b6faa4276a..37de4be0e620 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lag/debugfs.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/debugfs.c
@@ -145,7 +145,8 @@ static int members_show(struct seq_file *file, void *priv)
 	ldev = mlx5_lag_dev(dev);
 	mutex_lock(&ldev->lock);
 	mlx5_ldev_for_each(i, 0, ldev)
-		seq_printf(file, "%s\n", dev_name(ldev->pf[i].dev->device));
+		seq_printf(file, "%s\n",
+			   dev_name(mlx5_lag_pf(ldev, i)->dev->device));
 	mutex_unlock(&ldev->lock);
 
 	return 0;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
index 044adfdf9aa2..81b1f84f902e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
@@ -232,6 +232,7 @@ static void mlx5_do_bond_work(struct work_struct *work);
 static void mlx5_ldev_free(struct kref *ref)
 {
 	struct mlx5_lag *ldev = container_of(ref, struct mlx5_lag, ref);
+	struct lag_func *pf;
 	struct net *net;
 	int i;
 
@@ -241,13 +242,16 @@ static void mlx5_ldev_free(struct kref *ref)
 	}
 
 	mlx5_ldev_for_each(i, 0, ldev) {
-		if (ldev->pf[i].dev &&
-		    ldev->pf[i].port_change_nb.nb.notifier_call) {
-			struct mlx5_nb *nb = &ldev->pf[i].port_change_nb;
+		pf = mlx5_lag_pf(ldev, i);
+		if (pf->port_change_nb.nb.notifier_call) {
+			struct mlx5_nb *nb = &pf->port_change_nb;
 
-			mlx5_eq_notifier_unregister(ldev->pf[i].dev, nb);
+			mlx5_eq_notifier_unregister(pf->dev, nb);
 		}
+		xa_erase(&ldev->pfs, i);
+		kfree(pf);
 	}
+	xa_destroy(&ldev->pfs);
 
 	mlx5_lag_mp_cleanup(ldev);
 	cancel_delayed_work_sync(&ldev->bond_work);
@@ -284,6 +288,7 @@ static struct mlx5_lag *mlx5_lag_dev_alloc(struct mlx5_core_dev *dev)
 
 	kref_init(&ldev->ref);
 	mutex_init(&ldev->lock);
+	xa_init(&ldev->pfs);
 	INIT_DELAYED_WORK(&ldev->bond_work, mlx5_do_bond_work);
 	INIT_WORK(&ldev->speed_update_work, mlx5_mpesw_speed_update_work);
 
@@ -309,11 +314,14 @@ static struct mlx5_lag *mlx5_lag_dev_alloc(struct mlx5_core_dev *dev)
 int mlx5_lag_dev_get_netdev_idx(struct mlx5_lag *ldev,
 				struct net_device *ndev)
 {
+	struct lag_func *pf;
 	int i;
 
-	mlx5_ldev_for_each(i, 0, ldev)
-		if (ldev->pf[i].netdev == ndev)
+	mlx5_ldev_for_each(i, 0, ldev) {
+		pf = mlx5_lag_pf(ldev, i);
+		if (pf->netdev == ndev)
 			return i;
+	}
 
 	return -ENOENT;
 }
@@ -349,14 +357,17 @@ int mlx5_lag_num_devs(struct mlx5_lag *ldev)
 
 int mlx5_lag_num_netdevs(struct mlx5_lag *ldev)
 {
+	struct lag_func *pf;
 	int i, num = 0;
 
 	if (!ldev)
 		return 0;
 
-	mlx5_ldev_for_each(i, 0, ldev)
-		if (ldev->pf[i].netdev)
+	mlx5_ldev_for_each(i, 0, ldev) {
+		pf = mlx5_lag_pf(ldev, i);
+		if (pf->netdev)
 			num++;
+	}
 	return num;
 }
 
@@ -424,25 +435,30 @@ static void mlx5_infer_tx_affinity_mapping(struct lag_tracker *tracker,
 
 static bool mlx5_lag_has_drop_rule(struct mlx5_lag *ldev)
 {
+	struct lag_func *pf;
 	int i;
 
-	mlx5_ldev_for_each(i, 0, ldev)
-		if (ldev->pf[i].has_drop)
+	mlx5_ldev_for_each(i, 0, ldev) {
+		pf = mlx5_lag_pf(ldev, i);
+		if (pf->has_drop)
 			return true;
+	}
 	return false;
 }
 
 static void mlx5_lag_drop_rule_cleanup(struct mlx5_lag *ldev)
 {
+	struct lag_func *pf;
 	int i;
 
 	mlx5_ldev_for_each(i, 0, ldev) {
-		if (!ldev->pf[i].has_drop)
+		pf = mlx5_lag_pf(ldev, i);
+		if (!pf->has_drop)
 			continue;
 
-		mlx5_esw_acl_ingress_vport_drop_rule_destroy(ldev->pf[i].dev->priv.eswitch,
+		mlx5_esw_acl_ingress_vport_drop_rule_destroy(pf->dev->priv.eswitch,
 							     MLX5_VPORT_UPLINK);
-		ldev->pf[i].has_drop = false;
+		pf->has_drop = false;
 	}
 }
 
@@ -451,6 +467,7 @@ static void mlx5_lag_drop_rule_setup(struct mlx5_lag *ldev,
 {
 	u8 disabled_ports[MLX5_MAX_PORTS] = {};
 	struct mlx5_core_dev *dev;
+	struct lag_func *pf;
 	int disabled_index;
 	int num_disabled;
 	int err;
@@ -468,11 +485,12 @@ static void mlx5_lag_drop_rule_setup(struct mlx5_lag *ldev,
 
 	for (i = 0; i < num_disabled; i++) {
 		disabled_index = disabled_ports[i];
-		dev = ldev->pf[disabled_index].dev;
+		pf = mlx5_lag_pf(ldev, disabled_index);
+		dev = pf->dev;
 		err = mlx5_esw_acl_ingress_vport_drop_rule_create(dev->priv.eswitch,
 								  MLX5_VPORT_UPLINK);
 		if (!err)
-			ldev->pf[disabled_index].has_drop = true;
+			pf->has_drop = true;
 		else
 			mlx5_core_err(dev,
 				      "Failed to create lag drop rule, error: %d", err);
@@ -504,7 +522,7 @@ static int _mlx5_modify_lag(struct mlx5_lag *ldev, u8 *ports)
 	if (idx < 0)
 		return -EINVAL;
 
-	dev0 = ldev->pf[idx].dev;
+	dev0 = mlx5_lag_pf(ldev, idx)->dev;
 	if (test_bit(MLX5_LAG_MODE_FLAG_HASH_BASED, &ldev->mode_flags)) {
 		ret = mlx5_lag_port_sel_modify(ldev, ports);
 		if (ret ||
@@ -521,6 +539,7 @@ static int _mlx5_modify_lag(struct mlx5_lag *ldev, u8 *ports)
 static struct net_device *mlx5_lag_active_backup_get_netdev(struct mlx5_core_dev *dev)
 {
 	struct net_device *ndev = NULL;
+	struct lag_func *pf;
 	struct mlx5_lag *ldev;
 	unsigned long flags;
 	int i, last_idx;
@@ -531,14 +550,17 @@ static struct net_device *mlx5_lag_active_backup_get_netdev(struct mlx5_core_dev
 	if (!ldev)
 		goto unlock;
 
-	mlx5_ldev_for_each(i, 0, ldev)
+	mlx5_ldev_for_each(i, 0, ldev) {
+		pf = mlx5_lag_pf(ldev, i);
 		if (ldev->tracker.netdev_state[i].tx_enabled)
-			ndev = ldev->pf[i].netdev;
+			ndev = pf->netdev;
+	}
 	if (!ndev) {
 		last_idx = mlx5_lag_get_dev_index_by_seq(ldev, ldev->ports - 1);
 		if (last_idx < 0)
 			goto unlock;
-		ndev = ldev->pf[last_idx].netdev;
+		pf = mlx5_lag_pf(ldev, last_idx);
+		ndev = pf->netdev;
 	}
 
 	dev_hold(ndev);
@@ -563,7 +585,7 @@ void mlx5_modify_lag(struct mlx5_lag *ldev,
 	if (first_idx < 0)
 		return;
 
-	dev0 = ldev->pf[first_idx].dev;
+	dev0 = mlx5_lag_pf(ldev, first_idx)->dev;
 	mlx5_infer_tx_affinity_mapping(tracker, ldev, ldev->buckets, ports);
 
 	mlx5_ldev_for_each(i, 0, ldev) {
@@ -615,7 +637,7 @@ static int mlx5_lag_set_port_sel_mode(struct mlx5_lag *ldev,
 	    mode == MLX5_LAG_MODE_MULTIPATH)
 		return 0;
 
-	dev0 = ldev->pf[first_idx].dev;
+	dev0 = mlx5_lag_pf(ldev, first_idx)->dev;
 
 	if (!MLX5_CAP_PORT_SELECTION(dev0, port_select_flow_table)) {
 		if (ldev->ports > 2)
@@ -670,10 +692,12 @@ static int mlx5_lag_create_single_fdb(struct mlx5_lag *ldev)
 	if (first_idx < 0)
 		return -EINVAL;
 
-	dev0 = ldev->pf[first_idx].dev;
+	dev0 = mlx5_lag_pf(ldev, first_idx)->dev;
 	master_esw = dev0->priv.eswitch;
 	mlx5_ldev_for_each(i, first_idx + 1, ldev) {
-		struct mlx5_eswitch *slave_esw = ldev->pf[i].dev->priv.eswitch;
+		struct mlx5_eswitch *slave_esw;
+
+		slave_esw = mlx5_lag_pf(ldev, i)->dev->priv.eswitch;
 
 		err = mlx5_eswitch_offloads_single_fdb_add_one(master_esw,
 							       slave_esw, ldev->ports);
@@ -684,7 +708,7 @@ static int mlx5_lag_create_single_fdb(struct mlx5_lag *ldev)
 err:
 	mlx5_ldev_for_each_reverse(j, i, first_idx + 1, ldev)
 		mlx5_eswitch_offloads_single_fdb_del_one(master_esw,
-							 ldev->pf[j].dev->priv.eswitch);
+							 mlx5_lag_pf(ldev, j)->dev->priv.eswitch);
 	return err;
 }
 
@@ -702,7 +726,7 @@ static int mlx5_create_lag(struct mlx5_lag *ldev,
 	if (first_idx < 0)
 		return -EINVAL;
 
-	dev0 = ldev->pf[first_idx].dev;
+	dev0 = mlx5_lag_pf(ldev, first_idx)->dev;
 	if (tracker)
 		mlx5_lag_print_mapping(dev0, ldev, tracker, flags);
 	mlx5_core_info(dev0, "shared_fdb:%d mode:%s\n",
@@ -749,7 +773,7 @@ int mlx5_activate_lag(struct mlx5_lag *ldev,
 	if (first_idx < 0)
 		return -EINVAL;
 
-	dev0 = ldev->pf[first_idx].dev;
+	dev0 = mlx5_lag_pf(ldev, first_idx)->dev;
 	err = mlx5_lag_set_flags(ldev, mode, tracker, shared_fdb, &flags);
 	if (err)
 		return err;
@@ -805,7 +829,7 @@ int mlx5_deactivate_lag(struct mlx5_lag *ldev)
 	if (first_idx < 0)
 		return -EINVAL;
 
-	dev0 = ldev->pf[first_idx].dev;
+	dev0 = mlx5_lag_pf(ldev, first_idx)->dev;
 	master_esw = dev0->priv.eswitch;
 	ldev->mode = MLX5_LAG_MODE_NONE;
 	ldev->mode_flags = 0;
@@ -814,7 +838,7 @@ int mlx5_deactivate_lag(struct mlx5_lag *ldev)
 	if (test_bit(MLX5_LAG_MODE_FLAG_SHARED_FDB, &flags)) {
 		mlx5_ldev_for_each(i, first_idx + 1, ldev)
 			mlx5_eswitch_offloads_single_fdb_del_one(master_esw,
-								 ldev->pf[i].dev->priv.eswitch);
+								 mlx5_lag_pf(ldev, i)->dev->priv.eswitch);
 		clear_bit(MLX5_LAG_MODE_FLAG_SHARED_FDB, &flags);
 	}
 
@@ -849,6 +873,7 @@ bool mlx5_lag_check_prereq(struct mlx5_lag *ldev)
 	struct mlx5_core_dev *dev;
 	u8 mode;
 #endif
+	struct lag_func *pf;
 	bool roce_support;
 	int i;
 
@@ -857,55 +882,66 @@ bool mlx5_lag_check_prereq(struct mlx5_lag *ldev)
 
 #ifdef CONFIG_MLX5_ESWITCH
 	mlx5_ldev_for_each(i, 0, ldev) {
-		dev = ldev->pf[i].dev;
+		pf = mlx5_lag_pf(ldev, i);
+		dev = pf->dev;
 		if (mlx5_eswitch_num_vfs(dev->priv.eswitch) && !is_mdev_switchdev_mode(dev))
 			return false;
 	}
 
-	dev = ldev->pf[first_idx].dev;
+	pf = mlx5_lag_pf(ldev, first_idx);
+	dev = pf->dev;
 	mode = mlx5_eswitch_mode(dev);
-	mlx5_ldev_for_each(i, 0, ldev)
-		if (mlx5_eswitch_mode(ldev->pf[i].dev) != mode)
+	mlx5_ldev_for_each(i, 0, ldev) {
+		pf = mlx5_lag_pf(ldev, i);
+		if (mlx5_eswitch_mode(pf->dev) != mode)
 			return false;
+	}
 
 #else
-	mlx5_ldev_for_each(i, 0, ldev)
-		if (mlx5_sriov_is_enabled(ldev->pf[i].dev))
+	mlx5_ldev_for_each(i, 0, ldev) {
+		pf = mlx5_lag_pf(ldev, i);
+		if (mlx5_sriov_is_enabled(pf->dev))
 			return false;
+	}
 #endif
-	roce_support = mlx5_get_roce_state(ldev->pf[first_idx].dev);
-	mlx5_ldev_for_each(i, first_idx + 1, ldev)
-		if (mlx5_get_roce_state(ldev->pf[i].dev) != roce_support)
+	pf = mlx5_lag_pf(ldev, first_idx);
+	roce_support = mlx5_get_roce_state(pf->dev);
+	mlx5_ldev_for_each(i, first_idx + 1, ldev) {
+		pf = mlx5_lag_pf(ldev, i);
+		if (mlx5_get_roce_state(pf->dev) != roce_support)
 			return false;
+	}
 
 	return true;
 }
 
 void mlx5_lag_add_devices(struct mlx5_lag *ldev)
 {
+	struct lag_func *pf;
 	int i;
 
 	mlx5_ldev_for_each(i, 0, ldev) {
-		if (ldev->pf[i].dev->priv.flags &
-		    MLX5_PRIV_FLAGS_DISABLE_ALL_ADEV)
+		pf = mlx5_lag_pf(ldev, i);
+		if (pf->dev->priv.flags & MLX5_PRIV_FLAGS_DISABLE_ALL_ADEV)
 			continue;
 
-		ldev->pf[i].dev->priv.flags &= ~MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
-		mlx5_rescan_drivers_locked(ldev->pf[i].dev);
+		pf->dev->priv.flags &= ~MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
+		mlx5_rescan_drivers_locked(pf->dev);
 	}
 }
 
 void mlx5_lag_remove_devices(struct mlx5_lag *ldev)
 {
+	struct lag_func *pf;
 	int i;
 
 	mlx5_ldev_for_each(i, 0, ldev) {
-		if (ldev->pf[i].dev->priv.flags &
-		    MLX5_PRIV_FLAGS_DISABLE_ALL_ADEV)
+		pf = mlx5_lag_pf(ldev, i);
+		if (pf->dev->priv.flags & MLX5_PRIV_FLAGS_DISABLE_ALL_ADEV)
 			continue;
 
-		ldev->pf[i].dev->priv.flags |= MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
-		mlx5_rescan_drivers_locked(ldev->pf[i].dev);
+		pf->dev->priv.flags |= MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
+		mlx5_rescan_drivers_locked(pf->dev);
 	}
 }
 
@@ -921,7 +957,7 @@ void mlx5_disable_lag(struct mlx5_lag *ldev)
 	if (idx < 0)
 		return;
 
-	dev0 = ldev->pf[idx].dev;
+	dev0 = mlx5_lag_pf(ldev, idx)->dev;
 	roce_lag = __mlx5_lag_is_roce(ldev);
 
 	if (shared_fdb) {
@@ -932,7 +968,7 @@ void mlx5_disable_lag(struct mlx5_lag *ldev)
 			mlx5_rescan_drivers_locked(dev0);
 		}
 		mlx5_ldev_for_each(i, idx + 1, ldev)
-			mlx5_nic_vport_disable_roce(ldev->pf[i].dev);
+			mlx5_nic_vport_disable_roce(mlx5_lag_pf(ldev, i)->dev);
 	}
 
 	err = mlx5_deactivate_lag(ldev);
@@ -944,8 +980,8 @@ void mlx5_disable_lag(struct mlx5_lag *ldev)
 
 	if (shared_fdb)
 		mlx5_ldev_for_each(i, 0, ldev)
-			if (!(ldev->pf[i].dev->priv.flags & MLX5_PRIV_FLAGS_DISABLE_ALL_ADEV))
-				mlx5_eswitch_reload_ib_reps(ldev->pf[i].dev->priv.eswitch);
+			if (!(mlx5_lag_pf(ldev, i)->dev->priv.flags & MLX5_PRIV_FLAGS_DISABLE_ALL_ADEV))
+				mlx5_eswitch_reload_ib_reps(mlx5_lag_pf(ldev, i)->dev->priv.eswitch);
 }
 
 bool mlx5_lag_shared_fdb_supported(struct mlx5_lag *ldev)
@@ -958,7 +994,7 @@ bool mlx5_lag_shared_fdb_supported(struct mlx5_lag *ldev)
 		return false;
 
 	mlx5_ldev_for_each(i, idx + 1, ldev) {
-		dev = ldev->pf[i].dev;
+		dev = mlx5_lag_pf(ldev, i)->dev;
 		if (is_mdev_switchdev_mode(dev) &&
 		    mlx5_eswitch_vport_match_metadata_enabled(dev->priv.eswitch) &&
 		    MLX5_CAP_GEN(dev, lag_native_fdb_selection) &&
@@ -969,7 +1005,7 @@ bool mlx5_lag_shared_fdb_supported(struct mlx5_lag *ldev)
 		return false;
 	}
 
-	dev = ldev->pf[idx].dev;
+	dev = mlx5_lag_pf(ldev, idx)->dev;
 	if (is_mdev_switchdev_mode(dev) &&
 	    mlx5_eswitch_vport_match_metadata_enabled(dev->priv.eswitch) &&
 	    mlx5_esw_offloads_devcom_is_ready(dev->priv.eswitch) &&
@@ -983,14 +1019,19 @@ bool mlx5_lag_shared_fdb_supported(struct mlx5_lag *ldev)
 static bool mlx5_lag_is_roce_lag(struct mlx5_lag *ldev)
 {
 	bool roce_lag = true;
+	struct lag_func *pf;
 	int i;
 
-	mlx5_ldev_for_each(i, 0, ldev)
-		roce_lag = roce_lag && !mlx5_sriov_is_enabled(ldev->pf[i].dev);
+	mlx5_ldev_for_each(i, 0, ldev) {
+		pf = mlx5_lag_pf(ldev, i);
+		roce_lag = roce_lag && !mlx5_sriov_is_enabled(pf->dev);
+	}
 
 #ifdef CONFIG_MLX5_ESWITCH
-	mlx5_ldev_for_each(i, 0, ldev)
-		roce_lag = roce_lag && is_mdev_legacy_mode(ldev->pf[i].dev);
+	mlx5_ldev_for_each(i, 0, ldev) {
+		pf = mlx5_lag_pf(ldev, i);
+		roce_lag = roce_lag && is_mdev_legacy_mode(pf->dev);
+	}
 #endif
 
 	return roce_lag;
@@ -1014,13 +1055,17 @@ mlx5_lag_sum_devices_speed(struct mlx5_lag *ldev, u32 *sum_speed,
 			   int (*get_speed)(struct mlx5_core_dev *, u32 *))
 {
 	struct mlx5_core_dev *pf_mdev;
+	struct lag_func *pf;
 	int pf_idx;
 	u32 speed;
 	int ret;
 
 	*sum_speed = 0;
 	mlx5_ldev_for_each(pf_idx, 0, ldev) {
-		pf_mdev = ldev->pf[pf_idx].dev;
+		pf = mlx5_lag_pf(ldev, pf_idx);
+		if (!pf)
+			continue;
+		pf_mdev = pf->dev;
 		if (!pf_mdev)
 			continue;
 
@@ -1086,6 +1131,7 @@ static void mlx5_lag_modify_device_vports_speed(struct mlx5_core_dev *mdev,
 void mlx5_lag_set_vports_agg_speed(struct mlx5_lag *ldev)
 {
 	struct mlx5_core_dev *mdev;
+	struct lag_func *pf;
 	u32 speed;
 	int pf_idx;
 
@@ -1105,7 +1151,10 @@ void mlx5_lag_set_vports_agg_speed(struct mlx5_lag *ldev)
 	speed = speed / MLX5_MAX_TX_SPEED_UNIT;
 
 	mlx5_ldev_for_each(pf_idx, 0, ldev) {
-		mdev = ldev->pf[pf_idx].dev;
+		pf = mlx5_lag_pf(ldev, pf_idx);
+		if (!pf)
+			continue;
+		mdev = pf->dev;
 		if (!mdev)
 			continue;
 
@@ -1116,12 +1165,16 @@ void mlx5_lag_set_vports_agg_speed(struct mlx5_lag *ldev)
 void mlx5_lag_reset_vports_speed(struct mlx5_lag *ldev)
 {
 	struct mlx5_core_dev *mdev;
+	struct lag_func *pf;
 	u32 speed;
 	int pf_idx;
 	int ret;
 
 	mlx5_ldev_for_each(pf_idx, 0, ldev) {
-		mdev = ldev->pf[pf_idx].dev;
+		pf = mlx5_lag_pf(ldev, pf_idx);
+		if (!pf)
+			continue;
+		mdev = pf->dev;
 		if (!mdev)
 			continue;
 
@@ -1152,7 +1205,7 @@ static void mlx5_do_bond(struct mlx5_lag *ldev)
 	if (idx < 0)
 		return;
 
-	dev0 = ldev->pf[idx].dev;
+	dev0 = mlx5_lag_pf(ldev, idx)->dev;
 	if (!mlx5_lag_is_ready(ldev)) {
 		do_bond = false;
 	} else {
@@ -1182,16 +1235,19 @@ static void mlx5_do_bond(struct mlx5_lag *ldev)
 				mlx5_lag_add_devices(ldev);
 			if (shared_fdb) {
 				mlx5_ldev_for_each(i, 0, ldev)
-					mlx5_eswitch_reload_ib_reps(ldev->pf[i].dev->priv.eswitch);
+					mlx5_eswitch_reload_ib_reps(mlx5_lag_pf(ldev, i)->dev->priv.eswitch);
 			}
 
 			return;
 		} else if (roce_lag) {
+			struct mlx5_core_dev *dev;
+
 			dev0->priv.flags &= ~MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
 			mlx5_rescan_drivers_locked(dev0);
 			mlx5_ldev_for_each(i, idx + 1, ldev) {
-				if (mlx5_get_roce_state(ldev->pf[i].dev))
-					mlx5_nic_vport_enable_roce(ldev->pf[i].dev);
+				dev = mlx5_lag_pf(ldev, i)->dev;
+				if (mlx5_get_roce_state(dev))
+					mlx5_nic_vport_enable_roce(dev);
 			}
 		} else if (shared_fdb) {
 			int i;
@@ -1200,7 +1256,7 @@ static void mlx5_do_bond(struct mlx5_lag *ldev)
 			mlx5_rescan_drivers_locked(dev0);
 
 			mlx5_ldev_for_each(i, 0, ldev) {
-				err = mlx5_eswitch_reload_ib_reps(ldev->pf[i].dev->priv.eswitch);
+				err = mlx5_eswitch_reload_ib_reps(mlx5_lag_pf(ldev, i)->dev->priv.eswitch);
 				if (err)
 					break;
 			}
@@ -1211,7 +1267,7 @@ static void mlx5_do_bond(struct mlx5_lag *ldev)
 				mlx5_deactivate_lag(ldev);
 				mlx5_lag_add_devices(ldev);
 				mlx5_ldev_for_each(i, 0, ldev)
-					mlx5_eswitch_reload_ib_reps(ldev->pf[i].dev->priv.eswitch);
+					mlx5_eswitch_reload_ib_reps(mlx5_lag_pf(ldev, i)->dev->priv.eswitch);
 				mlx5_core_err(dev0, "Failed to enable lag\n");
 				return;
 			}
@@ -1243,12 +1299,15 @@ static void mlx5_do_bond(struct mlx5_lag *ldev)
 struct mlx5_devcom_comp_dev *mlx5_lag_get_devcom_comp(struct mlx5_lag *ldev)
 {
 	struct mlx5_devcom_comp_dev *devcom = NULL;
+	struct lag_func *pf;
 	int i;
 
 	mutex_lock(&ldev->lock);
 	i = mlx5_get_next_ldev_func(ldev, 0);
-	if (i < MLX5_MAX_PORTS)
-		devcom = ldev->pf[i].dev->priv.hca_devcom_comp;
+	if (i < MLX5_MAX_PORTS) {
+		pf = mlx5_lag_pf(ldev, i);
+		devcom = pf->dev->priv.hca_devcom_comp;
+	}
 	mutex_unlock(&ldev->lock);
 	return devcom;
 }
@@ -1297,6 +1356,7 @@ static int mlx5_handle_changeupper_event(struct mlx5_lag *ldev,
 	struct netdev_lag_upper_info *lag_upper_info = NULL;
 	bool is_bonded, is_in_lag, mode_supported;
 	bool has_inactive = 0;
+	struct lag_func *pf;
 	struct slave *slave;
 	u8 bond_status = 0;
 	int num_slaves = 0;
@@ -1317,7 +1377,8 @@ static int mlx5_handle_changeupper_event(struct mlx5_lag *ldev,
 	rcu_read_lock();
 	for_each_netdev_in_bond_rcu(upper, ndev_tmp) {
 		mlx5_ldev_for_each(i, 0, ldev) {
-			if (ldev->pf[i].netdev == ndev_tmp) {
+			pf = mlx5_lag_pf(ldev, i);
+			if (pf->netdev == ndev_tmp) {
 				idx++;
 				break;
 			}
@@ -1538,10 +1599,12 @@ static void mlx5_ldev_add_netdev(struct mlx5_lag *ldev,
 				struct net_device *netdev)
 {
 	unsigned int fn = mlx5_get_dev_index(dev);
+	struct lag_func *pf;
 	unsigned long flags;
 
 	spin_lock_irqsave(&lag_lock, flags);
-	ldev->pf[fn].netdev = netdev;
+	pf = mlx5_lag_pf(ldev, fn);
+	pf->netdev = netdev;
 	ldev->tracker.netdev_state[fn].link_up = 0;
 	ldev->tracker.netdev_state[fn].tx_enabled = 0;
 	spin_unlock_irqrestore(&lag_lock, flags);
@@ -1550,46 +1613,69 @@ static void mlx5_ldev_add_netdev(struct mlx5_lag *ldev,
 static void mlx5_ldev_remove_netdev(struct mlx5_lag *ldev,
 				    struct net_device *netdev)
 {
+	struct lag_func *pf;
 	unsigned long flags;
 	int i;
 
 	spin_lock_irqsave(&lag_lock, flags);
 	mlx5_ldev_for_each(i, 0, ldev) {
-		if (ldev->pf[i].netdev == netdev) {
-			ldev->pf[i].netdev = NULL;
+		pf = mlx5_lag_pf(ldev, i);
+		if (pf->netdev == netdev) {
+			pf->netdev = NULL;
 			break;
 		}
 	}
 	spin_unlock_irqrestore(&lag_lock, flags);
 }
 
-static void mlx5_ldev_add_mdev(struct mlx5_lag *ldev,
+static int mlx5_ldev_add_mdev(struct mlx5_lag *ldev,
 			      struct mlx5_core_dev *dev)
 {
 	unsigned int fn = mlx5_get_dev_index(dev);
+	struct lag_func *pf;
+	int err;
+
+	pf = xa_load(&ldev->pfs, fn);
+	if (!pf) {
+		pf = kzalloc_obj(*pf);
+		if (!pf)
+			return -ENOMEM;
+
+		err = xa_err(xa_store(&ldev->pfs, fn, pf, GFP_KERNEL));
+		if (err) {
+			kfree(pf);
+			return err;
+		}
+	}
 
-	ldev->pf[fn].dev = dev;
+	pf->dev = dev;
 	dev->priv.lag = ldev;
 
-	MLX5_NB_INIT(&ldev->pf[fn].port_change_nb,
+	MLX5_NB_INIT(&pf->port_change_nb,
 		     mlx5_lag_mpesw_port_change_event, PORT_CHANGE);
-	mlx5_eq_notifier_register(dev, &ldev->pf[fn].port_change_nb);
+	mlx5_eq_notifier_register(dev, &pf->port_change_nb);
+
+	return 0;
 }
 
 static void mlx5_ldev_remove_mdev(struct mlx5_lag *ldev,
 				  struct mlx5_core_dev *dev)
 {
+	struct lag_func *pf;
 	int fn;
 
 	fn = mlx5_get_dev_index(dev);
-	if (ldev->pf[fn].dev != dev)
+	pf = xa_load(&ldev->pfs, fn);
+	if (!pf || pf->dev != dev)
 		return;
 
-	if (ldev->pf[fn].port_change_nb.nb.notifier_call)
-		mlx5_eq_notifier_unregister(dev, &ldev->pf[fn].port_change_nb);
+	if (pf->port_change_nb.nb.notifier_call)
+		mlx5_eq_notifier_unregister(dev, &pf->port_change_nb);
 
-	ldev->pf[fn].dev = NULL;
+	pf->dev = NULL;
 	dev->priv.lag = NULL;
+	xa_erase(&ldev->pfs, fn);
+	kfree(pf);
 }
 
 /* Must be called with HCA devcom component lock held */
@@ -1598,6 +1684,7 @@ static int __mlx5_lag_dev_add_mdev(struct mlx5_core_dev *dev)
 	struct mlx5_devcom_comp_dev *pos = NULL;
 	struct mlx5_lag *ldev = NULL;
 	struct mlx5_core_dev *tmp_dev;
+	int err;
 
 	tmp_dev = mlx5_devcom_get_next_peer_data(dev->priv.hca_devcom_comp, &pos);
 	if (tmp_dev)
@@ -1609,7 +1696,12 @@ static int __mlx5_lag_dev_add_mdev(struct mlx5_core_dev *dev)
 			mlx5_core_err(dev, "Failed to alloc lag dev\n");
 			return 0;
 		}
-		mlx5_ldev_add_mdev(ldev, dev);
+		err = mlx5_ldev_add_mdev(ldev, dev);
+		if (err) {
+			mlx5_core_err(dev, "Failed to add mdev to lag dev\n");
+			mlx5_ldev_put(ldev);
+			return 0;
+		}
 		return 0;
 	}
 
@@ -1619,7 +1711,12 @@ static int __mlx5_lag_dev_add_mdev(struct mlx5_core_dev *dev)
 		return -EAGAIN;
 	}
 	mlx5_ldev_get(ldev);
-	mlx5_ldev_add_mdev(ldev, dev);
+	err = mlx5_ldev_add_mdev(ldev, dev);
+	if (err) {
+		mlx5_ldev_put(ldev);
+		mutex_unlock(&ldev->lock);
+		return err;
+	}
 	mutex_unlock(&ldev->lock);
 
 	return 0;
@@ -1746,21 +1843,25 @@ void mlx5_lag_add_netdev(struct mlx5_core_dev *dev,
 
 int mlx5_get_pre_ldev_func(struct mlx5_lag *ldev, int start_idx, int end_idx)
 {
+	struct lag_func *pf;
 	int i;
 
-	for (i = start_idx; i >= end_idx; i--)
-		if (ldev->pf[i].dev)
+	for (i = start_idx; i >= end_idx; i--) {
+		pf = xa_load(&ldev->pfs, i);
+		if (pf && pf->dev)
 			return i;
+	}
 	return -1;
 }
 
 int mlx5_get_next_ldev_func(struct mlx5_lag *ldev, int start_idx)
 {
-	int i;
+	struct lag_func *pf;
+	unsigned long idx;
 
-	for (i = start_idx; i < MLX5_MAX_PORTS; i++)
-		if (ldev->pf[i].dev)
-			return i;
+	xa_for_each_start(&ldev->pfs, idx, pf, start_idx)
+		if (pf->dev)
+			return idx;
 	return MLX5_MAX_PORTS;
 }
 
@@ -1814,13 +1915,17 @@ bool mlx5_lag_is_master(struct mlx5_core_dev *dev)
 {
 	struct mlx5_lag *ldev;
 	unsigned long flags;
+	struct lag_func *pf;
 	bool res = false;
 	int idx;
 
 	spin_lock_irqsave(&lag_lock, flags);
 	ldev = mlx5_lag_dev(dev);
 	idx = mlx5_lag_get_dev_index_by_seq(ldev, MLX5_LAG_P1);
-	res = ldev && __mlx5_lag_is_active(ldev) && idx >= 0 && dev == ldev->pf[idx].dev;
+	if (ldev && __mlx5_lag_is_active(ldev) && idx >= 0) {
+		pf = mlx5_lag_pf(ldev, idx);
+		res = pf && dev == pf->dev;
+	}
 	spin_unlock_irqrestore(&lag_lock, flags);
 
 	return res;
@@ -1899,6 +2004,7 @@ u8 mlx5_lag_get_slave_port(struct mlx5_core_dev *dev,
 {
 	struct mlx5_lag *ldev;
 	unsigned long flags;
+	struct lag_func *pf;
 	u8 port = 0;
 	int i;
 
@@ -1908,7 +2014,8 @@ u8 mlx5_lag_get_slave_port(struct mlx5_core_dev *dev,
 		goto unlock;
 
 	mlx5_ldev_for_each(i, 0, ldev) {
-		if (ldev->pf[i].netdev == slave) {
+		pf = mlx5_lag_pf(ldev, i);
+		if (pf->netdev == slave) {
 			port = i;
 			break;
 		}
@@ -1939,6 +2046,7 @@ struct mlx5_core_dev *mlx5_lag_get_next_peer_mdev(struct mlx5_core_dev *dev, int
 	struct mlx5_core_dev *peer_dev = NULL;
 	struct mlx5_lag *ldev;
 	unsigned long flags;
+	struct lag_func *pf;
 	int idx;
 
 	spin_lock_irqsave(&lag_lock, flags);
@@ -1948,9 +2056,11 @@ struct mlx5_core_dev *mlx5_lag_get_next_peer_mdev(struct mlx5_core_dev *dev, int
 
 	if (*i == MLX5_MAX_PORTS)
 		goto unlock;
-	mlx5_ldev_for_each(idx, *i, ldev)
-		if (ldev->pf[idx].dev != dev)
+	mlx5_ldev_for_each(idx, *i, ldev) {
+		pf = mlx5_lag_pf(ldev, idx);
+		if (pf->dev != dev)
 			break;
+	}
 
 	if (idx == MLX5_MAX_PORTS) {
 		*i = idx;
@@ -1958,7 +2068,8 @@ struct mlx5_core_dev *mlx5_lag_get_next_peer_mdev(struct mlx5_core_dev *dev, int
 	}
 	*i = idx + 1;
 
-	peer_dev = ldev->pf[idx].dev;
+	pf = mlx5_lag_pf(ldev, idx);
+	peer_dev = pf->dev;
 
 unlock:
 	spin_unlock_irqrestore(&lag_lock, flags);
@@ -1976,6 +2087,7 @@ int mlx5_lag_query_cong_counters(struct mlx5_core_dev *dev,
 	int ret = 0, i, j, idx = 0;
 	struct mlx5_lag *ldev;
 	unsigned long flags;
+	struct lag_func *pf;
 	int num_ports;
 	void *out;
 
@@ -1995,8 +2107,10 @@ int mlx5_lag_query_cong_counters(struct mlx5_core_dev *dev,
 	ldev = mlx5_lag_dev(dev);
 	if (ldev && __mlx5_lag_is_active(ldev)) {
 		num_ports = ldev->ports;
-		mlx5_ldev_for_each(i, 0, ldev)
-			mdev[idx++] = ldev->pf[i].dev;
+		mlx5_ldev_for_each(i, 0, ldev) {
+			pf = mlx5_lag_pf(ldev, i);
+			mdev[idx++] = pf->dev;
+		}
 	} else {
 		num_ports = 1;
 		mdev[MLX5_LAG_P1] = dev;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h
index be1afece5fdc..09758871b3da 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h
@@ -64,7 +64,7 @@ struct mlx5_lag {
 	int			  mode_changes_in_progress;
 	u8			  v2p_map[MLX5_MAX_PORTS * MLX5_LAG_MAX_HASH_BUCKETS];
 	struct kref               ref;
-	struct lag_func           pf[MLX5_MAX_PORTS];
+	struct xarray             pfs;
 	struct lag_tracker        tracker;
 	struct workqueue_struct   *wq;
 	struct delayed_work       bond_work;
@@ -84,6 +84,12 @@ mlx5_lag_dev(struct mlx5_core_dev *dev)
 	return dev->priv.lag;
 }
 
+static inline struct lag_func *
+mlx5_lag_pf(struct mlx5_lag *ldev, unsigned int idx)
+{
+	return xa_load(&ldev->pfs, idx);
+}
+
 static inline bool
 __mlx5_lag_is_active(struct mlx5_lag *ldev)
 {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/mp.c b/drivers/net/ethernet/mellanox/mlx5/core/lag/mp.c
index c4c2bf33ef35..f42e051fa7e7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lag/mp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/mp.c
@@ -29,8 +29,8 @@ static bool mlx5_lag_multipath_check_prereq(struct mlx5_lag *ldev)
 	if (ldev->ports > MLX5_LAG_MULTIPATH_OFFLOADS_SUPPORTED_PORTS)
 		return false;
 
-	return mlx5_esw_multipath_prereq(ldev->pf[idx0].dev,
-					 ldev->pf[idx1].dev);
+	return mlx5_esw_multipath_prereq(mlx5_lag_pf(ldev, idx0)->dev,
+					 mlx5_lag_pf(ldev, idx1)->dev);
 }
 
 bool mlx5_lag_is_multipath(struct mlx5_core_dev *dev)
@@ -80,18 +80,18 @@ static void mlx5_lag_set_port_affinity(struct mlx5_lag *ldev,
 		tracker.netdev_state[idx1].link_up = true;
 		break;
 	default:
-		mlx5_core_warn(ldev->pf[idx0].dev,
+		mlx5_core_warn(mlx5_lag_pf(ldev, idx0)->dev,
 			       "Invalid affinity port %d", port);
 		return;
 	}
 
 	if (tracker.netdev_state[idx0].tx_enabled)
-		mlx5_notifier_call_chain(ldev->pf[idx0].dev->priv.events,
+		mlx5_notifier_call_chain(mlx5_lag_pf(ldev, idx0)->dev->priv.events,
 					 MLX5_DEV_EVENT_PORT_AFFINITY,
 					 (void *)0);
 
 	if (tracker.netdev_state[idx1].tx_enabled)
-		mlx5_notifier_call_chain(ldev->pf[idx1].dev->priv.events,
+		mlx5_notifier_call_chain(mlx5_lag_pf(ldev, idx1)->dev->priv.events,
 					 MLX5_DEV_EVENT_PORT_AFFINITY,
 					 (void *)0);
 
@@ -146,7 +146,7 @@ mlx5_lag_get_next_fib_dev(struct mlx5_lag *ldev,
 		fib_dev = fib_info_nh(fi, i)->fib_nh_dev;
 		ldev_idx = mlx5_lag_dev_get_netdev_idx(ldev, fib_dev);
 		if (ldev_idx >= 0)
-			return ldev->pf[ldev_idx].netdev;
+			return mlx5_lag_pf(ldev, ldev_idx)->netdev;
 	}
 
 	return NULL;
@@ -178,7 +178,7 @@ static void mlx5_lag_fib_route_event(struct mlx5_lag *ldev, unsigned long event,
 	    mp->fib.dst_len <= fen_info->dst_len &&
 	    !(mp->fib.dst_len == fen_info->dst_len &&
 	      fi->fib_priority < mp->fib.priority)) {
-		mlx5_core_dbg(ldev->pf[idx].dev,
+		mlx5_core_dbg(mlx5_lag_pf(ldev, idx)->dev,
 			      "Multipath entry with lower priority was rejected\n");
 		return;
 	}
@@ -194,7 +194,7 @@ static void mlx5_lag_fib_route_event(struct mlx5_lag *ldev, unsigned long event,
 	}
 
 	if (nh_dev0 == nh_dev1) {
-		mlx5_core_warn(ldev->pf[idx].dev,
+		mlx5_core_warn(mlx5_lag_pf(ldev, idx)->dev,
 			       "Multipath offload doesn't support routes with multiple nexthops of the same device");
 		return;
 	}
@@ -203,7 +203,7 @@ static void mlx5_lag_fib_route_event(struct mlx5_lag *ldev, unsigned long event,
 		if (__mlx5_lag_is_active(ldev)) {
 			mlx5_ldev_for_each(i, 0, ldev) {
 				dev_idx++;
-				if (ldev->pf[i].netdev == nh_dev0)
+				if (mlx5_lag_pf(ldev, i)->netdev == nh_dev0)
 					break;
 			}
 			mlx5_lag_set_port_affinity(ldev, dev_idx);
@@ -240,7 +240,7 @@ static void mlx5_lag_fib_nexthop_event(struct mlx5_lag *ldev,
 	/* nh added/removed */
 	if (event == FIB_EVENT_NH_DEL) {
 		mlx5_ldev_for_each(i, 0, ldev) {
-			if (ldev->pf[i].netdev == fib_nh->fib_nh_dev)
+			if (mlx5_lag_pf(ldev, i)->netdev == fib_nh->fib_nh_dev)
 				break;
 			dev_idx++;
 		}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/mpesw.c b/drivers/net/ethernet/mellanox/mlx5/core/lag/mpesw.c
index 74d5c2ed14ff..0e7d206cd594 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lag/mpesw.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/mpesw.c
@@ -16,7 +16,7 @@ static void mlx5_mpesw_metadata_cleanup(struct mlx5_lag *ldev)
 	int i;
 
 	mlx5_ldev_for_each(i, 0, ldev) {
-		dev = ldev->pf[i].dev;
+		dev = mlx5_lag_pf(ldev, i)->dev;
 		esw = dev->priv.eswitch;
 		pf_metadata = ldev->lag_mpesw.pf_metadata[i];
 		if (!pf_metadata)
@@ -37,7 +37,7 @@ static int mlx5_mpesw_metadata_set(struct mlx5_lag *ldev)
 	int i, err;
 
 	mlx5_ldev_for_each(i, 0, ldev) {
-		dev = ldev->pf[i].dev;
+		dev = mlx5_lag_pf(ldev, i)->dev;
 		esw = dev->priv.eswitch;
 		pf_metadata = mlx5_esw_match_metadata_alloc(esw);
 		if (!pf_metadata) {
@@ -53,7 +53,7 @@ static int mlx5_mpesw_metadata_set(struct mlx5_lag *ldev)
 	}
 
 	mlx5_ldev_for_each(i, 0, ldev) {
-		dev = ldev->pf[i].dev;
+		dev = mlx5_lag_pf(ldev, i)->dev;
 		mlx5_notifier_call_chain(dev->priv.events, MLX5_DEV_EVENT_MULTIPORT_ESW,
 					 (void *)0);
 	}
@@ -82,7 +82,7 @@ static int mlx5_lag_enable_mpesw(struct mlx5_lag *ldev)
 	if (idx < 0)
 		return -EINVAL;
 
-	dev0 = ldev->pf[idx].dev;
+	dev0 = mlx5_lag_pf(ldev, idx)->dev;
 	if (mlx5_eswitch_mode(dev0) != MLX5_ESWITCH_OFFLOADS ||
 	    !MLX5_CAP_PORT_SELECTION(dev0, port_select_flow_table) ||
 	    !MLX5_CAP_GEN(dev0, create_lag_when_not_master_up) ||
@@ -105,7 +105,7 @@ static int mlx5_lag_enable_mpesw(struct mlx5_lag *ldev)
 	dev0->priv.flags &= ~MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
 	mlx5_rescan_drivers_locked(dev0);
 	mlx5_ldev_for_each(i, 0, ldev) {
-		err = mlx5_eswitch_reload_ib_reps(ldev->pf[i].dev->priv.eswitch);
+		err = mlx5_eswitch_reload_ib_reps(mlx5_lag_pf(ldev, i)->dev->priv.eswitch);
 		if (err)
 			goto err_rescan_drivers;
 	}
@@ -121,7 +121,7 @@ static int mlx5_lag_enable_mpesw(struct mlx5_lag *ldev)
 err_add_devices:
 	mlx5_lag_add_devices(ldev);
 	mlx5_ldev_for_each(i, 0, ldev)
-		mlx5_eswitch_reload_ib_reps(ldev->pf[i].dev->priv.eswitch);
+		mlx5_eswitch_reload_ib_reps(mlx5_lag_pf(ldev, i)->dev->priv.eswitch);
 	mlx5_mpesw_metadata_cleanup(ldev);
 	return err;
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/port_sel.c b/drivers/net/ethernet/mellanox/mlx5/core/lag/port_sel.c
index 16c7d16215c4..7e9e3e81977d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lag/port_sel.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/port_sel.c
@@ -50,7 +50,7 @@ static int mlx5_lag_create_port_sel_table(struct mlx5_lag *ldev,
 	if (first_idx < 0)
 		return -EINVAL;
 
-	dev = ldev->pf[first_idx].dev;
+	dev = mlx5_lag_pf(ldev, first_idx)->dev;
 	ft_attr.max_fte = ldev->ports * ldev->buckets;
 	ft_attr.level = MLX5_LAG_FT_LEVEL_DEFINER;
 
@@ -84,8 +84,9 @@ static int mlx5_lag_create_port_sel_table(struct mlx5_lag *ldev,
 			idx = i * ldev->buckets + j;
 			affinity = ports[idx];
 
-			dest.vport.vhca_id = MLX5_CAP_GEN(ldev->pf[affinity - 1].dev,
-							  vhca_id);
+			dest.vport.vhca_id =
+				MLX5_CAP_GEN(mlx5_lag_pf(ldev, affinity - 1)->dev,
+					     vhca_id);
 			lag_definer->rules[idx] = mlx5_add_flow_rules(lag_definer->ft,
 								      NULL, &flow_act,
 								      &dest, 1);
@@ -307,7 +308,7 @@ mlx5_lag_create_definer(struct mlx5_lag *ldev, enum netdev_lag_hash hash,
 	if (first_idx < 0)
 		return ERR_PTR(-EINVAL);
 
-	dev = ldev->pf[first_idx].dev;
+	dev = mlx5_lag_pf(ldev, first_idx)->dev;
 	lag_definer = kzalloc_obj(*lag_definer);
 	if (!lag_definer)
 		return ERR_PTR(-ENOMEM);
@@ -356,7 +357,7 @@ static void mlx5_lag_destroy_definer(struct mlx5_lag *ldev,
 	if (first_idx < 0)
 		return;
 
-	dev = ldev->pf[first_idx].dev;
+	dev = mlx5_lag_pf(ldev, first_idx)->dev;
 	mlx5_ldev_for_each(i, first_idx, ldev) {
 		for (j = 0; j < ldev->buckets; j++) {
 			idx = i * ldev->buckets + j;
@@ -520,7 +521,7 @@ static int mlx5_lag_create_ttc_table(struct mlx5_lag *ldev)
 	if (first_idx < 0)
 		return -EINVAL;
 
-	dev = ldev->pf[first_idx].dev;
+	dev = mlx5_lag_pf(ldev, first_idx)->dev;
 	mlx5_lag_set_outer_ttc_params(ldev, &ttc_params);
 	port_sel->outer.ttc = mlx5_create_ttc_table(dev, &ttc_params);
 	return PTR_ERR_OR_ZERO(port_sel->outer.ttc);
@@ -536,7 +537,7 @@ static int mlx5_lag_create_inner_ttc_table(struct mlx5_lag *ldev)
 	if (first_idx < 0)
 		return -EINVAL;
 
-	dev = ldev->pf[first_idx].dev;
+	dev = mlx5_lag_pf(ldev, first_idx)->dev;
 	mlx5_lag_set_inner_ttc_params(ldev, &ttc_params);
 	port_sel->inner.ttc = mlx5_create_inner_ttc_table(dev, &ttc_params);
 	return PTR_ERR_OR_ZERO(port_sel->inner.ttc);
@@ -594,8 +595,9 @@ static int __mlx5_lag_modify_definers_destinations(struct mlx5_lag *ldev,
 			if (ldev->v2p_map[idx] == ports[idx])
 				continue;
 
-			dest.vport.vhca_id = MLX5_CAP_GEN(ldev->pf[ports[idx] - 1].dev,
-							  vhca_id);
+			dest.vport.vhca_id =
+				MLX5_CAP_GEN(mlx5_lag_pf(ldev, ports[idx] - 1)->dev,
+					     vhca_id);
 			err = mlx5_modify_rule_destination(def->rules[idx], &dest, NULL);
 			if (err)
 				return err;
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH mlx5-next 4/8] net/mlx5: LAG, use xa_alloc to manage LAG device indices
  2026-03-08  6:55 [PATCH mlx5-next 0/8] mlx5-next updates 2026-03-08 Tariq Toukan
                   ` (2 preceding siblings ...)
  2026-03-08  6:55 ` [PATCH mlx5-next 3/8] net/mlx5: LAG, replace pf array with xarray Tariq Toukan
@ 2026-03-08  6:55 ` Tariq Toukan
  2026-03-08  6:55 ` [PATCH mlx5-next 5/8] net/mlx5: E-switch, modify peer miss rule index to vhca_id Tariq Toukan
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Tariq Toukan @ 2026-03-08  6:55 UTC (permalink / raw)
  To: Leon Romanovsky, Jason Gunthorpe, Saeed Mahameed, Tariq Toukan
  Cc: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller, Mark Bloch, linux-kernel, linux-rdma, netdev,
	Gal Pressman, Dragos Tatulea, Moshe Shemesh, Shay Drory,
	Alexei Lazar

Replace the use of mlx5_get_dev_index() for xarray indexing with
xa_alloc() to dynamically allocate indices. This decouples the LAG
xarray index from the physical device index.

Update mlx5_ldev_add_netdev/remove_mdev to find entries by dev pointer
and replace mlx5_lag_get_dev_index_by_seq(ldev, MLX5_LAG_P1) calls with
mlx5_lag_get_master_idx() where appropriate.

No functional changes intended

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/lag/lag.c | 242 ++++++++++++++----
 .../net/ethernet/mellanox/mlx5/core/lag/lag.h |  29 +++
 .../ethernet/mellanox/mlx5/core/lag/mpesw.c   |   3 +-
 .../mellanox/mlx5/core/lag/port_sel.c         |  12 +-
 4 files changed, 230 insertions(+), 56 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
index 81b1f84f902e..4beee64c937a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
@@ -288,7 +288,7 @@ static struct mlx5_lag *mlx5_lag_dev_alloc(struct mlx5_core_dev *dev)
 
 	kref_init(&ldev->ref);
 	mutex_init(&ldev->lock);
-	xa_init(&ldev->pfs);
+	xa_init_flags(&ldev->pfs, XA_FLAGS_ALLOC);
 	INIT_DELAYED_WORK(&ldev->bond_work, mlx5_do_bond_work);
 	INIT_WORK(&ldev->speed_update_work, mlx5_mpesw_speed_update_work);
 
@@ -326,14 +326,42 @@ int mlx5_lag_dev_get_netdev_idx(struct mlx5_lag *ldev,
 	return -ENOENT;
 }
 
+static int mlx5_lag_get_master_idx(struct mlx5_lag *ldev)
+{
+	unsigned long idx = 0;
+	void *entry;
+
+	if (!ldev)
+		return -ENOENT;
+
+	entry = xa_find(&ldev->pfs, &idx, U8_MAX, MLX5_LAG_XA_MARK_MASTER);
+	if (!entry)
+		return -ENOENT;
+
+	return (int)idx;
+}
+
 int mlx5_lag_get_dev_index_by_seq(struct mlx5_lag *ldev, int seq)
 {
-	int i, num = 0;
+	int master_idx, i, num = 0;
 
 	if (!ldev)
 		return -ENOENT;
 
+	master_idx = mlx5_lag_get_master_idx(ldev);
+
+	/* If seq 0 is requested and there's a primary PF, return it */
+	if (master_idx >= 0) {
+		if (seq == 0)
+			return master_idx;
+		num++;
+	}
+
 	mlx5_ldev_for_each(i, 0, ldev) {
+		/* Skip the primary PF in the loop */
+		if (i == master_idx)
+			continue;
+
 		if (num == seq)
 			return i;
 		num++;
@@ -341,6 +369,75 @@ int mlx5_lag_get_dev_index_by_seq(struct mlx5_lag *ldev, int seq)
 	return -ENOENT;
 }
 
+/* Devcom events for LAG master marking */
+#define LAG_DEVCOM_PAIR		(0)
+#define LAG_DEVCOM_UNPAIR	(1)
+
+static void mlx5_lag_mark_master(struct mlx5_lag *ldev)
+{
+	int lowest_dev_idx = INT_MAX;
+	struct lag_func *pf;
+	int master_xa_idx = -1;
+	int dev_idx;
+	int i;
+
+	mlx5_ldev_for_each(i, 0, ldev) {
+		pf = mlx5_lag_pf(ldev, i);
+		dev_idx = mlx5_get_dev_index(pf->dev);
+		if (dev_idx < lowest_dev_idx) {
+			lowest_dev_idx = dev_idx;
+			master_xa_idx = i;
+		}
+	}
+
+	if (master_xa_idx >= 0)
+		xa_set_mark(&ldev->pfs, master_xa_idx, MLX5_LAG_XA_MARK_MASTER);
+}
+
+static void mlx5_lag_clear_master(struct mlx5_lag *ldev)
+{
+	unsigned long idx = 0;
+	void *entry;
+
+	entry = xa_find(&ldev->pfs, &idx, U8_MAX, MLX5_LAG_XA_MARK_MASTER);
+	if (!entry)
+		return;
+
+	xa_clear_mark(&ldev->pfs, idx, MLX5_LAG_XA_MARK_MASTER);
+}
+
+/* Devcom event handler to manage LAG master marking */
+static int mlx5_lag_devcom_event(int event, void *my_data, void *event_data)
+{
+	struct mlx5_core_dev *dev = my_data;
+	struct mlx5_lag *ldev;
+	int idx;
+
+	ldev = mlx5_lag_dev(dev);
+	if (!ldev)
+		return 0;
+
+	mutex_lock(&ldev->lock);
+	switch (event) {
+	case LAG_DEVCOM_PAIR:
+		/* No need to mark more than once */
+		idx = mlx5_lag_get_master_idx(ldev);
+		if (idx >= 0)
+			break;
+		/* Check if all LAG ports are now registered */
+		if (mlx5_lag_num_devs(ldev) == ldev->ports)
+			mlx5_lag_mark_master(ldev);
+		break;
+
+	case LAG_DEVCOM_UNPAIR:
+		/* Clear master mark when a device is removed */
+		mlx5_lag_clear_master(ldev);
+		break;
+	}
+	mutex_unlock(&ldev->lock);
+	return 0;
+}
+
 int mlx5_lag_num_devs(struct mlx5_lag *ldev)
 {
 	int i, num = 0;
@@ -411,11 +508,12 @@ static void mlx5_infer_tx_affinity_mapping(struct lag_tracker *tracker,
 
 	/* Use native mapping by default where each port's buckets
 	 * point the native port: 1 1 1 .. 1 2 2 2 ... 2 3 3 3 ... 3 etc
+	 * ports[] values are 1-indexed device indices for FW.
 	 */
 	mlx5_ldev_for_each(i, 0, ldev) {
 		for (j = 0; j < buckets; j++) {
 			idx = i * buckets + j;
-			ports[idx] = i + 1;
+			ports[idx] = mlx5_lag_xa_to_dev_idx(ldev, i) + 1;
 		}
 	}
 
@@ -427,8 +525,12 @@ static void mlx5_infer_tx_affinity_mapping(struct lag_tracker *tracker,
 	/* Go over the disabled ports and for each assign a random active port */
 	for (i = 0; i < disabled_ports_num; i++) {
 		for (j = 0; j < buckets; j++) {
+			int rand_xa_idx;
+
 			get_random_bytes(&rand, 4);
-			ports[disabled[i] * buckets + j] = enabled[rand % enabled_ports_num] + 1;
+			rand_xa_idx = enabled[rand % enabled_ports_num];
+			ports[disabled[i] * buckets + j] =
+				mlx5_lag_xa_to_dev_idx(ldev, rand_xa_idx) + 1;
 		}
 	}
 }
@@ -683,20 +785,23 @@ char *mlx5_get_str_port_sel_mode(enum mlx5_lag_mode mode, unsigned long flags)
 
 static int mlx5_lag_create_single_fdb(struct mlx5_lag *ldev)
 {
-	int first_idx = mlx5_lag_get_dev_index_by_seq(ldev, MLX5_LAG_P1);
+	int master_idx = mlx5_lag_get_dev_index_by_seq(ldev, MLX5_LAG_P1);
 	struct mlx5_eswitch *master_esw;
 	struct mlx5_core_dev *dev0;
 	int i, j;
 	int err;
 
-	if (first_idx < 0)
+	if (master_idx < 0)
 		return -EINVAL;
 
-	dev0 = mlx5_lag_pf(ldev, first_idx)->dev;
+	dev0 = mlx5_lag_pf(ldev, master_idx)->dev;
 	master_esw = dev0->priv.eswitch;
-	mlx5_ldev_for_each(i, first_idx + 1, ldev) {
+	mlx5_ldev_for_each(i, 0, ldev) {
 		struct mlx5_eswitch *slave_esw;
 
+		if (i == master_idx)
+			continue;
+
 		slave_esw = mlx5_lag_pf(ldev, i)->dev->priv.eswitch;
 
 		err = mlx5_eswitch_offloads_single_fdb_add_one(master_esw,
@@ -706,9 +811,12 @@ static int mlx5_lag_create_single_fdb(struct mlx5_lag *ldev)
 	}
 	return 0;
 err:
-	mlx5_ldev_for_each_reverse(j, i, first_idx + 1, ldev)
+	mlx5_ldev_for_each_reverse(j, i, 0, ldev) {
+		if (j == master_idx)
+			continue;
 		mlx5_eswitch_offloads_single_fdb_del_one(master_esw,
 							 mlx5_lag_pf(ldev, j)->dev->priv.eswitch);
+	}
 	return err;
 }
 
@@ -717,8 +825,8 @@ static int mlx5_create_lag(struct mlx5_lag *ldev,
 			   enum mlx5_lag_mode mode,
 			   unsigned long flags)
 {
-	bool shared_fdb = test_bit(MLX5_LAG_MODE_FLAG_SHARED_FDB, &flags);
 	int first_idx = mlx5_lag_get_dev_index_by_seq(ldev, MLX5_LAG_P1);
+	bool shared_fdb = test_bit(MLX5_LAG_MODE_FLAG_SHARED_FDB, &flags);
 	u32 in[MLX5_ST_SZ_DW(destroy_lag_in)] = {};
 	struct mlx5_core_dev *dev0;
 	int err;
@@ -764,16 +872,17 @@ int mlx5_activate_lag(struct mlx5_lag *ldev,
 		      enum mlx5_lag_mode mode,
 		      bool shared_fdb)
 {
-	int first_idx = mlx5_lag_get_dev_index_by_seq(ldev, MLX5_LAG_P1);
 	bool roce_lag = mode == MLX5_LAG_MODE_ROCE;
 	struct mlx5_core_dev *dev0;
 	unsigned long flags = 0;
+	int master_idx;
 	int err;
 
-	if (first_idx < 0)
+	master_idx = mlx5_lag_get_dev_index_by_seq(ldev, MLX5_LAG_P1);
+	if (master_idx < 0)
 		return -EINVAL;
 
-	dev0 = mlx5_lag_pf(ldev, first_idx)->dev;
+	dev0 = mlx5_lag_pf(ldev, master_idx)->dev;
 	err = mlx5_lag_set_flags(ldev, mode, tracker, shared_fdb, &flags);
 	if (err)
 		return err;
@@ -817,7 +926,7 @@ int mlx5_activate_lag(struct mlx5_lag *ldev,
 
 int mlx5_deactivate_lag(struct mlx5_lag *ldev)
 {
-	int first_idx = mlx5_lag_get_dev_index_by_seq(ldev, MLX5_LAG_P1);
+	int master_idx = mlx5_lag_get_dev_index_by_seq(ldev, MLX5_LAG_P1);
 	u32 in[MLX5_ST_SZ_DW(destroy_lag_in)] = {};
 	bool roce_lag = __mlx5_lag_is_roce(ldev);
 	unsigned long flags = ldev->mode_flags;
@@ -826,19 +935,22 @@ int mlx5_deactivate_lag(struct mlx5_lag *ldev)
 	int err;
 	int i;
 
-	if (first_idx < 0)
+	if (master_idx < 0)
 		return -EINVAL;
 
-	dev0 = mlx5_lag_pf(ldev, first_idx)->dev;
+	dev0 = mlx5_lag_pf(ldev, master_idx)->dev;
 	master_esw = dev0->priv.eswitch;
 	ldev->mode = MLX5_LAG_MODE_NONE;
 	ldev->mode_flags = 0;
 	mlx5_lag_mp_reset(ldev);
 
 	if (test_bit(MLX5_LAG_MODE_FLAG_SHARED_FDB, &flags)) {
-		mlx5_ldev_for_each(i, first_idx + 1, ldev)
+		mlx5_ldev_for_each(i, 0, ldev) {
+			if (i == master_idx)
+				continue;
 			mlx5_eswitch_offloads_single_fdb_del_one(master_esw,
 								 mlx5_lag_pf(ldev, i)->dev->priv.eswitch);
+		}
 		clear_bit(MLX5_LAG_MODE_FLAG_SHARED_FDB, &flags);
 	}
 
@@ -868,7 +980,7 @@ int mlx5_deactivate_lag(struct mlx5_lag *ldev)
 
 bool mlx5_lag_check_prereq(struct mlx5_lag *ldev)
 {
-	int first_idx = mlx5_lag_get_dev_index_by_seq(ldev, MLX5_LAG_P1);
+	int master_idx = mlx5_lag_get_dev_index_by_seq(ldev, MLX5_LAG_P1);
 #ifdef CONFIG_MLX5_ESWITCH
 	struct mlx5_core_dev *dev;
 	u8 mode;
@@ -877,7 +989,7 @@ bool mlx5_lag_check_prereq(struct mlx5_lag *ldev)
 	bool roce_support;
 	int i;
 
-	if (first_idx < 0 || mlx5_lag_num_devs(ldev) != ldev->ports)
+	if (master_idx < 0 || mlx5_lag_num_devs(ldev) != ldev->ports)
 		return false;
 
 #ifdef CONFIG_MLX5_ESWITCH
@@ -888,7 +1000,7 @@ bool mlx5_lag_check_prereq(struct mlx5_lag *ldev)
 			return false;
 	}
 
-	pf = mlx5_lag_pf(ldev, first_idx);
+	pf = mlx5_lag_pf(ldev, master_idx);
 	dev = pf->dev;
 	mode = mlx5_eswitch_mode(dev);
 	mlx5_ldev_for_each(i, 0, ldev) {
@@ -904,9 +1016,11 @@ bool mlx5_lag_check_prereq(struct mlx5_lag *ldev)
 			return false;
 	}
 #endif
-	pf = mlx5_lag_pf(ldev, first_idx);
+	pf = mlx5_lag_pf(ldev, master_idx);
 	roce_support = mlx5_get_roce_state(pf->dev);
-	mlx5_ldev_for_each(i, first_idx + 1, ldev) {
+	mlx5_ldev_for_each(i, 0, ldev) {
+		if (i == master_idx)
+			continue;
 		pf = mlx5_lag_pf(ldev, i);
 		if (mlx5_get_roce_state(pf->dev) != roce_support)
 			return false;
@@ -967,8 +1081,11 @@ void mlx5_disable_lag(struct mlx5_lag *ldev)
 			dev0->priv.flags |= MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
 			mlx5_rescan_drivers_locked(dev0);
 		}
-		mlx5_ldev_for_each(i, idx + 1, ldev)
+		mlx5_ldev_for_each(i, 0, ldev) {
+			if (i == idx)
+				continue;
 			mlx5_nic_vport_disable_roce(mlx5_lag_pf(ldev, i)->dev);
+		}
 	}
 
 	err = mlx5_deactivate_lag(ldev);
@@ -986,14 +1103,18 @@ void mlx5_disable_lag(struct mlx5_lag *ldev)
 
 bool mlx5_lag_shared_fdb_supported(struct mlx5_lag *ldev)
 {
-	int idx = mlx5_lag_get_dev_index_by_seq(ldev, MLX5_LAG_P1);
 	struct mlx5_core_dev *dev;
+	bool ret = false;
+	int idx;
 	int i;
 
+	idx = mlx5_lag_get_dev_index_by_seq(ldev, MLX5_LAG_P1);
 	if (idx < 0)
 		return false;
 
-	mlx5_ldev_for_each(i, idx + 1, ldev) {
+	mlx5_ldev_for_each(i, 0, ldev) {
+		if (i == idx)
+			continue;
 		dev = mlx5_lag_pf(ldev, i)->dev;
 		if (is_mdev_switchdev_mode(dev) &&
 		    mlx5_eswitch_vport_match_metadata_enabled(dev->priv.eswitch) &&
@@ -1011,9 +1132,9 @@ bool mlx5_lag_shared_fdb_supported(struct mlx5_lag *ldev)
 	    mlx5_esw_offloads_devcom_is_ready(dev->priv.eswitch) &&
 	    MLX5_CAP_ESW(dev, esw_shared_ingress_acl) &&
 	    mlx5_eswitch_get_npeers(dev->priv.eswitch) == MLX5_CAP_GEN(dev, num_lag_ports) - 1)
-		return true;
+		ret = true;
 
-	return false;
+	return ret;
 }
 
 static bool mlx5_lag_is_roce_lag(struct mlx5_lag *ldev)
@@ -1239,12 +1360,16 @@ static void mlx5_do_bond(struct mlx5_lag *ldev)
 			}
 
 			return;
-		} else if (roce_lag) {
+		}
+
+		if (roce_lag) {
 			struct mlx5_core_dev *dev;
 
 			dev0->priv.flags &= ~MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
 			mlx5_rescan_drivers_locked(dev0);
-			mlx5_ldev_for_each(i, idx + 1, ldev) {
+			mlx5_ldev_for_each(i, 0, ldev) {
+				if (i == idx)
+					continue;
 				dev = mlx5_lag_pf(ldev, i)->dev;
 				if (mlx5_get_roce_state(dev))
 					mlx5_nic_vport_enable_roce(dev);
@@ -1598,15 +1723,21 @@ static void mlx5_ldev_add_netdev(struct mlx5_lag *ldev,
 				struct mlx5_core_dev *dev,
 				struct net_device *netdev)
 {
-	unsigned int fn = mlx5_get_dev_index(dev);
 	struct lag_func *pf;
 	unsigned long flags;
+	int i;
 
 	spin_lock_irqsave(&lag_lock, flags);
-	pf = mlx5_lag_pf(ldev, fn);
-	pf->netdev = netdev;
-	ldev->tracker.netdev_state[fn].link_up = 0;
-	ldev->tracker.netdev_state[fn].tx_enabled = 0;
+	/* Find pf entry by matching dev pointer */
+	mlx5_ldev_for_each(i, 0, ldev) {
+		pf = mlx5_lag_pf(ldev, i);
+		if (pf->dev == dev) {
+			pf->netdev = netdev;
+			ldev->tracker.netdev_state[i].link_up = 0;
+			ldev->tracker.netdev_state[i].tx_enabled = 0;
+			break;
+		}
+	}
 	spin_unlock_irqrestore(&lag_lock, flags);
 }
 
@@ -1631,23 +1762,22 @@ static void mlx5_ldev_remove_netdev(struct mlx5_lag *ldev,
 static int mlx5_ldev_add_mdev(struct mlx5_lag *ldev,
 			      struct mlx5_core_dev *dev)
 {
-	unsigned int fn = mlx5_get_dev_index(dev);
 	struct lag_func *pf;
+	u32 idx;
 	int err;
 
-	pf = xa_load(&ldev->pfs, fn);
-	if (!pf) {
-		pf = kzalloc_obj(*pf);
-		if (!pf)
-			return -ENOMEM;
+	pf = kzalloc_obj(*pf);
+	if (!pf)
+		return -ENOMEM;
 
-		err = xa_err(xa_store(&ldev->pfs, fn, pf, GFP_KERNEL));
-		if (err) {
-			kfree(pf);
-			return err;
-		}
+	err = xa_alloc(&ldev->pfs, &idx, pf, XA_LIMIT(0, MLX5_MAX_PORTS - 1),
+		       GFP_KERNEL);
+	if (err) {
+		kfree(pf);
+		return err;
 	}
 
+	pf->idx = idx;
 	pf->dev = dev;
 	dev->priv.lag = ldev;
 
@@ -1662,11 +1792,14 @@ static void mlx5_ldev_remove_mdev(struct mlx5_lag *ldev,
 				  struct mlx5_core_dev *dev)
 {
 	struct lag_func *pf;
-	int fn;
+	int i;
 
-	fn = mlx5_get_dev_index(dev);
-	pf = xa_load(&ldev->pfs, fn);
-	if (!pf || pf->dev != dev)
+	mlx5_ldev_for_each(i, 0, ldev) {
+		pf = mlx5_lag_pf(ldev, i);
+		if (pf->dev == dev)
+			break;
+	}
+	if (i >= MLX5_MAX_PORTS)
 		return;
 
 	if (pf->port_change_nb.nb.notifier_call)
@@ -1674,7 +1807,7 @@ static void mlx5_ldev_remove_mdev(struct mlx5_lag *ldev,
 
 	pf->dev = NULL;
 	dev->priv.lag = NULL;
-	xa_erase(&ldev->pfs, fn);
+	xa_erase(&ldev->pfs, pf->idx);
 	kfree(pf);
 }
 
@@ -1744,7 +1877,8 @@ static int mlx5_lag_register_hca_devcom_comp(struct mlx5_core_dev *dev)
 	dev->priv.hca_devcom_comp =
 		mlx5_devcom_register_component(dev->priv.devc,
 					       MLX5_DEVCOM_HCA_PORTS,
-					       &attr, NULL, dev);
+					       &attr, mlx5_lag_devcom_event,
+					       dev);
 	if (!dev->priv.hca_devcom_comp) {
 		mlx5_core_err(dev,
 			      "Failed to register devcom HCA component.");
@@ -1775,6 +1909,9 @@ void mlx5_lag_remove_mdev(struct mlx5_core_dev *dev)
 	}
 	mlx5_ldev_remove_mdev(ldev, dev);
 	mutex_unlock(&ldev->lock);
+	/* Send devcom event to notify peers that a device is being removed */
+	mlx5_devcom_send_event(dev->priv.hca_devcom_comp,
+			       LAG_DEVCOM_UNPAIR, LAG_DEVCOM_UNPAIR, dev);
 	mlx5_lag_unregister_hca_devcom_comp(dev);
 	mlx5_ldev_put(ldev);
 }
@@ -1798,6 +1935,9 @@ void mlx5_lag_add_mdev(struct mlx5_core_dev *dev)
 		msleep(100);
 		goto recheck;
 	}
+	/* Send devcom event to notify peers that a device was added */
+	mlx5_devcom_send_event(dev->priv.hca_devcom_comp,
+			       LAG_DEVCOM_PAIR, LAG_DEVCOM_UNPAIR, dev);
 	mlx5_ldev_add_debugfs(dev);
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h
index 09758871b3da..30cbd61768f8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h
@@ -7,6 +7,12 @@
 #include <linux/debugfs.h>
 
 #define MLX5_LAG_MAX_HASH_BUCKETS 16
+/* XArray mark for the LAG master device
+ * (device with lowest mlx5_get_dev_index).
+ * Note: XA_MARK_0 is reserved by XA_FLAGS_ALLOC for free-slot tracking.
+ */
+#define MLX5_LAG_XA_MARK_MASTER XA_MARK_1
+
 #include "mlx5_core.h"
 #include "mp.h"
 #include "port_sel.h"
@@ -39,6 +45,7 @@ struct lag_func {
 	struct mlx5_core_dev *dev;
 	struct net_device    *netdev;
 	bool has_drop;
+	unsigned int idx; /* xarray index assigned by LAG */
 	struct mlx5_nb port_change_nb;
 };
 
@@ -90,6 +97,28 @@ mlx5_lag_pf(struct mlx5_lag *ldev, unsigned int idx)
 	return xa_load(&ldev->pfs, idx);
 }
 
+/* Get device index (mlx5_get_dev_index) from xarray index */
+static inline int mlx5_lag_xa_to_dev_idx(struct mlx5_lag *ldev, int xa_idx)
+{
+	struct lag_func *pf = mlx5_lag_pf(ldev, xa_idx);
+
+	return pf ? mlx5_get_dev_index(pf->dev) : -ENOENT;
+}
+
+/* Find lag_func by device index (reverse lookup from mlx5_get_dev_index) */
+static inline struct lag_func *
+mlx5_lag_pf_by_dev_idx(struct mlx5_lag *ldev, int dev_idx)
+{
+	struct lag_func *pf;
+	unsigned long idx;
+
+	xa_for_each(&ldev->pfs, idx, pf) {
+		if (mlx5_get_dev_index(pf->dev) == dev_idx)
+			return pf;
+	}
+	return NULL;
+}
+
 static inline bool
 __mlx5_lag_is_active(struct mlx5_lag *ldev)
 {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/mpesw.c b/drivers/net/ethernet/mellanox/mlx5/core/lag/mpesw.c
index 0e7d206cd594..5eea12a6887a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lag/mpesw.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/mpesw.c
@@ -67,9 +67,9 @@ static int mlx5_mpesw_metadata_set(struct mlx5_lag *ldev)
 
 static int mlx5_lag_enable_mpesw(struct mlx5_lag *ldev)
 {
+	int idx = mlx5_lag_get_dev_index_by_seq(ldev, MLX5_LAG_P1);
 	struct mlx5_core_dev *dev0;
 	int err;
-	int idx;
 	int i;
 
 	if (ldev->mode == MLX5_LAG_MODE_MPESW)
@@ -78,7 +78,6 @@ static int mlx5_lag_enable_mpesw(struct mlx5_lag *ldev)
 	if (ldev->mode != MLX5_LAG_MODE_NONE)
 		return -EINVAL;
 
-	idx = mlx5_lag_get_dev_index_by_seq(ldev, MLX5_LAG_P1);
 	if (idx < 0)
 		return -EINVAL;
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/port_sel.c b/drivers/net/ethernet/mellanox/mlx5/core/lag/port_sel.c
index 7e9e3e81977d..2a034b2a3eee 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lag/port_sel.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/port_sel.c
@@ -84,8 +84,11 @@ static int mlx5_lag_create_port_sel_table(struct mlx5_lag *ldev,
 			idx = i * ldev->buckets + j;
 			affinity = ports[idx];
 
+			/* affinity is 1-indexed device index,
+			 * use reverse lookup.
+			 */
 			dest.vport.vhca_id =
-				MLX5_CAP_GEN(mlx5_lag_pf(ldev, affinity - 1)->dev,
+				MLX5_CAP_GEN(mlx5_lag_pf_by_dev_idx(ldev, affinity - 1)->dev,
 					     vhca_id);
 			lag_definer->rules[idx] = mlx5_add_flow_rules(lag_definer->ft,
 								      NULL, &flow_act,
@@ -358,7 +361,7 @@ static void mlx5_lag_destroy_definer(struct mlx5_lag *ldev,
 		return;
 
 	dev = mlx5_lag_pf(ldev, first_idx)->dev;
-	mlx5_ldev_for_each(i, first_idx, ldev) {
+	mlx5_ldev_for_each(i, 0, ldev) {
 		for (j = 0; j < ldev->buckets; j++) {
 			idx = i * ldev->buckets + j;
 			mlx5_del_flow_rules(lag_definer->rules[idx]);
@@ -595,8 +598,11 @@ static int __mlx5_lag_modify_definers_destinations(struct mlx5_lag *ldev,
 			if (ldev->v2p_map[idx] == ports[idx])
 				continue;
 
+			/* ports[] contains 1-indexed device indices,
+			 * use reverse lookup.
+			 */
 			dest.vport.vhca_id =
-				MLX5_CAP_GEN(mlx5_lag_pf(ldev, ports[idx] - 1)->dev,
+				MLX5_CAP_GEN(mlx5_lag_pf_by_dev_idx(ldev, ports[idx] - 1)->dev,
 					     vhca_id);
 			err = mlx5_modify_rule_destination(def->rules[idx], &dest, NULL);
 			if (err)
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH mlx5-next 5/8] net/mlx5: E-switch, modify peer miss rule index to vhca_id
  2026-03-08  6:55 [PATCH mlx5-next 0/8] mlx5-next updates 2026-03-08 Tariq Toukan
                   ` (3 preceding siblings ...)
  2026-03-08  6:55 ` [PATCH mlx5-next 4/8] net/mlx5: LAG, use xa_alloc to manage LAG device indices Tariq Toukan
@ 2026-03-08  6:55 ` Tariq Toukan
  2026-03-08  6:55 ` [PATCH mlx5-next 6/8] net/mlx5: LAG, replace mlx5_get_dev_index with LAG sequence number Tariq Toukan
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Tariq Toukan @ 2026-03-08  6:55 UTC (permalink / raw)
  To: Leon Romanovsky, Jason Gunthorpe, Saeed Mahameed, Tariq Toukan
  Cc: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller, Mark Bloch, linux-kernel, linux-rdma, netdev,
	Gal Pressman, Dragos Tatulea, Moshe Shemesh, Shay Drory,
	Alexei Lazar

From: Shay Drory <shayd@nvidia.com>

Replace the fixed-size array peer_miss_rules[MLX5_MAX_PORTS], indexed
by physical function index, with an xarray indexed by vhca_id.

This decouples peer_miss_rules from mlx5_get_dev_index(), removing the
dependency on a PF-derived index and the arbitrary MLX5_MAX_PORTS bounds
check. Using vhca_id as the key simplifies insertion/removal logic and
scales better across multi-peer topologies.

Initialize and destroy the xarray alongside the existing esw->paired
xarray in mlx5_esw_offloads_devcom_init/cleanup respectively.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/eswitch.h |  2 +-
 .../mellanox/mlx5/core/eswitch_offloads.c     | 20 +++++++++----------
 2 files changed, 10 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index 6841caef02d1..96309a732d50 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -273,7 +273,7 @@ struct mlx5_eswitch_fdb {
 			struct mlx5_flow_group *send_to_vport_grp;
 			struct mlx5_flow_group *send_to_vport_meta_grp;
 			struct mlx5_flow_group *peer_miss_grp;
-			struct mlx5_flow_handle **peer_miss_rules[MLX5_MAX_PORTS];
+			struct xarray peer_miss_rules;
 			struct mlx5_flow_group *miss_grp;
 			struct mlx5_flow_handle **send_to_vport_meta_rules;
 			struct mlx5_flow_handle *miss_rule_uni;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 1366f6e489bd..90e6f97bdf4a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -1190,7 +1190,7 @@ static int esw_add_fdb_peer_miss_rules(struct mlx5_eswitch *esw,
 	struct mlx5_flow_handle *flow;
 	struct mlx5_vport *peer_vport;
 	struct mlx5_flow_spec *spec;
-	int err, pfindex;
+	int err;
 	unsigned long i;
 	void *misc;
 
@@ -1274,14 +1274,10 @@ static int esw_add_fdb_peer_miss_rules(struct mlx5_eswitch *esw,
 		}
 	}
 
-	pfindex = mlx5_get_dev_index(peer_dev);
-	if (pfindex >= MLX5_MAX_PORTS) {
-		esw_warn(esw->dev, "Peer dev index(%d) is over the max num defined(%d)\n",
-			 pfindex, MLX5_MAX_PORTS);
-		err = -EINVAL;
+	err = xa_insert(&esw->fdb_table.offloads.peer_miss_rules,
+			MLX5_CAP_GEN(peer_dev, vhca_id), flows, GFP_KERNEL);
+	if (err)
 		goto add_ec_vf_flow_err;
-	}
-	esw->fdb_table.offloads.peer_miss_rules[pfindex] = flows;
 
 	kvfree(spec);
 	return 0;
@@ -1323,12 +1319,13 @@ static void esw_del_fdb_peer_miss_rules(struct mlx5_eswitch *esw,
 					struct mlx5_core_dev *peer_dev)
 {
 	struct mlx5_eswitch *peer_esw = peer_dev->priv.eswitch;
-	u16 peer_index = mlx5_get_dev_index(peer_dev);
+	u16 peer_vhca_id = MLX5_CAP_GEN(peer_dev, vhca_id);
 	struct mlx5_flow_handle **flows;
 	struct mlx5_vport *peer_vport;
 	unsigned long i;
 
-	flows = esw->fdb_table.offloads.peer_miss_rules[peer_index];
+	flows = xa_erase(&esw->fdb_table.offloads.peer_miss_rules,
+			 peer_vhca_id);
 	if (!flows)
 		return;
 
@@ -1353,7 +1350,6 @@ static void esw_del_fdb_peer_miss_rules(struct mlx5_eswitch *esw,
 	}
 
 	kvfree(flows);
-	esw->fdb_table.offloads.peer_miss_rules[peer_index] = NULL;
 }
 
 static int esw_add_fdb_miss_rule(struct mlx5_eswitch *esw)
@@ -3250,6 +3246,7 @@ void mlx5_esw_offloads_devcom_init(struct mlx5_eswitch *esw,
 		return;
 
 	xa_init(&esw->paired);
+	xa_init(&esw->fdb_table.offloads.peer_miss_rules);
 	esw->num_peers = 0;
 	esw->devcom = mlx5_devcom_register_component(esw->dev->priv.devc,
 						     MLX5_DEVCOM_ESW_OFFLOADS,
@@ -3277,6 +3274,7 @@ void mlx5_esw_offloads_devcom_cleanup(struct mlx5_eswitch *esw)
 
 	mlx5_devcom_unregister_component(esw->devcom);
 	xa_destroy(&esw->paired);
+	xa_destroy(&esw->fdb_table.offloads.peer_miss_rules);
 	esw->devcom = NULL;
 }
 
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH mlx5-next 6/8] net/mlx5: LAG, replace mlx5_get_dev_index with LAG sequence number
  2026-03-08  6:55 [PATCH mlx5-next 0/8] mlx5-next updates 2026-03-08 Tariq Toukan
                   ` (4 preceding siblings ...)
  2026-03-08  6:55 ` [PATCH mlx5-next 5/8] net/mlx5: E-switch, modify peer miss rule index to vhca_id Tariq Toukan
@ 2026-03-08  6:55 ` Tariq Toukan
  2026-03-08  6:55 ` [PATCH mlx5-next 7/8] net/mlx5: Add VHCA RX flow destination support for FW steering Tariq Toukan
  2026-03-08  6:55 ` [PATCH mlx5-next 8/8] {net/RDMA}/mlx5: Add LAG demux table API and vport demux rules Tariq Toukan
  7 siblings, 0 replies; 14+ messages in thread
From: Tariq Toukan @ 2026-03-08  6:55 UTC (permalink / raw)
  To: Leon Romanovsky, Jason Gunthorpe, Saeed Mahameed, Tariq Toukan
  Cc: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller, Mark Bloch, linux-kernel, linux-rdma, netdev,
	Gal Pressman, Dragos Tatulea, Moshe Shemesh, Shay Drory,
	Alexei Lazar

From: Shay Drory <shayd@nvidia.com>

Introduce mlx5_lag_get_dev_seq() which returns a device's sequence
number within the LAG: master is always 0, remaining devices numbered
sequentially. This provides a stable index for peer flow tracking and
vport ordering without depending on native_port_num.

Replace mlx5_get_dev_index() usage in en_tc.c (peer flow array
indexing) and ib_rep.c (vport index ordering) with the new API.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 drivers/infiniband/hw/mlx5/ib_rep.c           |  4 ++-
 .../net/ethernet/mellanox/mlx5/core/en_tc.c   |  9 ++---
 .../net/ethernet/mellanox/mlx5/core/lag/lag.c | 34 +++++++++++++++++++
 include/linux/mlx5/lag.h                      | 11 ++++++
 4 files changed, 53 insertions(+), 5 deletions(-)
 create mode 100644 include/linux/mlx5/lag.h

diff --git a/drivers/infiniband/hw/mlx5/ib_rep.c b/drivers/infiniband/hw/mlx5/ib_rep.c
index 621834d75205..df8f049c5806 100644
--- a/drivers/infiniband/hw/mlx5/ib_rep.c
+++ b/drivers/infiniband/hw/mlx5/ib_rep.c
@@ -3,6 +3,7 @@
  * Copyright (c) 2018 Mellanox Technologies. All rights reserved.
  */
 
+#include <linux/mlx5/lag.h>
 #include <linux/mlx5/vport.h>
 #include "ib_rep.h"
 #include "srq.h"
@@ -134,7 +135,8 @@ mlx5_ib_vport_rep_load(struct mlx5_core_dev *dev, struct mlx5_eswitch_rep *rep)
 				/* Only 1 ib port is the representor for all uplinks */
 					peer_n_ports--;
 
-				if (mlx5_get_dev_index(peer_dev) < mlx5_get_dev_index(dev))
+				if (mlx5_lag_get_dev_seq(peer_dev) <
+				    mlx5_lag_get_dev_seq(dev))
 					vport_index += peer_n_ports;
 			}
 		}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 1434b65d4746..397a93584fd6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -35,6 +35,7 @@
 #include <net/sch_generic.h>
 #include <net/pkt_cls.h>
 #include <linux/mlx5/fs.h>
+#include <linux/mlx5/lag.h>
 #include <linux/mlx5/device.h>
 #include <linux/rhashtable.h>
 #include <linux/refcount.h>
@@ -2131,7 +2132,7 @@ static void mlx5e_tc_del_fdb_peer_flow(struct mlx5e_tc_flow *flow,
 	mutex_unlock(&esw->offloads.peer_mutex);
 
 	list_for_each_entry_safe(peer_flow, tmp, &flow->peer_flows, peer_flows) {
-		if (peer_index != mlx5_get_dev_index(peer_flow->priv->mdev))
+		if (peer_index != mlx5_lag_get_dev_seq(peer_flow->priv->mdev))
 			continue;
 
 		list_del(&peer_flow->peer_flows);
@@ -2154,7 +2155,7 @@ static void mlx5e_tc_del_fdb_peers_flow(struct mlx5e_tc_flow *flow)
 
 	devcom = flow->priv->mdev->priv.eswitch->devcom;
 	mlx5_devcom_for_each_peer_entry(devcom, peer_esw, pos) {
-		i = mlx5_get_dev_index(peer_esw->dev);
+		i = mlx5_lag_get_dev_seq(peer_esw->dev);
 		mlx5e_tc_del_fdb_peer_flow(flow, i);
 	}
 }
@@ -4584,7 +4585,7 @@ static int mlx5e_tc_add_fdb_peer_flow(struct flow_cls_offload *f,
 	struct mlx5_eswitch *esw = priv->mdev->priv.eswitch;
 	struct mlx5_esw_flow_attr *attr = flow->attr->esw_attr;
 	struct mlx5e_tc_flow_parse_attr *parse_attr;
-	int i = mlx5_get_dev_index(peer_esw->dev);
+	int i = mlx5_lag_get_dev_seq(peer_esw->dev);
 	struct mlx5e_rep_priv *peer_urpriv;
 	struct mlx5e_tc_flow *peer_flow;
 	struct mlx5_core_dev *in_mdev;
@@ -5525,7 +5526,7 @@ void mlx5e_tc_clean_fdb_peer_flows(struct mlx5_eswitch *esw)
 	devcom = esw->devcom;
 
 	mlx5_devcom_for_each_peer_entry(devcom, peer_esw, pos) {
-		i = mlx5_get_dev_index(peer_esw->dev);
+		i = mlx5_lag_get_dev_seq(peer_esw->dev);
 		list_for_each_entry_safe(flow, tmp, &esw->offloads.peer_flows[i], peer[i])
 			mlx5e_tc_del_fdb_peers_flow(flow);
 	}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
index 4beee64c937a..51ec8f61ecbb 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
@@ -35,6 +35,7 @@
 #include <linux/mlx5/driver.h>
 #include <linux/mlx5/eswitch.h>
 #include <linux/mlx5/vport.h>
+#include <linux/mlx5/lag.h>
 #include "lib/mlx5.h"
 #include "lib/devcom.h"
 #include "mlx5_core.h"
@@ -369,6 +370,39 @@ int mlx5_lag_get_dev_index_by_seq(struct mlx5_lag *ldev, int seq)
 	return -ENOENT;
 }
 
+/* Reverse of mlx5_lag_get_dev_index_by_seq: given a device, return its
+ * sequence number in the LAG. Master is always 0, others numbered
+ * sequentially starting from 1.
+ */
+int mlx5_lag_get_dev_seq(struct mlx5_core_dev *dev)
+{
+	struct mlx5_lag *ldev = mlx5_lag_dev(dev);
+	int master_idx, i, num = 1;
+	struct lag_func *pf;
+
+	if (!ldev)
+		return -ENOENT;
+
+	master_idx = mlx5_lag_get_master_idx(ldev);
+	if (master_idx < 0)
+		return -ENOENT;
+
+	pf = mlx5_lag_pf(ldev, master_idx);
+	if (pf && pf->dev == dev)
+		return 0;
+
+	mlx5_ldev_for_each(i, 0, ldev) {
+		if (i == master_idx)
+			continue;
+		pf = mlx5_lag_pf(ldev, i);
+		if (pf->dev == dev)
+			return num;
+		num++;
+	}
+	return -ENOENT;
+}
+EXPORT_SYMBOL(mlx5_lag_get_dev_seq);
+
 /* Devcom events for LAG master marking */
 #define LAG_DEVCOM_PAIR		(0)
 #define LAG_DEVCOM_UNPAIR	(1)
diff --git a/include/linux/mlx5/lag.h b/include/linux/mlx5/lag.h
new file mode 100644
index 000000000000..d370dfd19055
--- /dev/null
+++ b/include/linux/mlx5/lag.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
+/* Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. */
+
+#ifndef __MLX5_LAG_API_H__
+#define __MLX5_LAG_API_H__
+
+struct mlx5_core_dev;
+
+int mlx5_lag_get_dev_seq(struct mlx5_core_dev *dev);
+
+#endif /* __MLX5_LAG_API_H__ */
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH mlx5-next 7/8] net/mlx5: Add VHCA RX flow destination support for FW steering
  2026-03-08  6:55 [PATCH mlx5-next 0/8] mlx5-next updates 2026-03-08 Tariq Toukan
                   ` (5 preceding siblings ...)
  2026-03-08  6:55 ` [PATCH mlx5-next 6/8] net/mlx5: LAG, replace mlx5_get_dev_index with LAG sequence number Tariq Toukan
@ 2026-03-08  6:55 ` Tariq Toukan
  2026-03-08  6:55 ` [PATCH mlx5-next 8/8] {net/RDMA}/mlx5: Add LAG demux table API and vport demux rules Tariq Toukan
  7 siblings, 0 replies; 14+ messages in thread
From: Tariq Toukan @ 2026-03-08  6:55 UTC (permalink / raw)
  To: Leon Romanovsky, Jason Gunthorpe, Saeed Mahameed, Tariq Toukan
  Cc: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller, Mark Bloch, linux-kernel, linux-rdma, netdev,
	Gal Pressman, Dragos Tatulea, Moshe Shemesh, Shay Drory,
	Alexei Lazar

From: Shay Drory <shayd@nvidia.com>

Introduce MLX5_FLOW_DESTINATION_TYPE_VHCA_RX as a new flow steering
destination type.

Wire the new destination through flow steering command setup by mapping
it to MLX5_IFC_FLOW_DESTINATION_TYPE_VHCA_RX and passing the vhca id,
extend forward-destination validation to accept it, and teach the flow
steering tracepoint formatter to print rx_vhca_id.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/diag/fs_tracepoint.c   | 3 +++
 drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c           | 4 ++++
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c          | 7 +++++--
 include/linux/mlx5/fs.h                                    | 4 ++++
 4 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/diag/fs_tracepoint.c b/drivers/net/ethernet/mellanox/mlx5/core/diag/fs_tracepoint.c
index 6d73127b7217..2cf1d3825def 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/diag/fs_tracepoint.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/diag/fs_tracepoint.c
@@ -282,6 +282,9 @@ const char *parse_fs_dst(struct trace_seq *p,
 	case MLX5_FLOW_DESTINATION_TYPE_NONE:
 		trace_seq_printf(p, "none\n");
 		break;
+	case MLX5_FLOW_DESTINATION_TYPE_VHCA_RX:
+		trace_seq_printf(p, "rx_vhca_id=%u\n", dst->vhca.id);
+		break;
 	}
 
 	trace_seq_putc(p, 0);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
index 16b28028609d..1cd4cd898ec2 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
@@ -716,6 +716,10 @@ static int mlx5_cmd_set_fte(struct mlx5_core_dev *dev,
 				id = dst->dest_attr.ft->id;
 				ifc_type = MLX5_IFC_FLOW_DESTINATION_TYPE_TABLE_TYPE;
 				break;
+			case MLX5_FLOW_DESTINATION_TYPE_VHCA_RX:
+				id = dst->dest_attr.vhca.id;
+				ifc_type = MLX5_IFC_FLOW_DESTINATION_TYPE_VHCA_RX;
+				break;
 			default:
 				id = dst->dest_attr.tir_num;
 				ifc_type = MLX5_IFC_FLOW_DESTINATION_TYPE_TIR;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
index 2c3544880a30..003d211306a7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
@@ -503,7 +503,8 @@ static bool is_fwd_dest_type(enum mlx5_flow_destination_type type)
 		type == MLX5_FLOW_DESTINATION_TYPE_FLOW_SAMPLER ||
 		type == MLX5_FLOW_DESTINATION_TYPE_TIR ||
 		type == MLX5_FLOW_DESTINATION_TYPE_RANGE ||
-		type == MLX5_FLOW_DESTINATION_TYPE_TABLE_TYPE;
+		type == MLX5_FLOW_DESTINATION_TYPE_TABLE_TYPE ||
+		type == MLX5_FLOW_DESTINATION_TYPE_VHCA_RX;
 }
 
 static bool check_valid_spec(const struct mlx5_flow_spec *spec)
@@ -1890,7 +1891,9 @@ static bool mlx5_flow_dests_cmp(struct mlx5_flow_destination *d1,
 		     d1->range.hit_ft == d2->range.hit_ft &&
 		     d1->range.miss_ft == d2->range.miss_ft &&
 		     d1->range.min == d2->range.min &&
-		     d1->range.max == d2->range.max))
+		     d1->range.max == d2->range.max) ||
+		    (d1->type == MLX5_FLOW_DESTINATION_TYPE_VHCA_RX &&
+		     d1->vhca.id == d2->vhca.id))
 			return true;
 	}
 
diff --git a/include/linux/mlx5/fs.h b/include/linux/mlx5/fs.h
index 9cadb1d5e6df..02064424e868 100644
--- a/include/linux/mlx5/fs.h
+++ b/include/linux/mlx5/fs.h
@@ -55,6 +55,7 @@ enum mlx5_flow_destination_type {
 	MLX5_FLOW_DESTINATION_TYPE_FLOW_TABLE_NUM,
 	MLX5_FLOW_DESTINATION_TYPE_RANGE,
 	MLX5_FLOW_DESTINATION_TYPE_TABLE_TYPE,
+	MLX5_FLOW_DESTINATION_TYPE_VHCA_RX,
 };
 
 enum {
@@ -189,6 +190,9 @@ struct mlx5_flow_destination {
 		u32			ft_num;
 		struct mlx5_flow_table	*ft;
 		struct mlx5_fc          *counter;
+		struct {
+			u16		id;
+		} vhca;
 		struct {
 			u16		num;
 			u16		vhca_id;
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH mlx5-next 8/8] {net/RDMA}/mlx5: Add LAG demux table API and vport demux rules
  2026-03-08  6:55 [PATCH mlx5-next 0/8] mlx5-next updates 2026-03-08 Tariq Toukan
                   ` (6 preceding siblings ...)
  2026-03-08  6:55 ` [PATCH mlx5-next 7/8] net/mlx5: Add VHCA RX flow destination support for FW steering Tariq Toukan
@ 2026-03-08  6:55 ` Tariq Toukan
  2026-03-08 15:52   ` Jakub Kicinski
  7 siblings, 1 reply; 14+ messages in thread
From: Tariq Toukan @ 2026-03-08  6:55 UTC (permalink / raw)
  To: Leon Romanovsky, Jason Gunthorpe, Saeed Mahameed, Tariq Toukan
  Cc: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller, Mark Bloch, linux-kernel, linux-rdma, netdev,
	Gal Pressman, Dragos Tatulea, Moshe Shemesh, Shay Drory,
	Alexei Lazar

From: Shay Drory <shayd@nvidia.com>

Downstream patches will introduce SW-only LAG (e.g. shared_fdb without
HW LAG). In this mode the firmware cannot create the LAG demux table,
but vport demuxing is still required.

Move LAG demux flow-table ownership to the LAG layer and introduce APIs
to init/cleanup the demux table and add/delete per-vport rules. Adjust
the RDMA driver to use the new APIs.

In this mode, the LAG layer will create a flow group that matches vport
metadata. Vports that are not native to the LAG master eswitch add the
demux rule during IB representor load and remove it on unload.
The demux rule forward traffic from said vports to their native eswitch
manager via a new dest type - MLX5_FLOW_DESTINATION_TYPE_VHCA_RX.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 drivers/infiniband/hw/mlx5/ib_rep.c           |  20 ++-
 drivers/infiniband/hw/mlx5/main.c             |  21 +--
 drivers/infiniband/hw/mlx5/mlx5_ib.h          |   1 -
 .../net/ethernet/mellanox/mlx5/core/eswitch.h |  12 ++
 .../mellanox/mlx5/core/eswitch_offloads.c     |  83 +++++++++-
 .../net/ethernet/mellanox/mlx5/core/fs_core.c |  10 +-
 .../net/ethernet/mellanox/mlx5/core/lag/lag.c | 152 ++++++++++++++++++
 .../net/ethernet/mellanox/mlx5/core/lag/lag.h |  12 ++
 include/linux/mlx5/fs.h                       |   6 +-
 include/linux/mlx5/lag.h                      |  10 ++
 10 files changed, 300 insertions(+), 27 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/ib_rep.c b/drivers/infiniband/hw/mlx5/ib_rep.c
index df8f049c5806..abedc5e2f7b7 100644
--- a/drivers/infiniband/hw/mlx5/ib_rep.c
+++ b/drivers/infiniband/hw/mlx5/ib_rep.c
@@ -10,11 +10,13 @@
 
 static int
 mlx5_ib_set_vport_rep(struct mlx5_core_dev *dev,
+		      struct mlx5_core_dev *rep_dev,
 		      struct mlx5_eswitch_rep *rep,
 		      int vport_index)
 {
 	struct mlx5_ib_dev *ibdev;
 	struct net_device *ndev;
+	int ret;
 
 	ibdev = mlx5_eswitch_uplink_get_proto_dev(dev->priv.eswitch, REP_IB);
 	if (!ibdev)
@@ -24,7 +26,17 @@ mlx5_ib_set_vport_rep(struct mlx5_core_dev *dev,
 	rep->rep_data[REP_IB].priv = ibdev;
 	ndev = mlx5_ib_get_rep_netdev(rep->esw, rep->vport);
 
-	return ib_device_set_netdev(&ibdev->ib_dev, ndev, vport_index + 1);
+	ret = ib_device_set_netdev(&ibdev->ib_dev, ndev, vport_index + 1);
+	if (ret)
+		return ret;
+
+	/* Only Vports that are not native to the LAG master eswitch need to add
+	 * demux rule.
+	 */
+	if (mlx5_eswitch_get_total_vports(dev) >= vport_index)
+		return 0;
+
+	return mlx5_lag_demux_rule_add(rep_dev, rep->vport, vport_index);
 }
 
 static void mlx5_ib_register_peer_vport_reps(struct mlx5_core_dev *mdev);
@@ -131,7 +143,7 @@ mlx5_ib_vport_rep_load(struct mlx5_core_dev *dev, struct mlx5_eswitch_rep *rep)
 
 				if (mlx5_lag_is_master(peer_dev))
 					lag_master = peer_dev;
-				else if (!mlx5_lag_is_mpesw(dev))
+				else if (!mlx5_lag_is_mpesw(peer_dev))
 				/* Only 1 ib port is the representor for all uplinks */
 					peer_n_ports--;
 
@@ -145,7 +157,7 @@ mlx5_ib_vport_rep_load(struct mlx5_core_dev *dev, struct mlx5_eswitch_rep *rep)
 	if (rep->vport == MLX5_VPORT_UPLINK && !new_uplink)
 		profile = &raw_eth_profile;
 	else
-		return mlx5_ib_set_vport_rep(lag_master, rep, vport_index);
+		return mlx5_ib_set_vport_rep(lag_master, dev, rep, vport_index);
 
 	if (mlx5_lag_is_shared_fdb(dev)) {
 		ret = mlx5_ib_take_transport(lag_master);
@@ -233,6 +245,8 @@ mlx5_ib_vport_rep_unload(struct mlx5_eswitch_rep *rep)
 		vport_index = i;
 	}
 
+	mlx5_lag_demux_rule_del(mdev, vport_index);
+
 	port = &dev->port[vport_index];
 
 	ib_device_set_netdev(&dev->ib_dev, NULL, vport_index + 1);
diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index 635002e684a5..9fb0629978bd 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -26,6 +26,7 @@
 #include <linux/mlx5/fs.h>
 #include <linux/mlx5/eswitch.h>
 #include <linux/mlx5/driver.h>
+#include <linux/mlx5/lag.h>
 #include <linux/list.h>
 #include <rdma/ib_smi.h>
 #include <rdma/ib_umem_odp.h>
@@ -3678,12 +3679,12 @@ static void mlx5e_lag_event_unregister(struct mlx5_ib_dev *dev)
 
 static int mlx5_eth_lag_init(struct mlx5_ib_dev *dev)
 {
+	struct mlx5_flow_table_attr ft_attr = {};
 	struct mlx5_core_dev *mdev = dev->mdev;
-	struct mlx5_flow_namespace *ns = mlx5_get_flow_namespace(mdev,
-								 MLX5_FLOW_NAMESPACE_LAG);
-	struct mlx5_flow_table *ft;
+	struct mlx5_flow_namespace *ns;
 	int err;
 
+	ns = mlx5_get_flow_namespace(mdev, MLX5_FLOW_NAMESPACE_LAG);
 	if (!ns || !mlx5_lag_is_active(mdev))
 		return 0;
 
@@ -3691,14 +3692,15 @@ static int mlx5_eth_lag_init(struct mlx5_ib_dev *dev)
 	if (err)
 		return err;
 
-	ft = mlx5_create_lag_demux_flow_table(ns, 0, 0);
-	if (IS_ERR(ft)) {
-		err = PTR_ERR(ft);
+	ft_attr.level = 0;
+	ft_attr.prio = 0;
+	ft_attr.max_fte = dev->num_ports;
+
+	err = mlx5_lag_demux_init(mdev, &ft_attr);
+	if (err)
 		goto err_destroy_vport_lag;
-	}
 
 	mlx5e_lag_event_register(dev);
-	dev->flow_db->lag_demux_ft = ft;
 	dev->lag_ports = mlx5_lag_get_num_ports(mdev);
 	dev->lag_active = true;
 	return 0;
@@ -3716,8 +3718,7 @@ static void mlx5_eth_lag_cleanup(struct mlx5_ib_dev *dev)
 		dev->lag_active = false;
 
 		mlx5e_lag_event_unregister(dev);
-		mlx5_destroy_flow_table(dev->flow_db->lag_demux_ft);
-		dev->flow_db->lag_demux_ft = NULL;
+		mlx5_lag_demux_cleanup(mdev);
 
 		mlx5_cmd_destroy_vport_lag(mdev);
 	}
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 4f4114d95130..3fc31415e107 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -306,7 +306,6 @@ struct mlx5_ib_flow_db {
 	struct mlx5_ib_flow_prio	rdma_rx[MLX5_IB_NUM_FLOW_FT];
 	struct mlx5_ib_flow_prio	rdma_tx[MLX5_IB_NUM_FLOW_FT];
 	struct mlx5_ib_flow_prio	opfcs[MLX5_IB_OPCOUNTER_MAX];
-	struct mlx5_flow_table		*lag_demux_ft;
 	struct mlx5_ib_flow_prio        *rdma_transport_rx[MLX5_RDMA_TRANSPORT_BYPASS_PRIO];
 	struct mlx5_ib_flow_prio        *rdma_transport_tx[MLX5_RDMA_TRANSPORT_BYPASS_PRIO];
 	/* Protect flow steering bypass flow tables
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index 96309a732d50..9b729789537f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -940,6 +940,12 @@ int mlx5_esw_ipsec_vf_packet_offload_supported(struct mlx5_core_dev *dev,
 					       u16 vport_num);
 bool mlx5_esw_host_functions_enabled(const struct mlx5_core_dev *dev);
 void mlx5_eswitch_safe_aux_devs_remove(struct mlx5_core_dev *dev);
+struct mlx5_flow_group *
+mlx5_esw_lag_demux_fg_create(struct mlx5_eswitch *esw,
+			     struct mlx5_flow_table *ft);
+struct mlx5_flow_handle *
+mlx5_esw_lag_demux_rule_create(struct mlx5_eswitch *esw, u16 vport_num,
+			       struct mlx5_flow_table *lag_ft);
 #else  /* CONFIG_MLX5_ESWITCH */
 /* eswitch API stubs */
 static inline int  mlx5_eswitch_init(struct mlx5_core_dev *dev) { return 0; }
@@ -1025,6 +1031,12 @@ mlx5_esw_vport_vhca_id(struct mlx5_eswitch *esw, u16 vportn, u16 *vhca_id)
 
 static inline void
 mlx5_eswitch_safe_aux_devs_remove(struct mlx5_core_dev *dev) {}
+static inline struct mlx5_flow_handle *
+mlx5_esw_lag_demux_rule_create(struct mlx5_eswitch *esw, u16 vport_num,
+			       struct mlx5_flow_table *lag_ft)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
 
 #endif /* CONFIG_MLX5_ESWITCH */
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 90e6f97bdf4a..0d907fb7f290 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -1459,6 +1459,83 @@ esw_add_restore_rule(struct mlx5_eswitch *esw, u32 tag)
 	return flow_rule;
 }
 
+struct mlx5_flow_group *
+mlx5_esw_lag_demux_fg_create(struct mlx5_eswitch *esw,
+			     struct mlx5_flow_table *ft)
+{
+	int inlen = MLX5_ST_SZ_BYTES(create_flow_group_in);
+	struct mlx5_flow_group *fg;
+	void *match_criteria;
+	void *flow_group_in;
+
+	if (!mlx5_eswitch_vport_match_metadata_enabled(esw))
+		return ERR_PTR(-EOPNOTSUPP);
+
+	if (IS_ERR(ft))
+		return ERR_CAST(ft);
+
+	flow_group_in = kvzalloc(inlen, GFP_KERNEL);
+	if (!flow_group_in)
+		return ERR_PTR(-ENOMEM);
+
+	match_criteria = MLX5_ADDR_OF(create_flow_group_in, flow_group_in,
+				      match_criteria);
+	MLX5_SET(create_flow_group_in, flow_group_in, match_criteria_enable,
+		 MLX5_MATCH_MISC_PARAMETERS_2);
+	MLX5_SET(create_flow_group_in, flow_group_in, start_flow_index, 0);
+	MLX5_SET(create_flow_group_in, flow_group_in, end_flow_index,
+		 ft->max_fte - 1);
+
+	MLX5_SET(fte_match_param, match_criteria,
+		 misc_parameters_2.metadata_reg_c_0,
+		 mlx5_eswitch_get_vport_metadata_mask());
+
+	fg = mlx5_create_flow_group(ft, flow_group_in);
+	kvfree(flow_group_in);
+	if (IS_ERR(fg))
+		esw_warn(esw->dev, "Can't create LAG demux flow group\n");
+
+	return fg;
+}
+
+struct mlx5_flow_handle *
+mlx5_esw_lag_demux_rule_create(struct mlx5_eswitch *esw, u16 vport_num,
+			       struct mlx5_flow_table *lag_ft)
+{
+	struct mlx5_flow_spec *spec = kvzalloc(sizeof(*spec), GFP_KERNEL);
+	struct mlx5_flow_destination dest = {};
+	struct mlx5_flow_act flow_act = {};
+	struct mlx5_flow_handle *ret;
+	void *misc;
+
+	if (!spec)
+		return ERR_PTR(-ENOMEM);
+
+	if (!mlx5_eswitch_vport_match_metadata_enabled(esw)) {
+		kfree(spec);
+		return ERR_PTR(-EOPNOTSUPP);
+	}
+
+	misc = MLX5_ADDR_OF(fte_match_param, spec->match_criteria,
+			    misc_parameters_2);
+	MLX5_SET(fte_match_set_misc2, misc, metadata_reg_c_0,
+		 mlx5_eswitch_get_vport_metadata_mask());
+	spec->match_criteria_enable = MLX5_MATCH_MISC_PARAMETERS_2;
+
+	misc = MLX5_ADDR_OF(fte_match_param, spec->match_value,
+			    misc_parameters_2);
+	MLX5_SET(fte_match_set_misc2, misc, metadata_reg_c_0,
+		 mlx5_eswitch_get_vport_metadata_for_match(esw, vport_num));
+
+	flow_act.action = MLX5_FLOW_CONTEXT_ACTION_FWD_DEST;
+	dest.type = MLX5_FLOW_DESTINATION_TYPE_VHCA_RX;
+	dest.vhca.id = MLX5_CAP_GEN(esw->dev, vhca_id);
+
+	ret = mlx5_add_flow_rules(lag_ft, spec, &flow_act, &dest, 1);
+	kfree(spec);
+	return ret;
+}
+
 #define MAX_PF_SQ 256
 #define MAX_SQ_NVPORTS 32
 
@@ -2047,7 +2124,8 @@ static int esw_create_vport_rx_group(struct mlx5_eswitch *esw)
 
 	if (IS_ERR(g)) {
 		err = PTR_ERR(g);
-		mlx5_core_warn(esw->dev, "Failed to create vport rx group err %d\n", err);
+		esw_warn(esw->dev, "Failed to create vport rx group err %d\n",
+			 err);
 		goto out;
 	}
 
@@ -2092,7 +2170,8 @@ static int esw_create_vport_rx_drop_group(struct mlx5_eswitch *esw)
 
 	if (IS_ERR(g)) {
 		err = PTR_ERR(g);
-		mlx5_core_warn(esw->dev, "Failed to create vport rx drop group err %d\n", err);
+		esw_warn(esw->dev,
+			 "Failed to create vport rx drop group err %d\n", err);
 		goto out;
 	}
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
index 003d211306a7..61a6ba1e49dd 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
@@ -1438,15 +1438,9 @@ mlx5_create_vport_flow_table(struct mlx5_flow_namespace *ns,
 
 struct mlx5_flow_table*
 mlx5_create_lag_demux_flow_table(struct mlx5_flow_namespace *ns,
-				 int prio, u32 level)
+				 struct mlx5_flow_table_attr *ft_attr)
 {
-	struct mlx5_flow_table_attr ft_attr = {};
-
-	ft_attr.level = level;
-	ft_attr.prio  = prio;
-	ft_attr.max_fte = 1;
-
-	return __mlx5_create_flow_table(ns, &ft_attr, FS_FT_OP_MOD_LAG_DEMUX, 0);
+	return __mlx5_create_flow_table(ns, ft_attr, FS_FT_OP_MOD_LAG_DEMUX, 0);
 }
 EXPORT_SYMBOL(mlx5_create_lag_demux_flow_table);
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
index 51ec8f61ecbb..449e4bd86c06 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
@@ -1471,6 +1471,158 @@ struct mlx5_devcom_comp_dev *mlx5_lag_get_devcom_comp(struct mlx5_lag *ldev)
 	return devcom;
 }
 
+static int mlx5_lag_demux_ft_fg_init(struct mlx5_core_dev *dev,
+				     struct mlx5_flow_table_attr *ft_attr,
+				     struct mlx5_lag *ldev)
+{
+#ifdef CONFIG_MLX5_ESWITCH
+	struct mlx5_flow_namespace *ns;
+	struct mlx5_flow_group *fg;
+	int err;
+
+	ns = mlx5_get_flow_namespace(dev, MLX5_FLOW_NAMESPACE_LAG);
+	if (!ns)
+		return 0;
+
+	ldev->lag_demux_ft = mlx5_create_flow_table(ns, ft_attr);
+	if (IS_ERR(ldev->lag_demux_ft))
+		return PTR_ERR(ldev->lag_demux_ft);
+
+	fg = mlx5_esw_lag_demux_fg_create(dev->priv.eswitch,
+					  ldev->lag_demux_ft);
+	if (IS_ERR(fg)) {
+		err = PTR_ERR(fg);
+		mlx5_destroy_flow_table(ldev->lag_demux_ft);
+		ldev->lag_demux_ft = NULL;
+		return err;
+	}
+
+	ldev->lag_demux_fg = fg;
+	return 0;
+#else
+	return -EOPNOTSUPP;
+#endif
+}
+
+static int mlx5_lag_demux_fw_init(struct mlx5_core_dev *dev,
+				  struct mlx5_flow_table_attr *ft_attr,
+				  struct mlx5_lag *ldev)
+{
+	struct mlx5_flow_namespace *ns;
+	int err;
+
+	ns = mlx5_get_flow_namespace(dev, MLX5_FLOW_NAMESPACE_LAG);
+	if (!ns)
+		return 0;
+
+	ldev->lag_demux_fg = NULL;
+	ft_attr->max_fte = 1;
+	ldev->lag_demux_ft = mlx5_create_lag_demux_flow_table(ns, ft_attr);
+	if (IS_ERR(ldev->lag_demux_ft)) {
+		err = PTR_ERR(ldev->lag_demux_ft);
+		ldev->lag_demux_ft = NULL;
+		return err;
+	}
+
+	return 0;
+}
+
+int mlx5_lag_demux_init(struct mlx5_core_dev *dev,
+			struct mlx5_flow_table_attr *ft_attr)
+{
+	struct mlx5_lag *ldev;
+
+	if (!ft_attr)
+		return -EINVAL;
+
+	ldev = mlx5_lag_dev(dev);
+	if (!ldev)
+		return -ENODEV;
+
+	xa_init(&ldev->lag_demux_rules);
+
+	if (mlx5_get_sd(dev))
+		return mlx5_lag_demux_ft_fg_init(dev, ft_attr, ldev);
+
+	return mlx5_lag_demux_fw_init(dev, ft_attr, ldev);
+}
+EXPORT_SYMBOL(mlx5_lag_demux_init);
+
+void mlx5_lag_demux_cleanup(struct mlx5_core_dev *dev)
+{
+	struct mlx5_flow_handle *rule;
+	struct mlx5_lag *ldev;
+	unsigned long vport_num;
+
+	ldev = mlx5_lag_dev(dev);
+	if (!ldev)
+		return;
+
+	xa_for_each(&ldev->lag_demux_rules, vport_num, rule)
+		mlx5_del_flow_rules(rule);
+	xa_destroy(&ldev->lag_demux_rules);
+
+	if (ldev->lag_demux_fg)
+		mlx5_destroy_flow_group(ldev->lag_demux_fg);
+	if (ldev->lag_demux_ft)
+		mlx5_destroy_flow_table(ldev->lag_demux_ft);
+	ldev->lag_demux_fg = NULL;
+	ldev->lag_demux_ft = NULL;
+}
+EXPORT_SYMBOL(mlx5_lag_demux_cleanup);
+
+int mlx5_lag_demux_rule_add(struct mlx5_core_dev *vport_dev, u16 vport_num,
+			    int index)
+{
+	struct mlx5_flow_handle *rule;
+	struct mlx5_lag *ldev;
+	int err;
+
+	ldev = mlx5_lag_dev(vport_dev);
+	if (!ldev || !ldev->lag_demux_fg)
+		return 0;
+
+	if (xa_load(&ldev->lag_demux_rules, index))
+		return 0;
+
+	rule = mlx5_esw_lag_demux_rule_create(vport_dev->priv.eswitch,
+					      vport_num, ldev->lag_demux_ft);
+	if (IS_ERR(rule)) {
+		err = PTR_ERR(rule);
+		mlx5_core_warn(vport_dev,
+			       "Failed to create LAG demux rule for vport %u, err %d\n",
+			       vport_num, err);
+		return err;
+	}
+
+	err = xa_err(xa_store(&ldev->lag_demux_rules, index, rule,
+			      GFP_KERNEL));
+	if (err) {
+		mlx5_del_flow_rules(rule);
+		mlx5_core_warn(vport_dev,
+			       "Failed to store LAG demux rule for vport %u, err %d\n",
+			       vport_num, err);
+	}
+
+	return err;
+}
+EXPORT_SYMBOL(mlx5_lag_demux_rule_add);
+
+void mlx5_lag_demux_rule_del(struct mlx5_core_dev *dev, int index)
+{
+	struct mlx5_flow_handle *rule;
+	struct mlx5_lag *ldev;
+
+	ldev = mlx5_lag_dev(dev);
+	if (!ldev || !ldev->lag_demux_fg)
+		return;
+
+	rule = xa_erase(&ldev->lag_demux_rules, index);
+	if (rule)
+		mlx5_del_flow_rules(rule);
+}
+EXPORT_SYMBOL(mlx5_lag_demux_rule_del);
+
 static void mlx5_queue_bond_work(struct mlx5_lag *ldev, unsigned long delay)
 {
 	queue_delayed_work(ldev->wq, &ldev->bond_work, delay);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h
index 30cbd61768f8..6c911374f409 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h
@@ -5,6 +5,9 @@
 #define __MLX5_LAG_H__
 
 #include <linux/debugfs.h>
+#include <linux/errno.h>
+#include <linux/xarray.h>
+#include <linux/mlx5/fs.h>
 
 #define MLX5_LAG_MAX_HASH_BUCKETS 16
 /* XArray mark for the LAG master device
@@ -83,6 +86,9 @@ struct mlx5_lag {
 	/* Protect lag fields/state changes */
 	struct mutex		  lock;
 	struct lag_mpesw	  lag_mpesw;
+	struct mlx5_flow_table   *lag_demux_ft;
+	struct mlx5_flow_group   *lag_demux_fg;
+	struct xarray		  lag_demux_rules;
 };
 
 static inline struct mlx5_lag *
@@ -133,6 +139,12 @@ mlx5_lag_is_ready(struct mlx5_lag *ldev)
 
 bool mlx5_lag_shared_fdb_supported(struct mlx5_lag *ldev);
 bool mlx5_lag_check_prereq(struct mlx5_lag *ldev);
+int mlx5_lag_demux_init(struct mlx5_core_dev *dev,
+			struct mlx5_flow_table_attr *ft_attr);
+void mlx5_lag_demux_cleanup(struct mlx5_core_dev *dev);
+int mlx5_lag_demux_rule_add(struct mlx5_core_dev *dev, u16 vport_num,
+			    int vport_index);
+void mlx5_lag_demux_rule_del(struct mlx5_core_dev *dev, int vport_index);
 void mlx5_modify_lag(struct mlx5_lag *ldev,
 		     struct lag_tracker *tracker);
 int mlx5_activate_lag(struct mlx5_lag *ldev,
diff --git a/include/linux/mlx5/fs.h b/include/linux/mlx5/fs.h
index 02064424e868..d8f3b7ef319e 100644
--- a/include/linux/mlx5/fs.h
+++ b/include/linux/mlx5/fs.h
@@ -252,9 +252,9 @@ mlx5_create_auto_grouped_flow_table(struct mlx5_flow_namespace *ns,
 struct mlx5_flow_table *
 mlx5_create_vport_flow_table(struct mlx5_flow_namespace *ns,
 			     struct mlx5_flow_table_attr *ft_attr, u16 vport);
-struct mlx5_flow_table *mlx5_create_lag_demux_flow_table(
-					       struct mlx5_flow_namespace *ns,
-					       int prio, u32 level);
+struct mlx5_flow_table *
+mlx5_create_lag_demux_flow_table(struct mlx5_flow_namespace *ns,
+				 struct mlx5_flow_table_attr *ft_attr);
 int mlx5_destroy_flow_table(struct mlx5_flow_table *ft);
 
 /* inbox should be set with the following values:
diff --git a/include/linux/mlx5/lag.h b/include/linux/mlx5/lag.h
index d370dfd19055..ab9f754664e5 100644
--- a/include/linux/mlx5/lag.h
+++ b/include/linux/mlx5/lag.h
@@ -4,8 +4,18 @@
 #ifndef __MLX5_LAG_API_H__
 #define __MLX5_LAG_API_H__
 
+#include <linux/types.h>
+
 struct mlx5_core_dev;
+struct mlx5_flow_table;
+struct mlx5_flow_table_attr;
 
+int mlx5_lag_demux_init(struct mlx5_core_dev *dev,
+			struct mlx5_flow_table_attr *ft_attr);
+void mlx5_lag_demux_cleanup(struct mlx5_core_dev *dev);
+int mlx5_lag_demux_rule_add(struct mlx5_core_dev *dev, u16 vport_num,
+			    int vport_index);
+void mlx5_lag_demux_rule_del(struct mlx5_core_dev *dev, int vport_index);
 int mlx5_lag_get_dev_seq(struct mlx5_core_dev *dev);
 
 #endif /* __MLX5_LAG_API_H__ */
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH mlx5-next 8/8] {net/RDMA}/mlx5: Add LAG demux table API and vport demux rules
  2026-03-08  6:55 ` [PATCH mlx5-next 8/8] {net/RDMA}/mlx5: Add LAG demux table API and vport demux rules Tariq Toukan
@ 2026-03-08 15:52   ` Jakub Kicinski
  2026-03-08 18:34     ` Mark Bloch
  0 siblings, 1 reply; 14+ messages in thread
From: Jakub Kicinski @ 2026-03-08 15:52 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Leon Romanovsky, Jason Gunthorpe, Saeed Mahameed, Eric Dumazet,
	Paolo Abeni, Andrew Lunn, David S. Miller, Mark Bloch,
	linux-kernel, linux-rdma, netdev, Gal Pressman, Dragos Tatulea,
	Moshe Shemesh, Shay Drory, Alexei Lazar

On Sun, 8 Mar 2026 08:55:59 +0200 Tariq Toukan wrote:
> +struct mlx5_flow_handle *
> +mlx5_esw_lag_demux_rule_create(struct mlx5_eswitch *esw, u16 vport_num,
> +			       struct mlx5_flow_table *lag_ft)
> +{
> +	struct mlx5_flow_spec *spec = kvzalloc(sizeof(*spec), GFP_KERNEL);
> +	struct mlx5_flow_destination dest = {};
> +	struct mlx5_flow_act flow_act = {};
> +	struct mlx5_flow_handle *ret;
> +	void *misc;
> +
> +	if (!spec)
> +		return ERR_PTR(-ENOMEM);
> +
> +	if (!mlx5_eswitch_vport_match_metadata_enabled(esw)) {
> +		kfree(spec);
> +		return ERR_PTR(-EOPNOTSUPP);
> +	}
> +
> +	misc = MLX5_ADDR_OF(fte_match_param, spec->match_criteria,
> +			    misc_parameters_2);
> +	MLX5_SET(fte_match_set_misc2, misc, metadata_reg_c_0,
> +		 mlx5_eswitch_get_vport_metadata_mask());
> +	spec->match_criteria_enable = MLX5_MATCH_MISC_PARAMETERS_2;
> +
> +	misc = MLX5_ADDR_OF(fte_match_param, spec->match_value,
> +			    misc_parameters_2);
> +	MLX5_SET(fte_match_set_misc2, misc, metadata_reg_c_0,
> +		 mlx5_eswitch_get_vport_metadata_for_match(esw, vport_num));
> +
> +	flow_act.action = MLX5_FLOW_CONTEXT_ACTION_FWD_DEST;
> +	dest.type = MLX5_FLOW_DESTINATION_TYPE_VHCA_RX;
> +	dest.vhca.id = MLX5_CAP_GEN(esw->dev, vhca_id);
> +
> +	ret = mlx5_add_flow_rules(lag_ft, spec, &flow_act, &dest, 1);
> +	kfree(spec);
> +	return ret;
> +}

drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c:1512:12-13: WARNING kvmalloc is used to allocate this memory at line 1502
drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c:1532:11-12: WARNING kvmalloc is used to allocate this memory at line 1502
-- 
pw-bot: cr

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH mlx5-next 8/8] {net/RDMA}/mlx5: Add LAG demux table API and vport demux rules
  2026-03-08 15:52   ` Jakub Kicinski
@ 2026-03-08 18:34     ` Mark Bloch
  2026-03-09 21:33       ` Jakub Kicinski
  0 siblings, 1 reply; 14+ messages in thread
From: Mark Bloch @ 2026-03-08 18:34 UTC (permalink / raw)
  To: Jakub Kicinski, Tariq Toukan
  Cc: Leon Romanovsky, Jason Gunthorpe, Saeed Mahameed, Eric Dumazet,
	Paolo Abeni, Andrew Lunn, David S. Miller, linux-kernel,
	linux-rdma, netdev, Gal Pressman, Dragos Tatulea, Moshe Shemesh,
	Shay Drory, Alexei Lazar



On 08/03/2026 17:52, Jakub Kicinski wrote:
> On Sun, 8 Mar 2026 08:55:59 +0200 Tariq Toukan wrote:
>> +struct mlx5_flow_handle *
>> +mlx5_esw_lag_demux_rule_create(struct mlx5_eswitch *esw, u16 vport_num,
>> +			       struct mlx5_flow_table *lag_ft)
>> +{
>> +	struct mlx5_flow_spec *spec = kvzalloc(sizeof(*spec), GFP_KERNEL);
>> +	struct mlx5_flow_destination dest = {};
>> +	struct mlx5_flow_act flow_act = {};
>> +	struct mlx5_flow_handle *ret;
>> +	void *misc;
>> +
>> +	if (!spec)
>> +		return ERR_PTR(-ENOMEM);
>> +
>> +	if (!mlx5_eswitch_vport_match_metadata_enabled(esw)) {
>> +		kfree(spec);
>> +		return ERR_PTR(-EOPNOTSUPP);
>> +	}
>> +
>> +	misc = MLX5_ADDR_OF(fte_match_param, spec->match_criteria,
>> +			    misc_parameters_2);
>> +	MLX5_SET(fte_match_set_misc2, misc, metadata_reg_c_0,
>> +		 mlx5_eswitch_get_vport_metadata_mask());
>> +	spec->match_criteria_enable = MLX5_MATCH_MISC_PARAMETERS_2;
>> +
>> +	misc = MLX5_ADDR_OF(fte_match_param, spec->match_value,
>> +			    misc_parameters_2);
>> +	MLX5_SET(fte_match_set_misc2, misc, metadata_reg_c_0,
>> +		 mlx5_eswitch_get_vport_metadata_for_match(esw, vport_num));
>> +
>> +	flow_act.action = MLX5_FLOW_CONTEXT_ACTION_FWD_DEST;
>> +	dest.type = MLX5_FLOW_DESTINATION_TYPE_VHCA_RX;
>> +	dest.vhca.id = MLX5_CAP_GEN(esw->dev, vhca_id);
>> +
>> +	ret = mlx5_add_flow_rules(lag_ft, spec, &flow_act, &dest, 1);
>> +	kfree(spec);
>> +	return ret;
>> +}
> 
> drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c:1512:12-13: WARNING kvmalloc is used to allocate this memory at line 1502
> drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c:1532:11-12: WARNING kvmalloc is used to allocate this memory at line 1502
Hi Jakub,

Thanks for catching this. We’ll address it.

Also, I saw IA flagged issues con
“net/mlx5: LAG, replace pf array with xarray”.
Just for context, lag_lock is already a known problematic
area for us, and we do have plans to remove it. I ran the
review prompts locally in ORC mode, so I assume I saw the
same comments as NIPA.

So the issue raised there is not really a new one. lag_lock
already has some known issues today, but we do not expect to
hit this particular case in practice, since by the time
execution reaches mdev removal, the LAG should already have
been destroyed and the netdevs already removed for the driver
internal structures.

Mark

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH mlx5-next 8/8] {net/RDMA}/mlx5: Add LAG demux table API and vport demux rules
  2026-03-08 18:34     ` Mark Bloch
@ 2026-03-09 21:33       ` Jakub Kicinski
  2026-03-10  6:05         ` Mark Bloch
  0 siblings, 1 reply; 14+ messages in thread
From: Jakub Kicinski @ 2026-03-09 21:33 UTC (permalink / raw)
  To: Mark Bloch
  Cc: Tariq Toukan, Leon Romanovsky, Jason Gunthorpe, Saeed Mahameed,
	Eric Dumazet, Paolo Abeni, Andrew Lunn, David S. Miller,
	linux-kernel, linux-rdma, netdev, Gal Pressman, Dragos Tatulea,
	Moshe Shemesh, Shay Drory, Alexei Lazar

On Sun, 8 Mar 2026 20:34:26 +0200 Mark Bloch wrote:
> Thanks for catching this. We’ll address it.
> 
> Also, I saw IA flagged issues con
> “net/mlx5: LAG, replace pf array with xarray”.
> Just for context, lag_lock is already a known problematic
> area for us, and we do have plans to remove it. I ran the
> review prompts locally in ORC mode, so I assume I saw the
> same comments as NIPA.
> 
> So the issue raised there is not really a new one. lag_lock
> already has some known issues today, but we do not expect to
> hit this particular case in practice, since by the time
> execution reaches mdev removal, the LAG should already have
> been destroyed and the netdevs already removed for the driver
> internal structures.

Ack, I haven't looked at the AI reivew TBH.
As usual with known AI flags - should the explanation be part 
of the commit message?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH mlx5-next 8/8] {net/RDMA}/mlx5: Add LAG demux table API and vport demux rules
  2026-03-09 21:33       ` Jakub Kicinski
@ 2026-03-10  6:05         ` Mark Bloch
  2026-03-10 23:58           ` Jakub Kicinski
  0 siblings, 1 reply; 14+ messages in thread
From: Mark Bloch @ 2026-03-10  6:05 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Tariq Toukan, Leon Romanovsky, Jason Gunthorpe, Saeed Mahameed,
	Eric Dumazet, Paolo Abeni, Andrew Lunn, David S. Miller,
	linux-kernel, linux-rdma, netdev, Gal Pressman, Dragos Tatulea,
	Moshe Shemesh, Shay Drory, Alexei Lazar



On 09/03/2026 23:33, Jakub Kicinski wrote:
> On Sun, 8 Mar 2026 20:34:26 +0200 Mark Bloch wrote:
>> Thanks for catching this. We’ll address it.
>>
>> Also, I saw IA flagged issues con
>> “net/mlx5: LAG, replace pf array with xarray”.
>> Just for context, lag_lock is already a known problematic
>> area for us, and we do have plans to remove it. I ran the
>> review prompts locally in ORC mode, so I assume I saw the
>> same comments as NIPA.
>>
>> So the issue raised there is not really a new one. lag_lock
>> already has some known issues today, but we do not expect to
>> hit this particular case in practice, since by the time
>> execution reaches mdev removal, the LAG should already have
>> been destroyed and the netdevs already removed for the driver
>> internal structures.
> 
> Ack, I haven't looked at the AI reivew TBH.
> As usual with known AI flags - should the explanation be part 
> of the commit message?

That's an interesting question.
I'll try to give my 0.02$ about the general case.
Out of curiosity I ran one of our upcoming internal series
through both Mason's prompts with Claude and our internal
AI review tool.

Mason's + Claude reported 3 false positives.

Our internal AI tool also reported 3 false positives (interestingly,
they were different issues) and 1 real issue, which I already knew
about since the author hasn't fixed it yet.

So in theory we could add a note like “AI tools may flag issues
X/Y/Z but those are not valid here”, but in practice it really
depends on which tool is used and how it's configured.

At the moment it seems that netdev/NIPA is using Mason's prompts
with Claude, so if anything that would probably be the default
reference.

The larger question is that running NIPA before submission is
not currently required. Are there any plans to make that part
of the submission expectations, and not just encouraged?

Mark





^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH mlx5-next 8/8] {net/RDMA}/mlx5: Add LAG demux table API and vport demux rules
  2026-03-10  6:05         ` Mark Bloch
@ 2026-03-10 23:58           ` Jakub Kicinski
  0 siblings, 0 replies; 14+ messages in thread
From: Jakub Kicinski @ 2026-03-10 23:58 UTC (permalink / raw)
  To: Mark Bloch
  Cc: Tariq Toukan, Leon Romanovsky, Jason Gunthorpe, Saeed Mahameed,
	Eric Dumazet, Paolo Abeni, Andrew Lunn, David S. Miller,
	linux-kernel, linux-rdma, netdev, Gal Pressman, Dragos Tatulea,
	Moshe Shemesh, Shay Drory, Alexei Lazar

On Tue, 10 Mar 2026 08:05:39 +0200 Mark Bloch wrote:
> On 09/03/2026 23:33, Jakub Kicinski wrote:
> > On Sun, 8 Mar 2026 20:34:26 +0200 Mark Bloch wrote:  
> >> Thanks for catching this. We’ll address it.
> >>
> >> Also, I saw IA flagged issues con
> >> “net/mlx5: LAG, replace pf array with xarray”.
> >> Just for context, lag_lock is already a known problematic
> >> area for us, and we do have plans to remove it. I ran the
> >> review prompts locally in ORC mode, so I assume I saw the
> >> same comments as NIPA.
> >>
> >> So the issue raised there is not really a new one. lag_lock
> >> already has some known issues today, but we do not expect to
> >> hit this particular case in practice, since by the time
> >> execution reaches mdev removal, the LAG should already have
> >> been destroyed and the netdevs already removed for the driver
> >> internal structures.  
> > 
> > Ack, I haven't looked at the AI reivew TBH.
> > As usual with known AI flags - should the explanation be part 
> > of the commit message?  
> 
> That's an interesting question.
> I'll try to give my 0.02$ about the general case.
> Out of curiosity I ran one of our upcoming internal series
> through both Mason's prompts with Claude and our internal
> AI review tool.
> 
> Mason's + Claude reported 3 false positives.
> 
> Our internal AI tool also reported 3 false positives (interestingly,
> they were different issues) and 1 real issue, which I already knew
> about since the author hasn't fixed it yet.
> 
> So in theory we could add a note like “AI tools may flag issues
> X/Y/Z but those are not valid here”, but in practice it really
> depends on which tool is used and how it's configured.
> 
> At the moment it seems that netdev/NIPA is using Mason's prompts
> with Claude, so if anything that would probably be the default
> reference.
> 
> The larger question is that running NIPA before submission is
> not currently required. Are there any plans to make that part
> of the submission expectations, and not just encouraged?

No, no, the process angle is not how I look at this.
We should only add comments to the commit message or code if there's
genuine ambiguity. Basically if someone reading the code may also get
confused there should be an explanation somewhere. We should not be
adding any code or explanations to make tools happy. 

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2026-03-10 23:58 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-08  6:55 [PATCH mlx5-next 0/8] mlx5-next updates 2026-03-08 Tariq Toukan
2026-03-08  6:55 ` [PATCH mlx5-next 1/8] net/mlx5: Add IFC bits for shared headroom pool PBMC support Tariq Toukan
2026-03-08  6:55 ` [PATCH mlx5-next 2/8] net/mlx5: Add silent mode set/query and VHCA RX IFC bits Tariq Toukan
2026-03-08  6:55 ` [PATCH mlx5-next 3/8] net/mlx5: LAG, replace pf array with xarray Tariq Toukan
2026-03-08  6:55 ` [PATCH mlx5-next 4/8] net/mlx5: LAG, use xa_alloc to manage LAG device indices Tariq Toukan
2026-03-08  6:55 ` [PATCH mlx5-next 5/8] net/mlx5: E-switch, modify peer miss rule index to vhca_id Tariq Toukan
2026-03-08  6:55 ` [PATCH mlx5-next 6/8] net/mlx5: LAG, replace mlx5_get_dev_index with LAG sequence number Tariq Toukan
2026-03-08  6:55 ` [PATCH mlx5-next 7/8] net/mlx5: Add VHCA RX flow destination support for FW steering Tariq Toukan
2026-03-08  6:55 ` [PATCH mlx5-next 8/8] {net/RDMA}/mlx5: Add LAG demux table API and vport demux rules Tariq Toukan
2026-03-08 15:52   ` Jakub Kicinski
2026-03-08 18:34     ` Mark Bloch
2026-03-09 21:33       ` Jakub Kicinski
2026-03-10  6:05         ` Mark Bloch
2026-03-10 23:58           ` Jakub Kicinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox