[PATCH net-next V2 0/7] net/mlx5: Improve representor lifecycle and allow switchdev by default

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH net-next V2 0/7] net/mlx5: Improve representor lifecycle and allow switchdev by default
@ 2026-05-01  4:16 Tariq Toukan
  2026-05-01  4:16 ` [PATCH net-next V2 1/7] net/mlx5: Lag: refactor representor reload handling Tariq Toukan
                   ` (6 more replies)
  0 siblings, 7 replies; 22+ messages in thread
From: Tariq Toukan @ 2026-05-01  4:16 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller
  Cc: Leon Romanovsky, Jason Gunthorpe, Saeed Mahameed, Tariq Toukan,
	Mark Bloch, Shay Drory, Or Har-Toov, Edward Srouji, Maher Sanalla,
	Simon Horman, Gerd Bayer, Moshe Shemesh, Kees Cook,
	Patrisious Haddad, Parav Pandit, Carolina Jubran, Cosmin Ratiu,
	linux-rdma, linux-kernel, netdev, Gal Pressman, Dragos Tatulea

Hi,

Find detailed description by Mark below.

Regards,
Tariq

This series addresses three problems that have been present for years, and
fixes one representor reload error-unwind case exposed while making the
reload path reusable.

First, there is no coordination between E-Switch reconfiguration and
representor registration. The E-Switch can be mid-way through a mode
change or VF count update while mlx5_ib walks in and registers or
unregisters representors. Nothing stops them. The race window is small
and there is no field report, but it is clearly wrong.

Second, loading mlx5_ib while the device is already in switchdev mode
does not bring up the IB representors. mlx5_eswitch_register_vport_reps()
only stores callbacks; nobody triggers the actual load after registration.

Third, deployments that always use switchdev mode still need a manual
devlink mode change after probe. That makes automated provisioning more
fragile than needed.

The series fixes the registration race with a per-E-Switch representor
mutex. The lock is introduced first, then LAG shared-FDB and multiport
E-Switch transitions are adjusted so auxiliary device rescans and IB
representor reloads do not hold ldev->lock while taking the representor
lock. This keeps the intermediate commits bisectable before the stricter
E-Switch serialization and lock assertions are enabled.

After the LAG ordering is fixed, all E-Switch reconfiguration paths that
create, destroy, load, or unload representors take the representor mutex.
esw_mode_change() deliberately drops the mutex around
mlx5_rescan_drivers_locked(), because auxiliary probe and remove paths
re-enter mlx5_eswitch_register_vport_reps() and
mlx5_eswitch_unregister_vport_reps() on the same thread.

The shared-FDB peer IB registration path can hold one E-Switch
representor mutex and then register peer representor ops on another
E-Switch. The series annotates that case as nested locking so lockdep can
distinguish it from recursive locking on the same E-Switch.

For the missing IB representors, mlx5_eswitch_register_vport_reps() queues
a work item that acquires the devlink lock and loads all relevant
representors. This is the change that actually fixes the long-standing
bug.

The reload path also learns to track which representor types were loaded by
the current attempt, so an error does not unload representors that were
already active before the retry.

Patch 1 is cleanup. LAG and MPESW had the same representor reload
sequence duplicated in several places and the copies had started to
drift. This consolidates them into one helper.

Patch 2 adds the per-E-Switch representor lifecycle lock and helper APIs.

Patch 3 adjusts the LAG shared-FDB and multiport E-Switch transitions so
auxiliary device rescans and IB representor reloads run without
ldev->lock held while taking the representor lock.

Patch 4 protects the E-Switch reconfiguration, representor registration
and peer IB representor paths with the representor lock.

Patch 5 fixes representor load error unwind so only representor types
loaded by the current attempt are unloaded on failure.

Patch 6 moves the representor load triggered by
mlx5_eswitch_register_vport_reps() onto the work queue. This is the patch
that fixes IB representors not coming up when mlx5_ib is loaded while the
device is already in switchdev mode.

Patch 7 adds a driver profile that auto-enables switchdev at device init,
for deployments that always operate in switchdev mode and want to avoid a
manual devlink command after every probe.

V2:

Split v1 into two parts: the E-Switch workqueue deadlock fix and the
representor lifecycle changes. This is the second part; the first part
has already been accepted [1].

Patch 1: Add a cont_on_fail flag so callers can decide whether reload
should continue after a failure.

Patches 2, 3, 4: Replace the atomic-variable based scheme with a mutex,
per Jakub's feedback.

Patch 5: New patch that fixes the unwind on representor load failure.

Patch 7: Switch from profile 4 to profile 8. Since the profile mainly
targets E-Switch handling, keep it separate from the NIC profiles.

[1] https://lore.kernel.org/all/20260428051018.219093-1-tariqt@nvidia.com/

Link to V1:
https://lore.kernel.org/all/20260409115550.156419-1-tariqt@nvidia.com/

Mark Bloch (7):
  net/mlx5: Lag: refactor representor reload handling
  net/mlx5: E-Switch, add representor lifecycle lock
  net/mlx5: Lag, avoid LAG and representor lock cycles
  net/mlx5: E-Switch, serialize representor lifecycle
  net/mlx5: E-Switch, unwind only newly loaded representor types
  net/mlx5: E-switch, load reps via work queue after registration
  net/mlx5: Add profile to auto-enable switchdev mode at device init

 drivers/infiniband/hw/mlx5/ib_rep.c           |   6 +-
 .../net/ethernet/mellanox/mlx5/core/eswitch.c |  10 +
 .../net/ethernet/mellanox/mlx5/core/eswitch.h |  12 ++
 .../mellanox/mlx5/core/eswitch_offloads.c     | 186 ++++++++++++++++--
 .../net/ethernet/mellanox/mlx5/core/lag/lag.c | 171 ++++++++++++----
 .../net/ethernet/mellanox/mlx5/core/lag/lag.h |   5 +
 .../ethernet/mellanox/mlx5/core/lag/mpesw.c   |  18 +-
 .../ethernet/mellanox/mlx5/core/lib/devcom.c  |   8 +
 .../ethernet/mellanox/mlx5/core/lib/devcom.h  |   1 +
 .../net/ethernet/mellanox/mlx5/core/main.c    |  43 +++-
 .../ethernet/mellanox/mlx5/core/sf/devlink.c  |   5 +
 include/linux/mlx5/driver.h                   |   2 +
 include/linux/mlx5/eswitch.h                  |   6 +
 13 files changed, 405 insertions(+), 68 deletions(-)

base-commit: 4e37987362bcac8909f2d4b4458f3aa645e41641
-- 
2.44.0

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH net-next V2 1/7] net/mlx5: Lag: refactor representor reload handling
  2026-05-01  4:16 [PATCH net-next V2 0/7] net/mlx5: Improve representor lifecycle and allow switchdev by default Tariq Toukan
@ 2026-05-01  4:16 ` Tariq Toukan
  2026-05-01  4:16 ` [PATCH net-next V2 2/7] net/mlx5: E-Switch, add representor lifecycle lock Tariq Toukan
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 22+ messages in thread
From: Tariq Toukan @ 2026-05-01  4:16 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller
  Cc: Leon Romanovsky, Jason Gunthorpe, Saeed Mahameed, Tariq Toukan,
	Mark Bloch, Shay Drory, Or Har-Toov, Edward Srouji, Maher Sanalla,
	Simon Horman, Gerd Bayer, Moshe Shemesh, Kees Cook,
	Patrisious Haddad, Parav Pandit, Carolina Jubran, Cosmin Ratiu,
	linux-rdma, linux-kernel, netdev, Gal Pressman, Dragos Tatulea

From: Mark Bloch <mbloch@nvidia.com>

Representor reload during LAG/MPESW transitions has to be repeated in
several flows, and each open‑coded loop was easy to get out of sync
when adding new flags or tweaking error handling. Move the sequencing
into a single helper so that all call sites share the same ordering
and checks

Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Shay Drori <shayd@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/lag/lag.c | 45 +++++++++++--------
 .../net/ethernet/mellanox/mlx5/core/lag/lag.h |  2 +
 .../ethernet/mellanox/mlx5/core/lag/mpesw.c   | 12 ++---
 3 files changed, 33 insertions(+), 26 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
index 449e4bd86c06..a474f970e056 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
@@ -1093,6 +1093,27 @@ void mlx5_lag_remove_devices(struct mlx5_lag *ldev)
 	}
 }
 
+int mlx5_lag_reload_ib_reps(struct mlx5_lag *ldev, u32 flags, bool cont_on_fail)
+{
+	struct lag_func *pf;
+	int ret;
+	int i;
+
+	mlx5_ldev_for_each(i, 0, ldev) {
+		pf = mlx5_lag_pf(ldev, i);
+		if (!(pf->dev->priv.flags & flags)) {
+			struct mlx5_eswitch *esw;
+
+			esw = pf->dev->priv.eswitch;
+			ret = mlx5_eswitch_reload_ib_reps(esw);
+			if (ret && !cont_on_fail)
+				return ret;
+		}
+	}
+
+	return 0;
+}
+
 void mlx5_disable_lag(struct mlx5_lag *ldev)
 {
 	bool shared_fdb = test_bit(MLX5_LAG_MODE_FLAG_SHARED_FDB, &ldev->mode_flags);
@@ -1130,9 +1151,8 @@ void mlx5_disable_lag(struct mlx5_lag *ldev)
 		mlx5_lag_add_devices(ldev);
 
 	if (shared_fdb)
-		mlx5_ldev_for_each(i, 0, ldev)
-			if (!(mlx5_lag_pf(ldev, i)->dev->priv.flags & MLX5_PRIV_FLAGS_DISABLE_ALL_ADEV))
-				mlx5_eswitch_reload_ib_reps(mlx5_lag_pf(ldev, i)->dev->priv.eswitch);
+		mlx5_lag_reload_ib_reps(ldev, MLX5_PRIV_FLAGS_DISABLE_ALL_ADEV,
+					true);
 }
 
 bool mlx5_lag_shared_fdb_supported(struct mlx5_lag *ldev)
@@ -1388,10 +1408,8 @@ static void mlx5_do_bond(struct mlx5_lag *ldev)
 		if (err) {
 			if (shared_fdb || roce_lag)
 				mlx5_lag_add_devices(ldev);
-			if (shared_fdb) {
-				mlx5_ldev_for_each(i, 0, ldev)
-					mlx5_eswitch_reload_ib_reps(mlx5_lag_pf(ldev, i)->dev->priv.eswitch);
-			}
+			if (shared_fdb)
+				mlx5_lag_reload_ib_reps(ldev, 0, true);
 
 			return;
 		}
@@ -1409,24 +1427,15 @@ static void mlx5_do_bond(struct mlx5_lag *ldev)
 					mlx5_nic_vport_enable_roce(dev);
 			}
 		} else if (shared_fdb) {
-			int i;
-
 			dev0->priv.flags &= ~MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
 			mlx5_rescan_drivers_locked(dev0);
-
-			mlx5_ldev_for_each(i, 0, ldev) {
-				err = mlx5_eswitch_reload_ib_reps(mlx5_lag_pf(ldev, i)->dev->priv.eswitch);
-				if (err)
-					break;
-			}
-
+			err = mlx5_lag_reload_ib_reps(ldev, 0, false);
 			if (err) {
 				dev0->priv.flags |= MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
 				mlx5_rescan_drivers_locked(dev0);
 				mlx5_deactivate_lag(ldev);
 				mlx5_lag_add_devices(ldev);
-				mlx5_ldev_for_each(i, 0, ldev)
-					mlx5_eswitch_reload_ib_reps(mlx5_lag_pf(ldev, i)->dev->priv.eswitch);
+				mlx5_lag_reload_ib_reps(ldev, 0, true);
 				mlx5_core_err(dev0, "Failed to enable lag\n");
 				return;
 			}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h
index 6c911374f409..daca8ebd5256 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h
@@ -199,4 +199,6 @@ int mlx5_get_next_ldev_func(struct mlx5_lag *ldev, int start_idx);
 int mlx5_lag_get_dev_index_by_seq(struct mlx5_lag *ldev, int seq);
 int mlx5_lag_num_devs(struct mlx5_lag *ldev);
 int mlx5_lag_num_netdevs(struct mlx5_lag *ldev);
+int mlx5_lag_reload_ib_reps(struct mlx5_lag *ldev, u32 flags,
+			    bool cont_on_fail);
 #endif /* __MLX5_LAG_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/mpesw.c b/drivers/net/ethernet/mellanox/mlx5/core/lag/mpesw.c
index 5eea12a6887a..edcd06f3be7a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lag/mpesw.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/mpesw.c
@@ -70,7 +70,6 @@ static int mlx5_lag_enable_mpesw(struct mlx5_lag *ldev)
 	int idx = mlx5_lag_get_dev_index_by_seq(ldev, MLX5_LAG_P1);
 	struct mlx5_core_dev *dev0;
 	int err;
-	int i;
 
 	if (ldev->mode == MLX5_LAG_MODE_MPESW)
 		return 0;
@@ -103,11 +102,9 @@ static int mlx5_lag_enable_mpesw(struct mlx5_lag *ldev)
 
 	dev0->priv.flags &= ~MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
 	mlx5_rescan_drivers_locked(dev0);
-	mlx5_ldev_for_each(i, 0, ldev) {
-		err = mlx5_eswitch_reload_ib_reps(mlx5_lag_pf(ldev, i)->dev->priv.eswitch);
-		if (err)
-			goto err_rescan_drivers;
-	}
+	err = mlx5_lag_reload_ib_reps(ldev, 0, false);
+	if (err)
+		goto err_rescan_drivers;
 
 	mlx5_lag_set_vports_agg_speed(ldev);
 
@@ -119,8 +116,7 @@ static int mlx5_lag_enable_mpesw(struct mlx5_lag *ldev)
 	mlx5_deactivate_lag(ldev);
 err_add_devices:
 	mlx5_lag_add_devices(ldev);
-	mlx5_ldev_for_each(i, 0, ldev)
-		mlx5_eswitch_reload_ib_reps(mlx5_lag_pf(ldev, i)->dev->priv.eswitch);
+	mlx5_lag_reload_ib_reps(ldev, 0, true);
 	mlx5_mpesw_metadata_cleanup(ldev);
 	return err;
 }
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH net-next V2 2/7] net/mlx5: E-Switch, add representor lifecycle lock
  2026-05-01  4:16 [PATCH net-next V2 0/7] net/mlx5: Improve representor lifecycle and allow switchdev by default Tariq Toukan
  2026-05-01  4:16 ` [PATCH net-next V2 1/7] net/mlx5: Lag: refactor representor reload handling Tariq Toukan
@ 2026-05-01  4:16 ` Tariq Toukan
  2026-05-01  4:16 ` [PATCH net-next V2 3/7] net/mlx5: Lag, avoid LAG and representor lock cycles Tariq Toukan
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 22+ messages in thread
From: Tariq Toukan @ 2026-05-01  4:16 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller
  Cc: Leon Romanovsky, Jason Gunthorpe, Saeed Mahameed, Tariq Toukan,
	Mark Bloch, Shay Drory, Or Har-Toov, Edward Srouji, Maher Sanalla,
	Simon Horman, Gerd Bayer, Moshe Shemesh, Kees Cook,
	Patrisious Haddad, Parav Pandit, Carolina Jubran, Cosmin Ratiu,
	linux-rdma, linux-kernel, netdev, Gal Pressman, Dragos Tatulea

From: Mark Bloch <mbloch@nvidia.com>

Add a per-E-Switch mutex for serializing representor lifecycle work and
provide small helpers for taking and dropping it. Initialize and destroy
the mutex with the E-Switch offloads state.

Add the lock and helper API first. Follow-up patches will take the lock in
the individual representor lifecycle components. This keeps the functional
changes split by component and leaves this patch without intended behavior
change, making the series easier to review and bisectable.

Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.h    |  6 ++++++
 .../ethernet/mellanox/mlx5/core/eswitch_offloads.c   | 12 ++++++++++++
 2 files changed, 18 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index 2fd601bd102f..3858690e09b4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -316,6 +316,7 @@ struct mlx5_esw_offload {
 	DECLARE_HASHTABLE(termtbl_tbl, 8);
 	struct mutex termtbl_mutex; /* protects termtbl hash */
 	struct xarray vhca_map;
+	struct mutex reps_lock; /* protects representor load/unload/register */
 	const struct mlx5_eswitch_rep_ops *rep_ops[NUM_REP_TYPES];
 	u8 inline_mode;
 	atomic64_t num_flows;
@@ -951,6 +952,8 @@ mlx5_esw_lag_demux_fg_create(struct mlx5_eswitch *esw,
 struct mlx5_flow_handle *
 mlx5_esw_lag_demux_rule_create(struct mlx5_eswitch *esw, u16 vport_num,
 			       struct mlx5_flow_table *lag_ft);
+void mlx5_esw_reps_block(struct mlx5_eswitch *esw);
+void mlx5_esw_reps_unblock(struct mlx5_eswitch *esw);
 #else  /* CONFIG_MLX5_ESWITCH */
 /* eswitch API stubs */
 static inline int  mlx5_eswitch_init(struct mlx5_core_dev *dev) { return 0; }
@@ -1028,6 +1031,9 @@ mlx5_esw_host_functions_enabled(const struct mlx5_core_dev *dev)
 	return true;
 }
 
+static inline void mlx5_esw_reps_block(struct mlx5_eswitch *esw) {}
+static inline void mlx5_esw_reps_unblock(struct mlx5_eswitch *esw) {}
+
 static inline bool
 mlx5_esw_vport_vhca_id(struct mlx5_eswitch *esw, u16 vportn, u16 *vhca_id)
 {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 69ddf56e2fc9..6a5143b63dfd 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -2413,6 +2413,16 @@ static int esw_create_restore_table(struct mlx5_eswitch *esw)
 	return err;
 }
 
+void mlx5_esw_reps_block(struct mlx5_eswitch *esw)
+{
+	mutex_lock(&esw->offloads.reps_lock);
+}
+
+void mlx5_esw_reps_unblock(struct mlx5_eswitch *esw)
+{
+	mutex_unlock(&esw->offloads.reps_lock);
+}
+
 static void esw_mode_change(struct mlx5_eswitch *esw, u16 mode)
 {
 	mlx5_devcom_comp_lock(esw->dev->priv.hca_devcom_comp);
@@ -2645,6 +2655,7 @@ static void esw_offloads_cleanup_reps(struct mlx5_eswitch *esw)
 	mlx5_esw_for_each_rep(esw, i, rep)
 		mlx5_esw_offloads_rep_cleanup(esw, rep);
 	xa_destroy(&esw->offloads.vport_reps);
+	mutex_destroy(&esw->offloads.reps_lock);
 }
 
 static int esw_offloads_init_reps(struct mlx5_eswitch *esw)
@@ -2654,6 +2665,7 @@ static int esw_offloads_init_reps(struct mlx5_eswitch *esw)
 	int err;
 
 	xa_init(&esw->offloads.vport_reps);
+	mutex_init(&esw->offloads.reps_lock);
 
 	mlx5_esw_for_each_vport(esw, i, vport) {
 		err = mlx5_esw_offloads_rep_add(esw, vport);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH net-next V2 3/7] net/mlx5: Lag, avoid LAG and representor lock cycles
  2026-05-01  4:16 [PATCH net-next V2 0/7] net/mlx5: Improve representor lifecycle and allow switchdev by default Tariq Toukan
  2026-05-01  4:16 ` [PATCH net-next V2 1/7] net/mlx5: Lag: refactor representor reload handling Tariq Toukan
  2026-05-01  4:16 ` [PATCH net-next V2 2/7] net/mlx5: E-Switch, add representor lifecycle lock Tariq Toukan
@ 2026-05-01  4:16 ` Tariq Toukan
  2026-05-02 20:04   ` Mark Bloch
  2026-05-01  4:16 ` [PATCH net-next V2 4/7] net/mlx5: E-Switch, serialize representor lifecycle Tariq Toukan
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Tariq Toukan @ 2026-05-01  4:16 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller
  Cc: Leon Romanovsky, Jason Gunthorpe, Saeed Mahameed, Tariq Toukan,
	Mark Bloch, Shay Drory, Or Har-Toov, Edward Srouji, Maher Sanalla,
	Simon Horman, Gerd Bayer, Moshe Shemesh, Kees Cook,
	Patrisious Haddad, Parav Pandit, Carolina Jubran, Cosmin Ratiu,
	linux-rdma, linux-kernel, netdev, Gal Pressman, Dragos Tatulea

From: Mark Bloch <mbloch@nvidia.com>

The LAG shared-FDB and multiport E-Switch transitions rescan auxiliary
devices and reload IB representors while holding ldev->lock. Driver
bind/unbind paths may register or unregister E-Switch representor ops, and
representor load paths may enter LAG code, so holding ldev->lock across
those calls creates lock-order cycles with the E-Switch representor lock.

Keep the devcom component locked for the transition, but drop ldev->lock
before rescanning auxiliary devices or reloading IB representors. Mark the
LAG transition as in progress while the lock is dropped and assert the
devcom lock where the helper relies on it. This preserves LAG serialization
while avoiding ldev->lock nesting under E-Switch representor registration.

Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/lag/lag.c | 142 ++++++++++++++----
 .../net/ethernet/mellanox/mlx5/core/lag/lag.h |   7 +-
 .../ethernet/mellanox/mlx5/core/lag/mpesw.c   |  10 +-
 .../ethernet/mellanox/mlx5/core/lib/devcom.c  |   8 +
 .../ethernet/mellanox/mlx5/core/lib/devcom.h  |   1 +
 5 files changed, 134 insertions(+), 34 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
index a474f970e056..e77f9931c39c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
@@ -1063,37 +1063,99 @@ bool mlx5_lag_check_prereq(struct mlx5_lag *ldev)
 	return true;
 }
 
-void mlx5_lag_add_devices(struct mlx5_lag *ldev)
+static void mlx5_lag_assert_locked_transition(struct mlx5_lag *ldev)
 {
+	struct mlx5_devcom_comp_dev *devcom = NULL;
 	struct lag_func *pf;
 	int i;
 
-	mlx5_ldev_for_each(i, 0, ldev) {
-		pf = mlx5_lag_pf(ldev, i);
-		if (pf->dev->priv.flags & MLX5_PRIV_FLAGS_DISABLE_ALL_ADEV)
-			continue;
+	lockdep_assert_held(&ldev->lock);
 
-		pf->dev->priv.flags &= ~MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
-		mlx5_rescan_drivers_locked(pf->dev);
+	i = mlx5_get_next_ldev_func(ldev, 0);
+	if (i < MLX5_MAX_PORTS) {
+		pf = mlx5_lag_pf(ldev, i);
+		devcom = pf->dev->priv.hca_devcom_comp;
 	}
+	mlx5_devcom_comp_assert_locked(devcom);
 }
 
-void mlx5_lag_remove_devices(struct mlx5_lag *ldev)
+static void mlx5_lag_drop_lock_for_reps(struct mlx5_lag *ldev)
+{
+	mlx5_lag_assert_locked_transition(ldev);
+
+	/* Keep PF membership stable while ldev->lock is dropped. Device add
+	 * and remove paths observe mode_changes_in_progress and retry.
+	 */
+	ldev->mode_changes_in_progress++;
+	mutex_unlock(&ldev->lock);
+}
+
+static void mlx5_lag_retake_lock_after_reps(struct mlx5_lag *ldev)
 {
+	mutex_lock(&ldev->lock);
+	ldev->mode_changes_in_progress--;
+}
+
+void mlx5_lag_rescan_dev_locked(struct mlx5_lag *ldev,
+				struct mlx5_core_dev *dev,
+				bool enable)
+{
+	if (dev->priv.flags & MLX5_PRIV_FLAGS_DISABLE_ALL_ADEV)
+		return;
+
+	if (enable)
+		dev->priv.flags &= ~MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
+	else
+		dev->priv.flags |= MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
+
+	/* Auxiliary bus probe/remove can register or unregister representor
+	 * callbacks and take reps_lock. Drop ldev->lock so the only ordering
+	 * remains reps_lock -> ldev->lock from representor callbacks.
+	 */
+	mlx5_lag_drop_lock_for_reps(ldev);
+	mlx5_rescan_drivers_locked(dev);
+	mlx5_lag_retake_lock_after_reps(ldev);
+}
+
+static void mlx5_lag_rescan_devices_locked(struct mlx5_lag *ldev, bool enable)
+{
+	struct mlx5_core_dev *devs[MLX5_MAX_PORTS];
 	struct lag_func *pf;
+	int num_devs = 0;
 	int i;
 
+	mlx5_lag_assert_locked_transition(ldev);
+
 	mlx5_ldev_for_each(i, 0, ldev) {
 		pf = mlx5_lag_pf(ldev, i);
 		if (pf->dev->priv.flags & MLX5_PRIV_FLAGS_DISABLE_ALL_ADEV)
 			continue;
 
-		pf->dev->priv.flags |= MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
-		mlx5_rescan_drivers_locked(pf->dev);
+		if (enable)
+			pf->dev->priv.flags &= ~MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
+		else
+			pf->dev->priv.flags |= MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
+		devs[num_devs++] = pf->dev;
 	}
+
+	mlx5_lag_drop_lock_for_reps(ldev);
+	for (i = 0; i < num_devs; i++)
+		mlx5_rescan_drivers_locked(devs[i]);
+	mlx5_lag_retake_lock_after_reps(ldev);
 }
 
-int mlx5_lag_reload_ib_reps(struct mlx5_lag *ldev, u32 flags, bool cont_on_fail)
+void mlx5_lag_add_devices(struct mlx5_lag *ldev)
+{
+	mlx5_lag_rescan_devices_locked(ldev, true);
+}
+
+void mlx5_lag_remove_devices(struct mlx5_lag *ldev)
+{
+	mlx5_lag_rescan_devices_locked(ldev, false);
+}
+
+static int mlx5_lag_reload_ib_reps_unlocked(struct mlx5_lag *ldev, u32 flags,
+					    bool cont_on_fail)
 {
 	struct lag_func *pf;
 	int ret;
@@ -1105,7 +1167,9 @@ int mlx5_lag_reload_ib_reps(struct mlx5_lag *ldev, u32 flags, bool cont_on_fail)
 			struct mlx5_eswitch *esw;
 
 			esw = pf->dev->priv.eswitch;
+			mlx5_esw_reps_block(esw);
 			ret = mlx5_eswitch_reload_ib_reps(esw);
+			mlx5_esw_reps_unblock(esw);
 			if (ret && !cont_on_fail)
 				return ret;
 		}
@@ -1114,6 +1178,34 @@ int mlx5_lag_reload_ib_reps(struct mlx5_lag *ldev, u32 flags, bool cont_on_fail)
 	return 0;
 }
 
+static int mlx5_lag_reload_ib_reps(struct mlx5_lag *ldev, u32 flags,
+				   bool cont_on_fail)
+{
+	int ret;
+
+	/* The HCA devcom component lock serializes LAG mode transitions while
+	 * ldev->lock is dropped here. Dropping ldev->lock is required because
+	 * the reload takes the per-E-Switch reps_lock, and representor
+	 * load/unload callbacks can re-enter LAG netdev add/remove and take
+	 * ldev->lock. Keep the ordering reps_lock -> ldev->lock.
+	 */
+	mlx5_lag_drop_lock_for_reps(ldev);
+	ret = mlx5_lag_reload_ib_reps_unlocked(ldev, flags, cont_on_fail);
+	mlx5_lag_retake_lock_after_reps(ldev);
+
+	return ret;
+}
+
+int mlx5_lag_reload_ib_reps_from_locked(struct mlx5_lag *ldev, u32 flags,
+					bool cont_on_fail)
+{
+	int ret;
+
+	ret = mlx5_lag_reload_ib_reps(ldev, flags, cont_on_fail);
+
+	return ret;
+}
+
 void mlx5_disable_lag(struct mlx5_lag *ldev)
 {
 	bool shared_fdb = test_bit(MLX5_LAG_MODE_FLAG_SHARED_FDB, &ldev->mode_flags);
@@ -1132,10 +1224,7 @@ void mlx5_disable_lag(struct mlx5_lag *ldev)
 	if (shared_fdb) {
 		mlx5_lag_remove_devices(ldev);
 	} else if (roce_lag) {
-		if (!(dev0->priv.flags & MLX5_PRIV_FLAGS_DISABLE_ALL_ADEV)) {
-			dev0->priv.flags |= MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
-			mlx5_rescan_drivers_locked(dev0);
-		}
+		mlx5_lag_rescan_dev_locked(ldev, dev0, false);
 		mlx5_ldev_for_each(i, 0, ldev) {
 			if (i == idx)
 				continue;
@@ -1151,8 +1240,9 @@ void mlx5_disable_lag(struct mlx5_lag *ldev)
 		mlx5_lag_add_devices(ldev);
 
 	if (shared_fdb)
-		mlx5_lag_reload_ib_reps(ldev, MLX5_PRIV_FLAGS_DISABLE_ALL_ADEV,
-					true);
+		mlx5_lag_reload_ib_reps_from_locked(ldev,
+						    MLX5_PRIV_FLAGS_DISABLE_ALL_ADEV,
+						    true);
 }
 
 bool mlx5_lag_shared_fdb_supported(struct mlx5_lag *ldev)
@@ -1409,7 +1499,8 @@ static void mlx5_do_bond(struct mlx5_lag *ldev)
 			if (shared_fdb || roce_lag)
 				mlx5_lag_add_devices(ldev);
 			if (shared_fdb)
-				mlx5_lag_reload_ib_reps(ldev, 0, true);
+				mlx5_lag_reload_ib_reps_from_locked(ldev, 0,
+								    true);
 
 			return;
 		}
@@ -1417,8 +1508,7 @@ static void mlx5_do_bond(struct mlx5_lag *ldev)
 		if (roce_lag) {
 			struct mlx5_core_dev *dev;
 
-			dev0->priv.flags &= ~MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
-			mlx5_rescan_drivers_locked(dev0);
+			mlx5_lag_rescan_dev_locked(ldev, dev0, true);
 			mlx5_ldev_for_each(i, 0, ldev) {
 				if (i == idx)
 					continue;
@@ -1427,15 +1517,15 @@ static void mlx5_do_bond(struct mlx5_lag *ldev)
 					mlx5_nic_vport_enable_roce(dev);
 			}
 		} else if (shared_fdb) {
-			dev0->priv.flags &= ~MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
-			mlx5_rescan_drivers_locked(dev0);
-			err = mlx5_lag_reload_ib_reps(ldev, 0, false);
+			mlx5_lag_rescan_dev_locked(ldev, dev0, true);
+			err = mlx5_lag_reload_ib_reps_from_locked(ldev, 0,
+								  false);
 			if (err) {
-				dev0->priv.flags |= MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
-				mlx5_rescan_drivers_locked(dev0);
+				mlx5_lag_rescan_dev_locked(ldev, dev0, false);
 				mlx5_deactivate_lag(ldev);
 				mlx5_lag_add_devices(ldev);
-				mlx5_lag_reload_ib_reps(ldev, 0, true);
+				mlx5_lag_reload_ib_reps_from_locked(ldev, 0,
+								    true);
 				mlx5_core_err(dev0, "Failed to enable lag\n");
 				return;
 			}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h
index daca8ebd5256..6afe7707d076 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h
@@ -164,6 +164,9 @@ void mlx5_disable_lag(struct mlx5_lag *ldev);
 void mlx5_lag_remove_devices(struct mlx5_lag *ldev);
 int mlx5_deactivate_lag(struct mlx5_lag *ldev);
 void mlx5_lag_add_devices(struct mlx5_lag *ldev);
+void mlx5_lag_rescan_dev_locked(struct mlx5_lag *ldev,
+				struct mlx5_core_dev *dev,
+				bool enable);
 struct mlx5_devcom_comp_dev *mlx5_lag_get_devcom_comp(struct mlx5_lag *ldev);
 
 #ifdef CONFIG_MLX5_ESWITCH
@@ -199,6 +202,6 @@ int mlx5_get_next_ldev_func(struct mlx5_lag *ldev, int start_idx);
 int mlx5_lag_get_dev_index_by_seq(struct mlx5_lag *ldev, int seq);
 int mlx5_lag_num_devs(struct mlx5_lag *ldev);
 int mlx5_lag_num_netdevs(struct mlx5_lag *ldev);
-int mlx5_lag_reload_ib_reps(struct mlx5_lag *ldev, u32 flags,
-			    bool cont_on_fail);
+int mlx5_lag_reload_ib_reps_from_locked(struct mlx5_lag *ldev, u32 flags,
+					bool cont_on_fail);
 #endif /* __MLX5_LAG_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/mpesw.c b/drivers/net/ethernet/mellanox/mlx5/core/lag/mpesw.c
index edcd06f3be7a..8a349f8fd823 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lag/mpesw.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/mpesw.c
@@ -100,9 +100,8 @@ static int mlx5_lag_enable_mpesw(struct mlx5_lag *ldev)
 		goto err_add_devices;
 	}
 
-	dev0->priv.flags &= ~MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
-	mlx5_rescan_drivers_locked(dev0);
-	err = mlx5_lag_reload_ib_reps(ldev, 0, false);
+	mlx5_lag_rescan_dev_locked(ldev, dev0, true);
+	err = mlx5_lag_reload_ib_reps_from_locked(ldev, 0, false);
 	if (err)
 		goto err_rescan_drivers;
 
@@ -111,12 +110,11 @@ static int mlx5_lag_enable_mpesw(struct mlx5_lag *ldev)
 	return 0;
 
 err_rescan_drivers:
-	dev0->priv.flags |= MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
-	mlx5_rescan_drivers_locked(dev0);
+	mlx5_lag_rescan_dev_locked(ldev, dev0, false);
 	mlx5_deactivate_lag(ldev);
 err_add_devices:
 	mlx5_lag_add_devices(ldev);
-	mlx5_lag_reload_ib_reps(ldev, 0, true);
+	mlx5_lag_reload_ib_reps_from_locked(ldev, 0, true);
 	mlx5_mpesw_metadata_cleanup(ldev);
 	return err;
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/devcom.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/devcom.c
index 4b5ac2db55ce..d40c53193ea8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/devcom.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/devcom.c
@@ -3,6 +3,7 @@
 
 #include <linux/mlx5/vport.h>
 #include <linux/list.h>
+#include <linux/lockdep.h>
 #include "lib/devcom.h"
 #include "lib/mlx5.h"
 #include "mlx5_core.h"
@@ -438,3 +439,10 @@ int mlx5_devcom_comp_trylock(struct mlx5_devcom_comp_dev *devcom)
 		return 0;
 	return down_write_trylock(&devcom->comp->sem);
 }
+
+void mlx5_devcom_comp_assert_locked(struct mlx5_devcom_comp_dev *devcom)
+{
+	if (!devcom)
+		return;
+	lockdep_assert_held_write(&devcom->comp->sem);
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/devcom.h b/drivers/net/ethernet/mellanox/mlx5/core/lib/devcom.h
index 91e5ae529d5c..316052a85ca5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/devcom.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/devcom.h
@@ -75,5 +75,6 @@ void *mlx5_devcom_get_next_peer_data_rcu(struct mlx5_devcom_comp_dev *devcom,
 void mlx5_devcom_comp_lock(struct mlx5_devcom_comp_dev *devcom);
 void mlx5_devcom_comp_unlock(struct mlx5_devcom_comp_dev *devcom);
 int mlx5_devcom_comp_trylock(struct mlx5_devcom_comp_dev *devcom);
+void mlx5_devcom_comp_assert_locked(struct mlx5_devcom_comp_dev *devcom);
 
 #endif /* __LIB_MLX5_DEVCOM_H__ */
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH net-next V2 4/7] net/mlx5: E-Switch, serialize representor lifecycle
  2026-05-01  4:16 [PATCH net-next V2 0/7] net/mlx5: Improve representor lifecycle and allow switchdev by default Tariq Toukan
                   ` (2 preceding siblings ...)
  2026-05-01  4:16 ` [PATCH net-next V2 3/7] net/mlx5: Lag, avoid LAG and representor lock cycles Tariq Toukan
@ 2026-05-01  4:16 ` Tariq Toukan
  2026-05-02 20:05   ` Mark Bloch
  2026-05-03  1:42   ` Jakub Kicinski
  2026-05-01  4:16 ` [PATCH net-next V2 5/7] net/mlx5: E-Switch, unwind only newly loaded representor types Tariq Toukan
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 22+ messages in thread
From: Tariq Toukan @ 2026-05-01  4:16 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller
  Cc: Leon Romanovsky, Jason Gunthorpe, Saeed Mahameed, Tariq Toukan,
	Mark Bloch, Shay Drory, Or Har-Toov, Edward Srouji, Maher Sanalla,
	Simon Horman, Gerd Bayer, Moshe Shemesh, Kees Cook,
	Patrisious Haddad, Parav Pandit, Carolina Jubran, Cosmin Ratiu,
	linux-rdma, linux-kernel, netdev, Gal Pressman, Dragos Tatulea

From: Mark Bloch <mbloch@nvidia.com>

Representor callbacks can be registered and unregistered while the
E-Switch is already in switchdev mode, and the same E-Switch may also be
reconfigured by devlink, VF changes and SF changes. Serialize these paths
with the per-E-Switch representor mutex instead of relying on ad-hoc bit
state and wait queues.

Take the representor lock around the mode transition, VF/SF representor
changes and representor ops registration. Keep mode_lock and the
representor lock unnested by using the operation flag while the mode lock
is dropped. During mode changes, drop the representor lock around the
auxiliary bus rescan because driver bind/unbind may register or unregister
representor ops.

Split representor ops registration into locked public wrappers and blocked
internal helpers, clear the ops pointer on unregister, and add nested
wrappers for the shared-FDB master IB path that registers peer
representor ops while another E-Switch representor lock is already held.

Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 drivers/infiniband/hw/mlx5/ib_rep.c           |   6 +-
 .../net/ethernet/mellanox/mlx5/core/eswitch.c |  10 ++
 .../mellanox/mlx5/core/eswitch_offloads.c     | 102 ++++++++++++++++--
 .../ethernet/mellanox/mlx5/core/sf/devlink.c  |   5 +
 include/linux/mlx5/eswitch.h                  |   6 ++
 5 files changed, 119 insertions(+), 10 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/ib_rep.c b/drivers/infiniband/hw/mlx5/ib_rep.c
index 1709b628702e..65d8767d1830 100644
--- a/drivers/infiniband/hw/mlx5/ib_rep.c
+++ b/drivers/infiniband/hw/mlx5/ib_rep.c
@@ -262,9 +262,10 @@ mlx5_ib_vport_rep_unload(struct mlx5_eswitch_rep *rep)
 			struct mlx5_core_dev *peer_mdev;
 			struct mlx5_eswitch *esw;
 
+			/* Called while the master E-Switch reps_lock is held. */
 			mlx5_lag_for_each_peer_mdev(mdev, peer_mdev, i) {
 				esw = peer_mdev->priv.eswitch;
-				mlx5_eswitch_unregister_vport_reps(esw, REP_IB);
+				mlx5_eswitch_unregister_vport_reps_nested(esw, REP_IB);
 			}
 			mlx5_ib_release_transport(mdev);
 		}
@@ -284,9 +285,10 @@ static void mlx5_ib_register_peer_vport_reps(struct mlx5_core_dev *mdev)
 	struct mlx5_eswitch *esw;
 	int i;
 
+	/* Called while the master E-Switch reps_lock is held. */
 	mlx5_lag_for_each_peer_mdev(mdev, peer_mdev, i) {
 		esw = peer_mdev->priv.eswitch;
-		mlx5_eswitch_register_vport_reps(esw, &rep_ops, REP_IB);
+		mlx5_eswitch_register_vport_reps_nested(esw, &rep_ops, REP_IB);
 	}
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index 66a773a99876..f70737437954 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -1712,6 +1712,7 @@ int mlx5_eswitch_enable(struct mlx5_eswitch *esw, int num_vfs)
 		mlx5_lag_disable_change(esw->dev);
 
 	mlx5_eswitch_invalidate_wq(esw);
+	mlx5_esw_reps_block(esw);
 
 	if (!mlx5_esw_is_fdb_created(esw)) {
 		ret = mlx5_eswitch_enable_locked(esw, num_vfs);
@@ -1735,6 +1736,8 @@ int mlx5_eswitch_enable(struct mlx5_eswitch *esw, int num_vfs)
 		}
 	}
 
+	mlx5_esw_reps_unblock(esw);
+
 	if (toggle_lag)
 		mlx5_lag_enable_change(esw->dev);
 
@@ -1759,6 +1762,7 @@ void mlx5_eswitch_disable_sriov(struct mlx5_eswitch *esw, bool clear_vf)
 		 esw->esw_funcs.num_vfs, esw->esw_funcs.num_ec_vfs, esw->enabled_vports);
 
 	mlx5_eswitch_invalidate_wq(esw);
+	mlx5_esw_reps_block(esw);
 
 	if (!mlx5_core_is_ecpf(esw->dev)) {
 		mlx5_eswitch_unload_vf_vports(esw, esw->esw_funcs.num_vfs);
@@ -1770,6 +1774,8 @@ void mlx5_eswitch_disable_sriov(struct mlx5_eswitch *esw, bool clear_vf)
 			mlx5_eswitch_clear_ec_vf_vports_info(esw);
 	}
 
+	mlx5_esw_reps_unblock(esw);
+
 	if (esw->mode == MLX5_ESWITCH_OFFLOADS) {
 		struct devlink *devlink = priv_to_devlink(esw->dev);
 
@@ -1825,7 +1831,11 @@ void mlx5_eswitch_disable(struct mlx5_eswitch *esw)
 
 	devl_assert_locked(priv_to_devlink(esw->dev));
 	mlx5_lag_disable_change(esw->dev);
+
+	mlx5_esw_reps_block(esw);
 	mlx5_eswitch_disable_locked(esw);
+	mlx5_esw_reps_unblock(esw);
+
 	esw->mode = MLX5_ESWITCH_LEGACY;
 	mlx5_lag_enable_change(esw->dev);
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 6a5143b63dfd..d4ac07c995b9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -36,6 +36,7 @@
 #include <linux/mlx5/mlx5_ifc.h>
 #include <linux/mlx5/vport.h>
 #include <linux/mlx5/fs.h>
+#include <linux/lockdep.h>
 #include "mlx5_core.h"
 #include "eswitch.h"
 #include "esw/indir_table.h"
@@ -2413,11 +2414,21 @@ static int esw_create_restore_table(struct mlx5_eswitch *esw)
 	return err;
 }
 
+static void mlx5_esw_assert_reps_locked(struct mlx5_eswitch *esw)
+{
+	lockdep_assert_held(&esw->offloads.reps_lock);
+}
+
 void mlx5_esw_reps_block(struct mlx5_eswitch *esw)
 {
 	mutex_lock(&esw->offloads.reps_lock);
 }
 
+static void mlx5_esw_reps_block_nested(struct mlx5_eswitch *esw)
+{
+	mutex_lock_nested(&esw->offloads.reps_lock, SINGLE_DEPTH_NESTING);
+}
+
 void mlx5_esw_reps_unblock(struct mlx5_eswitch *esw)
 {
 	mutex_unlock(&esw->offloads.reps_lock);
@@ -2425,21 +2436,22 @@ void mlx5_esw_reps_unblock(struct mlx5_eswitch *esw)
 
 static void esw_mode_change(struct mlx5_eswitch *esw, u16 mode)
 {
+	mlx5_esw_reps_unblock(esw);
 	mlx5_devcom_comp_lock(esw->dev->priv.hca_devcom_comp);
 	if (esw->dev->priv.flags & MLX5_PRIV_FLAGS_DISABLE_IB_ADEV ||
 	    mlx5_core_mp_enabled(esw->dev)) {
 		esw->mode = mode;
-		mlx5_rescan_drivers_locked(esw->dev);
-		mlx5_devcom_comp_unlock(esw->dev->priv.hca_devcom_comp);
-		return;
+		goto out;
 	}
 
 	esw->dev->priv.flags |= MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
 	mlx5_rescan_drivers_locked(esw->dev);
 	esw->mode = mode;
 	esw->dev->priv.flags &= ~MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
+out:
 	mlx5_rescan_drivers_locked(esw->dev);
 	mlx5_devcom_comp_unlock(esw->dev->priv.hca_devcom_comp);
+	mlx5_esw_reps_block(esw);
 }
 
 static void mlx5_esw_fdb_drop_destroy(struct mlx5_eswitch *esw)
@@ -2776,6 +2788,8 @@ void esw_offloads_cleanup(struct mlx5_eswitch *esw)
 static int __esw_offloads_load_rep(struct mlx5_eswitch *esw,
 				   struct mlx5_eswitch_rep *rep, u8 rep_type)
 {
+	mlx5_esw_assert_reps_locked(esw);
+
 	if (atomic_cmpxchg(&rep->rep_data[rep_type].state,
 			   REP_REGISTERED, REP_LOADED) == REP_REGISTERED)
 		return esw->offloads.rep_ops[rep_type]->load(esw->dev, rep);
@@ -2786,6 +2800,8 @@ static int __esw_offloads_load_rep(struct mlx5_eswitch *esw,
 static void __esw_offloads_unload_rep(struct mlx5_eswitch *esw,
 				      struct mlx5_eswitch_rep *rep, u8 rep_type)
 {
+	mlx5_esw_assert_reps_locked(esw);
+
 	if (atomic_cmpxchg(&rep->rep_data[rep_type].state,
 			   REP_LOADED, REP_REGISTERED) == REP_LOADED) {
 		if (rep_type == REP_ETH)
@@ -3691,6 +3707,7 @@ static void esw_vfs_changed_event_handler(struct mlx5_eswitch *esw)
 	if (new_num_vfs == esw->esw_funcs.num_vfs || host_pf_disabled)
 		goto free;
 
+	mlx5_esw_reps_block(esw);
 	/* Number of VFs can only change from "0 to x" or "x to 0". */
 	if (esw->esw_funcs.num_vfs > 0) {
 		mlx5_eswitch_unload_vf_vports(esw, esw->esw_funcs.num_vfs);
@@ -3700,9 +3717,11 @@ static void esw_vfs_changed_event_handler(struct mlx5_eswitch *esw)
 		err = mlx5_eswitch_load_vf_vports(esw, new_num_vfs,
 						  MLX5_VPORT_UC_ADDR_CHANGE);
 		if (err)
-			goto free;
+			goto unblock;
 	}
 	esw->esw_funcs.num_vfs = new_num_vfs;
+unblock:
+	mlx5_esw_reps_unblock(esw);
 free:
 	kvfree(out);
 }
@@ -4188,9 +4207,14 @@ int mlx5_devlink_eswitch_mode_set(struct devlink *devlink, u16 mode,
 		goto unlock;
 	}
 
+	/* Keep mode_lock and reps_lock unnested. The operation flag excludes
+	 * mode users while mode_lock is dropped before taking reps_lock.
+	 */
 	esw->eswitch_operation_in_progress = true;
 	up_write(&esw->mode_lock);
 
+	mlx5_esw_reps_block(esw);
+
 	if (mlx5_mode == MLX5_ESWITCH_OFFLOADS &&
 	    !mlx5_devlink_netdev_netns_immutable_set(devlink, true)) {
 		NL_SET_ERR_MSG_MOD(extack,
@@ -4223,6 +4247,10 @@ int mlx5_devlink_eswitch_mode_set(struct devlink *devlink, u16 mode,
 skip:
 	if (mlx5_mode == MLX5_ESWITCH_OFFLOADS && err)
 		mlx5_devlink_netdev_netns_immutable_set(devlink, false);
+	/* Reconfiguration is done; drop reps_lock before taking mode_lock again
+	 * to clear the operation flag.
+	 */
+	mlx5_esw_reps_unblock(esw);
 	down_write(&esw->mode_lock);
 	esw->eswitch_operation_in_progress = false;
 unlock:
@@ -4496,9 +4524,10 @@ mlx5_eswitch_vport_has_rep(const struct mlx5_eswitch *esw, u16 vport_num)
 	return true;
 }
 
-void mlx5_eswitch_register_vport_reps(struct mlx5_eswitch *esw,
-				      const struct mlx5_eswitch_rep_ops *ops,
-				      u8 rep_type)
+static void
+mlx5_eswitch_register_vport_reps_blocked(struct mlx5_eswitch *esw,
+					 const struct mlx5_eswitch_rep_ops *ops,
+					 u8 rep_type)
 {
 	struct mlx5_eswitch_rep_data *rep_data;
 	struct mlx5_eswitch_rep *rep;
@@ -4513,9 +4542,40 @@ void mlx5_eswitch_register_vport_reps(struct mlx5_eswitch *esw,
 		}
 	}
 }
+
+static void
+mlx5_eswitch_register_vport_reps_locked(struct mlx5_eswitch *esw,
+					const struct mlx5_eswitch_rep_ops *ops,
+					u8 rep_type, bool nested)
+{
+	if (nested)
+		mlx5_esw_reps_block_nested(esw);
+	else
+		mlx5_esw_reps_block(esw);
+	mlx5_eswitch_register_vport_reps_blocked(esw, ops, rep_type);
+	mlx5_esw_reps_unblock(esw);
+}
+
+void mlx5_eswitch_register_vport_reps(struct mlx5_eswitch *esw,
+				      const struct mlx5_eswitch_rep_ops *ops,
+				      u8 rep_type)
+{
+	mlx5_eswitch_register_vport_reps_locked(esw, ops, rep_type, false);
+}
 EXPORT_SYMBOL(mlx5_eswitch_register_vport_reps);
 
-void mlx5_eswitch_unregister_vport_reps(struct mlx5_eswitch *esw, u8 rep_type)
+void
+mlx5_eswitch_register_vport_reps_nested(struct mlx5_eswitch *esw,
+					const struct mlx5_eswitch_rep_ops *ops,
+					u8 rep_type)
+{
+	mlx5_eswitch_register_vport_reps_locked(esw, ops, rep_type, true);
+}
+EXPORT_SYMBOL(mlx5_eswitch_register_vport_reps_nested);
+
+static void
+mlx5_eswitch_unregister_vport_reps_blocked(struct mlx5_eswitch *esw,
+					   u8 rep_type)
 {
 	struct mlx5_eswitch_rep *rep;
 	unsigned long i;
@@ -4525,9 +4585,35 @@ void mlx5_eswitch_unregister_vport_reps(struct mlx5_eswitch *esw, u8 rep_type)
 
 	mlx5_esw_for_each_rep(esw, i, rep)
 		atomic_set(&rep->rep_data[rep_type].state, REP_UNREGISTERED);
+
+	esw->offloads.rep_ops[rep_type] = NULL;
+}
+
+static void
+mlx5_eswitch_unregister_vport_reps_locked(struct mlx5_eswitch *esw,
+					  u8 rep_type, bool nested)
+{
+	if (nested)
+		mlx5_esw_reps_block_nested(esw);
+	else
+		mlx5_esw_reps_block(esw);
+	mlx5_eswitch_unregister_vport_reps_blocked(esw, rep_type);
+	mlx5_esw_reps_unblock(esw);
+}
+
+void mlx5_eswitch_unregister_vport_reps(struct mlx5_eswitch *esw, u8 rep_type)
+{
+	mlx5_eswitch_unregister_vport_reps_locked(esw, rep_type, false);
 }
 EXPORT_SYMBOL(mlx5_eswitch_unregister_vport_reps);
 
+void mlx5_eswitch_unregister_vport_reps_nested(struct mlx5_eswitch *esw,
+					       u8 rep_type)
+{
+	mlx5_eswitch_unregister_vport_reps_locked(esw, rep_type, true);
+}
+EXPORT_SYMBOL(mlx5_eswitch_unregister_vport_reps_nested);
+
 void *mlx5_eswitch_get_uplink_priv(struct mlx5_eswitch *esw, u8 rep_type)
 {
 	struct mlx5_eswitch_rep *rep;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/devlink.c b/drivers/net/ethernet/mellanox/mlx5/core/sf/devlink.c
index 8503e532f423..2fc69897e35b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/sf/devlink.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/devlink.c
@@ -245,8 +245,10 @@ static int mlx5_sf_add(struct mlx5_core_dev *dev, struct mlx5_sf_table *table,
 	if (IS_ERR(sf))
 		return PTR_ERR(sf);
 
+	mlx5_esw_reps_block(esw);
 	err = mlx5_eswitch_load_sf_vport(esw, sf->hw_fn_id, MLX5_VPORT_UC_ADDR_CHANGE,
 					 &sf->dl_port, new_attr->controller, new_attr->sfnum);
+	mlx5_esw_reps_unblock(esw);
 	if (err)
 		goto esw_err;
 	*dl_port = &sf->dl_port.dl_port;
@@ -367,7 +369,10 @@ int mlx5_devlink_sf_port_del(struct devlink *devlink,
 	struct mlx5_sf_table *table = dev->priv.sf_table;
 	struct mlx5_sf *sf = mlx5_sf_by_dl_port(dl_port);
 
+	mlx5_esw_reps_block(dev->priv.eswitch);
 	mlx5_sf_del(table, sf);
+	mlx5_esw_reps_unblock(dev->priv.eswitch);
+
 	return 0;
 }
 
diff --git a/include/linux/mlx5/eswitch.h b/include/linux/mlx5/eswitch.h
index 3b29a3c6794d..a0dd162baa78 100644
--- a/include/linux/mlx5/eswitch.h
+++ b/include/linux/mlx5/eswitch.h
@@ -63,7 +63,13 @@ struct mlx5_eswitch_rep {
 void mlx5_eswitch_register_vport_reps(struct mlx5_eswitch *esw,
 				      const struct mlx5_eswitch_rep_ops *ops,
 				      u8 rep_type);
+void
+mlx5_eswitch_register_vport_reps_nested(struct mlx5_eswitch *esw,
+					const struct mlx5_eswitch_rep_ops *ops,
+					u8 rep_type);
 void mlx5_eswitch_unregister_vport_reps(struct mlx5_eswitch *esw, u8 rep_type);
+void mlx5_eswitch_unregister_vport_reps_nested(struct mlx5_eswitch *esw,
+					       u8 rep_type);
 void *mlx5_eswitch_get_proto_dev(struct mlx5_eswitch *esw,
 				 u16 vport_num,
 				 u8 rep_type);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH net-next V2 5/7] net/mlx5: E-Switch, unwind only newly loaded representor types
  2026-05-01  4:16 [PATCH net-next V2 0/7] net/mlx5: Improve representor lifecycle and allow switchdev by default Tariq Toukan
                   ` (3 preceding siblings ...)
  2026-05-01  4:16 ` [PATCH net-next V2 4/7] net/mlx5: E-Switch, serialize representor lifecycle Tariq Toukan
@ 2026-05-01  4:16 ` Tariq Toukan
  2026-05-02 20:06   ` Mark Bloch
  2026-05-01  4:16 ` [PATCH net-next V2 6/7] net/mlx5: E-switch, load reps via work queue after registration Tariq Toukan
  2026-05-01  4:16 ` [PATCH net-next V2 7/7] net/mlx5: Add profile to auto-enable switchdev mode at device init Tariq Toukan
  6 siblings, 1 reply; 22+ messages in thread
From: Tariq Toukan @ 2026-05-01  4:16 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller
  Cc: Leon Romanovsky, Jason Gunthorpe, Saeed Mahameed, Tariq Toukan,
	Mark Bloch, Shay Drory, Or Har-Toov, Edward Srouji, Maher Sanalla,
	Simon Horman, Gerd Bayer, Moshe Shemesh, Kees Cook,
	Patrisious Haddad, Parav Pandit, Carolina Jubran, Cosmin Ratiu,
	linux-rdma, linux-kernel, netdev, Gal Pressman, Dragos Tatulea

From: Mark Bloch <mbloch@nvidia.com>

__esw_offloads_load_rep() may return success without invoking the
representor load callback when the representor type is already loaded.

On a later load failure, mlx5_esw_offloads_rep_load() unconditionally
unloaded all previously iterated representor types. This could unload
representor types that were already loaded before this load attempt.

Track which representor types were actually loaded by the current call and
unwind only those on error. Also restore the representor state back to
REP_REGISTERED when the load callback itself fails.

Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 .../mellanox/mlx5/core/eswitch_offloads.c     | 38 ++++++++++++++-----
 1 file changed, 29 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index d4ac07c995b9..8f656253981b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -2786,13 +2786,28 @@ void esw_offloads_cleanup(struct mlx5_eswitch *esw)
 }
 
 static int __esw_offloads_load_rep(struct mlx5_eswitch *esw,
-				   struct mlx5_eswitch_rep *rep, u8 rep_type)
+				   struct mlx5_eswitch_rep *rep,
+				   u8 rep_type, bool *newly_loaded)
 {
+	int err;
+
 	mlx5_esw_assert_reps_locked(esw);
 
+	if (newly_loaded)
+		*newly_loaded = false;
+
 	if (atomic_cmpxchg(&rep->rep_data[rep_type].state,
-			   REP_REGISTERED, REP_LOADED) == REP_REGISTERED)
-		return esw->offloads.rep_ops[rep_type]->load(esw->dev, rep);
+			   REP_REGISTERED, REP_LOADED) != REP_REGISTERED)
+		return 0;
+
+	err = esw->offloads.rep_ops[rep_type]->load(esw->dev, rep);
+	if (err) {
+		atomic_set(&rep->rep_data[rep_type].state, REP_REGISTERED);
+		return err;
+	}
+
+	if (newly_loaded)
+		*newly_loaded = true;
 
 	return 0;
 }
@@ -2822,22 +2837,27 @@ static void __unload_reps_all_vport(struct mlx5_eswitch *esw, u8 rep_type)
 static int mlx5_esw_offloads_rep_load(struct mlx5_eswitch *esw, u16 vport_num)
 {
 	struct mlx5_eswitch_rep *rep;
+	unsigned long loaded = 0;
+	bool newly_loaded;
 	int rep_type;
 	int err;
 
 	rep = mlx5_eswitch_get_rep(esw, vport_num);
 	for (rep_type = 0; rep_type < NUM_REP_TYPES; rep_type++) {
-		err = __esw_offloads_load_rep(esw, rep, rep_type);
+		err = __esw_offloads_load_rep(esw, rep, rep_type,
+					      &newly_loaded);
 		if (err)
 			goto err_reps;
+		if (newly_loaded)
+			loaded |= BIT(rep_type);
 	}
 
 	return 0;
 
 err_reps:
-	atomic_set(&rep->rep_data[rep_type].state, REP_REGISTERED);
-	for (--rep_type; rep_type >= 0; rep_type--)
-		__esw_offloads_unload_rep(esw, rep, rep_type);
+	while (--rep_type >= 0)
+		if (test_bit(rep_type, &loaded))
+			__esw_offloads_unload_rep(esw, rep, rep_type);
 	return err;
 }
 
@@ -3591,13 +3611,13 @@ int mlx5_eswitch_reload_ib_reps(struct mlx5_eswitch *esw)
 	if (atomic_read(&rep->rep_data[REP_ETH].state) != REP_LOADED)
 		return 0;
 
-	ret = __esw_offloads_load_rep(esw, rep, REP_IB);
+	ret = __esw_offloads_load_rep(esw, rep, REP_IB, NULL);
 	if (ret)
 		return ret;
 
 	mlx5_esw_for_each_rep(esw, i, rep) {
 		if (atomic_read(&rep->rep_data[REP_ETH].state) == REP_LOADED)
-			__esw_offloads_load_rep(esw, rep, REP_IB);
+			__esw_offloads_load_rep(esw, rep, REP_IB, NULL);
 	}
 
 	return 0;
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH net-next V2 6/7] net/mlx5: E-switch, load reps via work queue after registration
  2026-05-01  4:16 [PATCH net-next V2 0/7] net/mlx5: Improve representor lifecycle and allow switchdev by default Tariq Toukan
                   ` (4 preceding siblings ...)
  2026-05-01  4:16 ` [PATCH net-next V2 5/7] net/mlx5: E-Switch, unwind only newly loaded representor types Tariq Toukan
@ 2026-05-01  4:16 ` Tariq Toukan
  2026-05-02 20:07   ` Mark Bloch
  2026-05-03  1:42   ` Jakub Kicinski
  2026-05-01  4:16 ` [PATCH net-next V2 7/7] net/mlx5: Add profile to auto-enable switchdev mode at device init Tariq Toukan
  6 siblings, 2 replies; 22+ messages in thread
From: Tariq Toukan @ 2026-05-01  4:16 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller
  Cc: Leon Romanovsky, Jason Gunthorpe, Saeed Mahameed, Tariq Toukan,
	Mark Bloch, Shay Drory, Or Har-Toov, Edward Srouji, Maher Sanalla,
	Simon Horman, Gerd Bayer, Moshe Shemesh, Kees Cook,
	Patrisious Haddad, Parav Pandit, Carolina Jubran, Cosmin Ratiu,
	linux-rdma, linux-kernel, netdev, Gal Pressman, Dragos Tatulea

From: Mark Bloch <mbloch@nvidia.com>

mlx5_eswitch_register_vport_reps() only installs representor callbacks and
marks the rep type as registered. If the E-Switch is already in switchdev
mode, the newly registered rep type must then be loaded for already enabled
vports.

That load path needs to run under the devlink lock, which is not held by
the auxiliary driver registration context. Queue the reload to the E-Switch
workqueue, whose handler acquires the devlink lock, and load the relevant
representors from there.

The unregister path is unchanged and still unloads representors
synchronously while tearing down the registered callbacks.

Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 .../mellanox/mlx5/core/eswitch_offloads.c     | 34 +++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 8f656253981b..f26d1652dd05 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -4563,6 +4563,38 @@ mlx5_eswitch_register_vport_reps_blocked(struct mlx5_eswitch *esw,
 	}
 }
 
+static void mlx5_eswitch_reload_reps_blocked(struct mlx5_eswitch *esw)
+{
+	struct mlx5_vport *vport;
+	unsigned long i;
+
+	if (esw->mode != MLX5_ESWITCH_OFFLOADS)
+		return;
+
+	if (mlx5_esw_offloads_rep_load(esw, MLX5_VPORT_UPLINK))
+		return;
+
+	mlx5_esw_for_each_vport(esw, i, vport) {
+		if (!vport)
+			continue;
+		if (!vport->enabled)
+			continue;
+		if (vport->vport == MLX5_VPORT_UPLINK)
+			continue;
+		if (!mlx5_eswitch_vport_has_rep(esw, vport->vport))
+			continue;
+
+		mlx5_esw_offloads_rep_load(esw, vport->vport);
+	}
+}
+
+static void mlx5_eswitch_reload_reps(struct mlx5_eswitch *esw)
+{
+	mlx5_esw_reps_block(esw);
+	mlx5_eswitch_reload_reps_blocked(esw);
+	mlx5_esw_reps_unblock(esw);
+}
+
 static void
 mlx5_eswitch_register_vport_reps_locked(struct mlx5_eswitch *esw,
 					const struct mlx5_eswitch_rep_ops *ops,
@@ -4574,6 +4606,8 @@ mlx5_eswitch_register_vport_reps_locked(struct mlx5_eswitch *esw,
 		mlx5_esw_reps_block(esw);
 	mlx5_eswitch_register_vport_reps_blocked(esw, ops, rep_type);
 	mlx5_esw_reps_unblock(esw);
+
+	mlx5_esw_add_work(esw, mlx5_eswitch_reload_reps);
 }
 
 void mlx5_eswitch_register_vport_reps(struct mlx5_eswitch *esw,
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH net-next V2 7/7] net/mlx5: Add profile to auto-enable switchdev mode at device init
  2026-05-01  4:16 [PATCH net-next V2 0/7] net/mlx5: Improve representor lifecycle and allow switchdev by default Tariq Toukan
                   ` (5 preceding siblings ...)
  2026-05-01  4:16 ` [PATCH net-next V2 6/7] net/mlx5: E-switch, load reps via work queue after registration Tariq Toukan
@ 2026-05-01  4:16 ` Tariq Toukan
  2026-05-02 20:08   ` Mark Bloch
  6 siblings, 1 reply; 22+ messages in thread
From: Tariq Toukan @ 2026-05-01  4:16 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller
  Cc: Leon Romanovsky, Jason Gunthorpe, Saeed Mahameed, Tariq Toukan,
	Mark Bloch, Shay Drory, Or Har-Toov, Edward Srouji, Maher Sanalla,
	Simon Horman, Gerd Bayer, Moshe Shemesh, Kees Cook,
	Patrisious Haddad, Parav Pandit, Carolina Jubran, Cosmin Ratiu,
	linux-rdma, linux-kernel, netdev, Gal Pressman, Dragos Tatulea

From: Mark Bloch <mbloch@nvidia.com>

Deployments that always operate in switchdev mode currently require
manual devlink configuration after driver probe, which complicates
automated provisioning.

Introduce MLX5_PROF_MASK_DEF_SWITCHDEV, a new profile mask bit, and
profile index 8. When a device is initialized or reloaded with this
profile, the driver automatically switches the e-switch to switchdev
mode by calling mlx5_devlink_eswitch_mode_set() immediately after
bringing the device online.

A no-op stub of mlx5_devlink_eswitch_mode_set() is added for builds
without CONFIG_MLX5_ESWITCH.

Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/eswitch.h |  6 +++
 .../net/ethernet/mellanox/mlx5/core/main.c    | 43 ++++++++++++++++++-
 include/linux/mlx5/driver.h                   |  2 +
 3 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index 3858690e09b4..cfb9595f9de8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -1049,6 +1049,12 @@ mlx5_esw_lag_demux_rule_create(struct mlx5_eswitch *esw, u16 vport_num,
 	return ERR_PTR(-EOPNOTSUPP);
 }
 
+static inline int
+mlx5_devlink_eswitch_mode_set(struct devlink *devlink, u16 mode,
+			      struct netlink_ext_ack *extack)
+{
+	return -EOPNOTSUPP;
+}
 #endif /* CONFIG_MLX5_ESWITCH */
 
 #endif /* __MLX5_ESWITCH_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 74827e8ca125..4cdda15ed7f5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -86,7 +86,7 @@ MODULE_PARM_DESC(debug_mask, "debug mask: 1 = dump cmd data, 2 = dump cmd exec t
 
 static unsigned int prof_sel = MLX5_DEFAULT_PROF;
 module_param_named(prof_sel, prof_sel, uint, 0444);
-MODULE_PARM_DESC(prof_sel, "profile selector. Valid range 0 - 2");
+MODULE_PARM_DESC(prof_sel, "profile selector. Valid range 0 - 3 and 8");
 
 static u32 sw_owner_id[4];
 #define MAX_SW_VHCA_ID (BIT(__mlx5_bit_sz(cmd_hca_cap_2, sw_vhca_id)) - 1)
@@ -99,6 +99,8 @@ enum {
 
 #define LOG_MAX_SUPPORTED_QPS 0xff
 
+#define MLX5_PROF_SEL_LAST_NIC 3
+#define MLX5_PROF_SEL_FIRST_ESW 8
 static struct mlx5_profile profile[] = {
 	[0] = {
 		.mask           = 0,
@@ -120,6 +122,11 @@ static struct mlx5_profile profile[] = {
 		.log_max_qp	= LOG_MAX_SUPPORTED_QPS,
 		.num_cmd_caches = 0,
 	},
+	[8] = {
+		.mask = MLX5_PROF_MASK_DEF_SWITCHDEV | MLX5_PROF_MASK_QP_SIZE,
+		.log_max_qp = LOG_MAX_SUPPORTED_QPS,
+		.num_cmd_caches = MLX5_NUM_COMMAND_CACHES,
+	},
 };
 
 static int wait_fw_init(struct mlx5_core_dev *dev, u32 max_wait_mili,
@@ -1385,6 +1392,22 @@ static void mlx5_unload(struct mlx5_core_dev *dev)
 	mlx5_free_bfreg(dev, &dev->priv.bfreg);
 }
 
+static void mlx5_set_default_switchdev(struct mlx5_core_dev *dev)
+{
+	int err;
+
+	/* Default switchdev is best-effort; keep the device usable on
+	 * failure.
+	 */
+	err = mlx5_devlink_eswitch_mode_set(priv_to_devlink(dev),
+					    DEVLINK_ESWITCH_MODE_SWITCHDEV,
+					    NULL);
+	if (err && err != -EOPNOTSUPP)
+		mlx5_core_warn(dev,
+			       "Failed to set switchdev as default, continuing in current mode, err(%d)\n",
+			       err);
+}
+
 int mlx5_init_one_devl_locked(struct mlx5_core_dev *dev)
 {
 	bool light_probe = mlx5_dev_is_lightweight(dev);
@@ -1431,6 +1454,10 @@ int mlx5_init_one_devl_locked(struct mlx5_core_dev *dev)
 		mlx5_core_err(dev, "mlx5_hwmon_dev_register failed with error code %d\n", err);
 
 	mutex_unlock(&dev->intf_state_mutex);
+
+	if (dev->profile.mask & MLX5_PROF_MASK_DEF_SWITCHDEV)
+		mlx5_set_default_switchdev(dev);
+
 	return 0;
 
 err_register:
@@ -1532,6 +1559,10 @@ int mlx5_load_one_devl_locked(struct mlx5_core_dev *dev, bool recovery)
 		goto err_attach;
 
 	mutex_unlock(&dev->intf_state_mutex);
+
+	if (dev->profile.mask & MLX5_PROF_MASK_DEF_SWITCHDEV)
+		mlx5_set_default_switchdev(dev);
+
 	return 0;
 
 err_attach:
@@ -2314,6 +2345,16 @@ static void mlx5_core_verify_params(void)
 			MLX5_DEFAULT_PROF);
 		prof_sel = MLX5_DEFAULT_PROF;
 	}
+
+	if (prof_sel > MLX5_PROF_SEL_LAST_NIC &&
+	    prof_sel < MLX5_PROF_SEL_FIRST_ESW) {
+		pr_warn("mlx5_core: WARNING: Invalid module parameter prof_sel %d invalid range %d - %d, changing back to default (%d)\n",
+			prof_sel,
+			MLX5_PROF_SEL_LAST_NIC + 1,
+			MLX5_PROF_SEL_FIRST_ESW - 1,
+			MLX5_DEFAULT_PROF);
+		prof_sel = MLX5_DEFAULT_PROF;
+	}
 }
 
 static int __init mlx5_init(void)
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 04b96c5abb57..65298c07df4d 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -705,6 +705,8 @@ struct mlx5_st;
 
 enum {
 	MLX5_PROF_MASK_QP_SIZE		= (u64)1 << 0,
+	MLX5_PROF_MASK_MR_CACHE		= (u64)1 << 1,
+	MLX5_PROF_MASK_DEF_SWITCHDEV    = (u64)1 << 2,
 };
 
 struct mlx5_profile {
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next V2 3/7] net/mlx5: Lag, avoid LAG and representor lock cycles
  2026-05-01  4:16 ` [PATCH net-next V2 3/7] net/mlx5: Lag, avoid LAG and representor lock cycles Tariq Toukan
@ 2026-05-02 20:04   ` Mark Bloch
  0 siblings, 0 replies; 22+ messages in thread
From: Mark Bloch @ 2026-05-02 20:04 UTC (permalink / raw)
  To: Tariq Toukan, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Andrew Lunn, David S. Miller
  Cc: Leon Romanovsky, Jason Gunthorpe, Saeed Mahameed, Shay Drory,
	Or Har-Toov, Edward Srouji, Maher Sanalla, Simon Horman,
	Gerd Bayer, Moshe Shemesh, Kees Cook, Patrisious Haddad,
	Parav Pandit, Carolina Jubran, Cosmin Ratiu, linux-rdma,
	linux-kernel, netdev, Gal Pressman, Dragos Tatulea



On 01/05/2026 7:16, Tariq Toukan wrote:
> From: Mark Bloch <mbloch@nvidia.com>
> 
> The LAG shared-FDB and multiport E-Switch transitions rescan auxiliary
> devices and reload IB representors while holding ldev->lock. Driver
> bind/unbind paths may register or unregister E-Switch representor ops, and
> representor load paths may enter LAG code, so holding ldev->lock across
> those calls creates lock-order cycles with the E-Switch representor lock.
> 
> Keep the devcom component locked for the transition, but drop ldev->lock
> before rescanning auxiliary devices or reloading IB representors. Mark the
> LAG transition as in progress while the lock is dropped and assert the
> devcom lock where the helper relies on it. This preserves LAG serialization
> while avoiding ldev->lock nesting under E-Switch representor registration.
> 
> Signed-off-by: Mark Bloch <mbloch@nvidia.com>
> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
> ---
>  .../net/ethernet/mellanox/mlx5/core/lag/lag.c | 142 ++++++++++++++----
>  .../net/ethernet/mellanox/mlx5/core/lag/lag.h |   7 +-
>  .../ethernet/mellanox/mlx5/core/lag/mpesw.c   |  10 +-
>  .../ethernet/mellanox/mlx5/core/lib/devcom.c  |   8 +
>  .../ethernet/mellanox/mlx5/core/lib/devcom.h  |   1 +
>  5 files changed, 134 insertions(+), 34 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
> index a474f970e056..e77f9931c39c 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
> @@ -1063,37 +1063,99 @@ bool mlx5_lag_check_prereq(struct mlx5_lag *ldev)
>  	return true;
>  }
>  
> -void mlx5_lag_add_devices(struct mlx5_lag *ldev)
> +static void mlx5_lag_assert_locked_transition(struct mlx5_lag *ldev)
>  {
> +	struct mlx5_devcom_comp_dev *devcom = NULL;
>  	struct lag_func *pf;
>  	int i;
>  
> -	mlx5_ldev_for_each(i, 0, ldev) {
> -		pf = mlx5_lag_pf(ldev, i);
> -		if (pf->dev->priv.flags & MLX5_PRIV_FLAGS_DISABLE_ALL_ADEV)
> -			continue;
> +	lockdep_assert_held(&ldev->lock);
>  
> -		pf->dev->priv.flags &= ~MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
> -		mlx5_rescan_drivers_locked(pf->dev);
> +	i = mlx5_get_next_ldev_func(ldev, 0);
> +	if (i < MLX5_MAX_PORTS) {
> +		pf = mlx5_lag_pf(ldev, i);
> +		devcom = pf->dev->priv.hca_devcom_comp;
>  	}
> +	mlx5_devcom_comp_assert_locked(devcom);
>  }
>  
> -void mlx5_lag_remove_devices(struct mlx5_lag *ldev)
> +static void mlx5_lag_drop_lock_for_reps(struct mlx5_lag *ldev)
> +{
> +	mlx5_lag_assert_locked_transition(ldev);
> +
> +	/* Keep PF membership stable while ldev->lock is dropped. Device add
> +	 * and remove paths observe mode_changes_in_progress and retry.
> +	 */
> +	ldev->mode_changes_in_progress++;
> +	mutex_unlock(&ldev->lock);
> +}
> +
> +static void mlx5_lag_retake_lock_after_reps(struct mlx5_lag *ldev)
>  {
> +	mutex_lock(&ldev->lock);
> +	ldev->mode_changes_in_progress--;
> +}x

sashiko.dev says:
"
Is it possible this introduces an ad-hoc synchronization mechanism?
The networking subsystem guidelines suggest that using a flag or integer
counter to guard a section of code, particularly when concurrent paths
observe it and retry, can bypass lockdep deadlock detection. These patterns
also often lack proper fairness and memory ordering guarantees compared to
standard locking primitives.
Could a standard synchronization primitive like an rwsem be used here instead
of the mode_changes_in_progress counter?
"

The counter is pre-existing, this patch only reuses it while ldev->lock is
dropped, and all accesses remain protected by ldev->lock.
The transition itself is still serialized by the HCA devcom write lock,
which is lockdep-visible; the counter only makes affected paths retry.
I agree this gating can be cleaned up separately, but I’d prefer
not to fold that broader LAG locking rework into this patch series.

Mark

> +
> +void mlx5_lag_rescan_dev_locked(struct mlx5_lag *ldev,
> +				struct mlx5_core_dev *dev,
> +				bool enable)
> +{
> +	if (dev->priv.flags & MLX5_PRIV_FLAGS_DISABLE_ALL_ADEV)
> +		return;
> +
> +	if (enable)
> +		dev->priv.flags &= ~MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
> +	else
> +		dev->priv.flags |= MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
> +
> +	/* Auxiliary bus probe/remove can register or unregister representor
> +	 * callbacks and take reps_lock. Drop ldev->lock so the only ordering
> +	 * remains reps_lock -> ldev->lock from representor callbacks.
> +	 */
> +	mlx5_lag_drop_lock_for_reps(ldev);
> +	mlx5_rescan_drivers_locked(dev);
> +	mlx5_lag_retake_lock_after_reps(ldev);
> +}
> +
> +static void mlx5_lag_rescan_devices_locked(struct mlx5_lag *ldev, bool enable)
> +{
> +	struct mlx5_core_dev *devs[MLX5_MAX_PORTS];
>  	struct lag_func *pf;
> +	int num_devs = 0;
>  	int i;
>  
> +	mlx5_lag_assert_locked_transition(ldev);
> +
>  	mlx5_ldev_for_each(i, 0, ldev) {
>  		pf = mlx5_lag_pf(ldev, i);
>  		if (pf->dev->priv.flags & MLX5_PRIV_FLAGS_DISABLE_ALL_ADEV)
>  			continue;
>  
> -		pf->dev->priv.flags |= MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
> -		mlx5_rescan_drivers_locked(pf->dev);
> +		if (enable)
> +			pf->dev->priv.flags &= ~MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
> +		else
> +			pf->dev->priv.flags |= MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
> +		devs[num_devs++] = pf->dev;
>  	}
> +
> +	mlx5_lag_drop_lock_for_reps(ldev);
> +	for (i = 0; i < num_devs; i++)
> +		mlx5_rescan_drivers_locked(devs[i]);
> +	mlx5_lag_retake_lock_after_reps(ldev);
>  }
>  
> -int mlx5_lag_reload_ib_reps(struct mlx5_lag *ldev, u32 flags, bool cont_on_fail)
> +void mlx5_lag_add_devices(struct mlx5_lag *ldev)
> +{
> +	mlx5_lag_rescan_devices_locked(ldev, true);
> +}
> +
> +void mlx5_lag_remove_devices(struct mlx5_lag *ldev)
> +{
> +	mlx5_lag_rescan_devices_locked(ldev, false);
> +}
> +
> +static int mlx5_lag_reload_ib_reps_unlocked(struct mlx5_lag *ldev, u32 flags,
> +					    bool cont_on_fail)
>  {
>  	struct lag_func *pf;
>  	int ret;
> @@ -1105,7 +1167,9 @@ int mlx5_lag_reload_ib_reps(struct mlx5_lag *ldev, u32 flags, bool cont_on_fail)
>  			struct mlx5_eswitch *esw;
>  
>  			esw = pf->dev->priv.eswitch;
> +			mlx5_esw_reps_block(esw);
>  			ret = mlx5_eswitch_reload_ib_reps(esw);
> +			mlx5_esw_reps_unblock(esw);
>  			if (ret && !cont_on_fail)
>  				return ret;
>  		}
> @@ -1114,6 +1178,34 @@ int mlx5_lag_reload_ib_reps(struct mlx5_lag *ldev, u32 flags, bool cont_on_fail)
>  	return 0;
>  }
>  
> +static int mlx5_lag_reload_ib_reps(struct mlx5_lag *ldev, u32 flags,
> +				   bool cont_on_fail)
> +{
> +	int ret;
> +
> +	/* The HCA devcom component lock serializes LAG mode transitions while
> +	 * ldev->lock is dropped here. Dropping ldev->lock is required because
> +	 * the reload takes the per-E-Switch reps_lock, and representor
> +	 * load/unload callbacks can re-enter LAG netdev add/remove and take
> +	 * ldev->lock. Keep the ordering reps_lock -> ldev->lock.
> +	 */
> +	mlx5_lag_drop_lock_for_reps(ldev);
> +	ret = mlx5_lag_reload_ib_reps_unlocked(ldev, flags, cont_on_fail);
> +	mlx5_lag_retake_lock_after_reps(ldev);
> +
> +	return ret;
> +}
> +
> +int mlx5_lag_reload_ib_reps_from_locked(struct mlx5_lag *ldev, u32 flags,
> +					bool cont_on_fail)
> +{
> +	int ret;
> +
> +	ret = mlx5_lag_reload_ib_reps(ldev, flags, cont_on_fail);
> +
> +	return ret;
> +}
> +
>  void mlx5_disable_lag(struct mlx5_lag *ldev)
>  {
>  	bool shared_fdb = test_bit(MLX5_LAG_MODE_FLAG_SHARED_FDB, &ldev->mode_flags);
> @@ -1132,10 +1224,7 @@ void mlx5_disable_lag(struct mlx5_lag *ldev)
>  	if (shared_fdb) {
>  		mlx5_lag_remove_devices(ldev);
>  	} else if (roce_lag) {
> -		if (!(dev0->priv.flags & MLX5_PRIV_FLAGS_DISABLE_ALL_ADEV)) {
> -			dev0->priv.flags |= MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
> -			mlx5_rescan_drivers_locked(dev0);
> -		}
> +		mlx5_lag_rescan_dev_locked(ldev, dev0, false);
>  		mlx5_ldev_for_each(i, 0, ldev) {
>  			if (i == idx)
>  				continue;
> @@ -1151,8 +1240,9 @@ void mlx5_disable_lag(struct mlx5_lag *ldev)
>  		mlx5_lag_add_devices(ldev);
>  
>  	if (shared_fdb)
> -		mlx5_lag_reload_ib_reps(ldev, MLX5_PRIV_FLAGS_DISABLE_ALL_ADEV,
> -					true);
> +		mlx5_lag_reload_ib_reps_from_locked(ldev,
> +						    MLX5_PRIV_FLAGS_DISABLE_ALL_ADEV,
> +						    true);
>  }
>  
>  bool mlx5_lag_shared_fdb_supported(struct mlx5_lag *ldev)
> @@ -1409,7 +1499,8 @@ static void mlx5_do_bond(struct mlx5_lag *ldev)
>  			if (shared_fdb || roce_lag)
>  				mlx5_lag_add_devices(ldev);
>  			if (shared_fdb)
> -				mlx5_lag_reload_ib_reps(ldev, 0, true);
> +				mlx5_lag_reload_ib_reps_from_locked(ldev, 0,
> +								    true);
>  
>  			return;
>  		}
> @@ -1417,8 +1508,7 @@ static void mlx5_do_bond(struct mlx5_lag *ldev)
>  		if (roce_lag) {
>  			struct mlx5_core_dev *dev;
>  
> -			dev0->priv.flags &= ~MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
> -			mlx5_rescan_drivers_locked(dev0);
> +			mlx5_lag_rescan_dev_locked(ldev, dev0, true);
>  			mlx5_ldev_for_each(i, 0, ldev) {
>  				if (i == idx)
>  					continue;
> @@ -1427,15 +1517,15 @@ static void mlx5_do_bond(struct mlx5_lag *ldev)
>  					mlx5_nic_vport_enable_roce(dev);
>  			}
>  		} else if (shared_fdb) {
> -			dev0->priv.flags &= ~MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
> -			mlx5_rescan_drivers_locked(dev0);
> -			err = mlx5_lag_reload_ib_reps(ldev, 0, false);
> +			mlx5_lag_rescan_dev_locked(ldev, dev0, true);
> +			err = mlx5_lag_reload_ib_reps_from_locked(ldev, 0,
> +								  false);
>  			if (err) {
> -				dev0->priv.flags |= MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
> -				mlx5_rescan_drivers_locked(dev0);
> +				mlx5_lag_rescan_dev_locked(ldev, dev0, false);
>  				mlx5_deactivate_lag(ldev);
>  				mlx5_lag_add_devices(ldev);
> -				mlx5_lag_reload_ib_reps(ldev, 0, true);
> +				mlx5_lag_reload_ib_reps_from_locked(ldev, 0,
> +								    true);
>  				mlx5_core_err(dev0, "Failed to enable lag\n");
>  				return;
>  			}
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h
> index daca8ebd5256..6afe7707d076 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h
> @@ -164,6 +164,9 @@ void mlx5_disable_lag(struct mlx5_lag *ldev);
>  void mlx5_lag_remove_devices(struct mlx5_lag *ldev);
>  int mlx5_deactivate_lag(struct mlx5_lag *ldev);
>  void mlx5_lag_add_devices(struct mlx5_lag *ldev);
> +void mlx5_lag_rescan_dev_locked(struct mlx5_lag *ldev,
> +				struct mlx5_core_dev *dev,
> +				bool enable);
>  struct mlx5_devcom_comp_dev *mlx5_lag_get_devcom_comp(struct mlx5_lag *ldev);
>  
>  #ifdef CONFIG_MLX5_ESWITCH
> @@ -199,6 +202,6 @@ int mlx5_get_next_ldev_func(struct mlx5_lag *ldev, int start_idx);
>  int mlx5_lag_get_dev_index_by_seq(struct mlx5_lag *ldev, int seq);
>  int mlx5_lag_num_devs(struct mlx5_lag *ldev);
>  int mlx5_lag_num_netdevs(struct mlx5_lag *ldev);
> -int mlx5_lag_reload_ib_reps(struct mlx5_lag *ldev, u32 flags,
> -			    bool cont_on_fail);
> +int mlx5_lag_reload_ib_reps_from_locked(struct mlx5_lag *ldev, u32 flags,
> +					bool cont_on_fail);
>  #endif /* __MLX5_LAG_H__ */
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/mpesw.c b/drivers/net/ethernet/mellanox/mlx5/core/lag/mpesw.c
> index edcd06f3be7a..8a349f8fd823 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/lag/mpesw.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/mpesw.c
> @@ -100,9 +100,8 @@ static int mlx5_lag_enable_mpesw(struct mlx5_lag *ldev)
>  		goto err_add_devices;
>  	}
>  
> -	dev0->priv.flags &= ~MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
> -	mlx5_rescan_drivers_locked(dev0);
> -	err = mlx5_lag_reload_ib_reps(ldev, 0, false);
> +	mlx5_lag_rescan_dev_locked(ldev, dev0, true);
> +	err = mlx5_lag_reload_ib_reps_from_locked(ldev, 0, false);
>  	if (err)
>  		goto err_rescan_drivers;
>  
> @@ -111,12 +110,11 @@ static int mlx5_lag_enable_mpesw(struct mlx5_lag *ldev)
>  	return 0;
>  
>  err_rescan_drivers:
> -	dev0->priv.flags |= MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
> -	mlx5_rescan_drivers_locked(dev0);
> +	mlx5_lag_rescan_dev_locked(ldev, dev0, false);
>  	mlx5_deactivate_lag(ldev);
>  err_add_devices:
>  	mlx5_lag_add_devices(ldev);
> -	mlx5_lag_reload_ib_reps(ldev, 0, true);
> +	mlx5_lag_reload_ib_reps_from_locked(ldev, 0, true);
>  	mlx5_mpesw_metadata_cleanup(ldev);
>  	return err;
>  }
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/devcom.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/devcom.c
> index 4b5ac2db55ce..d40c53193ea8 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/lib/devcom.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/devcom.c
> @@ -3,6 +3,7 @@
>  
>  #include <linux/mlx5/vport.h>
>  #include <linux/list.h>
> +#include <linux/lockdep.h>
>  #include "lib/devcom.h"
>  #include "lib/mlx5.h"
>  #include "mlx5_core.h"
> @@ -438,3 +439,10 @@ int mlx5_devcom_comp_trylock(struct mlx5_devcom_comp_dev *devcom)
>  		return 0;
>  	return down_write_trylock(&devcom->comp->sem);
>  }
> +
> +void mlx5_devcom_comp_assert_locked(struct mlx5_devcom_comp_dev *devcom)
> +{
> +	if (!devcom)
> +		return;
> +	lockdep_assert_held_write(&devcom->comp->sem);
> +}
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/devcom.h b/drivers/net/ethernet/mellanox/mlx5/core/lib/devcom.h
> index 91e5ae529d5c..316052a85ca5 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/lib/devcom.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/devcom.h
> @@ -75,5 +75,6 @@ void *mlx5_devcom_get_next_peer_data_rcu(struct mlx5_devcom_comp_dev *devcom,
>  void mlx5_devcom_comp_lock(struct mlx5_devcom_comp_dev *devcom);
>  void mlx5_devcom_comp_unlock(struct mlx5_devcom_comp_dev *devcom);
>  int mlx5_devcom_comp_trylock(struct mlx5_devcom_comp_dev *devcom);
> +void mlx5_devcom_comp_assert_locked(struct mlx5_devcom_comp_dev *devcom);
>  
>  #endif /* __LIB_MLX5_DEVCOM_H__ */


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next V2 4/7] net/mlx5: E-Switch, serialize representor lifecycle
  2026-05-01  4:16 ` [PATCH net-next V2 4/7] net/mlx5: E-Switch, serialize representor lifecycle Tariq Toukan
@ 2026-05-02 20:05   ` Mark Bloch
  2026-05-03  1:42   ` Jakub Kicinski
  1 sibling, 0 replies; 22+ messages in thread
From: Mark Bloch @ 2026-05-02 20:05 UTC (permalink / raw)
  To: Tariq Toukan, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Andrew Lunn, David S. Miller
  Cc: Leon Romanovsky, Jason Gunthorpe, Saeed Mahameed, Shay Drory,
	Or Har-Toov, Edward Srouji, Maher Sanalla, Simon Horman,
	Gerd Bayer, Moshe Shemesh, Kees Cook, Patrisious Haddad,
	Parav Pandit, Carolina Jubran, Cosmin Ratiu, linux-rdma,
	linux-kernel, netdev, Gal Pressman, Dragos Tatulea



On 01/05/2026 7:16, Tariq Toukan wrote:
> From: Mark Bloch <mbloch@nvidia.com>
> 
> Representor callbacks can be registered and unregistered while the
> E-Switch is already in switchdev mode, and the same E-Switch may also be
> reconfigured by devlink, VF changes and SF changes. Serialize these paths
> with the per-E-Switch representor mutex instead of relying on ad-hoc bit
> state and wait queues.
> 
> Take the representor lock around the mode transition, VF/SF representor
> changes and representor ops registration. Keep mode_lock and the
> representor lock unnested by using the operation flag while the mode lock
> is dropped. During mode changes, drop the representor lock around the
> auxiliary bus rescan because driver bind/unbind may register or unregister
> representor ops.
> 
> Split representor ops registration into locked public wrappers and blocked
> internal helpers, clear the ops pointer on unregister, and add nested
> wrappers for the shared-FDB master IB path that registers peer
> representor ops while another E-Switch representor lock is already held.
> 
> Signed-off-by: Mark Bloch <mbloch@nvidia.com>
> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
> ---
>  drivers/infiniband/hw/mlx5/ib_rep.c           |   6 +-
>  .../net/ethernet/mellanox/mlx5/core/eswitch.c |  10 ++
>  .../mellanox/mlx5/core/eswitch_offloads.c     | 102 ++++++++++++++++--
>  .../ethernet/mellanox/mlx5/core/sf/devlink.c  |   5 +
>  include/linux/mlx5/eswitch.h                  |   6 ++
>  5 files changed, 119 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/infiniband/hw/mlx5/ib_rep.c b/drivers/infiniband/hw/mlx5/ib_rep.c
> index 1709b628702e..65d8767d1830 100644
> --- a/drivers/infiniband/hw/mlx5/ib_rep.c
> +++ b/drivers/infiniband/hw/mlx5/ib_rep.c
> @@ -262,9 +262,10 @@ mlx5_ib_vport_rep_unload(struct mlx5_eswitch_rep *rep)
>  			struct mlx5_core_dev *peer_mdev;
>  			struct mlx5_eswitch *esw;
>  
> +			/* Called while the master E-Switch reps_lock is held. */
>  			mlx5_lag_for_each_peer_mdev(mdev, peer_mdev, i) {
>  				esw = peer_mdev->priv.eswitch;
> -				mlx5_eswitch_unregister_vport_reps(esw, REP_IB);
> +				mlx5_eswitch_unregister_vport_reps_nested(esw, REP_IB);
>  			}
>  			mlx5_ib_release_transport(mdev);
>  		}
> @@ -284,9 +285,10 @@ static void mlx5_ib_register_peer_vport_reps(struct mlx5_core_dev *mdev)
>  	struct mlx5_eswitch *esw;
>  	int i;
>  
> +	/* Called while the master E-Switch reps_lock is held. */
>  	mlx5_lag_for_each_peer_mdev(mdev, peer_mdev, i) {
>  		esw = peer_mdev->priv.eswitch;
> -		mlx5_eswitch_register_vport_reps(esw, &rep_ops, REP_IB);
> +		mlx5_eswitch_register_vport_reps_nested(esw, &rep_ops, REP_IB);
>  	}
>  }
>  
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
> index 66a773a99876..f70737437954 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
> @@ -1712,6 +1712,7 @@ int mlx5_eswitch_enable(struct mlx5_eswitch *esw, int num_vfs)
>  		mlx5_lag_disable_change(esw->dev);
>  
>  	mlx5_eswitch_invalidate_wq(esw);
> +	mlx5_esw_reps_block(esw);
>  
>  	if (!mlx5_esw_is_fdb_created(esw)) {
>  		ret = mlx5_eswitch_enable_locked(esw, num_vfs);
> @@ -1735,6 +1736,8 @@ int mlx5_eswitch_enable(struct mlx5_eswitch *esw, int num_vfs)
>  		}
>  	}
>  
> +	mlx5_esw_reps_unblock(esw);
> +
>  	if (toggle_lag)
>  		mlx5_lag_enable_change(esw->dev);
>  
> @@ -1759,6 +1762,7 @@ void mlx5_eswitch_disable_sriov(struct mlx5_eswitch *esw, bool clear_vf)
>  		 esw->esw_funcs.num_vfs, esw->esw_funcs.num_ec_vfs, esw->enabled_vports);
>  
>  	mlx5_eswitch_invalidate_wq(esw);
> +	mlx5_esw_reps_block(esw);
>  
>  	if (!mlx5_core_is_ecpf(esw->dev)) {
>  		mlx5_eswitch_unload_vf_vports(esw, esw->esw_funcs.num_vfs);
> @@ -1770,6 +1774,8 @@ void mlx5_eswitch_disable_sriov(struct mlx5_eswitch *esw, bool clear_vf)
>  			mlx5_eswitch_clear_ec_vf_vports_info(esw);
>  	}
>  
> +	mlx5_esw_reps_unblock(esw);
> +
>  	if (esw->mode == MLX5_ESWITCH_OFFLOADS) {
>  		struct devlink *devlink = priv_to_devlink(esw->dev);
>  
> @@ -1825,7 +1831,11 @@ void mlx5_eswitch_disable(struct mlx5_eswitch *esw)
>  
>  	devl_assert_locked(priv_to_devlink(esw->dev));
>  	mlx5_lag_disable_change(esw->dev);
> +
> +	mlx5_esw_reps_block(esw);
>  	mlx5_eswitch_disable_locked(esw);
> +	mlx5_esw_reps_unblock(esw);
> +
>  	esw->mode = MLX5_ESWITCH_LEGACY;
>  	mlx5_lag_enable_change(esw->dev);
>  }
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> index 6a5143b63dfd..d4ac07c995b9 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> @@ -36,6 +36,7 @@
>  #include <linux/mlx5/mlx5_ifc.h>
>  #include <linux/mlx5/vport.h>
>  #include <linux/mlx5/fs.h>
> +#include <linux/lockdep.h>
>  #include "mlx5_core.h"
>  #include "eswitch.h"
>  #include "esw/indir_table.h"
> @@ -2413,11 +2414,21 @@ static int esw_create_restore_table(struct mlx5_eswitch *esw)
>  	return err;
>  }
>  
> +static void mlx5_esw_assert_reps_locked(struct mlx5_eswitch *esw)
> +{
> +	lockdep_assert_held(&esw->offloads.reps_lock);
> +}
> +
>  void mlx5_esw_reps_block(struct mlx5_eswitch *esw)
>  {
>  	mutex_lock(&esw->offloads.reps_lock);
>  }
>  
> +static void mlx5_esw_reps_block_nested(struct mlx5_eswitch *esw)
> +{
> +	mutex_lock_nested(&esw->offloads.reps_lock, SINGLE_DEPTH_NESTING);
> +}
> +
>  void mlx5_esw_reps_unblock(struct mlx5_eswitch *esw)
>  {
>  	mutex_unlock(&esw->offloads.reps_lock);
> @@ -2425,21 +2436,22 @@ void mlx5_esw_reps_unblock(struct mlx5_eswitch *esw)
>  
>  static void esw_mode_change(struct mlx5_eswitch *esw, u16 mode)
>  {
> +	mlx5_esw_reps_unblock(esw);
>  	mlx5_devcom_comp_lock(esw->dev->priv.hca_devcom_comp);
>  	if (esw->dev->priv.flags & MLX5_PRIV_FLAGS_DISABLE_IB_ADEV ||
>  	    mlx5_core_mp_enabled(esw->dev)) {
>  		esw->mode = mode;
> -		mlx5_rescan_drivers_locked(esw->dev);
> -		mlx5_devcom_comp_unlock(esw->dev->priv.hca_devcom_comp);
> -		return;
> +		goto out;
>  	}
>  
>  	esw->dev->priv.flags |= MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
>  	mlx5_rescan_drivers_locked(esw->dev);
>  	esw->mode = mode;
>  	esw->dev->priv.flags &= ~MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
> +out:
>  	mlx5_rescan_drivers_locked(esw->dev);
>  	mlx5_devcom_comp_unlock(esw->dev->priv.hca_devcom_comp);
> +	mlx5_esw_reps_block(esw);
>  }
>  
>  static void mlx5_esw_fdb_drop_destroy(struct mlx5_eswitch *esw)
> @@ -2776,6 +2788,8 @@ void esw_offloads_cleanup(struct mlx5_eswitch *esw)
>  static int __esw_offloads_load_rep(struct mlx5_eswitch *esw,
>  				   struct mlx5_eswitch_rep *rep, u8 rep_type)
>  {
> +	mlx5_esw_assert_reps_locked(esw);
> +
>  	if (atomic_cmpxchg(&rep->rep_data[rep_type].state,
>  			   REP_REGISTERED, REP_LOADED) == REP_REGISTERED)
>  		return esw->offloads.rep_ops[rep_type]->load(esw->dev, rep);
> @@ -2786,6 +2800,8 @@ static int __esw_offloads_load_rep(struct mlx5_eswitch *esw,
>  static void __esw_offloads_unload_rep(struct mlx5_eswitch *esw,
>  				      struct mlx5_eswitch_rep *rep, u8 rep_type)
>  {
> +	mlx5_esw_assert_reps_locked(esw);
> +
>  	if (atomic_cmpxchg(&rep->rep_data[rep_type].state,
>  			   REP_LOADED, REP_REGISTERED) == REP_LOADED) {
>  		if (rep_type == REP_ETH)
> @@ -3691,6 +3707,7 @@ static void esw_vfs_changed_event_handler(struct mlx5_eswitch *esw)
>  	if (new_num_vfs == esw->esw_funcs.num_vfs || host_pf_disabled)
>  		goto free;
>  
> +	mlx5_esw_reps_block(esw);
>  	/* Number of VFs can only change from "0 to x" or "x to 0". */
>  	if (esw->esw_funcs.num_vfs > 0) {
>  		mlx5_eswitch_unload_vf_vports(esw, esw->esw_funcs.num_vfs);
> @@ -3700,9 +3717,11 @@ static void esw_vfs_changed_event_handler(struct mlx5_eswitch *esw)
>  		err = mlx5_eswitch_load_vf_vports(esw, new_num_vfs,
>  						  MLX5_VPORT_UC_ADDR_CHANGE);
>  		if (err)
> -			goto free;
> +			goto unblock;
>  	}
>  	esw->esw_funcs.num_vfs = new_num_vfs;
> +unblock:
> +	mlx5_esw_reps_unblock(esw);
>  free:
>  	kvfree(out);
>  }
> @@ -4188,9 +4207,14 @@ int mlx5_devlink_eswitch_mode_set(struct devlink *devlink, u16 mode,
>  		goto unlock;
>  	}
>  
> +	/* Keep mode_lock and reps_lock unnested. The operation flag excludes
> +	 * mode users while mode_lock is dropped before taking reps_lock.
> +	 */
>  	esw->eswitch_operation_in_progress = true;
>  	up_write(&esw->mode_lock);
>  
> +	mlx5_esw_reps_block(esw);
> +
>  	if (mlx5_mode == MLX5_ESWITCH_OFFLOADS &&
>  	    !mlx5_devlink_netdev_netns_immutable_set(devlink, true)) {
>  		NL_SET_ERR_MSG_MOD(extack,
> @@ -4223,6 +4247,10 @@ int mlx5_devlink_eswitch_mode_set(struct devlink *devlink, u16 mode,
>  skip:
>  	if (mlx5_mode == MLX5_ESWITCH_OFFLOADS && err)
>  		mlx5_devlink_netdev_netns_immutable_set(devlink, false);
> +	/* Reconfiguration is done; drop reps_lock before taking mode_lock again
> +	 * to clear the operation flag.
> +	 */
> +	mlx5_esw_reps_unblock(esw);
>  	down_write(&esw->mode_lock);
>  	esw->eswitch_operation_in_progress = false;
>  unlock:
> @@ -4496,9 +4524,10 @@ mlx5_eswitch_vport_has_rep(const struct mlx5_eswitch *esw, u16 vport_num)
>  	return true;
>  }
>  
> -void mlx5_eswitch_register_vport_reps(struct mlx5_eswitch *esw,
> -				      const struct mlx5_eswitch_rep_ops *ops,
> -				      u8 rep_type)
> +static void
> +mlx5_eswitch_register_vport_reps_blocked(struct mlx5_eswitch *esw,
> +					 const struct mlx5_eswitch_rep_ops *ops,
> +					 u8 rep_type)
>  {
>  	struct mlx5_eswitch_rep_data *rep_data;
>  	struct mlx5_eswitch_rep *rep;
> @@ -4513,9 +4542,40 @@ void mlx5_eswitch_register_vport_reps(struct mlx5_eswitch *esw,
>  		}
>  	}
>  }
> +
> +static void
> +mlx5_eswitch_register_vport_reps_locked(struct mlx5_eswitch *esw,
> +					const struct mlx5_eswitch_rep_ops *ops,
> +					u8 rep_type, bool nested)
> +{
> +	if (nested)
> +		mlx5_esw_reps_block_nested(esw);
> +	else
> +		mlx5_esw_reps_block(esw);
> +	mlx5_eswitch_register_vport_reps_blocked(esw, ops, rep_type);
> +	mlx5_esw_reps_unblock(esw);
> +}
> +
> +void mlx5_eswitch_register_vport_reps(struct mlx5_eswitch *esw,
> +				      const struct mlx5_eswitch_rep_ops *ops,
> +				      u8 rep_type)
> +{
> +	mlx5_eswitch_register_vport_reps_locked(esw, ops, rep_type, false);
> +}
>  EXPORT_SYMBOL(mlx5_eswitch_register_vport_reps);
>  
> -void mlx5_eswitch_unregister_vport_reps(struct mlx5_eswitch *esw, u8 rep_type)
> +void
> +mlx5_eswitch_register_vport_reps_nested(struct mlx5_eswitch *esw,
> +					const struct mlx5_eswitch_rep_ops *ops,
> +					u8 rep_type)
> +{
> +	mlx5_eswitch_register_vport_reps_locked(esw, ops, rep_type, true);
> +}
> +EXPORT_SYMBOL(mlx5_eswitch_register_vport_reps_nested);
> +
> +static void
> +mlx5_eswitch_unregister_vport_reps_blocked(struct mlx5_eswitch *esw,
> +					   u8 rep_type)
>  {
>  	struct mlx5_eswitch_rep *rep;
>  	unsigned long i;
> @@ -4525,9 +4585,35 @@ void mlx5_eswitch_unregister_vport_reps(struct mlx5_eswitch *esw, u8 rep_type)
>  
>  	mlx5_esw_for_each_rep(esw, i, rep)
>  		atomic_set(&rep->rep_data[rep_type].state, REP_UNREGISTERED);
> +
> +	esw->offloads.rep_ops[rep_type] = NULL;

sashiko.dev says:
"
Could this assignment cause a NULL pointer dereference in concurrent readers?
In mlx5_eswitch_get_proto_dev(), the state is checked before accessing the ops
pointer without holding reps_lock:
    if (atomic_read(&rep->rep_data[rep_type].state) == REP_LOADED &&
        esw->offloads.rep_ops[rep_type]->get_proto_dev)
        return esw->offloads.rep_ops[rep_type]->get_proto_dev(rep);
If a thread in mlx5_eswitch_get_proto_dev() evaluates the state check to true
and is then preempted, can the unregister path execute and set rep_ops to NULL?
When the preempted thread resumes, it might dereference the now-NULL pointer.
Also, since the ops pointer isn't fetched into a local variable using
READ_ONCE(), could the compiler emit multiple loads, further widening the
race window?
"

The REP_LOADED check is not the only protection here, get_proto_dev() 
is only reached from representor-owned contexts, and unregister first
unloads all reps under reps_lock. That unload tears down the users
that can call into this helper before the state is set to REP_UNREGISTERED
and before rep_ops is cleared. So clearing rep_ops does not create a new
live-reader window; it only removes the stale ops pointer after the
representor lifecycle is already quiesced.

Mark

> +}
> +
> +static void
> +mlx5_eswitch_unregister_vport_reps_locked(struct mlx5_eswitch *esw,
> +					  u8 rep_type, bool nested)
> +{
> +	if (nested)
> +		mlx5_esw_reps_block_nested(esw);
> +	else
> +		mlx5_esw_reps_block(esw);
> +	mlx5_eswitch_unregister_vport_reps_blocked(esw, rep_type);
> +	mlx5_esw_reps_unblock(esw);
> +}
> +
> +void mlx5_eswitch_unregister_vport_reps(struct mlx5_eswitch *esw, u8 rep_type)
> +{
> +	mlx5_eswitch_unregister_vport_reps_locked(esw, rep_type, false);
>  }
>  EXPORT_SYMBOL(mlx5_eswitch_unregister_vport_reps);
>  
> +void mlx5_eswitch_unregister_vport_reps_nested(struct mlx5_eswitch *esw,
> +					       u8 rep_type)
> +{
> +	mlx5_eswitch_unregister_vport_reps_locked(esw, rep_type, true);
> +}
> +EXPORT_SYMBOL(mlx5_eswitch_unregister_vport_reps_nested);
> +
>  void *mlx5_eswitch_get_uplink_priv(struct mlx5_eswitch *esw, u8 rep_type)
>  {
>  	struct mlx5_eswitch_rep *rep;
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/devlink.c b/drivers/net/ethernet/mellanox/mlx5/core/sf/devlink.c
> index 8503e532f423..2fc69897e35b 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/sf/devlink.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/devlink.c
> @@ -245,8 +245,10 @@ static int mlx5_sf_add(struct mlx5_core_dev *dev, struct mlx5_sf_table *table,
>  	if (IS_ERR(sf))
>  		return PTR_ERR(sf);
>  
> +	mlx5_esw_reps_block(esw);
>  	err = mlx5_eswitch_load_sf_vport(esw, sf->hw_fn_id, MLX5_VPORT_UC_ADDR_CHANGE,
>  					 &sf->dl_port, new_attr->controller, new_attr->sfnum);
> +	mlx5_esw_reps_unblock(esw);
>  	if (err)
>  		goto esw_err;
>  	*dl_port = &sf->dl_port.dl_port;
> @@ -367,7 +369,10 @@ int mlx5_devlink_sf_port_del(struct devlink *devlink,
>  	struct mlx5_sf_table *table = dev->priv.sf_table;
>  	struct mlx5_sf *sf = mlx5_sf_by_dl_port(dl_port);
>  
> +	mlx5_esw_reps_block(dev->priv.eswitch);
>  	mlx5_sf_del(table, sf);
> +	mlx5_esw_reps_unblock(dev->priv.eswitch);
> +
>  	return 0;
>  }
>  
> diff --git a/include/linux/mlx5/eswitch.h b/include/linux/mlx5/eswitch.h
> index 3b29a3c6794d..a0dd162baa78 100644
> --- a/include/linux/mlx5/eswitch.h
> +++ b/include/linux/mlx5/eswitch.h
> @@ -63,7 +63,13 @@ struct mlx5_eswitch_rep {
>  void mlx5_eswitch_register_vport_reps(struct mlx5_eswitch *esw,
>  				      const struct mlx5_eswitch_rep_ops *ops,
>  				      u8 rep_type);
> +void
> +mlx5_eswitch_register_vport_reps_nested(struct mlx5_eswitch *esw,
> +					const struct mlx5_eswitch_rep_ops *ops,
> +					u8 rep_type);
>  void mlx5_eswitch_unregister_vport_reps(struct mlx5_eswitch *esw, u8 rep_type);
> +void mlx5_eswitch_unregister_vport_reps_nested(struct mlx5_eswitch *esw,
> +					       u8 rep_type);
>  void *mlx5_eswitch_get_proto_dev(struct mlx5_eswitch *esw,
>  				 u16 vport_num,
>  				 u8 rep_type);


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next V2 5/7] net/mlx5: E-Switch, unwind only newly loaded representor types
  2026-05-01  4:16 ` [PATCH net-next V2 5/7] net/mlx5: E-Switch, unwind only newly loaded representor types Tariq Toukan
@ 2026-05-02 20:06   ` Mark Bloch
  0 siblings, 0 replies; 22+ messages in thread
From: Mark Bloch @ 2026-05-02 20:06 UTC (permalink / raw)
  To: Tariq Toukan, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Andrew Lunn, David S. Miller
  Cc: Leon Romanovsky, Jason Gunthorpe, Saeed Mahameed, Shay Drory,
	Or Har-Toov, Edward Srouji, Maher Sanalla, Simon Horman,
	Gerd Bayer, Moshe Shemesh, Kees Cook, Patrisious Haddad,
	Parav Pandit, Carolina Jubran, Cosmin Ratiu, linux-rdma,
	linux-kernel, netdev, Gal Pressman, Dragos Tatulea



On 01/05/2026 7:16, Tariq Toukan wrote:
> From: Mark Bloch <mbloch@nvidia.com>
> 
> __esw_offloads_load_rep() may return success without invoking the
> representor load callback when the representor type is already loaded.
> 
> On a later load failure, mlx5_esw_offloads_rep_load() unconditionally
> unloaded all previously iterated representor types. This could unload
> representor types that were already loaded before this load attempt.
> 
> Track which representor types were actually loaded by the current call and
> unwind only those on error. Also restore the representor state back to
> REP_REGISTERED when the load callback itself fails.
> 
> Signed-off-by: Mark Bloch <mbloch@nvidia.com>
> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
> ---

sashiko.dev says:
"
Note: Due to recitation restrictions, the standard inline patch quoting format
has been altered to a free-form summary.
commit ca89c8e618c910a6810fe7127853404ce151ae6b
Author: Mark Bloch <mbloch@nvidia.com>
net/mlx5: E-Switch, unwind only newly loaded representor types
This commit updates the representor load error handling to track which
representor types were actually loaded by the current call and unwind only
those on error. It also restores the state to REP_REGISTERED if the load
callback fails.
In the modified __esw_offloads_load_rep() function, the representor state
is set to REP_LOADED via atomic_cmpxchg() before the actual load() callback
is invoked. Does this sequence expose a race condition with lockless readers?
If lockless readers, such as mlx5_eswitch_get_proto_dev() or
mlx5_esw_ipsec_restore_dest_uplink(), check the state and observe REP_LOADED
while the load callback is still executing, can they access uninitialized
private data resulting in a NULL pointer dereference?
Additionally, if the load() callback subsequently fails and frees the
allocated private data, could a lockless reader that already observed the
"

The REP_LOADED before load() ordering is pre-existing and orthogonal
to this unwind fix. I’ll look into it and if needed address it in
a separate patch.

Mark


>  .../mellanox/mlx5/core/eswitch_offloads.c     | 38 ++++++++++++++-----
>  1 file changed, 29 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> index d4ac07c995b9..8f656253981b 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> @@ -2786,13 +2786,28 @@ void esw_offloads_cleanup(struct mlx5_eswitch *esw)
>  }
>  
>  static int __esw_offloads_load_rep(struct mlx5_eswitch *esw,
> -				   struct mlx5_eswitch_rep *rep, u8 rep_type)
> +				   struct mlx5_eswitch_rep *rep,
> +				   u8 rep_type, bool *newly_loaded)
>  {
> +	int err;
> +
>  	mlx5_esw_assert_reps_locked(esw);
>  
> +	if (newly_loaded)
> +		*newly_loaded = false;
> +
>  	if (atomic_cmpxchg(&rep->rep_data[rep_type].state,
> -			   REP_REGISTERED, REP_LOADED) == REP_REGISTERED)
> -		return esw->offloads.rep_ops[rep_type]->load(esw->dev, rep);
> +			   REP_REGISTERED, REP_LOADED) != REP_REGISTERED)
> +		return 0;
> +
> +	err = esw->offloads.rep_ops[rep_type]->load(esw->dev, rep);
> +	if (err) {
> +		atomic_set(&rep->rep_data[rep_type].state, REP_REGISTERED);
> +		return err;
> +	}
> +
> +	if (newly_loaded)
> +		*newly_loaded = true;
>  
>  	return 0;
>  }
> @@ -2822,22 +2837,27 @@ static void __unload_reps_all_vport(struct mlx5_eswitch *esw, u8 rep_type)
>  static int mlx5_esw_offloads_rep_load(struct mlx5_eswitch *esw, u16 vport_num)
>  {
>  	struct mlx5_eswitch_rep *rep;
> +	unsigned long loaded = 0;
> +	bool newly_loaded;
>  	int rep_type;
>  	int err;
>  
>  	rep = mlx5_eswitch_get_rep(esw, vport_num);
>  	for (rep_type = 0; rep_type < NUM_REP_TYPES; rep_type++) {
> -		err = __esw_offloads_load_rep(esw, rep, rep_type);
> +		err = __esw_offloads_load_rep(esw, rep, rep_type,
> +					      &newly_loaded);
>  		if (err)
>  			goto err_reps;
> +		if (newly_loaded)
> +			loaded |= BIT(rep_type);
>  	}
>  
>  	return 0;
>  
>  err_reps:
> -	atomic_set(&rep->rep_data[rep_type].state, REP_REGISTERED);
> -	for (--rep_type; rep_type >= 0; rep_type--)
> -		__esw_offloads_unload_rep(esw, rep, rep_type);
> +	while (--rep_type >= 0)
> +		if (test_bit(rep_type, &loaded))
> +			__esw_offloads_unload_rep(esw, rep, rep_type);
>  	return err;
>  }
>  
> @@ -3591,13 +3611,13 @@ int mlx5_eswitch_reload_ib_reps(struct mlx5_eswitch *esw)
>  	if (atomic_read(&rep->rep_data[REP_ETH].state) != REP_LOADED)
>  		return 0;
>  
> -	ret = __esw_offloads_load_rep(esw, rep, REP_IB);
> +	ret = __esw_offloads_load_rep(esw, rep, REP_IB, NULL);
>  	if (ret)
>  		return ret;
>  
>  	mlx5_esw_for_each_rep(esw, i, rep) {
>  		if (atomic_read(&rep->rep_data[REP_ETH].state) == REP_LOADED)
> -			__esw_offloads_load_rep(esw, rep, REP_IB);
> +			__esw_offloads_load_rep(esw, rep, REP_IB, NULL);
>  	}
>  
>  	return 0;


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next V2 6/7] net/mlx5: E-switch, load reps via work queue after registration
  2026-05-01  4:16 ` [PATCH net-next V2 6/7] net/mlx5: E-switch, load reps via work queue after registration Tariq Toukan
@ 2026-05-02 20:07   ` Mark Bloch
  2026-05-03  1:42   ` Jakub Kicinski
  1 sibling, 0 replies; 22+ messages in thread
From: Mark Bloch @ 2026-05-02 20:07 UTC (permalink / raw)
  To: Tariq Toukan, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Andrew Lunn, David S. Miller
  Cc: Leon Romanovsky, Jason Gunthorpe, Saeed Mahameed, Shay Drory,
	Or Har-Toov, Edward Srouji, Maher Sanalla, Simon Horman,
	Gerd Bayer, Moshe Shemesh, Kees Cook, Patrisious Haddad,
	Parav Pandit, Carolina Jubran, Cosmin Ratiu, linux-rdma,
	linux-kernel, netdev, Gal Pressman, Dragos Tatulea



On 01/05/2026 7:16, Tariq Toukan wrote:
> From: Mark Bloch <mbloch@nvidia.com>
> 
> mlx5_eswitch_register_vport_reps() only installs representor callbacks and
> marks the rep type as registered. If the E-Switch is already in switchdev
> mode, the newly registered rep type must then be loaded for already enabled
> vports.
> 
> That load path needs to run under the devlink lock, which is not held by
> the auxiliary driver registration context. Queue the reload to the E-Switch
> workqueue, whose handler acquires the devlink lock, and load the relevant
> representors from there.
> 
> The unregister path is unchanged and still unloads representors
> synchronously while tearing down the registered callbacks.
> 
> Signed-off-by: Mark Bloch <mbloch@nvidia.com>
> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
> ---

sashiko.dev says:
"
[Note: The format of this report is altered due to recitation restrictions. Code quoting and inline diffs have been omitted.]
commit f4f3e4872e599ef6d13a1f2bacffd52e6dbc8d37
Author: Mark Bloch <mbloch@nvidia.com>
net/mlx5: E-switch, load reps via work queue after registration
This commit defers the loading of representors for already enabled vports to
a workqueue when the E-Switch is in switchdev mode. This ensures the load
path runs under the devlink lock, which is not held during auxiliary driver
registration.
I have a couple of questions regarding error handling in this update.
Looking at mlx5_eswitch_register_vport_reps_locked(), the return value of
mlx5_esw_add_work() is not checked. Since mlx5_esw_add_work() uses
kzalloc_obj() with GFP_ATOMIC, it can fail under memory pressure and return
-ENOMEM. If this allocation fails, the background work to load the
representors is never queued, but the representors are already marked as
REP_REGISTERED. Could this result in a NULL pointer dereference if callers
attempt to access the representor data under the assumption that it was
successfully initialized?
Additionally, in mlx5_eswitch_reload_reps_blocked(), failures from
mlx5_esw_offloads_rep_load() appear to be silently ignored. If the call
fails for MLX5_VPORT_UPLINK, the function aborts early without loading the
other vports. Within the loop for the remaining vports, the return value is
also ignored. Does this leave the system in a partially loaded and
inconsistent state, where representors are registered but uninitialized for
some vports?
"

REP_REGISTERED does not imply initialized representor data. Users that need
private data only proceed on REP_LOADED, and failed loads leave the state
REP_REGISTERED. The queued reload is for late rep-ops registration while already
in switchdev, where there is no synchronous error path back to the auxiliary
probe, so it is intentionally opportunistic. The strict path is still the
switchdev transition: there, representor load errors are propagated and the
mode change is aborted, so this does not change the consistency model.

Mark

>  .../mellanox/mlx5/core/eswitch_offloads.c     | 34 +++++++++++++++++++
>  1 file changed, 34 insertions(+)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> index 8f656253981b..f26d1652dd05 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> @@ -4563,6 +4563,38 @@ mlx5_eswitch_register_vport_reps_blocked(struct mlx5_eswitch *esw,
>  	}
>  }
>  
> +static void mlx5_eswitch_reload_reps_blocked(struct mlx5_eswitch *esw)
> +{
> +	struct mlx5_vport *vport;
> +	unsigned long i;
> +
> +	if (esw->mode != MLX5_ESWITCH_OFFLOADS)
> +		return;
> +
> +	if (mlx5_esw_offloads_rep_load(esw, MLX5_VPORT_UPLINK))
> +		return;
> +
> +	mlx5_esw_for_each_vport(esw, i, vport) {
> +		if (!vport)
> +			continue;
> +		if (!vport->enabled)
> +			continue;
> +		if (vport->vport == MLX5_VPORT_UPLINK)
> +			continue;
> +		if (!mlx5_eswitch_vport_has_rep(esw, vport->vport))
> +			continue;
> +
> +		mlx5_esw_offloads_rep_load(esw, vport->vport);
> +	}
> +}
> +
> +static void mlx5_eswitch_reload_reps(struct mlx5_eswitch *esw)
> +{
> +	mlx5_esw_reps_block(esw);
> +	mlx5_eswitch_reload_reps_blocked(esw);
> +	mlx5_esw_reps_unblock(esw);
> +}
> +
>  static void
>  mlx5_eswitch_register_vport_reps_locked(struct mlx5_eswitch *esw,
>  					const struct mlx5_eswitch_rep_ops *ops,
> @@ -4574,6 +4606,8 @@ mlx5_eswitch_register_vport_reps_locked(struct mlx5_eswitch *esw,
>  		mlx5_esw_reps_block(esw);
>  	mlx5_eswitch_register_vport_reps_blocked(esw, ops, rep_type);
>  	mlx5_esw_reps_unblock(esw);
> +
> +	mlx5_esw_add_work(esw, mlx5_eswitch_reload_reps);
>  }
>  
>  void mlx5_eswitch_register_vport_reps(struct mlx5_eswitch *esw,


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next V2 7/7] net/mlx5: Add profile to auto-enable switchdev mode at device init
  2026-05-01  4:16 ` [PATCH net-next V2 7/7] net/mlx5: Add profile to auto-enable switchdev mode at device init Tariq Toukan
@ 2026-05-02 20:08   ` Mark Bloch
  2026-05-03  1:41     ` Jakub Kicinski
  0 siblings, 1 reply; 22+ messages in thread
From: Mark Bloch @ 2026-05-02 20:08 UTC (permalink / raw)
  To: Tariq Toukan, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Andrew Lunn, David S. Miller
  Cc: Leon Romanovsky, Jason Gunthorpe, Saeed Mahameed, Shay Drory,
	Or Har-Toov, Edward Srouji, Maher Sanalla, Simon Horman,
	Gerd Bayer, Moshe Shemesh, Kees Cook, Patrisious Haddad,
	Parav Pandit, Carolina Jubran, Cosmin Ratiu, linux-rdma,
	linux-kernel, netdev, Gal Pressman, Dragos Tatulea



On 01/05/2026 7:16, Tariq Toukan wrote:
> From: Mark Bloch <mbloch@nvidia.com>
> 
> Deployments that always operate in switchdev mode currently require
> manual devlink configuration after driver probe, which complicates
> automated provisioning.
> 
> Introduce MLX5_PROF_MASK_DEF_SWITCHDEV, a new profile mask bit, and
> profile index 8. When a device is initialized or reloaded with this
> profile, the driver automatically switches the e-switch to switchdev
> mode by calling mlx5_devlink_eswitch_mode_set() immediately after
> bringing the device online.
> 
> A no-op stub of mlx5_devlink_eswitch_mode_set() is added for builds
> without CONFIG_MLX5_ESWITCH.
> 
> Signed-off-by: Mark Bloch <mbloch@nvidia.com>
> Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
> ---
>  .../net/ethernet/mellanox/mlx5/core/eswitch.h |  6 +++
>  .../net/ethernet/mellanox/mlx5/core/main.c    | 43 ++++++++++++++++++-
>  include/linux/mlx5/driver.h                   |  2 +
>  3 files changed, 50 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
> index 3858690e09b4..cfb9595f9de8 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
> @@ -1049,6 +1049,12 @@ mlx5_esw_lag_demux_rule_create(struct mlx5_eswitch *esw, u16 vport_num,
>  	return ERR_PTR(-EOPNOTSUPP);
>  }
>  
> +static inline int
> +mlx5_devlink_eswitch_mode_set(struct devlink *devlink, u16 mode,
> +			      struct netlink_ext_ack *extack)
> +{
> +	return -EOPNOTSUPP;
> +}
>  #endif /* CONFIG_MLX5_ESWITCH */
>  
>  #endif /* __MLX5_ESWITCH_H__ */
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> index 74827e8ca125..4cdda15ed7f5 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> @@ -86,7 +86,7 @@ MODULE_PARM_DESC(debug_mask, "debug mask: 1 = dump cmd data, 2 = dump cmd exec t
>  
>  static unsigned int prof_sel = MLX5_DEFAULT_PROF;
>  module_param_named(prof_sel, prof_sel, uint, 0444);
> -MODULE_PARM_DESC(prof_sel, "profile selector. Valid range 0 - 2");
> +MODULE_PARM_DESC(prof_sel, "profile selector. Valid range 0 - 3 and 8");

sashiko.dev says:
"
Does using a module parameter to configure e-switch mode bypass the standard
devlink uAPI? 
The networking subsystem typically standardizes this configuration via
netlink/devlink interfaces, and relying on a module parameter for this might
fragment automated provisioning workflows.
"

The target here is DPU side deployments where switchdev is the only valid
operating mode, and the goal is to boot directly into that mode instead
of relying on userspace scripts after probe. I agree devlink remains the
runtime uapi. This profile is only an opt-in transition step so we can
avoid forcing switchdev in code before users have a way to reject that behavior.
This does not replace devlink provisioning, it provides a DPU side default for
environments that explicitly choose this profile.


>  
>  static u32 sw_owner_id[4];
>  #define MAX_SW_VHCA_ID (BIT(__mlx5_bit_sz(cmd_hca_cap_2, sw_vhca_id)) - 1)
> @@ -99,6 +99,8 @@ enum {
>  
>  #define LOG_MAX_SUPPORTED_QPS 0xff
>  
> +#define MLX5_PROF_SEL_LAST_NIC 3
> +#define MLX5_PROF_SEL_FIRST_ESW 8
>  static struct mlx5_profile profile[] = {
>  	[0] = {
>  		.mask           = 0,
> @@ -120,6 +122,11 @@ static struct mlx5_profile profile[] = {
>  		.log_max_qp	= LOG_MAX_SUPPORTED_QPS,
>  		.num_cmd_caches = 0,
>  	},
> +	[8] = {
> +		.mask = MLX5_PROF_MASK_DEF_SWITCHDEV | MLX5_PROF_MASK_QP_SIZE,
> +		.log_max_qp = LOG_MAX_SUPPORTED_QPS,
> +		.num_cmd_caches = MLX5_NUM_COMMAND_CACHES,
> +	},
>  };
>  
>  static int wait_fw_init(struct mlx5_core_dev *dev, u32 max_wait_mili,
> @@ -1385,6 +1392,22 @@ static void mlx5_unload(struct mlx5_core_dev *dev)
>  	mlx5_free_bfreg(dev, &dev->priv.bfreg);
>  }
>  
> +static void mlx5_set_default_switchdev(struct mlx5_core_dev *dev)
> +{
> +	int err;
> +
> +	/* Default switchdev is best-effort; keep the device usable on
> +	 * failure.
> +	 */
> +	err = mlx5_devlink_eswitch_mode_set(priv_to_devlink(dev),
> +					    DEVLINK_ESWITCH_MODE_SWITCHDEV,
> +					    NULL);
> +	if (err && err != -EOPNOTSUPP)
> +		mlx5_core_warn(dev,
> +			       "Failed to set switchdev as default, continuing in current mode, err(%d)\n",
> +			       err);
> +}
> +
>  int mlx5_init_one_devl_locked(struct mlx5_core_dev *dev)
>  {
>  	bool light_probe = mlx5_dev_is_lightweight(dev);
> @@ -1431,6 +1454,10 @@ int mlx5_init_one_devl_locked(struct mlx5_core_dev *dev)
>  		mlx5_core_err(dev, "mlx5_hwmon_dev_register failed with error code %d\n", err);
>  
>  	mutex_unlock(&dev->intf_state_mutex);
> +
> +	if (dev->profile.mask & MLX5_PROF_MASK_DEF_SWITCHDEV)
> +		mlx5_set_default_switchdev(dev);
> +
>  	return 0;

sashiko.dev says:
"
If a user explicitly sets the e-switch mode to legacy via devlink after
initialization, will this override their setting during driver reload or
firmware error recovery?
Since mlx5_set_default_switchdev() is called unconditionally here based on
the profile mask, it seems like it could silently revert the device back to
switchdev mode, discarding the active user configuration.
"

Yes, with this profile selected, switchdev is intentionally reapplied after
reload or recovery. A devlink change to legacy affects the current lifetime,
but the selected profile defines the default mode after the device is
reinitialized, users that want legacy to persist should not use this profile.

>  
>  err_register:
> @@ -1532,6 +1559,10 @@ int mlx5_load_one_devl_locked(struct mlx5_core_dev *dev, bool recovery)
>  		goto err_attach;
>  
>  	mutex_unlock(&dev->intf_state_mutex);
> +
> +	if (dev->profile.mask & MLX5_PROF_MASK_DEF_SWITCHDEV)
> +		mlx5_set_default_switchdev(dev);
> +
>  	return 0;
>  
>  err_attach:
> @@ -2314,6 +2345,16 @@ static void mlx5_core_verify_params(void)
>  			MLX5_DEFAULT_PROF);
>  		prof_sel = MLX5_DEFAULT_PROF;
>  	}
> +
> +	if (prof_sel > MLX5_PROF_SEL_LAST_NIC &&
> +	    prof_sel < MLX5_PROF_SEL_FIRST_ESW) {
> +		pr_warn("mlx5_core: WARNING: Invalid module parameter prof_sel %d invalid range %d - %d, changing back to default (%d)\n",
> +			prof_sel,
> +			MLX5_PROF_SEL_LAST_NIC + 1,
> +			MLX5_PROF_SEL_FIRST_ESW - 1,
> +			MLX5_DEFAULT_PROF);
> +		prof_sel = MLX5_DEFAULT_PROF;
> +	}
>  }
>  
>  static int __init mlx5_init(void)
> diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
> index 04b96c5abb57..65298c07df4d 100644
> --- a/include/linux/mlx5/driver.h
> +++ b/include/linux/mlx5/driver.h
> @@ -705,6 +705,8 @@ struct mlx5_st;
>  
>  enum {
>  	MLX5_PROF_MASK_QP_SIZE		= (u64)1 << 0,
> +	MLX5_PROF_MASK_MR_CACHE		= (u64)1 << 1,
> +	MLX5_PROF_MASK_DEF_SWITCHDEV    = (u64)1 << 2,
>  };

sashiko.dev says:
"
This isn't a bug, but it looks like MLX5_PROF_MASK_MR_CACHE is introduced
here but never used in the driver code. Is this mask intended for a future
patch?
"

Before I respin for the unrelated MR_CACHE cleanup, I’d like to confirm
whether the opt-in profile approach is acceptable at all. Regardless
of this last patch, the first 6 patches fix real representor/LAG locking
issues and are needed independently, so I’d like to keep those moving toward
acceptance as soon as possible.

Mark


>  
>  struct mlx5_profile {


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next V2 7/7] net/mlx5: Add profile to auto-enable switchdev mode at device init
  2026-05-02 20:08   ` Mark Bloch
@ 2026-05-03  1:41     ` Jakub Kicinski
  2026-05-03  7:51       ` Mark Bloch
  0 siblings, 1 reply; 22+ messages in thread
From: Jakub Kicinski @ 2026-05-03  1:41 UTC (permalink / raw)
  To: Mark Bloch
  Cc: Tariq Toukan, Eric Dumazet, Paolo Abeni, Andrew Lunn,
	David S. Miller, Leon Romanovsky, Jason Gunthorpe, Saeed Mahameed,
	Shay Drory, Or Har-Toov, Edward Srouji, Maher Sanalla,
	Simon Horman, Gerd Bayer, Moshe Shemesh, Kees Cook,
	Patrisious Haddad, Parav Pandit, Carolina Jubran, Cosmin Ratiu,
	linux-rdma, linux-kernel, netdev, Gal Pressman, Dragos Tatulea

On Sat, 2 May 2026 23:08:43 +0300 Mark Bloch wrote:
> Before I respin for the unrelated MR_CACHE cleanup, I’d like to confirm
> whether the opt-in profile approach is acceptable at all. Regardless
> of this last patch, the first 6 patches fix real representor/LAG locking
> issues and are needed independently, so I’d like to keep those moving toward
> acceptance as soon as possible.

For probe-time config module param is probably our only option.
I'd obviously prefer to have a devlink-level knob for this, instead 
of a mlx5 specific one. Can we come up with some format that'd apply
more broadly? devlink=[$bfd:]flag1 ? so devlink=[$bdf:]switchdev-mode ?

BTW looks like issues Sashiko/Claude finds are slightly different,
let me send them out.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next V2 4/7] net/mlx5: E-Switch, serialize representor lifecycle
  2026-05-01  4:16 ` [PATCH net-next V2 4/7] net/mlx5: E-Switch, serialize representor lifecycle Tariq Toukan
  2026-05-02 20:05   ` Mark Bloch
@ 2026-05-03  1:42   ` Jakub Kicinski
  2026-05-03  8:18     ` Mark Bloch
  1 sibling, 1 reply; 22+ messages in thread
From: Jakub Kicinski @ 2026-05-03  1:42 UTC (permalink / raw)
  To: tariqt
  Cc: Jakub Kicinski, edumazet, pabeni, andrew+netdev, davem, leon, jgg,
	saeedm, mbloch, shayd, ohartoov, edwards, msanalla, horms, gbayer,
	moshe, kees, phaddad, parav, cjubran, cratiu, linux-rdma,
	linux-kernel, netdev, gal, dtatulea

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
net/mlx5: E-Switch, serialize representor lifecycle

This change replaces the ad-hoc bit/wait-queue coordination of E-Switch
representor lifecycle with the per-E-Switch reps_lock, adds nested
wrappers for the shared-FDB master IB path, and clears the rep_ops
pointer on unregister.

> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
> index 66a773a99876..f70737437954 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c

[ ... ]

> @@ -1825,7 +1831,11 @@ void mlx5_eswitch_disable(struct mlx5_eswitch *esw)
>  
>  	devl_assert_locked(priv_to_devlink(esw->dev));
>  	mlx5_lag_disable_change(esw->dev);
> +
> +	mlx5_esw_reps_block(esw);
>  	mlx5_eswitch_disable_locked(esw);
> +	mlx5_esw_reps_unblock(esw);
> +
>  	esw->mode = MLX5_ESWITCH_LEGACY;
>  	mlx5_lag_enable_change(esw->dev);
>  }

The commit message says "Take the representor lock around the mode
transition", but the assignment esw->mode = MLX5_ESWITCH_LEGACY happens
here after reps_lock has been dropped.

Meanwhile mlx5_eswitch_unregister_vport_reps_blocked() now reads
esw->mode under reps_lock to decide whether to call
__unload_reps_all_vport():

    if (esw->mode == MLX5_ESWITCH_OFFLOADS)
            __unload_reps_all_vport(esw, rep_type);

Should the mode store be moved inside the reps_block/reps_unblock
region so the reader's assumption about stability under reps_lock
holds?

> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> index 6a5143b63dfd..d4ac07c995b9 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c

[ ... ]

> @@ -2425,21 +2436,22 @@ void mlx5_esw_reps_unblock(struct mlx5_eswitch *esw)
>  
>  static void esw_mode_change(struct mlx5_eswitch *esw, u16 mode)
>  {
> +	mlx5_esw_reps_unblock(esw);
>  	mlx5_devcom_comp_lock(esw->dev->priv.hca_devcom_comp);
>  	if (esw->dev->priv.flags & MLX5_PRIV_FLAGS_DISABLE_IB_ADEV ||
>  	    mlx5_core_mp_enabled(esw->dev)) {
>  		esw->mode = mode;
> -		mlx5_rescan_drivers_locked(esw->dev);
> -		mlx5_devcom_comp_unlock(esw->dev->priv.hca_devcom_comp);
> -		return;
> +		goto out;
>  	}
>  
>  	esw->dev->priv.flags |= MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
>  	mlx5_rescan_drivers_locked(esw->dev);
>  	esw->mode = mode;
>  	esw->dev->priv.flags &= ~MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
> +out:
>  	mlx5_rescan_drivers_locked(esw->dev);
>  	mlx5_devcom_comp_unlock(esw->dev->priv.hca_devcom_comp);
> +	mlx5_esw_reps_block(esw);
>  }

Both assignments to esw->mode in this function are reached after
reps_lock has been dropped at the top of the function.

mlx5_rescan_drivers_locked() can invoke driver bind/unbind which routes
through mlx5_eswitch_unregister_vport_reps(), and that path now takes
reps_lock and reads esw->mode to decide whether to call
__unload_reps_all_vport().

Can a rep-unregister that runs during the rescan observe a stale
esw->mode here and make the wrong unload decision?

[ ... ]

> @@ -4496,9 +4524,10 @@ mlx5_eswitch_vport_has_rep(const struct mlx5_eswitch *esw, u16 vport_num)

[ ... ]

> @@ -4525,9 +4585,35 @@ void mlx5_eswitch_unregister_vport_reps(struct mlx5_eswitch *esw, u8 rep_type)
>  
>  	mlx5_esw_for_each_rep(esw, i, rep)
>  		atomic_set(&rep->rep_data[rep_type].state, REP_UNREGISTERED);
> +
> +	esw->offloads.rep_ops[rep_type] = NULL;
> +}

This new NULL store on unregister appears to introduce a concurrent
NULL dereference in readers that do not hold reps_lock. Before this
patch, rep_ops[rep_type] was set once at register and never cleared,
so a REP_LOADED state check was sufficient to know the pointer was
valid.

mlx5_eswitch_get_proto_dev() is one such reader:

    rep = mlx5_eswitch_get_rep(esw, vport);
    if (atomic_read(&rep->rep_data[rep_type].state) == REP_LOADED &&
        esw->offloads.rep_ops[rep_type]->get_proto_dev)
            return esw->offloads.rep_ops[rep_type]->get_proto_dev(rep);

The state check and each pointer load here are independent reads. Can
a concurrent unregister on another CPU run between the state load and
the ->get_proto_dev dereference, transitioning the rep through
REP_REGISTERED/REP_UNREGISTERED and then storing NULL into rep_ops,
so the second load observes NULL?

mlx5_esw_offloads_rep_event_unpair() and mlx5_esw_offloads_pair() have
the same shape:

    ops = esw->offloads.rep_ops[rep_type];
    if (atomic_read(&rep->rep_data[rep_type].state) == REP_LOADED &&
        ops->event)
            ops->event(esw, rep, ...);

Neither path holds reps_lock; they run from the devcom PAIR/UNPAIR
event handler. Can ops be NULL here once an unregister has cleared
rep_ops?

If readers are expected to be safe, should they either take reps_lock,
be converted to RCU, or skip the NULL clear on unregister?

[ ... ]
-- 
pw-bot: cr

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next V2 6/7] net/mlx5: E-switch, load reps via work queue after registration
  2026-05-01  4:16 ` [PATCH net-next V2 6/7] net/mlx5: E-switch, load reps via work queue after registration Tariq Toukan
  2026-05-02 20:07   ` Mark Bloch
@ 2026-05-03  1:42   ` Jakub Kicinski
  2026-05-03  8:01     ` Mark Bloch
  1 sibling, 1 reply; 22+ messages in thread
From: Jakub Kicinski @ 2026-05-03  1:42 UTC (permalink / raw)
  To: tariqt
  Cc: Jakub Kicinski, edumazet, pabeni, andrew+netdev, davem, leon, jgg,
	saeedm, mbloch, shayd, ohartoov, edwards, msanalla, horms, gbayer,
	moshe, kees, phaddad, parav, cjubran, cratiu, linux-rdma,
	linux-kernel, netdev, gal, dtatulea

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
net/mlx5: E-switch, load reps via work queue after registration

This change defers loading of newly registered representors to the
E-Switch work queue so the handler can acquire the devlink lock that is
not held by the auxiliary driver registration context. The unregister
path continues to unload representors synchronously.

> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> index 8f656253981b..f26d1652dd05 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> @@ -4563,6 +4563,38 @@ mlx5_eswitch_register_vport_reps_blocked(struct mlx5_eswitch *esw,

[ ... ]

> +static void mlx5_eswitch_reload_reps(struct mlx5_eswitch *esw)
> +{
> +	mlx5_esw_reps_block(esw);
> +	mlx5_eswitch_reload_reps_blocked(esw);
> +	mlx5_esw_reps_unblock(esw);
> +}
> +
>  static void
>  mlx5_eswitch_register_vport_reps_locked(struct mlx5_eswitch *esw,
>  					const struct mlx5_eswitch_rep_ops *ops,
> @@ -4574,6 +4606,8 @@ mlx5_eswitch_register_vport_reps_locked(struct mlx5_eswitch *esw,
>  		mlx5_esw_reps_block(esw);
>  	mlx5_eswitch_register_vport_reps_blocked(esw, ops, rep_type);
>  	mlx5_esw_reps_unblock(esw);
> +
> +	mlx5_esw_add_work(esw, mlx5_eswitch_reload_reps);

The return value of mlx5_esw_add_work() is discarded here.

Looking at mlx5_esw_add_work():

    static int mlx5_esw_add_work(struct mlx5_eswitch *esw,
                                 void (*func)(struct mlx5_eswitch *esw))
    {
            struct mlx5_host_work *host_work;

            host_work = kzalloc_obj(*host_work, GFP_ATOMIC);
            if (!host_work)
                    return -ENOMEM;
            ...
    }

the allocation is GFP_ATOMIC and can fail, in which case no work is
queued.

If mlx5_esw_add_work() returns -ENOMEM while the E-Switch is already in
MLX5_ESWITCH_OFFLOADS mode, the reload that mlx5_eswitch_reload_reps()
was meant to perform never runs, so the newly registered rep type will
not be loaded for the already-enabled vports — which is the exact case
this commit is introducing infrastructure to handle.

Registration still returns success to the auxiliary driver
(mlx5e_rep_probe / mlx5r_rep_probe), with no log message, no retry, and
no way for the caller to notice the dropped reload.

The other existing caller, mlx5_esw_funcs_changed_handler(), does check
the return value of mlx5_esw_add_work(), suggesting allocation failure
is expected to be handled.

Should this call check the return value and at least emit a warning or
retry, so that a GFP_ATOMIC allocation failure during switchdev-mode
registration does not silently leave the representors unloaded?

>  }
>  
>  void mlx5_eswitch_register_vport_reps(struct mlx5_eswitch *esw,

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next V2 7/7] net/mlx5: Add profile to auto-enable switchdev mode at device init
  2026-05-03  1:41     ` Jakub Kicinski
@ 2026-05-03  7:51       ` Mark Bloch
  2026-05-05  1:21         ` Jakub Kicinski
  0 siblings, 1 reply; 22+ messages in thread
From: Mark Bloch @ 2026-05-03  7:51 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Tariq Toukan, Eric Dumazet, Paolo Abeni, Andrew Lunn,
	David S. Miller, Leon Romanovsky, Jason Gunthorpe, Saeed Mahameed,
	Shay Drory, Or Har-Toov, Edward Srouji, Maher Sanalla,
	Simon Horman, Gerd Bayer, Moshe Shemesh, Kees Cook,
	Patrisious Haddad, Parav Pandit, Carolina Jubran, Cosmin Ratiu,
	linux-rdma, linux-kernel, netdev, Gal Pressman, Dragos Tatulea

On 03/05/2026 4:41, Jakub Kicinski wrote:
> On Sat, 2 May 2026 23:08:43 +0300 Mark Bloch wrote:
>> Before I respin for the unrelated MR_CACHE cleanup, I’d like to confirm
>> whether the opt-in profile approach is acceptable at all. Regardless
>> of this last patch, the first 6 patches fix real representor/LAG locking
>> issues and are needed independently, so I’d like to keep those moving toward
>> acceptance as soon as possible.
> 
> For probe-time config module param is probably our only option.
> I'd obviously prefer to have a devlink-level knob for this, instead 
> of a mlx5 specific one. Can we come up with some format that'd apply
> more broadly? devlink=[$bfd:]flag1 ? so devlink=[$bdf:]switchdev-mode ?

I’m not convinced this is really a generic devlink knob problem.

A device should probe in its selected/default configuration. For DPU
deployments switchdev is the expected operating mode. mlx5 just made the
wrong default choice historically, and this profile is a way to move away
from that without forcing it on everyone at once. I expect/hope to move
quickly from this flag to simply making switchdev the driver default for
all DPU configs.

A generic cmdline format also gets complicated quickly: vendor-specific
flags, ordering/dependencies between flags, hotplug timing, and whether a
BDF rule should apply when a device is passed into a VM after boot.
Userspace scripts are probably better for that kind of policy because
they can carry real site specific logic.

I’ll drop this last patch from the series for now so the representor/LAG
locking fixes can move independently and we can continue the default
switchdev discussion separately. I can always submit that as a standalone
patch later in the cycle if needed.

> 
> BTW looks like issues Sashiko/Claude finds are slightly different,
> let me send them out.

Right, we saw that as well. That is expected, since the
comments depend on the model being used, and can even differ between
runs of the same model.

I saw on patchwork that Sashiko/NIPA run had timed out, so I did not
have those comments when I replied. I’ll go over the additional
comments you've sent, thanks!

Mark

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next V2 6/7] net/mlx5: E-switch, load reps via work queue after registration
  2026-05-03  1:42   ` Jakub Kicinski
@ 2026-05-03  8:01     ` Mark Bloch
  0 siblings, 0 replies; 22+ messages in thread
From: Mark Bloch @ 2026-05-03  8:01 UTC (permalink / raw)
  To: Jakub Kicinski, tariqt
  Cc: edumazet, pabeni, andrew+netdev, davem, leon, jgg, saeedm, shayd,
	ohartoov, edwards, msanalla, horms, gbayer, moshe, kees, phaddad,
	parav, cjubran, cratiu, linux-rdma, linux-kernel, netdev, gal,
	dtatulea



On 03/05/2026 4:42, Jakub Kicinski wrote:
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> ---
> net/mlx5: E-switch, load reps via work queue after registration
> 
> This change defers loading of newly registered representors to the
> E-Switch work queue so the handler can acquire the devlink lock that is
> not held by the auxiliary driver registration context. The unregister
> path continues to unload representors synchronously.
> 
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
>> index 8f656253981b..f26d1652dd05 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
>> @@ -4563,6 +4563,38 @@ mlx5_eswitch_register_vport_reps_blocked(struct mlx5_eswitch *esw,
> 
> [ ... ]
> 
>> +static void mlx5_eswitch_reload_reps(struct mlx5_eswitch *esw)
>> +{
>> +	mlx5_esw_reps_block(esw);
>> +	mlx5_eswitch_reload_reps_blocked(esw);
>> +	mlx5_esw_reps_unblock(esw);
>> +}
>> +
>>  static void
>>  mlx5_eswitch_register_vport_reps_locked(struct mlx5_eswitch *esw,
>>  					const struct mlx5_eswitch_rep_ops *ops,
>> @@ -4574,6 +4606,8 @@ mlx5_eswitch_register_vport_reps_locked(struct mlx5_eswitch *esw,
>>  		mlx5_esw_reps_block(esw);
>>  	mlx5_eswitch_register_vport_reps_blocked(esw, ops, rep_type);
>>  	mlx5_esw_reps_unblock(esw);
>> +
>> +	mlx5_esw_add_work(esw, mlx5_eswitch_reload_reps);
> 
> The return value of mlx5_esw_add_work() is discarded here.
> 
> Looking at mlx5_esw_add_work():
> 
>     static int mlx5_esw_add_work(struct mlx5_eswitch *esw,
>                                  void (*func)(struct mlx5_eswitch *esw))
>     {
>             struct mlx5_host_work *host_work;
> 
>             host_work = kzalloc_obj(*host_work, GFP_ATOMIC);
>             if (!host_work)
>                     return -ENOMEM;
>             ...
>     }
> 
> the allocation is GFP_ATOMIC and can fail, in which case no work is
> queued.
> 
> If mlx5_esw_add_work() returns -ENOMEM while the E-Switch is already in
> MLX5_ESWITCH_OFFLOADS mode, the reload that mlx5_eswitch_reload_reps()
> was meant to perform never runs, so the newly registered rep type will
> not be loaded for the already-enabled vports — which is the exact case
> this commit is introducing infrastructure to handle.
> 
> Registration still returns success to the auxiliary driver
> (mlx5e_rep_probe / mlx5r_rep_probe), with no log message, no retry, and
> no way for the caller to notice the dropped reload.
> 
> The other existing caller, mlx5_esw_funcs_changed_handler(), does check
> the return value of mlx5_esw_add_work(), suggesting allocation failure
> is expected to be handled.
> 
> Should this call check the return value and at least emit a warning or
> retry, so that a GFP_ATOMIC allocation failure during switchdev-mode
> registration does not silently leave the representors unloaded?
> 
>>  }

I don’t think this needs a warning. This reload is a best effort path
for late rep_ops registration, REP_REGISTERED only means the ops are
installed, while users still key off REP_LOADED before using rep
private data.

The work can also legitimately do nothing because the E-Switch 
mode changed or the generation check made it stale. When registration
happens during an E-Switch rescan, the normal E-Switch path will try
to load the reps and report any real failure back to the user.

I’ll still make mlx5_esw_add_work() take a gfp_t, so this path can
use GFP_KERNEL instead of forcing GFP_ATOMIC reduce allocation failure
risk.

Mark

>>  
>>  void mlx5_eswitch_register_vport_reps(struct mlx5_eswitch *esw,


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next V2 4/7] net/mlx5: E-Switch, serialize representor lifecycle
  2026-05-03  1:42   ` Jakub Kicinski
@ 2026-05-03  8:18     ` Mark Bloch
  0 siblings, 0 replies; 22+ messages in thread
From: Mark Bloch @ 2026-05-03  8:18 UTC (permalink / raw)
  To: Jakub Kicinski, tariqt
  Cc: edumazet, pabeni, andrew+netdev, davem, leon, jgg, saeedm, shayd,
	ohartoov, edwards, msanalla, horms, gbayer, moshe, kees, phaddad,
	parav, cjubran, cratiu, linux-rdma, linux-kernel, netdev, gal,
	dtatulea



On 03/05/2026 4:42, Jakub Kicinski wrote:
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> ---
> net/mlx5: E-Switch, serialize representor lifecycle
> 
> This change replaces the ad-hoc bit/wait-queue coordination of E-Switch
> representor lifecycle with the per-E-Switch reps_lock, adds nested
> wrappers for the shared-FDB master IB path, and clears the rep_ops
> pointer on unregister.
> 
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
>> index 66a773a99876..f70737437954 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
> 
> [ ... ]
> 
>> @@ -1825,7 +1831,11 @@ void mlx5_eswitch_disable(struct mlx5_eswitch *esw)
>>  
>>  	devl_assert_locked(priv_to_devlink(esw->dev));
>>  	mlx5_lag_disable_change(esw->dev);
>> +
>> +	mlx5_esw_reps_block(esw);
>>  	mlx5_eswitch_disable_locked(esw);
>> +	mlx5_esw_reps_unblock(esw);
>> +
>>  	esw->mode = MLX5_ESWITCH_LEGACY;
>>  	mlx5_lag_enable_change(esw->dev);
>>  }
> 
> The commit message says "Take the representor lock around the mode
> transition", but the assignment esw->mode = MLX5_ESWITCH_LEGACY happens
> here after reps_lock has been dropped.
> 
> Meanwhile mlx5_eswitch_unregister_vport_reps_blocked() now reads
> esw->mode under reps_lock to decide whether to call
> __unload_reps_all_vport():
> 
>     if (esw->mode == MLX5_ESWITCH_OFFLOADS)
>             __unload_reps_all_vport(esw, rep_type);
> 
> Should the mode store be moved inside the reps_block/reps_unblock
> region so the reader's assumption about stability under reps_lock
> holds?
> 
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
>> index 6a5143b63dfd..d4ac07c995b9 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> 
> [ ... ]
> 
>> @@ -2425,21 +2436,22 @@ void mlx5_esw_reps_unblock(struct mlx5_eswitch *esw)
>>  
>>  static void esw_mode_change(struct mlx5_eswitch *esw, u16 mode)
>>  {
>> +	mlx5_esw_reps_unblock(esw);
>>  	mlx5_devcom_comp_lock(esw->dev->priv.hca_devcom_comp);
>>  	if (esw->dev->priv.flags & MLX5_PRIV_FLAGS_DISABLE_IB_ADEV ||
>>  	    mlx5_core_mp_enabled(esw->dev)) {
>>  		esw->mode = mode;
>> -		mlx5_rescan_drivers_locked(esw->dev);
>> -		mlx5_devcom_comp_unlock(esw->dev->priv.hca_devcom_comp);
>> -		return;
>> +		goto out;
>>  	}
>>  
>>  	esw->dev->priv.flags |= MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
>>  	mlx5_rescan_drivers_locked(esw->dev);
>>  	esw->mode = mode;
>>  	esw->dev->priv.flags &= ~MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
>> +out:
>>  	mlx5_rescan_drivers_locked(esw->dev);
>>  	mlx5_devcom_comp_unlock(esw->dev->priv.hca_devcom_comp);
>> +	mlx5_esw_reps_block(esw);
>>  }
> 
> Both assignments to esw->mode in this function are reached after
> reps_lock has been dropped at the top of the function.
> 
> mlx5_rescan_drivers_locked() can invoke driver bind/unbind which routes
> through mlx5_eswitch_unregister_vport_reps(), and that path now takes
> reps_lock and reads esw->mode to decide whether to call
> __unload_reps_all_vport().
> 
> Can a rep-unregister that runs during the rescan observe a stale
> esw->mode here and make the wrong unload decision?
> 
> [ ... ]
> 
>> @@ -4496,9 +4524,10 @@ mlx5_eswitch_vport_has_rep(const struct mlx5_eswitch *esw, u16 vport_num)
> 
> [ ... ]
> 
>> @@ -4525,9 +4585,35 @@ void mlx5_eswitch_unregister_vport_reps(struct mlx5_eswitch *esw, u8 rep_type)
>>  
>>  	mlx5_esw_for_each_rep(esw, i, rep)
>>  		atomic_set(&rep->rep_data[rep_type].state, REP_UNREGISTERED);
>> +
>> +	esw->offloads.rep_ops[rep_type] = NULL;
>> +}
> 
> This new NULL store on unregister appears to introduce a concurrent
> NULL dereference in readers that do not hold reps_lock. Before this
> patch, rep_ops[rep_type] was set once at register and never cleared,
> so a REP_LOADED state check was sufficient to know the pointer was
> valid.
> 
> mlx5_eswitch_get_proto_dev() is one such reader:
> 
>     rep = mlx5_eswitch_get_rep(esw, vport);
>     if (atomic_read(&rep->rep_data[rep_type].state) == REP_LOADED &&
>         esw->offloads.rep_ops[rep_type]->get_proto_dev)
>             return esw->offloads.rep_ops[rep_type]->get_proto_dev(rep);
> 
> The state check and each pointer load here are independent reads. Can
> a concurrent unregister on another CPU run between the state load and
> the ->get_proto_dev dereference, transitioning the rep through
> REP_REGISTERED/REP_UNREGISTERED and then storing NULL into rep_ops,
> so the second load observes NULL?
> 
> mlx5_esw_offloads_rep_event_unpair() and mlx5_esw_offloads_pair() have
> the same shape:
> 
>     ops = esw->offloads.rep_ops[rep_type];
>     if (atomic_read(&rep->rep_data[rep_type].state) == REP_LOADED &&
>         ops->event)
>             ops->event(esw, rep, ...);
> 
> Neither path holds reps_lock; they run from the devcom PAIR/UNPAIR
> event handler. Can ops be NULL here once an unregister has cleared
> rep_ops?
> 
> If readers are expected to be safe, should they either take reps_lock,
> be converted to RCU, or skip the NULL clear on unregister?
> 
> [ ... ]

Good point. The unregister path should not key the unload decision off
esw->mode. esw->mode is not the lifetime predicate for representor ops,
especially because mode transitions intentionally drop reps_lock around
auxiliary bus rescans. The actual invariant we need is that unregister
unloads any currently loaded reps before the ops pointer can be cleared
or the ops owner can go away.

I will change mlx5_eswitch_unregister_vport_reps_blocked() to always
call __unload_reps_all_vport(). That helper is already state-gated,
so this is a no-op when no reps are loaded.

With that change, the lockless readers do not get a new live
rep_ops == NULL window. unregister first synchronously unloads the
loaded reps, and those unload callbacks tear down the representor-owned
contexts that can call get_proto_dev() / rep events. Clearing rep_op
after that removes a stale pointer keeping it would be wrong because
the ops memory may be going away.

Mark


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next V2 7/7] net/mlx5: Add profile to auto-enable switchdev mode at device init
  2026-05-03  7:51       ` Mark Bloch
@ 2026-05-05  1:21         ` Jakub Kicinski
  2026-05-05  2:00           ` Mark Bloch
  0 siblings, 1 reply; 22+ messages in thread
From: Jakub Kicinski @ 2026-05-05  1:21 UTC (permalink / raw)
  To: Mark Bloch
  Cc: Tariq Toukan, Eric Dumazet, Paolo Abeni, Andrew Lunn,
	David S. Miller, Leon Romanovsky, Jason Gunthorpe, Saeed Mahameed,
	Shay Drory, Or Har-Toov, Edward Srouji, Maher Sanalla,
	Simon Horman, Gerd Bayer, Moshe Shemesh, Kees Cook,
	Patrisious Haddad, Parav Pandit, Carolina Jubran, Cosmin Ratiu,
	linux-rdma, linux-kernel, netdev, Gal Pressman, Dragos Tatulea

On Sun, 3 May 2026 10:51:06 +0300 Mark Bloch wrote:
> On 03/05/2026 4:41, Jakub Kicinski wrote:
> > On Sat, 2 May 2026 23:08:43 +0300 Mark Bloch wrote:  
> >> Before I respin for the unrelated MR_CACHE cleanup, I’d like to confirm
> >> whether the opt-in profile approach is acceptable at all. Regardless
> >> of this last patch, the first 6 patches fix real representor/LAG locking
> >> issues and are needed independently, so I’d like to keep those moving toward
> >> acceptance as soon as possible.  
> > 
> > For probe-time config module param is probably our only option.
> > I'd obviously prefer to have a devlink-level knob for this, instead 
> > of a mlx5 specific one. Can we come up with some format that'd apply
> > more broadly? devlink=[$bfd:]flag1 ? so devlink=[$bdf:]switchdev-mode ?  
> 
> I’m not convinced this is really a generic devlink knob problem.

I'm surprised you say that. Anyone using switchdev mode could benefit.
Having the probe in one mode and switch adds to boot time. Whether it's
a DPU or not is quite secondary.

Unless there's another deeper reason which makes the DPU incapable of
running in the non-switchdev mode. But not sure that squares with the
code you posted AFAICT.

> A device should probe in its selected/default configuration. For DPU
> deployments switchdev is the expected operating mode. mlx5 just made the
> wrong default choice historically, and this profile is a way to move away
> from that without forcing it on everyone at once. I expect/hope to move
> quickly from this flag to simply making switchdev the driver default for
> all DPU configs.
> 
> A generic cmdline format also gets complicated quickly: vendor-specific
> flags, ordering/dependencies between flags, hotplug timing, and whether a
> BDF rule should apply when a device is passed into a VM after boot.
> Userspace scripts are probably better for that kind of policy because
> they can carry real site specific logic.
> 
> I’ll drop this last patch from the series for now so the representor/LAG
> locking fixes can move independently and we can continue the default
> switchdev discussion separately. I can always submit that as a standalone
> patch later in the cycle if needed.

SG

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next V2 7/7] net/mlx5: Add profile to auto-enable switchdev mode at device init
  2026-05-05  1:21         ` Jakub Kicinski
@ 2026-05-05  2:00           ` Mark Bloch
  2026-05-05  2:19             ` Jakub Kicinski
  0 siblings, 1 reply; 22+ messages in thread
From: Mark Bloch @ 2026-05-05  2:00 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Tariq Toukan, Eric Dumazet, Paolo Abeni, Andrew Lunn,
	David S. Miller, Leon Romanovsky, Jason Gunthorpe, Saeed Mahameed,
	Shay Drory, Or Har-Toov, Edward Srouji, Maher Sanalla,
	Simon Horman, Gerd Bayer, Moshe Shemesh, Kees Cook,
	Patrisious Haddad, Parav Pandit, Carolina Jubran, Cosmin Ratiu,
	linux-rdma, linux-kernel, netdev, Gal Pressman, Dragos Tatulea



On 05/05/2026 4:21, Jakub Kicinski wrote:
> On Sun, 3 May 2026 10:51:06 +0300 Mark Bloch wrote:
>> On 03/05/2026 4:41, Jakub Kicinski wrote:
>>> On Sat, 2 May 2026 23:08:43 +0300 Mark Bloch wrote:  
>>>> Before I respin for the unrelated MR_CACHE cleanup, I’d like to confirm
>>>> whether the opt-in profile approach is acceptable at all. Regardless
>>>> of this last patch, the first 6 patches fix real representor/LAG locking
>>>> issues and are needed independently, so I’d like to keep those moving toward
>>>> acceptance as soon as possible.  
>>>
>>> For probe-time config module param is probably our only option.
>>> I'd obviously prefer to have a devlink-level knob for this, instead 
>>> of a mlx5 specific one. Can we come up with some format that'd apply
>>> more broadly? devlink=[$bfd:]flag1 ? so devlink=[$bdf:]switchdev-mode ?  
>>
>> I’m not convinced this is really a generic devlink knob problem.
> 
> I'm surprised you say that. Anyone using switchdev mode could benefit.
> Having the probe in one mode and switch adds to boot time. Whether it's
> a DPU or not is quite secondary.
> 
> Unless there's another deeper reason which makes the DPU incapable of
> running in the non-switchdev mode. But not sure that squares with the
> code you posted AFAICT.

No, there is no deeper DPU limitation. The device can probe in
non switchdev mode, this is only about the desired default for those
deployments, and avoiding the extra boot-time cost of probing in one mode
and then switching to another.

What I meant is that I am wary of putting too much policy into the kernel
command line. A generic devlink level switchdev probe mode knob sounds
reasonable to me if we keep the scope narrow. More complex policy, such as
changing multiple defaults still seems better handled by userspace.

Would adding only switchdev/switchdev_inactive for now be acceptable?
I will try to keep the code generic enough so it can be extended later if
we want.

Let's continue with v3 as posted and please give me a few days to put
together an RFC for the devlink part.

Mark


> 
>> A device should probe in its selected/default configuration. For DPU
>> deployments switchdev is the expected operating mode. mlx5 just made the
>> wrong default choice historically, and this profile is a way to move away
>> from that without forcing it on everyone at once. I expect/hope to move
>> quickly from this flag to simply making switchdev the driver default for
>> all DPU configs.
>>
>> A generic cmdline format also gets complicated quickly: vendor-specific
>> flags, ordering/dependencies between flags, hotplug timing, and whether a
>> BDF rule should apply when a device is passed into a VM after boot.
>> Userspace scripts are probably better for that kind of policy because
>> they can carry real site specific logic.
>>
>> I’ll drop this last patch from the series for now so the representor/LAG
>> locking fixes can move independently and we can continue the default
>> switchdev discussion separately. I can always submit that as a standalone
>> patch later in the cycle if needed.
> 
> SG


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next V2 7/7] net/mlx5: Add profile to auto-enable switchdev mode at device init
  2026-05-05  2:00           ` Mark Bloch
@ 2026-05-05  2:19             ` Jakub Kicinski
  0 siblings, 0 replies; 22+ messages in thread
From: Jakub Kicinski @ 2026-05-05  2:19 UTC (permalink / raw)
  To: Mark Bloch
  Cc: Tariq Toukan, Eric Dumazet, Paolo Abeni, Andrew Lunn,
	David S. Miller, Leon Romanovsky, Jason Gunthorpe, Saeed Mahameed,
	Shay Drory, Or Har-Toov, Edward Srouji, Maher Sanalla,
	Simon Horman, Gerd Bayer, Moshe Shemesh, Kees Cook,
	Patrisious Haddad, Parav Pandit, Carolina Jubran, Cosmin Ratiu,
	linux-rdma, linux-kernel, netdev, Gal Pressman, Dragos Tatulea

On Tue, 5 May 2026 05:00:15 +0300 Mark Bloch wrote:
> What I meant is that I am wary of putting too much policy into the kernel
> command line. A generic devlink level switchdev probe mode knob sounds
> reasonable to me if we keep the scope narrow. More complex policy, such as
> changing multiple defaults still seems better handled by userspace.
> 
> Would adding only switchdev/switchdev_inactive for now be acceptable?
> I will try to keep the code generic enough so it can be extended later if
> we want.

I wanted the format to be reasonably generic, but yes, just for making
future extensions possible. I don't expect us to add anything beyond 
the switchdev flag at this point. We don't have to implement the device
matching either. Just a comment in the code how we expect the full
format to look like would be enough, for whoever needs it in the future.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2026-05-05  2:20 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-01  4:16 [PATCH net-next V2 0/7] net/mlx5: Improve representor lifecycle and allow switchdev by default Tariq Toukan
2026-05-01  4:16 ` [PATCH net-next V2 1/7] net/mlx5: Lag: refactor representor reload handling Tariq Toukan
2026-05-01  4:16 ` [PATCH net-next V2 2/7] net/mlx5: E-Switch, add representor lifecycle lock Tariq Toukan
2026-05-01  4:16 ` [PATCH net-next V2 3/7] net/mlx5: Lag, avoid LAG and representor lock cycles Tariq Toukan
2026-05-02 20:04   ` Mark Bloch
2026-05-01  4:16 ` [PATCH net-next V2 4/7] net/mlx5: E-Switch, serialize representor lifecycle Tariq Toukan
2026-05-02 20:05   ` Mark Bloch
2026-05-03  1:42   ` Jakub Kicinski
2026-05-03  8:18     ` Mark Bloch
2026-05-01  4:16 ` [PATCH net-next V2 5/7] net/mlx5: E-Switch, unwind only newly loaded representor types Tariq Toukan
2026-05-02 20:06   ` Mark Bloch
2026-05-01  4:16 ` [PATCH net-next V2 6/7] net/mlx5: E-switch, load reps via work queue after registration Tariq Toukan
2026-05-02 20:07   ` Mark Bloch
2026-05-03  1:42   ` Jakub Kicinski
2026-05-03  8:01     ` Mark Bloch
2026-05-01  4:16 ` [PATCH net-next V2 7/7] net/mlx5: Add profile to auto-enable switchdev mode at device init Tariq Toukan
2026-05-02 20:08   ` Mark Bloch
2026-05-03  1:41     ` Jakub Kicinski
2026-05-03  7:51       ` Mark Bloch
2026-05-05  1:21         ` Jakub Kicinski
2026-05-05  2:00           ` Mark Bloch
2026-05-05  2:19             ` Jakub Kicinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox