From: Tariq Toukan <tariqt@nvidia.com>
To: Eric Dumazet <edumazet@google.com>,
Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
Andrew Lunn <andrew+netdev@lunn.ch>,
"David S. Miller" <davem@davemloft.net>
Cc: Saeed Mahameed <saeedm@nvidia.com>,
Leon Romanovsky <leon@kernel.org>,
Tariq Toukan <tariqt@nvidia.com>, Mark Bloch <mbloch@nvidia.com>,
Shay Drory <shayd@nvidia.com>, Or Har-Toov <ohartoov@nvidia.com>,
Edward Srouji <edwards@nvidia.com>,
Maher Sanalla <msanalla@nvidia.com>,
Simon Horman <horms@kernel.org>, Moshe Shemesh <moshe@nvidia.com>,
Kees Cook <kees@kernel.org>,
Patrisious Haddad <phaddad@nvidia.com>,
Gerd Bayer <gbayer@linux.ibm.com>,
Parav Pandit <parav@nvidia.com>, Cosmin Ratiu <cratiu@nvidia.com>,
Carolina Jubran <cjubran@nvidia.com>, <netdev@vger.kernel.org>,
<linux-rdma@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
Gal Pressman <gal@nvidia.com>,
Dragos Tatulea <dtatulea@nvidia.com>
Subject: [PATCH net-next 5/7] net/mlx5: E-Switch, block representors during reconfiguration
Date: Thu, 9 Apr 2026 14:55:48 +0300 [thread overview]
Message-ID: <20260409115550.156419-6-tariqt@nvidia.com> (raw)
In-Reply-To: <20260409115550.156419-1-tariqt@nvidia.com>
From: Mark Bloch <mbloch@nvidia.com>
Introduce a simple atomic block state via mlx5_esw_reps_block() and
mlx5_esw_reps_unblock(). Internally, mlx5_esw_mark_reps() spins a
cmpxchg between the UNBLOCKED and BLOCKED states. All E-Switch
reconfiguration paths (mode set, enable, disable, VF/SF add/del, LAG
reload) now bracket their work with this guard so representor changes
won't race with the ongoing E-Switch update, yet we remain
non-blocking and avoid new locks.
A spinlock is out because the protected work can sleep (RDMA ops,
devcom, netdev callbacks). A mutex won't work either: esw_mode_change()
has to drop the guard mid-flight so mlx5_rescan_drivers_locked() can
reload mlx5_ib, which calls back into mlx5_eswitch_register_vport_reps()
on the same thread. Beyond that, any real lock would create an ABBA
cycle: the LAG side holds the LAG lock when it calls reps_block(), and
the mlx5_ib side holds RDMA locks when it calls register_vport_reps(),
and those two subsystems talk to each other. The atomic CAS loop avoids
all of this - no lock ordering, no sleep restrictions, and the owner
can drop the guard and let a nested caller win the next transition
before reclaiming it.
With this infrastructure in place, downstream patches can safely tie
representor load/unload to the mlx5_ib module's lifecycle. Loading
mlx5_ib while the device is in switchdev mode has failed to bring up
the IB representors for years; those patches will finally fix that.
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
.../net/ethernet/mellanox/mlx5/core/eswitch.c | 13 ++++
.../net/ethernet/mellanox/mlx5/core/eswitch.h | 6 ++
.../mellanox/mlx5/core/eswitch_offloads.c | 77 +++++++++++++++++--
.../net/ethernet/mellanox/mlx5/core/lag/lag.c | 2 +
.../ethernet/mellanox/mlx5/core/sf/devlink.c | 5 ++
include/linux/mlx5/eswitch.h | 5 ++
6 files changed, 100 insertions(+), 8 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index d315484390c8..a7701c9d776a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -1700,6 +1700,7 @@ int mlx5_eswitch_enable(struct mlx5_eswitch *esw, int num_vfs)
mlx5_lag_disable_change(esw->dev);
atomic_inc(&esw->generation);
+ mlx5_esw_reps_block(esw);
if (!mlx5_esw_is_fdb_created(esw)) {
ret = mlx5_eswitch_enable_locked(esw, num_vfs);
@@ -1723,6 +1724,8 @@ int mlx5_eswitch_enable(struct mlx5_eswitch *esw, int num_vfs)
}
}
+ mlx5_esw_reps_unblock(esw);
+
if (toggle_lag)
mlx5_lag_enable_change(esw->dev);
@@ -1747,6 +1750,8 @@ void mlx5_eswitch_disable_sriov(struct mlx5_eswitch *esw, bool clear_vf)
esw->esw_funcs.num_vfs, esw->esw_funcs.num_ec_vfs, esw->enabled_vports);
atomic_inc(&esw->generation);
+ mlx5_esw_reps_block(esw);
+
if (!mlx5_core_is_ecpf(esw->dev)) {
mlx5_eswitch_unload_vf_vports(esw, esw->esw_funcs.num_vfs);
if (clear_vf)
@@ -1757,6 +1762,8 @@ void mlx5_eswitch_disable_sriov(struct mlx5_eswitch *esw, bool clear_vf)
mlx5_eswitch_clear_ec_vf_vports_info(esw);
}
+ mlx5_esw_reps_unblock(esw);
+
if (esw->mode == MLX5_ESWITCH_OFFLOADS) {
struct devlink *devlink = priv_to_devlink(esw->dev);
@@ -1812,7 +1819,11 @@ void mlx5_eswitch_disable(struct mlx5_eswitch *esw)
devl_assert_locked(priv_to_devlink(esw->dev));
atomic_inc(&esw->generation);
mlx5_lag_disable_change(esw->dev);
+
+ mlx5_esw_reps_block(esw);
mlx5_eswitch_disable_locked(esw);
+ mlx5_esw_reps_unblock(esw);
+
esw->mode = MLX5_ESWITCH_LEGACY;
mlx5_lag_enable_change(esw->dev);
}
@@ -2075,6 +2086,8 @@ int mlx5_eswitch_init(struct mlx5_core_dev *dev)
init_rwsem(&esw->mode_lock);
refcount_set(&esw->qos.refcnt, 0);
atomic_set(&esw->generation, 0);
+ atomic_set(&esw->offloads.reps_conf_state,
+ MLX5_ESW_OFFLOADS_REP_TYPE_UNBLOCKED);
esw->enabled_vports = 0;
esw->offloads.inline_mode = MLX5_INLINE_MODE_NONE;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index e3ab8a30c174..256ac3ad37bc 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -315,6 +315,7 @@ struct mlx5_esw_offload {
DECLARE_HASHTABLE(termtbl_tbl, 8);
struct mutex termtbl_mutex; /* protects termtbl hash */
struct xarray vhca_map;
+ atomic_t reps_conf_state;
const struct mlx5_eswitch_rep_ops *rep_ops[NUM_REP_TYPES];
u8 inline_mode;
atomic64_t num_flows;
@@ -949,6 +950,8 @@ mlx5_esw_lag_demux_fg_create(struct mlx5_eswitch *esw,
struct mlx5_flow_handle *
mlx5_esw_lag_demux_rule_create(struct mlx5_eswitch *esw, u16 vport_num,
struct mlx5_flow_table *lag_ft);
+void mlx5_esw_reps_block(struct mlx5_eswitch *esw);
+void mlx5_esw_reps_unblock(struct mlx5_eswitch *esw);
#else /* CONFIG_MLX5_ESWITCH */
/* eswitch API stubs */
static inline int mlx5_eswitch_init(struct mlx5_core_dev *dev) { return 0; }
@@ -1026,6 +1029,9 @@ mlx5_esw_host_functions_enabled(const struct mlx5_core_dev *dev)
return true;
}
+static inline void mlx5_esw_reps_block(struct mlx5_eswitch *esw) {}
+static inline void mlx5_esw_reps_unblock(struct mlx5_eswitch *esw) {}
+
static inline bool
mlx5_esw_vport_vhca_id(struct mlx5_eswitch *esw, u16 vportn, u16 *vhca_id)
{
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 988595e1b425..4b626ffcfa8e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -2410,23 +2410,56 @@ static int esw_create_restore_table(struct mlx5_eswitch *esw)
return err;
}
+static void mlx5_esw_assert_reps_blocked(struct mlx5_eswitch *esw)
+{
+ if (atomic_read(&esw->offloads.reps_conf_state) ==
+ MLX5_ESW_OFFLOADS_REP_TYPE_BLOCKED)
+ return;
+
+ esw_warn(esw->dev, "reps state machine violated: expected BLOCKED\n");
+}
+
+static void mlx5_esw_mark_reps(struct mlx5_eswitch *esw,
+ enum mlx5_esw_offloads_rep_type_state old,
+ enum mlx5_esw_offloads_rep_type_state new)
+{
+ atomic_t *reps_conf_state = &esw->offloads.reps_conf_state;
+
+ do {
+ atomic_cond_read_relaxed(reps_conf_state, VAL == old);
+ } while (atomic_cmpxchg(reps_conf_state, old, new) != old);
+}
+
+void mlx5_esw_reps_block(struct mlx5_eswitch *esw)
+{
+ mlx5_esw_mark_reps(esw, MLX5_ESW_OFFLOADS_REP_TYPE_UNBLOCKED,
+ MLX5_ESW_OFFLOADS_REP_TYPE_BLOCKED);
+}
+
+void mlx5_esw_reps_unblock(struct mlx5_eswitch *esw)
+{
+ mlx5_esw_mark_reps(esw, MLX5_ESW_OFFLOADS_REP_TYPE_BLOCKED,
+ MLX5_ESW_OFFLOADS_REP_TYPE_UNBLOCKED);
+}
+
static void esw_mode_change(struct mlx5_eswitch *esw, u16 mode)
{
+ mlx5_esw_reps_unblock(esw);
mlx5_devcom_comp_lock(esw->dev->priv.hca_devcom_comp);
if (esw->dev->priv.flags & MLX5_PRIV_FLAGS_DISABLE_IB_ADEV ||
mlx5_core_mp_enabled(esw->dev)) {
esw->mode = mode;
- mlx5_rescan_drivers_locked(esw->dev);
- mlx5_devcom_comp_unlock(esw->dev->priv.hca_devcom_comp);
- return;
+ goto out;
}
esw->dev->priv.flags |= MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
mlx5_rescan_drivers_locked(esw->dev);
esw->mode = mode;
esw->dev->priv.flags &= ~MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
+out:
mlx5_rescan_drivers_locked(esw->dev);
mlx5_devcom_comp_unlock(esw->dev->priv.hca_devcom_comp);
+ mlx5_esw_reps_block(esw);
}
static void mlx5_esw_fdb_drop_destroy(struct mlx5_eswitch *esw)
@@ -2761,6 +2794,8 @@ void esw_offloads_cleanup(struct mlx5_eswitch *esw)
static int __esw_offloads_load_rep(struct mlx5_eswitch *esw,
struct mlx5_eswitch_rep *rep, u8 rep_type)
{
+ mlx5_esw_assert_reps_blocked(esw);
+
if (atomic_cmpxchg(&rep->rep_data[rep_type].state,
REP_REGISTERED, REP_LOADED) == REP_REGISTERED)
return esw->offloads.rep_ops[rep_type]->load(esw->dev, rep);
@@ -2771,6 +2806,8 @@ static int __esw_offloads_load_rep(struct mlx5_eswitch *esw,
static void __esw_offloads_unload_rep(struct mlx5_eswitch *esw,
struct mlx5_eswitch_rep *rep, u8 rep_type)
{
+ mlx5_esw_assert_reps_blocked(esw);
+
if (atomic_cmpxchg(&rep->rep_data[rep_type].state,
REP_LOADED, REP_REGISTERED) == REP_LOADED) {
if (rep_type == REP_ETH)
@@ -3673,6 +3710,7 @@ static void esw_vfs_changed_event_handler(struct mlx5_eswitch *esw)
if (new_num_vfs == esw->esw_funcs.num_vfs || host_pf_disabled)
goto free;
+ mlx5_esw_reps_block(esw);
/* Number of VFs can only change from "0 to x" or "x to 0". */
if (esw->esw_funcs.num_vfs > 0) {
mlx5_eswitch_unload_vf_vports(esw, esw->esw_funcs.num_vfs);
@@ -3682,9 +3720,11 @@ static void esw_vfs_changed_event_handler(struct mlx5_eswitch *esw)
err = mlx5_eswitch_load_vf_vports(esw, new_num_vfs,
MLX5_VPORT_UC_ADDR_CHANGE);
if (err)
- goto free;
+ goto unblock;
}
esw->esw_funcs.num_vfs = new_num_vfs;
+unblock:
+ mlx5_esw_reps_unblock(esw);
free:
kvfree(out);
}
@@ -4164,6 +4204,7 @@ int mlx5_devlink_eswitch_mode_set(struct devlink *devlink, u16 mode,
goto unlock;
}
+ mlx5_esw_reps_block(esw);
esw->eswitch_operation_in_progress = true;
up_write(&esw->mode_lock);
@@ -4203,6 +4244,7 @@ int mlx5_devlink_eswitch_mode_set(struct devlink *devlink, u16 mode,
mlx5_devlink_netdev_netns_immutable_set(devlink, false);
down_write(&esw->mode_lock);
esw->eswitch_operation_in_progress = false;
+ mlx5_esw_reps_unblock(esw);
unlock:
mlx5_esw_unlock(esw);
enable_lag:
@@ -4474,9 +4516,10 @@ mlx5_eswitch_vport_has_rep(const struct mlx5_eswitch *esw, u16 vport_num)
return true;
}
-void mlx5_eswitch_register_vport_reps(struct mlx5_eswitch *esw,
- const struct mlx5_eswitch_rep_ops *ops,
- u8 rep_type)
+static void
+mlx5_eswitch_register_vport_reps_blocked(struct mlx5_eswitch *esw,
+ const struct mlx5_eswitch_rep_ops *ops,
+ u8 rep_type)
{
struct mlx5_eswitch_rep_data *rep_data;
struct mlx5_eswitch_rep *rep;
@@ -4491,9 +4534,20 @@ void mlx5_eswitch_register_vport_reps(struct mlx5_eswitch *esw,
}
}
}
+
+void mlx5_eswitch_register_vport_reps(struct mlx5_eswitch *esw,
+ const struct mlx5_eswitch_rep_ops *ops,
+ u8 rep_type)
+{
+ mlx5_esw_reps_block(esw);
+ mlx5_eswitch_register_vport_reps_blocked(esw, ops, rep_type);
+ mlx5_esw_reps_unblock(esw);
+}
EXPORT_SYMBOL(mlx5_eswitch_register_vport_reps);
-void mlx5_eswitch_unregister_vport_reps(struct mlx5_eswitch *esw, u8 rep_type)
+static void
+mlx5_eswitch_unregister_vport_reps_blocked(struct mlx5_eswitch *esw,
+ u8 rep_type)
{
struct mlx5_eswitch_rep *rep;
unsigned long i;
@@ -4504,6 +4558,13 @@ void mlx5_eswitch_unregister_vport_reps(struct mlx5_eswitch *esw, u8 rep_type)
mlx5_esw_for_each_rep(esw, i, rep)
atomic_set(&rep->rep_data[rep_type].state, REP_UNREGISTERED);
}
+
+void mlx5_eswitch_unregister_vport_reps(struct mlx5_eswitch *esw, u8 rep_type)
+{
+ mlx5_esw_reps_block(esw);
+ mlx5_eswitch_unregister_vport_reps_blocked(esw, rep_type);
+ mlx5_esw_reps_unblock(esw);
+}
EXPORT_SYMBOL(mlx5_eswitch_unregister_vport_reps);
void *mlx5_eswitch_get_uplink_priv(struct mlx5_eswitch *esw, u8 rep_type)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
index c402a8463081..ff2e6f6caa0c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
@@ -1105,7 +1105,9 @@ int mlx5_lag_reload_ib_reps(struct mlx5_lag *ldev, u32 flags)
struct mlx5_eswitch *esw;
esw = pf->dev->priv.eswitch;
+ mlx5_esw_reps_block(esw);
ret = mlx5_eswitch_reload_ib_reps(esw);
+ mlx5_esw_reps_unblock(esw);
if (ret)
return ret;
}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/devlink.c b/drivers/net/ethernet/mellanox/mlx5/core/sf/devlink.c
index 8503e532f423..2fc69897e35b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/sf/devlink.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/devlink.c
@@ -245,8 +245,10 @@ static int mlx5_sf_add(struct mlx5_core_dev *dev, struct mlx5_sf_table *table,
if (IS_ERR(sf))
return PTR_ERR(sf);
+ mlx5_esw_reps_block(esw);
err = mlx5_eswitch_load_sf_vport(esw, sf->hw_fn_id, MLX5_VPORT_UC_ADDR_CHANGE,
&sf->dl_port, new_attr->controller, new_attr->sfnum);
+ mlx5_esw_reps_unblock(esw);
if (err)
goto esw_err;
*dl_port = &sf->dl_port.dl_port;
@@ -367,7 +369,10 @@ int mlx5_devlink_sf_port_del(struct devlink *devlink,
struct mlx5_sf_table *table = dev->priv.sf_table;
struct mlx5_sf *sf = mlx5_sf_by_dl_port(dl_port);
+ mlx5_esw_reps_block(dev->priv.eswitch);
mlx5_sf_del(table, sf);
+ mlx5_esw_reps_unblock(dev->priv.eswitch);
+
return 0;
}
diff --git a/include/linux/mlx5/eswitch.h b/include/linux/mlx5/eswitch.h
index 67256e776566..786b1ea83843 100644
--- a/include/linux/mlx5/eswitch.h
+++ b/include/linux/mlx5/eswitch.h
@@ -29,6 +29,11 @@ enum {
REP_LOADED,
};
+enum mlx5_esw_offloads_rep_type_state {
+ MLX5_ESW_OFFLOADS_REP_TYPE_UNBLOCKED,
+ MLX5_ESW_OFFLOADS_REP_TYPE_BLOCKED,
+};
+
enum mlx5_switchdev_event {
MLX5_SWITCHDEV_EVENT_PAIR,
MLX5_SWITCHDEV_EVENT_UNPAIR,
--
2.44.0
next prev parent reply other threads:[~2026-04-09 11:57 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-09 11:55 [PATCH net-next 0/7] net/mlx5: Improve representor lifecycle and fix work queue deadlock Tariq Toukan
2026-04-09 11:55 ` [PATCH net-next 1/7] net/mlx5: Lag: refactor representor reload handling Tariq Toukan
2026-04-09 17:57 ` Mark Bloch
2026-04-09 11:55 ` [PATCH net-next 2/7] net/mlx5: E-Switch, move work queue generation counter Tariq Toukan
2026-04-09 17:58 ` Mark Bloch
2026-04-09 11:55 ` [PATCH net-next 3/7] net/mlx5: E-Switch, introduce generic work queue dispatch helper Tariq Toukan
2026-04-09 11:55 ` [PATCH net-next 4/7] net/mlx5: E-Switch, fix deadlock between devlink lock and esw->wq Tariq Toukan
2026-04-09 18:01 ` Mark Bloch
2026-04-09 11:55 ` Tariq Toukan [this message]
2026-04-09 18:02 ` [PATCH net-next 5/7] net/mlx5: E-Switch, block representors during reconfiguration Mark Bloch
2026-04-09 11:55 ` [PATCH net-next 6/7] net/mlx5: E-switch, load reps via work queue after registration Tariq Toukan
2026-04-09 18:02 ` Mark Bloch
2026-04-09 11:55 ` [PATCH net-next 7/7] net/mlx5: Add profile to auto-enable switchdev mode at device init Tariq Toukan
2026-04-09 18:02 ` Mark Bloch
2026-04-09 18:20 ` [PATCH net-next 0/7] net/mlx5: Improve representor lifecycle and fix work queue deadlock Mark Bloch
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260409115550.156419-6-tariqt@nvidia.com \
--to=tariqt@nvidia.com \
--cc=andrew+netdev@lunn.ch \
--cc=cjubran@nvidia.com \
--cc=cratiu@nvidia.com \
--cc=davem@davemloft.net \
--cc=dtatulea@nvidia.com \
--cc=edumazet@google.com \
--cc=edwards@nvidia.com \
--cc=gal@nvidia.com \
--cc=gbayer@linux.ibm.com \
--cc=horms@kernel.org \
--cc=kees@kernel.org \
--cc=kuba@kernel.org \
--cc=leon@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=mbloch@nvidia.com \
--cc=moshe@nvidia.com \
--cc=msanalla@nvidia.com \
--cc=netdev@vger.kernel.org \
--cc=ohartoov@nvidia.com \
--cc=pabeni@redhat.com \
--cc=parav@nvidia.com \
--cc=phaddad@nvidia.com \
--cc=saeedm@nvidia.com \
--cc=shayd@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox