From: Tariq Toukan <tariqt@nvidia.com>
To: Eric Dumazet <edumazet@google.com>,
Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
Andrew Lunn <andrew+netdev@lunn.ch>,
"David S. Miller" <davem@davemloft.net>
Cc: Saeed Mahameed <saeedm@nvidia.com>,
Leon Romanovsky <leon@kernel.org>,
Tariq Toukan <tariqt@nvidia.com>, Mark Bloch <mbloch@nvidia.com>,
Shay Drory <shayd@nvidia.com>, Or Har-Toov <ohartoov@nvidia.com>,
Edward Srouji <edwards@nvidia.com>,
Maher Sanalla <msanalla@nvidia.com>,
Simon Horman <horms@kernel.org>, Moshe Shemesh <moshe@nvidia.com>,
Kees Cook <kees@kernel.org>,
Patrisious Haddad <phaddad@nvidia.com>,
Gerd Bayer <gbayer@linux.ibm.com>,
Parav Pandit <parav@nvidia.com>, Cosmin Ratiu <cratiu@nvidia.com>,
Carolina Jubran <cjubran@nvidia.com>, <netdev@vger.kernel.org>,
<linux-rdma@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
Gal Pressman <gal@nvidia.com>,
Dragos Tatulea <dtatulea@nvidia.com>
Subject: [PATCH net-next 0/7] net/mlx5: Improve representor lifecycle and fix work queue deadlock
Date: Thu, 9 Apr 2026 14:55:43 +0300 [thread overview]
Message-ID: <20260409115550.156419-1-tariqt@nvidia.com> (raw)
Hi,
See detailed description by Mark below [1].
Regards,
Tariq
[1]
This series addresses three problems that have been present for years.
First, there is no coordination between E-Switch reconfiguration and
representor registration. The E-Switch can be mid-way through a mode
change or VF count update while mlx5_ib walks in and registers or
unregisters representors. Nothing stops them. The race window is small
and there is no field report, but it is clearly wrong.
A mutex is not the answer. The representor callbacks reach into RDMA,
netdev, and LAG layers that already hold their own locks, making a
new mutex in the E-Switch layer a deadlock waiting to happen.
Second, the E-Switch work queue has a deadlock of its own.
mlx5_eswitch_cleanup() drains the work queue while holding the devlink
lock. Workers on that queue acquire devlink lock before checking whether
their work is still relevant. They block. The cleanup path waits for
them to finish. Deadlock.
Third, loading mlx5_ib while the device is already in switchdev mode
does not bring up the IB representors. This has been broken for years.
mlx5_eswitch_register_vport_reps() only stores callbacks; nobody
triggers the actual load after registration.
For the work queue deadlock: introduce a generation counter in the
top-level mlx5_eswitch struct (moved from mlx5_esw_functions,
which only covered function-change events) and a generic dispatch helper
mlx5_esw_add_work(). The worker esw_wq_handler() checks the counter
before touching the devlink lock using devl_trylock() in a loop. Stale
work exits immediately without ever contending. The counter is
incremented at every E-Switch operation boundary: cleanup, disable,
mode-set, enable, disable_sriov.
For the registration race: a simple atomic block state guards all
reconfiguration paths. mlx5_esw_reps_block()/mlx5_esw_reps_unblock()
spin a cmpxchg between UNBLOCKED and BLOCKED. Every reconfiguration
path (mode set, enable, disable, VF/SF add/del, LAG reload, and the
register/unregister calls themselves) brackets its work with this guard.
No new locks, no deadlock risk.
For the missing IB representors: now that the work queue infrastructure
is in place, mlx5_eswitch_register_vport_reps() queues a work item that
acquires the devlink lock and loads all relevant representors. This is
the change that actually fixes the long-standing bug.
One thing worth calling out: the block guard is non-reentrant. A caller
that tries to transition UNBLOCKED->BLOCKED while the E-Switch is already
BLOCKED will spin forever. All call sites were audited:
- mlx5_eswitch_enable/disable/disable_sriov hold BLOCKED only around
low-level vport helpers that do not call register/unregister.
- Inside mlx5_eswitch_unregister_vport_reps the unload callbacks run
while BLOCKED is held. The one callback that calls unregister
(mlx5_ib_vport_rep_unload in LAG shared-FDB mode) only does so on
peer E-Switch instances, each with its own independent atomic.
- mlx5_devlink_eswitch_mode_set acquires BLOCKED, then calls
esw_offloads_start/stop -> esw_mode_change. esw_mode_change releases
BLOCKED before calling rescan_drivers so that the probe/remove
callbacks that trigger register/unregister see UNBLOCKED.
esw_mode_change re-acquires before returning, and mode_set releases
at the end. This is an explicit hand-off of the guard across the
rescan window.
- mlx5_eswitch_register_vport_reps holds BLOCKED only while storing
callbacks and queuing the load work. The actual rep loading runs from
the work queue after the guard is released.
Patch 1 is cleanup. LAG and MPESW had the same representor reload
sequence duplicated in several places and the copies had started to
drift. This consolidates them into one helper.
Patches 2-4 fix the work queue deadlock in three steps: first move the
generation counter from mlx5_esw_functions to mlx5_eswitch;
then introduce the generic esw_wq_handler/mlx5_esw_add_work dispatch
infrastructure; then apply the actual fix by switching to devl_trylock
and adding generation increments at all operation boundaries.
Patch 5 adds the atomic block guard for representor registration,
protecting all reconfiguration paths.
Patch 6 moves the representor load triggered by
mlx5_eswitch_register_vport_reps() onto the work queue. This is the
patch that fixes IB representors not coming up when mlx5_ib is loaded
while the device is already in switchdev mode.
Patch 7 adds a driver profile that auto-enables switchdev at device
init, for deployments that always operate in switchdev mode and want
to avoid a manual devlink command after every probe.
Mark Bloch (7):
net/mlx5: Lag: refactor representor reload handling
net/mlx5: E-Switch, move work queue generation counter
net/mlx5: E-Switch, introduce generic work queue dispatch helper
net/mlx5: E-Switch, fix deadlock between devlink lock and esw->wq
net/mlx5: E-Switch, block representors during reconfiguration
net/mlx5: E-switch, load reps via work queue after registration
net/mlx5: Add profile to auto-enable switchdev mode at device init
.../net/ethernet/mellanox/mlx5/core/eswitch.c | 25 ++-
.../net/ethernet/mellanox/mlx5/core/eswitch.h | 15 +-
.../mellanox/mlx5/core/eswitch_offloads.c | 204 ++++++++++++++----
.../net/ethernet/mellanox/mlx5/core/lag/lag.c | 46 ++--
.../net/ethernet/mellanox/mlx5/core/lag/lag.h | 1 +
.../ethernet/mellanox/mlx5/core/lag/mpesw.c | 12 +-
.../net/ethernet/mellanox/mlx5/core/main.c | 26 ++-
.../ethernet/mellanox/mlx5/core/sf/devlink.c | 5 +
include/linux/mlx5/driver.h | 1 +
include/linux/mlx5/eswitch.h | 5 +
10 files changed, 267 insertions(+), 73 deletions(-)
base-commit: 9700282a7ec721e285771d995ccfe33845e776dc
--
2.44.0
next reply other threads:[~2026-04-09 11:56 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-09 11:55 Tariq Toukan [this message]
2026-04-09 11:55 ` [PATCH net-next 1/7] net/mlx5: Lag: refactor representor reload handling Tariq Toukan
2026-04-09 17:57 ` Mark Bloch
2026-04-09 11:55 ` [PATCH net-next 2/7] net/mlx5: E-Switch, move work queue generation counter Tariq Toukan
2026-04-09 17:58 ` Mark Bloch
2026-04-09 11:55 ` [PATCH net-next 3/7] net/mlx5: E-Switch, introduce generic work queue dispatch helper Tariq Toukan
2026-04-09 11:55 ` [PATCH net-next 4/7] net/mlx5: E-Switch, fix deadlock between devlink lock and esw->wq Tariq Toukan
2026-04-09 18:01 ` Mark Bloch
2026-04-09 11:55 ` [PATCH net-next 5/7] net/mlx5: E-Switch, block representors during reconfiguration Tariq Toukan
2026-04-09 18:02 ` Mark Bloch
2026-04-09 11:55 ` [PATCH net-next 6/7] net/mlx5: E-switch, load reps via work queue after registration Tariq Toukan
2026-04-09 18:02 ` Mark Bloch
2026-04-09 11:55 ` [PATCH net-next 7/7] net/mlx5: Add profile to auto-enable switchdev mode at device init Tariq Toukan
2026-04-09 18:02 ` Mark Bloch
2026-04-09 18:20 ` [PATCH net-next 0/7] net/mlx5: Improve representor lifecycle and fix work queue deadlock Mark Bloch
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260409115550.156419-1-tariqt@nvidia.com \
--to=tariqt@nvidia.com \
--cc=andrew+netdev@lunn.ch \
--cc=cjubran@nvidia.com \
--cc=cratiu@nvidia.com \
--cc=davem@davemloft.net \
--cc=dtatulea@nvidia.com \
--cc=edumazet@google.com \
--cc=edwards@nvidia.com \
--cc=gal@nvidia.com \
--cc=gbayer@linux.ibm.com \
--cc=horms@kernel.org \
--cc=kees@kernel.org \
--cc=kuba@kernel.org \
--cc=leon@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=mbloch@nvidia.com \
--cc=moshe@nvidia.com \
--cc=msanalla@nvidia.com \
--cc=netdev@vger.kernel.org \
--cc=ohartoov@nvidia.com \
--cc=pabeni@redhat.com \
--cc=parav@nvidia.com \
--cc=phaddad@nvidia.com \
--cc=saeedm@nvidia.com \
--cc=shayd@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox