From: Tariq Toukan <tariqt@nvidia.com>
To: Eric Dumazet <edumazet@google.com>,
Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
Andrew Lunn <andrew+netdev@lunn.ch>,
"David S. Miller" <davem@davemloft.net>
Cc: Saeed Mahameed <saeedm@nvidia.com>,
Leon Romanovsky <leon@kernel.org>,
Tariq Toukan <tariqt@nvidia.com>, Mark Bloch <mbloch@nvidia.com>,
<netdev@vger.kernel.org>, <linux-rdma@vger.kernel.org>,
<linux-kernel@vger.kernel.org>, Gal Pressman <gal@nvidia.com>,
Dragos Tatulea <dtatulea@nvidia.com>
Subject: [PATCH net-next V2 0/3] net/mlx5: Fix E-Switch work queue deadlock with devlink lock
Date: Tue, 28 Apr 2026 08:10:14 +0300 [thread overview]
Message-ID: <20260428051018.219093-1-tariqt@nvidia.com> (raw)
Hi,
See detailed description by Mark below [1].
Regards,
Tariq
[1]
mlx5_eswitch_cleanup() calls destroy_workqueue() while holding the
devlink lock through mlx5_uninit_one(). E-Switch workqueue workers also
need the devlink lock, but previously took it before checking whether
their work item was stale. Cleanup can therefore wait for a worker that
is blocked on the same devlink lock.
Mode changes have the same ordering hazard: the mode-change path holds
devlink lock while tearing down the current mode, and old work may still
be pending on the E-Switch workqueue.
Fix this by making esw_wq_handler() check the generation counter before
attempting to take devlink lock. The worker uses devl_trylock(); if the
lock is busy and the work is still current, it sleeps on an E-Switch wait
queue with a short timeout. Invalidation increments the generation
counter and wakes the wait queue, so stale workers exit without spinning
or blocking cleanup.
The generation counter already existed but was buried in
mlx5_esw_functions and only covered function-change events. The three
patches get from there to the fix in small steps.
Patch 1 moves the counter up to mlx5_eswitch. Pure refactor,
no behavior change.
Patch 2 cleans up the work queue plumbing: factors out the repeated
lock/check/dispatch boilerplate into a single esw_wq_handler() and
adds mlx5_esw_add_work() as the one place to enqueue work.
Patch 3 is the actual fix: check the generation before the lock, use
devl_trylock() instead of devl_lock(), add a wait queue so lock retries
do not spin, and invalidate pending work at the earliest safe operation
boundary. Cleanup invalidates before destroy_workqueue(), and mode
teardown unregisters the work-producing notifiers before invalidating so
new notifier work cannot capture the new generation.
V2:
Split out from a larger series. The representor lifecycle improvements
to be sent separately.
Patch 3:
- Move generation invalidation after notifier unregister but before
teardown, so old work is discarded early without allowing new notifier
work to use the new generation.
- Replace cond_resched() polling with a wait queue to avoid CPU spinning
while devlink lock is held by a long operation.
Link to V1:
https://lore.kernel.org/all/20260409115550.156419-1-tariqt@nvidia.com/
Mark Bloch (3):
net/mlx5: E-Switch, move work queue generation counter
net/mlx5: E-Switch, introduce generic work queue dispatch helper
net/mlx5: E-Switch, fix deadlock between devlink lock and esw->wq
.../net/ethernet/mellanox/mlx5/core/eswitch.c | 20 +++-
.../net/ethernet/mellanox/mlx5/core/eswitch.h | 5 +-
.../mellanox/mlx5/core/eswitch_offloads.c | 96 ++++++++++++-------
3 files changed, 82 insertions(+), 39 deletions(-)
base-commit: 254f49634ee16a731174d2ae34bc50bd5f45e731
--
2.44.0
next reply other threads:[~2026-04-28 5:11 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-28 5:10 Tariq Toukan [this message]
2026-04-28 5:10 ` [PATCH net-next V2 1/3] net/mlx5: E-Switch, move work queue generation counter Tariq Toukan
2026-04-28 5:10 ` [PATCH net-next V2 2/3] net/mlx5: E-Switch, introduce generic work queue dispatch helper Tariq Toukan
2026-04-28 5:10 ` [PATCH net-next V2 3/3] net/mlx5: E-Switch, fix deadlock between devlink lock and esw->wq Tariq Toukan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260428051018.219093-1-tariqt@nvidia.com \
--to=tariqt@nvidia.com \
--cc=andrew+netdev@lunn.ch \
--cc=davem@davemloft.net \
--cc=dtatulea@nvidia.com \
--cc=edumazet@google.com \
--cc=gal@nvidia.com \
--cc=kuba@kernel.org \
--cc=leon@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=mbloch@nvidia.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=saeedm@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox