public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
From: Mark Bloch <mbloch@nvidia.com>
To: Tariq Toukan <tariqt@nvidia.com>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	Andrew Lunn <andrew+netdev@lunn.ch>,
	"David S. Miller" <davem@davemloft.net>
Cc: Saeed Mahameed <saeedm@nvidia.com>,
	Leon Romanovsky <leon@kernel.org>, Shay Drory <shayd@nvidia.com>,
	Or Har-Toov <ohartoov@nvidia.com>,
	Edward Srouji <edwards@nvidia.com>,
	Maher Sanalla <msanalla@nvidia.com>,
	Simon Horman <horms@kernel.org>, Moshe Shemesh <moshe@nvidia.com>,
	Kees Cook <kees@kernel.org>,
	Patrisious Haddad <phaddad@nvidia.com>,
	Gerd Bayer <gbayer@linux.ibm.com>,
	Parav Pandit <parav@nvidia.com>, Cosmin Ratiu <cratiu@nvidia.com>,
	Carolina Jubran <cjubran@nvidia.com>,
	netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	linux-kernel@vger.kernel.org, Gal Pressman <gal@nvidia.com>,
	Dragos Tatulea <dtatulea@nvidia.com>
Subject: Re: [PATCH net-next 0/7] net/mlx5: Improve representor lifecycle and fix work queue deadlock
Date: Thu, 9 Apr 2026 21:20:45 +0300	[thread overview]
Message-ID: <e4708c37-ccd3-466f-b08c-653d241260dd@nvidia.com> (raw)
In-Reply-To: <20260409115550.156419-1-tariqt@nvidia.com>



On 09/04/2026 14:55, Tariq Toukan wrote:
> Hi,
> 
> See detailed description by Mark below [1].
> 

Sashiko flagged a few issues in the series. The main ones are in patch 4
([PATCH net-next 4/7] net/mlx5: E-Switch, fix deadlock between devlink lock and esw->wq).
These are known, and I currently consider them acceptable trade-offs rather than bugs.

That said, reviewers/maintainers may reasonably see this differently.
Before sending a v2 focused on patch 4, I’d appreciate feedback on the overall
approach and direction of the series, please see sashiko's comments here:
https://sashiko.dev/#/patchset/20260409115550.156419-1-tariqt%40nvidia.com

I've provided my commetns on each patch on the mailing list and
included what was found by sashiko.

the patch series can be broken into 3:

[patches 2/3/4] – workqueue deadlock:

During teardown we must ensure all pending work is completed.
However, since the teardown path already holds the devlink lock,
we cannot have work items blocking on that same lock without
risking a deadlock.

[patches 5/6] – reps block/unblock state
The interaction between mlx5_core and mlx5_ib (load/unload),
eswitch mode transitions, and auxiliary device handling makes
this particularly tricky.

Several conventional locking/synchronization approaches were
explored, but they either reintroduced deadlocks or created
even more complex issues. The current approach is admittedly
not the cleanest, but it has proven to behave correctly in practice.

[patch 7] – switchdev by default
The long-term goal is to have switchdev as the default mode
for DPU environments. Flipping that behavior globally in one
step is risky.

This patch provides a controlled, incremental way to move in
that direction, allowing validation in real deployments before
making it the default.

Mark

> Regards,
> Tariq
> 
> [1]
> This series addresses three problems that have been present for years.
> First, there is no coordination between E-Switch reconfiguration and
> representor registration. The E-Switch can be mid-way through a mode
> change or VF count update while mlx5_ib walks in and registers or
> unregisters representors. Nothing stops them. The race window is small
> and there is no field report, but it is clearly wrong.
> 
> A mutex is not the answer. The representor callbacks reach into RDMA,
> netdev, and LAG layers that already hold their own locks, making a
> new mutex in the E-Switch layer a deadlock waiting to happen.
> 
> Second, the E-Switch work queue has a deadlock of its own.
> mlx5_eswitch_cleanup() drains the work queue while holding the devlink
> lock. Workers on that queue acquire devlink lock before checking whether
> their work is still relevant. They block. The cleanup path waits for
> them to finish. Deadlock.
> 
> Third, loading mlx5_ib while the device is already in switchdev mode
> does not bring up the IB representors. This has been broken for years.
> mlx5_eswitch_register_vport_reps() only stores callbacks; nobody
> triggers the actual load after registration.
> 
> For the work queue deadlock: introduce a generation counter in the
> top-level mlx5_eswitch struct (moved from mlx5_esw_functions,
> which only covered function-change events) and a generic dispatch helper
> mlx5_esw_add_work(). The worker esw_wq_handler() checks the counter
> before touching the devlink lock using devl_trylock() in a loop. Stale
> work exits immediately without ever contending. The counter is
> incremented at every E-Switch operation boundary: cleanup, disable,
> mode-set, enable, disable_sriov.
> 
> For the registration race: a simple atomic block state guards all
> reconfiguration paths. mlx5_esw_reps_block()/mlx5_esw_reps_unblock()
> spin a cmpxchg between UNBLOCKED and BLOCKED. Every reconfiguration
> path (mode set, enable, disable, VF/SF add/del, LAG reload, and the
> register/unregister calls themselves) brackets its work with this guard.
> No new locks, no deadlock risk.
> 
> For the missing IB representors: now that the work queue infrastructure
> is in place, mlx5_eswitch_register_vport_reps() queues a work item that
> acquires the devlink lock and loads all relevant representors. This is
> the change that actually fixes the long-standing bug.
> 
> One thing worth calling out: the block guard is non-reentrant. A caller
> that tries to transition UNBLOCKED->BLOCKED while the E-Switch is already
> BLOCKED will spin forever. All call sites were audited:
> 
>  - mlx5_eswitch_enable/disable/disable_sriov hold BLOCKED only around
>    low-level vport helpers that do not call register/unregister.
> 
>  - Inside mlx5_eswitch_unregister_vport_reps the unload callbacks run
>    while BLOCKED is held. The one callback that calls unregister
>    (mlx5_ib_vport_rep_unload in LAG shared-FDB mode) only does so on
>    peer E-Switch instances, each with its own independent atomic.
> 
>  - mlx5_devlink_eswitch_mode_set acquires BLOCKED, then calls
>    esw_offloads_start/stop -> esw_mode_change. esw_mode_change releases
>    BLOCKED before calling rescan_drivers so that the probe/remove
>    callbacks that trigger register/unregister see UNBLOCKED.
>    esw_mode_change re-acquires before returning, and mode_set releases
>    at the end. This is an explicit hand-off of the guard across the
>    rescan window.
> 
>  - mlx5_eswitch_register_vport_reps holds BLOCKED only while storing
>    callbacks and queuing the load work. The actual rep loading runs from
>    the work queue after the guard is released.
> 
> Patch 1 is cleanup. LAG and MPESW had the same representor reload
> sequence duplicated in several places and the copies had started to
> drift. This consolidates them into one helper.
> 
> Patches 2-4 fix the work queue deadlock in three steps: first move the
> generation counter from mlx5_esw_functions to mlx5_eswitch;
> then introduce the generic esw_wq_handler/mlx5_esw_add_work dispatch
> infrastructure; then apply the actual fix by switching to devl_trylock
> and adding generation increments at all operation boundaries.
> 
> Patch 5 adds the atomic block guard for representor registration,
> protecting all reconfiguration paths.
> 
> Patch 6 moves the representor load triggered by
> mlx5_eswitch_register_vport_reps() onto the work queue. This is the
> patch that fixes IB representors not coming up when mlx5_ib is loaded
> while the device is already in switchdev mode.
> 
> Patch 7 adds a driver profile that auto-enables switchdev at device
> init, for deployments that always operate in switchdev mode and want
> to avoid a manual devlink command after every probe.
> 
> Mark Bloch (7):
>   net/mlx5: Lag: refactor representor reload handling
>   net/mlx5: E-Switch, move work queue generation counter
>   net/mlx5: E-Switch, introduce generic work queue dispatch helper
>   net/mlx5: E-Switch, fix deadlock between devlink lock and esw->wq
>   net/mlx5: E-Switch, block representors during reconfiguration
>   net/mlx5: E-switch, load reps via work queue after registration
>   net/mlx5: Add profile to auto-enable switchdev mode at device init
> 
>  .../net/ethernet/mellanox/mlx5/core/eswitch.c |  25 ++-
>  .../net/ethernet/mellanox/mlx5/core/eswitch.h |  15 +-
>  .../mellanox/mlx5/core/eswitch_offloads.c     | 204 ++++++++++++++----
>  .../net/ethernet/mellanox/mlx5/core/lag/lag.c |  46 ++--
>  .../net/ethernet/mellanox/mlx5/core/lag/lag.h |   1 +
>  .../ethernet/mellanox/mlx5/core/lag/mpesw.c   |  12 +-
>  .../net/ethernet/mellanox/mlx5/core/main.c    |  26 ++-
>  .../ethernet/mellanox/mlx5/core/sf/devlink.c  |   5 +
>  include/linux/mlx5/driver.h                   |   1 +
>  include/linux/mlx5/eswitch.h                  |   5 +
>  10 files changed, 267 insertions(+), 73 deletions(-)
> 
> 
> base-commit: 9700282a7ec721e285771d995ccfe33845e776dc


      parent reply	other threads:[~2026-04-09 18:20 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-09 11:55 [PATCH net-next 0/7] net/mlx5: Improve representor lifecycle and fix work queue deadlock Tariq Toukan
2026-04-09 11:55 ` [PATCH net-next 1/7] net/mlx5: Lag: refactor representor reload handling Tariq Toukan
2026-04-09 17:57   ` Mark Bloch
2026-04-09 11:55 ` [PATCH net-next 2/7] net/mlx5: E-Switch, move work queue generation counter Tariq Toukan
2026-04-09 17:58   ` Mark Bloch
2026-04-09 11:55 ` [PATCH net-next 3/7] net/mlx5: E-Switch, introduce generic work queue dispatch helper Tariq Toukan
2026-04-09 11:55 ` [PATCH net-next 4/7] net/mlx5: E-Switch, fix deadlock between devlink lock and esw->wq Tariq Toukan
2026-04-09 18:01   ` Mark Bloch
2026-04-09 11:55 ` [PATCH net-next 5/7] net/mlx5: E-Switch, block representors during reconfiguration Tariq Toukan
2026-04-09 18:02   ` Mark Bloch
2026-04-09 11:55 ` [PATCH net-next 6/7] net/mlx5: E-switch, load reps via work queue after registration Tariq Toukan
2026-04-09 18:02   ` Mark Bloch
2026-04-09 11:55 ` [PATCH net-next 7/7] net/mlx5: Add profile to auto-enable switchdev mode at device init Tariq Toukan
2026-04-09 18:02   ` Mark Bloch
2026-04-09 18:20 ` Mark Bloch [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e4708c37-ccd3-466f-b08c-653d241260dd@nvidia.com \
    --to=mbloch@nvidia.com \
    --cc=andrew+netdev@lunn.ch \
    --cc=cjubran@nvidia.com \
    --cc=cratiu@nvidia.com \
    --cc=davem@davemloft.net \
    --cc=dtatulea@nvidia.com \
    --cc=edumazet@google.com \
    --cc=edwards@nvidia.com \
    --cc=gal@nvidia.com \
    --cc=gbayer@linux.ibm.com \
    --cc=horms@kernel.org \
    --cc=kees@kernel.org \
    --cc=kuba@kernel.org \
    --cc=leon@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=moshe@nvidia.com \
    --cc=msanalla@nvidia.com \
    --cc=netdev@vger.kernel.org \
    --cc=ohartoov@nvidia.com \
    --cc=pabeni@redhat.com \
    --cc=parav@nvidia.com \
    --cc=phaddad@nvidia.com \
    --cc=saeedm@nvidia.com \
    --cc=shayd@nvidia.com \
    --cc=tariqt@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox