public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
From: Tariq Toukan <tariqt@nvidia.com>
To: Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	Andrew Lunn <andrew+netdev@lunn.ch>,
	"David S. Miller" <davem@davemloft.net>
Cc: Leon Romanovsky <leon@kernel.org>, Jason Gunthorpe <jgg@ziepe.ca>,
	"Saeed Mahameed" <saeedm@nvidia.com>,
	Tariq Toukan <tariqt@nvidia.com>, Mark Bloch <mbloch@nvidia.com>,
	Shay Drory <shayd@nvidia.com>, Or Har-Toov <ohartoov@nvidia.com>,
	Edward Srouji <edwards@nvidia.com>,
	Simon Horman <horms@kernel.org>,
	Maher Sanalla <msanalla@nvidia.com>,
	Parav Pandit <parav@nvidia.com>,
	Patrisious Haddad <phaddad@nvidia.com>,
	Kees Cook <kees@kernel.org>, Gerd Bayer <gbayer@linux.ibm.com>,
	Moshe Shemesh <moshe@nvidia.com>,
	Carolina Jubran <cjubran@nvidia.com>,
	Cosmin Ratiu <cratiu@nvidia.com>, <linux-rdma@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>, <netdev@vger.kernel.org>,
	Gal Pressman <gal@nvidia.com>,
	Dragos Tatulea <dtatulea@nvidia.com>
Subject: [PATCH net-next V3 0/7] net/mlx5: Improve representor lifecycle and late IB representor loading
Date: Sun, 3 May 2026 23:27:19 +0300	[thread overview]
Message-ID: <20260503202726.266415-1-tariqt@nvidia.com> (raw)

Hi,

Find detailed description by Mark below.

Regards,
Tariq


This series addresses two problems that have been present for years, and
fixes one representor reload error-unwind case exposed while making the
reload path reusable.

First, there is no coordination between E-Switch reconfiguration and
representor registration. The E-Switch can be mid-way through a mode
change or VF count update while mlx5_ib walks in and registers or
unregisters representors. Nothing stops them. The race window is small
and there is no field report, but it is clearly wrong.

Second, loading mlx5_ib while the device is already in switchdev mode
does not bring up the IB representors. mlx5_eswitch_register_vport_reps()
only stores callbacks; nobody triggers the actual load after registration.

The series fixes the registration race with a per-E-Switch representor
mutex. The lock is introduced first, then LAG shared-FDB and multiport
E-Switch transitions are adjusted so auxiliary device rescans and IB
representor reloads do not hold ldev->lock while taking the representor
lock. This keeps the intermediate commits bisectable before the stricter
E-Switch serialization and lock assertions are enabled.

After the LAG ordering is fixed, all E-Switch reconfiguration paths that
create, destroy, load, or unload representors take the representor mutex.
esw_mode_change() deliberately drops the mutex around
mlx5_rescan_drivers_locked(), because auxiliary probe and remove paths
re-enter mlx5_eswitch_register_vport_reps() and
mlx5_eswitch_unregister_vport_reps() on the same thread.

The shared-FDB peer IB registration path can hold one E-Switch
representor mutex and then register peer representor ops on another
E-Switch. The series annotates that case as nested locking so lockdep can
distinguish it from recursive locking on the same E-Switch.

For the missing IB representors, mlx5_eswitch_register_vport_reps() queues
a work item that acquires the devlink lock and loads all relevant
representors. This is the change that actually fixes the long-standing
bug.

The reload path also learns to track which representor types were loaded by
the current attempt, so an error does not unload representors that were
already active before the retry.

Patch 1 is cleanup. LAG and MPESW had the same representor reload
sequence duplicated in several places and the copies had started to
drift. This consolidates them into one helper.

Patch 2 lets E-Switch workqueue callers choose GFP allocation flags.

Patch 3 adds the per-E-Switch representor lifecycle lock and helper APIs.

Patch 4 adjusts the LAG shared-FDB and multiport E-Switch transitions so
auxiliary device rescans and IB representor reloads run without
ldev->lock held while taking the representor lock.

Patch 5 protects the E-Switch reconfiguration, representor registration
and peer IB representor paths with the representor lock.

Patch 6 fixes representor load error unwind so only representor types
loaded by the current attempt are unloaded on failure.

Patch 7 moves the representor load triggered by
mlx5_eswitch_register_vport_reps() onto the work queue. This is the patch
that fixes IB representors not coming up when mlx5_ib is loaded while the
device is already in switchdev mode.

Changes:

v2 -> v3:

Drop the default switchdev module parameter patch. The proper user facing
interface is still under discussion, and this may be better handled by
devlink core infrastructure.

Patch 2: Add a new patch, per Sashiko's feedback, that lets E-Switch
workqueue callers pass GFP allocation flags to mlx5_esw_add_work(). The
functions-change notifier keeps using GFP_ATOMIC, while sleepable callers
can use GFP_KERNEL.

Patch 5: The unregister path now always unloads the selected representor
type before marking it unregistered and clearing rep_ops. It no longer
depends on esw->mode == MLX5_ESWITCH_OFFLOADS.

Patch 7: The queued late representor reload now calls mlx5_esw_add_work()
with GFP_KERNEL instead of relying on the helper's previous hardcoded
GFP_ATOMIC allocation.

v1 -> v2:

Split v1 into two parts: the E-Switch workqueue deadlock fix and the
representor lifecycle changes. This is the second part; the first part
has already been accepted [1].

Patch 1: Add a cont_on_fail flag so callers can decide whether reload
should continue after a failure.

Patches 2, 3, 4: Replace the atomic-variable based scheme with a mutex,
per Jakub's feedback.

Patch 5: New patch that fixes the unwind on representor load failure.

Patch 7: Switch from profile 4 to profile 8. Since the profile mainly
targets E-Switch handling, keep it separate from the NIC profiles.

V2:
https://lore.kernel.org/all/20260501041633.231662-1-tariqt@nvidia.com/

V1:
https://lore.kernel.org/all/20260409115550.156419-1-tariqt@nvidia.com/

[1] https://lore.kernel.org/all/20260428051018.219093-1-tariqt@nvidia.com/


Mark Bloch (7):
  net/mlx5: Lag: refactor representor reload handling
  net/mlx5: E-Switch, let esw work callers choose GFP flags
  net/mlx5: E-Switch, add representor lifecycle lock
  net/mlx5: Lag, avoid LAG and representor lock cycles
  net/mlx5: E-Switch, serialize representor lifecycle
  net/mlx5: E-Switch, unwind only newly loaded representor types
  net/mlx5: E-Switch, load reps via work queue after registration

 drivers/infiniband/hw/mlx5/ib_rep.c           |   6 +-
 .../net/ethernet/mellanox/mlx5/core/eswitch.c |  10 +
 .../net/ethernet/mellanox/mlx5/core/eswitch.h |   6 +
 .../mellanox/mlx5/core/eswitch_offloads.c     | 197 ++++++++++++++++--
 .../net/ethernet/mellanox/mlx5/core/lag/lag.c | 171 +++++++++++----
 .../net/ethernet/mellanox/mlx5/core/lag/lag.h |   5 +
 .../ethernet/mellanox/mlx5/core/lag/mpesw.c   |  18 +-
 .../ethernet/mellanox/mlx5/core/lib/devcom.c  |   8 +
 .../ethernet/mellanox/mlx5/core/lib/devcom.h  |   1 +
 .../ethernet/mellanox/mlx5/core/sf/devlink.c  |   5 +
 include/linux/mlx5/eswitch.h                  |   6 +
 11 files changed, 361 insertions(+), 72 deletions(-)


base-commit: 98878ed91b68a3150126fccef125ee7b1bb86ab2
-- 
2.44.0


             reply	other threads:[~2026-05-03 20:27 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-03 20:27 Tariq Toukan [this message]
2026-05-03 20:27 ` [PATCH net-next V3 1/7] net/mlx5: Lag: refactor representor reload handling Tariq Toukan
2026-05-03 20:27 ` [PATCH net-next V3 2/7] net/mlx5: E-Switch, let esw work callers choose GFP flags Tariq Toukan
2026-05-03 20:27 ` [PATCH net-next V3 3/7] net/mlx5: E-Switch, add representor lifecycle lock Tariq Toukan
2026-05-03 20:27 ` [PATCH net-next V3 4/7] net/mlx5: Lag, avoid LAG and representor lock cycles Tariq Toukan
2026-05-03 20:27 ` [PATCH net-next V3 5/7] net/mlx5: E-Switch, serialize representor lifecycle Tariq Toukan
2026-05-03 20:27 ` [PATCH net-next V3 6/7] net/mlx5: E-Switch, unwind only newly loaded representor types Tariq Toukan
2026-05-03 20:27 ` [PATCH net-next V3 7/7] net/mlx5: E-Switch, load reps via work queue after registration Tariq Toukan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260503202726.266415-1-tariqt@nvidia.com \
    --to=tariqt@nvidia.com \
    --cc=andrew+netdev@lunn.ch \
    --cc=cjubran@nvidia.com \
    --cc=cratiu@nvidia.com \
    --cc=davem@davemloft.net \
    --cc=dtatulea@nvidia.com \
    --cc=edumazet@google.com \
    --cc=edwards@nvidia.com \
    --cc=gal@nvidia.com \
    --cc=gbayer@linux.ibm.com \
    --cc=horms@kernel.org \
    --cc=jgg@ziepe.ca \
    --cc=kees@kernel.org \
    --cc=kuba@kernel.org \
    --cc=leon@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=mbloch@nvidia.com \
    --cc=moshe@nvidia.com \
    --cc=msanalla@nvidia.com \
    --cc=netdev@vger.kernel.org \
    --cc=ohartoov@nvidia.com \
    --cc=pabeni@redhat.com \
    --cc=parav@nvidia.com \
    --cc=phaddad@nvidia.com \
    --cc=saeedm@nvidia.com \
    --cc=shayd@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox