Netdev List
 help / color / mirror / Atom feed
From: Shay Drori <shayd@nvidia.com>
To: Tariq Toukan <tariqt@nvidia.com>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	"Andrew Lunn" <andrew+netdev@lunn.ch>,
	"David S. Miller" <davem@davemloft.net>
Cc: Saeed Mahameed <saeedm@nvidia.com>,
	Leon Romanovsky <leon@kernel.org>, Mark Bloch <mbloch@nvidia.com>,
	Or Har-Toov <ohartoov@nvidia.com>,
	"Edward Srouji" <edwards@nvidia.com>,
	Maher Sanalla <msanalla@nvidia.com>,
	"Simon Horman" <horms@kernel.org>,
	Gerd Bayer <gbayer@linux.ibm.com>, Kees Cook <kees@kernel.org>,
	Moshe Shemesh <moshe@nvidia.com>, Parav Pandit <parav@nvidia.com>,
	Patrisious Haddad <phaddad@nvidia.com>, <netdev@vger.kernel.org>,
	<linux-rdma@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
	Gal Pressman <gal@nvidia.com>
Subject: Re: [PATCH net-next V3 13/15] net/mlx5: E-Switch, Tie rep load/unload to SD LAG state
Date: Sun, 14 Jun 2026 19:44:23 +0300	[thread overview]
Message-ID: <8e8434f7-6120-4fe7-8a8a-045e71bbfd2f@nvidia.com> (raw)
In-Reply-To: <20260612113904.537595-14-tariqt@nvidia.com>



On 12/06/2026 14:39, Tariq Toukan wrote:
> From: Shay Drory <shayd@nvidia.com>
> 
> On an SD device, vport representors are not functional until the SD
> group is combined and shared FDB is active. Skip the initial load and
> the reload paths in that window; reps are loaded as part of the SD LAG
> activation flow once it becomes active.
> 
> In addition, explicitly unload representors when SD LAG is destroyed.
> 
> Signed-off-by: Shay Drory <shayd@nvidia.com>
> Reviewed-by: Mark Bloch <mbloch@nvidia.com>
> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
> ---
>   .../net/ethernet/mellanox/mlx5/core/eswitch.h |  4 +++
>   .../mellanox/mlx5/core/eswitch_offloads.c     | 26 +++++++++++++++++++
>   .../net/ethernet/mellanox/mlx5/core/lag/lag.c | 26 +++++++++++++++++++
>   .../net/ethernet/mellanox/mlx5/core/lag/lag.h |  1 +
>   .../mellanox/mlx5/core/lag/shared_fdb.c       |  1 +
>   5 files changed, 58 insertions(+)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
> index a5f0774834fe..b2b3150f1f04 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
> @@ -959,6 +959,7 @@ int mlx5_eswitch_offloads_single_fdb_add_one(struct mlx5_eswitch *master_esw,
>   void mlx5_eswitch_offloads_single_fdb_del_one(struct mlx5_eswitch *master_esw,
>   					      struct mlx5_eswitch *slave_esw);
>   int mlx5_eswitch_reload_ib_reps(struct mlx5_eswitch *esw);
> +void mlx5_eswitch_unload_reps(struct mlx5_eswitch *esw);
>   bool mlx5_eswitch_is_peer(struct mlx5_eswitch *esw,
>   			  struct mlx5_eswitch *peer_esw);
>   
> @@ -1063,6 +1064,9 @@ mlx5_eswitch_reload_ib_reps(struct mlx5_eswitch *esw)
>   	return 0;
>   }
>   
> +static inline void
> +mlx5_eswitch_unload_reps(struct mlx5_eswitch *esw) {}
> +
>   static inline bool
>   mlx5_eswitch_block_encap(struct mlx5_core_dev *dev, bool from_fdb)
>   {
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> index a24719cfba34..4dc190a4e7b2 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> @@ -2863,6 +2863,10 @@ static int mlx5_esw_offloads_rep_load(struct mlx5_eswitch *esw, u16 vport_num)
>   	int rep_type;
>   	int err;
>   
> +	if (vport_num != MLX5_VPORT_UPLINK &&
> +	    mlx5_get_sd(esw->dev) && !mlx5_lag_is_active(esw->dev))
> +		return 0;

sashiko.dev says:
If we intentionally skip the initial load of the Ethernet representors
(REP_ETH) here, will secondary devices permanently fail to load them later?
Looking at mlx5_lag_shared_fdb_create(), the code triggers an asynchronous
rescan exclusively for the primary device (dev0):
     mlx5_lag_rescan_dev_locked(ldev, dev0, true);
     err = mlx5_lag_reload_ib_reps_from_locked(ldev, 0, filter, false);
Because secondary devices are not rescanned, their auxiliary eth-rep drivers
are never re-probed, and their asynchronous reload is never queued. 
Since the
synchronous reload depends on REP_ETH already being loaded, skipping it here
seems to leave secondary devices permanently without network interfaces.

[SD] primary will invoke mlx5_esw_add_work() for secondary devices as
well via mlx5_eswitch_register_vport_reps_nested().

> +
>   	rep = mlx5_eswitch_get_rep(esw, vport_num);
>   	for (rep_type = 0; rep_type < NUM_REP_TYPES; rep_type++) {
>   		err = __esw_offloads_load_rep(esw, rep, rep_type,
> @@ -3779,6 +3783,21 @@ static void esw_destroy_offloads_acl_tables(struct mlx5_eswitch *esw)
>   		esw_vport_destroy_offloads_acl_tables(esw, vport);
>   }
>   
> +void mlx5_eswitch_unload_reps(struct mlx5_eswitch *esw)
> +{
> +	struct mlx5_eswitch_rep *rep;
> +	unsigned long i;
> +
> +	if (!esw || esw->mode != MLX5_ESWITCH_OFFLOADS)
> +		return;
> +
> +	mlx5_esw_for_each_rep(esw, i, rep) {
> +		if (rep->vport == MLX5_VPORT_UPLINK)
> +			continue;
> +		mlx5_esw_offloads_rep_unload(esw, rep->vport);
> +	}
> +}
> +
>   int mlx5_eswitch_reload_ib_reps(struct mlx5_eswitch *esw)
>   {
>   	struct mlx5_eswitch_rep *rep;
> @@ -3805,6 +3824,10 @@ int mlx5_eswitch_reload_ib_reps(struct mlx5_eswitch *esw)
>   		if (!mlx5_sd_is_primary(esw->dev) &&
>   		    rep->vport == MLX5_VPORT_UPLINK)
>   			continue;
> +		if (rep->vport != MLX5_VPORT_UPLINK &&
> +		    mlx5_get_sd(esw->dev) && !mlx5_lag_is_active(esw->dev))
> +			continue;
> +

Is there a race condition here during SD LAG activation that bypasses the
synchronous load of primary device representors?
In mlx5_lag_shared_fdb_create(), the unbind/rebind of auxiliary drivers for
the primary device queues an asynchronous work item to load REP_ETH.
Immediately following this, mlx5_lag_reload_ib_reps_from_locked() executes
synchronously.
Because the asynchronous work hasn't run yet, REP_ETH is not loaded.
Consequently, this synchronous loop will evaluate the REP_LOADED check as
false and silently skip loading REP_IB.

[SD] The async reload loads both REP_ETH and REP_IB for VF/SF. The
synchronous reload_ib_reps only re-adds IB for reps whose ETH is already
loaded; skipping IB when ETH isn't up yet is not a loss — the async path
loads both. No race.


>   		if (atomic_read(&rep->rep_data[REP_ETH].state) == REP_LOADED)
>   			__esw_offloads_load_rep(esw, rep, REP_IB, NULL);
>   	}
> @@ -4764,6 +4787,9 @@ static void mlx5_eswitch_reload_reps_blocked(struct mlx5_eswitch *esw)
>   		return;
>   	}
>   
> +	if (mlx5_get_sd(esw->dev) && !mlx5_lag_is_active(esw->dev))
> +		return;
> +
>   	mlx5_esw_for_each_vport(esw, i, vport) {
>   		if (!vport)
>   			continue;
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
> index 424478e649ef..28d16fdc3f06 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
> @@ -1312,6 +1312,32 @@ int mlx5_lag_reload_ib_reps_from_locked(struct mlx5_lag *ldev, u32 flags,
>   	return mlx5_lag_reload_ib_reps(ldev, flags, filter, cont_on_fail);
>   }
>   
> +static void mlx5_lag_unload_reps_unlocked(struct mlx5_lag *ldev, u32 filter)
> +{
> +	struct lag_func *pf;
> +	int i;
> +
> +	mlx5_lag_for_each(i, 0, ldev, filter) {
> +		struct mlx5_eswitch *esw;
> +
> +		pf = mlx5_lag_pf(ldev, i);
> +		esw = pf->dev->priv.eswitch;
> +		mlx5_esw_reps_block(esw);
> +		mlx5_eswitch_unload_reps(esw);
> +		mlx5_esw_reps_unblock(esw);
> +	}
> +}
> +
> +void mlx5_lag_unload_reps_from_locked(struct mlx5_lag *ldev, u32 filter)
> +{
> +	/* Same lock dance as mlx5_lag_reload_ib_reps: drop ldev->lock around
> +	 * the per-eswitch reps_lock to keep the reps_lock -> ldev->lock order.
> +	 */
> +	mlx5_lag_drop_lock_for_reps(ldev, filter);
> +	mlx5_lag_unload_reps_unlocked(ldev, filter);
> +	mlx5_lag_retake_lock_after_reps(ldev);
> +}
> +
>   void mlx5_disable_lag(struct mlx5_lag *ldev)
>   {
>   	bool shared_fdb = test_bit(MLX5_LAG_MODE_FLAG_SHARED_FDB, &ldev->mode_flags);
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h
> index 8481ce55c10a..e9f0ef83ce1d 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h
> @@ -310,6 +310,7 @@ int mlx5_lag_num_devs(struct mlx5_lag *ldev);
>   int mlx5_lag_num_netdevs(struct mlx5_lag *ldev);
>   int mlx5_lag_reload_ib_reps_from_locked(struct mlx5_lag *ldev, u32 flags,
>   					u32 filter, bool cont_on_fail);
> +void mlx5_lag_unload_reps_from_locked(struct mlx5_lag *ldev, u32 filter);
>   int mlx5_ldev_add_mdev(struct mlx5_lag *ldev, struct mlx5_core_dev *dev,
>   		       u32 group_id);
>   void mlx5_ldev_remove_mdev(struct mlx5_lag *ldev, struct mlx5_core_dev *dev);
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/shared_fdb.c b/drivers/net/ethernet/mellanox/mlx5/core/lag/shared_fdb.c
> index 8d4f2903a101..113866494d16 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/lag/shared_fdb.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/shared_fdb.c
> @@ -296,6 +296,7 @@ void mlx5_lag_shared_fdb_destroy(struct mlx5_lag *ldev, u32 group_id)
>   			pf->sd_fdb_active = false;
>   		}
>   		mlx5_lag_destroy_single_fdb_filter(ldev, group_id);
> +		mlx5_lag_unload_reps_from_locked(ldev, filter);

sashiko.dev says:
Does explicitly unloading all representors here render the subsequent IB 
reload
a dead code no-op?
Immediately after mlx5_lag_unload_reps_from_locked() forcefully unloads 
REP_ETH
and other representors, this function calls
mlx5_lag_reload_ib_reps_from_locked().
Because REP_ETH was just unloaded, the condition checking if the state is
REP_LOADED inside mlx5_eswitch_reload_ib_reps() will evaluate to false,
silently skipping all IB representors.

[SD] this is intended

>   	}
>   
>   	mlx5_lag_add_devices_filter(ldev, filter);


  reply	other threads:[~2026-06-14 16:44 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-12 11:38 [PATCH net-next V3 00/15] net/mlx5: Add switchdev mode support for Socket Direct single netdev, part 2/2 Tariq Toukan
2026-06-12 11:38 ` [PATCH net-next V3 01/15] net/mlx5: E-Switch, skip uplink IB rep load for SD secondary devices Tariq Toukan
2026-06-12 11:38 ` [PATCH net-next V3 02/15] net/mlx5: devcom, expose locked variant of send_event Tariq Toukan
2026-06-12 11:38 ` [PATCH net-next V3 03/15] net/mlx5: devcom, add DEVCOM_CANT_FAIL for non-rollback events Tariq Toukan
2026-06-12 11:38 ` [PATCH net-next V3 04/15] net/mlx5: SD, make primary/secondary role determination more robust Tariq Toukan
2026-06-12 11:38 ` [PATCH net-next V3 05/15] net/mlx5: SD, add L2 table silent mode query support Tariq Toukan
2026-06-12 11:38 ` [PATCH net-next V3 06/15] net/mlx5: SD, expend vport metadata for SD secondary devices Tariq Toukan
2026-06-12 11:38 ` [PATCH net-next V3 07/15] net/mlx5: SD, support switchdev mode transition with shared FDB Tariq Toukan
2026-06-12 11:38 ` [PATCH net-next V3 08/15] net/mlx5: E-Switch, notify SD on eswitch disable Tariq Toukan
2026-06-12 11:38 ` [PATCH net-next V3 09/15] net/mlx5: LAG, store demux resources per master lag_func Tariq Toukan
2026-06-12 11:38 ` [PATCH net-next V3 10/15] net/mlx5: LAG, disable both regular and SD LAG on lag_disable_change Tariq Toukan
2026-06-14 16:43   ` Shay Drori
2026-06-12 11:39 ` [PATCH net-next V3 11/15] net/mlx5: LAG, introduce software vport LAG implementation Tariq Toukan
2026-06-12 11:39 ` [PATCH net-next V3 12/15] net/mlx5: LAG, add MPESW over SD LAG support Tariq Toukan
2026-06-12 11:39 ` [PATCH net-next V3 13/15] net/mlx5: E-Switch, Tie rep load/unload to SD LAG state Tariq Toukan
2026-06-14 16:44   ` Shay Drori [this message]
2026-06-12 11:39 ` [PATCH net-next V3 14/15] net/mlx5: SD, defer vport metadata init until SD is ready Tariq Toukan
2026-06-14 16:44   ` Shay Drori
2026-06-12 11:39 ` [PATCH net-next V3 15/15] net/mlx5: SD, enable SD over ECPF and allow switchdev transition Tariq Toukan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8e8434f7-6120-4fe7-8a8a-045e71bbfd2f@nvidia.com \
    --to=shayd@nvidia.com \
    --cc=andrew+netdev@lunn.ch \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=edwards@nvidia.com \
    --cc=gal@nvidia.com \
    --cc=gbayer@linux.ibm.com \
    --cc=horms@kernel.org \
    --cc=kees@kernel.org \
    --cc=kuba@kernel.org \
    --cc=leon@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=mbloch@nvidia.com \
    --cc=moshe@nvidia.com \
    --cc=msanalla@nvidia.com \
    --cc=netdev@vger.kernel.org \
    --cc=ohartoov@nvidia.com \
    --cc=pabeni@redhat.com \
    --cc=parav@nvidia.com \
    --cc=phaddad@nvidia.com \
    --cc=saeedm@nvidia.com \
    --cc=tariqt@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox