public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Mark Bloch <mbloch@nvidia.com>
To: Jakub Kicinski <kuba@kernel.org>, tariqt@nvidia.com
Cc: edumazet@google.com, pabeni@redhat.com, andrew+netdev@lunn.ch,
	davem@davemloft.net, leon@kernel.org, jgg@ziepe.ca,
	saeedm@nvidia.com, shayd@nvidia.com, ohartoov@nvidia.com,
	edwards@nvidia.com, msanalla@nvidia.com, horms@kernel.org,
	gbayer@linux.ibm.com, moshe@nvidia.com, kees@kernel.org,
	phaddad@nvidia.com, parav@nvidia.com, cjubran@nvidia.com,
	cratiu@nvidia.com, linux-rdma@vger.kernel.org,
	linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
	gal@nvidia.com, dtatulea@nvidia.com
Subject: Re: [PATCH net-next V2 4/7] net/mlx5: E-Switch, serialize representor lifecycle
Date: Sun, 3 May 2026 11:18:52 +0300	[thread overview]
Message-ID: <391365ee-df9a-43b6-b152-00e30ae695b1@nvidia.com> (raw)
In-Reply-To: <20260503014224.4096089-1-kuba@kernel.org>



On 03/05/2026 4:42, Jakub Kicinski wrote:
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> ---
> net/mlx5: E-Switch, serialize representor lifecycle
> 
> This change replaces the ad-hoc bit/wait-queue coordination of E-Switch
> representor lifecycle with the per-E-Switch reps_lock, adds nested
> wrappers for the shared-FDB master IB path, and clears the rep_ops
> pointer on unregister.
> 
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
>> index 66a773a99876..f70737437954 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
> 
> [ ... ]
> 
>> @@ -1825,7 +1831,11 @@ void mlx5_eswitch_disable(struct mlx5_eswitch *esw)
>>  
>>  	devl_assert_locked(priv_to_devlink(esw->dev));
>>  	mlx5_lag_disable_change(esw->dev);
>> +
>> +	mlx5_esw_reps_block(esw);
>>  	mlx5_eswitch_disable_locked(esw);
>> +	mlx5_esw_reps_unblock(esw);
>> +
>>  	esw->mode = MLX5_ESWITCH_LEGACY;
>>  	mlx5_lag_enable_change(esw->dev);
>>  }
> 
> The commit message says "Take the representor lock around the mode
> transition", but the assignment esw->mode = MLX5_ESWITCH_LEGACY happens
> here after reps_lock has been dropped.
> 
> Meanwhile mlx5_eswitch_unregister_vport_reps_blocked() now reads
> esw->mode under reps_lock to decide whether to call
> __unload_reps_all_vport():
> 
>     if (esw->mode == MLX5_ESWITCH_OFFLOADS)
>             __unload_reps_all_vport(esw, rep_type);
> 
> Should the mode store be moved inside the reps_block/reps_unblock
> region so the reader's assumption about stability under reps_lock
> holds?
> 
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
>> index 6a5143b63dfd..d4ac07c995b9 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> 
> [ ... ]
> 
>> @@ -2425,21 +2436,22 @@ void mlx5_esw_reps_unblock(struct mlx5_eswitch *esw)
>>  
>>  static void esw_mode_change(struct mlx5_eswitch *esw, u16 mode)
>>  {
>> +	mlx5_esw_reps_unblock(esw);
>>  	mlx5_devcom_comp_lock(esw->dev->priv.hca_devcom_comp);
>>  	if (esw->dev->priv.flags & MLX5_PRIV_FLAGS_DISABLE_IB_ADEV ||
>>  	    mlx5_core_mp_enabled(esw->dev)) {
>>  		esw->mode = mode;
>> -		mlx5_rescan_drivers_locked(esw->dev);
>> -		mlx5_devcom_comp_unlock(esw->dev->priv.hca_devcom_comp);
>> -		return;
>> +		goto out;
>>  	}
>>  
>>  	esw->dev->priv.flags |= MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
>>  	mlx5_rescan_drivers_locked(esw->dev);
>>  	esw->mode = mode;
>>  	esw->dev->priv.flags &= ~MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
>> +out:
>>  	mlx5_rescan_drivers_locked(esw->dev);
>>  	mlx5_devcom_comp_unlock(esw->dev->priv.hca_devcom_comp);
>> +	mlx5_esw_reps_block(esw);
>>  }
> 
> Both assignments to esw->mode in this function are reached after
> reps_lock has been dropped at the top of the function.
> 
> mlx5_rescan_drivers_locked() can invoke driver bind/unbind which routes
> through mlx5_eswitch_unregister_vport_reps(), and that path now takes
> reps_lock and reads esw->mode to decide whether to call
> __unload_reps_all_vport().
> 
> Can a rep-unregister that runs during the rescan observe a stale
> esw->mode here and make the wrong unload decision?
> 
> [ ... ]
> 
>> @@ -4496,9 +4524,10 @@ mlx5_eswitch_vport_has_rep(const struct mlx5_eswitch *esw, u16 vport_num)
> 
> [ ... ]
> 
>> @@ -4525,9 +4585,35 @@ void mlx5_eswitch_unregister_vport_reps(struct mlx5_eswitch *esw, u8 rep_type)
>>  
>>  	mlx5_esw_for_each_rep(esw, i, rep)
>>  		atomic_set(&rep->rep_data[rep_type].state, REP_UNREGISTERED);
>> +
>> +	esw->offloads.rep_ops[rep_type] = NULL;
>> +}
> 
> This new NULL store on unregister appears to introduce a concurrent
> NULL dereference in readers that do not hold reps_lock. Before this
> patch, rep_ops[rep_type] was set once at register and never cleared,
> so a REP_LOADED state check was sufficient to know the pointer was
> valid.
> 
> mlx5_eswitch_get_proto_dev() is one such reader:
> 
>     rep = mlx5_eswitch_get_rep(esw, vport);
>     if (atomic_read(&rep->rep_data[rep_type].state) == REP_LOADED &&
>         esw->offloads.rep_ops[rep_type]->get_proto_dev)
>             return esw->offloads.rep_ops[rep_type]->get_proto_dev(rep);
> 
> The state check and each pointer load here are independent reads. Can
> a concurrent unregister on another CPU run between the state load and
> the ->get_proto_dev dereference, transitioning the rep through
> REP_REGISTERED/REP_UNREGISTERED and then storing NULL into rep_ops,
> so the second load observes NULL?
> 
> mlx5_esw_offloads_rep_event_unpair() and mlx5_esw_offloads_pair() have
> the same shape:
> 
>     ops = esw->offloads.rep_ops[rep_type];
>     if (atomic_read(&rep->rep_data[rep_type].state) == REP_LOADED &&
>         ops->event)
>             ops->event(esw, rep, ...);
> 
> Neither path holds reps_lock; they run from the devcom PAIR/UNPAIR
> event handler. Can ops be NULL here once an unregister has cleared
> rep_ops?
> 
> If readers are expected to be safe, should they either take reps_lock,
> be converted to RCU, or skip the NULL clear on unregister?
> 
> [ ... ]

Good point. The unregister path should not key the unload decision off
esw->mode. esw->mode is not the lifetime predicate for representor ops,
especially because mode transitions intentionally drop reps_lock around
auxiliary bus rescans. The actual invariant we need is that unregister
unloads any currently loaded reps before the ops pointer can be cleared
or the ops owner can go away.

I will change mlx5_eswitch_unregister_vport_reps_blocked() to always
call __unload_reps_all_vport(). That helper is already state-gated,
so this is a no-op when no reps are loaded.

With that change, the lockless readers do not get a new live
rep_ops == NULL window. unregister first synchronously unloads the
loaded reps, and those unload callbacks tear down the representor-owned
contexts that can call get_proto_dev() / rep events. Clearing rep_op
after that removes a stale pointer keeping it would be wrong because
the ops memory may be going away.

Mark


  reply	other threads:[~2026-05-03  8:19 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-01  4:16 [PATCH net-next V2 0/7] net/mlx5: Improve representor lifecycle and allow switchdev by default Tariq Toukan
2026-05-01  4:16 ` [PATCH net-next V2 1/7] net/mlx5: Lag: refactor representor reload handling Tariq Toukan
2026-05-01  4:16 ` [PATCH net-next V2 2/7] net/mlx5: E-Switch, add representor lifecycle lock Tariq Toukan
2026-05-01  4:16 ` [PATCH net-next V2 3/7] net/mlx5: Lag, avoid LAG and representor lock cycles Tariq Toukan
2026-05-02 20:04   ` Mark Bloch
2026-05-01  4:16 ` [PATCH net-next V2 4/7] net/mlx5: E-Switch, serialize representor lifecycle Tariq Toukan
2026-05-02 20:05   ` Mark Bloch
2026-05-03  1:42   ` Jakub Kicinski
2026-05-03  8:18     ` Mark Bloch [this message]
2026-05-01  4:16 ` [PATCH net-next V2 5/7] net/mlx5: E-Switch, unwind only newly loaded representor types Tariq Toukan
2026-05-02 20:06   ` Mark Bloch
2026-05-01  4:16 ` [PATCH net-next V2 6/7] net/mlx5: E-switch, load reps via work queue after registration Tariq Toukan
2026-05-02 20:07   ` Mark Bloch
2026-05-03  1:42   ` Jakub Kicinski
2026-05-03  8:01     ` Mark Bloch
2026-05-01  4:16 ` [PATCH net-next V2 7/7] net/mlx5: Add profile to auto-enable switchdev mode at device init Tariq Toukan
2026-05-02 20:08   ` Mark Bloch
2026-05-03  1:41     ` Jakub Kicinski
2026-05-03  7:51       ` Mark Bloch
2026-05-05  1:21         ` Jakub Kicinski
2026-05-05  2:00           ` Mark Bloch
2026-05-05  2:19             ` Jakub Kicinski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=391365ee-df9a-43b6-b152-00e30ae695b1@nvidia.com \
    --to=mbloch@nvidia.com \
    --cc=andrew+netdev@lunn.ch \
    --cc=cjubran@nvidia.com \
    --cc=cratiu@nvidia.com \
    --cc=davem@davemloft.net \
    --cc=dtatulea@nvidia.com \
    --cc=edumazet@google.com \
    --cc=edwards@nvidia.com \
    --cc=gal@nvidia.com \
    --cc=gbayer@linux.ibm.com \
    --cc=horms@kernel.org \
    --cc=jgg@ziepe.ca \
    --cc=kees@kernel.org \
    --cc=kuba@kernel.org \
    --cc=leon@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=moshe@nvidia.com \
    --cc=msanalla@nvidia.com \
    --cc=netdev@vger.kernel.org \
    --cc=ohartoov@nvidia.com \
    --cc=pabeni@redhat.com \
    --cc=parav@nvidia.com \
    --cc=phaddad@nvidia.com \
    --cc=saeedm@nvidia.com \
    --cc=shayd@nvidia.com \
    --cc=tariqt@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox