From: Jakub Kicinski <kuba@kernel.org>
To: tariqt@nvidia.com
Cc: Jakub Kicinski <kuba@kernel.org>,
edumazet@google.com, pabeni@redhat.com, andrew+netdev@lunn.ch,
davem@davemloft.net, leon@kernel.org, jgg@ziepe.ca,
saeedm@nvidia.com, mbloch@nvidia.com, shayd@nvidia.com,
ohartoov@nvidia.com, edwards@nvidia.com, msanalla@nvidia.com,
horms@kernel.org, gbayer@linux.ibm.com, moshe@nvidia.com,
kees@kernel.org, phaddad@nvidia.com, parav@nvidia.com,
cjubran@nvidia.com, cratiu@nvidia.com,
linux-rdma@vger.kernel.org, linux-kernel@vger.kernel.org,
netdev@vger.kernel.org, gal@nvidia.com, dtatulea@nvidia.com
Subject: Re: [PATCH net-next V2 4/7] net/mlx5: E-Switch, serialize representor lifecycle
Date: Sat, 2 May 2026 18:42:24 -0700 [thread overview]
Message-ID: <20260503014224.4096089-1-kuba@kernel.org> (raw)
In-Reply-To: <20260501041633.231662-5-tariqt@nvidia.com>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
net/mlx5: E-Switch, serialize representor lifecycle
This change replaces the ad-hoc bit/wait-queue coordination of E-Switch
representor lifecycle with the per-E-Switch reps_lock, adds nested
wrappers for the shared-FDB master IB path, and clears the rep_ops
pointer on unregister.
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
> index 66a773a99876..f70737437954 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
[ ... ]
> @@ -1825,7 +1831,11 @@ void mlx5_eswitch_disable(struct mlx5_eswitch *esw)
>
> devl_assert_locked(priv_to_devlink(esw->dev));
> mlx5_lag_disable_change(esw->dev);
> +
> + mlx5_esw_reps_block(esw);
> mlx5_eswitch_disable_locked(esw);
> + mlx5_esw_reps_unblock(esw);
> +
> esw->mode = MLX5_ESWITCH_LEGACY;
> mlx5_lag_enable_change(esw->dev);
> }
The commit message says "Take the representor lock around the mode
transition", but the assignment esw->mode = MLX5_ESWITCH_LEGACY happens
here after reps_lock has been dropped.
Meanwhile mlx5_eswitch_unregister_vport_reps_blocked() now reads
esw->mode under reps_lock to decide whether to call
__unload_reps_all_vport():
if (esw->mode == MLX5_ESWITCH_OFFLOADS)
__unload_reps_all_vport(esw, rep_type);
Should the mode store be moved inside the reps_block/reps_unblock
region so the reader's assumption about stability under reps_lock
holds?
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> index 6a5143b63dfd..d4ac07c995b9 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
[ ... ]
> @@ -2425,21 +2436,22 @@ void mlx5_esw_reps_unblock(struct mlx5_eswitch *esw)
>
> static void esw_mode_change(struct mlx5_eswitch *esw, u16 mode)
> {
> + mlx5_esw_reps_unblock(esw);
> mlx5_devcom_comp_lock(esw->dev->priv.hca_devcom_comp);
> if (esw->dev->priv.flags & MLX5_PRIV_FLAGS_DISABLE_IB_ADEV ||
> mlx5_core_mp_enabled(esw->dev)) {
> esw->mode = mode;
> - mlx5_rescan_drivers_locked(esw->dev);
> - mlx5_devcom_comp_unlock(esw->dev->priv.hca_devcom_comp);
> - return;
> + goto out;
> }
>
> esw->dev->priv.flags |= MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
> mlx5_rescan_drivers_locked(esw->dev);
> esw->mode = mode;
> esw->dev->priv.flags &= ~MLX5_PRIV_FLAGS_DISABLE_IB_ADEV;
> +out:
> mlx5_rescan_drivers_locked(esw->dev);
> mlx5_devcom_comp_unlock(esw->dev->priv.hca_devcom_comp);
> + mlx5_esw_reps_block(esw);
> }
Both assignments to esw->mode in this function are reached after
reps_lock has been dropped at the top of the function.
mlx5_rescan_drivers_locked() can invoke driver bind/unbind which routes
through mlx5_eswitch_unregister_vport_reps(), and that path now takes
reps_lock and reads esw->mode to decide whether to call
__unload_reps_all_vport().
Can a rep-unregister that runs during the rescan observe a stale
esw->mode here and make the wrong unload decision?
[ ... ]
> @@ -4496,9 +4524,10 @@ mlx5_eswitch_vport_has_rep(const struct mlx5_eswitch *esw, u16 vport_num)
[ ... ]
> @@ -4525,9 +4585,35 @@ void mlx5_eswitch_unregister_vport_reps(struct mlx5_eswitch *esw, u8 rep_type)
>
> mlx5_esw_for_each_rep(esw, i, rep)
> atomic_set(&rep->rep_data[rep_type].state, REP_UNREGISTERED);
> +
> + esw->offloads.rep_ops[rep_type] = NULL;
> +}
This new NULL store on unregister appears to introduce a concurrent
NULL dereference in readers that do not hold reps_lock. Before this
patch, rep_ops[rep_type] was set once at register and never cleared,
so a REP_LOADED state check was sufficient to know the pointer was
valid.
mlx5_eswitch_get_proto_dev() is one such reader:
rep = mlx5_eswitch_get_rep(esw, vport);
if (atomic_read(&rep->rep_data[rep_type].state) == REP_LOADED &&
esw->offloads.rep_ops[rep_type]->get_proto_dev)
return esw->offloads.rep_ops[rep_type]->get_proto_dev(rep);
The state check and each pointer load here are independent reads. Can
a concurrent unregister on another CPU run between the state load and
the ->get_proto_dev dereference, transitioning the rep through
REP_REGISTERED/REP_UNREGISTERED and then storing NULL into rep_ops,
so the second load observes NULL?
mlx5_esw_offloads_rep_event_unpair() and mlx5_esw_offloads_pair() have
the same shape:
ops = esw->offloads.rep_ops[rep_type];
if (atomic_read(&rep->rep_data[rep_type].state) == REP_LOADED &&
ops->event)
ops->event(esw, rep, ...);
Neither path holds reps_lock; they run from the devcom PAIR/UNPAIR
event handler. Can ops be NULL here once an unregister has cleared
rep_ops?
If readers are expected to be safe, should they either take reps_lock,
be converted to RCU, or skip the NULL clear on unregister?
[ ... ]
--
pw-bot: cr
next prev parent reply other threads:[~2026-05-03 1:42 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-01 4:16 [PATCH net-next V2 0/7] net/mlx5: Improve representor lifecycle and allow switchdev by default Tariq Toukan
2026-05-01 4:16 ` [PATCH net-next V2 1/7] net/mlx5: Lag: refactor representor reload handling Tariq Toukan
2026-05-01 4:16 ` [PATCH net-next V2 2/7] net/mlx5: E-Switch, add representor lifecycle lock Tariq Toukan
2026-05-01 4:16 ` [PATCH net-next V2 3/7] net/mlx5: Lag, avoid LAG and representor lock cycles Tariq Toukan
2026-05-02 20:04 ` Mark Bloch
2026-05-01 4:16 ` [PATCH net-next V2 4/7] net/mlx5: E-Switch, serialize representor lifecycle Tariq Toukan
2026-05-02 20:05 ` Mark Bloch
2026-05-03 1:42 ` Jakub Kicinski [this message]
2026-05-03 8:18 ` Mark Bloch
2026-05-01 4:16 ` [PATCH net-next V2 5/7] net/mlx5: E-Switch, unwind only newly loaded representor types Tariq Toukan
2026-05-02 20:06 ` Mark Bloch
2026-05-01 4:16 ` [PATCH net-next V2 6/7] net/mlx5: E-switch, load reps via work queue after registration Tariq Toukan
2026-05-02 20:07 ` Mark Bloch
2026-05-03 1:42 ` Jakub Kicinski
2026-05-03 8:01 ` Mark Bloch
2026-05-01 4:16 ` [PATCH net-next V2 7/7] net/mlx5: Add profile to auto-enable switchdev mode at device init Tariq Toukan
2026-05-02 20:08 ` Mark Bloch
2026-05-03 1:41 ` Jakub Kicinski
2026-05-03 7:51 ` Mark Bloch
2026-05-05 1:21 ` Jakub Kicinski
2026-05-05 2:00 ` Mark Bloch
2026-05-05 2:19 ` Jakub Kicinski
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260503014224.4096089-1-kuba@kernel.org \
--to=kuba@kernel.org \
--cc=andrew+netdev@lunn.ch \
--cc=cjubran@nvidia.com \
--cc=cratiu@nvidia.com \
--cc=davem@davemloft.net \
--cc=dtatulea@nvidia.com \
--cc=edumazet@google.com \
--cc=edwards@nvidia.com \
--cc=gal@nvidia.com \
--cc=gbayer@linux.ibm.com \
--cc=horms@kernel.org \
--cc=jgg@ziepe.ca \
--cc=kees@kernel.org \
--cc=leon@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=mbloch@nvidia.com \
--cc=moshe@nvidia.com \
--cc=msanalla@nvidia.com \
--cc=netdev@vger.kernel.org \
--cc=ohartoov@nvidia.com \
--cc=pabeni@redhat.com \
--cc=parav@nvidia.com \
--cc=phaddad@nvidia.com \
--cc=saeedm@nvidia.com \
--cc=shayd@nvidia.com \
--cc=tariqt@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.