From: Mark Bloch <mbloch@nvidia.com>
To: Tariq Toukan <tariqt@nvidia.com>,
Eric Dumazet <edumazet@google.com>,
Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
Andrew Lunn <andrew+netdev@lunn.ch>,
"David S. Miller" <davem@davemloft.net>
Cc: Saeed Mahameed <saeedm@nvidia.com>,
Leon Romanovsky <leon@kernel.org>, Shay Drory <shayd@nvidia.com>,
Or Har-Toov <ohartoov@nvidia.com>,
Edward Srouji <edwards@nvidia.com>,
Maher Sanalla <msanalla@nvidia.com>,
Simon Horman <horms@kernel.org>, Moshe Shemesh <moshe@nvidia.com>,
Kees Cook <kees@kernel.org>,
Patrisious Haddad <phaddad@nvidia.com>,
Gerd Bayer <gbayer@linux.ibm.com>,
Parav Pandit <parav@nvidia.com>, Cosmin Ratiu <cratiu@nvidia.com>,
Carolina Jubran <cjubran@nvidia.com>,
netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
linux-kernel@vger.kernel.org, Gal Pressman <gal@nvidia.com>,
Dragos Tatulea <dtatulea@nvidia.com>
Subject: Re: [PATCH net-next 2/7] net/mlx5: E-Switch, move work queue generation counter
Date: Thu, 9 Apr 2026 20:58:26 +0300 [thread overview]
Message-ID: <bcd32076-5f52-4d8c-81da-7a2aed3990b0@nvidia.com> (raw)
In-Reply-To: <20260409115550.156419-3-tariqt@nvidia.com>
On 09/04/2026 14:55, Tariq Toukan wrote:
> From: Mark Bloch <mbloch@nvidia.com>
>
> The generation counter in mlx5_esw_functions is used to detect stale
> work items on the E-Switch work queue. Move it from mlx5_esw_functions
> to the top-level mlx5_eswitch struct so it can guard all work types,
> not just function-change events.
>
> This is a mechanical refactor: no behavioral change.
>
> Signed-off-by: Mark Bloch <mbloch@nvidia.com>
> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
> ---
> drivers/net/ethernet/mellanox/mlx5/core/eswitch.c | 3 ++-
> drivers/net/ethernet/mellanox/mlx5/core/eswitch.h | 2 +-
> drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c | 4 ++--
> 3 files changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
> index 123c96716a54..1986d4d0e886 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
> @@ -1075,7 +1075,7 @@ static void mlx5_eswitch_event_handler_unregister(struct mlx5_eswitch *esw)
> if (esw->mode == MLX5_ESWITCH_OFFLOADS &&
> mlx5_eswitch_is_funcs_handler(esw->dev)) {
> mlx5_eq_notifier_unregister(esw->dev, &esw->esw_funcs.nb);
> - atomic_inc(&esw->esw_funcs.generation);
> + atomic_inc(&esw->generation);
> }
> }
>
> @@ -2072,6 +2072,7 @@ int mlx5_eswitch_init(struct mlx5_core_dev *dev)
> mutex_init(&esw->state_lock);
> init_rwsem(&esw->mode_lock);
> refcount_set(&esw->qos.refcnt, 0);
> + atomic_set(&esw->generation, 0);
>
> esw->enabled_vports = 0;
> esw->offloads.inline_mode = MLX5_INLINE_MODE_NONE;
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
> index 5128f5020dae..0c3d2bdebf8c 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
> @@ -340,7 +340,6 @@ struct mlx5_host_work {
>
> struct mlx5_esw_functions {
> struct mlx5_nb nb;
> - atomic_t generation;
> bool host_funcs_disabled;
> u16 num_vfs;
> u16 num_ec_vfs;
> @@ -410,6 +409,7 @@ struct mlx5_eswitch {
> struct mlx5_devcom_comp_dev *devcom;
> u16 enabled_ipsec_vf_count;
> bool eswitch_operation_in_progress;
> + atomic_t generation;
> };
>
> void esw_offloads_disable(struct mlx5_eswitch *esw);
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> index a078d06f4567..b2e7294d3a5c 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> @@ -3667,7 +3667,7 @@ esw_vfs_changed_event_handler(struct mlx5_eswitch *esw, int work_gen,
> devl_lock(devlink);
>
> /* Stale work from one or more mode changes ago. Bail out. */
> - if (work_gen != atomic_read(&esw->esw_funcs.generation))
> + if (work_gen != atomic_read(&esw->generation))
> goto unlock;
Sashiko writes:
> Does this generation counter pattern allow stale work to bypass the guard?
> While this issue seems to have existed before this patch, it appears the
> notifier can execute concurrently with the teardown path, creating a race
> condition.
> If a concurrent hardware event triggers mlx5_esw_funcs_changed_handler() on
> another CPU during mode teardown:
> 1. The teardown path calls mlx5_eswitch_event_handler_unregister(), which
> unregisters the notifier and increments esw->generation.
> 2. mlx5_eq_notifier_unregister() removes the callback but does not
> synchronize with currently executing RCU readers.
> 3. The concurrently running notifier reads the newly incremented generation
> number via atomic_read(&esw->generation) and queues host_work with it.
> 4. The worker thread (esw_vfs_changed_event_handler) blocks waiting for
> devl_lock(devlink), which is currently held by the teardown thread.
> 5. Once teardown finishes transitioning to legacy mode and drops the lock,
> the worker resumes and checks the generation.
> 6. Since both the work_gen and esw->generation match the newly incremented
> value, the guard is bypassed.
> Because teardown set esw_funcs.num_vfs = 0 but firmware might still report
> new_num_vfs > 0 (as VFs remain active), the worker calls
> mlx5_eswitch_load_vf_vports(). Could this execute offloads-specific
> initialization while the eswitch is in legacy mode and lead to state
> corruption?
False positive, atomic_notifier_call_chain() runs under rcu
read lock, while atomic_notifier_chain_unregister()
performs a synchronize_rcu() before returning.
Mark
>
> new_num_vfs = MLX5_GET(query_esw_functions_out, out,
> @@ -3729,7 +3729,7 @@ int mlx5_esw_funcs_changed_handler(struct notifier_block *nb, unsigned long type
> esw = container_of(esw_funcs, struct mlx5_eswitch, esw_funcs);
>
> host_work->esw = esw;
> - host_work->work_gen = atomic_read(&esw_funcs->generation);
> + host_work->work_gen = atomic_read(&esw->generation);
>
> INIT_WORK(&host_work->work, esw_functions_changed_event_handler);
> queue_work(esw->work_queue, &host_work->work);
next prev parent reply other threads:[~2026-04-09 17:58 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-09 11:55 [PATCH net-next 0/7] net/mlx5: Improve representor lifecycle and fix work queue deadlock Tariq Toukan
2026-04-09 11:55 ` [PATCH net-next 1/7] net/mlx5: Lag: refactor representor reload handling Tariq Toukan
2026-04-09 17:57 ` Mark Bloch
2026-04-09 11:55 ` [PATCH net-next 2/7] net/mlx5: E-Switch, move work queue generation counter Tariq Toukan
2026-04-09 17:58 ` Mark Bloch [this message]
2026-04-09 11:55 ` [PATCH net-next 3/7] net/mlx5: E-Switch, introduce generic work queue dispatch helper Tariq Toukan
2026-04-09 11:55 ` [PATCH net-next 4/7] net/mlx5: E-Switch, fix deadlock between devlink lock and esw->wq Tariq Toukan
2026-04-09 18:01 ` Mark Bloch
2026-04-09 11:55 ` [PATCH net-next 5/7] net/mlx5: E-Switch, block representors during reconfiguration Tariq Toukan
2026-04-09 18:02 ` Mark Bloch
2026-04-09 11:55 ` [PATCH net-next 6/7] net/mlx5: E-switch, load reps via work queue after registration Tariq Toukan
2026-04-09 18:02 ` Mark Bloch
2026-04-09 11:55 ` [PATCH net-next 7/7] net/mlx5: Add profile to auto-enable switchdev mode at device init Tariq Toukan
2026-04-09 18:02 ` Mark Bloch
2026-04-09 18:20 ` [PATCH net-next 0/7] net/mlx5: Improve representor lifecycle and fix work queue deadlock Mark Bloch
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=bcd32076-5f52-4d8c-81da-7a2aed3990b0@nvidia.com \
--to=mbloch@nvidia.com \
--cc=andrew+netdev@lunn.ch \
--cc=cjubran@nvidia.com \
--cc=cratiu@nvidia.com \
--cc=davem@davemloft.net \
--cc=dtatulea@nvidia.com \
--cc=edumazet@google.com \
--cc=edwards@nvidia.com \
--cc=gal@nvidia.com \
--cc=gbayer@linux.ibm.com \
--cc=horms@kernel.org \
--cc=kees@kernel.org \
--cc=kuba@kernel.org \
--cc=leon@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=moshe@nvidia.com \
--cc=msanalla@nvidia.com \
--cc=netdev@vger.kernel.org \
--cc=ohartoov@nvidia.com \
--cc=pabeni@redhat.com \
--cc=parav@nvidia.com \
--cc=phaddad@nvidia.com \
--cc=saeedm@nvidia.com \
--cc=shayd@nvidia.com \
--cc=tariqt@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox