public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
From: Mark Bloch <mbloch@nvidia.com>
To: Tariq Toukan <tariqt@nvidia.com>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	Andrew Lunn <andrew+netdev@lunn.ch>,
	"David S. Miller" <davem@davemloft.net>
Cc: Saeed Mahameed <saeedm@nvidia.com>,
	Leon Romanovsky <leon@kernel.org>, Shay Drory <shayd@nvidia.com>,
	Or Har-Toov <ohartoov@nvidia.com>,
	Edward Srouji <edwards@nvidia.com>,
	Maher Sanalla <msanalla@nvidia.com>,
	Simon Horman <horms@kernel.org>, Moshe Shemesh <moshe@nvidia.com>,
	Kees Cook <kees@kernel.org>,
	Patrisious Haddad <phaddad@nvidia.com>,
	Gerd Bayer <gbayer@linux.ibm.com>,
	Parav Pandit <parav@nvidia.com>, Cosmin Ratiu <cratiu@nvidia.com>,
	Carolina Jubran <cjubran@nvidia.com>,
	netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	linux-kernel@vger.kernel.org, Gal Pressman <gal@nvidia.com>,
	Dragos Tatulea <dtatulea@nvidia.com>
Subject: Re: [PATCH net-next 2/7] net/mlx5: E-Switch, move work queue generation counter
Date: Thu, 9 Apr 2026 20:58:26 +0300	[thread overview]
Message-ID: <bcd32076-5f52-4d8c-81da-7a2aed3990b0@nvidia.com> (raw)
In-Reply-To: <20260409115550.156419-3-tariqt@nvidia.com>



On 09/04/2026 14:55, Tariq Toukan wrote:
> From: Mark Bloch <mbloch@nvidia.com>
> 
> The generation counter in mlx5_esw_functions is used to detect stale
> work items on the E-Switch work queue. Move it from mlx5_esw_functions
> to the top-level mlx5_eswitch struct so it can guard all work types,
> not just function-change events.
> 
> This is a mechanical refactor: no behavioral change.
> 
> Signed-off-by: Mark Bloch <mbloch@nvidia.com>
> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/eswitch.c          | 3 ++-
>  drivers/net/ethernet/mellanox/mlx5/core/eswitch.h          | 2 +-
>  drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c | 4 ++--
>  3 files changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
> index 123c96716a54..1986d4d0e886 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
> @@ -1075,7 +1075,7 @@ static void mlx5_eswitch_event_handler_unregister(struct mlx5_eswitch *esw)
>  	if (esw->mode == MLX5_ESWITCH_OFFLOADS &&
>  	    mlx5_eswitch_is_funcs_handler(esw->dev)) {
>  		mlx5_eq_notifier_unregister(esw->dev, &esw->esw_funcs.nb);
> -		atomic_inc(&esw->esw_funcs.generation);
> +		atomic_inc(&esw->generation);
>  	}
>  }
>  
> @@ -2072,6 +2072,7 @@ int mlx5_eswitch_init(struct mlx5_core_dev *dev)
>  	mutex_init(&esw->state_lock);
>  	init_rwsem(&esw->mode_lock);
>  	refcount_set(&esw->qos.refcnt, 0);
> +	atomic_set(&esw->generation, 0);
>  
>  	esw->enabled_vports = 0;
>  	esw->offloads.inline_mode = MLX5_INLINE_MODE_NONE;
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
> index 5128f5020dae..0c3d2bdebf8c 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
> @@ -340,7 +340,6 @@ struct mlx5_host_work {
>  
>  struct mlx5_esw_functions {
>  	struct mlx5_nb		nb;
> -	atomic_t		generation;
>  	bool			host_funcs_disabled;
>  	u16			num_vfs;
>  	u16			num_ec_vfs;
> @@ -410,6 +409,7 @@ struct mlx5_eswitch {
>  	struct mlx5_devcom_comp_dev *devcom;
>  	u16 enabled_ipsec_vf_count;
>  	bool eswitch_operation_in_progress;
> +	atomic_t generation;
>  };
>  
>  void esw_offloads_disable(struct mlx5_eswitch *esw);
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> index a078d06f4567..b2e7294d3a5c 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> @@ -3667,7 +3667,7 @@ esw_vfs_changed_event_handler(struct mlx5_eswitch *esw, int work_gen,
>  	devl_lock(devlink);
>  
>  	/* Stale work from one or more mode changes ago. Bail out. */
> -	if (work_gen != atomic_read(&esw->esw_funcs.generation))
> +	if (work_gen != atomic_read(&esw->generation))
>  		goto unlock;

Sashiko writes:

> Does this generation counter pattern allow stale work to bypass the guard?
> While this issue seems to have existed before this patch, it appears the
> notifier can execute concurrently with the teardown path, creating a race
> condition.
> If a concurrent hardware event triggers mlx5_esw_funcs_changed_handler() on
> another CPU during mode teardown:
> 1. The teardown path calls mlx5_eswitch_event_handler_unregister(), which
>    unregisters the notifier and increments esw->generation.
> 2. mlx5_eq_notifier_unregister() removes the callback but does not
>    synchronize with currently executing RCU readers.
> 3. The concurrently running notifier reads the newly incremented generation
>    number via atomic_read(&esw->generation) and queues host_work with it.
> 4. The worker thread (esw_vfs_changed_event_handler) blocks waiting for
>    devl_lock(devlink), which is currently held by the teardown thread.
> 5. Once teardown finishes transitioning to legacy mode and drops the lock,
>    the worker resumes and checks the generation.
> 6. Since both the work_gen and esw->generation match the newly incremented
>    value, the guard is bypassed.
> Because teardown set esw_funcs.num_vfs = 0 but firmware might still report
> new_num_vfs > 0 (as VFs remain active), the worker calls
> mlx5_eswitch_load_vf_vports(). Could this execute offloads-specific
> initialization while the eswitch is in legacy mode and lead to state
> corruption?

False positive, atomic_notifier_call_chain() runs under rcu
read lock, while atomic_notifier_chain_unregister()
performs a synchronize_rcu() before returning.

Mark
>  
>  	new_num_vfs = MLX5_GET(query_esw_functions_out, out,
> @@ -3729,7 +3729,7 @@ int mlx5_esw_funcs_changed_handler(struct notifier_block *nb, unsigned long type
>  	esw = container_of(esw_funcs, struct mlx5_eswitch, esw_funcs);
>  
>  	host_work->esw = esw;
> -	host_work->work_gen = atomic_read(&esw_funcs->generation);
> +	host_work->work_gen = atomic_read(&esw->generation);
>  
>  	INIT_WORK(&host_work->work, esw_functions_changed_event_handler);
>  	queue_work(esw->work_queue, &host_work->work);


  reply	other threads:[~2026-04-09 17:58 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-09 11:55 [PATCH net-next 0/7] net/mlx5: Improve representor lifecycle and fix work queue deadlock Tariq Toukan
2026-04-09 11:55 ` [PATCH net-next 1/7] net/mlx5: Lag: refactor representor reload handling Tariq Toukan
2026-04-09 17:57   ` Mark Bloch
2026-04-09 11:55 ` [PATCH net-next 2/7] net/mlx5: E-Switch, move work queue generation counter Tariq Toukan
2026-04-09 17:58   ` Mark Bloch [this message]
2026-04-09 11:55 ` [PATCH net-next 3/7] net/mlx5: E-Switch, introduce generic work queue dispatch helper Tariq Toukan
2026-04-09 11:55 ` [PATCH net-next 4/7] net/mlx5: E-Switch, fix deadlock between devlink lock and esw->wq Tariq Toukan
2026-04-09 18:01   ` Mark Bloch
2026-04-09 11:55 ` [PATCH net-next 5/7] net/mlx5: E-Switch, block representors during reconfiguration Tariq Toukan
2026-04-09 18:02   ` Mark Bloch
2026-04-09 11:55 ` [PATCH net-next 6/7] net/mlx5: E-switch, load reps via work queue after registration Tariq Toukan
2026-04-09 18:02   ` Mark Bloch
2026-04-09 11:55 ` [PATCH net-next 7/7] net/mlx5: Add profile to auto-enable switchdev mode at device init Tariq Toukan
2026-04-09 18:02   ` Mark Bloch
2026-04-09 18:20 ` [PATCH net-next 0/7] net/mlx5: Improve representor lifecycle and fix work queue deadlock Mark Bloch

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bcd32076-5f52-4d8c-81da-7a2aed3990b0@nvidia.com \
    --to=mbloch@nvidia.com \
    --cc=andrew+netdev@lunn.ch \
    --cc=cjubran@nvidia.com \
    --cc=cratiu@nvidia.com \
    --cc=davem@davemloft.net \
    --cc=dtatulea@nvidia.com \
    --cc=edumazet@google.com \
    --cc=edwards@nvidia.com \
    --cc=gal@nvidia.com \
    --cc=gbayer@linux.ibm.com \
    --cc=horms@kernel.org \
    --cc=kees@kernel.org \
    --cc=kuba@kernel.org \
    --cc=leon@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=moshe@nvidia.com \
    --cc=msanalla@nvidia.com \
    --cc=netdev@vger.kernel.org \
    --cc=ohartoov@nvidia.com \
    --cc=pabeni@redhat.com \
    --cc=parav@nvidia.com \
    --cc=phaddad@nvidia.com \
    --cc=saeedm@nvidia.com \
    --cc=shayd@nvidia.com \
    --cc=tariqt@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox