public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Mark Bloch <mbloch@nvidia.com>
To: Tariq Toukan <tariqt@nvidia.com>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	Andrew Lunn <andrew+netdev@lunn.ch>,
	"David S. Miller" <davem@davemloft.net>
Cc: Leon Romanovsky <leon@kernel.org>, Jason Gunthorpe <jgg@ziepe.ca>,
	Saeed Mahameed <saeedm@nvidia.com>, Shay Drory <shayd@nvidia.com>,
	Or Har-Toov <ohartoov@nvidia.com>,
	Edward Srouji <edwards@nvidia.com>,
	Maher Sanalla <msanalla@nvidia.com>,
	Simon Horman <horms@kernel.org>,
	Gerd Bayer <gbayer@linux.ibm.com>,
	Moshe Shemesh <moshe@nvidia.com>, Kees Cook <kees@kernel.org>,
	Patrisious Haddad <phaddad@nvidia.com>,
	Parav Pandit <parav@nvidia.com>,
	Carolina Jubran <cjubran@nvidia.com>,
	Cosmin Ratiu <cratiu@nvidia.com>,
	linux-rdma@vger.kernel.org, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org, Gal Pressman <gal@nvidia.com>,
	Dragos Tatulea <dtatulea@nvidia.com>
Subject: Re: [PATCH net-next V2 7/7] net/mlx5: Add profile to auto-enable switchdev mode at device init
Date: Sat, 2 May 2026 23:08:43 +0300	[thread overview]
Message-ID: <421e8885-5849-4390-8956-9bc344fa0bf0@nvidia.com> (raw)
In-Reply-To: <20260501041633.231662-8-tariqt@nvidia.com>



On 01/05/2026 7:16, Tariq Toukan wrote:
> From: Mark Bloch <mbloch@nvidia.com>
> 
> Deployments that always operate in switchdev mode currently require
> manual devlink configuration after driver probe, which complicates
> automated provisioning.
> 
> Introduce MLX5_PROF_MASK_DEF_SWITCHDEV, a new profile mask bit, and
> profile index 8. When a device is initialized or reloaded with this
> profile, the driver automatically switches the e-switch to switchdev
> mode by calling mlx5_devlink_eswitch_mode_set() immediately after
> bringing the device online.
> 
> A no-op stub of mlx5_devlink_eswitch_mode_set() is added for builds
> without CONFIG_MLX5_ESWITCH.
> 
> Signed-off-by: Mark Bloch <mbloch@nvidia.com>
> Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
> ---
>  .../net/ethernet/mellanox/mlx5/core/eswitch.h |  6 +++
>  .../net/ethernet/mellanox/mlx5/core/main.c    | 43 ++++++++++++++++++-
>  include/linux/mlx5/driver.h                   |  2 +
>  3 files changed, 50 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
> index 3858690e09b4..cfb9595f9de8 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
> @@ -1049,6 +1049,12 @@ mlx5_esw_lag_demux_rule_create(struct mlx5_eswitch *esw, u16 vport_num,
>  	return ERR_PTR(-EOPNOTSUPP);
>  }
>  
> +static inline int
> +mlx5_devlink_eswitch_mode_set(struct devlink *devlink, u16 mode,
> +			      struct netlink_ext_ack *extack)
> +{
> +	return -EOPNOTSUPP;
> +}
>  #endif /* CONFIG_MLX5_ESWITCH */
>  
>  #endif /* __MLX5_ESWITCH_H__ */
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> index 74827e8ca125..4cdda15ed7f5 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> @@ -86,7 +86,7 @@ MODULE_PARM_DESC(debug_mask, "debug mask: 1 = dump cmd data, 2 = dump cmd exec t
>  
>  static unsigned int prof_sel = MLX5_DEFAULT_PROF;
>  module_param_named(prof_sel, prof_sel, uint, 0444);
> -MODULE_PARM_DESC(prof_sel, "profile selector. Valid range 0 - 2");
> +MODULE_PARM_DESC(prof_sel, "profile selector. Valid range 0 - 3 and 8");

sashiko.dev says:
"
Does using a module parameter to configure e-switch mode bypass the standard
devlink uAPI? 
The networking subsystem typically standardizes this configuration via
netlink/devlink interfaces, and relying on a module parameter for this might
fragment automated provisioning workflows.
"

The target here is DPU side deployments where switchdev is the only valid
operating mode, and the goal is to boot directly into that mode instead
of relying on userspace scripts after probe. I agree devlink remains the
runtime uapi. This profile is only an opt-in transition step so we can
avoid forcing switchdev in code before users have a way to reject that behavior.
This does not replace devlink provisioning, it provides a DPU side default for
environments that explicitly choose this profile.


>  
>  static u32 sw_owner_id[4];
>  #define MAX_SW_VHCA_ID (BIT(__mlx5_bit_sz(cmd_hca_cap_2, sw_vhca_id)) - 1)
> @@ -99,6 +99,8 @@ enum {
>  
>  #define LOG_MAX_SUPPORTED_QPS 0xff
>  
> +#define MLX5_PROF_SEL_LAST_NIC 3
> +#define MLX5_PROF_SEL_FIRST_ESW 8
>  static struct mlx5_profile profile[] = {
>  	[0] = {
>  		.mask           = 0,
> @@ -120,6 +122,11 @@ static struct mlx5_profile profile[] = {
>  		.log_max_qp	= LOG_MAX_SUPPORTED_QPS,
>  		.num_cmd_caches = 0,
>  	},
> +	[8] = {
> +		.mask = MLX5_PROF_MASK_DEF_SWITCHDEV | MLX5_PROF_MASK_QP_SIZE,
> +		.log_max_qp = LOG_MAX_SUPPORTED_QPS,
> +		.num_cmd_caches = MLX5_NUM_COMMAND_CACHES,
> +	},
>  };
>  
>  static int wait_fw_init(struct mlx5_core_dev *dev, u32 max_wait_mili,
> @@ -1385,6 +1392,22 @@ static void mlx5_unload(struct mlx5_core_dev *dev)
>  	mlx5_free_bfreg(dev, &dev->priv.bfreg);
>  }
>  
> +static void mlx5_set_default_switchdev(struct mlx5_core_dev *dev)
> +{
> +	int err;
> +
> +	/* Default switchdev is best-effort; keep the device usable on
> +	 * failure.
> +	 */
> +	err = mlx5_devlink_eswitch_mode_set(priv_to_devlink(dev),
> +					    DEVLINK_ESWITCH_MODE_SWITCHDEV,
> +					    NULL);
> +	if (err && err != -EOPNOTSUPP)
> +		mlx5_core_warn(dev,
> +			       "Failed to set switchdev as default, continuing in current mode, err(%d)\n",
> +			       err);
> +}
> +
>  int mlx5_init_one_devl_locked(struct mlx5_core_dev *dev)
>  {
>  	bool light_probe = mlx5_dev_is_lightweight(dev);
> @@ -1431,6 +1454,10 @@ int mlx5_init_one_devl_locked(struct mlx5_core_dev *dev)
>  		mlx5_core_err(dev, "mlx5_hwmon_dev_register failed with error code %d\n", err);
>  
>  	mutex_unlock(&dev->intf_state_mutex);
> +
> +	if (dev->profile.mask & MLX5_PROF_MASK_DEF_SWITCHDEV)
> +		mlx5_set_default_switchdev(dev);
> +
>  	return 0;

sashiko.dev says:
"
If a user explicitly sets the e-switch mode to legacy via devlink after
initialization, will this override their setting during driver reload or
firmware error recovery?
Since mlx5_set_default_switchdev() is called unconditionally here based on
the profile mask, it seems like it could silently revert the device back to
switchdev mode, discarding the active user configuration.
"

Yes, with this profile selected, switchdev is intentionally reapplied after
reload or recovery. A devlink change to legacy affects the current lifetime,
but the selected profile defines the default mode after the device is
reinitialized, users that want legacy to persist should not use this profile.

>  
>  err_register:
> @@ -1532,6 +1559,10 @@ int mlx5_load_one_devl_locked(struct mlx5_core_dev *dev, bool recovery)
>  		goto err_attach;
>  
>  	mutex_unlock(&dev->intf_state_mutex);
> +
> +	if (dev->profile.mask & MLX5_PROF_MASK_DEF_SWITCHDEV)
> +		mlx5_set_default_switchdev(dev);
> +
>  	return 0;
>  
>  err_attach:
> @@ -2314,6 +2345,16 @@ static void mlx5_core_verify_params(void)
>  			MLX5_DEFAULT_PROF);
>  		prof_sel = MLX5_DEFAULT_PROF;
>  	}
> +
> +	if (prof_sel > MLX5_PROF_SEL_LAST_NIC &&
> +	    prof_sel < MLX5_PROF_SEL_FIRST_ESW) {
> +		pr_warn("mlx5_core: WARNING: Invalid module parameter prof_sel %d invalid range %d - %d, changing back to default (%d)\n",
> +			prof_sel,
> +			MLX5_PROF_SEL_LAST_NIC + 1,
> +			MLX5_PROF_SEL_FIRST_ESW - 1,
> +			MLX5_DEFAULT_PROF);
> +		prof_sel = MLX5_DEFAULT_PROF;
> +	}
>  }
>  
>  static int __init mlx5_init(void)
> diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
> index 04b96c5abb57..65298c07df4d 100644
> --- a/include/linux/mlx5/driver.h
> +++ b/include/linux/mlx5/driver.h
> @@ -705,6 +705,8 @@ struct mlx5_st;
>  
>  enum {
>  	MLX5_PROF_MASK_QP_SIZE		= (u64)1 << 0,
> +	MLX5_PROF_MASK_MR_CACHE		= (u64)1 << 1,
> +	MLX5_PROF_MASK_DEF_SWITCHDEV    = (u64)1 << 2,
>  };

sashiko.dev says:
"
This isn't a bug, but it looks like MLX5_PROF_MASK_MR_CACHE is introduced
here but never used in the driver code. Is this mask intended for a future
patch?
"

Before I respin for the unrelated MR_CACHE cleanup, I’d like to confirm
whether the opt-in profile approach is acceptable at all. Regardless
of this last patch, the first 6 patches fix real representor/LAG locking
issues and are needed independently, so I’d like to keep those moving toward
acceptance as soon as possible.

Mark


>  
>  struct mlx5_profile {


  reply	other threads:[~2026-05-02 20:08 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-01  4:16 [PATCH net-next V2 0/7] net/mlx5: Improve representor lifecycle and allow switchdev by default Tariq Toukan
2026-05-01  4:16 ` [PATCH net-next V2 1/7] net/mlx5: Lag: refactor representor reload handling Tariq Toukan
2026-05-01  4:16 ` [PATCH net-next V2 2/7] net/mlx5: E-Switch, add representor lifecycle lock Tariq Toukan
2026-05-01  4:16 ` [PATCH net-next V2 3/7] net/mlx5: Lag, avoid LAG and representor lock cycles Tariq Toukan
2026-05-02 20:04   ` Mark Bloch
2026-05-01  4:16 ` [PATCH net-next V2 4/7] net/mlx5: E-Switch, serialize representor lifecycle Tariq Toukan
2026-05-02 20:05   ` Mark Bloch
2026-05-03  1:42   ` Jakub Kicinski
2026-05-03  8:18     ` Mark Bloch
2026-05-01  4:16 ` [PATCH net-next V2 5/7] net/mlx5: E-Switch, unwind only newly loaded representor types Tariq Toukan
2026-05-02 20:06   ` Mark Bloch
2026-05-01  4:16 ` [PATCH net-next V2 6/7] net/mlx5: E-switch, load reps via work queue after registration Tariq Toukan
2026-05-02 20:07   ` Mark Bloch
2026-05-03  1:42   ` Jakub Kicinski
2026-05-03  8:01     ` Mark Bloch
2026-05-01  4:16 ` [PATCH net-next V2 7/7] net/mlx5: Add profile to auto-enable switchdev mode at device init Tariq Toukan
2026-05-02 20:08   ` Mark Bloch [this message]
2026-05-03  1:41     ` Jakub Kicinski
2026-05-03  7:51       ` Mark Bloch
2026-05-05  1:21         ` Jakub Kicinski
2026-05-05  2:00           ` Mark Bloch
2026-05-05  2:19             ` Jakub Kicinski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=421e8885-5849-4390-8956-9bc344fa0bf0@nvidia.com \
    --to=mbloch@nvidia.com \
    --cc=andrew+netdev@lunn.ch \
    --cc=cjubran@nvidia.com \
    --cc=cratiu@nvidia.com \
    --cc=davem@davemloft.net \
    --cc=dtatulea@nvidia.com \
    --cc=edumazet@google.com \
    --cc=edwards@nvidia.com \
    --cc=gal@nvidia.com \
    --cc=gbayer@linux.ibm.com \
    --cc=horms@kernel.org \
    --cc=jgg@ziepe.ca \
    --cc=kees@kernel.org \
    --cc=kuba@kernel.org \
    --cc=leon@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=moshe@nvidia.com \
    --cc=msanalla@nvidia.com \
    --cc=netdev@vger.kernel.org \
    --cc=ohartoov@nvidia.com \
    --cc=pabeni@redhat.com \
    --cc=parav@nvidia.com \
    --cc=phaddad@nvidia.com \
    --cc=saeedm@nvidia.com \
    --cc=shayd@nvidia.com \
    --cc=tariqt@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox