Netdev List
 help / color / mirror / Atom feed
From: Mark Bloch <mbloch@nvidia.com>
To: Jiri Pirko <jiri@resnulli.us>
Cc: Tariq Toukan <tariqt@nvidia.com>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	Andrew Lunn <andrew+netdev@lunn.ch>,
	"David S. Miller" <davem@davemloft.net>,
	Jonathan Corbet <corbet@lwn.net>,
	Shuah Khan <skhan@linuxfoundation.org>,
	Simon Horman <horms@kernel.org>,
	Saeed Mahameed <saeedm@nvidia.com>,
	Leon Romanovsky <leon@kernel.org>,
	"Borislav Petkov (AMD)" <bp@alien8.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	Randy Dunlap <rdunlap@infradead.org>,
	Thomas Gleixner <tglx@kernel.org>, Petr Mladek <pmladek@suse.com>,
	"Peter Zijlstra (Intel)" <peterz@infradead.org>,
	Tejun Heo <tj@kernel.org>, Vlastimil Babka <vbabka@kernel.org>,
	Feng Tang <feng.tang@linux.alibaba.com>,
	Christian Brauner <brauner@kernel.org>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Dapeng Mi <dapeng1.mi@linux.intel.com>,
	Kees Cook <kees@kernel.org>, Marco Elver <elver@google.com>,
	Li RongQing <lirongqing@baidu.com>,
	Eric Biggers <ebiggers@kernel.org>,
	"Paul E. McKenney" <paulmck@kernel.org>,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	Gal Pressman <gal@nvidia.com>,
	Dragos Tatulea <dtatulea@nvidia.com>,
	Jiri Pirko <jiri@nvidia.com>, Shay Drori <shayd@nvidia.com>,
	Moshe Shemesh <moshe@nvidia.com>
Subject: Re: [PATCH net-next 3/3] net/mlx5: Apply devlink default eswitch mode during init
Date: Tue, 26 May 2026 18:03:57 +0300	[thread overview]
Message-ID: <9aa7c295-35cb-428b-9031-13a2f507ae4b@nvidia.com> (raw)
In-Reply-To: <ahWm4NXph9gdazV_@FV6GYCPJ69>



On 26/05/2026 17:07, Jiri Pirko wrote:
> Tue, May 26, 2026 at 11:44:46AM +0200, mbloch@nvidia.com wrote:
>>
>>
>> On 26/05/2026 10:44, Jiri Pirko wrote:
>>> Thu, May 21, 2026 at 09:24:34AM +0200, tariqt@nvidia.com wrote:
>>>> From: Mark Bloch <mbloch@nvidia.com>
>>>>
>>>> Apply devlink default eswitch mode for mlx5 devices after successful
>>>> device initialization while holding the devlink instance lock.
>>>>
>>>> At this point the devlink instance is registered and the mlx5 devlink
>>>> operations are available, so the default eswitch mode can be applied to
>>>> the matching PCI devlink handle.
>>>>
>>>> Signed-off-by: Mark Bloch <mbloch@nvidia.com>
>>>> Reviewed-by: Shay Drori <shayd@nvidia.com>
>>>> Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
>>>> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
>>>> ---
>>>> drivers/net/ethernet/mellanox/mlx5/core/main.c | 17 +++++++++++++++++
>>>> 1 file changed, 17 insertions(+)
>>>>
>>>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
>>>> index 0c6e4efe38c8..4528097f3d84 100644
>>>> --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
>>>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
>>>> @@ -1391,6 +1391,21 @@ static void mlx5_unload(struct mlx5_core_dev *dev)
>>>> 	mlx5_free_bfreg(dev, &dev->priv.bfreg);
>>>> }
>>>>
>>>> +static void mlx5_devl_apply_default_esw_mode(struct mlx5_core_dev *dev)
>>>> +{
>>>> +	struct devlink *devlink = priv_to_devlink(dev);
>>>> +	int err;
>>>> +
>>>> +	if (!MLX5_ESWITCH_MANAGER(dev))
>>>> +		return;
>>>> +
>>>> +	devl_assert_locked(devlink);
>>>> +	err = devl_apply_default_esw_mode(devlink);
>>>> +	if (err)
>>>> +		mlx5_core_warn(dev, "Couldn't apply default eswitch mode, err %d\n",
>>>> +			       err);
>>>> +}
>>>> +
>>>> int mlx5_init_one_devl_locked(struct mlx5_core_dev *dev)
>>>> {
>>>> 	bool light_probe = mlx5_dev_is_lightweight(dev);
>>>> @@ -1437,6 +1452,7 @@ int mlx5_init_one_devl_locked(struct mlx5_core_dev *dev)
>>>> 		mlx5_core_err(dev, "mlx5_hwmon_dev_register failed with error code %d\n", err);
>>>>
>>>> 	mutex_unlock(&dev->intf_state_mutex);
>>>> +	mlx5_devl_apply_default_esw_mode(dev);
>>>
>>> I wonder how we can make this work for all. I mean, other driver would
>>> silently ignore this command like arg, right? Any idea how to make all
>>> drivers follow the arg from very beginning?
>>>
>>
>> I have a follow-up series that adds the call to all drivers which support
>> setting eswitch mode. When going over the other drivers, what I found is
>> that the right point to apply the default is driver specific, drivers
>> I have patch for:
>>
>> 46e16c6d9836 net: Apply devlink esw mode defaults
>> ab4f54102ba9 bnxt_en: Apply devlink default eswitch mode during init
>> b48cce1607bb liquidio: Apply devlink default eswitch mode during init
>> 4ea54b0fe04a ice: Apply devlink default eswitch mode during init
>> b7faddaa1c90 octeontx2-af: Apply devlink default eswitch mode during init
>> 74b0c22c47b9 octeontx2-pf: Apply devlink default eswitch mode during init
>> 5000e4c3d768 nfp: Apply devlink default eswitch mode during init
>> 97a218e95e41 netdevsim: Apply devlink default eswitch mode during init
>>
>> I don't think doing this generically from devlink is realistic. devlink
>> doesn't really know when a given driver is ready to change eswitch mode.
>> Some drivers need SR-IOV state, representor setup, or other init pieces to
>> be ready first, and the locking is not identical across drivers either.
> 
> 
> Low hanging fruit would be just to call ops->eswitch_mode_set at the end
> of register. Multiple reasons:
> 
> 1) end of devl_register is exactly the point userspace is free to issue
>    the eswitch mode set. Driver should be ready to handle it.
> 2) all drivers would transparently get this functionality, without
>    actually knowing this kernel command line arg ever existed, without
>    odd wiring call of related exported function. I prefer that stongly.
> 3) you should add a there warning for the case this arg is passed yet
>    the driver does not implement eswitch_mode_set. User should
>    get a feedback like this, not silent ignore.
> 
> The only loose end is see it the void return of devl_register().
> Multiple ways to handle the possibly failed eswitch_mode_set(). I would
> probably just go for pr_warn, seems to be the most correct.
> 
> Make sense?

I see the point, but I don't think devl_register() (at least not the only place)
is the right place.

There is a small but important difference between userspace doing
"devlink eswitch set" after register is done, and devlink core calling
eswitch_mode_set() from inside the register flow.

Some drivers call devlink_register() while holding the device lock.
liquidio is one example. If devlink core calls ops->eswitch_mode_set() from
there, we may start the full eswitch mode change while holding that lock.
That mode change can create representors, register netdevs, take rtnl,
allocate resources, etc. I don't think we want this to become an implicit
side effect of devlink registration.

For mlx5, the placement after intf_state_mutex is also intentional:

mutex_unlock(&dev->intf_state_mutex);
mlx5_devl_apply_default_esw_mode(dev);

We can't call it while holding intf_state_mutex because the mode set path
takes it internally, and switchdev mode may also create IB representors.

Also, devl_register() only covers the first registration. The mlx5 call in
mlx5_load_one_devl_locked() is for reload/fw reset recovery kind of flows.
In those flows devlink is already registered, so devl_register() is not
called again, but the driver state was rebuilt and we may need to apply the
default again.

Same for reload, fw reset and pci recovery in general. If the driver tears
down and rebuilds eswitch related state, the place to apply the default is
in that driver's reinit flow, not in devl_register().

When I went over the other drivers, the right place was not always the same
as devlink registration. I'm not an expert in any of them, so I hope I got
the details right, but for example octeontx2 AF needs sr-iov and the
representor switch state to be initialized first. nfp can do it after
app/vNIC init while the devlink lock is already held. liquidio should do it
only after dropping the PCI device lock.

Mark

> 
> 
>>
>> Also, since this knob is only about eswitch mode, I don't think we need to
>> touch every devlink driver. Drivers that don't implement eswitch_mode_set()
>> would just ignore it anyway. The follow-up only wires the default into
>> drivers that actually support changing eswitch mode.
>>
>> Mark
>>
>>>
>>>> 	return 0;
>>>>
>>>> err_register:
>>>> @@ -1538,6 +1554,7 @@ int mlx5_load_one_devl_locked(struct mlx5_core_dev *dev, bool recovery)
>>>> 		goto err_attach;
>>>>
>>>> 	mutex_unlock(&dev->intf_state_mutex);
>>>> +	mlx5_devl_apply_default_esw_mode(dev);
>>>> 	return 0;
>>>>
>>>> err_attach:
>>>> -- 
>>>> 2.44.0
>>>>
>>


  reply	other threads:[~2026-05-26 15:04 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-21  7:24 [PATCH net-next 0/3] devlink: Add boot-time eswitch mode defaults Tariq Toukan
2026-05-21  7:24 ` [PATCH net-next 1/3] net/mlx5: Clear FW reset-in-progress bit before reload Tariq Toukan
2026-05-21  7:24 ` [PATCH net-next 2/3] devlink: Add eswitch mode boot defaults Tariq Toukan
2026-05-21  7:24 ` [PATCH net-next 3/3] net/mlx5: Apply devlink default eswitch mode during init Tariq Toukan
2026-05-21 13:16   ` Mark Bloch
2026-05-21 13:41     ` Thomas Weißschuh
2026-05-21 21:02       ` Mark Bloch
2026-05-26  7:44   ` Jiri Pirko
2026-05-26  9:44     ` Mark Bloch
2026-05-26 14:07       ` Jiri Pirko
2026-05-26 15:03         ` Mark Bloch [this message]
2026-05-26 16:23           ` Jiri Pirko
2026-05-25 19:42 ` [PATCH net-next 0/3] devlink: Add boot-time eswitch mode defaults Jakub Kicinski
2026-05-26  7:41   ` Jiri Pirko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9aa7c295-35cb-428b-9031-13a2f507ae4b@nvidia.com \
    --to=mbloch@nvidia.com \
    --cc=akpm@linux-foundation.org \
    --cc=andrew+netdev@lunn.ch \
    --cc=bp@alien8.de \
    --cc=brauner@kernel.org \
    --cc=corbet@lwn.net \
    --cc=dapeng1.mi@linux.intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=davem@davemloft.net \
    --cc=dtatulea@nvidia.com \
    --cc=ebiggers@kernel.org \
    --cc=edumazet@google.com \
    --cc=elver@google.com \
    --cc=feng.tang@linux.alibaba.com \
    --cc=gal@nvidia.com \
    --cc=horms@kernel.org \
    --cc=jiri@nvidia.com \
    --cc=jiri@resnulli.us \
    --cc=kees@kernel.org \
    --cc=kuba@kernel.org \
    --cc=leon@kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=lirongqing@baidu.com \
    --cc=moshe@nvidia.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=pmladek@suse.com \
    --cc=rdunlap@infradead.org \
    --cc=saeedm@nvidia.com \
    --cc=shayd@nvidia.com \
    --cc=skhan@linuxfoundation.org \
    --cc=tariqt@nvidia.com \
    --cc=tglx@kernel.org \
    --cc=tj@kernel.org \
    --cc=vbabka@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox