From: Jiri Pirko <jiri@resnulli.us>
To: Mark Bloch <mbloch@nvidia.com>
Cc: Tariq Toukan <tariqt@nvidia.com>,
Eric Dumazet <edumazet@google.com>,
Jakub Kicinski <kuba@kernel.org>,
Paolo Abeni <pabeni@redhat.com>,
Andrew Lunn <andrew+netdev@lunn.ch>,
"David S. Miller" <davem@davemloft.net>,
Jonathan Corbet <corbet@lwn.net>,
Shuah Khan <skhan@linuxfoundation.org>,
Simon Horman <horms@kernel.org>,
Saeed Mahameed <saeedm@nvidia.com>,
Leon Romanovsky <leon@kernel.org>,
"Borislav Petkov (AMD)" <bp@alien8.de>,
Andrew Morton <akpm@linux-foundation.org>,
Randy Dunlap <rdunlap@infradead.org>,
Thomas Gleixner <tglx@kernel.org>,
Petr Mladek <pmladek@suse.com>,
"Peter Zijlstra (Intel)" <peterz@infradead.org>,
Tejun Heo <tj@kernel.org>, Vlastimil Babka <vbabka@kernel.org>,
Feng Tang <feng.tang@linux.alibaba.com>,
Christian Brauner <brauner@kernel.org>,
Dave Hansen <dave.hansen@linux.intel.com>,
Dapeng Mi <dapeng1.mi@linux.intel.com>,
Kees Cook <kees@kernel.org>, Marco Elver <elver@google.com>,
Li RongQing <lirongqing@baidu.com>,
Eric Biggers <ebiggers@kernel.org>,
"Paul E. McKenney" <paulmck@kernel.org>,
linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
Gal Pressman <gal@nvidia.com>,
Dragos Tatulea <dtatulea@nvidia.com>,
Jiri Pirko <jiri@nvidia.com>, Shay Drori <shayd@nvidia.com>,
Moshe Shemesh <moshe@nvidia.com>
Subject: Re: [PATCH net-next 3/3] net/mlx5: Apply devlink default eswitch mode during init
Date: Tue, 26 May 2026 18:23:29 +0200 [thread overview]
Message-ID: <ahXF2aQZNOwHdCG_@FV6GYCPJ69> (raw)
In-Reply-To: <9aa7c295-35cb-428b-9031-13a2f507ae4b@nvidia.com>
Tue, May 26, 2026 at 05:03:57PM +0200, mbloch@nvidia.com wrote:
>
>
>On 26/05/2026 17:07, Jiri Pirko wrote:
>> Tue, May 26, 2026 at 11:44:46AM +0200, mbloch@nvidia.com wrote:
>>>
>>>
>>> On 26/05/2026 10:44, Jiri Pirko wrote:
>>>> Thu, May 21, 2026 at 09:24:34AM +0200, tariqt@nvidia.com wrote:
>>>>> From: Mark Bloch <mbloch@nvidia.com>
>>>>>
>>>>> Apply devlink default eswitch mode for mlx5 devices after successful
>>>>> device initialization while holding the devlink instance lock.
>>>>>
>>>>> At this point the devlink instance is registered and the mlx5 devlink
>>>>> operations are available, so the default eswitch mode can be applied to
>>>>> the matching PCI devlink handle.
>>>>>
>>>>> Signed-off-by: Mark Bloch <mbloch@nvidia.com>
>>>>> Reviewed-by: Shay Drori <shayd@nvidia.com>
>>>>> Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
>>>>> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
>>>>> ---
>>>>> drivers/net/ethernet/mellanox/mlx5/core/main.c | 17 +++++++++++++++++
>>>>> 1 file changed, 17 insertions(+)
>>>>>
>>>>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
>>>>> index 0c6e4efe38c8..4528097f3d84 100644
>>>>> --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
>>>>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
>>>>> @@ -1391,6 +1391,21 @@ static void mlx5_unload(struct mlx5_core_dev *dev)
>>>>> mlx5_free_bfreg(dev, &dev->priv.bfreg);
>>>>> }
>>>>>
>>>>> +static void mlx5_devl_apply_default_esw_mode(struct mlx5_core_dev *dev)
>>>>> +{
>>>>> + struct devlink *devlink = priv_to_devlink(dev);
>>>>> + int err;
>>>>> +
>>>>> + if (!MLX5_ESWITCH_MANAGER(dev))
>>>>> + return;
>>>>> +
>>>>> + devl_assert_locked(devlink);
>>>>> + err = devl_apply_default_esw_mode(devlink);
>>>>> + if (err)
>>>>> + mlx5_core_warn(dev, "Couldn't apply default eswitch mode, err %d\n",
>>>>> + err);
>>>>> +}
>>>>> +
>>>>> int mlx5_init_one_devl_locked(struct mlx5_core_dev *dev)
>>>>> {
>>>>> bool light_probe = mlx5_dev_is_lightweight(dev);
>>>>> @@ -1437,6 +1452,7 @@ int mlx5_init_one_devl_locked(struct mlx5_core_dev *dev)
>>>>> mlx5_core_err(dev, "mlx5_hwmon_dev_register failed with error code %d\n", err);
>>>>>
>>>>> mutex_unlock(&dev->intf_state_mutex);
>>>>> + mlx5_devl_apply_default_esw_mode(dev);
>>>>
>>>> I wonder how we can make this work for all. I mean, other driver would
>>>> silently ignore this command like arg, right? Any idea how to make all
>>>> drivers follow the arg from very beginning?
>>>>
>>>
>>> I have a follow-up series that adds the call to all drivers which support
>>> setting eswitch mode. When going over the other drivers, what I found is
>>> that the right point to apply the default is driver specific, drivers
>>> I have patch for:
>>>
>>> 46e16c6d9836 net: Apply devlink esw mode defaults
>>> ab4f54102ba9 bnxt_en: Apply devlink default eswitch mode during init
>>> b48cce1607bb liquidio: Apply devlink default eswitch mode during init
>>> 4ea54b0fe04a ice: Apply devlink default eswitch mode during init
>>> b7faddaa1c90 octeontx2-af: Apply devlink default eswitch mode during init
>>> 74b0c22c47b9 octeontx2-pf: Apply devlink default eswitch mode during init
>>> 5000e4c3d768 nfp: Apply devlink default eswitch mode during init
>>> 97a218e95e41 netdevsim: Apply devlink default eswitch mode during init
>>>
>>> I don't think doing this generically from devlink is realistic. devlink
>>> doesn't really know when a given driver is ready to change eswitch mode.
>>> Some drivers need SR-IOV state, representor setup, or other init pieces to
>>> be ready first, and the locking is not identical across drivers either.
>>
>>
>> Low hanging fruit would be just to call ops->eswitch_mode_set at the end
>> of register. Multiple reasons:
>>
>> 1) end of devl_register is exactly the point userspace is free to issue
>> the eswitch mode set. Driver should be ready to handle it.
>> 2) all drivers would transparently get this functionality, without
>> actually knowing this kernel command line arg ever existed, without
>> odd wiring call of related exported function. I prefer that stongly.
>> 3) you should add a there warning for the case this arg is passed yet
>> the driver does not implement eswitch_mode_set. User should
>> get a feedback like this, not silent ignore.
>>
>> The only loose end is see it the void return of devl_register().
>> Multiple ways to handle the possibly failed eswitch_mode_set(). I would
>> probably just go for pr_warn, seems to be the most correct.
>>
>> Make sense?
>
>I see the point, but I don't think devl_register() (at least not the only place)
>is the right place.
>
>There is a small but important difference between userspace doing
>"devlink eswitch set" after register is done, and devlink core calling
>eswitch_mode_set() from inside the register flow.
>
>Some drivers call devlink_register() while holding the device lock.
>liquidio is one example. If devlink core calls ops->eswitch_mode_set() from
>there, we may start the full eswitch mode change while holding that lock.
>That mode change can create representors, register netdevs, take rtnl,
>allocate resources, etc. I don't think we want this to become an implicit
>side effect of devlink registration.
I believe your AI may untagle liquidio locking :)
>
>For mlx5, the placement after intf_state_mutex is also intentional:
>
>mutex_unlock(&dev->intf_state_mutex);
>mlx5_devl_apply_default_esw_mode(dev);
>
>We can't call it while holding intf_state_mutex because the mode set path
>takes it internally, and switchdev mode may also create IB representors.
>
>Also, devl_register() only covers the first registration. The mlx5 call in
>mlx5_load_one_devl_locked() is for reload/fw reset recovery kind of flows.
>In those flows devlink is already registered, so devl_register() is not
>called again, but the driver state was rebuilt and we may need to apply the
>default again.
Call it from reload too, right?
>
>Same for reload, fw reset and pci recovery in general. If the driver tears
>down and rebuilds eswitch related state, the place to apply the default is
>in that driver's reinit flow, not in devl_register().
>
>When I went over the other drivers, the right place was not always the same
>as devlink registration. I'm not an expert in any of them, so I hope I got
>the details right, but for example octeontx2 AF needs sr-iov and the
>representor switch state to be initialized first. nfp can do it after
>app/vNIC init while the devlink lock is already held. liquidio should do it
>only after dropping the PCI device lock.
Idk, perhaps do it from devlink_post_register_work of some kind? That
would allow you to have the same locking ordering as a userspace call.
>
>Mark
>
>>
>>
>>>
>>> Also, since this knob is only about eswitch mode, I don't think we need to
>>> touch every devlink driver. Drivers that don't implement eswitch_mode_set()
>>> would just ignore it anyway. The follow-up only wires the default into
>>> drivers that actually support changing eswitch mode.
>>>
>>> Mark
>>>
>>>>
>>>>> return 0;
>>>>>
>>>>> err_register:
>>>>> @@ -1538,6 +1554,7 @@ int mlx5_load_one_devl_locked(struct mlx5_core_dev *dev, bool recovery)
>>>>> goto err_attach;
>>>>>
>>>>> mutex_unlock(&dev->intf_state_mutex);
>>>>> + mlx5_devl_apply_default_esw_mode(dev);
>>>>> return 0;
>>>>>
>>>>> err_attach:
>>>>> --
>>>>> 2.44.0
>>>>>
>>>
>
next prev parent reply other threads:[~2026-05-26 16:23 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-21 7:24 [PATCH net-next 0/3] devlink: Add boot-time eswitch mode defaults Tariq Toukan
2026-05-21 7:24 ` [PATCH net-next 1/3] net/mlx5: Clear FW reset-in-progress bit before reload Tariq Toukan
2026-05-21 7:24 ` [PATCH net-next 2/3] devlink: Add eswitch mode boot defaults Tariq Toukan
2026-05-21 7:24 ` [PATCH net-next 3/3] net/mlx5: Apply devlink default eswitch mode during init Tariq Toukan
2026-05-21 13:16 ` Mark Bloch
2026-05-21 13:41 ` Thomas Weißschuh
2026-05-21 21:02 ` Mark Bloch
2026-05-26 7:44 ` Jiri Pirko
2026-05-26 9:44 ` Mark Bloch
2026-05-26 14:07 ` Jiri Pirko
2026-05-26 15:03 ` Mark Bloch
2026-05-26 16:23 ` Jiri Pirko [this message]
2026-05-26 17:13 ` Mark Bloch
2026-05-25 19:42 ` [PATCH net-next 0/3] devlink: Add boot-time eswitch mode defaults Jakub Kicinski
2026-05-26 7:41 ` Jiri Pirko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ahXF2aQZNOwHdCG_@FV6GYCPJ69 \
--to=jiri@resnulli.us \
--cc=akpm@linux-foundation.org \
--cc=andrew+netdev@lunn.ch \
--cc=bp@alien8.de \
--cc=brauner@kernel.org \
--cc=corbet@lwn.net \
--cc=dapeng1.mi@linux.intel.com \
--cc=dave.hansen@linux.intel.com \
--cc=davem@davemloft.net \
--cc=dtatulea@nvidia.com \
--cc=ebiggers@kernel.org \
--cc=edumazet@google.com \
--cc=elver@google.com \
--cc=feng.tang@linux.alibaba.com \
--cc=gal@nvidia.com \
--cc=horms@kernel.org \
--cc=jiri@nvidia.com \
--cc=kees@kernel.org \
--cc=kuba@kernel.org \
--cc=leon@kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=lirongqing@baidu.com \
--cc=mbloch@nvidia.com \
--cc=moshe@nvidia.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=paulmck@kernel.org \
--cc=peterz@infradead.org \
--cc=pmladek@suse.com \
--cc=rdunlap@infradead.org \
--cc=saeedm@nvidia.com \
--cc=shayd@nvidia.com \
--cc=skhan@linuxfoundation.org \
--cc=tariqt@nvidia.com \
--cc=tglx@kernel.org \
--cc=tj@kernel.org \
--cc=vbabka@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox