From: Jiri Pirko <jiri@resnulli.us>
To: Mark Bloch <mbloch@nvidia.com>
Cc: Tariq Toukan <tariqt@nvidia.com>,
Eric Dumazet <edumazet@google.com>,
Jakub Kicinski <kuba@kernel.org>,
Paolo Abeni <pabeni@redhat.com>,
Andrew Lunn <andrew+netdev@lunn.ch>,
"David S. Miller" <davem@davemloft.net>,
Jonathan Corbet <corbet@lwn.net>,
Shuah Khan <skhan@linuxfoundation.org>,
Simon Horman <horms@kernel.org>,
Saeed Mahameed <saeedm@nvidia.com>,
Leon Romanovsky <leon@kernel.org>,
"Borislav Petkov (AMD)" <bp@alien8.de>,
Andrew Morton <akpm@linux-foundation.org>,
Randy Dunlap <rdunlap@infradead.org>,
Thomas Gleixner <tglx@kernel.org>,
Petr Mladek <pmladek@suse.com>,
"Peter Zijlstra (Intel)" <peterz@infradead.org>,
Tejun Heo <tj@kernel.org>, Vlastimil Babka <vbabka@kernel.org>,
Feng Tang <feng.tang@linux.alibaba.com>,
Christian Brauner <brauner@kernel.org>,
Dave Hansen <dave.hansen@linux.intel.com>,
Dapeng Mi <dapeng1.mi@linux.intel.com>,
Kees Cook <kees@kernel.org>, Marco Elver <elver@google.com>,
Li RongQing <lirongqing@baidu.com>,
Eric Biggers <ebiggers@kernel.org>,
"Paul E. McKenney" <paulmck@kernel.org>,
linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
Gal Pressman <gal@nvidia.com>,
Dragos Tatulea <dtatulea@nvidia.com>,
Jiri Pirko <jiri@nvidia.com>, Shay Drori <shayd@nvidia.com>,
Moshe Shemesh <moshe@nvidia.com>
Subject: Re: [PATCH net-next 3/3] net/mlx5: Apply devlink default eswitch mode during init
Date: Fri, 29 May 2026 12:25:28 +0200 [thread overview]
Message-ID: <ahlpjJ4CCZAwqFVi@FV6GYCPJ69> (raw)
In-Reply-To: <e4f4a6a5-9be0-462b-b4d7-8bbf57001cb4@nvidia.com>
Thu, May 28, 2026 at 10:15:44AM +0200, mbloch@nvidia.com wrote:
>
>
>On 27/05/2026 14:18, Jiri Pirko wrote:
>> Wed, May 27, 2026 at 09:03:26AM +0200, mbloch@nvidia.com wrote:
>>>
>>>
>>> On 27/05/2026 8:14, Jiri Pirko wrote:
>>>> Tue, May 26, 2026 at 07:13:46PM +0200, mbloch@nvidia.com wrote:
>>>>>
>>>>>
>>>>> On 26/05/2026 19:23, Jiri Pirko wrote:
>>>>>> Tue, May 26, 2026 at 05:03:57PM +0200, mbloch@nvidia.com wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 26/05/2026 17:07, Jiri Pirko wrote:
>>>>>>>> Tue, May 26, 2026 at 11:44:46AM +0200, mbloch@nvidia.com wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 26/05/2026 10:44, Jiri Pirko wrote:
>>>>>>>>>> Thu, May 21, 2026 at 09:24:34AM +0200, tariqt@nvidia.com wrote:
>>>>>>>>>>> From: Mark Bloch <mbloch@nvidia.com>
>>>>>>>>>>>
>>>>>>>>>>> Apply devlink default eswitch mode for mlx5 devices after successful
>>>>>>>>>>> device initialization while holding the devlink instance lock.
>>>>>>>>>>>
>>>>>>>>>>> At this point the devlink instance is registered and the mlx5 devlink
>>>>>>>>>>> operations are available, so the default eswitch mode can be applied to
>>>>>>>>>>> the matching PCI devlink handle.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Mark Bloch <mbloch@nvidia.com>
>>>>>>>>>>> Reviewed-by: Shay Drori <shayd@nvidia.com>
>>>>>>>>>>> Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
>>>>>>>>>>> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
>>>>>>>>>>> ---
>>>>>>>>>>> drivers/net/ethernet/mellanox/mlx5/core/main.c | 17 +++++++++++++++++
>>>>>>>>>>> 1 file changed, 17 insertions(+)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
>>>>>>>>>>> index 0c6e4efe38c8..4528097f3d84 100644
>>>>>>>>>>> --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
>>>>>>>>>>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
>>>>>>>>>>> @@ -1391,6 +1391,21 @@ static void mlx5_unload(struct mlx5_core_dev *dev)
>>>>>>>>>>> mlx5_free_bfreg(dev, &dev->priv.bfreg);
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> +static void mlx5_devl_apply_default_esw_mode(struct mlx5_core_dev *dev)
>>>>>>>>>>> +{
>>>>>>>>>>> + struct devlink *devlink = priv_to_devlink(dev);
>>>>>>>>>>> + int err;
>>>>>>>>>>> +
>>>>>>>>>>> + if (!MLX5_ESWITCH_MANAGER(dev))
>>>>>>>>>>> + return;
>>>>>>>>>>> +
>>>>>>>>>>> + devl_assert_locked(devlink);
>>>>>>>>>>> + err = devl_apply_default_esw_mode(devlink);
>>>>>>>>>>> + if (err)
>>>>>>>>>>> + mlx5_core_warn(dev, "Couldn't apply default eswitch mode, err %d\n",
>>>>>>>>>>> + err);
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> int mlx5_init_one_devl_locked(struct mlx5_core_dev *dev)
>>>>>>>>>>> {
>>>>>>>>>>> bool light_probe = mlx5_dev_is_lightweight(dev);
>>>>>>>>>>> @@ -1437,6 +1452,7 @@ int mlx5_init_one_devl_locked(struct mlx5_core_dev *dev)
>>>>>>>>>>> mlx5_core_err(dev, "mlx5_hwmon_dev_register failed with error code %d\n", err);
>>>>>>>>>>>
>>>>>>>>>>> mutex_unlock(&dev->intf_state_mutex);
>>>>>>>>>>> + mlx5_devl_apply_default_esw_mode(dev);
>>>>>>>>>>
>>>>>>>>>> I wonder how we can make this work for all. I mean, other driver would
>>>>>>>>>> silently ignore this command like arg, right? Any idea how to make all
>>>>>>>>>> drivers follow the arg from very beginning?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I have a follow-up series that adds the call to all drivers which support
>>>>>>>>> setting eswitch mode. When going over the other drivers, what I found is
>>>>>>>>> that the right point to apply the default is driver specific, drivers
>>>>>>>>> I have patch for:
>>>>>>>>>
>>>>>>>>> 46e16c6d9836 net: Apply devlink esw mode defaults
>>>>>>>>> ab4f54102ba9 bnxt_en: Apply devlink default eswitch mode during init
>>>>>>>>> b48cce1607bb liquidio: Apply devlink default eswitch mode during init
>>>>>>>>> 4ea54b0fe04a ice: Apply devlink default eswitch mode during init
>>>>>>>>> b7faddaa1c90 octeontx2-af: Apply devlink default eswitch mode during init
>>>>>>>>> 74b0c22c47b9 octeontx2-pf: Apply devlink default eswitch mode during init
>>>>>>>>> 5000e4c3d768 nfp: Apply devlink default eswitch mode during init
>>>>>>>>> 97a218e95e41 netdevsim: Apply devlink default eswitch mode during init
>>>>>>>>>
>>>>>>>>> I don't think doing this generically from devlink is realistic. devlink
>>>>>>>>> doesn't really know when a given driver is ready to change eswitch mode.
>>>>>>>>> Some drivers need SR-IOV state, representor setup, or other init pieces to
>>>>>>>>> be ready first, and the locking is not identical across drivers either.
>>>>>>>>
>>>>>>>>
>>>>>>>> Low hanging fruit would be just to call ops->eswitch_mode_set at the end
>>>>>>>> of register. Multiple reasons:
>>>>>>>>
>>>>>>>> 1) end of devl_register is exactly the point userspace is free to issue
>>>>>>>> the eswitch mode set. Driver should be ready to handle it.
>>>>>>>> 2) all drivers would transparently get this functionality, without
>>>>>>>> actually knowing this kernel command line arg ever existed, without
>>>>>>>> odd wiring call of related exported function. I prefer that stongly.
>>>>>>>> 3) you should add a there warning for the case this arg is passed yet
>>>>>>>> the driver does not implement eswitch_mode_set. User should
>>>>>>>> get a feedback like this, not silent ignore.
>>>>>>>>
>>>>>>>> The only loose end is see it the void return of devl_register().
>>>>>>>> Multiple ways to handle the possibly failed eswitch_mode_set(). I would
>>>>>>>> probably just go for pr_warn, seems to be the most correct.
>>>>>>>>
>>>>>>>> Make sense?
>>>>>>>
>>>>>>> I see the point, but I don't think devl_register() (at least not the only place)
>>>>>>> is the right place.
>>>>>>>
>>>>>>> There is a small but important difference between userspace doing
>>>>>>> "devlink eswitch set" after register is done, and devlink core calling
>>>>>>> eswitch_mode_set() from inside the register flow.
>>>>>>>
>>>>>>> Some drivers call devlink_register() while holding the device lock.
>>>>>>> liquidio is one example. If devlink core calls ops->eswitch_mode_set() from
>>>>>>> there, we may start the full eswitch mode change while holding that lock.
>>>>>>> That mode change can create representors, register netdevs, take rtnl,
>>>>>>> allocate resources, etc. I don't think we want this to become an implicit
>>>>>>> side effect of devlink registration.
>>>>>>
>>>>>> I believe your AI may untagle liquidio locking :)
>>>>>
>>>>> I didn't try to solve that one with ai. Most drivers were fairly simple
>>>>> so I didn't use ai at all. bnxt was the one where I needed a bit of help :)
>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> For mlx5, the placement after intf_state_mutex is also intentional:
>>>>>>>
>>>>>>> mutex_unlock(&dev->intf_state_mutex);
>>>>>>> mlx5_devl_apply_default_esw_mode(dev);
>>>>>>>
>>>>>>> We can't call it while holding intf_state_mutex because the mode set path
>>>>>>> takes it internally, and switchdev mode may also create IB representors.
>>>>>>>
>>>>>>> Also, devl_register() only covers the first registration. The mlx5 call in
>>>>>>> mlx5_load_one_devl_locked() is for reload/fw reset recovery kind of flows.
>>>>>>> In those flows devlink is already registered, so devl_register() is not
>>>>>>> called again, but the driver state was rebuilt and we may need to apply the
>>>>>>> default again.
>>>>>>
>>>>>> Call it from reload too, right?
>>>>>
>>>>> Yes, that was my first thought: apply it from devl_register() for the first
>>>>> registration and from devlink_reload() after a successful DRIVER_REINIT.
>>>>>
>>>>> That covers the clean devlink reload path but....(see bellow)
>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Same for reload, fw reset and pci recovery in general. If the driver tears
>>>>>>> down and rebuilds eswitch related state, the place to apply the default is
>>>>>>> in that driver's reinit flow, not in devl_register().
>>>>>>>
>>>>>>> When I went over the other drivers, the right place was not always the same
>>>>>>> as devlink registration. I'm not an expert in any of them, so I hope I got
>>>>>>> the details right, but for example octeontx2 AF needs sr-iov and the
>>>>>>> representor switch state to be initialized first. nfp can do it after
>>>>>>> app/vNIC init while the devlink lock is already held. liquidio should do it
>>>>>>> only after dropping the PCI device lock.
>>>>>>
>>>>>> Idk, perhaps do it from devlink_post_register_work of some kind? That
>>>>>> would allow you to have the same locking ordering as a userspace cal
>>>>> l.
>>>>>
>>>>> I thought about a workqueue too, it was actually my first idea.
>>>>>
>>>>> The problem is that then we race with userspace. In the mlx5 version here the
>>>>> default is applied while the devlink lock is still held, before userspace can
>>>>> come in and issue its own eswitch set. If we defer it to post-register work,
>>>>> the devlink instance is already visible and userspace can get there first
>>>>> and then we might change the user configuration.
>>>>
>>>> Figure that out and expose to user by setting xa_mark only after the
>>>> work is done? This is doable.
>>>
>>> I agree that if devlink can keep the instance hidden/unavailable until the
>>> post register work is done, that solves the initial userspace race.
>>>
>>> The other part is the reinit/recovery case. For that I think devlink core
>>> needs some explicit indication from the driver that the device is now in
>>> reinit. Something like (at least that's the code I had initially, but something
>>> along those lines):
>>>
>>> void devl_dev_reinit_begin(struct devlink *devlink);
>>> void devl_dev_reinit_end(struct devlink *devlink);
>>> void devl_dev_reinit_abort(struct devlink *devlink);
>>>
>>> The core can then mark the instance as temporarily unavailable/in reinit
>>> between begin/end, and the relevant lookup/dump paths, for example
>>> devlink_get_from_attrs_lock() and devlink_nl_inst_iter_dumpit(), can reject
>>> or skip it while reinit is in progress. devlink_reload() can probably mark
>>> this state by itself around DRIVER_REINIT.
>>
>> I believe this is orthogonal to the problem you are trying to solve in
>> this patchset. Not sure why you bring it in to the conversation...
>>
>
>I brought it up because I was also thinking about reinit/recovery flows, but
>I guess I can tackle that later.
>
>For now I can focus on the generic devlink path, move drivers to register
>devlink only after the device is ready. Then devlink core can apply the default
>before exposing the instance to userspace.
>
>I think it is better to fix the ordering for all devlink drivers, not only the
>ones that support eswitch mode set. That gives us a consistent model and makes
>future defaults easier.
>
>Reload can be handled from devlink after successful DRIVER_REINIT.
>
>Does this sound ok?
Yes. Thanks!
>
>Mark
>
>>
>>>
>>> Then mlx5 would look more or less like:
>>> devl_lock(devlink);
>>> devl_dev_reinit_begin(devlink);
>>> ret = mlx5_load_one_devl_locked(dev, recovery);
>>> if (!ret)
>>> devl_dev_reinit_end(devlink);
>>> else
>>> devl_dev_reinit_abort(devlink);
>>> devl_unlock(devlink);
>>>
>>> This gives devlink core a way to know that the devlink instance is registered,
>>> but should not be used by userspace at the moment. It also allows keeping the
>>> default/config apply logic in devlink instead of adding driver specific calls
>>> to apply it in each init path.
>>>
>>> But this still means the generic solution needs some driver help. Drivers need
>>> to register devlink at a point where the post-register default apply is safe,
>>> and full reinit paths need to be marked with this begin/end API.
>>>
>>>>
>>>>
>>>>>
>>>>> Also, the bigger issue for mlx5 is not only initial registration or devlink
>>>>> reload. Some recovery paths, pci resume, and fw reset flows rebuild the driver
>>>>> state without going through devlink at all. I did not find a clean way for
>>>>> devlink core to infer all those points by itself.
>>>>
>>>> If you don't obey current configuration for example in pci resume, it is
>>>> bug and you should fix it. All these flows should obey current eswitch
>>>> mode configuration.
>>>>
>>>
>>> I agree that the device should come back according
>>> to the intended high level policy. But I don't think full reinit can be treated
>>> as restoring the whole previous runtime state. There may be user created
>>> steering rules and other objects which the driver cannot keep or replay. Today
>>> full reinit brings the device back to a clean initialized state, and that is
>>> intentional.
>>>
>>> So the split I have in mind is:
>>>
>>> - full runtime state is not preserved across full reinit;
>>> - high level devlink policy/configuration should be applied when the device is
>>> initialized again;
>>> - the command line default should not blindly override a later explicit
>>> userspace eswitch mode selection.
>>>
>>> I am not against moving this into devlink core, and I am willing to work on it.
>>>
>>> But before I rework the series, I want to make sure we agree on the direction.
>>> As I see it, doing this cleanly needs a devlink state like "registered but
>>> unavailable/in reinit", plus driver annotations for the reinit paths.
>>>
>>> If this is not the direction you want, I prefer to know now rather than spend
>>> time on a version that will be rejected anyway.
>>>
>>> Mark
>>>
>>>>
>>>>>
>>>>> To handle that from devlink I would still need to add some api for the driver
>>>>> to tell devlink "I just reinitialized, apply the default now". but nce I had
>>>>> that driver call , it felt simpler and clearer to let the driver call
>>>>> the helper directly at the points where it knows eswitch mode is safe.
>>>>>
>>>>> I agree that handling all of this inside devlink would be the better option.
>>>>> I just couldn't make it work in a clean way.
>>>>>
>>>>> Mark
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Mark
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Also, since this knob is only about eswitch mode, I don't think we need to
>>>>>>>>> touch every devlink driver. Drivers that don't implement eswitch_mode_set()
>>>>>>>>> would just ignore it anyway. The follow-up only wires the default into
>>>>>>>>> drivers that actually support changing eswitch mode.
>>>>>>>>>
>>>>>>>>> Mark
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> return 0;
>>>>>>>>>>>
>>>>>>>>>>> err_register:
>>>>>>>>>>> @@ -1538,6 +1554,7 @@ int mlx5_load_one_devl_locked(struct mlx5_core_dev *dev, bool recovery)
>>>>>>>>>>> goto err_attach;
>>>>>>>>>>>
>>>>>>>>>>> mutex_unlock(&dev->intf_state_mutex);
>>>>>>>>>>> + mlx5_devl_apply_default_esw_mode(dev);
>>>>>>>>>>> return 0;
>>>>>>>>>>>
>>>>>>>>>>> err_attach:
>>>>>>>>>>> --
>>>>>>>>>>> 2.44.0
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>
>
next prev parent reply other threads:[~2026-05-29 10:25 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-21 7:24 [PATCH net-next 0/3] devlink: Add boot-time eswitch mode defaults Tariq Toukan
2026-05-21 7:24 ` [PATCH net-next 1/3] net/mlx5: Clear FW reset-in-progress bit before reload Tariq Toukan
2026-05-21 7:24 ` [PATCH net-next 2/3] devlink: Add eswitch mode boot defaults Tariq Toukan
2026-05-21 7:24 ` [PATCH net-next 3/3] net/mlx5: Apply devlink default eswitch mode during init Tariq Toukan
2026-05-21 13:16 ` Mark Bloch
2026-05-21 13:41 ` Thomas Weißschuh
2026-05-21 21:02 ` Mark Bloch
2026-05-26 7:44 ` Jiri Pirko
2026-05-26 9:44 ` Mark Bloch
2026-05-26 14:07 ` Jiri Pirko
2026-05-26 15:03 ` Mark Bloch
2026-05-26 16:23 ` Jiri Pirko
2026-05-26 17:13 ` Mark Bloch
2026-05-27 5:14 ` Jiri Pirko
2026-05-27 7:03 ` Mark Bloch
2026-05-27 11:18 ` Jiri Pirko
2026-05-28 8:15 ` Mark Bloch
2026-05-29 10:25 ` Jiri Pirko [this message]
2026-05-25 19:42 ` [PATCH net-next 0/3] devlink: Add boot-time eswitch mode defaults Jakub Kicinski
2026-05-26 7:41 ` Jiri Pirko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ahlpjJ4CCZAwqFVi@FV6GYCPJ69 \
--to=jiri@resnulli.us \
--cc=akpm@linux-foundation.org \
--cc=andrew+netdev@lunn.ch \
--cc=bp@alien8.de \
--cc=brauner@kernel.org \
--cc=corbet@lwn.net \
--cc=dapeng1.mi@linux.intel.com \
--cc=dave.hansen@linux.intel.com \
--cc=davem@davemloft.net \
--cc=dtatulea@nvidia.com \
--cc=ebiggers@kernel.org \
--cc=edumazet@google.com \
--cc=elver@google.com \
--cc=feng.tang@linux.alibaba.com \
--cc=gal@nvidia.com \
--cc=horms@kernel.org \
--cc=jiri@nvidia.com \
--cc=kees@kernel.org \
--cc=kuba@kernel.org \
--cc=leon@kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=lirongqing@baidu.com \
--cc=mbloch@nvidia.com \
--cc=moshe@nvidia.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=paulmck@kernel.org \
--cc=peterz@infradead.org \
--cc=pmladek@suse.com \
--cc=rdunlap@infradead.org \
--cc=saeedm@nvidia.com \
--cc=shayd@nvidia.com \
--cc=skhan@linuxfoundation.org \
--cc=tariqt@nvidia.com \
--cc=tglx@kernel.org \
--cc=tj@kernel.org \
--cc=vbabka@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.