Linux RDMA and InfiniBand development
 help / color / mirror / Atom feed
From: Jiri Pirko <jiri@resnulli.us>
To: Mark Bloch <mbloch@nvidia.com>
Cc: Tariq Toukan <tariqt@nvidia.com>,
	Eric Dumazet <edumazet@google.com>,
	 Jakub Kicinski <kuba@kernel.org>,
	Paolo Abeni <pabeni@redhat.com>,
	 Andrew Lunn <andrew+netdev@lunn.ch>,
	"David S. Miller" <davem@davemloft.net>,
	 Jonathan Corbet <corbet@lwn.net>,
	Shuah Khan <skhan@linuxfoundation.org>,
	 Simon Horman <horms@kernel.org>,
	Saeed Mahameed <saeedm@nvidia.com>,
	 Leon Romanovsky <leon@kernel.org>,
	"Borislav Petkov (AMD)" <bp@alien8.de>,
	 Andrew Morton <akpm@linux-foundation.org>,
	Randy Dunlap <rdunlap@infradead.org>,
	 Thomas Gleixner <tglx@kernel.org>,
	Petr Mladek <pmladek@suse.com>,
	 "Peter Zijlstra (Intel)" <peterz@infradead.org>,
	Tejun Heo <tj@kernel.org>, Vlastimil Babka <vbabka@kernel.org>,
	 Feng Tang <feng.tang@linux.alibaba.com>,
	Christian Brauner <brauner@kernel.org>,
	 Dave Hansen <dave.hansen@linux.intel.com>,
	Dapeng Mi <dapeng1.mi@linux.intel.com>,
	 Kees Cook <kees@kernel.org>, Marco Elver <elver@google.com>,
	 Li RongQing <lirongqing@baidu.com>,
	Eric Biggers <ebiggers@kernel.org>,
	 "Paul E. McKenney" <paulmck@kernel.org>,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	 netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	Gal Pressman <gal@nvidia.com>,
	 Dragos Tatulea <dtatulea@nvidia.com>,
	Jiri Pirko <jiri@nvidia.com>, Shay Drori <shayd@nvidia.com>,
	 Moshe Shemesh <moshe@nvidia.com>
Subject: Re: [PATCH net-next 3/3] net/mlx5: Apply devlink default eswitch mode during init
Date: Fri, 29 May 2026 12:25:28 +0200	[thread overview]
Message-ID: <ahlpjJ4CCZAwqFVi@FV6GYCPJ69> (raw)
In-Reply-To: <e4f4a6a5-9be0-462b-b4d7-8bbf57001cb4@nvidia.com>

Thu, May 28, 2026 at 10:15:44AM +0200, mbloch@nvidia.com wrote:
>
>
>On 27/05/2026 14:18, Jiri Pirko wrote:
>> Wed, May 27, 2026 at 09:03:26AM +0200, mbloch@nvidia.com wrote:
>>>
>>>
>>> On 27/05/2026 8:14, Jiri Pirko wrote:
>>>> Tue, May 26, 2026 at 07:13:46PM +0200, mbloch@nvidia.com wrote:
>>>>>
>>>>>
>>>>> On 26/05/2026 19:23, Jiri Pirko wrote:
>>>>>> Tue, May 26, 2026 at 05:03:57PM +0200, mbloch@nvidia.com wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 26/05/2026 17:07, Jiri Pirko wrote:
>>>>>>>> Tue, May 26, 2026 at 11:44:46AM +0200, mbloch@nvidia.com wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 26/05/2026 10:44, Jiri Pirko wrote:
>>>>>>>>>> Thu, May 21, 2026 at 09:24:34AM +0200, tariqt@nvidia.com wrote:
>>>>>>>>>>> From: Mark Bloch <mbloch@nvidia.com>
>>>>>>>>>>>
>>>>>>>>>>> Apply devlink default eswitch mode for mlx5 devices after successful
>>>>>>>>>>> device initialization while holding the devlink instance lock.
>>>>>>>>>>>
>>>>>>>>>>> At this point the devlink instance is registered and the mlx5 devlink
>>>>>>>>>>> operations are available, so the default eswitch mode can be applied to
>>>>>>>>>>> the matching PCI devlink handle.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Mark Bloch <mbloch@nvidia.com>
>>>>>>>>>>> Reviewed-by: Shay Drori <shayd@nvidia.com>
>>>>>>>>>>> Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
>>>>>>>>>>> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
>>>>>>>>>>> ---
>>>>>>>>>>> drivers/net/ethernet/mellanox/mlx5/core/main.c | 17 +++++++++++++++++
>>>>>>>>>>> 1 file changed, 17 insertions(+)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
>>>>>>>>>>> index 0c6e4efe38c8..4528097f3d84 100644
>>>>>>>>>>> --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
>>>>>>>>>>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
>>>>>>>>>>> @@ -1391,6 +1391,21 @@ static void mlx5_unload(struct mlx5_core_dev *dev)
>>>>>>>>>>> 	mlx5_free_bfreg(dev, &dev->priv.bfreg);
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> +static void mlx5_devl_apply_default_esw_mode(struct mlx5_core_dev *dev)
>>>>>>>>>>> +{
>>>>>>>>>>> +	struct devlink *devlink = priv_to_devlink(dev);
>>>>>>>>>>> +	int err;
>>>>>>>>>>> +
>>>>>>>>>>> +	if (!MLX5_ESWITCH_MANAGER(dev))
>>>>>>>>>>> +		return;
>>>>>>>>>>> +
>>>>>>>>>>> +	devl_assert_locked(devlink);
>>>>>>>>>>> +	err = devl_apply_default_esw_mode(devlink);
>>>>>>>>>>> +	if (err)
>>>>>>>>>>> +		mlx5_core_warn(dev, "Couldn't apply default eswitch mode, err %d\n",
>>>>>>>>>>> +			       err);
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> int mlx5_init_one_devl_locked(struct mlx5_core_dev *dev)
>>>>>>>>>>> {
>>>>>>>>>>> 	bool light_probe = mlx5_dev_is_lightweight(dev);
>>>>>>>>>>> @@ -1437,6 +1452,7 @@ int mlx5_init_one_devl_locked(struct mlx5_core_dev *dev)
>>>>>>>>>>> 		mlx5_core_err(dev, "mlx5_hwmon_dev_register failed with error code %d\n", err);
>>>>>>>>>>>
>>>>>>>>>>> 	mutex_unlock(&dev->intf_state_mutex);
>>>>>>>>>>> +	mlx5_devl_apply_default_esw_mode(dev);
>>>>>>>>>>
>>>>>>>>>> I wonder how we can make this work for all. I mean, other driver would
>>>>>>>>>> silently ignore this command like arg, right? Any idea how to make all
>>>>>>>>>> drivers follow the arg from very beginning?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I have a follow-up series that adds the call to all drivers which support
>>>>>>>>> setting eswitch mode. When going over the other drivers, what I found is
>>>>>>>>> that the right point to apply the default is driver specific, drivers
>>>>>>>>> I have patch for:
>>>>>>>>>
>>>>>>>>> 46e16c6d9836 net: Apply devlink esw mode defaults
>>>>>>>>> ab4f54102ba9 bnxt_en: Apply devlink default eswitch mode during init
>>>>>>>>> b48cce1607bb liquidio: Apply devlink default eswitch mode during init
>>>>>>>>> 4ea54b0fe04a ice: Apply devlink default eswitch mode during init
>>>>>>>>> b7faddaa1c90 octeontx2-af: Apply devlink default eswitch mode during init
>>>>>>>>> 74b0c22c47b9 octeontx2-pf: Apply devlink default eswitch mode during init
>>>>>>>>> 5000e4c3d768 nfp: Apply devlink default eswitch mode during init
>>>>>>>>> 97a218e95e41 netdevsim: Apply devlink default eswitch mode during init
>>>>>>>>>
>>>>>>>>> I don't think doing this generically from devlink is realistic. devlink
>>>>>>>>> doesn't really know when a given driver is ready to change eswitch mode.
>>>>>>>>> Some drivers need SR-IOV state, representor setup, or other init pieces to
>>>>>>>>> be ready first, and the locking is not identical across drivers either.
>>>>>>>>
>>>>>>>>
>>>>>>>> Low hanging fruit would be just to call ops->eswitch_mode_set at the end
>>>>>>>> of register. Multiple reasons:
>>>>>>>>
>>>>>>>> 1) end of devl_register is exactly the point userspace is free to issue
>>>>>>>>    the eswitch mode set. Driver should be ready to handle it.
>>>>>>>> 2) all drivers would transparently get this functionality, without
>>>>>>>>    actually knowing this kernel command line arg ever existed, without
>>>>>>>>    odd wiring call of related exported function. I prefer that stongly.
>>>>>>>> 3) you should add a there warning for the case this arg is passed yet
>>>>>>>>    the driver does not implement eswitch_mode_set. User should
>>>>>>>>    get a feedback like this, not silent ignore.
>>>>>>>>
>>>>>>>> The only loose end is see it the void return of devl_register().
>>>>>>>> Multiple ways to handle the possibly failed eswitch_mode_set(). I would
>>>>>>>> probably just go for pr_warn, seems to be the most correct.
>>>>>>>>
>>>>>>>> Make sense?
>>>>>>>
>>>>>>> I see the point, but I don't think devl_register() (at least not the only place)
>>>>>>> is the right place.
>>>>>>>
>>>>>>> There is a small but important difference between userspace doing
>>>>>>> "devlink eswitch set" after register is done, and devlink core calling
>>>>>>> eswitch_mode_set() from inside the register flow.
>>>>>>>
>>>>>>> Some drivers call devlink_register() while holding the device lock.
>>>>>>> liquidio is one example. If devlink core calls ops->eswitch_mode_set() from
>>>>>>> there, we may start the full eswitch mode change while holding that lock.
>>>>>>> That mode change can create representors, register netdevs, take rtnl,
>>>>>>> allocate resources, etc. I don't think we want this to become an implicit
>>>>>>> side effect of devlink registration.
>>>>>>
>>>>>> I believe your AI may untagle liquidio locking :)
>>>>>
>>>>> I didn't try to solve that one with ai. Most drivers were fairly simple 
>>>>> so I didn't use ai at all. bnxt was the one where I needed a bit of help :)
>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> For mlx5, the placement after intf_state_mutex is also intentional:
>>>>>>>
>>>>>>> mutex_unlock(&dev->intf_state_mutex);
>>>>>>> mlx5_devl_apply_default_esw_mode(dev);
>>>>>>>
>>>>>>> We can't call it while holding intf_state_mutex because the mode set path
>>>>>>> takes it internally, and switchdev mode may also create IB representors.
>>>>>>>
>>>>>>> Also, devl_register() only covers the first registration. The mlx5 call in
>>>>>>> mlx5_load_one_devl_locked() is for reload/fw reset recovery kind of flows.
>>>>>>> In those flows devlink is already registered, so devl_register() is not
>>>>>>> called again, but the driver state was rebuilt and we may need to apply the
>>>>>>> default again.
>>>>>>
>>>>>> Call it from reload too, right?
>>>>>
>>>>> Yes, that was my first thought: apply it from devl_register() for the first
>>>>> registration and from devlink_reload() after a successful DRIVER_REINIT.
>>>>>
>>>>> That covers the clean devlink reload path but....(see bellow)
>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Same for reload, fw reset and pci recovery in general. If the driver tears
>>>>>>> down and rebuilds eswitch related state, the place to apply the default is
>>>>>>> in that driver's reinit flow, not in devl_register().
>>>>>>>
>>>>>>> When I went over the other drivers, the right place was not always the same
>>>>>>> as devlink registration. I'm not an expert in any of them, so I hope I got
>>>>>>> the details right, but for example octeontx2 AF needs sr-iov and the
>>>>>>> representor switch state to be initialized first. nfp can do it after
>>>>>>> app/vNIC init while the devlink lock is already held. liquidio should do it
>>>>>>> only after dropping the PCI device lock.
>>>>>>
>>>>>> Idk, perhaps do it from devlink_post_register_work of some kind? That
>>>>>> would allow you to have the same locking ordering as a userspace cal
>>>>> l.
>>>>>
>>>>> I thought about a workqueue too, it was actually my first idea.
>>>>>
>>>>> The problem is that then we race with userspace. In the mlx5 version here the
>>>>> default is applied while the devlink lock is still held, before userspace can
>>>>> come in and issue its own eswitch set. If we defer it to post-register work,
>>>>> the devlink instance is already visible and userspace can get there first
>>>>> and then we might change the user configuration.
>>>>
>>>> Figure that out and expose to user by setting xa_mark only after the
>>>> work is done? This is doable.
>>>
>>> I agree that if devlink can keep the instance hidden/unavailable until the
>>> post register work is done, that solves the initial userspace race.
>>>
>>> The other part is the reinit/recovery case. For that I think devlink core
>>> needs some explicit indication from the driver that the device is now in
>>> reinit. Something like (at least that's the code I had initially, but something
>>> along those lines):
>>>
>>> void devl_dev_reinit_begin(struct devlink *devlink);
>>> void devl_dev_reinit_end(struct devlink *devlink);
>>> void devl_dev_reinit_abort(struct devlink *devlink);
>>>
>>> The core can then mark the instance as temporarily unavailable/in reinit
>>> between begin/end, and the relevant lookup/dump paths, for example
>>> devlink_get_from_attrs_lock() and devlink_nl_inst_iter_dumpit(), can reject
>>> or skip it while reinit is in progress. devlink_reload() can probably mark
>>> this state by itself around DRIVER_REINIT.
>> 
>> I believe this is orthogonal to the problem you are trying to solve in
>> this patchset. Not sure why you bring it in to the conversation...
>> 
>
>I brought it up because I was also thinking about reinit/recovery flows, but
>I guess I can tackle that later.
>
>For now I can focus on the generic devlink path, move drivers to register
>devlink only after the device is ready. Then devlink core can apply the default
>before exposing the instance to userspace.
>
>I think it is better to fix the ordering for all devlink drivers, not only the
>ones that support eswitch mode set. That gives us a consistent model and makes
>future defaults easier.
>
>Reload can be handled from devlink after successful DRIVER_REINIT.
>
>Does this sound ok?

Yes. Thanks!


>
>Mark
>
>> 
>>>
>>> Then mlx5 would look more or less like:
>>> 	devl_lock(devlink);
>>> 	devl_dev_reinit_begin(devlink);
>>> 	ret = mlx5_load_one_devl_locked(dev, recovery);
>>> 	if (!ret)
>>> 		devl_dev_reinit_end(devlink);
>>> 	else
>>> 		devl_dev_reinit_abort(devlink);
>>> 	devl_unlock(devlink);
>>>
>>> This gives devlink core a way to know that the devlink instance is registered,
>>> but should not be used by userspace at the moment. It also allows keeping the
>>> default/config apply logic in devlink instead of adding driver specific calls
>>> to apply it in each init path.
>>>
>>> But this still means the generic solution needs some driver help. Drivers need
>>> to register devlink at a point where the post-register default apply is safe,
>>> and full reinit paths need to be marked with this begin/end API.
>>>
>>>>
>>>>
>>>>>
>>>>> Also, the bigger issue for mlx5 is not only initial registration or devlink
>>>>> reload. Some recovery paths, pci resume, and fw reset flows rebuild the driver
>>>>> state without going through devlink at all. I did not find a clean way for
>>>>> devlink core to infer all those points by itself.
>>>>
>>>> If you don't obey current configuration for example in pci resume, it is
>>>> bug and you should fix it. All these flows should obey current eswitch
>>>> mode configuration.
>>>>
>>>
>>> I agree that the device should come back according
>>> to the intended high level policy. But I don't think full reinit can be treated
>>> as restoring the whole previous runtime state. There may be user created
>>> steering rules and other objects which the driver cannot keep or replay. Today
>>> full reinit brings the device back to a clean initialized state, and that is
>>> intentional.
>>>
>>> So the split I have in mind is:
>>>
>>> - full runtime state is not preserved across full reinit;
>>> - high level devlink policy/configuration should be applied when the device is
>>>  initialized again;
>>> - the command line default should not blindly override a later explicit
>>>  userspace eswitch mode selection.
>>>
>>> I am not against moving this into devlink core, and I am willing to work on it.
>>>
>>> But before I rework the series, I want to make sure we agree on the direction.
>>> As I see it, doing this cleanly needs a devlink state like "registered but
>>> unavailable/in reinit", plus driver annotations for the reinit paths.
>>>
>>> If this is not the direction you want, I prefer to know now rather than spend
>>> time on a version that will be rejected anyway.
>>>
>>> Mark
>>>
>>>>
>>>>>
>>>>> To handle that from devlink I would still need to add some api for the driver
>>>>> to tell devlink "I just reinitialized, apply the default now". but nce I had
>>>>> that driver call , it felt simpler and clearer to let the driver call
>>>>> the helper directly at the points where it knows eswitch mode is safe.
>>>>>
>>>>> I agree that handling all of this inside devlink would be the better option.
>>>>> I just couldn't make it work in a clean way.
>>>>>
>>>>> Mark
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Mark
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Also, since this knob is only about eswitch mode, I don't think we need to
>>>>>>>>> touch every devlink driver. Drivers that don't implement eswitch_mode_set()
>>>>>>>>> would just ignore it anyway. The follow-up only wires the default into
>>>>>>>>> drivers that actually support changing eswitch mode.
>>>>>>>>>
>>>>>>>>> Mark
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> 	return 0;
>>>>>>>>>>>
>>>>>>>>>>> err_register:
>>>>>>>>>>> @@ -1538,6 +1554,7 @@ int mlx5_load_one_devl_locked(struct mlx5_core_dev *dev, bool recovery)
>>>>>>>>>>> 		goto err_attach;
>>>>>>>>>>>
>>>>>>>>>>> 	mutex_unlock(&dev->intf_state_mutex);
>>>>>>>>>>> +	mlx5_devl_apply_default_esw_mode(dev);
>>>>>>>>>>> 	return 0;
>>>>>>>>>>>
>>>>>>>>>>> err_attach:
>>>>>>>>>>> -- 
>>>>>>>>>>> 2.44.0
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>
>

  reply	other threads:[~2026-05-29 10:25 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-21  7:24 [PATCH net-next 0/3] devlink: Add boot-time eswitch mode defaults Tariq Toukan
2026-05-21  7:24 ` [PATCH net-next 1/3] net/mlx5: Clear FW reset-in-progress bit before reload Tariq Toukan
2026-05-21  7:24 ` [PATCH net-next 2/3] devlink: Add eswitch mode boot defaults Tariq Toukan
2026-05-21  7:24 ` [PATCH net-next 3/3] net/mlx5: Apply devlink default eswitch mode during init Tariq Toukan
2026-05-21 13:16   ` Mark Bloch
2026-05-21 13:41     ` Thomas Weißschuh
2026-05-21 21:02       ` Mark Bloch
2026-05-26  7:44   ` Jiri Pirko
2026-05-26  9:44     ` Mark Bloch
2026-05-26 14:07       ` Jiri Pirko
2026-05-26 15:03         ` Mark Bloch
2026-05-26 16:23           ` Jiri Pirko
2026-05-26 17:13             ` Mark Bloch
2026-05-27  5:14               ` Jiri Pirko
2026-05-27  7:03                 ` Mark Bloch
2026-05-27 11:18                   ` Jiri Pirko
2026-05-28  8:15                     ` Mark Bloch
2026-05-29 10:25                       ` Jiri Pirko [this message]
2026-05-25 19:42 ` [PATCH net-next 0/3] devlink: Add boot-time eswitch mode defaults Jakub Kicinski
2026-05-26  7:41   ` Jiri Pirko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ahlpjJ4CCZAwqFVi@FV6GYCPJ69 \
    --to=jiri@resnulli.us \
    --cc=akpm@linux-foundation.org \
    --cc=andrew+netdev@lunn.ch \
    --cc=bp@alien8.de \
    --cc=brauner@kernel.org \
    --cc=corbet@lwn.net \
    --cc=dapeng1.mi@linux.intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=davem@davemloft.net \
    --cc=dtatulea@nvidia.com \
    --cc=ebiggers@kernel.org \
    --cc=edumazet@google.com \
    --cc=elver@google.com \
    --cc=feng.tang@linux.alibaba.com \
    --cc=gal@nvidia.com \
    --cc=horms@kernel.org \
    --cc=jiri@nvidia.com \
    --cc=kees@kernel.org \
    --cc=kuba@kernel.org \
    --cc=leon@kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=lirongqing@baidu.com \
    --cc=mbloch@nvidia.com \
    --cc=moshe@nvidia.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=pmladek@suse.com \
    --cc=rdunlap@infradead.org \
    --cc=saeedm@nvidia.com \
    --cc=shayd@nvidia.com \
    --cc=skhan@linuxfoundation.org \
    --cc=tariqt@nvidia.com \
    --cc=tglx@kernel.org \
    --cc=tj@kernel.org \
    --cc=vbabka@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox