Re: [RFC net-next 0/4] devlink: Add boot-time defaults

Linux RDMA and InfiniBand development
 help / color / mirror / Atom feed

From: Jiri Pirko <jiri@resnulli.us>
To: Mark Bloch <mbloch@nvidia.com>
Cc: Jakub Kicinski <kuba@kernel.org>,
	Eric Dumazet <edumazet@google.com>,
	 Paolo Abeni <pabeni@redhat.com>,
	Andrew Lunn <andrew+netdev@lunn.ch>,
	 "David S. Miller" <davem@davemloft.net>,
	Jonathan Corbet <corbet@lwn.net>,
	 Shuah Khan <skhan@linuxfoundation.org>,
	Simon Horman <horms@kernel.org>,
	 Saeed Mahameed <saeedm@nvidia.com>,
	Leon Romanovsky <leon@kernel.org>,
	 Tariq Toukan <tariqt@nvidia.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	 "Borislav Petkov (AMD)" <bp@alien8.de>,
	Randy Dunlap <rdunlap@infradead.org>,
	 Dave Hansen <dave.hansen@linux.intel.com>,
	Christian Brauner <brauner@kernel.org>,
	 Petr Mladek <pmladek@suse.com>,
	"Peter Zijlstra (Intel)" <peterz@infradead.org>,
	 Thomas Gleixner <tglx@kernel.org>,
	Pawan Gupta <pawan.kumar.gupta@linux.intel.com>,
	 Dapeng Mi <dapeng1.mi@linux.intel.com>,
	Kees Cook <kees@kernel.org>, Marco Elver <elver@google.com>,
	 Eric Biggers <ebiggers@kernel.org>,
	Li RongQing <lirongqing@baidu.com>,
	 "Paul E. McKenney" <paulmck@kernel.org>,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	 netdev@vger.kernel.org, linux-rdma@vger.kernel.org
Subject: Re: [RFC net-next 0/4] devlink: Add boot-time defaults
Date: Mon, 11 May 2026 10:07:57 +0200	[thread overview]
Message-ID: <agGNVmN9tpHh0K1P@FV6GYCPJ69> (raw)
In-Reply-To: <580a774b-ba9e-4523-b43a-476f75dd5b12@nvidia.com>

Sun, May 10, 2026 at 02:31:35PM +0200, mbloch@nvidia.com wrote:
>
>
>On 09/05/2026 10:01, Jiri Pirko wrote:
>> Sat, May 09, 2026 at 02:52:13AM +0200, kuba@kernel.org wrote:
>>> On Fri, 8 May 2026 20:07:44 +0200 Jiri Pirko wrote:
>>>>> I don't think switchdev by default should mean CX4+ in general. If we get
>>>>> there, I would expect it to be limited to the DPU/BlueField/ECPF case, where
>>>>> the host PF probe path can depend on the ECPF reaching switchdev. Changing the
>>>>> default for regular host NIC deployments feels like a much larger compatibility
>>>>> change.  
>>>>
>>>> We can't travel throught time, but if from CX5 onwards the default would
>>>> be switchdev, nobody would feel broken in terms of compatibility. That
>>>> is my point. Having "legacy" as default is simply wrong for never NIC
>>>> generations. That is why it is called "legacy" and it should have been
>>>> rotten through and out since CX4 times.
>>>
>>> legacy vs switchdev only describes the eswitch configuration.
>>> As a non-SR-IOV user I really don't want to see the extra representors
>>> hanging around my systems, confusing all daemons. IIRC mlx5 had some
>>> limitations around the uplink representor. Maybe that's the disconnect.
>>> But for a real, fully featured switchdev eswitches having the
>>> PHY and PF representors on boot, always, will not make sense.
>> 
>> As "a non-SR-IOV user", what extra representors you talk about? When you
>> have pfs only, you don't have anything extra. Just 1 netdev per-pf, one
>> devlink port per-pf. What's extra about it? When you don't have VFs/SFs.
>> Everyhing is the same:
>
>The netdev list looking similar is a bit misleading. What matters here is
>not only how many netdevs show up, but what that netdev actually is.
>
>In legacy mode, a PF only user can just use the PF netdev as a regular NIC
>and use ROCE on it directly.

I don't see why we have this limitation. Sounds more like a bug to me.
The netdev is still the same, capable of the same things no matter in
which mode you have it. RoCE should work on it in both modes.


>
>In switchdev mode, even if there are no VFs or SFs yet, the PF is moved into
>the switchdev model and the visible netdev is the uplink representor. That is
>not the same thing from a user point of view. The uplink representor is not a
>ROCE capable endpoint. So a user who used to boot the machine and use ROCE on
>the PF now has to create a VF or SF, use that as the roce endpoint, and also
>set up the switchdev forwarding path with tc, bridge or OVS so traffic from
>that function actually reaches the wire.
>
>That is why I don't think this is only a card generation question. It changes
>the deployment model. It may be the right default for BlueField/ECPF style
>systems, where the host is expected to sit behind a switchdev control plane,
>but it is not a safe default for every regular host NIC setup.

Yeah, the point is, not to change deployment model. The legacy/switchdev
should only change behaviour for sriov/eswitch usecase. The rest
(PF/uplink netdev and related objects) should stay the same.


>
>> 
>> c-220-136-220-218:~$ sudo devlink dev eswitch show pci/0000:08:00.0
>> pci/0000:08:00.0: mode switchdev inline-mode none encap-mode basic
>> c-220-136-220-218:~$ sudo devlink dev eswitch show pci/0000:08:00.1
>> pci/0000:08:00.1: mode legacy inline-mode none encap-mode basic
>> c-220-136-220-218:~$ devlink dev
>> pci/0000:08:00.0: index 0
>>   nested_devlink:
>>     auxiliary/mlx5_core.eth.0
>> devlink_index/1: index 1
>>   nested_devlink:
>>     pci/0000:08:00.0
>>     pci/0000:08:00.1
>> auxiliary/mlx5_core.eth.0: index 2
>> pci/0000:08:00.1: index 3
>>   nested_devlink:
>>     auxiliary/mlx5_core.eth.1
>> auxiliary/mlx5_core.eth.1: index 4
>> c-220-136-220-218:~$ devlink port
>> auxiliary/mlx5_core.eth.0/65535: type eth netdev eth2 flavour physical port 0 splittable false
>> auxiliary/mlx5_core.eth.1/131071: type eth netdev eth3 flavour physical port 1 splittable false
>> c-220-136-220-218:~$ ip link
>> ...
>> 4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
>>     link/ether b8:e9:24:f2:b7:6c brd ff:ff:ff:ff:ff:ff
>>     altname enp8s0f0np0
>> 5: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
>>     link/ether b8:e9:24:f2:b7:6d brd ff:ff:ff:ff:ff:ff
>>     altname enp8s0f1np1
>> 
>> 
>>>
>>> IOW it's not a question of the generation of the card but of
>>> the deployment type / use case.
>> 
>> I don't think so, not in the case of mlx5. The difference is only when
>> you work with sr-iov, you either use legacy way (ip vf) or the new one.
>> Same usecase.
>> 
>> 
>>>
>>>>> For the ASIC/NV bit: maybe technically possible, but it feels like the wrong
>>>>> layer. This is boot/deployment policy, not a persistent hardware property, and
>>>>> storing it in NV memory would make the state persist across kernels/hosts in a
>>>>> surprising way.  
>>>>
>>>> Well, as any other nv config, it persists across kernels/hosts. Think
>>>> about it as "unbreak-my-not-legacy-device" bit.
>>>
>>> For most devices the switchdev mode does not change anything
>>> substantial about the device. It's purely a kernel / driver config. 
>>> It changes what objects and default rules kernel / driver installs. 
>>> So I don't get why it would make sense to flash into the device
>>> nvmem a Linux SW stack specific config.
>> 
>> I look at it from the perspective that from some CX generation,
>> switchdev mode should be default. So that is a device-based decision.
>> I believe as such it can optionally be permanenty configured (nv config)
>> on older device. Why not?
>
>This is a deployment policy decision, not a permanent property of the card.
>The same adapter can be used in a regular host/RDMA setup or in a
>switchdev/offload setup. If we store this in NVM, that Linux switchdev policy
>follows the device across hosts, kernels and use cases, and can surprise the
>next deployment that just expects a normal NIC.

Yeah, from my perspective, there should be not surprise/behaviour_change
for non-sriov/eswitch user. Then switchdev can be default and everyone
is happy. Why to complicate things?


>
>I'll send another RFC v2 with support limited to:
>devlink=[...]:esw:mode:{ switchdev | switchdev_inactive | legacy }
>and let's see where we land with that.
>
>I still think a small kernel command line knob is the cleanest way to get to
>"switchdev by default" without making the interface too broad. For more
>complex boot-time configuration, I agree that a devlinkd or similar userspace
>path is probably the better direction.
>
>The "pause probing until userspace configures devlink" idea feels less clear
>to me. It is not quite the simple boot policy knob, and not quite the full
>userspace policy manager either. It would add a new probe state and require
>early userspace orchestration before the device is fully materialized. At
>least for now, I would prefer either the small cmdline option for the simple
>global/default case, or a proper devlinkd-like solution for more complex
>policy. Between those, I still prefer the cmdline option for this specific
>early eswitch mode default.
>
>Mark
>
>> 
>> [...]
>

next prev parent reply	other threads:[~2026-05-11  8:08 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-06 12:37 [RFC net-next 0/4] devlink: Add boot-time defaults Mark Bloch
2026-05-06 12:37 ` [RFC net-next 1/4] devlink: Add infrastructure for " Mark Bloch
2026-05-06 12:37 ` [RFC net-next 2/4] devlink: Add eswitch mode boot default Mark Bloch
2026-05-06 12:37 ` [RFC net-next 3/4] devlink: Add runtime parameter boot defaults Mark Bloch
2026-05-06 12:37 ` [RFC net-next 4/4] net/mlx5: Apply devlink boot defaults during init Mark Bloch
2026-05-06 15:22 ` [RFC net-next 0/4] devlink: Add boot-time defaults Jiri Pirko
2026-05-06 17:35   ` Mark Bloch
2026-05-07 11:03     ` Jiri Pirko
2026-05-08 17:59       ` Mark Bloch
2026-05-08 18:07         ` Jiri Pirko
2026-05-09  0:52           ` Jakub Kicinski
2026-05-09  7:01             ` Jiri Pirko
2026-05-10 12:31               ` Mark Bloch
2026-05-11  8:07                 ` Jiri Pirko [this message]
2026-05-11 18:21                 ` Parav Pandit
2026-05-12  8:45                   ` Jiri Pirko
2026-05-12 13:48                     ` Parav Pandit
2026-05-12 14:07                       ` Jiri Pirko
2026-05-12 15:25                         ` Parav Pandit
2026-05-12 18:35                           ` Jiri Pirko
2026-05-13  5:53                             ` Mark Bloch
2026-05-13 11:11                               ` Jiri Pirko
2026-05-14 12:34                                 ` Mark Bloch
2026-05-10 16:37               ` Jakub Kicinski
2026-05-11  8:42                 ` Jiri Pirko
2026-05-11 23:41                   ` Jakub Kicinski
2026-05-12  8:42                     ` Jiri Pirko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=agGNVmN9tpHh0K1P@FV6GYCPJ69 \
    --to=jiri@resnulli.us \
    --cc=akpm@linux-foundation.org \
    --cc=andrew+netdev@lunn.ch \
    --cc=bp@alien8.de \
    --cc=brauner@kernel.org \
    --cc=corbet@lwn.net \
    --cc=dapeng1.mi@linux.intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=davem@davemloft.net \
    --cc=ebiggers@kernel.org \
    --cc=edumazet@google.com \
    --cc=elver@google.com \
    --cc=horms@kernel.org \
    --cc=kees@kernel.org \
    --cc=kuba@kernel.org \
    --cc=leon@kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=lirongqing@baidu.com \
    --cc=mbloch@nvidia.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=paulmck@kernel.org \
    --cc=pawan.kumar.gupta@linux.intel.com \
    --cc=peterz@infradead.org \
    --cc=pmladek@suse.com \
    --cc=rdunlap@infradead.org \
    --cc=saeedm@nvidia.com \
    --cc=skhan@linuxfoundation.org \
    --cc=tariqt@nvidia.com \
    --cc=tglx@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox