From: Mark Bloch <mbloch@nvidia.com>
To: Jiri Pirko <jiri@resnulli.us>, Jakub Kicinski <kuba@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>,
Paolo Abeni <pabeni@redhat.com>,
Andrew Lunn <andrew+netdev@lunn.ch>,
"David S. Miller" <davem@davemloft.net>,
Jonathan Corbet <corbet@lwn.net>,
Shuah Khan <skhan@linuxfoundation.org>,
Simon Horman <horms@kernel.org>,
Saeed Mahameed <saeedm@nvidia.com>,
Leon Romanovsky <leon@kernel.org>,
Tariq Toukan <tariqt@nvidia.com>,
Andrew Morton <akpm@linux-foundation.org>,
"Borislav Petkov (AMD)" <bp@alien8.de>,
Randy Dunlap <rdunlap@infradead.org>,
Dave Hansen <dave.hansen@linux.intel.com>,
Christian Brauner <brauner@kernel.org>,
Petr Mladek <pmladek@suse.com>,
"Peter Zijlstra (Intel)" <peterz@infradead.org>,
Thomas Gleixner <tglx@kernel.org>,
Pawan Gupta <pawan.kumar.gupta@linux.intel.com>,
Dapeng Mi <dapeng1.mi@linux.intel.com>,
Kees Cook <kees@kernel.org>, Marco Elver <elver@google.com>,
Eric Biggers <ebiggers@kernel.org>,
Li RongQing <lirongqing@baidu.com>,
"Paul E. McKenney" <paulmck@kernel.org>,
linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
netdev@vger.kernel.org, linux-rdma@vger.kernel.org
Subject: Re: [RFC net-next 0/4] devlink: Add boot-time defaults
Date: Sun, 10 May 2026 15:31:35 +0300 [thread overview]
Message-ID: <580a774b-ba9e-4523-b43a-476f75dd5b12@nvidia.com> (raw)
In-Reply-To: <af7Y4AYv-XDCbK_8@FV6GYCPJ69>
On 09/05/2026 10:01, Jiri Pirko wrote:
> Sat, May 09, 2026 at 02:52:13AM +0200, kuba@kernel.org wrote:
>> On Fri, 8 May 2026 20:07:44 +0200 Jiri Pirko wrote:
>>>> I don't think switchdev by default should mean CX4+ in general. If we get
>>>> there, I would expect it to be limited to the DPU/BlueField/ECPF case, where
>>>> the host PF probe path can depend on the ECPF reaching switchdev. Changing the
>>>> default for regular host NIC deployments feels like a much larger compatibility
>>>> change.
>>>
>>> We can't travel throught time, but if from CX5 onwards the default would
>>> be switchdev, nobody would feel broken in terms of compatibility. That
>>> is my point. Having "legacy" as default is simply wrong for never NIC
>>> generations. That is why it is called "legacy" and it should have been
>>> rotten through and out since CX4 times.
>>
>> legacy vs switchdev only describes the eswitch configuration.
>> As a non-SR-IOV user I really don't want to see the extra representors
>> hanging around my systems, confusing all daemons. IIRC mlx5 had some
>> limitations around the uplink representor. Maybe that's the disconnect.
>> But for a real, fully featured switchdev eswitches having the
>> PHY and PF representors on boot, always, will not make sense.
>
> As "a non-SR-IOV user", what extra representors you talk about? When you
> have pfs only, you don't have anything extra. Just 1 netdev per-pf, one
> devlink port per-pf. What's extra about it? When you don't have VFs/SFs.
> Everyhing is the same:
The netdev list looking similar is a bit misleading. What matters here is
not only how many netdevs show up, but what that netdev actually is.
In legacy mode, a PF only user can just use the PF netdev as a regular NIC
and use ROCE on it directly.
In switchdev mode, even if there are no VFs or SFs yet, the PF is moved into
the switchdev model and the visible netdev is the uplink representor. That is
not the same thing from a user point of view. The uplink representor is not a
ROCE capable endpoint. So a user who used to boot the machine and use ROCE on
the PF now has to create a VF or SF, use that as the roce endpoint, and also
set up the switchdev forwarding path with tc, bridge or OVS so traffic from
that function actually reaches the wire.
That is why I don't think this is only a card generation question. It changes
the deployment model. It may be the right default for BlueField/ECPF style
systems, where the host is expected to sit behind a switchdev control plane,
but it is not a safe default for every regular host NIC setup.
>
> c-220-136-220-218:~$ sudo devlink dev eswitch show pci/0000:08:00.0
> pci/0000:08:00.0: mode switchdev inline-mode none encap-mode basic
> c-220-136-220-218:~$ sudo devlink dev eswitch show pci/0000:08:00.1
> pci/0000:08:00.1: mode legacy inline-mode none encap-mode basic
> c-220-136-220-218:~$ devlink dev
> pci/0000:08:00.0: index 0
> nested_devlink:
> auxiliary/mlx5_core.eth.0
> devlink_index/1: index 1
> nested_devlink:
> pci/0000:08:00.0
> pci/0000:08:00.1
> auxiliary/mlx5_core.eth.0: index 2
> pci/0000:08:00.1: index 3
> nested_devlink:
> auxiliary/mlx5_core.eth.1
> auxiliary/mlx5_core.eth.1: index 4
> c-220-136-220-218:~$ devlink port
> auxiliary/mlx5_core.eth.0/65535: type eth netdev eth2 flavour physical port 0 splittable false
> auxiliary/mlx5_core.eth.1/131071: type eth netdev eth3 flavour physical port 1 splittable false
> c-220-136-220-218:~$ ip link
> ...
> 4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
> link/ether b8:e9:24:f2:b7:6c brd ff:ff:ff:ff:ff:ff
> altname enp8s0f0np0
> 5: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
> link/ether b8:e9:24:f2:b7:6d brd ff:ff:ff:ff:ff:ff
> altname enp8s0f1np1
>
>
>>
>> IOW it's not a question of the generation of the card but of
>> the deployment type / use case.
>
> I don't think so, not in the case of mlx5. The difference is only when
> you work with sr-iov, you either use legacy way (ip vf) or the new one.
> Same usecase.
>
>
>>
>>>> For the ASIC/NV bit: maybe technically possible, but it feels like the wrong
>>>> layer. This is boot/deployment policy, not a persistent hardware property, and
>>>> storing it in NV memory would make the state persist across kernels/hosts in a
>>>> surprising way.
>>>
>>> Well, as any other nv config, it persists across kernels/hosts. Think
>>> about it as "unbreak-my-not-legacy-device" bit.
>>
>> For most devices the switchdev mode does not change anything
>> substantial about the device. It's purely a kernel / driver config.
>> It changes what objects and default rules kernel / driver installs.
>> So I don't get why it would make sense to flash into the device
>> nvmem a Linux SW stack specific config.
>
> I look at it from the perspective that from some CX generation,
> switchdev mode should be default. So that is a device-based decision.
> I believe as such it can optionally be permanenty configured (nv config)
> on older device. Why not?
This is a deployment policy decision, not a permanent property of the card.
The same adapter can be used in a regular host/RDMA setup or in a
switchdev/offload setup. If we store this in NVM, that Linux switchdev policy
follows the device across hosts, kernels and use cases, and can surprise the
next deployment that just expects a normal NIC.
I'll send another RFC v2 with support limited to:
devlink=[...]:esw:mode:{ switchdev | switchdev_inactive | legacy }
and let's see where we land with that.
I still think a small kernel command line knob is the cleanest way to get to
"switchdev by default" without making the interface too broad. For more
complex boot-time configuration, I agree that a devlinkd or similar userspace
path is probably the better direction.
The "pause probing until userspace configures devlink" idea feels less clear
to me. It is not quite the simple boot policy knob, and not quite the full
userspace policy manager either. It would add a new probe state and require
early userspace orchestration before the device is fully materialized. At
least for now, I would prefer either the small cmdline option for the simple
global/default case, or a proper devlinkd-like solution for more complex
policy. Between those, I still prefer the cmdline option for this specific
early eswitch mode default.
Mark
>
> [...]
next prev parent reply other threads:[~2026-05-10 12:31 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-06 12:37 [RFC net-next 0/4] devlink: Add boot-time defaults Mark Bloch
2026-05-06 12:37 ` [RFC net-next 1/4] devlink: Add infrastructure for " Mark Bloch
2026-05-06 12:37 ` [RFC net-next 2/4] devlink: Add eswitch mode boot default Mark Bloch
2026-05-06 12:37 ` [RFC net-next 3/4] devlink: Add runtime parameter boot defaults Mark Bloch
2026-05-06 12:37 ` [RFC net-next 4/4] net/mlx5: Apply devlink boot defaults during init Mark Bloch
2026-05-06 15:22 ` [RFC net-next 0/4] devlink: Add boot-time defaults Jiri Pirko
2026-05-06 17:35 ` Mark Bloch
2026-05-07 11:03 ` Jiri Pirko
2026-05-08 17:59 ` Mark Bloch
2026-05-08 18:07 ` Jiri Pirko
2026-05-09 0:52 ` Jakub Kicinski
2026-05-09 7:01 ` Jiri Pirko
2026-05-10 12:31 ` Mark Bloch [this message]
2026-05-11 8:07 ` Jiri Pirko
2026-05-11 18:21 ` Parav Pandit
2026-05-10 16:37 ` Jakub Kicinski
2026-05-11 8:42 ` Jiri Pirko
2026-05-11 23:41 ` Jakub Kicinski
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=580a774b-ba9e-4523-b43a-476f75dd5b12@nvidia.com \
--to=mbloch@nvidia.com \
--cc=akpm@linux-foundation.org \
--cc=andrew+netdev@lunn.ch \
--cc=bp@alien8.de \
--cc=brauner@kernel.org \
--cc=corbet@lwn.net \
--cc=dapeng1.mi@linux.intel.com \
--cc=dave.hansen@linux.intel.com \
--cc=davem@davemloft.net \
--cc=ebiggers@kernel.org \
--cc=edumazet@google.com \
--cc=elver@google.com \
--cc=horms@kernel.org \
--cc=jiri@resnulli.us \
--cc=kees@kernel.org \
--cc=kuba@kernel.org \
--cc=leon@kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=lirongqing@baidu.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=paulmck@kernel.org \
--cc=pawan.kumar.gupta@linux.intel.com \
--cc=peterz@infradead.org \
--cc=pmladek@suse.com \
--cc=rdunlap@infradead.org \
--cc=saeedm@nvidia.com \
--cc=skhan@linuxfoundation.org \
--cc=tariqt@nvidia.com \
--cc=tglx@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox