From: Jiri Pirko <jiri@resnulli.us>
To: Mark Bloch <mbloch@nvidia.com>
Cc: Jakub Kicinski <kuba@kernel.org>,
Eric Dumazet <edumazet@google.com>,
Paolo Abeni <pabeni@redhat.com>,
Andrew Lunn <andrew+netdev@lunn.ch>,
"David S. Miller" <davem@davemloft.net>,
Jonathan Corbet <corbet@lwn.net>,
Shuah Khan <skhan@linuxfoundation.org>,
Simon Horman <horms@kernel.org>,
Saeed Mahameed <saeedm@nvidia.com>,
Leon Romanovsky <leon@kernel.org>,
Tariq Toukan <tariqt@nvidia.com>,
Andrew Morton <akpm@linux-foundation.org>,
"Borislav Petkov (AMD)" <bp@alien8.de>,
Randy Dunlap <rdunlap@infradead.org>,
Dave Hansen <dave.hansen@linux.intel.com>,
Christian Brauner <brauner@kernel.org>,
Petr Mladek <pmladek@suse.com>,
"Peter Zijlstra (Intel)" <peterz@infradead.org>,
Thomas Gleixner <tglx@kernel.org>,
Pawan Gupta <pawan.kumar.gupta@linux.intel.com>,
Dapeng Mi <dapeng1.mi@linux.intel.com>,
Kees Cook <kees@kernel.org>, Marco Elver <elver@google.com>,
Eric Biggers <ebiggers@kernel.org>,
Li RongQing <lirongqing@baidu.com>,
"Paul E. McKenney" <paulmck@kernel.org>,
linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
netdev@vger.kernel.org, linux-rdma@vger.kernel.org
Subject: Re: [RFC net-next 0/4] devlink: Add boot-time defaults
Date: Mon, 11 May 2026 10:07:57 +0200 [thread overview]
Message-ID: <agGNVmN9tpHh0K1P@FV6GYCPJ69> (raw)
In-Reply-To: <580a774b-ba9e-4523-b43a-476f75dd5b12@nvidia.com>
Sun, May 10, 2026 at 02:31:35PM +0200, mbloch@nvidia.com wrote:
>
>
>On 09/05/2026 10:01, Jiri Pirko wrote:
>> Sat, May 09, 2026 at 02:52:13AM +0200, kuba@kernel.org wrote:
>>> On Fri, 8 May 2026 20:07:44 +0200 Jiri Pirko wrote:
>>>>> I don't think switchdev by default should mean CX4+ in general. If we get
>>>>> there, I would expect it to be limited to the DPU/BlueField/ECPF case, where
>>>>> the host PF probe path can depend on the ECPF reaching switchdev. Changing the
>>>>> default for regular host NIC deployments feels like a much larger compatibility
>>>>> change.
>>>>
>>>> We can't travel throught time, but if from CX5 onwards the default would
>>>> be switchdev, nobody would feel broken in terms of compatibility. That
>>>> is my point. Having "legacy" as default is simply wrong for never NIC
>>>> generations. That is why it is called "legacy" and it should have been
>>>> rotten through and out since CX4 times.
>>>
>>> legacy vs switchdev only describes the eswitch configuration.
>>> As a non-SR-IOV user I really don't want to see the extra representors
>>> hanging around my systems, confusing all daemons. IIRC mlx5 had some
>>> limitations around the uplink representor. Maybe that's the disconnect.
>>> But for a real, fully featured switchdev eswitches having the
>>> PHY and PF representors on boot, always, will not make sense.
>>
>> As "a non-SR-IOV user", what extra representors you talk about? When you
>> have pfs only, you don't have anything extra. Just 1 netdev per-pf, one
>> devlink port per-pf. What's extra about it? When you don't have VFs/SFs.
>> Everyhing is the same:
>
>The netdev list looking similar is a bit misleading. What matters here is
>not only how many netdevs show up, but what that netdev actually is.
>
>In legacy mode, a PF only user can just use the PF netdev as a regular NIC
>and use ROCE on it directly.
I don't see why we have this limitation. Sounds more like a bug to me.
The netdev is still the same, capable of the same things no matter in
which mode you have it. RoCE should work on it in both modes.
>
>In switchdev mode, even if there are no VFs or SFs yet, the PF is moved into
>the switchdev model and the visible netdev is the uplink representor. That is
>not the same thing from a user point of view. The uplink representor is not a
>ROCE capable endpoint. So a user who used to boot the machine and use ROCE on
>the PF now has to create a VF or SF, use that as the roce endpoint, and also
>set up the switchdev forwarding path with tc, bridge or OVS so traffic from
>that function actually reaches the wire.
>
>That is why I don't think this is only a card generation question. It changes
>the deployment model. It may be the right default for BlueField/ECPF style
>systems, where the host is expected to sit behind a switchdev control plane,
>but it is not a safe default for every regular host NIC setup.
Yeah, the point is, not to change deployment model. The legacy/switchdev
should only change behaviour for sriov/eswitch usecase. The rest
(PF/uplink netdev and related objects) should stay the same.
>
>>
>> c-220-136-220-218:~$ sudo devlink dev eswitch show pci/0000:08:00.0
>> pci/0000:08:00.0: mode switchdev inline-mode none encap-mode basic
>> c-220-136-220-218:~$ sudo devlink dev eswitch show pci/0000:08:00.1
>> pci/0000:08:00.1: mode legacy inline-mode none encap-mode basic
>> c-220-136-220-218:~$ devlink dev
>> pci/0000:08:00.0: index 0
>> nested_devlink:
>> auxiliary/mlx5_core.eth.0
>> devlink_index/1: index 1
>> nested_devlink:
>> pci/0000:08:00.0
>> pci/0000:08:00.1
>> auxiliary/mlx5_core.eth.0: index 2
>> pci/0000:08:00.1: index 3
>> nested_devlink:
>> auxiliary/mlx5_core.eth.1
>> auxiliary/mlx5_core.eth.1: index 4
>> c-220-136-220-218:~$ devlink port
>> auxiliary/mlx5_core.eth.0/65535: type eth netdev eth2 flavour physical port 0 splittable false
>> auxiliary/mlx5_core.eth.1/131071: type eth netdev eth3 flavour physical port 1 splittable false
>> c-220-136-220-218:~$ ip link
>> ...
>> 4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
>> link/ether b8:e9:24:f2:b7:6c brd ff:ff:ff:ff:ff:ff
>> altname enp8s0f0np0
>> 5: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
>> link/ether b8:e9:24:f2:b7:6d brd ff:ff:ff:ff:ff:ff
>> altname enp8s0f1np1
>>
>>
>>>
>>> IOW it's not a question of the generation of the card but of
>>> the deployment type / use case.
>>
>> I don't think so, not in the case of mlx5. The difference is only when
>> you work with sr-iov, you either use legacy way (ip vf) or the new one.
>> Same usecase.
>>
>>
>>>
>>>>> For the ASIC/NV bit: maybe technically possible, but it feels like the wrong
>>>>> layer. This is boot/deployment policy, not a persistent hardware property, and
>>>>> storing it in NV memory would make the state persist across kernels/hosts in a
>>>>> surprising way.
>>>>
>>>> Well, as any other nv config, it persists across kernels/hosts. Think
>>>> about it as "unbreak-my-not-legacy-device" bit.
>>>
>>> For most devices the switchdev mode does not change anything
>>> substantial about the device. It's purely a kernel / driver config.
>>> It changes what objects and default rules kernel / driver installs.
>>> So I don't get why it would make sense to flash into the device
>>> nvmem a Linux SW stack specific config.
>>
>> I look at it from the perspective that from some CX generation,
>> switchdev mode should be default. So that is a device-based decision.
>> I believe as such it can optionally be permanenty configured (nv config)
>> on older device. Why not?
>
>This is a deployment policy decision, not a permanent property of the card.
>The same adapter can be used in a regular host/RDMA setup or in a
>switchdev/offload setup. If we store this in NVM, that Linux switchdev policy
>follows the device across hosts, kernels and use cases, and can surprise the
>next deployment that just expects a normal NIC.
Yeah, from my perspective, there should be not surprise/behaviour_change
for non-sriov/eswitch user. Then switchdev can be default and everyone
is happy. Why to complicate things?
>
>I'll send another RFC v2 with support limited to:
>devlink=[...]:esw:mode:{ switchdev | switchdev_inactive | legacy }
>and let's see where we land with that.
>
>I still think a small kernel command line knob is the cleanest way to get to
>"switchdev by default" without making the interface too broad. For more
>complex boot-time configuration, I agree that a devlinkd or similar userspace
>path is probably the better direction.
>
>The "pause probing until userspace configures devlink" idea feels less clear
>to me. It is not quite the simple boot policy knob, and not quite the full
>userspace policy manager either. It would add a new probe state and require
>early userspace orchestration before the device is fully materialized. At
>least for now, I would prefer either the small cmdline option for the simple
>global/default case, or a proper devlinkd-like solution for more complex
>policy. Between those, I still prefer the cmdline option for this specific
>early eswitch mode default.
>
>Mark
>
>>
>> [...]
>
next prev parent reply other threads:[~2026-05-11 8:08 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-06 12:37 [RFC net-next 0/4] devlink: Add boot-time defaults Mark Bloch
2026-05-06 12:37 ` [RFC net-next 1/4] devlink: Add infrastructure for " Mark Bloch
2026-05-06 12:37 ` [RFC net-next 2/4] devlink: Add eswitch mode boot default Mark Bloch
2026-05-06 12:37 ` [RFC net-next 3/4] devlink: Add runtime parameter boot defaults Mark Bloch
2026-05-06 12:37 ` [RFC net-next 4/4] net/mlx5: Apply devlink boot defaults during init Mark Bloch
2026-05-06 15:22 ` [RFC net-next 0/4] devlink: Add boot-time defaults Jiri Pirko
2026-05-06 17:35 ` Mark Bloch
2026-05-07 11:03 ` Jiri Pirko
2026-05-08 17:59 ` Mark Bloch
2026-05-08 18:07 ` Jiri Pirko
2026-05-09 0:52 ` Jakub Kicinski
2026-05-09 7:01 ` Jiri Pirko
2026-05-10 12:31 ` Mark Bloch
2026-05-11 8:07 ` Jiri Pirko [this message]
2026-05-10 16:37 ` Jakub Kicinski
2026-05-11 8:42 ` Jiri Pirko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=agGNVmN9tpHh0K1P@FV6GYCPJ69 \
--to=jiri@resnulli.us \
--cc=akpm@linux-foundation.org \
--cc=andrew+netdev@lunn.ch \
--cc=bp@alien8.de \
--cc=brauner@kernel.org \
--cc=corbet@lwn.net \
--cc=dapeng1.mi@linux.intel.com \
--cc=dave.hansen@linux.intel.com \
--cc=davem@davemloft.net \
--cc=ebiggers@kernel.org \
--cc=edumazet@google.com \
--cc=elver@google.com \
--cc=horms@kernel.org \
--cc=kees@kernel.org \
--cc=kuba@kernel.org \
--cc=leon@kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=lirongqing@baidu.com \
--cc=mbloch@nvidia.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=paulmck@kernel.org \
--cc=pawan.kumar.gupta@linux.intel.com \
--cc=peterz@infradead.org \
--cc=pmladek@suse.com \
--cc=rdunlap@infradead.org \
--cc=saeedm@nvidia.com \
--cc=skhan@linuxfoundation.org \
--cc=tariqt@nvidia.com \
--cc=tglx@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox