From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-wm1-f47.google.com (mail-wm1-f47.google.com [209.85.128.47]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 09CD23B52EF for ; Mon, 11 May 2026 08:08:05 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.47 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778486896; cv=none; b=cDD9MvNkY17DPUoF9DYOL5uP9WRcRcM0t8+QUf7dUAQlcFXXN2dPo3niKKEkSXFepc1EYEcQLFYVyw+BNaTTJACnqpzli/eMrv8Nl1tgCFuD69b72WcQn+v7kRx+VhqoGPr03mNqW7FlvTP9z2sCDUx0JFHEStocWQYdAU9Q4hU= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778486896; c=relaxed/simple; bh=NcjgQHe/X9C+SBudTsEkTDVrZL8ey9FSJwoVi7+nhtM=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=Pqj400rFHtlKrqbVLDuVJZvwlWRz+02jCro+S8wXMau5LIhC5VOzTAToHxKszTihTc1zPVc8ljNkx5yRHsifCkyUBYulJsxcqiyJP2N7hR38iEoqLXeRRRD2rVkib6MIZj7eZczIYIYA4TsIdVmeJhnURuJJ35uXTlbFV+XtvAU= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=resnulli.us; spf=none smtp.mailfrom=resnulli.us; dkim=pass (2048-bit key) header.d=resnulli-us.20251104.gappssmtp.com header.i=@resnulli-us.20251104.gappssmtp.com header.b=a++eeVQU; arc=none smtp.client-ip=209.85.128.47 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=resnulli.us Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=resnulli.us Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=resnulli-us.20251104.gappssmtp.com header.i=@resnulli-us.20251104.gappssmtp.com header.b="a++eeVQU" Received: by mail-wm1-f47.google.com with SMTP id 5b1f17b1804b1-4891c0620bcso27493145e9.1 for ; Mon, 11 May 2026 01:08:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=resnulli-us.20251104.gappssmtp.com; s=20251104; t=1778486881; x=1779091681; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=MdfA7aGOGSRQ1CILOc6Le7Ks8CyAnmg3KhzdGxgc+LI=; b=a++eeVQUmQC2uhkhik86EDQ3+IQGTEIrbONafIIO/TsC1d+8CACT4Ow2ImXauiXmEy iT6Z8dRk9jsko6OyvcVs3AllFEIzibqBGSrW5sVE77Zg9/XUA7JixvH6XspnXGO2NHRe ChywALP9BUOB9fXsyBydSIkyJRq8oQMIngY4APhu48oxHhRoVawmNP2MvTrGPiCRpFXz jXjRTp1B+d1MXBK2E1OHZCcB+hfX65rXRrJVbnNf4YZj1enyge5AbNrgFWRSrnlyCaQR CkSAZQvG1cWTQI2PMsKjfAvC0axtBsDdSjjANZjjRY7JsCfdZ+2TKsPOkjZIBzDus2sS PEhA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1778486881; x=1779091681; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-gg:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=MdfA7aGOGSRQ1CILOc6Le7Ks8CyAnmg3KhzdGxgc+LI=; b=sdfoN+cEryLaDqqJrcusEeXyj7fltO+vttJPDIDr48/+m7iy1+bbqUnE2s8uDyQDAL nGWlvMmF3DHEyTuhDDFEwql3GrmQtZU2lY/b9fz92SRJMJX+UzIlNYwXFe4IHAF0AGAc LPdH7VkBPNdE3Ny4+BH0V3ZhS/uqqfov5ZHIgPxopvCzftqAtyFimttQ5h5FiyjokEz6 KLgyfPFxqHfg6hj97oaPeQhIeIw9udj4/1HCnAS2h0bZD7DGUNAa0b3zRNyGMLcIvtyz PxoPdDkPisuAo1p6SMG1dVoif2t92YYvtN1jyBIHcbBzekBdAXvztFb0V+2EA8ICl23V y/lw== X-Forwarded-Encrypted: i=1; AFNElJ9rSWVYP10yJJoxEYaRRTMN6OYU5QLmekKE/o6OwGtbA6ztUPIgEy0GqMDm0wa8rlE4VmhDkFw4MdQ=@vger.kernel.org X-Gm-Message-State: AOJu0Yy0aaj5zj6asXIJ42q6roMxZx1/M0IbJBgcquC7wNIaL2hbioa9 iPpnGW8ODgrY6EpkVhN3hoZyao5IZ1iREyrJQsWuBUhDW28H68ifOEa34qElqdTj1jw= X-Gm-Gg: Acq92OEwJujQMImSEFoGl0OsV4iA3vjwGlrvFLPwYPsVrJaEAzgY1mg/t3fd20xYS6D dxLGvKNZkQ0DfAxMh++fizw3yZa0SP0VNO5W7exwdsd56hm+t+VJpz591e7JwyiPCeg+p2DivKX jZh0TZ7eueQDTVrxuZh4W1Zn1gyBp8fMABUUzTQ+gYkxzH2M5iCwQ/N2Xc3jqvP2aXqcNP0aSjl ctDhoXMztPTIn+tt2FQ3DcJWb065lvAbOWrwB7Q3d3vXytkCOcEzqTvySJmwmPxljKvNRIeVo5h MZ9VHDaAl7xB9VEOssngJb2+bPAu5X2ULirxAbU16y7gm6Pq1QECsjCJEXm+nwfsekOMmqh7O0Y 98Sy1MPWeYIi80Sl/9CyvqpB9s7WNKV14Du8wYVi4AFak2oDPcppA/wLhoBQqxMD4ipoE+PL5yB GAIYzGE6vMiX70MSXr2NhI31qea/b/cekjPF4= X-Received: by 2002:a05:600d:8496:20b0:48e:86e6:c2f1 with SMTP id 5b1f17b1804b1-48e86e6c38fmr7387965e9.2.1778486881151; Mon, 11 May 2026 01:08:01 -0700 (PDT) Received: from FV6GYCPJ69 ([140.209.217.212]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-454917d57aesm24147511f8f.26.2026.05.11.01.07.58 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 11 May 2026 01:08:00 -0700 (PDT) Date: Mon, 11 May 2026 10:07:57 +0200 From: Jiri Pirko To: Mark Bloch Cc: Jakub Kicinski , Eric Dumazet , Paolo Abeni , Andrew Lunn , "David S. Miller" , Jonathan Corbet , Shuah Khan , Simon Horman , Saeed Mahameed , Leon Romanovsky , Tariq Toukan , Andrew Morton , "Borislav Petkov (AMD)" , Randy Dunlap , Dave Hansen , Christian Brauner , Petr Mladek , "Peter Zijlstra (Intel)" , Thomas Gleixner , Pawan Gupta , Dapeng Mi , Kees Cook , Marco Elver , Eric Biggers , Li RongQing , "Paul E. McKenney" , linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, netdev@vger.kernel.org, linux-rdma@vger.kernel.org Subject: Re: [RFC net-next 0/4] devlink: Add boot-time defaults Message-ID: References: <20260506123739.1959770-1-mbloch@nvidia.com> <3f9215c4-7c84-46d9-ba74-30dabe24db09@nvidia.com> <20260508175213.1952097f@kernel.org> <580a774b-ba9e-4523-b43a-476f75dd5b12@nvidia.com> Precedence: bulk X-Mailing-List: linux-doc@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <580a774b-ba9e-4523-b43a-476f75dd5b12@nvidia.com> Sun, May 10, 2026 at 02:31:35PM +0200, mbloch@nvidia.com wrote: > > >On 09/05/2026 10:01, Jiri Pirko wrote: >> Sat, May 09, 2026 at 02:52:13AM +0200, kuba@kernel.org wrote: >>> On Fri, 8 May 2026 20:07:44 +0200 Jiri Pirko wrote: >>>>> I don't think switchdev by default should mean CX4+ in general. If we get >>>>> there, I would expect it to be limited to the DPU/BlueField/ECPF case, where >>>>> the host PF probe path can depend on the ECPF reaching switchdev. Changing the >>>>> default for regular host NIC deployments feels like a much larger compatibility >>>>> change. >>>> >>>> We can't travel throught time, but if from CX5 onwards the default would >>>> be switchdev, nobody would feel broken in terms of compatibility. That >>>> is my point. Having "legacy" as default is simply wrong for never NIC >>>> generations. That is why it is called "legacy" and it should have been >>>> rotten through and out since CX4 times. >>> >>> legacy vs switchdev only describes the eswitch configuration. >>> As a non-SR-IOV user I really don't want to see the extra representors >>> hanging around my systems, confusing all daemons. IIRC mlx5 had some >>> limitations around the uplink representor. Maybe that's the disconnect. >>> But for a real, fully featured switchdev eswitches having the >>> PHY and PF representors on boot, always, will not make sense. >> >> As "a non-SR-IOV user", what extra representors you talk about? When you >> have pfs only, you don't have anything extra. Just 1 netdev per-pf, one >> devlink port per-pf. What's extra about it? When you don't have VFs/SFs. >> Everyhing is the same: > >The netdev list looking similar is a bit misleading. What matters here is >not only how many netdevs show up, but what that netdev actually is. > >In legacy mode, a PF only user can just use the PF netdev as a regular NIC >and use ROCE on it directly. I don't see why we have this limitation. Sounds more like a bug to me. The netdev is still the same, capable of the same things no matter in which mode you have it. RoCE should work on it in both modes. > >In switchdev mode, even if there are no VFs or SFs yet, the PF is moved into >the switchdev model and the visible netdev is the uplink representor. That is >not the same thing from a user point of view. The uplink representor is not a >ROCE capable endpoint. So a user who used to boot the machine and use ROCE on >the PF now has to create a VF or SF, use that as the roce endpoint, and also >set up the switchdev forwarding path with tc, bridge or OVS so traffic from >that function actually reaches the wire. > >That is why I don't think this is only a card generation question. It changes >the deployment model. It may be the right default for BlueField/ECPF style >systems, where the host is expected to sit behind a switchdev control plane, >but it is not a safe default for every regular host NIC setup. Yeah, the point is, not to change deployment model. The legacy/switchdev should only change behaviour for sriov/eswitch usecase. The rest (PF/uplink netdev and related objects) should stay the same. > >> >> c-220-136-220-218:~$ sudo devlink dev eswitch show pci/0000:08:00.0 >> pci/0000:08:00.0: mode switchdev inline-mode none encap-mode basic >> c-220-136-220-218:~$ sudo devlink dev eswitch show pci/0000:08:00.1 >> pci/0000:08:00.1: mode legacy inline-mode none encap-mode basic >> c-220-136-220-218:~$ devlink dev >> pci/0000:08:00.0: index 0 >> nested_devlink: >> auxiliary/mlx5_core.eth.0 >> devlink_index/1: index 1 >> nested_devlink: >> pci/0000:08:00.0 >> pci/0000:08:00.1 >> auxiliary/mlx5_core.eth.0: index 2 >> pci/0000:08:00.1: index 3 >> nested_devlink: >> auxiliary/mlx5_core.eth.1 >> auxiliary/mlx5_core.eth.1: index 4 >> c-220-136-220-218:~$ devlink port >> auxiliary/mlx5_core.eth.0/65535: type eth netdev eth2 flavour physical port 0 splittable false >> auxiliary/mlx5_core.eth.1/131071: type eth netdev eth3 flavour physical port 1 splittable false >> c-220-136-220-218:~$ ip link >> ... >> 4: eth2: mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 >> link/ether b8:e9:24:f2:b7:6c brd ff:ff:ff:ff:ff:ff >> altname enp8s0f0np0 >> 5: eth3: mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 >> link/ether b8:e9:24:f2:b7:6d brd ff:ff:ff:ff:ff:ff >> altname enp8s0f1np1 >> >> >>> >>> IOW it's not a question of the generation of the card but of >>> the deployment type / use case. >> >> I don't think so, not in the case of mlx5. The difference is only when >> you work with sr-iov, you either use legacy way (ip vf) or the new one. >> Same usecase. >> >> >>> >>>>> For the ASIC/NV bit: maybe technically possible, but it feels like the wrong >>>>> layer. This is boot/deployment policy, not a persistent hardware property, and >>>>> storing it in NV memory would make the state persist across kernels/hosts in a >>>>> surprising way. >>>> >>>> Well, as any other nv config, it persists across kernels/hosts. Think >>>> about it as "unbreak-my-not-legacy-device" bit. >>> >>> For most devices the switchdev mode does not change anything >>> substantial about the device. It's purely a kernel / driver config. >>> It changes what objects and default rules kernel / driver installs. >>> So I don't get why it would make sense to flash into the device >>> nvmem a Linux SW stack specific config. >> >> I look at it from the perspective that from some CX generation, >> switchdev mode should be default. So that is a device-based decision. >> I believe as such it can optionally be permanenty configured (nv config) >> on older device. Why not? > >This is a deployment policy decision, not a permanent property of the card. >The same adapter can be used in a regular host/RDMA setup or in a >switchdev/offload setup. If we store this in NVM, that Linux switchdev policy >follows the device across hosts, kernels and use cases, and can surprise the >next deployment that just expects a normal NIC. Yeah, from my perspective, there should be not surprise/behaviour_change for non-sriov/eswitch user. Then switchdev can be default and everyone is happy. Why to complicate things? > >I'll send another RFC v2 with support limited to: >devlink=[...]:esw:mode:{ switchdev | switchdev_inactive | legacy } >and let's see where we land with that. > >I still think a small kernel command line knob is the cleanest way to get to >"switchdev by default" without making the interface too broad. For more >complex boot-time configuration, I agree that a devlinkd or similar userspace >path is probably the better direction. > >The "pause probing until userspace configures devlink" idea feels less clear >to me. It is not quite the simple boot policy knob, and not quite the full >userspace policy manager either. It would add a new probe state and require >early userspace orchestration before the device is fully materialized. At >least for now, I would prefer either the small cmdline option for the simple >global/default case, or a proper devlinkd-like solution for more complex >policy. Between those, I still prefer the cmdline option for this specific >early eswitch mode default. > >Mark > >> >> [...] >