Linux Documentation
 help / color / mirror / Atom feed
* [PATCH net-next V3 0/7] devlink: Add boot-time eswitch mode defaults
@ 2026-06-05 18:10 Mark Bloch
  2026-06-05 18:10 ` [PATCH net-next V3 1/7] devlink: Skip health recover notifications before register Mark Bloch
                   ` (7 more replies)
  0 siblings, 8 replies; 10+ messages in thread
From: Mark Bloch @ 2026-06-05 18:10 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller
  Cc: Jonathan Corbet, Shuah Khan, Jiri Pirko, Simon Horman,
	Sunil Goutham, Linu Cherian, Geetha sowjanya, hariprasad,
	Subbaraya Sundeep, Bharat Bhushan, Saeed Mahameed,
	Leon Romanovsky, Tariq Toukan, Mark Bloch, Borislav Petkov (AMD),
	Andrew Morton, Randy Dunlap, Thomas Gleixner, Petr Mladek,
	Peter Zijlstra (Intel), Dave Hansen, Vlastimil Babka,
	Christian Brauner, Tejun Heo, Feng Tang, Dapeng Mi, Kees Cook,
	Marco Elver, Eric Biggers, Li RongQing, Paul E. McKenney,
	Ethan Nelson-Moore, linux-doc, linux-kernel, netdev, linux-rdma

This series adds a devlink_eswitch_mode= kernel command line parameter for
applying a default devlink eswitch mode during device initialization.

Following the discussion with Jakub[1] and the feedback on the RFC
postings, this version keeps the scope limited to a boot-time devlink
eswitch mode default only.

The option selects either all devlink handles or an explicit comma
separated handle list:

devlink_eswitch_mode=*=switchdev
devlink_eswitch_mode=pci/0000:08:00.0,pci/0000:09:00.1=switchdev_inactive

The supported modes are legacy, switchdev and switchdev_inactive. The
selected mode is applied through the existing eswitch_mode_set() devlink
operation, the same operation used by the devlink eswitch mode command.

The preparatory patches move registration points that expose the devlink
instance before the driver is ready for a registration-time eswitch mode
change. Where registration is moved later, the matching unregister path is
moved earlier so unregister notifications are sent from devl_unregister()
before object teardown. The final patch adds the parser and applies the
default from devlink core when a matching instance is registered and after
a successful devlink reload that performed DRIVER_REINIT.

Patch 1 skips devlink health recovery notifications while a devlink
instance is not registered. Health state and counters are still updated,
but there is no registered instance for userspace to observe or receive
notifications from yet. This lets drivers move registration later without
hitting health notification registration assertions during early
initialization.

Patch 2 moves netdevsim devlink registration after device initialization,
so registration-time defaults can call eswitch_mode_set() after simulator
state is ready. It also unregisters devlink before netdevsim tears down
the objects that were registered before devlink became visible.

Patch 3 clears the mlx5 FW reset-in-progress bit before reloading after a
firmware reset.

Patch 4 moves mlx5 devlink registration after device initialization,
including the lightweight init path, and moves unregister before the
matching teardown.

Patch 5 moves octeontx2 AF devlink registration after SR-IOV setup and
switch lock initialization.

Patch 6 moves octeontx2 PF devlink registration after PF SR-IOV state
setup.

Patch 7 adds the devlink_eswitch_mode= parser, documentation,
registration-time default application and successful reload default
application.

Changelog:

v2 -> v3:

- Change the devlink_eswitch_mode= API syntax to use <selector>=<mode>
  instead of [<selector>]:<mode>, following a comment from Randy Dunlap.

v1 -> v2:

- Move default eswitch mode application into devlink core. The default is
  now applied during devlink registration and after a successful devlink
  reload that performed DRIVER_REINIT.

- Remove the exported devl_apply_default_esw_mode() driver API and the mlx5
  driver-side call to it.

- Skip devlink health recovery notifications while the devlink instance is
  not registered, so drivers can move registration later without early
  health work hitting registration assertions.

- Move mlx5 devlink registration after device initialization, including the
  lightweight init path, so the core can apply the default through the
  normal registration flow.

- Move the matching netdevsim and mlx5 unregister paths before object
  teardown, so unregister notifications come from devl_unregister() and the
  later object teardown paths run while the devlink instance is no longer
  registered.

- Add registration-ordering preparation patches for netdevsim and octeontx2
  AF/PF, so their eswitch state is ready before registration-time defaults
  may call eswitch_mode_set().

[1] https://lore.kernel.org/all/20260502184153.4fd8d06f@kernel.org/
RFC V1 : https://lore.kernel.org/all/20260506123739.1959770-1-mbloch@nvidia.com/
RFC V2 : https://lore.kernel.org/all/20260510185424.2041415-1-mbloch@nvidia.com/
v1     : https://lore.kernel.org/all/20260521072434.362624-1-tariqt@nvidia.com/
v2     : https://lore.kernel.org/all/20260603193259.3412464-1-mbloch@nvidia.com/

Signed-off-by: Mark Bloch <mbloch@nvidia.com>

Mark Bloch (7):
  devlink: Skip health recover notifications before register
  netdevsim: Register devlink after device init
  net/mlx5: Clear FW reset-in-progress bit before reload
  net/mlx5: Register devlink after device init
  octeontx2-af: Register devlink after SR-IOV init
  octeontx2-pf: Register devlink after SR-IOV state init
  devlink: Add eswitch mode boot defaults

 .../admin-guide/kernel-parameters.txt         |  25 ++
 .../networking/devlink/devlink-defaults.rst   |  78 +++++
 Documentation/networking/devlink/index.rst    |   1 +
 .../net/ethernet/marvell/octeontx2/af/rvu.c   |  24 +-
 .../ethernet/marvell/octeontx2/nic/otx2_pf.c  |  17 +-
 .../ethernet/mellanox/mlx5/core/fw_reset.c    |  28 +-
 .../net/ethernet/mellanox/mlx5/core/main.c    |  34 ++-
 drivers/net/netdevsim/dev.c                   |  15 +-
 net/devlink/core.c                            | 271 ++++++++++++++++++
 net/devlink/dev.c                             |   3 +
 net/devlink/devl_internal.h                   |   1 +
 net/devlink/health.c                          |   3 +-
 12 files changed, 451 insertions(+), 49 deletions(-)
 create mode 100644 Documentation/networking/devlink/devlink-defaults.rst


base-commit: bfa3d89cc15c09f7d1581c834a5ed725189ec19f
-- 
2.34.1


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH net-next V3 1/7] devlink: Skip health recover notifications before register
  2026-06-05 18:10 [PATCH net-next V3 0/7] devlink: Add boot-time eswitch mode defaults Mark Bloch
@ 2026-06-05 18:10 ` Mark Bloch
  2026-06-05 18:10 ` [PATCH net-next V3 2/7] netdevsim: Register devlink after device init Mark Bloch
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: Mark Bloch @ 2026-06-05 18:10 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller
  Cc: Jonathan Corbet, Shuah Khan, Jiri Pirko, Simon Horman,
	Sunil Goutham, Linu Cherian, Geetha sowjanya, hariprasad,
	Subbaraya Sundeep, Bharat Bhushan, Saeed Mahameed,
	Leon Romanovsky, Tariq Toukan, Mark Bloch, Borislav Petkov (AMD),
	Andrew Morton, Randy Dunlap, Thomas Gleixner, Petr Mladek,
	Peter Zijlstra (Intel), Dave Hansen, Vlastimil Babka,
	Christian Brauner, Tejun Heo, Feng Tang, Dapeng Mi, Kees Cook,
	Marco Elver, Eric Biggers, Li RongQing, Paul E. McKenney,
	Ethan Nelson-Moore, linux-doc, linux-kernel, netdev, linux-rdma

devlink health reports can be generated before the devlink instance is
registered. This can happen during driver initialization when a driver
creates health reporters early and its health polling detects an error
before devlink_register() is reached.

devlink health still records the report state and counters in that case,
but userspace cannot observe the devlink instance yet and there is no
registered handle to notify through. Skip the netlink notification while
the devlink instance is not registered instead of asserting registration.

This keeps later userspace queries useful after registration while avoiding
a warning from early health reports.

Signed-off-by: Mark Bloch <mbloch@nvidia.com>
---
 net/devlink/health.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/devlink/health.c b/net/devlink/health.c
index ea7a334e939b..376e79497771 100644
--- a/net/devlink/health.c
+++ b/net/devlink/health.c
@@ -513,9 +513,8 @@ static void devlink_recover_notify(struct devlink_health_reporter *reporter,
 	int err;
 
 	WARN_ON(cmd != DEVLINK_CMD_HEALTH_REPORTER_RECOVER);
-	ASSERT_DEVLINK_REGISTERED(devlink);
 
-	if (!devlink_nl_notify_need(devlink))
+	if (!__devl_is_registered(devlink) || !devlink_nl_notify_need(devlink))
 		return;
 
 	msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH net-next V3 2/7] netdevsim: Register devlink after device init
  2026-06-05 18:10 [PATCH net-next V3 0/7] devlink: Add boot-time eswitch mode defaults Mark Bloch
  2026-06-05 18:10 ` [PATCH net-next V3 1/7] devlink: Skip health recover notifications before register Mark Bloch
@ 2026-06-05 18:10 ` Mark Bloch
  2026-06-10 23:50   ` Jakub Kicinski
  2026-06-05 18:10 ` [PATCH net-next V3 3/7] net/mlx5: Clear FW reset-in-progress bit before reload Mark Bloch
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 10+ messages in thread
From: Mark Bloch @ 2026-06-05 18:10 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller
  Cc: Jonathan Corbet, Shuah Khan, Jiri Pirko, Simon Horman,
	Sunil Goutham, Linu Cherian, Geetha sowjanya, hariprasad,
	Subbaraya Sundeep, Bharat Bhushan, Saeed Mahameed,
	Leon Romanovsky, Tariq Toukan, Mark Bloch, Borislav Petkov (AMD),
	Andrew Morton, Randy Dunlap, Thomas Gleixner, Petr Mladek,
	Peter Zijlstra (Intel), Dave Hansen, Vlastimil Babka,
	Christian Brauner, Tejun Heo, Feng Tang, Dapeng Mi, Kees Cook,
	Marco Elver, Eric Biggers, Li RongQing, Paul E. McKenney,
	Ethan Nelson-Moore, linux-doc, linux-kernel, netdev, linux-rdma

devl_register() makes the devlink instance visible to userspace. A later
patch also makes registration the point where devlink core may call
eswitch_mode_set() to apply a boot-time default eswitch mode.

Move netdevsim registration after all objects (resources, params, regions,
traps, debugfs etc) are initialized, and after the initial eswitch mode is
set to legacy.

Move devl_unregister() to the beginning of nsim_drv_remove(), before those
devlink objects are torn down. This keeps devlink register/unregister as
the notification barrier and makes the later object teardown paths run
after devlink is no longer registered, so they do not emit their own
netlink DEL notifications.

Signed-off-by: Mark Bloch <mbloch@nvidia.com>
---
 drivers/net/netdevsim/dev.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/drivers/net/netdevsim/dev.c b/drivers/net/netdevsim/dev.c
index aed9ad5f1b43..7cf4102b049e 100644
--- a/drivers/net/netdevsim/dev.c
+++ b/drivers/net/netdevsim/dev.c
@@ -1680,13 +1680,9 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev)
 		goto err_devlink_unlock;
 	}
 
-	err = devl_register(devlink);
-	if (err)
-		goto err_vfc_free;
-
 	err = nsim_dev_resources_register(devlink);
 	if (err)
-		goto err_dl_unregister;
+		goto err_vfc_free;
 
 	err = devl_params_register(devlink, nsim_devlink_params,
 				   ARRAY_SIZE(nsim_devlink_params));
@@ -1733,9 +1729,14 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev)
 		goto err_hwstats_exit;
 
 	nsim_dev->esw_mode = DEVLINK_ESWITCH_MODE_LEGACY;
+	err = devl_register(devlink);
+	if (err)
+		goto err_port_del_all;
 	devl_unlock(devlink);
 	return 0;
 
+err_port_del_all:
+	nsim_dev_port_del_all(nsim_dev);
 err_hwstats_exit:
 	nsim_dev_hwstats_exit(nsim_dev);
 err_psample_exit:
@@ -1757,8 +1758,6 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev)
 			       ARRAY_SIZE(nsim_devlink_params));
 err_resource_unregister:
 	devl_resources_unregister(devlink);
-err_dl_unregister:
-	devl_unregister(devlink);
 err_vfc_free:
 	kfree(nsim_dev->vfconfigs);
 err_devlink_unlock:
@@ -1797,6 +1796,7 @@ void nsim_drv_remove(struct nsim_bus_dev *nsim_bus_dev)
 	struct devlink *devlink = priv_to_devlink(nsim_dev);
 
 	devl_lock(devlink);
+	devl_unregister(devlink);
 	nsim_dev_reload_destroy(nsim_dev);
 
 	nsim_bpf_dev_exit(nsim_dev);
@@ -1804,7 +1804,6 @@ void nsim_drv_remove(struct nsim_bus_dev *nsim_bus_dev)
 	devl_params_unregister(devlink, nsim_devlink_params,
 			       ARRAY_SIZE(nsim_devlink_params));
 	devl_resources_unregister(devlink);
-	devl_unregister(devlink);
 	kfree(nsim_dev->vfconfigs);
 	kfree(nsim_dev->fa_cookie);
 	mutex_destroy(&nsim_dev->progs_list_lock);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH net-next V3 3/7] net/mlx5: Clear FW reset-in-progress bit before reload
  2026-06-05 18:10 [PATCH net-next V3 0/7] devlink: Add boot-time eswitch mode defaults Mark Bloch
  2026-06-05 18:10 ` [PATCH net-next V3 1/7] devlink: Skip health recover notifications before register Mark Bloch
  2026-06-05 18:10 ` [PATCH net-next V3 2/7] netdevsim: Register devlink after device init Mark Bloch
@ 2026-06-05 18:10 ` Mark Bloch
  2026-06-05 18:10 ` [PATCH net-next V3 4/7] net/mlx5: Register devlink after device init Mark Bloch
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: Mark Bloch @ 2026-06-05 18:10 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller
  Cc: Jonathan Corbet, Shuah Khan, Jiri Pirko, Simon Horman,
	Sunil Goutham, Linu Cherian, Geetha sowjanya, hariprasad,
	Subbaraya Sundeep, Bharat Bhushan, Saeed Mahameed,
	Leon Romanovsky, Tariq Toukan, Mark Bloch, Borislav Petkov (AMD),
	Andrew Morton, Randy Dunlap, Thomas Gleixner, Petr Mladek,
	Peter Zijlstra (Intel), Dave Hansen, Vlastimil Babka,
	Christian Brauner, Tejun Heo, Feng Tang, Dapeng Mi, Kees Cook,
	Marco Elver, Eric Biggers, Li RongQing, Paul E. McKenney,
	Ethan Nelson-Moore, linux-doc, linux-kernel, netdev, linux-rdma,
	Shay Drori, Moshe Shemesh

mlx5 sets MLX5_FW_RESET_FLAGS_RESET_IN_PROGRESS when acknowledging a sync
reset request. This bit blocks devlink reload and other devlink operations
while the firmware reset is running, but it was kept set until after the
driver reload finished.

Clear the reset-in-progress bit once the reset unload flow is done and PCI
access is back, before reloading the device. For a reset initiated through
devlink, clear it before completing the reload waiter. For a reset reported
through an asynchronous firmware event, keep the unload flow outside
devl_lock, then take devl_lock before clearing the bit and reloading
through the devl-locked load helper.

Reviewed-by: Shay Drori <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
---
 .../ethernet/mellanox/mlx5/core/fw_reset.c    | 28 +++++++++++--------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fw_reset.c b/drivers/net/ethernet/mellanox/mlx5/core/fw_reset.c
index 07440c58713a..7283e5b49eed 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fw_reset.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fw_reset.c
@@ -238,24 +238,30 @@ static void mlx5_fw_reset_complete_reload(struct mlx5_core_dev *dev)
 {
 	struct mlx5_fw_reset *fw_reset = dev->priv.fw_reset;
 	struct devlink *devlink = priv_to_devlink(dev);
+	int err;
 
 	/* if this is the driver that initiated the fw reset, devlink completed the reload */
 	if (test_bit(MLX5_FW_RESET_FLAGS_PENDING_COMP, &fw_reset->reset_flags)) {
+		clear_bit(MLX5_FW_RESET_FLAGS_RESET_IN_PROGRESS,
+			  &fw_reset->reset_flags);
 		complete(&fw_reset->done);
-	} else {
-		mlx5_sync_reset_unload_flow(dev, false);
-		if (mlx5_health_wait_pci_up(dev))
-			mlx5_core_err(dev, "reset reload flow aborted, PCI reads still not working\n");
-		else
-			mlx5_load_one(dev, true);
-		devl_lock(devlink);
-		devlink_remote_reload_actions_performed(devlink, 0,
-							BIT(DEVLINK_RELOAD_ACTION_DRIVER_REINIT) |
-							BIT(DEVLINK_RELOAD_ACTION_FW_ACTIVATE));
-		devl_unlock(devlink);
+		return;
 	}
 
+	mlx5_sync_reset_unload_flow(dev, false);
+	err = mlx5_health_wait_pci_up(dev);
+
+	devl_lock(devlink);
 	clear_bit(MLX5_FW_RESET_FLAGS_RESET_IN_PROGRESS, &fw_reset->reset_flags);
+	if (err)
+		mlx5_core_err(dev, "reset reload flow aborted, PCI reads still not working\n");
+	else
+		mlx5_load_one_devl_locked(dev, true);
+
+	devlink_remote_reload_actions_performed(devlink, 0,
+						BIT(DEVLINK_RELOAD_ACTION_DRIVER_REINIT) |
+						BIT(DEVLINK_RELOAD_ACTION_FW_ACTIVATE));
+	devl_unlock(devlink);
 }
 
 static void mlx5_stop_sync_reset_poll(struct mlx5_core_dev *dev)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH net-next V3 4/7] net/mlx5: Register devlink after device init
  2026-06-05 18:10 [PATCH net-next V3 0/7] devlink: Add boot-time eswitch mode defaults Mark Bloch
                   ` (2 preceding siblings ...)
  2026-06-05 18:10 ` [PATCH net-next V3 3/7] net/mlx5: Clear FW reset-in-progress bit before reload Mark Bloch
@ 2026-06-05 18:10 ` Mark Bloch
  2026-06-05 18:10 ` [PATCH net-next V3 5/7] octeontx2-af: Register devlink after SR-IOV init Mark Bloch
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: Mark Bloch @ 2026-06-05 18:10 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller
  Cc: Jonathan Corbet, Shuah Khan, Jiri Pirko, Simon Horman,
	Sunil Goutham, Linu Cherian, Geetha sowjanya, hariprasad,
	Subbaraya Sundeep, Bharat Bhushan, Saeed Mahameed,
	Leon Romanovsky, Tariq Toukan, Mark Bloch, Borislav Petkov (AMD),
	Andrew Morton, Randy Dunlap, Thomas Gleixner, Petr Mladek,
	Peter Zijlstra (Intel), Dave Hansen, Vlastimil Babka,
	Christian Brauner, Tejun Heo, Feng Tang, Dapeng Mi, Kees Cook,
	Marco Elver, Eric Biggers, Li RongQing, Paul E. McKenney,
	Ethan Nelson-Moore, linux-doc, linux-kernel, netdev, linux-rdma,
	Shay Drori

devl_register() makes the devlink instance visible to userspace. A later
patch also makes registration the point where devlink core may call
eswitch_mode_set() to apply a boot-time default eswitch mode.

Move mlx5 devlink registration after mlx5 device initialization completes,
including the lightweight init path, so registration-time devlink
operations see initialized driver state.

Move devl_unregister() before the matching teardown paths, so unregister
notifications are emitted from devl_unregister() before mlx5 removes the
devlink objects.

Add a devl-locked uninit helper so failed nested devlink setup can unwind
the initialized device before the instance is registered.

Reviewed-by: Shay Drori <shayd@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/main.c    | 34 ++++++++++++++-----
 1 file changed, 25 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index fd285aeb9630..4e3cb6ec8630 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -1454,31 +1454,40 @@ int mlx5_init_one_devl_locked(struct mlx5_core_dev *dev)
 	return err;
 }
 
+static void mlx5_uninit_one_devl_locked(struct mlx5_core_dev *dev);
+
 int mlx5_init_one(struct mlx5_core_dev *dev)
 {
 	struct devlink *devlink = priv_to_devlink(dev);
 	int err;
 
 	devl_lock(devlink);
+	err = mlx5_init_one_devl_locked(dev);
+	if (err)
+		goto unlock;
+
 	if (dev->shd) {
 		err = devl_nested_devlink_set(dev->shd, devlink);
 		if (err)
-			goto unlock;
+			goto err_uninit;
 	}
+
 	devl_register(devlink);
-	err = mlx5_init_one_devl_locked(dev);
-	if (err)
-		devl_unregister(devlink);
+	devl_unlock(devlink);
+	return 0;
+
+err_uninit:
+	mlx5_uninit_one_devl_locked(dev);
 unlock:
 	devl_unlock(devlink);
 	return err;
 }
 
-void mlx5_uninit_one(struct mlx5_core_dev *dev)
+static void mlx5_uninit_one_devl_locked(struct mlx5_core_dev *dev)
 {
 	struct devlink *devlink = priv_to_devlink(dev);
 
-	devl_lock(devlink);
+	devl_assert_locked(devlink);
 	mutex_lock(&dev->intf_state_mutex);
 
 	mlx5_hwmon_dev_unregister(dev);
@@ -1501,7 +1510,15 @@ void mlx5_uninit_one(struct mlx5_core_dev *dev)
 	mlx5_function_teardown(dev, true);
 out:
 	mutex_unlock(&dev->intf_state_mutex);
+}
+
+void mlx5_uninit_one(struct mlx5_core_dev *dev)
+{
+	struct devlink *devlink = priv_to_devlink(dev);
+
+	devl_lock(devlink);
 	devl_unregister(devlink);
+	mlx5_uninit_one_devl_locked(dev);
 	devl_unlock(devlink);
 }
 
@@ -1636,7 +1653,6 @@ int mlx5_init_one_light(struct mlx5_core_dev *dev)
 	int err;
 
 	devl_lock(devlink);
-	devl_register(devlink);
 	dev->state = MLX5_DEVICE_STATE_UP;
 	err = mlx5_function_enable(dev, true, mlx5_tout_ms(dev, FW_PRE_INIT_TIMEOUT));
 	if (err) {
@@ -1656,6 +1672,7 @@ int mlx5_init_one_light(struct mlx5_core_dev *dev)
 		goto query_hca_caps_err;
 	}
 
+	devl_register(devlink);
 	devl_unlock(devlink);
 	return 0;
 
@@ -1663,7 +1680,6 @@ int mlx5_init_one_light(struct mlx5_core_dev *dev)
 	mlx5_function_disable(dev, true);
 out:
 	dev->state = MLX5_DEVICE_STATE_INTERNAL_ERROR;
-	devl_unregister(devlink);
 	devl_unlock(devlink);
 	return err;
 }
@@ -1673,8 +1689,8 @@ void mlx5_uninit_one_light(struct mlx5_core_dev *dev)
 	struct devlink *devlink = priv_to_devlink(dev);
 
 	devl_lock(devlink);
-	mlx5_devlink_params_unregister(priv_to_devlink(dev));
 	devl_unregister(devlink);
+	mlx5_devlink_params_unregister(priv_to_devlink(dev));
 	devl_unlock(devlink);
 	if (dev->state != MLX5_DEVICE_STATE_UP)
 		return;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH net-next V3 5/7] octeontx2-af: Register devlink after SR-IOV init
  2026-06-05 18:10 [PATCH net-next V3 0/7] devlink: Add boot-time eswitch mode defaults Mark Bloch
                   ` (3 preceding siblings ...)
  2026-06-05 18:10 ` [PATCH net-next V3 4/7] net/mlx5: Register devlink after device init Mark Bloch
@ 2026-06-05 18:10 ` Mark Bloch
  2026-06-05 18:10 ` [PATCH net-next V3 6/7] octeontx2-pf: Register devlink after SR-IOV state init Mark Bloch
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: Mark Bloch @ 2026-06-05 18:10 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller
  Cc: Jonathan Corbet, Shuah Khan, Jiri Pirko, Simon Horman,
	Sunil Goutham, Linu Cherian, Geetha sowjanya, hariprasad,
	Subbaraya Sundeep, Bharat Bhushan, Saeed Mahameed,
	Leon Romanovsky, Tariq Toukan, Mark Bloch, Borislav Petkov (AMD),
	Andrew Morton, Randy Dunlap, Thomas Gleixner, Petr Mladek,
	Peter Zijlstra (Intel), Dave Hansen, Vlastimil Babka,
	Christian Brauner, Tejun Heo, Feng Tang, Dapeng Mi, Kees Cook,
	Marco Elver, Eric Biggers, Li RongQing, Paul E. McKenney,
	Ethan Nelson-Moore, linux-doc, linux-kernel, netdev, linux-rdma

A later patch makes devlink registration the point where devlink core may
call eswitch_mode_set() to apply a boot-time default eswitch mode.

Move octeontx2 AF devlink registration after SR-IOV is enabled and the
representor switch lock is initialized, so the AF eswitch mode set path
sees the state it depends on.

If devlink registration fails after SR-IOV setup, unregister interrupts
before disabling SR-IOV. This keeps the AF-VF mailbox IRQ handlers
synchronized before the AF-VF mailbox workqueue is destroyed.

Signed-off-by: Mark Bloch <mbloch@nvidia.com>
---
 .../net/ethernet/marvell/octeontx2/af/rvu.c   | 24 ++++++++++---------
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu.c b/drivers/net/ethernet/marvell/octeontx2/af/rvu.c
index 3cf131508ecf..c2b52eb4ffab 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu.c
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu.c
@@ -3545,6 +3545,7 @@ static void rvu_update_module_params(struct rvu *rvu)
 static int rvu_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 {
 	struct device *dev = &pdev->dev;
+	bool sriov_done = false;
 	struct rvu *rvu;
 	int    err;
 
@@ -3634,26 +3635,27 @@ static int rvu_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 		goto err_flr;
 	}
 
-	err = rvu_register_dl(rvu);
-	if (err) {
-		dev_err(dev, "%s: Failed to register devlink\n", __func__);
-		goto err_irq;
-	}
-
 	rvu_setup_rvum_blk_revid(rvu);
 
 	/* Enable AF's VFs (if any) */
 	err = rvu_enable_sriov(rvu);
 	if (err) {
 		dev_err(dev, "%s: Failed to enable sriov\n", __func__);
-		goto err_dl;
+		goto err_irq;
+	}
+	sriov_done = true;
+
+	mutex_init(&rvu->rswitch.switch_lock);
+
+	err = rvu_register_dl(rvu);
+	if (err) {
+		dev_err(dev, "%s: Failed to register devlink\n", __func__);
+		goto err_irq;
 	}
 
 	/* Initialize debugfs */
 	rvu_dbg_init(rvu);
 
-	mutex_init(&rvu->rswitch.switch_lock);
-
 	if (rvu->fwdata)
 		ptp_start(rvu, rvu->fwdata->sclk, rvu->fwdata->ptp_ext_clk_rate,
 			  rvu->fwdata->ptp_ext_tstamp);
@@ -3662,10 +3664,10 @@ static int rvu_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	rvu_alloc_cint_qint_mem(rvu, &rvu->pf[RVU_AFPF], BLKADDR_NIX0,
 				(rvu->hw->block[BLKADDR_NIX0].lf.max));
 	return 0;
-err_dl:
-	rvu_unregister_dl(rvu);
 err_irq:
 	rvu_unregister_interrupts(rvu);
+	if (sriov_done)
+		rvu_disable_sriov(rvu);
 err_flr:
 	rvu_flr_wq_destroy(rvu);
 err_mbox:
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH net-next V3 6/7] octeontx2-pf: Register devlink after SR-IOV state init
  2026-06-05 18:10 [PATCH net-next V3 0/7] devlink: Add boot-time eswitch mode defaults Mark Bloch
                   ` (4 preceding siblings ...)
  2026-06-05 18:10 ` [PATCH net-next V3 5/7] octeontx2-af: Register devlink after SR-IOV init Mark Bloch
@ 2026-06-05 18:10 ` Mark Bloch
  2026-06-05 18:10 ` [PATCH net-next V3 7/7] devlink: Add eswitch mode boot defaults Mark Bloch
  2026-06-05 19:37 ` [PATCH net-next V3 0/7] devlink: Add boot-time eswitch mode defaults Borislav Petkov
  7 siblings, 0 replies; 10+ messages in thread
From: Mark Bloch @ 2026-06-05 18:10 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller
  Cc: Jonathan Corbet, Shuah Khan, Jiri Pirko, Simon Horman,
	Sunil Goutham, Linu Cherian, Geetha sowjanya, hariprasad,
	Subbaraya Sundeep, Bharat Bhushan, Saeed Mahameed,
	Leon Romanovsky, Tariq Toukan, Mark Bloch, Borislav Petkov (AMD),
	Andrew Morton, Randy Dunlap, Thomas Gleixner, Petr Mladek,
	Peter Zijlstra (Intel), Dave Hansen, Vlastimil Babka,
	Christian Brauner, Tejun Heo, Feng Tang, Dapeng Mi, Kees Cook,
	Marco Elver, Eric Biggers, Li RongQing, Paul E. McKenney,
	Ethan Nelson-Moore, linux-doc, linux-kernel, netdev, linux-rdma

A later patch makes devlink registration the point where devlink core may
call eswitch_mode_set() to apply a boot-time default eswitch mode.

Move octeontx2 PF devlink registration after PF SR-IOV configuration state
is initialized, so representor creation has the state it needs.

Add a separate unwind label so failures after devlink registration
unregister devlink before cleaning up SR-IOV state.

Signed-off-by: Mark Bloch <mbloch@nvidia.com>
---
 .../ethernet/marvell/octeontx2/nic/otx2_pf.c    | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c
index f9fbf0c17648..47d2c6a24636 100644
--- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c
+++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c
@@ -3278,14 +3278,14 @@ static int otx2_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (err)
 		goto err_mcam_flow_del;
 
-	err = otx2_register_dl(pf);
-	if (err)
-		goto err_mcam_flow_del;
-
 	/* Initialize SR-IOV resources */
 	err = otx2_sriov_vfcfg_init(pf);
 	if (err)
-		goto err_pf_sriov_init;
+		goto err_shutdown_tc;
+
+	err = otx2_register_dl(pf);
+	if (err)
+		goto err_sriov_cleannup;
 
 	/* Enable link notifications */
 	otx2_cgx_config_linkevents(pf, true);
@@ -3293,7 +3293,7 @@ static int otx2_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	pf->af_xdp_zc_qidx = bitmap_zalloc(qcount, GFP_KERNEL);
 	if (!pf->af_xdp_zc_qidx) {
 		err = -ENOMEM;
-		goto err_sriov_cleannup;
+		goto err_dl_unregister;
 	}
 
 #ifdef CONFIG_DCB
@@ -3310,10 +3310,11 @@ static int otx2_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 err_free_zc_bmap:
 	bitmap_free(pf->af_xdp_zc_qidx);
 #endif
+err_dl_unregister:
+	otx2_unregister_dl(pf);
 err_sriov_cleannup:
 	otx2_sriov_vfcfg_cleanup(pf);
-err_pf_sriov_init:
-	otx2_unregister_dl(pf);
+err_shutdown_tc:
 	otx2_shutdown_tc(pf);
 err_mcam_flow_del:
 	otx2_mcam_flow_del(pf);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH net-next V3 7/7] devlink: Add eswitch mode boot defaults
  2026-06-05 18:10 [PATCH net-next V3 0/7] devlink: Add boot-time eswitch mode defaults Mark Bloch
                   ` (5 preceding siblings ...)
  2026-06-05 18:10 ` [PATCH net-next V3 6/7] octeontx2-pf: Register devlink after SR-IOV state init Mark Bloch
@ 2026-06-05 18:10 ` Mark Bloch
  2026-06-05 19:37 ` [PATCH net-next V3 0/7] devlink: Add boot-time eswitch mode defaults Borislav Petkov
  7 siblings, 0 replies; 10+ messages in thread
From: Mark Bloch @ 2026-06-05 18:10 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller
  Cc: Jonathan Corbet, Shuah Khan, Jiri Pirko, Simon Horman,
	Sunil Goutham, Linu Cherian, Geetha sowjanya, hariprasad,
	Subbaraya Sundeep, Bharat Bhushan, Saeed Mahameed,
	Leon Romanovsky, Tariq Toukan, Mark Bloch, Borislav Petkov (AMD),
	Andrew Morton, Randy Dunlap, Thomas Gleixner, Petr Mladek,
	Peter Zijlstra (Intel), Dave Hansen, Vlastimil Babka,
	Christian Brauner, Tejun Heo, Feng Tang, Dapeng Mi, Kees Cook,
	Marco Elver, Eric Biggers, Li RongQing, Paul E. McKenney,
	Ethan Nelson-Moore, linux-doc, linux-kernel, netdev, linux-rdma

Add devlink_eswitch_mode= command line support for setting a default
eswitch mode during device initialization.

The supported syntax selects either all devlink handles or one
explicit comma-separated handle list:

  devlink_eswitch_mode=*=<mode>
  devlink_eswitch_mode=<handle>[,<handle>...]=<mode>

where <mode> is one of legacy, switchdev or switchdev_inactive. All
selected handles receive the same mode. Assigning different modes to
different handle lists in the same parameter value is not supported.

The default is applied through the existing eswitch_mode_set() devlink
operation, matching the userspace devlink eswitch mode command. devlink
core applies it when a matching devlink instance is registered and after a
successful devlink reload that performed DRIVER_REINIT, so rebuilt device
state returns to the requested boot default.

Document the devlink_eswitch_mode= syntax and duplicate handle handling.

Signed-off-by: Mark Bloch <mbloch@nvidia.com>
---
 .../admin-guide/kernel-parameters.txt         |  25 ++
 .../networking/devlink/devlink-defaults.rst   |  78 +++++
 Documentation/networking/devlink/index.rst    |   1 +
 net/devlink/core.c                            | 271 ++++++++++++++++++
 net/devlink/dev.c                             |   3 +
 net/devlink/devl_internal.h                   |   1 +
 6 files changed, 379 insertions(+)
 create mode 100644 Documentation/networking/devlink/devlink-defaults.rst

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index b3fdbbe3b3cc..b4fcc7f81166 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1246,6 +1246,31 @@ Kernel parameters
 	dell_smm_hwmon.fan_max=
 			[HW] Maximum configurable fan speed.
 
+	devlink_eswitch_mode=
+			[NET]
+			Format:
+			<selector>=<mode>
+
+			<selector>:
+			* | <handle>[,<handle>...]
+
+			<handle>:
+			<bus-name>/<dev-name>
+
+			Configure default devlink eswitch mode for matching
+			devlink instances during device initialization.
+
+			<mode>:
+			legacy | switchdev | switchdev_inactive
+
+			Examples:
+			devlink_eswitch_mode=*=switchdev
+			devlink_eswitch_mode=pci/0000:08:00.0=switchdev
+			devlink_eswitch_mode=pci/0000:08:00.0,pci/0000:09:00.1=switchdev_inactive
+
+			See Documentation/networking/devlink/devlink-defaults.rst
+			for the full syntax.
+
 	dfltcc=		[HW,S390]
 			Format: { on | off | def_only | inf_only | always }
 			on:       s390 zlib hardware support for compression on
diff --git a/Documentation/networking/devlink/devlink-defaults.rst b/Documentation/networking/devlink/devlink-defaults.rst
new file mode 100644
index 000000000000..380c9e99210e
--- /dev/null
+++ b/Documentation/networking/devlink/devlink-defaults.rst
@@ -0,0 +1,78 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============================
+Devlink Eswitch Mode Defaults
+==============================
+
+Devlink eswitch mode defaults allow the eswitch mode to be provided on the
+kernel command line and applied to matching devlink instances during device
+initialization.
+
+The devlink device is selected by its devlink handle. For PCI devices this is
+the same handle shown by ``devlink dev show``, for example
+``pci/0000:08:00.0``.
+
+Kernel command line syntax
+==========================
+
+Defaults are specified with the ``devlink_eswitch_mode=`` kernel command line
+parameter.
+
+The general syntax is::
+
+  devlink_eswitch_mode=<selector>=<mode>
+
+``<selector>`` is either ``*`` or one or more devlink handles::
+
+  * | <bus-name>/<dev-name>[,<bus-name>/<dev-name>...]
+
+``*`` applies the mode to every devlink instance. All handles in the same
+selector receive the same eswitch mode.
+
+``<mode>`` is one of ``legacy``, ``switchdev`` or ``switchdev_inactive``.
+
+Syntax rules
+------------
+
+The following syntax rules apply:
+
+* Specify the default in one ``devlink_eswitch_mode=`` parameter. Repeated
+  ``devlink_eswitch_mode=`` parameters are not accumulated.
+* The ``devlink_eswitch_mode=`` value is limited by the kernel command line
+  size.
+* Whitespace is not allowed within the parameter value.
+* ``<selector>`` must be either ``*`` or a handle list. ``*`` cannot be
+  combined with explicit handles.
+* ``<bus-name>`` and ``<dev-name>`` must not be empty.
+* ``<dev-name>`` may contain ``:``. This allows PCI names such as
+  ``0000:08:00.0``.
+* Handles must not contain whitespace, ``*``, ``=`` or more than one ``/``.
+* A comma separates handles.
+* Comma-separated default assignments are not supported.
+* Duplicate handles are rejected and the devlink eswitch mode default is
+  ignored.
+
+The eswitch mode default corresponds to the userspace command::
+
+  devlink dev eswitch set <handle> mode <value>
+
+
+Examples
+========
+
+Set all devlink instances to switchdev mode::
+
+  devlink_eswitch_mode=*=switchdev
+
+Set one PCI devlink instance to switchdev mode::
+
+  devlink_eswitch_mode=pci/0000:08:00.0=switchdev
+
+Set two PCI devlink instances to switchdev inactive mode::
+
+  devlink_eswitch_mode=pci/0000:08:00.0,pci/0000:09:00.1=switchdev_inactive
+
+The following is invalid because comma-separated default assignments are not
+supported::
+
+  devlink_eswitch_mode=pci/0000:08:00.0=switchdev,pci/0000:09:00.0=switchdev_inactive
diff --git a/Documentation/networking/devlink/index.rst b/Documentation/networking/devlink/index.rst
index f7ba7dcf477d..0d27a7008b14 100644
--- a/Documentation/networking/devlink/index.rst
+++ b/Documentation/networking/devlink/index.rst
@@ -56,6 +56,7 @@ general.
    :maxdepth: 1
 
    devlink-dpipe
+   devlink-defaults
    devlink-eswitch-attr
    devlink-flash
    devlink-health
diff --git a/net/devlink/core.c b/net/devlink/core.c
index fe9f6a0a67d5..2111bffb628f 100644
--- a/net/devlink/core.c
+++ b/net/devlink/core.c
@@ -4,6 +4,10 @@
  * Copyright (c) 2016 Jiri Pirko <jiri@mellanox.com>
  */
 
+#include <linux/init.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/string.h>
 #include <net/genetlink.h>
 #define CREATE_TRACE_POINTS
 #include <trace/events/devlink.h>
@@ -16,6 +20,230 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(devlink_trap_report);
 
 DEFINE_XARRAY_FLAGS(devlinks, XA_FLAGS_ALLOC);
 
+static char *devlink_default_esw_mode_param;
+static bool devlink_default_esw_mode_match_all;
+static enum devlink_eswitch_mode devlink_default_esw_mode;
+static LIST_HEAD(devlink_default_esw_mode_nodes);
+
+struct devlink_default_esw_mode_node {
+	struct list_head list;
+	char *bus_name;
+	char *dev_name;
+};
+
+static int __init
+devlink_default_esw_mode_to_value(const char *str,
+				  enum devlink_eswitch_mode *mode)
+{
+	if (!strcmp(str, "legacy")) {
+		*mode = DEVLINK_ESWITCH_MODE_LEGACY;
+		return 0;
+	}
+	if (!strcmp(str, "switchdev")) {
+		*mode = DEVLINK_ESWITCH_MODE_SWITCHDEV;
+		return 0;
+	}
+	if (!strcmp(str, "switchdev_inactive")) {
+		*mode = DEVLINK_ESWITCH_MODE_SWITCHDEV_INACTIVE;
+		return 0;
+	}
+
+	return -EINVAL;
+}
+
+static int __init
+devlink_default_esw_mode_handle_parse(char *handle, char **bus_name,
+				      char **dev_name)
+{
+	char *slash;
+	char *p;
+
+	if (!*handle)
+		return -EINVAL;
+
+	for (p = handle; *p; p++) {
+		if (*p == '*' || *p == '=')
+			return -EINVAL;
+	}
+
+	slash = strchr(handle, '/');
+	if (!slash || slash == handle || !slash[1])
+		return -EINVAL;
+	if (strchr(slash + 1, '/'))
+		return -EINVAL;
+
+	*slash = '\0';
+
+	*bus_name = handle;
+	*dev_name = slash + 1;
+	return 0;
+}
+
+static struct devlink_default_esw_mode_node *
+devlink_default_esw_mode_node_find(const char *bus_name, const char *dev_name)
+{
+	struct devlink_default_esw_mode_node *node;
+
+	list_for_each_entry(node, &devlink_default_esw_mode_nodes, list) {
+		if (!strcmp(node->bus_name, bus_name) &&
+		    !strcmp(node->dev_name, dev_name))
+			return node;
+	}
+
+	return NULL;
+}
+
+static int __init
+devlink_default_esw_mode_node_add(const char *bus_name, const char *dev_name)
+{
+	struct devlink_default_esw_mode_node *node;
+
+	if (devlink_default_esw_mode_node_find(bus_name, dev_name))
+		return -EEXIST;
+
+	node = kzalloc_obj(*node);
+	if (!node)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&node->list);
+	node->bus_name = kstrdup(bus_name, GFP_KERNEL);
+	node->dev_name = kstrdup(dev_name, GFP_KERNEL);
+	if (!node->bus_name || !node->dev_name) {
+		kfree(node->bus_name);
+		kfree(node->dev_name);
+		kfree(node);
+		return -ENOMEM;
+	}
+
+	list_add_tail(&node->list, &devlink_default_esw_mode_nodes);
+	return 0;
+}
+
+static int __init devlink_default_esw_mode_handles_parse(char *handles)
+{
+	char *handle;
+	int err;
+
+	if (!strcmp(handles, "*")) {
+		devlink_default_esw_mode_match_all = true;
+		return 0;
+	}
+
+	while ((handle = strsep(&handles, ",")) != NULL) {
+		char *bus_name;
+		char *dev_name;
+
+		err = devlink_default_esw_mode_handle_parse(handle, &bus_name,
+							    &dev_name);
+		if (err)
+			return err;
+
+		err = devlink_default_esw_mode_node_add(bus_name, dev_name);
+		if (err)
+			return err;
+	}
+
+	return 0;
+}
+
+static void __init
+devlink_default_esw_mode_node_free(struct devlink_default_esw_mode_node *node)
+{
+	kfree(node->bus_name);
+	kfree(node->dev_name);
+	kfree(node);
+}
+
+static void __init devlink_default_esw_mode_nodes_clear(void)
+{
+	struct devlink_default_esw_mode_node *node;
+	struct devlink_default_esw_mode_node *node_tmp;
+
+	list_for_each_entry_safe(node, node_tmp,
+				 &devlink_default_esw_mode_nodes, list) {
+		list_del(&node->list);
+		devlink_default_esw_mode_node_free(node);
+	}
+
+	devlink_default_esw_mode_match_all = false;
+}
+
+static int __init devlink_default_esw_mode_parse(char *str)
+{
+	char *handles;
+	char *separator;
+	char *mode;
+	enum devlink_eswitch_mode esw_mode;
+	int err;
+
+	if (!*str)
+		return -EINVAL;
+
+	separator = strrchr(str, '=');
+	if (!separator || separator == str || !separator[1])
+		return -EINVAL;
+
+	*separator = '\0';
+	handles = str;
+	mode = separator + 1;
+
+	err = devlink_default_esw_mode_to_value(mode, &esw_mode);
+	if (err)
+		return err;
+
+	err = devlink_default_esw_mode_handles_parse(handles);
+	if (err)
+		devlink_default_esw_mode_nodes_clear();
+	else
+		devlink_default_esw_mode = esw_mode;
+
+	return err;
+}
+
+static bool devlink_default_esw_mode_match(struct devlink *devlink)
+{
+	const char *bus_name = devlink_bus_name(devlink);
+	const char *dev_name = devlink_dev_name(devlink);
+	struct devlink_default_esw_mode_node *node;
+
+	if (devlink_default_esw_mode_match_all)
+		return true;
+
+	node = devlink_default_esw_mode_node_find(bus_name, dev_name);
+	return !!node;
+}
+
+void devlink_apply_default_esw_mode(struct devlink *devlink)
+{
+	const struct devlink_ops *ops = devlink->ops;
+	int err;
+
+	devl_assert_locked(devlink);
+
+	if (!devlink_default_esw_mode_match(devlink))
+		return;
+
+	if (!ops->eswitch_mode_set) {
+		if (!devlink_default_esw_mode_match_all)
+			devl_warn(devlink,
+				  "devlink_eswitch_mode= selected this device but eswitch mode setting is not supported\n");
+		return;
+	}
+
+	err = ops->eswitch_mode_set(devlink, devlink_default_esw_mode, NULL);
+	if (err)
+		devl_warn(devlink,
+			  "Couldn't apply default eswitch mode, err %d\n",
+			  err);
+}
+
+static int __init devlink_default_esw_mode_setup(char *str)
+{
+	devlink_default_esw_mode_param = str;
+	return 1;
+}
+__setup("devlink_eswitch_mode=", devlink_default_esw_mode_setup);
+
 static struct devlink *devlinks_xa_get(unsigned long index)
 {
 	struct devlink *devlink;
@@ -382,6 +610,20 @@ struct devlink *devlinks_xa_lookup_get(struct net *net, unsigned long index)
 /**
  * devl_register - Register devlink instance
  * @devlink: devlink
+ *
+ * Make @devlink visible to userspace. Drivers must call this only after the
+ * instance is fully initialized and its devlink operations can be called.
+ *
+ * If a matching devlink_eswitch_mode= default was provided on the kernel
+ * command line, devlink core applies it before devl_register() returns.
+ * Drivers implementing eswitch_mode_set() must therefore be ready to perform
+ * the same work as a userspace eswitch mode set request from this point,
+ * including creation of representors and other eswitch state.
+ *
+ * Context: Caller must hold the devlink instance lock. Use devlink_register()
+ * when the lock is not already held.
+ *
+ * Return: 0 on success.
  */
 int devl_register(struct devlink *devlink)
 {
@@ -391,6 +633,7 @@ int devl_register(struct devlink *devlink)
 	xa_set_mark(&devlinks, devlink->index, DEVLINK_REGISTERED);
 	devlink_notify_register(devlink);
 	devlink_rel_nested_in_notify(devlink);
+	devlink_apply_default_esw_mode(devlink);
 
 	return 0;
 }
@@ -580,6 +823,31 @@ static int __init devlink_init(void)
 {
 	int err;
 
+	if (devlink_default_esw_mode_param) {
+		char *def;
+
+		def = kstrdup(devlink_default_esw_mode_param, GFP_KERNEL);
+		if (!def) {
+			devlink_default_esw_mode_param = NULL;
+			pr_warn("devlink: devlink_eswitch_mode parameter ignored, failed to allocate memory\n");
+		} else {
+			err = devlink_default_esw_mode_parse(def);
+			kfree(def);
+			if (err == -EEXIST) {
+				devlink_default_esw_mode_param = NULL;
+				pr_warn("devlink: duplicate eswitch mode handles ignored\n");
+			} else if (err == -EINVAL) {
+				devlink_default_esw_mode_param = NULL;
+				pr_warn("devlink: invalid devlink_eswitch_mode parameter ignored\n");
+			} else if (err == -ENOMEM) {
+				devlink_default_esw_mode_param = NULL;
+				pr_warn("devlink: devlink_eswitch_mode parameter ignored, failed to allocate memory\n");
+			} else if (err) {
+				goto out;
+			}
+		}
+	}
+
 	err = register_pernet_subsys(&devlink_pernet_ops);
 	if (err)
 		goto out;
@@ -595,7 +863,10 @@ static int __init devlink_init(void)
 out_unreg_pernet_subsys:
 	unregister_pernet_subsys(&devlink_pernet_ops);
 out:
+	if (err)
+		devlink_default_esw_mode_nodes_clear();
 	WARN_ON(err);
+
 	return err;
 }
 
diff --git a/net/devlink/dev.c b/net/devlink/dev.c
index 57b2b8f03543..0b4a831465e8 100644
--- a/net/devlink/dev.c
+++ b/net/devlink/dev.c
@@ -478,6 +478,9 @@ int devlink_reload(struct devlink *devlink, struct net *dest_net,
 		return err;
 
 	WARN_ON(!(*actions_performed & BIT(action)));
+	if (*actions_performed & BIT(DEVLINK_RELOAD_ACTION_DRIVER_REINIT))
+		devlink_apply_default_esw_mode(devlink);
+
 	/* Catch driver on updating the remote action within devlink reload */
 	WARN_ON(memcmp(remote_reload_stats, devlink->stats.remote_reload_stats,
 		       sizeof(remote_reload_stats)));
diff --git a/net/devlink/devl_internal.h b/net/devlink/devl_internal.h
index e4e48ee2da5a..12557b65248d 100644
--- a/net/devlink/devl_internal.h
+++ b/net/devlink/devl_internal.h
@@ -71,6 +71,7 @@ extern struct genl_family devlink_nl_family;
 struct devlink *__devlink_alloc(const struct devlink_ops *ops, size_t priv_size,
 				struct net *net, struct device *dev,
 				const struct device_driver *dev_driver);
+void devlink_apply_default_esw_mode(struct devlink *devlink);
 
 #define devl_warn(devlink, format, args...)				\
 	do {								\
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next V3 0/7] devlink: Add boot-time eswitch mode defaults
  2026-06-05 18:10 [PATCH net-next V3 0/7] devlink: Add boot-time eswitch mode defaults Mark Bloch
                   ` (6 preceding siblings ...)
  2026-06-05 18:10 ` [PATCH net-next V3 7/7] devlink: Add eswitch mode boot defaults Mark Bloch
@ 2026-06-05 19:37 ` Borislav Petkov
  7 siblings, 0 replies; 10+ messages in thread
From: Borislav Petkov @ 2026-06-05 19:37 UTC (permalink / raw)
  To: Mark Bloch
  Cc: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller, Jonathan Corbet, Shuah Khan, Jiri Pirko,
	Simon Horman, Sunil Goutham, Linu Cherian, Geetha sowjanya,
	hariprasad, Subbaraya Sundeep, Bharat Bhushan, Saeed Mahameed,
	Leon Romanovsky, Tariq Toukan, Andrew Morton, Randy Dunlap,
	Thomas Gleixner, Petr Mladek, Peter Zijlstra (Intel), Dave Hansen,
	Vlastimil Babka, Christian Brauner, Tejun Heo, Feng Tang,
	Dapeng Mi, Kees Cook, Marco Elver, Eric Biggers, Li RongQing,
	Paul E. McKenney, Ethan Nelson-Moore, linux-doc, linux-kernel,
	netdev, linux-rdma

On Fri, Jun 05, 2026 at 09:10:23PM +0300, Mark Bloch wrote:
> This series adds a devlink_eswitch_mode= kernel command line parameter for
> applying a default devlink eswitch mode during device initialization.

Please trim your Cc list. There's nothing I can do about networking. You're
CCing too many people unnecessarily.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next V3 2/7] netdevsim: Register devlink after device init
  2026-06-05 18:10 ` [PATCH net-next V3 2/7] netdevsim: Register devlink after device init Mark Bloch
@ 2026-06-10 23:50   ` Jakub Kicinski
  0 siblings, 0 replies; 10+ messages in thread
From: Jakub Kicinski @ 2026-06-10 23:50 UTC (permalink / raw)
  To: Mark Bloch
  Cc: Eric Dumazet, Paolo Abeni, Andrew Lunn, David S. Miller,
	Jonathan Corbet, Shuah Khan, Jiri Pirko, Simon Horman,
	Sunil Goutham, Linu Cherian, Geetha sowjanya, hariprasad,
	Subbaraya Sundeep, Bharat Bhushan, Saeed Mahameed,
	Leon Romanovsky, Tariq Toukan, Ethan Nelson-Moore, linux-doc,
	netdev, linux-rdma

On Fri, 5 Jun 2026 21:10:25 +0300 Mark Bloch wrote:
> devl_register() makes the devlink instance visible to userspace. A later
> patch also makes registration the point where devlink core may call
> eswitch_mode_set() to apply a boot-time default eswitch mode.
> 
> Move netdevsim registration after all objects (resources, params, regions,
> traps, debugfs etc) are initialized, and after the initial eswitch mode is
> set to legacy.
> 
> Move devl_unregister() to the beginning of nsim_drv_remove(), before those
> devlink objects are torn down. This keeps devlink register/unregister as
> the notification barrier and makes the later object teardown paths run
> after devlink is no longer registered, so they do not emit their own
> netlink DEL notifications.

This is going backwards. At some point someone from nVidia thought that
we can order our way out of locking, so mlx5 is likely ordered this way,
but this must not be required, or in any way normalized.
We (syzbot) quickly discovered that it doesn't cover all corner cases.
devl_lock() is exposed specifically to allow the driver to finish
whatever init it needs without letting user space invoke callbacks, yet.
Almost (?) all driver callbacks hold devl_lock(), so maybe the devlink
instance is "visible" to user space but that should not matter.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-06-10 23:50 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-05 18:10 [PATCH net-next V3 0/7] devlink: Add boot-time eswitch mode defaults Mark Bloch
2026-06-05 18:10 ` [PATCH net-next V3 1/7] devlink: Skip health recover notifications before register Mark Bloch
2026-06-05 18:10 ` [PATCH net-next V3 2/7] netdevsim: Register devlink after device init Mark Bloch
2026-06-10 23:50   ` Jakub Kicinski
2026-06-05 18:10 ` [PATCH net-next V3 3/7] net/mlx5: Clear FW reset-in-progress bit before reload Mark Bloch
2026-06-05 18:10 ` [PATCH net-next V3 4/7] net/mlx5: Register devlink after device init Mark Bloch
2026-06-05 18:10 ` [PATCH net-next V3 5/7] octeontx2-af: Register devlink after SR-IOV init Mark Bloch
2026-06-05 18:10 ` [PATCH net-next V3 6/7] octeontx2-pf: Register devlink after SR-IOV state init Mark Bloch
2026-06-05 18:10 ` [PATCH net-next V3 7/7] devlink: Add eswitch mode boot defaults Mark Bloch
2026-06-05 19:37 ` [PATCH net-next V3 0/7] devlink: Add boot-time eswitch mode defaults Borislav Petkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox