* [PATCH net-next 1/3] net/mlx5: Clear FW reset-in-progress bit before reload
2026-05-21 7:24 [PATCH net-next 0/3] devlink: Add boot-time eswitch mode defaults Tariq Toukan
@ 2026-05-21 7:24 ` Tariq Toukan
2026-05-21 7:24 ` [PATCH net-next 2/3] devlink: Add eswitch mode boot defaults Tariq Toukan
` (2 subsequent siblings)
3 siblings, 0 replies; 15+ messages in thread
From: Tariq Toukan @ 2026-05-21 7:24 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Jonathan Corbet, Shuah Khan, Jiri Pirko, Simon Horman,
Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
Borislav Petkov (AMD), Andrew Morton, Randy Dunlap,
Thomas Gleixner, Petr Mladek, Peter Zijlstra (Intel), Tejun Heo,
Vlastimil Babka, Feng Tang, Christian Brauner, Dave Hansen,
Dapeng Mi, Kees Cook, Marco Elver, Li RongQing, Eric Biggers,
Paul E. McKenney, linux-doc, linux-kernel, netdev, linux-rdma,
Gal Pressman, Dragos Tatulea, Jiri Pirko, Shay Drori,
Moshe Shemesh
From: Mark Bloch <mbloch@nvidia.com>
mlx5 sets MLX5_FW_RESET_FLAGS_RESET_IN_PROGRESS when acknowledging a
sync reset request. This bit blocks devlink reload and other devlink
operations while the firmware reset is running, but it was kept set
until after the driver reload finished.
Clear the reset-in-progress bit once the reset unload flow is done and
PCI access is back, before reloading the device. For a reset initiated
through devlink, clear it before completing the reload waiter. For a
reset reported through an asynchronous firmware event, keep the unload
flow outside devl_lock, then take devl_lock before clearing the bit and
reloading through the devl-locked load helper.
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Shay Drori <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
.../ethernet/mellanox/mlx5/core/fw_reset.c | 28 +++++++++++--------
1 file changed, 17 insertions(+), 11 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fw_reset.c b/drivers/net/ethernet/mellanox/mlx5/core/fw_reset.c
index 07440c58713a..7283e5b49eed 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fw_reset.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fw_reset.c
@@ -238,24 +238,30 @@ static void mlx5_fw_reset_complete_reload(struct mlx5_core_dev *dev)
{
struct mlx5_fw_reset *fw_reset = dev->priv.fw_reset;
struct devlink *devlink = priv_to_devlink(dev);
+ int err;
/* if this is the driver that initiated the fw reset, devlink completed the reload */
if (test_bit(MLX5_FW_RESET_FLAGS_PENDING_COMP, &fw_reset->reset_flags)) {
+ clear_bit(MLX5_FW_RESET_FLAGS_RESET_IN_PROGRESS,
+ &fw_reset->reset_flags);
complete(&fw_reset->done);
- } else {
- mlx5_sync_reset_unload_flow(dev, false);
- if (mlx5_health_wait_pci_up(dev))
- mlx5_core_err(dev, "reset reload flow aborted, PCI reads still not working\n");
- else
- mlx5_load_one(dev, true);
- devl_lock(devlink);
- devlink_remote_reload_actions_performed(devlink, 0,
- BIT(DEVLINK_RELOAD_ACTION_DRIVER_REINIT) |
- BIT(DEVLINK_RELOAD_ACTION_FW_ACTIVATE));
- devl_unlock(devlink);
+ return;
}
+ mlx5_sync_reset_unload_flow(dev, false);
+ err = mlx5_health_wait_pci_up(dev);
+
+ devl_lock(devlink);
clear_bit(MLX5_FW_RESET_FLAGS_RESET_IN_PROGRESS, &fw_reset->reset_flags);
+ if (err)
+ mlx5_core_err(dev, "reset reload flow aborted, PCI reads still not working\n");
+ else
+ mlx5_load_one_devl_locked(dev, true);
+
+ devlink_remote_reload_actions_performed(devlink, 0,
+ BIT(DEVLINK_RELOAD_ACTION_DRIVER_REINIT) |
+ BIT(DEVLINK_RELOAD_ACTION_FW_ACTIVATE));
+ devl_unlock(devlink);
}
static void mlx5_stop_sync_reset_poll(struct mlx5_core_dev *dev)
--
2.44.0
^ permalink raw reply related [flat|nested] 15+ messages in thread* [PATCH net-next 2/3] devlink: Add eswitch mode boot defaults
2026-05-21 7:24 [PATCH net-next 0/3] devlink: Add boot-time eswitch mode defaults Tariq Toukan
2026-05-21 7:24 ` [PATCH net-next 1/3] net/mlx5: Clear FW reset-in-progress bit before reload Tariq Toukan
@ 2026-05-21 7:24 ` Tariq Toukan
2026-05-21 7:24 ` [PATCH net-next 3/3] net/mlx5: Apply devlink default eswitch mode during init Tariq Toukan
2026-05-25 19:42 ` [PATCH net-next 0/3] devlink: Add boot-time eswitch mode defaults Jakub Kicinski
3 siblings, 0 replies; 15+ messages in thread
From: Tariq Toukan @ 2026-05-21 7:24 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Jonathan Corbet, Shuah Khan, Jiri Pirko, Simon Horman,
Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
Borislav Petkov (AMD), Andrew Morton, Randy Dunlap,
Thomas Gleixner, Petr Mladek, Peter Zijlstra (Intel), Tejun Heo,
Vlastimil Babka, Feng Tang, Christian Brauner, Dave Hansen,
Dapeng Mi, Kees Cook, Marco Elver, Li RongQing, Eric Biggers,
Paul E. McKenney, linux-doc, linux-kernel, netdev, linux-rdma,
Gal Pressman, Dragos Tatulea, Jiri Pirko
From: Mark Bloch <mbloch@nvidia.com>
Add devlink_eswitch_mode= command line support for setting an eswitch
mode during device initialization.
The supported syntax selects either all devlink handles or one explicit
comma-separated handle list:
devlink_eswitch_mode=[*]:<mode>
devlink_eswitch_mode=[<handle>[,<handle>...]]:<mode>
where <mode> is one of legacy, switchdev or switchdev_inactive. All
selected handles receive the same mode. Assigning different modes to
different handle lists in the same parameter value is not supported.
The default is applied through the existing eswitch_mode_set() devlink
operation, matching the userspace devlink eswitch set command.
Expose devl_apply_default_esw_mode() so drivers can apply the default at
the point where their devlink instance and eswitch operations are ready.
Document the devlink_eswitch_mode= syntax and duplicate handle handling.
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
.../admin-guide/kernel-parameters.txt | 25 ++
.../networking/devlink/devlink-defaults.rst | 80 ++++++
Documentation/networking/devlink/index.rst | 1 +
include/net/devlink.h | 1 +
net/devlink/core.c | 255 ++++++++++++++++++
5 files changed, 362 insertions(+)
create mode 100644 Documentation/networking/devlink/devlink-defaults.rst
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 7834ee927310..f87ae561c0dc 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1278,6 +1278,31 @@ Kernel parameters
dell_smm_hwmon.fan_max=
[HW] Maximum configurable fan speed.
+ devlink_eswitch_mode=
+ [NET]
+ Format:
+ [<selector>]:<mode>
+
+ <selector>:
+ * | <handle>[,<handle>...]
+
+ <handle>:
+ <bus-name>/<dev-name>
+
+ Configure default devlink eswitch mode for matching
+ devlink instances during device initialization.
+
+ <mode>:
+ legacy | switchdev | switchdev_inactive
+
+ Examples:
+ devlink_eswitch_mode=[*]:switchdev
+ devlink_eswitch_mode=[pci/0000:08:00.0]:switchdev
+ devlink_eswitch_mode=[pci/0000:08:00.0,pci/0000:09:00.1]:legacy
+
+ See Documentation/networking/devlink/devlink-defaults.rst
+ for the full syntax.
+
dfltcc= [HW,S390]
Format: { on | off | def_only | inf_only | always }
on: s390 zlib hardware support for compression on
diff --git a/Documentation/networking/devlink/devlink-defaults.rst b/Documentation/networking/devlink/devlink-defaults.rst
new file mode 100644
index 000000000000..b554e75eeeea
--- /dev/null
+++ b/Documentation/networking/devlink/devlink-defaults.rst
@@ -0,0 +1,80 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============================
+Devlink Eswitch Mode Defaults
+==============================
+
+Devlink eswitch mode defaults allow the eswitch mode to be provided on the
+kernel command line and applied to matching devlink instances during device
+initialization.
+
+The devlink device is selected by its devlink handle. For PCI devices this is
+the same handle shown by ``devlink dev show``, for example
+``pci/0000:08:00.0``.
+
+Kernel command line syntax
+==========================
+
+Defaults are specified with the ``devlink_eswitch_mode=`` kernel command line
+parameter.
+
+The general syntax is::
+
+ devlink_eswitch_mode=[<selector>]:<mode>
+
+``<selector>`` is either ``*`` or one or more devlink handles::
+
+ * | <bus-name>/<dev-name>[,<bus-name>/<dev-name>...]
+
+``*`` applies the mode to every devlink instance. All handles in the same
+``[]`` list receive the same eswitch mode.
+
+``<mode>`` is one of ``legacy``, ``switchdev`` or ``switchdev_inactive``.
+
+Syntax rules
+------------
+
+The following syntax rules apply:
+
+* Specify the default in one ``devlink_eswitch_mode=`` parameter. Repeated
+ ``devlink_eswitch_mode=`` parameters are not accumulated.
+* The ``devlink_eswitch_mode=`` value is limited by the kernel command line
+ size.
+* Whitespace is not allowed within the parameter value.
+* ``<selector>`` must be either ``*`` or a handle list. ``*`` cannot be
+ combined with explicit handles.
+* ``<bus-name>`` and ``<dev-name>`` must not be empty.
+* ``<bus-name>`` must not contain ``:``.
+* ``<dev-name>`` may contain ``:``. This allows PCI names such as
+ ``0000:08:00.0``.
+* Handles must not contain whitespace, ``[``, ``]``, ``*`` or more than one
+ ``/``.
+* A comma inside ``[]`` separates handles.
+* Comma-separated default groups are not supported.
+* Duplicate handles are rejected and the devlink eswitch mode default is
+ ignored.
+
+The eswitch mode default corresponds to the userspace command::
+
+ devlink dev eswitch set <handle> mode <value>
+
+
+Examples
+========
+
+Set all devlink instances to switchdev mode::
+
+ devlink_eswitch_mode=[*]:switchdev
+
+Set one PCI devlink instance to switchdev mode::
+
+ devlink_eswitch_mode=[pci/0000:08:00.0]:switchdev
+
+Set two PCI devlink instances to legacy mode::
+
+ devlink_eswitch_mode=[pci/0000:08:00.0,pci/0000:09:00.1]:legacy
+
+The following is invalid because comma-separated default groups are not
+supported::
+
+ devlink_eswitch_mode=[pci/0000:08:00.0]:switchdev,[pci/0000:09:00.0]:switchdev_inactive
diff --git a/Documentation/networking/devlink/index.rst b/Documentation/networking/devlink/index.rst
index f7ba7dcf477d..0d27a7008b14 100644
--- a/Documentation/networking/devlink/index.rst
+++ b/Documentation/networking/devlink/index.rst
@@ -56,6 +56,7 @@ general.
:maxdepth: 1
devlink-dpipe
+ devlink-defaults
devlink-eswitch-attr
devlink-flash
devlink-health
diff --git a/include/net/devlink.h b/include/net/devlink.h
index bcd31de1f890..98885f7c6c10 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -1622,6 +1622,7 @@ int devl_trylock(struct devlink *devlink);
void devl_unlock(struct devlink *devlink);
void devl_assert_locked(struct devlink *devlink);
bool devl_lock_is_held(struct devlink *devlink);
+int devl_apply_default_esw_mode(struct devlink *devlink);
DEFINE_GUARD(devl, struct devlink *, devl_lock(_T), devl_unlock(_T));
struct ib_device;
diff --git a/net/devlink/core.c b/net/devlink/core.c
index eeb6a71f5f56..4bc1734878d1 100644
--- a/net/devlink/core.c
+++ b/net/devlink/core.c
@@ -4,6 +4,10 @@
* Copyright (c) 2016 Jiri Pirko <jiri@mellanox.com>
*/
+#include <linux/init.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/string.h>
#include <net/genetlink.h>
#define CREATE_TRACE_POINTS
#include <trace/events/devlink.h>
@@ -16,6 +20,233 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(devlink_trap_report);
DEFINE_XARRAY_FLAGS(devlinks, XA_FLAGS_ALLOC);
+static char *devlink_default_esw_mode_param;
+static bool devlink_default_esw_mode_match_all;
+static enum devlink_eswitch_mode devlink_default_esw_mode;
+static LIST_HEAD(devlink_default_esw_mode_nodes);
+
+struct devlink_default_esw_mode_node {
+ struct list_head list;
+ char *bus_name;
+ char *dev_name;
+};
+
+static int __init
+devlink_default_esw_mode_to_value(const char *str,
+ enum devlink_eswitch_mode *mode)
+{
+ if (!strcmp(str, "legacy")) {
+ *mode = DEVLINK_ESWITCH_MODE_LEGACY;
+ return 0;
+ }
+ if (!strcmp(str, "switchdev")) {
+ *mode = DEVLINK_ESWITCH_MODE_SWITCHDEV;
+ return 0;
+ }
+ if (!strcmp(str, "switchdev_inactive")) {
+ *mode = DEVLINK_ESWITCH_MODE_SWITCHDEV_INACTIVE;
+ return 0;
+ }
+
+ return -EINVAL;
+}
+
+static int devlink_default_esw_mode_apply(struct devlink *devlink)
+{
+ const struct devlink_ops *ops = devlink->ops;
+
+ if (!ops->eswitch_mode_set)
+ return -EOPNOTSUPP;
+
+ return ops->eswitch_mode_set(devlink, devlink_default_esw_mode,
+ NULL);
+}
+
+static int __init
+devlink_default_esw_mode_handle_parse(char *handle, char **bus_name,
+ char **dev_name)
+{
+ char *slash;
+ char *p;
+
+ if (!handle || !*handle)
+ return -EINVAL;
+
+ for (p = handle; *p; p++) {
+ if (*p == '[' || *p == ']' || *p == '*')
+ return -EINVAL;
+ }
+
+ slash = strchr(handle, '/');
+ if (!slash || slash == handle || !slash[1])
+ return -EINVAL;
+ if (strchr(slash + 1, '/'))
+ return -EINVAL;
+
+ *slash = '\0';
+ if (strchr(handle, ':'))
+ return -EINVAL;
+
+ *bus_name = handle;
+ *dev_name = slash + 1;
+ return 0;
+}
+
+static struct devlink_default_esw_mode_node *
+devlink_default_esw_mode_node_find(const char *bus_name, const char *dev_name)
+{
+ struct devlink_default_esw_mode_node *node;
+
+ list_for_each_entry(node, &devlink_default_esw_mode_nodes, list) {
+ if (!strcmp(node->bus_name, bus_name) &&
+ !strcmp(node->dev_name, dev_name))
+ return node;
+ }
+
+ return NULL;
+}
+
+static int __init
+devlink_default_esw_mode_node_add(const char *bus_name, const char *dev_name)
+{
+ struct devlink_default_esw_mode_node *node;
+
+ if (devlink_default_esw_mode_node_find(bus_name, dev_name))
+ return -EEXIST;
+
+ node = kzalloc_obj(*node);
+ if (!node)
+ return -ENOMEM;
+
+ INIT_LIST_HEAD(&node->list);
+ node->bus_name = kstrdup(bus_name, GFP_KERNEL);
+ node->dev_name = kstrdup(dev_name, GFP_KERNEL);
+ if (!node->bus_name || !node->dev_name) {
+ kfree(node->bus_name);
+ kfree(node->dev_name);
+ kfree(node);
+ return -ENOMEM;
+ }
+
+ list_add_tail(&node->list, &devlink_default_esw_mode_nodes);
+ return 0;
+}
+
+static int __init devlink_default_esw_mode_handles_parse(char *handles)
+{
+ char *handle;
+ int err;
+
+ if (!strcmp(handles, "*")) {
+ devlink_default_esw_mode_match_all = true;
+ return 0;
+ }
+
+ while ((handle = strsep(&handles, ",")) != NULL) {
+ char *bus_name;
+ char *dev_name;
+
+ err = devlink_default_esw_mode_handle_parse(handle, &bus_name,
+ &dev_name);
+ if (err)
+ return err;
+
+ err = devlink_default_esw_mode_node_add(bus_name, dev_name);
+ if (err)
+ return err;
+ }
+
+ return 0;
+}
+
+static void __init
+devlink_default_esw_mode_node_free(struct devlink_default_esw_mode_node *node)
+{
+ kfree(node->bus_name);
+ kfree(node->dev_name);
+ kfree(node);
+}
+
+static void __init devlink_default_esw_mode_nodes_clear(void)
+{
+ struct devlink_default_esw_mode_node *node;
+ struct devlink_default_esw_mode_node *node_tmp;
+
+ list_for_each_entry_safe(node, node_tmp,
+ &devlink_default_esw_mode_nodes, list) {
+ list_del(&node->list);
+ devlink_default_esw_mode_node_free(node);
+ }
+
+ devlink_default_esw_mode_match_all = false;
+}
+
+static int __init devlink_default_esw_mode_parse(char *str)
+{
+ char *handles_end;
+ char *handles;
+ char *mode;
+ int err;
+
+ if (!str || *str != '[')
+ return -EINVAL;
+
+ handles = str + 1;
+ handles_end = strchr(handles, ']');
+ if (!handles_end || handles_end[1] != ':' || !handles_end[2])
+ return -EINVAL;
+
+ *handles_end = '\0';
+ mode = handles_end + 2;
+ if (!*handles)
+ return -EINVAL;
+
+ err = devlink_default_esw_mode_to_value(mode,
+ &devlink_default_esw_mode);
+ if (err)
+ return err;
+
+ err = devlink_default_esw_mode_handles_parse(handles);
+ if (err)
+ devlink_default_esw_mode_nodes_clear();
+
+ return err;
+}
+
+/**
+ * devl_apply_default_esw_mode - Apply default eswitch mode to devlink instance
+ * @devlink: devlink
+ *
+ * The caller must hold the devlink instance lock.
+ *
+ * Return: 0 on success, negative error code otherwise.
+ */
+int devl_apply_default_esw_mode(struct devlink *devlink)
+{
+ const char *bus_name = devlink_bus_name(devlink);
+ const char *dev_name = devlink_dev_name(devlink);
+ struct devlink_default_esw_mode_node *node;
+
+ devl_assert_locked(devlink);
+
+ if (devlink_default_esw_mode_match_all)
+ return devlink_default_esw_mode_apply(devlink);
+
+ node = devlink_default_esw_mode_node_find(bus_name, dev_name);
+ if (node)
+ return devlink_default_esw_mode_apply(devlink);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(devl_apply_default_esw_mode);
+
+static int __init devlink_default_esw_mode_setup(char *str)
+{
+ devlink_default_esw_mode_param = str;
+ return 1;
+}
+__setup("devlink_eswitch_mode=", devlink_default_esw_mode_setup);
+
static struct devlink *devlinks_xa_get(unsigned long index)
{
struct devlink *devlink;
@@ -578,6 +809,27 @@ static int __init devlink_init(void)
{
int err;
+ if (devlink_default_esw_mode_param) {
+ char *def;
+
+ def = kstrdup(devlink_default_esw_mode_param, GFP_KERNEL);
+ if (!def) {
+ err = -ENOMEM;
+ goto out;
+ }
+ err = devlink_default_esw_mode_parse(def);
+ kfree(def);
+ if (err == -EEXIST) {
+ devlink_default_esw_mode_param = NULL;
+ pr_warn("devlink: duplicate eswitch mode handles ignored\n");
+ } else if (err == -EINVAL) {
+ devlink_default_esw_mode_param = NULL;
+ pr_warn("devlink: invalid devlink_eswitch_mode parameter ignored\n");
+ } else if (err) {
+ goto out;
+ }
+ }
+
err = register_pernet_subsys(&devlink_pernet_ops);
if (err)
goto out;
@@ -593,7 +845,10 @@ static int __init devlink_init(void)
out_unreg_pernet_subsys:
unregister_pernet_subsys(&devlink_pernet_ops);
out:
+ if (err)
+ devlink_default_esw_mode_nodes_clear();
WARN_ON(err);
+
return err;
}
--
2.44.0
^ permalink raw reply related [flat|nested] 15+ messages in thread* [PATCH net-next 3/3] net/mlx5: Apply devlink default eswitch mode during init
2026-05-21 7:24 [PATCH net-next 0/3] devlink: Add boot-time eswitch mode defaults Tariq Toukan
2026-05-21 7:24 ` [PATCH net-next 1/3] net/mlx5: Clear FW reset-in-progress bit before reload Tariq Toukan
2026-05-21 7:24 ` [PATCH net-next 2/3] devlink: Add eswitch mode boot defaults Tariq Toukan
@ 2026-05-21 7:24 ` Tariq Toukan
2026-05-21 13:16 ` Mark Bloch
2026-05-26 7:44 ` Jiri Pirko
2026-05-25 19:42 ` [PATCH net-next 0/3] devlink: Add boot-time eswitch mode defaults Jakub Kicinski
3 siblings, 2 replies; 15+ messages in thread
From: Tariq Toukan @ 2026-05-21 7:24 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Jonathan Corbet, Shuah Khan, Jiri Pirko, Simon Horman,
Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
Borislav Petkov (AMD), Andrew Morton, Randy Dunlap,
Thomas Gleixner, Petr Mladek, Peter Zijlstra (Intel), Tejun Heo,
Vlastimil Babka, Feng Tang, Christian Brauner, Dave Hansen,
Dapeng Mi, Kees Cook, Marco Elver, Li RongQing, Eric Biggers,
Paul E. McKenney, linux-doc, linux-kernel, netdev, linux-rdma,
Gal Pressman, Dragos Tatulea, Jiri Pirko, Shay Drori,
Moshe Shemesh
From: Mark Bloch <mbloch@nvidia.com>
Apply devlink default eswitch mode for mlx5 devices after successful
device initialization while holding the devlink instance lock.
At this point the devlink instance is registered and the mlx5 devlink
operations are available, so the default eswitch mode can be applied to
the matching PCI devlink handle.
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Shay Drori <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
drivers/net/ethernet/mellanox/mlx5/core/main.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 0c6e4efe38c8..4528097f3d84 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -1391,6 +1391,21 @@ static void mlx5_unload(struct mlx5_core_dev *dev)
mlx5_free_bfreg(dev, &dev->priv.bfreg);
}
+static void mlx5_devl_apply_default_esw_mode(struct mlx5_core_dev *dev)
+{
+ struct devlink *devlink = priv_to_devlink(dev);
+ int err;
+
+ if (!MLX5_ESWITCH_MANAGER(dev))
+ return;
+
+ devl_assert_locked(devlink);
+ err = devl_apply_default_esw_mode(devlink);
+ if (err)
+ mlx5_core_warn(dev, "Couldn't apply default eswitch mode, err %d\n",
+ err);
+}
+
int mlx5_init_one_devl_locked(struct mlx5_core_dev *dev)
{
bool light_probe = mlx5_dev_is_lightweight(dev);
@@ -1437,6 +1452,7 @@ int mlx5_init_one_devl_locked(struct mlx5_core_dev *dev)
mlx5_core_err(dev, "mlx5_hwmon_dev_register failed with error code %d\n", err);
mutex_unlock(&dev->intf_state_mutex);
+ mlx5_devl_apply_default_esw_mode(dev);
return 0;
err_register:
@@ -1538,6 +1554,7 @@ int mlx5_load_one_devl_locked(struct mlx5_core_dev *dev, bool recovery)
goto err_attach;
mutex_unlock(&dev->intf_state_mutex);
+ mlx5_devl_apply_default_esw_mode(dev);
return 0;
err_attach:
--
2.44.0
^ permalink raw reply related [flat|nested] 15+ messages in thread* Re: [PATCH net-next 3/3] net/mlx5: Apply devlink default eswitch mode during init
2026-05-21 7:24 ` [PATCH net-next 3/3] net/mlx5: Apply devlink default eswitch mode during init Tariq Toukan
@ 2026-05-21 13:16 ` Mark Bloch
2026-05-21 13:41 ` Thomas Weißschuh
2026-05-26 7:44 ` Jiri Pirko
1 sibling, 1 reply; 15+ messages in thread
From: Mark Bloch @ 2026-05-21 13:16 UTC (permalink / raw)
To: Tariq Toukan, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Andrew Lunn, David S. Miller, Thomas Weißschuh,
Thomas Gleixner, Arnd Bergmann
Cc: Jonathan Corbet, Shuah Khan, Jiri Pirko, Simon Horman,
Saeed Mahameed, Leon Romanovsky, Borislav Petkov (AMD),
Andrew Morton, Randy Dunlap, Thomas Gleixner, Petr Mladek,
Peter Zijlstra (Intel), Tejun Heo, Vlastimil Babka, Feng Tang,
Christian Brauner, Dave Hansen, Dapeng Mi, Kees Cook, Marco Elver,
Li RongQing, Eric Biggers, Paul E. McKenney, linux-doc,
linux-kernel, netdev, linux-rdma, Gal Pressman, Dragos Tatulea,
Jiri Pirko, Shay Drori, Moshe Shemesh
On 21/05/2026 10:24, Tariq Toukan wrote:
> From: Mark Bloch <mbloch@nvidia.com>
>
> Apply devlink default eswitch mode for mlx5 devices after successful
> device initialization while holding the devlink instance lock.
>
> At this point the devlink instance is registered and the mlx5 devlink
> operations are available, so the default eswitch mode can be applied to
> the matching PCI devlink handle.
>
> Signed-off-by: Mark Bloch <mbloch@nvidia.com>
> Reviewed-by: Shay Drori <shayd@nvidia.com>
> Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
> ---
> drivers/net/ethernet/mellanox/mlx5/core/main.c | 17 +++++++++++++++++
> 1 file changed, 17 insertions(+)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> index 0c6e4efe38c8..4528097f3d84 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> @@ -1391,6 +1391,21 @@ static void mlx5_unload(struct mlx5_core_dev *dev)
> mlx5_free_bfreg(dev, &dev->priv.bfreg);
> }
>
> +static void mlx5_devl_apply_default_esw_mode(struct mlx5_core_dev *dev)
> +{
> + struct devlink *devlink = priv_to_devlink(dev);
> + int err;
> +
> + if (!MLX5_ESWITCH_MANAGER(dev))
> + return;
> +
> + devl_assert_locked(devlink);
> + err = devl_apply_default_esw_mode(devlink);
> + if (err)
> + mlx5_core_warn(dev, "Couldn't apply default eswitch mode, err %d\n",
> + err);
> +}
> +
> int mlx5_init_one_devl_locked(struct mlx5_core_dev *dev)
> {
> bool light_probe = mlx5_dev_is_lightweight(dev);
> @@ -1437,6 +1452,7 @@ int mlx5_init_one_devl_locked(struct mlx5_core_dev *dev)
> mlx5_core_err(dev, "mlx5_hwmon_dev_register failed with error code %d\n", err);
>
> mutex_unlock(&dev->intf_state_mutex);
> + mlx5_devl_apply_default_esw_mode(dev);
> return 0;
>
> err_register:
> @@ -1538,6 +1554,7 @@ int mlx5_load_one_devl_locked(struct mlx5_core_dev *dev, bool recovery)
> goto err_attach;
>
> mutex_unlock(&dev->intf_state_mutex);
> + mlx5_devl_apply_default_esw_mode(dev);
> return 0;
>
> err_attach:
NIPA flagged this patch with a build_allmodconfig_warn failure:
https://netdev-ctrl.bots.linux.dev/logs/build/1098506/14585935/build_allmodconfig_warn/
I do not see how this mlx5 patch is related to the reported issue,
but I looked into it anyway.
After the kernel has been built once, the issue can be reproduced by rerunning sparse
only on version.o, which filters out the unrelated noise. I had an older sparse installed,
so I used a local copy:
rm -f arch/x86/boot/version.o
make V=1 C=1 CHECK=/labhome/mbloch/bin/sparse arch/x86/boot/version.o
This gives the same error reported by NIPA:
...
...
make -f ./scripts/Makefile.vmlinux
make -f ./scripts/Makefile.build obj=arch/x86/boot arch/x86/boot/bzImage
make -f ./scripts/Makefile.build obj=arch/x86/boot/compressed arch/x86/boot/compressed/vmlinux
# CC arch/x86/boot/version.o
gcc -Wp,-MMD,arch/x86/boot/.version.o.d -nostdinc -I./arch/x86/include -I./arch/x86/include/generated -I./include -I./include -I./arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/compiler-version.h -include ./include/linux/kconfig.h -include ./include/linux/compiler_types.h -D__KERNEL__ -std=gnu11 -fms-extensions -m16 -g -Os -DDISABLE_BRANCH_PROFILING -D__DISABLE_EXPORTS -Wall -Wstrict-prototypes -march=i386 -mregparm=3 -fno-strict-aliasing -fomit-frame-pointer -fno-pic -mno-mmx -mno-sse -fcf-protection=none -ffreestanding -fno-stack-protector -Wno-address-of-packed-member -mpreferred-stack-boundary=2 -D_SETUP -fno-asynchronous-unwind-tables -Wimplicit-fallthrough=5 -DKBUILD_MODFILE='"arch/x86/boot/version"' -DKBUILD_BASENAME='"version"' -DKBUILD_MODNAME='"version"' -D__KBUILD_MODNAME=version -c -o arch/x86/boot/version.o arch/x86/boot/version.c
# CHECK arch/x86/boot/version.c
/labhome/mbloch/bin/sparse -D__linux__ -Dlinux -D__STDC__ -Dunix -D__unix__ -Wbitwise -Wno-return-void -Wno-unknown-attribute -D__x86_64__ --arch=x86 -mlittle-endian -m64 -Wp,-MMD,arch/x86/boot/.version.o.d -nostdinc -I./arch/x86/include -I./arch/x86/include/generated -I./include -I./include -I./arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/compiler-version.h -include ./include/linux/kconfig.h -include ./include/linux/compiler_types.h -D__KERNEL__ -std=gnu11 -fms-extensions -m16 -g -Os -DDISABLE_BRANCH_PROFILING -D__DISABLE_EXPORTS -Wall -Wstrict-prototypes -march=i386 -mregparm=3 -fno-strict-aliasing -fomit-frame-pointer -fno-pic -mno-mmx -mno-sse -fcf-protection=none -ffreestanding -fno-stack-protector -Wno-address-of-packed-member -mpreferred-stack-boundary=2 -D_SETUP -fno-asynchronous-unwind-tables -Wimplicit-fallthrough=5 -DKBUILD_MODFILE='"arch/x86/boot/version"' -DKBUILD_BASENAME='"version"' -DKBUILD_MODNAME='"version"' -D__KBUILD_MODNAME=version arch/x86/boot/version.c
arch/x86/boot/version.c: note: in included file (through arch/x86/include/uapi/asm/bitsperlong.h, include/uapi/asm-generic/int-ll64.h, include/asm-generic/int-ll64.h, include/uapi/asm-generic/types.h, ...):
./include/asm-generic/bitsperlong.h:23:2: error: Inconsistent word size. Check asm/bitsperlong.h
./include/asm-generic/bitsperlong.h:27:33: error: static assertion failed: "Inconsistent word size. Check asm/bitsperlong.h"
# cmd_gen_symversions_c arch/x86/boot/version.o
if nm arch/x86/boot/version.o 2>/dev/null | grep -q ' __export_symbol_'; then gcc -E -D__GENKSYMS__ -Wp,-MMD,arch/x86/boot/.version.o.d -nostdinc -I./arch/x86/include -I./arch/x86/include/generated -I./include -I./include -I./arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/compiler-version.h -include ./include/linux/kconfig.h -include ./include/linux/compiler_types.h -D__KERNEL__ -std=gnu11 -fms-extensions -m16 -g -Os -DDISABLE_BRANCH_PROFILING -D__DISABLE_EXPORTS -Wall -Wstrict-prototypes -march=i386 -mregparm=3 -fno-strict-aliasing -fomit-frame-pointer -fno-pic -mno-mmx -mno-sse -fcf-protection=none -ffreestanding -fno-stack-protector -Wno-address-of-packed-member -mpreferred-stack-boundary=2 -D_SETUP -fno-asynchronous-unwind-tables -Wimplicit-fallthrough=5 -DKBUILD_MODFILE='"arch/x86/boot/version"' -DKBUILD_BASENAME='"version"' -DKBUILD_MODNAME='"version"' -D__KBUILD_MODNAME=version arch/x86/boot/version.c | ./scripts/genksyms/genksyms >> arch/x86/boot/.version.o.cmd; fi
# LD arch/x86/boot/setup.elf
ld -m elf_x86_64 -z noexecstack -m elf_i386 -z noexecstack -T arch/x86/boot/setup.ld arch/x86/boot/a20.o arch/x86/boot/bioscall.o arch/x86/boot/cmdline.o arch/x86/boot/copy.o arch/x86/boot/cpu.o arch/x86/boot/cpuflags.o arch/x86/boot/cpucheck.o arch/x86/boot/early_serial_console.o arch/x86/boot/edd.o arch/x86/boot/header.o arch/x86/boot/main.o arch/x86/boot/memory.o arch/x86/boot/pm.o arch/x86/boot/pmjump.o arch/x86/boot/printf.o arch/x86/boot/regs.o arch/x86/boot/string.o arch/x86/boot/tty.o arch/x86/boot/video.o arch/x86/boot/video-mode.o arch/x86/boot/version.o arch/x86/boot/video-vga.o arch/x86/boot/video-vesa.o arch/x86/boot/video-bios.o -o arch/x86/boot/setup.elf
# OBJCOPY arch/x86/boot/setup.bin
objcopy -O binary arch/x86/boot/setup.elf arch/x86/boot/setup.bin
# BUILD arch/x86/boot/bzImage
(dd if=arch/x86/boot/setup.bin bs=4k conv=sync status=none; cat arch/x86/boot/vmlinux.bin) >arch/x86/boot/bzImage
mkdir -p ./arch/x86_64/boot
ln -fsn ../../x86/boot/bzImage ./arch/x86_64/boot/bzImage
To me this looks like sparse is getting a conflicting set of flags.
The command line contains both "-D__x86_64__ -m64" and "-m16 -march=i386 -D_SETUP".
I confirmed that the following patch "fixes" the issue, but I do not know whether
this is the right fix. This area is outside my comfort zone, so it would be
helpful if someone more familiar with the x86 build/sparse flow could take a
look:
diff --git a/arch/x86/boot/Makefile b/arch/x86/boot/Makefile
index 3f9fb3698d66..80923864f6f9 100644
--- a/arch/x86/boot/Makefile
+++ b/arch/x86/boot/Makefile
@@ -71,6 +71,10 @@ $(obj)/vmlinux.bin: $(obj)/compressed/vmlinux FORCE
SETUP_OBJS = $(addprefix $(obj)/,$(setup-y))
+realmode-checkflags-$(CONFIG_X86_64) := -m32 -U__x86_64__ -D__i386__
+REALMODE_CHECKFLAGS := $(filter-out -m64 -D__x86_64__,$(CHECKFLAGS)) $(realmode-checkflags-y)
+$(SETUP_OBJS): CHECKFLAGS := $(REALMODE_CHECKFLAGS)
+
sed-zoffset := -e 's/^\([0-9a-fA-F]*\) [a-zA-Z] \(startup_32\|efi.._stub_entry\|efi\(32\)\?_pe_entry\|input_data\|kernel_info\|_end\|_ehead\|_text\|_e\?data\|_e\?sbat\|z_.*\)$$/\#define ZO_\2 0x\1/p'
quiet_cmd_zoffset = ZOFFSET $@
diff --git a/arch/x86/realmode/rm/Makefile b/arch/x86/realmode/rm/Makefile
index a0fb39abc5c8..341b0ff20c3d 100644
--- a/arch/x86/realmode/rm/Makefile
+++ b/arch/x86/realmode/rm/Makefile
@@ -29,6 +29,10 @@ targets += $(realmode-y)
REALMODE_OBJS = $(addprefix $(obj)/,$(realmode-y))
+realmode-checkflags-$(CONFIG_X86_64) := -m32 -U__x86_64__ -D__i386__
+REALMODE_CHECKFLAGS := $(filter-out -m64 -D__x86_64__,$(CHECKFLAGS)) $(realmode-checkflags-y)
+$(REALMODE_OBJS): CHECKFLAGS := $(REALMODE_CHECKFLAGS)
+
sed-pasyms := -n -r -e 's/^([0-9a-fA-F]+) [ABCDGRSTVW] (.+)$$/pa_\2 = \2;/p'
quiet_cmd_pasyms = PASYMS $@
^ permalink raw reply related [flat|nested] 15+ messages in thread* Re: [PATCH net-next 3/3] net/mlx5: Apply devlink default eswitch mode during init
2026-05-21 13:16 ` Mark Bloch
@ 2026-05-21 13:41 ` Thomas Weißschuh
2026-05-21 21:02 ` Mark Bloch
0 siblings, 1 reply; 15+ messages in thread
From: Thomas Weißschuh @ 2026-05-21 13:41 UTC (permalink / raw)
To: Mark Bloch
Cc: Tariq Toukan, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Andrew Lunn, David S. Miller, Thomas Gleixner, Arnd Bergmann,
Jonathan Corbet, Shuah Khan, Jiri Pirko, Simon Horman,
Saeed Mahameed, Leon Romanovsky, Borislav Petkov (AMD),
Andrew Morton, Randy Dunlap, Petr Mladek, Peter Zijlstra (Intel),
Tejun Heo, Vlastimil Babka, Feng Tang, Christian Brauner,
Dave Hansen, Dapeng Mi, Kees Cook, Marco Elver, Li RongQing,
Eric Biggers, Paul E. McKenney, linux-doc, linux-kernel, netdev,
linux-rdma, Gal Pressman, Dragos Tatulea, Jiri Pirko, Shay Drori,
Moshe Shemesh
On Thu, May 21, 2026 at 04:16:28PM +0300, Mark Bloch wrote:
(...)
> NIPA flagged this patch with a build_allmodconfig_warn failure:
> https://netdev-ctrl.bots.linux.dev/logs/build/1098506/14585935/build_allmodconfig_warn/
>
> I do not see how this mlx5 patch is related to the reported issue,
> but I looked into it anyway.
>
> After the kernel has been built once, the issue can be reproduced by rerunning sparse
> only on version.o, which filters out the unrelated noise. I had an older sparse installed,
> so I used a local copy:
>
> rm -f arch/x86/boot/version.o
> make V=1 C=1 CHECK=/labhome/mbloch/bin/sparse arch/x86/boot/version.o
>
> This gives the same error reported by NIPA:
>
> ...
> ...
> make -f ./scripts/Makefile.vmlinux
> make -f ./scripts/Makefile.build obj=arch/x86/boot arch/x86/boot/bzImage
> make -f ./scripts/Makefile.build obj=arch/x86/boot/compressed arch/x86/boot/compressed/vmlinux
> # CC arch/x86/boot/version.o
> gcc -Wp,-MMD,arch/x86/boot/.version.o.d -nostdinc -I./arch/x86/include -I./arch/x86/include/generated -I./include -I./include -I./arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/compiler-version.h -include ./include/linux/kconfig.h -include ./include/linux/compiler_types.h -D__KERNEL__ -std=gnu11 -fms-extensions -m16 -g -Os -DDISABLE_BRANCH_PROFILING -D__DISABLE_EXPORTS -Wall -Wstrict-prototypes -march=i386 -mregparm=3 -fno-strict-aliasing -fomit-frame-pointer -fno-pic -mno-mmx -mno-sse -fcf-protection=none -ffreestanding -fno-stack-protector -Wno-address-of-packed-member -mpreferred-stack-boundary=2 -D_SETUP -fno-asynchronous-unwind-tables -Wimplicit-fallthrough=5 -DKBUILD_MODFILE='"arch/x86/boot/version"' -DKBUILD_BASENAME='"version"' -DKBUILD_MODNAME='"version"' -D__KBUILD_MODNAME=version -c -o arch/x86/boot/version.o arch/x86/boot/version.c
> # CHECK arch/x86/boot/version.c
> /labhome/mbloch/bin/sparse -D__linux__ -Dlinux -D__STDC__ -Dunix -D__unix__ -Wbitwise -Wno-return-void -Wno-unknown-attribute -D__x86_64__ --arch=x86 -mlittle-endian -m64 -Wp,-MMD,arch/x86/boot/.version.o.d -nostdinc -I./arch/x86/include -I./arch/x86/include/generated -I./include -I./include -I./arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/compiler-version.h -include ./include/linux/kconfig.h -include ./include/linux/compiler_types.h -D__KERNEL__ -std=gnu11 -fms-extensions -m16 -g -Os -DDISABLE_BRANCH_PROFILING -D__DISABLE_EXPORTS -Wall -Wstrict-prototypes -march=i386 -mregparm=3 -fno-strict-aliasing -fomit-frame-pointer -fno-pic -mno-mmx -mno-sse -fcf-protection=none -ffreestanding -fno-stack-protector -Wno-address-of-packed-member -mpreferred-stack-boundary=2 -D_SETUP -fno-asynchronous-unwind-tables -Wimplicit-fallthrough=5 -DKBUILD_MODFILE='"arch/x86/boot/version"' -DKBUILD_BASENAME='"version"' -DKBUILD_MODNAME='"version"' -D__KBUILD_MODNAME=version arch/x86/boot/version.c
> arch/x86/boot/version.c: note: in included file (through arch/x86/include/uapi/asm/bitsperlong.h, include/uapi/asm-generic/int-ll64.h, include/asm-generic/int-ll64.h, include/uapi/asm-generic/types.h, ...):
> ./include/asm-generic/bitsperlong.h:23:2: error: Inconsistent word size. Check asm/bitsperlong.h
> ./include/asm-generic/bitsperlong.h:27:33: error: static assertion failed: "Inconsistent word size. Check asm/bitsperlong.h"
> # cmd_gen_symversions_c arch/x86/boot/version.o
> if nm arch/x86/boot/version.o 2>/dev/null | grep -q ' __export_symbol_'; then gcc -E -D__GENKSYMS__ -Wp,-MMD,arch/x86/boot/.version.o.d -nostdinc -I./arch/x86/include -I./arch/x86/include/generated -I./include -I./include -I./arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/compiler-version.h -include ./include/linux/kconfig.h -include ./include/linux/compiler_types.h -D__KERNEL__ -std=gnu11 -fms-extensions -m16 -g -Os -DDISABLE_BRANCH_PROFILING -D__DISABLE_EXPORTS -Wall -Wstrict-prototypes -march=i386 -mregparm=3 -fno-strict-aliasing -fomit-frame-pointer -fno-pic -mno-mmx -mno-sse -fcf-protection=none -ffreestanding -fno-stack-protector -Wno-address-of-packed-member -mpreferred-stack-boundary=2 -D_SETUP -fno-asynchronous-unwind-tables -Wimplicit-fallthrough=5 -DKBUILD_MODFILE='"arch/x86/boot/version"' -DKBUILD_BASENAME='"version"' -DKBUILD_MODNAME='"version"' -D__KBUILD_MODNAME=version arch/x86/boot/version.c | ./scripts/genksyms/genksyms >> arch/x86/boot/.version.o.cmd; fi
> # LD arch/x86/boot/setup.elf
> ld -m elf_x86_64 -z noexecstack -m elf_i386 -z noexecstack -T arch/x86/boot/setup.ld arch/x86/boot/a20.o arch/x86/boot/bioscall.o arch/x86/boot/cmdline.o arch/x86/boot/copy.o arch/x86/boot/cpu.o arch/x86/boot/cpuflags.o arch/x86/boot/cpucheck.o arch/x86/boot/early_serial_console.o arch/x86/boot/edd.o arch/x86/boot/header.o arch/x86/boot/main.o arch/x86/boot/memory.o arch/x86/boot/pm.o arch/x86/boot/pmjump.o arch/x86/boot/printf.o arch/x86/boot/regs.o arch/x86/boot/string.o arch/x86/boot/tty.o arch/x86/boot/video.o arch/x86/boot/video-mode.o arch/x86/boot/version.o arch/x86/boot/video-vga.o arch/x86/boot/video-vesa.o arch/x86/boot/video-bios.o -o arch/x86/boot/setup.elf
> # OBJCOPY arch/x86/boot/setup.bin
> objcopy -O binary arch/x86/boot/setup.elf arch/x86/boot/setup.bin
> # BUILD arch/x86/boot/bzImage
> (dd if=arch/x86/boot/setup.bin bs=4k conv=sync status=none; cat arch/x86/boot/vmlinux.bin) >arch/x86/boot/bzImage
> mkdir -p ./arch/x86_64/boot
> ln -fsn ../../x86/boot/bzImage ./arch/x86_64/boot/bzImage
>
> To me this looks like sparse is getting a conflicting set of flags.
> The command line contains both "-D__x86_64__ -m64" and "-m16 -march=i386 -D_SETUP".
>
> I confirmed that the following patch "fixes" the issue, but I do not know whether
> this is the right fix. This area is outside my comfort zone, so it would be
> helpful if someone more familiar with the x86 build/sparse flow could take a
> look:
>
> diff --git a/arch/x86/boot/Makefile b/arch/x86/boot/Makefile
> index 3f9fb3698d66..80923864f6f9 100644
> --- a/arch/x86/boot/Makefile
> +++ b/arch/x86/boot/Makefile
> @@ -71,6 +71,10 @@ $(obj)/vmlinux.bin: $(obj)/compressed/vmlinux FORCE
>
> SETUP_OBJS = $(addprefix $(obj)/,$(setup-y))
>
> +realmode-checkflags-$(CONFIG_X86_64) := -m32 -U__x86_64__ -D__i386__
> +REALMODE_CHECKFLAGS := $(filter-out -m64 -D__x86_64__,$(CHECKFLAGS)) $(realmode-checkflags-y)
> +$(SETUP_OBJS): CHECKFLAGS := $(REALMODE_CHECKFLAGS)
> +
> sed-zoffset := -e 's/^\([0-9a-fA-F]*\) [a-zA-Z] \(startup_32\|efi.._stub_entry\|efi\(32\)\?_pe_entry\|input_data\|kernel_info\|_end\|_ehead\|_text\|_e\?data\|_e\?sbat\|z_.*\)$$/\#define ZO_\2 0x\1/p'
>
> quiet_cmd_zoffset = ZOFFSET $@
> diff --git a/arch/x86/realmode/rm/Makefile b/arch/x86/realmode/rm/Makefile
> index a0fb39abc5c8..341b0ff20c3d 100644
> --- a/arch/x86/realmode/rm/Makefile
> +++ b/arch/x86/realmode/rm/Makefile
> @@ -29,6 +29,10 @@ targets += $(realmode-y)
>
> REALMODE_OBJS = $(addprefix $(obj)/,$(realmode-y))
>
> +realmode-checkflags-$(CONFIG_X86_64) := -m32 -U__x86_64__ -D__i386__
> +REALMODE_CHECKFLAGS := $(filter-out -m64 -D__x86_64__,$(CHECKFLAGS)) $(realmode-checkflags-y)
> +$(REALMODE_OBJS): CHECKFLAGS := $(REALMODE_CHECKFLAGS)
> +
The idea looks good, we do something similar for the 32-bit vDSO:
arch/x86/entry/vdso/vdso32/Makefile
CHECKFLAGS := $(subst -m64,-m32,$(CHECKFLAGS))
CHECKFLAGS := $(subst -D__x86_64__,-D__i386__,$(CHECKFLAGS))
It seems the same kind of substitution would work here.
We can add a helper function to arch/x86/Makefile and
use that also for the compat vDSO.
I am wondering why this didn't show up before.
Are you going to send a patch or should I?
Thomas
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net-next 3/3] net/mlx5: Apply devlink default eswitch mode during init
2026-05-21 13:41 ` Thomas Weißschuh
@ 2026-05-21 21:02 ` Mark Bloch
0 siblings, 0 replies; 15+ messages in thread
From: Mark Bloch @ 2026-05-21 21:02 UTC (permalink / raw)
To: Thomas Weißschuh
Cc: Tariq Toukan, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Andrew Lunn, David S. Miller, Thomas Gleixner, Arnd Bergmann,
Jonathan Corbet, Shuah Khan, Jiri Pirko, Simon Horman,
Saeed Mahameed, Leon Romanovsky, Borislav Petkov (AMD),
Andrew Morton, Randy Dunlap, Petr Mladek, Peter Zijlstra (Intel),
Tejun Heo, Vlastimil Babka, Feng Tang, Christian Brauner,
Dave Hansen, Dapeng Mi, Kees Cook, Marco Elver, Li RongQing,
Eric Biggers, Paul E. McKenney, linux-doc, linux-kernel, netdev,
linux-rdma, Gal Pressman, Dragos Tatulea, Jiri Pirko, Shay Drori,
Moshe Shemesh
On 21/05/2026 16:41, Thomas Weißschuh wrote:
> On Thu, May 21, 2026 at 04:16:28PM +0300, Mark Bloch wrote:
> (...)
>
>> NIPA flagged this patch with a build_allmodconfig_warn failure:
>> https://netdev-ctrl.bots.linux.dev/logs/build/1098506/14585935/build_allmodconfig_warn/
>>
>> I do not see how this mlx5 patch is related to the reported issue,
>> but I looked into it anyway.
>>
>> After the kernel has been built once, the issue can be reproduced by rerunning sparse
>> only on version.o, which filters out the unrelated noise. I had an older sparse installed,
>> so I used a local copy:
>>
>> rm -f arch/x86/boot/version.o
>> make V=1 C=1 CHECK=/labhome/mbloch/bin/sparse arch/x86/boot/version.o
>>
>> This gives the same error reported by NIPA:
>>
>> ...
>> ...
>> make -f ./scripts/Makefile.vmlinux
>> make -f ./scripts/Makefile.build obj=arch/x86/boot arch/x86/boot/bzImage
>> make -f ./scripts/Makefile.build obj=arch/x86/boot/compressed arch/x86/boot/compressed/vmlinux
>> # CC arch/x86/boot/version.o
>> gcc -Wp,-MMD,arch/x86/boot/.version.o.d -nostdinc -I./arch/x86/include -I./arch/x86/include/generated -I./include -I./include -I./arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/compiler-version.h -include ./include/linux/kconfig.h -include ./include/linux/compiler_types.h -D__KERNEL__ -std=gnu11 -fms-extensions -m16 -g -Os -DDISABLE_BRANCH_PROFILING -D__DISABLE_EXPORTS -Wall -Wstrict-prototypes -march=i386 -mregparm=3 -fno-strict-aliasing -fomit-frame-pointer -fno-pic -mno-mmx -mno-sse -fcf-protection=none -ffreestanding -fno-stack-protector -Wno-address-of-packed-member -mpreferred-stack-boundary=2 -D_SETUP -fno-asynchronous-unwind-tables -Wimplicit-fallthrough=5 -DKBUILD_MODFILE='"arch/x86/boot/version"' -DKBUILD_BASENAME='"version"' -DKBUILD_MODNAME='"version"' -D__KBUILD_MODNAME=version -c -o arch/x86/boot/version.o arch/x86/boot/version.c
>> # CHECK arch/x86/boot/version.c
>> /labhome/mbloch/bin/sparse -D__linux__ -Dlinux -D__STDC__ -Dunix -D__unix__ -Wbitwise -Wno-return-void -Wno-unknown-attribute -D__x86_64__ --arch=x86 -mlittle-endian -m64 -Wp,-MMD,arch/x86/boot/.version.o.d -nostdinc -I./arch/x86/include -I./arch/x86/include/generated -I./include -I./include -I./arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/compiler-version.h -include ./include/linux/kconfig.h -include ./include/linux/compiler_types.h -D__KERNEL__ -std=gnu11 -fms-extensions -m16 -g -Os -DDISABLE_BRANCH_PROFILING -D__DISABLE_EXPORTS -Wall -Wstrict-prototypes -march=i386 -mregparm=3 -fno-strict-aliasing -fomit-frame-pointer -fno-pic -mno-mmx -mno-sse -fcf-protection=none -ffreestanding -fno-stack-protector -Wno-address-of-packed-member -mpreferred-stack-boundary=2 -D_SETUP -fno-asynchronous-unwind-tables -Wimplicit-fallthrough=5 -DKBUILD_MODFILE='"arch/x86/boot/version"' -DKBUILD_BASENAME='"version"' -DKBUILD_MODNAME='"version"' -D__KBUILD_MODNAME=version arch/x86/boot/version.c
>> arch/x86/boot/version.c: note: in included file (through arch/x86/include/uapi/asm/bitsperlong.h, include/uapi/asm-generic/int-ll64.h, include/asm-generic/int-ll64.h, include/uapi/asm-generic/types.h, ...):
>> ./include/asm-generic/bitsperlong.h:23:2: error: Inconsistent word size. Check asm/bitsperlong.h
>> ./include/asm-generic/bitsperlong.h:27:33: error: static assertion failed: "Inconsistent word size. Check asm/bitsperlong.h"
>> # cmd_gen_symversions_c arch/x86/boot/version.o
>> if nm arch/x86/boot/version.o 2>/dev/null | grep -q ' __export_symbol_'; then gcc -E -D__GENKSYMS__ -Wp,-MMD,arch/x86/boot/.version.o.d -nostdinc -I./arch/x86/include -I./arch/x86/include/generated -I./include -I./include -I./arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/compiler-version.h -include ./include/linux/kconfig.h -include ./include/linux/compiler_types.h -D__KERNEL__ -std=gnu11 -fms-extensions -m16 -g -Os -DDISABLE_BRANCH_PROFILING -D__DISABLE_EXPORTS -Wall -Wstrict-prototypes -march=i386 -mregparm=3 -fno-strict-aliasing -fomit-frame-pointer -fno-pic -mno-mmx -mno-sse -fcf-protection=none -ffreestanding -fno-stack-protector -Wno-address-of-packed-member -mpreferred-stack-boundary=2 -D_SETUP -fno-asynchronous-unwind-tables -Wimplicit-fallthrough=5 -DKBUILD_MODFILE='"arch/x86/boot/version"' -DKBUILD_BASENAME='"version"' -DKBUILD_MODNAME='"version"' -D__KBUILD_MODNAME=version arch/x86/boot/version.c | ./scripts/genksyms/genksyms >> arch/x86/boot/.version.o.cmd; fi
>> # LD arch/x86/boot/setup.elf
>> ld -m elf_x86_64 -z noexecstack -m elf_i386 -z noexecstack -T arch/x86/boot/setup.ld arch/x86/boot/a20.o arch/x86/boot/bioscall.o arch/x86/boot/cmdline.o arch/x86/boot/copy.o arch/x86/boot/cpu.o arch/x86/boot/cpuflags.o arch/x86/boot/cpucheck.o arch/x86/boot/early_serial_console.o arch/x86/boot/edd.o arch/x86/boot/header.o arch/x86/boot/main.o arch/x86/boot/memory.o arch/x86/boot/pm.o arch/x86/boot/pmjump.o arch/x86/boot/printf.o arch/x86/boot/regs.o arch/x86/boot/string.o arch/x86/boot/tty.o arch/x86/boot/video.o arch/x86/boot/video-mode.o arch/x86/boot/version.o arch/x86/boot/video-vga.o arch/x86/boot/video-vesa.o arch/x86/boot/video-bios.o -o arch/x86/boot/setup.elf
>> # OBJCOPY arch/x86/boot/setup.bin
>> objcopy -O binary arch/x86/boot/setup.elf arch/x86/boot/setup.bin
>> # BUILD arch/x86/boot/bzImage
>> (dd if=arch/x86/boot/setup.bin bs=4k conv=sync status=none; cat arch/x86/boot/vmlinux.bin) >arch/x86/boot/bzImage
>> mkdir -p ./arch/x86_64/boot
>> ln -fsn ../../x86/boot/bzImage ./arch/x86_64/boot/bzImage
>>
>> To me this looks like sparse is getting a conflicting set of flags.
>> The command line contains both "-D__x86_64__ -m64" and "-m16 -march=i386 -D_SETUP".
>>
>> I confirmed that the following patch "fixes" the issue, but I do not know whether
>> this is the right fix. This area is outside my comfort zone, so it would be
>> helpful if someone more familiar with the x86 build/sparse flow could take a
>> look:
>>
>> diff --git a/arch/x86/boot/Makefile b/arch/x86/boot/Makefile
>> index 3f9fb3698d66..80923864f6f9 100644
>> --- a/arch/x86/boot/Makefile
>> +++ b/arch/x86/boot/Makefile
>> @@ -71,6 +71,10 @@ $(obj)/vmlinux.bin: $(obj)/compressed/vmlinux FORCE
>>
>> SETUP_OBJS = $(addprefix $(obj)/,$(setup-y))
>>
>> +realmode-checkflags-$(CONFIG_X86_64) := -m32 -U__x86_64__ -D__i386__
>> +REALMODE_CHECKFLAGS := $(filter-out -m64 -D__x86_64__,$(CHECKFLAGS)) $(realmode-checkflags-y)
>> +$(SETUP_OBJS): CHECKFLAGS := $(REALMODE_CHECKFLAGS)
>> +
>> sed-zoffset := -e 's/^\([0-9a-fA-F]*\) [a-zA-Z] \(startup_32\|efi.._stub_entry\|efi\(32\)\?_pe_entry\|input_data\|kernel_info\|_end\|_ehead\|_text\|_e\?data\|_e\?sbat\|z_.*\)$$/\#define ZO_\2 0x\1/p'
>>
>> quiet_cmd_zoffset = ZOFFSET $@
>> diff --git a/arch/x86/realmode/rm/Makefile b/arch/x86/realmode/rm/Makefile
>> index a0fb39abc5c8..341b0ff20c3d 100644
>> --- a/arch/x86/realmode/rm/Makefile
>> +++ b/arch/x86/realmode/rm/Makefile
>> @@ -29,6 +29,10 @@ targets += $(realmode-y)
>>
>> REALMODE_OBJS = $(addprefix $(obj)/,$(realmode-y))
>>
>> +realmode-checkflags-$(CONFIG_X86_64) := -m32 -U__x86_64__ -D__i386__
>> +REALMODE_CHECKFLAGS := $(filter-out -m64 -D__x86_64__,$(CHECKFLAGS)) $(realmode-checkflags-y)
>> +$(REALMODE_OBJS): CHECKFLAGS := $(REALMODE_CHECKFLAGS)
>> +
>
> The idea looks good, we do something similar for the 32-bit vDSO:
>
> arch/x86/entry/vdso/vdso32/Makefile
>
> CHECKFLAGS := $(subst -m64,-m32,$(CHECKFLAGS))
> CHECKFLAGS := $(subst -D__x86_64__,-D__i386__,$(CHECKFLAGS))
>
> It seems the same kind of substitution would work here.
> We can add a helper function to arch/x86/Makefile and
> use that also for the compat vDSO.
>
> I am wondering why this didn't show up before.
> Are you going to send a patch or should I?
>
Yes, please take it if you don't mind.
Mark
>
> Thomas
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net-next 3/3] net/mlx5: Apply devlink default eswitch mode during init
2026-05-21 7:24 ` [PATCH net-next 3/3] net/mlx5: Apply devlink default eswitch mode during init Tariq Toukan
2026-05-21 13:16 ` Mark Bloch
@ 2026-05-26 7:44 ` Jiri Pirko
2026-05-26 9:44 ` Mark Bloch
1 sibling, 1 reply; 15+ messages in thread
From: Jiri Pirko @ 2026-05-26 7:44 UTC (permalink / raw)
To: Tariq Toukan
Cc: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller, Jonathan Corbet, Shuah Khan, Simon Horman,
Saeed Mahameed, Leon Romanovsky, Mark Bloch,
Borislav Petkov (AMD), Andrew Morton, Randy Dunlap,
Thomas Gleixner, Petr Mladek, Peter Zijlstra (Intel), Tejun Heo,
Vlastimil Babka, Feng Tang, Christian Brauner, Dave Hansen,
Dapeng Mi, Kees Cook, Marco Elver, Li RongQing, Eric Biggers,
Paul E. McKenney, linux-doc, linux-kernel, netdev, linux-rdma,
Gal Pressman, Dragos Tatulea, Jiri Pirko, Shay Drori,
Moshe Shemesh
Thu, May 21, 2026 at 09:24:34AM +0200, tariqt@nvidia.com wrote:
>From: Mark Bloch <mbloch@nvidia.com>
>
>Apply devlink default eswitch mode for mlx5 devices after successful
>device initialization while holding the devlink instance lock.
>
>At this point the devlink instance is registered and the mlx5 devlink
>operations are available, so the default eswitch mode can be applied to
>the matching PCI devlink handle.
>
>Signed-off-by: Mark Bloch <mbloch@nvidia.com>
>Reviewed-by: Shay Drori <shayd@nvidia.com>
>Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
>Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
>---
> drivers/net/ethernet/mellanox/mlx5/core/main.c | 17 +++++++++++++++++
> 1 file changed, 17 insertions(+)
>
>diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
>index 0c6e4efe38c8..4528097f3d84 100644
>--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
>+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
>@@ -1391,6 +1391,21 @@ static void mlx5_unload(struct mlx5_core_dev *dev)
> mlx5_free_bfreg(dev, &dev->priv.bfreg);
> }
>
>+static void mlx5_devl_apply_default_esw_mode(struct mlx5_core_dev *dev)
>+{
>+ struct devlink *devlink = priv_to_devlink(dev);
>+ int err;
>+
>+ if (!MLX5_ESWITCH_MANAGER(dev))
>+ return;
>+
>+ devl_assert_locked(devlink);
>+ err = devl_apply_default_esw_mode(devlink);
>+ if (err)
>+ mlx5_core_warn(dev, "Couldn't apply default eswitch mode, err %d\n",
>+ err);
>+}
>+
> int mlx5_init_one_devl_locked(struct mlx5_core_dev *dev)
> {
> bool light_probe = mlx5_dev_is_lightweight(dev);
>@@ -1437,6 +1452,7 @@ int mlx5_init_one_devl_locked(struct mlx5_core_dev *dev)
> mlx5_core_err(dev, "mlx5_hwmon_dev_register failed with error code %d\n", err);
>
> mutex_unlock(&dev->intf_state_mutex);
>+ mlx5_devl_apply_default_esw_mode(dev);
I wonder how we can make this work for all. I mean, other driver would
silently ignore this command like arg, right? Any idea how to make all
drivers follow the arg from very beginning?
> return 0;
>
> err_register:
>@@ -1538,6 +1554,7 @@ int mlx5_load_one_devl_locked(struct mlx5_core_dev *dev, bool recovery)
> goto err_attach;
>
> mutex_unlock(&dev->intf_state_mutex);
>+ mlx5_devl_apply_default_esw_mode(dev);
> return 0;
>
> err_attach:
>--
>2.44.0
>
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH net-next 3/3] net/mlx5: Apply devlink default eswitch mode during init
2026-05-26 7:44 ` Jiri Pirko
@ 2026-05-26 9:44 ` Mark Bloch
2026-05-26 14:07 ` Jiri Pirko
0 siblings, 1 reply; 15+ messages in thread
From: Mark Bloch @ 2026-05-26 9:44 UTC (permalink / raw)
To: Jiri Pirko, Tariq Toukan
Cc: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller, Jonathan Corbet, Shuah Khan, Simon Horman,
Saeed Mahameed, Leon Romanovsky, Borislav Petkov (AMD),
Andrew Morton, Randy Dunlap, Thomas Gleixner, Petr Mladek,
Peter Zijlstra (Intel), Tejun Heo, Vlastimil Babka, Feng Tang,
Christian Brauner, Dave Hansen, Dapeng Mi, Kees Cook, Marco Elver,
Li RongQing, Eric Biggers, Paul E. McKenney, linux-doc,
linux-kernel, netdev, linux-rdma, Gal Pressman, Dragos Tatulea,
Jiri Pirko, Shay Drori, Moshe Shemesh
On 26/05/2026 10:44, Jiri Pirko wrote:
> Thu, May 21, 2026 at 09:24:34AM +0200, tariqt@nvidia.com wrote:
>> From: Mark Bloch <mbloch@nvidia.com>
>>
>> Apply devlink default eswitch mode for mlx5 devices after successful
>> device initialization while holding the devlink instance lock.
>>
>> At this point the devlink instance is registered and the mlx5 devlink
>> operations are available, so the default eswitch mode can be applied to
>> the matching PCI devlink handle.
>>
>> Signed-off-by: Mark Bloch <mbloch@nvidia.com>
>> Reviewed-by: Shay Drori <shayd@nvidia.com>
>> Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
>> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
>> ---
>> drivers/net/ethernet/mellanox/mlx5/core/main.c | 17 +++++++++++++++++
>> 1 file changed, 17 insertions(+)
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
>> index 0c6e4efe38c8..4528097f3d84 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
>> @@ -1391,6 +1391,21 @@ static void mlx5_unload(struct mlx5_core_dev *dev)
>> mlx5_free_bfreg(dev, &dev->priv.bfreg);
>> }
>>
>> +static void mlx5_devl_apply_default_esw_mode(struct mlx5_core_dev *dev)
>> +{
>> + struct devlink *devlink = priv_to_devlink(dev);
>> + int err;
>> +
>> + if (!MLX5_ESWITCH_MANAGER(dev))
>> + return;
>> +
>> + devl_assert_locked(devlink);
>> + err = devl_apply_default_esw_mode(devlink);
>> + if (err)
>> + mlx5_core_warn(dev, "Couldn't apply default eswitch mode, err %d\n",
>> + err);
>> +}
>> +
>> int mlx5_init_one_devl_locked(struct mlx5_core_dev *dev)
>> {
>> bool light_probe = mlx5_dev_is_lightweight(dev);
>> @@ -1437,6 +1452,7 @@ int mlx5_init_one_devl_locked(struct mlx5_core_dev *dev)
>> mlx5_core_err(dev, "mlx5_hwmon_dev_register failed with error code %d\n", err);
>>
>> mutex_unlock(&dev->intf_state_mutex);
>> + mlx5_devl_apply_default_esw_mode(dev);
>
> I wonder how we can make this work for all. I mean, other driver would
> silently ignore this command like arg, right? Any idea how to make all
> drivers follow the arg from very beginning?
>
I have a follow-up series that adds the call to all drivers which support
setting eswitch mode. When going over the other drivers, what I found is
that the right point to apply the default is driver specific, drivers
I have patch for:
46e16c6d9836 net: Apply devlink esw mode defaults
ab4f54102ba9 bnxt_en: Apply devlink default eswitch mode during init
b48cce1607bb liquidio: Apply devlink default eswitch mode during init
4ea54b0fe04a ice: Apply devlink default eswitch mode during init
b7faddaa1c90 octeontx2-af: Apply devlink default eswitch mode during init
74b0c22c47b9 octeontx2-pf: Apply devlink default eswitch mode during init
5000e4c3d768 nfp: Apply devlink default eswitch mode during init
97a218e95e41 netdevsim: Apply devlink default eswitch mode during init
I don't think doing this generically from devlink is realistic. devlink
doesn't really know when a given driver is ready to change eswitch mode.
Some drivers need SR-IOV state, representor setup, or other init pieces to
be ready first, and the locking is not identical across drivers either.
Also, since this knob is only about eswitch mode, I don't think we need to
touch every devlink driver. Drivers that don't implement eswitch_mode_set()
would just ignore it anyway. The follow-up only wires the default into
drivers that actually support changing eswitch mode.
Mark
>
>> return 0;
>>
>> err_register:
>> @@ -1538,6 +1554,7 @@ int mlx5_load_one_devl_locked(struct mlx5_core_dev *dev, bool recovery)
>> goto err_attach;
>>
>> mutex_unlock(&dev->intf_state_mutex);
>> + mlx5_devl_apply_default_esw_mode(dev);
>> return 0;
>>
>> err_attach:
>> --
>> 2.44.0
>>
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH net-next 3/3] net/mlx5: Apply devlink default eswitch mode during init
2026-05-26 9:44 ` Mark Bloch
@ 2026-05-26 14:07 ` Jiri Pirko
2026-05-26 15:03 ` Mark Bloch
0 siblings, 1 reply; 15+ messages in thread
From: Jiri Pirko @ 2026-05-26 14:07 UTC (permalink / raw)
To: Mark Bloch
Cc: Tariq Toukan, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Andrew Lunn, David S. Miller, Jonathan Corbet, Shuah Khan,
Simon Horman, Saeed Mahameed, Leon Romanovsky,
Borislav Petkov (AMD), Andrew Morton, Randy Dunlap,
Thomas Gleixner, Petr Mladek, Peter Zijlstra (Intel), Tejun Heo,
Vlastimil Babka, Feng Tang, Christian Brauner, Dave Hansen,
Dapeng Mi, Kees Cook, Marco Elver, Li RongQing, Eric Biggers,
Paul E. McKenney, linux-doc, linux-kernel, netdev, linux-rdma,
Gal Pressman, Dragos Tatulea, Jiri Pirko, Shay Drori,
Moshe Shemesh
Tue, May 26, 2026 at 11:44:46AM +0200, mbloch@nvidia.com wrote:
>
>
>On 26/05/2026 10:44, Jiri Pirko wrote:
>> Thu, May 21, 2026 at 09:24:34AM +0200, tariqt@nvidia.com wrote:
>>> From: Mark Bloch <mbloch@nvidia.com>
>>>
>>> Apply devlink default eswitch mode for mlx5 devices after successful
>>> device initialization while holding the devlink instance lock.
>>>
>>> At this point the devlink instance is registered and the mlx5 devlink
>>> operations are available, so the default eswitch mode can be applied to
>>> the matching PCI devlink handle.
>>>
>>> Signed-off-by: Mark Bloch <mbloch@nvidia.com>
>>> Reviewed-by: Shay Drori <shayd@nvidia.com>
>>> Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
>>> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
>>> ---
>>> drivers/net/ethernet/mellanox/mlx5/core/main.c | 17 +++++++++++++++++
>>> 1 file changed, 17 insertions(+)
>>>
>>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
>>> index 0c6e4efe38c8..4528097f3d84 100644
>>> --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
>>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
>>> @@ -1391,6 +1391,21 @@ static void mlx5_unload(struct mlx5_core_dev *dev)
>>> mlx5_free_bfreg(dev, &dev->priv.bfreg);
>>> }
>>>
>>> +static void mlx5_devl_apply_default_esw_mode(struct mlx5_core_dev *dev)
>>> +{
>>> + struct devlink *devlink = priv_to_devlink(dev);
>>> + int err;
>>> +
>>> + if (!MLX5_ESWITCH_MANAGER(dev))
>>> + return;
>>> +
>>> + devl_assert_locked(devlink);
>>> + err = devl_apply_default_esw_mode(devlink);
>>> + if (err)
>>> + mlx5_core_warn(dev, "Couldn't apply default eswitch mode, err %d\n",
>>> + err);
>>> +}
>>> +
>>> int mlx5_init_one_devl_locked(struct mlx5_core_dev *dev)
>>> {
>>> bool light_probe = mlx5_dev_is_lightweight(dev);
>>> @@ -1437,6 +1452,7 @@ int mlx5_init_one_devl_locked(struct mlx5_core_dev *dev)
>>> mlx5_core_err(dev, "mlx5_hwmon_dev_register failed with error code %d\n", err);
>>>
>>> mutex_unlock(&dev->intf_state_mutex);
>>> + mlx5_devl_apply_default_esw_mode(dev);
>>
>> I wonder how we can make this work for all. I mean, other driver would
>> silently ignore this command like arg, right? Any idea how to make all
>> drivers follow the arg from very beginning?
>>
>
>I have a follow-up series that adds the call to all drivers which support
>setting eswitch mode. When going over the other drivers, what I found is
>that the right point to apply the default is driver specific, drivers
>I have patch for:
>
>46e16c6d9836 net: Apply devlink esw mode defaults
>ab4f54102ba9 bnxt_en: Apply devlink default eswitch mode during init
>b48cce1607bb liquidio: Apply devlink default eswitch mode during init
>4ea54b0fe04a ice: Apply devlink default eswitch mode during init
>b7faddaa1c90 octeontx2-af: Apply devlink default eswitch mode during init
>74b0c22c47b9 octeontx2-pf: Apply devlink default eswitch mode during init
>5000e4c3d768 nfp: Apply devlink default eswitch mode during init
>97a218e95e41 netdevsim: Apply devlink default eswitch mode during init
>
>I don't think doing this generically from devlink is realistic. devlink
>doesn't really know when a given driver is ready to change eswitch mode.
>Some drivers need SR-IOV state, representor setup, or other init pieces to
>be ready first, and the locking is not identical across drivers either.
Low hanging fruit would be just to call ops->eswitch_mode_set at the end
of register. Multiple reasons:
1) end of devl_register is exactly the point userspace is free to issue
the eswitch mode set. Driver should be ready to handle it.
2) all drivers would transparently get this functionality, without
actually knowing this kernel command line arg ever existed, without
odd wiring call of related exported function. I prefer that stongly.
3) you should add a there warning for the case this arg is passed yet
the driver does not implement eswitch_mode_set. User should
get a feedback like this, not silent ignore.
The only loose end is see it the void return of devl_register().
Multiple ways to handle the possibly failed eswitch_mode_set(). I would
probably just go for pr_warn, seems to be the most correct.
Make sense?
>
>Also, since this knob is only about eswitch mode, I don't think we need to
>touch every devlink driver. Drivers that don't implement eswitch_mode_set()
>would just ignore it anyway. The follow-up only wires the default into
>drivers that actually support changing eswitch mode.
>
>Mark
>
>>
>>> return 0;
>>>
>>> err_register:
>>> @@ -1538,6 +1554,7 @@ int mlx5_load_one_devl_locked(struct mlx5_core_dev *dev, bool recovery)
>>> goto err_attach;
>>>
>>> mutex_unlock(&dev->intf_state_mutex);
>>> + mlx5_devl_apply_default_esw_mode(dev);
>>> return 0;
>>>
>>> err_attach:
>>> --
>>> 2.44.0
>>>
>
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH net-next 3/3] net/mlx5: Apply devlink default eswitch mode during init
2026-05-26 14:07 ` Jiri Pirko
@ 2026-05-26 15:03 ` Mark Bloch
2026-05-26 16:23 ` Jiri Pirko
0 siblings, 1 reply; 15+ messages in thread
From: Mark Bloch @ 2026-05-26 15:03 UTC (permalink / raw)
To: Jiri Pirko
Cc: Tariq Toukan, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Andrew Lunn, David S. Miller, Jonathan Corbet, Shuah Khan,
Simon Horman, Saeed Mahameed, Leon Romanovsky,
Borislav Petkov (AMD), Andrew Morton, Randy Dunlap,
Thomas Gleixner, Petr Mladek, Peter Zijlstra (Intel), Tejun Heo,
Vlastimil Babka, Feng Tang, Christian Brauner, Dave Hansen,
Dapeng Mi, Kees Cook, Marco Elver, Li RongQing, Eric Biggers,
Paul E. McKenney, linux-doc, linux-kernel, netdev, linux-rdma,
Gal Pressman, Dragos Tatulea, Jiri Pirko, Shay Drori,
Moshe Shemesh
On 26/05/2026 17:07, Jiri Pirko wrote:
> Tue, May 26, 2026 at 11:44:46AM +0200, mbloch@nvidia.com wrote:
>>
>>
>> On 26/05/2026 10:44, Jiri Pirko wrote:
>>> Thu, May 21, 2026 at 09:24:34AM +0200, tariqt@nvidia.com wrote:
>>>> From: Mark Bloch <mbloch@nvidia.com>
>>>>
>>>> Apply devlink default eswitch mode for mlx5 devices after successful
>>>> device initialization while holding the devlink instance lock.
>>>>
>>>> At this point the devlink instance is registered and the mlx5 devlink
>>>> operations are available, so the default eswitch mode can be applied to
>>>> the matching PCI devlink handle.
>>>>
>>>> Signed-off-by: Mark Bloch <mbloch@nvidia.com>
>>>> Reviewed-by: Shay Drori <shayd@nvidia.com>
>>>> Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
>>>> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
>>>> ---
>>>> drivers/net/ethernet/mellanox/mlx5/core/main.c | 17 +++++++++++++++++
>>>> 1 file changed, 17 insertions(+)
>>>>
>>>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
>>>> index 0c6e4efe38c8..4528097f3d84 100644
>>>> --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
>>>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
>>>> @@ -1391,6 +1391,21 @@ static void mlx5_unload(struct mlx5_core_dev *dev)
>>>> mlx5_free_bfreg(dev, &dev->priv.bfreg);
>>>> }
>>>>
>>>> +static void mlx5_devl_apply_default_esw_mode(struct mlx5_core_dev *dev)
>>>> +{
>>>> + struct devlink *devlink = priv_to_devlink(dev);
>>>> + int err;
>>>> +
>>>> + if (!MLX5_ESWITCH_MANAGER(dev))
>>>> + return;
>>>> +
>>>> + devl_assert_locked(devlink);
>>>> + err = devl_apply_default_esw_mode(devlink);
>>>> + if (err)
>>>> + mlx5_core_warn(dev, "Couldn't apply default eswitch mode, err %d\n",
>>>> + err);
>>>> +}
>>>> +
>>>> int mlx5_init_one_devl_locked(struct mlx5_core_dev *dev)
>>>> {
>>>> bool light_probe = mlx5_dev_is_lightweight(dev);
>>>> @@ -1437,6 +1452,7 @@ int mlx5_init_one_devl_locked(struct mlx5_core_dev *dev)
>>>> mlx5_core_err(dev, "mlx5_hwmon_dev_register failed with error code %d\n", err);
>>>>
>>>> mutex_unlock(&dev->intf_state_mutex);
>>>> + mlx5_devl_apply_default_esw_mode(dev);
>>>
>>> I wonder how we can make this work for all. I mean, other driver would
>>> silently ignore this command like arg, right? Any idea how to make all
>>> drivers follow the arg from very beginning?
>>>
>>
>> I have a follow-up series that adds the call to all drivers which support
>> setting eswitch mode. When going over the other drivers, what I found is
>> that the right point to apply the default is driver specific, drivers
>> I have patch for:
>>
>> 46e16c6d9836 net: Apply devlink esw mode defaults
>> ab4f54102ba9 bnxt_en: Apply devlink default eswitch mode during init
>> b48cce1607bb liquidio: Apply devlink default eswitch mode during init
>> 4ea54b0fe04a ice: Apply devlink default eswitch mode during init
>> b7faddaa1c90 octeontx2-af: Apply devlink default eswitch mode during init
>> 74b0c22c47b9 octeontx2-pf: Apply devlink default eswitch mode during init
>> 5000e4c3d768 nfp: Apply devlink default eswitch mode during init
>> 97a218e95e41 netdevsim: Apply devlink default eswitch mode during init
>>
>> I don't think doing this generically from devlink is realistic. devlink
>> doesn't really know when a given driver is ready to change eswitch mode.
>> Some drivers need SR-IOV state, representor setup, or other init pieces to
>> be ready first, and the locking is not identical across drivers either.
>
>
> Low hanging fruit would be just to call ops->eswitch_mode_set at the end
> of register. Multiple reasons:
>
> 1) end of devl_register is exactly the point userspace is free to issue
> the eswitch mode set. Driver should be ready to handle it.
> 2) all drivers would transparently get this functionality, without
> actually knowing this kernel command line arg ever existed, without
> odd wiring call of related exported function. I prefer that stongly.
> 3) you should add a there warning for the case this arg is passed yet
> the driver does not implement eswitch_mode_set. User should
> get a feedback like this, not silent ignore.
>
> The only loose end is see it the void return of devl_register().
> Multiple ways to handle the possibly failed eswitch_mode_set(). I would
> probably just go for pr_warn, seems to be the most correct.
>
> Make sense?
I see the point, but I don't think devl_register() (at least not the only place)
is the right place.
There is a small but important difference between userspace doing
"devlink eswitch set" after register is done, and devlink core calling
eswitch_mode_set() from inside the register flow.
Some drivers call devlink_register() while holding the device lock.
liquidio is one example. If devlink core calls ops->eswitch_mode_set() from
there, we may start the full eswitch mode change while holding that lock.
That mode change can create representors, register netdevs, take rtnl,
allocate resources, etc. I don't think we want this to become an implicit
side effect of devlink registration.
For mlx5, the placement after intf_state_mutex is also intentional:
mutex_unlock(&dev->intf_state_mutex);
mlx5_devl_apply_default_esw_mode(dev);
We can't call it while holding intf_state_mutex because the mode set path
takes it internally, and switchdev mode may also create IB representors.
Also, devl_register() only covers the first registration. The mlx5 call in
mlx5_load_one_devl_locked() is for reload/fw reset recovery kind of flows.
In those flows devlink is already registered, so devl_register() is not
called again, but the driver state was rebuilt and we may need to apply the
default again.
Same for reload, fw reset and pci recovery in general. If the driver tears
down and rebuilds eswitch related state, the place to apply the default is
in that driver's reinit flow, not in devl_register().
When I went over the other drivers, the right place was not always the same
as devlink registration. I'm not an expert in any of them, so I hope I got
the details right, but for example octeontx2 AF needs sr-iov and the
representor switch state to be initialized first. nfp can do it after
app/vNIC init while the devlink lock is already held. liquidio should do it
only after dropping the PCI device lock.
Mark
>
>
>>
>> Also, since this knob is only about eswitch mode, I don't think we need to
>> touch every devlink driver. Drivers that don't implement eswitch_mode_set()
>> would just ignore it anyway. The follow-up only wires the default into
>> drivers that actually support changing eswitch mode.
>>
>> Mark
>>
>>>
>>>> return 0;
>>>>
>>>> err_register:
>>>> @@ -1538,6 +1554,7 @@ int mlx5_load_one_devl_locked(struct mlx5_core_dev *dev, bool recovery)
>>>> goto err_attach;
>>>>
>>>> mutex_unlock(&dev->intf_state_mutex);
>>>> + mlx5_devl_apply_default_esw_mode(dev);
>>>> return 0;
>>>>
>>>> err_attach:
>>>> --
>>>> 2.44.0
>>>>
>>
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH net-next 3/3] net/mlx5: Apply devlink default eswitch mode during init
2026-05-26 15:03 ` Mark Bloch
@ 2026-05-26 16:23 ` Jiri Pirko
2026-05-26 17:13 ` Mark Bloch
0 siblings, 1 reply; 15+ messages in thread
From: Jiri Pirko @ 2026-05-26 16:23 UTC (permalink / raw)
To: Mark Bloch
Cc: Tariq Toukan, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Andrew Lunn, David S. Miller, Jonathan Corbet, Shuah Khan,
Simon Horman, Saeed Mahameed, Leon Romanovsky,
Borislav Petkov (AMD), Andrew Morton, Randy Dunlap,
Thomas Gleixner, Petr Mladek, Peter Zijlstra (Intel), Tejun Heo,
Vlastimil Babka, Feng Tang, Christian Brauner, Dave Hansen,
Dapeng Mi, Kees Cook, Marco Elver, Li RongQing, Eric Biggers,
Paul E. McKenney, linux-doc, linux-kernel, netdev, linux-rdma,
Gal Pressman, Dragos Tatulea, Jiri Pirko, Shay Drori,
Moshe Shemesh
Tue, May 26, 2026 at 05:03:57PM +0200, mbloch@nvidia.com wrote:
>
>
>On 26/05/2026 17:07, Jiri Pirko wrote:
>> Tue, May 26, 2026 at 11:44:46AM +0200, mbloch@nvidia.com wrote:
>>>
>>>
>>> On 26/05/2026 10:44, Jiri Pirko wrote:
>>>> Thu, May 21, 2026 at 09:24:34AM +0200, tariqt@nvidia.com wrote:
>>>>> From: Mark Bloch <mbloch@nvidia.com>
>>>>>
>>>>> Apply devlink default eswitch mode for mlx5 devices after successful
>>>>> device initialization while holding the devlink instance lock.
>>>>>
>>>>> At this point the devlink instance is registered and the mlx5 devlink
>>>>> operations are available, so the default eswitch mode can be applied to
>>>>> the matching PCI devlink handle.
>>>>>
>>>>> Signed-off-by: Mark Bloch <mbloch@nvidia.com>
>>>>> Reviewed-by: Shay Drori <shayd@nvidia.com>
>>>>> Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
>>>>> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
>>>>> ---
>>>>> drivers/net/ethernet/mellanox/mlx5/core/main.c | 17 +++++++++++++++++
>>>>> 1 file changed, 17 insertions(+)
>>>>>
>>>>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
>>>>> index 0c6e4efe38c8..4528097f3d84 100644
>>>>> --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
>>>>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
>>>>> @@ -1391,6 +1391,21 @@ static void mlx5_unload(struct mlx5_core_dev *dev)
>>>>> mlx5_free_bfreg(dev, &dev->priv.bfreg);
>>>>> }
>>>>>
>>>>> +static void mlx5_devl_apply_default_esw_mode(struct mlx5_core_dev *dev)
>>>>> +{
>>>>> + struct devlink *devlink = priv_to_devlink(dev);
>>>>> + int err;
>>>>> +
>>>>> + if (!MLX5_ESWITCH_MANAGER(dev))
>>>>> + return;
>>>>> +
>>>>> + devl_assert_locked(devlink);
>>>>> + err = devl_apply_default_esw_mode(devlink);
>>>>> + if (err)
>>>>> + mlx5_core_warn(dev, "Couldn't apply default eswitch mode, err %d\n",
>>>>> + err);
>>>>> +}
>>>>> +
>>>>> int mlx5_init_one_devl_locked(struct mlx5_core_dev *dev)
>>>>> {
>>>>> bool light_probe = mlx5_dev_is_lightweight(dev);
>>>>> @@ -1437,6 +1452,7 @@ int mlx5_init_one_devl_locked(struct mlx5_core_dev *dev)
>>>>> mlx5_core_err(dev, "mlx5_hwmon_dev_register failed with error code %d\n", err);
>>>>>
>>>>> mutex_unlock(&dev->intf_state_mutex);
>>>>> + mlx5_devl_apply_default_esw_mode(dev);
>>>>
>>>> I wonder how we can make this work for all. I mean, other driver would
>>>> silently ignore this command like arg, right? Any idea how to make all
>>>> drivers follow the arg from very beginning?
>>>>
>>>
>>> I have a follow-up series that adds the call to all drivers which support
>>> setting eswitch mode. When going over the other drivers, what I found is
>>> that the right point to apply the default is driver specific, drivers
>>> I have patch for:
>>>
>>> 46e16c6d9836 net: Apply devlink esw mode defaults
>>> ab4f54102ba9 bnxt_en: Apply devlink default eswitch mode during init
>>> b48cce1607bb liquidio: Apply devlink default eswitch mode during init
>>> 4ea54b0fe04a ice: Apply devlink default eswitch mode during init
>>> b7faddaa1c90 octeontx2-af: Apply devlink default eswitch mode during init
>>> 74b0c22c47b9 octeontx2-pf: Apply devlink default eswitch mode during init
>>> 5000e4c3d768 nfp: Apply devlink default eswitch mode during init
>>> 97a218e95e41 netdevsim: Apply devlink default eswitch mode during init
>>>
>>> I don't think doing this generically from devlink is realistic. devlink
>>> doesn't really know when a given driver is ready to change eswitch mode.
>>> Some drivers need SR-IOV state, representor setup, or other init pieces to
>>> be ready first, and the locking is not identical across drivers either.
>>
>>
>> Low hanging fruit would be just to call ops->eswitch_mode_set at the end
>> of register. Multiple reasons:
>>
>> 1) end of devl_register is exactly the point userspace is free to issue
>> the eswitch mode set. Driver should be ready to handle it.
>> 2) all drivers would transparently get this functionality, without
>> actually knowing this kernel command line arg ever existed, without
>> odd wiring call of related exported function. I prefer that stongly.
>> 3) you should add a there warning for the case this arg is passed yet
>> the driver does not implement eswitch_mode_set. User should
>> get a feedback like this, not silent ignore.
>>
>> The only loose end is see it the void return of devl_register().
>> Multiple ways to handle the possibly failed eswitch_mode_set(). I would
>> probably just go for pr_warn, seems to be the most correct.
>>
>> Make sense?
>
>I see the point, but I don't think devl_register() (at least not the only place)
>is the right place.
>
>There is a small but important difference between userspace doing
>"devlink eswitch set" after register is done, and devlink core calling
>eswitch_mode_set() from inside the register flow.
>
>Some drivers call devlink_register() while holding the device lock.
>liquidio is one example. If devlink core calls ops->eswitch_mode_set() from
>there, we may start the full eswitch mode change while holding that lock.
>That mode change can create representors, register netdevs, take rtnl,
>allocate resources, etc. I don't think we want this to become an implicit
>side effect of devlink registration.
I believe your AI may untagle liquidio locking :)
>
>For mlx5, the placement after intf_state_mutex is also intentional:
>
>mutex_unlock(&dev->intf_state_mutex);
>mlx5_devl_apply_default_esw_mode(dev);
>
>We can't call it while holding intf_state_mutex because the mode set path
>takes it internally, and switchdev mode may also create IB representors.
>
>Also, devl_register() only covers the first registration. The mlx5 call in
>mlx5_load_one_devl_locked() is for reload/fw reset recovery kind of flows.
>In those flows devlink is already registered, so devl_register() is not
>called again, but the driver state was rebuilt and we may need to apply the
>default again.
Call it from reload too, right?
>
>Same for reload, fw reset and pci recovery in general. If the driver tears
>down and rebuilds eswitch related state, the place to apply the default is
>in that driver's reinit flow, not in devl_register().
>
>When I went over the other drivers, the right place was not always the same
>as devlink registration. I'm not an expert in any of them, so I hope I got
>the details right, but for example octeontx2 AF needs sr-iov and the
>representor switch state to be initialized first. nfp can do it after
>app/vNIC init while the devlink lock is already held. liquidio should do it
>only after dropping the PCI device lock.
Idk, perhaps do it from devlink_post_register_work of some kind? That
would allow you to have the same locking ordering as a userspace call.
>
>Mark
>
>>
>>
>>>
>>> Also, since this knob is only about eswitch mode, I don't think we need to
>>> touch every devlink driver. Drivers that don't implement eswitch_mode_set()
>>> would just ignore it anyway. The follow-up only wires the default into
>>> drivers that actually support changing eswitch mode.
>>>
>>> Mark
>>>
>>>>
>>>>> return 0;
>>>>>
>>>>> err_register:
>>>>> @@ -1538,6 +1554,7 @@ int mlx5_load_one_devl_locked(struct mlx5_core_dev *dev, bool recovery)
>>>>> goto err_attach;
>>>>>
>>>>> mutex_unlock(&dev->intf_state_mutex);
>>>>> + mlx5_devl_apply_default_esw_mode(dev);
>>>>> return 0;
>>>>>
>>>>> err_attach:
>>>>> --
>>>>> 2.44.0
>>>>>
>>>
>
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH net-next 3/3] net/mlx5: Apply devlink default eswitch mode during init
2026-05-26 16:23 ` Jiri Pirko
@ 2026-05-26 17:13 ` Mark Bloch
0 siblings, 0 replies; 15+ messages in thread
From: Mark Bloch @ 2026-05-26 17:13 UTC (permalink / raw)
To: Jiri Pirko
Cc: Tariq Toukan, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Andrew Lunn, David S. Miller, Jonathan Corbet, Shuah Khan,
Simon Horman, Saeed Mahameed, Leon Romanovsky,
Borislav Petkov (AMD), Andrew Morton, Randy Dunlap,
Thomas Gleixner, Petr Mladek, Peter Zijlstra (Intel), Tejun Heo,
Vlastimil Babka, Feng Tang, Christian Brauner, Dave Hansen,
Dapeng Mi, Kees Cook, Marco Elver, Li RongQing, Eric Biggers,
Paul E. McKenney, linux-doc, linux-kernel, netdev, linux-rdma,
Gal Pressman, Dragos Tatulea, Jiri Pirko, Shay Drori,
Moshe Shemesh
On 26/05/2026 19:23, Jiri Pirko wrote:
> Tue, May 26, 2026 at 05:03:57PM +0200, mbloch@nvidia.com wrote:
>>
>>
>> On 26/05/2026 17:07, Jiri Pirko wrote:
>>> Tue, May 26, 2026 at 11:44:46AM +0200, mbloch@nvidia.com wrote:
>>>>
>>>>
>>>> On 26/05/2026 10:44, Jiri Pirko wrote:
>>>>> Thu, May 21, 2026 at 09:24:34AM +0200, tariqt@nvidia.com wrote:
>>>>>> From: Mark Bloch <mbloch@nvidia.com>
>>>>>>
>>>>>> Apply devlink default eswitch mode for mlx5 devices after successful
>>>>>> device initialization while holding the devlink instance lock.
>>>>>>
>>>>>> At this point the devlink instance is registered and the mlx5 devlink
>>>>>> operations are available, so the default eswitch mode can be applied to
>>>>>> the matching PCI devlink handle.
>>>>>>
>>>>>> Signed-off-by: Mark Bloch <mbloch@nvidia.com>
>>>>>> Reviewed-by: Shay Drori <shayd@nvidia.com>
>>>>>> Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
>>>>>> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
>>>>>> ---
>>>>>> drivers/net/ethernet/mellanox/mlx5/core/main.c | 17 +++++++++++++++++
>>>>>> 1 file changed, 17 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
>>>>>> index 0c6e4efe38c8..4528097f3d84 100644
>>>>>> --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
>>>>>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
>>>>>> @@ -1391,6 +1391,21 @@ static void mlx5_unload(struct mlx5_core_dev *dev)
>>>>>> mlx5_free_bfreg(dev, &dev->priv.bfreg);
>>>>>> }
>>>>>>
>>>>>> +static void mlx5_devl_apply_default_esw_mode(struct mlx5_core_dev *dev)
>>>>>> +{
>>>>>> + struct devlink *devlink = priv_to_devlink(dev);
>>>>>> + int err;
>>>>>> +
>>>>>> + if (!MLX5_ESWITCH_MANAGER(dev))
>>>>>> + return;
>>>>>> +
>>>>>> + devl_assert_locked(devlink);
>>>>>> + err = devl_apply_default_esw_mode(devlink);
>>>>>> + if (err)
>>>>>> + mlx5_core_warn(dev, "Couldn't apply default eswitch mode, err %d\n",
>>>>>> + err);
>>>>>> +}
>>>>>> +
>>>>>> int mlx5_init_one_devl_locked(struct mlx5_core_dev *dev)
>>>>>> {
>>>>>> bool light_probe = mlx5_dev_is_lightweight(dev);
>>>>>> @@ -1437,6 +1452,7 @@ int mlx5_init_one_devl_locked(struct mlx5_core_dev *dev)
>>>>>> mlx5_core_err(dev, "mlx5_hwmon_dev_register failed with error code %d\n", err);
>>>>>>
>>>>>> mutex_unlock(&dev->intf_state_mutex);
>>>>>> + mlx5_devl_apply_default_esw_mode(dev);
>>>>>
>>>>> I wonder how we can make this work for all. I mean, other driver would
>>>>> silently ignore this command like arg, right? Any idea how to make all
>>>>> drivers follow the arg from very beginning?
>>>>>
>>>>
>>>> I have a follow-up series that adds the call to all drivers which support
>>>> setting eswitch mode. When going over the other drivers, what I found is
>>>> that the right point to apply the default is driver specific, drivers
>>>> I have patch for:
>>>>
>>>> 46e16c6d9836 net: Apply devlink esw mode defaults
>>>> ab4f54102ba9 bnxt_en: Apply devlink default eswitch mode during init
>>>> b48cce1607bb liquidio: Apply devlink default eswitch mode during init
>>>> 4ea54b0fe04a ice: Apply devlink default eswitch mode during init
>>>> b7faddaa1c90 octeontx2-af: Apply devlink default eswitch mode during init
>>>> 74b0c22c47b9 octeontx2-pf: Apply devlink default eswitch mode during init
>>>> 5000e4c3d768 nfp: Apply devlink default eswitch mode during init
>>>> 97a218e95e41 netdevsim: Apply devlink default eswitch mode during init
>>>>
>>>> I don't think doing this generically from devlink is realistic. devlink
>>>> doesn't really know when a given driver is ready to change eswitch mode.
>>>> Some drivers need SR-IOV state, representor setup, or other init pieces to
>>>> be ready first, and the locking is not identical across drivers either.
>>>
>>>
>>> Low hanging fruit would be just to call ops->eswitch_mode_set at the end
>>> of register. Multiple reasons:
>>>
>>> 1) end of devl_register is exactly the point userspace is free to issue
>>> the eswitch mode set. Driver should be ready to handle it.
>>> 2) all drivers would transparently get this functionality, without
>>> actually knowing this kernel command line arg ever existed, without
>>> odd wiring call of related exported function. I prefer that stongly.
>>> 3) you should add a there warning for the case this arg is passed yet
>>> the driver does not implement eswitch_mode_set. User should
>>> get a feedback like this, not silent ignore.
>>>
>>> The only loose end is see it the void return of devl_register().
>>> Multiple ways to handle the possibly failed eswitch_mode_set(). I would
>>> probably just go for pr_warn, seems to be the most correct.
>>>
>>> Make sense?
>>
>> I see the point, but I don't think devl_register() (at least not the only place)
>> is the right place.
>>
>> There is a small but important difference between userspace doing
>> "devlink eswitch set" after register is done, and devlink core calling
>> eswitch_mode_set() from inside the register flow.
>>
>> Some drivers call devlink_register() while holding the device lock.
>> liquidio is one example. If devlink core calls ops->eswitch_mode_set() from
>> there, we may start the full eswitch mode change while holding that lock.
>> That mode change can create representors, register netdevs, take rtnl,
>> allocate resources, etc. I don't think we want this to become an implicit
>> side effect of devlink registration.
>
> I believe your AI may untagle liquidio locking :)
I didn't try to solve that one with ai. Most drivers were fairly simple
so I didn't use ai at all. bnxt was the one where I needed a bit of help :)
>
>
>>
>> For mlx5, the placement after intf_state_mutex is also intentional:
>>
>> mutex_unlock(&dev->intf_state_mutex);
>> mlx5_devl_apply_default_esw_mode(dev);
>>
>> We can't call it while holding intf_state_mutex because the mode set path
>> takes it internally, and switchdev mode may also create IB representors.
>>
>> Also, devl_register() only covers the first registration. The mlx5 call in
>> mlx5_load_one_devl_locked() is for reload/fw reset recovery kind of flows.
>> In those flows devlink is already registered, so devl_register() is not
>> called again, but the driver state was rebuilt and we may need to apply the
>> default again.
>
> Call it from reload too, right?
Yes, that was my first thought: apply it from devl_register() for the first
registration and from devlink_reload() after a successful DRIVER_REINIT.
That covers the clean devlink reload path but....(see bellow)
>
>
>>
>> Same for reload, fw reset and pci recovery in general. If the driver tears
>> down and rebuilds eswitch related state, the place to apply the default is
>> in that driver's reinit flow, not in devl_register().
>>
>> When I went over the other drivers, the right place was not always the same
>> as devlink registration. I'm not an expert in any of them, so I hope I got
>> the details right, but for example octeontx2 AF needs sr-iov and the
>> representor switch state to be initialized first. nfp can do it after
>> app/vNIC init while the devlink lock is already held. liquidio should do it
>> only after dropping the PCI device lock.
>
> Idk, perhaps do it from devlink_post_register_work of some kind? That
> would allow you to have the same locking ordering as a userspace cal
l.
I thought about a workqueue too, it was actually my first idea.
The problem is that then we race with userspace. In the mlx5 version here the
default is applied while the devlink lock is still held, before userspace can
come in and issue its own eswitch set. If we defer it to post-register work,
the devlink instance is already visible and userspace can get there first
and then we might change the user configuration.
Also, the bigger issue for mlx5 is not only initial registration or devlink
reload. Some recovery paths, pci resume, and fw reset flows rebuild the driver
state without going through devlink at all. I did not find a clean way for
devlink core to infer all those points by itself.
To handle that from devlink I would still need to add some api for the driver
to tell devlink "I just reinitialized, apply the default now". but nce I had
that driver call , it felt simpler and clearer to let the driver call
the helper directly at the points where it knows eswitch mode is safe.
I agree that handling all of this inside devlink would be the better option.
I just couldn't make it work in a clean way.
Mark
>
>>
>> Mark
>>
>>>
>>>
>>>>
>>>> Also, since this knob is only about eswitch mode, I don't think we need to
>>>> touch every devlink driver. Drivers that don't implement eswitch_mode_set()
>>>> would just ignore it anyway. The follow-up only wires the default into
>>>> drivers that actually support changing eswitch mode.
>>>>
>>>> Mark
>>>>
>>>>>
>>>>>> return 0;
>>>>>>
>>>>>> err_register:
>>>>>> @@ -1538,6 +1554,7 @@ int mlx5_load_one_devl_locked(struct mlx5_core_dev *dev, bool recovery)
>>>>>> goto err_attach;
>>>>>>
>>>>>> mutex_unlock(&dev->intf_state_mutex);
>>>>>> + mlx5_devl_apply_default_esw_mode(dev);
>>>>>> return 0;
>>>>>>
>>>>>> err_attach:
>>>>>> --
>>>>>> 2.44.0
>>>>>>
>>>>
>>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net-next 0/3] devlink: Add boot-time eswitch mode defaults
2026-05-21 7:24 [PATCH net-next 0/3] devlink: Add boot-time eswitch mode defaults Tariq Toukan
` (2 preceding siblings ...)
2026-05-21 7:24 ` [PATCH net-next 3/3] net/mlx5: Apply devlink default eswitch mode during init Tariq Toukan
@ 2026-05-25 19:42 ` Jakub Kicinski
2026-05-26 7:41 ` Jiri Pirko
3 siblings, 1 reply; 15+ messages in thread
From: Jakub Kicinski @ 2026-05-25 19:42 UTC (permalink / raw)
To: Jiri Pirko
Cc: Tariq Toukan, Eric Dumazet, Paolo Abeni, Andrew Lunn,
David S. Miller, Jonathan Corbet, Shuah Khan, Jiri Pirko,
Simon Horman, Saeed Mahameed, Leon Romanovsky, Mark Bloch,
Borislav Petkov (AMD), Andrew Morton, Randy Dunlap,
Thomas Gleixner, Petr Mladek, Peter Zijlstra (Intel), Tejun Heo,
Vlastimil Babka, Feng Tang, Christian Brauner, Dave Hansen,
Dapeng Mi, Kees Cook, Marco Elver, Li RongQing, Eric Biggers,
Paul E. McKenney, linux-doc, linux-kernel, netdev, linux-rdma,
Gal Pressman, Dragos Tatulea
On Thu, 21 May 2026 10:24:31 +0300 Tariq Toukan wrote:
> This series adds a devlink_eswitch_mode= kernel command line parameter
> for applying a default devlink eswitch mode during device
> initialization.
Jiri? Are you okay with this?
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH net-next 0/3] devlink: Add boot-time eswitch mode defaults
2026-05-25 19:42 ` [PATCH net-next 0/3] devlink: Add boot-time eswitch mode defaults Jakub Kicinski
@ 2026-05-26 7:41 ` Jiri Pirko
0 siblings, 0 replies; 15+ messages in thread
From: Jiri Pirko @ 2026-05-26 7:41 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Jiri Pirko, Tariq Toukan, Eric Dumazet, Paolo Abeni, Andrew Lunn,
David S. Miller, Jonathan Corbet, Shuah Khan, Simon Horman,
Saeed Mahameed, Leon Romanovsky, Mark Bloch,
Borislav Petkov (AMD), Andrew Morton, Randy Dunlap,
Thomas Gleixner, Petr Mladek, Peter Zijlstra (Intel), Tejun Heo,
Vlastimil Babka, Feng Tang, Christian Brauner, Dave Hansen,
Dapeng Mi, Kees Cook, Marco Elver, Li RongQing, Eric Biggers,
Paul E. McKenney, linux-doc, linux-kernel, netdev, linux-rdma,
Gal Pressman, Dragos Tatulea
Mon, May 25, 2026 at 09:42:56PM +0200, kuba@kernel.org wrote:
>On Thu, 21 May 2026 10:24:31 +0300 Tariq Toukan wrote:
>> This series adds a devlink_eswitch_mode= kernel command line parameter
>> for applying a default devlink eswitch mode during device
>> initialization.
>
>Jiri? Are you okay with this?
In general, yes. Couple of details perhaps. Let me check it more deep.
^ permalink raw reply [flat|nested] 15+ messages in thread