* [PATCH net-next V3 0/3] devlink eswitch inactive mode
@ 2025-11-08 7:04 Saeed Mahameed
2025-11-08 7:04 ` [PATCH net-next V3 1/3] devlink: Introduce switchdev_inactive eswitch mode Saeed Mahameed
` (3 more replies)
0 siblings, 4 replies; 5+ messages in thread
From: Saeed Mahameed @ 2025-11-08 7:04 UTC (permalink / raw)
To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet
Cc: Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman,
Leon Romanovsky, Jiri Pirko, mbloch
From: Saeed Mahameed <saeedm@nvidia.com>
v2->v3:
- Fix cocci check %pe
- minor improvement: create FDB drop counter once.
v1->v2:
- Introduce new devlink mode instead of state, Jiri's suggestion.
- Address kernel robot issues reported in v1.
no previous prototype for 'mlx5_mpfs_enable'
v2: https://lore.kernel.org/all/20251107000831.157375-1-saeed@kernel.org/
v1: https://lore.kernel.org/all/20251016013618.2030940-1-saeed@kernel.org/
Before having traffic flow through an eswitch, a user may want to have the
ability to block traffic towards the FDB until FDB is fully programmed and the
user is ready to send traffic to it. For example: when two eswitches are present
for vports in a multi-PF setup, one eswitch may take over the traffic from the
other when the user chooses. Before this take over, a user may want to first
program the inactive eswitch and then once ready redirect traffic to this new
eswitch.
This series introduces a user-configurable mode for an eswitch that allows
dynamically switching between active and inactive modes. When inactive, traffic
does not flow through the eswitch. While inactive, steering pipeline
configuration can be done (e.g. adding TC rules, discovering representors,
enabling the desired SDN modes such as bridge/OVS/DPDK/etc). Once configuration
is completed, a user can set the eswitch mode to active and have traffic flow
through. This allows admins to upgrade forwarding pipeline rules with very
minimal downtime and packet drops.
A user can start the eswitch in switchdev or switchdev_inactive mode.
Active: Traffic is enabled on this eswitch FDB.
Inactive: Traffic is ignored/dropped on this eswitch FDB.
An example use case:
$ devlink dev eswitch set pci/0000:08:00.1 mode switchdev_inactive
Setup FDB pipeline and netdev representors
...
Once ready to start receiving traffic
$ devlink dev eswitch set pci/0000:08:00.1 mode switchdev
Saeed Mahameed (3):
devlink: Introduce switchdev_inactive eswitch mode
net/mlx5: MPFS, add support for dynamic enable/disable
net/mlx5: E-Switch, support eswitch inactive mode
Documentation/netlink/specs/devlink.yaml | 2 +
.../devlink/devlink-eswitch-attr.rst | 13 ++
.../mellanox/mlx5/core/esw/adj_vport.c | 15 +-
.../net/ethernet/mellanox/mlx5/core/eswitch.h | 6 +
.../mellanox/mlx5/core/eswitch_offloads.c | 207 +++++++++++++++++-
.../net/ethernet/mellanox/mlx5/core/fs_core.c | 5 +
.../ethernet/mellanox/mlx5/core/lib/mpfs.c | 116 ++++++++--
.../ethernet/mellanox/mlx5/core/lib/mpfs.h | 9 +
include/linux/mlx5/fs.h | 1 +
include/uapi/linux/devlink.h | 1 +
net/devlink/netlink_gen.c | 2 +-
11 files changed, 338 insertions(+), 39 deletions(-)
--
2.51.1
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH net-next V3 1/3] devlink: Introduce switchdev_inactive eswitch mode
2025-11-08 7:04 [PATCH net-next V3 0/3] devlink eswitch inactive mode Saeed Mahameed
@ 2025-11-08 7:04 ` Saeed Mahameed
2025-11-08 7:04 ` [PATCH net-next V3 2/3] net/mlx5: MPFS, add support for dynamic enable/disable Saeed Mahameed
` (2 subsequent siblings)
3 siblings, 0 replies; 5+ messages in thread
From: Saeed Mahameed @ 2025-11-08 7:04 UTC (permalink / raw)
To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet
Cc: Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman,
Leon Romanovsky, Jiri Pirko, mbloch
From: Saeed Mahameed <saeedm@nvidia.com>
Adds DEVLINK_ESWITCH_MODE_SWITCHDEV_INACTIVE attribute to UAPI and
documentation.
Before having traffic flow through an eswitch, a user may want to have the
ability to block traffic towards the FDB until FDB is fully programmed and
the user is ready to send traffic to it. For example: when two eswitches
are present for vports in a multi-PF setup, one eswitch may take over the
traffic from the other when the user chooses.
Before this take over, a user may want to first program the inactive
eswitch and then once ready redirect traffic to this new eswitch.
switchdev modes transition semantics:
legacy->switchdev_inactive: Create switchdev mode normally, traffic not
allowed to flow yet.
switchdev_inactive->switchdev: Enable traffic to flow.
switchdev->switchdev_inactive: Block traffic on the FDB, FDB and
representros state and content is preserved.
When eswitch is configured to this mode, traffic is ignored/dropped on
this eswitch FDB, while current configuration is kept, e.g FDB rules and
netdev representros are kept available, FDB programming is allowed.
Example:
# start inactive switchdev
devlink dev eswitch set pci/0000:08:00.1 mode switchdev_inactive
# setup TC rules, representors etc ..
# activate
devlink dev eswitch set pci/0000:08:00.1 mode switchdev
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
---
Documentation/netlink/specs/devlink.yaml | 2 ++
.../networking/devlink/devlink-eswitch-attr.rst | 13 +++++++++++++
include/uapi/linux/devlink.h | 1 +
net/devlink/netlink_gen.c | 2 +-
4 files changed, 17 insertions(+), 1 deletion(-)
diff --git a/Documentation/netlink/specs/devlink.yaml b/Documentation/netlink/specs/devlink.yaml
index 3db59c965869..426d5aa7d955 100644
--- a/Documentation/netlink/specs/devlink.yaml
+++ b/Documentation/netlink/specs/devlink.yaml
@@ -99,6 +99,8 @@ definitions:
name: legacy
-
name: switchdev
+ -
+ name: switchdev-inactive
-
type: enum
name: eswitch-inline-mode
diff --git a/Documentation/networking/devlink/devlink-eswitch-attr.rst b/Documentation/networking/devlink/devlink-eswitch-attr.rst
index 08bb39ab1528..eafe09abc40c 100644
--- a/Documentation/networking/devlink/devlink-eswitch-attr.rst
+++ b/Documentation/networking/devlink/devlink-eswitch-attr.rst
@@ -39,6 +39,10 @@ The following is a list of E-Switch attributes.
rules.
* ``switchdev`` allows for more advanced offloading capabilities of
the E-Switch to hardware.
+ * ``switchdev_inactive`` switchdev mode but starts inactive, doesn't allow traffic
+ until explicitly activated. This mode is useful for orchestrators that
+ want to prepare the device in switchdev mode but only activate it when
+ all configurations are done.
* - ``inline-mode``
- enum
- Some HWs need the VF driver to put part of the packet
@@ -74,3 +78,12 @@ Example Usage
# enable encap-mode with legacy mode
$ devlink dev eswitch set pci/0000:08:00.0 mode legacy inline-mode none encap-mode basic
+
+ # start switchdev mode in inactive state
+ $ devlink dev eswitch set pci/0000:08:00.0 mode switchdev_inactive
+
+ # setup switchdev configurations, representors, FDB entries, etc..
+ ...
+
+ # activate switchdev mode to allow traffic
+ $ devlink dev eswitch set pci/0000:08:00.0 mode switchdev
diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
index bcad11a787a5..157f11d3fb72 100644
--- a/include/uapi/linux/devlink.h
+++ b/include/uapi/linux/devlink.h
@@ -181,6 +181,7 @@ enum devlink_sb_threshold_type {
enum devlink_eswitch_mode {
DEVLINK_ESWITCH_MODE_LEGACY,
DEVLINK_ESWITCH_MODE_SWITCHDEV,
+ DEVLINK_ESWITCH_MODE_SWITCHDEV_INACTIVE,
};
enum devlink_eswitch_inline_mode {
diff --git a/net/devlink/netlink_gen.c b/net/devlink/netlink_gen.c
index 9fd00977d59e..5ad435aee29d 100644
--- a/net/devlink/netlink_gen.c
+++ b/net/devlink/netlink_gen.c
@@ -229,7 +229,7 @@ static const struct nla_policy devlink_eswitch_get_nl_policy[DEVLINK_ATTR_DEV_NA
static const struct nla_policy devlink_eswitch_set_nl_policy[DEVLINK_ATTR_ESWITCH_ENCAP_MODE + 1] = {
[DEVLINK_ATTR_BUS_NAME] = { .type = NLA_NUL_STRING, },
[DEVLINK_ATTR_DEV_NAME] = { .type = NLA_NUL_STRING, },
- [DEVLINK_ATTR_ESWITCH_MODE] = NLA_POLICY_MAX(NLA_U16, 1),
+ [DEVLINK_ATTR_ESWITCH_MODE] = NLA_POLICY_MAX(NLA_U16, 2),
[DEVLINK_ATTR_ESWITCH_INLINE_MODE] = NLA_POLICY_MAX(NLA_U8, 3),
[DEVLINK_ATTR_ESWITCH_ENCAP_MODE] = NLA_POLICY_MAX(NLA_U8, 1),
};
--
2.51.1
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH net-next V3 2/3] net/mlx5: MPFS, add support for dynamic enable/disable
2025-11-08 7:04 [PATCH net-next V3 0/3] devlink eswitch inactive mode Saeed Mahameed
2025-11-08 7:04 ` [PATCH net-next V3 1/3] devlink: Introduce switchdev_inactive eswitch mode Saeed Mahameed
@ 2025-11-08 7:04 ` Saeed Mahameed
2025-11-08 7:04 ` [PATCH net-next V3 3/3] net/mlx5: E-Switch, support eswitch inactive mode Saeed Mahameed
2025-11-11 12:20 ` [PATCH net-next V3 0/3] devlink " patchwork-bot+netdevbpf
3 siblings, 0 replies; 5+ messages in thread
From: Saeed Mahameed @ 2025-11-08 7:04 UTC (permalink / raw)
To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet
Cc: Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman,
Leon Romanovsky, Jiri Pirko, mbloch, Adithya Jayachandran
From: Saeed Mahameed <saeedm@nvidia.com>
MPFS (Multi PF Switch) is enabled by default in Multi-Host environments,
the driver keeps a list of desired unicast mac addresses of all vports
(vfs/Sfs) and applied to HW via L2_table FW command.
Add API to dynamically apply the list of MACs to HW when needed for next
patches, to utilize this new API in devlink eswitch active/in-active uAPI.
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Adithya Jayachandran <ajayachandra@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
---
.../ethernet/mellanox/mlx5/core/lib/mpfs.c | 116 +++++++++++++++---
.../ethernet/mellanox/mlx5/core/lib/mpfs.h | 9 ++
2 files changed, 108 insertions(+), 17 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/mpfs.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/mpfs.c
index 4450091e181a..99fb7a53add0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/mpfs.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/mpfs.c
@@ -65,13 +65,14 @@ static int del_l2table_entry_cmd(struct mlx5_core_dev *dev, u32 index)
/* UC L2 table hash node */
struct l2table_node {
struct l2addr_node node;
- u32 index; /* index in HW l2 table */
+ int index; /* index in HW l2 table */
int ref_count;
};
struct mlx5_mpfs {
struct hlist_head hash[MLX5_L2_ADDR_HASH_SIZE];
struct mutex lock; /* Synchronize l2 table access */
+ bool enabled;
u32 size;
unsigned long *bitmap;
};
@@ -114,6 +115,8 @@ int mlx5_mpfs_init(struct mlx5_core_dev *dev)
return -ENOMEM;
}
+ mpfs->enabled = true;
+
dev->priv.mpfs = mpfs;
return 0;
}
@@ -135,7 +138,7 @@ int mlx5_mpfs_add_mac(struct mlx5_core_dev *dev, u8 *mac)
struct mlx5_mpfs *mpfs = dev->priv.mpfs;
struct l2table_node *l2addr;
int err = 0;
- u32 index;
+ int index;
if (!mpfs)
return 0;
@@ -148,30 +151,34 @@ int mlx5_mpfs_add_mac(struct mlx5_core_dev *dev, u8 *mac)
goto out;
}
- err = alloc_l2table_index(mpfs, &index);
- if (err)
- goto out;
-
l2addr = l2addr_hash_add(mpfs->hash, mac, struct l2table_node, GFP_KERNEL);
if (!l2addr) {
err = -ENOMEM;
- goto hash_add_err;
+ goto out;
}
- err = set_l2table_entry_cmd(dev, index, mac);
- if (err)
- goto set_table_entry_err;
+ index = -1;
+
+ if (mpfs->enabled) {
+ err = alloc_l2table_index(mpfs, &index);
+ if (err)
+ goto hash_del;
+ err = set_l2table_entry_cmd(dev, index, mac);
+ if (err)
+ goto free_l2table_index;
+ mlx5_core_dbg(dev, "MPFS entry %pM, set @index (%d)\n",
+ l2addr->node.addr, l2addr->index);
+ }
l2addr->index = index;
l2addr->ref_count = 1;
mlx5_core_dbg(dev, "MPFS mac added %pM, index (%d)\n", mac, index);
goto out;
-
-set_table_entry_err:
- l2addr_hash_del(l2addr);
-hash_add_err:
+free_l2table_index:
free_l2table_index(mpfs, index);
+hash_del:
+ l2addr_hash_del(l2addr);
out:
mutex_unlock(&mpfs->lock);
return err;
@@ -183,7 +190,7 @@ int mlx5_mpfs_del_mac(struct mlx5_core_dev *dev, u8 *mac)
struct mlx5_mpfs *mpfs = dev->priv.mpfs;
struct l2table_node *l2addr;
int err = 0;
- u32 index;
+ int index;
if (!mpfs)
return 0;
@@ -200,12 +207,87 @@ int mlx5_mpfs_del_mac(struct mlx5_core_dev *dev, u8 *mac)
goto unlock;
index = l2addr->index;
- del_l2table_entry_cmd(dev, index);
+ if (index >= 0) {
+ del_l2table_entry_cmd(dev, index);
+ free_l2table_index(mpfs, index);
+ mlx5_core_dbg(dev, "MPFS entry %pM, deleted @index (%d)\n",
+ mac, index);
+ }
l2addr_hash_del(l2addr);
- free_l2table_index(mpfs, index);
mlx5_core_dbg(dev, "MPFS mac deleted %pM, index (%d)\n", mac, index);
unlock:
mutex_unlock(&mpfs->lock);
return err;
}
EXPORT_SYMBOL(mlx5_mpfs_del_mac);
+
+int mlx5_mpfs_enable(struct mlx5_core_dev *dev)
+{
+ struct mlx5_mpfs *mpfs = dev->priv.mpfs;
+ struct l2table_node *l2addr;
+ struct hlist_node *n;
+ int err = 0, i;
+
+ if (!mpfs)
+ return -ENODEV;
+
+ mutex_lock(&mpfs->lock);
+ if (mpfs->enabled)
+ goto out;
+ mpfs->enabled = true;
+ mlx5_core_dbg(dev, "MPFS enabling mpfs\n");
+
+ mlx5_mpfs_foreach(l2addr, n, mpfs, i) {
+ u32 index;
+
+ err = alloc_l2table_index(mpfs, &index);
+ if (err) {
+ mlx5_core_err(dev, "Failed to allocated MPFS index for %pM, err(%d)\n",
+ l2addr->node.addr, err);
+ goto out;
+ }
+
+ err = set_l2table_entry_cmd(dev, index, l2addr->node.addr);
+ if (err) {
+ mlx5_core_err(dev, "Failed to set MPFS l2table entry for %pM index=%d, err(%d)\n",
+ l2addr->node.addr, index, err);
+ free_l2table_index(mpfs, index);
+ goto out;
+ }
+
+ l2addr->index = index;
+ mlx5_core_dbg(dev, "MPFS entry %pM, set @index (%d)\n",
+ l2addr->node.addr, l2addr->index);
+ }
+out:
+ mutex_unlock(&mpfs->lock);
+ return err;
+}
+
+void mlx5_mpfs_disable(struct mlx5_core_dev *dev)
+{
+ struct mlx5_mpfs *mpfs = dev->priv.mpfs;
+ struct l2table_node *l2addr;
+ struct hlist_node *n;
+ int i;
+
+ if (!mpfs)
+ return;
+
+ mutex_lock(&mpfs->lock);
+ if (!mpfs->enabled)
+ goto unlock;
+ mlx5_mpfs_foreach(l2addr, n, mpfs, i) {
+ if (l2addr->index < 0)
+ continue;
+ del_l2table_entry_cmd(dev, l2addr->index);
+ free_l2table_index(mpfs, l2addr->index);
+ mlx5_core_dbg(dev, "MPFS entry %pM, deleted @index (%d)\n",
+ l2addr->node.addr, l2addr->index);
+ l2addr->index = -1;
+ }
+ mpfs->enabled = false;
+ mlx5_core_dbg(dev, "MPFS disabled\n");
+unlock:
+ mutex_unlock(&mpfs->lock);
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/mpfs.h b/drivers/net/ethernet/mellanox/mlx5/core/lib/mpfs.h
index 4a293542a7aa..9c63838ce1f3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/mpfs.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/mpfs.h
@@ -45,6 +45,10 @@ struct l2addr_node {
u8 addr[ETH_ALEN];
};
+#define mlx5_mpfs_foreach(hs, tmp, mpfs, i) \
+ for (i = 0; i < MLX5_L2_ADDR_HASH_SIZE; i++) \
+ hlist_for_each_entry_safe(hs, tmp, &(mpfs)->hash[i], node.hlist)
+
#define for_each_l2hash_node(hn, tmp, hash, i) \
for (i = 0; i < MLX5_L2_ADDR_HASH_SIZE; i++) \
hlist_for_each_entry_safe(hn, tmp, &(hash)[i], hlist)
@@ -82,11 +86,16 @@ struct l2addr_node {
})
#ifdef CONFIG_MLX5_MPFS
+struct mlx5_core_dev;
int mlx5_mpfs_init(struct mlx5_core_dev *dev);
void mlx5_mpfs_cleanup(struct mlx5_core_dev *dev);
+int mlx5_mpfs_enable(struct mlx5_core_dev *dev);
+void mlx5_mpfs_disable(struct mlx5_core_dev *dev);
#else /* #ifndef CONFIG_MLX5_MPFS */
static inline int mlx5_mpfs_init(struct mlx5_core_dev *dev) { return 0; }
static inline void mlx5_mpfs_cleanup(struct mlx5_core_dev *dev) {}
+static inline int mlx5_mpfs_enable(struct mlx5_core_dev *dev) { return 0; }
+static inline void mlx5_mpfs_disable(struct mlx5_core_dev *dev) {}
#endif
#endif
--
2.51.1
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH net-next V3 3/3] net/mlx5: E-Switch, support eswitch inactive mode
2025-11-08 7:04 [PATCH net-next V3 0/3] devlink eswitch inactive mode Saeed Mahameed
2025-11-08 7:04 ` [PATCH net-next V3 1/3] devlink: Introduce switchdev_inactive eswitch mode Saeed Mahameed
2025-11-08 7:04 ` [PATCH net-next V3 2/3] net/mlx5: MPFS, add support for dynamic enable/disable Saeed Mahameed
@ 2025-11-08 7:04 ` Saeed Mahameed
2025-11-11 12:20 ` [PATCH net-next V3 0/3] devlink " patchwork-bot+netdevbpf
3 siblings, 0 replies; 5+ messages in thread
From: Saeed Mahameed @ 2025-11-08 7:04 UTC (permalink / raw)
To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet
Cc: Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman,
Leon Romanovsky, Jiri Pirko, mbloch, Adithya Jayachandran
From: Saeed Mahameed <saeedm@nvidia.com>
Add support for eswitch switchdev inactive mode
Inactive mode: Drop all traffic going to FDB, Remove
mpfs l2 rules and disconnect adjacent vports.
Active mode: Traffic flows through FDB, mpfs table populated, and
adjacent vports are connected.
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Adithya Jayachandran <ajayachandra@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
---
.../mellanox/mlx5/core/esw/adj_vport.c | 15 +-
.../net/ethernet/mellanox/mlx5/core/eswitch.h | 6 +
.../mellanox/mlx5/core/eswitch_offloads.c | 207 +++++++++++++++++-
.../net/ethernet/mellanox/mlx5/core/fs_core.c | 5 +
.../ethernet/mellanox/mlx5/core/lib/mpfs.c | 2 +-
include/linux/mlx5/fs.h | 1 +
6 files changed, 214 insertions(+), 22 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/esw/adj_vport.c b/drivers/net/ethernet/mellanox/mlx5/core/esw/adj_vport.c
index 0091ba697bae..250af09b5af2 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/esw/adj_vport.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/esw/adj_vport.c
@@ -4,13 +4,8 @@
#include "fs_core.h"
#include "eswitch.h"
-enum {
- MLX5_ADJ_VPORT_DISCONNECT = 0x0,
- MLX5_ADJ_VPORT_CONNECT = 0x1,
-};
-
-static int mlx5_esw_adj_vport_modify(struct mlx5_core_dev *dev,
- u16 vport, bool connect)
+int mlx5_esw_adj_vport_modify(struct mlx5_core_dev *dev, u16 vport,
+ bool connect)
{
u32 in[MLX5_ST_SZ_DW(modify_vport_state_in)] = {};
@@ -24,7 +19,7 @@ static int mlx5_esw_adj_vport_modify(struct mlx5_core_dev *dev,
MLX5_SET(modify_vport_state_in, in, egress_connect_valid, 1);
MLX5_SET(modify_vport_state_in, in, ingress_connect, connect);
MLX5_SET(modify_vport_state_in, in, egress_connect, connect);
-
+ MLX5_SET(modify_vport_state_in, in, admin_state, connect);
return mlx5_cmd_exec_in(dev, modify_vport_state, in);
}
@@ -96,7 +91,6 @@ static int mlx5_esw_adj_vport_create(struct mlx5_eswitch *esw, u16 vhca_id,
if (err)
goto acl_ns_remove;
- mlx5_esw_adj_vport_modify(esw->dev, vport_num, MLX5_ADJ_VPORT_CONNECT);
return 0;
acl_ns_remove:
@@ -117,8 +111,7 @@ static void mlx5_esw_adj_vport_destroy(struct mlx5_eswitch *esw,
esw_debug(esw->dev, "Destroying adjacent vport %d for vhca_id 0x%x\n",
vport_num, vport->vhca_id);
- mlx5_esw_adj_vport_modify(esw->dev, vport_num,
- MLX5_ADJ_VPORT_DISCONNECT);
+
mlx5_esw_offloads_rep_remove(esw, vport);
mlx5_fs_vport_egress_acl_ns_remove(esw->dev->priv.steering,
vport->index);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index 16eb99aba2a7..beaec450a734 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -264,6 +264,9 @@ struct mlx5_eswitch_fdb {
struct offloads_fdb {
struct mlx5_flow_namespace *ns;
+ struct mlx5_flow_table *drop_root;
+ struct mlx5_flow_handle *drop_root_rule;
+ struct mlx5_fc *drop_root_fc;
struct mlx5_flow_table *tc_miss_table;
struct mlx5_flow_table *slow_fdb;
struct mlx5_flow_group *send_to_vport_grp;
@@ -392,6 +395,7 @@ struct mlx5_eswitch {
struct mlx5_esw_offload offloads;
u32 last_vport_idx;
int mode;
+ bool offloads_inactive;
u16 manager_vport;
u16 first_host_vport;
u8 num_peers;
@@ -634,6 +638,8 @@ const u32 *mlx5_esw_query_functions(struct mlx5_core_dev *dev);
void mlx5_esw_adjacent_vhcas_setup(struct mlx5_eswitch *esw);
void mlx5_esw_adjacent_vhcas_cleanup(struct mlx5_eswitch *esw);
+int mlx5_esw_adj_vport_modify(struct mlx5_core_dev *dev, u16 vport,
+ bool connect);
#define MLX5_DEBUG_ESWITCH_MASK BIT(3)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 4092ea29c630..0b1a180ef238 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -1577,6 +1577,7 @@ esw_chains_create(struct mlx5_eswitch *esw, struct mlx5_flow_table *miss_fdb)
attr.max_grp_num = esw->params.large_group_num;
attr.default_ft = miss_fdb;
attr.mapping = esw->offloads.reg_c0_obj_pool;
+ attr.fs_base_prio = FDB_BYPASS_PATH;
chains = mlx5_chains_create(dev, &attr);
if (IS_ERR(chains)) {
@@ -2355,6 +2356,131 @@ static void esw_mode_change(struct mlx5_eswitch *esw, u16 mode)
mlx5_devcom_comp_unlock(esw->dev->priv.hca_devcom_comp);
}
+static void mlx5_esw_fdb_drop_destroy(struct mlx5_eswitch *esw)
+{
+ if (!esw->fdb_table.offloads.drop_root)
+ return;
+
+ esw_debug(esw->dev, "Destroying FDB drop root table %#x fc %#x\n",
+ esw->fdb_table.offloads.drop_root->id,
+ esw->fdb_table.offloads.drop_root_fc->id);
+ mlx5_del_flow_rules(esw->fdb_table.offloads.drop_root_rule);
+ /* Don't free flow counter here, can be reused on a later activation */
+ mlx5_destroy_flow_table(esw->fdb_table.offloads.drop_root);
+ esw->fdb_table.offloads.drop_root_rule = NULL;
+ esw->fdb_table.offloads.drop_root = NULL;
+}
+
+static int mlx5_esw_fdb_drop_create(struct mlx5_eswitch *esw)
+{
+ struct mlx5_flow_destination drop_fc_dst = {};
+ struct mlx5_flow_table_attr ft_attr = {};
+ struct mlx5_flow_destination *dst = NULL;
+ struct mlx5_core_dev *dev = esw->dev;
+ struct mlx5_flow_namespace *root_ns;
+ struct mlx5_flow_act flow_act = {};
+ struct mlx5_flow_handle *flow_rule;
+ struct mlx5_flow_table *table;
+ int err = 0, dst_num = 0;
+
+ if (esw->fdb_table.offloads.drop_root)
+ return 0;
+
+ root_ns = esw->fdb_table.offloads.ns;
+
+ ft_attr.prio = FDB_DROP_ROOT;
+ ft_attr.max_fte = 1;
+ ft_attr.autogroup.max_num_groups = 1;
+ table = mlx5_create_auto_grouped_flow_table(root_ns, &ft_attr);
+ if (IS_ERR(table)) {
+ esw_warn(dev, "Failed to create fdb drop root table, err %pe\n",
+ table);
+ return PTR_ERR(table);
+ }
+
+ /* Drop FC reusable, create once on first deactivation of FDB */
+ if (!esw->fdb_table.offloads.drop_root_fc) {
+ struct mlx5_fc *counter = mlx5_fc_create(dev, 0);
+
+ err = PTR_ERR_OR_ZERO(counter);
+ if (err)
+ esw_warn(esw->dev, "create fdb drop fc err %d\n", err);
+ else
+ esw->fdb_table.offloads.drop_root_fc = counter;
+ }
+
+ flow_act.action = MLX5_FLOW_CONTEXT_ACTION_DROP;
+
+ if (esw->fdb_table.offloads.drop_root_fc) {
+ flow_act.action |= MLX5_FLOW_CONTEXT_ACTION_COUNT;
+ drop_fc_dst.type = MLX5_FLOW_DESTINATION_TYPE_COUNTER;
+ drop_fc_dst.counter = esw->fdb_table.offloads.drop_root_fc;
+ dst = &drop_fc_dst;
+ dst_num++;
+ }
+
+ flow_rule = mlx5_add_flow_rules(table, NULL, &flow_act, dst, dst_num);
+ err = PTR_ERR_OR_ZERO(flow_rule);
+ if (err) {
+ esw_warn(esw->dev,
+ "fs offloads: Failed to add vport rx drop rule err %d\n",
+ err);
+ goto err_flow_rule;
+ }
+
+ esw->fdb_table.offloads.drop_root = table;
+ esw->fdb_table.offloads.drop_root_rule = flow_rule;
+ esw_debug(esw->dev, "Created FDB drop root table %#x fc %#x\n",
+ table->id, dst ? dst->counter->id : 0);
+ return 0;
+
+err_flow_rule:
+ /* no need to free drop fc, esw_offloads_steering_cleanup will do it */
+ mlx5_destroy_flow_table(table);
+ return err;
+}
+
+static void mlx5_esw_fdb_active(struct mlx5_eswitch *esw)
+{
+ struct mlx5_vport *vport;
+ unsigned long i;
+
+ mlx5_esw_fdb_drop_destroy(esw);
+ mlx5_mpfs_enable(esw->dev);
+
+ mlx5_esw_for_each_vf_vport(esw, i, vport, U16_MAX) {
+ if (!vport->adjacent)
+ continue;
+ esw_debug(esw->dev, "Connecting vport %d to eswitch\n",
+ vport->vport);
+ mlx5_esw_adj_vport_modify(esw->dev, vport->vport, true);
+ }
+
+ esw->offloads_inactive = false;
+ esw_warn(esw->dev, "MPFS/FDB active\n");
+}
+
+static void mlx5_esw_fdb_inactive(struct mlx5_eswitch *esw)
+{
+ struct mlx5_vport *vport;
+ unsigned long i;
+
+ mlx5_mpfs_disable(esw->dev);
+ mlx5_esw_fdb_drop_create(esw);
+
+ mlx5_esw_for_each_vf_vport(esw, i, vport, U16_MAX) {
+ if (!vport->adjacent)
+ continue;
+ esw_debug(esw->dev, "Disconnecting vport %u from eswitch\n",
+ vport->vport);
+
+ mlx5_esw_adj_vport_modify(esw->dev, vport->vport, false);
+ }
+
+ esw->offloads_inactive = true;
+ esw_warn(esw->dev, "MPFS/FDB inactive\n");
+}
+
static int esw_offloads_start(struct mlx5_eswitch *esw,
struct netlink_ext_ack *extack)
{
@@ -3438,6 +3564,10 @@ static int esw_offloads_steering_init(struct mlx5_eswitch *esw)
static void esw_offloads_steering_cleanup(struct mlx5_eswitch *esw)
{
+ mlx5_esw_fdb_drop_destroy(esw);
+ if (esw->fdb_table.offloads.drop_root_fc)
+ mlx5_fc_destroy(esw->dev, esw->fdb_table.offloads.drop_root_fc);
+ esw->fdb_table.offloads.drop_root_fc = NULL;
esw_destroy_vport_rx_drop_rule(esw);
esw_destroy_vport_rx_drop_group(esw);
esw_destroy_vport_rx_group(esw);
@@ -3600,6 +3730,11 @@ int esw_offloads_enable(struct mlx5_eswitch *esw)
if (err)
goto err_steering_init;
+ if (esw->offloads_inactive)
+ mlx5_esw_fdb_inactive(esw);
+ else
+ mlx5_esw_fdb_active(esw);
+
/* Representor will control the vport link state */
mlx5_esw_for_each_vf_vport(esw, i, vport, esw->esw_funcs.num_vfs)
vport->info.link_state = MLX5_VPORT_ADMIN_STATE_DOWN;
@@ -3666,6 +3801,9 @@ void esw_offloads_disable(struct mlx5_eswitch *esw)
esw_offloads_metadata_uninit(esw);
mlx5_rdma_disable_roce(esw->dev);
mlx5_esw_adjacent_vhcas_cleanup(esw);
+ /* must be done after vhcas cleanup to avoid adjacent vports connect */
+ if (esw->offloads_inactive)
+ mlx5_esw_fdb_active(esw); /* legacy mode always active */
mutex_destroy(&esw->offloads.termtbl_mutex);
}
@@ -3676,6 +3814,7 @@ static int esw_mode_from_devlink(u16 mode, u16 *mlx5_mode)
*mlx5_mode = MLX5_ESWITCH_LEGACY;
break;
case DEVLINK_ESWITCH_MODE_SWITCHDEV:
+ case DEVLINK_ESWITCH_MODE_SWITCHDEV_INACTIVE:
*mlx5_mode = MLX5_ESWITCH_OFFLOADS;
break;
default:
@@ -3685,14 +3824,17 @@ static int esw_mode_from_devlink(u16 mode, u16 *mlx5_mode)
return 0;
}
-static int esw_mode_to_devlink(u16 mlx5_mode, u16 *mode)
+static int esw_mode_to_devlink(struct mlx5_eswitch *esw, u16 *mode)
{
- switch (mlx5_mode) {
+ switch (esw->mode) {
case MLX5_ESWITCH_LEGACY:
*mode = DEVLINK_ESWITCH_MODE_LEGACY;
break;
case MLX5_ESWITCH_OFFLOADS:
- *mode = DEVLINK_ESWITCH_MODE_SWITCHDEV;
+ if (esw->offloads_inactive)
+ *mode = DEVLINK_ESWITCH_MODE_SWITCHDEV_INACTIVE;
+ else
+ *mode = DEVLINK_ESWITCH_MODE_SWITCHDEV;
break;
default:
return -EINVAL;
@@ -3798,6 +3940,45 @@ static bool mlx5_devlink_netdev_netns_immutable_set(struct devlink *devlink,
return ret;
}
+/* Returns true when only changing between active and inactive switchdev mode */
+static bool mlx5_devlink_switchdev_active_mode_change(struct mlx5_eswitch *esw,
+ u16 devlink_mode)
+{
+ /* current mode is not switchdev */
+ if (esw->mode != MLX5_ESWITCH_OFFLOADS)
+ return false;
+
+ /* new mode is not switchdev */
+ if (devlink_mode != DEVLINK_ESWITCH_MODE_SWITCHDEV &&
+ devlink_mode != DEVLINK_ESWITCH_MODE_SWITCHDEV_INACTIVE)
+ return false;
+
+ /* already inactive: no change in current state */
+ if (devlink_mode == DEVLINK_ESWITCH_MODE_SWITCHDEV_INACTIVE &&
+ esw->offloads_inactive)
+ return false;
+
+ /* already active: no change in current state */
+ if (devlink_mode == DEVLINK_ESWITCH_MODE_SWITCHDEV &&
+ !esw->offloads_inactive)
+ return false;
+
+ down_write(&esw->mode_lock);
+ esw->offloads_inactive = !esw->offloads_inactive;
+ esw->eswitch_operation_in_progress = true;
+ up_write(&esw->mode_lock);
+
+ if (esw->offloads_inactive)
+ mlx5_esw_fdb_inactive(esw);
+ else
+ mlx5_esw_fdb_active(esw);
+
+ down_write(&esw->mode_lock);
+ esw->eswitch_operation_in_progress = false;
+ up_write(&esw->mode_lock);
+ return true;
+}
+
int mlx5_devlink_eswitch_mode_set(struct devlink *devlink, u16 mode,
struct netlink_ext_ack *extack)
{
@@ -3812,12 +3993,16 @@ int mlx5_devlink_eswitch_mode_set(struct devlink *devlink, u16 mode,
if (esw_mode_from_devlink(mode, &mlx5_mode))
return -EINVAL;
- if (mode == DEVLINK_ESWITCH_MODE_SWITCHDEV && mlx5_get_sd(esw->dev)) {
+ if (mlx5_mode == MLX5_ESWITCH_OFFLOADS && mlx5_get_sd(esw->dev)) {
NL_SET_ERR_MSG_MOD(extack,
"Can't change E-Switch mode to switchdev when multi-PF netdev (Socket Direct) is configured.");
return -EPERM;
}
+ /* Avoid try_lock, active/inactive mode change is not restricted */
+ if (mlx5_devlink_switchdev_active_mode_change(esw, mode))
+ return 0;
+
mlx5_lag_disable_change(esw->dev);
err = mlx5_esw_try_lock(esw);
if (err < 0) {
@@ -3840,7 +4025,7 @@ int mlx5_devlink_eswitch_mode_set(struct devlink *devlink, u16 mode,
esw->eswitch_operation_in_progress = true;
up_write(&esw->mode_lock);
- if (mode == DEVLINK_ESWITCH_MODE_SWITCHDEV &&
+ if (mlx5_mode == MLX5_ESWITCH_OFFLOADS &&
!mlx5_devlink_netdev_netns_immutable_set(devlink, true)) {
NL_SET_ERR_MSG_MOD(extack,
"Can't change E-Switch mode to switchdev when netdev net namespace has diverged from the devlink's.");
@@ -3848,25 +4033,27 @@ int mlx5_devlink_eswitch_mode_set(struct devlink *devlink, u16 mode,
goto skip;
}
- if (mode == DEVLINK_ESWITCH_MODE_LEGACY)
+ if (mlx5_mode == MLX5_ESWITCH_LEGACY)
esw->dev->priv.flags |= MLX5_PRIV_FLAGS_SWITCH_LEGACY;
mlx5_eswitch_disable_locked(esw);
- if (mode == DEVLINK_ESWITCH_MODE_SWITCHDEV) {
+ if (mlx5_mode == MLX5_ESWITCH_OFFLOADS) {
if (mlx5_devlink_trap_get_num_active(esw->dev)) {
NL_SET_ERR_MSG_MOD(extack,
"Can't change mode while devlink traps are active");
err = -EOPNOTSUPP;
goto skip;
}
+ esw->offloads_inactive =
+ (mode == DEVLINK_ESWITCH_MODE_SWITCHDEV_INACTIVE);
err = esw_offloads_start(esw, extack);
- } else if (mode == DEVLINK_ESWITCH_MODE_LEGACY) {
+ } else if (mlx5_mode == MLX5_ESWITCH_LEGACY) {
err = esw_offloads_stop(esw, extack);
} else {
err = -EINVAL;
}
skip:
- if (mode == DEVLINK_ESWITCH_MODE_SWITCHDEV && err)
+ if (mlx5_mode == MLX5_ESWITCH_OFFLOADS && err)
mlx5_devlink_netdev_netns_immutable_set(devlink, false);
down_write(&esw->mode_lock);
esw->eswitch_operation_in_progress = false;
@@ -3885,7 +4072,7 @@ int mlx5_devlink_eswitch_mode_get(struct devlink *devlink, u16 *mode)
if (IS_ERR(esw))
return PTR_ERR(esw);
- return esw_mode_to_devlink(esw->mode, mode);
+ return esw_mode_to_devlink(esw, mode);
}
static int mlx5_esw_vports_inline_set(struct mlx5_eswitch *esw, u8 mlx5_mode,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
index 2db3ffb0a2b2..2ca3bddbdf05 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
@@ -3520,6 +3520,11 @@ static int init_fdb_root_ns(struct mlx5_flow_steering *steering)
if (!steering->fdb_root_ns)
return -ENOMEM;
+ maj_prio = fs_create_prio(&steering->fdb_root_ns->ns, FDB_DROP_ROOT, 1);
+ err = PTR_ERR_OR_ZERO(maj_prio);
+ if (err)
+ goto out_err;
+
err = create_fdb_bypass(steering);
if (err)
goto out_err;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/mpfs.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/mpfs.c
index 99fb7a53add0..4a88a42ae4f7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/mpfs.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/mpfs.c
@@ -167,7 +167,7 @@ int mlx5_mpfs_add_mac(struct mlx5_core_dev *dev, u8 *mac)
if (err)
goto free_l2table_index;
mlx5_core_dbg(dev, "MPFS entry %pM, set @index (%d)\n",
- l2addr->node.addr, l2addr->index);
+ l2addr->node.addr, index);
}
l2addr->index = index;
diff --git a/include/linux/mlx5/fs.h b/include/linux/mlx5/fs.h
index 6ac76a0c3827..7bf2449c53b2 100644
--- a/include/linux/mlx5/fs.h
+++ b/include/linux/mlx5/fs.h
@@ -116,6 +116,7 @@ enum mlx5_flow_namespace_type {
};
enum {
+ FDB_DROP_ROOT,
FDB_BYPASS_PATH,
FDB_CRYPTO_INGRESS,
FDB_TC_OFFLOAD,
--
2.51.1
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH net-next V3 0/3] devlink eswitch inactive mode
2025-11-08 7:04 [PATCH net-next V3 0/3] devlink eswitch inactive mode Saeed Mahameed
` (2 preceding siblings ...)
2025-11-08 7:04 ` [PATCH net-next V3 3/3] net/mlx5: E-Switch, support eswitch inactive mode Saeed Mahameed
@ 2025-11-11 12:20 ` patchwork-bot+netdevbpf
3 siblings, 0 replies; 5+ messages in thread
From: patchwork-bot+netdevbpf @ 2025-11-11 12:20 UTC (permalink / raw)
To: Saeed Mahameed
Cc: davem, kuba, pabeni, edumazet, saeedm, netdev, tariqt, gal,
leonro, jiri, mbloch
Hello:
This series was applied to netdev/net-next.git (main)
by Paolo Abeni <pabeni@redhat.com>:
On Fri, 7 Nov 2025 23:04:01 -0800 you wrote:
> From: Saeed Mahameed <saeedm@nvidia.com>
>
> v2->v3:
> - Fix cocci check %pe
> - minor improvement: create FDB drop counter once.
>
> v1->v2:
> - Introduce new devlink mode instead of state, Jiri's suggestion.
> - Address kernel robot issues reported in v1.
> no previous prototype for 'mlx5_mpfs_enable'
>
> [...]
Here is the summary with links:
- [net-next,V3,1/3] devlink: Introduce switchdev_inactive eswitch mode
https://git.kernel.org/netdev/net-next/c/0e535824d0bc
- [net-next,V3,2/3] net/mlx5: MPFS, add support for dynamic enable/disable
https://git.kernel.org/netdev/net-next/c/9902b6381d76
- [net-next,V3,3/3] net/mlx5: E-Switch, support eswitch inactive mode
https://git.kernel.org/netdev/net-next/c/9da611df15aa
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-11-11 12:20 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-08 7:04 [PATCH net-next V3 0/3] devlink eswitch inactive mode Saeed Mahameed
2025-11-08 7:04 ` [PATCH net-next V3 1/3] devlink: Introduce switchdev_inactive eswitch mode Saeed Mahameed
2025-11-08 7:04 ` [PATCH net-next V3 2/3] net/mlx5: MPFS, add support for dynamic enable/disable Saeed Mahameed
2025-11-08 7:04 ` [PATCH net-next V3 3/3] net/mlx5: E-Switch, support eswitch inactive mode Saeed Mahameed
2025-11-11 12:20 ` [PATCH net-next V3 0/3] devlink " patchwork-bot+netdevbpf
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).