* [PATCH net-next V3 01/15] net/mlx5: E-Switch, skip uplink IB rep load for SD secondary devices
2026-06-12 11:38 [PATCH net-next V3 00/15] net/mlx5: Add switchdev mode support for Socket Direct single netdev, part 2/2 Tariq Toukan
@ 2026-06-12 11:38 ` Tariq Toukan
2026-06-12 11:38 ` [PATCH net-next V3 02/15] net/mlx5: devcom, expose locked variant of send_event Tariq Toukan
` (13 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Tariq Toukan @ 2026-06-12 11:38 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
Shay Drory, Or Har-Toov, Edward Srouji, Maher Sanalla,
Simon Horman, Gerd Bayer, Kees Cook, Moshe Shemesh, Parav Pandit,
Patrisious Haddad, netdev, linux-rdma, linux-kernel, Gal Pressman
From: Shay Drory <shayd@nvidia.com>
SD secondary devices share the primary's uplink and do not have
their own uplink representor. When reloading IB reps on secondary
devices, skip the uplink and only load VF/SF vport IB reps.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
.../mellanox/mlx5/core/eswitch_offloads.c | 25 ++++++++++++++++---
1 file changed, 21 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 830fc910a080..12805e80ce57 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -3643,11 +3643,19 @@ int mlx5_eswitch_reload_ib_reps(struct mlx5_eswitch *esw)
if (atomic_read(&rep->rep_data[REP_ETH].state) != REP_LOADED)
return 0;
- ret = __esw_offloads_load_rep(esw, rep, REP_IB, NULL);
- if (ret)
- return ret;
+ /* SD secondary devices share the primary's uplink and do not
+ * have their own uplink representor. Only load VF/SF vports.
+ */
+ if (mlx5_sd_is_primary(esw->dev)) {
+ ret = __esw_offloads_load_rep(esw, rep, REP_IB, NULL);
+ if (ret)
+ return ret;
+ }
mlx5_esw_for_each_rep(esw, i, rep) {
+ if (!mlx5_sd_is_primary(esw->dev) &&
+ rep->vport == MLX5_VPORT_UPLINK)
+ continue;
if (atomic_read(&rep->rep_data[REP_ETH].state) == REP_LOADED)
__esw_offloads_load_rep(esw, rep, REP_IB, NULL);
}
@@ -4586,14 +4594,23 @@ mlx5_eswitch_register_vport_reps_blocked(struct mlx5_eswitch *esw,
static void mlx5_eswitch_reload_reps_blocked(struct mlx5_eswitch *esw)
{
+ struct mlx5_eswitch_rep *uplink;
struct mlx5_vport *vport;
+ bool newly_loaded;
unsigned long i;
if (esw->mode != MLX5_ESWITCH_OFFLOADS)
return;
- if (mlx5_esw_offloads_rep_load(esw, MLX5_VPORT_UPLINK))
+ uplink = mlx5_eswitch_get_rep(esw, MLX5_VPORT_UPLINK);
+ if (__esw_offloads_load_rep(esw, uplink, REP_ETH, &newly_loaded))
+ return;
+ if (mlx5_sd_is_primary(esw->dev) &&
+ __esw_offloads_load_rep(esw, uplink, REP_IB, NULL)) {
+ if (newly_loaded)
+ __esw_offloads_unload_rep(esw, uplink, REP_ETH);
return;
+ }
mlx5_esw_for_each_vport(esw, i, vport) {
if (!vport)
--
2.44.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* [PATCH net-next V3 02/15] net/mlx5: devcom, expose locked variant of send_event
2026-06-12 11:38 [PATCH net-next V3 00/15] net/mlx5: Add switchdev mode support for Socket Direct single netdev, part 2/2 Tariq Toukan
2026-06-12 11:38 ` [PATCH net-next V3 01/15] net/mlx5: E-Switch, skip uplink IB rep load for SD secondary devices Tariq Toukan
@ 2026-06-12 11:38 ` Tariq Toukan
2026-06-12 11:38 ` [PATCH net-next V3 03/15] net/mlx5: devcom, add DEVCOM_CANT_FAIL for non-rollback events Tariq Toukan
` (12 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Tariq Toukan @ 2026-06-12 11:38 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
Shay Drory, Or Har-Toov, Edward Srouji, Maher Sanalla,
Simon Horman, Gerd Bayer, Kees Cook, Moshe Shemesh, Parav Pandit,
Patrisious Haddad, netdev, linux-rdma, linux-kernel, Gal Pressman
From: Shay Drory <shayd@nvidia.com>
Factor mlx5_devcom_send_event() into two functions:
- mlx5_devcom_locked_send_event(): performs the dispatch (and
rollback) with comp->sem already held by the caller.
- mlx5_devcom_send_event(): unchanged wrapper that takes comp->sem,
calls the locked variant, and releases it.
This lets callers bracket multiple event broadcasts under a single
held write lock, eliminating the gap between consecutive dispatches
where peer state could change.
Will be used by a downstream patch.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
.../ethernet/mellanox/mlx5/core/lib/devcom.c | 29 ++++++++++++++-----
.../ethernet/mellanox/mlx5/core/lib/devcom.h | 3 ++
2 files changed, 25 insertions(+), 7 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/devcom.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/devcom.c
index d40c53193ea8..96b4f06d6184 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/devcom.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/devcom.c
@@ -287,9 +287,9 @@ int mlx5_devcom_comp_get_size(struct mlx5_devcom_comp_dev *devcom)
return kref_read(&comp->ref);
}
-int mlx5_devcom_send_event(struct mlx5_devcom_comp_dev *devcom,
- int event, int rollback_event,
- void *event_data)
+int mlx5_devcom_locked_send_event(struct mlx5_devcom_comp_dev *devcom,
+ int event, int rollback_event,
+ void *event_data)
{
struct mlx5_devcom_comp_dev *pos;
struct mlx5_devcom_comp *comp;
@@ -299,8 +299,8 @@ int mlx5_devcom_send_event(struct mlx5_devcom_comp_dev *devcom,
if (!devcom)
return -ENODEV;
+ lockdep_assert_held_write(&devcom->comp->sem);
comp = devcom->comp;
- down_write(&comp->sem);
list_for_each_entry(pos, &comp->comp_dev_list_head, list) {
data = rcu_dereference_protected(pos->data, lockdep_is_held(&comp->sem));
@@ -311,12 +311,11 @@ int mlx5_devcom_send_event(struct mlx5_devcom_comp_dev *devcom,
}
}
- up_write(&comp->sem);
return 0;
rollback:
if (list_entry_is_head(pos, &comp->comp_dev_list_head, list))
- goto out;
+ return err;
pos = list_prev_entry(pos, list);
list_for_each_entry_from_reverse(pos, &comp->comp_dev_list_head, list) {
data = rcu_dereference_protected(pos->data, lockdep_is_held(&comp->sem));
@@ -324,7 +323,23 @@ int mlx5_devcom_send_event(struct mlx5_devcom_comp_dev *devcom,
if (pos != devcom && data)
comp->handler(rollback_event, data, event_data);
}
-out:
+ return err;
+}
+
+int mlx5_devcom_send_event(struct mlx5_devcom_comp_dev *devcom,
+ int event, int rollback_event,
+ void *event_data)
+{
+ struct mlx5_devcom_comp *comp;
+ int err;
+
+ if (!devcom)
+ return -ENODEV;
+
+ comp = devcom->comp;
+ down_write(&comp->sem);
+ err = mlx5_devcom_locked_send_event(devcom, event, rollback_event,
+ event_data);
up_write(&comp->sem);
return err;
}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/devcom.h b/drivers/net/ethernet/mellanox/mlx5/core/lib/devcom.h
index 316052a85ca5..d5c60c03e55c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/devcom.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/devcom.h
@@ -46,6 +46,9 @@ mlx5_devcom_register_component(struct mlx5_devcom_dev *devc,
void *data);
void mlx5_devcom_unregister_component(struct mlx5_devcom_comp_dev *devcom);
+int mlx5_devcom_locked_send_event(struct mlx5_devcom_comp_dev *devcom,
+ int event, int rollback_event,
+ void *event_data);
int mlx5_devcom_send_event(struct mlx5_devcom_comp_dev *devcom,
int event, int rollback_event,
void *event_data);
--
2.44.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* [PATCH net-next V3 03/15] net/mlx5: devcom, add DEVCOM_CANT_FAIL for non-rollback events
2026-06-12 11:38 [PATCH net-next V3 00/15] net/mlx5: Add switchdev mode support for Socket Direct single netdev, part 2/2 Tariq Toukan
2026-06-12 11:38 ` [PATCH net-next V3 01/15] net/mlx5: E-Switch, skip uplink IB rep load for SD secondary devices Tariq Toukan
2026-06-12 11:38 ` [PATCH net-next V3 02/15] net/mlx5: devcom, expose locked variant of send_event Tariq Toukan
@ 2026-06-12 11:38 ` Tariq Toukan
2026-06-12 11:38 ` [PATCH net-next V3 04/15] net/mlx5: SD, make primary/secondary role determination more robust Tariq Toukan
` (11 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Tariq Toukan @ 2026-06-12 11:38 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
Shay Drory, Or Har-Toov, Edward Srouji, Maher Sanalla,
Simon Horman, Gerd Bayer, Kees Cook, Moshe Shemesh, Parav Pandit,
Patrisious Haddad, netdev, linux-rdma, linux-kernel, Gal Pressman
From: Shay Drory <shayd@nvidia.com>
Some devcom events are not expected to fail. Rather than attempting
a rollback that may not be meaningful, allow callers to pass
DEVCOM_CANT_FAIL as the rollback_event to indicate that the event
handler should not fail. If it does, emit a warning and stop
propagating to further peers, but skip the rollback path.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
drivers/net/ethernet/mellanox/mlx5/core/lib/devcom.c | 7 ++++++-
drivers/net/ethernet/mellanox/mlx5/core/lib/devcom.h | 2 ++
2 files changed, 8 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/devcom.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/devcom.c
index 96b4f06d6184..64f92427602d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/devcom.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/devcom.c
@@ -306,8 +306,13 @@ int mlx5_devcom_locked_send_event(struct mlx5_devcom_comp_dev *devcom,
if (pos != devcom && data) {
err = comp->handler(event, data, event_data);
- if (err)
+ if (err && rollback_event != DEVCOM_CANT_FAIL) {
goto rollback;
+ } else if (err && rollback_event == DEVCOM_CANT_FAIL) {
+ WARN_ONCE(1, "devcom component %d event %d failed: %d\n",
+ comp->id, event, err);
+ return err;
+ }
}
}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/devcom.h b/drivers/net/ethernet/mellanox/mlx5/core/lib/devcom.h
index d5c60c03e55c..7a704fafdbd3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/devcom.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/devcom.h
@@ -46,6 +46,8 @@ mlx5_devcom_register_component(struct mlx5_devcom_dev *devc,
void *data);
void mlx5_devcom_unregister_component(struct mlx5_devcom_comp_dev *devcom);
+#define DEVCOM_CANT_FAIL (INT_MAX)
+
int mlx5_devcom_locked_send_event(struct mlx5_devcom_comp_dev *devcom,
int event, int rollback_event,
void *event_data);
--
2.44.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* [PATCH net-next V3 04/15] net/mlx5: SD, make primary/secondary role determination more robust
2026-06-12 11:38 [PATCH net-next V3 00/15] net/mlx5: Add switchdev mode support for Socket Direct single netdev, part 2/2 Tariq Toukan
` (2 preceding siblings ...)
2026-06-12 11:38 ` [PATCH net-next V3 03/15] net/mlx5: devcom, add DEVCOM_CANT_FAIL for non-rollback events Tariq Toukan
@ 2026-06-12 11:38 ` Tariq Toukan
2026-06-12 11:38 ` [PATCH net-next V3 05/15] net/mlx5: SD, add L2 table silent mode query support Tariq Toukan
` (10 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Tariq Toukan @ 2026-06-12 11:38 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
Shay Drory, Or Har-Toov, Edward Srouji, Maher Sanalla,
Simon Horman, Gerd Bayer, Kees Cook, Moshe Shemesh, Parav Pandit,
Patrisious Haddad, netdev, linux-rdma, linux-kernel, Gal Pressman
From: Shay Drory <shayd@nvidia.com>
Refactor SD group registration to use devcom event-driven role
determination to ensure SD is marked as ready only after roles are fully
assigned and the group state is consistent, making outside accessors,
which will be added in downstream patches, safe to use without races.
The devcom events:
- SD_PRIMARY_SET event: each device compares bus numbers with peers
to determine which should be primary
- SD_SECONDARIES_SET event: secondaries register themselves with the
elected primary device
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
.../net/ethernet/mellanox/mlx5/core/lib/sd.c | 138 +++++++++++++-----
1 file changed, 104 insertions(+), 34 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
index 25286ecd724e..5209a27f82ed 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
@@ -26,6 +26,8 @@ struct mlx5_sd {
struct { /* primary */
struct mlx5_core_dev *secondaries[MLX5_SD_MAX_GROUP_SZ - 1];
struct mlx5_flow_table *tx_ft;
+ /* Next index for secondary registration */
+ u8 next_secondary_idx;
};
struct { /* secondary */
struct mlx5_core_dev *primary_dev;
@@ -374,62 +376,128 @@ static void sd_lag_cleanup(struct mlx5_core_dev *dev)
mutex_unlock(&ldev->lock);
}
+enum {
+ SD_PRIMARY_SET,
+ SD_SECONDARIES_SET,
+};
+
+static void sd_handle_primary_set(struct mlx5_core_dev *dev,
+ struct mlx5_core_dev *peer)
+{
+ struct mlx5_sd *peer_sd = mlx5_get_sd(peer);
+ struct mlx5_sd *sd = mlx5_get_sd(dev);
+ struct mlx5_core_dev *candidate;
+ struct mlx5_sd *candidate_sd;
+
+ /* Peer is the device that being sent to all the other devices in the
+ * group. Hence, use peer to get the candidate device.
+ */
+ candidate = peer_sd->primary ? peer : peer_sd->primary_dev;
+
+ if (dev->pdev->bus->number >= candidate->pdev->bus->number)
+ return;
+
+ candidate_sd = mlx5_get_sd(candidate);
+
+ sd->primary = true;
+ candidate_sd->primary = false;
+ candidate_sd->primary_dev = dev;
+ peer_sd->primary = false;
+ peer_sd->primary_dev = dev;
+}
+
+static void sd_handle_secondaries_set(struct mlx5_core_dev *dev,
+ struct mlx5_core_dev *peer)
+{
+ struct mlx5_sd *peer_sd = mlx5_get_sd(peer);
+ struct mlx5_sd *sd = mlx5_get_sd(dev);
+ u8 idx;
+
+ /* Primary has nothing to register with itself. */
+ if (sd->primary)
+ return;
+
+ /* dev is a secondary device, peer is the primary device.
+ * Secondary registers itself with the primary.
+ */
+ idx = peer_sd->next_secondary_idx++;
+ peer_sd->secondaries[idx] = dev;
+ sd->primary_dev = peer;
+}
+
+static int mlx5_sd_devcom_event(int event, void *my_data, void *event_data)
+{
+ struct mlx5_core_dev *peer = event_data;
+ struct mlx5_core_dev *dev = my_data;
+
+ switch (event) {
+ case SD_PRIMARY_SET:
+ sd_handle_primary_set(dev, peer);
+ break;
+ case SD_SECONDARIES_SET:
+ sd_handle_secondaries_set(dev, peer);
+ break;
+ }
+
+ return 0;
+}
+
static int sd_register(struct mlx5_core_dev *dev)
{
- struct mlx5_devcom_comp_dev *devcom, *pos;
struct mlx5_devcom_match_attr attr = {};
- struct mlx5_core_dev *peer, *primary;
- struct mlx5_sd *sd, *primary_sd;
- int err, i;
+ struct mlx5_devcom_comp_dev *devcom;
+ struct mlx5_core_dev *primary;
+ struct mlx5_sd *primary_sd;
+ struct mlx5_sd *sd;
+ int err;
sd = mlx5_get_sd(dev);
attr.key.val = sd->group_id;
attr.flags = MLX5_DEVCOM_MATCH_FLAGS_NS;
attr.net = mlx5_core_net(dev);
- devcom = mlx5_devcom_register_component(dev->priv.devc, MLX5_DEVCOM_SD_GROUP,
- &attr, NULL, dev);
+ devcom = mlx5_devcom_register_component(dev->priv.devc,
+ MLX5_DEVCOM_SD_GROUP,
+ &attr, mlx5_sd_devcom_event,
+ dev);
if (!devcom)
return -EINVAL;
sd->devcom = devcom;
- if (mlx5_devcom_comp_get_size(devcom) != sd->host_buses)
- return 0;
-
mlx5_devcom_comp_lock(devcom);
- mlx5_devcom_comp_set_ready(devcom, true);
- mlx5_devcom_comp_unlock(devcom);
+ if (mlx5_devcom_comp_get_size(devcom) != sd->host_buses ||
+ mlx5_devcom_comp_is_ready(devcom))
+ goto out;
- if (!mlx5_devcom_for_each_peer_begin(devcom)) {
- err = -ENODEV;
+ /* Send SD_PRIMARY_SET event with this device.
+ * All peers will receive this event and compare to this device.
+ * The one with lowest bus number will be marked as primary.
+ */
+ sd->primary = true;
+ err = mlx5_devcom_locked_send_event(devcom, SD_PRIMARY_SET,
+ SD_PRIMARY_SET, dev);
+ if (err)
goto err_devcom_unreg;
- }
- primary = dev;
- mlx5_devcom_for_each_peer_entry(devcom, peer, pos)
- if (peer->pdev->bus->number < primary->pdev->bus->number)
- primary = peer;
+ /* Broadcast SD_SECONDARIES_SET. Each non-sender peer's handler runs;
+ * the primary's handler returns early so only secondaries register.
+ */
+ primary = sd->primary ? dev : sd->primary_dev;
+ if (!sd->primary)
+ sd_handle_secondaries_set(dev, primary);
+ mlx5_devcom_locked_send_event(devcom, SD_SECONDARIES_SET,
+ DEVCOM_CANT_FAIL, primary);
primary_sd = mlx5_get_sd(primary);
- primary_sd->primary = true;
- i = 0;
- /* loop the secondaries */
- mlx5_devcom_for_each_peer_entry(primary_sd->devcom, peer, pos) {
- struct mlx5_sd *peer_sd = mlx5_get_sd(peer);
-
- primary_sd->secondaries[i++] = peer;
- peer_sd->primary = false;
- peer_sd->primary_dev = primary;
- }
-
- mlx5_devcom_for_each_peer_end(devcom);
+ if (primary_sd->next_secondary_idx + 1 == sd->host_buses)
+ mlx5_devcom_comp_set_ready(devcom, true);
+out:
+ mlx5_devcom_comp_unlock(devcom);
return 0;
err_devcom_unreg:
- mlx5_devcom_comp_lock(sd->devcom);
- mlx5_devcom_comp_set_ready(sd->devcom, false);
- mlx5_devcom_comp_unlock(sd->devcom);
- mlx5_devcom_unregister_component(sd->devcom);
+ mlx5_devcom_comp_unlock(devcom);
+ mlx5_devcom_unregister_component(devcom);
return err;
}
@@ -672,6 +740,7 @@ int mlx5_sd_init(struct mlx5_core_dev *dev)
peer_sd->primary_dev = NULL;
}
primary_sd->primary = false;
+ primary_sd->next_secondary_idx = 0;
mlx5_devcom_comp_set_ready(sd->devcom, false);
mlx5_devcom_comp_unlock(sd->devcom);
sd_unregister(dev);
@@ -719,6 +788,7 @@ void mlx5_sd_cleanup(struct mlx5_core_dev *dev)
peer_sd->primary_dev = NULL;
}
primary_sd->primary = false;
+ primary_sd->next_secondary_idx = 0;
out_ready_false:
mlx5_devcom_comp_set_ready(sd->devcom, false);
out_unlock:
--
2.44.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* [PATCH net-next V3 05/15] net/mlx5: SD, add L2 table silent mode query support
2026-06-12 11:38 [PATCH net-next V3 00/15] net/mlx5: Add switchdev mode support for Socket Direct single netdev, part 2/2 Tariq Toukan
` (3 preceding siblings ...)
2026-06-12 11:38 ` [PATCH net-next V3 04/15] net/mlx5: SD, make primary/secondary role determination more robust Tariq Toukan
@ 2026-06-12 11:38 ` Tariq Toukan
2026-06-12 11:38 ` [PATCH net-next V3 06/15] net/mlx5: SD, expend vport metadata for SD secondary devices Tariq Toukan
` (9 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Tariq Toukan @ 2026-06-12 11:38 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
Shay Drory, Or Har-Toov, Edward Srouji, Maher Sanalla,
Simon Horman, Gerd Bayer, Kees Cook, Moshe Shemesh, Parav Pandit,
Patrisious Haddad, netdev, linux-rdma, linux-kernel, Gal Pressman
From: Shay Drory <shayd@nvidia.com>
Add mlx5_fs_cmd_query_l2table_silent() to query the current silent mode
state from firmware. This allows detecting if firmware has already put
secondary devices into silent mode.
During SD group registration, query the silent mode of each device. If
a device is already in silent mode (set by firmware), record this in
the fw_silents_secondaries flag and use it to help determine the
primary/secondary roles.
When fw_silents_secondaries is set, skip the driver-initiated silent
mode set/unset operations since firmware manages this state. This
handles configurations where firmware persistently silences secondary
devices.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
.../net/ethernet/mellanox/mlx5/core/fs_cmd.c | 21 ++++
.../net/ethernet/mellanox/mlx5/core/fs_cmd.h | 2 +
.../net/ethernet/mellanox/mlx5/core/lib/sd.c | 105 +++++++++++++++---
3 files changed, 114 insertions(+), 14 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
index 1cd4cd898ec2..8af73393770c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
@@ -1217,3 +1217,24 @@ int mlx5_fs_cmd_set_tx_flow_table_root(struct mlx5_core_dev *dev, u32 ft_id, boo
return mlx5_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
}
+
+int mlx5_fs_cmd_query_l2table_silent(struct mlx5_core_dev *dev, u8 *silent_mode)
+{
+ u32 out[MLX5_ST_SZ_DW(query_l2_table_entry_out)] = {};
+ u32 in[MLX5_ST_SZ_DW(query_l2_table_entry_in)] = {};
+ int err;
+
+ if (!MLX5_CAP_GEN(dev, silent_mode_query))
+ return -EOPNOTSUPP;
+
+ MLX5_SET(query_l2_table_entry_in, in, opcode,
+ MLX5_CMD_OP_QUERY_L2_TABLE_ENTRY);
+ MLX5_SET(query_l2_table_entry_in, in, silent_mode_query, 1);
+
+ err = mlx5_cmd_exec_inout(dev, query_l2_table_entry, in, out);
+ if (err)
+ return err;
+
+ *silent_mode = MLX5_GET(query_l2_table_entry_out, out, silent_mode);
+ return 0;
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.h b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.h
index 7eb7b3ffe3d8..60280ff7da50 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.h
@@ -124,6 +124,8 @@ const struct mlx5_flow_cmds *mlx5_fs_cmd_get_fw_cmds(void);
int mlx5_fs_cmd_set_l2table_entry_silent(struct mlx5_core_dev *dev, u8 silent_mode);
int mlx5_fs_cmd_set_tx_flow_table_root(struct mlx5_core_dev *dev, u32 ft_id, bool disconnect);
+int mlx5_fs_cmd_query_l2table_silent(struct mlx5_core_dev *dev,
+ u8 *silent_mode);
static inline bool mlx5_fs_cmd_is_fw_term_table(struct mlx5_flow_table *ft)
{
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
index 5209a27f82ed..6b007b038f8b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
@@ -22,6 +22,7 @@ struct mlx5_sd {
struct dentry *dfs;
u8 state;
bool primary;
+ bool fw_silents_secondaries;
union {
struct { /* primary */
struct mlx5_core_dev *secondaries[MLX5_SD_MAX_GROUP_SZ - 1];
@@ -167,7 +168,8 @@ static bool mlx5_sd_caps_supported(struct mlx5_core_dev *dev, u8 host_buses)
/* Disconnect secondaries from the network */
if (!MLX5_CAP_GEN(dev, eswitch_manager))
return false;
- if (!MLX5_CAP_GEN(dev, silent_mode_set))
+ if (!MLX5_CAP_GEN(dev, silent_mode_set) &&
+ !MLX5_CAP_GEN(dev, silent_mode_query))
return false;
/* RX steering from primary to secondaries */
@@ -379,23 +381,77 @@ static void sd_lag_cleanup(struct mlx5_core_dev *dev)
enum {
SD_PRIMARY_SET,
SD_SECONDARIES_SET,
+ SD_FW_SILENT_CHECK,
};
-static void sd_handle_primary_set(struct mlx5_core_dev *dev,
- struct mlx5_core_dev *peer)
+static int sd_handle_fw_silent_check(struct mlx5_core_dev *dev,
+ struct mlx5_core_dev *peer)
+{
+ struct mlx5_sd *peer_sd = mlx5_get_sd(peer);
+ struct mlx5_sd *sd = mlx5_get_sd(dev);
+ u8 dev_silent = 0, peer_silent = 0;
+ int err;
+
+ if (peer_sd->fw_silents_secondaries) {
+ sd->fw_silents_secondaries = true;
+ return 0;
+ }
+
+ err = mlx5_fs_cmd_query_l2table_silent(dev, &dev_silent);
+ if (err) {
+ sd_warn(dev, "Failed to query silent mode for dev: %d\n", err);
+ return err;
+ }
+
+ err = mlx5_fs_cmd_query_l2table_silent(peer, &peer_silent);
+ if (err) {
+ sd_warn(dev, "Failed to query silent mode for peer: %d\n", err);
+ return err;
+ }
+
+ if (dev_silent || peer_silent) {
+ sd->fw_silents_secondaries = true;
+ peer_sd->fw_silents_secondaries = true;
+ sd_info(dev, "FW indicates at least one device is silent\n");
+ }
+ return 0;
+}
+
+static int sd_handle_primary_set(struct mlx5_core_dev *dev,
+ struct mlx5_core_dev *peer)
{
struct mlx5_sd *peer_sd = mlx5_get_sd(peer);
struct mlx5_sd *sd = mlx5_get_sd(dev);
struct mlx5_core_dev *candidate;
struct mlx5_sd *candidate_sd;
+ bool dev_should_be_primary;
/* Peer is the device that being sent to all the other devices in the
* group. Hence, use peer to get the candidate device.
*/
candidate = peer_sd->primary ? peer : peer_sd->primary_dev;
- if (dev->pdev->bus->number >= candidate->pdev->bus->number)
- return;
+ if (sd->fw_silents_secondaries) {
+ u8 candidate_silent = 0;
+ int err;
+
+ err = mlx5_fs_cmd_query_l2table_silent(candidate,
+ &candidate_silent);
+ if (err) {
+ sd_warn(candidate, "Failed to query silent mode for dev: %d\n",
+ err);
+ return err;
+ }
+ /* Candidate is silent, dev should be primary */
+ dev_should_be_primary = candidate_silent;
+ } else {
+ /* No FW silent mode, use bus number */
+ dev_should_be_primary =
+ dev->pdev->bus->number < candidate->pdev->bus->number;
+ }
+
+ if (!dev_should_be_primary)
+ return 0;
candidate_sd = mlx5_get_sd(candidate);
@@ -404,6 +460,7 @@ static void sd_handle_primary_set(struct mlx5_core_dev *dev,
candidate_sd->primary_dev = dev;
peer_sd->primary = false;
peer_sd->primary_dev = dev;
+ return 0;
}
static void sd_handle_secondaries_set(struct mlx5_core_dev *dev,
@@ -431,12 +488,13 @@ static int mlx5_sd_devcom_event(int event, void *my_data, void *event_data)
struct mlx5_core_dev *dev = my_data;
switch (event) {
+ case SD_FW_SILENT_CHECK:
+ return sd_handle_fw_silent_check(dev, peer);
case SD_PRIMARY_SET:
- sd_handle_primary_set(dev, peer);
- break;
+ return sd_handle_primary_set(dev, peer);
case SD_SECONDARIES_SET:
sd_handle_secondaries_set(dev, peer);
- break;
+ return 0;
}
return 0;
@@ -469,9 +527,21 @@ static int sd_register(struct mlx5_core_dev *dev)
mlx5_devcom_comp_is_ready(devcom))
goto out;
+ /* If silent mode query is supported, ask each device whether it is
+ * silent and propagate the result to the whole group. In each group
+ * only one device is not silent
+ */
+ if (MLX5_CAP_GEN(dev, silent_mode_query)) {
+ err = mlx5_devcom_locked_send_event(devcom, SD_FW_SILENT_CHECK,
+ SD_FW_SILENT_CHECK, dev);
+ if (err)
+ goto err_devcom_unreg;
+ }
+
/* Send SD_PRIMARY_SET event with this device.
* All peers will receive this event and compare to this device.
- * The one with lowest bus number will be marked as primary.
+ * If fw_silents_secondaries is set, choose non-silent device.
+ * Otherwise use bus number.
*/
sd->primary = true;
err = mlx5_devcom_locked_send_event(devcom, SD_PRIMARY_SET,
@@ -589,9 +659,11 @@ static int sd_cmd_set_secondary(struct mlx5_core_dev *secondary,
struct mlx5_sd *sd = mlx5_get_sd(secondary);
int err;
- err = mlx5_fs_cmd_set_l2table_entry_silent(secondary, 1);
- if (err)
- return err;
+ if (!primary_sd->fw_silents_secondaries) {
+ err = mlx5_fs_cmd_set_l2table_entry_silent(secondary, 1);
+ if (err)
+ return err;
+ }
err = sd_secondary_create_alias_ft(secondary, primary, primary_sd->tx_ft,
&sd->alias_obj_id, alias_key);
@@ -607,15 +679,20 @@ static int sd_cmd_set_secondary(struct mlx5_core_dev *secondary,
err_destroy_alias_ft:
sd_secondary_destroy_alias_ft(secondary);
err_unset_silent:
- mlx5_fs_cmd_set_l2table_entry_silent(secondary, 0);
+ if (!primary_sd->fw_silents_secondaries)
+ mlx5_fs_cmd_set_l2table_entry_silent(secondary, 0);
return err;
}
static void sd_cmd_unset_secondary(struct mlx5_core_dev *secondary)
{
+ struct mlx5_sd *primary_sd;
+
+ primary_sd = mlx5_get_sd(mlx5_sd_get_primary(secondary));
mlx5_fs_cmd_set_tx_flow_table_root(secondary, 0, true);
sd_secondary_destroy_alias_ft(secondary);
- mlx5_fs_cmd_set_l2table_entry_silent(secondary, 0);
+ if (!primary_sd->fw_silents_secondaries)
+ mlx5_fs_cmd_set_l2table_entry_silent(secondary, 0);
}
static void sd_print_group(struct mlx5_core_dev *primary)
--
2.44.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* [PATCH net-next V3 06/15] net/mlx5: SD, expend vport metadata for SD secondary devices
2026-06-12 11:38 [PATCH net-next V3 00/15] net/mlx5: Add switchdev mode support for Socket Direct single netdev, part 2/2 Tariq Toukan
` (4 preceding siblings ...)
2026-06-12 11:38 ` [PATCH net-next V3 05/15] net/mlx5: SD, add L2 table silent mode query support Tariq Toukan
@ 2026-06-12 11:38 ` Tariq Toukan
2026-06-12 11:38 ` [PATCH net-next V3 07/15] net/mlx5: SD, support switchdev mode transition with shared FDB Tariq Toukan
` (8 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Tariq Toukan @ 2026-06-12 11:38 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
Shay Drory, Or Har-Toov, Edward Srouji, Maher Sanalla,
Simon Horman, Gerd Bayer, Kees Cook, Moshe Shemesh, Parav Pandit,
Patrisious Haddad, netdev, linux-rdma, linux-kernel, Gal Pressman
From: Shay Drory <shayd@nvidia.com>
In Socket Direct configurations the primary and secondary PFs share the
same native_port_num. The eswitch vport metadata encodes pf_num in its
upper bits to distinguish vports across PFs. Without SD-awareness, both
PFs generate identical metadata, causing FDB rules to steer traffic to
the wrong representor.
Add mlx5_sd_pf_num_get() which remaps the pf_num for SD devices.
Use it so each PF in an SD group produces unique vport metadata.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
.../mellanox/mlx5/core/eswitch_offloads.c | 6 +++---
.../net/ethernet/mellanox/mlx5/core/lib/sd.c | 21 +++++++++++++++++++
.../net/ethernet/mellanox/mlx5/core/lib/sd.h | 1 +
3 files changed, 25 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 12805e80ce57..366531d8ef02 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -3472,12 +3472,12 @@ u32 mlx5_esw_match_metadata_alloc(struct mlx5_eswitch *esw)
u32 vport_end_ida = (1 << ESW_VPORT_BITS) - 1;
/* Reserve 0xf for internal port offload */
u32 max_pf_num = (1 << ESW_PFNUM_BITS) - 2;
- u32 pf_num;
+ int pf_num;
int id;
/* Only 4 bits of pf_num */
- pf_num = mlx5_get_dev_index(esw->dev);
- if (pf_num > max_pf_num)
+ pf_num = mlx5_sd_pf_num_get(esw->dev);
+ if (pf_num < 0 || pf_num > max_pf_num)
return 0;
/* Metadata is 4 bits of PFNUM and 12 bits of unique id */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
index 6b007b038f8b..c670ed1dd63c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
@@ -85,6 +85,27 @@ bool mlx5_sd_is_primary(struct mlx5_core_dev *dev)
return sd->primary;
}
+int mlx5_sd_pf_num_get(struct mlx5_core_dev *dev)
+{
+ struct mlx5_sd *sd = mlx5_get_sd(dev);
+ int pf_num = mlx5_get_dev_index(dev);
+ struct mlx5_core_dev *pos;
+ int i;
+
+ if (!sd)
+ return pf_num;
+
+ mlx5_devcom_comp_assert_locked(sd->devcom);
+ if (!mlx5_devcom_comp_is_ready(sd->devcom))
+ return -ENODEV;
+
+ mlx5_sd_for_each_dev(i, mlx5_sd_get_primary(dev), pos)
+ if (pos == dev)
+ break;
+
+ return pf_num * sd->host_buses + i;
+}
+
struct mlx5_core_dev *
mlx5_sd_primary_get_peer(struct mlx5_core_dev *primary, int idx)
{
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.h b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.h
index 011702ff6f02..7a41adbcee71 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.h
@@ -12,6 +12,7 @@ struct mlx5_sd;
struct mlx5_core_dev *mlx5_sd_get_primary(struct mlx5_core_dev *dev);
bool mlx5_sd_is_primary(struct mlx5_core_dev *dev);
+int mlx5_sd_pf_num_get(struct mlx5_core_dev *dev);
struct mlx5_core_dev *mlx5_sd_primary_get_peer(struct mlx5_core_dev *primary, int idx);
int mlx5_sd_ch_ix_get_dev_ix(struct mlx5_core_dev *dev, int ch_ix);
int mlx5_sd_ch_ix_get_vec_ix(struct mlx5_core_dev *dev, int ch_ix);
--
2.44.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* [PATCH net-next V3 07/15] net/mlx5: SD, support switchdev mode transition with shared FDB
2026-06-12 11:38 [PATCH net-next V3 00/15] net/mlx5: Add switchdev mode support for Socket Direct single netdev, part 2/2 Tariq Toukan
` (5 preceding siblings ...)
2026-06-12 11:38 ` [PATCH net-next V3 06/15] net/mlx5: SD, expend vport metadata for SD secondary devices Tariq Toukan
@ 2026-06-12 11:38 ` Tariq Toukan
2026-06-12 11:38 ` [PATCH net-next V3 08/15] net/mlx5: E-Switch, notify SD on eswitch disable Tariq Toukan
` (7 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Tariq Toukan @ 2026-06-12 11:38 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
Shay Drory, Or Har-Toov, Edward Srouji, Maher Sanalla,
Simon Horman, Gerd Bayer, Kees Cook, Moshe Shemesh, Parav Pandit,
Patrisious Haddad, netdev, linux-rdma, linux-kernel, Gal Pressman
From: Shay Drory <shayd@nvidia.com>
When the eswitch transitions, propagate the change to SD: secondaries
get their TX flow table root reconfigured for the new mode, and when
all group devices move to switchdev, the per-group shared FDB is
activated.
Shared FDB activation is best-effort - failure does not block the
eswitch transition; the next transition retries.
Note: the existing mlx5_get_sd() guard that blocks switchdev for SD
devices is intentionally retained. It will be removed once all
supporting patches are in place.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
.../mellanox/mlx5/core/eswitch_offloads.c | 12 +-
.../net/ethernet/mellanox/mlx5/core/lib/sd.c | 138 +++++++++++++++++-
.../net/ethernet/mellanox/mlx5/core/lib/sd.h | 7 +
3 files changed, 154 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 366531d8ef02..915571a1586c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -46,6 +46,7 @@
#include "fs_core.h"
#include "lib/mlx5.h"
#include "lib/devcom.h"
+#include "lib/sd.h"
#include "lib/eq.h"
#include "lib/fs_chains.h"
#include "en_tc.h"
@@ -3164,6 +3165,9 @@ static void esw_unset_master_egress_rule(struct mlx5_core_dev *dev,
vport = mlx5_eswitch_get_vport(dev->priv.eswitch,
dev->priv.eswitch->manager_vport);
+ if (!vport->egress.acl)
+ return;
+
esw_acl_egress_ofld_bounce_rule_destroy(vport, MLX5_CAP_GEN(slave_dev, vhca_id));
if (xa_empty(&vport->egress.offloads.bounce_rules)) {
@@ -3182,6 +3186,9 @@ int mlx5_eswitch_offloads_single_fdb_add_one(struct mlx5_eswitch *master_esw,
if (err)
return err;
+ if (!mlx5_sd_is_primary(slave_esw->dev))
+ return 0;
+
err = esw_set_master_egress_rule(master_esw->dev,
slave_esw->dev, max_slaves);
if (err)
@@ -3401,7 +3408,7 @@ void mlx5_esw_offloads_devcom_init(struct mlx5_eswitch *esw,
return;
if ((MLX5_VPORT_MANAGER(esw->dev) || mlx5_core_is_ecpf_esw_manager(esw->dev)) &&
- !mlx5_lag_is_supported(esw->dev))
+ (!mlx5_lag_is_supported(esw->dev) && !mlx5_get_sd(esw->dev)))
return;
xa_init(&esw->paired);
@@ -4306,6 +4313,9 @@ int mlx5_devlink_eswitch_mode_set(struct devlink *devlink, u16 mode,
mlx5_esw_unlock(esw);
enable_lag:
mlx5_lag_enable_change(esw->dev);
+ /* Shared FDB activation is creating LAG which is changing reps. */
+ if (!err)
+ mlx5_sd_eswitch_mode_set(esw->dev, mlx5_mode);
return err;
}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
index c670ed1dd63c..b35795bac098 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
@@ -5,6 +5,8 @@
#include "../lag/lag.h"
#include "mlx5_core.h"
#include "lib/mlx5.h"
+#include "devlink.h"
+#include "eswitch.h"
#include "fs_cmd.h"
#include <linux/mlx5/eswitch.h>
#include <linux/mlx5/vport.h>
@@ -33,6 +35,8 @@ struct mlx5_sd {
struct { /* secondary */
struct mlx5_core_dev *primary_dev;
u32 alias_obj_id;
+ /* TX flow table root in switchdev (silent) config */
+ bool tx_root_silent;
};
};
};
@@ -672,6 +676,29 @@ static void sd_secondary_destroy_alias_ft(struct mlx5_core_dev *secondary)
MLX5_GENERAL_OBJECT_TYPES_FLOW_TABLE_ALIAS);
}
+static int mlx5_sd_secondary_conf_tx_root(struct mlx5_core_dev *secondary,
+ bool disconnect)
+{
+ struct mlx5_sd *sd = mlx5_get_sd(secondary);
+ int err;
+
+ /* Idempotent: skip if TX root is already in the requested state. */
+ if (sd->tx_root_silent == disconnect)
+ return 0;
+
+ if (disconnect)
+ err = mlx5_fs_cmd_set_tx_flow_table_root(secondary, 0, true);
+ else
+ err = mlx5_fs_cmd_set_tx_flow_table_root(secondary,
+ sd->alias_obj_id,
+ false);
+ if (err)
+ return err;
+
+ sd->tx_root_silent = disconnect;
+ return 0;
+}
+
static int sd_cmd_set_secondary(struct mlx5_core_dev *secondary,
struct mlx5_core_dev *primary,
u8 *alias_key)
@@ -691,9 +718,11 @@ static int sd_cmd_set_secondary(struct mlx5_core_dev *secondary,
if (err)
goto err_unset_silent;
- err = mlx5_fs_cmd_set_tx_flow_table_root(secondary, sd->alias_obj_id, false);
+ err = mlx5_fs_cmd_set_tx_flow_table_root(secondary, sd->alias_obj_id,
+ false);
if (err)
goto err_destroy_alias_ft;
+ sd->tx_root_silent = false;
return 0;
@@ -710,7 +739,7 @@ static void sd_cmd_unset_secondary(struct mlx5_core_dev *secondary)
struct mlx5_sd *primary_sd;
primary_sd = mlx5_get_sd(mlx5_sd_get_primary(secondary));
- mlx5_fs_cmd_set_tx_flow_table_root(secondary, 0, true);
+ mlx5_sd_secondary_conf_tx_root(secondary, true);
sd_secondary_destroy_alias_ft(secondary);
if (!primary_sd->fw_silents_secondaries)
mlx5_fs_cmd_set_l2table_entry_silent(secondary, 0);
@@ -939,6 +968,111 @@ struct auxiliary_device *mlx5_sd_get_adev(struct mlx5_core_dev *dev,
return &primary_adev->adev;
}
+#ifdef CONFIG_MLX5_ESWITCH
+/* All SD members must have completed esw_offloads_enable (i.e., reached
+ * mlx5_esw_offloads_devcom_init) and become eswitch-peers of the primary.
+ * Until then, mlx5_eswitch_is_peer() returns false for the not-yet-paired
+ * member and shared_fdb_supported_filter would reject. When all PFs transition
+ * in parallel, only the last one to finish satisfies this gate; the earlier
+ * ones return 0 silently here.
+ */
+static bool mlx5_sd_all_paired(struct mlx5_core_dev *primary)
+{
+ struct mlx5_eswitch *primary_esw = primary->priv.eswitch;
+ struct mlx5_core_dev *pos;
+ int i;
+
+ mlx5_sd_for_each_secondary(i, primary, pos) {
+ if (!mlx5_eswitch_is_peer(primary_esw, pos->priv.eswitch))
+ return false;
+ }
+ return true;
+}
+
+static void mlx5_sd_activate_shared_fdb(struct mlx5_core_dev *primary)
+{
+ struct mlx5_sd *sd = mlx5_get_sd(primary);
+ struct mlx5_lag *ldev;
+ struct lag_func *pf;
+ int err;
+ int i;
+
+ ldev = mlx5_lag_dev(primary);
+ if (!ldev) {
+ sd_warn(primary, "Shared FDB MUST have ldev\n");
+ return;
+ }
+
+ mutex_lock(&ldev->lock);
+
+ if (ldev->mode_changes_in_progress)
+ goto unlock;
+
+ if (!mlx5_sd_all_paired(primary))
+ goto unlock;
+
+ /* Check if SD FDB is already active for this group */
+ mlx5_lag_for_each(i, 0, ldev, sd->group_id) {
+ pf = mlx5_lag_pf(ldev, i);
+ if (pf->sd_fdb_active)
+ goto unlock;
+ break;
+ }
+
+ if (!mlx5_lag_shared_fdb_supported_filter(ldev, sd->group_id)) {
+ sd_warn(primary, "Shared FDB not supported\n");
+ goto unlock;
+ }
+
+ err = mlx5_lag_shared_fdb_create(ldev, NULL, 0, sd->group_id);
+ if (err)
+ sd_warn(primary, "Failed to create shared FDB: %d\n", err);
+ else
+ sd_info(primary, "Shared FDB created\n");
+
+unlock:
+ mutex_unlock(&ldev->lock);
+}
+
+void mlx5_sd_eswitch_mode_set(struct mlx5_core_dev *dev, u16 mlx5_mode)
+{
+ struct mlx5_core_dev *primary;
+ struct mlx5_sd *sd;
+ int err;
+
+ sd = mlx5_get_sd(dev);
+ if (!sd || !mlx5_devcom_comp_is_ready(sd->devcom))
+ return;
+
+ mlx5_devcom_comp_lock(sd->devcom);
+ if (!mlx5_devcom_comp_is_ready(sd->devcom))
+ goto unlock;
+
+ primary = mlx5_sd_get_primary(dev);
+
+ /* Secondary devices need TX root reconfiguration */
+ if (dev != primary) {
+ bool disconnect = (mlx5_mode == MLX5_ESWITCH_OFFLOADS);
+
+ err = mlx5_sd_secondary_conf_tx_root(dev, disconnect);
+ if (err) {
+ sd_warn(dev, "Failed to set TX root: %d\n", err);
+ goto unlock;
+ }
+ }
+
+ /* Try to activate shared FDB when all devices are in switchdev.
+ * Shared FDB is optional - failure here doesn't fail the transition.
+ */
+ if (mlx5_mode == MLX5_ESWITCH_OFFLOADS)
+ mlx5_sd_activate_shared_fdb(primary);
+
+unlock:
+ mlx5_devcom_comp_unlock(sd->devcom);
+}
+
+#endif /* CONFIG_MLX5_ESWITCH */
+
void mlx5_sd_put_adev(struct auxiliary_device *actual_adev,
struct auxiliary_device *adev)
{
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.h b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.h
index 7a41adbcee71..cb88bf34079a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.h
@@ -45,6 +45,13 @@ mlx5_sd_get_devcom(struct mlx5_core_dev *dev)
}
#endif
+#ifdef CONFIG_MLX5_ESWITCH
+void mlx5_sd_eswitch_mode_set(struct mlx5_core_dev *dev, u16 mlx5_mode);
+#else
+static inline void
+mlx5_sd_eswitch_mode_set(struct mlx5_core_dev *dev, u16 mlx5_mode) { return; }
+#endif
+
#define mlx5_sd_for_each_dev_from_to(i, primary, ix_from, to, pos) \
for (i = ix_from; \
(pos = mlx5_sd_primary_get_peer(primary, i)) && pos != (to); i++)
--
2.44.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* [PATCH net-next V3 08/15] net/mlx5: E-Switch, notify SD on eswitch disable
2026-06-12 11:38 [PATCH net-next V3 00/15] net/mlx5: Add switchdev mode support for Socket Direct single netdev, part 2/2 Tariq Toukan
` (6 preceding siblings ...)
2026-06-12 11:38 ` [PATCH net-next V3 07/15] net/mlx5: SD, support switchdev mode transition with shared FDB Tariq Toukan
@ 2026-06-12 11:38 ` Tariq Toukan
2026-06-12 11:38 ` [PATCH net-next V3 09/15] net/mlx5: LAG, store demux resources per master lag_func Tariq Toukan
` (6 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Tariq Toukan @ 2026-06-12 11:38 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
Shay Drory, Or Har-Toov, Edward Srouji, Maher Sanalla,
Simon Horman, Gerd Bayer, Kees Cook, Moshe Shemesh, Parav Pandit,
Patrisious Haddad, netdev, linux-rdma, linux-kernel, Gal Pressman
From: Shay Drory <shayd@nvidia.com>
When eswitch is disabled, notify the SD layer so it can clean up
SD-specific resources such as the TX flow table root configuration
on secondary devices.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
drivers/net/ethernet/mellanox/mlx5/core/eswitch.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index f8cfbf76dd6a..93d51f09b17f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -2072,6 +2072,7 @@ void mlx5_eswitch_disable(struct mlx5_eswitch *esw)
mlx5_esw_reps_unblock(esw);
esw->mode = MLX5_ESWITCH_LEGACY;
+ mlx5_sd_eswitch_mode_set(esw->dev, MLX5_ESWITCH_LEGACY);
mlx5_lag_enable_change(esw->dev);
}
--
2.44.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* [PATCH net-next V3 09/15] net/mlx5: LAG, store demux resources per master lag_func
2026-06-12 11:38 [PATCH net-next V3 00/15] net/mlx5: Add switchdev mode support for Socket Direct single netdev, part 2/2 Tariq Toukan
` (7 preceding siblings ...)
2026-06-12 11:38 ` [PATCH net-next V3 08/15] net/mlx5: E-Switch, notify SD on eswitch disable Tariq Toukan
@ 2026-06-12 11:38 ` Tariq Toukan
2026-06-12 11:38 ` [PATCH net-next V3 10/15] net/mlx5: LAG, disable both regular and SD LAG on lag_disable_change Tariq Toukan
` (5 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Tariq Toukan @ 2026-06-12 11:38 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
Shay Drory, Or Har-Toov, Edward Srouji, Maher Sanalla,
Simon Horman, Gerd Bayer, Kees Cook, Moshe Shemesh, Parav Pandit,
Patrisious Haddad, netdev, linux-rdma, linux-kernel, Gal Pressman
From: Shay Drory <shayd@nvidia.com>
The lag demux resources (flow table, flow group, and rules xarray)
are stored on the shared ldev. With Socket Direct, multiple SD groups
each create their own demux FT/FG during their master's IB device
initialization. Since they all write to the same ldev fields, the
second group's init overwrites the first group's pointers, leaking
the first group's FT/FG.
During teardown, the cleanup uses the overwritten pointers, destroying
the wrong group's resources and leaving leaked flow tables in the LAG
namespace. These leaked tables can interfere with subsequently created
demux tables.
Move the demux resources from the shared ldev to per-master lag_func
instances. Each master device now owns its own independent demux
state. The rule_add and rule_del helpers look up the appropriate
master's lag_func via the existing filter/group infrastructure.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
.../net/ethernet/mellanox/mlx5/core/lag/lag.c | 95 +++++++++++++------
.../net/ethernet/mellanox/mlx5/core/lag/lag.h | 7 +-
2 files changed, 68 insertions(+), 34 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
index dd3f18f85466..e23c1e81b98f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
@@ -1590,7 +1590,7 @@ struct mlx5_devcom_comp_dev *mlx5_lag_get_devcom_comp(struct mlx5_lag *ldev)
static int mlx5_lag_demux_ft_fg_init(struct mlx5_core_dev *dev,
struct mlx5_flow_table_attr *ft_attr,
- struct mlx5_lag *ldev)
+ struct lag_func *pf)
{
#ifdef CONFIG_MLX5_ESWITCH
struct mlx5_flow_namespace *ns;
@@ -1601,20 +1601,20 @@ static int mlx5_lag_demux_ft_fg_init(struct mlx5_core_dev *dev,
if (!ns)
return 0;
- ldev->lag_demux_ft = mlx5_create_flow_table(ns, ft_attr);
- if (IS_ERR(ldev->lag_demux_ft))
- return PTR_ERR(ldev->lag_demux_ft);
+ pf->lag_demux_ft = mlx5_create_flow_table(ns, ft_attr);
+ if (IS_ERR(pf->lag_demux_ft))
+ return PTR_ERR(pf->lag_demux_ft);
fg = mlx5_esw_lag_demux_fg_create(dev->priv.eswitch,
- ldev->lag_demux_ft);
+ pf->lag_demux_ft);
if (IS_ERR(fg)) {
err = PTR_ERR(fg);
- mlx5_destroy_flow_table(ldev->lag_demux_ft);
- ldev->lag_demux_ft = NULL;
+ mlx5_destroy_flow_table(pf->lag_demux_ft);
+ pf->lag_demux_ft = NULL;
return err;
}
- ldev->lag_demux_fg = fg;
+ pf->lag_demux_fg = fg;
return 0;
#else
return -EOPNOTSUPP;
@@ -1623,7 +1623,7 @@ static int mlx5_lag_demux_ft_fg_init(struct mlx5_core_dev *dev,
static int mlx5_lag_demux_fw_init(struct mlx5_core_dev *dev,
struct mlx5_flow_table_attr *ft_attr,
- struct mlx5_lag *ldev)
+ struct lag_func *pf)
{
struct mlx5_flow_namespace *ns;
int err;
@@ -1632,12 +1632,12 @@ static int mlx5_lag_demux_fw_init(struct mlx5_core_dev *dev,
if (!ns)
return 0;
- ldev->lag_demux_fg = NULL;
+ pf->lag_demux_fg = NULL;
ft_attr->max_fte = 1;
- ldev->lag_demux_ft = mlx5_create_lag_demux_flow_table(ns, ft_attr);
- if (IS_ERR(ldev->lag_demux_ft)) {
- err = PTR_ERR(ldev->lag_demux_ft);
- ldev->lag_demux_ft = NULL;
+ pf->lag_demux_ft = mlx5_create_lag_demux_flow_table(ns, ft_attr);
+ if (IS_ERR(pf->lag_demux_ft)) {
+ err = PTR_ERR(pf->lag_demux_ft);
+ pf->lag_demux_ft = NULL;
return err;
}
@@ -1648,6 +1648,7 @@ int mlx5_lag_demux_init(struct mlx5_core_dev *dev,
struct mlx5_flow_table_attr *ft_attr)
{
struct mlx5_lag *ldev;
+ struct lag_func *pf;
if (!ft_attr)
return -EINVAL;
@@ -1656,12 +1657,16 @@ int mlx5_lag_demux_init(struct mlx5_core_dev *dev,
if (!ldev)
return -ENODEV;
- xa_init(&ldev->lag_demux_rules);
+ pf = mlx5_lag_pf_by_dev(ldev, dev);
+ if (!pf)
+ return -ENODEV;
+
+ xa_init(&pf->lag_demux_rules);
if (mlx5_get_sd(dev))
- return mlx5_lag_demux_ft_fg_init(dev, ft_attr, ldev);
+ return mlx5_lag_demux_ft_fg_init(dev, ft_attr, pf);
- return mlx5_lag_demux_fw_init(dev, ft_attr, ldev);
+ return mlx5_lag_demux_fw_init(dev, ft_attr, pf);
}
EXPORT_SYMBOL(mlx5_lag_demux_init);
@@ -1670,40 +1675,63 @@ void mlx5_lag_demux_cleanup(struct mlx5_core_dev *dev)
struct mlx5_flow_handle *rule;
struct mlx5_lag *ldev;
unsigned long vport_num;
+ struct lag_func *pf;
ldev = mlx5_lag_dev(dev);
if (!ldev)
return;
- xa_for_each(&ldev->lag_demux_rules, vport_num, rule)
+ pf = mlx5_lag_pf_by_dev(ldev, dev);
+ if (!pf)
+ return;
+
+ xa_for_each(&pf->lag_demux_rules, vport_num, rule)
mlx5_del_flow_rules(rule);
- xa_destroy(&ldev->lag_demux_rules);
+ xa_destroy(&pf->lag_demux_rules);
- if (ldev->lag_demux_fg)
- mlx5_destroy_flow_group(ldev->lag_demux_fg);
- if (ldev->lag_demux_ft)
- mlx5_destroy_flow_table(ldev->lag_demux_ft);
- ldev->lag_demux_fg = NULL;
- ldev->lag_demux_ft = NULL;
+ if (pf->lag_demux_fg)
+ mlx5_destroy_flow_group(pf->lag_demux_fg);
+ if (pf->lag_demux_ft)
+ mlx5_destroy_flow_table(pf->lag_demux_ft);
+ pf->lag_demux_fg = NULL;
+ pf->lag_demux_ft = NULL;
}
EXPORT_SYMBOL(mlx5_lag_demux_cleanup);
+static struct lag_func *mlx5_lag_dev_get_master_pf(struct mlx5_lag *ldev,
+ struct mlx5_core_dev *dev)
+{
+ u32 filter = mlx5_lag_get_filter(ldev, dev);
+ int idx;
+
+ idx = mlx5_lag_get_dev_index_by_seq_filter(ldev, MLX5_LAG_P1, filter);
+ if (idx < 0)
+ return NULL;
+
+ return mlx5_lag_pf(ldev, idx);
+}
+
int mlx5_lag_demux_rule_add(struct mlx5_core_dev *vport_dev, u16 vport_num,
int index)
{
struct mlx5_flow_handle *rule;
+ struct lag_func *master;
struct mlx5_lag *ldev;
int err;
ldev = mlx5_lag_dev(vport_dev);
- if (!ldev || !ldev->lag_demux_fg)
+ if (!ldev)
return 0;
- if (xa_load(&ldev->lag_demux_rules, index))
+ master = mlx5_lag_dev_get_master_pf(ldev, vport_dev);
+ if (!master || !master->lag_demux_fg)
+ return 0;
+
+ if (xa_load(&master->lag_demux_rules, index))
return 0;
rule = mlx5_esw_lag_demux_rule_create(vport_dev->priv.eswitch,
- vport_num, ldev->lag_demux_ft);
+ vport_num, master->lag_demux_ft);
if (IS_ERR(rule)) {
err = PTR_ERR(rule);
mlx5_core_warn(vport_dev,
@@ -1712,7 +1740,7 @@ int mlx5_lag_demux_rule_add(struct mlx5_core_dev *vport_dev, u16 vport_num,
return err;
}
- err = xa_err(xa_store(&ldev->lag_demux_rules, index, rule,
+ err = xa_err(xa_store(&master->lag_demux_rules, index, rule,
GFP_KERNEL));
if (err) {
mlx5_del_flow_rules(rule);
@@ -1728,13 +1756,18 @@ EXPORT_SYMBOL(mlx5_lag_demux_rule_add);
void mlx5_lag_demux_rule_del(struct mlx5_core_dev *dev, int index)
{
struct mlx5_flow_handle *rule;
+ struct lag_func *master_pf;
struct mlx5_lag *ldev;
ldev = mlx5_lag_dev(dev);
- if (!ldev || !ldev->lag_demux_fg)
+ if (!ldev)
+ return;
+
+ master_pf = mlx5_lag_dev_get_master_pf(ldev, dev);
+ if (!master_pf || !master_pf->lag_demux_fg)
return;
- rule = xa_erase(&ldev->lag_demux_rules, index);
+ rule = xa_erase(&master_pf->lag_demux_rules, index);
if (rule)
mlx5_del_flow_rules(rule);
}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h
index 0296f752bb4c..d645c2cfca4d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h
@@ -59,6 +59,10 @@ struct lag_func {
struct mlx5_nb port_change_nb;
u32 group_id; /* SD group ID, 0 = not SD */
bool sd_fdb_active; /* set on all SD group members */
+ /* Lag demux resources - only populated on master devices */
+ struct mlx5_flow_table *lag_demux_ft;
+ struct mlx5_flow_group *lag_demux_fg;
+ struct xarray lag_demux_rules;
};
/* Used for collection of netdev event info. */
@@ -95,9 +99,6 @@ struct mlx5_lag {
/* Protect lag fields/state changes */
struct mutex lock;
struct lag_mpesw lag_mpesw;
- struct mlx5_flow_table *lag_demux_ft;
- struct mlx5_flow_group *lag_demux_fg;
- struct xarray lag_demux_rules;
};
static inline struct mlx5_lag *
--
2.44.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* [PATCH net-next V3 10/15] net/mlx5: LAG, disable both regular and SD LAG on lag_disable_change
2026-06-12 11:38 [PATCH net-next V3 00/15] net/mlx5: Add switchdev mode support for Socket Direct single netdev, part 2/2 Tariq Toukan
` (8 preceding siblings ...)
2026-06-12 11:38 ` [PATCH net-next V3 09/15] net/mlx5: LAG, store demux resources per master lag_func Tariq Toukan
@ 2026-06-12 11:38 ` Tariq Toukan
2026-06-12 11:39 ` [PATCH net-next V3 11/15] net/mlx5: LAG, introduce software vport LAG implementation Tariq Toukan
` (4 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Tariq Toukan @ 2026-06-12 11:38 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
Shay Drory, Or Har-Toov, Edward Srouji, Maher Sanalla,
Simon Horman, Gerd Bayer, Kees Cook, Moshe Shemesh, Parav Pandit,
Patrisious Haddad, netdev, linux-rdma, linux-kernel, Gal Pressman
From: Shay Drory <shayd@nvidia.com>
Extend mlx5_lag_disable_change() to properly disable both regular LAG
and SD LAG when requested. Each LAG type uses its own devcom component
for locking.
Use mlx5_sd_get_devcom() helper to retrieve the SD devcom component,
needed for proper locking when disabling SD LAG.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
.../net/ethernet/mellanox/mlx5/core/lag/lag.c | 29 +++++++++++++++++--
1 file changed, 27 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
index e23c1e81b98f..84eff995cad1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
@@ -2494,13 +2494,22 @@ EXPORT_SYMBOL(mlx5_lag_is_shared_fdb);
void mlx5_lag_disable_change(struct mlx5_core_dev *dev)
{
+ struct mlx5_devcom_comp_dev *sd_devcom = mlx5_sd_get_devcom(dev);
+ struct mlx5_core_dev *primary = dev;
struct mlx5_lag *ldev;
+ struct lag_func *pf;
+ int i;
ldev = mlx5_lag_dev(dev);
if (!ldev)
return;
- mlx5_devcom_comp_lock(dev->priv.hca_devcom_comp);
+ if (sd_devcom) {
+ mlx5_devcom_comp_lock(sd_devcom);
+ primary = mlx5_sd_get_primary(dev) ?: dev;
+ mlx5_devcom_comp_unlock(sd_devcom);
+ }
+ mlx5_devcom_comp_lock(primary->priv.hca_devcom_comp);
mutex_lock(&ldev->lock);
ldev->mode_changes_in_progress++;
@@ -2512,7 +2521,23 @@ void mlx5_lag_disable_change(struct mlx5_core_dev *dev)
}
mutex_unlock(&ldev->lock);
- mlx5_devcom_comp_unlock(dev->priv.hca_devcom_comp);
+ mlx5_devcom_comp_unlock(primary->priv.hca_devcom_comp);
+
+ if (!sd_devcom)
+ return;
+
+ /* Teardown SD shared FDB for this device's group if active */
+ mlx5_devcom_comp_lock(sd_devcom);
+ mutex_lock(&ldev->lock);
+ mlx5_lag_for_each(i, 0, ldev, MLX5_LAG_FILTER_ALL) {
+ pf = mlx5_lag_pf(ldev, i);
+ if (pf->dev == dev && pf->sd_fdb_active) {
+ mlx5_lag_shared_fdb_destroy(ldev, pf->group_id);
+ break;
+ }
+ }
+ mutex_unlock(&ldev->lock);
+ mlx5_devcom_comp_unlock(sd_devcom);
}
void mlx5_lag_enable_change(struct mlx5_core_dev *dev)
--
2.44.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* [PATCH net-next V3 11/15] net/mlx5: LAG, introduce software vport LAG implementation
2026-06-12 11:38 [PATCH net-next V3 00/15] net/mlx5: Add switchdev mode support for Socket Direct single netdev, part 2/2 Tariq Toukan
` (9 preceding siblings ...)
2026-06-12 11:38 ` [PATCH net-next V3 10/15] net/mlx5: LAG, disable both regular and SD LAG on lag_disable_change Tariq Toukan
@ 2026-06-12 11:39 ` Tariq Toukan
2026-06-12 11:39 ` [PATCH net-next V3 12/15] net/mlx5: LAG, add MPESW over SD LAG support Tariq Toukan
` (3 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Tariq Toukan @ 2026-06-12 11:39 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
Shay Drory, Or Har-Toov, Edward Srouji, Maher Sanalla,
Simon Horman, Gerd Bayer, Kees Cook, Moshe Shemesh, Parav Pandit,
Patrisious Haddad, netdev, linux-rdma, linux-kernel, Gal Pressman
From: Shay Drory <shayd@nvidia.com>
SD LAG is a virtual LAG without hardware LAG support, so it cannot use
the firmware vport LAG commands. Implement a software-based vport LAG
using egress ACL bounce rules.
Add esw_set_slave_egress_rule() to create an egress ACL rule on the
slave's manager vport that bounces traffic to the master's manager
vport. This achieves the same traffic steering as hardware vport LAG.
Redirect mlx5_cmd_create_vport_lag() and mlx5_cmd_destroy_vport_lag()
to the software implementation when operating in SD LAG mode.
In addition, adjust lag_demux creation to check SD LAG mode as well.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
.../net/ethernet/mellanox/mlx5/core/eswitch.h | 4 +
.../mellanox/mlx5/core/eswitch_offloads.c | 142 ++++++++++++++++++
.../net/ethernet/mellanox/mlx5/core/lag/lag.c | 49 +++++-
.../net/ethernet/mellanox/mlx5/core/lag/lag.h | 14 ++
.../mellanox/mlx5/core/lag/shared_fdb.c | 74 ++++++++-
5 files changed, 280 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index 94a530d19828..a5f0774834fe 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -950,6 +950,10 @@ void esw_vport_change_handle_locked(struct mlx5_vport *vport);
bool mlx5_esw_offloads_controller_valid(const struct mlx5_eswitch *esw, u32 controller);
+int mlx5_eswitch_offloads_vport_lag_add_one(struct mlx5_eswitch *master_esw,
+ struct mlx5_eswitch *slave_esw);
+void mlx5_eswitch_offloads_vport_lag_del_one(struct mlx5_eswitch *master_esw,
+ struct mlx5_eswitch *slave_esw);
int mlx5_eswitch_offloads_single_fdb_add_one(struct mlx5_eswitch *master_esw,
struct mlx5_eswitch *slave_esw, int max_slaves);
void mlx5_eswitch_offloads_single_fdb_del_one(struct mlx5_eswitch *master_esw,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 915571a1586c..a24719cfba34 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -3041,6 +3041,136 @@ static int __esw_set_master_egress_rule(struct mlx5_core_dev *master,
return err;
}
+static int esw_slave_egress_create_resources(struct mlx5_eswitch *esw,
+ struct mlx5_vport *vport)
+{
+ struct mlx5_flow_table_attr ft_attr = {
+ .max_fte = 1, .prio = 0, .level = 0,
+ };
+ int inlen = MLX5_ST_SZ_BYTES(create_flow_group_in);
+ struct mlx5_flow_namespace *ns;
+ struct mlx5_flow_table *acl;
+ struct mlx5_flow_group *g;
+ u32 *flow_group_in;
+ int err = 0;
+
+ if (vport->egress.acl)
+ return 0;
+
+ xa_init_flags(&vport->egress.offloads.bounce_rules, XA_FLAGS_ALLOC);
+ ns = mlx5_get_flow_vport_namespace(esw->dev,
+ MLX5_FLOW_NAMESPACE_ESW_EGRESS,
+ vport->index);
+ if (!ns)
+ return -EINVAL;
+
+ flow_group_in = kvzalloc(inlen, GFP_KERNEL);
+ if (!flow_group_in)
+ return -ENOMEM;
+
+ if (vport->vport || mlx5_core_is_ecpf(esw->dev))
+ ft_attr.flags = MLX5_FLOW_TABLE_OTHER_VPORT;
+
+ acl = mlx5_create_vport_flow_table(ns, &ft_attr, vport->vport);
+ if (IS_ERR(acl)) {
+ err = PTR_ERR(acl);
+ goto out;
+ }
+
+ g = mlx5_create_flow_group(acl, flow_group_in);
+ if (IS_ERR(g)) {
+ err = PTR_ERR(g);
+ goto err_table;
+ }
+
+ vport->egress.acl = acl;
+ vport->egress.offloads.bounce_grp = g;
+ vport->egress.type = VPORT_EGRESS_ACL_TYPE_SHARED_FDB;
+ err = 0;
+
+err_table:
+ if (err && !IS_ERR_OR_NULL(acl)) {
+ mlx5_destroy_flow_table(acl);
+ vport->egress.acl = NULL;
+ }
+out:
+ kvfree(flow_group_in);
+ return err;
+}
+
+static void esw_slave_egress_destroy_resources(struct mlx5_vport *vport)
+{
+ if (!IS_ERR_OR_NULL(vport->egress.offloads.bounce_grp)) {
+ mlx5_destroy_flow_group(vport->egress.offloads.bounce_grp);
+ vport->egress.offloads.bounce_grp = NULL;
+ }
+ if (!IS_ERR_OR_NULL(vport->egress.acl)) {
+ esw_acl_egress_ofld_cleanup(vport);
+ xa_destroy(&vport->egress.offloads.bounce_rules);
+ }
+}
+
+static int esw_set_slave_egress_rule(struct mlx5_core_dev *master,
+ struct mlx5_core_dev *slave)
+{
+ struct mlx5_eswitch *slave_esw = slave->priv.eswitch;
+ u16 master_vhca = MLX5_CAP_GEN(master, vhca_id);
+ struct mlx5_flow_destination dest = {};
+ struct mlx5_flow_handle *bounce_rule;
+ struct mlx5_flow_act flow_act = {};
+ struct mlx5_vport *slave_vport;
+ int err;
+
+ slave_vport = mlx5_eswitch_get_vport(slave_esw,
+ slave_esw->manager_vport);
+ if (IS_ERR(slave_vport))
+ return PTR_ERR(slave_vport);
+
+ err = esw_slave_egress_create_resources(slave_esw, slave_vport);
+ if (err)
+ return err;
+
+ flow_act.action = MLX5_FLOW_CONTEXT_ACTION_FWD_DEST;
+ dest.type = MLX5_FLOW_DESTINATION_TYPE_VPORT;
+ dest.vport.num = master->priv.eswitch->manager_vport;
+ dest.vport.vhca_id = master_vhca;
+ dest.vport.flags = MLX5_FLOW_DEST_VPORT_VHCA_ID;
+
+ bounce_rule = mlx5_add_flow_rules(slave_vport->egress.acl, NULL,
+ &flow_act, &dest, 1);
+ if (IS_ERR(bounce_rule)) {
+ err = PTR_ERR(bounce_rule);
+ goto err_rule;
+ }
+ err = xa_insert(&slave_vport->egress.offloads.bounce_rules,
+ master_vhca, bounce_rule, GFP_KERNEL);
+ if (err)
+ goto err_insert;
+
+ return 0;
+err_insert:
+ mlx5_del_flow_rules(bounce_rule);
+err_rule:
+ esw_slave_egress_destroy_resources(slave_vport);
+ return err;
+}
+
+static void esw_unset_slave_egress_rule(struct mlx5_core_dev *master,
+ struct mlx5_core_dev *slave)
+{
+ struct mlx5_eswitch *slave_esw = slave->priv.eswitch;
+ u16 master_vhca = MLX5_CAP_GEN(master, vhca_id);
+ struct mlx5_vport *slave_vport;
+
+ slave_vport = mlx5_eswitch_get_vport(slave_esw,
+ slave_esw->manager_vport);
+ if (IS_ERR(slave_vport))
+ return;
+
+ esw_acl_egress_ofld_bounce_rule_destroy(slave_vport, master_vhca);
+ esw_slave_egress_destroy_resources(slave_vport);
+}
+
static int esw_master_egress_create_resources(struct mlx5_eswitch *esw,
struct mlx5_flow_namespace *egress_ns,
struct mlx5_vport *vport, size_t count)
@@ -3208,6 +3338,18 @@ void mlx5_eswitch_offloads_single_fdb_del_one(struct mlx5_eswitch *master_esw,
esw_unset_master_egress_rule(master_esw->dev, slave_esw->dev);
}
+int mlx5_eswitch_offloads_vport_lag_add_one(struct mlx5_eswitch *master_esw,
+ struct mlx5_eswitch *slave_esw)
+{
+ return esw_set_slave_egress_rule(master_esw->dev, slave_esw->dev);
+}
+
+void mlx5_eswitch_offloads_vport_lag_del_one(struct mlx5_eswitch *master_esw,
+ struct mlx5_eswitch *slave_esw)
+{
+ esw_unset_slave_egress_rule(master_esw->dev, slave_esw->dev);
+}
+
#define ESW_OFFLOADS_DEVCOM_PAIR (0)
#define ESW_OFFLOADS_DEVCOM_UNPAIR (1)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
index 84eff995cad1..06e1a61d1f58 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
@@ -139,9 +139,44 @@ static int mlx5_cmd_modify_lag(struct mlx5_core_dev *dev, struct mlx5_lag *ldev,
return mlx5_cmd_exec_in(dev, modify_lag, in);
}
+static u32 mlx5_lag_dev_group_id(struct mlx5_core_dev *dev)
+{
+ struct mlx5_lag *ldev = mlx5_lag_dev(dev);
+ struct lag_func *pf;
+ int i;
+
+ if (!ldev)
+ return 0;
+
+ mlx5_lag_for_each(i, 0, ldev, MLX5_LAG_FILTER_ALL) {
+ pf = mlx5_lag_pf(ldev, i);
+ if (pf->dev == dev)
+ return pf->sd_fdb_active ? pf->group_id : 0;
+ }
+ return 0;
+}
+
+static int mlx5_lag_is_sw_lag(struct mlx5_core_dev *dev)
+{
+ return mlx5_lag_is_sd(dev);
+}
+
int mlx5_cmd_create_vport_lag(struct mlx5_core_dev *dev)
{
u32 in[MLX5_ST_SZ_DW(create_vport_lag_in)] = {};
+ struct mlx5_lag *ldev = mlx5_lag_dev(dev);
+ int ret;
+
+ if (mlx5_lag_is_sw_lag(dev)) {
+ if (!ldev)
+ return -ENODEV;
+
+ mutex_lock(&ldev->lock);
+ ret = mlx5_lag_create_vport_lag(mlx5_lag_dev(dev),
+ mlx5_lag_dev_group_id(dev));
+ mutex_unlock(&ldev->lock);
+ return ret;
+ }
MLX5_SET(create_vport_lag_in, in, opcode, MLX5_CMD_OP_CREATE_VPORT_LAG);
@@ -152,6 +187,18 @@ EXPORT_SYMBOL(mlx5_cmd_create_vport_lag);
int mlx5_cmd_destroy_vport_lag(struct mlx5_core_dev *dev)
{
u32 in[MLX5_ST_SZ_DW(destroy_vport_lag_in)] = {};
+ struct mlx5_lag *ldev = mlx5_lag_dev(dev);
+
+ if (mlx5_lag_is_sw_lag(dev)) {
+ if (!ldev)
+ return 0;
+
+ mutex_lock(&ldev->lock);
+ mlx5_lag_destroy_vport_lag(mlx5_lag_dev(dev),
+ mlx5_lag_dev_group_id(dev));
+ mutex_unlock(&ldev->lock);
+ return 0;
+ }
MLX5_SET(destroy_vport_lag_in, in, opcode, MLX5_CMD_OP_DESTROY_VPORT_LAG);
@@ -1663,7 +1710,7 @@ int mlx5_lag_demux_init(struct mlx5_core_dev *dev,
xa_init(&pf->lag_demux_rules);
- if (mlx5_get_sd(dev))
+ if (mlx5_lag_is_sw_lag(dev))
return mlx5_lag_demux_ft_fg_init(dev, ft_attr, pf);
return mlx5_lag_demux_fw_init(dev, ft_attr, pf);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h
index d645c2cfca4d..57e6f82713b0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h
@@ -175,6 +175,8 @@ int mlx5_lag_shared_fdb_create(struct mlx5_lag *ldev,
enum mlx5_lag_mode mode,
u32 group_id);
void mlx5_lag_shared_fdb_destroy(struct mlx5_lag *ldev, u32 group_id);
+int mlx5_lag_create_vport_lag(struct mlx5_lag *ldev, u32 group_id);
+int mlx5_lag_destroy_vport_lag(struct mlx5_lag *ldev, u32 group_id);
int mlx5_lag_create_single_fdb(struct mlx5_lag *ldev);
void mlx5_lag_destroy_single_fdb(struct mlx5_lag *ldev);
bool mlx5_lag_shared_fdb_supported(struct mlx5_lag *ldev);
@@ -191,6 +193,18 @@ static inline int mlx5_lag_shared_fdb_create(struct mlx5_lag *ldev,
static inline void mlx5_lag_shared_fdb_destroy(struct mlx5_lag *ldev,
u32 group_id) {}
+static inline int mlx5_lag_create_vport_lag(struct mlx5_lag *ldev,
+ u32 group_id)
+{
+ return -EOPNOTSUPP;
+}
+
+static inline int mlx5_lag_destroy_vport_lag(struct mlx5_lag *ldev,
+ u32 group_id)
+{
+ return -EOPNOTSUPP;
+}
+
static inline int mlx5_lag_create_single_fdb(struct mlx5_lag *ldev)
{
return -EOPNOTSUPP;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/shared_fdb.c b/drivers/net/ethernet/mellanox/mlx5/core/lag/shared_fdb.c
index 1371e14c4c13..8d4f2903a101 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lag/shared_fdb.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/shared_fdb.c
@@ -89,6 +89,76 @@ static int mlx5_lag_create_single_fdb_filter(struct mlx5_lag *ldev, u32 filter)
return err;
}
+int mlx5_lag_create_vport_lag(struct mlx5_lag *ldev, u32 group_id)
+{
+ u32 filter = group_id ? group_id : MLX5_LAG_FILTER_ALL;
+ int master_idx = mlx5_lag_get_dev_index_by_seq_filter(ldev, MLX5_LAG_P1,
+ filter);
+ struct mlx5_eswitch *master_esw;
+ struct mlx5_core_dev *dev0;
+ int i, j;
+ int err;
+
+ if (master_idx < 0)
+ return -EINVAL;
+
+ dev0 = mlx5_lag_pf(ldev, master_idx)->dev;
+ master_esw = dev0->priv.eswitch;
+
+ mlx5_lag_for_each(i, 0, ldev, filter) {
+ struct mlx5_eswitch *slave_esw;
+
+ if (i == master_idx)
+ continue;
+
+ slave_esw = mlx5_lag_pf(ldev, i)->dev->priv.eswitch;
+ err = mlx5_eswitch_offloads_vport_lag_add_one(master_esw,
+ slave_esw);
+ if (err)
+ goto err;
+ }
+
+ return 0;
+
+err:
+ mlx5_lag_for_each_reverse(j, i - 1, 0, ldev, filter) {
+ struct mlx5_eswitch *slave_esw;
+
+ if (j == master_idx)
+ continue;
+ slave_esw = mlx5_lag_pf(ldev, j)->dev->priv.eswitch;
+ mlx5_eswitch_offloads_vport_lag_del_one(master_esw, slave_esw);
+ }
+ return err;
+}
+
+int mlx5_lag_destroy_vport_lag(struct mlx5_lag *ldev, u32 group_id)
+{
+ u32 filter = group_id ? group_id : MLX5_LAG_FILTER_ALL;
+ int master_idx = mlx5_lag_get_dev_index_by_seq_filter(ldev, MLX5_LAG_P1,
+ filter);
+ struct mlx5_eswitch *master_esw;
+ struct mlx5_core_dev *dev0;
+ int i;
+
+ if (master_idx < 0)
+ return 0;
+
+ dev0 = mlx5_lag_pf(ldev, master_idx)->dev;
+ master_esw = dev0->priv.eswitch;
+
+ mlx5_lag_for_each(i, 0, ldev, filter) {
+ struct mlx5_core_dev *dev;
+
+ if (i == master_idx)
+ continue;
+ dev = mlx5_lag_pf(ldev, i)->dev;
+ mlx5_eswitch_offloads_vport_lag_del_one(master_esw,
+ dev->priv.eswitch);
+ }
+ return 0;
+}
+
static void mlx5_lag_destroy_single_fdb_filter(struct mlx5_lag *ldev,
u32 filter)
{
@@ -141,7 +211,7 @@ int mlx5_lag_shared_fdb_create(struct mlx5_lag *ldev,
enum mlx5_lag_mode mode,
u32 group_id)
{
- u32 filter = group_id ? group_id : MLX5_LAG_FILTER_PORTS;
+ u32 filter = group_id ? group_id : MLX5_LAG_FILTER_ALL;
int idx = mlx5_lag_get_dev_index_by_seq_filter(ldev, MLX5_LAG_P1,
filter);
struct mlx5_core_dev *dev0;
@@ -209,7 +279,7 @@ int mlx5_lag_shared_fdb_create(struct mlx5_lag *ldev,
void mlx5_lag_shared_fdb_destroy(struct mlx5_lag *ldev, u32 group_id)
{
- u32 filter = group_id ? group_id : MLX5_LAG_FILTER_PORTS;
+ u32 filter = group_id ? group_id : MLX5_LAG_FILTER_ALL;
struct lag_func *pf;
int err;
int i;
--
2.44.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* [PATCH net-next V3 12/15] net/mlx5: LAG, add MPESW over SD LAG support
2026-06-12 11:38 [PATCH net-next V3 00/15] net/mlx5: Add switchdev mode support for Socket Direct single netdev, part 2/2 Tariq Toukan
` (10 preceding siblings ...)
2026-06-12 11:39 ` [PATCH net-next V3 11/15] net/mlx5: LAG, introduce software vport LAG implementation Tariq Toukan
@ 2026-06-12 11:39 ` Tariq Toukan
2026-06-12 11:39 ` [PATCH net-next V3 13/15] net/mlx5: E-Switch, Tie rep load/unload to SD LAG state Tariq Toukan
` (2 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Tariq Toukan @ 2026-06-12 11:39 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
Shay Drory, Or Har-Toov, Edward Srouji, Maher Sanalla,
Simon Horman, Gerd Bayer, Kees Cook, Moshe Shemesh, Parav Pandit,
Patrisious Haddad, netdev, linux-rdma, linux-kernel, Gal Pressman
From: Shay Drory <shayd@nvidia.com>
Enable MPESW LAG creation over SD LAG members, forming a composite LAG
hierarchy. This allows bonding multiple SD groups together under a
single MPESW configuration with shared FDB.
When enabling composite MPESW, the individual SD LAG shared FDB
configurations are temporarily torn down and recreated when the
composite LAG is disabled.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
.../net/ethernet/mellanox/mlx5/core/lag/lag.c | 6 ++
.../net/ethernet/mellanox/mlx5/core/lag/lag.h | 8 ++
.../ethernet/mellanox/mlx5/core/lag/mpesw.c | 95 +++++++++++++++++--
.../ethernet/mellanox/mlx5/core/lag/mpesw.h | 4 +
4 files changed, 105 insertions(+), 8 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
index 06e1a61d1f58..424478e649ef 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
@@ -2545,6 +2545,7 @@ void mlx5_lag_disable_change(struct mlx5_core_dev *dev)
struct mlx5_core_dev *primary = dev;
struct mlx5_lag *ldev;
struct lag_func *pf;
+ bool mpesw;
int i;
ldev = mlx5_lag_dev(dev);
@@ -2557,6 +2558,9 @@ void mlx5_lag_disable_change(struct mlx5_core_dev *dev)
mlx5_devcom_comp_unlock(sd_devcom);
}
mlx5_devcom_comp_lock(primary->priv.hca_devcom_comp);
+ mpesw = ldev->mode == MLX5_LAG_MODE_MPESW;
+ if (mpesw)
+ mlx5_mpesw_sd_devcoms_lock(ldev);
mutex_lock(&ldev->lock);
ldev->mode_changes_in_progress++;
@@ -2568,6 +2572,8 @@ void mlx5_lag_disable_change(struct mlx5_core_dev *dev)
}
mutex_unlock(&ldev->lock);
+ if (mpesw)
+ mlx5_mpesw_sd_devcoms_unlock(ldev);
mlx5_devcom_comp_unlock(primary->priv.hca_devcom_comp);
if (!sd_devcom)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h
index 57e6f82713b0..8481ce55c10a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h
@@ -157,6 +157,14 @@ __mlx5_lag_is_sd(struct mlx5_lag *ldev, struct mlx5_core_dev *dev)
return pf && pf->group_id != 0;
}
+static inline bool
+__mlx5_lag_dev_is_port(struct mlx5_lag *ldev, struct mlx5_core_dev *dev)
+{
+ struct lag_func *pf = mlx5_lag_pf_by_dev(ldev, dev);
+
+ return pf && xa_get_mark(&ldev->pfs, pf->idx, MLX5_LAG_XA_MARK_PORT);
+}
+
static inline bool
__mlx5_lag_is_active(struct mlx5_lag *ldev)
{
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/mpesw.c b/drivers/net/ethernet/mellanox/mlx5/core/lag/mpesw.c
index 2cb44084e239..50bfb450c71e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lag/mpesw.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/mpesw.c
@@ -15,7 +15,7 @@ static void mlx5_mpesw_metadata_cleanup(struct mlx5_lag *ldev)
u32 pf_metadata;
int i;
- mlx5_ldev_for_each(i, 0, ldev) {
+ mlx5_lag_for_each(i, 0, ldev, MLX5_LAG_FILTER_ALL) {
dev = mlx5_lag_pf(ldev, i)->dev;
esw = dev->priv.eswitch;
pf_metadata = ldev->lag_mpesw.pf_metadata[i];
@@ -36,7 +36,7 @@ static int mlx5_mpesw_metadata_set(struct mlx5_lag *ldev)
u32 pf_metadata;
int i, err;
- mlx5_ldev_for_each(i, 0, ldev) {
+ mlx5_lag_for_each(i, 0, ldev, MLX5_LAG_FILTER_ALL) {
dev = mlx5_lag_pf(ldev, i)->dev;
esw = dev->priv.eswitch;
pf_metadata = mlx5_esw_match_metadata_alloc(esw);
@@ -52,7 +52,7 @@ static int mlx5_mpesw_metadata_set(struct mlx5_lag *ldev)
goto err_metadata;
}
- mlx5_ldev_for_each(i, 0, ldev) {
+ mlx5_lag_for_each(i, 0, ldev, MLX5_LAG_FILTER_ALL) {
dev = mlx5_lag_pf(ldev, i)->dev;
mlx5_notifier_call_chain(dev->priv.events, MLX5_DEV_EVENT_MULTIPORT_ESW,
(void *)0);
@@ -65,6 +65,48 @@ static int mlx5_mpesw_metadata_set(struct mlx5_lag *ldev)
return err;
}
+static void mlx5_mpesw_restore_sd_fdb(struct mlx5_lag *ldev)
+{
+ struct lag_func *pf;
+ int err, i;
+
+ mlx5_ldev_for_each(i, 0, ldev) {
+ pf = mlx5_lag_pf(ldev, i);
+ err = mlx5_lag_shared_fdb_create(ldev, NULL, 0, pf->group_id);
+ if (err)
+ mlx5_core_warn(pf->dev,
+ "Failed to restore SD shared FDB (%d)\n",
+ err);
+ }
+}
+
+static int mlx5_mpesw_teardown_sd_fdb(struct mlx5_lag *ldev)
+{
+ struct lag_func *pf;
+ int i;
+
+ mlx5_ldev_for_each(i, 0, ldev) {
+ pf = mlx5_lag_pf(ldev, i);
+ if (!pf->sd_fdb_active)
+ continue;
+ mlx5_lag_shared_fdb_destroy(ldev, pf->group_id);
+ }
+ return 0;
+}
+
+static bool mlx5_lag_has_sd_group(struct mlx5_lag *ldev)
+{
+ struct lag_func *pf;
+ int i;
+
+ mlx5_ldev_for_each(i, 0, ldev) {
+ pf = mlx5_lag_pf(ldev, i);
+ if (pf->group_id)
+ return true;
+ }
+ return false;
+}
+
static int mlx5_lag_enable_mpesw(struct mlx5_lag *ldev)
{
int idx = mlx5_lag_get_dev_index_by_seq(ldev, MLX5_LAG_P1);
@@ -92,10 +134,17 @@ static int mlx5_lag_enable_mpesw(struct mlx5_lag *ldev)
if (err)
return err;
+ if (mlx5_lag_has_sd_group(ldev))
+ mlx5_mpesw_teardown_sd_fdb(ldev);
+
err = mlx5_lag_shared_fdb_create(ldev, NULL, MLX5_LAG_MODE_MPESW,
MLX5_LAG_FILTER_ALL);
if (err) {
- mlx5_core_warn(dev0, "Failed to create LAG in MPESW mode (%d)\n", err);
+ mlx5_core_warn(dev0,
+ "Failed to create LAG in MPESW mode (%d)\n",
+ err);
+ if (mlx5_lag_has_sd_group(ldev))
+ mlx5_mpesw_restore_sd_fdb(ldev);
mlx5_mpesw_metadata_cleanup(ldev);
return err;
}
@@ -105,9 +154,36 @@ static int mlx5_lag_enable_mpesw(struct mlx5_lag *ldev)
void mlx5_lag_disable_mpesw(struct mlx5_lag *ldev)
{
- if (ldev->mode == MLX5_LAG_MODE_MPESW) {
- mlx5_mpesw_metadata_cleanup(ldev);
- mlx5_lag_shared_fdb_destroy(ldev, MLX5_LAG_FILTER_ALL);
+ if (ldev->mode != MLX5_LAG_MODE_MPESW)
+ return;
+
+ mlx5_mpesw_metadata_cleanup(ldev);
+ mlx5_lag_shared_fdb_destroy(ldev, MLX5_LAG_FILTER_ALL);
+ if (mlx5_lag_has_sd_group(ldev))
+ mlx5_mpesw_restore_sd_fdb(ldev);
+}
+
+void mlx5_mpesw_sd_devcoms_lock(struct mlx5_lag *ldev)
+{
+ struct mlx5_devcom_comp_dev *sd_devcom;
+ int i;
+
+ mlx5_ldev_for_each(i, 0, ldev) {
+ sd_devcom = mlx5_sd_get_devcom(mlx5_lag_pf(ldev, i)->dev);
+ if (sd_devcom)
+ mlx5_devcom_comp_lock(sd_devcom);
+ }
+}
+
+void mlx5_mpesw_sd_devcoms_unlock(struct mlx5_lag *ldev)
+{
+ struct mlx5_devcom_comp_dev *sd_devcom;
+ int i;
+
+ mlx5_ldev_for_each_reverse(i, MLX5_MAX_PORTS, 0, ldev) {
+ sd_devcom = mlx5_sd_get_devcom(mlx5_lag_pf(ldev, i)->dev);
+ if (sd_devcom)
+ mlx5_devcom_comp_unlock(sd_devcom);
}
}
@@ -122,6 +198,7 @@ static void mlx5_mpesw_work(struct work_struct *work)
return;
mlx5_devcom_comp_lock(devcom);
+ mlx5_mpesw_sd_devcoms_lock(ldev);
mutex_lock(&ldev->lock);
if (ldev->mode_changes_in_progress) {
mpesww->result = -EAGAIN;
@@ -134,6 +211,7 @@ static void mlx5_mpesw_work(struct work_struct *work)
mlx5_lag_disable_mpesw(ldev);
unlock:
mutex_unlock(&ldev->lock);
+ mlx5_mpesw_sd_devcoms_unlock(ldev);
mlx5_devcom_comp_unlock(devcom);
complete(&mpesww->comp);
}
@@ -199,7 +277,8 @@ bool mlx5_lag_is_mpesw(struct mlx5_core_dev *dev)
{
struct mlx5_lag *ldev = mlx5_lag_dev(dev);
- return ldev && ldev->mode == MLX5_LAG_MODE_MPESW;
+ return ldev && ldev->mode == MLX5_LAG_MODE_MPESW &&
+ __mlx5_lag_dev_is_port(ldev, dev);
}
EXPORT_SYMBOL(mlx5_lag_is_mpesw);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/mpesw.h b/drivers/net/ethernet/mellanox/mlx5/core/lag/mpesw.h
index b767dbb4f457..5099723ba0f7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lag/mpesw.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/mpesw.h
@@ -33,8 +33,12 @@ void mlx5_lag_mpesw_disable(struct mlx5_core_dev *dev);
int mlx5_lag_mpesw_enable(struct mlx5_core_dev *dev);
#ifdef CONFIG_MLX5_ESWITCH
void mlx5_lag_disable_mpesw(struct mlx5_lag *ldev);
+void mlx5_mpesw_sd_devcoms_lock(struct mlx5_lag *ldev);
+void mlx5_mpesw_sd_devcoms_unlock(struct mlx5_lag *ldev);
#else
static inline void mlx5_lag_disable_mpesw(struct mlx5_lag *ldev) {}
+static inline void mlx5_mpesw_sd_devcoms_lock(struct mlx5_lag *ldev) {}
+static inline void mlx5_mpesw_sd_devcoms_unlock(struct mlx5_lag *ldev) {}
#endif /* CONFIG_MLX5_ESWITCH */
#ifdef CONFIG_MLX5_ESWITCH
--
2.44.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* [PATCH net-next V3 13/15] net/mlx5: E-Switch, Tie rep load/unload to SD LAG state
2026-06-12 11:38 [PATCH net-next V3 00/15] net/mlx5: Add switchdev mode support for Socket Direct single netdev, part 2/2 Tariq Toukan
` (11 preceding siblings ...)
2026-06-12 11:39 ` [PATCH net-next V3 12/15] net/mlx5: LAG, add MPESW over SD LAG support Tariq Toukan
@ 2026-06-12 11:39 ` Tariq Toukan
2026-06-12 11:39 ` [PATCH net-next V3 14/15] net/mlx5: SD, defer vport metadata init until SD is ready Tariq Toukan
2026-06-12 11:39 ` [PATCH net-next V3 15/15] net/mlx5: SD, enable SD over ECPF and allow switchdev transition Tariq Toukan
14 siblings, 0 replies; 16+ messages in thread
From: Tariq Toukan @ 2026-06-12 11:39 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
Shay Drory, Or Har-Toov, Edward Srouji, Maher Sanalla,
Simon Horman, Gerd Bayer, Kees Cook, Moshe Shemesh, Parav Pandit,
Patrisious Haddad, netdev, linux-rdma, linux-kernel, Gal Pressman
From: Shay Drory <shayd@nvidia.com>
On an SD device, vport representors are not functional until the SD
group is combined and shared FDB is active. Skip the initial load and
the reload paths in that window; reps are loaded as part of the SD LAG
activation flow once it becomes active.
In addition, explicitly unload representors when SD LAG is destroyed.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
.../net/ethernet/mellanox/mlx5/core/eswitch.h | 4 +++
.../mellanox/mlx5/core/eswitch_offloads.c | 26 +++++++++++++++++++
.../net/ethernet/mellanox/mlx5/core/lag/lag.c | 26 +++++++++++++++++++
.../net/ethernet/mellanox/mlx5/core/lag/lag.h | 1 +
.../mellanox/mlx5/core/lag/shared_fdb.c | 1 +
5 files changed, 58 insertions(+)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index a5f0774834fe..b2b3150f1f04 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -959,6 +959,7 @@ int mlx5_eswitch_offloads_single_fdb_add_one(struct mlx5_eswitch *master_esw,
void mlx5_eswitch_offloads_single_fdb_del_one(struct mlx5_eswitch *master_esw,
struct mlx5_eswitch *slave_esw);
int mlx5_eswitch_reload_ib_reps(struct mlx5_eswitch *esw);
+void mlx5_eswitch_unload_reps(struct mlx5_eswitch *esw);
bool mlx5_eswitch_is_peer(struct mlx5_eswitch *esw,
struct mlx5_eswitch *peer_esw);
@@ -1063,6 +1064,9 @@ mlx5_eswitch_reload_ib_reps(struct mlx5_eswitch *esw)
return 0;
}
+static inline void
+mlx5_eswitch_unload_reps(struct mlx5_eswitch *esw) {}
+
static inline bool
mlx5_eswitch_block_encap(struct mlx5_core_dev *dev, bool from_fdb)
{
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index a24719cfba34..4dc190a4e7b2 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -2863,6 +2863,10 @@ static int mlx5_esw_offloads_rep_load(struct mlx5_eswitch *esw, u16 vport_num)
int rep_type;
int err;
+ if (vport_num != MLX5_VPORT_UPLINK &&
+ mlx5_get_sd(esw->dev) && !mlx5_lag_is_active(esw->dev))
+ return 0;
+
rep = mlx5_eswitch_get_rep(esw, vport_num);
for (rep_type = 0; rep_type < NUM_REP_TYPES; rep_type++) {
err = __esw_offloads_load_rep(esw, rep, rep_type,
@@ -3779,6 +3783,21 @@ static void esw_destroy_offloads_acl_tables(struct mlx5_eswitch *esw)
esw_vport_destroy_offloads_acl_tables(esw, vport);
}
+void mlx5_eswitch_unload_reps(struct mlx5_eswitch *esw)
+{
+ struct mlx5_eswitch_rep *rep;
+ unsigned long i;
+
+ if (!esw || esw->mode != MLX5_ESWITCH_OFFLOADS)
+ return;
+
+ mlx5_esw_for_each_rep(esw, i, rep) {
+ if (rep->vport == MLX5_VPORT_UPLINK)
+ continue;
+ mlx5_esw_offloads_rep_unload(esw, rep->vport);
+ }
+}
+
int mlx5_eswitch_reload_ib_reps(struct mlx5_eswitch *esw)
{
struct mlx5_eswitch_rep *rep;
@@ -3805,6 +3824,10 @@ int mlx5_eswitch_reload_ib_reps(struct mlx5_eswitch *esw)
if (!mlx5_sd_is_primary(esw->dev) &&
rep->vport == MLX5_VPORT_UPLINK)
continue;
+ if (rep->vport != MLX5_VPORT_UPLINK &&
+ mlx5_get_sd(esw->dev) && !mlx5_lag_is_active(esw->dev))
+ continue;
+
if (atomic_read(&rep->rep_data[REP_ETH].state) == REP_LOADED)
__esw_offloads_load_rep(esw, rep, REP_IB, NULL);
}
@@ -4764,6 +4787,9 @@ static void mlx5_eswitch_reload_reps_blocked(struct mlx5_eswitch *esw)
return;
}
+ if (mlx5_get_sd(esw->dev) && !mlx5_lag_is_active(esw->dev))
+ return;
+
mlx5_esw_for_each_vport(esw, i, vport) {
if (!vport)
continue;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
index 424478e649ef..28d16fdc3f06 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
@@ -1312,6 +1312,32 @@ int mlx5_lag_reload_ib_reps_from_locked(struct mlx5_lag *ldev, u32 flags,
return mlx5_lag_reload_ib_reps(ldev, flags, filter, cont_on_fail);
}
+static void mlx5_lag_unload_reps_unlocked(struct mlx5_lag *ldev, u32 filter)
+{
+ struct lag_func *pf;
+ int i;
+
+ mlx5_lag_for_each(i, 0, ldev, filter) {
+ struct mlx5_eswitch *esw;
+
+ pf = mlx5_lag_pf(ldev, i);
+ esw = pf->dev->priv.eswitch;
+ mlx5_esw_reps_block(esw);
+ mlx5_eswitch_unload_reps(esw);
+ mlx5_esw_reps_unblock(esw);
+ }
+}
+
+void mlx5_lag_unload_reps_from_locked(struct mlx5_lag *ldev, u32 filter)
+{
+ /* Same lock dance as mlx5_lag_reload_ib_reps: drop ldev->lock around
+ * the per-eswitch reps_lock to keep the reps_lock -> ldev->lock order.
+ */
+ mlx5_lag_drop_lock_for_reps(ldev, filter);
+ mlx5_lag_unload_reps_unlocked(ldev, filter);
+ mlx5_lag_retake_lock_after_reps(ldev);
+}
+
void mlx5_disable_lag(struct mlx5_lag *ldev)
{
bool shared_fdb = test_bit(MLX5_LAG_MODE_FLAG_SHARED_FDB, &ldev->mode_flags);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h
index 8481ce55c10a..e9f0ef83ce1d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h
@@ -310,6 +310,7 @@ int mlx5_lag_num_devs(struct mlx5_lag *ldev);
int mlx5_lag_num_netdevs(struct mlx5_lag *ldev);
int mlx5_lag_reload_ib_reps_from_locked(struct mlx5_lag *ldev, u32 flags,
u32 filter, bool cont_on_fail);
+void mlx5_lag_unload_reps_from_locked(struct mlx5_lag *ldev, u32 filter);
int mlx5_ldev_add_mdev(struct mlx5_lag *ldev, struct mlx5_core_dev *dev,
u32 group_id);
void mlx5_ldev_remove_mdev(struct mlx5_lag *ldev, struct mlx5_core_dev *dev);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/shared_fdb.c b/drivers/net/ethernet/mellanox/mlx5/core/lag/shared_fdb.c
index 8d4f2903a101..113866494d16 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lag/shared_fdb.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/shared_fdb.c
@@ -296,6 +296,7 @@ void mlx5_lag_shared_fdb_destroy(struct mlx5_lag *ldev, u32 group_id)
pf->sd_fdb_active = false;
}
mlx5_lag_destroy_single_fdb_filter(ldev, group_id);
+ mlx5_lag_unload_reps_from_locked(ldev, filter);
}
mlx5_lag_add_devices_filter(ldev, filter);
--
2.44.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* [PATCH net-next V3 14/15] net/mlx5: SD, defer vport metadata init until SD is ready
2026-06-12 11:38 [PATCH net-next V3 00/15] net/mlx5: Add switchdev mode support for Socket Direct single netdev, part 2/2 Tariq Toukan
` (12 preceding siblings ...)
2026-06-12 11:39 ` [PATCH net-next V3 13/15] net/mlx5: E-Switch, Tie rep load/unload to SD LAG state Tariq Toukan
@ 2026-06-12 11:39 ` Tariq Toukan
2026-06-12 11:39 ` [PATCH net-next V3 15/15] net/mlx5: SD, enable SD over ECPF and allow switchdev transition Tariq Toukan
14 siblings, 0 replies; 16+ messages in thread
From: Tariq Toukan @ 2026-06-12 11:39 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
Shay Drory, Or Har-Toov, Edward Srouji, Maher Sanalla,
Simon Horman, Gerd Bayer, Kees Cook, Moshe Shemesh, Parav Pandit,
Patrisious Haddad, netdev, linux-rdma, linux-kernel, Gal Pressman
From: Shay Drory <shayd@nvidia.com>
Allow SD devices to transition to switchdev before the SD group is
fully up. Metadata allocation requires the SD group to be ready, so
defer it from esw_offloads_enable() until SD shared-FDB activation.
Add mlx5_esw_offloads_init_deferred_metadata() which allocates per-vport
metadata and refreshes the ingress ACLs that were previously programmed
with metadata=0. The helper is idempotent and can be called multiple
times.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
.../net/ethernet/mellanox/mlx5/core/eswitch.h | 1 +
.../mellanox/mlx5/core/eswitch_offloads.c | 79 ++++++++++++++++++-
.../net/ethernet/mellanox/mlx5/core/lib/sd.c | 16 ++++
3 files changed, 93 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index b2b3150f1f04..fea72b1dedab 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -440,6 +440,7 @@ struct mlx5_eswitch {
void esw_offloads_disable(struct mlx5_eswitch *esw);
int esw_offloads_enable(struct mlx5_eswitch *esw);
+int mlx5_esw_offloads_init_deferred_metadata(struct mlx5_eswitch *esw);
void esw_offloads_cleanup(struct mlx5_eswitch *esw);
int esw_offloads_init(struct mlx5_eswitch *esw);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 4dc190a4e7b2..8fa7e633451c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -3675,6 +3675,7 @@ static void esw_offloads_vport_metadata_cleanup(struct mlx5_eswitch *esw,
WARN_ON(vport->metadata != vport->default_metadata);
mlx5_esw_match_metadata_free(esw, vport->default_metadata);
+ vport->default_metadata = 0;
}
static void esw_offloads_metadata_uninit(struct mlx5_eswitch *esw)
@@ -3711,6 +3712,73 @@ static int esw_offloads_metadata_init(struct mlx5_eswitch *esw)
return err;
}
+/* Deferred metadata init for SD devices: allocate vport metadata and
+ * refresh the ingress ACL for every vport whose ACL was created with
+ * metadata=0 in esw_create_offloads_acl_tables() / esw_vport_setup().
+ *
+ * No Rep is loaded at this point ==> no Rep net-dev exists, so no need
+ * to take rtnl lock.
+ *
+ * Safe to call multiple times - subsequent calls are no-ops.
+ */
+int mlx5_esw_offloads_init_deferred_metadata(struct mlx5_eswitch *esw)
+{
+ struct mlx5_vport *manager, *vport;
+ unsigned long i;
+ int err;
+
+ if (!mlx5_eswitch_vport_match_metadata_enabled(esw))
+ return 0;
+
+ manager = mlx5_eswitch_get_vport(esw, esw->manager_vport);
+ if (IS_ERR(manager))
+ return PTR_ERR(manager);
+
+ /* Sanity check: skip if metadata was already initialized */
+ if (manager->default_metadata)
+ return 0;
+
+ err = esw_offloads_metadata_init(esw);
+ if (err)
+ return err;
+
+ mutex_lock(&esw->state_lock);
+ /* Manager vport doesn't have a rep/netdev loaded but its ingress ACL
+ * was programmed with metadata=0 - refresh it explicitly.
+ */
+ err = mlx5_esw_acl_ingress_vport_metadata_update(esw,
+ esw->manager_vport,
+ 0);
+ if (err)
+ goto err_acl;
+
+ /* UPLINK is never marked enabled but its ACL is programmed in
+ * esw_create_offloads_acl_tables(); refresh it explicitly.
+ */
+ err = mlx5_esw_acl_ingress_vport_metadata_update(esw, MLX5_VPORT_UPLINK,
+ 0);
+ if (err)
+ goto err_acl;
+
+ mlx5_esw_for_each_vport(esw, i, vport) {
+ if (!vport || !vport->enabled)
+ continue;
+ err = mlx5_esw_acl_ingress_vport_metadata_update(esw,
+ vport->vport,
+ 0);
+ if (err)
+ goto err_acl;
+ }
+
+ mutex_unlock(&esw->state_lock);
+ return 0;
+
+err_acl:
+ esw_offloads_metadata_uninit(esw);
+ mutex_unlock(&esw->state_lock);
+ return err;
+}
+
int
esw_vport_create_offloads_acl_tables(struct mlx5_eswitch *esw,
struct mlx5_vport *vport)
@@ -4072,9 +4140,14 @@ int esw_offloads_enable(struct mlx5_eswitch *esw)
if (err)
goto err_roce;
- err = esw_offloads_metadata_init(esw);
- if (err)
- goto err_metadata;
+ /* SD devices defer metadata init until SD is ready and
+ * mlx5_sd_pf_num_get() can return the correct pf_num.
+ */
+ if (!mlx5_get_sd(esw->dev)) {
+ err = esw_offloads_metadata_init(esw);
+ if (err)
+ goto err_metadata;
+ }
err = esw_set_passing_vport_metadata(esw, true);
if (err)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
index b35795bac098..2fcccd329eb5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
@@ -992,6 +992,7 @@ static bool mlx5_sd_all_paired(struct mlx5_core_dev *primary)
static void mlx5_sd_activate_shared_fdb(struct mlx5_core_dev *primary)
{
struct mlx5_sd *sd = mlx5_get_sd(primary);
+ struct mlx5_core_dev *pos;
struct mlx5_lag *ldev;
struct lag_func *pf;
int err;
@@ -1024,6 +1025,21 @@ static void mlx5_sd_activate_shared_fdb(struct mlx5_core_dev *primary)
goto unlock;
}
+ /* Initialize vport metadata for all group devices. This is deferred
+ * from esw_offloads_enable() because mlx5_sd_pf_num_get() requires
+ * the SD group to be ready.
+ */
+ mlx5_sd_for_each_dev(i, primary, pos) {
+ struct mlx5_eswitch *esw = pos->priv.eswitch;
+
+ err = mlx5_esw_offloads_init_deferred_metadata(esw);
+ if (err) {
+ sd_warn(primary, "Failed to init metadata for %s: %d\n",
+ dev_name(pos->device), err);
+ goto unlock;
+ }
+ }
+
err = mlx5_lag_shared_fdb_create(ldev, NULL, 0, sd->group_id);
if (err)
sd_warn(primary, "Failed to create shared FDB: %d\n", err);
--
2.44.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* [PATCH net-next V3 15/15] net/mlx5: SD, enable SD over ECPF and allow switchdev transition
2026-06-12 11:38 [PATCH net-next V3 00/15] net/mlx5: Add switchdev mode support for Socket Direct single netdev, part 2/2 Tariq Toukan
` (13 preceding siblings ...)
2026-06-12 11:39 ` [PATCH net-next V3 14/15] net/mlx5: SD, defer vport metadata init until SD is ready Tariq Toukan
@ 2026-06-12 11:39 ` Tariq Toukan
14 siblings, 0 replies; 16+ messages in thread
From: Tariq Toukan @ 2026-06-12 11:39 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
Shay Drory, Or Har-Toov, Edward Srouji, Maher Sanalla,
Simon Horman, Gerd Bayer, Kees Cook, Moshe Shemesh, Parav Pandit,
Patrisious Haddad, netdev, linux-rdma, linux-kernel, Gal Pressman
From: Shay Drory <shayd@nvidia.com>
Remove the restriction blocking SD on embedded CPU PFs (ECPF), enabling
SD functionality on BlueField DPUs. Remove the blocker preventing SD
devices from transitioning to switchdev mode.
The infrastructure added in earlier patches properly handles this case.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
.../net/ethernet/mellanox/mlx5/core/eswitch_offloads.c | 6 ------
drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c | 8 --------
2 files changed, 14 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 8fa7e633451c..907ee83a722d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -4472,12 +4472,6 @@ int mlx5_devlink_eswitch_mode_set(struct devlink *devlink, u16 mode,
if (esw_mode_from_devlink(mode, &mlx5_mode))
return -EINVAL;
- if (mlx5_mode == MLX5_ESWITCH_OFFLOADS && mlx5_get_sd(esw->dev)) {
- NL_SET_ERR_MSG_MOD(extack,
- "Can't change E-Switch mode to switchdev when multi-PF netdev (Socket Direct) is configured.");
- return -EPERM;
- }
-
/* Avoid try_lock, active/inactive mode change is not restricted */
if (mlx5_devlink_switchdev_active_mode_change(esw, mode))
return 0;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
index 2fcccd329eb5..ee2fdefa1945 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
@@ -222,10 +222,6 @@ bool mlx5_sd_is_supported(struct mlx5_core_dev *dev)
if (!mlx5_core_is_pf(dev))
return false;
- /* Block on embedded CPU PFs */
- if (mlx5_core_is_ecpf(dev))
- return false;
-
err = mlx5_query_nic_vport_sd_group(dev, &sd_group);
if (err || !sd_group)
return false;
@@ -252,10 +248,6 @@ static int sd_init(struct mlx5_core_dev *dev)
if (!mlx5_core_is_pf(dev))
return 0;
- /* Block on embedded CPU PFs */
- if (mlx5_core_is_ecpf(dev))
- return 0;
-
err = mlx5_query_nic_vport_sd_group(dev, &sd_group);
if (err)
return err;
--
2.44.0
^ permalink raw reply related [flat|nested] 16+ messages in thread