[PATCH net-next V3 00/15] net/mlx5: Add switchdev mode support for Socket Direct single netdev, part 2/2

Netdev List
 help / color / mirror / Atom feed

From: Tariq Toukan <tariqt@nvidia.com>
To: Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	Andrew Lunn <andrew+netdev@lunn.ch>,
	"David S. Miller" <davem@davemloft.net>
Cc: Saeed Mahameed <saeedm@nvidia.com>,
	Leon Romanovsky <leon@kernel.org>,
	Tariq Toukan <tariqt@nvidia.com>, Mark Bloch <mbloch@nvidia.com>,
	Shay Drory <shayd@nvidia.com>, Or Har-Toov <ohartoov@nvidia.com>,
	Edward Srouji <edwards@nvidia.com>,
	Maher Sanalla <msanalla@nvidia.com>,
	Simon Horman <horms@kernel.org>,
	Gerd Bayer <gbayer@linux.ibm.com>, Kees Cook <kees@kernel.org>,
	Moshe Shemesh <moshe@nvidia.com>, Parav Pandit <parav@nvidia.com>,
	Patrisious Haddad <phaddad@nvidia.com>, <netdev@vger.kernel.org>,
	<linux-rdma@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
	Gal Pressman <gal@nvidia.com>
Subject: [PATCH net-next V3 00/15] net/mlx5: Add switchdev mode support for Socket Direct single netdev, part 2/2
Date: Fri, 12 Jun 2026 14:38:49 +0300	[thread overview]
Message-ID: <20260612113904.537595-1-tariqt@nvidia.com> (raw)

Hi,

This is part 2. Find part 1 here:
https://lore.kernel.org/all/20260531113954.395443-1-tariqt@nvidia.com/

This series enables Socket Direct single netdev to operate in switchdev
mode with shared FDB. SD single netdev combines multiple PCI functions
behind a single netdev interface. To support switchdev offloads, these
functions must participate in virtual LAG (shared FDB).

Design

Rather than introducing a separate LAG instance for SD, this series
integrates SD secondary devices into the existing LAG structure
(priv.lag) created at probe time. Each lag_func entry carries a
group_id field that identifies its SD group membership (0 means not
part of any SD group). An xarray mark (XA_MARK_PORT) distinguishes
physical port entries from SD secondaries, enabling a single unified
iterator that filters by group:

  - MLX5_LAG_FILTER_PORTS: iterate port-level entries only (existing
    behavior, used by bonding, FW LAG commands, v2p_map)
  - MLX5_LAG_FILTER_ALL: iterate all devices including SD secondaries
    (used by MPESW shared FDB across all devices)
  - specific group_id: iterate only devices in that SD group (used by
    per-group SD shared FDB operations)

Existing callers use mlx5_ldev_for_each() which maps to
MLX5_LAG_FILTER_PORTS, preserving current behavior for non-SD
configurations.

Lifecycle and ownership

The SD LAG lifecycle is tied to the SD group, not to bonding events:

1. At PCI probe, mlx5_lag_add_mdev() creates the LAG structure
   (priv.lag) for each LAG-capable PF. e.g.: SD primary devices

2. During mlx5_sd_init(), after the SD group is fully formed (primary
   and secondaries paired), sd_lag_init() registers the secondary
   devices into the primary's existing priv.lag by calling
   mlx5_ldev_add_mdev() with the SD group_id. The primary's lag_func
   also gets its group_id set. No separate LAG instance is created.

3. After all the devices in SD group transition to switchdev,
   mlx5_lag_shared_fdb_create() is invoked with the group_id to create
   a software-only shared FDB scoped to that SD group. This sets
   sd_fdb_active on all lag_func entries in the group. No FW LAG
   commands are issued since SD devices share the same physical port.

4. If MPESW (multi-port eswitch) is enabled on top of SD groups, the
   per-group SD shared FDB is torn down first, then MPESW shared FDB is
   created spanning all devices (ports + SD secondaries) using
   MLX5_LAG_FILTER_ALL. On MPESW disable, per-group SD shared FDB is
   restored.

5. On SD teardown (mlx5_sd_cleanup or device unbind), sd_lag_cleanup()
   removes secondaries from priv.lag and clears the primary's group_id.
   The LAG structure itself is not destroyed.

The sd_fdb_active flag is set on all lag_func entries in a group (not
just the primary), so any device can detect the SD shared FDB state
during lag_disable_change teardown without needing to look up peer
entries.

SD shared FDB is a pure software construct -- unlike regular LAG modes
(ROCE, SRIOV, MPESW), it does not issue FW create_lag/destroy_lag
commands. The software vport LAG for SD is implemented via eswitch
egress ACL bounce rules, managed by the IB layer through
mlx5_eth_lag_init(). And the software LAG demux is implemented via
steering rules that utilize new destination, VHCA_RX.

Patches

E-Switch preparation (patch 1):
  - Skip uplink IB rep load for SD secondary devices

Devcom support (patches 2-3):
  - Expose locked variant of send_event
  - Add DEVCOM_CANT_FAIL for non-rollback events

SD core hardening (patches 4-6):
  - Make primary/secondary role determination more robust
  - Add L2 table silent mode query support
  - Expand vport metadata for SD secondary devices

SD switchdev transition (patches 7-8):
  - Support switchdev mode transition with shared FDB
  - Notify SD on eswitch disable

LAG integration (patches 9-12):
  - Store demux resources per master lag_func
  - Disable both regular and SD LAG on lag_disable_change
  - Introduce software vport LAG implementation
  - Add MPESW over SD LAG support

Deferred init (patches 13-14):
  - Tie rep load/unload to SD LAG state
  - Defer vport metadata init until SD is ready

Enablement (patch 15):
  - Enable SD over ECPF and allow switchdev transition

V3:
- Patch 4: Mark sd_devcom as ready only if
  next_secondary_idx + 1 == sd->host_buses.
- Patch 7: Set tx_root_silent explicitly in sd_cmd_set_secondary().
- Patch 13: Expand this patch to completely tie rep load/unload to SD
  LAG state.
- Patch 14: Take esw->state_lock when refreshing vports ACL.

V2:
https://lore.kernel.org/all/20260608135547.482825-1-tariqt@nvidia.com/

V1:
https://lore.kernel.org/all/20260604114455.434711-1-tariqt@nvidia.com/

Sashiko comments from both runs on the ML, and the internal run here:

- commit "net/mlx5: SD, make primary/secondary role determination more
  robust"

> + sd->primary = true;
> + err = mlx5_devcom_locked_send_event(devcom, SD_PRIMARY_SET,
> +    SD_PRIMARY_SET, dev);

sashiko.dev says:

What happens during failure rollback if a subsequent peer fails during
registration?
SD_PRIMARY_SET is passed as both the forward event and the rollback
event.  If the event handler fails for a peer, the devcom rollback loop
invokes sd_handle_primary_set() again for the previously successful
peers.
Since the handler receives the exact same arguments and logic, does it
unconditionally repeat the assignment peer_sd->primary_dev = dev instead
of clearing the state?
If sd_register() subsequently returns an error and the device unbinds
and frees the dev structure, does this leave the successfully processed
peers with dangling primary_dev pointers referencing the freed device?

[SD] on broadcast failure sd_register goes to err_devcom_unreg
(sd.c:566) and never sets the comp ready. Every reader of primary_dev
gates on mlx5_devcom_comp_is_ready() (sd.c:66), so the stale pointer is
written and but never dereferenced. sd_register returns failure, so all
is ok - no fix needed.

- commit "net/mlx5: SD, support switchdev mode transition with shared FDB"

> Because mlx5_sd_eswitch_mode_set() returns void, does ignoring this
> error
> leave the secondary device in an inconsistent state?
> If TX root reconfiguration fails, the execution aborts via goto unlock,
> but mlx5_devlink_eswitch_mode_set() will still report a successful
> transition to userspace.

[SD] This is by design-any SD switchdev related operations are best
effort.

- commit "net/mlx5: LAG, store demux resources per master lag_func":
- this note was on sashiko.dev as well.

> Can this lockless lookup lead to a use-after-free if the master device
> is
> removed concurrently?
> mlx5_lag_dev_get_master_pf() internally uses mlx5_lag_pf(), which
> performs a
> lockless xa_load() from ldev->pfs. If the master device is unbound or
> hot-removed concurrently, mlx5_ldev_remove_mdev() will remove the
> master's
> lag_func from the XArray and immediately free it using a synchronous
> kfree(pf).
> Since this path doesn't appear to hold ldev->lock or an overarching lock
> that
> protects the master device's lag_func lifecycle, could a race like this
> occur?
> CPU 1 (Adding rule)
> master_pf = mlx5_lag_dev_get_master_pf(ldev, vport_dev);
> CPU 2 (Master removal)
> mlx5_ldev_remove_mdev()
>    xa_erase(&ldev->pfs, idx);
>    kfree(pf);
> CPU 1 (Resumes)
> if (xa_load(&master_pf->lag_demux_rules, index))
> Previously, these resources were stored directly in struct mlx5_lag
> (ldev),
> which is refcounted and remains alive as long as any device in the LAG
> (including the secondary device) still references it.

[SD] no, before ldev is removed, the IB device is destroyed, and with
it the demux rules, group and table.

- commit "net/mlx5: LAG, introduce software vport LAG implementation"
- this note was on sashiko.dev as well

> Does this code safely handle shared ACLs?
> The esw_slave_egress_create_resources() function correctly reuses an
> existing
> vport->egress.acl if one is present. However, this teardown function
> unconditionally destroys both vport->egress.offloads.bounce_grp and
> vport->egress.acl without checking if the bounce_rules xarray is empty.
> In contrast, esw_unset_master_egress_rule() handles this safely:
> if (xa_empty(&vport->egress.offloads.bounce_rules)) {
>            esw_acl_egress_ofld_cleanup(vport);
>            xa_destroy(&vport->egress.offloads.bounce_rules);
>
> }
> Could this unconditional destruction tear down an ACL that is still
> being actively used by other features, potentially leading to a
> use-after-free when other subsystems attempt to access their flow rules?

[SD] this is an ACL with a single rule, so it is ok.

> +static u32 mlx5_lag_dev_group_id(struct mlx5_core_dev *dev)
> +{
> +     struct mlx5_lag *ldev = mlx5_lag_dev(dev);
> +     struct lag_func *pf;
> +     int i;
> +
> +     if (!ldev)
> +             return 0;
> +
> +     mlx5_lag_for_each(i, 0, ldev, MLX5_LAG_FILTER_ALL) {
> +             pf = mlx5_lag_pf(ldev, i);
> +             if (pf->dev == dev)
> +                     return pf->sd_fdb_active ? pf->group_id : 0;
> +     }
> +     return 0;
> +}

sashiko.dev says:

In mlx5_lag_shared_fdb_destroy(), pf->sd_fdb_active is set to false for
the target group before calling mlx5_lag_reload_ib_reps_from_locked() to
unload the IB representors.
During the IB unload process, mlx5_eth_lag_cleanup() calls
mlx5_cmd_destroy_vport_lag(), which queries mlx5_lag_dev_group_id().
Since sd_fdb_active was already cleared, will mlx5_lag_dev_group_id()
evaluate to 0 here?
If it does, there might be a risk that mlx5_cmd_destroy_vport_lag() will
pass 0 to mlx5_lag_destroy_vport_lag(), which treats 0 as
MLX5_LAG_FILTER_ALL and unconditionally destroys the manager vports'
egress ACLs for all independent SD groups on the device.

[SD] sd_fdb_active is cleared after, not before, the IB unload.
mlx5_lag_remove_devices_filter() is removing the IB rep, and The
mlx5_lag_reload_ib_reps_from_locked() call at the end of
mlx5_lag_shared_fdb_destroy() is the re-add path, not an unload.

- commit "net/mlx5: LAG, add MPESW over SD LAG support"
> Does this code introduce a use-after-free when iterating over ldev->pfs?
> In the concurrent LAG teardown path, elements are removed from the
> XArray and synchronously freed via kfree(pf) (not kfree_rcu()),
> protected only by ldev->lock.  Since xa_for_each() (used in
> mlx5_lag_pf_by_dev()) only holds the RCU read lock internally while
> retrieving the element and releases it before the loop body, calling
> this function locklessly could dereference freed memory when accessing
> Pf->idx.

[SD] this API caller assume dev is a valid device not in teardown flow.
If this break, the caller is broken as well.

> Can this call create an invalid global FDB or perform destructive
> redundant operations?
> Unlike mlx5_mpesw_teardown_sd_fdb() which filters PFs by checking if
> (!pf->sd_fdb_active), this loop iterates unconditionally.
> For non-SD ports, pf->group_id is 0, which instructs
> mlx5_lag_shared_fdb_create() to create a global shared FDB covering the
> entire LAG (MLX5_LAG_FILTER_ALL), resulting in a broken global LAG
> activation.
> Additionally, for SD groups containing multiple PFs, the first call
> creates the FDB, but subsequent calls for other PFs in the same group
> will re-execute the creation process, improperly attempting to recreate
> the already active FDB.

[SD] no, the iterator is running only over the PORT devices and for each
SD group there is only one such device, and when NIC is SD, all its PFs
(devices) are SD.

> +static void mlx5_mpesw_restore_sd_fdb(struct mlx5_lag *ldev)
> +{
> +     struct lag_func *pf;
> +     int err, i;
> +
> +     mlx5_ldev_for_each(i, 0, ldev) {
> +             pf = mlx5_lag_pf(ldev, i);
> +             err = mlx5_lag_shared_fdb_create(ldev, NULL, 0,
> pf->group_id);

sashiko.dev says:

Does this accidentally activate global FW LAG for non-SD ports?
This loop unconditionally calls mlx5_lag_shared_fdb_create() using
pf->group_id. For standalone non-SD ports, pf->group_id is 0.
Inside mlx5_lag_shared_fdb_create(), passing group_id = 0 sets the
filter to MLX5_LAG_FILTER_ALL, which bypasses the SD-specific single-FDB
path and incorrectly triggers a full global LAG activation via
mlx5_activate_lag().
Should there be a check for pf->group_id != 0 or pf->sd_fdb_active here,
similar to the check in mlx5_mpesw_teardown_sd_fdb()?

[SD] pf->sd_fdb_active is false here by definition-we are re-creating
it.  And this API is already gated by pf->group_id check.  When MPESW is
destroyed, all PFs have group ID (if this is SD NIC)

- commit "net/mlx5: SD, defer vport metadata init until SD is ready"

> +err_acl:
> +   esw_offloads_metadata_uninit(esw);
> +   return err;
sashiko.dev says:

If a vport update fails, this jumps to err_acl and calls
esw_offloads_metadata_uninit().
Looking at esw_offloads_metadata_uninit(), it frees the default_metadata
IDAs and zeros vport->default_metadata:
drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c:esw_offloads_vport_metadata_cleanup() {
    ...
    mlx5_esw_match_metadata_free(esw, vport->default_metadata);
    vport->default_metadata = 0;
}
However, it does not reset vport->metadata or revert the hardware ACL
rules for the vports that were already successfully updated.
Will this leave the successfully updated vports with hardware rules
actively matching against freed IDAs? If these IDAs are reallocated
later, could multiple vports share the same metadata tag and cause
packet misdirection?

[SD] no rep is loaded until SD shared fdb is created.
No reps -> no traffic -> the stale HW rules / freed IDAs can't misdirect
anything.

- commit "Tie rep load/unload to SD LAG state",
was "net/mlx5: E-Switch, defer rep load while SD LAG is not active"

> Does this rely on the InfiniBand representor probe to trigger the reload
> of the ethernet representors once the SD LAG becomes active?
> If the kernel is built without CONFIG_MLX5_INFINIBAND or if the ib-rep
> fails to probe, the ib-rep driver is never registered. Consequently, the
> mlx5_eswitch_reload_reps work is never queued.
> Since there is no explicit call to reload REP_ETH anywhere in the
> activation path, would the ethernet representors for the VF/SFs remain
> permanently unloaded? This seems like it would leave the vport
> representors completely non-functional for ethernet-only deployments.

[SD] This code isn't relevant if IB rep device doesn't probed.

- commit "net/mlx5: SD, defer vport metadata init until SD is ready"
> Does this operation need to hold pos->priv.eswitch->mode_lock?
> If this runs concurrently with a devlink command changing the peer
> device's mode back to legacy, esw_offloads_disable() could tear down the
> ACL tables and free offloads objects while this locklessly accesses and
> modifies metadata and ACLs. Could this lead to a Use-After-Free?

[SD] it won't. Peer E-switch and it vports are destroyed only after SD
is cleanup. Switching to legacy don't destroy resources used in
meta_date_init().

Shay Drory (15):
  net/mlx5: E-Switch, skip uplink IB rep load for SD secondary devices
  net/mlx5: devcom, expose locked variant of send_event
  net/mlx5: devcom, add DEVCOM_CANT_FAIL for non-rollback events
  net/mlx5: SD, make primary/secondary role determination more robust
  net/mlx5: SD, add L2 table silent mode query support
  net/mlx5: SD, expend vport metadata for SD secondary devices
  net/mlx5: SD, support switchdev mode transition with shared FDB
  net/mlx5: E-Switch, notify SD on eswitch disable
  net/mlx5: LAG, store demux resources per master lag_func
  net/mlx5: LAG, disable both regular and SD LAG on lag_disable_change
  net/mlx5: LAG, introduce software vport LAG implementation
  net/mlx5: LAG, add MPESW over SD LAG support
  net/mlx5: E-Switch, defer rep load while SD LAG is not active
  net/mlx5: SD, defer vport metadata init until SD is ready
  net/mlx5: SD, enable SD over ECPF and allow switchdev transition

 .../net/ethernet/mellanox/mlx5/core/eswitch.c |   1 +
 .../net/ethernet/mellanox/mlx5/core/eswitch.h |   5 +
 .../mellanox/mlx5/core/eswitch_offloads.c     | 275 +++++++++++-
 .../net/ethernet/mellanox/mlx5/core/fs_cmd.c  |  21 +
 .../net/ethernet/mellanox/mlx5/core/fs_cmd.h  |   2 +
 .../net/ethernet/mellanox/mlx5/core/lag/lag.c | 179 ++++++--
 .../net/ethernet/mellanox/mlx5/core/lag/lag.h |  29 +-
 .../ethernet/mellanox/mlx5/core/lag/mpesw.c   |  95 +++-
 .../ethernet/mellanox/mlx5/core/lag/mpesw.h   |   4 +
 .../mellanox/mlx5/core/lag/shared_fdb.c       |  74 +++-
 .../ethernet/mellanox/mlx5/core/lib/devcom.c  |  36 +-
 .../ethernet/mellanox/mlx5/core/lib/devcom.h  |   5 +
 .../net/ethernet/mellanox/mlx5/core/lib/sd.c  | 406 +++++++++++++++---
 .../net/ethernet/mellanox/mlx5/core/lib/sd.h  |   8 +
 14 files changed, 1018 insertions(+), 122 deletions(-)

base-commit: 5855479abc796c3b5d7b2f2ca147d68fc56cae1f
-- 
2.44.0

next             reply	other threads:[~2026-06-12 11:39 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-12 11:38 Tariq Toukan [this message]
2026-06-12 11:38 ` [PATCH net-next V3 01/15] net/mlx5: E-Switch, skip uplink IB rep load for SD secondary devices Tariq Toukan
2026-06-12 11:38 ` [PATCH net-next V3 02/15] net/mlx5: devcom, expose locked variant of send_event Tariq Toukan
2026-06-12 11:38 ` [PATCH net-next V3 03/15] net/mlx5: devcom, add DEVCOM_CANT_FAIL for non-rollback events Tariq Toukan
2026-06-12 11:38 ` [PATCH net-next V3 04/15] net/mlx5: SD, make primary/secondary role determination more robust Tariq Toukan
2026-06-12 11:38 ` [PATCH net-next V3 05/15] net/mlx5: SD, add L2 table silent mode query support Tariq Toukan
2026-06-12 11:38 ` [PATCH net-next V3 06/15] net/mlx5: SD, expend vport metadata for SD secondary devices Tariq Toukan
2026-06-12 11:38 ` [PATCH net-next V3 07/15] net/mlx5: SD, support switchdev mode transition with shared FDB Tariq Toukan
2026-06-12 11:38 ` [PATCH net-next V3 08/15] net/mlx5: E-Switch, notify SD on eswitch disable Tariq Toukan
2026-06-12 11:38 ` [PATCH net-next V3 09/15] net/mlx5: LAG, store demux resources per master lag_func Tariq Toukan
2026-06-12 11:38 ` [PATCH net-next V3 10/15] net/mlx5: LAG, disable both regular and SD LAG on lag_disable_change Tariq Toukan
2026-06-12 11:39 ` [PATCH net-next V3 11/15] net/mlx5: LAG, introduce software vport LAG implementation Tariq Toukan
2026-06-12 11:39 ` [PATCH net-next V3 12/15] net/mlx5: LAG, add MPESW over SD LAG support Tariq Toukan
2026-06-12 11:39 ` [PATCH net-next V3 13/15] net/mlx5: E-Switch, Tie rep load/unload to SD LAG state Tariq Toukan
2026-06-12 11:39 ` [PATCH net-next V3 14/15] net/mlx5: SD, defer vport metadata init until SD is ready Tariq Toukan
2026-06-12 11:39 ` [PATCH net-next V3 15/15] net/mlx5: SD, enable SD over ECPF and allow switchdev transition Tariq Toukan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260612113904.537595-1-tariqt@nvidia.com \
    --to=tariqt@nvidia.com \
    --cc=andrew+netdev@lunn.ch \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=edwards@nvidia.com \
    --cc=gal@nvidia.com \
    --cc=gbayer@linux.ibm.com \
    --cc=horms@kernel.org \
    --cc=kees@kernel.org \
    --cc=kuba@kernel.org \
    --cc=leon@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=mbloch@nvidia.com \
    --cc=moshe@nvidia.com \
    --cc=msanalla@nvidia.com \
    --cc=netdev@vger.kernel.org \
    --cc=ohartoov@nvidia.com \
    --cc=pabeni@redhat.com \
    --cc=parav@nvidia.com \
    --cc=phaddad@nvidia.com \
    --cc=saeedm@nvidia.com \
    --cc=shayd@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox