From: Jacob Keller <jacob.e.keller@intel.com>
To: Tariq Toukan <tariqt@nvidia.com>,
Eric Dumazet <edumazet@google.com>,
Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
"Andrew Lunn" <andrew+netdev@lunn.ch>,
"David S. Miller" <davem@davemloft.net>
Cc: Saeed Mahameed <saeedm@nvidia.com>,
Leon Romanovsky <leon@kernel.org>, Mark Bloch <mbloch@nvidia.com>,
Nimrod Oren <noren@nvidia.com>, Yael Chemla <ychemla@nvidia.com>,
Shay Drory <shayd@nvidia.com>, Or Har-Toov <ohartoov@nvidia.com>,
Edward Srouji <edwards@nvidia.com>,
Maher Sanalla <msanalla@nvidia.com>,
Simon Horman <horms@kernel.org>, Parav Pandit <parav@nvidia.com>,
Patrisious Haddad <phaddad@nvidia.com>,
Kees Cook <kees@kernel.org>, Moshe Shemesh <moshe@nvidia.com>,
<linux-kernel@vger.kernel.org>, <netdev@vger.kernel.org>,
<linux-rdma@vger.kernel.org>, Gal Pressman <gal@nvidia.com>
Subject: Re: [PATCH net-next 00/13] net/mlx5: Add switchdev mode support for Socket Direct single netdev, part 1/2
Date: Wed, 27 May 2026 15:08:04 -0700 [thread overview]
Message-ID: <0a432449-2409-4e55-b17d-9d2fe1cc4860@intel.com> (raw)
In-Reply-To: <20260527125427.385976-1-tariqt@nvidia.com>
On 5/27/2026 5:54 AM, Tariq Toukan wrote:
> Hi,
>
> This series enables Socket Direct single netdev to operate in switchdev
> mode with shared FDB. See detailed feature description by Shay below.
>
> Regards,
> Tariq
>
>
> This series enables Socket Direct single netdev to operate in switchdev
> mode with shared FDB. SD single netdev combines multiple PCI functions
> behind a single netdev interface. To support switchdev offloads, these
> functions must participate in virtual LAG (shared FDB).
>
> Design
>
> Rather than introducing a separate LAG instance for SD, this series
> integrates SD secondary devices into the existing LAG structure
> (priv.lag) created at probe time. Each lag_func entry carries a
> group_id field that identifies its SD group membership (0 means not
> part of any SD group). An xarray mark (XA_MARK_PORT) distinguishes
> physical port entries from SD secondaries, enabling a single unified
> iterator that filters by group:
>
> - MLX5_LAG_FILTER_PORTS: iterate port-level entries only (existing
> behavior, used by bonding, FW LAG commands, v2p_map)
> - MLX5_LAG_FILTER_ALL: iterate all devices including SD secondaries
> (used by MPESW shared FDB across all devices)
> - specific group_id: iterate only devices in that SD group (used by
> per-group SD shared FDB operations)
>
> Existing callers use mlx5_ldev_for_each() which maps to
> MLX5_LAG_FILTER_PORTS, preserving current behavior for non-SD
> configurations.
>
> Lifecycle and ownership
>
> The SD LAG lifecycle is tied to the SD group, not to bonding events:
>
> 1. At PCI probe, mlx5_lag_add_mdev() creates the LAG structure
> (priv.lag) for each LAG-capable PF. e.g.: SD primary devices
>
> 2. During mlx5_sd_init(), after the SD group is fully formed (primary
> and secondaries paired), sd_lag_init() registers the secondary
> devices into the primary's existing priv.lag by calling
> mlx5_ldev_add_mdev() with the SD group_id. The primary's lag_func
> also gets its group_id set. No separate LAG instance is created.
>
> 3. After all the devices in SD group transition to switchdev,
> mlx5_lag_shared_fdb_create() is invoked with the group_id to create
> a software-only shared FDB scoped to that SD group. This sets
> sd_fdb_active on all lag_func entries in the group. No FW LAG
> commands are issued since SD devices share the same physical port.
>
> 4. If MPESW (multi-port eswitch) is enabled on top of SD groups, the
> per-group SD shared FDB is torn down first, then MPESW shared FDB is
> created spanning all devices (ports + SD secondaries) using
> MLX5_LAG_FILTER_ALL. On MPESW disable, per-group SD shared FDB is
> restored.
>
> 5. On SD teardown (mlx5_sd_cleanup or device unbind), sd_lag_cleanup()
> removes secondaries from priv.lag and clears the primary's group_id.
> The LAG structure itself is not destroyed.
>
> The sd_fdb_active flag is set on all lag_func entries in a group (not
> just the primary), so any device can detect the SD shared FDB state
> during lag_disable_change teardown without needing to look up peer
> entries.
>
> SD shared FDB is a pure software construct -- unlike regular LAG modes
> (ROCE, SRIOV, MPESW), it does not issue FW create_lag/destroy_lag
> commands. The software vport LAG for SD is implemented via eswitch
> egress ACL bounce rules, managed by the IB layer through
> mlx5_eth_lag_init(). And the software LAG demux is implemented via
> steering rules that utilize new destination, VHCA_RX.
>
I appreciate the overall details on the lifecycle and ownership. That
made it easier to follow the patches and understand the changes.
> Patches
>
> Infrastructure (patches 1, 5-6):
> - Factor out shared FDB code into a dedicated file
> - Extend lag_func with group_id and sd_fdb_active fields;
> add XA_MARK_PORT and unified iterator with group_id filter
> - Extend shared FDB API with group_id parameter
>
> E-Switch preparation (patches 2-3):
> - Align eswitch disable sequence ordering
> - Move devcom init from TC to eswitch layer
>
> SD group management (patches 4, 7-9):
> - Replace peer count check with direct peer lookup
> - Register SD secondaries in the existing LAG at SD init time
> - Block RoCE and VF LAG for SD devices
> - Block multipath LAG for SD devices
>
> Switchdev integration (patch 10):
> - Keep netdev resources local in switchdev mode
>
> Steering (patches 11-12):
> - Track peer flow slots with bitmap for selective peer flow deletion
> - Enable TC flow steering for SD LAG
>
> Enablement (patch 13):
> - Verify unique vhca_id count for cross-VHCA RQT
>
The patch 13 being the "enablement" is a bit confusing to me since I had
trouble understanding how the patch description is "enabling" the socket
direct stuff.. But the description does say "part 1/2" so I am guessing
thats addressed in part 2?
> Shay Drory (13):
> net/mlx5: LAG, factor out shared FDB code into dedicated file
> net/mlx5: E-Switch, align disable sequence with switchdev-to-legacy
> transition
> net/mlx5: E-Switch, move devcom init from TC to eswitch layer
> net/mlx5: LAG, replace peer count check with direct peer lookup
> net/mlx5: LAG, prepare for SD device integration
> net/mlx5: LAG, extend shared FDB API with group_id filter
> net/mlx5: SD, introduce Socket Direct LAG
> net/mlx5: LAG, block RoCE and VF LAG for SD devices
> net/mlx5: LAG, block multipath LAG for SD devices
> net/mlx5: SD, keep netdev resources on same PF in switchdev mode
> net/mlx5e: TC, track peer flow slots with bitmap
> net/mlx5e: TC, enable steering for SD LAG
> net/mlx5e: Verify unique vhca_id count instead of range
>
> .../net/ethernet/mellanox/mlx5/core/Makefile | 2 +-
> .../net/ethernet/mellanox/mlx5/core/en/rqt.c | 27 +-
> .../ethernet/mellanox/mlx5/core/en/tc_priv.h | 7 +
> .../net/ethernet/mellanox/mlx5/core/en_tc.c | 83 ++--
> .../net/ethernet/mellanox/mlx5/core/eswitch.h | 11 +-
> .../mellanox/mlx5/core/eswitch_offloads.c | 26 ++
> .../net/ethernet/mellanox/mlx5/core/lag/lag.c | 429 ++++++++++--------
> .../net/ethernet/mellanox/mlx5/core/lag/lag.h | 100 +++-
> .../net/ethernet/mellanox/mlx5/core/lag/mp.c | 4 +
> .../ethernet/mellanox/mlx5/core/lag/mpesw.c | 28 +-
> .../mellanox/mlx5/core/lag/shared_fdb.c | 233 ++++++++++
> .../net/ethernet/mellanox/mlx5/core/lib/sd.c | 227 +++++++--
> .../net/ethernet/mellanox/mlx5/core/lib/sd.h | 23 +
> .../net/ethernet/mellanox/mlx5/core/main.c | 3 +-
> 14 files changed, 914 insertions(+), 289 deletions(-)
> create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/lag/shared_fdb.c
>
>
> base-commit: aa064a614efcfa4c300609d1f01134e99a12ad10
next prev parent reply other threads:[~2026-05-27 22:08 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-27 12:54 [PATCH net-next 00/13] net/mlx5: Add switchdev mode support for Socket Direct single netdev, part 1/2 Tariq Toukan
2026-05-27 12:54 ` [PATCH net-next 01/13] net/mlx5: LAG, factor out shared FDB code into dedicated file Tariq Toukan
2026-05-27 12:54 ` [PATCH net-next 02/13] net/mlx5: E-Switch, align disable sequence with switchdev-to-legacy transition Tariq Toukan
2026-05-27 12:54 ` [PATCH net-next 03/13] net/mlx5: E-Switch, move devcom init from TC to eswitch layer Tariq Toukan
2026-05-28 18:48 ` Shay Drori
2026-05-27 12:54 ` [PATCH net-next 04/13] net/mlx5: LAG, replace peer count check with direct peer lookup Tariq Toukan
2026-05-27 12:54 ` [PATCH net-next 05/13] net/mlx5: LAG, prepare for SD device integration Tariq Toukan
2026-05-28 18:56 ` Shay Drori
2026-05-27 12:54 ` [PATCH net-next 06/13] net/mlx5: LAG, extend shared FDB API with group_id filter Tariq Toukan
2026-05-27 12:54 ` [PATCH net-next 07/13] net/mlx5: SD, introduce Socket Direct LAG Tariq Toukan
2026-05-27 12:54 ` [PATCH net-next 08/13] net/mlx5: LAG, block RoCE and VF LAG for SD devices Tariq Toukan
2026-05-27 12:54 ` [PATCH net-next 09/13] net/mlx5: LAG, block multipath " Tariq Toukan
2026-05-27 12:54 ` [PATCH net-next 10/13] net/mlx5: SD, keep netdev resources on same PF in switchdev mode Tariq Toukan
2026-05-27 12:54 ` [PATCH net-next 11/13] net/mlx5e: TC, track peer flow slots with bitmap Tariq Toukan
2026-05-27 12:54 ` [PATCH net-next 12/13] net/mlx5e: TC, enable steering for SD LAG Tariq Toukan
2026-05-27 12:54 ` [PATCH net-next 13/13] net/mlx5e: Verify unique vhca_id count instead of range Tariq Toukan
2026-05-27 22:08 ` Jacob Keller [this message]
2026-05-28 9:18 ` [PATCH net-next 00/13] net/mlx5: Add switchdev mode support for Socket Direct single netdev, part 1/2 Shay Drori
2026-05-28 17:59 ` Jacob Keller
2026-05-29 0:40 ` Jakub Kicinski
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=0a432449-2409-4e55-b17d-9d2fe1cc4860@intel.com \
--to=jacob.e.keller@intel.com \
--cc=andrew+netdev@lunn.ch \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=edwards@nvidia.com \
--cc=gal@nvidia.com \
--cc=horms@kernel.org \
--cc=kees@kernel.org \
--cc=kuba@kernel.org \
--cc=leon@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=mbloch@nvidia.com \
--cc=moshe@nvidia.com \
--cc=msanalla@nvidia.com \
--cc=netdev@vger.kernel.org \
--cc=noren@nvidia.com \
--cc=ohartoov@nvidia.com \
--cc=pabeni@redhat.com \
--cc=parav@nvidia.com \
--cc=phaddad@nvidia.com \
--cc=saeedm@nvidia.com \
--cc=shayd@nvidia.com \
--cc=tariqt@nvidia.com \
--cc=ychemla@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox