Netdev List
 help / color / mirror / Atom feed
From: Jacob Keller <jacob.e.keller@intel.com>
To: Tariq Toukan <tariqt@nvidia.com>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	"Andrew Lunn" <andrew+netdev@lunn.ch>,
	"David S. Miller" <davem@davemloft.net>
Cc: Saeed Mahameed <saeedm@nvidia.com>,
	Leon Romanovsky <leon@kernel.org>, Mark Bloch <mbloch@nvidia.com>,
	Nimrod Oren <noren@nvidia.com>, Yael Chemla <ychemla@nvidia.com>,
	Shay Drory <shayd@nvidia.com>, Or Har-Toov <ohartoov@nvidia.com>,
	Edward Srouji <edwards@nvidia.com>,
	Maher Sanalla <msanalla@nvidia.com>,
	Simon Horman <horms@kernel.org>, Parav Pandit <parav@nvidia.com>,
	Patrisious Haddad <phaddad@nvidia.com>,
	Kees Cook <kees@kernel.org>, Moshe Shemesh <moshe@nvidia.com>,
	<linux-kernel@vger.kernel.org>, <netdev@vger.kernel.org>,
	<linux-rdma@vger.kernel.org>, Gal Pressman <gal@nvidia.com>
Subject: Re: [PATCH net-next 00/13] net/mlx5: Add switchdev mode support for Socket Direct single netdev, part 1/2
Date: Wed, 27 May 2026 15:08:04 -0700	[thread overview]
Message-ID: <0a432449-2409-4e55-b17d-9d2fe1cc4860@intel.com> (raw)
In-Reply-To: <20260527125427.385976-1-tariqt@nvidia.com>

On 5/27/2026 5:54 AM, Tariq Toukan wrote:
> Hi,
> 
> This series enables Socket Direct single netdev to operate in switchdev
> mode with shared FDB. See detailed feature description by Shay below.
> 
> Regards,
> Tariq
> 
> 
> This series enables Socket Direct single netdev to operate in switchdev
> mode with shared FDB. SD single netdev combines multiple PCI functions
> behind a single netdev interface. To support switchdev offloads, these
> functions must participate in virtual LAG (shared FDB).
> 
> Design
> 
> Rather than introducing a separate LAG instance for SD, this series
> integrates SD secondary devices into the existing LAG structure
> (priv.lag) created at probe time. Each lag_func entry carries a
> group_id field that identifies its SD group membership (0 means not
> part of any SD group). An xarray mark (XA_MARK_PORT) distinguishes
> physical port entries from SD secondaries, enabling a single unified
> iterator that filters by group:
> 
>   - MLX5_LAG_FILTER_PORTS: iterate port-level entries only (existing
>     behavior, used by bonding, FW LAG commands, v2p_map)
>   - MLX5_LAG_FILTER_ALL: iterate all devices including SD secondaries
>     (used by MPESW shared FDB across all devices)
>   - specific group_id: iterate only devices in that SD group (used by
>     per-group SD shared FDB operations)
> 
> Existing callers use mlx5_ldev_for_each() which maps to
> MLX5_LAG_FILTER_PORTS, preserving current behavior for non-SD
> configurations.
> 
> Lifecycle and ownership
> 
> The SD LAG lifecycle is tied to the SD group, not to bonding events:
> 
> 1. At PCI probe, mlx5_lag_add_mdev() creates the LAG structure
>    (priv.lag) for each LAG-capable PF. e.g.: SD primary devices
> 
> 2. During mlx5_sd_init(), after the SD group is fully formed (primary
>    and secondaries paired), sd_lag_init() registers the secondary
>    devices into the primary's existing priv.lag by calling
>    mlx5_ldev_add_mdev() with the SD group_id. The primary's lag_func
>    also gets its group_id set. No separate LAG instance is created.
> 
> 3. After all the devices in SD group transition to switchdev,
>    mlx5_lag_shared_fdb_create() is invoked with the group_id to create
>    a software-only shared FDB scoped to that SD group. This sets
>    sd_fdb_active on all lag_func entries in the group. No FW LAG
>    commands are issued since SD devices share the same physical port.
> 
> 4. If MPESW (multi-port eswitch) is enabled on top of SD groups, the
>    per-group SD shared FDB is torn down first, then MPESW shared FDB is
>    created spanning all devices (ports + SD secondaries) using
>    MLX5_LAG_FILTER_ALL. On MPESW disable, per-group SD shared FDB is
>    restored.
> 
> 5. On SD teardown (mlx5_sd_cleanup or device unbind), sd_lag_cleanup()
>    removes secondaries from priv.lag and clears the primary's group_id.
>    The LAG structure itself is not destroyed.
> 
> The sd_fdb_active flag is set on all lag_func entries in a group (not
> just the primary), so any device can detect the SD shared FDB state
> during lag_disable_change teardown without needing to look up peer
> entries.
> 
> SD shared FDB is a pure software construct -- unlike regular LAG modes
> (ROCE, SRIOV, MPESW), it does not issue FW create_lag/destroy_lag
> commands. The software vport LAG for SD is implemented via eswitch
> egress ACL bounce rules, managed by the IB layer through
> mlx5_eth_lag_init(). And the software LAG demux is implemented via
> steering rules that utilize new destination, VHCA_RX.
> 

I appreciate the overall details on the lifecycle and ownership. That
made it easier to follow the patches and understand the changes.

> Patches
> 
> Infrastructure (patches 1, 5-6):
>   - Factor out shared FDB code into a dedicated file
>   - Extend lag_func with group_id and sd_fdb_active fields;
>     add XA_MARK_PORT and unified iterator with group_id filter
>   - Extend shared FDB API with group_id parameter
> 
> E-Switch preparation (patches 2-3):
>   - Align eswitch disable sequence ordering
>   - Move devcom init from TC to eswitch layer
> 
> SD group management (patches 4, 7-9):
>   - Replace peer count check with direct peer lookup
>   - Register SD secondaries in the existing LAG at SD init time
>   - Block RoCE and VF LAG for SD devices
>   - Block multipath LAG for SD devices
> 
> Switchdev integration (patch 10):
>   - Keep netdev resources local in switchdev mode
> 
> Steering (patches 11-12):
>   - Track peer flow slots with bitmap for selective peer flow deletion
>   - Enable TC flow steering for SD LAG
> 
> Enablement (patch 13):
>   - Verify unique vhca_id count for cross-VHCA RQT
> 

The patch 13 being the "enablement" is a bit confusing to me since I had
trouble understanding how the patch description is "enabling" the socket
direct stuff..  But the description does say "part 1/2" so I am guessing
thats addressed in part 2?

> Shay Drory (13):
>   net/mlx5: LAG, factor out shared FDB code into dedicated file
>   net/mlx5: E-Switch, align disable sequence with switchdev-to-legacy
>     transition
>   net/mlx5: E-Switch, move devcom init from TC to eswitch layer
>   net/mlx5: LAG, replace peer count check with direct peer lookup
>   net/mlx5: LAG, prepare for SD device integration
>   net/mlx5: LAG, extend shared FDB API with group_id filter
>   net/mlx5: SD, introduce Socket Direct LAG
>   net/mlx5: LAG, block RoCE and VF LAG for SD devices
>   net/mlx5: LAG, block multipath LAG for SD devices
>   net/mlx5: SD, keep netdev resources on same PF in switchdev mode
>   net/mlx5e: TC, track peer flow slots with bitmap
>   net/mlx5e: TC, enable steering for SD LAG
>   net/mlx5e: Verify unique vhca_id count instead of range
> 
>  .../net/ethernet/mellanox/mlx5/core/Makefile  |   2 +-
>  .../net/ethernet/mellanox/mlx5/core/en/rqt.c  |  27 +-
>  .../ethernet/mellanox/mlx5/core/en/tc_priv.h  |   7 +
>  .../net/ethernet/mellanox/mlx5/core/en_tc.c   |  83 ++--
>  .../net/ethernet/mellanox/mlx5/core/eswitch.h |  11 +-
>  .../mellanox/mlx5/core/eswitch_offloads.c     |  26 ++
>  .../net/ethernet/mellanox/mlx5/core/lag/lag.c | 429 ++++++++++--------
>  .../net/ethernet/mellanox/mlx5/core/lag/lag.h | 100 +++-
>  .../net/ethernet/mellanox/mlx5/core/lag/mp.c  |   4 +
>  .../ethernet/mellanox/mlx5/core/lag/mpesw.c   |  28 +-
>  .../mellanox/mlx5/core/lag/shared_fdb.c       | 233 ++++++++++
>  .../net/ethernet/mellanox/mlx5/core/lib/sd.c  | 227 +++++++--
>  .../net/ethernet/mellanox/mlx5/core/lib/sd.h  |  23 +
>  .../net/ethernet/mellanox/mlx5/core/main.c    |   3 +-
>  14 files changed, 914 insertions(+), 289 deletions(-)
>  create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/lag/shared_fdb.c
> 
> 
> base-commit: aa064a614efcfa4c300609d1f01134e99a12ad10


  parent reply	other threads:[~2026-05-27 22:08 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-27 12:54 [PATCH net-next 00/13] net/mlx5: Add switchdev mode support for Socket Direct single netdev, part 1/2 Tariq Toukan
2026-05-27 12:54 ` [PATCH net-next 01/13] net/mlx5: LAG, factor out shared FDB code into dedicated file Tariq Toukan
2026-05-27 12:54 ` [PATCH net-next 02/13] net/mlx5: E-Switch, align disable sequence with switchdev-to-legacy transition Tariq Toukan
2026-05-27 12:54 ` [PATCH net-next 03/13] net/mlx5: E-Switch, move devcom init from TC to eswitch layer Tariq Toukan
2026-05-28 18:48   ` Shay Drori
2026-05-27 12:54 ` [PATCH net-next 04/13] net/mlx5: LAG, replace peer count check with direct peer lookup Tariq Toukan
2026-05-27 12:54 ` [PATCH net-next 05/13] net/mlx5: LAG, prepare for SD device integration Tariq Toukan
2026-05-28 18:56   ` Shay Drori
2026-05-27 12:54 ` [PATCH net-next 06/13] net/mlx5: LAG, extend shared FDB API with group_id filter Tariq Toukan
2026-05-27 12:54 ` [PATCH net-next 07/13] net/mlx5: SD, introduce Socket Direct LAG Tariq Toukan
2026-05-27 12:54 ` [PATCH net-next 08/13] net/mlx5: LAG, block RoCE and VF LAG for SD devices Tariq Toukan
2026-05-27 12:54 ` [PATCH net-next 09/13] net/mlx5: LAG, block multipath " Tariq Toukan
2026-05-27 12:54 ` [PATCH net-next 10/13] net/mlx5: SD, keep netdev resources on same PF in switchdev mode Tariq Toukan
2026-05-27 12:54 ` [PATCH net-next 11/13] net/mlx5e: TC, track peer flow slots with bitmap Tariq Toukan
2026-05-27 12:54 ` [PATCH net-next 12/13] net/mlx5e: TC, enable steering for SD LAG Tariq Toukan
2026-05-27 12:54 ` [PATCH net-next 13/13] net/mlx5e: Verify unique vhca_id count instead of range Tariq Toukan
2026-05-27 22:08 ` Jacob Keller [this message]
2026-05-28  9:18   ` [PATCH net-next 00/13] net/mlx5: Add switchdev mode support for Socket Direct single netdev, part 1/2 Shay Drori
2026-05-28 17:59     ` Jacob Keller
2026-05-29  0:40 ` Jakub Kicinski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0a432449-2409-4e55-b17d-9d2fe1cc4860@intel.com \
    --to=jacob.e.keller@intel.com \
    --cc=andrew+netdev@lunn.ch \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=edwards@nvidia.com \
    --cc=gal@nvidia.com \
    --cc=horms@kernel.org \
    --cc=kees@kernel.org \
    --cc=kuba@kernel.org \
    --cc=leon@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=mbloch@nvidia.com \
    --cc=moshe@nvidia.com \
    --cc=msanalla@nvidia.com \
    --cc=netdev@vger.kernel.org \
    --cc=noren@nvidia.com \
    --cc=ohartoov@nvidia.com \
    --cc=pabeni@redhat.com \
    --cc=parav@nvidia.com \
    --cc=phaddad@nvidia.com \
    --cc=saeedm@nvidia.com \
    --cc=shayd@nvidia.com \
    --cc=tariqt@nvidia.com \
    --cc=ychemla@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox