From: Jakub Kicinski <kuba@kernel.org>
To: tariqt@nvidia.com
Cc: Jakub Kicinski <kuba@kernel.org>,
edumazet@google.com, pabeni@redhat.com, andrew+netdev@lunn.ch,
davem@davemloft.net, saeedm@nvidia.com, leon@kernel.org,
mbloch@nvidia.com, noren@nvidia.com, ychemla@nvidia.com,
shayd@nvidia.com, ohartoov@nvidia.com, edwards@nvidia.com,
horms@kernel.org, msanalla@nvidia.com, parav@nvidia.com,
kees@kernel.org, phaddad@nvidia.com, moshe@nvidia.com,
linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
linux-rdma@vger.kernel.org, gal@nvidia.com,
jacob.e.keller@intel.com
Subject: Re: [PATCH net-next V2 12/13] net/mlx5e: TC, enable steering for SD LAG
Date: Tue, 2 Jun 2026 19:26:45 -0700 [thread overview]
Message-ID: <20260603022645.2298955-1-kuba@kernel.org> (raw)
In-Reply-To: <20260531113954.395443-13-tariqt@nvidia.com>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
net/mlx5e: TC, enable steering for SD LAG
Extends TC flow steering to support the SD LAG mode by adding
mlx5_lag_is_sd()/mlx5_sd_is_primary() helpers, redirecting uplink
priv/proto_dev queries to the primary eswitch in SD setups, and adding
peer-flow handling for secondary devices.
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
> index 2a16368a948e..910492eb51f2 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
[ ... ]
> @@ -4196,9 +4197,26 @@ static bool is_lag_dev(struct mlx5e_priv *priv,
> same_hw_reps(priv, peer_netdev));
> }
>
> +static bool is_sd_eligible(struct mlx5e_priv *priv,
> + struct net_device *peer_netdev)
> +{
> + struct mlx5e_priv *peer_priv;
> +
> + peer_priv = netdev_priv(peer_netdev);
> + return same_hw_reps(priv, peer_netdev) &&
> + mlx5_lag_is_sd(priv->mdev) &&
> + (mlx5_sd_get_primary(priv->mdev) ==
> + mlx5_sd_get_primary(peer_priv->mdev));
> +}
[Medium]
With the redefinition of mlx5_sd_get_primary() to return NULL when
sd->devcom is not ready, can both calls here legitimately return NULL
during a teardown race and make the equality test succeed?
In that situation is_sd_eligible() would report two unrelated devices
as belonging to the same SD primary.
Would something like this be more robust:
struct mlx5_core_dev *primary = mlx5_sd_get_primary(priv->mdev);
return same_hw_reps(priv, peer_netdev) &&
mlx5_lag_is_sd(priv->mdev) &&
primary &&
primary == mlx5_sd_get_primary(peer_priv->mdev);
> +
> static bool is_multiport_eligible(struct mlx5e_priv *priv, struct net_device *out_dev)
> {
> - return same_hw_reps(priv, out_dev) && mlx5_lag_is_mpesw(priv->mdev);
> + struct mlx5_core_dev *primary = mlx5_sd_get_primary(priv->mdev);
> +
> + if (!primary)
> + return false;
> +
> + return same_hw_reps(priv, out_dev) && mlx5_lag_is_mpesw(primary);
> }
[ ... ]
> @@ -4351,7 +4372,7 @@ static struct rhashtable *get_tc_ht(struct mlx5e_priv *priv,
> return &tc->ht;
> }
>
> -static bool is_peer_flow_needed(struct mlx5e_tc_flow *flow)
> +static bool is_peer_flow_needed(struct mlx5e_tc_flow *flow, bool *is_sd)
> {
> struct mlx5_esw_flow_attr *esw_attr = flow->attr->esw_attr;
> struct mlx5_flow_attr *attr = flow->attr;
> @@ -4372,6 +4393,13 @@ static bool is_peer_flow_needed(struct mlx5e_tc_flow *flow)
> if (mlx5_lag_is_mpesw(esw_attr->in_mdev))
> return true;
>
> + if (mlx5_lag_is_sd(esw_attr->in_mdev) &&
> + !mlx5_sd_is_primary(esw_attr->in_mdev)) {
> + if (!mlx5_lag_is_mpesw(mlx5_sd_get_primary(esw_attr->in_mdev)))
> + *is_sd = true;
> + return true;
> + }
> +
> return false;
> }
[High]
Can mlx5_sd_get_primary(esw_attr->in_mdev) return NULL here?
The patch updates mlx5_sd_get_primary() to return NULL when
sd->devcom is not ready:
struct mlx5_core_dev *mlx5_sd_get_primary(struct mlx5_core_dev *dev)
{
struct mlx5_sd *sd = mlx5_get_sd(dev);
if (!sd)
return dev;
if (!mlx5_devcom_comp_is_ready(sd->devcom))
return NULL;
return sd->primary ? dev : sd->primary_dev;
}
The result is then handed straight to mlx5_lag_is_mpesw(), which
dereferences its argument:
bool mlx5_lag_is_mpesw(struct mlx5_core_dev *dev)
{
struct mlx5_lag *ldev = mlx5_lag_dev(dev);
return ldev && ldev->mode == MLX5_LAG_MODE_MPESW;
}
mlx5_lag_dev(NULL) evaluates NULL->priv.lag.
Note that the companion is_multiport_eligible() helper added in the
same patch does add the if (!primary) return false; guard for the
same reason; would the same guard be appropriate here?
Could the sequence "mlx5_lag_is_sd() returns true (group_id != 0
observed under lag_lock) but mlx5_sd_get_primary() returns NULL
(devcom->ready observed false)" occur if SD teardown runs concurrently
with a TC flow add?
[Medium]
On the topic of unsynchronized SD-state reads in this branch, the
writer (mlx5_sd_cleanup) updates sd->primary and devcom->ready under
the devcom rwsem, but the readers introduced by this patch take none
of those locks:
- mlx5_sd_is_primary() reads sd->primary with no lock and no
READ_ONCE.
- mlx5_sd_get_primary() pairs READ_ONCE(ready) with a plain read of
sd->primary and sd->primary_dev.
- mlx5_lag_is_sd() reads pf->group_id under lag_lock, which does
not order against the devcom rwsem at all.
Since sd->primary also selects which arm of the union in struct
mlx5_sd is valid (secondaries[] vs primary_dev), can a torn or stale
read of sd->primary make these helpers read the wrong union member?
> @@ -4628,19 +4657,26 @@ mlx5e_add_fdb_flow(struct mlx5e_priv *priv,
> struct mlx5e_tc_flow **__flow)
> {
> struct mlx5_devcom_comp_dev *devcom = priv->mdev->priv.eswitch->devcom, *pos;
> + struct netlink_ext_ack *extack = f->common.extack;
> struct mlx5e_rep_priv *rpriv = priv->ppriv;
> struct mlx5_eswitch_rep *in_rep = rpriv->rep;
> struct mlx5_core_dev *in_mdev = priv->mdev;
> struct mlx5_eswitch *peer_esw;
> struct mlx5e_tc_flow *flow;
> + bool is_sd = false;
> int err;
>
> + if (mlx5_lag_is_sd(in_mdev) && !mlx5_lag_is_active(in_mdev)) {
> + NL_SET_ERR_MSG_MOD(extack, "SD shared FDB not yet active");
> + return -EOPNOTSUPP;
> + }
> +
> flow = __mlx5e_add_fdb_flow(priv, f, flow_flags, filter_dev, in_rep,
> in_mdev);
> if (IS_ERR(flow))
> return PTR_ERR(flow);
>
> - if (!is_peer_flow_needed(flow)) {
> + if (!is_peer_flow_needed(flow, &is_sd)) {
> *__flow = flow;
> return 0;
> }
> @@ -4651,6 +4687,15 @@ mlx5e_add_fdb_flow(struct mlx5e_priv *priv,
> }
>
> mlx5_devcom_for_each_peer_entry(devcom, peer_esw, pos) {
> + if (is_sd) {
> + /* SD shared FDB: only the matching SD primary. */
> + if (mlx5_sd_get_primary(in_mdev) !=
> + mlx5_sd_get_primary(peer_esw->dev))
> + continue;
> + } else {
> + if (!mlx5_sd_is_primary(peer_esw->dev))
> + continue;
> + }
> err = mlx5e_tc_add_fdb_peer_flow(f, flow, flow_flags, peer_esw);
> if (err)
> goto peer_clean;
[Medium]
Same NULL == NULL question applies here. The devcom being iterated is
priv->mdev->priv.eswitch->devcom, i.e. the eswitch devcom, so peer_esw
entries are not necessarily SD peers and may belong to different SD
groups (or to none).
If mlx5_sd_get_primary() can transiently return NULL on either side
during an SD teardown race, both sides may evaluate to NULL and the
!= test will not skip the unrelated peer:
if (mlx5_sd_get_primary(in_mdev) !=
mlx5_sd_get_primary(peer_esw->dev))
continue;
Would explicitly requiring a non-NULL primary on at least one side, or
caching mlx5_sd_get_primary(in_mdev) once and rejecting NULL up front,
be more correct?
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
> index ec606851feb8..25286ecd724e 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
> @@ -49,13 +49,16 @@ static int mlx5_sd_get_host_buses(struct mlx5_core_dev *dev)
> return sd->host_buses;
> }
>
> -static struct mlx5_core_dev *mlx5_sd_get_primary(struct mlx5_core_dev *dev)
> +struct mlx5_core_dev *mlx5_sd_get_primary(struct mlx5_core_dev *dev)
> {
> struct mlx5_sd *sd = mlx5_get_sd(dev);
>
> if (!sd)
> return dev;
>
> + if (!mlx5_devcom_comp_is_ready(sd->devcom))
> + return NULL;
> +
> return sd->primary ? dev : sd->primary_dev;
> }
[ ... ]
next prev parent reply other threads:[~2026-06-03 2:26 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-31 11:39 [PATCH net-next V2 00/13] net/mlx5: Add switchdev mode support for Socket Direct single netdev, part 1/2 Tariq Toukan
2026-05-31 11:39 ` [PATCH net-next V2 01/13] net/mlx5: LAG, factor out shared FDB code into dedicated file Tariq Toukan
2026-05-31 11:39 ` [PATCH net-next V2 02/13] net/mlx5: E-Switch, align disable sequence with switchdev-to-legacy transition Tariq Toukan
2026-05-31 11:39 ` [PATCH net-next V2 03/13] net/mlx5: E-Switch, move devcom init from TC to eswitch layer Tariq Toukan
2026-05-31 11:39 ` [PATCH net-next V2 04/13] net/mlx5: LAG, replace peer count check with direct peer lookup Tariq Toukan
2026-05-31 11:39 ` [PATCH net-next V2 05/13] net/mlx5: LAG, prepare for SD device integration Tariq Toukan
2026-05-31 11:39 ` [PATCH net-next V2 06/13] net/mlx5: LAG, extend shared FDB API with group_id filter Tariq Toukan
2026-05-31 11:39 ` [PATCH net-next V2 07/13] net/mlx5: SD, introduce Socket Direct LAG Tariq Toukan
2026-05-31 11:39 ` [PATCH net-next V2 08/13] net/mlx5: LAG, block RoCE and VF LAG for SD devices Tariq Toukan
2026-05-31 11:39 ` [PATCH net-next V2 09/13] net/mlx5: LAG, block multipath " Tariq Toukan
2026-05-31 11:39 ` [PATCH net-next V2 10/13] net/mlx5: SD, keep netdev resources on same PF in switchdev mode Tariq Toukan
2026-05-31 11:39 ` [PATCH net-next V2 11/13] net/mlx5e: TC, track peer flow slots with bitmap Tariq Toukan
2026-05-31 11:39 ` [PATCH net-next V2 12/13] net/mlx5e: TC, enable steering for SD LAG Tariq Toukan
2026-06-03 2:26 ` Jakub Kicinski [this message]
2026-06-03 6:43 ` Shay Drori
2026-05-31 11:39 ` [PATCH net-next V2 13/13] net/mlx5e: Verify unique vhca_id count instead of range Tariq Toukan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260603022645.2298955-1-kuba@kernel.org \
--to=kuba@kernel.org \
--cc=andrew+netdev@lunn.ch \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=edwards@nvidia.com \
--cc=gal@nvidia.com \
--cc=horms@kernel.org \
--cc=jacob.e.keller@intel.com \
--cc=kees@kernel.org \
--cc=leon@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=mbloch@nvidia.com \
--cc=moshe@nvidia.com \
--cc=msanalla@nvidia.com \
--cc=netdev@vger.kernel.org \
--cc=noren@nvidia.com \
--cc=ohartoov@nvidia.com \
--cc=pabeni@redhat.com \
--cc=parav@nvidia.com \
--cc=phaddad@nvidia.com \
--cc=saeedm@nvidia.com \
--cc=shayd@nvidia.com \
--cc=tariqt@nvidia.com \
--cc=ychemla@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox