Netdev List
 help / color / mirror / Atom feed
From: Jakub Kicinski <kuba@kernel.org>
To: tariqt@nvidia.com
Cc: Jakub Kicinski <kuba@kernel.org>,
	edumazet@google.com, pabeni@redhat.com, andrew+netdev@lunn.ch,
	davem@davemloft.net, saeedm@nvidia.com, leon@kernel.org,
	mbloch@nvidia.com, noren@nvidia.com, ychemla@nvidia.com,
	shayd@nvidia.com, ohartoov@nvidia.com, edwards@nvidia.com,
	horms@kernel.org, msanalla@nvidia.com, parav@nvidia.com,
	kees@kernel.org, phaddad@nvidia.com, moshe@nvidia.com,
	linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
	linux-rdma@vger.kernel.org, gal@nvidia.com,
	jacob.e.keller@intel.com
Subject: Re: [PATCH net-next V2 12/13] net/mlx5e: TC, enable steering for SD LAG
Date: Tue,  2 Jun 2026 19:26:45 -0700	[thread overview]
Message-ID: <20260603022645.2298955-1-kuba@kernel.org> (raw)
In-Reply-To: <20260531113954.395443-13-tariqt@nvidia.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
net/mlx5e: TC, enable steering for SD LAG

Extends TC flow steering to support the SD LAG mode by adding
mlx5_lag_is_sd()/mlx5_sd_is_primary() helpers, redirecting uplink
priv/proto_dev queries to the primary eswitch in SD setups, and adding
peer-flow handling for secondary devices.

> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
> index 2a16368a948e..910492eb51f2 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c

[ ... ]

> @@ -4196,9 +4197,26 @@ static bool is_lag_dev(struct mlx5e_priv *priv,
>  		 same_hw_reps(priv, peer_netdev));
>  }
>  
> +static bool is_sd_eligible(struct mlx5e_priv *priv,
> +			   struct net_device *peer_netdev)
> +{
> +	struct mlx5e_priv *peer_priv;
> +
> +	peer_priv = netdev_priv(peer_netdev);
> +	return same_hw_reps(priv, peer_netdev) &&
> +		mlx5_lag_is_sd(priv->mdev) &&
> +		(mlx5_sd_get_primary(priv->mdev) ==
> +		 mlx5_sd_get_primary(peer_priv->mdev));
> +}

[Medium]
With the redefinition of mlx5_sd_get_primary() to return NULL when
sd->devcom is not ready, can both calls here legitimately return NULL
during a teardown race and make the equality test succeed?

In that situation is_sd_eligible() would report two unrelated devices
as belonging to the same SD primary.

Would something like this be more robust:

    struct mlx5_core_dev *primary = mlx5_sd_get_primary(priv->mdev);

    return same_hw_reps(priv, peer_netdev) &&
           mlx5_lag_is_sd(priv->mdev) &&
           primary &&
           primary == mlx5_sd_get_primary(peer_priv->mdev);

> +
>  static bool is_multiport_eligible(struct mlx5e_priv *priv, struct net_device *out_dev)
>  {
> -	return same_hw_reps(priv, out_dev) && mlx5_lag_is_mpesw(priv->mdev);
> +	struct mlx5_core_dev *primary = mlx5_sd_get_primary(priv->mdev);
> +
> +	if (!primary)
> +		return false;
> +
> +	return same_hw_reps(priv, out_dev) && mlx5_lag_is_mpesw(primary);
>  }

[ ... ]

> @@ -4351,7 +4372,7 @@ static struct rhashtable *get_tc_ht(struct mlx5e_priv *priv,
>  		return &tc->ht;
>  }
>  
> -static bool is_peer_flow_needed(struct mlx5e_tc_flow *flow)
> +static bool is_peer_flow_needed(struct mlx5e_tc_flow *flow, bool *is_sd)
>  {
>  	struct mlx5_esw_flow_attr *esw_attr = flow->attr->esw_attr;
>  	struct mlx5_flow_attr *attr = flow->attr;
> @@ -4372,6 +4393,13 @@ static bool is_peer_flow_needed(struct mlx5e_tc_flow *flow)
>  	if (mlx5_lag_is_mpesw(esw_attr->in_mdev))
>  		return true;
>  
> +	if (mlx5_lag_is_sd(esw_attr->in_mdev) &&
> +	    !mlx5_sd_is_primary(esw_attr->in_mdev)) {
> +		if (!mlx5_lag_is_mpesw(mlx5_sd_get_primary(esw_attr->in_mdev)))
> +			*is_sd = true;
> +		return true;
> +	}
> +
>  	return false;
>  }

[High]
Can mlx5_sd_get_primary(esw_attr->in_mdev) return NULL here?

The patch updates mlx5_sd_get_primary() to return NULL when
sd->devcom is not ready:

    struct mlx5_core_dev *mlx5_sd_get_primary(struct mlx5_core_dev *dev)
    {
        struct mlx5_sd *sd = mlx5_get_sd(dev);

        if (!sd)
            return dev;

        if (!mlx5_devcom_comp_is_ready(sd->devcom))
            return NULL;

        return sd->primary ? dev : sd->primary_dev;
    }

The result is then handed straight to mlx5_lag_is_mpesw(), which
dereferences its argument:

    bool mlx5_lag_is_mpesw(struct mlx5_core_dev *dev)
    {
        struct mlx5_lag *ldev = mlx5_lag_dev(dev);

        return ldev && ldev->mode == MLX5_LAG_MODE_MPESW;
    }

mlx5_lag_dev(NULL) evaluates NULL->priv.lag.

Note that the companion is_multiport_eligible() helper added in the
same patch does add the if (!primary) return false; guard for the
same reason; would the same guard be appropriate here?

Could the sequence "mlx5_lag_is_sd() returns true (group_id != 0
observed under lag_lock) but mlx5_sd_get_primary() returns NULL
(devcom->ready observed false)" occur if SD teardown runs concurrently
with a TC flow add?

[Medium]
On the topic of unsynchronized SD-state reads in this branch, the
writer (mlx5_sd_cleanup) updates sd->primary and devcom->ready under
the devcom rwsem, but the readers introduced by this patch take none
of those locks:

  - mlx5_sd_is_primary() reads sd->primary with no lock and no
    READ_ONCE.
  - mlx5_sd_get_primary() pairs READ_ONCE(ready) with a plain read of
    sd->primary and sd->primary_dev.
  - mlx5_lag_is_sd() reads pf->group_id under lag_lock, which does
    not order against the devcom rwsem at all.

Since sd->primary also selects which arm of the union in struct
mlx5_sd is valid (secondaries[] vs primary_dev), can a torn or stale
read of sd->primary make these helpers read the wrong union member?

> @@ -4628,19 +4657,26 @@ mlx5e_add_fdb_flow(struct mlx5e_priv *priv,
>  		   struct mlx5e_tc_flow **__flow)
>  {
>  	struct mlx5_devcom_comp_dev *devcom = priv->mdev->priv.eswitch->devcom, *pos;
> +	struct netlink_ext_ack *extack = f->common.extack;
>  	struct mlx5e_rep_priv *rpriv = priv->ppriv;
>  	struct mlx5_eswitch_rep *in_rep = rpriv->rep;
>  	struct mlx5_core_dev *in_mdev = priv->mdev;
>  	struct mlx5_eswitch *peer_esw;
>  	struct mlx5e_tc_flow *flow;
> +	bool is_sd = false;
>  	int err;
>  
> +	if (mlx5_lag_is_sd(in_mdev) && !mlx5_lag_is_active(in_mdev)) {
> +		NL_SET_ERR_MSG_MOD(extack, "SD shared FDB not yet active");
> +		return -EOPNOTSUPP;
> +	}
> +
>  	flow = __mlx5e_add_fdb_flow(priv, f, flow_flags, filter_dev, in_rep,
>  				    in_mdev);
>  	if (IS_ERR(flow))
>  		return PTR_ERR(flow);
>  
> -	if (!is_peer_flow_needed(flow)) {
> +	if (!is_peer_flow_needed(flow, &is_sd)) {
>  		*__flow = flow;
>  		return 0;
>  	}
> @@ -4651,6 +4687,15 @@ mlx5e_add_fdb_flow(struct mlx5e_priv *priv,
>  	}
>  
>  	mlx5_devcom_for_each_peer_entry(devcom, peer_esw, pos) {
> +		if (is_sd) {
> +			/* SD shared FDB: only the matching SD primary. */
> +			if (mlx5_sd_get_primary(in_mdev) !=
> +			    mlx5_sd_get_primary(peer_esw->dev))
> +				continue;
> +		} else {
> +			if (!mlx5_sd_is_primary(peer_esw->dev))
> +				continue;
> +		}
>  		err = mlx5e_tc_add_fdb_peer_flow(f, flow, flow_flags, peer_esw);
>  		if (err)
>  			goto peer_clean;

[Medium]
Same NULL == NULL question applies here. The devcom being iterated is
priv->mdev->priv.eswitch->devcom, i.e. the eswitch devcom, so peer_esw
entries are not necessarily SD peers and may belong to different SD
groups (or to none).

If mlx5_sd_get_primary() can transiently return NULL on either side
during an SD teardown race, both sides may evaluate to NULL and the
!= test will not skip the unrelated peer:

    if (mlx5_sd_get_primary(in_mdev) !=
        mlx5_sd_get_primary(peer_esw->dev))
        continue;

Would explicitly requiring a non-NULL primary on at least one side, or
caching mlx5_sd_get_primary(in_mdev) once and rejecting NULL up front,
be more correct?

> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
> index ec606851feb8..25286ecd724e 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
> @@ -49,13 +49,16 @@ static int mlx5_sd_get_host_buses(struct mlx5_core_dev *dev)
>  	return sd->host_buses;
>  }
>  
> -static struct mlx5_core_dev *mlx5_sd_get_primary(struct mlx5_core_dev *dev)
> +struct mlx5_core_dev *mlx5_sd_get_primary(struct mlx5_core_dev *dev)
>  {
>  	struct mlx5_sd *sd = mlx5_get_sd(dev);
>  
>  	if (!sd)
>  		return dev;
>  
> +	if (!mlx5_devcom_comp_is_ready(sd->devcom))
> +		return NULL;
> +
>  	return sd->primary ? dev : sd->primary_dev;
>  }

[ ... ]

  reply	other threads:[~2026-06-03  2:26 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-31 11:39 [PATCH net-next V2 00/13] net/mlx5: Add switchdev mode support for Socket Direct single netdev, part 1/2 Tariq Toukan
2026-05-31 11:39 ` [PATCH net-next V2 01/13] net/mlx5: LAG, factor out shared FDB code into dedicated file Tariq Toukan
2026-05-31 11:39 ` [PATCH net-next V2 02/13] net/mlx5: E-Switch, align disable sequence with switchdev-to-legacy transition Tariq Toukan
2026-05-31 11:39 ` [PATCH net-next V2 03/13] net/mlx5: E-Switch, move devcom init from TC to eswitch layer Tariq Toukan
2026-05-31 11:39 ` [PATCH net-next V2 04/13] net/mlx5: LAG, replace peer count check with direct peer lookup Tariq Toukan
2026-05-31 11:39 ` [PATCH net-next V2 05/13] net/mlx5: LAG, prepare for SD device integration Tariq Toukan
2026-05-31 11:39 ` [PATCH net-next V2 06/13] net/mlx5: LAG, extend shared FDB API with group_id filter Tariq Toukan
2026-05-31 11:39 ` [PATCH net-next V2 07/13] net/mlx5: SD, introduce Socket Direct LAG Tariq Toukan
2026-05-31 11:39 ` [PATCH net-next V2 08/13] net/mlx5: LAG, block RoCE and VF LAG for SD devices Tariq Toukan
2026-05-31 11:39 ` [PATCH net-next V2 09/13] net/mlx5: LAG, block multipath " Tariq Toukan
2026-05-31 11:39 ` [PATCH net-next V2 10/13] net/mlx5: SD, keep netdev resources on same PF in switchdev mode Tariq Toukan
2026-05-31 11:39 ` [PATCH net-next V2 11/13] net/mlx5e: TC, track peer flow slots with bitmap Tariq Toukan
2026-05-31 11:39 ` [PATCH net-next V2 12/13] net/mlx5e: TC, enable steering for SD LAG Tariq Toukan
2026-06-03  2:26   ` Jakub Kicinski [this message]
2026-06-03  6:43     ` Shay Drori
2026-05-31 11:39 ` [PATCH net-next V2 13/13] net/mlx5e: Verify unique vhca_id count instead of range Tariq Toukan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260603022645.2298955-1-kuba@kernel.org \
    --to=kuba@kernel.org \
    --cc=andrew+netdev@lunn.ch \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=edwards@nvidia.com \
    --cc=gal@nvidia.com \
    --cc=horms@kernel.org \
    --cc=jacob.e.keller@intel.com \
    --cc=kees@kernel.org \
    --cc=leon@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=mbloch@nvidia.com \
    --cc=moshe@nvidia.com \
    --cc=msanalla@nvidia.com \
    --cc=netdev@vger.kernel.org \
    --cc=noren@nvidia.com \
    --cc=ohartoov@nvidia.com \
    --cc=pabeni@redhat.com \
    --cc=parav@nvidia.com \
    --cc=phaddad@nvidia.com \
    --cc=saeedm@nvidia.com \
    --cc=shayd@nvidia.com \
    --cc=tariqt@nvidia.com \
    --cc=ychemla@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox