public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH net 0/2] mlx5e PSP fixes
@ 2026-04-17  5:01 Tariq Toukan
  2026-04-17  5:02 ` [PATCH net 1/2] net/mlx5e: psp: Fix invalid access on PSP dev registration fail Tariq Toukan
  2026-04-17  5:02 ` [PATCH net 2/2] net/mlx5e: psp: Hook PSP dev reg/unreg to profile enable/disable Tariq Toukan
  0 siblings, 2 replies; 11+ messages in thread
From: Tariq Toukan @ 2026-04-17  5:01 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller
  Cc: Boris Pismenny, Saeed Mahameed, Leon Romanovsky, Tariq Toukan,
	Mark Bloch, Daniel Zahka, Willem de Bruijn, Cosmin Ratiu,
	Raed Salem, Rahul Rameshbabu, Dragos Tatulea, Kees Cook, netdev,
	linux-rdma, linux-kernel, Gal Pressman

Hi,

This patchset provides bug fixes from Cosmin to the mlx5e PSP feature.

Thanks,
Tariq.

Cosmin Ratiu (2):
  net/mlx5e: psp: Fix invalid access on PSP dev registration fail
  net/mlx5e: psp: Hook PSP dev reg/unreg to profile enable/disable

 .../mellanox/mlx5/core/en_accel/psp.c         | 36 ++++++++++---------
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  4 +--
 2 files changed, 22 insertions(+), 18 deletions(-)


base-commit: 82c21069028c5db3463f851ae8ac9cc2e38a3827
-- 
2.44.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH net 1/2] net/mlx5e: psp: Fix invalid access on PSP dev registration fail
  2026-04-17  5:01 [PATCH net 0/2] mlx5e PSP fixes Tariq Toukan
@ 2026-04-17  5:02 ` Tariq Toukan
  2026-04-18 19:08   ` Jakub Kicinski
  2026-04-17  5:02 ` [PATCH net 2/2] net/mlx5e: psp: Hook PSP dev reg/unreg to profile enable/disable Tariq Toukan
  1 sibling, 1 reply; 11+ messages in thread
From: Tariq Toukan @ 2026-04-17  5:02 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller
  Cc: Boris Pismenny, Saeed Mahameed, Leon Romanovsky, Tariq Toukan,
	Mark Bloch, Daniel Zahka, Willem de Bruijn, Cosmin Ratiu,
	Raed Salem, Rahul Rameshbabu, Dragos Tatulea, Kees Cook, netdev,
	linux-rdma, linux-kernel, Gal Pressman

From: Cosmin Ratiu <cratiu@nvidia.com>

priv->psp->psp is initialized with the PSP device as returned by
psp_dev_create(). This could also return an error, in which case a
future psp_dev_unregister() will result in unpleasantness.

Avoid that by using a local variable and only saving the PSP device when
registration succeeds.
Also apply some light refactoring of the functions managing the PSP
device in order to make them more readable/safe.

Fixes: 89ee2d92f66c ("net/mlx5e: Support PSP offload functionality")
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 .../mellanox/mlx5/core/en_accel/psp.c         | 36 ++++++++++---------
 1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/psp.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/psp.c
index 6a50b6dec0fa..d9adb993e64d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/psp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/psp.c
@@ -1070,29 +1070,37 @@ static struct psp_dev_ops mlx5_psp_ops = {
 
 void mlx5e_psp_unregister(struct mlx5e_priv *priv)
 {
-	if (!priv->psp || !priv->psp->psp)
+	struct mlx5e_psp *psp = priv->psp;
+
+	if (!psp || !psp->psp)
 		return;
 
-	psp_dev_unregister(priv->psp->psp);
+	psp_dev_unregister(psp->psp);
+	psp->psp = NULL;
 }
 
 void mlx5e_psp_register(struct mlx5e_priv *priv)
 {
+	struct mlx5e_psp *psp = priv->psp;
+	struct psp_dev *psd;
+
 	/* FW Caps missing */
 	if (!priv->psp)
 		return;
 
-	priv->psp->caps.assoc_drv_spc = sizeof(u32);
-	priv->psp->caps.versions = 1 << PSP_VERSION_HDR0_AES_GCM_128;
+	psp->caps.assoc_drv_spc = sizeof(u32);
+	psp->caps.versions = 1 << PSP_VERSION_HDR0_AES_GCM_128;
 	if (MLX5_CAP_PSP(priv->mdev, psp_crypto_esp_aes_gcm_256_encrypt) &&
 	    MLX5_CAP_PSP(priv->mdev, psp_crypto_esp_aes_gcm_256_decrypt))
-		priv->psp->caps.versions |= 1 << PSP_VERSION_HDR0_AES_GCM_256;
+		psp->caps.versions |= 1 << PSP_VERSION_HDR0_AES_GCM_256;
 
-	priv->psp->psp = psp_dev_create(priv->netdev, &mlx5_psp_ops,
-					&priv->psp->caps, NULL);
-	if (IS_ERR(priv->psp->psp))
+	psd = psp_dev_create(priv->netdev, &mlx5_psp_ops, &psp->caps, NULL);
+	if (IS_ERR(psd)) {
 		mlx5_core_err(priv->mdev, "PSP failed to register due to %pe\n",
-			      priv->psp->psp);
+			      psd);
+		return;
+	}
+	psp->psp = psd;
 }
 
 int mlx5e_psp_init(struct mlx5e_priv *priv)
@@ -1131,22 +1139,18 @@ int mlx5e_psp_init(struct mlx5e_priv *priv)
 	if (!psp)
 		return -ENOMEM;
 
-	priv->psp = psp;
 	fs = mlx5e_accel_psp_fs_init(priv);
 	if (IS_ERR(fs)) {
 		err = PTR_ERR(fs);
-		goto out_err;
+		kfree(psp);
+		return err;
 	}
 
 	psp->fs = fs;
+	priv->psp = psp;
 
 	mlx5_core_dbg(priv->mdev, "PSP attached to netdevice\n");
 	return 0;
-
-out_err:
-	priv->psp = NULL;
-	kfree(psp);
-	return err;
 }
 
 void mlx5e_psp_cleanup(struct mlx5e_priv *priv)
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH net 2/2] net/mlx5e: psp: Hook PSP dev reg/unreg to profile enable/disable
  2026-04-17  5:01 [PATCH net 0/2] mlx5e PSP fixes Tariq Toukan
  2026-04-17  5:02 ` [PATCH net 1/2] net/mlx5e: psp: Fix invalid access on PSP dev registration fail Tariq Toukan
@ 2026-04-17  5:02 ` Tariq Toukan
  1 sibling, 0 replies; 11+ messages in thread
From: Tariq Toukan @ 2026-04-17  5:02 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller
  Cc: Boris Pismenny, Saeed Mahameed, Leon Romanovsky, Tariq Toukan,
	Mark Bloch, Daniel Zahka, Willem de Bruijn, Cosmin Ratiu,
	Raed Salem, Rahul Rameshbabu, Dragos Tatulea, Kees Cook, netdev,
	linux-rdma, linux-kernel, Gal Pressman

From: Cosmin Ratiu <cratiu@nvidia.com>

devlink reload while PSP connections are active does:

mlx5_unload_one_devl_locked() -> mlx5_detach_device()
-> _mlx5e_suspend()
  -> mlx5e_detach_netdev()
    -> profile->cleanup_rx
    -> profile->cleanup_tx
  -> mlx5e_destroy_mdev_resources() -> mlx5_core_dealloc_pd() fails:
...
mlx5_core 0000:08:00.0: mlx5_cmd_out_err:821:(pid 19722):
DEALLOC_PD(0x801) op_mod(0x0) failed, status bad resource state(0x9),
syndrome (0xef0c8a), err(-22)
...

The reason for failure is the existence of TX keys, which are removed by
the PSP dev unregistration happening in:
profile->cleanup() -> mlx5e_psp_unregister() -> mlx5e_psp_cleanup()
  -> psp_dev_unregister()
...but this isn't invoked in the devlink reload flow, only when changing
the NIC profile (e.g. when transitioning to switchdev mode) or on dev
teardown.

Move PSP device registration into mlx5e_nic_enable(), and unregistration
into the corresponding mlx5e_nic_disable(). These functions are called
during netdev attach/detach after RX & TX are set up.
This ensures that the keys will be gone by the time the PD is destroyed.

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Fixes: 89ee2d92f66c ("net/mlx5e: Support PSP offload functionality")
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 6c4eeb88588c..c3938a2dbbfe 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -6021,7 +6021,6 @@ static int mlx5e_nic_init(struct mlx5_core_dev *mdev,
 	if (take_rtnl)
 		rtnl_lock();
 
-	mlx5e_psp_register(priv);
 	/* update XDP supported features */
 	mlx5e_set_xdp_feature(priv);
 
@@ -6034,7 +6033,6 @@ static int mlx5e_nic_init(struct mlx5_core_dev *mdev,
 static void mlx5e_nic_cleanup(struct mlx5e_priv *priv)
 {
 	mlx5e_health_destroy_reporters(priv);
-	mlx5e_psp_unregister(priv);
 	mlx5e_ktls_cleanup(priv);
 	mlx5e_psp_cleanup(priv);
 	mlx5e_fs_cleanup(priv->fs);
@@ -6158,6 +6156,7 @@ static void mlx5e_nic_enable(struct mlx5e_priv *priv)
 
 	mlx5e_fs_init_l2_addr(priv->fs, netdev);
 	mlx5e_ipsec_init(priv);
+	mlx5e_psp_register(priv);
 
 	err = mlx5e_macsec_init(priv);
 	if (err)
@@ -6228,6 +6227,7 @@ static void mlx5e_nic_disable(struct mlx5e_priv *priv)
 	mlx5_lag_remove_netdev(mdev, priv->netdev);
 	mlx5_vxlan_reset_to_default(mdev->vxlan);
 	mlx5e_macsec_cleanup(priv);
+	mlx5e_psp_unregister(priv);
 	mlx5e_ipsec_cleanup(priv);
 }
 
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH net 1/2] net/mlx5e: psp: Fix invalid access on PSP dev registration fail
  2026-04-17  5:02 ` [PATCH net 1/2] net/mlx5e: psp: Fix invalid access on PSP dev registration fail Tariq Toukan
@ 2026-04-18 19:08   ` Jakub Kicinski
  2026-04-20 10:30     ` Cosmin Ratiu
  0 siblings, 1 reply; 11+ messages in thread
From: Jakub Kicinski @ 2026-04-18 19:08 UTC (permalink / raw)
  To: tariqt
  Cc: Jakub Kicinski, edumazet, pabeni, andrew+netdev, davem, borisp,
	saeedm, leon, mbloch, daniel.zahka, willemdebruijn.kernel, cratiu,
	raeds, rrameshbabu, dtatulea, kees, netdev, linux-rdma,
	linux-kernel, gal

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
net/mlx5e: psp: Fix invalid access on PSP dev registration fail

This commit updates PSP device registration to use a local variable,
only saving the PSP device pointer if psp_dev_create() succeeds. This
prevents an error pointer from being stored in priv->psp->psp.

> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/psp.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/psp.c
[ ... ]
> @@ -1070,29 +1070,37 @@ static struct psp_dev_ops mlx5_psp_ops = {
[ ... ]
>  void mlx5e_psp_register(struct mlx5e_priv *priv)
>  {
> +	struct mlx5e_psp *psp = priv->psp;
> +	struct psp_dev *psd;
> +
[ ... ]
> -	priv->psp->psp = psp_dev_create(priv->netdev, &mlx5_psp_ops,
> -					&priv->psp->caps, NULL);
> -	if (IS_ERR(priv->psp->psp))
> +	psd = psp_dev_create(priv->netdev, &mlx5_psp_ops, &psp->caps, NULL);
> +	if (IS_ERR(psd)) {
>  		mlx5_core_err(priv->mdev, "PSP failed to register due to %pe\n",
> -			      priv->psp->psp);
> +			      psd);
> +		return;
> +	}
> +	psp->psp = psd;
>  }

When psp_dev_create() fails, this function now returns without setting
psp->psp, leaving it as NULL. However, priv->psp remains allocated and
non-NULL.

Does this leave the RX datapath vulnerable to a NULL pointer dereference?

If priv->psp is non-NULL, the NIC RX initialization path can still call
mlx5_accel_psp_fs_init_rx_tables(), which creates hardware flow steering
rules to intercept UDP traffic.

If a UDP packet triggers these rules, the hardware flags the CQE with
MLX5E_PSP_MARKER_BIT. The RX fast-path sees the marker and invokes
mlx5e_psp_offload_handle_rx_skb(), which dereferences the pointer
unconditionally:

u16 dev_id = priv->psp->psp->id;

Since priv->psp->psp is NULL, this will cause a kernel panic. Should
priv->psp be cleaned up, or the error propagated, to prevent flow rules
from being installed when registration fails?
-- 
pw-bot: cr

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net 1/2] net/mlx5e: psp: Fix invalid access on PSP dev registration fail
  2026-04-18 19:08   ` Jakub Kicinski
@ 2026-04-20 10:30     ` Cosmin Ratiu
  2026-04-20 17:09       ` Jakub Kicinski
  0 siblings, 1 reply; 11+ messages in thread
From: Cosmin Ratiu @ 2026-04-20 10:30 UTC (permalink / raw)
  To: Tariq Toukan, kuba@kernel.org
  Cc: Boris Pismenny, willemdebruijn.kernel@gmail.com,
	andrew+netdev@lunn.ch, daniel.zahka@gmail.com,
	davem@davemloft.net, leon@kernel.org, Rahul Rameshbabu,
	pabeni@redhat.com, linux-rdma@vger.kernel.org,
	linux-kernel@vger.kernel.org, Raed Salem, Dragos Tatulea,
	kees@kernel.org, Mark Bloch, edumazet@google.com, Saeed Mahameed,
	netdev@vger.kernel.org, Gal Pressman

On Sat, 2026-04-18 at 12:08 -0700, Jakub Kicinski wrote:
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> ---
> net/mlx5e: psp: Fix invalid access on PSP dev registration fail
> 
> This commit updates PSP device registration to use a local variable,
> only saving the PSP device pointer if psp_dev_create() succeeds. This
> prevents an error pointer from being stored in priv->psp->psp.
> 
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/psp.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/psp.c
> [ ... ]
> > @@ -1070,29 +1070,37 @@ static struct psp_dev_ops mlx5_psp_ops = {
> [ ... ]
> >  void mlx5e_psp_register(struct mlx5e_priv *priv)
> >  {
> > +	struct mlx5e_psp *psp = priv->psp;
> > +	struct psp_dev *psd;
> > +
> [ ... ]
> > -	priv->psp->psp = psp_dev_create(priv->netdev,
> > &mlx5_psp_ops,
> > -					&priv->psp->caps, NULL);
> > -	if (IS_ERR(priv->psp->psp))
> > +	psd = psp_dev_create(priv->netdev, &mlx5_psp_ops, &psp-
> > >caps, NULL);
> > +	if (IS_ERR(psd)) {
> >  		mlx5_core_err(priv->mdev, "PSP failed to register
> > due to %pe\n",
> > -			      priv->psp->psp);
> > +			      psd);
> > +		return;
> > +	}
> > +	psp->psp = psd;
> >  }
> 
> When psp_dev_create() fails, this function now returns without
> setting
> psp->psp, leaving it as NULL. However, priv->psp remains allocated
> and
> non-NULL.
> 
> Does this leave the RX datapath vulnerable to a NULL pointer
> dereference?
> 
> If priv->psp is non-NULL, the NIC RX initialization path can still
> call
> mlx5_accel_psp_fs_init_rx_tables(), which creates hardware flow
> steering
> rules to intercept UDP traffic.
> 
> If a UDP packet triggers these rules, the hardware flags the CQE with
> MLX5E_PSP_MARKER_BIT. The RX fast-path sees the marker and invokes
> mlx5e_psp_offload_handle_rx_skb(), which dereferences the pointer
> unconditionally:
> 
> u16 dev_id = priv->psp->psp->id;
> 
> Since priv->psp->psp is NULL, this will cause a kernel panic. Should
> priv->psp be cleaned up, or the error propagated, to prevent flow
> rules
> from being installed when registration fails?

First, this is preexisting. But more importantly, it's impossible to
trigger:
- with no PSP devs, there can be no PSP SAs installed.
- with no SAs, PSP decryption cannot succeed.
- all unsuccessfully decrypted PSP packets are dropped by steering.
- the RX handler will not see any PSP packets with the marker set.

This patch fixes the comparatively way more likely scenario of
psp_dev_register failing and then mlx5e_psp_unregister passing the
error pointer to psp_dev_unregister, which will do unpleasant things
with it.

Cosmin.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net 1/2] net/mlx5e: psp: Fix invalid access on PSP dev registration fail
  2026-04-20 10:30     ` Cosmin Ratiu
@ 2026-04-20 17:09       ` Jakub Kicinski
  2026-04-21 12:29         ` Cosmin Ratiu
  0 siblings, 1 reply; 11+ messages in thread
From: Jakub Kicinski @ 2026-04-20 17:09 UTC (permalink / raw)
  To: Cosmin Ratiu
  Cc: Tariq Toukan, Boris Pismenny, willemdebruijn.kernel@gmail.com,
	andrew+netdev@lunn.ch, daniel.zahka@gmail.com,
	davem@davemloft.net, leon@kernel.org, Rahul Rameshbabu,
	pabeni@redhat.com, linux-rdma@vger.kernel.org,
	linux-kernel@vger.kernel.org, Raed Salem, Dragos Tatulea,
	kees@kernel.org, Mark Bloch, edumazet@google.com, Saeed Mahameed,
	netdev@vger.kernel.org, Gal Pressman

On Mon, 20 Apr 2026 10:30:46 +0000 Cosmin Ratiu wrote:
> > When psp_dev_create() fails, this function now returns without
> > setting
> > psp->psp, leaving it as NULL. However, priv->psp remains allocated
> > and
> > non-NULL.
> > 
> > Does this leave the RX datapath vulnerable to a NULL pointer
> > dereference?
> > 
> > If priv->psp is non-NULL, the NIC RX initialization path can still
> > call
> > mlx5_accel_psp_fs_init_rx_tables(), which creates hardware flow
> > steering
> > rules to intercept UDP traffic.
> > 
> > If a UDP packet triggers these rules, the hardware flags the CQE with
> > MLX5E_PSP_MARKER_BIT. The RX fast-path sees the marker and invokes
> > mlx5e_psp_offload_handle_rx_skb(), which dereferences the pointer
> > unconditionally:
> > 
> > u16 dev_id = priv->psp->psp->id;
> > 
> > Since priv->psp->psp is NULL, this will cause a kernel panic. Should
> > priv->psp be cleaned up, or the error propagated, to prevent flow
> > rules
> > from being installed when registration fails?  
> 
> First, this is preexisting. But more importantly, it's impossible to
> trigger:
> - with no PSP devs, there can be no PSP SAs installed.
> - with no SAs, PSP decryption cannot succeed.
> - all unsuccessfully decrypted PSP packets are dropped by steering.
> - the RX handler will not see any PSP packets with the marker set.
> 
> This patch fixes the comparatively way more likely scenario of
> psp_dev_register failing and then mlx5e_psp_unregister passing the
> error pointer to psp_dev_unregister, which will do unpleasant things
> with it.

Sure but why are you leaving the priv->psp struct in place and whatever
FS init has been done? IOW if you really want PSP init to not block
probe why is mlx5e_psp_register() a void function rather than
mlx5e_psp_init() ? Ignoring errors from psp_dev_create()
makes no sense to me - what are you protecting from? kmalloc(GFP_KERNEL)
failing?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net 1/2] net/mlx5e: psp: Fix invalid access on PSP dev registration fail
  2026-04-20 17:09       ` Jakub Kicinski
@ 2026-04-21 12:29         ` Cosmin Ratiu
  2026-04-21 14:26           ` Jakub Kicinski
  0 siblings, 1 reply; 11+ messages in thread
From: Cosmin Ratiu @ 2026-04-21 12:29 UTC (permalink / raw)
  To: kuba@kernel.org
  Cc: Boris Pismenny, willemdebruijn.kernel@gmail.com,
	andrew+netdev@lunn.ch, daniel.zahka@gmail.com,
	davem@davemloft.net, leon@kernel.org, Rahul Rameshbabu,
	linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org,
	pabeni@redhat.com, Raed Salem, Dragos Tatulea, kees@kernel.org,
	Mark Bloch, edumazet@google.com, Tariq Toukan, Saeed Mahameed,
	netdev@vger.kernel.org, Gal Pressman

On Mon, 2026-04-20 at 10:09 -0700, Jakub Kicinski wrote:
> On Mon, 20 Apr 2026 10:30:46 +0000 Cosmin Ratiu wrote:
> > > When psp_dev_create() fails, this function now returns without
> > > setting
> > > psp->psp, leaving it as NULL. However, priv->psp remains
> > > allocated
> > > and
> > > non-NULL.
> > > 
> > > Does this leave the RX datapath vulnerable to a NULL pointer
> > > dereference?
> > > 
> > > If priv->psp is non-NULL, the NIC RX initialization path can
> > > still
> > > call
> > > mlx5_accel_psp_fs_init_rx_tables(), which creates hardware flow
> > > steering
> > > rules to intercept UDP traffic.
> > > 
> > > If a UDP packet triggers these rules, the hardware flags the CQE
> > > with
> > > MLX5E_PSP_MARKER_BIT. The RX fast-path sees the marker and
> > > invokes
> > > mlx5e_psp_offload_handle_rx_skb(), which dereferences the pointer
> > > unconditionally:
> > > 
> > > u16 dev_id = priv->psp->psp->id;
> > > 
> > > Since priv->psp->psp is NULL, this will cause a kernel panic.
> > > Should
> > > priv->psp be cleaned up, or the error propagated, to prevent flow
> > > rules
> > > from being installed when registration fails?  
> > 
> > First, this is preexisting. But more importantly, it's impossible
> > to
> > trigger:
> > - with no PSP devs, there can be no PSP SAs installed.
> > - with no SAs, PSP decryption cannot succeed.
> > - all unsuccessfully decrypted PSP packets are dropped by steering.
> > - the RX handler will not see any PSP packets with the marker set.
> > 
> > This patch fixes the comparatively way more likely scenario of
> > psp_dev_register failing and then mlx5e_psp_unregister passing the
> > error pointer to psp_dev_unregister, which will do unpleasant
> > things
> > with it.
> 
> Sure but why are you leaving the priv->psp struct in place and
> whatever
> FS init has been done? IOW if you really want PSP init to not block
> probe why is mlx5e_psp_register() a void function rather than
> mlx5e_psp_init() ? Ignoring errors from psp_dev_create()
> makes no sense to me - what are you protecting from?
> kmalloc(GFP_KERNEL)
> failing?

priv->psp and steering at the time of mlx5e_psp_register() is inert
without the PSP device. Cleaning it on psp_dev_create() failure would
be weird, it's cleaned up anyway on netdev teardown. The fact that only
memory allocations can fail inside psp_dev_create() is irrelevant here.
psp_dev_create() failing shouldn't bring down the whole netdevice, so
logging a message and continuing is ok (which is what is also done for
macsec and ktls).

mlx5e_psp_register() is void because it's called from
mlx5e_nic_enable() which can't fail, so it really can't do much other
than complain to dmesg.

But while thinking about this, I suppose we could change the entire PSP
initialization to happen at the time of the current
mlx5e_psp_register(), and that would simplify the number of states.
I will do that in the next planned PSP series for net-next.

Meanwhile, could you please take the 2nd patch and leave this one out?
It should apply with no conflicts by itself.

Or you would like to see a separate submission with the 2nd patch
alone?

Cosmin.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net 1/2] net/mlx5e: psp: Fix invalid access on PSP dev registration fail
  2026-04-21 12:29         ` Cosmin Ratiu
@ 2026-04-21 14:26           ` Jakub Kicinski
  2026-04-21 14:33             ` Cosmin Ratiu
  0 siblings, 1 reply; 11+ messages in thread
From: Jakub Kicinski @ 2026-04-21 14:26 UTC (permalink / raw)
  To: Cosmin Ratiu
  Cc: Boris Pismenny, willemdebruijn.kernel@gmail.com,
	andrew+netdev@lunn.ch, daniel.zahka@gmail.com,
	davem@davemloft.net, leon@kernel.org, Rahul Rameshbabu,
	linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org,
	pabeni@redhat.com, Raed Salem, Dragos Tatulea, kees@kernel.org,
	Mark Bloch, edumazet@google.com, Tariq Toukan, Saeed Mahameed,
	netdev@vger.kernel.org, Gal Pressman

On Tue, 21 Apr 2026 12:29:13 +0000 Cosmin Ratiu wrote:
> > Sure but why are you leaving the priv->psp struct in place and
> > whatever
> > FS init has been done? IOW if you really want PSP init to not block
> > probe why is mlx5e_psp_register() a void function rather than
> > mlx5e_psp_init() ? Ignoring errors from psp_dev_create()
> > makes no sense to me - what are you protecting from?
> > kmalloc(GFP_KERNEL)
> > failing?  
> 
> priv->psp and steering at the time of mlx5e_psp_register() is inert
> without the PSP device. Cleaning it on psp_dev_create() failure would
> be weird, it's cleaned up anyway on netdev teardown. The fact that only
> memory allocations can fail inside psp_dev_create() is irrelevant here.
> psp_dev_create() failing shouldn't bring down the whole netdevice, so
> logging a message and continuing is ok (which is what is also done for
> macsec and ktls).

This is a misguided cargo cult. Or something motivated by OOT
compatibility. Alex D sometimes tries to do the same thing with Meta
drivers. I don't get it. Of course we want the device to be operational
if some *device* init fails. The compatibility matrix with all device
generations and fw versions could justify that. But continuing init
when a single-page kmalloc failed is pure silliness.

> mlx5e_psp_register() is void because it's called from
> mlx5e_nic_enable() which can't fail, so it really can't do much other
> than complain to dmesg.
> 
> But while thinking about this, I suppose we could change the entire PSP
> initialization to happen at the time of the current
> mlx5e_psp_register(), and that would simplify the number of states.
> I will do that in the next planned PSP series for net-next.
> 
> Meanwhile, could you please take the 2nd patch and leave this one out?
> It should apply with no conflicts by itself.
> 
> Or you would like to see a separate submission with the 2nd patch
> alone?

Please resubmit.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net 1/2] net/mlx5e: psp: Fix invalid access on PSP dev registration fail
  2026-04-21 14:26           ` Jakub Kicinski
@ 2026-04-21 14:33             ` Cosmin Ratiu
  2026-04-21 15:09               ` Jakub Kicinski
  0 siblings, 1 reply; 11+ messages in thread
From: Cosmin Ratiu @ 2026-04-21 14:33 UTC (permalink / raw)
  To: kuba@kernel.org
  Cc: Boris Pismenny, willemdebruijn.kernel@gmail.com,
	andrew+netdev@lunn.ch, daniel.zahka@gmail.com,
	davem@davemloft.net, leon@kernel.org,
	linux-kernel@vger.kernel.org, edumazet@google.com,
	linux-rdma@vger.kernel.org, Rahul Rameshbabu, Raed Salem,
	Dragos Tatulea, kees@kernel.org, Mark Bloch, pabeni@redhat.com,
	Tariq Toukan, Saeed Mahameed, netdev@vger.kernel.org,
	Gal Pressman

On Tue, 2026-04-21 at 07:26 -0700, Jakub Kicinski wrote:
> On Tue, 21 Apr 2026 12:29:13 +0000 Cosmin Ratiu wrote:
> > > Sure but why are you leaving the priv->psp struct in place and
> > > whatever
> > > FS init has been done? IOW if you really want PSP init to not
> > > block
> > > probe why is mlx5e_psp_register() a void function rather than
> > > mlx5e_psp_init() ? Ignoring errors from psp_dev_create()
> > > makes no sense to me - what are you protecting from?
> > > kmalloc(GFP_KERNEL)
> > > failing?  
> > 
> > priv->psp and steering at the time of mlx5e_psp_register() is inert
> > without the PSP device. Cleaning it on psp_dev_create() failure
> > would
> > be weird, it's cleaned up anyway on netdev teardown. The fact that
> > only
> > memory allocations can fail inside psp_dev_create() is irrelevant
> > here.
> > psp_dev_create() failing shouldn't bring down the whole netdevice,
> > so
> > logging a message and continuing is ok (which is what is also done
> > for
> > macsec and ktls).
> 
> This is a misguided cargo cult. Or something motivated by OOT
> compatibility. Alex D sometimes tries to do the same thing with Meta
> drivers. I don't get it. Of course we want the device to be
> operational
> if some *device* init fails. The compatibility matrix with all device
> generations and fw versions could justify that. But continuing init
> when a single-page kmalloc failed is pure silliness.

I am not sure about the wider context, but from the POV of the driver,
it's calling $thing from the kernel which can fail and it needs to do
something about it, either fail the entire netdev bringup or accept
that $thing won't be functional and continue without it. The driver
shouldn't need to know what $thing does inside and how it can fail,
which can change over time. Today it's a kmalloc(), tomorrow it may be
something else. It doesn't and shouldn't matter for the local decision
to continue or not without $thing working.

Isn't this reasonable?

> 
> > mlx5e_psp_register() is void because it's called from
> > mlx5e_nic_enable() which can't fail, so it really can't do much
> > other
> > than complain to dmesg.
> > 
> > But while thinking about this, I suppose we could change the entire
> > PSP
> > initialization to happen at the time of the current
> > mlx5e_psp_register(), and that would simplify the number of states.
> > I will do that in the next planned PSP series for net-next.
> > 
> > Meanwhile, could you please take the 2nd patch and leave this one
> > out?
> > It should apply with no conflicts by itself.
> > 
> > Or you would like to see a separate submission with the 2nd patch
> > alone?
> 
> Please resubmit.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net 1/2] net/mlx5e: psp: Fix invalid access on PSP dev registration fail
  2026-04-21 14:33             ` Cosmin Ratiu
@ 2026-04-21 15:09               ` Jakub Kicinski
  2026-04-21 17:34                 ` Cosmin Ratiu
  0 siblings, 1 reply; 11+ messages in thread
From: Jakub Kicinski @ 2026-04-21 15:09 UTC (permalink / raw)
  To: Cosmin Ratiu
  Cc: Boris Pismenny, willemdebruijn.kernel@gmail.com,
	andrew+netdev@lunn.ch, daniel.zahka@gmail.com,
	davem@davemloft.net, leon@kernel.org,
	linux-kernel@vger.kernel.org, edumazet@google.com,
	linux-rdma@vger.kernel.org, Rahul Rameshbabu, Raed Salem,
	Dragos Tatulea, kees@kernel.org, Mark Bloch, pabeni@redhat.com,
	Tariq Toukan, Saeed Mahameed, netdev@vger.kernel.org,
	Gal Pressman

On Tue, 21 Apr 2026 14:33:51 +0000 Cosmin Ratiu wrote:
> > > priv->psp and steering at the time of mlx5e_psp_register() is inert
> > > without the PSP device. Cleaning it on psp_dev_create() failure
> > > would
> > > be weird, it's cleaned up anyway on netdev teardown. The fact that
> > > only
> > > memory allocations can fail inside psp_dev_create() is irrelevant
> > > here.
> > > psp_dev_create() failing shouldn't bring down the whole netdevice,
> > > so
> > > logging a message and continuing is ok (which is what is also done
> > > for
> > > macsec and ktls).  
> > 
> > This is a misguided cargo cult. Or something motivated by OOT
> > compatibility. Alex D sometimes tries to do the same thing with Meta
> > drivers. I don't get it. Of course we want the device to be
> > operational
> > if some *device* init fails. The compatibility matrix with all device
> > generations and fw versions could justify that. But continuing init
> > when a single-page kmalloc failed is pure silliness.  
> 
> I am not sure about the wider context, but from the POV of the driver,
> it's calling $thing from the kernel which can fail and it needs to do
> something about it, either fail the entire netdev bringup or accept
> that $thing won't be functional and continue without it. The driver
> shouldn't need to know what $thing does inside and how it can fail,
> which can change over time. Today it's a kmalloc(), tomorrow it may be
> something else.

Like what?

> It doesn't and shouldn't matter for the local decision
> to continue or not without $thing working.
> 
> Isn't this reasonable?

No, the normal thing to do is to propagate errors.
If you want to diverge from that _you_ should have a reason,
a better reason than a vague "kernel can fail".
I'd prefer for the driver to fail in an obvious way.
Which will be immediately spotted by the operator, not 2 weeks
later when 10% of the fleet is upgraded already.
The only exception I'd make is to keep devlink registered in
case the fix is to flash a different FW.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net 1/2] net/mlx5e: psp: Fix invalid access on PSP dev registration fail
  2026-04-21 15:09               ` Jakub Kicinski
@ 2026-04-21 17:34                 ` Cosmin Ratiu
  0 siblings, 0 replies; 11+ messages in thread
From: Cosmin Ratiu @ 2026-04-21 17:34 UTC (permalink / raw)
  To: kuba@kernel.org
  Cc: Boris Pismenny, willemdebruijn.kernel@gmail.com,
	andrew+netdev@lunn.ch, daniel.zahka@gmail.com,
	davem@davemloft.net, leon@kernel.org,
	linux-kernel@vger.kernel.org, edumazet@google.com,
	linux-rdma@vger.kernel.org, Rahul Rameshbabu, Raed Salem,
	Dragos Tatulea, kees@kernel.org, Mark Bloch, pabeni@redhat.com,
	Tariq Toukan, Saeed Mahameed, netdev@vger.kernel.org,
	Gal Pressman

On Tue, 2026-04-21 at 08:09 -0700, Jakub Kicinski wrote:
> On Tue, 21 Apr 2026 14:33:51 +0000 Cosmin Ratiu wrote:
> > > > priv->psp and steering at the time of mlx5e_psp_register() is
> > > > inert
> > > > without the PSP device. Cleaning it on psp_dev_create() failure
> > > > would
> > > > be weird, it's cleaned up anyway on netdev teardown. The fact
> > > > that
> > > > only
> > > > memory allocations can fail inside psp_dev_create() is
> > > > irrelevant
> > > > here.
> > > > psp_dev_create() failing shouldn't bring down the whole
> > > > netdevice,
> > > > so
> > > > logging a message and continuing is ok (which is what is also
> > > > done
> > > > for
> > > > macsec and ktls).  
> > > 
> > > This is a misguided cargo cult. Or something motivated by OOT
> > > compatibility. Alex D sometimes tries to do the same thing with
> > > Meta
> > > drivers. I don't get it. Of course we want the device to be
> > > operational
> > > if some *device* init fails. The compatibility matrix with all
> > > device
> > > generations and fw versions could justify that. But continuing
> > > init
> > > when a single-page kmalloc failed is pure silliness.  
> > 
> > I am not sure about the wider context, but from the POV of the
> > driver,
> > it's calling $thing from the kernel which can fail and it needs to
> > do
> > something about it, either fail the entire netdev bringup or accept
> > that $thing won't be functional and continue without it. The driver
> > shouldn't need to know what $thing does inside and how it can fail,
> > which can change over time. Today it's a kmalloc(), tomorrow it may
> > be
> > something else.
> 
> Like what?

The inner workings of $thing aren't and shouldn't be relevant, no?
Maybe tomorrow the kernel will lazy-init some TCP shenanigans for the
first PSP device being initialized or whatever, or maybe some other
moving parts inside can fail. It's an abstraction, why make it
unnecessarily leaky for the purpose of writing driver code?

> 
> > It doesn't and shouldn't matter for the local decision
> > to continue or not without $thing working.
> > 
> > Isn't this reasonable?
> 
> No, the normal thing to do is to propagate errors.
> If you want to diverge from that _you_ should have a reason,
> a better reason than a vague "kernel can fail".
> I'd prefer for the driver to fail in an obvious way.
> Which will be immediately spotted by the operator, not 2 weeks
> later when 10% of the fleet is upgraded already.
> The only exception I'd make is to keep devlink registered in
> case the fix is to flash a different FW.

In this case, PSP not working would be spotted on the next PSP dev-get
op which produces zilch instead of working devices.

But I understand what you want. You'd like the netdevice to either be
fully initialized with all supported+configured protocols or fail the
open operation. No intermediate/partial states. This is a non-trivial
refactor for mlx5, because mlx5_nic_enable() returns nothing.
Refactoring seems possible though, its only caller is
mlx5e_attach_netdev(), which returns errors. It's certainly not
something that should be done for a net fix though.

I have a series pending for net-next where the PSP configuration is
hooked to mlx5e_psp_set_config(). I will look into implementing what
you propose there and propagate errors.

Meanwhile, do you want to take these fixes (1 and 2) or maybe just 2
for net or not?

Cosmin.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-04-21 17:34 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-17  5:01 [PATCH net 0/2] mlx5e PSP fixes Tariq Toukan
2026-04-17  5:02 ` [PATCH net 1/2] net/mlx5e: psp: Fix invalid access on PSP dev registration fail Tariq Toukan
2026-04-18 19:08   ` Jakub Kicinski
2026-04-20 10:30     ` Cosmin Ratiu
2026-04-20 17:09       ` Jakub Kicinski
2026-04-21 12:29         ` Cosmin Ratiu
2026-04-21 14:26           ` Jakub Kicinski
2026-04-21 14:33             ` Cosmin Ratiu
2026-04-21 15:09               ` Jakub Kicinski
2026-04-21 17:34                 ` Cosmin Ratiu
2026-04-17  5:02 ` [PATCH net 2/2] net/mlx5e: psp: Hook PSP dev reg/unreg to profile enable/disable Tariq Toukan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox