* [PATCH net 0/2] mlx5e PSP fixes @ 2026-04-17 5:01 Tariq Toukan 2026-04-17 5:02 ` [PATCH net 1/2] net/mlx5e: psp: Fix invalid access on PSP dev registration fail Tariq Toukan 2026-04-17 5:02 ` [PATCH net 2/2] net/mlx5e: psp: Hook PSP dev reg/unreg to profile enable/disable Tariq Toukan 0 siblings, 2 replies; 12+ messages in thread From: Tariq Toukan @ 2026-04-17 5:01 UTC (permalink / raw) To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn, David S. Miller Cc: Boris Pismenny, Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch, Daniel Zahka, Willem de Bruijn, Cosmin Ratiu, Raed Salem, Rahul Rameshbabu, Dragos Tatulea, Kees Cook, netdev, linux-rdma, linux-kernel, Gal Pressman Hi, This patchset provides bug fixes from Cosmin to the mlx5e PSP feature. Thanks, Tariq. Cosmin Ratiu (2): net/mlx5e: psp: Fix invalid access on PSP dev registration fail net/mlx5e: psp: Hook PSP dev reg/unreg to profile enable/disable .../mellanox/mlx5/core/en_accel/psp.c | 36 ++++++++++--------- .../net/ethernet/mellanox/mlx5/core/en_main.c | 4 +-- 2 files changed, 22 insertions(+), 18 deletions(-) base-commit: 82c21069028c5db3463f851ae8ac9cc2e38a3827 -- 2.44.0 ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH net 1/2] net/mlx5e: psp: Fix invalid access on PSP dev registration fail 2026-04-17 5:01 [PATCH net 0/2] mlx5e PSP fixes Tariq Toukan @ 2026-04-17 5:02 ` Tariq Toukan 2026-04-18 19:08 ` Jakub Kicinski 2026-04-17 5:02 ` [PATCH net 2/2] net/mlx5e: psp: Hook PSP dev reg/unreg to profile enable/disable Tariq Toukan 1 sibling, 1 reply; 12+ messages in thread From: Tariq Toukan @ 2026-04-17 5:02 UTC (permalink / raw) To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn, David S. Miller Cc: Boris Pismenny, Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch, Daniel Zahka, Willem de Bruijn, Cosmin Ratiu, Raed Salem, Rahul Rameshbabu, Dragos Tatulea, Kees Cook, netdev, linux-rdma, linux-kernel, Gal Pressman From: Cosmin Ratiu <cratiu@nvidia.com> priv->psp->psp is initialized with the PSP device as returned by psp_dev_create(). This could also return an error, in which case a future psp_dev_unregister() will result in unpleasantness. Avoid that by using a local variable and only saving the PSP device when registration succeeds. Also apply some light refactoring of the functions managing the PSP device in order to make them more readable/safe. Fixes: 89ee2d92f66c ("net/mlx5e: Support PSP offload functionality") Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> --- .../mellanox/mlx5/core/en_accel/psp.c | 36 ++++++++++--------- 1 file changed, 20 insertions(+), 16 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/psp.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/psp.c index 6a50b6dec0fa..d9adb993e64d 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/psp.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/psp.c @@ -1070,29 +1070,37 @@ static struct psp_dev_ops mlx5_psp_ops = { void mlx5e_psp_unregister(struct mlx5e_priv *priv) { - if (!priv->psp || !priv->psp->psp) + struct mlx5e_psp *psp = priv->psp; + + if (!psp || !psp->psp) return; - psp_dev_unregister(priv->psp->psp); + psp_dev_unregister(psp->psp); + psp->psp = NULL; } void mlx5e_psp_register(struct mlx5e_priv *priv) { + struct mlx5e_psp *psp = priv->psp; + struct psp_dev *psd; + /* FW Caps missing */ if (!priv->psp) return; - priv->psp->caps.assoc_drv_spc = sizeof(u32); - priv->psp->caps.versions = 1 << PSP_VERSION_HDR0_AES_GCM_128; + psp->caps.assoc_drv_spc = sizeof(u32); + psp->caps.versions = 1 << PSP_VERSION_HDR0_AES_GCM_128; if (MLX5_CAP_PSP(priv->mdev, psp_crypto_esp_aes_gcm_256_encrypt) && MLX5_CAP_PSP(priv->mdev, psp_crypto_esp_aes_gcm_256_decrypt)) - priv->psp->caps.versions |= 1 << PSP_VERSION_HDR0_AES_GCM_256; + psp->caps.versions |= 1 << PSP_VERSION_HDR0_AES_GCM_256; - priv->psp->psp = psp_dev_create(priv->netdev, &mlx5_psp_ops, - &priv->psp->caps, NULL); - if (IS_ERR(priv->psp->psp)) + psd = psp_dev_create(priv->netdev, &mlx5_psp_ops, &psp->caps, NULL); + if (IS_ERR(psd)) { mlx5_core_err(priv->mdev, "PSP failed to register due to %pe\n", - priv->psp->psp); + psd); + return; + } + psp->psp = psd; } int mlx5e_psp_init(struct mlx5e_priv *priv) @@ -1131,22 +1139,18 @@ int mlx5e_psp_init(struct mlx5e_priv *priv) if (!psp) return -ENOMEM; - priv->psp = psp; fs = mlx5e_accel_psp_fs_init(priv); if (IS_ERR(fs)) { err = PTR_ERR(fs); - goto out_err; + kfree(psp); + return err; } psp->fs = fs; + priv->psp = psp; mlx5_core_dbg(priv->mdev, "PSP attached to netdevice\n"); return 0; - -out_err: - priv->psp = NULL; - kfree(psp); - return err; } void mlx5e_psp_cleanup(struct mlx5e_priv *priv) -- 2.44.0 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH net 1/2] net/mlx5e: psp: Fix invalid access on PSP dev registration fail 2026-04-17 5:02 ` [PATCH net 1/2] net/mlx5e: psp: Fix invalid access on PSP dev registration fail Tariq Toukan @ 2026-04-18 19:08 ` Jakub Kicinski 2026-04-20 10:30 ` Cosmin Ratiu 0 siblings, 1 reply; 12+ messages in thread From: Jakub Kicinski @ 2026-04-18 19:08 UTC (permalink / raw) To: tariqt Cc: Jakub Kicinski, edumazet, pabeni, andrew+netdev, davem, borisp, saeedm, leon, mbloch, daniel.zahka, willemdebruijn.kernel, cratiu, raeds, rrameshbabu, dtatulea, kees, netdev, linux-rdma, linux-kernel, gal This is an AI-generated review of your patch. The human sending this email has considered the AI review valid, or at least plausible. --- net/mlx5e: psp: Fix invalid access on PSP dev registration fail This commit updates PSP device registration to use a local variable, only saving the PSP device pointer if psp_dev_create() succeeds. This prevents an error pointer from being stored in priv->psp->psp. > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/psp.c > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/psp.c [ ... ] > @@ -1070,29 +1070,37 @@ static struct psp_dev_ops mlx5_psp_ops = { [ ... ] > void mlx5e_psp_register(struct mlx5e_priv *priv) > { > + struct mlx5e_psp *psp = priv->psp; > + struct psp_dev *psd; > + [ ... ] > - priv->psp->psp = psp_dev_create(priv->netdev, &mlx5_psp_ops, > - &priv->psp->caps, NULL); > - if (IS_ERR(priv->psp->psp)) > + psd = psp_dev_create(priv->netdev, &mlx5_psp_ops, &psp->caps, NULL); > + if (IS_ERR(psd)) { > mlx5_core_err(priv->mdev, "PSP failed to register due to %pe\n", > - priv->psp->psp); > + psd); > + return; > + } > + psp->psp = psd; > } When psp_dev_create() fails, this function now returns without setting psp->psp, leaving it as NULL. However, priv->psp remains allocated and non-NULL. Does this leave the RX datapath vulnerable to a NULL pointer dereference? If priv->psp is non-NULL, the NIC RX initialization path can still call mlx5_accel_psp_fs_init_rx_tables(), which creates hardware flow steering rules to intercept UDP traffic. If a UDP packet triggers these rules, the hardware flags the CQE with MLX5E_PSP_MARKER_BIT. The RX fast-path sees the marker and invokes mlx5e_psp_offload_handle_rx_skb(), which dereferences the pointer unconditionally: u16 dev_id = priv->psp->psp->id; Since priv->psp->psp is NULL, this will cause a kernel panic. Should priv->psp be cleaned up, or the error propagated, to prevent flow rules from being installed when registration fails? -- pw-bot: cr ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net 1/2] net/mlx5e: psp: Fix invalid access on PSP dev registration fail 2026-04-18 19:08 ` Jakub Kicinski @ 2026-04-20 10:30 ` Cosmin Ratiu 2026-04-20 17:09 ` Jakub Kicinski 0 siblings, 1 reply; 12+ messages in thread From: Cosmin Ratiu @ 2026-04-20 10:30 UTC (permalink / raw) To: Tariq Toukan, kuba@kernel.org Cc: Boris Pismenny, willemdebruijn.kernel@gmail.com, andrew+netdev@lunn.ch, daniel.zahka@gmail.com, davem@davemloft.net, leon@kernel.org, Rahul Rameshbabu, pabeni@redhat.com, linux-rdma@vger.kernel.org, linux-kernel@vger.kernel.org, Raed Salem, Dragos Tatulea, kees@kernel.org, Mark Bloch, edumazet@google.com, Saeed Mahameed, netdev@vger.kernel.org, Gal Pressman On Sat, 2026-04-18 at 12:08 -0700, Jakub Kicinski wrote: > This is an AI-generated review of your patch. The human sending this > email has considered the AI review valid, or at least plausible. > --- > net/mlx5e: psp: Fix invalid access on PSP dev registration fail > > This commit updates PSP device registration to use a local variable, > only saving the PSP device pointer if psp_dev_create() succeeds. This > prevents an error pointer from being stored in priv->psp->psp. > > > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/psp.c > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/psp.c > [ ... ] > > @@ -1070,29 +1070,37 @@ static struct psp_dev_ops mlx5_psp_ops = { > [ ... ] > > void mlx5e_psp_register(struct mlx5e_priv *priv) > > { > > + struct mlx5e_psp *psp = priv->psp; > > + struct psp_dev *psd; > > + > [ ... ] > > - priv->psp->psp = psp_dev_create(priv->netdev, > > &mlx5_psp_ops, > > - &priv->psp->caps, NULL); > > - if (IS_ERR(priv->psp->psp)) > > + psd = psp_dev_create(priv->netdev, &mlx5_psp_ops, &psp- > > >caps, NULL); > > + if (IS_ERR(psd)) { > > mlx5_core_err(priv->mdev, "PSP failed to register > > due to %pe\n", > > - priv->psp->psp); > > + psd); > > + return; > > + } > > + psp->psp = psd; > > } > > When psp_dev_create() fails, this function now returns without > setting > psp->psp, leaving it as NULL. However, priv->psp remains allocated > and > non-NULL. > > Does this leave the RX datapath vulnerable to a NULL pointer > dereference? > > If priv->psp is non-NULL, the NIC RX initialization path can still > call > mlx5_accel_psp_fs_init_rx_tables(), which creates hardware flow > steering > rules to intercept UDP traffic. > > If a UDP packet triggers these rules, the hardware flags the CQE with > MLX5E_PSP_MARKER_BIT. The RX fast-path sees the marker and invokes > mlx5e_psp_offload_handle_rx_skb(), which dereferences the pointer > unconditionally: > > u16 dev_id = priv->psp->psp->id; > > Since priv->psp->psp is NULL, this will cause a kernel panic. Should > priv->psp be cleaned up, or the error propagated, to prevent flow > rules > from being installed when registration fails? First, this is preexisting. But more importantly, it's impossible to trigger: - with no PSP devs, there can be no PSP SAs installed. - with no SAs, PSP decryption cannot succeed. - all unsuccessfully decrypted PSP packets are dropped by steering. - the RX handler will not see any PSP packets with the marker set. This patch fixes the comparatively way more likely scenario of psp_dev_register failing and then mlx5e_psp_unregister passing the error pointer to psp_dev_unregister, which will do unpleasant things with it. Cosmin. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net 1/2] net/mlx5e: psp: Fix invalid access on PSP dev registration fail 2026-04-20 10:30 ` Cosmin Ratiu @ 2026-04-20 17:09 ` Jakub Kicinski 2026-04-21 12:29 ` Cosmin Ratiu 0 siblings, 1 reply; 12+ messages in thread From: Jakub Kicinski @ 2026-04-20 17:09 UTC (permalink / raw) To: Cosmin Ratiu Cc: Tariq Toukan, Boris Pismenny, willemdebruijn.kernel@gmail.com, andrew+netdev@lunn.ch, daniel.zahka@gmail.com, davem@davemloft.net, leon@kernel.org, Rahul Rameshbabu, pabeni@redhat.com, linux-rdma@vger.kernel.org, linux-kernel@vger.kernel.org, Raed Salem, Dragos Tatulea, kees@kernel.org, Mark Bloch, edumazet@google.com, Saeed Mahameed, netdev@vger.kernel.org, Gal Pressman On Mon, 20 Apr 2026 10:30:46 +0000 Cosmin Ratiu wrote: > > When psp_dev_create() fails, this function now returns without > > setting > > psp->psp, leaving it as NULL. However, priv->psp remains allocated > > and > > non-NULL. > > > > Does this leave the RX datapath vulnerable to a NULL pointer > > dereference? > > > > If priv->psp is non-NULL, the NIC RX initialization path can still > > call > > mlx5_accel_psp_fs_init_rx_tables(), which creates hardware flow > > steering > > rules to intercept UDP traffic. > > > > If a UDP packet triggers these rules, the hardware flags the CQE with > > MLX5E_PSP_MARKER_BIT. The RX fast-path sees the marker and invokes > > mlx5e_psp_offload_handle_rx_skb(), which dereferences the pointer > > unconditionally: > > > > u16 dev_id = priv->psp->psp->id; > > > > Since priv->psp->psp is NULL, this will cause a kernel panic. Should > > priv->psp be cleaned up, or the error propagated, to prevent flow > > rules > > from being installed when registration fails? > > First, this is preexisting. But more importantly, it's impossible to > trigger: > - with no PSP devs, there can be no PSP SAs installed. > - with no SAs, PSP decryption cannot succeed. > - all unsuccessfully decrypted PSP packets are dropped by steering. > - the RX handler will not see any PSP packets with the marker set. > > This patch fixes the comparatively way more likely scenario of > psp_dev_register failing and then mlx5e_psp_unregister passing the > error pointer to psp_dev_unregister, which will do unpleasant things > with it. Sure but why are you leaving the priv->psp struct in place and whatever FS init has been done? IOW if you really want PSP init to not block probe why is mlx5e_psp_register() a void function rather than mlx5e_psp_init() ? Ignoring errors from psp_dev_create() makes no sense to me - what are you protecting from? kmalloc(GFP_KERNEL) failing? ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net 1/2] net/mlx5e: psp: Fix invalid access on PSP dev registration fail 2026-04-20 17:09 ` Jakub Kicinski @ 2026-04-21 12:29 ` Cosmin Ratiu 2026-04-21 14:26 ` Jakub Kicinski 0 siblings, 1 reply; 12+ messages in thread From: Cosmin Ratiu @ 2026-04-21 12:29 UTC (permalink / raw) To: kuba@kernel.org Cc: Boris Pismenny, willemdebruijn.kernel@gmail.com, andrew+netdev@lunn.ch, daniel.zahka@gmail.com, davem@davemloft.net, leon@kernel.org, Rahul Rameshbabu, linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org, pabeni@redhat.com, Raed Salem, Dragos Tatulea, kees@kernel.org, Mark Bloch, edumazet@google.com, Tariq Toukan, Saeed Mahameed, netdev@vger.kernel.org, Gal Pressman On Mon, 2026-04-20 at 10:09 -0700, Jakub Kicinski wrote: > On Mon, 20 Apr 2026 10:30:46 +0000 Cosmin Ratiu wrote: > > > When psp_dev_create() fails, this function now returns without > > > setting > > > psp->psp, leaving it as NULL. However, priv->psp remains > > > allocated > > > and > > > non-NULL. > > > > > > Does this leave the RX datapath vulnerable to a NULL pointer > > > dereference? > > > > > > If priv->psp is non-NULL, the NIC RX initialization path can > > > still > > > call > > > mlx5_accel_psp_fs_init_rx_tables(), which creates hardware flow > > > steering > > > rules to intercept UDP traffic. > > > > > > If a UDP packet triggers these rules, the hardware flags the CQE > > > with > > > MLX5E_PSP_MARKER_BIT. The RX fast-path sees the marker and > > > invokes > > > mlx5e_psp_offload_handle_rx_skb(), which dereferences the pointer > > > unconditionally: > > > > > > u16 dev_id = priv->psp->psp->id; > > > > > > Since priv->psp->psp is NULL, this will cause a kernel panic. > > > Should > > > priv->psp be cleaned up, or the error propagated, to prevent flow > > > rules > > > from being installed when registration fails? > > > > First, this is preexisting. But more importantly, it's impossible > > to > > trigger: > > - with no PSP devs, there can be no PSP SAs installed. > > - with no SAs, PSP decryption cannot succeed. > > - all unsuccessfully decrypted PSP packets are dropped by steering. > > - the RX handler will not see any PSP packets with the marker set. > > > > This patch fixes the comparatively way more likely scenario of > > psp_dev_register failing and then mlx5e_psp_unregister passing the > > error pointer to psp_dev_unregister, which will do unpleasant > > things > > with it. > > Sure but why are you leaving the priv->psp struct in place and > whatever > FS init has been done? IOW if you really want PSP init to not block > probe why is mlx5e_psp_register() a void function rather than > mlx5e_psp_init() ? Ignoring errors from psp_dev_create() > makes no sense to me - what are you protecting from? > kmalloc(GFP_KERNEL) > failing? priv->psp and steering at the time of mlx5e_psp_register() is inert without the PSP device. Cleaning it on psp_dev_create() failure would be weird, it's cleaned up anyway on netdev teardown. The fact that only memory allocations can fail inside psp_dev_create() is irrelevant here. psp_dev_create() failing shouldn't bring down the whole netdevice, so logging a message and continuing is ok (which is what is also done for macsec and ktls). mlx5e_psp_register() is void because it's called from mlx5e_nic_enable() which can't fail, so it really can't do much other than complain to dmesg. But while thinking about this, I suppose we could change the entire PSP initialization to happen at the time of the current mlx5e_psp_register(), and that would simplify the number of states. I will do that in the next planned PSP series for net-next. Meanwhile, could you please take the 2nd patch and leave this one out? It should apply with no conflicts by itself. Or you would like to see a separate submission with the 2nd patch alone? Cosmin. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net 1/2] net/mlx5e: psp: Fix invalid access on PSP dev registration fail 2026-04-21 12:29 ` Cosmin Ratiu @ 2026-04-21 14:26 ` Jakub Kicinski 2026-04-21 14:33 ` Cosmin Ratiu 0 siblings, 1 reply; 12+ messages in thread From: Jakub Kicinski @ 2026-04-21 14:26 UTC (permalink / raw) To: Cosmin Ratiu Cc: Boris Pismenny, willemdebruijn.kernel@gmail.com, andrew+netdev@lunn.ch, daniel.zahka@gmail.com, davem@davemloft.net, leon@kernel.org, Rahul Rameshbabu, linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org, pabeni@redhat.com, Raed Salem, Dragos Tatulea, kees@kernel.org, Mark Bloch, edumazet@google.com, Tariq Toukan, Saeed Mahameed, netdev@vger.kernel.org, Gal Pressman On Tue, 21 Apr 2026 12:29:13 +0000 Cosmin Ratiu wrote: > > Sure but why are you leaving the priv->psp struct in place and > > whatever > > FS init has been done? IOW if you really want PSP init to not block > > probe why is mlx5e_psp_register() a void function rather than > > mlx5e_psp_init() ? Ignoring errors from psp_dev_create() > > makes no sense to me - what are you protecting from? > > kmalloc(GFP_KERNEL) > > failing? > > priv->psp and steering at the time of mlx5e_psp_register() is inert > without the PSP device. Cleaning it on psp_dev_create() failure would > be weird, it's cleaned up anyway on netdev teardown. The fact that only > memory allocations can fail inside psp_dev_create() is irrelevant here. > psp_dev_create() failing shouldn't bring down the whole netdevice, so > logging a message and continuing is ok (which is what is also done for > macsec and ktls). This is a misguided cargo cult. Or something motivated by OOT compatibility. Alex D sometimes tries to do the same thing with Meta drivers. I don't get it. Of course we want the device to be operational if some *device* init fails. The compatibility matrix with all device generations and fw versions could justify that. But continuing init when a single-page kmalloc failed is pure silliness. > mlx5e_psp_register() is void because it's called from > mlx5e_nic_enable() which can't fail, so it really can't do much other > than complain to dmesg. > > But while thinking about this, I suppose we could change the entire PSP > initialization to happen at the time of the current > mlx5e_psp_register(), and that would simplify the number of states. > I will do that in the next planned PSP series for net-next. > > Meanwhile, could you please take the 2nd patch and leave this one out? > It should apply with no conflicts by itself. > > Or you would like to see a separate submission with the 2nd patch > alone? Please resubmit. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net 1/2] net/mlx5e: psp: Fix invalid access on PSP dev registration fail 2026-04-21 14:26 ` Jakub Kicinski @ 2026-04-21 14:33 ` Cosmin Ratiu 2026-04-21 15:09 ` Jakub Kicinski 0 siblings, 1 reply; 12+ messages in thread From: Cosmin Ratiu @ 2026-04-21 14:33 UTC (permalink / raw) To: kuba@kernel.org Cc: Boris Pismenny, willemdebruijn.kernel@gmail.com, andrew+netdev@lunn.ch, daniel.zahka@gmail.com, davem@davemloft.net, leon@kernel.org, linux-kernel@vger.kernel.org, edumazet@google.com, linux-rdma@vger.kernel.org, Rahul Rameshbabu, Raed Salem, Dragos Tatulea, kees@kernel.org, Mark Bloch, pabeni@redhat.com, Tariq Toukan, Saeed Mahameed, netdev@vger.kernel.org, Gal Pressman On Tue, 2026-04-21 at 07:26 -0700, Jakub Kicinski wrote: > On Tue, 21 Apr 2026 12:29:13 +0000 Cosmin Ratiu wrote: > > > Sure but why are you leaving the priv->psp struct in place and > > > whatever > > > FS init has been done? IOW if you really want PSP init to not > > > block > > > probe why is mlx5e_psp_register() a void function rather than > > > mlx5e_psp_init() ? Ignoring errors from psp_dev_create() > > > makes no sense to me - what are you protecting from? > > > kmalloc(GFP_KERNEL) > > > failing? > > > > priv->psp and steering at the time of mlx5e_psp_register() is inert > > without the PSP device. Cleaning it on psp_dev_create() failure > > would > > be weird, it's cleaned up anyway on netdev teardown. The fact that > > only > > memory allocations can fail inside psp_dev_create() is irrelevant > > here. > > psp_dev_create() failing shouldn't bring down the whole netdevice, > > so > > logging a message and continuing is ok (which is what is also done > > for > > macsec and ktls). > > This is a misguided cargo cult. Or something motivated by OOT > compatibility. Alex D sometimes tries to do the same thing with Meta > drivers. I don't get it. Of course we want the device to be > operational > if some *device* init fails. The compatibility matrix with all device > generations and fw versions could justify that. But continuing init > when a single-page kmalloc failed is pure silliness. I am not sure about the wider context, but from the POV of the driver, it's calling $thing from the kernel which can fail and it needs to do something about it, either fail the entire netdev bringup or accept that $thing won't be functional and continue without it. The driver shouldn't need to know what $thing does inside and how it can fail, which can change over time. Today it's a kmalloc(), tomorrow it may be something else. It doesn't and shouldn't matter for the local decision to continue or not without $thing working. Isn't this reasonable? > > > mlx5e_psp_register() is void because it's called from > > mlx5e_nic_enable() which can't fail, so it really can't do much > > other > > than complain to dmesg. > > > > But while thinking about this, I suppose we could change the entire > > PSP > > initialization to happen at the time of the current > > mlx5e_psp_register(), and that would simplify the number of states. > > I will do that in the next planned PSP series for net-next. > > > > Meanwhile, could you please take the 2nd patch and leave this one > > out? > > It should apply with no conflicts by itself. > > > > Or you would like to see a separate submission with the 2nd patch > > alone? > > Please resubmit. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net 1/2] net/mlx5e: psp: Fix invalid access on PSP dev registration fail 2026-04-21 14:33 ` Cosmin Ratiu @ 2026-04-21 15:09 ` Jakub Kicinski 2026-04-21 17:34 ` Cosmin Ratiu 0 siblings, 1 reply; 12+ messages in thread From: Jakub Kicinski @ 2026-04-21 15:09 UTC (permalink / raw) To: Cosmin Ratiu Cc: Boris Pismenny, willemdebruijn.kernel@gmail.com, andrew+netdev@lunn.ch, daniel.zahka@gmail.com, davem@davemloft.net, leon@kernel.org, linux-kernel@vger.kernel.org, edumazet@google.com, linux-rdma@vger.kernel.org, Rahul Rameshbabu, Raed Salem, Dragos Tatulea, kees@kernel.org, Mark Bloch, pabeni@redhat.com, Tariq Toukan, Saeed Mahameed, netdev@vger.kernel.org, Gal Pressman On Tue, 21 Apr 2026 14:33:51 +0000 Cosmin Ratiu wrote: > > > priv->psp and steering at the time of mlx5e_psp_register() is inert > > > without the PSP device. Cleaning it on psp_dev_create() failure > > > would > > > be weird, it's cleaned up anyway on netdev teardown. The fact that > > > only > > > memory allocations can fail inside psp_dev_create() is irrelevant > > > here. > > > psp_dev_create() failing shouldn't bring down the whole netdevice, > > > so > > > logging a message and continuing is ok (which is what is also done > > > for > > > macsec and ktls). > > > > This is a misguided cargo cult. Or something motivated by OOT > > compatibility. Alex D sometimes tries to do the same thing with Meta > > drivers. I don't get it. Of course we want the device to be > > operational > > if some *device* init fails. The compatibility matrix with all device > > generations and fw versions could justify that. But continuing init > > when a single-page kmalloc failed is pure silliness. > > I am not sure about the wider context, but from the POV of the driver, > it's calling $thing from the kernel which can fail and it needs to do > something about it, either fail the entire netdev bringup or accept > that $thing won't be functional and continue without it. The driver > shouldn't need to know what $thing does inside and how it can fail, > which can change over time. Today it's a kmalloc(), tomorrow it may be > something else. Like what? > It doesn't and shouldn't matter for the local decision > to continue or not without $thing working. > > Isn't this reasonable? No, the normal thing to do is to propagate errors. If you want to diverge from that _you_ should have a reason, a better reason than a vague "kernel can fail". I'd prefer for the driver to fail in an obvious way. Which will be immediately spotted by the operator, not 2 weeks later when 10% of the fleet is upgraded already. The only exception I'd make is to keep devlink registered in case the fix is to flash a different FW. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net 1/2] net/mlx5e: psp: Fix invalid access on PSP dev registration fail 2026-04-21 15:09 ` Jakub Kicinski @ 2026-04-21 17:34 ` Cosmin Ratiu 2026-04-21 18:32 ` Jakub Kicinski 0 siblings, 1 reply; 12+ messages in thread From: Cosmin Ratiu @ 2026-04-21 17:34 UTC (permalink / raw) To: kuba@kernel.org Cc: Boris Pismenny, willemdebruijn.kernel@gmail.com, andrew+netdev@lunn.ch, daniel.zahka@gmail.com, davem@davemloft.net, leon@kernel.org, linux-kernel@vger.kernel.org, edumazet@google.com, linux-rdma@vger.kernel.org, Rahul Rameshbabu, Raed Salem, Dragos Tatulea, kees@kernel.org, Mark Bloch, pabeni@redhat.com, Tariq Toukan, Saeed Mahameed, netdev@vger.kernel.org, Gal Pressman On Tue, 2026-04-21 at 08:09 -0700, Jakub Kicinski wrote: > On Tue, 21 Apr 2026 14:33:51 +0000 Cosmin Ratiu wrote: > > > > priv->psp and steering at the time of mlx5e_psp_register() is > > > > inert > > > > without the PSP device. Cleaning it on psp_dev_create() failure > > > > would > > > > be weird, it's cleaned up anyway on netdev teardown. The fact > > > > that > > > > only > > > > memory allocations can fail inside psp_dev_create() is > > > > irrelevant > > > > here. > > > > psp_dev_create() failing shouldn't bring down the whole > > > > netdevice, > > > > so > > > > logging a message and continuing is ok (which is what is also > > > > done > > > > for > > > > macsec and ktls). > > > > > > This is a misguided cargo cult. Or something motivated by OOT > > > compatibility. Alex D sometimes tries to do the same thing with > > > Meta > > > drivers. I don't get it. Of course we want the device to be > > > operational > > > if some *device* init fails. The compatibility matrix with all > > > device > > > generations and fw versions could justify that. But continuing > > > init > > > when a single-page kmalloc failed is pure silliness. > > > > I am not sure about the wider context, but from the POV of the > > driver, > > it's calling $thing from the kernel which can fail and it needs to > > do > > something about it, either fail the entire netdev bringup or accept > > that $thing won't be functional and continue without it. The driver > > shouldn't need to know what $thing does inside and how it can fail, > > which can change over time. Today it's a kmalloc(), tomorrow it may > > be > > something else. > > Like what? The inner workings of $thing aren't and shouldn't be relevant, no? Maybe tomorrow the kernel will lazy-init some TCP shenanigans for the first PSP device being initialized or whatever, or maybe some other moving parts inside can fail. It's an abstraction, why make it unnecessarily leaky for the purpose of writing driver code? > > > It doesn't and shouldn't matter for the local decision > > to continue or not without $thing working. > > > > Isn't this reasonable? > > No, the normal thing to do is to propagate errors. > If you want to diverge from that _you_ should have a reason, > a better reason than a vague "kernel can fail". > I'd prefer for the driver to fail in an obvious way. > Which will be immediately spotted by the operator, not 2 weeks > later when 10% of the fleet is upgraded already. > The only exception I'd make is to keep devlink registered in > case the fix is to flash a different FW. In this case, PSP not working would be spotted on the next PSP dev-get op which produces zilch instead of working devices. But I understand what you want. You'd like the netdevice to either be fully initialized with all supported+configured protocols or fail the open operation. No intermediate/partial states. This is a non-trivial refactor for mlx5, because mlx5_nic_enable() returns nothing. Refactoring seems possible though, its only caller is mlx5e_attach_netdev(), which returns errors. It's certainly not something that should be done for a net fix though. I have a series pending for net-next where the PSP configuration is hooked to mlx5e_psp_set_config(). I will look into implementing what you propose there and propagate errors. Meanwhile, do you want to take these fixes (1 and 2) or maybe just 2 for net or not? Cosmin. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net 1/2] net/mlx5e: psp: Fix invalid access on PSP dev registration fail 2026-04-21 17:34 ` Cosmin Ratiu @ 2026-04-21 18:32 ` Jakub Kicinski 0 siblings, 0 replies; 12+ messages in thread From: Jakub Kicinski @ 2026-04-21 18:32 UTC (permalink / raw) To: Cosmin Ratiu Cc: Boris Pismenny, willemdebruijn.kernel@gmail.com, andrew+netdev@lunn.ch, daniel.zahka@gmail.com, davem@davemloft.net, leon@kernel.org, linux-kernel@vger.kernel.org, edumazet@google.com, linux-rdma@vger.kernel.org, Rahul Rameshbabu, Raed Salem, Dragos Tatulea, kees@kernel.org, Mark Bloch, pabeni@redhat.com, Tariq Toukan, Saeed Mahameed, netdev@vger.kernel.org, Gal Pressman On Tue, 21 Apr 2026 17:34:32 +0000 Cosmin Ratiu wrote: > > No, the normal thing to do is to propagate errors. > > If you want to diverge from that _you_ should have a reason, > > a better reason than a vague "kernel can fail". > > I'd prefer for the driver to fail in an obvious way. > > Which will be immediately spotted by the operator, not 2 weeks > > later when 10% of the fleet is upgraded already. > > The only exception I'd make is to keep devlink registered in > > case the fix is to flash a different FW. > > In this case, PSP not working would be spotted on the next PSP dev-get > op which produces zilch instead of working devices. When you have X vendors times Y device generations times Z FW versions in your fleet dev-get returning nothing is not a failure. It just means you're running on a machine that's not capable. Best you can do to spot a buggy kernel is to notice that the fraction of PSP traffic is decreasing over time. After significant portion of the fleet is already on the bad kernel. > But I understand what you want. You'd like the netdevice to either be > fully initialized with all supported+configured protocols or fail the > open operation. No intermediate/partial states. This is a non-trivial > refactor for mlx5, because mlx5_nic_enable() returns nothing. > Refactoring seems possible though, its only caller is > mlx5e_attach_netdev(), which returns errors. It's certainly not > something that should be done for a net fix though. > > I have a series pending for net-next where the PSP configuration is > hooked to mlx5e_psp_set_config(). I will look into implementing what > you propose there and propagate errors. > > Meanwhile, do you want to take these fixes (1 and 2) or maybe just 2 > for net or not? Can you call mlx5e_psp_cleanup() when register fails for now? ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH net 2/2] net/mlx5e: psp: Hook PSP dev reg/unreg to profile enable/disable 2026-04-17 5:01 [PATCH net 0/2] mlx5e PSP fixes Tariq Toukan 2026-04-17 5:02 ` [PATCH net 1/2] net/mlx5e: psp: Fix invalid access on PSP dev registration fail Tariq Toukan @ 2026-04-17 5:02 ` Tariq Toukan 1 sibling, 0 replies; 12+ messages in thread From: Tariq Toukan @ 2026-04-17 5:02 UTC (permalink / raw) To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn, David S. Miller Cc: Boris Pismenny, Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch, Daniel Zahka, Willem de Bruijn, Cosmin Ratiu, Raed Salem, Rahul Rameshbabu, Dragos Tatulea, Kees Cook, netdev, linux-rdma, linux-kernel, Gal Pressman From: Cosmin Ratiu <cratiu@nvidia.com> devlink reload while PSP connections are active does: mlx5_unload_one_devl_locked() -> mlx5_detach_device() -> _mlx5e_suspend() -> mlx5e_detach_netdev() -> profile->cleanup_rx -> profile->cleanup_tx -> mlx5e_destroy_mdev_resources() -> mlx5_core_dealloc_pd() fails: ... mlx5_core 0000:08:00.0: mlx5_cmd_out_err:821:(pid 19722): DEALLOC_PD(0x801) op_mod(0x0) failed, status bad resource state(0x9), syndrome (0xef0c8a), err(-22) ... The reason for failure is the existence of TX keys, which are removed by the PSP dev unregistration happening in: profile->cleanup() -> mlx5e_psp_unregister() -> mlx5e_psp_cleanup() -> psp_dev_unregister() ...but this isn't invoked in the devlink reload flow, only when changing the NIC profile (e.g. when transitioning to switchdev mode) or on dev teardown. Move PSP device registration into mlx5e_nic_enable(), and unregistration into the corresponding mlx5e_nic_disable(). These functions are called during netdev attach/detach after RX & TX are set up. This ensures that the keys will be gone by the time the PD is destroyed. Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Fixes: 89ee2d92f66c ("net/mlx5e: Support PSP offload functionality") Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> --- drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c index 6c4eeb88588c..c3938a2dbbfe 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -6021,7 +6021,6 @@ static int mlx5e_nic_init(struct mlx5_core_dev *mdev, if (take_rtnl) rtnl_lock(); - mlx5e_psp_register(priv); /* update XDP supported features */ mlx5e_set_xdp_feature(priv); @@ -6034,7 +6033,6 @@ static int mlx5e_nic_init(struct mlx5_core_dev *mdev, static void mlx5e_nic_cleanup(struct mlx5e_priv *priv) { mlx5e_health_destroy_reporters(priv); - mlx5e_psp_unregister(priv); mlx5e_ktls_cleanup(priv); mlx5e_psp_cleanup(priv); mlx5e_fs_cleanup(priv->fs); @@ -6158,6 +6156,7 @@ static void mlx5e_nic_enable(struct mlx5e_priv *priv) mlx5e_fs_init_l2_addr(priv->fs, netdev); mlx5e_ipsec_init(priv); + mlx5e_psp_register(priv); err = mlx5e_macsec_init(priv); if (err) @@ -6228,6 +6227,7 @@ static void mlx5e_nic_disable(struct mlx5e_priv *priv) mlx5_lag_remove_netdev(mdev, priv->netdev); mlx5_vxlan_reset_to_default(mdev->vxlan); mlx5e_macsec_cleanup(priv); + mlx5e_psp_unregister(priv); mlx5e_ipsec_cleanup(priv); } -- 2.44.0 ^ permalink raw reply related [flat|nested] 12+ messages in thread
end of thread, other threads:[~2026-04-21 18:32 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-04-17 5:01 [PATCH net 0/2] mlx5e PSP fixes Tariq Toukan 2026-04-17 5:02 ` [PATCH net 1/2] net/mlx5e: psp: Fix invalid access on PSP dev registration fail Tariq Toukan 2026-04-18 19:08 ` Jakub Kicinski 2026-04-20 10:30 ` Cosmin Ratiu 2026-04-20 17:09 ` Jakub Kicinski 2026-04-21 12:29 ` Cosmin Ratiu 2026-04-21 14:26 ` Jakub Kicinski 2026-04-21 14:33 ` Cosmin Ratiu 2026-04-21 15:09 ` Jakub Kicinski 2026-04-21 17:34 ` Cosmin Ratiu 2026-04-21 18:32 ` Jakub Kicinski 2026-04-17 5:02 ` [PATCH net 2/2] net/mlx5e: psp: Hook PSP dev reg/unreg to profile enable/disable Tariq Toukan
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox