public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH net V4 0/4] net/mlx5: Fixes for Socket-Direct
@ 2026-04-28  6:01 Tariq Toukan
  2026-04-28  6:01 ` [PATCH net V4 1/4] net/mlx5: SD: Serialize init/cleanup Tariq Toukan
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Tariq Toukan @ 2026-04-28  6:01 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller
  Cc: Saeed Mahameed, Tariq Toukan, Mark Bloch, Leon Romanovsky,
	Shay Drory, Simon Horman, Patrisious Haddad, Kees Cook,
	Parav Pandit, Gal Pressman, netdev, linux-rdma, linux-kernel,
	Dragos Tatulea

Hi,

This series fixes several race conditions and bugs in the mlx5
Socket-Direct (SD) single netdev flow.

Patch 1 serializes mlx5_sd_init()/mlx5_sd_cleanup() with
mlx5_devcom_comp_lock() and tracks the SD group state on the primary
device, preventing concurrent or duplicate bring-up/tear-down.

Patch 2 fixes the debugfs "multi-pf" directory being stored on the
calling device's sd struct instead of the primary's, which caused
memory leaks and recreation errors when cleanup ran from a different PF.

Patch 3 fixes a race where a secondary PF could access the primary's
auxiliary device after it had been unbound, by holding the primary's
device lock while operating on its auxiliary device.

Patch 4 fixes missing cleanup on ETH probe errors. The analogous gap on
the resume path requires introducing sd_suspend/resume APIs that only
destroy FW resources and is left for a follow-up series.

Regards,
Tariq

V4:
- Link to V3:
  https://lore.kernel.org/all/20260423123104.201552-1-tariqt@nvidia.com/
- Adjust "net/mlx5e: SD, Fix missing cleanup on probe/resume error" to
  cleanup SD only on probe; the resume gap is deferred to a follow-up
  series that will introduce sd_suspend/resume APIs.
- Fix concurrent cleanup vs. init race in
  "net/mlx5: SD: Serialize init/cleanup".
- Remove leftover sentence in commit message of
  "net/mlx5: SD: Serialize init/cleanup"

Shay Drory (4):
  net/mlx5: SD: Serialize init/cleanup
  net/mlx5: SD, Keep multi-pf debugfs entries on primary
  net/mlx5e: SD, Fix missing cleanup on probe error
  net/mlx5e: SD, Fix race condition in secondary device probe/remove

 .../net/ethernet/mellanox/mlx5/core/en_main.c | 26 +++++--
 .../net/ethernet/mellanox/mlx5/core/lib/sd.c  | 76 ++++++++++++++++---
 .../net/ethernet/mellanox/mlx5/core/lib/sd.h  |  2 +
 3 files changed, 87 insertions(+), 17 deletions(-)


base-commit: 3bc179bc7146c26c9dff75d2943d10528274e301
-- 
2.44.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH net V4 1/4] net/mlx5: SD: Serialize init/cleanup
  2026-04-28  6:01 [PATCH net V4 0/4] net/mlx5: Fixes for Socket-Direct Tariq Toukan
@ 2026-04-28  6:01 ` Tariq Toukan
  2026-04-30  1:42   ` Jakub Kicinski
  2026-04-28  6:01 ` [PATCH net V4 2/4] net/mlx5: SD, Keep multi-pf debugfs entries on primary Tariq Toukan
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: Tariq Toukan @ 2026-04-28  6:01 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller
  Cc: Saeed Mahameed, Tariq Toukan, Mark Bloch, Leon Romanovsky,
	Shay Drory, Simon Horman, Patrisious Haddad, Kees Cook,
	Parav Pandit, Gal Pressman, netdev, linux-rdma, linux-kernel,
	Dragos Tatulea

From: Shay Drory <shayd@nvidia.com>

mlx5_sd_init() / mlx5_sd_cleanup() may run from multiple PFs in the same
Socket-Direct group. This can cause the SD bring-up/tear-down sequence
to be executed more than once or interleaved across PFs.

Protect SD init/cleanup with mlx5_devcom_comp_lock() and track the SD
group state on the primary device. Skip init if the primary is already
UP, and skip cleanup unless the primary is UP.

The state check on cleanup is needed because sd_register() drops the
devcom comp lock between marking the comp ready and assigning
primary_dev on each peer. A concurrent cleanup that acquires the lock
in this window would observe devcom_is_ready==true while primary_dev
is still NULL (causing mlx5_sd_get_primary() to return NULL) or while
the FW alias setup performed by mlx5_sd_init()'s body has not yet run
(causing sd_cmd_unset_primary() to dereference a NULL tx_ft). Gate the
cleanup body on primary_sd->state == MLX5_SD_STATE_UP, which is set
only at the very end of mlx5_sd_init() under the same comp lock - so
observing UP guarantees primary_dev, secondaries[], tx_ft, and dfs are
all populated. Also bail explicitly if mlx5_sd_get_primary() returns
NULL, in case state is checked on a peer whose primary_dev hasn't been
assigned yet.

In addition, move mlx5_devcom_comp_set_ready(false) from sd_unregister()
into the cleanup's locked section. A concurrent init acquiring the
devcom lock will now observe devcom is no longer ready and bail out
immediately.

Fixes: 381978d28317 ("net/mlx5e: Create single netdev per SD group")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/lib/sd.c  | 40 ++++++++++++++++---
 1 file changed, 34 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
index 762c783156b4..d42c283cbb38 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
@@ -18,6 +18,7 @@ struct mlx5_sd {
 	u8 host_buses;
 	struct mlx5_devcom_comp_dev *devcom;
 	struct dentry *dfs;
+	u8 state;
 	bool primary;
 	union {
 		struct { /* primary */
@@ -31,6 +32,11 @@ struct mlx5_sd {
 	};
 };
 
+enum mlx5_sd_state {
+	MLX5_SD_STATE_DOWN = 0,
+	MLX5_SD_STATE_UP,
+};
+
 static int mlx5_sd_get_host_buses(struct mlx5_core_dev *dev)
 {
 	struct mlx5_sd *sd = mlx5_get_sd(dev);
@@ -270,9 +276,6 @@ static void sd_unregister(struct mlx5_core_dev *dev)
 {
 	struct mlx5_sd *sd = mlx5_get_sd(dev);
 
-	mlx5_devcom_comp_lock(sd->devcom);
-	mlx5_devcom_comp_set_ready(sd->devcom, false);
-	mlx5_devcom_comp_unlock(sd->devcom);
 	mlx5_devcom_unregister_component(sd->devcom);
 }
 
@@ -426,6 +429,7 @@ int mlx5_sd_init(struct mlx5_core_dev *dev)
 	struct mlx5_core_dev *primary, *pos, *to;
 	struct mlx5_sd *sd = mlx5_get_sd(dev);
 	u8 alias_key[ACCESS_KEY_LEN];
+	struct mlx5_sd *primary_sd;
 	int err, i;
 
 	err = sd_init(dev);
@@ -440,10 +444,17 @@ int mlx5_sd_init(struct mlx5_core_dev *dev)
 	if (err)
 		goto err_sd_cleanup;
 
+	mlx5_devcom_comp_lock(sd->devcom);
 	if (!mlx5_devcom_comp_is_ready(sd->devcom))
-		return 0;
+		goto out;
 
 	primary = mlx5_sd_get_primary(dev);
+	if (!primary)
+		goto out;
+
+	primary_sd = mlx5_get_sd(primary);
+	if (primary_sd->state != MLX5_SD_STATE_DOWN)
+		goto out;
 
 	for (i = 0; i < ACCESS_KEY_LEN; i++)
 		alias_key[i] = get_random_u8();
@@ -472,6 +483,9 @@ int mlx5_sd_init(struct mlx5_core_dev *dev)
 		sd->group_id, mlx5_devcom_comp_get_size(sd->devcom));
 	sd_print_group(primary);
 
+	primary_sd->state = MLX5_SD_STATE_UP;
+out:
+	mlx5_devcom_comp_unlock(sd->devcom);
 	return 0;
 
 err_unset_secondaries:
@@ -481,6 +495,8 @@ int mlx5_sd_init(struct mlx5_core_dev *dev)
 	sd_cmd_unset_primary(primary);
 	debugfs_remove_recursive(sd->dfs);
 err_sd_unregister:
+	mlx5_devcom_comp_set_ready(sd->devcom, false);
+	mlx5_devcom_comp_unlock(sd->devcom);
 	sd_unregister(dev);
 err_sd_cleanup:
 	sd_cleanup(dev);
@@ -491,22 +507,34 @@ void mlx5_sd_cleanup(struct mlx5_core_dev *dev)
 {
 	struct mlx5_sd *sd = mlx5_get_sd(dev);
 	struct mlx5_core_dev *primary, *pos;
+	struct mlx5_sd *primary_sd;
 	int i;
 
 	if (!sd)
 		return;
 
+	mlx5_devcom_comp_lock(sd->devcom);
 	if (!mlx5_devcom_comp_is_ready(sd->devcom))
-		goto out;
+		goto out_unlock;
 
 	primary = mlx5_sd_get_primary(dev);
+	if (!primary)
+		goto out_unlock;
+
+	primary_sd = mlx5_get_sd(primary);
+	if (primary_sd->state != MLX5_SD_STATE_UP)
+		goto out_unlock;
+
 	mlx5_sd_for_each_secondary(i, primary, pos)
 		sd_cmd_unset_secondary(pos);
 	sd_cmd_unset_primary(primary);
 	debugfs_remove_recursive(sd->dfs);
 
 	sd_info(primary, "group id %#x, uncombined\n", sd->group_id);
-out:
+	primary_sd->state = MLX5_SD_STATE_DOWN;
+	mlx5_devcom_comp_set_ready(sd->devcom, false);
+out_unlock:
+	mlx5_devcom_comp_unlock(sd->devcom);
 	sd_unregister(dev);
 	sd_cleanup(dev);
 }
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH net V4 2/4] net/mlx5: SD, Keep multi-pf debugfs entries on primary
  2026-04-28  6:01 [PATCH net V4 0/4] net/mlx5: Fixes for Socket-Direct Tariq Toukan
  2026-04-28  6:01 ` [PATCH net V4 1/4] net/mlx5: SD: Serialize init/cleanup Tariq Toukan
@ 2026-04-28  6:01 ` Tariq Toukan
  2026-04-28  6:01 ` [PATCH net V4 3/4] net/mlx5e: SD, Fix missing cleanup on probe error Tariq Toukan
  2026-04-28  6:01 ` [PATCH net V4 4/4] net/mlx5e: SD, Fix race condition in secondary device probe/remove Tariq Toukan
  3 siblings, 0 replies; 11+ messages in thread
From: Tariq Toukan @ 2026-04-28  6:01 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller
  Cc: Saeed Mahameed, Tariq Toukan, Mark Bloch, Leon Romanovsky,
	Shay Drory, Simon Horman, Patrisious Haddad, Kees Cook,
	Parav Pandit, Gal Pressman, netdev, linux-rdma, linux-kernel,
	Dragos Tatulea

From: Shay Drory <shayd@nvidia.com>

mlx5_sd_init() creates the "multi-pf" debugfs directory under the
primary device debugfs root, but stored the dentry in the calling
device's sd struct. When sd_cleanup() run on a different PF,
this leads to using the wrong sd->dfs for removing entries, which
results in memory leak and an error in when re-creating the SD.[1]

Fix it by explicitly storing the debugfs dentry in the primary
device sd struct and use it for all per-group files.

[1]
debugfs: 'multi-pf' already exists in '0000:08:00.1'

Fixes: 4375130bf527 ("net/mlx5: SD, Add debugfs")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/lib/sd.c  | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
index d42c283cbb38..7a1787f15320 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
@@ -463,9 +463,13 @@ int mlx5_sd_init(struct mlx5_core_dev *dev)
 	if (err)
 		goto err_sd_unregister;
 
-	sd->dfs = debugfs_create_dir("multi-pf", mlx5_debugfs_get_dev_root(primary));
-	debugfs_create_x32("group_id", 0400, sd->dfs, &sd->group_id);
-	debugfs_create_file("primary", 0400, sd->dfs, primary, &dev_fops);
+	primary_sd->dfs =
+		debugfs_create_dir("multi-pf",
+				   mlx5_debugfs_get_dev_root(primary));
+	debugfs_create_x32("group_id", 0400, primary_sd->dfs,
+			   &primary_sd->group_id);
+	debugfs_create_file("primary", 0400, primary_sd->dfs, primary,
+			    &dev_fops);
 
 	mlx5_sd_for_each_secondary(i, primary, pos) {
 		char name[32];
@@ -475,7 +479,8 @@ int mlx5_sd_init(struct mlx5_core_dev *dev)
 			goto err_unset_secondaries;
 
 		snprintf(name, sizeof(name), "secondary_%d", i - 1);
-		debugfs_create_file(name, 0400, sd->dfs, pos, &dev_fops);
+		debugfs_create_file(name, 0400, primary_sd->dfs, pos,
+				    &dev_fops);
 
 	}
 
@@ -493,7 +498,8 @@ int mlx5_sd_init(struct mlx5_core_dev *dev)
 	mlx5_sd_for_each_secondary_to(i, primary, to, pos)
 		sd_cmd_unset_secondary(pos);
 	sd_cmd_unset_primary(primary);
-	debugfs_remove_recursive(sd->dfs);
+	debugfs_remove_recursive(primary_sd->dfs);
+	primary_sd->dfs = NULL;
 err_sd_unregister:
 	mlx5_devcom_comp_set_ready(sd->devcom, false);
 	mlx5_devcom_comp_unlock(sd->devcom);
@@ -528,7 +534,8 @@ void mlx5_sd_cleanup(struct mlx5_core_dev *dev)
 	mlx5_sd_for_each_secondary(i, primary, pos)
 		sd_cmd_unset_secondary(pos);
 	sd_cmd_unset_primary(primary);
-	debugfs_remove_recursive(sd->dfs);
+	debugfs_remove_recursive(primary_sd->dfs);
+	primary_sd->dfs = NULL;
 
 	sd_info(primary, "group id %#x, uncombined\n", sd->group_id);
 	primary_sd->state = MLX5_SD_STATE_DOWN;
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH net V4 3/4] net/mlx5e: SD, Fix missing cleanup on probe error
  2026-04-28  6:01 [PATCH net V4 0/4] net/mlx5: Fixes for Socket-Direct Tariq Toukan
  2026-04-28  6:01 ` [PATCH net V4 1/4] net/mlx5: SD: Serialize init/cleanup Tariq Toukan
  2026-04-28  6:01 ` [PATCH net V4 2/4] net/mlx5: SD, Keep multi-pf debugfs entries on primary Tariq Toukan
@ 2026-04-28  6:01 ` Tariq Toukan
  2026-04-30  1:42   ` Jakub Kicinski
  2026-04-28  6:01 ` [PATCH net V4 4/4] net/mlx5e: SD, Fix race condition in secondary device probe/remove Tariq Toukan
  3 siblings, 1 reply; 11+ messages in thread
From: Tariq Toukan @ 2026-04-28  6:01 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller
  Cc: Saeed Mahameed, Tariq Toukan, Mark Bloch, Leon Romanovsky,
	Shay Drory, Simon Horman, Patrisious Haddad, Kees Cook,
	Parav Pandit, Gal Pressman, netdev, linux-rdma, linux-kernel,
	Dragos Tatulea

From: Shay Drory <shayd@nvidia.com>

When _mlx5e_probe() fails, the preceding successful mlx5_sd_init() is
not undone. Auxiliary bus probe failure skips binding, so mlx5e_remove()
is never called for that adev and the matching mlx5_sd_cleanup() never
runs - leaking the per-dev SD struct.

Call mlx5_sd_cleanup() on the probe error path to balance
mlx5_sd_init().

A similar gap exists on the resume path: mlx5_sd_init() and
mlx5_sd_cleanup() are currently bundled with both probe/remove and
suspend/resume, even though only the FW alias state actually needs to
follow the suspend/resume lifecycle - the sd struct allocation and
devcom membership are software state that should track the full bound
lifetime. As a result, a failed resume can leave a still-bound device
with sd == NULL, which mlx5_sd_get_adev() can't distinguish from a
non-SD device. Fixing this requires sd_suspend/resume APIs which will
only destroy FW resources and is left for a follow-up series.

Fixes: 381978d28317 ("net/mlx5e: Create single netdev per SD group")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 5a46870c4b74..e21affd0ffc4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -6775,8 +6775,8 @@ static int mlx5e_resume(struct auxiliary_device *adev)
 
 	actual_adev = mlx5_sd_get_adev(mdev, adev, edev->idx);
 	if (actual_adev)
-		return _mlx5e_resume(actual_adev);
-	return 0;
+		err = _mlx5e_resume(actual_adev);
+	return err;
 }
 
 static int _mlx5e_suspend(struct auxiliary_device *adev, bool pre_netdev_reg)
@@ -6912,9 +6912,16 @@ static int mlx5e_probe(struct auxiliary_device *adev,
 		return err;
 
 	actual_adev = mlx5_sd_get_adev(mdev, adev, edev->idx);
-	if (actual_adev)
-		return _mlx5e_probe(actual_adev);
+	if (actual_adev) {
+		err = _mlx5e_probe(actual_adev);
+		if (err)
+			goto sd_cleanup;
+	}
 	return 0;
+
+sd_cleanup:
+	mlx5_sd_cleanup(mdev);
+	return err;
 }
 
 static void _mlx5e_remove(struct auxiliary_device *adev)
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH net V4 4/4] net/mlx5e: SD, Fix race condition in secondary device probe/remove
  2026-04-28  6:01 [PATCH net V4 0/4] net/mlx5: Fixes for Socket-Direct Tariq Toukan
                   ` (2 preceding siblings ...)
  2026-04-28  6:01 ` [PATCH net V4 3/4] net/mlx5e: SD, Fix missing cleanup on probe error Tariq Toukan
@ 2026-04-28  6:01 ` Tariq Toukan
  2026-04-30  1:42   ` Jakub Kicinski
  3 siblings, 1 reply; 11+ messages in thread
From: Tariq Toukan @ 2026-04-28  6:01 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller
  Cc: Saeed Mahameed, Tariq Toukan, Mark Bloch, Leon Romanovsky,
	Shay Drory, Simon Horman, Patrisious Haddad, Kees Cook,
	Parav Pandit, Gal Pressman, netdev, linux-rdma, linux-kernel,
	Dragos Tatulea

From: Shay Drory <shayd@nvidia.com>

When utilizing Socket-Direct single netdev functionality the driver
resolves the actual auxiliary device using mlx5_sd_get_adev(). However,
the current implementation returns the primary ETH auxiliary device
without holding the device lock, leading to a potential race condition
where the ETH device could be unbound or removed concurrently during
probe, suspend, resume, or remove operations.[1]

Fix this by introducing mlx5_sd_put_adev() and updating
mlx5_sd_get_adev() so that secondaries devices would acquire the device
lock of the returned auxiliary device. After the lock is acquired, a
second devcom check is needed[2].
In addition, update The callers to pair the get operation with the new
put operation, ensuring the lock is held while the auxiliary device is
being operated on and released afterwards.

The "primary" designation is determined once in sd_register(). It's set
before devcom is marked ready, and it never changes after that.
In Addition, The primary path never locks a secondary: When the primary
device invoke mlx5_sd_get_adev(), it sees dev == primary and returns.
no additional lock is taken.
Therefore lock ordering is always: secondary_lock -> primary_lock. The
reverse never happens, so ABBA deadlock is impossible.

[1]
for example:
BUG: kernel NULL pointer dereference, address: 0000000000000370
PGD 0 P4D 0
Oops: Oops: 0000 [#1] SMP
CPU: 4 UID: 0 PID: 3945 Comm: bash Not tainted 6.19.0-rc3+ #1 NONE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:mlx5e_dcbnl_dscp_app+0x23/0x100 [mlx5_core]
Call Trace:
 <TASK>
 mlx5e_remove+0x82/0x12a [mlx5_core]
 device_release_driver_internal+0x194/0x1f0
 bus_remove_device+0xc6/0x140
 device_del+0x159/0x3c0
 ? devl_param_driverinit_value_get+0x29/0x80
 mlx5_rescan_drivers_locked+0x92/0x160 [mlx5_core]
 mlx5_unregister_device+0x34/0x50 [mlx5_core]
 mlx5_uninit_one+0x43/0xb0 [mlx5_core]
 remove_one+0x4e/0xc0 [mlx5_core]
 pci_device_remove+0x39/0xa0
 device_release_driver_internal+0x194/0x1f0
 unbind_store+0x99/0xa0
 kernfs_fop_write_iter+0x12e/0x1e0
 vfs_write+0x215/0x3d0
 ksys_write+0x5f/0xd0
 do_syscall_64+0x55/0xe90
 entry_SYSCALL_64_after_hwframe+0x4b/0x53

[2]
    CPU0 (primary)                     CPU1 (secondary)
==========================================================================
mlx5e_remove() (device_lock held)
                                     mlx5e_remove() (2nd device_lock held)
                                      mlx5_sd_get_adev()
                                       mlx5_devcom_comp_is_ready() => true
                                       device_lock(primary)
 mlx5_sd_get_adev() ==> ret adev
 _mlx5e_remove()
 mlx5_sd_cleanup()
 // mlx5e_remove finished
 // releasing device_lock
                                       //need another check here...
                                       mlx5_devcom_comp_is_ready() => false

Fixes: 381978d28317 ("net/mlx5e: Create single netdev per SD group")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/en_main.c   | 11 ++++++++++-
 .../net/ethernet/mellanox/mlx5/core/lib/sd.c    | 17 +++++++++++++++++
 .../net/ethernet/mellanox/mlx5/core/lib/sd.h    |  2 ++
 3 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index e21affd0ffc4..b09806dfebe5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -6774,8 +6774,10 @@ static int mlx5e_resume(struct auxiliary_device *adev)
 		return err;
 
 	actual_adev = mlx5_sd_get_adev(mdev, adev, edev->idx);
-	if (actual_adev)
+	if (actual_adev) {
 		err = _mlx5e_resume(actual_adev);
+		mlx5_sd_put_adev(actual_adev, adev);
+	}
 	return err;
 }
 
@@ -6815,6 +6817,8 @@ static int mlx5e_suspend(struct auxiliary_device *adev, pm_message_t state)
 		err = _mlx5e_suspend(actual_adev, false);
 
 	mlx5_sd_cleanup(mdev);
+	if (actual_adev)
+		mlx5_sd_put_adev(actual_adev, adev);
 	return err;
 }
 
@@ -6916,11 +6920,14 @@ static int mlx5e_probe(struct auxiliary_device *adev,
 		err = _mlx5e_probe(actual_adev);
 		if (err)
 			goto sd_cleanup;
+		mlx5_sd_put_adev(actual_adev, adev);
 	}
 	return 0;
 
 sd_cleanup:
 	mlx5_sd_cleanup(mdev);
+	if (actual_adev)
+		mlx5_sd_put_adev(actual_adev, adev);
 	return err;
 }
 
@@ -6973,6 +6980,8 @@ static void mlx5e_remove(struct auxiliary_device *adev)
 		_mlx5e_remove(actual_adev);
 
 	mlx5_sd_cleanup(mdev);
+	if (actual_adev)
+		mlx5_sd_put_adev(actual_adev, adev);
 }
 
 static const struct auxiliary_device_id mlx5e_id_table[] = {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
index 7a1787f15320..a43ae482a679 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
@@ -546,6 +546,10 @@ void mlx5_sd_cleanup(struct mlx5_core_dev *dev)
 	sd_cleanup(dev);
 }
 
+/* Cannot take devcom lock as a gate for device lock. ABBA deadlock:
+ * primary:  actual_adev_lock -> SD devcom comp lock
+ * secondary: SD devcom comp lock -> actual_adev_lock
+ */
 struct auxiliary_device *mlx5_sd_get_adev(struct mlx5_core_dev *dev,
 					  struct auxiliary_device *adev,
 					  int idx)
@@ -563,5 +567,18 @@ struct auxiliary_device *mlx5_sd_get_adev(struct mlx5_core_dev *dev,
 	if (dev == primary)
 		return adev;
 
+	device_lock(&primary->priv.adev[idx]->adev.dev);
+	/* In case primary finish removing its adev */
+	if (!mlx5_devcom_comp_is_ready(sd->devcom)) {
+		device_unlock(&primary->priv.adev[idx]->adev.dev);
+		return NULL;
+	}
 	return &primary->priv.adev[idx]->adev;
 }
+
+void mlx5_sd_put_adev(struct auxiliary_device *actual_adev,
+		      struct auxiliary_device *adev)
+{
+	if (actual_adev != adev)
+		device_unlock(&actual_adev->dev);
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.h b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.h
index 137efaf9aabc..9bfd5b9756b5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.h
@@ -15,6 +15,8 @@ struct mlx5_core_dev *mlx5_sd_ch_ix_get_dev(struct mlx5_core_dev *primary, int c
 struct auxiliary_device *mlx5_sd_get_adev(struct mlx5_core_dev *dev,
 					  struct auxiliary_device *adev,
 					  int idx);
+void mlx5_sd_put_adev(struct auxiliary_device *actual_adev,
+		      struct auxiliary_device *adev);
 
 int mlx5_sd_init(struct mlx5_core_dev *dev);
 void mlx5_sd_cleanup(struct mlx5_core_dev *dev);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH net V4 1/4] net/mlx5: SD: Serialize init/cleanup
  2026-04-28  6:01 ` [PATCH net V4 1/4] net/mlx5: SD: Serialize init/cleanup Tariq Toukan
@ 2026-04-30  1:42   ` Jakub Kicinski
  2026-04-30  9:38     ` Shay Drori
  0 siblings, 1 reply; 11+ messages in thread
From: Jakub Kicinski @ 2026-04-30  1:42 UTC (permalink / raw)
  To: tariqt
  Cc: Jakub Kicinski, edumazet, pabeni, andrew+netdev, davem, saeedm,
	mbloch, leon, shayd, horms, phaddad, kees, parav, gal, netdev,
	linux-rdma, linux-kernel, dtatulea

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
net/mlx5: SD: Serialize init/cleanup

This commit protects the Socket-Direct (SD) init/cleanup paths with the
devcom component lock and tracks the SD group state on the primary device.
It skips init if the primary is already UP, and skips cleanup unless the
primary is UP, serializing the bring-up/tear-down sequence across PFs.

> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
> index 762c783156b4..d42c283cbb38 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c

[ ... ]

> @@ -491,22 +507,34 @@ void mlx5_sd_cleanup(struct mlx5_core_dev *dev)
>  {
>  	struct mlx5_sd *sd = mlx5_get_sd(dev);
>  	struct mlx5_core_dev *primary, *pos;
> +	struct mlx5_sd *primary_sd;
>  	int i;
>  
>  	if (!sd)
>  		return;
>  
> +	mlx5_devcom_comp_lock(sd->devcom);
>  	if (!mlx5_devcom_comp_is_ready(sd->devcom))
> -		goto out;
> +		goto out_unlock;
>  
>  	primary = mlx5_sd_get_primary(dev);
> +	if (!primary)
> +		goto out_unlock;
> +
> +	primary_sd = mlx5_get_sd(primary);
> +	if (primary_sd->state != MLX5_SD_STATE_UP)
> +		goto out_unlock;

If the device is unbound during the race window where the devcom component
is ready but the state is still MLX5_SD_STATE_DOWN (for example, just after
sd_register() finishes but before mlx5_sd_init() sets the state to UP), does
this early exit leave the component marked as ready?

Since mlx5_devcom_comp_set_ready(sd->devcom, false) was moved out of
sd_unregister() and into this function, skipping it here means the device will
unregister and free its struct mlx5_sd while the group is still ready.

Could a concurrent mlx5_sd_init() on another peer acquire the devcom lock,
observe that it is ready, and proceed? If the unregistered device was the
primary, mlx5_get_sd(primary) would return NULL, causing a NULL pointer
dereference at primary_sd->state. If it was a secondary, mlx5_sd_init()
would iterate over primary_sd->secondaries and access the freed device
structure, resulting in a use-after-free.

> +
>  	mlx5_sd_for_each_secondary(i, primary, pos)
>  		sd_cmd_unset_secondary(pos);
>  	sd_cmd_unset_primary(primary);
>  	debugfs_remove_recursive(sd->dfs);

This isn't a regression introduced in this patch, but does storing sd->dfs
locally instead of on the primary device cause debugfs entries to leak?

If a different peer is unbound first, its sd->dfs is NULL, making the call
to debugfs_remove_recursive() a no-op. The cleanup then finishes and marks
the group as unready.

When the original device that created the debugfs directory is later unbound,
it sees the group is no longer ready and skips this block entirely, never
freeing its sd->dfs. Since these files hold pointers to the mlx5_core_dev
structs, accessing them from userspace after the devices are freed could
trigger a use-after-free.

I note this was fixed later in the series by commit "net/mlx5: SD, Keep
multi-pf debugfs entries on primary", but wanted to highlight it here.

>  
>  	sd_info(primary, "group id %#x, uncombined\n", sd->group_id);
> -out:
> +	primary_sd->state = MLX5_SD_STATE_DOWN;
> +	mlx5_devcom_comp_set_ready(sd->devcom, false);
> +out_unlock:
> +	mlx5_devcom_comp_unlock(sd->devcom);
>  	sd_unregister(dev);
>  	sd_cleanup(dev);
>  }
-- 
pw-bot: cr

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net V4 3/4] net/mlx5e: SD, Fix missing cleanup on probe error
  2026-04-28  6:01 ` [PATCH net V4 3/4] net/mlx5e: SD, Fix missing cleanup on probe error Tariq Toukan
@ 2026-04-30  1:42   ` Jakub Kicinski
  2026-04-30 12:52     ` Shay Drori
  0 siblings, 1 reply; 11+ messages in thread
From: Jakub Kicinski @ 2026-04-30  1:42 UTC (permalink / raw)
  To: tariqt
  Cc: Jakub Kicinski, edumazet, pabeni, andrew+netdev, davem, saeedm,
	mbloch, leon, shayd, horms, phaddad, kees, parav, gal, netdev,
	linux-rdma, linux-kernel, dtatulea

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
net/mlx5e: SD, Fix missing cleanup on probe error

This commit adds a missing mlx5_sd_cleanup() call to the probe error path.
The commit message notes a similar gap exists on the resume path leaving
sd == NULL on failure, but defers the fix to a follow-up series.

> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> index 5a46870c4b74..e21affd0ffc4 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
[ ... ]
> @@ -6912,9 +6912,16 @@ static int mlx5e_probe(struct auxiliary_device *adev,
>  		return err;
>  
>  	actual_adev = mlx5_sd_get_adev(mdev, adev, edev->idx);
> -	if (actual_adev)
> -		return _mlx5e_probe(actual_adev);
> +	if (actual_adev) {
> +		err = _mlx5e_probe(actual_adev);
> +		if (err)
> +			goto sd_cleanup;
> +	}
>  	return 0;
> +
> +sd_cleanup:
> +	mlx5_sd_cleanup(mdev);
> +	return err;
>  }
>  
>  static void _mlx5e_remove(struct auxiliary_device *adev)

Will leaving the resume gap unfixed as mentioned in the commit message cause
a NULL pointer dereference when a secondary device is later removed?

If a secondary device fails resume, sd becomes NULL. Later during removal,
mlx5_sd_get_adev() will check !sd and return the secondary adev.

This causes _mlx5e_remove() to be executed on the secondary adev:

_mlx5e_remove()
{
        struct mlx5_adev *edev = container_of(adev, struct mlx5_adev, adev);
        struct mlx5e_dev *mlx5e_dev = auxiliary_get_drvdata(adev);
        struct net_device *netdev = mlx5e_dev->netdev;
        ...
}

Since _mlx5e_probe() is never called on secondary devices,
auxiliary_get_drvdata(adev) returns NULL. Does this mean accessing
mlx5e_dev->netdev will crash?

Would it be safer to add a NULL check for mlx5e_dev in _mlx5e_remove()
until the proper PM lifecycle fixes are implemented?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net V4 4/4] net/mlx5e: SD, Fix race condition in secondary device probe/remove
  2026-04-28  6:01 ` [PATCH net V4 4/4] net/mlx5e: SD, Fix race condition in secondary device probe/remove Tariq Toukan
@ 2026-04-30  1:42   ` Jakub Kicinski
  2026-04-30 13:03     ` Shay Drori
  0 siblings, 1 reply; 11+ messages in thread
From: Jakub Kicinski @ 2026-04-30  1:42 UTC (permalink / raw)
  To: tariqt
  Cc: Jakub Kicinski, edumazet, pabeni, andrew+netdev, davem, saeedm,
	mbloch, leon, shayd, horms, phaddad, kees, parav, gal, netdev,
	linux-rdma, linux-kernel, dtatulea

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
net/mlx5e: SD, Fix race condition in secondary device probe/remove

This commit updates secondary devices to acquire the device lock of the
primary auxiliary device during probe, suspend, resume, and remove
operations. It introduces mlx5_sd_get_adev() and mlx5_sd_put_adev() to
manage this lock ordering and add a second devcom readiness check.

> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
> index 7a1787f15320..a43ae482a679 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c

[ ... ]

> @@ -563,5 +567,18 @@ struct auxiliary_device *mlx5_sd_get_adev(struct mlx5_core_dev *dev,
>  	if (dev == primary)
>  		return adev;
>  
> +	device_lock(&primary->priv.adev[idx]->adev.dev);

If mlx5_devcom_comp_is_ready() returned true just before this, but the
secondary thread is preempted before reaching device_lock(), could a
concurrent removal of the primary device cause primary->priv.adev[idx]
to become NULL or freed?

It looks like the removal path (such as mlx5_unregister_device) frees the
mlx5_adev memory and explicitly sets primary->priv.adev[idx] to NULL. If
the secondary thread then resumes, it might dereference a NULL or freed
pointer when attempting to acquire the lock.

Since the second readiness check happens after the device_lock() call, does
this leave a window where memory corruption or a panic can still occur?

Additionally, since the driver core already holds device_lock(&adev->dev)
for the secondary device during callbacks like probe() and remove(), will
acquiring the primary device's lock here trigger a lockdep warning for
possible recursive locking?

Both the primary and secondary auxiliary devices use the same generic lock
class assigned in device_initialize(). While the fixed secondary to primary
ordering avoids an actual deadlock, lockdep might not distinguish between
the two mutexes without explicit lockdep subclassing.

> +	/* In case primary finish removing its adev */
> +	if (!mlx5_devcom_comp_is_ready(sd->devcom)) {
> +		device_unlock(&primary->priv.adev[idx]->adev.dev);
> +		return NULL;
> +	}
>  	return &primary->priv.adev[idx]->adev;
>  }

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net V4 1/4] net/mlx5: SD: Serialize init/cleanup
  2026-04-30  1:42   ` Jakub Kicinski
@ 2026-04-30  9:38     ` Shay Drori
  0 siblings, 0 replies; 11+ messages in thread
From: Shay Drori @ 2026-04-30  9:38 UTC (permalink / raw)
  To: Jakub Kicinski, tariqt
  Cc: edumazet, pabeni, andrew+netdev, davem, saeedm, mbloch, leon,
	horms, phaddad, kees, parav, gal, netdev, linux-rdma,
	linux-kernel, dtatulea



On 30/04/2026 4:42, Jakub Kicinski wrote:
> External email: Use caution opening links or attachments
> 
> 
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> ---
> net/mlx5: SD: Serialize init/cleanup
> 
> This commit protects the Socket-Direct (SD) init/cleanup paths with the
> devcom component lock and tracks the SD group state on the primary device.
> It skips init if the primary is already UP, and skips cleanup unless the
> primary is UP, serializing the bring-up/tear-down sequence across PFs.
> 
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
>> index 762c783156b4..d42c283cbb38 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
> 
> [ ... ]
> 
>> @@ -491,22 +507,34 @@ void mlx5_sd_cleanup(struct mlx5_core_dev *dev)
>>   {
>>        struct mlx5_sd *sd = mlx5_get_sd(dev);
>>        struct mlx5_core_dev *primary, *pos;
>> +     struct mlx5_sd *primary_sd;
>>        int i;
>>
>>        if (!sd)
>>                return;
>>
>> +     mlx5_devcom_comp_lock(sd->devcom);
>>        if (!mlx5_devcom_comp_is_ready(sd->devcom))
>> -             goto out;
>> +             goto out_unlock;
>>
>>        primary = mlx5_sd_get_primary(dev);
>> +     if (!primary)
>> +             goto out_unlock;
>> +
>> +     primary_sd = mlx5_get_sd(primary);
>> +     if (primary_sd->state != MLX5_SD_STATE_UP)
>> +             goto out_unlock;
> 
> If the device is unbound during the race window where the devcom component
> is ready but the state is still MLX5_SD_STATE_DOWN (for example, just after
> sd_register() finishes but before mlx5_sd_init() sets the state to UP), does
> this early exit leave the component marked as ready?
> 
> Since mlx5_devcom_comp_set_ready(sd->devcom, false) was moved out of
> sd_unregister() and into this function, skipping it here means the device will
> unregister and free its struct mlx5_sd while the group is still ready.
> 
> Could a concurrent mlx5_sd_init() on another peer acquire the devcom lock,
> observe that it is ready, and proceed? If the unregistered device was the
> primary, mlx5_get_sd(primary) would return NULL, causing a NULL pointer
> dereference at primary_sd->state. If it was a secondary, mlx5_sd_init()
> would iterate over primary_sd->secondaries and access the freed device
> structure, resulting in a use-after-free.

This is correct :(
will fix in next version

> 
>> +
>>        mlx5_sd_for_each_secondary(i, primary, pos)
>>                sd_cmd_unset_secondary(pos);
>>        sd_cmd_unset_primary(primary);
>>        debugfs_remove_recursive(sd->dfs);
> 
> This isn't a regression introduced in this patch, but does storing sd->dfs
> locally instead of on the primary device cause debugfs entries to leak?
> 
> If a different peer is unbound first, its sd->dfs is NULL, making the call
> to debugfs_remove_recursive() a no-op. The cleanup then finishes and marks
> the group as unready.
> 
> When the original device that created the debugfs directory is later unbound,
> it sees the group is no longer ready and skips this block entirely, never
> freeing its sd->dfs. Since these files hold pointers to the mlx5_core_dev
> structs, accessing them from userspace after the devices are freed could
> trigger a use-after-free.
> 
> I note this was fixed later in the series by commit "net/mlx5: SD, Keep
> multi-pf debugfs entries on primary", but wanted to highlight it here.

like mention, this is fixed later and I don't see a reason to change the
patches order

> 
>>
>>        sd_info(primary, "group id %#x, uncombined\n", sd->group_id);
>> -out:
>> +     primary_sd->state = MLX5_SD_STATE_DOWN;
>> +     mlx5_devcom_comp_set_ready(sd->devcom, false);
>> +out_unlock:
>> +     mlx5_devcom_comp_unlock(sd->devcom);
>>        sd_unregister(dev);
>>        sd_cleanup(dev);
>>   }
> --
> pw-bot: cr


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net V4 3/4] net/mlx5e: SD, Fix missing cleanup on probe error
  2026-04-30  1:42   ` Jakub Kicinski
@ 2026-04-30 12:52     ` Shay Drori
  0 siblings, 0 replies; 11+ messages in thread
From: Shay Drori @ 2026-04-30 12:52 UTC (permalink / raw)
  To: Jakub Kicinski, tariqt
  Cc: edumazet, pabeni, andrew+netdev, davem, saeedm, mbloch, leon,
	horms, phaddad, kees, parav, gal, netdev, linux-rdma,
	linux-kernel, dtatulea



On 30/04/2026 4:42, Jakub Kicinski wrote:
> External email: Use caution opening links or attachments
> 
> 
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> ---
> net/mlx5e: SD, Fix missing cleanup on probe error
> 
> This commit adds a missing mlx5_sd_cleanup() call to the probe error path.
> The commit message notes a similar gap exists on the resume path leaving
> sd == NULL on failure, but defers the fix to a follow-up series.
> 
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>> index 5a46870c4b74..e21affd0ffc4 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> [ ... ]
>> @@ -6912,9 +6912,16 @@ static int mlx5e_probe(struct auxiliary_device *adev,
>>                return err;
>>
>>        actual_adev = mlx5_sd_get_adev(mdev, adev, edev->idx);
>> -     if (actual_adev)
>> -             return _mlx5e_probe(actual_adev);
>> +     if (actual_adev) {
>> +             err = _mlx5e_probe(actual_adev);
>> +             if (err)
>> +                     goto sd_cleanup;
>> +     }
>>        return 0;
>> +
>> +sd_cleanup:
>> +     mlx5_sd_cleanup(mdev);
>> +     return err;
>>   }
>>
>>   static void _mlx5e_remove(struct auxiliary_device *adev)
> 
> Will leaving the resume gap unfixed as mentioned in the commit message cause
> a NULL pointer dereference when a secondary device is later removed?
> 
> If a secondary device fails resume, sd becomes NULL. Later during removal,
> mlx5_sd_get_adev() will check !sd and return the secondary adev.
> 
> This causes _mlx5e_remove() to be executed on the secondary adev:
> 
> _mlx5e_remove()
> {
>          struct mlx5_adev *edev = container_of(adev, struct mlx5_adev, adev);
>          struct mlx5e_dev *mlx5e_dev = auxiliary_get_drvdata(adev);
>          struct net_device *netdev = mlx5e_dev->netdev;
>          ...
> }
> 
> Since _mlx5e_probe() is never called on secondary devices,
> auxiliary_get_drvdata(adev) returns NULL. Does this mean accessing
> mlx5e_dev->netdev will crash?
> 
> Would it be safer to add a NULL check for mlx5e_dev in _mlx5e_remove()
> until the proper PM lifecycle fixes are implemented?

IMO, a proper fix is needed here. PM lifecycle is implemented, the bug
is in case of error flow...


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net V4 4/4] net/mlx5e: SD, Fix race condition in secondary device probe/remove
  2026-04-30  1:42   ` Jakub Kicinski
@ 2026-04-30 13:03     ` Shay Drori
  0 siblings, 0 replies; 11+ messages in thread
From: Shay Drori @ 2026-04-30 13:03 UTC (permalink / raw)
  To: Jakub Kicinski, tariqt
  Cc: edumazet, pabeni, andrew+netdev, davem, saeedm, mbloch, leon,
	horms, phaddad, kees, parav, gal, netdev, linux-rdma,
	linux-kernel, dtatulea



On 30/04/2026 4:42, Jakub Kicinski wrote:
> External email: Use caution opening links or attachments
> 
> 
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> ---
> net/mlx5e: SD, Fix race condition in secondary device probe/remove
> 
> This commit updates secondary devices to acquire the device lock of the
> primary auxiliary device during probe, suspend, resume, and remove
> operations. It introduces mlx5_sd_get_adev() and mlx5_sd_put_adev() to
> manage this lock ordering and add a second devcom readiness check.
> 
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
>> index 7a1787f15320..a43ae482a679 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
> 
> [ ... ]
> 
>> @@ -563,5 +567,18 @@ struct auxiliary_device *mlx5_sd_get_adev(struct mlx5_core_dev *dev,
>>        if (dev == primary)
>>                return adev;
>>
>> +     device_lock(&primary->priv.adev[idx]->adev.dev);
> 
> If mlx5_devcom_comp_is_ready() returned true just before this, but the
> secondary thread is preempted before reaching device_lock(), could a
> concurrent removal of the primary device cause primary->priv.adev[idx]
> to become NULL or freed?
> 
> It looks like the removal path (such as mlx5_unregister_device) frees the
> mlx5_adev memory and explicitly sets primary->priv.adev[idx] to NULL. If
> the secondary thread then resumes, it might dereference a NULL or freed
> pointer when attempting to acquire the lock.
> 
> Since the second readiness check happens after the device_lock() call, does
> this leave a window where memory corruption or a panic can still occur?

This is also correct and will fix in next version.

> 
> Additionally, since the driver core already holds device_lock(&adev->dev)
> for the secondary device during callbacks like probe() and remove(), will
> acquiring the primary device's lock here trigger a lockdep warning for
> possible recursive locking?
> 
> Both the primary and secondary auxiliary devices use the same generic lock
> class assigned in device_initialize(). While the fixed secondary to primary
> ordering avoids an actual deadlock, lockdep might not distinguish between
> the two mutexes without explicit lockdep subclassing.
> 
>> +     /* In case primary finish removing its adev */
>> +     if (!mlx5_devcom_comp_is_ready(sd->devcom)) {
>> +             device_unlock(&primary->priv.adev[idx]->adev.dev);
>> +             return NULL;
>> +     }
>>        return &primary->priv.adev[idx]->adev;
>>   }

I test this code with KASAN and LOCKDEP enable, I didn't get any
splat...


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-04-30 13:04 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-28  6:01 [PATCH net V4 0/4] net/mlx5: Fixes for Socket-Direct Tariq Toukan
2026-04-28  6:01 ` [PATCH net V4 1/4] net/mlx5: SD: Serialize init/cleanup Tariq Toukan
2026-04-30  1:42   ` Jakub Kicinski
2026-04-30  9:38     ` Shay Drori
2026-04-28  6:01 ` [PATCH net V4 2/4] net/mlx5: SD, Keep multi-pf debugfs entries on primary Tariq Toukan
2026-04-28  6:01 ` [PATCH net V4 3/4] net/mlx5e: SD, Fix missing cleanup on probe error Tariq Toukan
2026-04-30  1:42   ` Jakub Kicinski
2026-04-30 12:52     ` Shay Drori
2026-04-28  6:01 ` [PATCH net V4 4/4] net/mlx5e: SD, Fix race condition in secondary device probe/remove Tariq Toukan
2026-04-30  1:42   ` Jakub Kicinski
2026-04-30 13:03     ` Shay Drori

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox