public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH net 0/3] net/mlx5: Fixes for Socket-Direct
@ 2026-03-30 19:34 Tariq Toukan
  2026-03-30 19:34 ` [PATCH net 1/3] net/mlx5e: SD, Fix race condition in secondary device probe/remove Tariq Toukan
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Tariq Toukan @ 2026-03-30 19:34 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller
  Cc: Saeed Mahameed, Tariq Toukan, Mark Bloch, Leon Romanovsky,
	Shay Drory, Simon Horman, Kees Cook, Parav Pandit,
	Patrisious Haddad, Gal Pressman, netdev, linux-rdma, linux-kernel

Hi,

This series by Shay addresses issues in the Socket Direct (SD) layer.
It includes adding proper locking for auxiliary device access, fixing
a debugfs dentry mismatch in multi-PF setups, and serializing SD
init/cleanup using devcom locks.

Regards,
Tariq

Shay Drory (3):
  net/mlx5e: SD, Fix race condition in secondary device probe/remove
  net/mlx5: SD, Keep multi-pf debugfs entries on primary
  net/mlx5: SD: Serialize init/cleanup

 .../net/ethernet/mellanox/mlx5/core/en_main.c | 18 ++++--
 .../net/ethernet/mellanox/mlx5/core/lib/sd.c  | 64 ++++++++++++++++---
 .../net/ethernet/mellanox/mlx5/core/lib/sd.h  |  2 +
 3 files changed, 71 insertions(+), 13 deletions(-)


base-commit: d9c2a509c96378d77435e5845561c4afd3eaedad
-- 
2.44.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH net 1/3] net/mlx5e: SD, Fix race condition in secondary device probe/remove
  2026-03-30 19:34 [PATCH net 0/3] net/mlx5: Fixes for Socket-Direct Tariq Toukan
@ 2026-03-30 19:34 ` Tariq Toukan
  2026-04-02  3:08   ` Jakub Kicinski
  2026-03-30 19:34 ` [PATCH net 2/3] net/mlx5: SD, Keep multi-pf debugfs entries on primary Tariq Toukan
  2026-03-30 19:34 ` [PATCH net 3/3] net/mlx5: SD: Serialize init/cleanup Tariq Toukan
  2 siblings, 1 reply; 12+ messages in thread
From: Tariq Toukan @ 2026-03-30 19:34 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller
  Cc: Saeed Mahameed, Tariq Toukan, Mark Bloch, Leon Romanovsky,
	Shay Drory, Simon Horman, Kees Cook, Parav Pandit,
	Patrisious Haddad, Gal Pressman, netdev, linux-rdma, linux-kernel

From: Shay Drory <shayd@nvidia.com>

When utilizing Socket-Direct single netdev functionality the driver
resolves the actual auxiliary device using mlx5_sd_get_adev(). However,
the current implementation returns the primary ETH auxiliary device
without holding the device lock, leading to a potential race condition
where the ETH device could be unbound or removed concurrently during
probe, suspend, resume, or remove operations.[1]

Fix this by introducing mlx5_sd_put_adev() and updating
mlx5_sd_get_adev() so that secondaries devices would acquire the device
lock of the returned auxiliary device. After the lock is acquired, a
second devcom check is needed[2].
In addition, update The callers to pair the get operation with the new
put operation, ensuring the lock is held while the auxiliary device is
being operated on and released afterwards.

[1]
for example:
BUG: kernel NULL pointer dereference, address: 0000000000000370
PGD 0 P4D 0
Oops: Oops: 0000 [#1] SMP
CPU: 4 UID: 0 PID: 3945 Comm: bash Not tainted 6.19.0-rc3+ #1 NONE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:mlx5e_dcbnl_dscp_app+0x23/0x100 [mlx5_core]
Call Trace:
 <TASK>
 mlx5e_remove+0x82/0x12a [mlx5_core]
 device_release_driver_internal+0x194/0x1f0
 bus_remove_device+0xc6/0x140
 device_del+0x159/0x3c0
 ? devl_param_driverinit_value_get+0x29/0x80
 mlx5_rescan_drivers_locked+0x92/0x160 [mlx5_core]
 mlx5_unregister_device+0x34/0x50 [mlx5_core]
 mlx5_uninit_one+0x43/0xb0 [mlx5_core]
 remove_one+0x4e/0xc0 [mlx5_core]
 pci_device_remove+0x39/0xa0
 device_release_driver_internal+0x194/0x1f0
 unbind_store+0x99/0xa0
 kernfs_fop_write_iter+0x12e/0x1e0
 vfs_write+0x215/0x3d0
 ksys_write+0x5f/0xd0
 do_syscall_64+0x55/0xe90
 entry_SYSCALL_64_after_hwframe+0x4b/0x53

[2]
    CPU0 (primary)                     CPU1 (secondary)
==========================================================================
mlx5e_remove() (device_lock held)
                                     mlx5e_remove() (2nd device_lock held)
                                      mlx5_sd_get_adev()
                                       mlx5_devcom_comp_is_ready() => true
                                       device_lock(primary)
 mlx5_sd_get_adev() ==> ret adev
 _mlx5e_remove()
 mlx5_sd_cleanup()
 // mlx5e_remove finished
 // releasing device_lock
                                       //need another check here...
                                       mlx5_devcom_comp_is_ready() => false

Fixes: 381978d28317 ("net/mlx5e: Create single netdev per SD group")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/en_main.c  | 18 ++++++++++++++----
 .../net/ethernet/mellanox/mlx5/core/lib/sd.c   | 13 +++++++++++++
 .../net/ethernet/mellanox/mlx5/core/lib/sd.h   |  2 ++
 3 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index b6c12460b54a..5761f655f488 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -6657,8 +6657,11 @@ static int mlx5e_resume(struct auxiliary_device *adev)
 		return err;
 
 	actual_adev = mlx5_sd_get_adev(mdev, adev, edev->idx);
-	if (actual_adev)
-		return _mlx5e_resume(actual_adev);
+	if (actual_adev) {
+		err = _mlx5e_resume(actual_adev);
+		mlx5_sd_put_adev(actual_adev, adev);
+		return err;
+	}
 	return 0;
 }
 
@@ -6698,6 +6701,8 @@ static int mlx5e_suspend(struct auxiliary_device *adev, pm_message_t state)
 		err = _mlx5e_suspend(actual_adev, false);
 
 	mlx5_sd_cleanup(mdev);
+	if (actual_adev)
+		mlx5_sd_put_adev(actual_adev, adev);
 	return err;
 }
 
@@ -6787,8 +6792,11 @@ static int mlx5e_probe(struct auxiliary_device *adev,
 		return err;
 
 	actual_adev = mlx5_sd_get_adev(mdev, adev, edev->idx);
-	if (actual_adev)
-		return _mlx5e_probe(actual_adev);
+	if (actual_adev) {
+		err = _mlx5e_probe(actual_adev);
+		mlx5_sd_put_adev(actual_adev, adev);
+		return err;
+	}
 	return 0;
 }
 
@@ -6841,6 +6849,8 @@ static void mlx5e_remove(struct auxiliary_device *adev)
 		_mlx5e_remove(actual_adev);
 
 	mlx5_sd_cleanup(mdev);
+	if (actual_adev)
+		mlx5_sd_put_adev(actual_adev, adev);
 }
 
 static const struct auxiliary_device_id mlx5e_id_table[] = {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
index 954942ad93c5..060649645012 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
@@ -528,5 +528,18 @@ struct auxiliary_device *mlx5_sd_get_adev(struct mlx5_core_dev *dev,
 	if (dev == primary)
 		return adev;
 
+	device_lock(&primary->priv.adev[idx]->adev.dev);
+	/* In case primary finish removing its adev */
+	if (!mlx5_devcom_comp_is_ready(sd->devcom)) {
+		device_unlock(&primary->priv.adev[idx]->adev.dev);
+		return NULL;
+	}
 	return &primary->priv.adev[idx]->adev;
 }
+
+void mlx5_sd_put_adev(struct auxiliary_device *actual_adev,
+		      struct auxiliary_device *adev)
+{
+	if (actual_adev != adev)
+		device_unlock(&actual_adev->dev);
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.h b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.h
index 137efaf9aabc..9bfd5b9756b5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.h
@@ -15,6 +15,8 @@ struct mlx5_core_dev *mlx5_sd_ch_ix_get_dev(struct mlx5_core_dev *primary, int c
 struct auxiliary_device *mlx5_sd_get_adev(struct mlx5_core_dev *dev,
 					  struct auxiliary_device *adev,
 					  int idx);
+void mlx5_sd_put_adev(struct auxiliary_device *actual_adev,
+		      struct auxiliary_device *adev);
 
 int mlx5_sd_init(struct mlx5_core_dev *dev);
 void mlx5_sd_cleanup(struct mlx5_core_dev *dev);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH net 2/3] net/mlx5: SD, Keep multi-pf debugfs entries on primary
  2026-03-30 19:34 [PATCH net 0/3] net/mlx5: Fixes for Socket-Direct Tariq Toukan
  2026-03-30 19:34 ` [PATCH net 1/3] net/mlx5e: SD, Fix race condition in secondary device probe/remove Tariq Toukan
@ 2026-03-30 19:34 ` Tariq Toukan
  2026-04-02  3:09   ` Jakub Kicinski
  2026-03-30 19:34 ` [PATCH net 3/3] net/mlx5: SD: Serialize init/cleanup Tariq Toukan
  2 siblings, 1 reply; 12+ messages in thread
From: Tariq Toukan @ 2026-03-30 19:34 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller
  Cc: Saeed Mahameed, Tariq Toukan, Mark Bloch, Leon Romanovsky,
	Shay Drory, Simon Horman, Kees Cook, Parav Pandit,
	Patrisious Haddad, Gal Pressman, netdev, linux-rdma, linux-kernel

From: Shay Drory <shayd@nvidia.com>

mlx5_sd_init() creates the "multi-pf" debugfs directory under the
primary device debugfs root, but stored the dentry in the calling
device's sd struct. When sd_cleanup() run on a different PF,
this leads to using the wrong sd->dfs for removing entries, which
results in memory leak and an error in when re-creating the SD.[1]

Fix it by explicitly storing the debugfs dentry in the primary
device sd struct and use it for all per-group files.

[1]
debugfs: 'multi-pf' already exists in '0000:08:00.1'

Fixes: 4375130bf527 ("net/mlx5: SD, Add debugfs")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/lib/sd.c  | 23 ++++++++++++++-----
 1 file changed, 17 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
index 060649645012..4c80b9d25283 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
@@ -426,6 +426,7 @@ int mlx5_sd_init(struct mlx5_core_dev *dev)
 	struct mlx5_core_dev *primary, *pos, *to;
 	struct mlx5_sd *sd = mlx5_get_sd(dev);
 	u8 alias_key[ACCESS_KEY_LEN];
+	struct mlx5_sd *primary_sd;
 	int err, i;
 
 	err = sd_init(dev);
@@ -444,6 +445,7 @@ int mlx5_sd_init(struct mlx5_core_dev *dev)
 		return 0;
 
 	primary = mlx5_sd_get_primary(dev);
+	primary_sd = mlx5_get_sd(primary);
 
 	for (i = 0; i < ACCESS_KEY_LEN; i++)
 		alias_key[i] = get_random_u8();
@@ -452,9 +454,13 @@ int mlx5_sd_init(struct mlx5_core_dev *dev)
 	if (err)
 		goto err_sd_unregister;
 
-	sd->dfs = debugfs_create_dir("multi-pf", mlx5_debugfs_get_dev_root(primary));
-	debugfs_create_x32("group_id", 0400, sd->dfs, &sd->group_id);
-	debugfs_create_file("primary", 0400, sd->dfs, primary, &dev_fops);
+	primary_sd->dfs =
+		debugfs_create_dir("multi-pf",
+				   mlx5_debugfs_get_dev_root(primary));
+	debugfs_create_x32("group_id", 0400, primary_sd->dfs,
+			   &primary_sd->group_id);
+	debugfs_create_file("primary", 0400, primary_sd->dfs, primary,
+			    &dev_fops);
 
 	mlx5_sd_for_each_secondary(i, primary, pos) {
 		char name[32];
@@ -464,7 +470,8 @@ int mlx5_sd_init(struct mlx5_core_dev *dev)
 			goto err_unset_secondaries;
 
 		snprintf(name, sizeof(name), "secondary_%d", i - 1);
-		debugfs_create_file(name, 0400, sd->dfs, pos, &dev_fops);
+		debugfs_create_file(name, 0400, primary_sd->dfs, pos,
+				    &dev_fops);
 
 	}
 
@@ -479,7 +486,8 @@ int mlx5_sd_init(struct mlx5_core_dev *dev)
 	mlx5_sd_for_each_secondary_to(i, primary, to, pos)
 		sd_cmd_unset_secondary(pos);
 	sd_cmd_unset_primary(primary);
-	debugfs_remove_recursive(sd->dfs);
+	debugfs_remove_recursive(primary_sd->dfs);
+	primary_sd->dfs = NULL;
 err_sd_unregister:
 	sd_unregister(dev);
 err_sd_cleanup:
@@ -491,6 +499,7 @@ void mlx5_sd_cleanup(struct mlx5_core_dev *dev)
 {
 	struct mlx5_sd *sd = mlx5_get_sd(dev);
 	struct mlx5_core_dev *primary, *pos;
+	struct mlx5_sd *primary_sd;
 	int i;
 
 	if (!sd)
@@ -500,10 +509,12 @@ void mlx5_sd_cleanup(struct mlx5_core_dev *dev)
 		goto out;
 
 	primary = mlx5_sd_get_primary(dev);
+	primary_sd = mlx5_get_sd(primary);
 	mlx5_sd_for_each_secondary(i, primary, pos)
 		sd_cmd_unset_secondary(pos);
 	sd_cmd_unset_primary(primary);
-	debugfs_remove_recursive(sd->dfs);
+	debugfs_remove_recursive(primary_sd->dfs);
+	primary_sd->dfs = NULL;
 
 	sd_info(primary, "group id %#x, uncombined\n", sd->group_id);
 out:
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH net 3/3] net/mlx5: SD: Serialize init/cleanup
  2026-03-30 19:34 [PATCH net 0/3] net/mlx5: Fixes for Socket-Direct Tariq Toukan
  2026-03-30 19:34 ` [PATCH net 1/3] net/mlx5e: SD, Fix race condition in secondary device probe/remove Tariq Toukan
  2026-03-30 19:34 ` [PATCH net 2/3] net/mlx5: SD, Keep multi-pf debugfs entries on primary Tariq Toukan
@ 2026-03-30 19:34 ` Tariq Toukan
  2026-04-02  3:09   ` Jakub Kicinski
  2 siblings, 1 reply; 12+ messages in thread
From: Tariq Toukan @ 2026-03-30 19:34 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller
  Cc: Saeed Mahameed, Tariq Toukan, Mark Bloch, Leon Romanovsky,
	Shay Drory, Simon Horman, Kees Cook, Parav Pandit,
	Patrisious Haddad, Gal Pressman, netdev, linux-rdma, linux-kernel

From: Shay Drory <shayd@nvidia.com>

mlx5_sd_init() / mlx5_sd_cleanup() may run from multiple PFs in the same
Socket-Direct group. This can cause the SD bring-up/tear-down sequence
to be executed more than once or interleaved across PFs.

Protect SD init/cleanup with mlx5_devcom_comp_lock() and track the SD
group state on the primary device. Skip init if the primary is already
UP, and skip cleanup unless the primary is UP.

Fixes: 381978d28317 ("net/mlx5e: Create single netdev per SD group")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/lib/sd.c  | 28 +++++++++++++++++--
 1 file changed, 25 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
index 4c80b9d25283..374f27b78fbe 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
@@ -18,6 +18,7 @@ struct mlx5_sd {
 	u8 host_buses;
 	struct mlx5_devcom_comp_dev *devcom;
 	struct dentry *dfs;
+	u8 state;
 	bool primary;
 	union {
 		struct { /* primary */
@@ -31,6 +32,11 @@ struct mlx5_sd {
 	};
 };
 
+enum mlx5_sd_state {
+	MLX5_SD_STATE_DOWN = 0,
+	MLX5_SD_STATE_UP,
+};
+
 static int mlx5_sd_get_host_buses(struct mlx5_core_dev *dev)
 {
 	struct mlx5_sd *sd = mlx5_get_sd(dev);
@@ -441,12 +447,16 @@ int mlx5_sd_init(struct mlx5_core_dev *dev)
 	if (err)
 		goto err_sd_cleanup;
 
+	mlx5_devcom_comp_lock(sd->devcom);
 	if (!mlx5_devcom_comp_is_ready(sd->devcom))
-		return 0;
+		goto out;
 
 	primary = mlx5_sd_get_primary(dev);
 	primary_sd = mlx5_get_sd(primary);
 
+	if (primary_sd->state == MLX5_SD_STATE_UP)
+		goto out;
+
 	for (i = 0; i < ACCESS_KEY_LEN; i++)
 		alias_key[i] = get_random_u8();
 
@@ -479,6 +489,9 @@ int mlx5_sd_init(struct mlx5_core_dev *dev)
 		sd->group_id, mlx5_devcom_comp_get_size(sd->devcom));
 	sd_print_group(primary);
 
+	primary_sd->state = MLX5_SD_STATE_UP;
+out:
+	mlx5_devcom_comp_unlock(sd->devcom);
 	return 0;
 
 err_unset_secondaries:
@@ -489,6 +502,8 @@ int mlx5_sd_init(struct mlx5_core_dev *dev)
 	debugfs_remove_recursive(primary_sd->dfs);
 	primary_sd->dfs = NULL;
 err_sd_unregister:
+	primary_sd->state = MLX5_SD_STATE_DOWN;
+	mlx5_devcom_comp_unlock(sd->devcom);
 	sd_unregister(dev);
 err_sd_cleanup:
 	sd_cleanup(dev);
@@ -505,11 +520,16 @@ void mlx5_sd_cleanup(struct mlx5_core_dev *dev)
 	if (!sd)
 		return;
 
+	mlx5_devcom_comp_lock(sd->devcom);
 	if (!mlx5_devcom_comp_is_ready(sd->devcom))
-		goto out;
+		goto out_unlock;
 
 	primary = mlx5_sd_get_primary(dev);
 	primary_sd = mlx5_get_sd(primary);
+
+	if (primary_sd->state != MLX5_SD_STATE_UP)
+		goto out_unlock;
+
 	mlx5_sd_for_each_secondary(i, primary, pos)
 		sd_cmd_unset_secondary(pos);
 	sd_cmd_unset_primary(primary);
@@ -517,7 +537,9 @@ void mlx5_sd_cleanup(struct mlx5_core_dev *dev)
 	primary_sd->dfs = NULL;
 
 	sd_info(primary, "group id %#x, uncombined\n", sd->group_id);
-out:
+	primary_sd->state = MLX5_SD_STATE_DOWN;
+out_unlock:
+	mlx5_devcom_comp_unlock(sd->devcom);
 	sd_unregister(dev);
 	sd_cleanup(dev);
 }
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH net 1/3] net/mlx5e: SD, Fix race condition in secondary device probe/remove
  2026-03-30 19:34 ` [PATCH net 1/3] net/mlx5e: SD, Fix race condition in secondary device probe/remove Tariq Toukan
@ 2026-04-02  3:08   ` Jakub Kicinski
  2026-04-02 20:03     ` Shay Drori
  0 siblings, 1 reply; 12+ messages in thread
From: Jakub Kicinski @ 2026-04-02  3:08 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Eric Dumazet, Paolo Abeni, Andrew Lunn, David S. Miller,
	Saeed Mahameed, Mark Bloch, Leon Romanovsky, Shay Drory,
	Simon Horman, Kees Cook, Parav Pandit, Patrisious Haddad,
	Gal Pressman, netdev, linux-rdma, linux-kernel

On Mon, 30 Mar 2026 22:34:10 +0300 Tariq Toukan wrote:
> From: Shay Drory <shayd@nvidia.com>
> 
> When utilizing Socket-Direct single netdev functionality the driver
> resolves the actual auxiliary device using mlx5_sd_get_adev(). However,
> the current implementation returns the primary ETH auxiliary device
> without holding the device lock, leading to a potential race condition
> where the ETH device could be unbound or removed concurrently during
> probe, suspend, resume, or remove operations.[1]
> 
> Fix this by introducing mlx5_sd_put_adev() and updating
> mlx5_sd_get_adev() so that secondaries devices would acquire the device
> lock of the returned auxiliary device. After the lock is acquired, a
> second devcom check is needed[2].
> In addition, update The callers to pair the get operation with the new
> put operation, ensuring the lock is held while the auxiliary device is
> being operated on and released afterwards.

Please explain why the "primary" designation is reliable, and therefore
we can be sure there will be no ABBA deadlock here

> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> index b6c12460b54a..5761f655f488 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> @@ -6657,8 +6657,11 @@ static int mlx5e_resume(struct auxiliary_device *adev)
>  		return err;
>  
>  	actual_adev = mlx5_sd_get_adev(mdev, adev, edev->idx);
> -	if (actual_adev)
> -		return _mlx5e_resume(actual_adev);
> +	if (actual_adev) {
> +		err = _mlx5e_resume(actual_adev);
> +		mlx5_sd_put_adev(actual_adev, adev);
> +		return err;
> +	}
>  	return 0;

Feels like I recently complained about similar code y'all were trying
to add. Magically and conditionally locking something in a get helper
makes for extremely confusing code.
-- 
pw-bot: cr

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net 2/3] net/mlx5: SD, Keep multi-pf debugfs entries on primary
  2026-03-30 19:34 ` [PATCH net 2/3] net/mlx5: SD, Keep multi-pf debugfs entries on primary Tariq Toukan
@ 2026-04-02  3:09   ` Jakub Kicinski
  2026-04-02 19:50     ` Shay Drori
  0 siblings, 1 reply; 12+ messages in thread
From: Jakub Kicinski @ 2026-04-02  3:09 UTC (permalink / raw)
  To: tariqt
  Cc: Jakub Kicinski, edumazet, pabeni, andrew+netdev, davem, saeedm,
	mbloch, leon, shayd, horms, kees, parav, phaddad, gal, netdev,
	linux-rdma, linux-kernel

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
net/mlx5: SD, Keep multi-pf debugfs entries on primary

This commit fixes an issue where the "multi-pf" debugfs directory was stored
in the calling device's sd struct but created under the primary device's root.
It explicitly stores the debugfs dentry in the primary device's sd struct.

> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
> index 060649645012..4c80b9d25283 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
[ ... ]
> @@ -491,6 +499,7 @@ void mlx5_sd_cleanup(struct mlx5_core_dev *dev)
>  {
>  	struct mlx5_sd *sd = mlx5_get_sd(dev);
>  	struct mlx5_core_dev *primary, *pos;
> +	struct mlx5_sd *primary_sd;
>  	int i;
>  
>  	if (!sd)
> @@ -500,10 +509,12 @@ void mlx5_sd_cleanup(struct mlx5_core_dev *dev)
>  		goto out;
>  
>  	primary = mlx5_sd_get_primary(dev);
> +	primary_sd = mlx5_get_sd(primary);
>  	mlx5_sd_for_each_secondary(i, primary, pos)
>  		sd_cmd_unset_secondary(pos);
>  	sd_cmd_unset_primary(primary);
> -	debugfs_remove_recursive(sd->dfs);
> +	debugfs_remove_recursive(primary_sd->dfs);
> +	primary_sd->dfs = NULL;

Can concurrent unbind operations cause a use-after-free and double-free
regression here?

In mlx5_sd_cleanup(), the check !mlx5_devcom_comp_is_ready(sd->devcom) is
performed locklessly. If multiple physical functions in the Socket-Direct
group are unbound concurrently, multiple threads can observe ready == true
and enter this teardown block simultaneously.

Since the cleanup code now operates on the shared primary_sd pointer instead
of the local device's sd, the thread for the primary physical function might
complete its unbind and execute sd_cleanup(dev), freeing its sd object
(which is primary_sd).

Concurrently, the thread for a secondary physical function will dereference
the freed primary_sd to access primary_sd->dfs, causing a use-after-free.
Both threads could also pass the same pointer to debugfs_remove_recursive(),
resulting in a double-free.

Note: This was fixed later in the patch series by commit 92528cec4e0b91deb
("net/mlx5: SD: Serialize init/cleanup").
-- 
pw-bot: cr

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net 3/3] net/mlx5: SD: Serialize init/cleanup
  2026-03-30 19:34 ` [PATCH net 3/3] net/mlx5: SD: Serialize init/cleanup Tariq Toukan
@ 2026-04-02  3:09   ` Jakub Kicinski
  2026-04-02 19:49     ` Shay Drori
  0 siblings, 1 reply; 12+ messages in thread
From: Jakub Kicinski @ 2026-04-02  3:09 UTC (permalink / raw)
  To: tariqt
  Cc: Jakub Kicinski, edumazet, pabeni, andrew+netdev, davem, saeedm,
	mbloch, leon, shayd, horms, kees, parav, phaddad, gal, netdev,
	linux-rdma, linux-kernel

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
net/mlx5: SD: Serialize init/cleanup

This commit introduces a state tracker and locking around the Socket-Direct
init and cleanup routines. It prevents multiple physical functions from
concurrently bringing up or tearing down the SD group.

> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
> index 4c80b9d25283..374f27b78fbe 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c

[ ... ]

> @@ -505,11 +520,16 @@ void mlx5_sd_cleanup(struct mlx5_core_dev *dev)
>  	if (!sd)
>  		return;
>  
> +	mlx5_devcom_comp_lock(sd->devcom);
>  	if (!mlx5_devcom_comp_is_ready(sd->devcom))
> -		goto out;
> +		goto out_unlock;
>  
>  	primary = mlx5_sd_get_primary(dev);
>  	primary_sd = mlx5_get_sd(primary);
> +
> +	if (primary_sd->state != MLX5_SD_STATE_UP)
> +		goto out_unlock;
> +
>  	mlx5_sd_for_each_secondary(i, primary, pos)
>  		sd_cmd_unset_secondary(pos);
>  	sd_cmd_unset_primary(primary);
> @@ -517,7 +537,9 @@ void mlx5_sd_cleanup(struct mlx5_core_dev *dev)
>  	primary_sd->dfs = NULL;
>  
>  	sd_info(primary, "group id %#x, uncombined\n", sd->group_id);
> -out:
> +	primary_sd->state = MLX5_SD_STATE_DOWN;
> +out_unlock:
> +	mlx5_devcom_comp_unlock(sd->devcom);
>  	sd_unregister(dev);
>  	sd_cleanup(dev);
>  }

Is there a race condition if a secondary PF calls mlx5_sd_init() while the
primary PF is running mlx5_sd_cleanup()?

In mlx5_sd_cleanup(), the code sets the state to MLX5_SD_STATE_DOWN and drops
the devcom lock before calling sd_unregister(). At this point, the primary
PF is still registered in the devcom group, so mlx5_devcom_comp_is_ready()
remains true.

If a secondary PF is concurrently probed, could it acquire the devcom lock
and observe this intermediate state?

CPU1
mlx5_sd_cleanup()
    primary_sd->state = MLX5_SD_STATE_DOWN;
    mlx5_devcom_comp_unlock(sd->devcom);
    // Preempted before sd_unregister(dev)

CPU2 would see this in mlx5_sd_init():
    mlx5_devcom_comp_lock(sd->devcom);
    if (!mlx5_devcom_comp_is_ready(sd->devcom)) // returns true
        goto out;
    
    primary = mlx5_sd_get_primary(dev);
    primary_sd = mlx5_get_sd(primary);
    
    if (primary_sd->state == MLX5_SD_STATE_UP) // returns false
        goto out;
    ...
    debugfs_create_file("primary", 0400, primary_sd->dfs, primary, &dev_fops);

Would CPU2 proceed to initialize the group using the dying primary PF?

If CPU1 then resumes and finishes sd_cleanup(), it would free the primary_sd
structure and the primary device. This might leave the secondary PF with
dangling pointers in the debugfs files, potentially leading to a
use-after-free on access.

Furthermore, when the secondary is later removed, mlx5_sd_cleanup(secondary)
might see is_ready as false and skip the teardown entirely, leaving the
dangling debugfs files permanently.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net 3/3] net/mlx5: SD: Serialize init/cleanup
  2026-04-02  3:09   ` Jakub Kicinski
@ 2026-04-02 19:49     ` Shay Drori
  0 siblings, 0 replies; 12+ messages in thread
From: Shay Drori @ 2026-04-02 19:49 UTC (permalink / raw)
  To: Jakub Kicinski, tariqt
  Cc: edumazet, pabeni, andrew+netdev, davem, saeedm, mbloch, leon,
	horms, kees, parav, phaddad, gal, netdev, linux-rdma,
	linux-kernel



On 02/04/2026 6:09, Jakub Kicinski wrote:
> External email: Use caution opening links or attachments
> 
> 
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> ---
> net/mlx5: SD: Serialize init/cleanup
> 
> This commit introduces a state tracker and locking around the Socket-Direct
> init and cleanup routines. It prevents multiple physical functions from
> concurrently bringing up or tearing down the SD group.
> 
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
>> index 4c80b9d25283..374f27b78fbe 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
> 
> [ ... ]
> 
>> @@ -505,11 +520,16 @@ void mlx5_sd_cleanup(struct mlx5_core_dev *dev)
>>        if (!sd)
>>                return;
>>
>> +     mlx5_devcom_comp_lock(sd->devcom);
>>        if (!mlx5_devcom_comp_is_ready(sd->devcom))
>> -             goto out;
>> +             goto out_unlock;
>>
>>        primary = mlx5_sd_get_primary(dev);
>>        primary_sd = mlx5_get_sd(primary);
>> +
>> +     if (primary_sd->state != MLX5_SD_STATE_UP)
>> +             goto out_unlock;
>> +
>>        mlx5_sd_for_each_secondary(i, primary, pos)
>>                sd_cmd_unset_secondary(pos);
>>        sd_cmd_unset_primary(primary);
>> @@ -517,7 +537,9 @@ void mlx5_sd_cleanup(struct mlx5_core_dev *dev)
>>        primary_sd->dfs = NULL;
>>
>>        sd_info(primary, "group id %#x, uncombined\n", sd->group_id);
>> -out:
>> +     primary_sd->state = MLX5_SD_STATE_DOWN;
>> +out_unlock:
>> +     mlx5_devcom_comp_unlock(sd->devcom);
>>        sd_unregister(dev);
>>        sd_cleanup(dev);
>>   }
> 
> Is there a race condition if a secondary PF calls mlx5_sd_init() while the
> primary PF is running mlx5_sd_cleanup()?
> 
> In mlx5_sd_cleanup(), the code sets the state to MLX5_SD_STATE_DOWN and drops
> the devcom lock before calling sd_unregister(). At this point, the primary
> PF is still registered in the devcom group, so mlx5_devcom_comp_is_ready()
> remains true.
> 
> If a secondary PF is concurrently probed, could it acquire the devcom lock
> and observe this intermediate state?
> 
> CPU1
> mlx5_sd_cleanup()
>      primary_sd->state = MLX5_SD_STATE_DOWN;
>      mlx5_devcom_comp_unlock(sd->devcom);
>      // Preempted before sd_unregister(dev)
> 
> CPU2 would see this in mlx5_sd_init():
>      mlx5_devcom_comp_lock(sd->devcom);
>      if (!mlx5_devcom_comp_is_ready(sd->devcom)) // returns true
>          goto out;
> 
>      primary = mlx5_sd_get_primary(dev);
>      primary_sd = mlx5_get_sd(primary);
> 
>      if (primary_sd->state == MLX5_SD_STATE_UP) // returns false
>          goto out;
>      ...
>      debugfs_create_file("primary", 0400, primary_sd->dfs, primary, &dev_fops);
> 
> Would CPU2 proceed to initialize the group using the dying primary PF?
> 
> If CPU1 then resumes and finishes sd_cleanup(), it would free the primary_sd
> structure and the primary device. This might leave the secondary PF with
> dangling pointers in the debugfs files, potentially leading to a
> use-after-free on access.
> 
> Furthermore, when the secondary is later removed, mlx5_sd_cleanup(secondary)
> might see is_ready as false and skip the teardown entirely, leaving the
> dangling debugfs files permanently.

Thank, Good catch. will fix in V2


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net 2/3] net/mlx5: SD, Keep multi-pf debugfs entries on primary
  2026-04-02  3:09   ` Jakub Kicinski
@ 2026-04-02 19:50     ` Shay Drori
  0 siblings, 0 replies; 12+ messages in thread
From: Shay Drori @ 2026-04-02 19:50 UTC (permalink / raw)
  To: Jakub Kicinski, tariqt
  Cc: edumazet, pabeni, andrew+netdev, davem, saeedm, mbloch, leon,
	horms, kees, parav, phaddad, gal, netdev, linux-rdma,
	linux-kernel



On 02/04/2026 6:09, Jakub Kicinski wrote:
> External email: Use caution opening links or attachments
> 
> 
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> ---
> net/mlx5: SD, Keep multi-pf debugfs entries on primary
> 
> This commit fixes an issue where the "multi-pf" debugfs directory was stored
> in the calling device's sd struct but created under the primary device's root.
> It explicitly stores the debugfs dentry in the primary device's sd struct.
> 
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
>> index 060649645012..4c80b9d25283 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
> [ ... ]
>> @@ -491,6 +499,7 @@ void mlx5_sd_cleanup(struct mlx5_core_dev *dev)
>>   {
>>        struct mlx5_sd *sd = mlx5_get_sd(dev);
>>        struct mlx5_core_dev *primary, *pos;
>> +     struct mlx5_sd *primary_sd;
>>        int i;
>>
>>        if (!sd)
>> @@ -500,10 +509,12 @@ void mlx5_sd_cleanup(struct mlx5_core_dev *dev)
>>                goto out;
>>
>>        primary = mlx5_sd_get_primary(dev);
>> +     primary_sd = mlx5_get_sd(primary);
>>        mlx5_sd_for_each_secondary(i, primary, pos)
>>                sd_cmd_unset_secondary(pos);
>>        sd_cmd_unset_primary(primary);
>> -     debugfs_remove_recursive(sd->dfs);
>> +     debugfs_remove_recursive(primary_sd->dfs);
>> +     primary_sd->dfs = NULL;
> 
> Can concurrent unbind operations cause a use-after-free and double-free
> regression here?
> 
> In mlx5_sd_cleanup(), the check !mlx5_devcom_comp_is_ready(sd->devcom) is
> performed locklessly. If multiple physical functions in the Socket-Direct
> group are unbound concurrently, multiple threads can observe ready == true
> and enter this teardown block simultaneously.
> 
> Since the cleanup code now operates on the shared primary_sd pointer instead
> of the local device's sd, the thread for the primary physical function might
> complete its unbind and execute sd_cleanup(dev), freeing its sd object
> (which is primary_sd).
> 
> Concurrently, the thread for a secondary physical function will dereference
> the freed primary_sd to access primary_sd->dfs, causing a use-after-free.
> Both threads could also pass the same pointer to debugfs_remove_recursive(),
> resulting in a double-free.
> 
> Note: This was fixed later in the patch series by commit 92528cec4e0b91deb
> ("net/mlx5: SD: Serialize init/cleanup").

This is correct, will re-arrange the patches so that this patch will be
after "net/mlx5: SD: Serialize init/cleanup".

> --
> pw-bot: cr


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net 1/3] net/mlx5e: SD, Fix race condition in secondary device probe/remove
  2026-04-02  3:08   ` Jakub Kicinski
@ 2026-04-02 20:03     ` Shay Drori
  2026-04-03  0:45       ` Jakub Kicinski
  0 siblings, 1 reply; 12+ messages in thread
From: Shay Drori @ 2026-04-02 20:03 UTC (permalink / raw)
  To: Jakub Kicinski, Tariq Toukan
  Cc: Eric Dumazet, Paolo Abeni, Andrew Lunn, David S. Miller,
	Saeed Mahameed, Mark Bloch, Leon Romanovsky, Simon Horman,
	Kees Cook, Parav Pandit, Patrisious Haddad, Gal Pressman, netdev,
	linux-rdma, linux-kernel



On 02/04/2026 6:08, Jakub Kicinski wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Mon, 30 Mar 2026 22:34:10 +0300 Tariq Toukan wrote:
>> From: Shay Drory <shayd@nvidia.com>
>>
>> When utilizing Socket-Direct single netdev functionality the driver
>> resolves the actual auxiliary device using mlx5_sd_get_adev(). However,
>> the current implementation returns the primary ETH auxiliary device
>> without holding the device lock, leading to a potential race condition
>> where the ETH device could be unbound or removed concurrently during
>> probe, suspend, resume, or remove operations.[1]
>>
>> Fix this by introducing mlx5_sd_put_adev() and updating
>> mlx5_sd_get_adev() so that secondaries devices would acquire the device
>> lock of the returned auxiliary device. After the lock is acquired, a
>> second devcom check is needed[2].
>> In addition, update The callers to pair the get operation with the new
>> put operation, ensuring the lock is held while the auxiliary device is
>> being operated on and released afterwards.
> 
> Please explain why the "primary" designation is reliable, and therefore
> we can be sure there will be no ABBA deadlock here

The "primary" designation is determined once in sd_register(). It's set
before devcom is marked ready, and it never changes after that.
In Addition, The primary path never locks a secondary: When the primary
device invoke mlx5_sd_get_adev(), it sees dev == primary and returns.
no additional lock is taken.
Therefore lock ordering is always: secondary_lock → primary_lock. The
reverse never happens, so ABBA deadlock is impossible.

Does the above is the explanation you looked for?
If not, can you elaborate?
If yes, to add it to the commit message in V2?

> 
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>> index b6c12460b54a..5761f655f488 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>> @@ -6657,8 +6657,11 @@ static int mlx5e_resume(struct auxiliary_device *adev)
>>                return err;
>>
>>        actual_adev = mlx5_sd_get_adev(mdev, adev, edev->idx);
>> -     if (actual_adev)
>> -             return _mlx5e_resume(actual_adev);
>> +     if (actual_adev) {
>> +             err = _mlx5e_resume(actual_adev);
>> +             mlx5_sd_put_adev(actual_adev, adev);
>> +             return err;
>> +     }
>>        return 0;
> 
> Feels like I recently complained about similar code y'all were trying
> to add. Magically and conditionally locking something in a get helper
> makes for extremely confusing code.

Do you think explicit locking API is preferred here?
something like:
new_locking_api()

mlx5_sd_get_adev()

new_unlocking_api()

thanks for the review

> --
> pw-bot: cr


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net 1/3] net/mlx5e: SD, Fix race condition in secondary device probe/remove
  2026-04-02 20:03     ` Shay Drori
@ 2026-04-03  0:45       ` Jakub Kicinski
  2026-04-05 19:05         ` Shay Drori
  0 siblings, 1 reply; 12+ messages in thread
From: Jakub Kicinski @ 2026-04-03  0:45 UTC (permalink / raw)
  To: Shay Drori
  Cc: Tariq Toukan, Eric Dumazet, Paolo Abeni, Andrew Lunn,
	David S. Miller, Saeed Mahameed, Mark Bloch, Leon Romanovsky,
	Simon Horman, Kees Cook, Parav Pandit, Patrisious Haddad,
	Gal Pressman, netdev, linux-rdma, linux-kernel

On Thu, 2 Apr 2026 23:03:10 +0300 Shay Drori wrote:
> On 02/04/2026 6:08, Jakub Kicinski wrote:
> > On Mon, 30 Mar 2026 22:34:10 +0300 Tariq Toukan wrote:  
> >> From: Shay Drory <shayd@nvidia.com>
> >>
> >> When utilizing Socket-Direct single netdev functionality the driver
> >> resolves the actual auxiliary device using mlx5_sd_get_adev(). However,
> >> the current implementation returns the primary ETH auxiliary device
> >> without holding the device lock, leading to a potential race condition
> >> where the ETH device could be unbound or removed concurrently during
> >> probe, suspend, resume, or remove operations.[1]
> >>
> >> Fix this by introducing mlx5_sd_put_adev() and updating
> >> mlx5_sd_get_adev() so that secondaries devices would acquire the device
> >> lock of the returned auxiliary device. After the lock is acquired, a
> >> second devcom check is needed[2].
> >> In addition, update The callers to pair the get operation with the new
> >> put operation, ensuring the lock is held while the auxiliary device is
> >> being operated on and released afterwards.  
> > 
> > Please explain why the "primary" designation is reliable, and therefore
> > we can be sure there will be no ABBA deadlock here  
> 
> The "primary" designation is determined once in sd_register(). It's set
> before devcom is marked ready, and it never changes after that.
> In Addition, The primary path never locks a secondary: When the primary
> device invoke mlx5_sd_get_adev(), it sees dev == primary and returns.
> no additional lock is taken.
> Therefore lock ordering is always: secondary_lock → primary_lock. The
> reverse never happens, so ABBA deadlock is impossible.

And the device_lock instances have separate lockdep classes?
So lockdep will also understand this?

> Does the above is the explanation you looked for?
> If not, can you elaborate?
> If yes, to add it to the commit message in V2?

Sounds good, please add to the msg.

> >> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> >> index b6c12460b54a..5761f655f488 100644
> >> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> >> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> >> @@ -6657,8 +6657,11 @@ static int mlx5e_resume(struct auxiliary_device *adev)
> >>                return err;
> >>
> >>        actual_adev = mlx5_sd_get_adev(mdev, adev, edev->idx);
> >> -     if (actual_adev)
> >> -             return _mlx5e_resume(actual_adev);
> >> +     if (actual_adev) {
> >> +             err = _mlx5e_resume(actual_adev);
> >> +             mlx5_sd_put_adev(actual_adev, adev);
> >> +             return err;
> >> +     }
> >>        return 0;  
> > 
> > Feels like I recently complained about similar code y'all were trying
> > to add. Magically and conditionally locking something in a get helper
> > makes for extremely confusing code.  
> 
> Do you think explicit locking API is preferred here?
> something like:
> new_locking_api()
> 
> mlx5_sd_get_adev()
> 
> new_unlocking_api()

Readability is hard, I'd just push the locking up into the callers TBH.
Looks like there's only 4, the LoC delta isn't going to be huge.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net 1/3] net/mlx5e: SD, Fix race condition in secondary device probe/remove
  2026-04-03  0:45       ` Jakub Kicinski
@ 2026-04-05 19:05         ` Shay Drori
  0 siblings, 0 replies; 12+ messages in thread
From: Shay Drori @ 2026-04-05 19:05 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Tariq Toukan, Eric Dumazet, Paolo Abeni, Andrew Lunn,
	David S. Miller, Saeed Mahameed, Mark Bloch, Leon Romanovsky,
	Simon Horman, Kees Cook, Parav Pandit, Patrisious Haddad,
	Gal Pressman, netdev, linux-rdma, linux-kernel



On 03/04/2026 3:45, Jakub Kicinski wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Thu, 2 Apr 2026 23:03:10 +0300 Shay Drori wrote:
>> On 02/04/2026 6:08, Jakub Kicinski wrote:
>>> On Mon, 30 Mar 2026 22:34:10 +0300 Tariq Toukan wrote:
>>>> From: Shay Drory <shayd@nvidia.com>
>>>>
>>>> When utilizing Socket-Direct single netdev functionality the driver
>>>> resolves the actual auxiliary device using mlx5_sd_get_adev(). However,
>>>> the current implementation returns the primary ETH auxiliary device
>>>> without holding the device lock, leading to a potential race condition
>>>> where the ETH device could be unbound or removed concurrently during
>>>> probe, suspend, resume, or remove operations.[1]
>>>>
>>>> Fix this by introducing mlx5_sd_put_adev() and updating
>>>> mlx5_sd_get_adev() so that secondaries devices would acquire the device
>>>> lock of the returned auxiliary device. After the lock is acquired, a
>>>> second devcom check is needed[2].
>>>> In addition, update The callers to pair the get operation with the new
>>>> put operation, ensuring the lock is held while the auxiliary device is
>>>> being operated on and released afterwards.
>>>
>>> Please explain why the "primary" designation is reliable, and therefore
>>> we can be sure there will be no ABBA deadlock here
>>
>> The "primary" designation is determined once in sd_register(). It's set
>> before devcom is marked ready, and it never changes after that.
>> In Addition, The primary path never locks a secondary: When the primary
>> device invoke mlx5_sd_get_adev(), it sees dev == primary and returns.
>> no additional lock is taken.
>> Therefore lock ordering is always: secondary_lock → primary_lock. The
>> reverse never happens, so ABBA deadlock is impossible.
> 
> And the device_lock instances have separate lockdep classes?
> So lockdep will also understand this?

I tested this patches with lockdep enable and didn't get a splat,
so it seems lockdep understand.

> 
>> Does the above is the explanation you looked for?
>> If not, can you elaborate?
>> If yes, to add it to the commit message in V2?
> 
> Sounds good, please add to the msg.
> 
>>>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>>>> index b6c12460b54a..5761f655f488 100644
>>>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>>>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>>>> @@ -6657,8 +6657,11 @@ static int mlx5e_resume(struct auxiliary_device *adev)
>>>>                 return err;
>>>>
>>>>         actual_adev = mlx5_sd_get_adev(mdev, adev, edev->idx);
>>>> -     if (actual_adev)
>>>> -             return _mlx5e_resume(actual_adev);
>>>> +     if (actual_adev) {
>>>> +             err = _mlx5e_resume(actual_adev);
>>>> +             mlx5_sd_put_adev(actual_adev, adev);
>>>> +             return err;
>>>> +     }
>>>>         return 0;
>>>
>>> Feels like I recently complained about similar code y'all were trying
>>> to add. Magically and conditionally locking something in a get helper
>>> makes for extremely confusing code.
>>
>> Do you think explicit locking API is preferred here?
>> something like:
>> new_locking_api()
>>
>> mlx5_sd_get_adev()
>>
>> new_unlocking_api()
> 
> Readability is hard, I'd just push the locking up into the callers TBH.
> Looks like there's only 4, the LoC delta isn't going to be huge.

I though about it, but AFAIU your suggestion, a race is still possible:
your suggestion is to lock the adev returned from mlx5_sd_get_adev().
but between mlx5_sd_get_adev() and the lock, the adev can be free...

Therefore, an SD helper is still needed.
do you have a preferred approach here?

thanks


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2026-04-05 19:05 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-30 19:34 [PATCH net 0/3] net/mlx5: Fixes for Socket-Direct Tariq Toukan
2026-03-30 19:34 ` [PATCH net 1/3] net/mlx5e: SD, Fix race condition in secondary device probe/remove Tariq Toukan
2026-04-02  3:08   ` Jakub Kicinski
2026-04-02 20:03     ` Shay Drori
2026-04-03  0:45       ` Jakub Kicinski
2026-04-05 19:05         ` Shay Drori
2026-03-30 19:34 ` [PATCH net 2/3] net/mlx5: SD, Keep multi-pf debugfs entries on primary Tariq Toukan
2026-04-02  3:09   ` Jakub Kicinski
2026-04-02 19:50     ` Shay Drori
2026-03-30 19:34 ` [PATCH net 3/3] net/mlx5: SD: Serialize init/cleanup Tariq Toukan
2026-04-02  3:09   ` Jakub Kicinski
2026-04-02 19:49     ` Shay Drori

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox