* [PATCH net V2 0/3] net/mlx5: Fixes for Socket-Direct
@ 2026-04-13 10:53 Tariq Toukan
2026-04-13 10:53 ` [PATCH net V2 1/3] net/mlx5: SD: Serialize init/cleanup Tariq Toukan
` (2 more replies)
0 siblings, 3 replies; 6+ messages in thread
From: Tariq Toukan @ 2026-04-13 10:53 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Saeed Mahameed, Tariq Toukan, Mark Bloch, Leon Romanovsky,
Shay Drory, Simon Horman, Kees Cook, Parav Pandit,
Patrisious Haddad, Gal Pressman, netdev, linux-rdma, linux-kernel,
Dragos Tatulea
Hi,
This series fixes several race conditions and bugs in the mlx5
Socket-Direct (SD) single netdev flow.
Patch 1 serializes mlx5_sd_init()/mlx5_sd_cleanup() with
mlx5_devcom_comp_lock() and tracks the SD group state on the primary
device, preventing concurrent or duplicate bring-up/tear-down.
Patch 2 fixes the debugfs "multi-pf" directory being stored on the
calling device's sd struct instead of the primary's, which caused
memory leaks and recreation errors when cleanup ran from a different PF.
Patch 3 fixes a race where a secondary PF could access the primary's
auxiliary device after it had been unbound, by holding the primary's
device lock while operating on its auxiliary device.
Regards,
Tariq
V2:
- Link to V1:
https://lore.kernel.org/all/20260330193412.53408-1-tariqt@nvidia.com/
- Reorder the patches so that "net/mlx5: SD: Serialize init/cleanup"
is first.
- Add MLX5_SD_STATE_DESTROYING to the patch above to solve a concurrent
edge case.
- Expend commit message of "net/mlx5e: SD, Fix race condition in
secondary device probe/remove"
Shay Drory (3):
net/mlx5: SD: Serialize init/cleanup
net/mlx5: SD, Keep multi-pf debugfs entries on primary
net/mlx5e: SD, Fix race condition in secondary device probe/remove
.../net/ethernet/mellanox/mlx5/core/en_main.c | 18 +++--
.../net/ethernet/mellanox/mlx5/core/lib/sd.c | 70 ++++++++++++++++---
.../net/ethernet/mellanox/mlx5/core/lib/sd.h | 2 +
3 files changed, 77 insertions(+), 13 deletions(-)
base-commit: 2dddb34dd0d07b01fa770eca89480a4da4f13153
--
2.44.0
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH net V2 1/3] net/mlx5: SD: Serialize init/cleanup
2026-04-13 10:53 [PATCH net V2 0/3] net/mlx5: Fixes for Socket-Direct Tariq Toukan
@ 2026-04-13 10:53 ` Tariq Toukan
2026-04-16 11:00 ` Paolo Abeni
2026-04-13 10:53 ` [PATCH net V2 2/3] net/mlx5: SD, Keep multi-pf debugfs entries on primary Tariq Toukan
2026-04-13 10:53 ` [PATCH net V2 3/3] net/mlx5e: SD, Fix race condition in secondary device probe/remove Tariq Toukan
2 siblings, 1 reply; 6+ messages in thread
From: Tariq Toukan @ 2026-04-13 10:53 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Saeed Mahameed, Tariq Toukan, Mark Bloch, Leon Romanovsky,
Shay Drory, Simon Horman, Kees Cook, Parav Pandit,
Patrisious Haddad, Gal Pressman, netdev, linux-rdma, linux-kernel,
Dragos Tatulea
From: Shay Drory <shayd@nvidia.com>
mlx5_sd_init() / mlx5_sd_cleanup() may run from multiple PFs in the same
Socket-Direct group. This can cause the SD bring-up/tear-down sequence
to be executed more than once or interleaved across PFs.
Protect SD init/cleanup with mlx5_devcom_comp_lock() and track the SD
group state on the primary device. Skip init if the primary is already
UP, and skip cleanup unless the primary is UP.
Fixes: 381978d28317 ("net/mlx5e: Create single netdev per SD group")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
.../net/ethernet/mellanox/mlx5/core/lib/sd.c | 35 +++++++++++++++++--
1 file changed, 32 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
index 954942ad93c5..a34860ad231a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
@@ -18,6 +18,7 @@ struct mlx5_sd {
u8 host_buses;
struct mlx5_devcom_comp_dev *devcom;
struct dentry *dfs;
+ u8 state;
bool primary;
union {
struct { /* primary */
@@ -31,6 +32,12 @@ struct mlx5_sd {
};
};
+enum mlx5_sd_state {
+ MLX5_SD_STATE_DOWN = 0,
+ MLX5_SD_STATE_UP,
+ MLX5_SD_STATE_DESTROYING,
+};
+
static int mlx5_sd_get_host_buses(struct mlx5_core_dev *dev)
{
struct mlx5_sd *sd = mlx5_get_sd(dev);
@@ -426,6 +433,7 @@ int mlx5_sd_init(struct mlx5_core_dev *dev)
struct mlx5_core_dev *primary, *pos, *to;
struct mlx5_sd *sd = mlx5_get_sd(dev);
u8 alias_key[ACCESS_KEY_LEN];
+ struct mlx5_sd *primary_sd;
int err, i;
err = sd_init(dev);
@@ -440,10 +448,15 @@ int mlx5_sd_init(struct mlx5_core_dev *dev)
if (err)
goto err_sd_cleanup;
+ mlx5_devcom_comp_lock(sd->devcom);
if (!mlx5_devcom_comp_is_ready(sd->devcom))
- return 0;
+ goto out;
primary = mlx5_sd_get_primary(dev);
+ primary_sd = mlx5_get_sd(primary);
+
+ if (primary_sd->state != MLX5_SD_STATE_DOWN)
+ goto out;
for (i = 0; i < ACCESS_KEY_LEN; i++)
alias_key[i] = get_random_u8();
@@ -472,6 +485,9 @@ int mlx5_sd_init(struct mlx5_core_dev *dev)
sd->group_id, mlx5_devcom_comp_get_size(sd->devcom));
sd_print_group(primary);
+ primary_sd->state = MLX5_SD_STATE_UP;
+out:
+ mlx5_devcom_comp_unlock(sd->devcom);
return 0;
err_unset_secondaries:
@@ -481,6 +497,7 @@ int mlx5_sd_init(struct mlx5_core_dev *dev)
sd_cmd_unset_primary(primary);
debugfs_remove_recursive(sd->dfs);
err_sd_unregister:
+ mlx5_devcom_comp_unlock(sd->devcom);
sd_unregister(dev);
err_sd_cleanup:
sd_cleanup(dev);
@@ -491,23 +508,35 @@ void mlx5_sd_cleanup(struct mlx5_core_dev *dev)
{
struct mlx5_sd *sd = mlx5_get_sd(dev);
struct mlx5_core_dev *primary, *pos;
+ struct mlx5_sd *primary_sd = NULL;
int i;
if (!sd)
return;
+ mlx5_devcom_comp_lock(sd->devcom);
if (!mlx5_devcom_comp_is_ready(sd->devcom))
- goto out;
+ goto out_unlock;
primary = mlx5_sd_get_primary(dev);
+ primary_sd = mlx5_get_sd(primary);
+
+ if (primary_sd->state != MLX5_SD_STATE_UP)
+ goto out_unlock;
+
mlx5_sd_for_each_secondary(i, primary, pos)
sd_cmd_unset_secondary(pos);
sd_cmd_unset_primary(primary);
debugfs_remove_recursive(sd->dfs);
sd_info(primary, "group id %#x, uncombined\n", sd->group_id);
-out:
+ primary_sd->state = MLX5_SD_STATE_DESTROYING;
+out_unlock:
+ mlx5_devcom_comp_unlock(sd->devcom);
sd_unregister(dev);
+ if (primary_sd)
+ /* devcom isn't ready, reset the state */
+ primary_sd->state = MLX5_SD_STATE_DOWN;
sd_cleanup(dev);
}
--
2.44.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH net V2 2/3] net/mlx5: SD, Keep multi-pf debugfs entries on primary
2026-04-13 10:53 [PATCH net V2 0/3] net/mlx5: Fixes for Socket-Direct Tariq Toukan
2026-04-13 10:53 ` [PATCH net V2 1/3] net/mlx5: SD: Serialize init/cleanup Tariq Toukan
@ 2026-04-13 10:53 ` Tariq Toukan
2026-04-13 10:53 ` [PATCH net V2 3/3] net/mlx5e: SD, Fix race condition in secondary device probe/remove Tariq Toukan
2 siblings, 0 replies; 6+ messages in thread
From: Tariq Toukan @ 2026-04-13 10:53 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Saeed Mahameed, Tariq Toukan, Mark Bloch, Leon Romanovsky,
Shay Drory, Simon Horman, Kees Cook, Parav Pandit,
Patrisious Haddad, Gal Pressman, netdev, linux-rdma, linux-kernel,
Dragos Tatulea
From: Shay Drory <shayd@nvidia.com>
mlx5_sd_init() creates the "multi-pf" debugfs directory under the
primary device debugfs root, but stored the dentry in the calling
device's sd struct. When sd_cleanup() run on a different PF,
this leads to using the wrong sd->dfs for removing entries, which
results in memory leak and an error in when re-creating the SD.[1]
Fix it by explicitly storing the debugfs dentry in the primary
device sd struct and use it for all per-group files.
[1]
debugfs: 'multi-pf' already exists in '0000:08:00.1'
Fixes: 4375130bf527 ("net/mlx5: SD, Add debugfs")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
.../net/ethernet/mellanox/mlx5/core/lib/sd.c | 20 ++++++++++++-------
1 file changed, 13 insertions(+), 7 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
index a34860ad231a..a5e2e0a411df 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
@@ -465,9 +465,13 @@ int mlx5_sd_init(struct mlx5_core_dev *dev)
if (err)
goto err_sd_unregister;
- sd->dfs = debugfs_create_dir("multi-pf", mlx5_debugfs_get_dev_root(primary));
- debugfs_create_x32("group_id", 0400, sd->dfs, &sd->group_id);
- debugfs_create_file("primary", 0400, sd->dfs, primary, &dev_fops);
+ primary_sd->dfs =
+ debugfs_create_dir("multi-pf",
+ mlx5_debugfs_get_dev_root(primary));
+ debugfs_create_x32("group_id", 0400, primary_sd->dfs,
+ &primary_sd->group_id);
+ debugfs_create_file("primary", 0400, primary_sd->dfs, primary,
+ &dev_fops);
mlx5_sd_for_each_secondary(i, primary, pos) {
char name[32];
@@ -477,7 +481,8 @@ int mlx5_sd_init(struct mlx5_core_dev *dev)
goto err_unset_secondaries;
snprintf(name, sizeof(name), "secondary_%d", i - 1);
- debugfs_create_file(name, 0400, sd->dfs, pos, &dev_fops);
+ debugfs_create_file(name, 0400, primary_sd->dfs, pos,
+ &dev_fops);
}
@@ -495,7 +500,8 @@ int mlx5_sd_init(struct mlx5_core_dev *dev)
mlx5_sd_for_each_secondary_to(i, primary, to, pos)
sd_cmd_unset_secondary(pos);
sd_cmd_unset_primary(primary);
- debugfs_remove_recursive(sd->dfs);
+ debugfs_remove_recursive(primary_sd->dfs);
+ primary_sd->dfs = NULL;
err_sd_unregister:
mlx5_devcom_comp_unlock(sd->devcom);
sd_unregister(dev);
@@ -520,14 +526,14 @@ void mlx5_sd_cleanup(struct mlx5_core_dev *dev)
primary = mlx5_sd_get_primary(dev);
primary_sd = mlx5_get_sd(primary);
-
if (primary_sd->state != MLX5_SD_STATE_UP)
goto out_unlock;
mlx5_sd_for_each_secondary(i, primary, pos)
sd_cmd_unset_secondary(pos);
sd_cmd_unset_primary(primary);
- debugfs_remove_recursive(sd->dfs);
+ debugfs_remove_recursive(primary_sd->dfs);
+ primary_sd->dfs = NULL;
sd_info(primary, "group id %#x, uncombined\n", sd->group_id);
primary_sd->state = MLX5_SD_STATE_DESTROYING;
--
2.44.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH net V2 3/3] net/mlx5e: SD, Fix race condition in secondary device probe/remove
2026-04-13 10:53 [PATCH net V2 0/3] net/mlx5: Fixes for Socket-Direct Tariq Toukan
2026-04-13 10:53 ` [PATCH net V2 1/3] net/mlx5: SD: Serialize init/cleanup Tariq Toukan
2026-04-13 10:53 ` [PATCH net V2 2/3] net/mlx5: SD, Keep multi-pf debugfs entries on primary Tariq Toukan
@ 2026-04-13 10:53 ` Tariq Toukan
2026-04-16 11:07 ` Paolo Abeni
2 siblings, 1 reply; 6+ messages in thread
From: Tariq Toukan @ 2026-04-13 10:53 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Saeed Mahameed, Tariq Toukan, Mark Bloch, Leon Romanovsky,
Shay Drory, Simon Horman, Kees Cook, Parav Pandit,
Patrisious Haddad, Gal Pressman, netdev, linux-rdma, linux-kernel,
Dragos Tatulea
From: Shay Drory <shayd@nvidia.com>
When utilizing Socket-Direct single netdev functionality the driver
resolves the actual auxiliary device using mlx5_sd_get_adev(). However,
the current implementation returns the primary ETH auxiliary device
without holding the device lock, leading to a potential race condition
where the ETH device could be unbound or removed concurrently during
probe, suspend, resume, or remove operations.[1]
Fix this by introducing mlx5_sd_put_adev() and updating
mlx5_sd_get_adev() so that secondaries devices would acquire the device
lock of the returned auxiliary device. After the lock is acquired, a
second devcom check is needed[2].
In addition, update The callers to pair the get operation with the new
put operation, ensuring the lock is held while the auxiliary device is
being operated on and released afterwards.
The "primary" designation is determined once in sd_register(). It's set
before devcom is marked ready, and it never changes after that.
In Addition, The primary path never locks a secondary: When the primary
device invoke mlx5_sd_get_adev(), it sees dev == primary and returns.
no additional lock is taken.
Therefore lock ordering is always: secondary_lock -> primary_lock. The
reverse never happens, so ABBA deadlock is impossible.
[1]
for example:
BUG: kernel NULL pointer dereference, address: 0000000000000370
PGD 0 P4D 0
Oops: Oops: 0000 [#1] SMP
CPU: 4 UID: 0 PID: 3945 Comm: bash Not tainted 6.19.0-rc3+ #1 NONE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:mlx5e_dcbnl_dscp_app+0x23/0x100 [mlx5_core]
Call Trace:
<TASK>
mlx5e_remove+0x82/0x12a [mlx5_core]
device_release_driver_internal+0x194/0x1f0
bus_remove_device+0xc6/0x140
device_del+0x159/0x3c0
? devl_param_driverinit_value_get+0x29/0x80
mlx5_rescan_drivers_locked+0x92/0x160 [mlx5_core]
mlx5_unregister_device+0x34/0x50 [mlx5_core]
mlx5_uninit_one+0x43/0xb0 [mlx5_core]
remove_one+0x4e/0xc0 [mlx5_core]
pci_device_remove+0x39/0xa0
device_release_driver_internal+0x194/0x1f0
unbind_store+0x99/0xa0
kernfs_fop_write_iter+0x12e/0x1e0
vfs_write+0x215/0x3d0
ksys_write+0x5f/0xd0
do_syscall_64+0x55/0xe90
entry_SYSCALL_64_after_hwframe+0x4b/0x53
[2]
CPU0 (primary) CPU1 (secondary)
==========================================================================
mlx5e_remove() (device_lock held)
mlx5e_remove() (2nd device_lock held)
mlx5_sd_get_adev()
mlx5_devcom_comp_is_ready() => true
device_lock(primary)
mlx5_sd_get_adev() ==> ret adev
_mlx5e_remove()
mlx5_sd_cleanup()
// mlx5e_remove finished
// releasing device_lock
//need another check here...
mlx5_devcom_comp_is_ready() => false
Fixes: 381978d28317 ("net/mlx5e: Create single netdev per SD group")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
.../net/ethernet/mellanox/mlx5/core/en_main.c | 18 ++++++++++++++----
.../net/ethernet/mellanox/mlx5/core/lib/sd.c | 17 +++++++++++++++++
.../net/ethernet/mellanox/mlx5/core/lib/sd.h | 2 ++
3 files changed, 33 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 0b8b44bbcb9e..11f80158e107 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -6657,8 +6657,11 @@ static int mlx5e_resume(struct auxiliary_device *adev)
return err;
actual_adev = mlx5_sd_get_adev(mdev, adev, edev->idx);
- if (actual_adev)
- return _mlx5e_resume(actual_adev);
+ if (actual_adev) {
+ err = _mlx5e_resume(actual_adev);
+ mlx5_sd_put_adev(actual_adev, adev);
+ return err;
+ }
return 0;
}
@@ -6698,6 +6701,8 @@ static int mlx5e_suspend(struct auxiliary_device *adev, pm_message_t state)
err = _mlx5e_suspend(actual_adev, false);
mlx5_sd_cleanup(mdev);
+ if (actual_adev)
+ mlx5_sd_put_adev(actual_adev, adev);
return err;
}
@@ -6795,8 +6800,11 @@ static int mlx5e_probe(struct auxiliary_device *adev,
return err;
actual_adev = mlx5_sd_get_adev(mdev, adev, edev->idx);
- if (actual_adev)
- return _mlx5e_probe(actual_adev);
+ if (actual_adev) {
+ err = _mlx5e_probe(actual_adev);
+ mlx5_sd_put_adev(actual_adev, adev);
+ return err;
+ }
return 0;
}
@@ -6849,6 +6857,8 @@ static void mlx5e_remove(struct auxiliary_device *adev)
_mlx5e_remove(actual_adev);
mlx5_sd_cleanup(mdev);
+ if (actual_adev)
+ mlx5_sd_put_adev(actual_adev, adev);
}
static const struct auxiliary_device_id mlx5e_id_table[] = {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
index a5e2e0a411df..6cece851b102 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
@@ -546,6 +546,10 @@ void mlx5_sd_cleanup(struct mlx5_core_dev *dev)
sd_cleanup(dev);
}
+/* Cannot take devcom lock as a gate for device lock. ABBA deadlock:
+ * primary: actual_adev_lock -> SD devcom comp lock
+ * secondary: SD devcom comp lock -> actual_adev_lock
+ */
struct auxiliary_device *mlx5_sd_get_adev(struct mlx5_core_dev *dev,
struct auxiliary_device *adev,
int idx)
@@ -563,5 +567,18 @@ struct auxiliary_device *mlx5_sd_get_adev(struct mlx5_core_dev *dev,
if (dev == primary)
return adev;
+ device_lock(&primary->priv.adev[idx]->adev.dev);
+ /* In case primary finish removing its adev */
+ if (!mlx5_devcom_comp_is_ready(sd->devcom)) {
+ device_unlock(&primary->priv.adev[idx]->adev.dev);
+ return NULL;
+ }
return &primary->priv.adev[idx]->adev;
}
+
+void mlx5_sd_put_adev(struct auxiliary_device *actual_adev,
+ struct auxiliary_device *adev)
+{
+ if (actual_adev != adev)
+ device_unlock(&actual_adev->dev);
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.h b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.h
index 137efaf9aabc..9bfd5b9756b5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.h
@@ -15,6 +15,8 @@ struct mlx5_core_dev *mlx5_sd_ch_ix_get_dev(struct mlx5_core_dev *primary, int c
struct auxiliary_device *mlx5_sd_get_adev(struct mlx5_core_dev *dev,
struct auxiliary_device *adev,
int idx);
+void mlx5_sd_put_adev(struct auxiliary_device *actual_adev,
+ struct auxiliary_device *adev);
int mlx5_sd_init(struct mlx5_core_dev *dev);
void mlx5_sd_cleanup(struct mlx5_core_dev *dev);
--
2.44.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH net V2 1/3] net/mlx5: SD: Serialize init/cleanup
2026-04-13 10:53 ` [PATCH net V2 1/3] net/mlx5: SD: Serialize init/cleanup Tariq Toukan
@ 2026-04-16 11:00 ` Paolo Abeni
0 siblings, 0 replies; 6+ messages in thread
From: Paolo Abeni @ 2026-04-16 11:00 UTC (permalink / raw)
To: Tariq Toukan, Eric Dumazet, Jakub Kicinski, Andrew Lunn,
David S. Miller
Cc: Saeed Mahameed, Mark Bloch, Leon Romanovsky, Shay Drory,
Simon Horman, Kees Cook, Parav Pandit, Patrisious Haddad,
Gal Pressman, netdev, linux-rdma, linux-kernel, Dragos Tatulea
On 4/13/26 12:53 PM, Tariq Toukan wrote:
> @@ -491,23 +508,35 @@ void mlx5_sd_cleanup(struct mlx5_core_dev *dev)
> {
> struct mlx5_sd *sd = mlx5_get_sd(dev);
> struct mlx5_core_dev *primary, *pos;
> + struct mlx5_sd *primary_sd = NULL;
> int i;
>
> if (!sd)
> return;
>
> + mlx5_devcom_comp_lock(sd->devcom);
> if (!mlx5_devcom_comp_is_ready(sd->devcom))
> - goto out;
> + goto out_unlock;
>
> primary = mlx5_sd_get_primary(dev);
> + primary_sd = mlx5_get_sd(primary);
> +
> + if (primary_sd->state != MLX5_SD_STATE_UP)
> + goto out_unlock;
> +
> mlx5_sd_for_each_secondary(i, primary, pos)
> sd_cmd_unset_secondary(pos);
> sd_cmd_unset_primary(primary);
> debugfs_remove_recursive(sd->dfs);
>
> sd_info(primary, "group id %#x, uncombined\n", sd->group_id);
> -out:
> + primary_sd->state = MLX5_SD_STATE_DESTROYING;
> +out_unlock:
> + mlx5_devcom_comp_unlock(sd->devcom);
> sd_unregister(dev);
> + if (primary_sd)
> + /* devcom isn't ready, reset the state */
> + primary_sd->state = MLX5_SD_STATE_DOWN;
Sashiko says:
---
Since primary_sd is only populated if devcom is ready, this condition will
never trigger when devcom isn't ready, contrary to the comment. Instead,
it triggers on the normal path after devcom is ready, immediately
overwriting
MLX5_SD_STATE_DESTROYING with MLX5_SD_STATE_DOWN outside the lock.
Could this allow concurrent mlx5_sd_init() calls to see the down state and
attempt hardware re-initialization while the group is still being torn down?
Also, can this race and cause a use-after-free regression?
During a concurrent Socket-Direct group teardown, the primary PF and
secondary
PF can execute mlx5_sd_cleanup() in parallel.
If the primary PF completes its cleanup first, it will call
sd_cleanup(primary)
which calls kfree() on the sd structure, freeing the primary_sd memory.
If the secondary PF is preempted just after releasing the devcom lock,
it will resume, evaluate its local non-NULL primary_sd pointer, and
locklessly
write to primary_sd->state. Does this dereference the freed memory of the
primary PF?
---
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH net V2 3/3] net/mlx5e: SD, Fix race condition in secondary device probe/remove
2026-04-13 10:53 ` [PATCH net V2 3/3] net/mlx5e: SD, Fix race condition in secondary device probe/remove Tariq Toukan
@ 2026-04-16 11:07 ` Paolo Abeni
0 siblings, 0 replies; 6+ messages in thread
From: Paolo Abeni @ 2026-04-16 11:07 UTC (permalink / raw)
To: Tariq Toukan, Eric Dumazet, Jakub Kicinski, Andrew Lunn,
David S. Miller
Cc: Saeed Mahameed, Mark Bloch, Leon Romanovsky, Shay Drory,
Simon Horman, Kees Cook, Parav Pandit, Patrisious Haddad,
Gal Pressman, netdev, linux-rdma, linux-kernel, Dragos Tatulea
On 4/13/26 12:53 PM, Tariq Toukan wrote:
> From: Shay Drory <shayd@nvidia.com>
>
> When utilizing Socket-Direct single netdev functionality the driver
> resolves the actual auxiliary device using mlx5_sd_get_adev(). However,
> the current implementation returns the primary ETH auxiliary device
> without holding the device lock, leading to a potential race condition
> where the ETH device could be unbound or removed concurrently during
> probe, suspend, resume, or remove operations.[1]
>
> Fix this by introducing mlx5_sd_put_adev() and updating
> mlx5_sd_get_adev() so that secondaries devices would acquire the device
> lock of the returned auxiliary device. After the lock is acquired, a
> second devcom check is needed[2].
> In addition, update The callers to pair the get operation with the new
> put operation, ensuring the lock is held while the auxiliary device is
> being operated on and released afterwards.
>
> The "primary" designation is determined once in sd_register(). It's set
> before devcom is marked ready, and it never changes after that.
> In Addition, The primary path never locks a secondary: When the primary
> device invoke mlx5_sd_get_adev(), it sees dev == primary and returns.
> no additional lock is taken.
> Therefore lock ordering is always: secondary_lock -> primary_lock. The
> reverse never happens, so ABBA deadlock is impossible.
>
> [1]
> for example:
> BUG: kernel NULL pointer dereference, address: 0000000000000370
> PGD 0 P4D 0
> Oops: Oops: 0000 [#1] SMP
> CPU: 4 UID: 0 PID: 3945 Comm: bash Not tainted 6.19.0-rc3+ #1 NONE
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
> RIP: 0010:mlx5e_dcbnl_dscp_app+0x23/0x100 [mlx5_core]
> Call Trace:
> <TASK>
> mlx5e_remove+0x82/0x12a [mlx5_core]
> device_release_driver_internal+0x194/0x1f0
> bus_remove_device+0xc6/0x140
> device_del+0x159/0x3c0
> ? devl_param_driverinit_value_get+0x29/0x80
> mlx5_rescan_drivers_locked+0x92/0x160 [mlx5_core]
> mlx5_unregister_device+0x34/0x50 [mlx5_core]
> mlx5_uninit_one+0x43/0xb0 [mlx5_core]
> remove_one+0x4e/0xc0 [mlx5_core]
> pci_device_remove+0x39/0xa0
> device_release_driver_internal+0x194/0x1f0
> unbind_store+0x99/0xa0
> kernfs_fop_write_iter+0x12e/0x1e0
> vfs_write+0x215/0x3d0
> ksys_write+0x5f/0xd0
> do_syscall_64+0x55/0xe90
> entry_SYSCALL_64_after_hwframe+0x4b/0x53
>
> [2]
> CPU0 (primary) CPU1 (secondary)
> ==========================================================================
> mlx5e_remove() (device_lock held)
> mlx5e_remove() (2nd device_lock held)
> mlx5_sd_get_adev()
> mlx5_devcom_comp_is_ready() => true
> device_lock(primary)
> mlx5_sd_get_adev() ==> ret adev
> _mlx5e_remove()
> mlx5_sd_cleanup()
> // mlx5e_remove finished
> // releasing device_lock
> //need another check here...
> mlx5_devcom_comp_is_ready() => false
>
> Fixes: 381978d28317 ("net/mlx5e: Create single netdev per SD group")
> Signed-off-by: Shay Drory <shayd@nvidia.com>
> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
> ---
> .../net/ethernet/mellanox/mlx5/core/en_main.c | 18 ++++++++++++++----
> .../net/ethernet/mellanox/mlx5/core/lib/sd.c | 17 +++++++++++++++++
> .../net/ethernet/mellanox/mlx5/core/lib/sd.h | 2 ++
> 3 files changed, 33 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> index 0b8b44bbcb9e..11f80158e107 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> @@ -6657,8 +6657,11 @@ static int mlx5e_resume(struct auxiliary_device *adev)
> return err;
>
> actual_adev = mlx5_sd_get_adev(mdev, adev, edev->idx);
> - if (actual_adev)
> - return _mlx5e_resume(actual_adev);
> + if (actual_adev) {
> + err = _mlx5e_resume(actual_adev);
> + mlx5_sd_put_adev(actual_adev, adev);
> + return err;
> + }
> return 0;
> }
>
> @@ -6698,6 +6701,8 @@ static int mlx5e_suspend(struct auxiliary_device *adev, pm_message_t state)
> err = _mlx5e_suspend(actual_adev, false);
>
> mlx5_sd_cleanup(mdev);
> + if (actual_adev)
> + mlx5_sd_put_adev(actual_adev, adev);
> return err;
> }
>
> @@ -6795,8 +6800,11 @@ static int mlx5e_probe(struct auxiliary_device *adev,
> return err;
>
> actual_adev = mlx5_sd_get_adev(mdev, adev, edev->idx);
> - if (actual_adev)
> - return _mlx5e_probe(actual_adev);
> + if (actual_adev) {
> + err = _mlx5e_probe(actual_adev);
> + mlx5_sd_put_adev(actual_adev, adev);
Sashiko says:
---
If _mlx5e_probe(actual_adev) fails, it frees mlx5e_dev but leaves the
auxiliary device's drvdata pointing to it. Furthermore, mlx5e_probe()
returns the error without calling mlx5_sd_cleanup(mdev), leaving devcom
incorrectly marked as ready.
If the primary device is later unbound, mlx5e_remove() will see that
devcom is ready, call _mlx5e_remove(), and blindly dereference the
dangling mlx5e_dev pointer.
Is there a missing cleanup step here to clear drvdata or reset the sd
state on failure?
---
Please try to address AI comments (i.e. explaining why not relevant)
proactively.
Thanks,
Paolo
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-04-16 11:07 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-13 10:53 [PATCH net V2 0/3] net/mlx5: Fixes for Socket-Direct Tariq Toukan
2026-04-13 10:53 ` [PATCH net V2 1/3] net/mlx5: SD: Serialize init/cleanup Tariq Toukan
2026-04-16 11:00 ` Paolo Abeni
2026-04-13 10:53 ` [PATCH net V2 2/3] net/mlx5: SD, Keep multi-pf debugfs entries on primary Tariq Toukan
2026-04-13 10:53 ` [PATCH net V2 3/3] net/mlx5e: SD, Fix race condition in secondary device probe/remove Tariq Toukan
2026-04-16 11:07 ` Paolo Abeni
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox