From: Shay Drori <shayd@nvidia.com>
To: Jakub Kicinski <kuba@kernel.org>, Tariq Toukan <tariqt@nvidia.com>
Cc: Eric Dumazet <edumazet@google.com>,
Paolo Abeni <pabeni@redhat.com>,
Andrew Lunn <andrew+netdev@lunn.ch>,
"David S. Miller" <davem@davemloft.net>,
Saeed Mahameed <saeedm@nvidia.com>,
Mark Bloch <mbloch@nvidia.com>,
"Leon Romanovsky" <leon@kernel.org>,
Simon Horman <horms@kernel.org>, Kees Cook <kees@kernel.org>,
Parav Pandit <parav@nvidia.com>,
Patrisious Haddad <phaddad@nvidia.com>,
Gal Pressman <gal@nvidia.com>, <netdev@vger.kernel.org>,
<linux-rdma@vger.kernel.org>, <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH net 1/3] net/mlx5e: SD, Fix race condition in secondary device probe/remove
Date: Thu, 2 Apr 2026 23:03:10 +0300 [thread overview]
Message-ID: <53c3bf8c-0b0e-4986-91f3-3eec53fc2b1a@nvidia.com> (raw)
In-Reply-To: <20260401200842.79322a24@kernel.org>
On 02/04/2026 6:08, Jakub Kicinski wrote:
> External email: Use caution opening links or attachments
>
>
> On Mon, 30 Mar 2026 22:34:10 +0300 Tariq Toukan wrote:
>> From: Shay Drory <shayd@nvidia.com>
>>
>> When utilizing Socket-Direct single netdev functionality the driver
>> resolves the actual auxiliary device using mlx5_sd_get_adev(). However,
>> the current implementation returns the primary ETH auxiliary device
>> without holding the device lock, leading to a potential race condition
>> where the ETH device could be unbound or removed concurrently during
>> probe, suspend, resume, or remove operations.[1]
>>
>> Fix this by introducing mlx5_sd_put_adev() and updating
>> mlx5_sd_get_adev() so that secondaries devices would acquire the device
>> lock of the returned auxiliary device. After the lock is acquired, a
>> second devcom check is needed[2].
>> In addition, update The callers to pair the get operation with the new
>> put operation, ensuring the lock is held while the auxiliary device is
>> being operated on and released afterwards.
>
> Please explain why the "primary" designation is reliable, and therefore
> we can be sure there will be no ABBA deadlock here
The "primary" designation is determined once in sd_register(). It's set
before devcom is marked ready, and it never changes after that.
In Addition, The primary path never locks a secondary: When the primary
device invoke mlx5_sd_get_adev(), it sees dev == primary and returns.
no additional lock is taken.
Therefore lock ordering is always: secondary_lock → primary_lock. The
reverse never happens, so ABBA deadlock is impossible.
Does the above is the explanation you looked for?
If not, can you elaborate?
If yes, to add it to the commit message in V2?
>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>> index b6c12460b54a..5761f655f488 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>> @@ -6657,8 +6657,11 @@ static int mlx5e_resume(struct auxiliary_device *adev)
>> return err;
>>
>> actual_adev = mlx5_sd_get_adev(mdev, adev, edev->idx);
>> - if (actual_adev)
>> - return _mlx5e_resume(actual_adev);
>> + if (actual_adev) {
>> + err = _mlx5e_resume(actual_adev);
>> + mlx5_sd_put_adev(actual_adev, adev);
>> + return err;
>> + }
>> return 0;
>
> Feels like I recently complained about similar code y'all were trying
> to add. Magically and conditionally locking something in a get helper
> makes for extremely confusing code.
Do you think explicit locking API is preferred here?
something like:
new_locking_api()
mlx5_sd_get_adev()
new_unlocking_api()
thanks for the review
> --
> pw-bot: cr
next prev parent reply other threads:[~2026-04-02 20:03 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-30 19:34 [PATCH net 0/3] net/mlx5: Fixes for Socket-Direct Tariq Toukan
2026-03-30 19:34 ` [PATCH net 1/3] net/mlx5e: SD, Fix race condition in secondary device probe/remove Tariq Toukan
2026-04-02 3:08 ` Jakub Kicinski
2026-04-02 20:03 ` Shay Drori [this message]
2026-04-03 0:45 ` Jakub Kicinski
2026-04-05 19:05 ` Shay Drori
2026-03-30 19:34 ` [PATCH net 2/3] net/mlx5: SD, Keep multi-pf debugfs entries on primary Tariq Toukan
2026-04-02 3:09 ` Jakub Kicinski
2026-04-02 19:50 ` Shay Drori
2026-03-30 19:34 ` [PATCH net 3/3] net/mlx5: SD: Serialize init/cleanup Tariq Toukan
2026-04-02 3:09 ` Jakub Kicinski
2026-04-02 19:49 ` Shay Drori
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=53c3bf8c-0b0e-4986-91f3-3eec53fc2b1a@nvidia.com \
--to=shayd@nvidia.com \
--cc=andrew+netdev@lunn.ch \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=gal@nvidia.com \
--cc=horms@kernel.org \
--cc=kees@kernel.org \
--cc=kuba@kernel.org \
--cc=leon@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=mbloch@nvidia.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=parav@nvidia.com \
--cc=phaddad@nvidia.com \
--cc=saeedm@nvidia.com \
--cc=tariqt@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox