* [PATCH net 1/6] net/mlx5: Fix multiport device check over light SFs
2026-02-12 10:32 [PATCH net 0/6] mlx5 misc fixes 2026-02-12 Tariq Toukan
@ 2026-02-12 10:32 ` Tariq Toukan
2026-02-12 22:25 ` Jacob Keller
2026-02-12 10:32 ` [PATCH net 2/6] net/mlx5: Fix misidentification of write combining CQE during poll loop Tariq Toukan
` (4 subsequent siblings)
5 siblings, 1 reply; 13+ messages in thread
From: Tariq Toukan @ 2026-02-12 10:32 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch, netdev,
linux-rdma, linux-kernel, Gal Pressman, Moshe Shemesh, Shay Drory
From: Shay Drory <shayd@nvidia.com>
Driver is using num_vhca_ports capability to distinguish between
multiport master device and multiport slave device. num_vhca_ports is a
capability the driver sets according to the MAX num_vhca_ports
capability reported by FW. On the other hand, light SFs doesn't set the
above capbility.
This leads to wrong results whenever light SFs is checking whether he is
a multiport master or slave.
Therefore, use the MAX capability to distinguish between master and
slave devices.
Fixes: e71383fb9cd1 ("net/mlx5: Light probe local SFs")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
include/linux/mlx5/driver.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 1c54aa6f74fb..1967d1c79139 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -1281,12 +1281,12 @@ static inline bool mlx5_rl_is_supported(struct mlx5_core_dev *dev)
static inline int mlx5_core_is_mp_slave(struct mlx5_core_dev *dev)
{
return MLX5_CAP_GEN(dev, affiliate_nic_vport_criteria) &&
- MLX5_CAP_GEN(dev, num_vhca_ports) <= 1;
+ MLX5_CAP_GEN_MAX(dev, num_vhca_ports) <= 1;
}
static inline int mlx5_core_is_mp_master(struct mlx5_core_dev *dev)
{
- return MLX5_CAP_GEN(dev, num_vhca_ports) > 1;
+ return MLX5_CAP_GEN_MAX(dev, num_vhca_ports) > 1;
}
static inline int mlx5_core_mp_enabled(struct mlx5_core_dev *dev)
--
2.44.0
^ permalink raw reply related [flat|nested] 13+ messages in thread* Re: [PATCH net 1/6] net/mlx5: Fix multiport device check over light SFs
2026-02-12 10:32 ` [PATCH net 1/6] net/mlx5: Fix multiport device check over light SFs Tariq Toukan
@ 2026-02-12 22:25 ` Jacob Keller
0 siblings, 0 replies; 13+ messages in thread
From: Jacob Keller @ 2026-02-12 22:25 UTC (permalink / raw)
To: Tariq Toukan, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Andrew Lunn, David S. Miller
Cc: Saeed Mahameed, Leon Romanovsky, Mark Bloch, netdev, linux-rdma,
linux-kernel, Gal Pressman, Moshe Shemesh, Shay Drory
On 2/12/2026 2:32 AM, Tariq Toukan wrote:
> From: Shay Drory <shayd@nvidia.com>
>
> Driver is using num_vhca_ports capability to distinguish between
> multiport master device and multiport slave device. num_vhca_ports is a
> capability the driver sets according to the MAX num_vhca_ports
> capability reported by FW. On the other hand, light SFs doesn't set the
> above capbility.
>
> This leads to wrong results whenever light SFs is checking whether he is
> a multiport master or slave.
>
> Therefore, use the MAX capability to distinguish between master and
> slave devices.
>
So we were previously checking the number of VHCA ports, but since SFs
set this to 0, they would always be ported as mp_slave, even though they
should be mp_master.
Makes sense.
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH net 2/6] net/mlx5: Fix misidentification of write combining CQE during poll loop
2026-02-12 10:32 [PATCH net 0/6] mlx5 misc fixes 2026-02-12 Tariq Toukan
2026-02-12 10:32 ` [PATCH net 1/6] net/mlx5: Fix multiport device check over light SFs Tariq Toukan
@ 2026-02-12 10:32 ` Tariq Toukan
2026-02-12 22:36 ` Jacob Keller
2026-02-12 10:32 ` [PATCH net 3/6] net/mlx5e: Fix misidentification of ASO " Tariq Toukan
` (3 subsequent siblings)
5 siblings, 1 reply; 13+ messages in thread
From: Tariq Toukan @ 2026-02-12 10:32 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch, netdev,
linux-rdma, linux-kernel, Gal Pressman, Moshe Shemesh, Jianbo Liu
From: Gal Pressman <gal@nvidia.com>
The write combining completion poll loop uses usleep_range() which can
sleep much longer than requested due to scheduler latency. Under load,
we witnessed a 20ms+ delay until the process was rescheduled, causing
the jiffies based timeout to expire while the thread is sleeping.
The original do-while loop structure (poll, sleep, check timeout) would
exit without a final poll when waking after timeout, missing a CQE that
arrived during sleep.
Restructure the loop by moving the poll into the while condition,
ensuring we always poll after sleeping, catching CQEs that arrived
during that time.
While at it, remove the redundant 'err' assignment.
Fixes: d98995b4bf98 ("net/mlx5: Reimplement write combining test")
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
drivers/net/ethernet/mellanox/mlx5/core/wc.c | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/wc.c b/drivers/net/ethernet/mellanox/mlx5/core/wc.c
index 815a7c97d6b0..29db15c4b978 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/wc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/wc.c
@@ -390,12 +390,10 @@ static void mlx5_core_test_wc(struct mlx5_core_dev *mdev)
mlx5_wc_post_nop(sq, &offset, true);
expires = jiffies + TEST_WC_POLLING_MAX_TIME_JIFFIES;
- do {
- err = mlx5_wc_poll_cq(sq);
- if (err)
- usleep_range(2, 10);
- } while (mdev->wc_state == MLX5_WC_STATE_UNINITIALIZED &&
- time_is_after_jiffies(expires));
+ while ((mlx5_wc_poll_cq(sq),
+ mdev->wc_state == MLX5_WC_STATE_UNINITIALIZED) &&
+ time_is_after_jiffies(expires))
+ usleep_range(2, 10);
mlx5_wc_destroy_sq(sq);
--
2.44.0
^ permalink raw reply related [flat|nested] 13+ messages in thread* Re: [PATCH net 2/6] net/mlx5: Fix misidentification of write combining CQE during poll loop
2026-02-12 10:32 ` [PATCH net 2/6] net/mlx5: Fix misidentification of write combining CQE during poll loop Tariq Toukan
@ 2026-02-12 22:36 ` Jacob Keller
2026-02-15 12:13 ` Gal Pressman
0 siblings, 1 reply; 13+ messages in thread
From: Jacob Keller @ 2026-02-12 22:36 UTC (permalink / raw)
To: Tariq Toukan, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Andrew Lunn, David S. Miller
Cc: Saeed Mahameed, Leon Romanovsky, Mark Bloch, netdev, linux-rdma,
linux-kernel, Gal Pressman, Moshe Shemesh, Jianbo Liu
On 2/12/2026 2:32 AM, Tariq Toukan wrote:
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/wc.c b/drivers/net/ethernet/mellanox/mlx5/core/wc.c
> index 815a7c97d6b0..29db15c4b978 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/wc.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/wc.c
> @@ -390,12 +390,10 @@ static void mlx5_core_test_wc(struct mlx5_core_dev *mdev)
> mlx5_wc_post_nop(sq, &offset, true);
>
> expires = jiffies + TEST_WC_POLLING_MAX_TIME_JIFFIES;
> - do {
> - err = mlx5_wc_poll_cq(sq);
> - if (err)
> - usleep_range(2, 10);
> - } while (mdev->wc_state == MLX5_WC_STATE_UNINITIALIZED &&
> - time_is_after_jiffies(expires));
> + while ((mlx5_wc_poll_cq(sq),
> + mdev->wc_state == MLX5_WC_STATE_UNINITIALIZED) &&
> + time_is_after_jiffies(expires))
> + usleep_range(2, 10);
>
This could be written with poll_timeout_us(), but I don't know if it
warrants holding up the fix.
Something line the following:
diff --git i/drivers/net/ethernet/mellanox/mlx5/core/wc.c
w/drivers/net/ethernet/mellanox/mlx5/core/wc.c
index 29db15c4b978..6ec9c1a2da78 100644
--- i/drivers/net/ethernet/mellanox/mlx5/core/wc.c
+++ w/drivers/net/ethernet/mellanox/mlx5/core/wc.c
@@ -15,7 +15,7 @@
#define TEST_WC_NUM_WQES 255
#define TEST_WC_LOG_CQ_SZ (order_base_2(TEST_WC_NUM_WQES))
#define TEST_WC_SQ_LOG_WQ_SZ TEST_WC_LOG_CQ_SZ
-#define TEST_WC_POLLING_MAX_TIME_JIFFIES msecs_to_jiffies(100)
+#define TEST_WC_POLLING_MAX_TIME_USEC (100 * USEC_PER_MSEC)
struct mlx5_wc_cq {
/* data path - accessed per cqe */
@@ -359,7 +359,6 @@ static int mlx5_wc_poll_cq(struct mlx5_wc_sq *sq)
static void mlx5_core_test_wc(struct mlx5_core_dev *mdev)
{
unsigned int offset = 0;
- unsigned long expires;
struct mlx5_wc_sq *sq;
int i, err;
@@ -389,11 +388,9 @@ static void mlx5_core_test_wc(struct mlx5_core_dev
*mdev)
mlx5_wc_post_nop(sq, &offset, true);
- expires = jiffies + TEST_WC_POLLING_MAX_TIME_JIFFIES;
- while ((mlx5_wc_poll_cq(sq),
- mdev->wc_state == MLX5_WC_STATE_UNINITIALIZED) &&
- time_is_after_jiffies(expires))
- usleep_range(2, 10);
+ poll_timeout_us(mlx5_wc_poll_cq(sq),
+ mdev->wc_state != MLX5_WC_STATE_UNINITIALIZED,
+ 10, TEST_WC_POLLING_MAX_TIME_USEC, false);
mlx5_wc_destroy_sq(sq);
^ permalink raw reply related [flat|nested] 13+ messages in thread* Re: [PATCH net 2/6] net/mlx5: Fix misidentification of write combining CQE during poll loop
2026-02-12 22:36 ` Jacob Keller
@ 2026-02-15 12:13 ` Gal Pressman
0 siblings, 0 replies; 13+ messages in thread
From: Gal Pressman @ 2026-02-15 12:13 UTC (permalink / raw)
To: Jacob Keller, Tariq Toukan, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Andrew Lunn, David S. Miller
Cc: Saeed Mahameed, Leon Romanovsky, Mark Bloch, netdev, linux-rdma,
linux-kernel, Moshe Shemesh, Jianbo Liu
On 13/02/2026 0:36, Jacob Keller wrote:
>
>
> On 2/12/2026 2:32 AM, Tariq Toukan wrote:
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/wc.c b/drivers/
>> net/ethernet/mellanox/mlx5/core/wc.c
>> index 815a7c97d6b0..29db15c4b978 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/wc.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/wc.c
>> @@ -390,12 +390,10 @@ static void mlx5_core_test_wc(struct
>> mlx5_core_dev *mdev)
>> mlx5_wc_post_nop(sq, &offset, true);
>> expires = jiffies + TEST_WC_POLLING_MAX_TIME_JIFFIES;
>> - do {
>> - err = mlx5_wc_poll_cq(sq);
>> - if (err)
>> - usleep_range(2, 10);
>> - } while (mdev->wc_state == MLX5_WC_STATE_UNINITIALIZED &&
>> - time_is_after_jiffies(expires));
>> + while ((mlx5_wc_poll_cq(sq),
>> + mdev->wc_state == MLX5_WC_STATE_UNINITIALIZED) &&
>> + time_is_after_jiffies(expires))
>> + usleep_range(2, 10);
>>
>
> This could be written with poll_timeout_us(), but I don't know if it
> warrants holding up the fix.
Wasn't aware of iopoll.h, will change, thanks Jacob!
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH net 3/6] net/mlx5e: Fix misidentification of ASO CQE during poll loop
2026-02-12 10:32 [PATCH net 0/6] mlx5 misc fixes 2026-02-12 Tariq Toukan
2026-02-12 10:32 ` [PATCH net 1/6] net/mlx5: Fix multiport device check over light SFs Tariq Toukan
2026-02-12 10:32 ` [PATCH net 2/6] net/mlx5: Fix misidentification of write combining CQE during poll loop Tariq Toukan
@ 2026-02-12 10:32 ` Tariq Toukan
2026-02-12 22:38 ` Jacob Keller
2026-02-12 10:32 ` [PATCH net 4/6] net/mlx5e: MACsec, add ASO poll loop in macsec_aso_set_arm_event Tariq Toukan
` (2 subsequent siblings)
5 siblings, 1 reply; 13+ messages in thread
From: Tariq Toukan @ 2026-02-12 10:32 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch, netdev,
linux-rdma, linux-kernel, Gal Pressman, Moshe Shemesh, Jianbo Liu
From: Gal Pressman <gal@nvidia.com>
The ASO completion poll loop uses usleep_range() which can sleep much
longer than requested due to scheduler latency. Under load, we witnessed
a 20ms+ delay until the process was rescheduled, causing the jiffies
based timeout to expire while the thread is sleeping.
The original do-while loop structure (poll, sleep, check timeout) would
exit without a final poll when waking after timeout, missing a CQE that
arrived during sleep.
Restructure the loop by moving the poll into the while condition,
ensuring we always poll after sleeping, catching CQEs that arrived
during that time.
Fixes: 739cfa34518e ("net/mlx5: Make ASO poll CQ usable in atomic context")
Fixes: 7e3fce82d945 ("net/mlx5e: Overcome slow response for first macsec ASO WQE")
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
drivers/net/ethernet/mellanox/mlx5/core/en/tc/meter.c | 8 +++-----
drivers/net/ethernet/mellanox/mlx5/core/en_accel/macsec.c | 8 +++-----
2 files changed, 6 insertions(+), 10 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/tc/meter.c b/drivers/net/ethernet/mellanox/mlx5/core/en/tc/meter.c
index 7819fb297280..2ab618e11aad 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/tc/meter.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/tc/meter.c
@@ -188,11 +188,9 @@ mlx5e_tc_meter_modify(struct mlx5_core_dev *mdev,
/* With newer FW, the wait for the first ASO WQE is more than 2us, put the wait 10ms. */
expires = jiffies + msecs_to_jiffies(10);
- do {
- err = mlx5_aso_poll_cq(aso, true);
- if (err)
- usleep_range(2, 10);
- } while (err && time_is_after_jiffies(expires));
+ while ((err = mlx5_aso_poll_cq(aso, true)) &&
+ time_is_after_jiffies(expires))
+ usleep_range(2, 10);
mutex_unlock(&flow_meters->aso_lock);
return err;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/macsec.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/macsec.c
index 528b04d4de41..2b3556fbfc42 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/macsec.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/macsec.c
@@ -1412,11 +1412,9 @@ static int macsec_aso_query(struct mlx5_core_dev *mdev, struct mlx5e_macsec *mac
mlx5_aso_post_wqe(maso, false, &aso_wqe->ctrl);
expires = jiffies + msecs_to_jiffies(10);
- do {
- err = mlx5_aso_poll_cq(maso, false);
- if (err)
- usleep_range(2, 10);
- } while (err && time_is_after_jiffies(expires));
+ while ((err = mlx5_aso_poll_cq(maso, false)) &&
+ time_is_after_jiffies(expires))
+ usleep_range(2, 10);
if (err)
goto err_out;
--
2.44.0
^ permalink raw reply related [flat|nested] 13+ messages in thread* Re: [PATCH net 3/6] net/mlx5e: Fix misidentification of ASO CQE during poll loop
2026-02-12 10:32 ` [PATCH net 3/6] net/mlx5e: Fix misidentification of ASO " Tariq Toukan
@ 2026-02-12 22:38 ` Jacob Keller
0 siblings, 0 replies; 13+ messages in thread
From: Jacob Keller @ 2026-02-12 22:38 UTC (permalink / raw)
To: Tariq Toukan, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Andrew Lunn, David S. Miller
Cc: Saeed Mahameed, Leon Romanovsky, Mark Bloch, netdev, linux-rdma,
linux-kernel, Gal Pressman, Moshe Shemesh, Jianbo Liu
On 2/12/2026 2:32 AM, Tariq Toukan wrote:
> From: Gal Pressman <gal@nvidia.com>
>
> The ASO completion poll loop uses usleep_range() which can sleep much
> longer than requested due to scheduler latency. Under load, we witnessed
> a 20ms+ delay until the process was rescheduled, causing the jiffies
> based timeout to expire while the thread is sleeping.
>
> The original do-while loop structure (poll, sleep, check timeout) would
> exit without a final poll when waking after timeout, missing a CQE that
> arrived during sleep.
>
> Restructure the loop by moving the poll into the while condition,
> ensuring we always poll after sleeping, catching CQEs that arrived
> during that time.
>
I would have re-written these to be based on poll_timeout_us() or
read_poll_timeout().
Again as with previous patch, I don't know if that warrants a re-roll.
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH net 4/6] net/mlx5e: MACsec, add ASO poll loop in macsec_aso_set_arm_event
2026-02-12 10:32 [PATCH net 0/6] mlx5 misc fixes 2026-02-12 Tariq Toukan
` (2 preceding siblings ...)
2026-02-12 10:32 ` [PATCH net 3/6] net/mlx5e: Fix misidentification of ASO " Tariq Toukan
@ 2026-02-12 10:32 ` Tariq Toukan
2026-02-12 10:32 ` [PATCH net 5/6] net/mlx5e: Fix deadlocks between devlink and netdev instance locks Tariq Toukan
2026-02-12 10:32 ` [PATCH net 6/6] net/mlx5e: Use unsigned for mlx5e_get_max_num_channels Tariq Toukan
5 siblings, 0 replies; 13+ messages in thread
From: Tariq Toukan @ 2026-02-12 10:32 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch, netdev,
linux-rdma, linux-kernel, Gal Pressman, Moshe Shemesh, Jianbo Liu
From: Gal Pressman <gal@nvidia.com>
The macsec_aso_set_arm_event function calls mlx5_aso_poll_cq once
without a retry loop. If the CQE is not immediately available after
posting the WQE, the function fails unnecessarily.
Add a poll loop with timeout, consistent with other ASO polling code
paths in the driver.
Fixes: 739cfa34518e ("net/mlx5: Make ASO poll CQ usable in atomic context")
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
drivers/net/ethernet/mellanox/mlx5/core/en_accel/macsec.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/macsec.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/macsec.c
index 2b3556fbfc42..e64a46be1cbd 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/macsec.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/macsec.c
@@ -1374,6 +1374,7 @@ static int macsec_aso_set_arm_event(struct mlx5_core_dev *mdev, struct mlx5e_mac
struct mlx5e_macsec_aso *aso;
struct mlx5_aso_wqe *aso_wqe;
struct mlx5_aso *maso;
+ unsigned long expires;
int err;
aso = &macsec->aso;
@@ -1385,7 +1386,10 @@ static int macsec_aso_set_arm_event(struct mlx5_core_dev *mdev, struct mlx5e_mac
MLX5_ACCESS_ASO_OPC_MOD_MACSEC);
macsec_aso_build_ctrl(aso, &aso_wqe->aso_ctrl, in);
mlx5_aso_post_wqe(maso, false, &aso_wqe->ctrl);
- err = mlx5_aso_poll_cq(maso, false);
+ expires = jiffies + msecs_to_jiffies(10);
+ while ((err = mlx5_aso_poll_cq(maso, false)) &&
+ time_is_after_jiffies(expires))
+ usleep_range(2, 10);
mutex_unlock(&aso->aso_lock);
return err;
--
2.44.0
^ permalink raw reply related [flat|nested] 13+ messages in thread* [PATCH net 5/6] net/mlx5e: Fix deadlocks between devlink and netdev instance locks
2026-02-12 10:32 [PATCH net 0/6] mlx5 misc fixes 2026-02-12 Tariq Toukan
` (3 preceding siblings ...)
2026-02-12 10:32 ` [PATCH net 4/6] net/mlx5e: MACsec, add ASO poll loop in macsec_aso_set_arm_event Tariq Toukan
@ 2026-02-12 10:32 ` Tariq Toukan
2026-02-12 22:41 ` Jacob Keller
2026-02-12 10:32 ` [PATCH net 6/6] net/mlx5e: Use unsigned for mlx5e_get_max_num_channels Tariq Toukan
5 siblings, 1 reply; 13+ messages in thread
From: Tariq Toukan @ 2026-02-12 10:32 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch, netdev,
linux-rdma, linux-kernel, Gal Pressman, Moshe Shemesh,
Cosmin Ratiu, Dragos Tatulea
From: Cosmin Ratiu <cratiu@nvidia.com>
In the mentioned "Fixes" commit, various work tasks triggering devlink
health reporter recovery were switched to use netdev_trylock to protect
against concurrent tear down of the channels being recovered. But this
had the side effect of introducing potential deadlocks because of
incorrect lock ordering.
The correct lock order is described by the init flow:
probe_one -> mlx5_init_one (acquires devlink lock)
-> mlx5_init_one_devl_locked -> mlx5_register_device
-> mlx5_rescan_drivers_locked -...-> mlx5e_probe -> _mlx5e_probe
-> register_netdev (acquires rtnl lock)
-> register_netdevice (acquires netdev lock)
=> devlink lock -> rtnl lock -> netdev lock.
But in the current recovery flow, the order is wrong:
mlx5e_tx_err_cqe_work (acquires netdev lock)
-> mlx5e_reporter_tx_err_cqe -> mlx5e_health_report
-> devlink_health_report (acquires devlink lock => boom!)
-> devlink_health_reporter_recover
-> mlx5e_tx_reporter_recover -> mlx5e_tx_reporter_recover_from_ctx
-> mlx5e_tx_reporter_err_cqe_recover
The same pattern exists in:
mlx5e_reporter_rx_timeout
mlx5e_reporter_tx_ptpsq_unhealthy
mlx5e_reporter_tx_timeout
Fix these by moving the netdev_trylock calls from the work handlers
lower in the call stack, in the respective recovery functions, where
they are actually necessary.
Fixes: 8f7b00307bf1 ("net/mlx5e: Convert mlx5 netdevs to instance locking")
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
.../net/ethernet/mellanox/mlx5/core/en/ptp.c | 14 -----
.../mellanox/mlx5/core/en/reporter_rx.c | 13 +++++
.../mellanox/mlx5/core/en/reporter_tx.c | 52 +++++++++++++++++--
.../net/ethernet/mellanox/mlx5/core/en_main.c | 40 --------------
4 files changed, 61 insertions(+), 58 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/ptp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/ptp.c
index 424f8a2728a3..74660e7fe674 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/ptp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/ptp.c
@@ -457,22 +457,8 @@ static void mlx5e_ptpsq_unhealthy_work(struct work_struct *work)
{
struct mlx5e_ptpsq *ptpsq =
container_of(work, struct mlx5e_ptpsq, report_unhealthy_work);
- struct mlx5e_txqsq *sq = &ptpsq->txqsq;
-
- /* Recovering the PTP SQ means re-enabling NAPI, which requires the
- * netdev instance lock. However, SQ closing has to wait for this work
- * task to finish while also holding the same lock. So either get the
- * lock or find that the SQ is no longer enabled and thus this work is
- * not relevant anymore.
- */
- while (!netdev_trylock(sq->netdev)) {
- if (!test_bit(MLX5E_SQ_STATE_ENABLED, &sq->state))
- return;
- msleep(20);
- }
mlx5e_reporter_tx_ptpsq_unhealthy(ptpsq);
- netdev_unlock(sq->netdev);
}
static int mlx5e_ptp_open_txqsq(struct mlx5e_ptp *c, u32 tisn,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_rx.c
index 0686fbdd5a05..6efb626b5506 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_rx.c
@@ -1,6 +1,8 @@
// SPDX-License-Identifier: GPL-2.0
// Copyright (c) 2019 Mellanox Technologies.
+#include <net/netdev_lock.h>
+
#include "health.h"
#include "params.h"
#include "txrx.h"
@@ -177,6 +179,16 @@ static int mlx5e_rx_reporter_timeout_recover(void *ctx)
rq = ctx;
priv = rq->priv;
+ /* Acquire netdev instance lock to synchronize with channel close and
+ * reopen flows. Either successfully obtain the lock, or detect that
+ * channels are closing for another reason, making this work no longer
+ * necessary.
+ */
+ while (!netdev_trylock(rq->netdev)) {
+ if (!test_bit(MLX5E_STATE_CHANNELS_ACTIVE, &rq->priv->state))
+ return 0;
+ msleep(20);
+ }
mutex_lock(&priv->state_lock);
eq = rq->cq.mcq.eq;
@@ -186,6 +198,7 @@ static int mlx5e_rx_reporter_timeout_recover(void *ctx)
clear_bit(MLX5E_SQ_STATE_ENABLED, &rq->icosq->state);
mutex_unlock(&priv->state_lock);
+ netdev_unlock(rq->netdev);
return err;
}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
index 9e2cf191ed30..9f6454102cf7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
@@ -1,6 +1,8 @@
/* SPDX-License-Identifier: GPL-2.0 */
/* Copyright (c) 2019 Mellanox Technologies. */
+#include <net/netdev_lock.h>
+
#include "health.h"
#include "en/ptp.h"
#include "en/devlink.h"
@@ -78,6 +80,18 @@ static int mlx5e_tx_reporter_err_cqe_recover(void *ctx)
if (!test_bit(MLX5E_SQ_STATE_RECOVERING, &sq->state))
return 0;
+ /* Recovering queues means re-enabling NAPI, which requires the netdev
+ * instance lock. However, SQ closing flows have to wait for work tasks
+ * to finish while also holding the netdev instance lock. So either get
+ * the lock or find that the SQ is no longer enabled and thus this work
+ * is not relevant anymore.
+ */
+ while (!netdev_trylock(dev)) {
+ if (!test_bit(MLX5E_SQ_STATE_ENABLED, &sq->state))
+ return 0;
+ msleep(20);
+ }
+
err = mlx5_core_query_sq_state(mdev, sq->sqn, &state);
if (err) {
netdev_err(dev, "Failed to query SQ 0x%x state. err = %d\n",
@@ -113,9 +127,11 @@ static int mlx5e_tx_reporter_err_cqe_recover(void *ctx)
else
mlx5e_trigger_napi_sched(sq->cq.napi);
+ netdev_unlock(dev);
return 0;
out:
clear_bit(MLX5E_SQ_STATE_RECOVERING, &sq->state);
+ netdev_unlock(dev);
return err;
}
@@ -136,10 +152,24 @@ static int mlx5e_tx_reporter_timeout_recover(void *ctx)
sq = to_ctx->sq;
eq = sq->cq.mcq.eq;
priv = sq->priv;
+
+ /* Recovering the TX queues implies re-enabling NAPI, which requires
+ * the netdev instance lock.
+ * However, channel closing flows have to wait for this work to finish
+ * while holding the same lock. So either get the lock or find that
+ * channels are being closed for other reason and this work is not
+ * relevant anymore.
+ */
+ while (!netdev_trylock(sq->netdev)) {
+ if (!test_bit(MLX5E_STATE_CHANNELS_ACTIVE, &priv->state))
+ return 0;
+ msleep(20);
+ }
+
err = mlx5e_health_channel_eq_recover(sq->netdev, eq, sq->cq.ch_stats);
if (!err) {
to_ctx->status = 0; /* this sq recovered */
- return err;
+ goto out;
}
mutex_lock(&priv->state_lock);
@@ -147,7 +177,7 @@ static int mlx5e_tx_reporter_timeout_recover(void *ctx)
mutex_unlock(&priv->state_lock);
if (!err) {
to_ctx->status = 1; /* all channels recovered */
- return err;
+ goto out;
}
to_ctx->status = err;
@@ -155,7 +185,8 @@ static int mlx5e_tx_reporter_timeout_recover(void *ctx)
netdev_err(priv->netdev,
"mlx5e_safe_reopen_channels failed recovering from a tx_timeout, err(%d).\n",
err);
-
+out:
+ netdev_unlock(sq->netdev);
return err;
}
@@ -172,10 +203,22 @@ static int mlx5e_tx_reporter_ptpsq_unhealthy_recover(void *ctx)
return 0;
priv = ptpsq->txqsq.priv;
+ netdev = priv->netdev;
+
+ /* Recovering the PTP SQ means re-enabling NAPI, which requires the
+ * netdev instance lock. However, SQ closing has to wait for this work
+ * task to finish while also holding the same lock. So either get the
+ * lock or find that the SQ is no longer enabled and thus this work is
+ * not relevant anymore.
+ */
+ while (!netdev_trylock(netdev)) {
+ if (!test_bit(MLX5E_SQ_STATE_ENABLED, &ptpsq->txqsq.state))
+ return 0;
+ msleep(20);
+ }
mutex_lock(&priv->state_lock);
chs = &priv->channels;
- netdev = priv->netdev;
carrier_ok = netif_carrier_ok(netdev);
netif_carrier_off(netdev);
@@ -192,6 +235,7 @@ static int mlx5e_tx_reporter_ptpsq_unhealthy_recover(void *ctx)
netif_carrier_on(netdev);
mutex_unlock(&priv->state_lock);
+ netdev_unlock(netdev);
return err;
}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 4b2963bbe7ff..e15e6fb4cd8e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -688,19 +688,7 @@ static void mlx5e_rq_timeout_work(struct work_struct *timeout_work)
struct mlx5e_rq,
rx_timeout_work);
- /* Acquire netdev instance lock to synchronize with channel close and
- * reopen flows. Either successfully obtain the lock, or detect that
- * channels are closing for another reason, making this work no longer
- * necessary.
- */
- while (!netdev_trylock(rq->netdev)) {
- if (!test_bit(MLX5E_STATE_CHANNELS_ACTIVE, &rq->priv->state))
- return;
- msleep(20);
- }
-
mlx5e_reporter_rx_timeout(rq);
- netdev_unlock(rq->netdev);
}
static int mlx5e_alloc_mpwqe_rq_drop_page(struct mlx5e_rq *rq)
@@ -1997,20 +1985,7 @@ void mlx5e_tx_err_cqe_work(struct work_struct *recover_work)
struct mlx5e_txqsq *sq = container_of(recover_work, struct mlx5e_txqsq,
recover_work);
- /* Recovering queues means re-enabling NAPI, which requires the netdev
- * instance lock. However, SQ closing flows have to wait for work tasks
- * to finish while also holding the netdev instance lock. So either get
- * the lock or find that the SQ is no longer enabled and thus this work
- * is not relevant anymore.
- */
- while (!netdev_trylock(sq->netdev)) {
- if (!test_bit(MLX5E_SQ_STATE_ENABLED, &sq->state))
- return;
- msleep(20);
- }
-
mlx5e_reporter_tx_err_cqe(sq);
- netdev_unlock(sq->netdev);
}
static struct dim_cq_moder mlx5e_get_def_tx_moderation(u8 cq_period_mode)
@@ -5121,19 +5096,6 @@ static void mlx5e_tx_timeout_work(struct work_struct *work)
struct net_device *netdev = priv->netdev;
int i;
- /* Recovering the TX queues implies re-enabling NAPI, which requires
- * the netdev instance lock.
- * However, channel closing flows have to wait for this work to finish
- * while holding the same lock. So either get the lock or find that
- * channels are being closed for other reason and this work is not
- * relevant anymore.
- */
- while (!netdev_trylock(netdev)) {
- if (!test_bit(MLX5E_STATE_CHANNELS_ACTIVE, &priv->state))
- return;
- msleep(20);
- }
-
for (i = 0; i < netdev->real_num_tx_queues; i++) {
struct netdev_queue *dev_queue =
netdev_get_tx_queue(netdev, i);
@@ -5146,8 +5108,6 @@ static void mlx5e_tx_timeout_work(struct work_struct *work)
/* break if tried to reopened channels */
break;
}
-
- netdev_unlock(netdev);
}
static void mlx5e_tx_timeout(struct net_device *dev, unsigned int txqueue)
--
2.44.0
^ permalink raw reply related [flat|nested] 13+ messages in thread* Re: [PATCH net 5/6] net/mlx5e: Fix deadlocks between devlink and netdev instance locks
2026-02-12 10:32 ` [PATCH net 5/6] net/mlx5e: Fix deadlocks between devlink and netdev instance locks Tariq Toukan
@ 2026-02-12 22:41 ` Jacob Keller
0 siblings, 0 replies; 13+ messages in thread
From: Jacob Keller @ 2026-02-12 22:41 UTC (permalink / raw)
To: Tariq Toukan, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Andrew Lunn, David S. Miller
Cc: Saeed Mahameed, Leon Romanovsky, Mark Bloch, netdev, linux-rdma,
linux-kernel, Gal Pressman, Moshe Shemesh, Cosmin Ratiu,
Dragos Tatulea
On 2/12/2026 2:32 AM, Tariq Toukan wrote:
> From: Cosmin Ratiu <cratiu@nvidia.com>
>
> In the mentioned "Fixes" commit, various work tasks triggering devlink
> health reporter recovery were switched to use netdev_trylock to protect
> against concurrent tear down of the channels being recovered. But this
> had the side effect of introducing potential deadlocks because of
> incorrect lock ordering.
>
> The correct lock order is described by the init flow:
> probe_one -> mlx5_init_one (acquires devlink lock)
> -> mlx5_init_one_devl_locked -> mlx5_register_device
> -> mlx5_rescan_drivers_locked -...-> mlx5e_probe -> _mlx5e_probe
> -> register_netdev (acquires rtnl lock)
> -> register_netdevice (acquires netdev lock)
> => devlink lock -> rtnl lock -> netdev lock.
>
> But in the current recovery flow, the order is wrong:
> mlx5e_tx_err_cqe_work (acquires netdev lock)
> -> mlx5e_reporter_tx_err_cqe -> mlx5e_health_report
> -> devlink_health_report (acquires devlink lock => boom!)
> -> devlink_health_reporter_recover
> -> mlx5e_tx_reporter_recover -> mlx5e_tx_reporter_recover_from_ctx
> -> mlx5e_tx_reporter_err_cqe_recover
>
> The same pattern exists in:
> mlx5e_reporter_rx_timeout
> mlx5e_reporter_tx_ptpsq_unhealthy
> mlx5e_reporter_tx_timeout
>
> Fix these by moving the netdev_trylock calls from the work handlers
> lower in the call stack, in the respective recovery functions, where
> they are actually necessary.
>
> Fixes: 8f7b00307bf1 ("net/mlx5e: Convert mlx5 netdevs to instance locking")
> Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
> ---
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH net 6/6] net/mlx5e: Use unsigned for mlx5e_get_max_num_channels
2026-02-12 10:32 [PATCH net 0/6] mlx5 misc fixes 2026-02-12 Tariq Toukan
` (4 preceding siblings ...)
2026-02-12 10:32 ` [PATCH net 5/6] net/mlx5e: Fix deadlocks between devlink and netdev instance locks Tariq Toukan
@ 2026-02-12 10:32 ` Tariq Toukan
2026-02-12 22:41 ` Jacob Keller
5 siblings, 1 reply; 13+ messages in thread
From: Tariq Toukan @ 2026-02-12 10:32 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch, netdev,
linux-rdma, linux-kernel, Gal Pressman, Moshe Shemesh,
Cosmin Ratiu, Dragos Tatulea
From: Cosmin Ratiu <cratiu@nvidia.com>
The max number of channels is always an unsigned int, use the correct
type to fix compilation errors done with strict type checking, e.g.:
error: call to ‘__compiletime_assert_1110’ declared with attribute
error: min(mlx5e_get_devlink_param_num_doorbells(mdev),
mlx5e_get_max_num_channels(mdev)) signedness error
Fixes: 74a8dadac17e ("net/mlx5e: Preparations for supporting larger number of channels")
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
drivers/net/ethernet/mellanox/mlx5/core/en.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index ff4ab4691baf..a06d08576fd4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -179,7 +179,8 @@ static inline u16 mlx5_min_rx_wqes(int wq_type, u32 wq_size)
}
/* Use this function to get max num channels (rxqs/txqs) only to create netdev */
-static inline int mlx5e_get_max_num_channels(struct mlx5_core_dev *mdev)
+static inline unsigned int
+mlx5e_get_max_num_channels(struct mlx5_core_dev *mdev)
{
return is_kdump_kernel() ?
MLX5E_MIN_NUM_CHANNELS :
--
2.44.0
^ permalink raw reply related [flat|nested] 13+ messages in thread* Re: [PATCH net 6/6] net/mlx5e: Use unsigned for mlx5e_get_max_num_channels
2026-02-12 10:32 ` [PATCH net 6/6] net/mlx5e: Use unsigned for mlx5e_get_max_num_channels Tariq Toukan
@ 2026-02-12 22:41 ` Jacob Keller
0 siblings, 0 replies; 13+ messages in thread
From: Jacob Keller @ 2026-02-12 22:41 UTC (permalink / raw)
To: Tariq Toukan, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Andrew Lunn, David S. Miller
Cc: Saeed Mahameed, Leon Romanovsky, Mark Bloch, netdev, linux-rdma,
linux-kernel, Gal Pressman, Moshe Shemesh, Cosmin Ratiu,
Dragos Tatulea
On 2/12/2026 2:32 AM, Tariq Toukan wrote:
> From: Cosmin Ratiu <cratiu@nvidia.com>
>
> The max number of channels is always an unsigned int, use the correct
> type to fix compilation errors done with strict type checking, e.g.:
>
> error: call to ‘__compiletime_assert_1110’ declared with attribute
> error: min(mlx5e_get_devlink_param_num_doorbells(mdev),
> mlx5e_get_max_num_channels(mdev)) signedness error
>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
^ permalink raw reply [flat|nested] 13+ messages in thread