Netdev List
 help / color / mirror / Atom feed
* [PATCH net] net/mlx5e: Fix use-after-free in mlx5e_tx_reporter_timeout_recover
@ 2026-04-08 18:44 Matt Fleming
  2026-05-01 10:03 ` Matt Fleming
  2026-05-12 11:08 ` Cosmin Ratiu
  0 siblings, 2 replies; 4+ messages in thread
From: Matt Fleming @ 2026-04-08 18:44 UTC (permalink / raw)
  To: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch
  Cc: Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, netdev, linux-rdma, linux-kernel, kernel-team,
	Matt Fleming

From: Matt Fleming <mfleming@cloudflare.com>

mlx5e_tx_reporter_timeout_recover() accesses sq->netdev after
mlx5e_safe_reopen_channels() has torn down and freed the channel (and
its embedded SQs). Replace the three sq->netdev references with
priv->netdev which is safe because priv outlives channel teardown.

The netdev_err() call already used priv->netdev for this reason; make
the trylock/unlock and health_channel_eq_recover calls consistent.

This fixes the following KASAN splat:

  BUG: KASAN: use-after-free in mlx5e_tx_reporter_timeout_recover+0x1dd/0x360 [mlx5_core]
  Read of size 8 at addr ffff889860ed0b28 by task kworker/u113:2/5277

  Call Trace:
   mlx5e_tx_reporter_timeout_recover+0x1dd/0x360 [mlx5_core]
   devlink_health_reporter_recover+0xa2/0x150
   devlink_health_report+0x254/0x7c0
   mlx5e_reporter_tx_timeout+0x297/0x380 [mlx5_core]
   mlx5e_tx_timeout_work+0x109/0x170 [mlx5_core]
   process_one_work+0x677/0xf20
   worker_thread+0x51f/0xd90
   kthread+0x3a5/0x810
   ret_from_fork+0x208/0x400
   ret_from_fork_asm+0x1a/0x30

Fixes: 83ac0304a2d7 ("net/mlx5e: Fix deadlocks between devlink and netdev instance locks")
Signed-off-by: Matt Fleming <mfleming@cloudflare.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
index afdeb1b3d425..8409ae73768f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
@@ -160,13 +160,13 @@ static int mlx5e_tx_reporter_timeout_recover(void *ctx)
 	 * channels are being closed for other reason and this work is not
 	 * relevant anymore.
 	 */
-	while (!netdev_trylock(sq->netdev)) {
+	while (!netdev_trylock(priv->netdev)) {
 		if (!test_bit(MLX5E_STATE_CHANNELS_ACTIVE, &priv->state))
 			return 0;
 		msleep(20);
 	}
 
-	err = mlx5e_health_channel_eq_recover(sq->netdev, eq, sq->cq.ch_stats);
+	err = mlx5e_health_channel_eq_recover(priv->netdev, eq, sq->cq.ch_stats);
 	if (!err) {
 		to_ctx->status = 0; /* this sq recovered */
 		goto out;
@@ -186,7 +186,7 @@ static int mlx5e_tx_reporter_timeout_recover(void *ctx)
 		   "mlx5e_safe_reopen_channels failed recovering from a tx_timeout, err(%d).\n",
 		   err);
 out:
-	netdev_unlock(sq->netdev);
+	netdev_unlock(priv->netdev);
 	return err;
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH net] net/mlx5e: Fix use-after-free in mlx5e_tx_reporter_timeout_recover
  2026-04-08 18:44 [PATCH net] net/mlx5e: Fix use-after-free in mlx5e_tx_reporter_timeout_recover Matt Fleming
@ 2026-05-01 10:03 ` Matt Fleming
  2026-05-12 11:08 ` Cosmin Ratiu
  1 sibling, 0 replies; 4+ messages in thread
From: Matt Fleming @ 2026-05-01 10:03 UTC (permalink / raw)
  To: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch
  Cc: Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, netdev, linux-rdma, linux-kernel, kernel-team

On Wed, Apr 08, 2026 at 07:44:58PM +0100, Matt Fleming wrote:
> From: Matt Fleming <mfleming@cloudflare.com>
> 
> mlx5e_tx_reporter_timeout_recover() accesses sq->netdev after
> mlx5e_safe_reopen_channels() has torn down and freed the channel (and
> its embedded SQs). Replace the three sq->netdev references with
> priv->netdev which is safe because priv outlives channel teardown.
> 
> The netdev_err() call already used priv->netdev for this reason; make
> the trylock/unlock and health_channel_eq_recover calls consistent.
> 
> This fixes the following KASAN splat:
> 
>   BUG: KASAN: use-after-free in mlx5e_tx_reporter_timeout_recover+0x1dd/0x360 [mlx5_core]
>   Read of size 8 at addr ffff889860ed0b28 by task kworker/u113:2/5277
> 
>   Call Trace:
>    mlx5e_tx_reporter_timeout_recover+0x1dd/0x360 [mlx5_core]
>    devlink_health_reporter_recover+0xa2/0x150
>    devlink_health_report+0x254/0x7c0
>    mlx5e_reporter_tx_timeout+0x297/0x380 [mlx5_core]
>    mlx5e_tx_timeout_work+0x109/0x170 [mlx5_core]
>    process_one_work+0x677/0xf20
>    worker_thread+0x51f/0xd90
>    kthread+0x3a5/0x810
>    ret_from_fork+0x208/0x400
>    ret_from_fork_asm+0x1a/0x30
> 
> Fixes: 83ac0304a2d7 ("net/mlx5e: Fix deadlocks between devlink and netdev instance locks")
> Signed-off-by: Matt Fleming <mfleming@cloudflare.com>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
> index afdeb1b3d425..8409ae73768f 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
> @@ -160,13 +160,13 @@ static int mlx5e_tx_reporter_timeout_recover(void *ctx)
>  	 * channels are being closed for other reason and this work is not
>  	 * relevant anymore.
>  	 */
> -	while (!netdev_trylock(sq->netdev)) {
> +	while (!netdev_trylock(priv->netdev)) {
>  		if (!test_bit(MLX5E_STATE_CHANNELS_ACTIVE, &priv->state))
>  			return 0;
>  		msleep(20);
>  	}
>  
> -	err = mlx5e_health_channel_eq_recover(sq->netdev, eq, sq->cq.ch_stats);
> +	err = mlx5e_health_channel_eq_recover(priv->netdev, eq, sq->cq.ch_stats);
>  	if (!err) {
>  		to_ctx->status = 0; /* this sq recovered */
>  		goto out;
> @@ -186,7 +186,7 @@ static int mlx5e_tx_reporter_timeout_recover(void *ctx)
>  		   "mlx5e_safe_reopen_channels failed recovering from a tx_timeout, err(%d).\n",
>  		   err);
>  out:
> -	netdev_unlock(sq->netdev);
> +	netdev_unlock(priv->netdev);
>  	return err;
>  }
>  
> -- 
> 2.43.0
> 

Hey there, any thoughts on this?

Thanks,
Matt

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH net] net/mlx5e: Fix use-after-free in mlx5e_tx_reporter_timeout_recover
  2026-04-08 18:44 [PATCH net] net/mlx5e: Fix use-after-free in mlx5e_tx_reporter_timeout_recover Matt Fleming
  2026-05-01 10:03 ` Matt Fleming
@ 2026-05-12 11:08 ` Cosmin Ratiu
  2026-05-12 11:12   ` Tariq Toukan
  1 sibling, 1 reply; 4+ messages in thread
From: Cosmin Ratiu @ 2026-05-12 11:08 UTC (permalink / raw)
  To: Tariq Toukan, Mark Bloch, Saeed Mahameed, matt@readmodwrite.com,
	leon@kernel.org
  Cc: linux-rdma@vger.kernel.org, andrew+netdev@lunn.ch,
	davem@davemloft.net, linux-kernel@vger.kernel.org,
	kernel-team@cloudflare.com, kuba@kernel.org,
	netdev@vger.kernel.org, edumazet@google.com, pabeni@redhat.com,
	mfleming@cloudflare.com

On Wed, 2026-04-08 at 19:44 +0100, Matt Fleming wrote:
> From: Matt Fleming <mfleming@cloudflare.com>

First of all, apologies for the delay, I missed this and it seems
nobody else reacted for more than a month.

Next time, you will probably get more immediate reactions if you
directly CC the people involved in the patch which introduced the bug.
This will also make the patchwork checkers happier.

> 
> mlx5e_tx_reporter_timeout_recover() accesses sq->netdev after
> mlx5e_safe_reopen_channels() has torn down and freed the channel (and
> its embedded SQs). Replace the three sq->netdev references with
> priv->netdev which is safe because priv outlives channel teardown.
> 
> The netdev_err() call already used priv->netdev for this reason; make
> the trylock/unlock and health_channel_eq_recover calls consistent.
> 
> This fixes the following KASAN splat:
> 
>   BUG: KASAN: use-after-free in
> mlx5e_tx_reporter_timeout_recover+0x1dd/0x360 [mlx5_core]
>   Read of size 8 at addr ffff889860ed0b28 by task kworker/u113:2/5277
> 
>   Call Trace:
>    mlx5e_tx_reporter_timeout_recover+0x1dd/0x360 [mlx5_core]
>    devlink_health_reporter_recover+0xa2/0x150
>    devlink_health_report+0x254/0x7c0
>    mlx5e_reporter_tx_timeout+0x297/0x380 [mlx5_core]
>    mlx5e_tx_timeout_work+0x109/0x170 [mlx5_core]
>    process_one_work+0x677/0xf20
>    worker_thread+0x51f/0xd90
>    kthread+0x3a5/0x810
>    ret_from_fork+0x208/0x400
>    ret_from_fork_asm+0x1a/0x30
> 
> Fixes: 83ac0304a2d7 ("net/mlx5e: Fix deadlocks between devlink and
> netdev instance locks")
> Signed-off-by: Matt Fleming <mfleming@cloudflare.com>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
> b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
> index afdeb1b3d425..8409ae73768f 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
> @@ -160,13 +160,13 @@ static int
> mlx5e_tx_reporter_timeout_recover(void *ctx)
>  	 * channels are being closed for other reason and this work
> is not
>  	 * relevant anymore.
>  	 */
> -	while (!netdev_trylock(sq->netdev)) {
> +	while (!netdev_trylock(priv->netdev)) {
>  		if (!test_bit(MLX5E_STATE_CHANNELS_ACTIVE, &priv-
> >state))
>  			return 0;
>  		msleep(20);
>  	}
>  
> -	err = mlx5e_health_channel_eq_recover(sq->netdev, eq, sq-
> >cq.ch_stats);
> +	err = mlx5e_health_channel_eq_recover(priv->netdev, eq, sq-
> >cq.ch_stats);
>  	if (!err) {
>  		to_ctx->status = 0; /* this sq recovered */
>  		goto out;
> @@ -186,7 +186,7 @@ static int mlx5e_tx_reporter_timeout_recover(void
> *ctx)
>  		   "mlx5e_safe_reopen_channels failed recovering
> from a tx_timeout, err(%d).\n",
>  		   err);
>  out:
> -	netdev_unlock(sq->netdev);
> +	netdev_unlock(priv->netdev);
>  	return err;
>  }
>  

Thank you for the fix, it is a real problem which can happen if direct
SQ recovery fails and all channels need to be reopened, which is
apparently what happened in your KASAN report.

Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH net] net/mlx5e: Fix use-after-free in mlx5e_tx_reporter_timeout_recover
  2026-05-12 11:08 ` Cosmin Ratiu
@ 2026-05-12 11:12   ` Tariq Toukan
  0 siblings, 0 replies; 4+ messages in thread
From: Tariq Toukan @ 2026-05-12 11:12 UTC (permalink / raw)
  To: Cosmin Ratiu, Mark Bloch, Saeed Mahameed, matt@readmodwrite.com,
	leon@kernel.org
  Cc: linux-rdma@vger.kernel.org, andrew+netdev@lunn.ch,
	davem@davemloft.net, linux-kernel@vger.kernel.org,
	kernel-team@cloudflare.com, kuba@kernel.org,
	netdev@vger.kernel.org, edumazet@google.com, pabeni@redhat.com,
	mfleming@cloudflare.com



On 12/05/2026 14:08, Cosmin Ratiu wrote:
> On Wed, 2026-04-08 at 19:44 +0100, Matt Fleming wrote:
>> From: Matt Fleming <mfleming@cloudflare.com>
> 
> First of all, apologies for the delay, I missed this and it seems
> nobody else reacted for more than a month.
> 
> Next time, you will probably get more immediate reactions if you
> directly CC the people involved in the patch which introduced the bug.
> This will also make the patchwork checkers happier.
> 
>>
>> mlx5e_tx_reporter_timeout_recover() accesses sq->netdev after
>> mlx5e_safe_reopen_channels() has torn down and freed the channel (and
>> its embedded SQs). Replace the three sq->netdev references with
>> priv->netdev which is safe because priv outlives channel teardown.
>>
>> The netdev_err() call already used priv->netdev for this reason; make
>> the trylock/unlock and health_channel_eq_recover calls consistent.
>>
>> This fixes the following KASAN splat:
>>
>>    BUG: KASAN: use-after-free in
>> mlx5e_tx_reporter_timeout_recover+0x1dd/0x360 [mlx5_core]
>>    Read of size 8 at addr ffff889860ed0b28 by task kworker/u113:2/5277
>>
>>    Call Trace:
>>     mlx5e_tx_reporter_timeout_recover+0x1dd/0x360 [mlx5_core]
>>     devlink_health_reporter_recover+0xa2/0x150
>>     devlink_health_report+0x254/0x7c0
>>     mlx5e_reporter_tx_timeout+0x297/0x380 [mlx5_core]
>>     mlx5e_tx_timeout_work+0x109/0x170 [mlx5_core]
>>     process_one_work+0x677/0xf20
>>     worker_thread+0x51f/0xd90
>>     kthread+0x3a5/0x810
>>     ret_from_fork+0x208/0x400
>>     ret_from_fork_asm+0x1a/0x30
>>
>> Fixes: 83ac0304a2d7 ("net/mlx5e: Fix deadlocks between devlink and
>> netdev instance locks")
>> Signed-off-by: Matt Fleming <mfleming@cloudflare.com>
>> ---
>>   drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c | 6 +++---
>>   1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
>> b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
>> index afdeb1b3d425..8409ae73768f 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
>> @@ -160,13 +160,13 @@ static int
>> mlx5e_tx_reporter_timeout_recover(void *ctx)
>>   	 * channels are being closed for other reason and this work
>> is not
>>   	 * relevant anymore.
>>   	 */
>> -	while (!netdev_trylock(sq->netdev)) {
>> +	while (!netdev_trylock(priv->netdev)) {
>>   		if (!test_bit(MLX5E_STATE_CHANNELS_ACTIVE, &priv-
>>> state))
>>   			return 0;
>>   		msleep(20);
>>   	}
>>   
>> -	err = mlx5e_health_channel_eq_recover(sq->netdev, eq, sq-
>>> cq.ch_stats);
>> +	err = mlx5e_health_channel_eq_recover(priv->netdev, eq, sq-
>>> cq.ch_stats);
>>   	if (!err) {
>>   		to_ctx->status = 0; /* this sq recovered */
>>   		goto out;
>> @@ -186,7 +186,7 @@ static int mlx5e_tx_reporter_timeout_recover(void
>> *ctx)
>>   		   "mlx5e_safe_reopen_channels failed recovering
>> from a tx_timeout, err(%d).\n",
>>   		   err);
>>   out:
>> -	netdev_unlock(sq->netdev);
>> +	netdev_unlock(priv->netdev);
>>   	return err;
>>   }
>>   
> 
> Thank you for the fix, it is a real problem which can happen if direct
> SQ recovery fails and all channels need to be reopened, which is
> apparently what happened in your KASAN report.
> 
> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>

Thanks for your patch.
I think that due to our delayed response you'll have to resend.

You can add our tags:
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-05-12 11:12 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-08 18:44 [PATCH net] net/mlx5e: Fix use-after-free in mlx5e_tx_reporter_timeout_recover Matt Fleming
2026-05-01 10:03 ` Matt Fleming
2026-05-12 11:08 ` Cosmin Ratiu
2026-05-12 11:12   ` Tariq Toukan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox