linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net 0/3] mlx5 misc fixes 2025-09-28
@ 2025-09-28 21:02 Tariq Toukan
  2025-09-28 21:02 ` [PATCH net 1/3] net/mlx5: Stop polling for command response if interface goes down Tariq Toukan
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Tariq Toukan @ 2025-09-28 21:02 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller
  Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch, netdev,
	linux-rdma, linux-kernel, Gal Pressman, Moshe Shemesh, Shay Drori

Hi,

This patchset provides misc bug fixes from the team to the mlx5 core
driver.

Thanks,
Tariq.


Moshe Shemesh (2):
  net/mlx5: Stop polling for command response if interface goes down
  net/mlx5: fw reset, add reset timeout work

Shay Drory (1):
  net/mlx5: pagealloc: Fix reclaim race during command interface
    teardown

 drivers/net/ethernet/mellanox/mlx5/core/cmd.c |  6 ++++-
 .../ethernet/mellanox/mlx5/core/fw_reset.c    | 24 +++++++++++++++++++
 .../ethernet/mellanox/mlx5/core/pagealloc.c   |  7 ++++--
 3 files changed, 34 insertions(+), 3 deletions(-)


base-commit: 012ea489aedab1a4c08efbd936bb7be91a06d236
-- 
2.31.1


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH net 1/3] net/mlx5: Stop polling for command response if interface goes down
  2025-09-28 21:02 [PATCH net 0/3] mlx5 misc fixes 2025-09-28 Tariq Toukan
@ 2025-09-28 21:02 ` Tariq Toukan
  2025-09-28 21:02 ` [PATCH net 2/3] net/mlx5: pagealloc: Fix reclaim race during command interface teardown Tariq Toukan
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Tariq Toukan @ 2025-09-28 21:02 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller
  Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch, netdev,
	linux-rdma, linux-kernel, Gal Pressman, Moshe Shemesh, Shay Drori

From: Moshe Shemesh <moshe@nvidia.com>

Stop polling on firmware response to command in polling mode if the
command interface got down. This situation can occur, for example, if a
firmware fatal error is detected during polling.

This change halts the polling process when the command interface goes
down, preventing unnecessary waits.

Fixes: b898ce7bccf1 ("net/mlx5: cmdif, Avoid skipping reclaim pages if FW is not accessible")
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Shay Drori <shayd@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/cmd.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
index e395ef5f356e..722282cebce9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
@@ -294,6 +294,10 @@ static void poll_timeout(struct mlx5_cmd_work_ent *ent)
 			return;
 		}
 		cond_resched();
+		if (mlx5_cmd_is_down(dev)) {
+			ent->ret = -ENXIO;
+			return;
+		}
 	} while (time_before(jiffies, poll_end));
 
 	ent->ret = -ETIMEDOUT;
@@ -1070,7 +1074,7 @@ static void cmd_work_handler(struct work_struct *work)
 		poll_timeout(ent);
 		/* make sure we read the descriptor after ownership is SW */
 		rmb();
-		mlx5_cmd_comp_handler(dev, 1ULL << ent->idx, (ent->ret == -ETIMEDOUT));
+		mlx5_cmd_comp_handler(dev, 1ULL << ent->idx, !!ent->ret);
 	}
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH net 2/3] net/mlx5: pagealloc: Fix reclaim race during command interface teardown
  2025-09-28 21:02 [PATCH net 0/3] mlx5 misc fixes 2025-09-28 Tariq Toukan
  2025-09-28 21:02 ` [PATCH net 1/3] net/mlx5: Stop polling for command response if interface goes down Tariq Toukan
@ 2025-09-28 21:02 ` Tariq Toukan
  2025-09-28 21:02 ` [PATCH net 3/3] net/mlx5: fw reset, add reset timeout work Tariq Toukan
  2025-09-30  2:00 ` [PATCH net 0/3] mlx5 misc fixes 2025-09-28 patchwork-bot+netdevbpf
  3 siblings, 0 replies; 5+ messages in thread
From: Tariq Toukan @ 2025-09-28 21:02 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller
  Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch, netdev,
	linux-rdma, linux-kernel, Gal Pressman, Moshe Shemesh, Shay Drori

From: Shay Drory <shayd@nvidia.com>

The reclaim_pages_cmd() function sends a command to the firmware to
reclaim pages if the command interface is active.

A race condition can occur if the command interface goes down (e.g., due
to a PCI error) while the mlx5_cmd_do() call is in flight. In this
case, mlx5_cmd_do() will return an error. The original code would
propagate this error immediately, bypassing the software-based page
reclamation logic that is supposed to run when the command interface is
down.

Fix this by checking whether mlx5_cmd_do() returns -ENXIO, which mark
that command interface is down. If this is the case, fall through to
the software reclamation path. If the command failed for any another
reason, or finished successfully, return as before.

Fixes: b898ce7bccf1 ("net/mlx5: cmdif, Avoid skipping reclaim pages if FW is not accessible")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c b/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c
index 9bc9bd83c232..cd68c4b2c0bf 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c
@@ -489,9 +489,12 @@ static int reclaim_pages_cmd(struct mlx5_core_dev *dev,
 	u32 func_id;
 	u32 npages;
 	u32 i = 0;
+	int err;
 
-	if (!mlx5_cmd_is_down(dev))
-		return mlx5_cmd_do(dev, in, in_size, out, out_size);
+	err = mlx5_cmd_do(dev, in, in_size, out, out_size);
+	/* If FW is gone (-ENXIO), proceed to forceful reclaim */
+	if (err != -ENXIO)
+		return err;
 
 	/* No hard feelings, we want our pages back! */
 	npages = MLX5_GET(manage_pages_in, in, input_num_entries);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH net 3/3] net/mlx5: fw reset, add reset timeout work
  2025-09-28 21:02 [PATCH net 0/3] mlx5 misc fixes 2025-09-28 Tariq Toukan
  2025-09-28 21:02 ` [PATCH net 1/3] net/mlx5: Stop polling for command response if interface goes down Tariq Toukan
  2025-09-28 21:02 ` [PATCH net 2/3] net/mlx5: pagealloc: Fix reclaim race during command interface teardown Tariq Toukan
@ 2025-09-28 21:02 ` Tariq Toukan
  2025-09-30  2:00 ` [PATCH net 0/3] mlx5 misc fixes 2025-09-28 patchwork-bot+netdevbpf
  3 siblings, 0 replies; 5+ messages in thread
From: Tariq Toukan @ 2025-09-28 21:02 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller
  Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch, netdev,
	linux-rdma, linux-kernel, Gal Pressman, Moshe Shemesh, Shay Drori

From: Moshe Shemesh <moshe@nvidia.com>

Add sync reset timeout to stop poll_sync_reset in case there was no
reset done or abort event within timeout. Otherwise poll sync reset will
just continue and in case of fw fatal error no health reporting will be
done.

Fixes: 38b9f903f22b ("net/mlx5: Handle sync reset request event")
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Shay Drori <shayd@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 .../ethernet/mellanox/mlx5/core/fw_reset.c    | 24 +++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fw_reset.c b/drivers/net/ethernet/mellanox/mlx5/core/fw_reset.c
index 22995131824a..89e399606877 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fw_reset.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fw_reset.c
@@ -27,6 +27,7 @@ struct mlx5_fw_reset {
 	struct work_struct reset_reload_work;
 	struct work_struct reset_now_work;
 	struct work_struct reset_abort_work;
+	struct delayed_work reset_timeout_work;
 	unsigned long reset_flags;
 	u8 reset_method;
 	struct timer_list timer;
@@ -259,6 +260,8 @@ static int mlx5_sync_reset_clear_reset_requested(struct mlx5_core_dev *dev, bool
 		return -EALREADY;
 	}
 
+	if (current_work() != &fw_reset->reset_timeout_work.work)
+		cancel_delayed_work(&fw_reset->reset_timeout_work);
 	mlx5_stop_sync_reset_poll(dev);
 	if (poll_health)
 		mlx5_start_health_poll(dev);
@@ -330,6 +333,11 @@ static int mlx5_sync_reset_set_reset_requested(struct mlx5_core_dev *dev)
 	}
 	mlx5_stop_health_poll(dev, true);
 	mlx5_start_sync_reset_poll(dev);
+
+	if (!test_bit(MLX5_FW_RESET_FLAGS_DROP_NEW_REQUESTS,
+		      &fw_reset->reset_flags))
+		schedule_delayed_work(&fw_reset->reset_timeout_work,
+			msecs_to_jiffies(mlx5_tout_ms(dev, PCI_SYNC_UPDATE)));
 	return 0;
 }
 
@@ -739,6 +747,19 @@ static void mlx5_sync_reset_events_handle(struct mlx5_fw_reset *fw_reset, struct
 	}
 }
 
+static void mlx5_sync_reset_timeout_work(struct work_struct *work)
+{
+	struct delayed_work *dwork = container_of(work, struct delayed_work,
+						  work);
+	struct mlx5_fw_reset *fw_reset =
+		container_of(dwork, struct mlx5_fw_reset, reset_timeout_work);
+	struct mlx5_core_dev *dev = fw_reset->dev;
+
+	if (mlx5_sync_reset_clear_reset_requested(dev, true))
+		return;
+	mlx5_core_warn(dev, "PCI Sync FW Update Reset Timeout.\n");
+}
+
 static int fw_reset_event_notifier(struct notifier_block *nb, unsigned long action, void *data)
 {
 	struct mlx5_fw_reset *fw_reset = mlx5_nb_cof(nb, struct mlx5_fw_reset, nb);
@@ -822,6 +843,7 @@ void mlx5_drain_fw_reset(struct mlx5_core_dev *dev)
 	cancel_work_sync(&fw_reset->reset_reload_work);
 	cancel_work_sync(&fw_reset->reset_now_work);
 	cancel_work_sync(&fw_reset->reset_abort_work);
+	cancel_delayed_work(&fw_reset->reset_timeout_work);
 }
 
 static const struct devlink_param mlx5_fw_reset_devlink_params[] = {
@@ -865,6 +887,8 @@ int mlx5_fw_reset_init(struct mlx5_core_dev *dev)
 	INIT_WORK(&fw_reset->reset_reload_work, mlx5_sync_reset_reload_work);
 	INIT_WORK(&fw_reset->reset_now_work, mlx5_sync_reset_now_event);
 	INIT_WORK(&fw_reset->reset_abort_work, mlx5_sync_reset_abort_event);
+	INIT_DELAYED_WORK(&fw_reset->reset_timeout_work,
+			  mlx5_sync_reset_timeout_work);
 
 	init_completion(&fw_reset->done);
 	return 0;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH net 0/3] mlx5 misc fixes 2025-09-28
  2025-09-28 21:02 [PATCH net 0/3] mlx5 misc fixes 2025-09-28 Tariq Toukan
                   ` (2 preceding siblings ...)
  2025-09-28 21:02 ` [PATCH net 3/3] net/mlx5: fw reset, add reset timeout work Tariq Toukan
@ 2025-09-30  2:00 ` patchwork-bot+netdevbpf
  3 siblings, 0 replies; 5+ messages in thread
From: patchwork-bot+netdevbpf @ 2025-09-30  2:00 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: edumazet, kuba, pabeni, andrew+netdev, davem, saeedm, leon,
	mbloch, netdev, linux-rdma, linux-kernel, gal, moshe, shayd

Hello:

This series was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Mon, 29 Sep 2025 00:02:06 +0300 you wrote:
> Hi,
> 
> This patchset provides misc bug fixes from the team to the mlx5 core
> driver.
> 
> Thanks,
> Tariq.
> 
> [...]

Here is the summary with links:
  - [net,1/3] net/mlx5: Stop polling for command response if interface goes down
    https://git.kernel.org/netdev/net/c/b1f0349bd6d3
  - [net,2/3] net/mlx5: pagealloc: Fix reclaim race during command interface teardown
    https://git.kernel.org/netdev/net/c/79a0e32b32ac
  - [net,3/3] net/mlx5: fw reset, add reset timeout work
    https://git.kernel.org/netdev/net/c/5cfbe7ebfa42

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-09-30  2:00 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-28 21:02 [PATCH net 0/3] mlx5 misc fixes 2025-09-28 Tariq Toukan
2025-09-28 21:02 ` [PATCH net 1/3] net/mlx5: Stop polling for command response if interface goes down Tariq Toukan
2025-09-28 21:02 ` [PATCH net 2/3] net/mlx5: pagealloc: Fix reclaim race during command interface teardown Tariq Toukan
2025-09-28 21:02 ` [PATCH net 3/3] net/mlx5: fw reset, add reset timeout work Tariq Toukan
2025-09-30  2:00 ` [PATCH net 0/3] mlx5 misc fixes 2025-09-28 patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).