netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Saeed Mahameed <saeed@kernel.org>
To: "David S. Miller" <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>
Cc: netdev@vger.kernel.org, Shay Drory <shayd@nvidia.com>,
	Moshe Shemesh <moshe@nvidia.com>,
	Saeed Mahameed <saeedm@nvidia.com>
Subject: [net 06/15] net/mlx5: Fix health error state handling
Date: Thu, 11 Feb 2021 18:56:32 -0800	[thread overview]
Message-ID: <20210212025641.323844-7-saeed@kernel.org> (raw)
In-Reply-To: <20210212025641.323844-1-saeed@kernel.org>

From: Shay Drory <shayd@nvidia.com>

Currently, when we discover a fatal error, we are queueing a work that
will wait for a lock in order to enter the device to error state.
Meanwhile, FW commands are still being processed, and gets timeouts.
This can block the driver for few minutes before the work will manage
to get the lock and enter to error state.

Setting the device to error state before queueing health work, in order
to avoid FW commands being processed while the work is waiting for the
lock.

Fixes: c1d4d2e92ad6 ("net/mlx5: Avoid calling sleeping function by the health poll thread")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/health.c  | 22 ++++++++++++-------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/health.c b/drivers/net/ethernet/mellanox/mlx5/core/health.c
index 54523bed16cd..0c32c485eb58 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/health.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/health.c
@@ -190,6 +190,16 @@ static bool reset_fw_if_needed(struct mlx5_core_dev *dev)
 	return true;
 }
 
+static void enter_error_state(struct mlx5_core_dev *dev, bool force)
+{
+	if (mlx5_health_check_fatal_sensors(dev) || force) { /* protected state setting */
+		dev->state = MLX5_DEVICE_STATE_INTERNAL_ERROR;
+		mlx5_cmd_flush(dev);
+	}
+
+	mlx5_notifier_call_chain(dev->priv.events, MLX5_DEV_EVENT_SYS_ERROR, (void *)1);
+}
+
 void mlx5_enter_error_state(struct mlx5_core_dev *dev, bool force)
 {
 	bool err_detected = false;
@@ -208,12 +218,7 @@ void mlx5_enter_error_state(struct mlx5_core_dev *dev, bool force)
 		goto unlock;
 	}
 
-	if (mlx5_health_check_fatal_sensors(dev) || force) { /* protected state setting */
-		dev->state = MLX5_DEVICE_STATE_INTERNAL_ERROR;
-		mlx5_cmd_flush(dev);
-	}
-
-	mlx5_notifier_call_chain(dev->priv.events, MLX5_DEV_EVENT_SYS_ERROR, (void *)1);
+	enter_error_state(dev, force);
 unlock:
 	mutex_unlock(&dev->intf_state_mutex);
 }
@@ -613,7 +618,7 @@ static void mlx5_fw_fatal_reporter_err_work(struct work_struct *work)
 	priv = container_of(health, struct mlx5_priv, health);
 	dev = container_of(priv, struct mlx5_core_dev, priv);
 
-	mlx5_enter_error_state(dev, false);
+	enter_error_state(dev, false);
 	if (IS_ERR_OR_NULL(health->fw_fatal_reporter)) {
 		if (mlx5_health_try_recover(dev))
 			mlx5_core_err(dev, "health recovery failed\n");
@@ -707,8 +712,9 @@ static void poll_health(struct timer_list *t)
 		mlx5_core_err(dev, "Fatal error %u detected\n", fatal_error);
 		dev->priv.health.fatal_error = fatal_error;
 		print_health_info(dev);
+		dev->state = MLX5_DEVICE_STATE_INTERNAL_ERROR;
 		mlx5_trigger_health_work(dev);
-		goto out;
+		return;
 	}
 
 	count = ioread32be(health->health_counter);
-- 
2.29.2


  parent reply	other threads:[~2021-02-12  2:59 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-12  2:56 [pull request][net 00/15] mlx5 fixes 2021-02-11 Saeed Mahameed
2021-02-12  2:56 ` [net 01/15] net/mlx5e: E-switch, Fix rate calculation for overflow Saeed Mahameed
2021-02-27 12:14   ` Arnd Bergmann
2021-03-02  0:52     ` Saeed Mahameed
2021-03-02  9:01       ` Arnd Bergmann
2021-02-12  2:56 ` [net 02/15] net/mlx5e: Enable striding RQ for Connect-X IPsec capable devices Saeed Mahameed
2021-02-12  2:56 ` [net 03/15] net/mlx5e: Enable XDP " Saeed Mahameed
2021-02-12  2:56 ` [net 04/15] net/mlx5e: Don't change interrupt moderation params when DIM is enabled Saeed Mahameed
2021-02-12  2:56 ` [net 05/15] net/mlx5e: Change interrupt moderation channel params also when channels are closed Saeed Mahameed
2021-02-12  2:56 ` Saeed Mahameed [this message]
2021-02-12  2:56 ` [net 07/15] net/mlx5e: Replace synchronize_rcu with synchronize_net Saeed Mahameed
2021-02-12  2:56 ` [net 08/15] net/mlx5e: Fix CQ params of ICOSQ and async ICOSQ Saeed Mahameed
2021-02-12  2:56 ` [net 09/15] net/mlx5e: kTLS, Use refcounts to free kTLS RX priv context Saeed Mahameed
2021-02-12  2:56 ` [net 10/15] net/mlx5: Disable devlink reload for multi port slave device Saeed Mahameed
2021-02-12  2:56 ` [net 11/15] net/mlx5: Disallow RoCE on " Saeed Mahameed
2021-02-12  2:56 ` [net 12/15] net/mlx5: Disallow RoCE on lag device Saeed Mahameed
2021-02-12  2:56 ` [net 13/15] net/mlx5: Disable devlink reload for lag devices Saeed Mahameed
2021-02-12  2:56 ` [net 14/15] net/mlx5e: CT: manage the lifetime of the ct entry object Saeed Mahameed
2021-02-12  2:56 ` [net 15/15] net/mlx5e: Check tunnel offload is required before setting SWP Saeed Mahameed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210212025641.323844-7-saeed@kernel.org \
    --to=saeed@kernel.org \
    --cc=davem@davemloft.net \
    --cc=kuba@kernel.org \
    --cc=moshe@nvidia.com \
    --cc=netdev@vger.kernel.org \
    --cc=saeedm@nvidia.com \
    --cc=shayd@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).