All of lore.kernel.org
 help / color / mirror / Atom feed
From: Oded Gabbay <ogabbay@kernel.org>
To: linux-kernel@vger.kernel.org
Cc: farah kassabri <fkassabri@habana.ai>
Subject: [PATCH 4/7] habanalabs/gaudi2: log critical events with no rate limit
Date: Mon, 29 Aug 2022 12:22:51 +0300	[thread overview]
Message-ID: <20220829092254.930753-4-ogabbay@kernel.org> (raw)
In-Reply-To: <20220829092254.930753-1-ogabbay@kernel.org>

From: farah kassabri <fkassabri@habana.ai>

When we have a storm of errors of HBM ECC SERR we can reach a situation
where driver start hard reset flow without logging the error cause
that caused the hard reset due to logs rate limiting.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
---
 drivers/misc/habanalabs/gaudi2/gaudi2.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/misc/habanalabs/gaudi2/gaudi2.c b/drivers/misc/habanalabs/gaudi2/gaudi2.c
index ff0f9e9db1b5..6bebd5eb0294 100644
--- a/drivers/misc/habanalabs/gaudi2/gaudi2.c
+++ b/drivers/misc/habanalabs/gaudi2/gaudi2.c
@@ -8200,10 +8200,17 @@ static bool gaudi2_handle_hbm_mc_sei_err(struct hl_device *hdev, u16 event_type,
 		return true;
 	}
 
-	dev_err_ratelimited(hdev->dev,
-		"System Error Interrupt - HBM(%u) MC(%u) MC_CH(%u) MC_PC(%u). Critical(%u). Error cause: %s\n",
-		hbm_id, mc_id, sei_data->hdr.mc_channel, sei_data->hdr.mc_pseudo_channel,
-		sei_data->hdr.is_critical, hbm_mc_sei_cause[cause_idx]);
+	if (sei_data->hdr.is_critical)
+		dev_err(hdev->dev,
+			"System Critical Error Interrupt - HBM(%u) MC(%u) MC_CH(%u) MC_PC(%u). Error cause: %s\n",
+			hbm_id, mc_id, sei_data->hdr.mc_channel, sei_data->hdr.mc_pseudo_channel,
+			hbm_mc_sei_cause[cause_idx]);
+
+	else
+		dev_err_ratelimited(hdev->dev,
+			"System Non-Critical Error Interrupt - HBM(%u) MC(%u) MC_CH(%u) MC_PC(%u). Error cause: %s\n",
+			hbm_id, mc_id, sei_data->hdr.mc_channel, sei_data->hdr.mc_pseudo_channel,
+			hbm_mc_sei_cause[cause_idx]);
 
 	/* Print error-specific info */
 	switch (cause_idx) {
-- 
2.25.1


  parent reply	other threads:[~2022-08-29  9:23 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-29  9:22 [PATCH 1/7] habanalabs: set command buffer host VA dynamically Oded Gabbay
2022-08-29  9:22 ` [PATCH 2/7] habanalabs: perform context switch flow only if needed Oded Gabbay
2022-08-29  9:22 ` [PATCH 3/7] habanalabs: ignore EEPROM errors during boot Oded Gabbay
2022-08-29  9:22 ` Oded Gabbay [this message]
2022-08-29  9:22 ` [PATCH 5/7] habanalabs: allow control device open during reset Oded Gabbay
2022-08-29  9:22 ` [PATCH 6/7] habanalabs/gaudi2: dump detailed information upon RAZWI Oded Gabbay
2022-08-29  9:22 ` [PATCH 7/7] habanalabs: send device active message to f/w Oded Gabbay

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220829092254.930753-4-ogabbay@kernel.org \
    --to=ogabbay@kernel.org \
    --cc=fkassabri@habana.ai \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.