From: Oded Gabbay <ogabbay@kernel.org>
To: dri-devel@lists.freedesktop.org
Cc: Ofir Bitton <obitton@habana.ai>
Subject: [PATCH 2/6] accel/habanalabs: add pci health check during heartbeat
Date: Mon, 1 May 2023 12:47:50 +0300 [thread overview]
Message-ID: <20230501094754.100030-2-ogabbay@kernel.org> (raw)
In-Reply-To: <20230501094754.100030-1-ogabbay@kernel.org>
From: Ofir Bitton <obitton@habana.ai>
Currently upon a heartbeat failure, we don't know if the failure
is due to firmware hang or due to a bad PCI link. Hence, we
are reading a PCI config space register with a known value (vendor ID)
so we will know which of the two possibilities caused the heartbeat
failure.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
---
drivers/accel/habanalabs/common/device.c | 15 ++++++++++++++-
drivers/accel/habanalabs/common/habanalabs.h | 2 ++
drivers/accel/habanalabs/common/habanalabs_drv.c | 2 --
3 files changed, 16 insertions(+), 3 deletions(-)
diff --git a/drivers/accel/habanalabs/common/device.c b/drivers/accel/habanalabs/common/device.c
index c8f4a34d2831..98b46b2e1898 100644
--- a/drivers/accel/habanalabs/common/device.c
+++ b/drivers/accel/habanalabs/common/device.c
@@ -981,6 +981,18 @@ static void device_early_fini(struct hl_device *hdev)
hdev->asic_funcs->early_fini(hdev);
}
+static bool is_pci_link_healthy(struct hl_device *hdev)
+{
+ u16 vendor_id;
+
+ if (!hdev->pdev)
+ return false;
+
+ pci_read_config_word(hdev->pdev, PCI_VENDOR_ID, &vendor_id);
+
+ return (vendor_id == PCI_VENDOR_ID_HABANALABS);
+}
+
static void hl_device_heartbeat(struct work_struct *work)
{
struct hl_device *hdev = container_of(work, struct hl_device,
@@ -995,7 +1007,8 @@ static void hl_device_heartbeat(struct work_struct *work)
goto reschedule;
if (hl_device_operational(hdev, NULL))
- dev_err(hdev->dev, "Device heartbeat failed!\n");
+ dev_err(hdev->dev, "Device heartbeat failed! PCI link is %s\n",
+ is_pci_link_healthy(hdev) ? "healthy" : "broken");
info.err_type = HL_INFO_FW_HEARTBEAT_ERR;
info.event_mask = &event_mask;
diff --git a/drivers/accel/habanalabs/common/habanalabs.h b/drivers/accel/habanalabs/common/habanalabs.h
index f83ea96c6530..ad412cc01aba 100644
--- a/drivers/accel/habanalabs/common/habanalabs.h
+++ b/drivers/accel/habanalabs/common/habanalabs.h
@@ -36,6 +36,8 @@
struct hl_device;
struct hl_fpriv;
+#define PCI_VENDOR_ID_HABANALABS 0x1da3
+
/* Use upper bits of mmap offset to store habana driver specific information.
* bits[63:59] - Encode mmap type
* bits[45:0] - mmap offset value
diff --git a/drivers/accel/habanalabs/common/habanalabs_drv.c b/drivers/accel/habanalabs/common/habanalabs_drv.c
index a4b3f50f1cba..ea80638fc6b9 100644
--- a/drivers/accel/habanalabs/common/habanalabs_drv.c
+++ b/drivers/accel/habanalabs/common/habanalabs_drv.c
@@ -54,8 +54,6 @@ module_param(boot_error_status_mask, ulong, 0444);
MODULE_PARM_DESC(boot_error_status_mask,
"Mask of the error status during device CPU boot (If bitX is cleared then error X is masked. Default all 1's)");
-#define PCI_VENDOR_ID_HABANALABS 0x1da3
-
#define PCI_IDS_GOYA 0x0001
#define PCI_IDS_GAUDI 0x1000
#define PCI_IDS_GAUDI_SEC 0x1010
--
2.40.1
next prev parent reply other threads:[~2023-05-01 9:48 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-05-01 9:47 [PATCH 1/6] accel/habanalabs: add missing tpc interrupt info Oded Gabbay
2023-05-01 9:47 ` Oded Gabbay [this message]
2023-05-01 9:47 ` [PATCH 3/6] accel/habanalabs: expose debugfs files later Oded Gabbay
2023-05-01 9:47 ` [PATCH 4/6] accel/habanalabs: poll for device status update following WFE cmd Oded Gabbay
2023-05-01 9:47 ` [PATCH 5/6] accel/habanalabs: fix a static warning - 'dubious: x & !y' Oded Gabbay
2023-05-01 9:47 ` [PATCH 6/6] accel/habanalabs: always fetch pci addr_dec error info Oded Gabbay
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230501094754.100030-2-ogabbay@kernel.org \
--to=ogabbay@kernel.org \
--cc=dri-devel@lists.freedesktop.org \
--cc=obitton@habana.ai \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.