public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: Ofir Bitton <obitton@habana.ai>, Oded Gabbay <ogabbay@kernel.org>,
	Sasha Levin <sashal@kernel.org>,
	gregkh@linuxfoundation.org, ttayar@habana.ai, osharabi@habana.ai,
	oshpigelman@habana.ai, mhaimovski@habana.ai, dliberman@habana.ai,
	fkassabri@habana.ai, bjauhari@habana.ai, talcohen@habana.ai,
	dhirschfeld@habana.ai, rostedt@goodmis.org
Subject: [PATCH AUTOSEL 6.1 23/41] accel/habanalabs: add pci health check during heartbeat
Date: Sun, 23 Jul 2023 21:20:56 -0400	[thread overview]
Message-ID: <20230724012118.2316073-23-sashal@kernel.org> (raw)
In-Reply-To: <20230724012118.2316073-1-sashal@kernel.org>

From: Ofir Bitton <obitton@habana.ai>

[ Upstream commit d8b9cea584661b30305cf341bf9f675dc0a25471 ]

Currently upon a heartbeat failure, we don't know if the failure
is due to firmware hang or due to a bad PCI link. Hence, we
are reading a PCI config space register with a known value (vendor ID)
so we will know which of the two possibilities caused the heartbeat
failure.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/misc/habanalabs/common/device.c         | 15 ++++++++++++++-
 drivers/misc/habanalabs/common/habanalabs.h     |  2 ++
 drivers/misc/habanalabs/common/habanalabs_drv.c |  2 --
 3 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/drivers/misc/habanalabs/common/device.c b/drivers/misc/habanalabs/common/device.c
index e0dca445abf14..9ee1b6abd8a05 100644
--- a/drivers/misc/habanalabs/common/device.c
+++ b/drivers/misc/habanalabs/common/device.c
@@ -870,6 +870,18 @@ static void device_early_fini(struct hl_device *hdev)
 		hdev->asic_funcs->early_fini(hdev);
 }
 
+static bool is_pci_link_healthy(struct hl_device *hdev)
+{
+	u16 vendor_id;
+
+	if (!hdev->pdev)
+		return false;
+
+	pci_read_config_word(hdev->pdev, PCI_VENDOR_ID, &vendor_id);
+
+	return (vendor_id == PCI_VENDOR_ID_HABANALABS);
+}
+
 static void hl_device_heartbeat(struct work_struct *work)
 {
 	struct hl_device *hdev = container_of(work, struct hl_device,
@@ -882,7 +894,8 @@ static void hl_device_heartbeat(struct work_struct *work)
 		goto reschedule;
 
 	if (hl_device_operational(hdev, NULL))
-		dev_err(hdev->dev, "Device heartbeat failed!\n");
+		dev_err(hdev->dev, "Device heartbeat failed! PCI link is %s\n",
+			is_pci_link_healthy(hdev) ? "healthy" : "broken");
 
 	hl_device_reset(hdev, HL_DRV_RESET_HARD | HL_DRV_RESET_HEARTBEAT);
 
diff --git a/drivers/misc/habanalabs/common/habanalabs.h b/drivers/misc/habanalabs/common/habanalabs.h
index 58c95b13be69a..257b94cec6248 100644
--- a/drivers/misc/habanalabs/common/habanalabs.h
+++ b/drivers/misc/habanalabs/common/habanalabs.h
@@ -34,6 +34,8 @@
 struct hl_device;
 struct hl_fpriv;
 
+#define PCI_VENDOR_ID_HABANALABS	0x1da3
+
 /* Use upper bits of mmap offset to store habana driver specific information.
  * bits[63:59] - Encode mmap type
  * bits[45:0]  - mmap offset value
diff --git a/drivers/misc/habanalabs/common/habanalabs_drv.c b/drivers/misc/habanalabs/common/habanalabs_drv.c
index 112632afe7d53..ae3cab3f4aa55 100644
--- a/drivers/misc/habanalabs/common/habanalabs_drv.c
+++ b/drivers/misc/habanalabs/common/habanalabs_drv.c
@@ -54,8 +54,6 @@ module_param(boot_error_status_mask, ulong, 0444);
 MODULE_PARM_DESC(boot_error_status_mask,
 	"Mask of the error status during device CPU boot (If bitX is cleared then error X is masked. Default all 1's)");
 
-#define PCI_VENDOR_ID_HABANALABS	0x1da3
-
 #define PCI_IDS_GOYA			0x0001
 #define PCI_IDS_GAUDI			0x1000
 #define PCI_IDS_GAUDI_SEC		0x1010
-- 
2.39.2


  parent reply	other threads:[~2023-07-24  1:24 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-24  1:20 [PATCH AUTOSEL 6.1 01/41] drm/amd/display: Do not set drr on pipe commit Sasha Levin
2023-07-24  1:20 ` [PATCH AUTOSEL 6.1 02/41] drm/amdgpu: fix calltrace warning in amddrm_buddy_fini Sasha Levin
2023-07-24  1:20 ` [PATCH AUTOSEL 6.1 03/41] drm/radeon: Fix integer overflow in radeon_cs_parser_init Sasha Levin
2023-07-24  1:20 ` [PATCH AUTOSEL 6.1 04/41] drm/amdgpu: Fix integer overflow in amdgpu_cs_pass1 Sasha Levin
2023-07-24  1:20 ` [PATCH AUTOSEL 6.1 05/41] drm/amdgpu: fix memory leak in mes self test Sasha Levin
2023-07-24  1:20 ` [PATCH AUTOSEL 6.1 06/41] ALSA: emu10k1: roll up loops in DSP setup code for Audigy Sasha Levin
2023-07-24  1:20 ` [PATCH AUTOSEL 6.1 07/41] ASoC: Intel: sof_sdw: add quirk for MTL RVP Sasha Levin
2023-07-24  1:20 ` [PATCH AUTOSEL 6.1 08/41] ASoC: Intel: sof_sdw: add quirk for LNL RVP Sasha Levin
2023-07-24  1:20 ` [PATCH AUTOSEL 6.1 09/41] PCI: tegra194: Fix possible array out of bounds access Sasha Levin
2023-07-24  1:20 ` [PATCH AUTOSEL 6.1 10/41] ASoC: SOF: amd: Add pci revision id check Sasha Levin
2023-07-24  1:20 ` [PATCH AUTOSEL 6.1 11/41] drm/stm: ltdc: fix late dereference check Sasha Levin
2023-07-24  1:20 ` [PATCH AUTOSEL 6.1 12/41] drm: rcar-du: remove R-Car H3 ES1.* workarounds Sasha Levin
2023-07-24  1:20 ` [PATCH AUTOSEL 6.1 13/41] ASoC: amd: vangogh: Add check for acp config flags in vangogh platform Sasha Levin
2023-07-24  1:20 ` [PATCH AUTOSEL 6.1 14/41] ARM: dts: imx6dl: prtrvt, prtvt7, prti6q, prtwd2: fix USB related warnings Sasha Levin
2023-07-24  1:20 ` [PATCH AUTOSEL 6.1 15/41] ASoC: Intel: sof_sdw_rt_sdca_jack_common: test SOF_JACK_JDSRC in _exit Sasha Levin
2023-07-24  1:20 ` [PATCH AUTOSEL 6.1 16/41] ASoC: Intel: sof_sdw: Add support for Rex soundwire Sasha Levin
2023-07-24  1:20 ` [PATCH AUTOSEL 6.1 17/41] iopoll: Call cpu_relax() in busy loops Sasha Levin
2023-07-24  1:20 ` [PATCH AUTOSEL 6.1 18/41] quota: Properly disable quotas when add_dquot_ref() fails Sasha Levin
2023-07-24  1:20 ` [PATCH AUTOSEL 6.1 19/41] quota: fix warning in dqgrab() Sasha Levin
2023-07-24  1:20 ` [PATCH AUTOSEL 6.1 20/41] ALSA: hda: Add Loongson LS7A HD-Audio support Sasha Levin
2023-07-24  1:20 ` [PATCH AUTOSEL 6.1 21/41] ASoC: SOF: Intel: fix SoundWire/HDaudio mutual exclusion Sasha Levin
2023-07-24  1:20 ` [PATCH AUTOSEL 6.1 22/41] dma-remap: use kvmalloc_array/kvfree for larger dma memory remap Sasha Levin
2023-07-24  1:20 ` Sasha Levin [this message]
2023-07-24  1:20 ` [PATCH AUTOSEL 6.1 24/41] HID: logitech-hidpp: Add USB and Bluetooth IDs for the Logitech G915 TKL Keyboard Sasha Levin
2023-07-24  1:20 ` [PATCH AUTOSEL 6.1 25/41] iommu/amd: Introduce Disable IRTE Caching Support Sasha Levin
2023-07-24  1:20 ` [PATCH AUTOSEL 6.1 26/41] drm/amdgpu: install stub fence into potential unused fence pointers Sasha Levin
2023-07-24  1:21 ` [PATCH AUTOSEL 6.1 27/41] drm/amd/display: Apply 60us prefetch for DCFCLK <= 300Mhz Sasha Levin
2023-07-24  1:21 ` [PATCH AUTOSEL 6.1 28/41] Revert "drm/amd/display: Do not set drr on pipe commit" Sasha Levin
2023-07-24 10:46   ` Michel Dänzer
2023-07-24  1:21 ` [PATCH AUTOSEL 6.1 29/41] HID: add quirk for 03f0:464a HP Elite Presenter Mouse Sasha Levin
2023-07-24  1:21 ` [PATCH AUTOSEL 6.1 30/41] RDMA/mlx5: Return the firmware result upon destroying QP/RQ Sasha Levin
2023-07-24  1:21 ` [PATCH AUTOSEL 6.1 31/41] drm/amd/display: Skip DPP DTO update if root clock is gated Sasha Levin
2023-07-24  1:21 ` [PATCH AUTOSEL 6.1 32/41] drm/amd/display: Enable dcn314 DPP RCO Sasha Levin
2023-07-24  1:21 ` [PATCH AUTOSEL 6.1 33/41] ASoC: SOF: core: Free the firmware trace before calling snd_sof_shutdown() Sasha Levin
2023-07-24  1:21 ` [PATCH AUTOSEL 6.1 34/41] ovl: check type and offset of struct vfsmount in ovl_entry Sasha Levin
2023-07-24  1:21 ` [PATCH AUTOSEL 6.1 35/41] HID: intel-ish-hid: ipc: Add Arrow Lake PCI device ID Sasha Levin
2023-07-24  1:21 ` [PATCH AUTOSEL 6.1 36/41] udf: Fix uninitialized array access for some pathnames Sasha Levin
2023-07-24  1:21 ` [PATCH AUTOSEL 6.1 37/41] ALSA: hda/realtek: Add quirks for ROG ALLY CS35l41 audio Sasha Levin
2023-07-24  1:21 ` [PATCH AUTOSEL 6.1 38/41] fs: jfs: Fix UBSAN: array-index-out-of-bounds in dbAllocDmapLev Sasha Levin
2023-07-24  1:21 ` [PATCH AUTOSEL 6.1 39/41] MIPS: dec: prom: Address -Warray-bounds warning Sasha Levin
2023-07-24  1:21 ` [PATCH AUTOSEL 6.1 40/41] FS: JFS: Fix null-ptr-deref Read in txBegin Sasha Levin
2023-07-24  1:21 ` [PATCH AUTOSEL 6.1 41/41] FS: JFS: Check for read-only mounted filesystem " Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230724012118.2316073-23-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=bjauhari@habana.ai \
    --cc=dhirschfeld@habana.ai \
    --cc=dliberman@habana.ai \
    --cc=fkassabri@habana.ai \
    --cc=gregkh@linuxfoundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mhaimovski@habana.ai \
    --cc=obitton@habana.ai \
    --cc=ogabbay@kernel.org \
    --cc=osharabi@habana.ai \
    --cc=oshpigelman@habana.ai \
    --cc=rostedt@goodmis.org \
    --cc=stable@vger.kernel.org \
    --cc=talcohen@habana.ai \
    --cc=ttayar@habana.ai \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox