From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9FDDEC001DE for ; Mon, 24 Jul 2023 01:24:37 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231290AbjGXBYg (ORCPT ); Sun, 23 Jul 2023 21:24:36 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52098 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231348AbjGXBXT (ORCPT ); Sun, 23 Jul 2023 21:23:19 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0A2C019AA; Sun, 23 Jul 2023 18:22:49 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 57B4860DFE; Mon, 24 Jul 2023 01:22:43 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 46BDEC433C8; Mon, 24 Jul 2023 01:22:41 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1690161762; bh=OtuaxbPf17KiLovSIlaUXgyLjCOeuhCpNENhvVXzWog=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=d5lO0Dte/mzVatYr1jn6FOl/VIO3eKBk8mgRoXDF59qYNMXaDowKr56A3bX8LC66x O6mx+ZbAPyeTi4+vVjqpzBwXJwccNCx7wKpqw+2jord908eZ1j5tif1BYbflZka9cX oXwIm6XtiYWkEZ57avtuXh81jAhasYzIZKBlhcH1Vwvge0Ul34NauCEjXpWy7RLDyz oIZep++CLtu59XwhUpoG2emgHDGcQsdAyhSQcORla6TAzYOsI1ig72WqIXN8wOAK4a 4jQkYgU9nxDEl52GqKyeWNuY52WlfkDUMRsaigMZUFwvbNe38rPMBpVGCmNXiQTdfb GS7rUUXccysXg== From: Sasha Levin To: linux-kernel@vger.kernel.org, stable@vger.kernel.org Cc: Ofir Bitton , Oded Gabbay , Sasha Levin , gregkh@linuxfoundation.org, ttayar@habana.ai, osharabi@habana.ai, oshpigelman@habana.ai, mhaimovski@habana.ai, dliberman@habana.ai, fkassabri@habana.ai, bjauhari@habana.ai, talcohen@habana.ai, dhirschfeld@habana.ai, rostedt@goodmis.org Subject: [PATCH AUTOSEL 6.1 23/41] accel/habanalabs: add pci health check during heartbeat Date: Sun, 23 Jul 2023 21:20:56 -0400 Message-Id: <20230724012118.2316073-23-sashal@kernel.org> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20230724012118.2316073-1-sashal@kernel.org> References: <20230724012118.2316073-1-sashal@kernel.org> MIME-Version: 1.0 X-stable: review X-Patchwork-Hint: Ignore X-stable-base: Linux 6.1.40 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Ofir Bitton [ Upstream commit d8b9cea584661b30305cf341bf9f675dc0a25471 ] Currently upon a heartbeat failure, we don't know if the failure is due to firmware hang or due to a bad PCI link. Hence, we are reading a PCI config space register with a known value (vendor ID) so we will know which of the two possibilities caused the heartbeat failure. Signed-off-by: Ofir Bitton Reviewed-by: Oded Gabbay Signed-off-by: Oded Gabbay Signed-off-by: Sasha Levin --- drivers/misc/habanalabs/common/device.c | 15 ++++++++++++++- drivers/misc/habanalabs/common/habanalabs.h | 2 ++ drivers/misc/habanalabs/common/habanalabs_drv.c | 2 -- 3 files changed, 16 insertions(+), 3 deletions(-) diff --git a/drivers/misc/habanalabs/common/device.c b/drivers/misc/habanalabs/common/device.c index e0dca445abf14..9ee1b6abd8a05 100644 --- a/drivers/misc/habanalabs/common/device.c +++ b/drivers/misc/habanalabs/common/device.c @@ -870,6 +870,18 @@ static void device_early_fini(struct hl_device *hdev) hdev->asic_funcs->early_fini(hdev); } +static bool is_pci_link_healthy(struct hl_device *hdev) +{ + u16 vendor_id; + + if (!hdev->pdev) + return false; + + pci_read_config_word(hdev->pdev, PCI_VENDOR_ID, &vendor_id); + + return (vendor_id == PCI_VENDOR_ID_HABANALABS); +} + static void hl_device_heartbeat(struct work_struct *work) { struct hl_device *hdev = container_of(work, struct hl_device, @@ -882,7 +894,8 @@ static void hl_device_heartbeat(struct work_struct *work) goto reschedule; if (hl_device_operational(hdev, NULL)) - dev_err(hdev->dev, "Device heartbeat failed!\n"); + dev_err(hdev->dev, "Device heartbeat failed! PCI link is %s\n", + is_pci_link_healthy(hdev) ? "healthy" : "broken"); hl_device_reset(hdev, HL_DRV_RESET_HARD | HL_DRV_RESET_HEARTBEAT); diff --git a/drivers/misc/habanalabs/common/habanalabs.h b/drivers/misc/habanalabs/common/habanalabs.h index 58c95b13be69a..257b94cec6248 100644 --- a/drivers/misc/habanalabs/common/habanalabs.h +++ b/drivers/misc/habanalabs/common/habanalabs.h @@ -34,6 +34,8 @@ struct hl_device; struct hl_fpriv; +#define PCI_VENDOR_ID_HABANALABS 0x1da3 + /* Use upper bits of mmap offset to store habana driver specific information. * bits[63:59] - Encode mmap type * bits[45:0] - mmap offset value diff --git a/drivers/misc/habanalabs/common/habanalabs_drv.c b/drivers/misc/habanalabs/common/habanalabs_drv.c index 112632afe7d53..ae3cab3f4aa55 100644 --- a/drivers/misc/habanalabs/common/habanalabs_drv.c +++ b/drivers/misc/habanalabs/common/habanalabs_drv.c @@ -54,8 +54,6 @@ module_param(boot_error_status_mask, ulong, 0444); MODULE_PARM_DESC(boot_error_status_mask, "Mask of the error status during device CPU boot (If bitX is cleared then error X is masked. Default all 1's)"); -#define PCI_VENDOR_ID_HABANALABS 0x1da3 - #define PCI_IDS_GOYA 0x0001 #define PCI_IDS_GAUDI 0x1000 #define PCI_IDS_GAUDI_SEC 0x1010 -- 2.39.2