public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Oded Gabbay <oded.gabbay@gmail.com>
To: gregkh@linuxfoundation.org, linux-kernel@vger.kernel.org
Subject: [PATCH 03/15] habanalabs: disable CPU access on timeouts
Date: Thu, 28 Feb 2019 10:46:12 +0200	[thread overview]
Message-ID: <20190228084624.25288-4-oded.gabbay@gmail.com> (raw)
In-Reply-To: <20190228084624.25288-1-oded.gabbay@gmail.com>

This patch provides a workaround for a bug in the F/W where the response
time for a request from KMD may take more then 100ms. This could cause the
queue between KMD and the F/W to get out of sync.

The WA is to:
1. Increase the timeout of ALL requests to 1s.
2. In case a request isn't answered in time, mark the state as
"cpu_disabled" and prevent sending further requests from KMD to the F/W.
This will eventually lead to a heartbeat failure and hard reset of the
device.

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
---
 drivers/misc/habanalabs/debugfs.c    | 6 ++++--
 drivers/misc/habanalabs/device.c     | 2 ++
 drivers/misc/habanalabs/goya/goya.c  | 9 +++++++--
 drivers/misc/habanalabs/habanalabs.h | 2 ++
 drivers/misc/habanalabs/hwmon.c      | 2 +-
 drivers/misc/habanalabs/sysfs.c      | 4 ++--
 6 files changed, 18 insertions(+), 7 deletions(-)

diff --git a/drivers/misc/habanalabs/debugfs.c b/drivers/misc/habanalabs/debugfs.c
index f472b572faea..1d2bbcf90f16 100644
--- a/drivers/misc/habanalabs/debugfs.c
+++ b/drivers/misc/habanalabs/debugfs.c
@@ -723,7 +723,7 @@ static ssize_t hl_device_read(struct file *f, char __user *buf,
 		return 0;
 
 	sprintf(tmp_buf,
-		"Valid values are: disable, enable, suspend, resume\n");
+		"Valid values: disable, enable, suspend, resume, cpu_timeout\n");
 	rc = simple_read_from_buffer(buf, strlen(tmp_buf) + 1, ppos, tmp_buf,
 			strlen(tmp_buf) + 1);
 
@@ -751,9 +751,11 @@ static ssize_t hl_device_write(struct file *f, const char __user *buf,
 		hdev->asic_funcs->suspend(hdev);
 	} else if (strncmp("resume", data, strlen("resume")) == 0) {
 		hdev->asic_funcs->resume(hdev);
+	} else if (strncmp("cpu_timeout", data, strlen("cpu_timeout")) == 0) {
+		hdev->device_cpu_disabled = true;
 	} else {
 		dev_err(hdev->dev,
-			"Valid values are: disable, enable, suspend, resume\n");
+			"Valid values: disable, enable, suspend, resume, cpu_timeout\n");
 		count = -EINVAL;
 	}
 
diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
index 120d30a13afb..de46aa6ed154 100644
--- a/drivers/misc/habanalabs/device.c
+++ b/drivers/misc/habanalabs/device.c
@@ -636,6 +636,8 @@ int hl_device_reset(struct hl_device *hdev, bool hard_reset,
 	/* Finished tear-down, starting to re-initialize */
 
 	if (hard_reset) {
+		hdev->device_cpu_disabled = false;
+
 		/* Allocate the kernel context */
 		hdev->kernel_ctx = kzalloc(sizeof(*hdev->kernel_ctx),
 						GFP_KERNEL);
diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
index 7c2edabe20bd..5780041abe32 100644
--- a/drivers/misc/habanalabs/goya/goya.c
+++ b/drivers/misc/habanalabs/goya/goya.c
@@ -3232,6 +3232,11 @@ int goya_send_cpu_message(struct hl_device *hdev, u32 *msg, u16 len,
 	if (hdev->disabled)
 		goto out;
 
+	if (hdev->device_cpu_disabled) {
+		rc = -EIO;
+		goto out;
+	}
+
 	rc = hl_hw_queue_send_cb_no_cmpl(hdev, GOYA_QUEUE_ID_CPU_PQ, len,
 			pkt_dma_addr);
 	if (rc) {
@@ -3245,8 +3250,8 @@ int goya_send_cpu_message(struct hl_device *hdev, u32 *msg, u16 len,
 	hl_hw_queue_inc_ci_kernel(hdev, GOYA_QUEUE_ID_CPU_PQ);
 
 	if (rc == -ETIMEDOUT) {
-		dev_err(hdev->dev,
-			"Timeout while waiting for CPU packet fence\n");
+		dev_err(hdev->dev, "Timeout while waiting for device CPU\n");
+		hdev->device_cpu_disabled = true;
 		goto out;
 	}
 
diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
index 59b25c6fae00..a7c95e9f9b9a 100644
--- a/drivers/misc/habanalabs/habanalabs.h
+++ b/drivers/misc/habanalabs/habanalabs.h
@@ -1079,6 +1079,7 @@ struct hl_device_reset_work {
  * @dram_default_page_mapping: is DRAM default page mapping enabled.
  * @init_done: is the initialization of the device done.
  * @mmu_enable: is MMU enabled.
+ * @device_cpu_disabled: is the device CPU disabled (due to timeouts)
  */
 struct hl_device {
 	struct pci_dev			*pdev;
@@ -1146,6 +1147,7 @@ struct hl_device {
 	u8				dram_supports_virtual_memory;
 	u8				dram_default_page_mapping;
 	u8				init_done;
+	u8				device_cpu_disabled;
 
 	/* Parameters for bring-up */
 	u8				mmu_enable;
diff --git a/drivers/misc/habanalabs/hwmon.c b/drivers/misc/habanalabs/hwmon.c
index 9c359a1dd868..7eec21f9b96e 100644
--- a/drivers/misc/habanalabs/hwmon.c
+++ b/drivers/misc/habanalabs/hwmon.c
@@ -10,7 +10,7 @@
 #include <linux/pci.h>
 #include <linux/hwmon.h>
 
-#define SENSORS_PKT_TIMEOUT		100000	/* 100ms */
+#define SENSORS_PKT_TIMEOUT		1000000	/* 1s */
 #define HWMON_NR_SENSOR_TYPES		(hwmon_pwm + 1)
 
 int hl_build_hwmon_channel_info(struct hl_device *hdev,
diff --git a/drivers/misc/habanalabs/sysfs.c b/drivers/misc/habanalabs/sysfs.c
index 6d80e7e0885c..12c782112a8c 100644
--- a/drivers/misc/habanalabs/sysfs.c
+++ b/drivers/misc/habanalabs/sysfs.c
@@ -9,8 +9,8 @@
 
 #include <linux/pci.h>
 
-#define SET_CLK_PKT_TIMEOUT	200000	/* 200ms */
-#define SET_PWR_PKT_TIMEOUT	400000	/* 400ms */
+#define SET_CLK_PKT_TIMEOUT	1000000	/* 1s */
+#define SET_PWR_PKT_TIMEOUT	1000000	/* 1s */
 
 long hl_get_frequency(struct hl_device *hdev, u32 pll_index, bool curr)
 {
-- 
2.17.1


  parent reply	other threads:[~2019-02-28  8:46 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-02-28  8:46 [PATCH 00/15] habanalabs fixes for merge window Oded Gabbay
2019-02-28  8:46 ` [PATCH 01/15] habanalabs: Dissociate RAZWI info from event types Oded Gabbay
2019-02-28  8:46 ` [PATCH 02/15] habanalabs: add MMU DRAM default page mapping Oded Gabbay
2019-02-28  8:46 ` Oded Gabbay [this message]
2019-02-28  8:46 ` [PATCH 04/15] habanalabs: fix mmu cache registers init Oded Gabbay
2019-02-28  8:46 ` [PATCH 05/15] habanalabs: fix validation of WREG32 to DMA completion Oded Gabbay
2019-02-28  8:46 ` [PATCH 06/15] habanalabs: set DMA0 completion to SOB 1007 Oded Gabbay
2019-02-28  8:46 ` [PATCH 07/15] habanalabs: extend QMAN0 job timeout Oded Gabbay
2019-02-28  8:46 ` [PATCH 08/15] habanalabs: add comments in uapi/misc/habanalabs.h Oded Gabbay
2019-02-28  8:46 ` [PATCH 09/15] habanalabs: return correct error code on MMU mapping failure Oded Gabbay
2019-02-28  8:46 ` [PATCH 10/15] habanalabs: fix memory leak with CBs with unaligned size Oded Gabbay
2019-02-28  8:46 ` [PATCH 11/15] habanalabs: print pointer using %p Oded Gabbay
2019-02-28  9:31   ` Greg KH
2019-02-28  9:47     ` Oded Gabbay
2019-02-28  8:46 ` [PATCH 12/15] habanalabs: soft-reset device if context-switch fails Oded Gabbay
2019-02-28  8:46 ` [PATCH 13/15] habanalabs: fix little-endian<->cpu conversion warnings Oded Gabbay
2019-02-28  8:46 ` [PATCH 14/15] habanalabs: use NULL to initialize array of pointers Oded Gabbay
2019-02-28  8:46 ` [PATCH 15/15] habanalabs: fix little-endian<->cpu conversion warnings Oded Gabbay

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190228084624.25288-4-oded.gabbay@gmail.com \
    --to=oded.gabbay@gmail.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox