From: Oded Gabbay <oded.gabbay@gmail.com>
To: gregkh@linuxfoundation.org, linux-kernel@vger.kernel.org
Subject: [PATCH 03/15] habanalabs: disable CPU access on timeouts
Date: Thu, 28 Feb 2019 10:46:12 +0200 [thread overview]
Message-ID: <20190228084624.25288-4-oded.gabbay@gmail.com> (raw)
In-Reply-To: <20190228084624.25288-1-oded.gabbay@gmail.com>
This patch provides a workaround for a bug in the F/W where the response
time for a request from KMD may take more then 100ms. This could cause the
queue between KMD and the F/W to get out of sync.
The WA is to:
1. Increase the timeout of ALL requests to 1s.
2. In case a request isn't answered in time, mark the state as
"cpu_disabled" and prevent sending further requests from KMD to the F/W.
This will eventually lead to a heartbeat failure and hard reset of the
device.
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
---
drivers/misc/habanalabs/debugfs.c | 6 ++++--
drivers/misc/habanalabs/device.c | 2 ++
drivers/misc/habanalabs/goya/goya.c | 9 +++++++--
drivers/misc/habanalabs/habanalabs.h | 2 ++
drivers/misc/habanalabs/hwmon.c | 2 +-
drivers/misc/habanalabs/sysfs.c | 4 ++--
6 files changed, 18 insertions(+), 7 deletions(-)
diff --git a/drivers/misc/habanalabs/debugfs.c b/drivers/misc/habanalabs/debugfs.c
index f472b572faea..1d2bbcf90f16 100644
--- a/drivers/misc/habanalabs/debugfs.c
+++ b/drivers/misc/habanalabs/debugfs.c
@@ -723,7 +723,7 @@ static ssize_t hl_device_read(struct file *f, char __user *buf,
return 0;
sprintf(tmp_buf,
- "Valid values are: disable, enable, suspend, resume\n");
+ "Valid values: disable, enable, suspend, resume, cpu_timeout\n");
rc = simple_read_from_buffer(buf, strlen(tmp_buf) + 1, ppos, tmp_buf,
strlen(tmp_buf) + 1);
@@ -751,9 +751,11 @@ static ssize_t hl_device_write(struct file *f, const char __user *buf,
hdev->asic_funcs->suspend(hdev);
} else if (strncmp("resume", data, strlen("resume")) == 0) {
hdev->asic_funcs->resume(hdev);
+ } else if (strncmp("cpu_timeout", data, strlen("cpu_timeout")) == 0) {
+ hdev->device_cpu_disabled = true;
} else {
dev_err(hdev->dev,
- "Valid values are: disable, enable, suspend, resume\n");
+ "Valid values: disable, enable, suspend, resume, cpu_timeout\n");
count = -EINVAL;
}
diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
index 120d30a13afb..de46aa6ed154 100644
--- a/drivers/misc/habanalabs/device.c
+++ b/drivers/misc/habanalabs/device.c
@@ -636,6 +636,8 @@ int hl_device_reset(struct hl_device *hdev, bool hard_reset,
/* Finished tear-down, starting to re-initialize */
if (hard_reset) {
+ hdev->device_cpu_disabled = false;
+
/* Allocate the kernel context */
hdev->kernel_ctx = kzalloc(sizeof(*hdev->kernel_ctx),
GFP_KERNEL);
diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
index 7c2edabe20bd..5780041abe32 100644
--- a/drivers/misc/habanalabs/goya/goya.c
+++ b/drivers/misc/habanalabs/goya/goya.c
@@ -3232,6 +3232,11 @@ int goya_send_cpu_message(struct hl_device *hdev, u32 *msg, u16 len,
if (hdev->disabled)
goto out;
+ if (hdev->device_cpu_disabled) {
+ rc = -EIO;
+ goto out;
+ }
+
rc = hl_hw_queue_send_cb_no_cmpl(hdev, GOYA_QUEUE_ID_CPU_PQ, len,
pkt_dma_addr);
if (rc) {
@@ -3245,8 +3250,8 @@ int goya_send_cpu_message(struct hl_device *hdev, u32 *msg, u16 len,
hl_hw_queue_inc_ci_kernel(hdev, GOYA_QUEUE_ID_CPU_PQ);
if (rc == -ETIMEDOUT) {
- dev_err(hdev->dev,
- "Timeout while waiting for CPU packet fence\n");
+ dev_err(hdev->dev, "Timeout while waiting for device CPU\n");
+ hdev->device_cpu_disabled = true;
goto out;
}
diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
index 59b25c6fae00..a7c95e9f9b9a 100644
--- a/drivers/misc/habanalabs/habanalabs.h
+++ b/drivers/misc/habanalabs/habanalabs.h
@@ -1079,6 +1079,7 @@ struct hl_device_reset_work {
* @dram_default_page_mapping: is DRAM default page mapping enabled.
* @init_done: is the initialization of the device done.
* @mmu_enable: is MMU enabled.
+ * @device_cpu_disabled: is the device CPU disabled (due to timeouts)
*/
struct hl_device {
struct pci_dev *pdev;
@@ -1146,6 +1147,7 @@ struct hl_device {
u8 dram_supports_virtual_memory;
u8 dram_default_page_mapping;
u8 init_done;
+ u8 device_cpu_disabled;
/* Parameters for bring-up */
u8 mmu_enable;
diff --git a/drivers/misc/habanalabs/hwmon.c b/drivers/misc/habanalabs/hwmon.c
index 9c359a1dd868..7eec21f9b96e 100644
--- a/drivers/misc/habanalabs/hwmon.c
+++ b/drivers/misc/habanalabs/hwmon.c
@@ -10,7 +10,7 @@
#include <linux/pci.h>
#include <linux/hwmon.h>
-#define SENSORS_PKT_TIMEOUT 100000 /* 100ms */
+#define SENSORS_PKT_TIMEOUT 1000000 /* 1s */
#define HWMON_NR_SENSOR_TYPES (hwmon_pwm + 1)
int hl_build_hwmon_channel_info(struct hl_device *hdev,
diff --git a/drivers/misc/habanalabs/sysfs.c b/drivers/misc/habanalabs/sysfs.c
index 6d80e7e0885c..12c782112a8c 100644
--- a/drivers/misc/habanalabs/sysfs.c
+++ b/drivers/misc/habanalabs/sysfs.c
@@ -9,8 +9,8 @@
#include <linux/pci.h>
-#define SET_CLK_PKT_TIMEOUT 200000 /* 200ms */
-#define SET_PWR_PKT_TIMEOUT 400000 /* 400ms */
+#define SET_CLK_PKT_TIMEOUT 1000000 /* 1s */
+#define SET_PWR_PKT_TIMEOUT 1000000 /* 1s */
long hl_get_frequency(struct hl_device *hdev, u32 pll_index, bool curr)
{
--
2.17.1
next prev parent reply other threads:[~2019-02-28 8:46 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-02-28 8:46 [PATCH 00/15] habanalabs fixes for merge window Oded Gabbay
2019-02-28 8:46 ` [PATCH 01/15] habanalabs: Dissociate RAZWI info from event types Oded Gabbay
2019-02-28 8:46 ` [PATCH 02/15] habanalabs: add MMU DRAM default page mapping Oded Gabbay
2019-02-28 8:46 ` Oded Gabbay [this message]
2019-02-28 8:46 ` [PATCH 04/15] habanalabs: fix mmu cache registers init Oded Gabbay
2019-02-28 8:46 ` [PATCH 05/15] habanalabs: fix validation of WREG32 to DMA completion Oded Gabbay
2019-02-28 8:46 ` [PATCH 06/15] habanalabs: set DMA0 completion to SOB 1007 Oded Gabbay
2019-02-28 8:46 ` [PATCH 07/15] habanalabs: extend QMAN0 job timeout Oded Gabbay
2019-02-28 8:46 ` [PATCH 08/15] habanalabs: add comments in uapi/misc/habanalabs.h Oded Gabbay
2019-02-28 8:46 ` [PATCH 09/15] habanalabs: return correct error code on MMU mapping failure Oded Gabbay
2019-02-28 8:46 ` [PATCH 10/15] habanalabs: fix memory leak with CBs with unaligned size Oded Gabbay
2019-02-28 8:46 ` [PATCH 11/15] habanalabs: print pointer using %p Oded Gabbay
2019-02-28 9:31 ` Greg KH
2019-02-28 9:47 ` Oded Gabbay
2019-02-28 8:46 ` [PATCH 12/15] habanalabs: soft-reset device if context-switch fails Oded Gabbay
2019-02-28 8:46 ` [PATCH 13/15] habanalabs: fix little-endian<->cpu conversion warnings Oded Gabbay
2019-02-28 8:46 ` [PATCH 14/15] habanalabs: use NULL to initialize array of pointers Oded Gabbay
2019-02-28 8:46 ` [PATCH 15/15] habanalabs: fix little-endian<->cpu conversion warnings Oded Gabbay
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190228084624.25288-4-oded.gabbay@gmail.com \
--to=oded.gabbay@gmail.com \
--cc=gregkh@linuxfoundation.org \
--cc=linux-kernel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.