From: Oded Gabbay <oded.gabbay@gmail.com>
To: gregkh@linuxfoundation.org, linux-kernel@vger.kernel.org
Subject: [PATCH 12/15] habanalabs: soft-reset device if context-switch fails
Date: Thu, 28 Feb 2019 10:46:21 +0200 [thread overview]
Message-ID: <20190228084624.25288-13-oded.gabbay@gmail.com> (raw)
In-Reply-To: <20190228084624.25288-1-oded.gabbay@gmail.com>
This patch fix a bug in the driver, where if the TPC or MME remains in
non-IDLE even after all the command submissions are done (due to user bug
or malicious user), then future command submissions will fail in the
context-switch stage and the driver will remain in "stuck" mode.
The fix is to do a soft-reset of the device in case the context-switch
fails, because the device should be IDLE during context-switch. If it is
not IDLE, then something is wrong and we should reset the compute engines.
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
---
drivers/misc/habanalabs/command_submission.c | 16 +++++++++-------
drivers/misc/habanalabs/goya/goya.c | 2 +-
2 files changed, 10 insertions(+), 8 deletions(-)
diff --git a/drivers/misc/habanalabs/command_submission.c b/drivers/misc/habanalabs/command_submission.c
index 25ad9d805cfa..3525236ed8d9 100644
--- a/drivers/misc/habanalabs/command_submission.c
+++ b/drivers/misc/habanalabs/command_submission.c
@@ -622,13 +622,15 @@ int hl_cs_ioctl(struct hl_fpriv *hpriv, void *data)
"Failed to switch to context %d, rejecting CS! %d\n",
ctx->asid, rc);
/*
- * If we timedout, we need to soft-reset because
- * QMAN is probably stuck. However, we can't
- * call to reset here directly because of
- * deadlock, so need to do it at the very end
- * of this function
+ * If we timedout, or if the device is not IDLE
+ * while we want to do context-switch (-EBUSY),
+ * we need to soft-reset because QMAN is
+ * probably stuck. However, we can't call to
+ * reset here directly because of deadlock, so
+ * need to do it at the very end of this
+ * function
*/
- if (rc == -ETIMEDOUT)
+ if ((rc == -ETIMEDOUT) || (rc == -EBUSY))
need_soft_reset = true;
mutex_unlock(&hpriv->restore_phase_mutex);
goto out;
@@ -706,7 +708,7 @@ int hl_cs_ioctl(struct hl_fpriv *hpriv, void *data)
args->out.seq = cs_seq;
}
- if ((rc == -ETIMEDOUT) && (need_soft_reset))
+ if (((rc == -ETIMEDOUT) || (rc == -EBUSY)) && (need_soft_reset))
hl_device_reset(hdev, false, false);
return rc;
diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
index 39824214ce61..11597432f519 100644
--- a/drivers/misc/habanalabs/goya/goya.c
+++ b/drivers/misc/habanalabs/goya/goya.c
@@ -3138,7 +3138,7 @@ static int goya_send_job_on_qman0(struct hl_device *hdev, struct hl_cs_job *job)
if (!hdev->asic_funcs->is_device_idle(hdev)) {
dev_err_ratelimited(hdev->dev,
"Can't send KMD job on QMAN0 if device is not idle\n");
- return -EFAULT;
+ return -EBUSY;
}
fence_ptr = hdev->asic_funcs->dma_pool_zalloc(hdev, 4, GFP_KERNEL,
--
2.17.1
next prev parent reply other threads:[~2019-02-28 8:47 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-02-28 8:46 [PATCH 00/15] habanalabs fixes for merge window Oded Gabbay
2019-02-28 8:46 ` [PATCH 01/15] habanalabs: Dissociate RAZWI info from event types Oded Gabbay
2019-02-28 8:46 ` [PATCH 02/15] habanalabs: add MMU DRAM default page mapping Oded Gabbay
2019-02-28 8:46 ` [PATCH 03/15] habanalabs: disable CPU access on timeouts Oded Gabbay
2019-02-28 8:46 ` [PATCH 04/15] habanalabs: fix mmu cache registers init Oded Gabbay
2019-02-28 8:46 ` [PATCH 05/15] habanalabs: fix validation of WREG32 to DMA completion Oded Gabbay
2019-02-28 8:46 ` [PATCH 06/15] habanalabs: set DMA0 completion to SOB 1007 Oded Gabbay
2019-02-28 8:46 ` [PATCH 07/15] habanalabs: extend QMAN0 job timeout Oded Gabbay
2019-02-28 8:46 ` [PATCH 08/15] habanalabs: add comments in uapi/misc/habanalabs.h Oded Gabbay
2019-02-28 8:46 ` [PATCH 09/15] habanalabs: return correct error code on MMU mapping failure Oded Gabbay
2019-02-28 8:46 ` [PATCH 10/15] habanalabs: fix memory leak with CBs with unaligned size Oded Gabbay
2019-02-28 8:46 ` [PATCH 11/15] habanalabs: print pointer using %p Oded Gabbay
2019-02-28 9:31 ` Greg KH
2019-02-28 9:47 ` Oded Gabbay
2019-02-28 8:46 ` Oded Gabbay [this message]
2019-02-28 8:46 ` [PATCH 13/15] habanalabs: fix little-endian<->cpu conversion warnings Oded Gabbay
2019-02-28 8:46 ` [PATCH 14/15] habanalabs: use NULL to initialize array of pointers Oded Gabbay
2019-02-28 8:46 ` [PATCH 15/15] habanalabs: fix little-endian<->cpu conversion warnings Oded Gabbay
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190228084624.25288-13-oded.gabbay@gmail.com \
--to=oded.gabbay@gmail.com \
--cc=gregkh@linuxfoundation.org \
--cc=linux-kernel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox