All of lore.kernel.org
 help / color / mirror / Atom feed
From: Oded Gabbay <oded.gabbay@gmail.com>
To: gregkh@linuxfoundation.org, linux-kernel@vger.kernel.org
Subject: [PATCH 12/15] habanalabs: soft-reset device if context-switch fails
Date: Thu, 28 Feb 2019 10:46:21 +0200	[thread overview]
Message-ID: <20190228084624.25288-13-oded.gabbay@gmail.com> (raw)
In-Reply-To: <20190228084624.25288-1-oded.gabbay@gmail.com>

This patch fix a bug in the driver, where if the TPC or MME remains in
non-IDLE even after all the command submissions are done (due to user bug
or malicious user), then future command submissions will fail in the
context-switch stage and the driver will remain in "stuck" mode.

The fix is to do a soft-reset of the device in case the context-switch
fails, because the device should be IDLE during context-switch. If it is
not IDLE, then something is wrong and we should reset the compute engines.

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
---
 drivers/misc/habanalabs/command_submission.c | 16 +++++++++-------
 drivers/misc/habanalabs/goya/goya.c          |  2 +-
 2 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/drivers/misc/habanalabs/command_submission.c b/drivers/misc/habanalabs/command_submission.c
index 25ad9d805cfa..3525236ed8d9 100644
--- a/drivers/misc/habanalabs/command_submission.c
+++ b/drivers/misc/habanalabs/command_submission.c
@@ -622,13 +622,15 @@ int hl_cs_ioctl(struct hl_fpriv *hpriv, void *data)
 					"Failed to switch to context %d, rejecting CS! %d\n",
 					ctx->asid, rc);
 				/*
-				 * If we timedout, we need to soft-reset because
-				 * QMAN is probably stuck. However, we can't
-				 * call to reset here directly because of
-				 * deadlock, so need to do it at the very end
-				 * of this function
+				 * If we timedout, or if the device is not IDLE
+				 * while we want to do context-switch (-EBUSY),
+				 * we need to soft-reset because QMAN is
+				 * probably stuck. However, we can't call to
+				 * reset here directly because of deadlock, so
+				 * need to do it at the very end of this
+				 * function
 				 */
-				if (rc == -ETIMEDOUT)
+				if ((rc == -ETIMEDOUT) || (rc == -EBUSY))
 					need_soft_reset = true;
 				mutex_unlock(&hpriv->restore_phase_mutex);
 				goto out;
@@ -706,7 +708,7 @@ int hl_cs_ioctl(struct hl_fpriv *hpriv, void *data)
 		args->out.seq = cs_seq;
 	}
 
-	if ((rc == -ETIMEDOUT) && (need_soft_reset))
+	if (((rc == -ETIMEDOUT) || (rc == -EBUSY)) && (need_soft_reset))
 		hl_device_reset(hdev, false, false);
 
 	return rc;
diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
index 39824214ce61..11597432f519 100644
--- a/drivers/misc/habanalabs/goya/goya.c
+++ b/drivers/misc/habanalabs/goya/goya.c
@@ -3138,7 +3138,7 @@ static int goya_send_job_on_qman0(struct hl_device *hdev, struct hl_cs_job *job)
 	if (!hdev->asic_funcs->is_device_idle(hdev)) {
 		dev_err_ratelimited(hdev->dev,
 			"Can't send KMD job on QMAN0 if device is not idle\n");
-		return -EFAULT;
+		return -EBUSY;
 	}
 
 	fence_ptr = hdev->asic_funcs->dma_pool_zalloc(hdev, 4, GFP_KERNEL,
-- 
2.17.1


  parent reply	other threads:[~2019-02-28  8:47 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-02-28  8:46 [PATCH 00/15] habanalabs fixes for merge window Oded Gabbay
2019-02-28  8:46 ` [PATCH 01/15] habanalabs: Dissociate RAZWI info from event types Oded Gabbay
2019-02-28  8:46 ` [PATCH 02/15] habanalabs: add MMU DRAM default page mapping Oded Gabbay
2019-02-28  8:46 ` [PATCH 03/15] habanalabs: disable CPU access on timeouts Oded Gabbay
2019-02-28  8:46 ` [PATCH 04/15] habanalabs: fix mmu cache registers init Oded Gabbay
2019-02-28  8:46 ` [PATCH 05/15] habanalabs: fix validation of WREG32 to DMA completion Oded Gabbay
2019-02-28  8:46 ` [PATCH 06/15] habanalabs: set DMA0 completion to SOB 1007 Oded Gabbay
2019-02-28  8:46 ` [PATCH 07/15] habanalabs: extend QMAN0 job timeout Oded Gabbay
2019-02-28  8:46 ` [PATCH 08/15] habanalabs: add comments in uapi/misc/habanalabs.h Oded Gabbay
2019-02-28  8:46 ` [PATCH 09/15] habanalabs: return correct error code on MMU mapping failure Oded Gabbay
2019-02-28  8:46 ` [PATCH 10/15] habanalabs: fix memory leak with CBs with unaligned size Oded Gabbay
2019-02-28  8:46 ` [PATCH 11/15] habanalabs: print pointer using %p Oded Gabbay
2019-02-28  9:31   ` Greg KH
2019-02-28  9:47     ` Oded Gabbay
2019-02-28  8:46 ` Oded Gabbay [this message]
2019-02-28  8:46 ` [PATCH 13/15] habanalabs: fix little-endian<->cpu conversion warnings Oded Gabbay
2019-02-28  8:46 ` [PATCH 14/15] habanalabs: use NULL to initialize array of pointers Oded Gabbay
2019-02-28  8:46 ` [PATCH 15/15] habanalabs: fix little-endian<->cpu conversion warnings Oded Gabbay

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190228084624.25288-13-oded.gabbay@gmail.com \
    --to=oded.gabbay@gmail.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.