Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Rodrigo Vivi <rodrigo.vivi@intel.com>
To: <intel-xe@lists.freedesktop.org>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>,
	Matthew Brost <matthew.brost@intel.com>
Subject: [PATCH 3/6] drm/xe: Add extra busted protection for the no GuC reset
Date: Fri, 15 Mar 2024 10:01:05 -0400	[thread overview]
Message-ID: <20240315140108.217862-3-rodrigo.vivi@intel.com> (raw)
In-Reply-To: <20240315140108.217862-1-rodrigo.vivi@intel.com>

When GuC doesn't reset the GPU on our behalf we need to be
extra cautious on timeout and skip scheduling jobs or
manually forcing gt_reset. Otherwise we get in infinite loop
of timeout and reschedule.

So, this is a preparation for introducing the busted mode
where it gets busted in any single timeout/hang without
allowing GuC to reset.

XXX: This is enough to get a clean stop for the software
validation teams to debug the memory. However the device
unbind will splat some WARNS because memory is not entirely
free since hw_fences were not released.

Cc: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
---
 drivers/gpu/drm/xe/xe_guc_submit.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index 82c955a2a15c..ee663683e9eb 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -176,7 +176,8 @@ static void set_exec_queue_killed(struct xe_exec_queue *q)
 
 static bool exec_queue_killed_or_banned(struct xe_exec_queue *q)
 {
-	return exec_queue_killed(q) || exec_queue_banned(q);
+	return xe_device_busted(gt_to_xe(q->gt)) ||
+		exec_queue_killed(q) || exec_queue_banned(q);
 }
 
 #ifdef CONFIG_PROVE_LOCKING
@@ -960,7 +961,7 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 	 */
 	if (q->flags & EXEC_QUEUE_FLAG_KERNEL ||
 	    (q->flags & EXEC_QUEUE_FLAG_VM && !exec_queue_killed(q))) {
-		if (!xe_sched_invalidate_job(job, 2)) {
+		if (!xe_sched_invalidate_job(job, 2) && !xe_device_busted(xe)) {
 			xe_sched_add_pending_job(sched, job);
 			xe_sched_submission_start(sched);
 			xe_gt_reset_async(q->gt);
@@ -969,7 +970,7 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 	}
 
 	/* Engine state now stable, disable scheduling if needed */
-	if (exec_queue_registered(q)) {
+	if (exec_queue_registered(q) && !xe_device_busted(xe)) {
 		struct xe_guc *guc = exec_queue_to_guc(q);
 		int ret;
 
@@ -1010,8 +1011,11 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 	 * Fence state now stable, stop / start scheduler which cleans up any
 	 * fences that are complete
 	 */
-	xe_sched_add_pending_job(sched, job);
-	xe_sched_submission_start(sched);
+	if (!xe_device_busted(xe)) {
+		xe_sched_add_pending_job(sched, job);
+		xe_sched_submission_start(sched);
+	}
+
 	xe_guc_exec_queue_trigger_cleanup(q);
 
 	/* Mark all outstanding jobs as bad, thus completing them */
@@ -1024,7 +1028,8 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 	xe_hw_fence_irq_start(q->fence_irq);
 
 out:
-	return DRM_GPU_SCHED_STAT_NOMINAL;
+	return xe_device_busted(xe) ? DRM_GPU_SCHED_STAT_ENODEV :
+		DRM_GPU_SCHED_STAT_NOMINAL;
 }
 
 static void __guc_exec_queue_fini_async(struct work_struct *w)
-- 
2.44.0


  parent reply	other threads:[~2024-03-15 14:01 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-15 14:01 [PATCH 1/6] drm/xe: Introduce a simple busted state Rodrigo Vivi
2024-03-15 14:01 ` [PATCH 2/6] drm/xe: declare busted upon GuC load failure Rodrigo Vivi
2024-03-15 14:01 ` Rodrigo Vivi [this message]
2024-03-15 14:01 ` [PATCH 4/6] drm/xe: Force busted state and block GT reset upon any GPU hang Rodrigo Vivi
2024-03-18 21:08   ` Dafna Hirschfeld
2024-03-18 21:18     ` Rodrigo Vivi
2024-03-15 14:01 ` [PATCH 5/6] drm/xe: Introduce the busted_mode debugfs Rodrigo Vivi
2024-03-18 19:16   ` Lucas De Marchi
2024-03-18 19:45     ` Rodrigo Vivi
2024-03-15 14:01 ` [PATCH 6/6] " Rodrigo Vivi
2024-03-18 19:31   ` Lucas De Marchi
2024-03-18 20:14     ` Rodrigo Vivi
2024-03-21 13:16       ` Lucas De Marchi
2024-03-18 21:12   ` Dafna Hirschfeld
2024-03-18 21:25     ` Rodrigo Vivi
2024-03-15 14:06 ` ✓ CI.Patch_applied: success for series starting with [1/6] drm/xe: Introduce a simple busted state Patchwork
2024-03-15 14:07 ` ✓ CI.checkpatch: " Patchwork
2024-03-15 14:07 ` ✓ CI.KUnit: " Patchwork
2024-03-15 14:18 ` ✓ CI.Build: " Patchwork
2024-03-15 14:20 ` ✓ CI.Hooks: " Patchwork
2024-03-15 14:22 ` ✓ CI.checksparse: " Patchwork
2024-03-15 14:47 ` ✓ CI.BAT: " Patchwork
2024-03-18 19:04 ` [PATCH 1/6] " Lucas De Marchi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240315140108.217862-3-rodrigo.vivi@intel.com \
    --to=rodrigo.vivi@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=matthew.brost@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox