From: Rodrigo Vivi <rodrigo.vivi@intel.com>
To: <intel-xe@lists.freedesktop.org>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>,
Matthew Brost <matthew.brost@intel.com>
Subject: [PATCH 3/6] drm/xe: Add extra busted protection for the no GuC reset
Date: Fri, 15 Mar 2024 10:01:05 -0400 [thread overview]
Message-ID: <20240315140108.217862-3-rodrigo.vivi@intel.com> (raw)
In-Reply-To: <20240315140108.217862-1-rodrigo.vivi@intel.com>
When GuC doesn't reset the GPU on our behalf we need to be
extra cautious on timeout and skip scheduling jobs or
manually forcing gt_reset. Otherwise we get in infinite loop
of timeout and reschedule.
So, this is a preparation for introducing the busted mode
where it gets busted in any single timeout/hang without
allowing GuC to reset.
XXX: This is enough to get a clean stop for the software
validation teams to debug the memory. However the device
unbind will splat some WARNS because memory is not entirely
free since hw_fences were not released.
Cc: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
---
drivers/gpu/drm/xe/xe_guc_submit.c | 17 +++++++++++------
1 file changed, 11 insertions(+), 6 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index 82c955a2a15c..ee663683e9eb 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -176,7 +176,8 @@ static void set_exec_queue_killed(struct xe_exec_queue *q)
static bool exec_queue_killed_or_banned(struct xe_exec_queue *q)
{
- return exec_queue_killed(q) || exec_queue_banned(q);
+ return xe_device_busted(gt_to_xe(q->gt)) ||
+ exec_queue_killed(q) || exec_queue_banned(q);
}
#ifdef CONFIG_PROVE_LOCKING
@@ -960,7 +961,7 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
*/
if (q->flags & EXEC_QUEUE_FLAG_KERNEL ||
(q->flags & EXEC_QUEUE_FLAG_VM && !exec_queue_killed(q))) {
- if (!xe_sched_invalidate_job(job, 2)) {
+ if (!xe_sched_invalidate_job(job, 2) && !xe_device_busted(xe)) {
xe_sched_add_pending_job(sched, job);
xe_sched_submission_start(sched);
xe_gt_reset_async(q->gt);
@@ -969,7 +970,7 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
}
/* Engine state now stable, disable scheduling if needed */
- if (exec_queue_registered(q)) {
+ if (exec_queue_registered(q) && !xe_device_busted(xe)) {
struct xe_guc *guc = exec_queue_to_guc(q);
int ret;
@@ -1010,8 +1011,11 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
* Fence state now stable, stop / start scheduler which cleans up any
* fences that are complete
*/
- xe_sched_add_pending_job(sched, job);
- xe_sched_submission_start(sched);
+ if (!xe_device_busted(xe)) {
+ xe_sched_add_pending_job(sched, job);
+ xe_sched_submission_start(sched);
+ }
+
xe_guc_exec_queue_trigger_cleanup(q);
/* Mark all outstanding jobs as bad, thus completing them */
@@ -1024,7 +1028,8 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
xe_hw_fence_irq_start(q->fence_irq);
out:
- return DRM_GPU_SCHED_STAT_NOMINAL;
+ return xe_device_busted(xe) ? DRM_GPU_SCHED_STAT_ENODEV :
+ DRM_GPU_SCHED_STAT_NOMINAL;
}
static void __guc_exec_queue_fini_async(struct work_struct *w)
--
2.44.0
next prev parent reply other threads:[~2024-03-15 14:01 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-03-15 14:01 [PATCH 1/6] drm/xe: Introduce a simple busted state Rodrigo Vivi
2024-03-15 14:01 ` [PATCH 2/6] drm/xe: declare busted upon GuC load failure Rodrigo Vivi
2024-03-15 14:01 ` Rodrigo Vivi [this message]
2024-03-15 14:01 ` [PATCH 4/6] drm/xe: Force busted state and block GT reset upon any GPU hang Rodrigo Vivi
2024-03-18 21:08 ` Dafna Hirschfeld
2024-03-18 21:18 ` Rodrigo Vivi
2024-03-15 14:01 ` [PATCH 5/6] drm/xe: Introduce the busted_mode debugfs Rodrigo Vivi
2024-03-18 19:16 ` Lucas De Marchi
2024-03-18 19:45 ` Rodrigo Vivi
2024-03-15 14:01 ` [PATCH 6/6] " Rodrigo Vivi
2024-03-18 19:31 ` Lucas De Marchi
2024-03-18 20:14 ` Rodrigo Vivi
2024-03-21 13:16 ` Lucas De Marchi
2024-03-18 21:12 ` Dafna Hirschfeld
2024-03-18 21:25 ` Rodrigo Vivi
2024-03-15 14:06 ` ✓ CI.Patch_applied: success for series starting with [1/6] drm/xe: Introduce a simple busted state Patchwork
2024-03-15 14:07 ` ✓ CI.checkpatch: " Patchwork
2024-03-15 14:07 ` ✓ CI.KUnit: " Patchwork
2024-03-15 14:18 ` ✓ CI.Build: " Patchwork
2024-03-15 14:20 ` ✓ CI.Hooks: " Patchwork
2024-03-15 14:22 ` ✓ CI.checksparse: " Patchwork
2024-03-15 14:47 ` ✓ CI.BAT: " Patchwork
2024-03-18 19:04 ` [PATCH 1/6] " Lucas De Marchi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240315140108.217862-3-rodrigo.vivi@intel.com \
--to=rodrigo.vivi@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=matthew.brost@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox