From: John.C.Harrison@Intel.com
To: Intel-GFX@Lists.FreeDesktop.Org
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>,
Michael Cheng <michael.cheng@intel.com>,
Alan Previn <alan.previn.teres.alexis@intel.com>,
intel-gfx@lists.freedesktop.org,
Lucas De Marchi <lucas.demarchi@intel.com>,
Chris Wilson <chris@chris-wilson.co.uk>,
DRI-Devel@Lists.FreeDesktop.Org,
Andrzej Hajda <andrzej.hajda@intel.com>,
Rodrigo Vivi <rodrigo.vivi@intel.com>,
Tejas Upadhyay <tejaskumarx.surendrakumar.upadhyay@intel.com>,
Matthew Auld <matthew.auld@intel.com>
Subject: [Intel-gfx] [PATCH v2 1/5] drm/i915: Fix request locking during error capture & debugfs dump
Date: Tue, 17 Jan 2023 13:36:26 -0800 [thread overview]
Message-ID: <20230117213630.2897570-2-John.C.Harrison@Intel.com> (raw)
In-Reply-To: <20230117213630.2897570-1-John.C.Harrison@Intel.com>
From: John Harrison <John.C.Harrison@Intel.com>
When GuC support was added to error capture, the locking around the
request object was broken. Fix it up.
The context based search manages the spinlocking around the search
internally. So it needs to grab the reference count internally as
well. The execlist only request based search relies on external
locking, so it needs an external reference count. So no change to that
code itself but the context version does change.
The only other caller is the code for dumping engine state to debugfs.
That code wasn't previously getting an explicit reference at all as it
does everything while holding the execlist specific spinlock. So that
needs updaing as well as that spinlock doesn't help when using GuC
submission. Rather than trying to conditionally get/put depending on
submission model, just change it to always do the get/put.
In addition, intel_guc_find_hung_context() was not acquiring the
correct spinlock before searching the request list. So fix that up too.
Fixes: dc0dad365c5e ("drm/i915/guc: Fix for error capture after full GPU reset
with GuC")
Fixes: 573ba126aef3 ("drm/i915/guc: Capture error state on context reset")
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: John Harrison <John.C.Harrison@Intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Andrzej Hajda <andrzej.hajda@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Cc: Michael Cheng <michael.cheng@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Tejas Upadhyay <tejaskumarx.surendrakumar.upadhyay@intel.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
Cc: Alan Previn <alan.previn.teres.alexis@intel.com>
Cc: Bruce Chang <yu.bruce.chang@intel.com>
Cc: intel-gfx@lists.freedesktop.org
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
drivers/gpu/drm/i915/gt/intel_context.c | 1 +
drivers/gpu/drm/i915/gt/intel_engine_cs.c | 7 ++++++-
drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 11 +++++++++++
drivers/gpu/drm/i915/i915_gpu_error.c | 5 ++---
4 files changed, 20 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index e94365b08f1ef..df64cf1954c1d 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -552,6 +552,7 @@ struct i915_request *intel_context_find_active_request(struct intel_context *ce)
active = rq;
}
+ active = i915_request_get_rcu(active);
spin_unlock_irqrestore(&parent->guc_state.lock, flags);
return active;
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 922f1bb22dc68..517d1fb7ae333 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -2236,10 +2236,13 @@ static void engine_dump_active_requests(struct intel_engine_cs *engine, struct d
guc = intel_uc_uses_guc_submission(&engine->gt->uc);
if (guc) {
ce = intel_engine_get_hung_context(engine);
- if (ce)
+ if (ce) {
+ /* This will reference count the request (if found) */
hung_rq = intel_context_find_active_request(ce);
+ }
} else {
hung_rq = intel_engine_execlist_find_hung_request(engine);
+ hung_rq = i915_request_get_rcu(hung_rq);
}
if (hung_rq)
@@ -2250,6 +2253,8 @@ static void engine_dump_active_requests(struct intel_engine_cs *engine, struct d
else
intel_engine_dump_active_requests(&engine->sched_engine->requests,
hung_rq, m);
+ if (hung_rq)
+ i915_request_put(hung_rq);
}
void intel_engine_dump(struct intel_engine_cs *engine,
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index b436dd7f12e42..3b34a82d692be 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -4820,6 +4820,8 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
xa_lock_irqsave(&guc->context_lookup, flags);
xa_for_each(&guc->context_lookup, index, ce) {
+ bool found;
+
if (!kref_get_unless_zero(&ce->ref))
continue;
@@ -4836,10 +4838,18 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
goto next;
}
+ found = false;
+ spin_lock(&ce->guc_state.lock);
list_for_each_entry(rq, &ce->guc_state.requests, sched.link) {
if (i915_test_request_state(rq) != I915_REQUEST_ACTIVE)
continue;
+ found = true;
+ break;
+ }
+ spin_unlock(&ce->guc_state.lock);
+
+ if (found) {
intel_engine_set_hung_context(engine, ce);
/* Can only cope with one hang at a time... */
@@ -4847,6 +4857,7 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
xa_lock(&guc->context_lookup);
goto done;
}
+
next:
intel_context_put(ce);
xa_lock(&guc->context_lookup);
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 9d5d5a397b64e..4107a0dfcca7d 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -1607,6 +1607,7 @@ capture_engine(struct intel_engine_cs *engine,
ce = intel_engine_get_hung_context(engine);
if (ce) {
intel_engine_clear_hung_context(engine);
+ /* This will reference count the request (if found) */
rq = intel_context_find_active_request(ce);
if (!rq || !i915_request_started(rq))
goto no_request_capture;
@@ -1618,13 +1619,11 @@ capture_engine(struct intel_engine_cs *engine,
if (!intel_uc_uses_guc_submission(&engine->gt->uc)) {
spin_lock_irqsave(&engine->sched_engine->lock, flags);
rq = intel_engine_execlist_find_hung_request(engine);
+ rq = i915_request_get_rcu(rq);
spin_unlock_irqrestore(&engine->sched_engine->lock,
flags);
}
}
- if (rq)
- rq = i915_request_get_rcu(rq);
-
if (!rq)
goto no_request_capture;
--
2.39.0
next prev parent reply other threads:[~2023-01-17 21:36 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-01-17 21:36 [Intel-gfx] [PATCH v2 0/5] Allow error capture without a request / on reset failure John.C.Harrison
2023-01-17 21:36 ` John.C.Harrison [this message]
2023-01-18 8:29 ` [Intel-gfx] [PATCH v2 1/5] drm/i915: Fix request locking during error capture & debugfs dump Andy Shevchenko
2023-01-18 17:34 ` John Harrison
2023-01-18 17:54 ` Andy Shevchenko
2023-01-18 18:18 ` John Harrison
2023-01-18 16:22 ` Tvrtko Ursulin
2023-01-18 17:55 ` John Harrison
2023-01-17 21:36 ` [Intel-gfx] [PATCH v2 2/5] drm/i915: Allow error capture without a request John.C.Harrison
2023-01-18 16:34 ` Tvrtko Ursulin
2023-01-17 21:36 ` [Intel-gfx] [PATCH v2 3/5] drm/i915: Allow error capture of a pending request John.C.Harrison
2023-01-18 16:35 ` Tvrtko Ursulin
2023-01-17 21:36 ` [Intel-gfx] [PATCH v2 4/5] drm/i915/guc: Look for a guilty context when an engine reset fails John.C.Harrison
2023-01-18 16:37 ` Tvrtko Ursulin
2023-01-17 21:36 ` [Intel-gfx] [PATCH v2 5/5] drm/i915/guc: Add a debug print on GuC triggered reset John.C.Harrison
2023-01-17 22:55 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Allow error capture without a request / on reset failure (rev3) Patchwork
2023-01-17 22:55 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
2023-01-17 23:05 ` [Intel-gfx] ✗ Fi.CI.BAT: failure " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230117213630.2897570-2-John.C.Harrison@Intel.com \
--to=john.c.harrison@intel.com \
--cc=DRI-Devel@Lists.FreeDesktop.Org \
--cc=Intel-GFX@Lists.FreeDesktop.Org \
--cc=alan.previn.teres.alexis@intel.com \
--cc=andriy.shevchenko@linux.intel.com \
--cc=andrzej.hajda@intel.com \
--cc=chris@chris-wilson.co.uk \
--cc=lucas.demarchi@intel.com \
--cc=matthew.auld@intel.com \
--cc=michael.cheng@intel.com \
--cc=rodrigo.vivi@intel.com \
--cc=tejaskumarx.surendrakumar.upadhyay@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox