From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: Chris Wilson <chris@chris-wilson.co.uk>, intel-gfx@lists.freedesktop.org
Subject: Re: [Intel-gfx] [PATCH v3] drm/i915/execlists: Reclaim the hanging virtual request
Date: Tue, 21 Jan 2020 13:55:29 +0000 [thread overview]
Message-ID: <524735a8-dc0c-fdfc-941a-5cc3afaac40e@linux.intel.com> (raw)
In-Reply-To: <20200121130411.267092-1-chris@chris-wilson.co.uk>
On 21/01/2020 13:04, Chris Wilson wrote:
> If we encounter a hang on a virtual engine, as we process the hang the
> request may already have been moved back to the virtual engine (we are
> processing the hang on the physical engine). We need to reclaim the
> request from the virtual engine so that the locking is consistent and
> local to the real engine on which we will hold the request for error
> state capturing.
>
> v2: Pull the reclamation into execlists_hold() and assert that cannot be
> called from outside of the reset (i.e. with the tasklet disabled).
> v3: Added selftest
>
> Fixes: 748317386afb ("drm/i915/execlists: Offline error capture")
> Testcase: igt/gem_exec_balancer/hang
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> ---
> drivers/gpu/drm/i915/gt/intel_lrc.c | 29 +++++
> drivers/gpu/drm/i915/gt/selftest_lrc.c | 158 ++++++++++++++++++++++++-
> 2 files changed, 186 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
> index 3a30767ff0c4..11dfee4fe9ea 100644
> --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
> @@ -2396,6 +2396,35 @@ static void __execlists_hold(struct i915_request *rq)
> static void execlists_hold(struct intel_engine_cs *engine,
> struct i915_request *rq)
> {
> + if (rq->engine != engine) { /* preempted virtual engine */
> + struct virtual_engine *ve = to_virtual_engine(rq->engine);
> + unsigned long flags;
> +
> + /*
> + * intel_context_inflight() is only protected by virtue
> + * of process_csb() being called only by the tasklet (or
> + * directly from inside reset while the tasklet is suspended).
> + * Assert that neither of those are allowed to run while we
> + * poke at the request queues.
> + */
I was thinking or execlists_dequeue, but I forgot there is no direct
submission any more.
> + GEM_BUG_ON(!reset_in_progress(&engine->execlists));
> +
> + /*
> + * An unsubmitted request along a virtual engine will
> + * remain on the active (this) engine until we are able
> + * to process the context switch away (and so mark the
> + * context as no longer in flight). That cannot have happened
> + * yet, otherwise we would not be hanging!
> + */
> + spin_lock_irqsave(&ve->base.active.lock, flags);
> + GEM_BUG_ON(intel_context_inflight(rq->context) != engine);
> + GEM_BUG_ON(ve->request != rq);
> + ve->request = NULL;
> + spin_unlock_irqrestore(&ve->base.active.lock, flags);
> +
> + rq->engine = engine;
Lets see I understand this... tasklet has been disabled and ring paused.
But we find an uncompleted request in the ELSP context, with rq->engine
== virtual engine. Therefore this cannot be the first request on this
timeline but has to be later. One which has been put on the runqueue but
not yet submitted to hw. (Because one at a time.) Or it has been
unsubmitted by __unwind_incomplete_request already. In the former case
why move it to the physical engine? Also in the latter actually, it
would overwrite rq->engine with the physical one.
> + }
> +
> spin_lock_irq(&engine->active.lock);
>
> /*
> diff --git a/drivers/gpu/drm/i915/gt/selftest_lrc.c b/drivers/gpu/drm/i915/gt/selftest_lrc.c
> index b208c2176bbd..1beb47c3ba9e 100644
> --- a/drivers/gpu/drm/i915/gt/selftest_lrc.c
> +++ b/drivers/gpu/drm/i915/gt/selftest_lrc.c
> @@ -335,7 +335,6 @@ static int live_hold_reset(void *arg)
>
> if (test_and_set_bit(I915_RESET_ENGINE + id,
> >->reset.flags)) {
> - spin_unlock_irq(&engine->active.lock);
> intel_gt_set_wedged(gt);
> err = -EBUSY;
> goto out;
> @@ -3411,6 +3410,162 @@ static int live_virtual_bond(void *arg)
> return 0;
> }
>
> +static int reset_virtual_engine(struct intel_gt *gt,
> + struct intel_engine_cs **siblings,
> + unsigned int nsibling)
> +{
> + struct intel_engine_cs *engine;
> + struct intel_context *ve;
> + unsigned long *heartbeat;
> + struct igt_spinner spin;
> + struct i915_request *rq;
> + unsigned int n;
> + int err = 0;
> +
> + /*
> + * In order to support offline error capture for fast preempt reset,
> + * we need to decouple the guilty request and ensure that it and its
> + * descendents are not executed while the capture is in progress.
> + */
> +
> + if (!intel_has_reset_engine(gt))
> + return 0;
> +
> + heartbeat = kmalloc_array(nsibling, sizeof(*heartbeat), GFP_KERNEL);
> + if (!heartbeat)
> + return -ENOMEM;
> +
> + if (igt_spinner_init(&spin, gt)) {
> + err = -ENOMEM;
> + goto out_heartbeat;
> + }
> +
> + ve = intel_execlists_create_virtual(siblings, nsibling);
> + if (IS_ERR(ve)) {
> + err = PTR_ERR(ve);
> + goto out_spin;
> + }
> +
> + for (n = 0; n < nsibling; n++)
> + engine_heartbeat_disable(siblings[n], &heartbeat[n]);
> +
> + rq = igt_spinner_create_request(&spin, ve, MI_ARB_CHECK);
> + if (IS_ERR(rq)) {
> + err = PTR_ERR(rq);
> + goto out;
> + }
> + i915_request_add(rq);
> +
> + if (!igt_wait_for_spinner(&spin, rq)) {
> + intel_gt_set_wedged(gt);
> + err = -ETIME;
> + goto out;
> + }
> +
> + engine = rq->engine;
> + GEM_BUG_ON(engine == ve->engine);
> +
> + /* Take ownership of the reset and tasklet */
> + if (test_and_set_bit(I915_RESET_ENGINE + engine->id,
> + >->reset.flags)) {
> + intel_gt_set_wedged(gt);
> + err = -EBUSY;
> + goto out;
> + }
> + tasklet_disable(&engine->execlists.tasklet);
> +
> + engine->execlists.tasklet.func(engine->execlists.tasklet.data);
> + GEM_BUG_ON(execlists_active(&engine->execlists) != rq);
> +
> + /* Fake a preemption event; failed of course */
> + spin_lock_irq(&engine->active.lock);
> + __unwind_incomplete_requests(engine);
> + spin_unlock_irq(&engine->active.lock);
> +
> + /* Reset the engine while keeping our active request on hold */
> + execlists_hold(engine, rq);
> + GEM_BUG_ON(!i915_request_on_hold(rq));
> +
> + intel_engine_reset(engine, NULL);
> + GEM_BUG_ON(rq->fence.error != -EIO);
> +
> + /* Release our grasp on the engine, letting CS flow again */
> + tasklet_enable(&engine->execlists.tasklet);
> + clear_and_wake_up_bit(I915_RESET_ENGINE + engine->id, >->reset.flags);
> +
> + /* Check that we do not resubmit the held request */
> + i915_request_get(rq);
> + if (!i915_request_wait(rq, 0, HZ / 5)) {
> + pr_err("%s: on hold request completed!\n",
> + engine->name);
> + i915_request_put(rq);
> + err = -EIO;
> + goto out;
> + }
> + GEM_BUG_ON(!i915_request_on_hold(rq));
> +
> + /* But is resubmitted on release */
> + execlists_unhold(engine, rq);
> + if (i915_request_wait(rq, 0, HZ / 5) < 0) {
> + pr_err("%s: held request did not complete!\n",
> + engine->name);
> + intel_gt_set_wedged(gt);
> + err = -ETIME;
> + }
> + i915_request_put(rq);
> +
> +out:
> + for (n = 0; n < nsibling; n++)
> + engine_heartbeat_enable(siblings[n], heartbeat[n]);
> +
> + intel_context_put(ve);
> +out_spin:
> + igt_spinner_fini(&spin);
> +out_heartbeat:
> + kfree(heartbeat);
> + return err;
> +}
> +
> +static int live_virtual_reset(void *arg)
> +{
> + struct intel_gt *gt = arg;
> + struct intel_engine_cs *siblings[MAX_ENGINE_INSTANCE + 1];
> + unsigned int class, inst;
> +
> + /*
> + * Check that the context image retains non-privileged (user) registers
> + * from one engine to the next. For this we check that the CS_GPR
> + * are preserved.
> + */
Zap.
> +
> + if (USES_GUC_SUBMISSION(gt->i915))
> + return 0;
Who will remember to turn these on hm...
> +
> + /* As we use CS_GPR we cannot run before they existed on all engines. */
Zap.
> + if (INTEL_GEN(gt->i915) < 9)
> + return 0;
> +
> + for (class = 0; class <= MAX_ENGINE_CLASS; class++) {
> + int nsibling, err;
> +
> + nsibling = 0;
> + for (inst = 0; inst <= MAX_ENGINE_INSTANCE; inst++) {
> + if (!gt->engine_class[class][inst])
> + continue;
> +
> + siblings[nsibling++] = gt->engine_class[class][inst];
> + }
> + if (nsibling < 2)
> + continue;
> +
> + err = reset_virtual_engine(gt, siblings, nsibling);
> + if (err)
> + return err;
> + }
> +
> + return 0;
> +}
> +
> int intel_execlists_live_selftests(struct drm_i915_private *i915)
> {
> static const struct i915_subtest tests[] = {
> @@ -3436,6 +3591,7 @@ int intel_execlists_live_selftests(struct drm_i915_private *i915)
> SUBTEST(live_virtual_mask),
> SUBTEST(live_virtual_preserved),
> SUBTEST(live_virtual_bond),
> + SUBTEST(live_virtual_reset),
> };
>
> if (!HAS_EXECLISTS(i915))
>
Regards,
Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
next prev parent reply other threads:[~2020-01-21 13:55 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-01-21 10:09 [Intel-gfx] [PATCH] drm/i915/execlists: Reclaim hanging virtual request Chris Wilson
2020-01-21 11:04 ` [Intel-gfx] ✓ Fi.CI.BAT: success for " Patchwork
2020-01-21 11:20 ` [Intel-gfx] [PATCH] " Tvrtko Ursulin
2020-01-21 11:33 ` Chris Wilson
2020-01-21 11:44 ` Chris Wilson
2020-01-21 11:50 ` [Intel-gfx] [PATCH] drm/i915/execlists: Reclaim the " Chris Wilson
2020-01-21 13:04 ` [Intel-gfx] [PATCH v3] " Chris Wilson
2020-01-21 13:55 ` Tvrtko Ursulin [this message]
2020-01-21 14:07 ` Chris Wilson
2020-01-21 17:19 ` Tvrtko Ursulin
2020-01-21 17:32 ` Chris Wilson
2020-01-21 17:43 ` Tvrtko Ursulin
2020-01-21 17:57 ` Chris Wilson
2020-01-22 13:32 ` Tvrtko Ursulin
2020-01-21 13:48 ` [Intel-gfx] [PATCH v4] " Chris Wilson
2020-01-21 15:06 ` [Intel-gfx] ✓ Fi.CI.BAT: success for drm/i915/execlists: Reclaim hanging virtual request (rev4) Patchwork
2020-01-22 16:57 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork
2020-01-23 17:54 ` [Intel-gfx] ✓ Fi.CI.IGT: success " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=524735a8-dc0c-fdfc-941a-5cc3afaac40e@linux.intel.com \
--to=tvrtko.ursulin@linux.intel.com \
--cc=chris@chris-wilson.co.uk \
--cc=intel-gfx@lists.freedesktop.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox