From: Jeff McGee <jeff.mcgee@intel.com>
To: Chris Wilson <chris@chris-wilson.co.uk>
Cc: intel-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/i915: Clear local engine-needs-reset bit if in progress elsewhere
Date: Wed, 13 Sep 2017 07:23:18 -0700 [thread overview]
Message-ID: <20170913142318.GF11255@jeffdesk> (raw)
In-Reply-To: <150473504045.7166.12400100881932844996@mail.alporthouse.com>
On Wed, Sep 06, 2017 at 10:57:20PM +0100, Chris Wilson wrote:
> Quoting Jeff McGee (2017-08-29 18:01:47)
> > On Tue, Aug 29, 2017 at 04:17:46PM +0100, Chris Wilson wrote:
> > > Quoting Jeff McGee (2017-08-29 16:04:17)
> > > > On Tue, Aug 29, 2017 at 10:07:18AM +0100, Chris Wilson wrote:
> > > > > Quoting Jeff McGee (2017-08-28 21:18:44)
> > > > > > On Mon, Aug 28, 2017 at 08:44:48PM +0100, Chris Wilson wrote:
> > > > > > > Quoting jeff.mcgee@intel.com (2017-08-28 20:25:30)
> > > > > > > > From: Jeff McGee <jeff.mcgee@intel.com>
> > > > > > > >
> > > > > > > > If someone else is resetting the engine we should clear our own bit as
> > > > > > > > part of skipping that engine. Otherwise we will later believe that it
> > > > > > > > has not been reset successfully and then trigger full gpu reset. If the
> > > > > > > > other guy's reset actually fails, he will trigger the full gpu reset.
> > > > > > >
> > > > > > > The reason we did continue on to the global reset was to serialise
> > > > > > > i915_handle_error() with the other thread. Not a huge issue, but a
> > > > > > > reasonable property to keep -- and we definitely want a to explain why
> > > > > > > only one reset at a time is important.
> > > > > > >
> > > > > > > bool intel_engine_lock_reset() {
> > > > > > > if (!test_and_set_bit(I915_RESET_ENGINE + engine->id,
> > > > > > > &engine->i915->gpu_error.flags))
> > > > > > > return true;
> > > > > > >
> > > > > > > intel_engine_wait_for_reset(engine);
> > > > > > The current code doesn't wait for the other thread to finish the reset, but
> > > > > > this would add that wait.
> > > > >
> > > > > Pardon? If we can't reset the engine, we go to the full reset which is
> > > > > serialised, both with individual engine resets and other globals.
> > > > >
> > > > > > Did you intend that as an additional change to
> > > > > > the current code? I don't think it is necessary. Each thread wants to
> > > > > > reset some subset of engines, so it seems the thread can safely exit as soon
> > > > > > as it knows each of those engines has been reset or is being reset as part
> > > > > > of another thread that got the lock first. If any of the threads fail to
> > > > > > reset an engine they "own", then full gpu reset is assured.
> > > > >
> > > > > It's unexpected for this function to return before the reset.
> > > > > -Chris
> > > >
> > > > I'm a bit confused, so let's go back to the original code that I was trying
> > > > to fix:
> > > >
> > > >
> > > > /*
> > > > * Try engine reset when available. We fall back to full reset if
> > > > * single reset fails.
> > > > */
> > > > if (intel_has_reset_engine(dev_priv)) {
> > > > for_each_engine_masked(engine, dev_priv, engine_mask, tmp) {
> > > > BUILD_BUG_ON(I915_RESET_MODESET >= I915_RESET_ENGINE);
> > > > if (test_and_set_bit(I915_RESET_ENGINE + engine->id,
> > > > &dev_priv->gpu_error.flags))
> > > > continue;
> > > >
> > > > if (i915_reset_engine(engine, 0) == 0)
> > > > engine_mask &= ~intel_engine_flag(engine);
> > > >
> > > > clear_bit(I915_RESET_ENGINE + engine->id,
> > > > &dev_priv->gpu_error.flags);
> > > > wake_up_bit(&dev_priv->gpu_error.flags,
> > > > I915_RESET_ENGINE + engine->id);
> > > > }
> > > > }
> > > >
> > > > if (!engine_mask)
> > > > goto out;
> > > >
> > > > /* Full reset needs the mutex, stop any other user trying to do so. */
> > > >
> > > > Let's say that 2 threads are here intending to reset render. #1 gets the lock
> > > > and starts the render engine-only reset. #2 fails to get the lock which implies
> > > > that someone else is in the process of resetting the render engine (with single
> > > > engine reset or full gpu reset). #2 continues on without waiting but doesn't
> > > > clear the render bit in engine_mask. So #2 will proceed to initiate a full
> > > > gpu reset when it may not be necessary. That's the problem I was trying
> > > > to address with my initial patch. Do you agree that #2 must clear this bit
> > > > to avoid always triggering full gpu reset? If the engine-only reset done by
> > > > #1 fails, #1 will do the fallback to full gpu reset, so there is no risk that
> > > > we would miss the full gpu reset if it is really needed.
> > > >
> > > > Then there is the question of whether #2 should wait around for the
> > > > render engine reset by #1 to complete. It doesn't in current code and I don't
> > > > see why it needs to. But that can be a separate discussion from the above.
> > >
> > > It very much does in the current code. If we can not do the per-engine
> > > reset, it falls back to the global reset.
> >
> > So are you saying that it is by design in this scenario that #2 will resort
> > to full gpu reset just because it wasn't the thread that actually performed
> > the engine reset, even though it can clearly infer based on the engine lock
> > being held that #1 is performing that reset for him?
>
> Yes, that wait was intentional.
>
So couldn't we preserve the wait without resorting to full GPU reset to do it?
It is the unnecessary triggering of full GPU reset that concerns me the most,
not that thread #2 has to wait.
I admit it's not a major concern at the moment because the code runs
concurrently only if debugfs wedged is invoked along with hangcheck (if I read
it correctly). I've added another delayed work that can run the code to do
"forced" preemption to clear the way for high-priority requests, and unnecessary
fallback to slow full GPU reset is a bad thing.
> > > The global reset is serialised
> > > with itself and the per-engine resets. Ergo it waits, and that was
> > > intentional.
> > > -Chris
> >
> > Yes, the wait happens because #2 goes on to start full gpu reset which
> > requires all engine bits to be grabbed. My contention is that it should not
> > start full gpu reset.
>
> And that I am not disputing. Just that returning before the reset is
> complete changes the current contract.
> -Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
next prev parent reply other threads:[~2017-09-13 14:26 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-08-28 19:25 [PATCH] drm/i915: Clear local engine-needs-reset bit if in progress elsewhere jeff.mcgee
2017-08-28 19:41 ` Michel Thierry
2017-08-28 19:46 ` Jeff McGee
2017-08-29 15:22 ` Chris Wilson
2017-08-28 19:44 ` Chris Wilson
2017-08-28 20:18 ` Jeff McGee
2017-08-29 9:07 ` Chris Wilson
2017-08-29 15:04 ` Jeff McGee
2017-08-29 15:17 ` Chris Wilson
2017-08-29 17:01 ` Jeff McGee
2017-09-06 21:57 ` Chris Wilson
2017-09-13 14:23 ` Jeff McGee [this message]
2017-08-28 19:48 ` ✓ Fi.CI.BAT: success for " Patchwork
2017-08-28 20:59 ` ✗ Fi.CI.IGT: warning " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170913142318.GF11255@jeffdesk \
--to=jeff.mcgee@intel.com \
--cc=chris@chris-wilson.co.uk \
--cc=intel-gfx@lists.freedesktop.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.