public inbox for intel-gfx@lists.freedesktop.org
 help / color / mirror / Atom feed
From: Jeff McGee <jeff.mcgee@intel.com>
To: Chris Wilson <chris@chris-wilson.co.uk>
Cc: intel-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/i915: Clear local engine-needs-reset bit if in progress elsewhere
Date: Wed, 13 Sep 2017 07:23:18 -0700	[thread overview]
Message-ID: <20170913142318.GF11255@jeffdesk> (raw)
In-Reply-To: <150473504045.7166.12400100881932844996@mail.alporthouse.com>

On Wed, Sep 06, 2017 at 10:57:20PM +0100, Chris Wilson wrote:
> Quoting Jeff McGee (2017-08-29 18:01:47)
> > On Tue, Aug 29, 2017 at 04:17:46PM +0100, Chris Wilson wrote:
> > > Quoting Jeff McGee (2017-08-29 16:04:17)
> > > > On Tue, Aug 29, 2017 at 10:07:18AM +0100, Chris Wilson wrote:
> > > > > Quoting Jeff McGee (2017-08-28 21:18:44)
> > > > > > On Mon, Aug 28, 2017 at 08:44:48PM +0100, Chris Wilson wrote:
> > > > > > > Quoting jeff.mcgee@intel.com (2017-08-28 20:25:30)
> > > > > > > > From: Jeff McGee <jeff.mcgee@intel.com>
> > > > > > > > 
> > > > > > > > If someone else is resetting the engine we should clear our own bit as
> > > > > > > > part of skipping that engine. Otherwise we will later believe that it
> > > > > > > > has not been reset successfully and then trigger full gpu reset. If the
> > > > > > > > other guy's reset actually fails, he will trigger the full gpu reset.
> > > > > > > 
> > > > > > > The reason we did continue on to the global reset was to serialise
> > > > > > > i915_handle_error() with the other thread. Not a huge issue, but a
> > > > > > > reasonable property to keep -- and we definitely want a to explain why
> > > > > > > only one reset at a time is important.
> > > > > > > 
> > > > > > > bool intel_engine_lock_reset() {
> > > > > > >       if (!test_and_set_bit(I915_RESET_ENGINE + engine->id,
> > > > > > >                             &engine->i915->gpu_error.flags))
> > > > > > >               return true;
> > > > > > > 
> > > > > > >       intel_engine_wait_for_reset(engine);
> > > > > > The current code doesn't wait for the other thread to finish the reset, but
> > > > > > this would add that wait. 
> > > > > 
> > > > > Pardon? If we can't reset the engine, we go to the full reset which is
> > > > > serialised, both with individual engine resets and other globals.
> > > > > 
> > > > > > Did you intend that as an additional change to
> > > > > > the current code? I don't think it is necessary. Each thread wants to
> > > > > > reset some subset of engines, so it seems the thread can safely exit as soon
> > > > > > as it knows each of those engines has been reset or is being reset as part
> > > > > > of another thread that got the lock first. If any of the threads fail to
> > > > > > reset an engine they "own", then full gpu reset is assured.
> > > > > 
> > > > > It's unexpected for this function to return before the reset.
> > > > > -Chris
> > > > 
> > > > I'm a bit confused, so let's go back to the original code that I was trying
> > > > to fix:
> > > > 
> > > > 
> > > >         /*
> > > >          * Try engine reset when available. We fall back to full reset if
> > > >          * single reset fails.
> > > >          */
> > > >         if (intel_has_reset_engine(dev_priv)) {
> > > >                 for_each_engine_masked(engine, dev_priv, engine_mask, tmp) {
> > > >                         BUILD_BUG_ON(I915_RESET_MODESET >= I915_RESET_ENGINE);
> > > >                         if (test_and_set_bit(I915_RESET_ENGINE + engine->id,
> > > >                                              &dev_priv->gpu_error.flags))
> > > >                                 continue;
> > > > 
> > > >                         if (i915_reset_engine(engine, 0) == 0)
> > > >                                 engine_mask &= ~intel_engine_flag(engine);
> > > > 
> > > >                         clear_bit(I915_RESET_ENGINE + engine->id,
> > > >                                   &dev_priv->gpu_error.flags);
> > > >                         wake_up_bit(&dev_priv->gpu_error.flags,
> > > >                                     I915_RESET_ENGINE + engine->id);
> > > >                 }
> > > >         }
> > > > 
> > > >         if (!engine_mask)
> > > >                 goto out;
> > > > 
> > > >         /* Full reset needs the mutex, stop any other user trying to do so. */
> > > > 
> > > > Let's say that 2 threads are here intending to reset render. #1 gets the lock
> > > > and starts the render engine-only reset. #2 fails to get the lock which implies
> > > > that someone else is in the process of resetting the render engine (with single
> > > > engine reset or full gpu reset). #2 continues on without waiting but doesn't
> > > > clear the render bit in engine_mask. So #2 will proceed to initiate a full
> > > > gpu reset when it may not be necessary. That's the problem I was trying
> > > > to address with my initial patch. Do you agree that #2 must clear this bit
> > > > to avoid always triggering full gpu reset? If the engine-only reset done by
> > > > #1 fails, #1 will do the fallback to full gpu reset, so there is no risk that
> > > > we would miss the full gpu reset if it is really needed.
> > > > 
> > > > Then there is the question of whether #2 should wait around for the
> > > > render engine reset by #1 to complete. It doesn't in current code and I don't
> > > > see why it needs to. But that can be a separate discussion from the above.
> > > 
> > > It very much does in the current code. If we can not do the per-engine
> > > reset, it falls back to the global reset.
> > 
> > So are you saying that it is by design in this scenario that #2 will resort
> > to full gpu reset just because it wasn't the thread that actually performed
> > the engine reset, even though it can clearly infer based on the engine lock
> > being held that #1 is performing that reset for him?
> 
> Yes, that wait was intentional.
>  

So couldn't we preserve the wait without resorting to full GPU reset to do it?
It is the unnecessary triggering of full GPU reset that concerns me the most,
not that thread #2 has to wait.

I admit it's not a major concern at the moment because the code runs
concurrently only if debugfs wedged is invoked along with hangcheck (if I read
it correctly). I've added another delayed work that can run the code to do
"forced" preemption to clear the way for high-priority requests, and unnecessary
fallback to slow full GPU reset is a bad thing.

> > > The global reset is serialised
> > > with itself and the per-engine resets. Ergo it waits, and that was
> > > intentional.
> > > -Chris
> > 
> > Yes, the wait happens because #2 goes on to start full gpu reset which
> > requires all engine bits to be grabbed. My contention is that it should not
> > start full gpu reset.
> 
> And that I am not disputing. Just that returning before the reset is
> complete changes the current contract.
> -Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

  reply	other threads:[~2017-09-13 14:26 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-28 19:25 [PATCH] drm/i915: Clear local engine-needs-reset bit if in progress elsewhere jeff.mcgee
2017-08-28 19:41 ` Michel Thierry
2017-08-28 19:46   ` Jeff McGee
2017-08-29 15:22     ` Chris Wilson
2017-08-28 19:44 ` Chris Wilson
2017-08-28 20:18   ` Jeff McGee
2017-08-29  9:07     ` Chris Wilson
2017-08-29 15:04       ` Jeff McGee
2017-08-29 15:17         ` Chris Wilson
2017-08-29 17:01           ` Jeff McGee
2017-09-06 21:57             ` Chris Wilson
2017-09-13 14:23               ` Jeff McGee [this message]
2017-08-28 19:48 ` ✓ Fi.CI.BAT: success for " Patchwork
2017-08-28 20:59 ` ✗ Fi.CI.IGT: warning " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170913142318.GF11255@jeffdesk \
    --to=jeff.mcgee@intel.com \
    --cc=chris@chris-wilson.co.uk \
    --cc=intel-gfx@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox