From: Michel Thierry <michel.thierry@intel.com>
To: Chris Wilson <chris@chris-wilson.co.uk>,
intel-gfx@lists.freedesktop.org, mika.kuoppala@linux.intel.com
Subject: Re: [RFC 1/3] drm/i915: Watchdog timeout: IRQ handler for gen8+
Date: Thu, 23 Feb 2017 13:21:03 -0800 [thread overview]
Message-ID: <c26c590d-8ddd-8bea-7b8c-a23c26efe176@intel.com> (raw)
In-Reply-To: <20170223205754.GF19577@nuc-i3427.alporthouse.com>
On 23/02/17 12:57, Chris Wilson wrote:
> On Thu, Feb 23, 2017 at 11:44:17AM -0800, Michel Thierry wrote:
>> *** General ***
>>
>> Watchdog timeout (or "media engine reset") is a feature that allows
>> userland applications to enable hang detection on individual batch buffers.
>> The detection mechanism itself is mostly bound to the hardware and the only
>> thing that the driver needs to do to support this form of hang detection
>> is to implement the interrupt handling support as well as watchdog command
>> emission before and after the emitted batch buffer start instruction in the
>> ring buffer.
>>
>> The principle of the hang detection mechanism is as follows:
>>
>> 1. Once the decision has been made to enable watchdog timeout for a
>> particular batch buffer and the driver is in the process of emitting the
>> batch buffer start instruction into the ring buffer it also emits a
>> watchdog timer start instruction before and a watchdog timer cancellation
>> instruction after the batch buffer start instruction in the ring buffer.
>>
>> 2. Once the GPU execution reaches the watchdog timer start instruction
>> the hardware watchdog counter is started by the hardware. The counter
>> keeps counting until either reaching a previously configured threshold
>> value or the timer cancellation instruction is executed.
>>
>> 2a. If the counter reaches the threshold value the hardware fires a
>> watchdog interrupt that is picked up by the watchdog interrupt handler.
>> This means that a hang has been detected and the driver needs to deal with
>> it the same way it would deal with a engine hang detected by the periodic
>> hang checker. The only difference between the two is that we already blamed
>> the active request (to ensure an engine reset).
>>
>> 2b. If the batch buffer completes and the execution reaches the watchdog
>> cancellation instruction before the watchdog counter reaches its
>> threshold value the watchdog is cancelled and nothing more comes of it.
>> No hang is detected.
>>
>> Note about future interaction with preemption: Preemption could happen
>> in a command sequence prior to watchdog counter getting disabled,
>> resulting in watchdog being triggered following preemption. The driver will
>> need to explicitly disable the watchdog counter as part of the
>> preemption sequence.
>>
>> *** This patch introduces: ***
>>
>> 1. IRQ handler code for watchdog timeout allowing direct hang recovery
>> based on hardware-driven hang detection, which then integrates directly
>> with the hang recovery path. This is independent of having per-engine reset
>> or just full gpu reset.
>>
>> 2. Watchdog specific register information.
>>
>> Currently the render engine and all available media engines support
>> watchdog timeout (VECS is only supported in GEN9). The specifications elude
>> to the BCS engine being supported but that is currently not supported by
>> this commit.
>>
>> Note that the value to stop the counter is different between render and
>> non-render engines.
>>
>> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
>> Signed-off-by: Ian Lister <ian.lister@intel.com>
>> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
>> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
>> ---
>> drivers/gpu/drm/i915/i915_drv.h | 4 ++++
>> drivers/gpu/drm/i915/i915_irq.c | 31 ++++++++++++++++++++++++++++++-
>> drivers/gpu/drm/i915/i915_reg.h | 6 ++++++
>> drivers/gpu/drm/i915/intel_hangcheck.c | 13 +++++++++----
>> drivers/gpu/drm/i915/intel_lrc.c | 16 ++++++++++++++++
>> 5 files changed, 65 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
>> index eed9ead1b592..0e4f4cc3c6de 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.h
>> +++ b/drivers/gpu/drm/i915/i915_drv.h
>> @@ -1568,6 +1568,9 @@ struct i915_gpu_error {
>> * recovery. All waiters on the reset_queue will be woken when
>> * that happens.
>> *
>> + * When hw detects a hang before us, we can use I915_RESET_WATCHDOG to
>> + * report the hang detection cause accurately.
>> + *
>> * This counter is used by the wait_seqno code to notice that reset
>> * event happened and it needs to restart the entire ioctl (since most
>> * likely the seqno it waited for won't ever signal anytime soon).
>> @@ -1580,6 +1583,7 @@ struct i915_gpu_error {
>>
>> unsigned long flags;
>> #define I915_RESET_IN_PROGRESS 0
>> +#define I915_RESET_WATCHDOG 2 /* looking at the future */
>> #define I915_WEDGED (BITS_PER_LONG - 1)
>>
>> /**
>> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
>> index bc70e2c451b2..4ef73363bbe9 100644
>> --- a/drivers/gpu/drm/i915/i915_irq.c
>> +++ b/drivers/gpu/drm/i915/i915_irq.c
>> @@ -1352,6 +1352,28 @@ gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir, int test_shift)
>> set_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted);
>> tasklet_hi_schedule(&engine->irq_tasklet);
>> }
>> +
>> + if (iir & (GT_GEN8_WATCHDOG_INTERRUPT << test_shift)) {
>> + struct drm_i915_private *dev_priv = engine->i915;
>> + u32 watchdog_disable;
>> +
>> + if (engine->id == RCS)
>> + watchdog_disable = GEN8_RCS_WATCHDOG_DISABLE;
>> + else
>> + watchdog_disable = GEN8_XCS_WATCHDOG_DISABLE;
>> +
>> + /* Stop the counter to prevent further timeout interrupts */
>> + I915_WRITE_FW(RING_CNTR(engine->mmio_base), watchdog_disable);
>
> There's no guarrantee you hold forcewake, you need to use I915_WRITE.
> Better yet would be to avoid having to wait for forcewake within the
> hardirq handler.
>
>> +
>> + /* Make sure the active request will be marked as guilty */
>> + engine->hangcheck.stalled = true;
>> + engine->hangcheck.seqno = intel_engine_get_seqno(engine);
>
> Just set a flag saying the engine->hangcheck.watchdog = true. Don't
> confuse us. engine->hangcheck.seqno does not give the guilty seqno!
>
> Also there is no guarrantee here that seqno is the guilty party. That's
> a nasty bug. Servicing the interrupt will be running in parallel with
> the GPU that may complete the request before we read the HWS.
>
> Please tell me we can use a PID along with the watchdog timer...
A 'watchdog' PID and 'running' PID in the HWSP would sound ok?
There's also the question if we want different thresholds per engine.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
next prev parent reply other threads:[~2017-02-23 21:21 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-02-23 19:44 [RFC 1/3] drm/i915: Watchdog timeout: IRQ handler for gen8+ Michel Thierry
2017-02-23 19:44 ` [RFC 2/3] drm/i915: Watchdog timeout: Ringbuffer command emission " Michel Thierry
2017-02-23 21:03 ` Chris Wilson
2017-02-24 9:15 ` Tvrtko Ursulin
2017-02-23 19:44 ` [RFC 3/3] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout Michel Thierry
2017-02-23 21:14 ` Chris Wilson
2017-02-23 19:44 ` [PATCH i-g-t 1/2] lib/igt_gt: Add watchdog gem_ctx_set_param ioctl interface Michel Thierry
2017-02-23 19:44 ` [PATCH i-g-t 2/2] tests/drv_hangman: watchdog tests Michel Thierry
2017-02-23 22:26 ` Chris Wilson
2017-02-23 23:06 ` Chris Wilson
2017-02-23 20:52 ` ✓ Fi.CI.BAT: success for series starting with [RFC,1/3] drm/i915: Watchdog timeout: IRQ handler for gen8+ Patchwork
2017-02-23 20:57 ` [RFC 1/3] " Chris Wilson
2017-02-23 21:21 ` Michel Thierry [this message]
2017-02-23 21:49 ` Chris Wilson
2017-02-23 22:12 ` Michel Thierry
2017-02-23 23:38 ` Chris Wilson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c26c590d-8ddd-8bea-7b8c-a23c26efe176@intel.com \
--to=michel.thierry@intel.com \
--cc=chris@chris-wilson.co.uk \
--cc=intel-gfx@lists.freedesktop.org \
--cc=mika.kuoppala@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox