All of lore.kernel.org
 help / color / mirror / Atom feed
From: Michel Thierry <michel.thierry@intel.com>
To: Chris Wilson <chris@chris-wilson.co.uk>, intel-gfx@lists.freedesktop.org
Subject: Re: [PATCH v6 16/20] drm/i915: Watchdog timeout: IRQ handler for gen8+
Date: Wed, 19 Apr 2017 10:11:37 -0700	[thread overview]
Message-ID: <a7ded571-a9da-3fea-1908-fbdda72db577@intel.com> (raw)
In-Reply-To: <20170419102026.GH9029@nuc-i3427.alporthouse.com>



On 19/04/17 03:20, Chris Wilson wrote:
> On Tue, Apr 18, 2017 at 01:23:31PM -0700, Michel Thierry wrote:
>> *** General ***
>>
>> Watchdog timeout (or "media engine reset") is a feature that allows
>> userland applications to enable hang detection on individual batch buffers.
>> The detection mechanism itself is mostly bound to the hardware and the only
>> thing that the driver needs to do to support this form of hang detection
>> is to implement the interrupt handling support as well as watchdog command
>> emission before and after the emitted batch buffer start instruction in the
>> ring buffer.
>>
>> The principle of the hang detection mechanism is as follows:
>>
>> 1. Once the decision has been made to enable watchdog timeout for a
>> particular batch buffer and the driver is in the process of emitting the
>> batch buffer start instruction into the ring buffer it also emits a
>> watchdog timer start instruction before and a watchdog timer cancellation
>> instruction after the batch buffer start instruction in the ring buffer.
>>
>> 2. Once the GPU execution reaches the watchdog timer start instruction
>> the hardware watchdog counter is started by the hardware. The counter
>> keeps counting until either reaching a previously configured threshold
>> value or the timer cancellation instruction is executed.
>>
>> 2a. If the counter reaches the threshold value the hardware fires a
>> watchdog interrupt that is picked up by the watchdog interrupt handler.
>> This means that a hang has been detected and the driver needs to deal with
>> it the same way it would deal with a engine hang detected by the periodic
>> hang checker. The only difference between the two is that we already blamed
>> the active request (to ensure an engine reset).
>>
>> 2b. If the batch buffer completes and the execution reaches the watchdog
>> cancellation instruction before the watchdog counter reaches its
>> threshold value the watchdog is cancelled and nothing more comes of it.
>> No hang is detected.
>>
>> Note about future interaction with preemption: Preemption could happen
>> in a command sequence prior to watchdog counter getting disabled,
>> resulting in watchdog being triggered following preemption. The driver will
>> need to explicitly disable the watchdog counter as part of the
>> preemption sequence.
>
> Does MI_ARB_ON_OFF do the trick? Shouldn't we basically be only turning
> preemption on for the user buffers as it just causes hassle if we allow
> preemption in our preamble + breadcrumb. (And there's little point in
> preempting in the flushes.)
>

Mid-batch?
The watchdog counter is not aware of MI_ARB_ON_OFF (or any other cmd) 
and would keep running / expire. We could call emit_stop_watchdog 
unconditionally to prevent this.

>> *** This patch introduces: ***
>>
>> 1. IRQ handler code for watchdog timeout allowing direct hang recovery
>> based on hardware-driven hang detection, which then integrates directly
>> with the hang recovery path. This is independent of having per-engine reset
>> or just full gpu reset.
>>
>> 2. Watchdog specific register information.
>>
>> Currently the render engine and all available media engines support
>> watchdog timeout (VECS is only supported in GEN9). The specifications elude
>> to the BCS engine being supported but that is currently not supported by
>> this commit.
>>
>> Note that the value to stop the counter is different between render and
>> non-render engines in GEN8; GEN9 onwards it's the same.
>
> Should mention the choice to piggyback the current hangcheck + capture
> scheme.
>
>> +	if (iir & (GT_GEN8_WATCHDOG_INTERRUPT << test_shift)) {
>> +		tasklet_schedule(&engine->watchdog_tasklet);
>> +	}
>
> Kill unwanted braces.
>
>> +#define GEN8_WATCHDOG_1000US 0x2ee0 //XXX: Temp, replace with helper function
>> +static void gen8_watchdog_irq_handler(unsigned long data)
>> +{
>> +	struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
>> +	struct drm_i915_private *dev_priv = engine->i915;
>> +	u32 current_seqno;
>> +
>> +	intel_uncore_forcewake_get(dev_priv, engine->fw_domains);
>> +
>> +	/* Stop the counter to prevent further timeout interrupts */
>> +	I915_WRITE_FW(RING_CNTR(engine->mmio_base), get_watchdog_disable(engine));
>> +
>> +	current_seqno = intel_engine_get_seqno(engine);
>> +
>> +	/* did the request complete after the timer expired? */
>> +	if (intel_engine_last_submit(engine) == current_seqno)
>> +		goto fw_put;
>> +
>> +	if (engine->hangcheck.watchdog == current_seqno) {
>> +		/* Make sure the active request will be marked as guilty */
>> +		engine->hangcheck.stalled = true;
>> +		engine->hangcheck.seqno = intel_engine_get_seqno(engine);
>
> Use current_seqno again. intel_engine_get_seqno() may have just changed.
> -Chris
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

  reply	other threads:[~2017-04-19 17:11 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-18 20:23 [PATCH v6 00/20] Gen8+ engine-reset Michel Thierry
2017-04-18 20:23 ` [PATCH v6 01/20] drm/i915: Fix stale comment about I915_RESET_IN_PROGRESS flag Michel Thierry
2017-04-18 20:23 ` [PATCH v6 02/20] drm/i915: Rename gen8_(un)request_engine_reset to gen8_reset_engine_start/cancel Michel Thierry
2017-04-27  8:13   ` Chris Wilson
2017-04-18 20:23 ` [PATCH v6 03/20] drm/i915: Update i915.reset to handle engine resets Michel Thierry
2017-04-18 20:23 ` [PATCH v6 04/20] drm/i915/tdr: Modify error handler for per engine hang recovery Michel Thierry
2017-04-18 21:40   ` Chris Wilson
2017-04-18 22:01     ` Michel Thierry
2017-04-18 23:13       ` Chris Wilson
2017-04-18 20:23 ` [PATCH v6 05/20] drm/i915/tdr: Add support for per engine reset recovery Michel Thierry
2017-04-19 10:49   ` Chris Wilson
2017-04-19 13:49     ` Chris Wilson
2017-04-21  0:17     ` Michel Thierry
2017-04-24 21:22       ` Michel Thierry
2017-04-25  9:42         ` Chris Wilson
2017-04-18 20:23 ` [PATCH v6 06/20] drm/i915: Skip reset request if there is one already Michel Thierry
2017-04-18 20:23 ` [PATCH v6 07/20] drm/i915/tdr: Add engine reset count to error state Michel Thierry
2017-04-18 20:23 ` [PATCH v6 08/20] drm/i915/tdr: Export per-engine reset count info to debugfs Michel Thierry
2017-04-18 20:23 ` [PATCH v6 09/20] drm/i915/tdr: Enable Engine reset and recovery support Michel Thierry
2017-04-18 20:23 ` [PATCH v6 10/20] drm/i915: Add engine reset count in get-reset-stats ioctl Michel Thierry
2017-04-18 20:23 ` [PATCH v6 11/20] drm/i915/selftests: reset engine self tests Michel Thierry
2017-04-18 20:23 ` [PATCH v6 12/20] drm/i915/guc: fix mmio whitelist mmio_start offset and add reminder Michel Thierry
2017-04-18 20:23 ` [PATCH v6 13/20] drm/i915/guc: Provide register list to be saved/restored during engine reset Michel Thierry
2017-04-19  0:26   ` Daniele Ceraolo Spurio
2017-04-19  0:44     ` Michel Thierry
2017-04-18 20:23 ` [PATCH v6 14/20] drm/i915/guc: Add support for reset engine using GuC commands Michel Thierry
2017-04-19 10:27   ` Chris Wilson
2017-04-19 23:22     ` Michel Thierry
2017-04-20  9:05       ` Chris Wilson
2017-04-18 20:23 ` [PATCH v6 15/20] drm/i915: Watchdog timeout: Pass GuC shared data structure during param load Michel Thierry
2017-04-18 21:18   ` Daniele Ceraolo Spurio
2017-04-18 20:23 ` [PATCH v6 16/20] drm/i915: Watchdog timeout: IRQ handler for gen8+ Michel Thierry
2017-04-19 10:20   ` Chris Wilson
2017-04-19 17:11     ` Michel Thierry [this message]
2017-04-19 17:51       ` Chris Wilson
2017-04-19 18:13         ` Michel Thierry
2017-04-18 20:23 ` [PATCH v6 17/20] drm/i915: Watchdog timeout: Ringbuffer command emission " Michel Thierry
2017-04-18 21:20   ` Chris Wilson
2017-04-18 21:36     ` Michel Thierry
2017-04-18 23:06       ` Chris Wilson
2017-04-18 23:11         ` Michel Thierry
2017-04-18 20:23 ` [PATCH v6 18/20] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout Michel Thierry
2017-04-19 16:56   ` Jeff McGee
2017-04-19 17:07   ` Daniele Ceraolo Spurio
2017-04-20  1:09   ` Michel Thierry
2017-04-20  8:52     ` Chris Wilson
2017-04-20 17:19       ` Michel Thierry
2017-04-18 20:23 ` [PATCH v6 19/20] drm/i915: Watchdog timeout: Include threshold value in error state Michel Thierry
2017-04-18 20:23 ` [PATCH v6 20/20] drm/i915: Watchdog timeout: Export media reset count from GuC to debugfs Michel Thierry
2017-04-18 20:44 ` ✓ Fi.CI.BAT: success for Gen8+ engine-reset (rev2) Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a7ded571-a9da-3fea-1908-fbdda72db577@intel.com \
    --to=michel.thierry@intel.com \
    --cc=chris@chris-wilson.co.uk \
    --cc=intel-gfx@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.