public inbox for intel-gfx@lists.freedesktop.org
 help / color / mirror / Atom feed
From: Dave Gordon <david.s.gordon@intel.com>
To: intel-gfx@lists.freedesktop.org
Subject: Re: [RFC 03/11] drm/i915: Add reset stats entry point for per-engine reset.
Date: Thu, 18 Jun 2015 12:12:00 +0100	[thread overview]
Message-ID: <5582A780.7080506@intel.com> (raw)
In-Reply-To: <20150616135449.GE11933@nuc-i3427.alporthouse.com>

On 16/06/15 14:54, Chris Wilson wrote:
> On Tue, Jun 16, 2015 at 03:48:09PM +0200, Daniel Vetter wrote:
>> On Mon, Jun 08, 2015 at 06:33:59PM +0100, Chris Wilson wrote:
>>> On Mon, Jun 08, 2015 at 06:03:21PM +0100, Tomas Elf wrote:
>>>> In preparation for per-engine reset add way for setting context reset stats.
>>>>
>>>> OPEN QUESTIONS:
>>>> 1. How do we deal with get_reset_stats and the GL robustness interface when
>>>> introducing per-engine resets?
>>>>
>>>> 	a. Do we set context that cause per-engine resets as guilty? If so, how
>>>> 	does this affect context banning?
>>>
>>> Yes. If the reset works quicker, then we can set a higher threshold for
>>> DoS detection, but we still do need Dos detection?
>>>  
>>>> 	b. Do we extend the publically available reset stats to also contain
>>>> 	per-engine reset statistics? If so, would this break the ABI?
>>>
>>> No. The get_reset_stats is targetted at the GL API and describing it in
>>> terms of whether my context is guilty or has been affected. That is
>>> orthogonal to whether the reset was on a single ring or the entire GPU -
>>> the question is how broad do want the "affected" to be. Ideally a
>>> per-context reset wouldn't necessarily impact others, except for the
>>> surfaces shared between them...
>>
>> gl computes sharing sets itself, the kernel only tells it whether a given
>> context has been victimized, i.e. one of it's batches was not properly
>> executed due to reset after a hang.
> 
> So you don't think we should delete all pending requests that depend
> upon state from the hung request?
> -Chris

John Harrison & I discussed this yesterday; he's against doing so (even
though the scheduler is ideally placed to do it, if that were actually
the preferred policy). The primary argument (as I see it) is that you
actually don't and can't know the nature of an apparent dependency
between batches that share a buffer object. There are at least three cases:

1. "tightly-coupled": the dependent batch is going to rely on data
produced by the earlier batch. In this case, GIGO applies and the
results will be undefined, possibly including a further hang. Subsequent
batches presumably belong to the same or a closely-related
(co-operating) task, and killing them might be a reasonable strategy here.

2. "loosely-coupled": the dependent batch is going to access the data,
but not in any way that depends on the content (for example, blitting a
rectangle into a composition buffer). The result will be wrong, but only
in a limited way (e.g. window belonging to the faulty application will
appear corrupted). The dependent batches may well belong to unrelated
system tasks (e.g. X or surfaceflinger) and killing them is probably not
justified.

3. "uncoupled": the dependent batch wants the /buffer/, not the data in
it (most likely a framebuffer or similar object). Any incorrect data in
the buffer is irrelevant. Killing off subsequent batches would be wrong.

Buffer access mode (readonly, read/write, writeonly) might allow us to
distinguish these somewhat, but probably not enough to help make the
right decision. So the default must be *not* to kill off dependants
automatically, but if the failure does propagate in such a way as to
cause further consequent hangs, then the context-banning mechanism
should eventually catch and block all the downstream effects.

.Dave.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

  parent reply	other threads:[~2015-06-18 11:12 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-08 17:03 [RFC 00/11] TDR/watchdog timeout support for gen8 Tomas Elf
2015-06-08 17:03 ` [RFC 01/11] drm/i915: Early exit from semaphore_waits_for for execlist mode Tomas Elf
2015-06-08 17:36   ` Chris Wilson
2015-06-09 11:02     ` Tomas Elf
2015-06-16 13:44   ` Daniel Vetter
2015-06-16 15:46     ` Tomas Elf
2015-06-16 16:50       ` Chris Wilson
2015-06-16 17:07         ` Tomas Elf
2015-06-17 11:43       ` Daniel Vetter
2015-06-08 17:03 ` [RFC 02/11] drm/i915: Introduce uevent for full GPU reset Tomas Elf
2015-06-16 13:43   ` Daniel Vetter
2015-06-16 15:43     ` Tomas Elf
2015-06-16 16:55       ` Chris Wilson
2015-06-16 17:32         ` Tomas Elf
2015-06-16 19:33           ` Chris Wilson
2015-06-17 11:49             ` Daniel Vetter
2015-06-17 12:51               ` Chris Wilson
2015-06-08 17:03 ` [RFC 03/11] drm/i915: Add reset stats entry point for per-engine reset Tomas Elf
2015-06-08 17:33   ` Chris Wilson
2015-06-09 11:06     ` Tomas Elf
2015-06-16 13:48     ` Daniel Vetter
2015-06-16 13:54       ` Chris Wilson
2015-06-16 15:55         ` Daniel Vetter
2015-06-18 11:12         ` Dave Gordon [this message]
2015-06-11  9:14   ` Dave Gordon
2015-06-16 13:49   ` Daniel Vetter
2015-06-16 15:54     ` Tomas Elf
2015-06-17 11:51       ` Daniel Vetter
2015-06-08 17:03 ` [RFC 04/11] drm/i915: Adding TDR / per-engine reset support for gen8 Tomas Elf
2015-06-08 17:03 ` [RFC 05/11] drm/i915: Extending i915_gem_check_wedge to check engine reset in progress Tomas Elf
2015-06-08 17:24   ` Chris Wilson
2015-06-09 11:08     ` Tomas Elf
2015-06-09 11:11   ` Chris Wilson
2015-06-08 17:03 ` [RFC 06/11] drm/i915: Disable warnings for TDR interruptions in the display driver Tomas Elf
2015-06-08 17:53   ` Chris Wilson
2015-06-08 17:03 ` [RFC 07/11] drm/i915: Reinstate hang recovery work queue Tomas Elf
2015-06-08 17:03 ` [RFC 08/11] drm/i915: Watchdog timeout support for gen8 Tomas Elf
2015-06-08 17:03 ` [RFC 09/11] drm/i915: Fake lost context interrupts through forced CSB check Tomas Elf
2015-06-08 17:03 ` [RFC 10/11] drm/i915: Debugfs interface for per-engine hang recovery Tomas Elf
2015-06-08 17:45   ` Chris Wilson
2015-06-09 11:18     ` Tomas Elf
2015-06-09 12:27       ` Chris Wilson
2015-06-09 17:28         ` Tomas Elf
2015-06-11  9:32     ` Dave Gordon
2015-06-08 17:03 ` [RFC 11/11] drm/i915: TDR/watchdog trace points Tomas Elf
2015-06-23 10:05 ` [RFC 00/11] TDR/watchdog timeout support for gen8 Daniel Vetter
2015-06-23 10:47   ` Tomas Elf
2015-06-23 11:38     ` Daniel Vetter
2015-06-23 14:06       ` Tomas Elf
2015-06-23 15:20         ` Daniel Vetter
2015-06-23 15:35           ` Daniel Vetter
2015-06-25 10:38             ` Tomas Elf
2015-07-03 11:15 ` Mika Kuoppala
2015-07-03 17:41   ` Tomas Elf
2015-07-09 18:47 ` Chris Wilson
2015-07-10 15:24   ` Tomas Elf
2015-07-10 15:48     ` Tomas Elf
2015-07-11 18:15       ` Chris Wilson
2015-07-11 18:22     ` Chris Wilson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5582A780.7080506@intel.com \
    --to=david.s.gordon@intel.com \
    --cc=intel-gfx@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox