Intel-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: Chris Wilson <chris@chris-wilson.co.uk>,
	dri-devel@lists.freedesktop.org, intel-gfx@lists.freedesktop.org
Subject: Re: [PATCH 2/2] drm/i915: Set guilty-flag on fence after detecting a hang
Date: Tue, 3 Jan 2017 13:17:19 +0000	[thread overview]
Message-ID: <1943b5da-fd46-9fce-d79b-0ba28ba89bd8@linux.intel.com> (raw)
In-Reply-To: <20170103123824.GL16295@nuc-i3427.alporthouse.com>


On 03/01/2017 12:38, Chris Wilson wrote:
> On Tue, Jan 03, 2017 at 12:34:16PM +0000, Tvrtko Ursulin wrote:
>>
>> On 03/01/2017 12:13, Chris Wilson wrote:
>>> On Tue, Jan 03, 2017 at 11:57:44AM +0000, Tvrtko Ursulin wrote:
>>>>
>>>> On 03/01/2017 11:46, Chris Wilson wrote:
>>>>> On Tue, Jan 03, 2017 at 11:34:45AM +0000, Tvrtko Ursulin wrote:
>>>>>>
>>>>>> On 03/01/2017 11:05, Chris Wilson wrote:
>>>>>>> The struct dma_fence carries a status field exposed to userspace by
>>>>>>> sync_file. This is inspected after the fence is signaled and can convey
>>>>>>> whether or not the request completed successfully, or in our case if we
>>>>>>> detected a hang during the request (signaled via -EIO in
>>>>>>> SYNC_IOC_FILE_INFO).
>>>>>>>
>>>>>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>>>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>>>>> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>>>>>>> ---
>>>>>>> drivers/gpu/drm/i915/i915_gem.c | 6 ++++--
>>>>>>> 1 file changed, 4 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>>>>>>> index 204c4a673bf3..bc99c0e292d8 100644
>>>>>>> --- a/drivers/gpu/drm/i915/i915_gem.c
>>>>>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>>>>>>> @@ -2757,10 +2757,12 @@ static void i915_gem_reset_engine(struct intel_engine_cs *engine)
>>>>>>> 		ring_hung = false;
>>>>>>> 	}
>>>>>>>
>>>>>>> -	if (ring_hung)
>>>>>>> +	if (ring_hung) {
>>>>>>> 		i915_gem_context_mark_guilty(request->ctx);
>>>>>>> -	else
>>>>>>> +		request->fence.status = -EIO;
>>>>>>> +	} else {
>>>>>>> 		i915_gem_context_mark_innocent(request->ctx);
>>>>>>> +	}
>>>>>>>
>>>>>>> 	if (!ring_hung)
>>>>>>> 		return;
>>>>>>>
>>>>>>
>>>>>> Reading what happens later in this function, should we set the
>>>>>> status of all the other requests we are about to clear?
>>>>>>
>>>>>> However one thing I don't understand is how this scheme interacts
>>>>>> with the current userspace. We will clear/no-nop some of the
>>>>>> submitted requests since the state is corrupt. But how will
>>>>>> userspace notice this before it submits more requets?
>>>>>
>>>>> There is no mechanism currently for user space to be able to detect a
>>>>> hung request. (It can use the uevent for async notification of the
>>>>> hang/reset, but that will not tell you who caused the hang.) Userspace
>>>>> can track the number of hangs it caused, but the delay makes any
>>>>> roundtripping impractical (i.e. you have to synchronise before all
>>>>> rendering if you must detect the event immediately). Note also that we
>>>>> do not want to give out interprocess information (i.e. to allow one
>>>>> client to spy on another), which makes things harder to get right.
>>>>
>>>> So idea is to clear already submitted requests _if_ the userspace is
>>>> synchronising before all rendering and looking at reset stats, to
>>>> make it theoretically possible to detect the corrupt state?
>>>
>>> No, I'm just don't see a way that userspace can detect the hang without
>>> testing after seeing the request signaled (either by waiting on the
>>> batch or by waiting on the fence), i.e. by being completely synchronous
>>> (or at least chosing its synchronous points very carefully, such as
>>> around IPC). It can either poll reset-count or use sync_file (which
>>> requires fence exporting).
>>>
>>> The current robustness interfaces is a basic query on whether any reset
>>> occurred within the context, not when.
>>
>> Why do we bother with clearing the submitted requests then?
>
> The same reason we ban processes from submitting new requests if they
> cause repeated hangs. If before we ban that client, it has already
> submitted 1000 hanging requests, it has successfully locked the machine
> up for a couple of hours.

So we would need to gate clearing on the transition to banned state I 
think. Because currently it does in unconditionally.

Regards,

Tvrtko


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

  reply	other threads:[~2017-01-03 13:17 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-01-03 11:05 [PATCH 1/2] dma-fence: Clear fence->status during dma_fence_init() Chris Wilson
2017-01-03 11:05 ` [PATCH 2/2] drm/i915: Set guilty-flag on fence after detecting a hang Chris Wilson
2017-01-03 11:34   ` Tvrtko Ursulin
2017-01-03 11:46     ` [Intel-gfx] " Chris Wilson
2017-01-03 11:57       ` Tvrtko Ursulin
2017-01-03 12:13         ` Chris Wilson
2017-01-03 12:34           ` [Intel-gfx] " Tvrtko Ursulin
2017-01-03 12:38             ` Chris Wilson
2017-01-03 13:17               ` Tvrtko Ursulin [this message]
2017-01-03 13:25                 ` Chris Wilson
2017-01-03 11:53 ` ✗ Fi.CI.BAT: failure for series starting with [1/2] dma-fence: Clear fence->status during dma_fence_init() Patchwork
2017-01-03 14:04 ` [PATCH 1/2] " Tvrtko Ursulin
2017-01-04  9:15   ` [Intel-gfx] " Daniel Vetter
2017-01-04  9:24     ` Chris Wilson
2017-01-04  9:37       ` [Intel-gfx] " Daniel Vetter
2017-01-04  9:43         ` Chris Wilson
2017-01-04 10:18           ` Daniel Vetter
2017-01-04 10:26             ` Chris Wilson
2017-01-04 11:31               ` [Intel-gfx] " Daniel Vetter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1943b5da-fd46-9fce-d79b-0ba28ba89bd8@linux.intel.com \
    --to=tvrtko.ursulin@linux.intel.com \
    --cc=chris@chris-wilson.co.uk \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=intel-gfx@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox