All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mika Kuoppala <mika.kuoppala@linux.intel.com>
To: Chris Wilson <chris@chris-wilson.co.uk>
Cc: intel-gfx@lists.freedesktop.org
Subject: Re: [PATCH 2/6] drm/i915: Decouple hang detection from hangcheck period
Date: Tue, 15 Nov 2016 19:23:26 +0200	[thread overview]
Message-ID: <87bmxgwtbl.fsf@gaia.fi.intel.com> (raw)
In-Reply-To: <20161115163856.GD30410@nuc-i3427.alporthouse.com>

Chris Wilson <chris@chris-wilson.co.uk> writes:

> On Tue, Nov 15, 2016 at 04:36:32PM +0200, Mika Kuoppala wrote:
>> +static bool
>> +hangcheck_engine_stall(struct intel_engine_cs *engine,
>> +		       struct intel_engine_hangcheck *hc)
>> +{
>> +	const unsigned long last_action = engine->hangcheck.action_timestamp;
>>  
>> -		memset(&engine->hangcheck.instdone, 0,
>> -		       sizeof(engine->hangcheck.instdone));
>> -		break;
>> +	if (hc->action == HANGCHECK_ACTIVE_SEQNO ||
>> +	    hc->action == HANGCHECK_IDLE)
>> +		return false;
>> +
>> +	if (time_before(jiffies, last_action + HANGCHECK_HUNG_JIFFIES))
>> +		return false;
>> +
>> +	if (time_before(jiffies, last_action + HANGCHECK_STUCK_JIFFIES))
>> +		if (hc->action != HANGCHECK_HUNG)
>> +			return false;
>
> This deserves a lot more explanation. Why two values? What are their
> significance?
>

#define DRM_I915_STUCK_PERIOD_SEC 24 /* No observed seqno progress */
#define DRM_I915_HUNG_PERIOD_SEC 4 /* No observed seqno nor head progress */

We allow bigger timeout if head is moving inside a request. THus two
values. I tried to mimic the behaviour from old code.

> Do we want to be using jiffies? (Values are in seconds, so jiffie
> resoluation makes sense, but add that as a comment somewhere.)

The defines as seconds and the subsequent jiffie counterparts in
i915_drv.h not enough?

>
>> +	return true;
>> +}
>> +
>> +static struct intel_engine_cs *find_lra_engine(struct drm_i915_private *i915,
>> +					       const unsigned int mask)
>
> What is lra?
>

Least recently active. I will opencode the name.

>> +{
>> +	struct intel_engine_cs *engine = NULL, *c;
>> +	enum intel_engine_id id;
>> +
>> +	for_each_engine_masked(c, i915, mask, id) {
>> +		if (engine == NULL) {
>> +			engine = c;
>> +			continue;
>> +		}
>> +
>> +		if (time_before(c->hangcheck.action_timestamp,
>> +				engine->hangcheck.action_timestamp))
>> +			engine = c;
>> +		else if (c->hangcheck.action_timestamp ==
>> +			 engine->hangcheck.action_timestamp &&
>> +			 c->hangcheck.seqno < engine->hangcheck.seqno)
>> +			engine = c;
>
> Why is the last engine significant?

We blame the engine which had work and was inactive for the
longest amount of time. 

> Why is just engine guilty?

Just one? As the stamps are valid cross reset, the next
advacement after this hang will make progress show
on this engine, and the next engine will get caught.

The hangcheck action is a filter to weed out those engines
that can't possibly be culprit for hang.

So we aim to pinpoint only one engine with accuracy and
reset that to lean towards per engine resets.

-Mika

>
>
> Too many whys in this one.
> -Chris
>
> -- 
> Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

  reply	other threads:[~2016-11-15 17:24 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-11-15 14:36 [PATCH 1/6] drm/i915: Split up hangcheck phases Mika Kuoppala
2016-11-15 14:36 ` [PATCH 2/6] drm/i915: Decouple hang detection from hangcheck period Mika Kuoppala
2016-11-15 16:38   ` Chris Wilson
2016-11-15 17:23     ` Mika Kuoppala [this message]
2016-11-15 21:32       ` Chris Wilson
2016-11-16  8:07   ` Chris Wilson
2016-11-15 14:36 ` [PATCH 3/6] drm/i915: Use request retirement as context progress Mika Kuoppala
2016-11-15 16:34   ` Chris Wilson
2016-11-15 14:36 ` [PATCH 4/6] drm/i915: Add bannable context parameter Mika Kuoppala
2016-11-15 16:31   ` Chris Wilson
2016-11-15 14:36 ` [PATCH 5/6] drm/i915: Add per client max context ban limit Mika Kuoppala
2016-11-15 16:27   ` Chris Wilson
2016-11-15 17:12     ` Mika Kuoppala
2016-11-15 14:36 ` [PATCH 6/6] drm/i915: Wipe hang stats as an embedded struct Mika Kuoppala
2016-11-15 15:45 ` ✓ Fi.CI.BAT: success for series starting with [1/6] drm/i915: Split up hangcheck phases Patchwork
2016-11-15 16:47   ` Chris Wilson
2016-11-15 17:27     ` Mika Kuoppala
2016-11-15 16:39 ` [PATCH 1/6] " Chris Wilson
  -- strict thread matches above, loose matches on Subject: below --
2016-11-16 15:20 Mika Kuoppala
2016-11-16 15:20 ` [PATCH 2/6] drm/i915: Decouple hang detection from hangcheck period Mika Kuoppala
2016-11-16 17:05   ` Chris Wilson
2016-11-18 13:09   ` Mika Kuoppala

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87bmxgwtbl.fsf@gaia.fi.intel.com \
    --to=mika.kuoppala@linux.intel.com \
    --cc=chris@chris-wilson.co.uk \
    --cc=intel-gfx@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.