Re: [PATCH v4 2/5] drm/i915: Watchdog timeout: IRQ handler for gen8+

From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: Carlos Santa <carlos.santa@intel.com>,
	Chris Wilson <chris@chris-wilson.co.uk>,
	intel-gfx@lists.freedesktop.org
Cc: Michel Thierry <michel.thierry@intel.com>
Subject: Re: [PATCH v4 2/5] drm/i915: Watchdog timeout: IRQ handler for gen8+
Date: Mon, 11 Mar 2019 10:39:20 +0000	[thread overview]
Message-ID: <62d65486-a45a-7cf9-1574-1c961be2ee9a@linux.intel.com> (raw)
In-Reply-To: <fe1aea1587da489e06d3c850891e3560c4b8d412.camel@intel.com>

On 08/03/2019 03:16, Carlos Santa wrote:
> On Fri, 2019-03-01 at 09:36 +0000, Chris Wilson wrote:
>>>
>> Quoting Carlos Santa (2019-02-21 02:58:16)
>>> +#define GEN8_WATCHDOG_1000US(dev_priv)
>>> watchdog_to_clock_counts(dev_priv, 1000)
>>> +static void gen8_watchdog_irq_handler(unsigned long data)
>>> +{
>>> +       struct intel_engine_cs *engine = (struct intel_engine_cs
>>> *)data;
>>> +       struct drm_i915_private *dev_priv = engine->i915;
>>> +       unsigned int hung = 0;
>>> +       u32 current_seqno=0;
>>> +       char msg[80];
>>> +       unsigned int tmp;
>>> +       int len;
>>> +
>>> +       /* Stop the counter to prevent further timeout interrupts
>>> */
>>> +       I915_WRITE_FW(RING_CNTR(engine->mmio_base),
>>> get_watchdog_disable(engine));
>>> +
>>> +       /* Read the heartbeat seqno once again to check if we are
>>> stuck? */
>>> +       current_seqno = intel_engine_get_hangcheck_seqno(engine);
>>
>> I have said this before, but this doesn't exist either, it's just a
>> temporary glitch in the matrix.
>>
> 
> Chris, Tvrtko, I need some guidance on how to find the quilty seqno
> during a hang, can you please advice here what to do?

When an interrupt fires you need to ascertain whether the same request 
which enabled the watchdog is running, correct?

So I think you would need this, with a disclaimer that I haven't thought 
about the details really:

1. Take a reference to timeline hwsp when setting up the watchdog for a 
request.

2. Store the initial seqno associated with this request.

3. Force enable user interrupts.

4. When timeout fires, inspect the HWSP seqno to see if the request 
completed or not.

5. Reset the engine if not completed.

6. Put the timeline/hwsp reference.

If the user interrupt fires with the request completed cancel the above 
operations.

There could be an inherent race between inspecting the seqno and 
deciding to reset. Not sure at the moment what to do. Maybe just call it 
bad luck?

I also think for the software implementation you need to force no 
request coalescing for contexts with timeout set. Because you want to 
have 100% defined borders for request in and out - since the timeout is 
defined per request.

In this case you don't need the user interrupt for the trailing edge 
signal but can use context complete. Maybe putting hooks into 
context_in/out in intel_lrc.c would work under these circumstances.

Also if preempted you need to cancel the timer setup and store elapsed 
execution time.

Or it may make sense to just disable preemption for these contexts. 
Otherwise there is no point in trying to mandate the timeout?

But it is also kind of bad since non-privileged contexts can make 
themselves non-preemptable by setting the watchdog timeout.

Maybe as a compromise we need to automatically apply an elevated 
priority level, but not as high to be completely non-preemptable. Sounds 
like a hard question.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx