public inbox for intel-gfx@lists.freedesktop.org
 help / color / mirror / Atom feed
From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: Carlos Santa <carlos.santa@intel.com>,
	Chris Wilson <chris@chris-wilson.co.uk>,
	intel-gfx@lists.freedesktop.org
Cc: Michel Thierry <michel.thierry@intel.com>
Subject: Re: [PATCH v4 2/5] drm/i915: Watchdog timeout: IRQ handler for gen8+
Date: Tue, 19 Mar 2019 12:46:06 +0000	[thread overview]
Message-ID: <617e8e5f-ca57-ef8f-e085-79dbf70dcb2d@linux.intel.com> (raw)
In-Reply-To: <25e7c81e-18fc-0676-dd27-d0a7823aaa41@linux.intel.com>


On 19/03/2019 12:39, Tvrtko Ursulin wrote:
> 
> On 18/03/2019 00:15, Carlos Santa wrote:
>> On Mon, 2019-03-11 at 10:39 +0000, Tvrtko Ursulin wrote:
>>> On 08/03/2019 03:16, Carlos Santa wrote:
>>>> On Fri, 2019-03-01 at 09:36 +0000, Chris Wilson wrote:
>>>>>>
>>>>>
>>>>> Quoting Carlos Santa (2019-02-21 02:58:16)
>>>>>> +#define GEN8_WATCHDOG_1000US(dev_priv)
>>>>>> watchdog_to_clock_counts(dev_priv, 1000)
>>>>>> +static void gen8_watchdog_irq_handler(unsigned long data)
>>>>>> +{
>>>>>> +       struct intel_engine_cs *engine = (struct
>>>>>> intel_engine_cs
>>>>>> *)data;
>>>>>> +       struct drm_i915_private *dev_priv = engine->i915;
>>>>>> +       unsigned int hung = 0;
>>>>>> +       u32 current_seqno=0;
>>>>>> +       char msg[80];
>>>>>> +       unsigned int tmp;
>>>>>> +       int len;
>>>>>> +
>>>>>> +       /* Stop the counter to prevent further timeout
>>>>>> interrupts
>>>>>> */
>>>>>> +       I915_WRITE_FW(RING_CNTR(engine->mmio_base),
>>>>>> get_watchdog_disable(engine));
>>>>>> +
>>>>>> +       /* Read the heartbeat seqno once again to check if we
>>>>>> are
>>>>>> stuck? */
>>>>>> +       current_seqno =
>>>>>> intel_engine_get_hangcheck_seqno(engine);
>>>>>
>>>>> I have said this before, but this doesn't exist either, it's just
>>>>> a
>>>>> temporary glitch in the matrix.
>>>>>
>>>>
>>>> Chris, Tvrtko, I need some guidance on how to find the quilty seqno
>>>> during a hang, can you please advice here what to do?
>>>
>>> When an interrupt fires you need to ascertain whether the same
>>> request
>>> which enabled the watchdog is running, correct?
>>>
>>> So I think you would need this, with a disclaimer that I haven't
>>> thought
>>> about the details really:
>>>
>>> 1. Take a reference to timeline hwsp when setting up the watchdog for
>>> a
>>> request.
>>>
>>> 2. Store the initial seqno associated with this request.
>>>
>>> 3. Force enable user interrupts.
>>>
>>> 4. When timeout fires, inspect the HWSP seqno to see if the request
>>> completed or not.
>>>
>>> 5. Reset the engine if not completed.
>>>
>>> 6. Put the timeline/hwsp reference.
>>
>>
>> static int gen8_emit_bb_start(struct i915_request *rq,
>>                             u64 offset, u32
>> len,
>>                             const unsigned
>> int flags)
>> {
>>     struct i915_timeline *tl;
>>     u32 seqno;
>>
>>     if (enable_watchdog) {
>>         /* Start watchdog timer */
>>         cs = gen8_emit_start_watchdog(rq, cs);
>>         tl = ce->ring->timeline;
>>         i915_timeline_get_seqno(tl, rq, &seqno);
>>         /*Store initial hwsp seqno associated with this request
>>         engine->watchdog_hwsp_seqno = tl->hwsp_seqno;
> 
> You should not need to allocate a new seqno and also having something 
> stored per engine does not make clear how will you solve out of order.
> 
> Maybe you just set up the timer, then lets see below..
> 
> Also, are you not trying to do the software implementation to start with?
> 
>>     }
>>
>> }
>>
>> static void gen8_watchdog_tasklet(unsigned long data)
>> {
>>         struct i915_request *rq;
>>
>>         rq = intel_engine_find_active_request(engine);
>>
>>         /* Inspect the watchdog seqno once again for
>> completion? */
>>         if (!i915_seqno_passed(engine->watchdog_hwsp_seqno, rq-
>>> fence.seqno)) {
>>             //Reset Engine
>>         }
>> }
> 
> What happens if you simply reset without checking anything? You know hw 
> timer wouldn't have fired if the context wasn't running, correct?
> 
> (Ignoring the race condition between interrupt raised -> hw interrupt 
> delivered -> serviced -> tasklet scheduled -> tasklet running. Which may 
> mean request has completed in the meantime and you reset the engine for 
> nothing. But this is probably not 100% solvable.)

Good idea would be to write some tests to exercise some normal and more 
edge case scenarios like coalesced requests, preemption etc. Checking 
which request got reset etc.

Regards,

Tvrtko

> Regards,
> 
> Tvrtko
> 
>> Tvrtko, is the above acceptable to inspect whether the seqno has
>> completed?
>>
>> I noticed there's a helper function i915_request_completed(struct
>> i915_request *rq) but it will require me to modify it in order to pass
>> 2 different seqnos.
>>
>> Regards,
>> Carlos
>>
>>>
>>> If the user interrupt fires with the request completed cancel the
>>> above
>>> operations.
>>>
>>> There could be an inherent race between inspecting the seqno and
>>> deciding to reset. Not sure at the moment what to do. Maybe just call
>>> it
>>> bad luck?
>>>
>>> I also think for the software implementation you need to force no
>>> request coalescing for contexts with timeout set. Because you want
>>> to
>>> have 100% defined borders for request in and out - since the timeout
>>> is
>>> defined per request.
>>>
>>> In this case you don't need the user interrupt for the trailing edge
>>> signal but can use context complete. Maybe putting hooks into
>>> context_in/out in intel_lrc.c would work under these circumstances.
>>>
>>> Also if preempted you need to cancel the timer setup and store
>>> elapsed
>>> execution time.
>>>
>>> Or it may make sense to just disable preemption for these contexts.
>>> Otherwise there is no point in trying to mandate the timeout?
>>>
>>> But it is also kind of bad since non-privileged contexts can make
>>> themselves non-preemptable by setting the watchdog timeout.
>>>
>>> Maybe as a compromise we need to automatically apply an elevated
>>> priority level, but not as high to be completely non-preemptable.
>>> Sounds
>>> like a hard question.
>>>
>>> Regards,
>>>
>>> Tvrtko
>>
>>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

  reply	other threads:[~2019-03-19 12:46 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-02-21  2:58 [PATCH v4 0/5] GEN8+ GPU Watchdog Reset Support Carlos Santa
2019-02-21  2:58 ` [PATCH v4 1/5] drm/i915: Add engine reset count in get-reset-stats ioctl Carlos Santa
2019-02-25 13:34   ` Tvrtko Ursulin
2019-03-06 23:08     ` Carlos Santa
2019-03-07  7:27       ` Tvrtko Ursulin
2019-02-21  2:58 ` [PATCH v4 2/5] drm/i915: Watchdog timeout: IRQ handler for gen8+ Carlos Santa
2019-02-28 17:38   ` Tvrtko Ursulin
2019-03-01  1:51     ` Carlos Santa
2019-03-01  9:36   ` Chris Wilson
2019-03-02  2:08     ` Carlos Santa
2019-03-08  3:16     ` Carlos Santa
2019-03-11 10:39       ` Tvrtko Ursulin
2019-03-18  0:15         ` Carlos Santa
2019-03-19 12:39           ` Tvrtko Ursulin
2019-03-19 12:46             ` Tvrtko Ursulin [this message]
2019-03-19 17:52               ` Carlos Santa
2019-02-21  2:58 ` [PATCH v4 3/5] drm/i915: Watchdog timeout: Ringbuffer command emission " Carlos Santa
2019-02-21  2:58 ` [PATCH v4 4/5] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout Carlos Santa
2019-02-28 17:22   ` Tvrtko Ursulin
2019-02-21  2:58 ` [PATCH v4 5/5] drm/i915: Watchdog timeout: Include threshold value in error state Carlos Santa
2019-02-21  2:58 ` drm/i915: Replace global_seqno with a hangcheck heartbeat seqno Carlos Santa
2019-02-21  3:24 ` ✗ Fi.CI.BAT: failure for drm/i915: Replace global_seqno with a hangcheck heartbeat seqno (rev3) Patchwork
2019-03-11 11:54 ` [PATCH v4 0/5] GEN8+ GPU Watchdog Reset Support Chris Wilson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=617e8e5f-ca57-ef8f-e085-79dbf70dcb2d@linux.intel.com \
    --to=tvrtko.ursulin@linux.intel.com \
    --cc=carlos.santa@intel.com \
    --cc=chris@chris-wilson.co.uk \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=michel.thierry@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox