All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: Chris Wilson <chris@chris-wilson.co.uk>, intel-gfx@lists.freedesktop.org
Subject: Re: [Intel-gfx] [PATCH v3] drm/i915/execlists: Reclaim the hanging virtual request
Date: Wed, 22 Jan 2020 13:32:31 +0000	[thread overview]
Message-ID: <1d605d9a-dff4-2ff5-30bb-6c5a7350a9cf@linux.intel.com> (raw)
In-Reply-To: <157962947004.6241.16387329374520796728@skylake-alporthouse-com>


On 21/01/2020 17:57, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2020-01-21 17:43:37)
>>
>> On 21/01/2020 17:32, Chris Wilson wrote:
>>> Quoting Tvrtko Ursulin (2020-01-21 17:19:52)
>>>>
>>>> On 21/01/2020 14:07, Chris Wilson wrote:
>>>>> Quoting Tvrtko Ursulin (2020-01-21 13:55:29)
>>>>>>
>>>>>>
>>>>>> On 21/01/2020 13:04, Chris Wilson wrote:
>>>>>>> +             GEM_BUG_ON(!reset_in_progress(&engine->execlists));
>>>>>>> +
>>>>>>> +             /*
>>>>>>> +              * An unsubmitted request along a virtual engine will
>>>>>>> +              * remain on the active (this) engine until we are able
>>>>>>> +              * to process the context switch away (and so mark the
>>>>>>> +              * context as no longer in flight). That cannot have happened
>>>>>>> +              * yet, otherwise we would not be hanging!
>>>>>>> +              */
>>>>>>> +             spin_lock_irqsave(&ve->base.active.lock, flags);
>>>>>>> +             GEM_BUG_ON(intel_context_inflight(rq->context) != engine);
>>>>>>> +             GEM_BUG_ON(ve->request != rq);
>>>>>>> +             ve->request = NULL;
>>>>>>> +             spin_unlock_irqrestore(&ve->base.active.lock, flags);
>>>>>>> +
>>>>>>> +             rq->engine = engine;
>>>>>>
>>>>>> Lets see I understand this... tasklet has been disabled and ring paused.
>>>>>> But we find an uncompleted request in the ELSP context, with rq->engine
>>>>>> == virtual engine. Therefore this cannot be the first request on this
>>>>>> timeline but has to be later.
>>>>>
>>>>> Not quite.
>>>>>
>>>>> engine->execlists.active[] tracks the HW, it get's updated only upon
>>>>> receiving HW acks (or we reset).
>>>>>
>>>>> So if execlists_active()->engine == virtual, it can only mean that the
>>>>> inflight _hanging_ request has already been unsubmitted by an earlier
>>>>> preemption in execlists_dequeue(), but that preemption has not yet been
>>>>> processed by the HW. (Hence the preemption-reset underway.)
>>>>>
>>>>> Now while we coalesce the requests for a context into a single ELSP[]
>>>>> slot, and only record the last request submitted for a context, we have
>>>>> to walk back along that context's timeline to find the earliest
>>>>> incomplete request and blame the hang upon it.
>>>>>
>>>>> For a virtual engine, it's much simpler as there is only ever one
>>>>> request in flight, but I don't think that has any impact here other
>>>>> than that we only need to repair the single unsubmitted request that was
>>>>> returned to the virtual engine.
>>>>>
>>>>>> One which has been put on the runqueue but
>>>>>> not yet submitted to hw. (Because one at a time.) Or it has been
>>>>>> unsubmitted by __unwind_incomplete_request already. In the former case
>>>>>> why move it to the physical engine? Also in the latter actually, it
>>>>>> would overwrite rq->engine with the physical one.
>>>>>
>>>>> Yes. For incomplete preemption event, the request is *still* on this
>>>>> engine and has not been released (rq->context->inflight == engine, so it
>>>>> cannot be submitted to any other engine, until after we acknowledge the
>>>>> context has been saved and is no longer being accessed by HW.) It is
>>>>> legal for us to process the hanging request along this engine; we have a
>>>>> suboptimal decision to return the request to the same engine after the
>>>>> reset, but since we have replaced the hanging payload, the request is a
>>>>> mere signaling placeholder (and I do not think will overly burden the
>>>>> system and negatively impact other virtual engines).
>>>>
>>>> What if the request in elsp actually completed in the meantime eg.
>>>> preemption timeout was a false positive?
>>>>
>>>> In execlists_capture we do:
>>>>
>>>>           cap->rq = execlists_active(&engine->execlists);
>>>>
>>>> This gets a completed request, then we do:
>>>>
>>>>           cap->rq = active_request(cap->rq->context->timeline, cap->rq);
>>>>
>>>> This walks along the virtual timeline and finds a next virtual request.
>>>> It then binds this request to a physical engine and sets ve->request to
>>>> NULL.
>>>
>>> If we miss the completion event, then active_request() returns the
>>> original request and we blame it for a having a 650ms preemption-off
>>> shader with a 640ms preemption timeout.
>>
>> I am thinking of this sequence of interleaved events:
>>
>>          preempt_timeout
>>                                  tasklet_disable
>>                                  ring_pause
>>                                  execlist_active
>>          seqno write visible
>>                                  active_request - walks the tl to next
> 
> ... tries to walk to next, sees no incomplete request, returns original
> request.
> 
> static struct i915_request *
> active_request(const struct intel_timeline * const tl, struct i915_request *rq)
> {
>          struct i915_request *active = rq;
> 	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ this sneaky line
> 
>          rcu_read_lock();
>          list_for_each_entry_continue_reverse(rq, &tl->requests, link) {
>                  if (i915_request_completed(rq))
>                          break;
> 
>                  active = rq;
> 		^^^^^^^^^^^^ these too may complete at any moment after
> 		our inspection
> 
> 
>          }
>          rcu_read_unlock();
> 
>          return active;
> }

Brain fart on my part, sorry. I was confused.

Regards,

Tvrtko

>>                                  execlist_hold
>>                                  schedule_worker
>>                                  tasklet_enable
>>          process_csb completed
>>
>> This is not possible? Seqno write happening needs only to be roughly
>> there since as long as tasklet has been disabled execlist->active
>> remains fixed.
> 
> It's certainly possible, the requests do keep going on the HW up until
> the next semaphore (which is after the seqno write). That is taken into
> account in that we may end up trying to reset a completed request, which
> should be avoided in execlists_reset() [after the HW has processed the
> reset request], but we capture the request anyway and put it back for
> execution (which is avoided in execlists_dequeue). Isn't preempt-to-busy
> fun?
> -Chris
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

  reply	other threads:[~2020-01-22 13:32 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-01-21 10:09 [Intel-gfx] [PATCH] drm/i915/execlists: Reclaim hanging virtual request Chris Wilson
2020-01-21 11:04 ` [Intel-gfx] ✓ Fi.CI.BAT: success for " Patchwork
2020-01-21 11:20 ` [Intel-gfx] [PATCH] " Tvrtko Ursulin
2020-01-21 11:33   ` Chris Wilson
2020-01-21 11:44     ` Chris Wilson
2020-01-21 11:50 ` [Intel-gfx] [PATCH] drm/i915/execlists: Reclaim the " Chris Wilson
2020-01-21 13:04 ` [Intel-gfx] [PATCH v3] " Chris Wilson
2020-01-21 13:55   ` Tvrtko Ursulin
2020-01-21 14:07     ` Chris Wilson
2020-01-21 17:19       ` Tvrtko Ursulin
2020-01-21 17:32         ` Chris Wilson
2020-01-21 17:43           ` Tvrtko Ursulin
2020-01-21 17:57             ` Chris Wilson
2020-01-22 13:32               ` Tvrtko Ursulin [this message]
2020-01-21 13:48 ` [Intel-gfx] [PATCH v4] " Chris Wilson
2020-01-21 15:06 ` [Intel-gfx] ✓ Fi.CI.BAT: success for drm/i915/execlists: Reclaim hanging virtual request (rev4) Patchwork
2020-01-22 16:57 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork
2020-01-23 17:54 ` [Intel-gfx] ✓ Fi.CI.IGT: success " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1d605d9a-dff4-2ff5-30bb-6c5a7350a9cf@linux.intel.com \
    --to=tvrtko.ursulin@linux.intel.com \
    --cc=chris@chris-wilson.co.uk \
    --cc=intel-gfx@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.