From: John Harrison <john.c.harrison@intel.com>
To: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>,
"Hellstrom, Thomas" <thomas.hellstrom@intel.com>,
"Intel-GFX@Lists.FreeDesktop.Org"
<Intel-GFX@Lists.FreeDesktop.Org>
Cc: "Auld, Matthew" <matthew.auld@intel.com>,
"DRI-Devel@Lists.FreeDesktop.Org"
<DRI-Devel@Lists.FreeDesktop.Org>
Subject: Re: [Intel-gfx] [PATCH 1/4] drm/i915: Allow error capture without a request
Date: Tue, 17 Jan 2023 11:40:57 -0800 [thread overview]
Message-ID: <4bec6af7-90f5-46de-1df0-9af034662475@intel.com> (raw)
In-Reply-To: <75379a08-1a87-d3a3-01ee-781c73d40d6f@linux.intel.com>
On 1/16/2023 04:38, Tvrtko Ursulin wrote:
> On 13/01/2023 21:29, John Harrison wrote:
>> On 1/13/2023 09:46, Hellstrom, Thomas wrote:
>>> On Fri, 2023-01-13 at 09:51 +0000, Tvrtko Ursulin wrote:
>>>> On 12/01/2023 20:40, John Harrison wrote:
>>>>> On 1/12/2023 02:01, Tvrtko Ursulin wrote:
>>>>>> On 12/01/2023 02:53, John.C.Harrison@Intel.com wrote:
>
> [snip]
>
>>>>>>> + engine->name);
>>>>>>> + rq = NULL;
>>>>>>> + }
>>>>>>> } else {
>>>>>>> /*
>>>>>>> * Getting here with GuC enabled means it is a forced
>>>>>>> error
>>>>>>> capture
>>>>>>> @@ -1622,22 +1645,24 @@ capture_engine(struct intel_engine_cs
>>>>>>> *engine,
>>>>>>> flags);
>>>>>>> }
>>>>>>> }
>>>>>>> - if (rq)
>>>>>>> + if (rq) {
>>>>>>> rq = i915_request_get_rcu(rq);
>>>>>>> + capture = intel_engine_coredump_add_request(ee, rq,
>>>>>>> ATOMIC_MAYFAIL);
>>>>>>> + } else if (ce) {
>>>>>>> + capture = engine_coredump_add_context(ee, ce,
>>>>>>> ATOMIC_MAYFAIL);
>>>>>>> + }
>>>>>>> - if (!rq)
>>>>>>> - goto no_request_capture;
>>>>>>> -
>>>>>>> - capture = intel_engine_coredump_add_request(ee, rq,
>>>>>>> ATOMIC_MAYFAIL);
>>>>>>> if (!capture) {
>>>>>>> - i915_request_put(rq);
>>>>>>> + if (rq)
>>>>>>> + i915_request_put(rq);
>>>>>>> goto no_request_capture;
>>>>>>> }
>>>>>>> if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE)
>>>>>>> intel_guc_capture_get_matching_node(engine->gt, ee,
>>>>>>> ce);
>>>>>> This step requires non-NULL ce, so if you move it under the "else
>>>>>> if
>>>>>> (ce)" above then I *think* exit from the function can be
>>>>>> consolidated
>>>>>> to just:
>>>>>>
>>>>>> if (capture) {
>>>>>> intel_engine_coredump_add_vma(ee, capture, compress);
>>>>>> if (rq)
>>>>>> i915_request_put(rq);
>>>>> Is there any reason the rq ref needs to be held during the add_vma
>>>>> call?
>>>>> Can it now just be moved earlier to be:
>>>>> if (rq) {
>>>>> rq = i915_request_get_rcu(rq);
>>>>> capture = intel_engine_coredump_add_request(ee, rq,
>>>>> ATOMIC_MAYFAIL);
>>>>> i915_request_put(rq);
>>>>> }
>>>>>
>>>>> The internals of the request object are only touched in the above
>>>>> _add_request() code. The later _add_vma() call fiddles around with
>>>>> vmas
>>>>> that pulled from the request but the capture_vma code inside
>>>>> _add_request() has already copied everything, hasn't it? Or rather,
>>>>> it
>>>>> has grabbed its own private vma resource locks. So there is no
>>>>> requirement to keep the request itself around still?
>>> That sounds correct. It was some time ago since I worked with this code
>>> but when i started IIRC KASAN told me the request along with the whole
>>> capture list could disappear under us due to a parallel capture.
>>>
>>> So the request reference added then might cover a bit too much now that
>>> we also hold references on vma resources, which it looks like we do in
>>> intel_engine_coredump_add_vma().
>> So that means we end up with:
>> rq = intel_context_find_active_request(ce);
>> ...
>> [test stuff like i915_request_started(rq)]
>> ...
>> if (rq) {
>> rq = i915_request_get_rcu(rq);
>> capture = intel_engine_coredump_add_request(ee, rq,
>> ATOMIC_MAYFAIL);
>> i915_request_put(rq);
>> }
>>
>> What is special about coredump_add_request() that it needs the
>> request to be extra locked for that call and only that call? If the
>> request can magically vanish after being found then what protects the
>> _started() query? For that matter, what stops the request_get_rcu()
>> itself being called on a pointer that is no longer valid? And if we
>> do actually have sufficient locking in place to prevent that, why
>> doesn't that cover the coredump_add_request() usage?
>
> There is definitely a red flag there with the difference between the
> if and else blocks at the top of capture_engine(). And funnily enough,
> the first block appears to be GuC only. That is not obvious from the
> code and should probably have a comment, or function names made
> self-documenting.
In terms of 'red flag', you mean the apparent difference in locking in
this section?
ce = intel_engine_get_hung_context(engine);
if (ce) {
intel_engine_clear_hung_context(engine);
rq = intel_context_find_active_request(ce);
if (!rq || !i915_request_started(rq))
goto no_request_capture;
} else {
/*
* Getting here with GuC enabled means it is a forced
error capture
* with no actual hang. So, no need to attempt the
execlist search.
*/
if (!intel_uc_uses_guc_submission(&engine->gt->uc)) {
spin_lock_irqsave(&engine->sched_engine->lock, flags);
rq =
intel_engine_execlist_find_hung_request(engine);
spin_unlock_irqrestore(&engine->sched_engine->lock,
flags);
}
}
There is no actual locking difference. The first thing
intel_context_find_active_request() does is to acquire the relevant
spinlock for the list it is about to traverse. I assume
intel_engine_execlist_find_hung_request() must be called from other
places that already have the appropriate lock and hence the lock must be
done externally for all callers.
Technically, the first part does not have to be GuC only. It is entirely
possible that a future improvement to execlists would add support for
tagging the hanging context up front without requiring a fresh search
here (based on pre-emption timeouts, heartbeat timeouts, or whatever it
was that decided to call the error capture code in the first place). So
there is no reason to add unnecessary enforcement of backend
implementation details to this higher level by marking that as GuC only.
Whereas, the search for an individual request after the hang has
happened is an execlist only implementation detail. Hence the
enforcement that we don't do that for GuC (with comment to explain).
>
> I guess the special thing about intel_engine_coredump_add_request() is
> that it dereferences the rq. So it is possibly 573ba126aef3
> ("drm/i915/guc: Capture error state on context reset") which added a
> bug where rq can be dereferenced with a reference held. Or perhaps
> with the GuC backend there is a guarantee request cannot be retired
> from elsewhere while error capture is examining it.
"added a bug where rq can be dereferenced with a reference held." <--
did you mean 'with' or 'without'? Dereferencing with a reference held
sounds correct to me.
You are meaning the loop in intel_context_find_active_request()? It gets
the lock on the list of requests that it is scanning through. Presumably
requests can't vanish which they are on that list, they must have been
reference counted when adding. So dereferencing within the list has to
be safe, yes? The issue is that there needs to be an extra get on the
reference that is returned before dropping the list lock. And that
reference would have to be released by the caller. Yes? And likewise,
the request_get that currently exists in capture_engine() needs to be
inside the execlist spinlock. Which is how it used to be before the GuC
support was added in the above patch. Or rather, there was no explicit
request_get at all but all the request processing was done inside the
execlist spinlock (which seems bad given that it would be allocating
memory and such within the spinlock?!).
Presumably engine_dump_active_requests() is also broken. It asserts that
the execlist spinlock is held but does nothing about the GuC's
equivalent spinlock despite pulling a request of the GuC state's list
and then doing lots of dereferencing on that. It will also need to have
intel_context_find_active_request() return an extra reference count on
the request (before dropping the GuC lock as described above) and then
manually put the request iff it got it via the hung context route rather
than the hung request route! Or more simply, also get a local reference
in the hung request path and then just unconditionally put at the end.
Sound plausible?
>
> To unravel the error entry points into error capture, from execlists,
> debugfs, ringbuffer, I don't have the time to remind myself how all
> that works right now. Quite possibly at least some of those run async
> to the GPU so must be safe against parallel request retirement. So I
> don't know if the i915_request_get_rcu safe in all those cases without
> spending some time to refresh my knowledge a bit.
>
> Sounds like the best plan is not to change this too much - just leave
> the scope of reference held as is and ideally eliminate the necessary
> goto labels. AFAIR that should be doable without changing anything
> real and unblock these improvements.
Hmm. If you want it all left unchanged, I don't think you can eliminate
the goto. But it seems like the fix is to move the existing get into the
execlist only spinlock and add a new get to
intel_context_find_active_request(). Then everything should be
guaranteed protected at all times.
John.
>
> Regards,
>
> Tvrtko
next prev parent reply other threads:[~2023-01-17 19:41 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-01-12 2:53 [Intel-gfx] [PATCH 0/4] Allow error capture without a request / on reset failure John.C.Harrison
2023-01-12 2:53 ` [Intel-gfx] [PATCH 1/4] drm/i915: Allow error capture without a request John.C.Harrison
2023-01-12 10:01 ` Tvrtko Ursulin
2023-01-12 20:40 ` John Harrison
2023-01-13 9:51 ` Tvrtko Ursulin
2023-01-13 17:46 ` Hellstrom, Thomas
2023-01-13 21:29 ` John Harrison
2023-01-16 12:38 ` Tvrtko Ursulin
2023-01-17 19:40 ` John Harrison [this message]
2023-01-16 12:13 ` Tvrtko Ursulin
2023-01-12 2:53 ` [Intel-gfx] [PATCH 2/4] drm/i915: Allow error capture of a pending request John.C.Harrison
2023-01-12 10:06 ` Tvrtko Ursulin
2023-01-12 20:46 ` John Harrison
2023-01-13 9:10 ` Tvrtko Ursulin
2023-01-12 2:53 ` [Intel-gfx] [PATCH 3/4] drm/i915/guc: Look for a guilty context when an engine reset fails John.C.Harrison
2023-01-12 10:15 ` Tvrtko Ursulin
2023-01-12 20:59 ` John Harrison
2023-01-13 9:22 ` Tvrtko Ursulin
2023-01-14 1:27 ` John Harrison
2023-01-16 12:43 ` Tvrtko Ursulin
2023-01-17 21:14 ` John Harrison
2023-01-12 2:53 ` [Intel-gfx] [PATCH 4/4] drm/i915/guc: Add a debug print on GuC triggered reset John.C.Harrison
2023-01-12 10:11 ` Tvrtko Ursulin
2023-01-12 3:21 ` [Intel-gfx] ✗ Fi.CI.SPARSE: warning for Allow error capture without a request / on reset failure (rev2) Patchwork
2023-01-12 3:36 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
2023-01-12 5:36 ` [Intel-gfx] ✓ Fi.CI.IGT: " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4bec6af7-90f5-46de-1df0-9af034662475@intel.com \
--to=john.c.harrison@intel.com \
--cc=DRI-Devel@Lists.FreeDesktop.Org \
--cc=Intel-GFX@Lists.FreeDesktop.Org \
--cc=matthew.auld@intel.com \
--cc=thomas.hellstrom@intel.com \
--cc=tvrtko.ursulin@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox