Intel-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: "Hellstrom, Thomas" <thomas.hellstrom@intel.com>,
	"Harrison, John C" <john.c.harrison@intel.com>,
	"Intel-GFX@Lists.FreeDesktop.Org"
	<Intel-GFX@Lists.FreeDesktop.Org>
Cc: "Auld, Matthew" <matthew.auld@intel.com>,
	"DRI-Devel@Lists.FreeDesktop.Org"
	<DRI-Devel@Lists.FreeDesktop.Org>
Subject: Re: [Intel-gfx] [PATCH 1/4] drm/i915: Allow error capture without a request
Date: Mon, 16 Jan 2023 12:13:44 +0000	[thread overview]
Message-ID: <3ed42f96-8965-7872-709d-2daa6c48dcb9@linux.intel.com> (raw)
In-Reply-To: <7316954666eecb39aef79067bc590e58bee48389.camel@intel.com>


On 13/01/2023 17:46, Hellstrom, Thomas wrote:
> On Fri, 2023-01-13 at 09:51 +0000, Tvrtko Ursulin wrote:
>>
>> On 12/01/2023 20:40, John Harrison wrote:
>>> On 1/12/2023 02:01, Tvrtko Ursulin wrote:
>>>> On 12/01/2023 02:53, John.C.Harrison@Intel.com wrote:
>>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>>
>>>>> There was a report of error captures occurring without any hung
>>>>> context being indicated despite the capture being initiated by
>>>>> a 'hung
>>>>> context notification' from GuC. The problem was not
>>>>> reproducible.
>>>>> However, it is possible to happen if the context in question
>>>>> has no
>>>>> active requests. For example, if the hang was in the context
>>>>> switch
>>>>> itself then the breadcrumb write would have occurred and the
>>>>> KMD would
>>>>> see an idle context.
>>>>>
>>>>> In the interests of attempting to provide as much information
>>>>> as
>>>>> possible about a hang, it seems wise to include the engine info
>>>>> regardless of whether a request was found or not. As opposed to
>>>>> just
>>>>> prentending there was no hang at all.
>>>>>
>>>>> So update the error capture code to always record engine
>>>>> information
>>>>> if an engine is given. Which means updating record_context() to
>>>>> take a
>>>>> context instead of a request (which it only ever used to find
>>>>> the
>>>>> context anyway). And split the request agnostic parts of
>>>>> intel_engine_coredump_add_request() out into a seaprate
>>>>> function.
>>>>>
>>>>> v2: Remove a duplicate 'if' statement (Umesh) and fix a put of
>>>>> a null
>>>>> pointer.
>>>>>
>>>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>>>> Reviewed-by: Umesh Nerlige Ramappa
>>>>> <umesh.nerlige.ramappa@intel.com>
>>>>> ---
>>>>>    drivers/gpu/drm/i915/i915_gpu_error.c | 61
>>>>> +++++++++++++++++++--------
>>>>>    1 file changed, 43 insertions(+), 18 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c
>>>>> b/drivers/gpu/drm/i915/i915_gpu_error.c
>>>>> index 9d5d5a397b64e..bd2cf7d235df0 100644
>>>>> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
>>>>> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
>>>>> @@ -1370,14 +1370,14 @@ static void
>>>>> engine_record_execlists(struct
>>>>> intel_engine_coredump *ee)
>>>>>    }
>>>>>      static bool record_context(struct i915_gem_context_coredump
>>>>> *e,
>>>>> -               const struct i915_request *rq)
>>>>> +               struct intel_context *ce)
>>>>>    {
>>>>>        struct i915_gem_context *ctx;
>>>>>        struct task_struct *task;
>>>>>        bool simulated;
>>>>>          rcu_read_lock();
>>>>> -    ctx = rcu_dereference(rq->context->gem_context);
>>>>> +    ctx = rcu_dereference(ce->gem_context);
>>>>>        if (ctx && !kref_get_unless_zero(&ctx->ref))
>>>>>            ctx = NULL;
>>>>>        rcu_read_unlock();
>>>>> @@ -1396,8 +1396,8 @@ static bool record_context(struct
>>>>> i915_gem_context_coredump *e,
>>>>>        e->guilty = atomic_read(&ctx->guilty_count);
>>>>>        e->active = atomic_read(&ctx->active_count);
>>>>>    -    e->total_runtime =
>>>>> intel_context_get_total_runtime_ns(rq->context);
>>>>> -    e->avg_runtime = intel_context_get_avg_runtime_ns(rq-
>>>>>> context);
>>>>> +    e->total_runtime = intel_context_get_total_runtime_ns(ce);
>>>>> +    e->avg_runtime = intel_context_get_avg_runtime_ns(ce);
>>>>>          simulated = i915_gem_context_no_error_capture(ctx);
>>>>>    @@ -1532,15 +1532,37 @@ intel_engine_coredump_alloc(struct
>>>>> intel_engine_cs *engine, gfp_t gfp, u32 dump_
>>>>>        return ee;
>>>>>    }
>>>>>    +static struct intel_engine_capture_vma *
>>>>> +engine_coredump_add_context(struct intel_engine_coredump *ee,
>>>>> +                struct intel_context *ce,
>>>>> +                gfp_t gfp)
>>>>> +{
>>>>> +    struct intel_engine_capture_vma *vma = NULL;
>>>>> +
>>>>> +    ee->simulated |= record_context(&ee->context, ce);
>>>>> +    if (ee->simulated)
>>>>> +        return NULL;
>>>>> +
>>>>> +    /*
>>>>> +     * We need to copy these to an anonymous buffer
>>>>> +     * as the simplest method to avoid being overwritten
>>>>> +     * by userspace.
>>>>> +     */
>>>>> +    vma = capture_vma(vma, ce->ring->vma, "ring", gfp);
>>>>> +    vma = capture_vma(vma, ce->state, "HW context", gfp);
>>>>> +
>>>>> +    return vma;
>>>>> +}
>>>>> +
>>>>>    struct intel_engine_capture_vma *
>>>>>    intel_engine_coredump_add_request(struct
>>>>> intel_engine_coredump *ee,
>>>>>                      struct i915_request *rq,
>>>>>                      gfp_t gfp)
>>>>>    {
>>>>> -    struct intel_engine_capture_vma *vma = NULL;
>>>>> +    struct intel_engine_capture_vma *vma;
>>>>>    -    ee->simulated |= record_context(&ee->context, rq);
>>>>> -    if (ee->simulated)
>>>>> +    vma = engine_coredump_add_context(ee, rq->context, gfp);
>>>>> +    if (!vma)
>>>>>            return NULL;
>>>>>          /*
>>>>> @@ -1550,8 +1572,6 @@ intel_engine_coredump_add_request(struct
>>>>> intel_engine_coredump *ee,
>>>>>         */
>>>>>        vma = capture_vma_snapshot(vma, rq->batch_res, gfp,
>>>>> "batch");
>>>>>        vma = capture_user(vma, rq, gfp);
>>>>> -    vma = capture_vma(vma, rq->ring->vma, "ring", gfp);
>>>>> -    vma = capture_vma(vma, rq->context->state, "HW context",
>>>>> gfp);
>>>>>          ee->rq_head = rq->head;
>>>>>        ee->rq_post = rq->postfix;
>>>>> @@ -1608,8 +1628,11 @@ capture_engine(struct intel_engine_cs
>>>>> *engine,
>>>>>        if (ce) {
>>>>>            intel_engine_clear_hung_context(engine);
>>>>>            rq = intel_context_find_active_request(ce);
>>>>> -        if (!rq || !i915_request_started(rq))
>>>>> -            goto no_request_capture;
>>>>> +        if (rq && !i915_request_started(rq)) {
>>>>> +            drm_info(&engine->gt->i915->drm, "Got hung context
>>>>> on %s
>>>>> with no active request!\n",
>>>>
>>>> Suggest s/active/started/ since we have both i915_request_active
>>>> and
>>>> i915_request_started, so to align the terminology.
>>> The message text was based on the intent of the activity not the
>>> naming
>>> of some internal helper function. Can change it if you really want
>>> but
>>> "with no started request" just reads like bad English to me. Plus
>>> it
>>> gets removed in the next patch anyway...
>>>
>>>
>>>>
>>>>> +                 engine->name);
>>>>> +            rq = NULL;
>>>>> +        }
>>>>>        } else {
>>>>>            /*
>>>>>             * Getting here with GuC enabled means it is a forced
>>>>> error
>>>>> capture
>>>>> @@ -1622,22 +1645,24 @@ capture_engine(struct intel_engine_cs
>>>>> *engine,
>>>>>                               flags);
>>>>>            }
>>>>>        }
>>>>> -    if (rq)
>>>>> +    if (rq) {
>>>>>            rq = i915_request_get_rcu(rq);
>>>>> +        capture = intel_engine_coredump_add_request(ee, rq,
>>>>> ATOMIC_MAYFAIL);
>>>>> +    } else if (ce) {
>>>>> +        capture = engine_coredump_add_context(ee, ce,
>>>>> ATOMIC_MAYFAIL);
>>>>> +    }
>>>>>    -    if (!rq)
>>>>> -        goto no_request_capture;
>>>>> -
>>>>> -    capture = intel_engine_coredump_add_request(ee, rq,
>>>>> ATOMIC_MAYFAIL);
>>>>>        if (!capture) {
>>>>> -        i915_request_put(rq);
>>>>> +        if (rq)
>>>>> +            i915_request_put(rq);
>>>>>            goto no_request_capture;
>>>>>        }
>>>>>        if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE)
>>>>>            intel_guc_capture_get_matching_node(engine->gt, ee,
>>>>> ce);
>>>>
>>>> This step requires non-NULL ce, so if you move it under the "else
>>>> if
>>>> (ce)" above then I *think* exit from the function can be
>>>> consolidated
>>>> to just:
>>>>
>>>> if (capture) {
>>>>      intel_engine_coredump_add_vma(ee, capture, compress);
>>>>      if (rq)
>>>>          i915_request_put(rq);
>>> Is there any reason the rq ref needs to be held during the add_vma
>>> call?
>>> Can it now just be moved earlier to be:
>>>       if (rq) {
>>>           rq = i915_request_get_rcu(rq);
>>>           capture = intel_engine_coredump_add_request(ee, rq,
>>> ATOMIC_MAYFAIL);
>>>           i915_request_put(rq);
>>>       }
>>>
>>> The internals of the request object are only touched in the above
>>> _add_request() code. The later _add_vma() call fiddles around with
>>> vmas
>>> that pulled from the request but the capture_vma code inside
>>> _add_request() has already copied everything, hasn't it? Or rather,
>>> it
>>> has grabbed its own private vma resource locks. So there is no
>>> requirement to keep the request itself around still?
> 
> That sounds correct. It was some time ago since I worked with this code
> but when i started IIRC KASAN told me the request along with the whole
> capture list could disappear under us due to a parallel capture.
> 
> So the request reference added then might cover a bit too much now that
> we also hold references on vma resources, which it looks like we do in
> intel_engine_coredump_add_vma().

If you are not sure maybe just should leave the reference covering as it 
does? I don't think there is any harm in covering too much. Whether or 
not it is immediately released after the call to 
intel_engine_coredump_add_request(), or at the exit of capture_engine() 
I mean, we can skip that clean up if in doubt.

> Another thing which is crappy with the current error capture code is
> that the request capture list needs to be freed with the request and
> not when the request signals (We can't block request signalling in the
> capture code to keep the capture list around). There might be many
> signaled requests hanging around in non-pruned dma_resv objects and
> thus many unused capture lists with many unused vma resources. :/

This last part sounds vaguely familiar - is it really a problem with the 
error capture code and it wasn't some other refactoring which removed 
the pruning of signaled fences from dma_resv?

Regards,

Tvrtko

> 
> /Thomas
> 
> 
>>
>> Don't know.. it is a question if changes from 60dc43d1190d
>> ("drm/i915:
>> Use struct vma_resource instead of struct vma_snapshot") removed the
>> need for holding the rq reference that "long" I guess? Adding Thomas
>> and
>> Matt to perhaps comment.
>>
>> Regards,
>>
>> Tvrtko
>>
>>
>>> John.
>>>
>>>
>>>> } else {
>>>>      kfree(ee);
>>>>      ee = NULL;
>>>> }
>>>>
>>>> return ee;
>>>>
>>>> No "if (rq) i915_request_put()" twice, and goto label can be
>>>> completely removed.
>>>>
>>>> Regards,
>>>>
>>>> Tvrtko
>>>>
>>>>>          intel_engine_coredump_add_vma(ee, capture, compress);
>>>>> -    i915_request_put(rq);
>>>>> +    if (rq)
>>>>> +        i915_request_put(rq);
>>>>>          return ee;
>>>
> 

  parent reply	other threads:[~2023-01-16 12:13 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-12  2:53 [Intel-gfx] [PATCH 0/4] Allow error capture without a request / on reset failure John.C.Harrison
2023-01-12  2:53 ` [Intel-gfx] [PATCH 1/4] drm/i915: Allow error capture without a request John.C.Harrison
2023-01-12 10:01   ` Tvrtko Ursulin
2023-01-12 20:40     ` John Harrison
2023-01-13  9:51       ` Tvrtko Ursulin
2023-01-13 17:46         ` Hellstrom, Thomas
2023-01-13 21:29           ` John Harrison
2023-01-16 12:38             ` Tvrtko Ursulin
2023-01-17 19:40               ` John Harrison
2023-01-16 12:13           ` Tvrtko Ursulin [this message]
2023-01-12  2:53 ` [Intel-gfx] [PATCH 2/4] drm/i915: Allow error capture of a pending request John.C.Harrison
2023-01-12 10:06   ` Tvrtko Ursulin
2023-01-12 20:46     ` John Harrison
2023-01-13  9:10       ` Tvrtko Ursulin
2023-01-12  2:53 ` [Intel-gfx] [PATCH 3/4] drm/i915/guc: Look for a guilty context when an engine reset fails John.C.Harrison
2023-01-12 10:15   ` Tvrtko Ursulin
2023-01-12 20:59     ` John Harrison
2023-01-13  9:22       ` Tvrtko Ursulin
2023-01-14  1:27         ` John Harrison
2023-01-16 12:43           ` Tvrtko Ursulin
2023-01-17 21:14             ` John Harrison
2023-01-12  2:53 ` [Intel-gfx] [PATCH 4/4] drm/i915/guc: Add a debug print on GuC triggered reset John.C.Harrison
2023-01-12 10:11   ` Tvrtko Ursulin
2023-01-12  3:21 ` [Intel-gfx] ✗ Fi.CI.SPARSE: warning for Allow error capture without a request / on reset failure (rev2) Patchwork
2023-01-12  3:36 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
2023-01-12  5:36 ` [Intel-gfx] ✓ Fi.CI.IGT: " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3ed42f96-8965-7872-709d-2daa6c48dcb9@linux.intel.com \
    --to=tvrtko.ursulin@linux.intel.com \
    --cc=DRI-Devel@Lists.FreeDesktop.Org \
    --cc=Intel-GFX@Lists.FreeDesktop.Org \
    --cc=john.c.harrison@intel.com \
    --cc=matthew.auld@intel.com \
    --cc=thomas.hellstrom@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox