Re: [Intel-gfx] [PATCH 1/4] drm/i915: Allow error capture without a request

public inbox for intel-gfx@lists.freedesktop.org
 help / color / mirror / Atom feed

From: "Hellstrom, Thomas" <thomas.hellstrom@intel.com>
To: "Harrison, John C" <john.c.harrison@intel.com>,
	"tvrtko.ursulin@linux.intel.com" <tvrtko.ursulin@linux.intel.com>,
	"Intel-GFX@Lists.FreeDesktop.Org"
	<Intel-GFX@Lists.FreeDesktop.Org>
Cc: "Auld, Matthew" <matthew.auld@intel.com>,
	"DRI-Devel@Lists.FreeDesktop.Org"
	<DRI-Devel@Lists.FreeDesktop.Org>
Subject: Re: [Intel-gfx] [PATCH 1/4] drm/i915: Allow error capture without a request
Date: Fri, 13 Jan 2023 17:46:11 +0000	[thread overview]
Message-ID: <7316954666eecb39aef79067bc590e58bee48389.camel@intel.com> (raw)
In-Reply-To: <4acf4db5-97e9-34dc-2b89-517296ce4c3c@linux.intel.com>

On Fri, 2023-01-13 at 09:51 +0000, Tvrtko Ursulin wrote:
> 
> On 12/01/2023 20:40, John Harrison wrote:
> > On 1/12/2023 02:01, Tvrtko Ursulin wrote:
> > > On 12/01/2023 02:53, John.C.Harrison@Intel.com wrote:
> > > > From: John Harrison <John.C.Harrison@Intel.com>
> > > > 
> > > > There was a report of error captures occurring without any hung
> > > > context being indicated despite the capture being initiated by
> > > > a 'hung
> > > > context notification' from GuC. The problem was not
> > > > reproducible.
> > > > However, it is possible to happen if the context in question
> > > > has no
> > > > active requests. For example, if the hang was in the context
> > > > switch
> > > > itself then the breadcrumb write would have occurred and the
> > > > KMD would
> > > > see an idle context.
> > > > 
> > > > In the interests of attempting to provide as much information
> > > > as
> > > > possible about a hang, it seems wise to include the engine info
> > > > regardless of whether a request was found or not. As opposed to
> > > > just
> > > > prentending there was no hang at all.
> > > > 
> > > > So update the error capture code to always record engine
> > > > information
> > > > if an engine is given. Which means updating record_context() to
> > > > take a
> > > > context instead of a request (which it only ever used to find
> > > > the
> > > > context anyway). And split the request agnostic parts of
> > > > intel_engine_coredump_add_request() out into a seaprate
> > > > function.
> > > > 
> > > > v2: Remove a duplicate 'if' statement (Umesh) and fix a put of
> > > > a null
> > > > pointer.
> > > > 
> > > > Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> > > > Reviewed-by: Umesh Nerlige Ramappa
> > > > <umesh.nerlige.ramappa@intel.com>
> > > > ---
> > > >   drivers/gpu/drm/i915/i915_gpu_error.c | 61
> > > > +++++++++++++++++++--------
> > > >   1 file changed, 43 insertions(+), 18 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c 
> > > > b/drivers/gpu/drm/i915/i915_gpu_error.c
> > > > index 9d5d5a397b64e..bd2cf7d235df0 100644
> > > > --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> > > > +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> > > > @@ -1370,14 +1370,14 @@ static void
> > > > engine_record_execlists(struct 
> > > > intel_engine_coredump *ee)
> > > >   }
> > > >     static bool record_context(struct i915_gem_context_coredump
> > > > *e,
> > > > -               const struct i915_request *rq)
> > > > +               struct intel_context *ce)
> > > >   {
> > > >       struct i915_gem_context *ctx;
> > > >       struct task_struct *task;
> > > >       bool simulated;
> > > >         rcu_read_lock();
> > > > -    ctx = rcu_dereference(rq->context->gem_context);
> > > > +    ctx = rcu_dereference(ce->gem_context);
> > > >       if (ctx && !kref_get_unless_zero(&ctx->ref))
> > > >           ctx = NULL;
> > > >       rcu_read_unlock();
> > > > @@ -1396,8 +1396,8 @@ static bool record_context(struct 
> > > > i915_gem_context_coredump *e,
> > > >       e->guilty = atomic_read(&ctx->guilty_count);
> > > >       e->active = atomic_read(&ctx->active_count);
> > > >   -    e->total_runtime = 
> > > > intel_context_get_total_runtime_ns(rq->context);
> > > > -    e->avg_runtime = intel_context_get_avg_runtime_ns(rq-
> > > > >context);
> > > > +    e->total_runtime = intel_context_get_total_runtime_ns(ce);
> > > > +    e->avg_runtime = intel_context_get_avg_runtime_ns(ce);
> > > >         simulated = i915_gem_context_no_error_capture(ctx);
> > > >   @@ -1532,15 +1532,37 @@ intel_engine_coredump_alloc(struct 
> > > > intel_engine_cs *engine, gfp_t gfp, u32 dump_
> > > >       return ee;
> > > >   }
> > > >   +static struct intel_engine_capture_vma *
> > > > +engine_coredump_add_context(struct intel_engine_coredump *ee,
> > > > +                struct intel_context *ce,
> > > > +                gfp_t gfp)
> > > > +{
> > > > +    struct intel_engine_capture_vma *vma = NULL;
> > > > +
> > > > +    ee->simulated |= record_context(&ee->context, ce);
> > > > +    if (ee->simulated)
> > > > +        return NULL;
> > > > +
> > > > +    /*
> > > > +     * We need to copy these to an anonymous buffer
> > > > +     * as the simplest method to avoid being overwritten
> > > > +     * by userspace.
> > > > +     */
> > > > +    vma = capture_vma(vma, ce->ring->vma, "ring", gfp);
> > > > +    vma = capture_vma(vma, ce->state, "HW context", gfp);
> > > > +
> > > > +    return vma;
> > > > +}
> > > > +
> > > >   struct intel_engine_capture_vma *
> > > >   intel_engine_coredump_add_request(struct
> > > > intel_engine_coredump *ee,
> > > >                     struct i915_request *rq,
> > > >                     gfp_t gfp)
> > > >   {
> > > > -    struct intel_engine_capture_vma *vma = NULL;
> > > > +    struct intel_engine_capture_vma *vma;
> > > >   -    ee->simulated |= record_context(&ee->context, rq);
> > > > -    if (ee->simulated)
> > > > +    vma = engine_coredump_add_context(ee, rq->context, gfp);
> > > > +    if (!vma)
> > > >           return NULL;
> > > >         /*
> > > > @@ -1550,8 +1572,6 @@ intel_engine_coredump_add_request(struct 
> > > > intel_engine_coredump *ee,
> > > >        */
> > > >       vma = capture_vma_snapshot(vma, rq->batch_res, gfp,
> > > > "batch");
> > > >       vma = capture_user(vma, rq, gfp);
> > > > -    vma = capture_vma(vma, rq->ring->vma, "ring", gfp);
> > > > -    vma = capture_vma(vma, rq->context->state, "HW context",
> > > > gfp);
> > > >         ee->rq_head = rq->head;
> > > >       ee->rq_post = rq->postfix;
> > > > @@ -1608,8 +1628,11 @@ capture_engine(struct intel_engine_cs
> > > > *engine,
> > > >       if (ce) {
> > > >           intel_engine_clear_hung_context(engine);
> > > >           rq = intel_context_find_active_request(ce);
> > > > -        if (!rq || !i915_request_started(rq))
> > > > -            goto no_request_capture;
> > > > +        if (rq && !i915_request_started(rq)) {
> > > > +            drm_info(&engine->gt->i915->drm, "Got hung context
> > > > on %s 
> > > > with no active request!\n",
> > > 
> > > Suggest s/active/started/ since we have both i915_request_active
> > > and 
> > > i915_request_started, so to align the terminology.
> > The message text was based on the intent of the activity not the
> > naming 
> > of some internal helper function. Can change it if you really want
> > but 
> > "with no started request" just reads like bad English to me. Plus
> > it 
> > gets removed in the next patch anyway...
> > 
> > 
> > > 
> > > > +                 engine->name);
> > > > +            rq = NULL;
> > > > +        }
> > > >       } else {
> > > >           /*
> > > >            * Getting here with GuC enabled means it is a forced
> > > > error 
> > > > capture
> > > > @@ -1622,22 +1645,24 @@ capture_engine(struct intel_engine_cs
> > > > *engine,
> > > >                              flags);
> > > >           }
> > > >       }
> > > > -    if (rq)
> > > > +    if (rq) {
> > > >           rq = i915_request_get_rcu(rq);
> > > > +        capture = intel_engine_coredump_add_request(ee, rq, 
> > > > ATOMIC_MAYFAIL);
> > > > +    } else if (ce) {
> > > > +        capture = engine_coredump_add_context(ee, ce,
> > > > ATOMIC_MAYFAIL);
> > > > +    }
> > > >   -    if (!rq)
> > > > -        goto no_request_capture;
> > > > -
> > > > -    capture = intel_engine_coredump_add_request(ee, rq, 
> > > > ATOMIC_MAYFAIL);
> > > >       if (!capture) {
> > > > -        i915_request_put(rq);
> > > > +        if (rq)
> > > > +            i915_request_put(rq);
> > > >           goto no_request_capture;
> > > >       }
> > > >       if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE)
> > > >           intel_guc_capture_get_matching_node(engine->gt, ee,
> > > > ce);
> > > 
> > > This step requires non-NULL ce, so if you move it under the "else
> > > if 
> > > (ce)" above then I *think* exit from the function can be
> > > consolidated 
> > > to just:
> > > 
> > > if (capture) {
> > >     intel_engine_coredump_add_vma(ee, capture, compress);
> > >     if (rq)
> > >         i915_request_put(rq);
> > Is there any reason the rq ref needs to be held during the add_vma
> > call? 
> > Can it now just be moved earlier to be:
> >      if (rq) {
> >          rq = i915_request_get_rcu(rq);
> >          capture = intel_engine_coredump_add_request(ee, rq, 
> > ATOMIC_MAYFAIL);
> >          i915_request_put(rq);
> >      }
> > 
> > The internals of the request object are only touched in the above 
> > _add_request() code. The later _add_vma() call fiddles around with
> > vmas 
> > that pulled from the request but the capture_vma code inside 
> > _add_request() has already copied everything, hasn't it? Or rather,
> > it 
> > has grabbed its own private vma resource locks. So there is no 
> > requirement to keep the request itself around still?

That sounds correct. It was some time ago since I worked with this code
but when i started IIRC KASAN told me the request along with the whole
capture list could disappear under us due to a parallel capture.

So the request reference added then might cover a bit too much now that
we also hold references on vma resources, which it looks like we do in
intel_engine_coredump_add_vma().

Another thing which is crappy with the current error capture code is
that the request capture list needs to be freed with the request and
not when the request signals (We can't block request signalling in the
capture code to keep the capture list around). There might be many
signaled requests hanging around in non-pruned dma_resv objects and
thus many unused capture lists with many unused vma resources. :/

/Thomas


> 
> Don't know.. it is a question if changes from 60dc43d1190d
> ("drm/i915: 
> Use struct vma_resource instead of struct vma_snapshot") removed the 
> need for holding the rq reference that "long" I guess? Adding Thomas
> and 
> Matt to perhaps comment.
> 
> Regards,
> 
> Tvrtko
> 
> 
> > John.
> > 
> > 
> > > } else {
> > >     kfree(ee);
> > >     ee = NULL;
> > > }
> > > 
> > > return ee;
> > > 
> > > No "if (rq) i915_request_put()" twice, and goto label can be 
> > > completely removed.
> > > 
> > > Regards,
> > > 
> > > Tvrtko
> > > 
> > > >         intel_engine_coredump_add_vma(ee, capture, compress);
> > > > -    i915_request_put(rq);
> > > > +    if (rq)
> > > > +        i915_request_put(rq);
> > > >         return ee;
> >

next prev parent reply	other threads:[~2023-01-13 17:46 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-12  2:53 [Intel-gfx] [PATCH 0/4] Allow error capture without a request / on reset failure John.C.Harrison
2023-01-12  2:53 ` [Intel-gfx] [PATCH 1/4] drm/i915: Allow error capture without a request John.C.Harrison
2023-01-12 10:01   ` Tvrtko Ursulin
2023-01-12 20:40     ` John Harrison
2023-01-13  9:51       ` Tvrtko Ursulin
2023-01-13 17:46         ` Hellstrom, Thomas [this message]
2023-01-13 21:29           ` John Harrison
2023-01-16 12:38             ` Tvrtko Ursulin
2023-01-17 19:40               ` John Harrison
2023-01-16 12:13           ` Tvrtko Ursulin
2023-01-12  2:53 ` [Intel-gfx] [PATCH 2/4] drm/i915: Allow error capture of a pending request John.C.Harrison
2023-01-12 10:06   ` Tvrtko Ursulin
2023-01-12 20:46     ` John Harrison
2023-01-13  9:10       ` Tvrtko Ursulin
2023-01-12  2:53 ` [Intel-gfx] [PATCH 3/4] drm/i915/guc: Look for a guilty context when an engine reset fails John.C.Harrison
2023-01-12 10:15   ` Tvrtko Ursulin
2023-01-12 20:59     ` John Harrison
2023-01-13  9:22       ` Tvrtko Ursulin
2023-01-14  1:27         ` John Harrison
2023-01-16 12:43           ` Tvrtko Ursulin
2023-01-17 21:14             ` John Harrison
2023-01-12  2:53 ` [Intel-gfx] [PATCH 4/4] drm/i915/guc: Add a debug print on GuC triggered reset John.C.Harrison
2023-01-12 10:11   ` Tvrtko Ursulin
2023-01-12  3:21 ` [Intel-gfx] ✗ Fi.CI.SPARSE: warning for Allow error capture without a request / on reset failure (rev2) Patchwork
2023-01-12  3:36 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
2023-01-12  5:36 ` [Intel-gfx] ✓ Fi.CI.IGT: " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7316954666eecb39aef79067bc590e58bee48389.camel@intel.com \
    --to=thomas.hellstrom@intel.com \
    --cc=DRI-Devel@Lists.FreeDesktop.Org \
    --cc=Intel-GFX@Lists.FreeDesktop.Org \
    --cc=john.c.harrison@intel.com \
    --cc=matthew.auld@intel.com \
    --cc=tvrtko.ursulin@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox