Re: [Intel-gfx] [PATCH 2/3] drm/i915/guc: Add work queue to trigger a GT reset

public inbox for intel-gfx@lists.freedesktop.org
 help / color / mirror / Atom feed

From: Matthew Brost <matthew.brost@intel.com>
To: John Harrison <john.c.harrison@intel.com>
Cc: thomas.hellstrom@linux.intel.com,
	intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org
Subject: Re: [Intel-gfx] [PATCH 2/3] drm/i915/guc: Add work queue to trigger a GT reset
Date: Wed, 19 Jan 2022 13:05:28 -0800	[thread overview]
Message-ID: <20220119210528.GA32998@jons-linux-dev-box> (raw)
In-Reply-To: <4f6e2027-bf15-7c84-7573-74c6d5daae42@intel.com>

On Wed, Jan 19, 2022 at 01:07:22PM -0800, John Harrison wrote:
> On 1/19/2022 12:54, Matthew Brost wrote:
> > On Tue, Jan 18, 2022 at 05:37:01PM -0800, John Harrison wrote:
> > > On 1/18/2022 13:43, Matthew Brost wrote:
> > > > The G2H handler needs to be flushed during a GT reset but a G2H
> > > > indicating engine reset failure can trigger a GT reset. Add a worker to
> > > > trigger the GT when a engine reset failure is received to break this
> > > s/a/an/
> > > 
> > Yep.
> > 
> > > > circular dependency.
> > > > 
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > ---
> > > >    drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  5 ++++
> > > >    .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 23 +++++++++++++++----
> > > >    2 files changed, 24 insertions(+), 4 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > > index 9d26a86fe557a..60ea8deef5392 100644
> > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > > @@ -119,6 +119,11 @@ struct intel_guc {
> > > >    		 * function as it might be in an atomic context (no sleeping)
> > > >    		 */
> > > >    		struct work_struct destroyed_worker;
> > > > +		/**
> > > > +		 * @reset_worker: worker to trigger a GT reset after an engine
> > > > +		 * reset fails
> > > > +		 */
> > > > +		struct work_struct reset_worker;
> > > >    	} submission_state;
> > > >    	/**
> > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > index 23a40f10d376d..cdd8d691251ff 100644
> > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > @@ -1746,6 +1746,7 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
> > > >    }
> > > >    static void destroyed_worker_func(struct work_struct *w);
> > > > +static void reset_worker_func(struct work_struct *w);
> > > >    /*
> > > >     * Set up the memory resources to be shared with the GuC (via the GGTT)
> > > > @@ -1776,6 +1777,8 @@ int intel_guc_submission_init(struct intel_guc *guc)
> > > >    	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
> > > >    	INIT_WORK(&guc->submission_state.destroyed_worker,
> > > >    		  destroyed_worker_func);
> > > > +	INIT_WORK(&guc->submission_state.reset_worker,
> > > > +		  reset_worker_func);
> > > >    	guc->submission_state.guc_ids_bitmap =
> > > >    		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID(guc), GFP_KERNEL);
> > > > @@ -4052,6 +4055,17 @@ guc_lookup_engine(struct intel_guc *guc, u8 guc_class, u8 instance)
> > > >    	return gt->engine_class[engine_class][instance];
> > > >    }
> > > > +static void reset_worker_func(struct work_struct *w)
> > > > +{
> > > > +	struct intel_guc *guc = container_of(w, struct intel_guc,
> > > > +					     submission_state.reset_worker);
> > > > +	struct intel_gt *gt = guc_to_gt(guc);
> > > > +
> > > > +	intel_gt_handle_error(gt, ALL_ENGINES,
> > > > +			      I915_ERROR_CAPTURE,
> > > > +			      "GuC failed to reset a engine\n");
> > > s/a/an/
> > > 
> > Yep.
> > 
> > > > +}
> > > > +
> > > >    int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
> > > >    					 const u32 *msg, u32 len)
> > > >    {
> > > > @@ -4083,10 +4097,11 @@ int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
> > > >    	drm_err(&gt->i915->drm, "GuC engine reset request failed on %d:%d (%s) because 0x%08X",
> > > >    		guc_class, instance, engine->name, reason);
> > > > -	intel_gt_handle_error(gt, engine->mask,
> > > > -			      I915_ERROR_CAPTURE,
> > > > -			      "GuC failed to reset %s (reason=0x%08x)\n",
> > > > -			      engine->name, reason);
> > > The engine name and reason code are lost from the error capture? I guess we
> > > still get it in the drm_err above, though. So probably not an issue. We
> > > shouldn't be getting these from end users and any internal CI run is only
> > > likely to give us the dmesg, not the error capture anyway! However, still
> > That was my reasoning on the msg too.
> > 
> > > seems like it is work saving engine->mask in the submission_state structure
> > > (ORing in, in case there are multiple resets). Clearing it should be safe
> > > because once a GT reset has happened, we aren't getting any more G2Hs. And
> > > we can't have multiple message handlers running concurrently, right? So no
> > > need to protect the OR either.
> > > 
> > I could do that but the engine->mask is really only used for the error
> > capture with GuC submission as any i915 based reset with GuC submission
> > is a GT reset. Going from engine->mask to ALL_ENGINES will just capture
> > all engine state before doing a GT reset which probably isn't a bad
> > thing, right?
> > 
> > I can update the commit message explaining this if that helps.
> Except that a failure to reset is notionally a hardware bug. As recently
> demonstrated, it could be a software bug due to timeouts being broken. But
> officially, it is something that should never happen. So in the rare case
> where one does show up, we would want to know as much as possible about the
> issue. Most especially - which engine it was that failed. And if all we get
> is a customer bug report with an error capture but no dmesg then we will
> have no idea which. It just seems wrong to be throwing away potentially
> important information for no real reason.
> 

Ok, will add a engine->mask that gets OR'd on every engine reset failure
and cleared on every GT reset in the worker. Probably to be really safe
I should protect this field by the submission state lock too.

Matt 

> John.
> 
> 
> > 
> > Matt
> > 
> > > John.
> > > 
> > > 
> > > > +	/*
> > > > +	 * A GT reset flushes this worker queue (G2H handler) so we must use
> > > > +	 * another worker to trigger a GT reset.
> > > > +	 */
> > > > +	queue_work(system_unbound_wq, &guc->submission_state.reset_worker);
> > > >    	return 0;
> > > >    }
>

next prev parent reply	other threads:[~2022-01-19 21:11 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-18 21:43 [Intel-gfx] [PATCH 0/3] Flush G2H handler during a GT reset Matthew Brost
2022-01-18 21:43 ` [Intel-gfx] [PATCH 1/3] drm/i915: Allocate intel_engine_coredump_alloc with ALLOW_FAIL Matthew Brost
2022-01-19  1:29   ` John Harrison
2022-01-19 20:47     ` Matthew Brost
2022-01-19 20:56       ` John Harrison
2022-01-18 21:43 ` [Intel-gfx] [PATCH 2/3] drm/i915/guc: Add work queue to trigger a GT reset Matthew Brost
2022-01-19  1:37   ` John Harrison
2022-01-19 20:54     ` Matthew Brost
2022-01-19 21:07       ` John Harrison
2022-01-19 21:05         ` Matthew Brost [this message]
2022-01-18 21:43 ` [Intel-gfx] [PATCH 3/3] drm/i915/guc: Flush G2H handler during " Matthew Brost
2022-01-19  1:38   ` John Harrison
2022-01-18 22:01 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Flush G2H handler during a GT reset (rev2) Patchwork
2022-01-18 22:02 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
2022-01-18 22:32 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
2022-01-19  1:02 ` [Intel-gfx] ✓ Fi.CI.IGT: " Patchwork
  -- strict thread matches above, loose matches on Subject: below --
2022-01-19 21:24 [Intel-gfx] [PATCH 0/3] Flush G2H handler during a GT reset Matthew Brost
2022-01-19 21:24 ` [Intel-gfx] [PATCH 2/3] drm/i915/guc: Add work queue to trigger " Matthew Brost
2022-01-21  1:34   ` John Harrison
2022-01-21  4:04     ` Matthew Brost
2022-01-21  4:31 [Intel-gfx] [PATCH 0/3] Flush G2H handler during " Matthew Brost
2022-01-21  4:31 ` [Intel-gfx] [PATCH 2/3] drm/i915/guc: Add work queue to trigger " Matthew Brost
2022-01-21 18:53   ` John Harrison

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220119210528.GA32998@jons-linux-dev-box \
    --to=matthew.brost@intel.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=john.c.harrison@intel.com \
    --cc=thomas.hellstrom@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox