Re: [Intel-gfx] [PATCH 3/4] drm/i915/guc: Look for a guilty context when an engine reset fails

Intel-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: John Harrison <john.c.harrison@intel.com>
To: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>,
	<Intel-GFX@Lists.FreeDesktop.Org>
Cc: DRI-Devel@Lists.FreeDesktop.Org
Subject: Re: [Intel-gfx] [PATCH 3/4] drm/i915/guc: Look for a guilty context when an engine reset fails
Date: Fri, 13 Jan 2023 17:27:11 -0800	[thread overview]
Message-ID: <f3749fe9-56c9-dcb7-d1bc-bfd57bee668c@intel.com> (raw)
In-Reply-To: <ed669153-0c07-dfdd-58c9-8146ed966366@linux.intel.com>

On 1/13/2023 01:22, Tvrtko Ursulin wrote:
> On 12/01/2023 20:59, John Harrison wrote:
>> On 1/12/2023 02:15, Tvrtko Ursulin wrote:
>>> On 12/01/2023 02:53, John.C.Harrison@Intel.com wrote:
>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>
>>>> Engine resets are supposed to never fail. But in the case when one
>>>> does (due to unknown reasons that normally come down to a missing
>>>> w/a), it is useful to get as much information out of the system as
>>>> possible. Given that the GuC effectively dies on such a situation, it
>>>> is not possible to get a guilty context notification back. So do a
>>>> manual search instead. Given that GuC is dead, this is safe because
>>>> GuC won't be changing the engine state asynchronously.
>>>>
>>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>>> ---
>>>>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c   | 17 
>>>> +++++++++++++++--
>>>>   1 file changed, 15 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
>>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>> index b436dd7f12e42..99d09e3394597 100644
>>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>> @@ -4754,11 +4754,24 @@ static void reset_fail_worker_func(struct 
>>>> work_struct *w)
>>>>       guc->submission_state.reset_fail_mask = 0;
>>>> spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>>>>   -    if (likely(reset_fail_mask))
>>>> +    if (likely(reset_fail_mask)) {
>>>> +        struct intel_engine_cs *engine;
>>>> +        enum intel_engine_id id;
>>>> +
>>>> +        /*
>>>> +         * GuC is toast at this point - it dead loops after 
>>>> sending the failed
>>>> +         * reset notification. So need to manually determine the 
>>>> guilty context.
>>>> +         * Note that it should be safe/reliable to do this here 
>>>> because the GuC
>>>> +         * is toast and will not be scheduling behind the KMD's back.
>>>> +         */
>>>> +        for_each_engine_masked(engine, gt, reset_fail_mask, id)
>>>> +            intel_guc_find_hung_context(engine);
>>>> +
>>>>           intel_gt_handle_error(gt, reset_fail_mask,
>>>>                         I915_ERROR_CAPTURE,
>>>> -                      "GuC failed to reset engine mask=0x%x\n",
>>>> +                      "GuC failed to reset engine mask=0x%x",
>>>>                         reset_fail_mask);
>>>> +    }
>>>>   }
>>>>     int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
>>>
>>> This one I don't feel "at home" enough to r-b. Just a question - can 
>>> we be sure at this point that GuC is 100% stuck and there isn't a 
>>> chance it somehow comes alive and starts running in parallel (being 
>>> driven in parallel by a different "thread" in i915), interfering 
>>> with the assumption made in the comment?
>> The GuC API definition for the engine reset failure notification is 
>> that GuC will dead loop itself after sending - to quote "This is a 
>> catastrophic failure that requires a full GT reset, or FLR to 
>> recover.". So yes, GuC is 100% stuck and is not going to self 
>> recover. Guaranteed. If that changes in the future then that would be 
>> a backwards breaking API change and would require a corresponding 
>> driver update to go with supporting the new GuC firmware version.
>>
>> There is the potential for a GT reset to maybe occur in parallel and 
>> resurrect the GuC that way. Not sure how that could happen though. 
>> The heartbeat timeout is significantly longer than the GuC's 
>> pre-emption timeout + engine reset timeout. That just leaves manual 
>> resets from the user or maybe from a selftest. If the user is 
>> manually poking reset debugfs files then it is already known that all 
>> bets are off in terms of getting an accurate error capture. And if a 
>> selftest is triggering GT resets in parallel with engine resets then 
>> either it is a broken test or it is attempting to test an evil corner 
>> case in which it is expected that error capture results will be 
>> unreliable. Having said all that, given that the submission_state 
>> lock is held here, such a GT reset would not get very far in bring 
>> the GuC back up anyway. Certainly, it would not be able to get as far 
>> as submitting new work and thus potentially changing the engine state.
>>
>> So yes, if multiple impossible events occur back to back then the 
>> error capture may be wonky. Where wonky means a potentially innocent 
>> context/request gets blamed for breaking the hardware. Oh dear. I can 
>> live with that.
>
> Okay, so I was triggered by the "safe/reliable" qualification from the 
> comment. I agree "reliable" does not have to be and was mostly worried 
> about the "safe" part.
>
> From what you explain if short heartbeat, or manual reset invocation, 
> could actually mess up any of the data structures which added 
> intel_guc_find_hung_context walks and so crash the kernel.
>
> Looking inside, there is some lock dropping going on (and undocumented 
> irqsave games), and walking the list while unlocked. So whether or not 
> that can go bang if a full reset happens in parallel and re-activates 
> the normal driver flows.
There is no walking of unlocked lists. The xa_lock is held whenever it 
looks at the xa structure itself. The release is only while analysing 
the context that was retrieved. And the context retrieval itself starts 
with a kref_get_unless_zero. So everything is only ever accessed while 
locked or reference counted. The unlock of the xa while analysing a 
context is because the xa object can be accessed from interrupt code and 
so we don't want to hold it locked unnecessarily while scanning through 
requests within a context (all code which has no connection to the GuC 
backend at all).

I can drop the word 'safe' if it makes you nervous. That was only meant 
to refer to the possibility of such a scan returning bogus results due 
to contexts switching in/out of the hardware before/during/after the 
scan. There is no way for it to go bang.

John.


>
> Regards,
>
> Tvrtko

next prev parent reply	other threads:[~2023-01-14  1:27 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-12  2:53 [Intel-gfx] [PATCH 0/4] Allow error capture without a request / on reset failure John.C.Harrison
2023-01-12  2:53 ` [Intel-gfx] [PATCH 1/4] drm/i915: Allow error capture without a request John.C.Harrison
2023-01-12 10:01   ` Tvrtko Ursulin
2023-01-12 20:40     ` John Harrison
2023-01-13  9:51       ` Tvrtko Ursulin
2023-01-13 17:46         ` Hellstrom, Thomas
2023-01-13 21:29           ` John Harrison
2023-01-16 12:38             ` Tvrtko Ursulin
2023-01-17 19:40               ` John Harrison
2023-01-16 12:13           ` Tvrtko Ursulin
2023-01-12  2:53 ` [Intel-gfx] [PATCH 2/4] drm/i915: Allow error capture of a pending request John.C.Harrison
2023-01-12 10:06   ` Tvrtko Ursulin
2023-01-12 20:46     ` John Harrison
2023-01-13  9:10       ` Tvrtko Ursulin
2023-01-12  2:53 ` [Intel-gfx] [PATCH 3/4] drm/i915/guc: Look for a guilty context when an engine reset fails John.C.Harrison
2023-01-12 10:15   ` Tvrtko Ursulin
2023-01-12 20:59     ` John Harrison
2023-01-13  9:22       ` Tvrtko Ursulin
2023-01-14  1:27         ` John Harrison [this message]
2023-01-16 12:43           ` Tvrtko Ursulin
2023-01-17 21:14             ` John Harrison
2023-01-12  2:53 ` [Intel-gfx] [PATCH 4/4] drm/i915/guc: Add a debug print on GuC triggered reset John.C.Harrison
2023-01-12 10:11   ` Tvrtko Ursulin
2023-01-12  3:21 ` [Intel-gfx] ✗ Fi.CI.SPARSE: warning for Allow error capture without a request / on reset failure (rev2) Patchwork
2023-01-12  3:36 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
2023-01-12  5:36 ` [Intel-gfx] ✓ Fi.CI.IGT: " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f3749fe9-56c9-dcb7-d1bc-bfd57bee668c@intel.com \
    --to=john.c.harrison@intel.com \
    --cc=DRI-Devel@Lists.FreeDesktop.Org \
    --cc=Intel-GFX@Lists.FreeDesktop.Org \
    --cc=tvrtko.ursulin@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox