From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: John Harrison <john.c.harrison@intel.com>,
Intel-GFX@Lists.FreeDesktop.Org
Cc: DRI-Devel@Lists.FreeDesktop.Org
Subject: Re: [Intel-gfx] [PATCH 3/4] drm/i915/guc: Look for a guilty context when an engine reset fails
Date: Mon, 16 Jan 2023 12:43:31 +0000 [thread overview]
Message-ID: <195b373c-f98e-306f-40aa-1f028d27a77f@linux.intel.com> (raw)
In-Reply-To: <f3749fe9-56c9-dcb7-d1bc-bfd57bee668c@intel.com>
On 14/01/2023 01:27, John Harrison wrote:
> On 1/13/2023 01:22, Tvrtko Ursulin wrote:
>> On 12/01/2023 20:59, John Harrison wrote:
>>> On 1/12/2023 02:15, Tvrtko Ursulin wrote:
>>>> On 12/01/2023 02:53, John.C.Harrison@Intel.com wrote:
>>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>>
>>>>> Engine resets are supposed to never fail. But in the case when one
>>>>> does (due to unknown reasons that normally come down to a missing
>>>>> w/a), it is useful to get as much information out of the system as
>>>>> possible. Given that the GuC effectively dies on such a situation, it
>>>>> is not possible to get a guilty context notification back. So do a
>>>>> manual search instead. Given that GuC is dead, this is safe because
>>>>> GuC won't be changing the engine state asynchronously.
>>>>>
>>>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>>>> ---
>>>>> .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 17
>>>>> +++++++++++++++--
>>>>> 1 file changed, 15 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>> index b436dd7f12e42..99d09e3394597 100644
>>>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>> @@ -4754,11 +4754,24 @@ static void reset_fail_worker_func(struct
>>>>> work_struct *w)
>>>>> guc->submission_state.reset_fail_mask = 0;
>>>>> spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>>>>> - if (likely(reset_fail_mask))
>>>>> + if (likely(reset_fail_mask)) {
>>>>> + struct intel_engine_cs *engine;
>>>>> + enum intel_engine_id id;
>>>>> +
>>>>> + /*
>>>>> + * GuC is toast at this point - it dead loops after
>>>>> sending the failed
>>>>> + * reset notification. So need to manually determine the
>>>>> guilty context.
>>>>> + * Note that it should be safe/reliable to do this here
>>>>> because the GuC
>>>>> + * is toast and will not be scheduling behind the KMD's back.
>>>>> + */
>>>>> + for_each_engine_masked(engine, gt, reset_fail_mask, id)
>>>>> + intel_guc_find_hung_context(engine);
>>>>> +
>>>>> intel_gt_handle_error(gt, reset_fail_mask,
>>>>> I915_ERROR_CAPTURE,
>>>>> - "GuC failed to reset engine mask=0x%x\n",
>>>>> + "GuC failed to reset engine mask=0x%x",
>>>>> reset_fail_mask);
>>>>> + }
>>>>> }
>>>>> int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
>>>>
>>>> This one I don't feel "at home" enough to r-b. Just a question - can
>>>> we be sure at this point that GuC is 100% stuck and there isn't a
>>>> chance it somehow comes alive and starts running in parallel (being
>>>> driven in parallel by a different "thread" in i915), interfering
>>>> with the assumption made in the comment?
>>> The GuC API definition for the engine reset failure notification is
>>> that GuC will dead loop itself after sending - to quote "This is a
>>> catastrophic failure that requires a full GT reset, or FLR to
>>> recover.". So yes, GuC is 100% stuck and is not going to self
>>> recover. Guaranteed. If that changes in the future then that would be
>>> a backwards breaking API change and would require a corresponding
>>> driver update to go with supporting the new GuC firmware version.
>>>
>>> There is the potential for a GT reset to maybe occur in parallel and
>>> resurrect the GuC that way. Not sure how that could happen though.
>>> The heartbeat timeout is significantly longer than the GuC's
>>> pre-emption timeout + engine reset timeout. That just leaves manual
>>> resets from the user or maybe from a selftest. If the user is
>>> manually poking reset debugfs files then it is already known that all
>>> bets are off in terms of getting an accurate error capture. And if a
>>> selftest is triggering GT resets in parallel with engine resets then
>>> either it is a broken test or it is attempting to test an evil corner
>>> case in which it is expected that error capture results will be
>>> unreliable. Having said all that, given that the submission_state
>>> lock is held here, such a GT reset would not get very far in bring
>>> the GuC back up anyway. Certainly, it would not be able to get as far
>>> as submitting new work and thus potentially changing the engine state.
>>>
>>> So yes, if multiple impossible events occur back to back then the
>>> error capture may be wonky. Where wonky means a potentially innocent
>>> context/request gets blamed for breaking the hardware. Oh dear. I can
>>> live with that.
>>
>> Okay, so I was triggered by the "safe/reliable" qualification from the
>> comment. I agree "reliable" does not have to be and was mostly worried
>> about the "safe" part.
>>
>> From what you explain if short heartbeat, or manual reset invocation,
>> could actually mess up any of the data structures which added
>> intel_guc_find_hung_context walks and so crash the kernel.
>>
>> Looking inside, there is some lock dropping going on (and undocumented
>> irqsave games), and walking the list while unlocked. So whether or not
>> that can go bang if a full reset happens in parallel and re-activates
>> the normal driver flows.
> There is no walking of unlocked lists. The xa_lock is held whenever it
> looks at the xa structure itself. The release is only while analysing
> the context that was retrieved. And the context retrieval itself starts
> with a kref_get_unless_zero. So everything is only ever accessed while
> locked or reference counted. The unlock of the xa while analysing a
> context is because the xa object can be accessed from interrupt code and
> so we don't want to hold it locked unnecessarily while scanning through
> requests within a context (all code which has no connection to the GuC
> backend at all).
AFAICS intel_guc_find_hung_context walks &ce->guc_state.requests with no
locks held. Other places in the code appear to use &ce->guc_state.lock,
or maybe &sched_engine->lock, not sure. Then we have request submission,
retirement and a few other places modify that list. So *if* indeed hung
GuC can get resurrected by a parallel full reset while
reset_fail_worker_func is running, why couldn't that list walk explode?
Regards,
Tvrtko
> I can drop the word 'safe' if it makes you nervous. That was only meant
> to refer to the possibility of such a scan returning bogus results due
> to contexts switching in/out of the hardware before/during/after the
> scan. There is no way for it to go bang.
>
> John.
>
>
>>
>> Regards,
>>
>> Tvrtko
>
next prev parent reply other threads:[~2023-01-16 12:43 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-01-12 2:53 [Intel-gfx] [PATCH 0/4] Allow error capture without a request / on reset failure John.C.Harrison
2023-01-12 2:53 ` [Intel-gfx] [PATCH 1/4] drm/i915: Allow error capture without a request John.C.Harrison
2023-01-12 10:01 ` Tvrtko Ursulin
2023-01-12 20:40 ` John Harrison
2023-01-13 9:51 ` Tvrtko Ursulin
2023-01-13 17:46 ` Hellstrom, Thomas
2023-01-13 21:29 ` John Harrison
2023-01-16 12:38 ` Tvrtko Ursulin
2023-01-17 19:40 ` John Harrison
2023-01-16 12:13 ` Tvrtko Ursulin
2023-01-12 2:53 ` [Intel-gfx] [PATCH 2/4] drm/i915: Allow error capture of a pending request John.C.Harrison
2023-01-12 10:06 ` Tvrtko Ursulin
2023-01-12 20:46 ` John Harrison
2023-01-13 9:10 ` Tvrtko Ursulin
2023-01-12 2:53 ` [Intel-gfx] [PATCH 3/4] drm/i915/guc: Look for a guilty context when an engine reset fails John.C.Harrison
2023-01-12 10:15 ` Tvrtko Ursulin
2023-01-12 20:59 ` John Harrison
2023-01-13 9:22 ` Tvrtko Ursulin
2023-01-14 1:27 ` John Harrison
2023-01-16 12:43 ` Tvrtko Ursulin [this message]
2023-01-17 21:14 ` John Harrison
2023-01-12 2:53 ` [Intel-gfx] [PATCH 4/4] drm/i915/guc: Add a debug print on GuC triggered reset John.C.Harrison
2023-01-12 10:11 ` Tvrtko Ursulin
2023-01-12 3:21 ` [Intel-gfx] ✗ Fi.CI.SPARSE: warning for Allow error capture without a request / on reset failure (rev2) Patchwork
2023-01-12 3:36 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
2023-01-12 5:36 ` [Intel-gfx] ✓ Fi.CI.IGT: " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=195b373c-f98e-306f-40aa-1f028d27a77f@linux.intel.com \
--to=tvrtko.ursulin@linux.intel.com \
--cc=DRI-Devel@Lists.FreeDesktop.Org \
--cc=Intel-GFX@Lists.FreeDesktop.Org \
--cc=john.c.harrison@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox