From: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
To: John Harrison <john.c.harrison@intel.com>,
Matthew Brost <matthew.brost@intel.com>
Cc: intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org
Subject: Re: [Intel-gfx] [PATCH] drm/i915/selftests: Allow engine reset failure to do a GT reset in hangcheck selftest
Date: Wed, 27 Oct 2021 22:47:20 +0200 [thread overview]
Message-ID: <aef72bc9-323d-7ef4-1f34-c9256a0c6f6b@linux.intel.com> (raw)
In-Reply-To: <fa2815fc-5adc-8df6-af19-93dea1edb5d1@intel.com>
On 10/27/21 22:34, John Harrison wrote:
> On 10/26/2021 23:36, Thomas Hellström wrote:
>> Hi, John,
>>
>> On 10/26/21 21:55, John Harrison wrote:
>>> On 10/21/2021 23:23, Thomas Hellström wrote:
>>>> On 10/21/21 22:37, Matthew Brost wrote:
>>>>> On Thu, Oct 21, 2021 at 08:15:49AM +0200, Thomas Hellström wrote:
>>>>>> Hi, Matthew,
>>>>>>
>>>>>> On Mon, 2021-10-11 at 16:47 -0700, Matthew Brost wrote:
>>>>>>> The hangcheck selftest blocks per engine resets by setting magic
>>>>>>> bits
>>>>>>> in
>>>>>>> the reset flags. This is incorrect for GuC submission because if
>>>>>>> the
>>>>>>> GuC
>>>>>>> fails to reset an engine we would like to do a full GT reset. Do no
>>>>>>> set
>>>>>>> these magic bits when using GuC submission.
>>>>>>>
>>>>>>> Side note this lockless algorithm with magic bits to block resets
>>>>>>> really
>>>>>>> should be ripped out.
>>>>>>>
>>>>>> Lockless algorithm aside, from a quick look at the code in
>>>>>> intel_reset.c it appears to me like the interface that falls back
>>>>>> to a
>>>>>> full GT reset is intel_gt_handle_error() whereas
>>>>>> intel_engine_reset()
>>>>>> is explicitly intended to not do that, so is there a discrepancy
>>>>>> between GuC and non-GuC here?
>>>>>>
>>>>> With GuC submission when an engine reset fails, we get an engine
>>>>> reset
>>>>> failure notification which triggers a full GT reset
>>>>> (intel_guc_engine_failure_process_msg in intel_guc_submission.c).
>>>>> That
>>>>> reset is blocking by setting these magic bits. Clearing the bits
>>>>> in this
>>>>> function doesn't seem to unblock that reset either, the driver
>>>>> tries to
>>>>> unload with a worker blocked, and results in the blow up.
>>>>> Something with
>>>>> this lockless algorithm could be wrong as clear of the bit should
>>>>> unlblock the reset but it is doesn't. We can look into that but in
>>>>> the
>>>>> meantime we need to fix this test to be able to fail gracefully
>>>>> and not
>>>>> crash CI.
>>>>
>>>> Yeah, for that lockless algorithm if needed, we might want to use a
>>>> ww_mutex per engine or something,
>>>> but point was that AFAICT at least one of the tests that set those
>>>> flags explicitly tested the functionality that no other engines
>>>> than the intended one was reset when the intel_engine_reset()
>>>> function was used, and then if GuC submission doesn't honor that,
>>>> wouldn't a better approach be to make a code comment around
>>>> intel_engine_reset() to explain the differences and disable that
>>>> particular test for GuC?. Also wouldn't we for example we see a
>>>> duplicated full GT reset with GuC if intel_engine_reset() fails as
>>>> part of the intel_gt_handle_error() function?
>>> Re-reading this thread, I think there is a misunderstanding.
>>>
>>> The selftests themselves have already been updated to support GuC
>>> based engine resets. That is done by submitting a hanging context
>>> and letting the GuC detect the hang and issue a reset. There is no
>>> mechanism available for i915 to directly issue or request an engine
>>> based reset (because i915 does not know what is running on any given
>>> engine at any given time, being disconnected from the scheduler).
>>>
>>> So the tests are already correctly testing per engine resets and do
>>> not go anywhere near either intel_engine_reset() or
>>> intel_gt_handle_error() when GuC submission is used. The problem is
>>> what happens if the engine reset fails (which supposedly can only
>>> happen with broken hardware). In that scenario, there is an
>>> asynchronous message from GuC to i915 to notify us of the failure.
>>> The KMD receives that notification and then (eventually) calls
>>> intel_gt_handle_error() to issue a full GT reset. However, that is
>>> blocked because the selftest is not expecting it and has vetoed the
>>> possibility.
>>
>> This is where my understanding of the discussion differs. According
>> to Matthew, the selftest actually proceeds to clear the bits, but the
>> worker that calls into intel_gt_handle_error() never wakes up. (and
>> that's probably due to clear_bit() being used instead of
>> clear_and_wake_up_bit()).
> Hmm, missed that point. Yeah, sounds like the missing wake_up suffix
> is what is causing the deadlock. I can't see any other reason why the
> reset handler would not proceed once the flags are cleared. And it
> looks like the selftest should timeout out waiting for the request and
> continue on to clear the bits just fine.
>
>
>>
>> And my problem with this particular patch is that it adds even more
>> "if (!guc_submission)" which is already sprinkled all over the place
>> in the selftests to the point that it becomes difficult to see what
>> (if anything) the tests are really testing.
> I agree with this. Fixing the problem at source seems like a better
> solution than hacking lots of different bits in different tests.
>
OK, so if we can fix this in intel_gt_handle_error() that'd be great.
Thanks,
Thomas
next prev parent reply other threads:[~2021-10-27 20:47 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-10-11 23:47 [Intel-gfx] [PATCH] drm/i915/selftests: Allow engine reset failure to do a GT reset in hangcheck selftest Matthew Brost
2021-10-12 0:52 ` [Intel-gfx] ✓ Fi.CI.BAT: success for " Patchwork
2021-10-12 4:46 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork
2021-10-21 6:15 ` [Intel-gfx] [PATCH] " Thomas Hellström
2021-10-21 20:37 ` Matthew Brost
2021-10-22 6:23 ` Thomas Hellström
2021-10-22 17:03 ` Matthew Brost
2021-10-22 18:09 ` John Harrison
2021-10-23 17:46 ` Thomas Hellström
2021-10-23 18:18 ` Matthew Brost
2021-10-23 18:36 ` Thomas Hellström
2021-10-25 17:32 ` John Harrison
2021-10-26 19:55 ` John Harrison
2021-10-27 6:36 ` Thomas Hellström
2021-10-27 20:34 ` John Harrison
2021-10-27 20:47 ` Thomas Hellström [this message]
2021-10-26 8:22 ` Thomas Hellström
2021-10-26 19:48 ` John Harrison
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aef72bc9-323d-7ef4-1f34-c9256a0c6f6b@linux.intel.com \
--to=thomas.hellstrom@linux.intel.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=intel-gfx@lists.freedesktop.org \
--cc=john.c.harrison@intel.com \
--cc=matthew.brost@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox