public inbox for intel-gfx@lists.freedesktop.org
 help / color / mirror / Atom feed
From: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
To: John Harrison <john.c.harrison@intel.com>,
	Matthew Brost <matthew.brost@intel.com>
Cc: intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org
Subject: Re: [Intel-gfx] [PATCH] drm/i915/selftests: Allow engine reset failure to do a GT reset in hangcheck selftest
Date: Wed, 27 Oct 2021 22:47:20 +0200	[thread overview]
Message-ID: <aef72bc9-323d-7ef4-1f34-c9256a0c6f6b@linux.intel.com> (raw)
In-Reply-To: <fa2815fc-5adc-8df6-af19-93dea1edb5d1@intel.com>


On 10/27/21 22:34, John Harrison wrote:
> On 10/26/2021 23:36, Thomas Hellström wrote:
>> Hi, John,
>>
>> On 10/26/21 21:55, John Harrison wrote:
>>> On 10/21/2021 23:23, Thomas Hellström wrote:
>>>> On 10/21/21 22:37, Matthew Brost wrote:
>>>>> On Thu, Oct 21, 2021 at 08:15:49AM +0200, Thomas Hellström wrote:
>>>>>> Hi, Matthew,
>>>>>>
>>>>>> On Mon, 2021-10-11 at 16:47 -0700, Matthew Brost wrote:
>>>>>>> The hangcheck selftest blocks per engine resets by setting magic 
>>>>>>> bits
>>>>>>> in
>>>>>>> the reset flags. This is incorrect for GuC submission because if 
>>>>>>> the
>>>>>>> GuC
>>>>>>> fails to reset an engine we would like to do a full GT reset. Do no
>>>>>>> set
>>>>>>> these magic bits when using GuC submission.
>>>>>>>
>>>>>>> Side note this lockless algorithm with magic bits to block resets
>>>>>>> really
>>>>>>> should be ripped out.
>>>>>>>
>>>>>> Lockless algorithm aside, from a quick look at the code in
>>>>>> intel_reset.c it appears to me like the interface that falls back 
>>>>>> to a
>>>>>> full GT reset is intel_gt_handle_error() whereas 
>>>>>> intel_engine_reset()
>>>>>> is explicitly intended to not do that, so is there a discrepancy
>>>>>> between GuC and non-GuC here?
>>>>>>
>>>>> With GuC submission when an engine reset fails, we get an engine 
>>>>> reset
>>>>> failure notification which triggers a full GT reset
>>>>> (intel_guc_engine_failure_process_msg in intel_guc_submission.c). 
>>>>> That
>>>>> reset is blocking by setting these magic bits. Clearing the bits 
>>>>> in this
>>>>> function doesn't seem to unblock that reset either, the driver 
>>>>> tries to
>>>>> unload with a worker blocked, and results in the blow up. 
>>>>> Something with
>>>>> this lockless algorithm could be wrong as clear of the bit should
>>>>> unlblock the reset but it is doesn't. We can look into that but in 
>>>>> the
>>>>> meantime we need to fix this test to be able to fail gracefully 
>>>>> and not
>>>>> crash CI.
>>>>
>>>> Yeah, for that lockless algorithm if needed, we might want to use a 
>>>> ww_mutex per engine or something,
>>>> but point was that AFAICT at least one of the tests that set those 
>>>> flags explicitly tested the functionality that no other engines 
>>>> than the intended one was reset when the intel_engine_reset() 
>>>> function was used, and then if GuC submission doesn't honor that, 
>>>> wouldn't a better approach be to make a code comment around 
>>>> intel_engine_reset() to explain the differences and disable that 
>>>> particular test for GuC?. Also wouldn't we for example we see a 
>>>> duplicated full GT reset with GuC if intel_engine_reset() fails as 
>>>> part of the intel_gt_handle_error() function?
>>> Re-reading this thread, I think there is a misunderstanding.
>>>
>>> The selftests themselves have already been updated to support GuC 
>>> based engine resets. That is done by submitting a hanging context 
>>> and letting the GuC detect the hang and issue a reset. There is no 
>>> mechanism available for i915 to directly issue or request an engine 
>>> based reset (because i915 does not know what is running on any given 
>>> engine at any given time, being disconnected from the scheduler).
>>>
>>> So the tests are already correctly testing per engine resets and do 
>>> not go anywhere near either intel_engine_reset() or 
>>> intel_gt_handle_error() when GuC submission is used. The problem is 
>>> what happens if the engine reset fails (which supposedly can only 
>>> happen with broken hardware). In that scenario, there is an 
>>> asynchronous message from GuC to i915 to notify us of the failure. 
>>> The KMD receives that notification and then (eventually) calls 
>>> intel_gt_handle_error() to issue a full GT reset. However, that is 
>>> blocked because the selftest is not expecting it and has vetoed the 
>>> possibility.
>>
>> This is where my understanding of the discussion differs. According 
>> to Matthew, the selftest actually proceeds to clear the bits, but the 
>> worker that calls into intel_gt_handle_error() never wakes up. (and 
>> that's probably due to clear_bit() being used instead of 
>> clear_and_wake_up_bit()).
> Hmm, missed that point. Yeah, sounds like the missing wake_up suffix 
> is what is causing the deadlock. I can't see any other reason why the 
> reset handler would not proceed once the flags are cleared. And it 
> looks like the selftest should timeout out waiting for the request and 
> continue on to clear the bits just fine.
>
>
>>
>> And my problem with this particular patch is that it adds even more 
>> "if (!guc_submission)" which is already sprinkled all over the place 
>> in the selftests to the point that it becomes difficult to see what 
>> (if anything) the tests are really testing.
> I agree with this. Fixing the problem at source seems like a better 
> solution than hacking lots of different bits in different tests.
>
OK, so if we can fix this in intel_gt_handle_error() that'd be great.

Thanks,

Thomas



  reply	other threads:[~2021-10-27 20:47 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-11 23:47 [Intel-gfx] [PATCH] drm/i915/selftests: Allow engine reset failure to do a GT reset in hangcheck selftest Matthew Brost
2021-10-12  0:52 ` [Intel-gfx] ✓ Fi.CI.BAT: success for " Patchwork
2021-10-12  4:46 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork
2021-10-21  6:15 ` [Intel-gfx] [PATCH] " Thomas Hellström
2021-10-21 20:37   ` Matthew Brost
2021-10-22  6:23     ` Thomas Hellström
2021-10-22 17:03       ` Matthew Brost
2021-10-22 18:09         ` John Harrison
2021-10-23 17:46           ` Thomas Hellström
2021-10-23 18:18             ` Matthew Brost
2021-10-23 18:36               ` Thomas Hellström
2021-10-25 17:32                 ` John Harrison
2021-10-26 19:55       ` John Harrison
2021-10-27  6:36         ` Thomas Hellström
2021-10-27 20:34           ` John Harrison
2021-10-27 20:47             ` Thomas Hellström [this message]
2021-10-26  8:22     ` Thomas Hellström
2021-10-26 19:48 ` John Harrison

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aef72bc9-323d-7ef4-1f34-c9256a0c6f6b@linux.intel.com \
    --to=thomas.hellstrom@linux.intel.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=john.c.harrison@intel.com \
    --cc=matthew.brost@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox