Re: [Intel-gfx] [PATCH] drm/i915/selftests: Allow engine reset failure to do a GT reset in hangcheck selftest

public inbox for intel-gfx@lists.freedesktop.org
 help / color / mirror / Atom feed

From: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
To: John Harrison <john.c.harrison@intel.com>,
	Matthew Brost <matthew.brost@intel.com>
Cc: intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org
Subject: Re: [Intel-gfx] [PATCH] drm/i915/selftests: Allow engine reset failure to do a GT reset in hangcheck selftest
Date: Wed, 27 Oct 2021 08:36:54 +0200	[thread overview]
Message-ID: <5af5389a-2f42-16ad-90d2-d329801a882f@linux.intel.com> (raw)
In-Reply-To: <5deddbeb-328c-62b3-38e2-d855fc99668b@intel.com>

Hi, John,

On 10/26/21 21:55, John Harrison wrote:
> On 10/21/2021 23:23, Thomas Hellström wrote:
>> On 10/21/21 22:37, Matthew Brost wrote:
>>> On Thu, Oct 21, 2021 at 08:15:49AM +0200, Thomas Hellström wrote:
>>>> Hi, Matthew,
>>>>
>>>> On Mon, 2021-10-11 at 16:47 -0700, Matthew Brost wrote:
>>>>> The hangcheck selftest blocks per engine resets by setting magic bits
>>>>> in
>>>>> the reset flags. This is incorrect for GuC submission because if the
>>>>> GuC
>>>>> fails to reset an engine we would like to do a full GT reset. Do no
>>>>> set
>>>>> these magic bits when using GuC submission.
>>>>>
>>>>> Side note this lockless algorithm with magic bits to block resets
>>>>> really
>>>>> should be ripped out.
>>>>>
>>>> Lockless algorithm aside, from a quick look at the code in
>>>> intel_reset.c it appears to me like the interface that falls back to a
>>>> full GT reset is intel_gt_handle_error() whereas intel_engine_reset()
>>>> is explicitly intended to not do that, so is there a discrepancy
>>>> between GuC and non-GuC here?
>>>>
>>> With GuC submission when an engine reset fails, we get an engine reset
>>> failure notification which triggers a full GT reset
>>> (intel_guc_engine_failure_process_msg in intel_guc_submission.c). That
>>> reset is blocking by setting these magic bits. Clearing the bits in 
>>> this
>>> function doesn't seem to unblock that reset either, the driver tries to
>>> unload with a worker blocked, and results in the blow up. Something 
>>> with
>>> this lockless algorithm could be wrong as clear of the bit should
>>> unlblock the reset but it is doesn't. We can look into that but in the
>>> meantime we need to fix this test to be able to fail gracefully and not
>>> crash CI.
>>
>> Yeah, for that lockless algorithm if needed, we might want to use a 
>> ww_mutex per engine or something,
>> but point was that AFAICT at least one of the tests that set those 
>> flags explicitly tested the functionality that no other engines than 
>> the intended one was reset when the intel_engine_reset() function was 
>> used, and then if GuC submission doesn't honor that, wouldn't a 
>> better approach be to make a code comment around intel_engine_reset() 
>> to explain the differences and disable that particular test for GuC?. 
>> Also wouldn't we for example we see a duplicated full GT reset with 
>> GuC if intel_engine_reset() fails as part of the 
>> intel_gt_handle_error() function?
> Re-reading this thread, I think there is a misunderstanding.
>
> The selftests themselves have already been updated to support GuC 
> based engine resets. That is done by submitting a hanging context and 
> letting the GuC detect the hang and issue a reset. There is no 
> mechanism available for i915 to directly issue or request an engine 
> based reset (because i915 does not know what is running on any given 
> engine at any given time, being disconnected from the scheduler).
>
> So the tests are already correctly testing per engine resets and do 
> not go anywhere near either intel_engine_reset() or 
> intel_gt_handle_error() when GuC submission is used. The problem is 
> what happens if the engine reset fails (which supposedly can only 
> happen with broken hardware). In that scenario, there is an 
> asynchronous message from GuC to i915 to notify us of the failure. The 
> KMD receives that notification and then (eventually) calls 
> intel_gt_handle_error() to issue a full GT reset. However, that is 
> blocked because the selftest is not expecting it and has vetoed the 
> possibility.

This is where my understanding of the discussion differs. According to 
Matthew, the selftest actually proceeds to clear the bits, but the 
worker that calls into intel_gt_handle_error() never wakes up. (and 
that's probably due to clear_bit() being used instead of 
clear_and_wake_up_bit()).

And my problem with this particular patch is that it adds even more "if 
(!guc_submission)" which is already sprinkled all over the place in the 
selftests to the point that it becomes difficult to see what (if 
anything) the tests are really testing. For example 
igt_reset_nop_engine() from a cursory look looks like it's doing 
something but inside the engine loop it becomes clear that the test 
doesn't do *anything* except iterate over engines. Same for 
igt_reset_engines() in the !TEST_ACTIVE case and for 
igt_reset_idle_engine(). For some other tests the reset_count checks are 
gone, leaving only a test that we actually do a reset.

So if possible, as previously mentioned, I think a solution without 
adding more of this in the selftests is preferrable. To me the best 
option is probably be the one you suggest in your previous email: Don't 
wait on the I915_RESET_ENGINE bits with GuC in intel_gt_handle_error(), 
(or perhaps extract what's left in a separate function called from the 
GuC handler).

Thanks,

Thomas

next prev parent reply	other threads:[~2021-10-27  6:37 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-11 23:47 [Intel-gfx] [PATCH] drm/i915/selftests: Allow engine reset failure to do a GT reset in hangcheck selftest Matthew Brost
2021-10-12  0:52 ` [Intel-gfx] ✓ Fi.CI.BAT: success for " Patchwork
2021-10-12  4:46 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork
2021-10-21  6:15 ` [Intel-gfx] [PATCH] " Thomas Hellström
2021-10-21 20:37   ` Matthew Brost
2021-10-22  6:23     ` Thomas Hellström
2021-10-22 17:03       ` Matthew Brost
2021-10-22 18:09         ` John Harrison
2021-10-23 17:46           ` Thomas Hellström
2021-10-23 18:18             ` Matthew Brost
2021-10-23 18:36               ` Thomas Hellström
2021-10-25 17:32                 ` John Harrison
2021-10-26 19:55       ` John Harrison
2021-10-27  6:36         ` Thomas Hellström [this message]
2021-10-27 20:34           ` John Harrison
2021-10-27 20:47             ` Thomas Hellström
2021-10-26  8:22     ` Thomas Hellström
2021-10-26 19:48 ` John Harrison

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5af5389a-2f42-16ad-90d2-d329801a882f@linux.intel.com \
    --to=thomas.hellstrom@linux.intel.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=john.c.harrison@intel.com \
    --cc=matthew.brost@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox