All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
To: John Harrison <john.c.harrison@intel.com>,
	Matthew Brost <matthew.brost@intel.com>
Cc: intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org
Subject: Re: [Intel-gfx] [PATCH] drm/i915/selftests: Allow engine reset failure to do a GT reset in hangcheck selftest
Date: Sat, 23 Oct 2021 19:46:48 +0200	[thread overview]
Message-ID: <42cb2c7c-ce69-1cae-6e0c-a1f2b3cd5a67@linux.intel.com> (raw)
In-Reply-To: <070ab480-6306-653c-514a-6648ac495253@intel.com>


On 10/22/21 20:09, John Harrison wrote:
> And to be clear, the engine reset is not supposed to fail. Whether 
> issued by GuC or i915, the GDRST register is supposed to self clear 
> according to the bspec. If we are being sent the G2H notification for 
> an engine reset failure then the assumption is that the hardware is 
> broken. This is not a situation that is ever intended to occur in a 
> production system. Therefore, it is not something we should spend huge 
> amounts of effort on making a perfect selftest for.

I don't agree. Selftests are there to verify that assumptions made and 
contracts in the code hold and that hardware behaves as intended / 
assumed. No selftest should ideally trigger in a production driver / 
system. That doesn't mean we can remove all selftests or ignore updating 
them for altered assumptions / contracts. I think it's important here to 
acknowledge the fact that this and the perf selftest have found two 
problems that need consideration for fixing for a production system.

>
> The current theory is that the timeout in GuC is not quite long enough 
> for DG1. Given that the bspec does not specify any kind of timeout, it 
> is only a best guess anyway! Once that has been tuned correctly, we 
> should never hit this case again. Not ever, Not in a selftest, not in 
> an end user use case, just not ever.

..until we introduce new hardware for which the tuning doesn't hold 
anymore or somebody in a two years wants to lower the timeout wondering 
why it was set so long?

/Thomas



WARNING: multiple messages have this Message-ID (diff)
From: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
To: John Harrison <john.c.harrison@intel.com>,
	Matthew Brost <matthew.brost@intel.com>
Cc: intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org
Subject: Re: [PATCH] drm/i915/selftests: Allow engine reset failure to do a GT reset in hangcheck selftest
Date: Sat, 23 Oct 2021 19:46:48 +0200	[thread overview]
Message-ID: <42cb2c7c-ce69-1cae-6e0c-a1f2b3cd5a67@linux.intel.com> (raw)
In-Reply-To: <070ab480-6306-653c-514a-6648ac495253@intel.com>


On 10/22/21 20:09, John Harrison wrote:
> And to be clear, the engine reset is not supposed to fail. Whether 
> issued by GuC or i915, the GDRST register is supposed to self clear 
> according to the bspec. If we are being sent the G2H notification for 
> an engine reset failure then the assumption is that the hardware is 
> broken. This is not a situation that is ever intended to occur in a 
> production system. Therefore, it is not something we should spend huge 
> amounts of effort on making a perfect selftest for.

I don't agree. Selftests are there to verify that assumptions made and 
contracts in the code hold and that hardware behaves as intended / 
assumed. No selftest should ideally trigger in a production driver / 
system. That doesn't mean we can remove all selftests or ignore updating 
them for altered assumptions / contracts. I think it's important here to 
acknowledge the fact that this and the perf selftest have found two 
problems that need consideration for fixing for a production system.

>
> The current theory is that the timeout in GuC is not quite long enough 
> for DG1. Given that the bspec does not specify any kind of timeout, it 
> is only a best guess anyway! Once that has been tuned correctly, we 
> should never hit this case again. Not ever, Not in a selftest, not in 
> an end user use case, just not ever.

..until we introduce new hardware for which the tuning doesn't hold 
anymore or somebody in a two years wants to lower the timeout wondering 
why it was set so long?

/Thomas



  reply	other threads:[~2021-10-23 17:47 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-11 23:47 [Intel-gfx] [PATCH] drm/i915/selftests: Allow engine reset failure to do a GT reset in hangcheck selftest Matthew Brost
2021-10-11 23:47 ` Matthew Brost
2021-10-12  0:52 ` [Intel-gfx] ✓ Fi.CI.BAT: success for " Patchwork
2021-10-12  4:46 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork
2021-10-21  6:15 ` [Intel-gfx] [PATCH] " Thomas Hellström
2021-10-21  6:15   ` Thomas Hellström
2021-10-21 20:37   ` [Intel-gfx] " Matthew Brost
2021-10-21 20:37     ` Matthew Brost
2021-10-22  6:23     ` [Intel-gfx] " Thomas Hellström
2021-10-22  6:23       ` Thomas Hellström
2021-10-22 17:03       ` [Intel-gfx] " Matthew Brost
2021-10-22 17:03         ` Matthew Brost
2021-10-22 18:09         ` [Intel-gfx] " John Harrison
2021-10-22 18:09           ` John Harrison
2021-10-23 17:46           ` Thomas Hellström [this message]
2021-10-23 17:46             ` Thomas Hellström
2021-10-23 18:18             ` [Intel-gfx] " Matthew Brost
2021-10-23 18:18               ` Matthew Brost
2021-10-23 18:36               ` [Intel-gfx] " Thomas Hellström
2021-10-23 18:36                 ` Thomas Hellström
2021-10-25 17:32                 ` [Intel-gfx] " John Harrison
2021-10-25 17:32                   ` John Harrison
2021-10-26 19:55       ` [Intel-gfx] " John Harrison
2021-10-26 19:55         ` John Harrison
2021-10-27  6:36         ` [Intel-gfx] " Thomas Hellström
2021-10-27  6:36           ` Thomas Hellström
2021-10-27 20:34           ` [Intel-gfx] " John Harrison
2021-10-27 20:34             ` John Harrison
2021-10-27 20:47             ` [Intel-gfx] " Thomas Hellström
2021-10-27 20:47               ` Thomas Hellström
2021-10-26  8:22     ` [Intel-gfx] " Thomas Hellström
2021-10-26  8:22       ` Thomas Hellström
2021-10-26 19:48 ` [Intel-gfx] " John Harrison
2021-10-26 19:48   ` John Harrison

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=42cb2c7c-ce69-1cae-6e0c-a1f2b3cd5a67@linux.intel.com \
    --to=thomas.hellstrom@linux.intel.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=john.c.harrison@intel.com \
    --cc=matthew.brost@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.