From: Matthew Brost <matthew.brost@intel.com>
To: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
Cc: John Harrison <john.c.harrison@intel.com>,
intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org
Subject: Re: [Intel-gfx] [PATCH] drm/i915/selftests: Allow engine reset failure to do a GT reset in hangcheck selftest
Date: Sat, 23 Oct 2021 11:18:40 -0700 [thread overview]
Message-ID: <20211023181838.GA35211@jons-linux-dev-box> (raw)
In-Reply-To: <42cb2c7c-ce69-1cae-6e0c-a1f2b3cd5a67@linux.intel.com>
On Sat, Oct 23, 2021 at 07:46:48PM +0200, Thomas Hellström wrote:
>
> On 10/22/21 20:09, John Harrison wrote:
> > And to be clear, the engine reset is not supposed to fail. Whether
> > issued by GuC or i915, the GDRST register is supposed to self clear
> > according to the bspec. If we are being sent the G2H notification for an
> > engine reset failure then the assumption is that the hardware is broken.
> > This is not a situation that is ever intended to occur in a production
> > system. Therefore, it is not something we should spend huge amounts of
> > effort on making a perfect selftest for.
>
> I don't agree. Selftests are there to verify that assumptions made and
> contracts in the code hold and that hardware behaves as intended / assumed.
> No selftest should ideally trigger in a production driver / system. That
> doesn't mean we can remove all selftests or ignore updating them for altered
> assumptions / contracts. I think it's important here to acknowledge the fact
> that this and the perf selftest have found two problems that need
> consideration for fixing for a production system.
>
I'm confused - we are going down the rabbit hole here.
Back to this patch. This test was written for very specific execlists
behavior. It was updated to also support the GuC. In that update we
missed fixing the failure path, well because it always passed. Now it
has failed, we see that it doesn't fail gracefully, and takes down the
machine. This patch fixes that. It also openned my eyes to the horror
show reset locking that needs to be fixed long term.
> >
> > The current theory is that the timeout in GuC is not quite long enough
> > for DG1. Given that the bspec does not specify any kind of timeout, it
> > is only a best guess anyway! Once that has been tuned correctly, we
> > should never hit this case again. Not ever, Not in a selftest, not in an
> > end user use case, just not ever.
>
> ..until we introduce new hardware for which the tuning doesn't hold anymore
> or somebody in a two years wants to lower the timeout wondering why it was
> set so long?
If an engine reset fails in the GuC, the GuC signals the i915 via a G2H
that the engine reset has failed and i915 initiates a full GT reset.
After this patch (which removes hacky behavior to block foreign,
relative to the test, resets) we can see the i915 behaving correctly and
the GPU recovering. This path in the code is working as designed. Do you
have test for that behavior, no. Can we? No at the moment as we would
need a way for the GuC to intentionally fail a engine reset. Right now
all we have is either flaky HW or GuC isn't waiting long enough.
As far as why the engine reset fails - I am currently working with the
GuC team to get a firmware with a configurable timeout period so we can
root cause the engine failure. Figures crossed we are just not waiting
long enough rather than a HW issue.
Regardless of everything above, this patch is correct to unblock CI in
the short term and is correct in the long term too as this test
shouldn't bring down CI when it fails.
Matt
>
> /Thomas
>
>
next prev parent reply other threads:[~2021-10-23 18:23 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-10-11 23:47 [Intel-gfx] [PATCH] drm/i915/selftests: Allow engine reset failure to do a GT reset in hangcheck selftest Matthew Brost
2021-10-12 0:52 ` [Intel-gfx] ✓ Fi.CI.BAT: success for " Patchwork
2021-10-12 4:46 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork
2021-10-21 6:15 ` [Intel-gfx] [PATCH] " Thomas Hellström
2021-10-21 20:37 ` Matthew Brost
2021-10-22 6:23 ` Thomas Hellström
2021-10-22 17:03 ` Matthew Brost
2021-10-22 18:09 ` John Harrison
2021-10-23 17:46 ` Thomas Hellström
2021-10-23 18:18 ` Matthew Brost [this message]
2021-10-23 18:36 ` Thomas Hellström
2021-10-25 17:32 ` John Harrison
2021-10-26 19:55 ` John Harrison
2021-10-27 6:36 ` Thomas Hellström
2021-10-27 20:34 ` John Harrison
2021-10-27 20:47 ` Thomas Hellström
2021-10-26 8:22 ` Thomas Hellström
2021-10-26 19:48 ` John Harrison
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20211023181838.GA35211@jons-linux-dev-box \
--to=matthew.brost@intel.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=intel-gfx@lists.freedesktop.org \
--cc=john.c.harrison@intel.com \
--cc=thomas.hellstrom@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox