From: Andi Shyti <andi.shyti@linux.intel.com>
To: intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
Matt Roper <matthew.d.roper@intel.com>
Cc: Andi Shyti <andi.shyti@kernel.org>
Subject: [Intel-gfx] [PATCH 0/2] Report MMIO communication problems more clearly
Date: Mon, 20 Mar 2023 21:23:24 +0100 [thread overview]
Message-ID: <20230320202326.296498-1-andi.shyti@linux.intel.com> (raw)
Hi,
just copy pasting Matt's original cover letter:
We're periodically facing problems in CI where all registers read back
as 0xFFFFFFFF. In general this is what happens when the CPU is unable
to communicate with a PCI device, so the transaction autocompletes with
all F's as a placeholder. Sometimes the device will recover on its own,
sometimes it will never come back.
We already have some attempts to detect when this happens (e.g., when
checking FPGA_DBG), but let's add a couple more checks with descriptive
error messages to identify the problem in other cases:
- When the device is first probed, we'll do an initial check of the GT
forcewake register. As a masked register, the upper bits should
always come back as 0's if device access is behaving properly, so if
we see all F's, we can conclude that the device is already in a bad
state. We'll wait two seconds to see if it recovers on its own, then
give up on the device.
- When we encounter a 'forcewake timed out while waiting for clear'
error, we'll do one more read of the register to see if it's because
we're just reading back all F's. If so, we'll print a more
meaningful message clarifying that it isn't the forcewake itself
that's the problem, but rather communication with the device.
Note that this only captures the failure case where accessing the device
is problematic (resulting in registers giving all F's). There's a
separate class of problems where the device is okay, but the GT inside
the device is busted and all GT registers read back as 0's (other
registers like sgunit registers are usually still readable). This
series does not address that class of errors.
This is just a quick change to get some better CI error messages. Some
ideas for future enhancements:
- Try something to reset the device if we detect a problem at driver
load (e.g., PCI FLR, toggling the PCI power state, etc.)?
- Use something more standard like pci_read_config_dword() instead of a
device register read to determine when we're not communicating
properly? Generally the PCI config space is also giving all F's at
this point.
- Also handle the "device OK, GT dead" case by finding some GT
register(s) that should never be 0 on a functioning system. Maybe
one of the fuse registers would work for this?
Matt Roper (2):
drm/i915: Sanitycheck MMIO access early in driver load
drm/i915: Check for unreliable MMIO during forcewake
drivers/gpu/drm/i915/intel_uncore.c | 46 +++++++++++++++++++++++++++--
1 file changed, 43 insertions(+), 3 deletions(-)
--
2.39.2
next reply other threads:[~2023-03-20 20:24 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-03-20 20:23 Andi Shyti [this message]
2023-03-20 20:23 ` [Intel-gfx] [PATCH 1/2] drm/i915: Sanitycheck MMIO access early in driver load Andi Shyti
2023-03-21 11:37 ` Jani Nikula
2023-03-21 16:38 ` Andi Shyti
2023-03-20 20:23 ` [Intel-gfx] [PATCH 2/2] drm/i915: Check for unreliable MMIO during forcewake Andi Shyti
2023-03-21 12:06 ` [Intel-gfx] ✗ Fi.CI.BUILD: warning for Report MMIO communication problems more clearly Patchwork
2023-03-21 12:06 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
2023-03-21 12:32 ` [Intel-gfx] ✗ Fi.CI.BAT: failure " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230320202326.296498-1-andi.shyti@linux.intel.com \
--to=andi.shyti@linux.intel.com \
--cc=andi.shyti@kernel.org \
--cc=dri-devel@lists.freedesktop.org \
--cc=intel-gfx@lists.freedesktop.org \
--cc=matthew.d.roper@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox