From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: John Harrison <John.C.Harrison@Intel.com>,
Stuart Summers <stuart.summers@intel.com>,
Sasha Levin <sashal@kernel.org>,
lucas.demarchi@intel.com, thomas.hellstrom@linux.intel.com,
rodrigo.vivi@intel.com, michal.wajdeczko@intel.com,
alexandre.f.demers@gmail.com, alexander.deucher@amd.com,
intel-xe@lists.freedesktop.org
Subject: [PATCH AUTOSEL 6.17-6.12] drm/xe/guc: Add more GuC load error status codes
Date: Sat, 25 Oct 2025 12:01:16 -0400 [thread overview]
Message-ID: <20251025160905.3857885-445-sashal@kernel.org> (raw)
In-Reply-To: <20251025160905.3857885-1-sashal@kernel.org>
From: John Harrison <John.C.Harrison@Intel.com>
[ Upstream commit 45fbb51050e72723c2bdcedc1ce32305256c70ed ]
The GuC load process will abort if certain status codes (which are
indicative of a fatal error) are reported. Otherwise, it keeps waiting
until the 'success' code is returned. New error codes have been added
in recent GuC releases, so add support for aborting on those as well.
v2: Shuffle HWCONFIG_START to the front of the switch to keep the
ordering as per the enum define for clarity (review feedback by
Jonathan). Also add a description for the basic 'invalid init data'
code which was missing.
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Stuart Summers <stuart.summers@intel.com>
Link: https://lore.kernel.org/r/20250726024337.4056272-1-John.C.Harrison@Intel.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
LLM Generated explanations, may be completely bogus:
YES
Why this is a good stable backport
- Fixes real-world hangs/timeouts: New GuC firmware versions can report
additional fatal load status codes. Without this patch, the Xe driver
may continue waiting until the generic timeout, causing long delays
and poorer diagnostics. Recognizing these as terminal failures is a
correctness and robustness fix, not a feature.
- Small and contained: Changes are limited to two Xe files, only
touching enums and switch cases that read GuC status. No architectural
changes, no API/UAPI changes, no behavior change unless the new error
codes are actually returned.
- Forward-compatibility with newer GuC: Distros often update GuC via
linux-firmware independently of the kernel. This patch keeps older
kernels robust when paired with newer GuC blobs.
- Low regression risk: Older GuC won’t emit the new codes, so behavior
is unchanged there. New codes are explicitly fatal, so aborting
earlier is the correct action. Additional logging improves triage.
What changes and why they matter
- Add new GuC load error codes in the ABI header
- drivers/gpu/drm/xe/abi/guc_errors_abi.h:49 defines `enum
xe_guc_load_status`. This patch adds:
- `XE_GUC_LOAD_STATUS_BOOTROM_VERSION_MISMATCH = 0x08` (fatal)
- `XE_GUC_LOAD_STATUS_KLV_WORKAROUND_INIT_ERROR = 0x75` (fatal)
- `XE_GUC_LOAD_STATUS_INVALID_FTR_FLAG = 0x76` (fatal)
- In current tree, the relevant region is at
drivers/gpu/drm/xe/abi/guc_errors_abi.h:49–72. Adding these entries
fills previously unused values (0x08, 0x75, 0x76) and keeps them in
the “invalid init data” range where appropriate, preserving ordering
and ABI clarity.
- Treat the new codes as terminal failures in the load state machine
- drivers/gpu/drm/xe/xe_guc.c:517 `guc_load_done()` is the terminal-
state detector for the load loop.
- Existing fatal cases are in the switch at
drivers/gpu/drm/xe/xe_guc.c:526–535.
- The patch adds the new codes to this fatal set, so `guc_load_done()`
returns -1 immediately instead of waiting for a timeout. This
prevents long waits and aligns behavior with the intended semantics
of these GuC codes.
- Improve diagnostics for new failure modes during load
- drivers/gpu/drm/xe/xe_guc.c:593 `guc_wait_ucode()` logs the reason
for failure.
- New message cases are added to the `ukernel` switch (today at
drivers/gpu/drm/xe/xe_guc.c:672–685):
- A logging case for `HWCONFIG_START` was reordered to the front for
clarity (still “still extracting hwconfig table.”)
- New diagnostics for:
- `INIT_DATA_INVALID`: “illegal init/ADS data”
- `KLV_WORKAROUND_INIT_ERROR`: “illegal workaround KLV data”
- `INVALID_FTR_FLAG`: “illegal feature flag specified”
- These improve visibility into what went wrong without altering
control flow beyond early abort on fatal codes.
Cross-check with i915 (parity and precedent)
- i915 already handles one of these newer codes:
- `INTEL_GUC_LOAD_STATUS_KLV_WORKAROUND_INIT_ERROR` is defined and
handled in i915 (drivers/gpu/drm/i915/gt/uc/abi/guc_errors_abi.h:24
and :39; drivers/gpu/drm/i915/gt/uc/intel_guc_fw.c:118, 245),
confirming this class of additions is standard and low risk.
- Bringing Xe up to parity on load error handling is consistent with
upstream direction and improves stability for GuC firmware evolution.
Stable criteria assessment
- Bug fix that affects users: Yes — avoids long waits and wedges with
clearer diagnostics when GuC reports new fatal statuses.
- Minimal and contained: Yes — a handful of enum entries and switch
cases in two Xe files.
- No architectural changes: Correct — only error-code recognition and
messaging.
- Critical subsystem: It’s a GPU driver; impact is localized to GuC
bring-up, not core kernel.
- Explicit stable tags: Not present, but the change is a standard, low-
risk, forward-compat fix consistent with stable rules.
- Dependencies: None apparent; the new constants are self-contained.
Note: in some branches the header’s response enum is named
`xe_guc_response_status` (drivers/gpu/drm/xe/abi/guc_errors_abi.h:9),
not `xe_guc_response` as in the posted diff context. This patch does
not alter that enum and the backport simply adds entries to
`xe_guc_load_status`, so this naming difference does not block the
backport.
Potential risks and why they’re acceptable
- Earlier abort on these statuses vs. timing out: That is intended;
these codes are designated fatal by GuC. For older GuC which never
emit them, behavior is unchanged.
- No ABI or userspace exposure: The enums are internal to the
driver/firmware interface.
Conclusion
- This is a targeted robustness fix for GuC load error handling,
consistent with established patterns in i915, with minimal risk and
clear user benefit. It should be backported to stable.
drivers/gpu/drm/xe/abi/guc_errors_abi.h | 3 +++
drivers/gpu/drm/xe/xe_guc.c | 19 +++++++++++++++++--
2 files changed, 20 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/xe/abi/guc_errors_abi.h b/drivers/gpu/drm/xe/abi/guc_errors_abi.h
index ecf748fd87df3..ad76b4baf42e9 100644
--- a/drivers/gpu/drm/xe/abi/guc_errors_abi.h
+++ b/drivers/gpu/drm/xe/abi/guc_errors_abi.h
@@ -63,6 +63,7 @@ enum xe_guc_load_status {
XE_GUC_LOAD_STATUS_HWCONFIG_START = 0x05,
XE_GUC_LOAD_STATUS_HWCONFIG_DONE = 0x06,
XE_GUC_LOAD_STATUS_HWCONFIG_ERROR = 0x07,
+ XE_GUC_LOAD_STATUS_BOOTROM_VERSION_MISMATCH = 0x08,
XE_GUC_LOAD_STATUS_GDT_DONE = 0x10,
XE_GUC_LOAD_STATUS_IDT_DONE = 0x20,
XE_GUC_LOAD_STATUS_LAPIC_DONE = 0x30,
@@ -75,6 +76,8 @@ enum xe_guc_load_status {
XE_GUC_LOAD_STATUS_INVALID_INIT_DATA_RANGE_START,
XE_GUC_LOAD_STATUS_MPU_DATA_INVALID = 0x73,
XE_GUC_LOAD_STATUS_INIT_MMIO_SAVE_RESTORE_INVALID = 0x74,
+ XE_GUC_LOAD_STATUS_KLV_WORKAROUND_INIT_ERROR = 0x75,
+ XE_GUC_LOAD_STATUS_INVALID_FTR_FLAG = 0x76,
XE_GUC_LOAD_STATUS_INVALID_INIT_DATA_RANGE_END,
XE_GUC_LOAD_STATUS_READY = 0xF0,
diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
index 270fc37924936..9e0ed8fabcd54 100644
--- a/drivers/gpu/drm/xe/xe_guc.c
+++ b/drivers/gpu/drm/xe/xe_guc.c
@@ -990,11 +990,14 @@ static int guc_load_done(u32 status)
case XE_GUC_LOAD_STATUS_GUC_PREPROD_BUILD_MISMATCH:
case XE_GUC_LOAD_STATUS_ERROR_DEVID_INVALID_GUCTYPE:
case XE_GUC_LOAD_STATUS_HWCONFIG_ERROR:
+ case XE_GUC_LOAD_STATUS_BOOTROM_VERSION_MISMATCH:
case XE_GUC_LOAD_STATUS_DPC_ERROR:
case XE_GUC_LOAD_STATUS_EXCEPTION:
case XE_GUC_LOAD_STATUS_INIT_DATA_INVALID:
case XE_GUC_LOAD_STATUS_MPU_DATA_INVALID:
case XE_GUC_LOAD_STATUS_INIT_MMIO_SAVE_RESTORE_INVALID:
+ case XE_GUC_LOAD_STATUS_KLV_WORKAROUND_INIT_ERROR:
+ case XE_GUC_LOAD_STATUS_INVALID_FTR_FLAG:
return -1;
}
@@ -1134,17 +1137,29 @@ static void guc_wait_ucode(struct xe_guc *guc)
}
switch (ukernel) {
+ case XE_GUC_LOAD_STATUS_HWCONFIG_START:
+ xe_gt_err(gt, "still extracting hwconfig table.\n");
+ break;
+
case XE_GUC_LOAD_STATUS_EXCEPTION:
xe_gt_err(gt, "firmware exception. EIP: %#x\n",
xe_mmio_read32(mmio, SOFT_SCRATCH(13)));
break;
+ case XE_GUC_LOAD_STATUS_INIT_DATA_INVALID:
+ xe_gt_err(gt, "illegal init/ADS data\n");
+ break;
+
case XE_GUC_LOAD_STATUS_INIT_MMIO_SAVE_RESTORE_INVALID:
xe_gt_err(gt, "illegal register in save/restore workaround list\n");
break;
- case XE_GUC_LOAD_STATUS_HWCONFIG_START:
- xe_gt_err(gt, "still extracting hwconfig table.\n");
+ case XE_GUC_LOAD_STATUS_KLV_WORKAROUND_INIT_ERROR:
+ xe_gt_err(gt, "illegal workaround KLV data\n");
+ break;
+
+ case XE_GUC_LOAD_STATUS_INVALID_FTR_FLAG:
+ xe_gt_err(gt, "illegal feature flag specified\n");
break;
}
--
2.51.0
next prev parent reply other threads:[~2025-10-25 16:29 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20251025160905.3857885-1-sashal@kernel.org>
2025-10-25 15:54 ` [PATCH AUTOSEL 6.17] drm/xe/pcode: Initialize data0 for pcode read routine Sasha Levin
2025-10-25 15:54 ` [PATCH AUTOSEL 6.17] drm/xe: improve dma-resv handling for backup object Sasha Levin
2025-10-25 15:54 ` [PATCH AUTOSEL 6.17] drm/xe: Extend wa_13012615864 to additional Xe2 and Xe3 platforms Sasha Levin
2025-10-25 15:54 ` [PATCH AUTOSEL 6.17] drm/xe/ptl: Apply Wa_16026007364 Sasha Levin
2025-10-25 15:55 ` [PATCH AUTOSEL 6.17] drm/xe: Set GT as wedged before sending wedged uevent Sasha Levin
2025-10-25 15:55 ` [PATCH AUTOSEL 6.17] drm/xe/i2c: Enable bus mastering Sasha Levin
2025-10-25 15:55 ` [PATCH AUTOSEL 6.17] drm/xe/configfs: Enforce canonical device names Sasha Levin
2025-10-25 15:56 ` [PATCH AUTOSEL 6.17] drm/xe: Extend Wa_22021007897 to Xe3 platforms Sasha Levin
2025-10-25 15:56 ` [PATCH AUTOSEL 6.17] drm/xe: Cancel pending TLB inval workers on teardown Sasha Levin
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17-6.12] drm/xe/guc: Increase GuC crash dump buffer size Sasha Levin
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17] drm/xe/wcl: Extend L3bank mask workaround Sasha Levin
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17-6.12] drm/xe/guc: Set upper limit of H2G retries over CTB Sasha Levin
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17] drm/xe: Make page size consistent in loop Sasha Levin
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17] drm/xe/guc: Add devm release action to safely tear down CT Sasha Levin
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17] drm/xe/pf: Program LMTT directory pointer on all GTs within a tile Sasha Levin
2025-10-25 15:58 ` [PATCH AUTOSEL 6.17] drm/xe/guc: Always add CT disable action during second init step Sasha Levin
2025-10-25 15:58 ` [PATCH AUTOSEL 6.17] drm/xe/pf: Don't resume device from restart worker Sasha Levin
2025-10-25 15:59 ` [PATCH AUTOSEL 6.17-6.12] drm/xe/guc: Return an error code if the GuC load fails Sasha Levin
2025-10-25 15:59 ` [PATCH AUTOSEL 6.17] drm/xe: Ensure GT is in C0 during resumes Sasha Levin
2025-10-25 15:59 ` [PATCH AUTOSEL 6.17] drm/xe: rework PDE PAT index selection Sasha Levin
2025-10-25 16:01 ` Sasha Levin [this message]
2025-10-25 16:01 ` [PATCH AUTOSEL 6.17-6.12] drm/xe: Fix oops in xe_gem_fault when running core_hotunplug test Sasha Levin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251025160905.3857885-445-sashal@kernel.org \
--to=sashal@kernel.org \
--cc=John.C.Harrison@Intel.com \
--cc=alexander.deucher@amd.com \
--cc=alexandre.f.demers@gmail.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=lucas.demarchi@intel.com \
--cc=michal.wajdeczko@intel.com \
--cc=patches@lists.linux.dev \
--cc=rodrigo.vivi@intel.com \
--cc=stable@vger.kernel.org \
--cc=stuart.summers@intel.com \
--cc=thomas.hellstrom@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox