[PATCH AUTOSEL 6.17-6.12] drm/xe/guc: Return an error code if the GuC load fails

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: John Harrison <John.C.Harrison@Intel.com>,
	Rodrigo Vivi <rodrigo.vivi@intel.com>,
	Matt Atwood <matthew.s.atwood@intel.com>,
	Sasha Levin <sashal@kernel.org>,
	lucas.demarchi@intel.com, thomas.hellstrom@linux.intel.com,
	intel-xe@lists.freedesktop.org
Subject: [PATCH AUTOSEL 6.17-6.12] drm/xe/guc: Return an error code if the GuC load fails
Date: Sat, 25 Oct 2025 11:59:31 -0400	[thread overview]
Message-ID: <20251025160905.3857885-340-sashal@kernel.org> (raw)
In-Reply-To: <20251025160905.3857885-1-sashal@kernel.org>

From: John Harrison <John.C.Harrison@Intel.com>

[ Upstream commit 3b09b11805bfee32d5a0000f5ede42c07237a6c4 ]

Due to multiple explosion issues in the early days of the Xe driver,
the GuC load was hacked to never return a failure. That prevented
kernel panics and such initially, but now all it achieves is creating
more confusing errors when the driver tries to submit commands to a
GuC it already knows is not there. So fix that up.

As a stop-gap and to help with debug of load failures due to invalid
GuC init params, a wedge call had been added to the inner GuC load
function. The reason being that it leaves the GuC log accessible via
debugfs. However, for an end user, simply aborting the module load is
much cleaner than wedging and trying to continue. The wedge blocks
user submissions but it seems that various bits of the driver itself
still try to submit to a dead GuC and lots of subsequent errors occur.
And with regards to developers debugging why their particular code
change is being rejected by the GuC, it is trivial to either add the
wedge back in and hack the return code to zero again or to just do a
GuC log dump to dmesg.

v2: Add support for error injection testing and drop the now redundant
wedge call.

CC: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Matt Atwood <matthew.s.atwood@intel.com>
Link: https://lore.kernel.org/r/20250909224132.536320-1-John.C.Harrison@Intel.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

YES – this is a pure bug-fix that should go to stable.

- `guc_wait_ucode()` now returns `-EPROTO` when the firmware never
  reaches the “done” state and stops calling
  `xe_device_declare_wedged()` (`drivers/gpu/drm/xe/xe_guc.c:1058` and
  `drivers/gpu/drm/xe/xe_guc.c:1165`). Combined with
  `ALLOW_ERROR_INJECTION(guc_wait_ucode, ERRNO)`
  (`drivers/gpu/drm/xe/xe_guc.c:1180`), the driver can finally detect
  and test the failure path instead of pretending the load succeeded.
- `__xe_guc_upload()` propagates that failure
  (`drivers/gpu/drm/xe/xe_guc.c:1211` and
  `drivers/gpu/drm/xe/xe_guc.c:1221`), so both the early init
  (`drivers/gpu/drm/xe/xe_uc.c:81`) and the regular load/reset flow
  (`drivers/gpu/drm/xe/xe_uc.c:195`) bail out immediately when
  authentication fails. Previously the hard-coded `return 0 /* FIXME */`
  let probe continue, leaving the module “wedged” while still trying to
  talk to a GuC that never booted—exactly the noisy, misleading
  behaviour the commit message describes.
- The change is tightly scoped to the Xe GuC bring-up path; no ABI or
  architectural behaviour changes elsewhere. Failing a GuC load already
  leaves the GPU unusable, so probing failure instead of a half-alive
  wedged device is the safer outcome for end users.
- Dependencies are limited to the existing GuC timing/logging helpers
  that have been in mainline since mid-2024, so current stable trees
  that carry Xe already have the required context.

The only observable difference for users is that a fatal firmware load
failure now aborts driver probe instead of letting the system thrash a
dead GuC, which matches expectations and avoids secondary errors.

 drivers/gpu/drm/xe/xe_guc.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
index 62c76760fd26f..ab5b69cee3bff 100644
--- a/drivers/gpu/drm/xe/xe_guc.c
+++ b/drivers/gpu/drm/xe/xe_guc.c
@@ -1056,7 +1056,7 @@ static s32 guc_pc_get_cur_freq(struct xe_guc_pc *guc_pc)
 #endif
 #define GUC_LOAD_TIME_WARN_MS      200
 
-static void guc_wait_ucode(struct xe_guc *guc)
+static int guc_wait_ucode(struct xe_guc *guc)
 {
 	struct xe_gt *gt = guc_to_gt(guc);
 	struct xe_mmio *mmio = &gt->mmio;
@@ -1163,7 +1163,7 @@ static void guc_wait_ucode(struct xe_guc *guc)
 			break;
 		}
 
-		xe_device_declare_wedged(gt_to_xe(gt));
+		return -EPROTO;
 	} else if (delta_ms > GUC_LOAD_TIME_WARN_MS) {
 		xe_gt_warn(gt, "excessive init time: %lldms! [status = 0x%08X, timeouts = %d]\n",
 			   delta_ms, status, count);
@@ -1175,7 +1175,10 @@ static void guc_wait_ucode(struct xe_guc *guc)
 			  delta_ms, xe_guc_pc_get_act_freq(guc_pc), guc_pc_get_cur_freq(guc_pc),
 			  before_freq, status, count);
 	}
+
+	return 0;
 }
+ALLOW_ERROR_INJECTION(guc_wait_ucode, ERRNO);
 
 static int __xe_guc_upload(struct xe_guc *guc)
 {
@@ -1207,14 +1210,16 @@ static int __xe_guc_upload(struct xe_guc *guc)
 		goto out;
 
 	/* Wait for authentication */
-	guc_wait_ucode(guc);
+	ret = guc_wait_ucode(guc);
+	if (ret)
+		goto out;
 
 	xe_uc_fw_change_status(&guc->fw, XE_UC_FIRMWARE_RUNNING);
 	return 0;
 
 out:
 	xe_uc_fw_change_status(&guc->fw, XE_UC_FIRMWARE_LOAD_FAIL);
-	return 0	/* FIXME: ret, don't want to stop load currently */;
+	return ret;
 }
 
 static int vf_guc_min_load_for_hwconfig(struct xe_guc *guc)
-- 
2.51.0

next prev parent reply	other threads:[~2025-10-25 16:24 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20251025160905.3857885-1-sashal@kernel.org>
2025-10-25 15:54 ` [PATCH AUTOSEL 6.17] drm/xe/pcode: Initialize data0 for pcode read routine Sasha Levin
2025-10-25 15:54 ` [PATCH AUTOSEL 6.17] drm/xe: improve dma-resv handling for backup object Sasha Levin
2025-10-25 15:54 ` [PATCH AUTOSEL 6.17] drm/xe: Extend wa_13012615864 to additional Xe2 and Xe3 platforms Sasha Levin
2025-10-25 15:54 ` [PATCH AUTOSEL 6.17] drm/xe/ptl: Apply Wa_16026007364 Sasha Levin
2025-10-25 15:55 ` [PATCH AUTOSEL 6.17] drm/xe: Set GT as wedged before sending wedged uevent Sasha Levin
2025-10-25 15:55 ` [PATCH AUTOSEL 6.17] drm/xe/i2c: Enable bus mastering Sasha Levin
2025-10-25 15:55 ` [PATCH AUTOSEL 6.17] drm/xe/configfs: Enforce canonical device names Sasha Levin
2025-10-25 15:56 ` [PATCH AUTOSEL 6.17] drm/xe: Extend Wa_22021007897 to Xe3 platforms Sasha Levin
2025-10-25 15:56 ` [PATCH AUTOSEL 6.17] drm/xe: Cancel pending TLB inval workers on teardown Sasha Levin
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17-6.12] drm/xe/guc: Increase GuC crash dump buffer size Sasha Levin
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17] drm/xe/wcl: Extend L3bank mask workaround Sasha Levin
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17-6.12] drm/xe/guc: Set upper limit of H2G retries over CTB Sasha Levin
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17] drm/xe: Make page size consistent in loop Sasha Levin
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17] drm/xe/guc: Add devm release action to safely tear down CT Sasha Levin
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17] drm/xe/pf: Program LMTT directory pointer on all GTs within a tile Sasha Levin
2025-10-25 15:58 ` [PATCH AUTOSEL 6.17] drm/xe/guc: Always add CT disable action during second init step Sasha Levin
2025-10-25 15:58 ` [PATCH AUTOSEL 6.17] drm/xe/pf: Don't resume device from restart worker Sasha Levin
2025-10-25 15:59 ` Sasha Levin [this message]
2025-10-25 15:59 ` [PATCH AUTOSEL 6.17] drm/xe: Ensure GT is in C0 during resumes Sasha Levin
2025-10-25 15:59 ` [PATCH AUTOSEL 6.17] drm/xe: rework PDE PAT index selection Sasha Levin
2025-10-25 16:01 ` [PATCH AUTOSEL 6.17-6.12] drm/xe/guc: Add more GuC load error status codes Sasha Levin
2025-10-25 16:01 ` [PATCH AUTOSEL 6.17-6.12] drm/xe: Fix oops in xe_gem_fault when running core_hotunplug test Sasha Levin

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:62c76760fd26 dfblob:ab5b69cee3bf )
 OR (
bs:"[PATCH AUTOSEL 6.17-6.12] drm/xe/guc: Return an error code if the GuC load fails" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251025160905.3857885-340-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=John.C.Harrison@Intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=lucas.demarchi@intel.com \
    --cc=matthew.s.atwood@intel.com \
    --cc=patches@lists.linux.dev \
    --cc=rodrigo.vivi@intel.com \
    --cc=stable@vger.kernel.org \
    --cc=thomas.hellstrom@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox