[PATCH AUTOSEL 7.0-6.19] drm/xe/vf: Wait for all fixups before using default LRCs

public inbox for intel-xe@lists.freedesktop.org
 help / color / mirror / Atom feed

* [PATCH AUTOSEL 7.0-6.19] drm/xe/vf: Wait for all fixups before using default LRCs
       [not found] <20260420132314.1023554-1-sashal@kernel.org>
@ 2026-04-20 13:16 ` Sasha Levin
  2026-04-20 13:17 ` [PATCH AUTOSEL 6.18] drm/xe: Fix bug in idledly unit conversion Sasha Levin
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 4+ messages in thread
From: Sasha Levin @ 2026-04-20 13:16 UTC (permalink / raw)
  To: patches, stable
  Cc: Tomasz Lis, Matthew Brost, Michal Wajdeczko, Sasha Levin,
	thomas.hellstrom, rodrigo.vivi, airlied, simona, intel-xe,
	dri-devel, linux-kernel

From: Tomasz Lis <tomasz.lis@intel.com>

[ Upstream commit f3fb5f1ebbf39e685dd2885c9dbc8bb0a80be7c6 ]

When a context is being created during save/restore, the LRC creation
needs to wait for GGTT address space to be shifted. But it also needs
to have fixed default LRCs. This is mandatory to avoid the situation
where LRC will be created based on data from before the fixups, but
reference within exec queue will be set too late for fixups.

This fixes an issue where contexts created during save/restore have
a large chance of having one unfixed LRC, due to the xe_lrc_create()
being synced for equal start to race with default LRC fixups.

v2: Move the fixups confirmation further, behind all fixups.
  Revert some renames.

Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Link: https://patch.msgid.link/20260226212701.2937065-4-tomasz.lis@intel.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

Now I have all the information needed. Let me compile the complete
analysis.

---

## PHASE 1: COMMIT MESSAGE FORENSICS

**Step 1.1: Subject Line**
- Subsystem: `drm/xe/vf` (Intel Xe GPU driver, VF/SR-IOV path)
- Action verb: "Wait for" (= ensure, prevent — indicates fixing
  incorrect behavior)
- Summary: Wait for all fixups to complete before using default LRCs
  during VF migration recovery

Record: [drm/xe/vf] [Wait for / ensure] [Delays ggtt_need_fixes
completion signal until after all fixups, not just GGTT shift]

**Step 1.2: Tags**
- Signed-off-by: Tomasz Lis (author, Intel contributor with 30 commits
  in Xe driver)
- Reviewed-by: Matthew Brost (co-author of the original buggy commit
  3c1fa4aa60b14)
- Signed-off-by: Michal Wajdeczko (maintainer-level committer, 15
  commits to this file)
- Link: patch.msgid.link/20260226212701.2937065-4-tomasz.lis@intel.com
- No Fixes: tag, no Reported-by:, no Cc: stable (expected for autosel
  candidate)

Record: Reviewed by subsystem expert (Brost), committed by subsystem
lead (Wajdeczko). No explicit stable nomination.

**Step 1.3: Commit Body Analysis**
- Bug: LRC creation during save/restore can race with default LRC fixups
- Symptom: "contexts created during save/restore have a large chance of
  having one unfixed LRC"
- Root cause: `xe_lrc_create()` synced for "equal start to race with
  default LRC fixups" — meaning `ggtt_need_fixes` is cleared too early
  (after GGTT shift only, before default LRC hwsp rebase)
- Cover letter (from b4 dig): "Tests which create a lot of exec queues
  were sporadically failing due to one of LRCs having its state within
  VRAM damaged"

Record: Real race condition causing VRAM state corruption in LRC during
VF migration. Sporadic test failures observed.

**Step 1.4: Hidden Bug Fix Detection**
This IS explicitly described as a fix ("This fixes an issue where...").
Not hidden.

Record: Explicit bug fix for a race condition.

## PHASE 2: DIFF ANALYSIS

**Step 2.1: Inventory**
- `xe_gt_sriov_vf.c`: ~15 lines changed. Removes 5 lines from
  `vf_get_ggtt_info()`, adds new function
  `vf_post_migration_mark_fixups_done()` (5 lines), adds 1 call in
  `vf_post_migration_recovery()`, updates 1 comment.
- `xe_gt_sriov_vf_types.h`: 1 line comment update.
- Scope: Single-file surgical fix (functionally), trivial doc change in
  header.

Record: 2 files changed, ~15 net lines modified. Functions:
`vf_get_ggtt_info()` (code removed), new
`vf_post_migration_mark_fixups_done()`, `vf_post_migration_recovery()`
(call added). Scope: surgical.

**Step 2.2: Code Flow Change**
- **Before**: `ggtt_need_fixes` set to `false` + `wake_up_all()` in
  `vf_get_ggtt_info()`, which is the FIRST step of
  `vf_post_migration_fixups()`. This means waiters (LRC creators) are
  released while `xe_sriov_vf_ccs_rebase()`,
  `xe_gt_sriov_vf_default_lrcs_hwsp_rebase()`, and
  `xe_guc_contexts_hwsp_rebase()` are still pending.
- **After**: `ggtt_need_fixes` cleared and waiters woken ONLY after
  `vf_post_migration_fixups()` returns, meaning ALL fixups (GGTT shift,
  CCS rebase, default LRC hwsp rebase, contexts hwsp rebase) are
  complete before `xe_lrc_create()` can proceed.

Record: Moves the "fixups done" signal from midway through fixups to
after ALL fixups complete. Eliminates a race window where LRC creation
proceeds with stale default LRC data.

**Step 2.3: Bug Mechanism**
Category: Race condition. Specifically:
1. Migration triggers recovery, sets `ggtt_need_fixes = true`
2. `vf_post_migration_fixups()` calls `xe_gt_sriov_vf_query_config()` →
   `vf_get_ggtt_info()`, which sets `ggtt_need_fixes = false` and wakes
   waiters
3. Concurrent `xe_lrc_create()` (in `__xe_exec_queue_init()`) was
   waiting on `ggtt_need_fixes` via `xe_gt_sriov_vf_wait_valid_ggtt()` —
   now it proceeds
4. But default LRC hwsp rebase hasn't happened yet — `xe_lrc_create()`
   uses unfixed default LRC data
5. Result: LRC created with stale VRAM state

Record: [Race condition] The `ggtt_need_fixes` flag is cleared after
GGTT shift but before default LRC fixups, allowing `xe_lrc_create()` to
use stale default LRC data.

**Step 2.4: Fix Quality**
- Obviously correct: moves signaling to logically correct location
  (after ALL fixups)
- Minimal/surgical: only moves existing code, creates a small helper
  function
- Regression risk: Very low. The only change is that waiters wait
  slightly longer (for all fixups instead of just GGTT shift). This
  cannot cause deadlock since the fixups are sequential and bounded.

Record: Fix is obviously correct, minimal, and has negligible regression
risk.

## PHASE 3: GIT HISTORY INVESTIGATION

**Step 3.1: Blame**
The buggy code (clearing `ggtt_need_fixes` in `vf_get_ggtt_info`) was
introduced by commit `3c1fa4aa60b14` (Matthew Brost, 2025-10-08,
"drm/xe: Move queue init before LRC creation"). This commit first
appeared in v6.19.

Record: Buggy code introduced in 3c1fa4aa60b14, first present in v6.19.
Exists in 6.19.y and 7.0.y stable trees.

**Step 3.2: Fixes tag**
No explicit Fixes: tag in this commit. However, the series cover letter
and patch 1/4 have `Fixes: 3c1fa4aa60b1`.

Record: Implicitly fixes 3c1fa4aa60b14 which is in v6.19 and v7.0.

**Step 3.3: File History / Related Commits**
20+ commits to this file between v6.19 and v7.0. The VF migration
infrastructure is actively developed. Patch 1/4 of the same series
(99f9b5343cae8) is already in the tree.

Record: Active development area. Patch 1/4 already merged. Patches 2/4
and 4/4 not yet in tree.

**Step 3.4: Author**
Tomasz Lis has 30 commits in the xe driver, is an active contributor.
Matthew Brost (reviewer) authored the original buggy commit and is a key
xe/VF contributor. Michal Wajdeczko (committer) has 15 commits to this
specific file.

Record: Author and reviewers are all established subsystem contributors.

**Step 3.5: Dependencies**
This patch is standalone. It does NOT depend on patches 2/4 or 4/4:
- Patch 2/4 adds lrc_lookup_lock wrappers (separate race protection)
- Patch 4/4 adds LRC re-creation logic (a further improvement)
- This patch (3/4) only moves existing code and adds one call

Record: Standalone fix. No dependencies on other unmerged patches.

## PHASE 4: MAILING LIST AND EXTERNAL RESEARCH

**Step 4.1: Original Discussion**
Found via `b4 dig`. Series went through 4 versions (v1 through v4).
Cover letter title: "drm/xe/vf: Fix exec queue creation during post-
migration recovery". The series description confirms sporadic test
failures with VRAM-damaged LRC state.

Record: lore URL:
patch.msgid.link/20260226212701.2937065-4-tomasz.lis@intel.com. 4
revisions. Applied version is v4 (latest).

**Step 4.2: Reviewers**
CC'd: intel-xe@lists.freedesktop.org, Michał Winiarski, Michał
Wajdeczko, Piotr Piórkowski, Matthew Brost. All Intel Xe subsystem
experts.

Record: Appropriate subsystem experts were all involved in review.

**Step 4.3-4.5: Bug Reports / Stable Discussion**
No explicit syzbot or external bug reports. The issue was found
internally through testing. No explicit stable discussion found.

Record: Internal testing found the bug. No external bug reports or
stable nominations.

## PHASE 5: CODE SEMANTIC ANALYSIS

**Step 5.1-5.2: Functions Modified**
- `vf_get_ggtt_info()`: Called from `xe_gt_sriov_vf_query_config()`,
  which is called from `vf_post_migration_fixups()` during migration
  recovery.
- New `vf_post_migration_mark_fixups_done()`: Called from
  `vf_post_migration_recovery()`.
- `xe_gt_sriov_vf_wait_valid_ggtt()`: Called from
  `__xe_exec_queue_init()` which is called during exec queue creation —
  a common GPU path.

Record: The wait function is called during exec queue creation, which is
a common user-triggered path. The fix ensures correctness of this common
path during VF migration.

**Step 5.4: Call Chain**
User creates exec queue → `xe_exec_queue_create()` →
`__xe_exec_queue_init()` → `xe_gt_sriov_vf_wait_valid_ggtt()` →
`xe_lrc_create()`. The buggy path is directly reachable from userspace
GPU operations during VF migration.

Record: Path is reachable from userspace GPU operations.

## PHASE 6: CROSS-REFERENCING AND STABLE TREE ANALYSIS

**Step 6.1: Buggy Code in Stable Trees**
The buggy commit 3c1fa4aa60b14 exists in v6.19 and v7.0. The bug does
NOT exist in v6.18 or earlier (the VF migration wait mechanism was added
in that commit).

Record: Bug exists in 6.19.y and 7.0.y stable trees only.

**Step 6.2: Backport Complications**
The patch should apply cleanly to the 7.0 tree. For 6.19, there may be
minor context differences but the code structure is the same.

Record: Expected clean apply for 7.0.y. Minor conflicts possible for
6.19.y.

## PHASE 7: SUBSYSTEM AND MAINTAINER CONTEXT

**Step 7.1: Subsystem Criticality**
- Subsystem: `drivers/gpu/drm/xe` (Intel discrete GPU driver), VF/SR-IOV
  migration
- Criticality: PERIPHERAL — affects SR-IOV VF GPU users
  (cloud/virtualization deployments with Intel GPUs)

Record: PERIPHERAL criticality, but important for Intel GPU
virtualization users.

## PHASE 8: IMPACT AND RISK ASSESSMENT

**Step 8.1: Who is Affected**
Users running Intel Xe GPU in SR-IOV VF mode with live migration
support. This is relevant for cloud/virtualization environments.

**Step 8.2: Trigger Conditions**
Triggered when exec queue creation (GPU workload submission setup)
happens concurrently with VF post-migration recovery. The cover letter
says "tests which create a lot of exec queues were sporadically
failing."

**Step 8.3: Failure Mode Severity**
LRC created with stale VRAM state → corrupted GPU context → GPU errors,
potential hangs, incorrect rendering. Severity: HIGH for affected users
(data corruption in GPU state).

**Step 8.4: Risk-Benefit**
- BENEFIT: Fixes sporadic GPU state corruption during VF migration.
  Important for virtualized GPU workloads.
- RISK: Very low. The fix moves 5 lines of signaling code to a later
  point. No new locking, no API changes, no functional changes beyond
  delaying the wake-up.
- Ratio: High benefit / Very low risk.

Record: [HIGH benefit for VF migration users] [VERY LOW risk] [Favorable
ratio]

## PHASE 9: FINAL SYNTHESIS

**Step 9.1: Evidence Compilation**
FOR backporting:
- Fixes a real race condition causing GPU state corruption during VF
  migration
- Small, surgical fix (~15 lines, moves existing code)
- Obviously correct (signals fixups done after ALL fixups, not just one)
- Reviewed by the original code author (Brost) and committed by
  subsystem lead (Wajdeczko)
- 4 revisions of review before merge
- Standalone fix (does not require other patches from the series)
- Buggy code exists in 6.19.y and 7.0.y stable trees

AGAINST backporting:
- Part of a 4-patch series (but standalone as analyzed)
- Niche use case (SR-IOV VF migration on Intel Xe GPUs)
- No explicit Fixes: tag or Cc: stable (expected for autosel candidates)
- No syzbot or external bug reports

**Step 9.2: Stable Rules Checklist**
1. Obviously correct and tested? YES — logically obvious, reviewed, went
   through CI, 4 revisions
2. Fixes a real bug? YES — race condition causing LRC corruption during
   migration
3. Important issue? YES — GPU state corruption
4. Small and contained? YES — ~15 lines in one functional file
5. No new features or APIs? YES — no new features
6. Can apply to stable? YES — should apply cleanly to 7.0

**Step 9.3: Exception Categories**
Not an exception category — this is a standard bug fix.

**Step 9.4: Decision**
This is a clear, small, well-reviewed race condition fix that prevents
GPU state corruption during VF migration. It is standalone, obviously
correct, and meets all stable kernel criteria.

## Verification

- [Phase 1] Parsed tags: Reviewed-by Matthew Brost, Signed-off-by Michal
  Wajdeczko (committer), Link to lore. No Fixes: tag (expected).
- [Phase 2] Diff analysis: Removes 5 lines from `vf_get_ggtt_info()`
  (ggtt_need_fixes clearing), adds new 5-line helper
  `vf_post_migration_mark_fixups_done()`, adds 1 call in
  `vf_post_migration_recovery()` after `vf_post_migration_fixups()`.
  Updates 2 comments.
- [Phase 3] git blame: Buggy code introduced in 3c1fa4aa60b14 (Oct 2025,
  v6.19), confirmed via `git blame` and `git tag --contains`.
- [Phase 3] git show 3c1fa4aa60b14: Confirmed this commit added the
  `ggtt_need_fixes` mechanism in `vf_get_ggtt_info()` with the premature
  clearing.
- [Phase 3] File history: 20+ commits between v6.19 and v7.0, active
  development area.
- [Phase 3] Patch 1/4 (99f9b5343cae8) already in tree. Patches 2/4 and
  4/4 not in tree. Verified patch 3/4 is standalone by reading diffs.
- [Phase 4] b4 dig: Found series at
  patch.msgid.link/20260226212701.2937065-2-tomasz.lis@intel.com. Series
  went v1→v4.
- [Phase 4] b4 dig -w: CC'd to intel-xe list, 4 Intel engineers.
- [Phase 4] Cover letter confirms: "sporadic failures due to one of LRCs
  having its state within VRAM damaged."
- [Phase 5] `xe_gt_sriov_vf_wait_valid_ggtt()` called from
  `__xe_exec_queue_init()` in `xe_exec_queue.c:318`, confirming the wait
  is in the LRC creation path.
- [Phase 5] `vf_post_migration_fixups()` confirmed to call
  `xe_gt_sriov_vf_query_config()` (which calls `vf_get_ggtt_info()`)
  FIRST, then `xe_sriov_vf_ccs_rebase()`,
  `xe_gt_sriov_vf_default_lrcs_hwsp_rebase()`,
  `xe_guc_contexts_hwsp_rebase()` — confirming the early clearing race.
- [Phase 6] Bug exists in v6.19 and v7.0 (verified via `git tag
  --contains 3c1fa4aa60b14`).
- [Phase 8] Failure mode: GPU state corruption in LRC during VF
  migration, severity HIGH.
- UNVERIFIED: Exact backport applicability to 6.19.y (context may differ
  slightly due to intermediate commits).

**YES**

 drivers/gpu/drm/xe/xe_gt_sriov_vf.c       | 16 +++++++++-------
 drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h |  2 +-
 2 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
index 30e8c2cf5f09a..b50f7181ce7a9 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
@@ -529,12 +529,6 @@ static int vf_get_ggtt_info(struct xe_gt *gt)
 		xe_tile_sriov_vf_fixup_ggtt_nodes_locked(gt_to_tile(gt), shift);
 	}

-	if (xe_sriov_vf_migration_supported(gt_to_xe(gt))) {
-		WRITE_ONCE(gt->sriov.vf.migration.ggtt_need_fixes, false);
-		smp_wmb();	/* Ensure above write visible before wake */
-		wake_up_all(&gt->sriov.vf.migration.wq);
-	}
-
 	return 0;
 }

@@ -839,6 +833,13 @@ static void xe_gt_sriov_vf_default_lrcs_hwsp_rebase(struct xe_gt *gt)
 		xe_default_lrc_update_memirq_regs_with_address(hwe);
 }

+static void vf_post_migration_mark_fixups_done(struct xe_gt *gt)
+{
+	WRITE_ONCE(gt->sriov.vf.migration.ggtt_need_fixes, false);
+	smp_wmb();	/* Ensure above write visible before wake */
+	wake_up_all(&gt->sriov.vf.migration.wq);
+}
+
 static void vf_start_migration_recovery(struct xe_gt *gt)
 {
 	bool started;
@@ -1373,6 +1374,7 @@ static void vf_post_migration_recovery(struct xe_gt *gt)
 	if (err)
 		goto fail;

+	vf_post_migration_mark_fixups_done(gt);
 	vf_post_migration_rearm(gt);

 	err = vf_post_migration_resfix_done(gt, marker);
@@ -1507,7 +1509,7 @@ static bool vf_valid_ggtt(struct xe_gt *gt)
 }

 /**
- * xe_gt_sriov_vf_wait_valid_ggtt() - VF wait for valid GGTT addresses
+ * xe_gt_sriov_vf_wait_valid_ggtt() - wait for valid GGTT nodes and address refs
  * @gt: the &xe_gt
  */
 void xe_gt_sriov_vf_wait_valid_ggtt(struct xe_gt *gt)
diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
index 4ef881b9b6623..fca18be589db9 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
@@ -73,7 +73,7 @@ struct xe_gt_sriov_vf_migration {
 	bool recovery_queued;
 	/** @recovery_inprogress: VF post migration recovery in progress */
 	bool recovery_inprogress;
-	/** @ggtt_need_fixes: VF GGTT needs fixes */
+	/** @ggtt_need_fixes: VF GGTT and references to it need fixes */
 	bool ggtt_need_fixes;
 };

-- 
2.53.0

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH AUTOSEL 6.18] drm/xe: Fix bug in idledly unit conversion
       [not found] <20260420132314.1023554-1-sashal@kernel.org>
  2026-04-20 13:16 ` [PATCH AUTOSEL 7.0-6.19] drm/xe/vf: Wait for all fixups before using default LRCs Sasha Levin
@ 2026-04-20 13:17 ` Sasha Levin
  2026-04-20 13:17 ` [PATCH AUTOSEL 7.0] drm/xe: Skip adding PRL entry to NULL VMA Sasha Levin
  2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.18] drm/xe/guc: Add Wa_14025883347 for GuC DMA failure on reset Sasha Levin
  3 siblings, 0 replies; 4+ messages in thread
From: Sasha Levin @ 2026-04-20 13:17 UTC (permalink / raw)
  To: patches, stable
  Cc: Vinay Belgaumkar, Tangudu Tilak Tirumalesh, Rodrigo Vivi,
	Sasha Levin, matthew.brost, thomas.hellstrom, airlied, simona,
	John.C.Harrison, daniele.ceraolospurio, matthew.d.roper, intel-xe,
	dri-devel, linux-kernel

From: Vinay Belgaumkar <vinay.belgaumkar@intel.com>

[ Upstream commit 7596459f3c93d8d45a1bf12d4d7526b50c15baa2 ]

We only need to convert to picosecond units before writing to RING_IDLEDLY.

Fixes: 7c53ff050ba8 ("drm/xe: Apply Wa_16023105232")
Cc: Tangudu Tilak Tirumalesh <tilak.tirumalesh.tangudu@intel.com>
Acked-by: Tangudu Tilak Tirumalesh <tilak.tirumalesh.tangudu@intel.com>
Signed-off-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com>
Link: https://patch.msgid.link/20260401012710.4165547-1-vinay.belgaumkar@intel.com
(cherry picked from commit 13743bd628bc9d9a0e2fe53488b2891aedf7cc74)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

Error: Failed to generate final synthesis

 drivers/gpu/drm/xe/xe_hw_engine.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_hw_engine.c b/drivers/gpu/drm/xe/xe_hw_engine.c
index 1cf623b4a5bcc..d8f16e25b817d 100644
--- a/drivers/gpu/drm/xe/xe_hw_engine.c
+++ b/drivers/gpu/drm/xe/xe_hw_engine.c
@@ -587,9 +587,8 @@ static void adjust_idledly(struct xe_hw_engine *hwe)
 		maxcnt *= maxcnt_units_ns;
 
 		if (xe_gt_WARN_ON(gt, idledly >= maxcnt || inhibit_switch)) {
-			idledly = DIV_ROUND_CLOSEST(((maxcnt - 1) * maxcnt_units_ns),
+			idledly = DIV_ROUND_CLOSEST(((maxcnt - 1) * 1000),
 						    idledly_units_ps);
-			idledly = DIV_ROUND_CLOSEST(idledly, 1000);
 			xe_mmio_write32(&gt->mmio, RING_IDLEDLY(hwe->mmio_base), idledly);
 		}
 	}
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH AUTOSEL 7.0] drm/xe: Skip adding PRL entry to NULL VMA
       [not found] <20260420132314.1023554-1-sashal@kernel.org>
  2026-04-20 13:16 ` [PATCH AUTOSEL 7.0-6.19] drm/xe/vf: Wait for all fixups before using default LRCs Sasha Levin
  2026-04-20 13:17 ` [PATCH AUTOSEL 6.18] drm/xe: Fix bug in idledly unit conversion Sasha Levin
@ 2026-04-20 13:17 ` Sasha Levin
  2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.18] drm/xe/guc: Add Wa_14025883347 for GuC DMA failure on reset Sasha Levin
  3 siblings, 0 replies; 4+ messages in thread
From: Sasha Levin @ 2026-04-20 13:17 UTC (permalink / raw)
  To: patches, stable
  Cc: Brian Nguyen, Matthew Brost, Matt Roper, Sasha Levin,
	thomas.hellstrom, rodrigo.vivi, airlied, simona, intel-xe,
	dri-devel, linux-kernel

From: Brian Nguyen <brian3.nguyen@intel.com>

[ Upstream commit 1b12096b4bc5177d685ae098fdb90260ffd5db6b ]

NULL VMAs have no corresponding PTE, so skip adding a PRL entry to avoid
an unnecessary PRL abort during unbind.

Signed-off-by: Brian Nguyen <brian3.nguyen@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patch.msgid.link/20260305171546.67691-8-brian3.nguyen@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

Now I have all the information needed for a complete analysis. Let me
compile the findings.

---

## PHASE 1: COMMIT MESSAGE FORENSICS

### Step 1.1: Subject Line
- **Subsystem**: drm/xe (Intel Xe GPU driver)
- **Action verb**: "Skip" - implies avoiding an incorrect/unnecessary
  operation
- **Summary**: Skip adding Page Reclaim List (PRL) entry for NULL VMAs

### Step 1.2: Tags
- **Signed-off-by**: Brian Nguyen (author), Matt Roper (xe maintainer
  applied it)
- **Reviewed-by**: Matthew Brost (xe subsystem maintainer) - strong
  quality signal
- **Link**: `https://patch.msgid.link/20260305171546.67691-8-
  brian3.nguyen@intel.com` - patch 8 of a series
- No Fixes: tag (expected for manual review candidates)
- No Reported-by: (indicates developer-found issue during
  development/testing)
- No Cc: stable (expected)

### Step 1.3: Commit Body
- Bug: NULL VMAs have no corresponding PTE, so they shouldn't have PRL
  entries
- Consequence: "an unnecessary PRL abort during unbind"
- When PRL aborts, it invalidates the entire PRL batch and falls back to
  full PPC (Page-Private Cache) invalidation

### Step 1.4: Hidden Bug Fix Detection
This is a correctness fix disguised as optimization. The word "skip" and
"unnecessary" might sound like optimization, but the actual issue is:
NULL VMAs being processed through page reclaim creates incorrect PRL
entries with bogus physical addresses (address 0), which triggers PRL
abort for the entire unbind batch.

## PHASE 2: DIFF ANALYSIS

### Step 2.1: Inventory
- **Single file**: `drivers/gpu/drm/xe/xe_page_reclaim.c`
- **+8 lines / -0 lines** (3 doc comment lines, 3 code lines including
  blank, 2 context lines)
- **Function modified**: `xe_page_reclaim_skip()`
- **Scope**: Single-file surgical fix

### Step 2.2: Code Flow Change
**Before**: `xe_page_reclaim_skip()` directly accesses
`vma->attr.pat_index` and checks L3 policy. For NULL VMAs, this produces
a potentially meaningless L3 policy result, and the function returns
false (don't skip), leading to PRL entry generation.

**After**: An `xe_vma_is_null(vma)` check at the top returns true (skip)
immediately for NULL VMAs, preventing any page reclaim processing.

### Step 2.3: Bug Mechanism
**Category**: Logic/correctness fix. NULL VMAs (`DRM_GPUVA_SPARSE`) have
PTEs with `XE_PTE_NULL` bit set (bit 9) but no real physical backing.
When processed through the PRL generation during unbind:
1. The PTE is non-zero (has `XE_PTE_NULL` set), so it passes the `if
   (!pte)` check
2. `generate_reclaim_entry()` extracts `phys_addr = pte &
   XE_PTE_ADDR_MASK` which gives address 0
3. This creates bogus PRL entries or triggers PRL abort, invalidating
   the ENTIRE PRL for the batch

### Step 2.4: Fix Quality
- **Obviously correct**: NULL VMAs have no physical backing, so page
  reclaim is meaningless for them
- **Minimal/surgical**: 2 lines of actual code
- **Regression risk**: Near zero - `xe_vma_is_null()` is used throughout
  the codebase for exactly this purpose
- **No red flags**: Uses existing well-tested inline function

## PHASE 3: GIT HISTORY INVESTIGATION

### Step 3.1: Blame
The buggy code (`xe_page_reclaim_skip` without NULL VMA check) was
introduced by commit `7c52f13b76c531` (2025-12-13) "drm/xe: Optimize
flushing of L2$ by skipping unnecessary page reclaim". This was part of
the initial page reclaim feature series.

### Step 3.2: Fixes Tag
No Fixes: tag present. The root cause is `7c52f13b76c53` which didn't
account for NULL VMAs when implementing the skip logic.

### Step 3.3: File History
The entire `xe_page_reclaim.c` was introduced in v7.0-rc1 (commit
`b912138df2993`, 2025-12-13). 6 commits have touched this file. The
sibling patch from the same series (`38b8dcde23164` "Skip over non leaf
pte for PRL generation") was already cherry-picked to
`stable/linux-7.0.y`.

### Step 3.4: Author
Brian Nguyen is the primary developer of the page reclaim feature
(authored all ~15 page reclaim commits). He is the domain expert for
this code.

### Step 3.5: Dependencies
This fix is standalone - it only adds a guard check to an existing
function. No prerequisite patches needed. The function
`xe_vma_is_null()` exists in all v7.0 trees.

## PHASE 4: MAILING LIST RESEARCH

### Step 4.1: Patch Discussion
b4 dig found the series as "Page Reclamation Fixes" (v3/v4 series, 3
patches). The series went through at least 3 revisions (v2, v3, v4)
before being accepted, indicating thorough review.

### Step 4.2: Reviewers
- Matthew Brost (xe maintainer) reviewed the patch
- Stuart Summers was CC'd
- Applied by Matt Roper (Intel xe maintainer)

### Steps 4.3-4.5:
Lore.kernel.org was inaccessible due to anti-bot protection. Could not
verify mailing list discussion details.

## PHASE 5: CODE SEMANTIC ANALYSIS

### Step 5.1-5.2: Callers
`xe_page_reclaim_skip()` is called from a single location in `xe_pt.c`
line 2084:

```2083:2084:drivers/gpu/drm/xe/xe_pt.c
pt_op->prl = (xe_page_reclaim_list_valid(&pt_update_ops->prl) &&
             !xe_page_reclaim_skip(tile, vma)) ? &pt_update_ops->prl :
NULL;
```

This is in the unbind preparation path, called whenever a VMA is being
unbound from a tile.

### Step 5.3-5.4: Call Chain
The unbind path is reachable from userspace via
`ioctl(DRM_IOCTL_XE_VM_BIND)` with `DRM_XE_VM_BIND_OP_UNMAP`. NULL VMAs
are created via sparse binding operations, which are a normal GPU usage
pattern.

### Step 5.5: Similar Patterns
`xe_vma_is_null()` is already checked at multiple points in the Xe
driver:
- `xe_pt.c` line 449/479 (page table walk: "null VMA's do not have dma
  addresses")
- `xe_vm.c` line 4033 (invalidation: `xe_assert(!xe_vma_is_null(vma))`)
- `xe_vm_madvise.c` line 209 (madvise: skip null VMAs)

This confirms the established pattern: NULL VMAs need special handling
throughout the driver.

## PHASE 6: STABLE TREE ANALYSIS

### Step 6.1: Code Existence in Stable
- **v7.0.y**: YES - file exists, code is present, fix is needed
- **v6.13.y and older**: NO - `xe_page_reclaim.c` does not exist
  (`fatal: path exists on disk, but not in 'v6.13'`)

### Step 6.2: Backport Complications
The fix would apply cleanly to 7.0.y - the file in `stable/linux-7.0.y`
is identical to the file on the main branch at v7.0.

### Step 6.3: Related Fixes in Stable
The sibling patch `38b8dcde23164` ("Skip over non leaf pte for PRL
generation") from the same "Page Reclamation Fixes" series was already
cherry-picked to 7.0.y stable (has explicit `Fixes:` tag).

## PHASE 7: SUBSYSTEM CONTEXT

### Step 7.1: Subsystem
- **Subsystem**: GPU driver (drivers/gpu/drm/xe) - Intel Xe
  discrete/integrated GPU
- **Criticality**: IMPORTANT - Intel Xe GPU users on newer hardware
  (Lunar Lake, Arrow Lake, etc.)

### Step 7.2: Activity
Very active subsystem with many fixes flowing to 7.0.y stable (20+ xe
patches already cherry-picked).

## PHASE 8: IMPACT AND RISK ASSESSMENT

### Step 8.1: Affected Users
Intel Xe GPU users with hardware that supports page reclaim (specific
newer GPUs with `has_page_reclaim_hw_assist`).

### Step 8.2: Trigger Conditions
Triggered when unbinding sparse/NULL VMAs, which happens during normal
GPU memory management operations. Common in graphics workloads using
sparse resources.

### Step 8.3: Failure Mode
- PRL abort -> fallback to full PPC (Page-Private Cache) invalidation
- Severity: MEDIUM - performance degradation (full cache flush instead
  of targeted reclaim), not crash/corruption
- The abort invalidates the ENTIRE PRL batch, affecting all VMAs in the
  unbind operation, not just the NULL one

### Step 8.4: Risk-Benefit
- **Benefit**: MEDIUM - prevents incorrect PRL processing and
  unnecessary PRL aborts for all unbind batches containing NULL VMAs
- **Risk**: VERY LOW - 2-line guard check using existing well-tested
  function
- **Ratio**: Favorable

## PHASE 9: SYNTHESIS

### Step 9.1: Evidence Compilation

**FOR backporting:**
- Small, surgical fix (2 lines of code)
- Obviously correct (NULL VMAs have no physical backing, well-
  established pattern)
- Reviewed by subsystem maintainer (Matthew Brost)
- Same series as another commit already cherry-picked to 7.0.y
- Prevents incorrect behavior in page reclaim path
- Near-zero regression risk
- Author is the page reclaim feature developer

**AGAINST backporting:**
- No explicit Fixes: tag
- Not a crash/corruption/security fix - primarily
  performance/correctness
- Only applicable to 7.0.y (code doesn't exist in older stable trees)
- PRL abort is handled gracefully (fallback mechanism exists)

### Step 9.2: Stable Rules Checklist
1. Obviously correct and tested? **YES** - trivial guard check, reviewed
   by maintainer
2. Fixes a real bug? **YES** - NULL VMAs incorrectly processed through
   page reclaim
3. Important issue? **MEDIUM** - causes PRL abort and full cache flush
   fallback for all VMAs in batch
4. Small and contained? **YES** - 2 lines in one file
5. No new features? **YES** - just a guard check
6. Applies to stable? **YES** for 7.0.y only

### Step 9.3: Exception Categories
Not applicable.

### Step 9.4: Decision
This is a small, correct, well-reviewed fix for a real logic bug in the
Xe page reclaim path. While the consequence is primarily performance
(PRL abort causing full cache flush fallback) rather than crash, the fix
is extremely low-risk and the sibling patch from the same series was
already selected for 7.0.y stable. The fix prevents incorrect behavior
for a common GPU operation (unbinding sparse VMAs).

## Verification

- [Phase 1] Parsed tags: Reviewed-by Matthew Brost (xe maintainer),
  applied by Matt Roper
- [Phase 2] Diff analysis: 2 functional lines added to
  `xe_page_reclaim_skip()`, adding NULL VMA guard check
- [Phase 3] git blame: buggy code introduced in `7c52f13b76c531`
  (v7.0-rc1, 2025-12-13)
- [Phase 3] git log: entire `xe_page_reclaim.c` file created in v7.0-rc1
- [Phase 3] git show: author Brian Nguyen wrote all page reclaim commits
  (domain expert)
- [Phase 4] b4 dig -a: series "Page Reclamation Fixes" went through
  v2→v3→v4, indicating thorough review
- [Phase 4] b4 dig -w: Matthew Brost, Stuart Summers, intel-xe@ involved
  in review
- [Phase 4] UNVERIFIED: Could not access lore.kernel.org discussion due
  to anti-bot protection
- [Phase 5] Grep for callers: `xe_page_reclaim_skip()` called only from
  `xe_pt.c:2084` (unbind path)
- [Phase 5] Grep for `xe_vma_is_null`: used at 10+ locations in xe
  driver, well-established pattern
- [Phase 6] `git show v6.13:drivers/gpu/drm/xe/xe_page_reclaim.c`
  confirmed file does NOT exist in v6.13 or v6.12
- [Phase 6] `git show
  stable/linux-7.0.y:drivers/gpu/drm/xe/xe_page_reclaim.c` confirmed
  code exists in 7.0.y without fix
- [Phase 6] Sibling patch `38b8dcde23164` already in stable/linux-7.0.y
  (confirmed via `git log stable/linux-7.0.y`)
- [Phase 8] PRL abort path verified: invalidates PRL, increments
  counter, logs debug message - graceful fallback

**YES**

 drivers/gpu/drm/xe/xe_page_reclaim.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_page_reclaim.c b/drivers/gpu/drm/xe/xe_page_reclaim.c
index e13c71a89da2c..390bcb82e4c5c 100644
--- a/drivers/gpu/drm/xe/xe_page_reclaim.c
+++ b/drivers/gpu/drm/xe/xe_page_reclaim.c
@@ -26,12 +26,18 @@
  * flushes.
  * - pat_index is transient display (1)
  *
+ * For cases of NULL VMA, there should be no corresponding PRL entry
+ * so skip over.
+ *
  * Return: true when page reclamation is unnecessary, false otherwise.
  */
 bool xe_page_reclaim_skip(struct xe_tile *tile, struct xe_vma *vma)
 {
 	u8 l3_policy;

+	if (xe_vma_is_null(vma))
+		return true;
+
 	l3_policy = xe_pat_index_get_l3_policy(tile->xe, vma->attr.pat_index);

 	/*
-- 
2.53.0

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH AUTOSEL 7.0-6.18] drm/xe/guc: Add Wa_14025883347 for GuC DMA failure on reset
       [not found] <20260420132314.1023554-1-sashal@kernel.org>
                   ` (2 preceding siblings ...)
  2026-04-20 13:17 ` [PATCH AUTOSEL 7.0] drm/xe: Skip adding PRL entry to NULL VMA Sasha Levin
@ 2026-04-20 13:18 ` Sasha Levin
  3 siblings, 0 replies; 4+ messages in thread
From: Sasha Levin @ 2026-04-20 13:18 UTC (permalink / raw)
  To: patches, stable
  Cc: Sk Anirban, Badal Nilawar, Matt Roper, Daniele Ceraolo Spurio,
	Sasha Levin, matthew.brost, thomas.hellstrom, rodrigo.vivi,
	airlied, simona, intel-xe, dri-devel, linux-kernel

From: Sk Anirban <sk.anirban@intel.com>

[ Upstream commit c57db41b8d2cd410e7219729f446aa17965485ad ]

Prevent GuC firmware DMA failures during GuC-only reset by disabling
idle flow and verifying SRAM handling completion. Without this, reset
can be issued while SRAM handler is copying WOPCM to SRAM,
causing GuC HW to get stuck.

v2: Modify error message (Badal)
    Rename reg bit name (Daniele)
    Update WA skip condition (Daniele)
    Update SRAM handling logic (Daniele)
v3: Reorder WA call (Badal)
    Wait for GuC ready status (Daniele)
v4: Update reg name (Badal)
    Add comment (Daniele)
    Add extended graphics version (Daniele)
    Modify rules

Signed-off-by: Sk Anirban <sk.anirban@intel.com>
Reviewed-by: Badal Nilawar <badal.nilawar@intel.com>
Acked-by: Matt Roper <matthew.d.roper@intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Link: https://patch.msgid.link/20260202105313.3338094-4-sk.anirban@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

Now I have a comprehensive picture. Let me compile my full analysis.

---

## PHASE 1: COMMIT MESSAGE FORENSICS

**Step 1.1: Subject Line**
Record: [drm/xe/guc] [Add] Hardware workaround Wa_14025883347 for GuC
DMA failure during GuC-only reset

**Step 1.2: Tags**
- Signed-off-by: Sk Anirban <sk.anirban@intel.com> (author)
- Reviewed-by: Badal Nilawar <badal.nilawar@intel.com>
- Acked-by: Matt Roper <matthew.d.roper@intel.com> (subsystem
  maintainer)
- Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
- Link:
  https://patch.msgid.link/20260202105313.3338094-4-sk.anirban@intel.com
- Signed-off-by: Matt Roper <matthew.d.roper@intel.com> (applied by
  maintainer)
- No Fixes: tag (expected), no Reported-by, no Cc: stable

**Step 1.3: Commit Body**
The bug: During GuC-only reset, if the SRAM handler is actively copying
WOPCM to SRAM, issuing the reset causes GuC HW to get stuck. The
workaround disables idle flow and waits for SRAM handling completion
before proceeding with reset.

**Step 1.4: Hidden Bug Fix Detection**
This is explicitly a hardware workaround for a known Intel hardware
errata (Wa_14025883347). It prevents the GuC from getting stuck during
reset - this is a real bug fix for a hardware deficiency.

## PHASE 2: DIFF ANALYSIS

**Step 2.1: Inventory**
- `drivers/gpu/drm/xe/regs/xe_guc_regs.h`: +8 lines (new register
  definitions)
- `drivers/gpu/drm/xe/xe_guc.c`: +38 lines (new function + call site)
- `drivers/gpu/drm/xe/xe_wa_oob.rules`: +3 lines (WA matching rules)
- Total: +49 lines, 0 removed. 3 files changed.
- Scope: Single-subsystem, well-contained

**Step 2.2: Code Flow Changes**
- New register definitions: BOOT_HASH_CHK, GUC_BOOT_UKERNEL_VALID,
  GUC_SRAM_STATUS, GUC_SRAM_HANDLING_MASK, GUC_IDLE_FLOW_DISABLE
- New function `guc_prevent_fw_dma_failure_on_reset()`: reads GUC_STATUS
  (skips if already in reset), reads BOOT_HASH_CHK (skips if ukernel not
  valid), disables idle flow, waits for GuC ready status, waits for SRAM
  handling completion
- Call site: injected in `xe_guc_reset()` between SRIOV VF check and the
  actual reset write, gated by `XE_GT_WA(gt, 14025883347)`

**Step 2.3: Bug Mechanism**
This is a hardware workaround (category h). Race condition between SRAM
save/restore and reset issuance. Without the WA, reset can arrive while
DMA is in progress, causing hardware hang.

**Step 2.4: Fix Quality**
- Gated behind hardware version checks (only runs on affected hardware)
- Has early-return safety checks (already in reset, ukernel not valid)
- Uses existing MMIO wait infrastructure with timeouts
- Only emits warnings on timeout, doesn't abort the reset
- Very low regression risk for unaffected hardware (gated by XE_GT_WA)
- For affected hardware, the risk is also low: it adds delays before
  reset which is inherently safe

## PHASE 3: GIT HISTORY INVESTIGATION

**Step 3.1: Blame**
The `xe_guc_reset()` function was introduced with the xe driver in
commit dd08ebf6c3525a (Matthew Brost, 2023-03-30, "Introduce a new DRM
driver for Intel GPUs"). The function has been stable since, with minor
API changes (MMIO parameter refactoring by Matt Roper in
c18d4193b53be7).

**Step 3.2: Fixes tag**
No Fixes: tag present. The bug is inherent in the hardware itself, not
introduced by any specific software commit.

**Step 3.3: File History**
`xe_guc.c` has had 20 recent commits mostly around GuC
load/submit/communication. `xe_wa_oob.rules` has had 35 changes since
v6.12.

**Step 3.4: Author**
Sk Anirban has 4 xe-related commits including this one, with
d72779c29d82c ("drm/xe/ptl: Apply Wa_16026007364") also being a WA
patch. A regular Intel contributor focused on WA/frequency work.

**Step 3.5: Dependencies**
This is "PATCH v4 1/1" - a standalone single patch. No dependencies on
other patches. It uses existing infrastructure: XE_GT_WA macro,
xe_mmio_* functions, existing register headers.

## PHASE 4: MAILING LIST RESEARCH

**Step 4.1: Original Discussion**
Found on freedesktop.org/archives/intel-xe/2026-February/. The patch
went through 4 revisions (v1-v4) with extensive review from Daniele
Ceraolo Spurio and Badal Nilawar. Each version addressed reviewer
feedback.

**Step 4.2: Reviewers**
- Daniele Ceraolo Spurio: Intel GuC expert, provided detailed review
  across all 4 versions, gave final Reviewed-by
- Matt Roper: Subsystem maintainer, discussed the WA range policy, gave
  Acked-by and applied the patch
- Badal Nilawar: Intel engineer, reviewed and gave Reviewed-by

Daniele's only concern was about using large version ranges in the WA
table; Matt Roper acked this explicitly. No technical concerns about the
fix itself.

**Step 4.3: No external bug report found** - this is an internal Intel
hardware errata workaround.

**Step 4.4: Series Context**
Standalone patch (1/1). No dependencies.

**Step 4.5: No stable-specific discussion found.**

## PHASE 5: CODE SEMANTIC ANALYSIS

**Step 5.1: Functions Modified**
- New: `guc_prevent_fw_dma_failure_on_reset()` (static, only called from
  xe_guc_reset)
- Modified: `xe_guc_reset()` (3-line addition)

**Step 5.2: Callers of xe_guc_reset**
- `uc_reset()` in xe_uc.c -> called from `xe_uc_sanitize_reset()`
- Called during GT reset paths and UC initialization

**Step 5.3-5.4: Call Chain**
xe_gt reset path -> xe_uc_sanitize_reset -> uc_reset -> xe_guc_reset.
This is the standard GPU reset path, triggered when the GPU needs reset
(hang recovery, device suspend/resume, driver load).

**Step 5.5: Similar Patterns**
The xe driver has many similar XE_GT_WA patterns throughout the codebase
(8 existing uses in xe_guc.c alone).

## PHASE 6: STABLE TREE ANALYSIS

**Step 6.1: Buggy Code Existence**
The xe driver was introduced in v6.8. `xe_guc_reset()` exists in v6.8+.
The hardware affected (MEDIA_VERSION_RANGE 1301-3503,
GRAPHICS_VERSION_RANGE 2004-3005) includes Panther Lake and newer
platforms. Some of these platforms were only added in recent kernel
versions.

**Step 6.2: Backport Complications**
- For 7.0.y: Should apply cleanly. The tree is at v7.0, and the MMIO API
  and wa_oob.rules match.
- For 6.12.y: The MMIO API changed (`xe_mmio_write32(gt, ...)` vs
  `xe_mmio_write32(&gt->mmio, ...)`). Also, `xe_guc.c` has `struct
  xe_mmio *mmio` variable in v7.0 but not in v6.12. Significant rework
  needed.
- For 6.6.y and earlier: xe driver doesn't exist.

**Step 6.3: No related fixes already in stable.**

## PHASE 7: SUBSYSTEM AND MAINTAINER CONTEXT

**Step 7.1: Subsystem Criticality**
drm/xe is the Intel GPU driver. It's IMPORTANT - affects all users with
Intel discrete and integrated GPUs running the xe driver.

**Step 7.2: Subsystem Activity**
Very active (20+ commits recently). The xe driver is under rapid
development.

## PHASE 8: IMPACT AND RISK ASSESSMENT

**Step 8.1: Affected Users**
Users with Intel GPUs matching MEDIA_VERSION_RANGE(1301, 3503) or
GRAPHICS_VERSION_RANGE(2004, 3005). This includes Panther Lake and some
newer Intel GPU generations.

**Step 8.2: Trigger Conditions**
The bug triggers during GuC-only reset when SRAM handler is actively
copying WOPCM to SRAM. This is a timing-dependent race that can occur
during any GPU reset operation (hang recovery, suspend/resume, etc.).

**Step 8.3: Failure Mode**
GuC HW gets stuck - this is effectively a GPU hang. Severity: HIGH.
Without recovery, the GPU becomes unusable requiring a reboot.

**Step 8.4: Risk-Benefit**
- BENEFIT: Prevents GPU hangs on affected Intel hardware during reset.
  HIGH benefit for affected hardware users.
- RISK: Very low. The fix is gated behind XE_GT_WA (only active on
  affected hardware), adds only MMIO reads and waits before existing
  reset sequence, and emits warnings rather than aborting on timeout.
  Risk: very low.
- Ratio: HIGH benefit / very low risk = favorable

## PHASE 9: FINAL SYNTHESIS

**Evidence FOR backporting:**
- Hardware workaround (WA) - a standard exception category for stable
- Prevents GPU hangs (GuC stuck) during reset operations
- Well-reviewed: 3 Intel engineers (including subsystem maintainer)
  reviewed/acked
- Went through 4 revision cycles addressing reviewer feedback
- CI passed (Xe.CI.BAT: success)
- Standalone patch (1/1), no dependencies
- Well-contained: 49 lines across 3 files
- Gated behind hardware version check (no impact on unaffected hardware)
- Uses existing infrastructure (XE_GT_WA, xe_mmio_wait32)
- Should apply cleanly to v7.0.y

**Evidence AGAINST backporting:**
- Adds new register definitions and a new function (albeit small and
  contained)
- The WA uses version ranges that span many hardware generations
  (discussion concern from Daniele)
- For stable trees older than 7.0.y (e.g., 6.12.y), the MMIO API changed
  and significant rework would be needed
- Affects only specific newer Intel GPU hardware (Panther Lake and
  beyond)
- No user bug reports - this is a proactive hardware errata fix
- The xe driver is evolving rapidly, making older stable tree backports
  risky

**Stable Rules Checklist:**
1. Obviously correct and tested? YES - extensive review, CI tested,
   straightforward register reads/waits
2. Fixes a real bug? YES - GPU hang during reset
3. Important issue? YES - hardware hang requiring reboot
4. Small and contained? YES - 49 lines, 3 files, single subsystem
5. No new features or APIs? CORRECT - hardware workaround only
6. Can apply to stable? For 7.0.y: YES (clean). For 6.12.y: needs
   rework.

**Exception Category:** This is a hardware quirk/workaround - these are
explicitly allowed in stable.

## Verification

- [Phase 1] Parsed all tags: Reviewed-by (2), Acked-by (1), Link
  present, no Fixes, no Reported-by
- [Phase 2] Diff analysis: +49 lines across 3 files - new register defs,
  new WA function, WA rules entry
- [Phase 3] git blame: xe_guc_reset() introduced in dd08ebf6c3525a
  (March 2023, initial xe driver)
- [Phase 3] git log: no prerequisite commits needed, standalone patch
- [Phase 4] Freedesktop archive: Found [PATCH v4 0/1] and [PATCH v4 1/1]
  confirming single standalone patch
- [Phase 4] Freedesktop archive: Daniele's review comment on version
  ranges, Matt Roper's ack and policy discussion
- [Phase 4] Daniele gave final Reviewed-by after Matt acked the ranges
  approach
- [Phase 5] xe_guc_reset() called from uc_reset() in xe_uc.c, part of
  standard GT reset path
- [Phase 5] XE_GT_WA macro used 8 times in xe_guc.c already -
  established pattern
- [Phase 6] xe driver exists in v6.8+, not in v6.6. MMIO API changed
  between v6.12 and v7.0
- [Phase 6] For v7.0.y: patch should apply cleanly (tree matches patch
  base)
- [Phase 6] For v6.12.y: MMIO API mismatch would require rework
- [Phase 8] Failure mode: GuC stuck = GPU hang = severity HIGH
- UNVERIFIED: Cannot confirm which exact kernel versions first support
  the specific GPU generations targeted by MEDIA_VERSION_RANGE(1301,
  3503) and GRAPHICS_VERSION_RANGE(2004, 3005)

This is a well-reviewed, well-contained hardware workaround that
prevents GPU hangs during reset on affected Intel hardware. It falls
squarely into the "hardware quirk/workaround" exception category for
stable kernels. The fix is gated behind hardware detection, uses
existing infrastructure, and was authored as a standalone patch with no
dependencies. It should apply cleanly to the 7.0 stable tree.

**YES**

 drivers/gpu/drm/xe/regs/xe_guc_regs.h |  8 ++++++
 drivers/gpu/drm/xe/xe_guc.c           | 38 +++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_wa_oob.rules    |  3 +++
 3 files changed, 49 insertions(+)

diff --git a/drivers/gpu/drm/xe/regs/xe_guc_regs.h b/drivers/gpu/drm/xe/regs/xe_guc_regs.h
index 87984713dd126..5faac8316b66c 100644
--- a/drivers/gpu/drm/xe/regs/xe_guc_regs.h
+++ b/drivers/gpu/drm/xe/regs/xe_guc_regs.h
@@ -40,6 +40,9 @@
 #define   GS_BOOTROM_JUMP_PASSED		REG_FIELD_PREP(GS_BOOTROM_MASK, 0x76)
 #define   GS_MIA_IN_RESET			REG_BIT(0)

+#define BOOT_HASH_CHK				XE_REG(0xc010)
+#define   GUC_BOOT_UKERNEL_VALID		REG_BIT(31)
+
 #define GUC_HEADER_INFO				XE_REG(0xc014)

 #define GUC_WOPCM_SIZE				XE_REG(0xc050)
@@ -83,7 +86,12 @@
 #define   GUC_WOPCM_OFFSET_MASK			REG_GENMASK(31, GUC_WOPCM_OFFSET_SHIFT)
 #define   HUC_LOADING_AGENT_GUC			REG_BIT(1)
 #define   GUC_WOPCM_OFFSET_VALID		REG_BIT(0)
+
+#define GUC_SRAM_STATUS				XE_REG(0xc398)
+#define   GUC_SRAM_HANDLING_MASK		REG_GENMASK(8, 7)
+
 #define GUC_MAX_IDLE_COUNT			XE_REG(0xc3e4)
+#define   GUC_IDLE_FLOW_DISABLE			REG_BIT(31)
 #define GUC_PMTIMESTAMP_LO			XE_REG(0xc3e8)
 #define GUC_PMTIMESTAMP_HI			XE_REG(0xc3ec)

diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
index 4ab65cae87433..96c28014f3887 100644
--- a/drivers/gpu/drm/xe/xe_guc.c
+++ b/drivers/gpu/drm/xe/xe_guc.c
@@ -900,6 +900,41 @@ int xe_guc_post_load_init(struct xe_guc *guc)
 	return xe_guc_submit_enable(guc);
 }

+/*
+ * Wa_14025883347: Prevent GuC firmware DMA failures during GuC-only reset by ensuring
+ * SRAM save/restore operations are complete before reset.
+ */
+static void guc_prevent_fw_dma_failure_on_reset(struct xe_guc *guc)
+{
+	struct xe_gt *gt = guc_to_gt(guc);
+	u32 boot_hash_chk, guc_status, sram_status;
+	int ret;
+
+	guc_status = xe_mmio_read32(&gt->mmio, GUC_STATUS);
+	if (guc_status & GS_MIA_IN_RESET)
+		return;
+
+	boot_hash_chk = xe_mmio_read32(&gt->mmio, BOOT_HASH_CHK);
+	if (!(boot_hash_chk & GUC_BOOT_UKERNEL_VALID))
+		return;
+
+	/* Disable idle flow during reset (GuC reset re-enables it automatically) */
+	xe_mmio_rmw32(&gt->mmio, GUC_MAX_IDLE_COUNT, 0, GUC_IDLE_FLOW_DISABLE);
+
+	ret = xe_mmio_wait32(&gt->mmio, GUC_STATUS, GS_UKERNEL_MASK,
+			     FIELD_PREP(GS_UKERNEL_MASK, XE_GUC_LOAD_STATUS_READY),
+			     100000, &guc_status, false);
+	if (ret)
+		xe_gt_warn(gt, "GuC not ready after disabling idle flow (GUC_STATUS: 0x%x)\n",
+			   guc_status);
+
+	ret = xe_mmio_wait32(&gt->mmio, GUC_SRAM_STATUS, GUC_SRAM_HANDLING_MASK,
+			     0, 5000, &sram_status, false);
+	if (ret)
+		xe_gt_warn(gt, "SRAM handling not complete (GUC_SRAM_STATUS: 0x%x)\n",
+			   sram_status);
+}
+
 int xe_guc_reset(struct xe_guc *guc)
 {
 	struct xe_gt *gt = guc_to_gt(guc);
@@ -912,6 +947,9 @@ int xe_guc_reset(struct xe_guc *guc)
 	if (IS_SRIOV_VF(gt_to_xe(gt)))
 		return xe_gt_sriov_vf_bootstrap(gt);

+	if (XE_GT_WA(gt, 14025883347))
+		guc_prevent_fw_dma_failure_on_reset(guc);
+
 	xe_mmio_write32(mmio, GDRST, GRDOM_GUC);

 	ret = xe_mmio_wait32(mmio, GDRST, GRDOM_GUC, 0, 5000, &gdrst, false);
diff --git a/drivers/gpu/drm/xe/xe_wa_oob.rules b/drivers/gpu/drm/xe/xe_wa_oob.rules
index 5cd7fa6d2a5c0..ac08f94f90a14 100644
--- a/drivers/gpu/drm/xe/xe_wa_oob.rules
+++ b/drivers/gpu/drm/xe/xe_wa_oob.rules
@@ -73,3 +73,6 @@
 15015404425_disable	PLATFORM(PANTHERLAKE), MEDIA_STEP(B0, FOREVER)
 16026007364    MEDIA_VERSION(3000)
 14020316580    MEDIA_VERSION(1301)
+
+14025883347	MEDIA_VERSION_RANGE(1301, 3503)
+		GRAPHICS_VERSION_RANGE(2004, 3005)
-- 
2.53.0

^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-04-20 13:27 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260420132314.1023554-1-sashal@kernel.org>
2026-04-20 13:16 ` [PATCH AUTOSEL 7.0-6.19] drm/xe/vf: Wait for all fixups before using default LRCs Sasha Levin
2026-04-20 13:17 ` [PATCH AUTOSEL 6.18] drm/xe: Fix bug in idledly unit conversion Sasha Levin
2026-04-20 13:17 ` [PATCH AUTOSEL 7.0] drm/xe: Skip adding PRL entry to NULL VMA Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.18] drm/xe/guc: Add Wa_14025883347 for GuC DMA failure on reset Sasha Levin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox