From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: Riana Tauro <riana.tauro@intel.com>,
Matthew Brost <matthew.brost@intel.com>,
Raag Jadav <raag.jadav@intel.com>,
Rodrigo Vivi <rodrigo.vivi@intel.com>,
Sasha Levin <sashal@kernel.org>,
lucas.demarchi@intel.com, thomas.hellstrom@linux.intel.com,
intel-xe@lists.freedesktop.org
Subject: [PATCH AUTOSEL 6.17] drm/xe: Set GT as wedged before sending wedged uevent
Date: Sat, 25 Oct 2025 11:55:04 -0400 [thread overview]
Message-ID: <20251025160905.3857885-73-sashal@kernel.org> (raw)
In-Reply-To: <20251025160905.3857885-1-sashal@kernel.org>
From: Riana Tauro <riana.tauro@intel.com>
[ Upstream commit 90fdcf5f89e9288c153923f16a60e6f7da18ba76 ]
Userspace should be notified after setting the device as wedged.
Re-order function calls to set gt wedged before sending uevent.
Cc: Matthew Brost <matthew.brost@intel.com>
Suggested-by: Raag Jadav <raag.jadav@intel.com>
Signed-off-by: Riana Tauro <riana.tauro@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250826063419.3022216-4-riana.tauro@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
LLM Generated explanations, may be completely bogus:
## Backport Recommendation: YES (with dependency)
**Answer: YES** - This commit should be backported to stable kernel
trees, but ONLY together with its prerequisite commit 60439ac3f2354
("drm/xe: Add a helper function to set recovery method").
---
## Comprehensive Analysis
### Race Condition Being Fixed
This commit addresses a critical **Time-of-Check-to-Time-of-Use (TOCTOU)
race condition** in the device wedging error path.
**The Problem (OLD CODE):**
```c
if (!atomic_xchg(&xe->wedged.flag, 1)) {
// Set device as wedged
xe->needs_flr_on_fini = true;
drm_err(&xe->drm, ...);
// BUG: Send uevent to userspace HERE
drm_dev_wedged_event(&xe->drm, ...);
}
// Wedge GTs AFTER userspace notification
for_each_gt(gt, xe, id)
xe_gt_declare_wedged(gt);
```
**The Race:**
1. Thread sets `xe->wedged.flag = 1`
2. Thread sends uevent to userspace notifying of wedged state
3. **Userspace receives notification and may check device state**
4. **BUT: GTs are NOT yet wedged!**
5. Thread finally calls `xe_gt_declare_wedged()` which:
- Stops submission via `xe_guc_submit_wedge()`
- Stops command transport via `xe_guc_ct_stop()`
- Resets TLB invalidation via `xe_tlb_inval_reset()`
**The Impact:**
Userspace receiving the wedged uevent might:
- Query device state and see inconsistent information
- Initiate recovery procedures on a partially-wedged device
- Experience race conditions in error handling logic
- See submissions still active when device should be fully wedged
### The Fix (NEW CODE)
The commit reorders operations to ensure atomicity from userspace's
perspective:
```c
if (!atomic_xchg(&xe->wedged.flag, 1)) {
xe->needs_flr_on_fini = true;
drm_err(&xe->drm, ...);
} // ← Close the atomic block
// Wedge ALL GTs FIRST (always executed)
for_each_gt(gt, xe, id)
xe_gt_declare_wedged(gt);
// Then notify userspace (always executed if flag is set)
if (xe_device_wedged(xe)) {
if (!xe->wedged.method)
xe_device_set_wedged_method(xe, ...);
drm_dev_wedged_event(&xe->drm, xe->wedged.method, NULL);
}
```
This ensures that:
1. Device wedged flag is set
2. **ALL GTs are fully wedged (submissions stopped, CT stopped, TLB
reset)**
3. **ONLY THEN is userspace notified**
### Code Changes Analysis
**From xe_device.c:1260-1280:**
The key changes are:
1. **Moved closing brace** - The uevent call is moved OUT of the `if
(!atomic_xchg(...))` block
2. **Reordered operations** - `for_each_gt()` loop moved BEFORE the
uevent
3. **Added new guard** - `if (xe_device_wedged(xe))` wraps the uevent
notification
4. **Uses new infrastructure** - References `xe->wedged.method` (from
dependency commit)
### Behavioral Changes
**Minor behavioral change:** The uevent is now sent on every call after
the flag is set (not just the first call). However, this is likely
benign because:
1. Most callers check `xe_device_wedged()` before calling (see
xe_gt.c:816, xe_guc_submit.c:1038)
2. These are error recovery paths that shouldn't execute repeatedly
3. Userspace should handle wedged events idempotently anyway
### Critical Dependency
**This commit has a HARD DEPENDENCY on commit 60439ac3f2354** ("drm/xe:
Add a helper function to set recovery method") which:
1. Adds `xe->wedged.method` field to `struct xe_device`
(xe_device_types.h:544)
2. Adds `xe_device_set_wedged_method()` function (xe_device.c:1186)
3. Modifies `drm_dev_wedged_event()` call to use `xe->wedged.method`
**Without this dependency, the commit will NOT compile!**
The code in lines 1274-1276 references:
```c
if (!xe->wedged.method)
xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_REBIND |
DRM_WEDGE_RECOVERY_BUS_RESET);
```
And line 1279 uses:
```c
drm_dev_wedged_event(&xe->drm, xe->wedged.method, NULL);
```
Both require the infrastructure from commit 60439ac3f2354.
### Backporting Criteria Evaluation
1. **Does it fix a bug that affects users?** ✓ YES
- Fixes a race condition in critical error handling
- Affects device recovery and error reporting
- Can cause inconsistent state reporting to userspace
2. **Is the fix relatively small and contained?** ✓ YES
- Just 8 lines added, 4 lines removed
- Single function modified
- Localized to xe_device_declare_wedged()
3. **Does it have clear side effects beyond fixing the issue?** ✓ NO
- Only minor behavioral change (potential multiple uevents)
- No new functionality added
- No API changes
4. **Does it include major architectural changes?** ✓ NO
- Simple code reordering
- No new subsystems or abstractions
5. **Does it touch critical kernel subsystems?** ⚠ YES (but contained)
- Touches DRM Xe driver (Intel GPU driver)
- Critical for device error handling
- But change is confined to xe driver only
6. **Is there explicit mention of stable tree backporting?** ✗ NO
- No `Cc: stable@vger.kernel.org` tag
- No `Fixes:` tag
7. **Does it follow stable tree rules?** ✓ YES
- Important bugfix (race condition)
- Minimal risk of regression
- No new features
- Confined to single driver
### Risk Assessment
**Risk Level: LOW**
**Reasons:**
- The reordering is logically sound and correct
- Code has been reviewed by maintainers (Reviewed-by: Matthew Brost)
- No reverts found in subsequent commits
- The potential multiple-uevent issue is mitigated by caller checks
- Change is contained to error handling paths
**Testing considerations:**
- Error paths are inherently difficult to test
- Requires triggering GT reset failures or GuC load failures
- May need fault injection testing
### Recommendation
**YES - Backport this commit, BUT as part of a series with its
dependency.**
**Required commits for backport (in order):**
1. **60439ac3f2354** - "drm/xe: Add a helper function to set recovery
method"
2. **90fdcf5f89e92** - "drm/xe: Set GT as wedged before sending wedged
uevent" (this commit)
**Rationale:**
- Fixes a real race condition that can cause inconsistent device state
- Small, contained, and low-risk change
- Important for proper error handling and recovery
- Has clear benefit for users with Intel Xe GPUs
- No known regressions or issues
**Target stable trees:**
- Any stable kernel that includes the DRM Xe driver
- Likely 6.8+ (when Xe driver was merged)
**Note:** The commits should be backported as a pair in the correct
order to maintain compilation and functionality.
drivers/gpu/drm/xe/xe_device.c | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index 1c9907b8a4e9e..d399c2628fa33 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -1157,8 +1157,10 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
* xe_device_declare_wedged - Declare device wedged
* @xe: xe device instance
*
- * This is a final state that can only be cleared with a module
+ * This is a final state that can only be cleared with the recovery method
+ * specified in the drm wedged uevent. The default recovery method is
* re-probe (unbind + bind).
+ *
* In this state every IOCTL will be blocked so the GT cannot be used.
* In general it will be called upon any critical error such as gt reset
* failure or guc loading failure. Userspace will be notified of this state
@@ -1192,13 +1194,15 @@ void xe_device_declare_wedged(struct xe_device *xe)
"IOCTLs and executions are blocked. Only a rebind may clear the failure\n"
"Please file a _new_ bug report at https://gitlab.freedesktop.org/drm/xe/kernel/issues/new\n",
dev_name(xe->drm.dev));
+ }
+
+ for_each_gt(gt, xe, id)
+ xe_gt_declare_wedged(gt);
+ if (xe_device_wedged(xe)) {
/* Notify userspace of wedged device */
drm_dev_wedged_event(&xe->drm,
DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET,
NULL);
}
-
- for_each_gt(gt, xe, id)
- xe_gt_declare_wedged(gt);
}
--
2.51.0
next prev parent reply other threads:[~2025-10-25 16:13 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20251025160905.3857885-1-sashal@kernel.org>
2025-10-25 15:54 ` [PATCH AUTOSEL 6.17] drm/xe/pcode: Initialize data0 for pcode read routine Sasha Levin
2025-10-25 15:54 ` [PATCH AUTOSEL 6.17] drm/xe: improve dma-resv handling for backup object Sasha Levin
2025-10-25 15:54 ` [PATCH AUTOSEL 6.17] drm/xe: Extend wa_13012615864 to additional Xe2 and Xe3 platforms Sasha Levin
2025-10-25 15:54 ` [PATCH AUTOSEL 6.17] drm/xe/ptl: Apply Wa_16026007364 Sasha Levin
2025-10-25 15:55 ` Sasha Levin [this message]
2025-10-25 15:55 ` [PATCH AUTOSEL 6.17] drm/xe/i2c: Enable bus mastering Sasha Levin
2025-10-25 15:55 ` [PATCH AUTOSEL 6.17] drm/xe/configfs: Enforce canonical device names Sasha Levin
2025-10-25 15:56 ` [PATCH AUTOSEL 6.17] drm/xe: Extend Wa_22021007897 to Xe3 platforms Sasha Levin
2025-10-25 15:56 ` [PATCH AUTOSEL 6.17] drm/xe: Cancel pending TLB inval workers on teardown Sasha Levin
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17-6.12] drm/xe/guc: Increase GuC crash dump buffer size Sasha Levin
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17] drm/xe/wcl: Extend L3bank mask workaround Sasha Levin
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17-6.12] drm/xe/guc: Set upper limit of H2G retries over CTB Sasha Levin
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17] drm/xe: Make page size consistent in loop Sasha Levin
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17] drm/xe/guc: Add devm release action to safely tear down CT Sasha Levin
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17] drm/xe/pf: Program LMTT directory pointer on all GTs within a tile Sasha Levin
2025-10-25 15:58 ` [PATCH AUTOSEL 6.17] drm/xe/guc: Always add CT disable action during second init step Sasha Levin
2025-10-25 15:58 ` [PATCH AUTOSEL 6.17] drm/xe/pf: Don't resume device from restart worker Sasha Levin
2025-10-25 15:59 ` [PATCH AUTOSEL 6.17-6.12] drm/xe/guc: Return an error code if the GuC load fails Sasha Levin
2025-10-25 15:59 ` [PATCH AUTOSEL 6.17] drm/xe: Ensure GT is in C0 during resumes Sasha Levin
2025-10-25 15:59 ` [PATCH AUTOSEL 6.17] drm/xe: rework PDE PAT index selection Sasha Levin
2025-10-25 16:01 ` [PATCH AUTOSEL 6.17-6.12] drm/xe/guc: Add more GuC load error status codes Sasha Levin
2025-10-25 16:01 ` [PATCH AUTOSEL 6.17-6.12] drm/xe: Fix oops in xe_gem_fault when running core_hotunplug test Sasha Levin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251025160905.3857885-73-sashal@kernel.org \
--to=sashal@kernel.org \
--cc=intel-xe@lists.freedesktop.org \
--cc=lucas.demarchi@intel.com \
--cc=matthew.brost@intel.com \
--cc=patches@lists.linux.dev \
--cc=raag.jadav@intel.com \
--cc=riana.tauro@intel.com \
--cc=rodrigo.vivi@intel.com \
--cc=stable@vger.kernel.org \
--cc=thomas.hellstrom@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox