Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: Riana Tauro <riana.tauro@intel.com>,
	Matthew Brost <matthew.brost@intel.com>,
	Raag Jadav <raag.jadav@intel.com>,
	Rodrigo Vivi <rodrigo.vivi@intel.com>,
	Sasha Levin <sashal@kernel.org>,
	lucas.demarchi@intel.com, thomas.hellstrom@linux.intel.com,
	intel-xe@lists.freedesktop.org
Subject: [PATCH AUTOSEL 6.17] drm/xe: Set GT as wedged before sending wedged uevent
Date: Sat, 25 Oct 2025 11:55:04 -0400	[thread overview]
Message-ID: <20251025160905.3857885-73-sashal@kernel.org> (raw)
In-Reply-To: <20251025160905.3857885-1-sashal@kernel.org>

From: Riana Tauro <riana.tauro@intel.com>

[ Upstream commit 90fdcf5f89e9288c153923f16a60e6f7da18ba76 ]

Userspace should be notified after setting the device as wedged.
Re-order function calls to set gt wedged before sending uevent.

Cc: Matthew Brost <matthew.brost@intel.com>
Suggested-by: Raag Jadav <raag.jadav@intel.com>
Signed-off-by: Riana Tauro <riana.tauro@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250826063419.3022216-4-riana.tauro@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

## Backport Recommendation: YES (with dependency)

**Answer: YES** - This commit should be backported to stable kernel
trees, but ONLY together with its prerequisite commit 60439ac3f2354
("drm/xe: Add a helper function to set recovery method").

---

## Comprehensive Analysis

### Race Condition Being Fixed

This commit addresses a critical **Time-of-Check-to-Time-of-Use (TOCTOU)
race condition** in the device wedging error path.

**The Problem (OLD CODE):**
```c
if (!atomic_xchg(&xe->wedged.flag, 1)) {
    // Set device as wedged
    xe->needs_flr_on_fini = true;
    drm_err(&xe->drm, ...);

    // BUG: Send uevent to userspace HERE
    drm_dev_wedged_event(&xe->drm, ...);
}

// Wedge GTs AFTER userspace notification
for_each_gt(gt, xe, id)
    xe_gt_declare_wedged(gt);
```

**The Race:**
1. Thread sets `xe->wedged.flag = 1`
2. Thread sends uevent to userspace notifying of wedged state
3. **Userspace receives notification and may check device state**
4. **BUT: GTs are NOT yet wedged!**
5. Thread finally calls `xe_gt_declare_wedged()` which:
   - Stops submission via `xe_guc_submit_wedge()`
   - Stops command transport via `xe_guc_ct_stop()`
   - Resets TLB invalidation via `xe_tlb_inval_reset()`

**The Impact:**
Userspace receiving the wedged uevent might:
- Query device state and see inconsistent information
- Initiate recovery procedures on a partially-wedged device
- Experience race conditions in error handling logic
- See submissions still active when device should be fully wedged

### The Fix (NEW CODE)

The commit reorders operations to ensure atomicity from userspace's
perspective:

```c
if (!atomic_xchg(&xe->wedged.flag, 1)) {
    xe->needs_flr_on_fini = true;
    drm_err(&xe->drm, ...);
}  // ← Close the atomic block

// Wedge ALL GTs FIRST (always executed)
for_each_gt(gt, xe, id)
    xe_gt_declare_wedged(gt);

// Then notify userspace (always executed if flag is set)
if (xe_device_wedged(xe)) {
    if (!xe->wedged.method)
        xe_device_set_wedged_method(xe, ...);
    drm_dev_wedged_event(&xe->drm, xe->wedged.method, NULL);
}
```

This ensures that:
1. Device wedged flag is set
2. **ALL GTs are fully wedged (submissions stopped, CT stopped, TLB
   reset)**
3. **ONLY THEN is userspace notified**

### Code Changes Analysis

**From xe_device.c:1260-1280:**

The key changes are:
1. **Moved closing brace** - The uevent call is moved OUT of the `if
   (!atomic_xchg(...))` block
2. **Reordered operations** - `for_each_gt()` loop moved BEFORE the
   uevent
3. **Added new guard** - `if (xe_device_wedged(xe))` wraps the uevent
   notification
4. **Uses new infrastructure** - References `xe->wedged.method` (from
   dependency commit)

### Behavioral Changes

**Minor behavioral change:** The uevent is now sent on every call after
the flag is set (not just the first call). However, this is likely
benign because:

1. Most callers check `xe_device_wedged()` before calling (see
   xe_gt.c:816, xe_guc_submit.c:1038)
2. These are error recovery paths that shouldn't execute repeatedly
3. Userspace should handle wedged events idempotently anyway

### Critical Dependency

**This commit has a HARD DEPENDENCY on commit 60439ac3f2354** ("drm/xe:
Add a helper function to set recovery method") which:

1. Adds `xe->wedged.method` field to `struct xe_device`
   (xe_device_types.h:544)
2. Adds `xe_device_set_wedged_method()` function (xe_device.c:1186)
3. Modifies `drm_dev_wedged_event()` call to use `xe->wedged.method`

**Without this dependency, the commit will NOT compile!**

The code in lines 1274-1276 references:
```c
if (!xe->wedged.method)
    xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_REBIND |
                                DRM_WEDGE_RECOVERY_BUS_RESET);
```

And line 1279 uses:
```c
drm_dev_wedged_event(&xe->drm, xe->wedged.method, NULL);
```

Both require the infrastructure from commit 60439ac3f2354.

### Backporting Criteria Evaluation

1. **Does it fix a bug that affects users?** ✓ YES
   - Fixes a race condition in critical error handling
   - Affects device recovery and error reporting
   - Can cause inconsistent state reporting to userspace

2. **Is the fix relatively small and contained?** ✓ YES
   - Just 8 lines added, 4 lines removed
   - Single function modified
   - Localized to xe_device_declare_wedged()

3. **Does it have clear side effects beyond fixing the issue?** ✓ NO
   - Only minor behavioral change (potential multiple uevents)
   - No new functionality added
   - No API changes

4. **Does it include major architectural changes?** ✓ NO
   - Simple code reordering
   - No new subsystems or abstractions

5. **Does it touch critical kernel subsystems?** ⚠ YES (but contained)
   - Touches DRM Xe driver (Intel GPU driver)
   - Critical for device error handling
   - But change is confined to xe driver only

6. **Is there explicit mention of stable tree backporting?** ✗ NO
   - No `Cc: stable@vger.kernel.org` tag
   - No `Fixes:` tag

7. **Does it follow stable tree rules?** ✓ YES
   - Important bugfix (race condition)
   - Minimal risk of regression
   - No new features
   - Confined to single driver

### Risk Assessment

**Risk Level: LOW**

**Reasons:**
- The reordering is logically sound and correct
- Code has been reviewed by maintainers (Reviewed-by: Matthew Brost)
- No reverts found in subsequent commits
- The potential multiple-uevent issue is mitigated by caller checks
- Change is contained to error handling paths

**Testing considerations:**
- Error paths are inherently difficult to test
- Requires triggering GT reset failures or GuC load failures
- May need fault injection testing

### Recommendation

**YES - Backport this commit, BUT as part of a series with its
dependency.**

**Required commits for backport (in order):**
1. **60439ac3f2354** - "drm/xe: Add a helper function to set recovery
   method"
2. **90fdcf5f89e92** - "drm/xe: Set GT as wedged before sending wedged
   uevent" (this commit)

**Rationale:**
- Fixes a real race condition that can cause inconsistent device state
- Small, contained, and low-risk change
- Important for proper error handling and recovery
- Has clear benefit for users with Intel Xe GPUs
- No known regressions or issues

**Target stable trees:**
- Any stable kernel that includes the DRM Xe driver
- Likely 6.8+ (when Xe driver was merged)

**Note:** The commits should be backported as a pair in the correct
order to maintain compilation and functionality.

 drivers/gpu/drm/xe/xe_device.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index 1c9907b8a4e9e..d399c2628fa33 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -1157,8 +1157,10 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
  * xe_device_declare_wedged - Declare device wedged
  * @xe: xe device instance
  *
- * This is a final state that can only be cleared with a module
+ * This is a final state that can only be cleared with the recovery method
+ * specified in the drm wedged uevent. The default recovery method is
  * re-probe (unbind + bind).
+ *
  * In this state every IOCTL will be blocked so the GT cannot be used.
  * In general it will be called upon any critical error such as gt reset
  * failure or guc loading failure. Userspace will be notified of this state
@@ -1192,13 +1194,15 @@ void xe_device_declare_wedged(struct xe_device *xe)
 			"IOCTLs and executions are blocked. Only a rebind may clear the failure\n"
 			"Please file a _new_ bug report at https://gitlab.freedesktop.org/drm/xe/kernel/issues/new\n",
 			dev_name(xe->drm.dev));
+	}
+
+	for_each_gt(gt, xe, id)
+		xe_gt_declare_wedged(gt);
 
+	if (xe_device_wedged(xe)) {
 		/* Notify userspace of wedged device */
 		drm_dev_wedged_event(&xe->drm,
 				     DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET,
 				     NULL);
 	}
-
-	for_each_gt(gt, xe, id)
-		xe_gt_declare_wedged(gt);
 }
-- 
2.51.0


  parent reply	other threads:[~2025-10-25 16:13 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20251025160905.3857885-1-sashal@kernel.org>
2025-10-25 15:54 ` [PATCH AUTOSEL 6.17] drm/xe/pcode: Initialize data0 for pcode read routine Sasha Levin
2025-10-25 15:54 ` [PATCH AUTOSEL 6.17] drm/xe: improve dma-resv handling for backup object Sasha Levin
2025-10-25 15:54 ` [PATCH AUTOSEL 6.17] drm/xe: Extend wa_13012615864 to additional Xe2 and Xe3 platforms Sasha Levin
2025-10-25 15:54 ` [PATCH AUTOSEL 6.17] drm/xe/ptl: Apply Wa_16026007364 Sasha Levin
2025-10-25 15:55 ` Sasha Levin [this message]
2025-10-25 15:55 ` [PATCH AUTOSEL 6.17] drm/xe/i2c: Enable bus mastering Sasha Levin
2025-10-25 15:55 ` [PATCH AUTOSEL 6.17] drm/xe/configfs: Enforce canonical device names Sasha Levin
2025-10-25 15:56 ` [PATCH AUTOSEL 6.17] drm/xe: Extend Wa_22021007897 to Xe3 platforms Sasha Levin
2025-10-25 15:56 ` [PATCH AUTOSEL 6.17] drm/xe: Cancel pending TLB inval workers on teardown Sasha Levin
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17-6.12] drm/xe/guc: Increase GuC crash dump buffer size Sasha Levin
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17] drm/xe/wcl: Extend L3bank mask workaround Sasha Levin
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17-6.12] drm/xe/guc: Set upper limit of H2G retries over CTB Sasha Levin
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17] drm/xe: Make page size consistent in loop Sasha Levin
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17] drm/xe/guc: Add devm release action to safely tear down CT Sasha Levin
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17] drm/xe/pf: Program LMTT directory pointer on all GTs within a tile Sasha Levin
2025-10-25 15:58 ` [PATCH AUTOSEL 6.17] drm/xe/guc: Always add CT disable action during second init step Sasha Levin
2025-10-25 15:58 ` [PATCH AUTOSEL 6.17] drm/xe/pf: Don't resume device from restart worker Sasha Levin
2025-10-25 15:59 ` [PATCH AUTOSEL 6.17-6.12] drm/xe/guc: Return an error code if the GuC load fails Sasha Levin
2025-10-25 15:59 ` [PATCH AUTOSEL 6.17] drm/xe: Ensure GT is in C0 during resumes Sasha Levin
2025-10-25 15:59 ` [PATCH AUTOSEL 6.17] drm/xe: rework PDE PAT index selection Sasha Levin
2025-10-25 16:01 ` [PATCH AUTOSEL 6.17-6.12] drm/xe/guc: Add more GuC load error status codes Sasha Levin
2025-10-25 16:01 ` [PATCH AUTOSEL 6.17-6.12] drm/xe: Fix oops in xe_gem_fault when running core_hotunplug test Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251025160905.3857885-73-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=lucas.demarchi@intel.com \
    --cc=matthew.brost@intel.com \
    --cc=patches@lists.linux.dev \
    --cc=raag.jadav@intel.com \
    --cc=riana.tauro@intel.com \
    --cc=rodrigo.vivi@intel.com \
    --cc=stable@vger.kernel.org \
    --cc=thomas.hellstrom@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox