From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 15192301022; Sat, 25 Oct 2025 16:13:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761408794; cv=none; b=hDHcykNjyzZxW/AwdP7D4Hr02tRFvpsB+4FnfhA4T2BBUEMy5/0NmwXzOr4GZ7lB+EEo5SuFmWsPDtmfil1iX6kUH748UafeDGPRsCBNDxjRcloeSJy6PqHQMu9OI/EEvKy2vvXMDdVRCLo0ou9rIqNTcxWrFTjYr5FetyI9zjg= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761408794; c=relaxed/simple; bh=6G/gvk48evdMBUemFgDfd5CZY6KsXKzKU0k9aBxnDNg=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=EwX7gbJ3fmzTOUj0mX/21BAlPneNikTbF86qCkhiDcOQbOieuJEEdUYo12Y7cILCLCih5CCcq/ZVS2rzwWAfa6RmTNudfCjkXk1X2A3YO7+pfiVKTpsSeH0JgJ1VZspJ/jfFY7OMCQwN/5qyQUknxsB8AiVMn4XHwKUSfq4Qr6M= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=oIauK5VY; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="oIauK5VY" Received: by smtp.kernel.org (Postfix) with ESMTPSA id D4992C4CEF5; Sat, 25 Oct 2025 16:13:12 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1761408794; bh=6G/gvk48evdMBUemFgDfd5CZY6KsXKzKU0k9aBxnDNg=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=oIauK5VYMUA362OEEwc+Uzgv51oV3T13bwxLmDEvQrKCM7/laHd5a0U6Dkp7QlsZp Mknj5N+3cfz4/CBM7q6rEbpCMYOO51al/gwahewkQ2NB+2o1LdTpBkCZZVwDmCg1Ao c8EN0Se0IGOpIvhJnFbhppdsZ6ErvDrLU71zr8uSjprRlatICuquMylyXWwrcwnzwI ojMEF3E669VRrobPZ/drG1exnJtGLnKOdwRRU6CD5AetPRNWrACJpjLsizi9KnwcW8 doUD1I5jiOW65DPIwNqrsWBNxHu0eFNUqF83gUFPWVYQXGJrdac9UFv/UjZurNBPwr ccLJC8zur4L1Q== From: Sasha Levin To: patches@lists.linux.dev, stable@vger.kernel.org Cc: Riana Tauro , Matthew Brost , Raag Jadav , Rodrigo Vivi , Sasha Levin , lucas.demarchi@intel.com, thomas.hellstrom@linux.intel.com, intel-xe@lists.freedesktop.org Subject: [PATCH AUTOSEL 6.17] drm/xe: Set GT as wedged before sending wedged uevent Date: Sat, 25 Oct 2025 11:55:04 -0400 Message-ID: <20251025160905.3857885-73-sashal@kernel.org> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20251025160905.3857885-1-sashal@kernel.org> References: <20251025160905.3857885-1-sashal@kernel.org> Precedence: bulk X-Mailing-List: patches@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-stable: review X-Patchwork-Hint: Ignore X-stable-base: Linux 6.17.5 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit From: Riana Tauro [ Upstream commit 90fdcf5f89e9288c153923f16a60e6f7da18ba76 ] Userspace should be notified after setting the device as wedged. Re-order function calls to set gt wedged before sending uevent. Cc: Matthew Brost Suggested-by: Raag Jadav Signed-off-by: Riana Tauro Reviewed-by: Matthew Brost Link: https://lore.kernel.org/r/20250826063419.3022216-4-riana.tauro@intel.com Signed-off-by: Rodrigo Vivi Signed-off-by: Sasha Levin --- LLM Generated explanations, may be completely bogus: ## Backport Recommendation: YES (with dependency) **Answer: YES** - This commit should be backported to stable kernel trees, but ONLY together with its prerequisite commit 60439ac3f2354 ("drm/xe: Add a helper function to set recovery method"). --- ## Comprehensive Analysis ### Race Condition Being Fixed This commit addresses a critical **Time-of-Check-to-Time-of-Use (TOCTOU) race condition** in the device wedging error path. **The Problem (OLD CODE):** ```c if (!atomic_xchg(&xe->wedged.flag, 1)) { // Set device as wedged xe->needs_flr_on_fini = true; drm_err(&xe->drm, ...); // BUG: Send uevent to userspace HERE drm_dev_wedged_event(&xe->drm, ...); } // Wedge GTs AFTER userspace notification for_each_gt(gt, xe, id) xe_gt_declare_wedged(gt); ``` **The Race:** 1. Thread sets `xe->wedged.flag = 1` 2. Thread sends uevent to userspace notifying of wedged state 3. **Userspace receives notification and may check device state** 4. **BUT: GTs are NOT yet wedged!** 5. Thread finally calls `xe_gt_declare_wedged()` which: - Stops submission via `xe_guc_submit_wedge()` - Stops command transport via `xe_guc_ct_stop()` - Resets TLB invalidation via `xe_tlb_inval_reset()` **The Impact:** Userspace receiving the wedged uevent might: - Query device state and see inconsistent information - Initiate recovery procedures on a partially-wedged device - Experience race conditions in error handling logic - See submissions still active when device should be fully wedged ### The Fix (NEW CODE) The commit reorders operations to ensure atomicity from userspace's perspective: ```c if (!atomic_xchg(&xe->wedged.flag, 1)) { xe->needs_flr_on_fini = true; drm_err(&xe->drm, ...); } // ← Close the atomic block // Wedge ALL GTs FIRST (always executed) for_each_gt(gt, xe, id) xe_gt_declare_wedged(gt); // Then notify userspace (always executed if flag is set) if (xe_device_wedged(xe)) { if (!xe->wedged.method) xe_device_set_wedged_method(xe, ...); drm_dev_wedged_event(&xe->drm, xe->wedged.method, NULL); } ``` This ensures that: 1. Device wedged flag is set 2. **ALL GTs are fully wedged (submissions stopped, CT stopped, TLB reset)** 3. **ONLY THEN is userspace notified** ### Code Changes Analysis **From xe_device.c:1260-1280:** The key changes are: 1. **Moved closing brace** - The uevent call is moved OUT of the `if (!atomic_xchg(...))` block 2. **Reordered operations** - `for_each_gt()` loop moved BEFORE the uevent 3. **Added new guard** - `if (xe_device_wedged(xe))` wraps the uevent notification 4. **Uses new infrastructure** - References `xe->wedged.method` (from dependency commit) ### Behavioral Changes **Minor behavioral change:** The uevent is now sent on every call after the flag is set (not just the first call). However, this is likely benign because: 1. Most callers check `xe_device_wedged()` before calling (see xe_gt.c:816, xe_guc_submit.c:1038) 2. These are error recovery paths that shouldn't execute repeatedly 3. Userspace should handle wedged events idempotently anyway ### Critical Dependency **This commit has a HARD DEPENDENCY on commit 60439ac3f2354** ("drm/xe: Add a helper function to set recovery method") which: 1. Adds `xe->wedged.method` field to `struct xe_device` (xe_device_types.h:544) 2. Adds `xe_device_set_wedged_method()` function (xe_device.c:1186) 3. Modifies `drm_dev_wedged_event()` call to use `xe->wedged.method` **Without this dependency, the commit will NOT compile!** The code in lines 1274-1276 references: ```c if (!xe->wedged.method) xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET); ``` And line 1279 uses: ```c drm_dev_wedged_event(&xe->drm, xe->wedged.method, NULL); ``` Both require the infrastructure from commit 60439ac3f2354. ### Backporting Criteria Evaluation 1. **Does it fix a bug that affects users?** ✓ YES - Fixes a race condition in critical error handling - Affects device recovery and error reporting - Can cause inconsistent state reporting to userspace 2. **Is the fix relatively small and contained?** ✓ YES - Just 8 lines added, 4 lines removed - Single function modified - Localized to xe_device_declare_wedged() 3. **Does it have clear side effects beyond fixing the issue?** ✓ NO - Only minor behavioral change (potential multiple uevents) - No new functionality added - No API changes 4. **Does it include major architectural changes?** ✓ NO - Simple code reordering - No new subsystems or abstractions 5. **Does it touch critical kernel subsystems?** ⚠ YES (but contained) - Touches DRM Xe driver (Intel GPU driver) - Critical for device error handling - But change is confined to xe driver only 6. **Is there explicit mention of stable tree backporting?** ✗ NO - No `Cc: stable@vger.kernel.org` tag - No `Fixes:` tag 7. **Does it follow stable tree rules?** ✓ YES - Important bugfix (race condition) - Minimal risk of regression - No new features - Confined to single driver ### Risk Assessment **Risk Level: LOW** **Reasons:** - The reordering is logically sound and correct - Code has been reviewed by maintainers (Reviewed-by: Matthew Brost) - No reverts found in subsequent commits - The potential multiple-uevent issue is mitigated by caller checks - Change is contained to error handling paths **Testing considerations:** - Error paths are inherently difficult to test - Requires triggering GT reset failures or GuC load failures - May need fault injection testing ### Recommendation **YES - Backport this commit, BUT as part of a series with its dependency.** **Required commits for backport (in order):** 1. **60439ac3f2354** - "drm/xe: Add a helper function to set recovery method" 2. **90fdcf5f89e92** - "drm/xe: Set GT as wedged before sending wedged uevent" (this commit) **Rationale:** - Fixes a real race condition that can cause inconsistent device state - Small, contained, and low-risk change - Important for proper error handling and recovery - Has clear benefit for users with Intel Xe GPUs - No known regressions or issues **Target stable trees:** - Any stable kernel that includes the DRM Xe driver - Likely 6.8+ (when Xe driver was merged) **Note:** The commits should be backported as a pair in the correct order to maintain compilation and functionality. drivers/gpu/drm/xe/xe_device.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c index 1c9907b8a4e9e..d399c2628fa33 100644 --- a/drivers/gpu/drm/xe/xe_device.c +++ b/drivers/gpu/drm/xe/xe_device.c @@ -1157,8 +1157,10 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg) * xe_device_declare_wedged - Declare device wedged * @xe: xe device instance * - * This is a final state that can only be cleared with a module + * This is a final state that can only be cleared with the recovery method + * specified in the drm wedged uevent. The default recovery method is * re-probe (unbind + bind). + * * In this state every IOCTL will be blocked so the GT cannot be used. * In general it will be called upon any critical error such as gt reset * failure or guc loading failure. Userspace will be notified of this state @@ -1192,13 +1194,15 @@ void xe_device_declare_wedged(struct xe_device *xe) "IOCTLs and executions are blocked. Only a rebind may clear the failure\n" "Please file a _new_ bug report at https://gitlab.freedesktop.org/drm/xe/kernel/issues/new\n", dev_name(xe->drm.dev)); + } + + for_each_gt(gt, xe, id) + xe_gt_declare_wedged(gt); + if (xe_device_wedged(xe)) { /* Notify userspace of wedged device */ drm_dev_wedged_event(&xe->drm, DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET, NULL); } - - for_each_gt(gt, xe, id) - xe_gt_declare_wedged(gt); } -- 2.51.0