From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>,
Michal Wajdeczko <michal.wajdeczko@intel.com>,
Matthew Brost <matthew.brost@intel.com>,
Matthew Auld <matthew.auld@intel.com>,
Summers Stuart <stuart.summers@intel.com>,
Sasha Levin <sashal@kernel.org>,
lucas.demarchi@intel.com, thomas.hellstrom@linux.intel.com,
rodrigo.vivi@intel.com, intel-xe@lists.freedesktop.org
Subject: [PATCH AUTOSEL 6.17] drm/xe/guc: Add devm release action to safely tear down CT
Date: Sat, 25 Oct 2025 11:57:44 -0400 [thread overview]
Message-ID: <20251025160905.3857885-233-sashal@kernel.org> (raw)
In-Reply-To: <20251025160905.3857885-1-sashal@kernel.org>
From: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
[ Upstream commit ee4b32220a6b41e71512e8804585325e685456ba ]
When a buffer object (BO) is allocated with the XE_BO_FLAG_GGTT_INVALIDATE
flag, the driver initiates TLB invalidation requests via the CTB mechanism
while releasing the BO. However a premature release of the CTB BO can lead
to system crashes, as observed in:
Oops: Oops: 0000 [#1] SMP NOPTI
RIP: 0010:h2g_write+0x2f3/0x7c0 [xe]
Call Trace:
guc_ct_send_locked+0x8b/0x670 [xe]
xe_guc_ct_send_locked+0x19/0x60 [xe]
send_tlb_invalidation+0xb4/0x460 [xe]
xe_gt_tlb_invalidation_ggtt+0x15e/0x2e0 [xe]
ggtt_invalidate_gt_tlb.part.0+0x16/0x90 [xe]
ggtt_node_remove+0x110/0x140 [xe]
xe_ggtt_node_remove+0x40/0xa0 [xe]
xe_ggtt_remove_bo+0x87/0x250 [xe]
Introduce a devm-managed release action during xe_guc_ct_init() and
xe_guc_ct_init_post_hwconfig() to ensure proper CTB disablement before
resource deallocation, preventing the use-after-free scenario.
Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Summers Stuart <stuart.summers@intel.com>
Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Link: https://lore.kernel.org/r/20250901072541.31461-1-satyanarayana.k.v.p@intel.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
LLM Generated explanations, may be completely bogus:
YES
- Fixes a real crash/UAF: The commit addresses a use-after-free when BOs
allocated with `XE_BO_FLAG_GGTT_INVALIDATE` trigger GGTT TLB
invalidations during teardown via the GuC CTB path, while the CTB BO
may have already been released. The reported call trace shows the GuC
CT send path being used during BO removal, leading to a crash if CT
resources are torn down too early (h2g_write → guc_ct_send_locked →
send_tlb_invalidation → ggtt_* → xe_ggtt_remove_bo).
- Core fix (devm-managed CT disable): A devm release action is added so
CT is transitioned to the disabled state during device-managed
teardown before dependent resources are freed. This is implemented by
registering a managed action that calls `guc_ct_change_state(ct,
XE_GUC_CT_STATE_DISABLED)`:
- New action: `guc_action_disable_ct()` calls the internal state
change to disabled, canceling fences, quiescing paths and preventing
further CT traffic (drivers/gpu/drm/xe/xe_guc_ct.c:257).
- Action registration in init: `devm_add_action_or_reset(xe->drm.dev,
guc_action_disable_ct, ct)` ensures the disable runs during teardown
(drivers/gpu/drm/xe/xe_guc_ct.c:281).
- This is small, contained, and only affects the XE GuC CT teardown
behavior.
- Ensures correct teardown ordering across reinit: The CT buffer is
reallocated into VRAM post-hwconfig for dGFX, which changes devres
ordering. To keep the “disable CT” action running before releasing the
(new) CT BO, the patch removes and re-adds the devm action after the
VRAM reinit so the disable action is the last registered and runs
first (LIFO) during teardown:
- CT VRAM reinit helper: `xe_guc_ct_init_post_hwconfig()` performs
`xe_managed_bo_reinit_in_vram()` for `ct->bo` and then removes and
re-adds the devm action to fix ordering
(drivers/gpu/drm/xe/xe_guc_ct.c:294-311).
- The GuC-level post-hwconfig flow calls this new helper after generic
reallocations (drivers/gpu/drm/xe/xe_guc.c:833,
drivers/gpu/drm/xe/xe_guc.c:837-839). This also removes the previous
attempt to reinit `guc->ct.bo` in the generic realloc function to
avoid ordering issues.
- Prevents the UAF in practice: TLB invalidation uses the CT path only
if CT is enabled; otherwise it falls back to a safe MMIO path:
- The GGTT invalidation path checks `xe_guc_ct_enabled(&guc->ct)` and
submission state; if disabled, it uses MMIO-based invalidation
instead (drivers/gpu/drm/xe/xe_guc_tlb_inval.c:64-72,
drivers/gpu/drm/xe/xe_guc_tlb_inval.c:72-90).
- By setting CT state to disabled via the devm action before CT BO or
other resources are freed, teardown-time invalidations avoid the CT
path, eliminating the use-after-free.
- Scope and risk:
- Driver-only fix confined to the XE GuC/CT code paths
(drivers/gpu/drm/xe/*).
- No ABI changes, no feature additions, no architectural refactor.
- The devm action calls the same internal state transition used by
existing disable flows, with proper locking and fence cancelation.
The change is minimal and low risk.
- Stable suitability:
- Clearly fixes an important, user-affecting crash (Oops/UAF) during
teardown.
- Small, self-contained, and limited to the XE GuC CT
initialization/teardown ordering.
- Aligns with stable rules: bugfix, minimal risk, no new features,
confined to a subsystem.
- Note: Depending on stable branch, the helper used to drop/re-add the
devm action may be `devm_remove_action` vs `devm_release_action`;
trivial adaptation may be required, but the logic remains the same.
Given the crash it prevents, the minimal and targeted nature of the
changes, and the clear correctness rationale tied to teardown ordering
and the CT-enabled check in TLB invalidation, this is a strong candidate
for backporting to stable.
drivers/gpu/drm/xe/xe_guc.c | 8 +++----
drivers/gpu/drm/xe/xe_guc_ct.c | 41 +++++++++++++++++++++++++++++++++-
drivers/gpu/drm/xe/xe_guc_ct.h | 1 +
3 files changed, 45 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
index 9e0ed8fabcd54..62c76760fd26f 100644
--- a/drivers/gpu/drm/xe/xe_guc.c
+++ b/drivers/gpu/drm/xe/xe_guc.c
@@ -701,10 +701,6 @@ static int xe_guc_realloc_post_hwconfig(struct xe_guc *guc)
if (ret)
return ret;
- ret = xe_managed_bo_reinit_in_vram(xe, tile, &guc->ct.bo);
- if (ret)
- return ret;
-
return 0;
}
@@ -839,6 +835,10 @@ int xe_guc_init_post_hwconfig(struct xe_guc *guc)
if (ret)
return ret;
+ ret = xe_guc_ct_init_post_hwconfig(&guc->ct);
+ if (ret)
+ return ret;
+
guc_init_params_post_hwconfig(guc);
ret = xe_guc_submit_init(guc, ~0);
diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
index 3f4e6a46ff163..6d70dd1c106d4 100644
--- a/drivers/gpu/drm/xe/xe_guc_ct.c
+++ b/drivers/gpu/drm/xe/xe_guc_ct.c
@@ -39,6 +39,8 @@ static void receive_g2h(struct xe_guc_ct *ct);
static void g2h_worker_func(struct work_struct *w);
static void safe_mode_worker_func(struct work_struct *w);
static void ct_exit_safe_mode(struct xe_guc_ct *ct);
+static void guc_ct_change_state(struct xe_guc_ct *ct,
+ enum xe_guc_ct_state state);
#if IS_ENABLED(CONFIG_DRM_XE_DEBUG)
enum {
@@ -252,6 +254,13 @@ int xe_guc_ct_init_noalloc(struct xe_guc_ct *ct)
}
ALLOW_ERROR_INJECTION(xe_guc_ct_init_noalloc, ERRNO); /* See xe_pci_probe() */
+static void guc_action_disable_ct(void *arg)
+{
+ struct xe_guc_ct *ct = arg;
+
+ guc_ct_change_state(ct, XE_GUC_CT_STATE_DISABLED);
+}
+
int xe_guc_ct_init(struct xe_guc_ct *ct)
{
struct xe_device *xe = ct_to_xe(ct);
@@ -268,10 +277,40 @@ int xe_guc_ct_init(struct xe_guc_ct *ct)
return PTR_ERR(bo);
ct->bo = bo;
- return 0;
+
+ return devm_add_action_or_reset(xe->drm.dev, guc_action_disable_ct, ct);
}
ALLOW_ERROR_INJECTION(xe_guc_ct_init, ERRNO); /* See xe_pci_probe() */
+/**
+ * xe_guc_ct_init_post_hwconfig - Reinitialize the GuC CTB in VRAM
+ * @ct: the &xe_guc_ct
+ *
+ * Allocate a new BO in VRAM and free the previous BO that was allocated
+ * in system memory (SMEM). Applicable only for DGFX products.
+ *
+ * Return: 0 on success, or a negative errno on failure.
+ */
+int xe_guc_ct_init_post_hwconfig(struct xe_guc_ct *ct)
+{
+ struct xe_device *xe = ct_to_xe(ct);
+ struct xe_gt *gt = ct_to_gt(ct);
+ struct xe_tile *tile = gt_to_tile(gt);
+ int ret;
+
+ xe_assert(xe, !xe_guc_ct_enabled(ct));
+
+ if (!IS_DGFX(xe))
+ return 0;
+
+ ret = xe_managed_bo_reinit_in_vram(xe, tile, &ct->bo);
+ if (ret)
+ return ret;
+
+ devm_release_action(xe->drm.dev, guc_action_disable_ct, ct);
+ return devm_add_action_or_reset(xe->drm.dev, guc_action_disable_ct, ct);
+}
+
#define desc_read(xe_, guc_ctb__, field_) \
xe_map_rd_field(xe_, &guc_ctb__->desc, 0, \
struct guc_ct_buffer_desc, field_)
diff --git a/drivers/gpu/drm/xe/xe_guc_ct.h b/drivers/gpu/drm/xe/xe_guc_ct.h
index 18d4225e65024..cf41210ab30ae 100644
--- a/drivers/gpu/drm/xe/xe_guc_ct.h
+++ b/drivers/gpu/drm/xe/xe_guc_ct.h
@@ -13,6 +13,7 @@ struct xe_device;
int xe_guc_ct_init_noalloc(struct xe_guc_ct *ct);
int xe_guc_ct_init(struct xe_guc_ct *ct);
+int xe_guc_ct_init_post_hwconfig(struct xe_guc_ct *ct);
int xe_guc_ct_enable(struct xe_guc_ct *ct);
void xe_guc_ct_disable(struct xe_guc_ct *ct);
void xe_guc_ct_stop(struct xe_guc_ct *ct);
--
2.51.0
next prev parent reply other threads:[~2025-10-25 16:19 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20251025160905.3857885-1-sashal@kernel.org>
2025-10-25 15:54 ` [PATCH AUTOSEL 6.17] drm/xe/pcode: Initialize data0 for pcode read routine Sasha Levin
2025-10-25 15:54 ` [PATCH AUTOSEL 6.17] drm/xe: improve dma-resv handling for backup object Sasha Levin
2025-10-25 15:54 ` [PATCH AUTOSEL 6.17] drm/xe: Extend wa_13012615864 to additional Xe2 and Xe3 platforms Sasha Levin
2025-10-25 15:54 ` [PATCH AUTOSEL 6.17] drm/xe/ptl: Apply Wa_16026007364 Sasha Levin
2025-10-25 15:55 ` [PATCH AUTOSEL 6.17] drm/xe: Set GT as wedged before sending wedged uevent Sasha Levin
2025-10-25 15:55 ` [PATCH AUTOSEL 6.17] drm/xe/i2c: Enable bus mastering Sasha Levin
2025-10-25 15:55 ` [PATCH AUTOSEL 6.17] drm/xe/configfs: Enforce canonical device names Sasha Levin
2025-10-25 15:56 ` [PATCH AUTOSEL 6.17] drm/xe: Extend Wa_22021007897 to Xe3 platforms Sasha Levin
2025-10-25 15:56 ` [PATCH AUTOSEL 6.17] drm/xe: Cancel pending TLB inval workers on teardown Sasha Levin
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17-6.12] drm/xe/guc: Increase GuC crash dump buffer size Sasha Levin
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17] drm/xe/wcl: Extend L3bank mask workaround Sasha Levin
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17-6.12] drm/xe/guc: Set upper limit of H2G retries over CTB Sasha Levin
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17] drm/xe: Make page size consistent in loop Sasha Levin
2025-10-25 15:57 ` Sasha Levin [this message]
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17] drm/xe/pf: Program LMTT directory pointer on all GTs within a tile Sasha Levin
2025-10-25 15:58 ` [PATCH AUTOSEL 6.17] drm/xe/guc: Always add CT disable action during second init step Sasha Levin
2025-10-25 15:58 ` [PATCH AUTOSEL 6.17] drm/xe/pf: Don't resume device from restart worker Sasha Levin
2025-10-25 15:59 ` [PATCH AUTOSEL 6.17-6.12] drm/xe/guc: Return an error code if the GuC load fails Sasha Levin
2025-10-25 15:59 ` [PATCH AUTOSEL 6.17] drm/xe: Ensure GT is in C0 during resumes Sasha Levin
2025-10-25 15:59 ` [PATCH AUTOSEL 6.17] drm/xe: rework PDE PAT index selection Sasha Levin
2025-10-25 16:01 ` [PATCH AUTOSEL 6.17-6.12] drm/xe/guc: Add more GuC load error status codes Sasha Levin
2025-10-25 16:01 ` [PATCH AUTOSEL 6.17-6.12] drm/xe: Fix oops in xe_gem_fault when running core_hotunplug test Sasha Levin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251025160905.3857885-233-sashal@kernel.org \
--to=sashal@kernel.org \
--cc=intel-xe@lists.freedesktop.org \
--cc=lucas.demarchi@intel.com \
--cc=matthew.auld@intel.com \
--cc=matthew.brost@intel.com \
--cc=michal.wajdeczko@intel.com \
--cc=patches@lists.linux.dev \
--cc=rodrigo.vivi@intel.com \
--cc=satyanarayana.k.v.p@intel.com \
--cc=stable@vger.kernel.org \
--cc=stuart.summers@intel.com \
--cc=thomas.hellstrom@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox