[PATCH AUTOSEL 6.17] drm/xe: Cancel pending TLB inval workers on teardown

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: Stuart Summers <stuart.summers@intel.com>,
	Matthew Brost <matthew.brost@intel.com>,
	Sasha Levin <sashal@kernel.org>,
	lucas.demarchi@intel.com, thomas.hellstrom@linux.intel.com,
	rodrigo.vivi@intel.com, intel-xe@lists.freedesktop.org
Subject: [PATCH AUTOSEL 6.17] drm/xe: Cancel pending TLB inval workers on teardown
Date: Sat, 25 Oct 2025 11:56:18 -0400	[thread overview]
Message-ID: <20251025160905.3857885-147-sashal@kernel.org> (raw)
In-Reply-To: <20251025160905.3857885-1-sashal@kernel.org>

From: Stuart Summers <stuart.summers@intel.com>

[ Upstream commit 76186a253a4b9eb41c5a83224c14efdf30960a71 ]

Add a new _fini() routine on the GT TLB invalidation
side to handle this worker cleanup on driver teardown.

v2: Move the TLB teardown to the gt fini() routine called during
    gt_init rather than in gt_alloc. This way the GT structure stays
    alive for while we reset the TLB state.

Signed-off-by: Stuart Summers <stuart.summers@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250826182911.392550-3-stuart.summers@intel.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

YES

- What it fixes
  - Prevents use-after-free/hangs on driver teardown by cancelling
    pending TLB-invalidation workers/fences before GT resources are
    dismantled. The reset path already handles this during GT resets;
    this commit ensures the same cleanup occurs on teardown.

- Key changes and why they matter
  - drivers/gpu/drm/xe/xe_gt.c: `xe_gt_fini()` now calls
    `xe_gt_tlb_invalidation_fini(gt)` first. This ensures TLB
    invalidation workers/fences are cancelled while the GT is still
    alive, avoiding races/UAF during teardown.
  - drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c: Adds
    `xe_gt_tlb_invalidation_fini(struct xe_gt *gt)` which simply calls
    `xe_gt_tlb_invalidation_reset(gt)`. The reset routine:
    - Computes a “pending” seqno and updates `seqno_recv` so waiters see
      all prior invalidations as complete.
    - Iterates `pending_fences` and signals them, waking any kworkers
      waiting for TLB flush completion.
    - This mirrors the existing reset behavior (cancel delayed work,
      advance seqno, signal fences) used during GT resets to guarantee
      no waiter is left behind.
  - drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h: Adds the prototype for
    the new fini, keeping the API consistent.

- Concrete evidence in the code changes
  - The commit places `xe_gt_tlb_invalidation_fini(gt)` at the start of
    GT teardown (xe_gt.c: in `xe_gt_fini()`), so TLB/worker cleanup runs
    before `xe_hw_fence_irq_finish()` and
    `xe_gt_disable_host_l2_vram()`. This ordering minimizes races with
    IRQ/fence infrastructure and other GT resources during teardown.
  - The finish routine calls into the reset path which explicitly:
    - Sets `seqno_recv` to a value covering all outstanding requests.
    - Signals all pending invalidation fences via
      `list_for_each_entry_safe(... pending_fences ...)`, ensuring
      waiters are released.
    - This matches the comment in the reset path about kworkers not
      tracked by explicit TLB fences and the need to wake them assuming
      a full GT reset.

- Mapping to current tree (for context/impact assessment)
  - In this tree, the corresponding logic lives under the “tlb_inval”
    names:
    - The reset path is implemented in
      `drivers/gpu/drm/xe/xe_tlb_inval.c:156` (`xe_tlb_inval_reset()`),
      which cancels the delayed timeout work, updates `seqno_recv`, and
      signals all `pending_fences`.
    - This path is already invoked during GT reset flows (e.g.,
      `drivers/gpu/drm/xe/xe_gt.c:853, 1067, 1139`), proving the
      approach is safe and battle-tested during runtime resets.
    - A drmm-managed teardown hook exists
      (`drivers/gpu/drm/xe/xe_tlb_inval.c:114`), but that operates at
      DRM device teardown. If GT devm teardown runs earlier, there is a
      window where TLB invalidation workers could outlive GT, risking
      UAF. Moving the cleanup into `xe_gt_fini()` (devm action, see
      `drivers/gpu/drm/xe/xe_gt.c:624`) closes that gap, which is
      exactly what this commit does in its codebase.

- Stable backport criteria
  - Important bugfix: avoids teardown-time UAF/hangs/leaks by cancelling
    and signalling all pending TLB invalidation work.
  - Small and contained: touches only the xe GT/TLB invalidation
    teardown path; adds one call-site and a thin wrapper.
  - No feature or architectural change: purely lifecycle/cleanup
    ordering.
  - Low regression risk: uses the same reset logic already exercised in
    GT reset paths.
  - Driver subsystem only (DRM xe), not core kernel.

- Conclusion
  - This is a clear, low-risk correctness fix for teardown-time resource
    and worker cleanup in the xe driver. It should be backported to
    stable trees where the xe driver and TLB invalidation workers exist,
    adapting symbol/file names as needed (e.g., calling
    `xe_tlb_inval_reset(&gt->tlb_inval)` from `xe_gt_fini()` in trees
    with the older naming).

 drivers/gpu/drm/xe/xe_gt.c                  |  2 ++
 drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c | 12 ++++++++++++
 drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h |  1 +
 3 files changed, 15 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
index 17634195cdc26..6f63c658c341f 100644
--- a/drivers/gpu/drm/xe/xe_gt.c
+++ b/drivers/gpu/drm/xe/xe_gt.c
@@ -605,6 +605,8 @@ static void xe_gt_fini(void *arg)
 	struct xe_gt *gt = arg;
 	int i;
 
+	xe_gt_tlb_invalidation_fini(gt);
+
 	for (i = 0; i < XE_ENGINE_CLASS_MAX; ++i)
 		xe_hw_fence_irq_finish(&gt->fence_irq[i]);
 
diff --git a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
index 086c12ee3d9de..64cd6cf0ab8df 100644
--- a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
+++ b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
@@ -173,6 +173,18 @@ void xe_gt_tlb_invalidation_reset(struct xe_gt *gt)
 	mutex_unlock(&gt->uc.guc.ct.lock);
 }
 
+/**
+ *
+ * xe_gt_tlb_invalidation_fini - Clean up GT TLB invalidation state
+ *
+ * Cancel pending fence workers and clean up any additional
+ * GT TLB invalidation state.
+ */
+void xe_gt_tlb_invalidation_fini(struct xe_gt *gt)
+{
+	xe_gt_tlb_invalidation_reset(gt);
+}
+
 static bool tlb_invalidation_seqno_past(struct xe_gt *gt, int seqno)
 {
 	int seqno_recv = READ_ONCE(gt->tlb_invalidation.seqno_recv);
diff --git a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h
index f7f0f2eaf4b59..3e4cff3922d6f 100644
--- a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h
+++ b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h
@@ -16,6 +16,7 @@ struct xe_vm;
 struct xe_vma;
 
 int xe_gt_tlb_invalidation_init_early(struct xe_gt *gt);
+void xe_gt_tlb_invalidation_fini(struct xe_gt *gt);
 
 void xe_gt_tlb_invalidation_reset(struct xe_gt *gt);
 int xe_gt_tlb_invalidation_ggtt(struct xe_gt *gt);
-- 
2.51.0

next prev parent reply	other threads:[~2025-10-25 16:15 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20251025160905.3857885-1-sashal@kernel.org>
2025-10-25 15:54 ` [PATCH AUTOSEL 6.17] drm/xe/pcode: Initialize data0 for pcode read routine Sasha Levin
2025-10-25 15:54 ` [PATCH AUTOSEL 6.17] drm/xe: improve dma-resv handling for backup object Sasha Levin
2025-10-25 15:54 ` [PATCH AUTOSEL 6.17] drm/xe: Extend wa_13012615864 to additional Xe2 and Xe3 platforms Sasha Levin
2025-10-25 15:54 ` [PATCH AUTOSEL 6.17] drm/xe/ptl: Apply Wa_16026007364 Sasha Levin
2025-10-25 15:55 ` [PATCH AUTOSEL 6.17] drm/xe: Set GT as wedged before sending wedged uevent Sasha Levin
2025-10-25 15:55 ` [PATCH AUTOSEL 6.17] drm/xe/i2c: Enable bus mastering Sasha Levin
2025-10-25 15:55 ` [PATCH AUTOSEL 6.17] drm/xe/configfs: Enforce canonical device names Sasha Levin
2025-10-25 15:56 ` [PATCH AUTOSEL 6.17] drm/xe: Extend Wa_22021007897 to Xe3 platforms Sasha Levin
2025-10-25 15:56 ` Sasha Levin [this message]
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17-6.12] drm/xe/guc: Increase GuC crash dump buffer size Sasha Levin
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17] drm/xe/wcl: Extend L3bank mask workaround Sasha Levin
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17-6.12] drm/xe/guc: Set upper limit of H2G retries over CTB Sasha Levin
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17] drm/xe: Make page size consistent in loop Sasha Levin
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17] drm/xe/guc: Add devm release action to safely tear down CT Sasha Levin
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17] drm/xe/pf: Program LMTT directory pointer on all GTs within a tile Sasha Levin
2025-10-25 15:58 ` [PATCH AUTOSEL 6.17] drm/xe/guc: Always add CT disable action during second init step Sasha Levin
2025-10-25 15:58 ` [PATCH AUTOSEL 6.17] drm/xe/pf: Don't resume device from restart worker Sasha Levin
2025-10-25 15:59 ` [PATCH AUTOSEL 6.17-6.12] drm/xe/guc: Return an error code if the GuC load fails Sasha Levin
2025-10-25 15:59 ` [PATCH AUTOSEL 6.17] drm/xe: Ensure GT is in C0 during resumes Sasha Levin
2025-10-25 15:59 ` [PATCH AUTOSEL 6.17] drm/xe: rework PDE PAT index selection Sasha Levin
2025-10-25 16:01 ` [PATCH AUTOSEL 6.17-6.12] drm/xe/guc: Add more GuC load error status codes Sasha Levin
2025-10-25 16:01 ` [PATCH AUTOSEL 6.17-6.12] drm/xe: Fix oops in xe_gem_fault when running core_hotunplug test Sasha Levin

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:17634195cdc2 dfblob:6f63c658c341 dfblob:086c12ee3d9d
dfblob:64cd6cf0ab8d dfblob:f7f0f2eaf4b5 dfblob:3e4cff3922d6 )
 OR (
bs:"[PATCH AUTOSEL 6.17] drm/xe: Cancel pending TLB inval workers on teardown" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251025160905.3857885-147-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=lucas.demarchi@intel.com \
    --cc=matthew.brost@intel.com \
    --cc=patches@lists.linux.dev \
    --cc=rodrigo.vivi@intel.com \
    --cc=stable@vger.kernel.org \
    --cc=stuart.summers@intel.com \
    --cc=thomas.hellstrom@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox