From: Stuart Summers <stuart.summers@intel.com>
Cc: intel-xe@lists.freedesktop.org, matthew.brost@intel.com,
niranjana.vishwanathapura@intel.com, zhanjun.dong@intel.com,
shuicheng.lin@intel.com,
Stuart Summers <stuart.summers@intel.com>
Subject: [PATCH 0/6] Fix a couple of wedge corner-case memory leaks
Date: Mon, 27 Oct 2025 18:04:06 +0000 [thread overview]
Message-ID: <20251027180412.63743-1-stuart.summers@intel.com> (raw)
Most of the patches in this series are just adding
some debug hints to help track these down. I split
these up in case we want to pick and choose which ones
to include in the tree. I found them useful.
The main interesting patch is the last one in the
series which fixes some corner cases when the
driver becomes wedged in the middle of either communication
with the DRM scheduler or in the event the GuC becomes
unresponsive. In both of these cases there is a chance
we could leak memory around the exec queue members
like the LRC and the LRC BO. This patch fixes those
scenarios.
This series depends on [1].
v2: Address feedback from Matt:
- Let the DRM scheduler handle pausing/unpausing
- Still do the wait after scheduling disable/deregister
as with the previous patch, but skip the intermediate
software-based schedule disable using the "banned"
flag and instead just jump straight to the deregister
handling which will fully reset the queue state.
Note that for this case I am seeing a hardware failure
after submitting to GuC but before receiving the
response from GuC. So even if we wedge in this case
(monitoring the hardware state change), the queue
itself is not wedged because of the active GuC
submission (CT is not stalled at that point).
v3: Add back in the xe pause checks and instead just kickstart
message handling in the guc_submi_fini() routine before
doing the async wait there.
v4: Handle the CT communication loss during wedge asynchronously
Also combine those last two patches into one to handle
wedge cleanup generally.
v5: Add a new patch with a little documentation on the GuC
submission handling stages.
Move the scheduler kickstart and destruction call on the
dangling queues into the wedged_fini() callback. These
only get called now for queues which are in an error
state - wedge was called, but these weren't fully
cleaned up as seen by the lack of exec_queue reference
at the time of wedging.
Also fix the migration ordering teardown reference mistake
pointed by Matt in the previous series rev.
v6: Implement and test against [1] with the changes Matt suggested.
[1]: https://patchwork.freedesktop.org/series/155315/
Stuart Summers (6):
drm/xe: Add additional trace points for LRCs
drm/xe: Add a trace point for VM close
drm/xe: Add the BO pointer info to the BO trace
drm/xe: Add new exec queue trace points
drm/xe: Correct migration VM teardown order
drm/xe: Clean up GuC software state after a wedge
drivers/gpu/drm/xe/xe_exec_queue.c | 4 +++
drivers/gpu/drm/xe/xe_guc_submit.c | 17 +++++++++---
drivers/gpu/drm/xe/xe_lrc.c | 4 +++
drivers/gpu/drm/xe/xe_lrc.h | 3 +++
drivers/gpu/drm/xe/xe_migrate.c | 7 ++---
drivers/gpu/drm/xe/xe_trace.h | 22 ++++++++++++++--
drivers/gpu/drm/xe/xe_trace_bo.h | 12 +++++++--
drivers/gpu/drm/xe/xe_trace_lrc.h | 42 +++++++++++++++++++++++++++++-
drivers/gpu/drm/xe/xe_vm.c | 2 ++
9 files changed, 101 insertions(+), 12 deletions(-)
--
2.34.1
next reply other threads:[~2025-10-27 18:04 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-27 18:04 Stuart Summers [this message]
2025-10-27 18:04 ` [PATCH 1/6] drm/xe: Add additional trace points for LRCs Stuart Summers
2025-10-28 19:46 ` Matt Atwood
2025-10-28 20:11 ` Summers, Stuart
2025-10-27 18:04 ` [PATCH 2/6] drm/xe: Add a trace point for VM close Stuart Summers
2025-10-28 20:03 ` Matt Atwood
2025-10-28 20:10 ` Summers, Stuart
2025-10-27 18:04 ` [PATCH 3/6] drm/xe: Add the BO pointer info to the BO trace Stuart Summers
2025-10-28 20:11 ` Matt Atwood
2025-10-27 18:04 ` [PATCH 4/6] drm/xe: Add new exec queue trace points Stuart Summers
2025-10-28 20:29 ` Matt Atwood
2025-10-27 18:04 ` [PATCH 5/6] drm/xe: Correct migration VM teardown order Stuart Summers
2025-10-27 19:46 ` Matthew Brost
2025-10-27 18:04 ` [PATCH 6/6] drm/xe: Clean up GuC software state after a wedge Stuart Summers
2025-10-27 18:16 ` ✗ CI.checkpatch: warning for Fix a couple of wedge corner-case memory leaks (rev6) Patchwork
2025-10-27 18:17 ` ✗ CI.KUnit: failure " Patchwork
2025-10-27 18:18 ` Summers, Stuart
2025-10-27 20:10 ` Summers, Stuart
-- strict thread matches above, loose matches on Subject: below --
2025-10-14 18:09 [PATCH 0/6] Fix a couple of wedge corner-case memory leaks Stuart Summers
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251027180412.63743-1-stuart.summers@intel.com \
--to=stuart.summers@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=matthew.brost@intel.com \
--cc=niranjana.vishwanathapura@intel.com \
--cc=shuicheng.lin@intel.com \
--cc=zhanjun.dong@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox