From: Aradhya Bhatia <aradhya.bhatia@intel.com>
To: Matt Roper <matthew.d.roper@intel.com>
Cc: Intel XE List <intel-xe@lists.freedesktop.org>,
Lucas De Marchi <lucas.demarchi@intel.com>,
Thomas Hellstrom <thomas.hellstrom@intel.com>,
Tejas Upadhyay <tejas.upadhyay@intel.com>,
Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>,
Aradhya Bhatia <aradhya.bhatia@intel.com>
Subject: [PATCH 0/2] drm/xe: Fix the hotunplug NULL ptr dereference
Date: Fri, 28 Feb 2025 06:52:22 +0000 [thread overview]
Message-ID: <20250228065224.320811-1-aradhya.bhatia@intel.com> (raw)
Hi,
This patch series helps mitigate the kernel ptr dereference errors that has
been recently seen on a few platforms over the xe driver, when the core
hotunplug IGT tests are ran[0].
*Bried explanation of the error and its cause*
When attempting to close the drm file-descriptor (fd) after a hotunplug of
the xe gpu device, the kernel runs into a NULL pointer.
This happens when the close(fd) call puts back the final DRM reference. The
destroy migrate-subsystem action (xe_migrate_fini) is called upon, which
tries to put the page-table tree (pt) BO back.
However, the underlying struct device has already been removed during the
hotunplug, and so the iommu group for that pt BO is destroyed and hence,
unavailable.
When the xe_migrate_fini() tries to put back the pt-BO, the iommu group is
unavailable, causing a NULL ptr dereference.
*Brief description of the fix*
The xe migrate subsystem has been changed from being drm managed to being
dev managed. This way, the migrate subsystem destroy action will be called
sooner (during the unplug process) - allowing the pt-BO to be put back
sooner, i.e. at a point when the iommu group is not yet destroyed.
Since we are changing the migrate subsystem to be dev managed, this
subsystem is long gone by the time the TTM and VRAM resource managers do
get destroyed. (The TTM and VRAM resource manager destruction is drm
managed). So, when these resource managers attempt to evict all the BOs
before their destruction, that's not going to be possible any more, as
the BOs were already destroyed during the unplug process. Hence, we
extend the fix by making all the VRAM evictions during the xe device
remove.
Regards,
Aradhya
[0]: Link to the gitlab issue
https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3914
Aradhya Bhatia (2):
drm/xe_migrate: Switch from drm to dev managed actions
drm/xe_device: Evict all the VRAM objects during device remove
drivers/gpu/drm/xe/xe_device.c | 14 ++++++++++++++
drivers/gpu/drm/xe/xe_migrate.c | 6 +++---
2 files changed, 17 insertions(+), 3 deletions(-)
base-commit: 873b1a50bb4394e95332cfa611aa6463de6b7cb0
--
2.45.2
next reply other threads:[~2025-02-28 6:53 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-02-28 6:52 Aradhya Bhatia [this message]
2025-02-28 6:52 ` [PATCH 1/2] drm/xe_migrate: Switch from drm to dev managed actions Aradhya Bhatia
2025-02-28 7:01 ` Upadhyay, Tejas
2025-02-28 10:21 ` Matthew Auld
2025-02-28 11:11 ` Hellstrom, Thomas
2025-02-28 12:28 ` Matthew Auld
2025-02-28 12:57 ` Hellstrom, Thomas
2025-02-28 14:47 ` Hellstrom, Thomas
2025-02-28 18:38 ` Matthew Auld
2025-03-10 10:26 ` Aradhya Bhatia
2025-03-03 20:27 ` Lucas De Marchi
2025-02-28 6:52 ` [PATCH 2/2] drm/xe_device: Evict all the VRAM objects during device remove Aradhya Bhatia
2025-02-28 7:18 ` Upadhyay, Tejas
2025-02-28 11:21 ` Hellstrom, Thomas
2025-02-28 7:30 ` ✓ CI.Patch_applied: success for drm/xe: Fix the hotunplug NULL ptr dereference Patchwork
2025-02-28 7:30 ` ✓ CI.checkpatch: " Patchwork
2025-02-28 7:32 ` ✓ CI.KUnit: " Patchwork
2025-02-28 7:48 ` ✓ CI.Build: " Patchwork
2025-02-28 7:51 ` ✓ CI.Hooks: " Patchwork
2025-02-28 7:52 ` ✓ CI.checksparse: " Patchwork
2025-02-28 8:10 ` ✓ Xe.CI.BAT: " Patchwork
2025-02-28 13:05 ` ✗ Xe.CI.Full: failure " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250228065224.320811-1-aradhya.bhatia@intel.com \
--to=aradhya.bhatia@intel.com \
--cc=himal.prasad.ghimiray@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=lucas.demarchi@intel.com \
--cc=matthew.d.roper@intel.com \
--cc=tejas.upadhyay@intel.com \
--cc=thomas.hellstrom@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox