From: "Hellstrom, Thomas" <thomas.hellstrom@intel.com>
To: "Roper, Matthew D" <matthew.d.roper@intel.com>,
"Auld, Matthew" <matthew.auld@intel.com>,
"Bhatia, Aradhya" <aradhya.bhatia@intel.com>
Cc: "intel-xe@lists.freedesktop.org" <intel-xe@lists.freedesktop.org>,
"Upadhyay, Tejas" <tejas.upadhyay@intel.com>,
"Ghimiray, Himal Prasad" <himal.prasad.ghimiray@intel.com>,
"De Marchi, Lucas" <lucas.demarchi@intel.com>
Subject: Re: [PATCH 1/2] drm/xe_migrate: Switch from drm to dev managed actions
Date: Fri, 28 Feb 2025 11:11:59 +0000 [thread overview]
Message-ID: <3a74a434957fb52ebcc36a4d5b88e8428ac2165f.camel@intel.com> (raw)
In-Reply-To: <1b66c78c-82d4-4002-9fa4-6f30e97e0268@intel.com>
Hi, Matthew
On Fri, 2025-02-28 at 10:21 +0000, Matthew Auld wrote:
> On 28/02/2025 06:52, Aradhya Bhatia wrote:
> > Change the scope of the migrate subsystem to be dev managed instead
> > of
> > drm managed.
> >
> > The parent pci struct &device, that the xe struct &drm_device is a
> > part
> > of, gets removed when a hot unplug is triggered, which causes the
> > underlying iommu group to get destroyed as well.
>
> Nice find. But if that's the case then the migrate BO here is just
> one
> of many where we will see this. Basically all system memory BO can
> suffer from this AFAICT, including userspace owned. So I think we
> need
> to rather solve this for all, and I don't think we can really tie the
> lifetime of all BOs to devm, so likely need a different approach.
>
> I think we might instead need to teardown all dma mappings when
> removing
> the device, leaving the BO intact. It looks like there is already a
> helper for this, so maybe something roughly like this:
>
> @@ -980,6 +980,8 @@ void xe_device_remove(struct xe_device *xe)
>
> drm_dev_unplug(&xe->drm);
>
> + ttm_device_clear_dma_mappings(&xe->ttm);
> +
> xe_display_driver_remove(xe);
>
> xe_heci_gsc_fini(xe);
>
> But I don't think userptr will be covered by that, just all BOs, so
> likely need an extra step to also nuke all usersptr dma mappings
> somewhere.
I have been discussing this a bit with Aradhya, and the problem is a
bit complex.
Really if the PCI device goes away, we need to move *all* bos to
system. That should clear out any dma mappings, since we map dma when
moving to TT. Also VRAM bos should be moved to system which will also
handle things like SVM migration.
But that doesn't really help with pinned bos. So IMO subsystems that
allocate pinned bos need to be devm managed, and this is a step in that
direction. But we probably need to deal with some fallout. For example
when we take down the vram managers using drmmm_ they'd want to call
evict_all() and the migrate subsystem is already down.
So I think a natural place to deal with such fallouts is the remove()
callback, while subsystems with pinned bos use devm_
But admittedly to me it's not really clear how to *best* handle this
situation. I suspect if we really stress-test device unbinding on a
running app we're going to hit a lot of problems.
As for the userptrs, I just posted a series that introduces a per-
userptr dedicated unmap function. We could probably put them on a
device list or something similar that calls that function (or just
general userptr invalidation) for all userptrs.
/Thomas
>
> >
> > The migrate subsystem, which handles the lifetime of the page-table
> > tree
> > (pt) BO, doesn't get a chance to keep the BO back during the hot
> > unplug,
> > as all the references to DRM haven't been put back.
> > When all the references to DRM are indeed put back later, the
> > migrate
> > subsystem tries to put back the pt BO. Since the underlying iommu
> > group
> > has been already destroyed, a kernel NULL ptr dereference takes
> > place
> > while attempting to keep back the pt BO.
> >
> > Signed-off-by: Aradhya Bhatia <aradhya.bhatia@intel.com>
> > ---
> > drivers/gpu/drm/xe/xe_migrate.c | 6 +++---
> > 1 file changed, 3 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_migrate.c
> > b/drivers/gpu/drm/xe/xe_migrate.c
> > index 278bc96cf593..4e23adfa208a 100644
> > --- a/drivers/gpu/drm/xe/xe_migrate.c
> > +++ b/drivers/gpu/drm/xe/xe_migrate.c
> > @@ -97,7 +97,7 @@ struct xe_exec_queue
> > *xe_tile_migrate_exec_queue(struct xe_tile *tile)
> > return tile->migrate->q;
> > }
> >
> > -static void xe_migrate_fini(struct drm_device *dev, void *arg)
> > +static void xe_migrate_fini(void *arg)
> > {
> > struct xe_migrate *m = arg;
> >
> > @@ -401,7 +401,7 @@ struct xe_migrate *xe_migrate_init(struct
> > xe_tile *tile)
> > struct xe_vm *vm;
> > int err;
> >
> > - m = drmm_kzalloc(&xe->drm, sizeof(*m), GFP_KERNEL);
> > + m = devm_kzalloc(xe->drm.dev, sizeof(*m), GFP_KERNEL);
> > if (!m)
> > return ERR_PTR(-ENOMEM);
> >
> > @@ -455,7 +455,7 @@ struct xe_migrate *xe_migrate_init(struct
> > xe_tile *tile)
> > might_lock(&m->job_mutex);
> > fs_reclaim_release(GFP_KERNEL);
> >
> > - err = drmm_add_action_or_reset(&xe->drm, xe_migrate_fini,
> > m);
> > + err = devm_add_action_or_reset(xe->drm.dev,
> > xe_migrate_fini, m);
> > if (err)
> > return ERR_PTR(err);
> >
>
next prev parent reply other threads:[~2025-02-28 11:12 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-02-28 6:52 [PATCH 0/2] drm/xe: Fix the hotunplug NULL ptr dereference Aradhya Bhatia
2025-02-28 6:52 ` [PATCH 1/2] drm/xe_migrate: Switch from drm to dev managed actions Aradhya Bhatia
2025-02-28 7:01 ` Upadhyay, Tejas
2025-02-28 10:21 ` Matthew Auld
2025-02-28 11:11 ` Hellstrom, Thomas [this message]
2025-02-28 12:28 ` Matthew Auld
2025-02-28 12:57 ` Hellstrom, Thomas
2025-02-28 14:47 ` Hellstrom, Thomas
2025-02-28 18:38 ` Matthew Auld
2025-03-10 10:26 ` Aradhya Bhatia
2025-03-03 20:27 ` Lucas De Marchi
2025-02-28 6:52 ` [PATCH 2/2] drm/xe_device: Evict all the VRAM objects during device remove Aradhya Bhatia
2025-02-28 7:18 ` Upadhyay, Tejas
2025-02-28 11:21 ` Hellstrom, Thomas
2025-02-28 7:30 ` ✓ CI.Patch_applied: success for drm/xe: Fix the hotunplug NULL ptr dereference Patchwork
2025-02-28 7:30 ` ✓ CI.checkpatch: " Patchwork
2025-02-28 7:32 ` ✓ CI.KUnit: " Patchwork
2025-02-28 7:48 ` ✓ CI.Build: " Patchwork
2025-02-28 7:51 ` ✓ CI.Hooks: " Patchwork
2025-02-28 7:52 ` ✓ CI.checksparse: " Patchwork
2025-02-28 8:10 ` ✓ Xe.CI.BAT: " Patchwork
2025-02-28 13:05 ` ✗ Xe.CI.Full: failure " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=3a74a434957fb52ebcc36a4d5b88e8428ac2165f.camel@intel.com \
--to=thomas.hellstrom@intel.com \
--cc=aradhya.bhatia@intel.com \
--cc=himal.prasad.ghimiray@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=lucas.demarchi@intel.com \
--cc=matthew.auld@intel.com \
--cc=matthew.d.roper@intel.com \
--cc=tejas.upadhyay@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox