From: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
To: Matthew Auld <matthew.auld@intel.com>,
Matthew Brost <matthew.brost@intel.com>,
Raag Jadav <raag.jadav@intel.com>
Cc: intel-xe@lists.freedesktop.org, himal.prasad.ghimiray@intel.com,
matthew.d.roper@intel.com
Subject: Re: [PATCH v2] drm/xe: Drop all mappings for wedged device
Date: Mon, 30 Mar 2026 09:45:22 +0200 [thread overview]
Message-ID: <52e4f2e61ab7df9b46c1c16ac4dfde929732629a.camel@linux.intel.com> (raw)
In-Reply-To: <156d287e-117f-4575-a90c-3aaa233ed670@intel.com>
On Fri, 2026-03-27 at 15:03 +0000, Matthew Auld wrote:
> On 27/03/2026 10:41, Thomas Hellström wrote:
> > On Fri, 2026-03-27 at 10:18 +0000, Matthew Auld wrote:
> > > On 26/03/2026 21:19, Matthew Brost wrote:
> > > > On Thu, Mar 26, 2026 at 06:58:16PM +0530, Raag Jadav wrote:
> > > > > As per uapi documentation[1], the prerequisite for wedged
> > > > > device
> > > > > is to
> > > > > drop all memory mappings. Follow it.
> > > > >
> > > > > [1] Documentation/gpu/drm-uapi.rst
> > > > >
> > > > > v2: Also drop CPU mappings (Matthew Auld)
> > > > >
> > > > > Fixes: 7bc00751f877 ("drm/xe: Use device wedged event")
> > > > > Signed-off-by: Raag Jadav <raag.jadav@intel.com>
> > > > > ---
> > > > > drivers/gpu/drm/xe/xe_bo_evict.c | 8 +++++++-
> > > > > drivers/gpu/drm/xe/xe_bo_evict.h | 1 +
> > > > > drivers/gpu/drm/xe/xe_device.c | 5 +++++
> > > > > 3 files changed, 13 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/drivers/gpu/drm/xe/xe_bo_evict.c
> > > > > b/drivers/gpu/drm/xe/xe_bo_evict.c
> > > > > index 7661fca7f278..f741cda50b2d 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_bo_evict.c
> > > > > +++ b/drivers/gpu/drm/xe/xe_bo_evict.c
> > > > > @@ -270,7 +270,13 @@ int xe_bo_restore_late(struct xe_device
> > > > > *xe)
> > > > > return ret;
> > > > > }
> > > > >
> > > > > -static void xe_bo_pci_dev_remove_pinned(struct xe_device
> > > > > *xe)
> > > > > +/**
> > > > > + * xe_bo_pci_dev_remove_pinned() - Unmap external bos
> > > > > + * @xe: xe device
> > > > > + *
> > > > > + * Drop dma mappings of all external pinned bos.
> > > > > + */
> > > > > +void xe_bo_pci_dev_remove_pinned(struct xe_device *xe)
> > > > > {
> > > > > struct xe_tile *tile;
> > > > > unsigned int id;
> > > > > diff --git a/drivers/gpu/drm/xe/xe_bo_evict.h
> > > > > b/drivers/gpu/drm/xe/xe_bo_evict.h
> > > > > index e8385cb7f5e9..6ce27e272780 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_bo_evict.h
> > > > > +++ b/drivers/gpu/drm/xe/xe_bo_evict.h
> > > > > @@ -15,6 +15,7 @@ void
> > > > > xe_bo_notifier_unprepare_all_pinned(struct
> > > > > xe_device *xe);
> > > > > int xe_bo_restore_early(struct xe_device *xe);
> > > > > int xe_bo_restore_late(struct xe_device *xe);
> > > > >
> > > > > +void xe_bo_pci_dev_remove_pinned(struct xe_device *xe);
> > > > > void xe_bo_pci_dev_remove_all(struct xe_device *xe);
> > > > >
> > > > > int xe_bo_pinned_init(struct xe_device *xe);
> > > > > diff --git a/drivers/gpu/drm/xe/xe_device.c
> > > > > b/drivers/gpu/drm/xe/xe_device.c
> > > > > index b17d4a878686..4c0097f3aefb 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_device.c
> > > > > +++ b/drivers/gpu/drm/xe/xe_device.c
> > > > > @@ -1347,6 +1347,11 @@ void xe_device_declare_wedged(struct
> > > > > xe_device *xe)
> > > > > for_each_gt(gt, xe, id)
> > > > > xe_gt_declare_wedged(gt);
> > > > >
> > > > > + /* Drop dma mappings of external bos */
> > > > > + xe_bo_pci_dev_remove_pinned(xe);
> > > >
> > > > Do we even need the part above? unmap_mapping_range() should
> > > > drop
> > > > all
> > > > DMA mappings for the device being wedged, right? In other
> > > > words,
> > > > the device
> > > > should no longer be able to access system memory or other
> > > > devices’
> > > > memory
> > > > via PCIe P2P. I'm not 100% sure about this, though.
> > >
> > > AFAIK unmap_mapping_range() is just for the CPU mmap side. It
> > > should
> > > ensure ~everything is refaulted on the next CPU access, so we can
> > > point
> > > to dummy page.
> > >
> > > For dma mapping side, I'm still not completely sure what the best
> > > approach is. On the one hand, device is wedged so we should not
> > > really
> > > be doing new GPU access? Ioctls are all blocked, and with below,
> > > CPU
> > > access will be re-directed to dummy page. So perhaps doing
> > > nothing
> > > for
> > > dma mapping side is OK? If we want to actually remove all dma
> > > mappings
> > > for extra safety, I think closest thing is maybe purge all BOs?
> > > Similar
> > > to what we do for an unplug.
> >
> > It sounds like, when the device is wedged beyond recovery, not even
> > the
> > unplug / pcie device unbind path should be doing any hwardware
> > accesses. So if that path is fixed up to avoid that, then perhaps
> > we
> > can just unbind the pcie device just after wedging? That is, of
> > course
> > if it's acceptable that the drm_device <-> pcie device association
> > is
> > broken.
>
> Yeah, it does seem kind of similar to the unplug flow, just that
> device
> is potentially inaccessible through that. All of the devm cleanup,
> plus
> already nuking CPU mapppings and the purge stuff is relevant for
> wedge,
> I think.
>
> Do we know if coredump is maybe interesting after wedge? Maybe that
> is
> one potential issue, since it looks like that is attached to the
> physical device? AFAICT that would currently get nuked on
> unbind/unplug?
> Perhaps if we get into a wedge state, the user might want to collect
> any
> logs/coredump first, assuming there are some?
>
> Also thinking some more, there is also the dma-buf exported to
> another
> device edgecase, like with VRAM? With unplug, I think we move it to
> system memory and notify the importer. But here, if device is cooked,
> we
> maybe can't actually move it with hw? Do we know if this is handled
> in
> some way already?
For dma-buf, there is a new interface landing to "revoke" a dma-buf
completely. I think it's already in the code but not fully sure. We
should take a look at that. Otherwise it sounds like for dynamic
importers at least we need to reject mapping and run move_notify() on
all exported dma-buf.
Apart from that, following
https://dri.freedesktop.org/docs/drm/gpu/drm-uapi.html#device-wedging
and from discussions with Rodrigo IMO we need to
* Block user-space mappings of device memory - unmap_mapping_range().
* Figure out how to handle device-private SVM pages. At the very least
we need to invalidate the mm ranges, error on new migration attempts
and block dma-setup for peer devices.
* Kill all dma-mappings.
* Signal all fences.
* Block and drain user-space accesses (ioctl, read, write, fault).
Looks like much is blocked today but not drained. To drain we need to
be using drm_dev_enter() and drm_dev_exit() and add wedge
synchronization with those.
* Figure out what user-space accesses remaining needed for post-wedge
access of debug / telemetry data.
* Ensure this all doesn't make unplug / devres / dmres cleanup trip.
/Thomas
>
> >
> > /Thomas
> >
> >
> >
> > >
> > > So perhaps xe_bo_pci_dev_remove_all() is better here? Also I
> > > guess
> > > would
> > > need:
> > >
> > > @@ -349,7 +349,8 @@ static void xe_evict_flags(struct
> > > ttm_buffer_object
> > > *tbo,
> > > return;
> > > }
> > >
> > > - if (device_unplugged && !tbo->base.dma_buf) {
> > > + if ((device_unplugged || xe_device_wedged(xe)) &&
> > > + !tbo->base.dma_buf) {
> > > *placement = purge_placement;
> > > return;
> > > }
> > >
> > > >
> > > > Matt
> > > >
> > > > > + /* Drop all CPU mappings pointing to this device */
> > > > > + unmap_mapping_range(xe->drm.anon_inode->i_mapping,
> > > > > 0, 0,
> > > > > 1);
> > > > > +
> > > > > if (xe_device_wedged(xe)) {
> > > > > /*
> > > > > * XE_WEDGED_MODE_UPON_ANY_HANG_NO_RESET is
> > > > > intended for debugging
> > > > > --
> > > > > 2.43.0
> > > > >
next prev parent reply other threads:[~2026-03-30 7:45 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-26 13:28 [PATCH v2] drm/xe: Drop all mappings for wedged device Raag Jadav
2026-03-26 14:06 ` ✓ CI.KUnit: success for " Patchwork
2026-03-26 14:42 ` ✓ Xe.CI.BAT: " Patchwork
2026-03-26 21:19 ` [PATCH v2] " Matthew Brost
2026-03-27 10:18 ` Matthew Auld
2026-03-27 10:41 ` Thomas Hellström
2026-03-27 14:24 ` Raag Jadav
2026-03-27 15:03 ` Matthew Auld
2026-03-30 7:45 ` Thomas Hellström [this message]
2026-03-27 19:51 ` Matthew Brost
2026-03-27 4:56 ` ✓ Xe.CI.FULL: success for " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=52e4f2e61ab7df9b46c1c16ac4dfde929732629a.camel@linux.intel.com \
--to=thomas.hellstrom@linux.intel.com \
--cc=himal.prasad.ghimiray@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=matthew.auld@intel.com \
--cc=matthew.brost@intel.com \
--cc=matthew.d.roper@intel.com \
--cc=raag.jadav@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox