Re: [PATCH v2] drm/xe: Drop all mappings for wedged device

public inbox for intel-xe@lists.freedesktop.org
 help / color / mirror / Atom feed

From: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
To: Matthew Auld <matthew.auld@intel.com>,
	Matthew Brost <matthew.brost@intel.com>,
	Raag Jadav <raag.jadav@intel.com>
Cc: intel-xe@lists.freedesktop.org, himal.prasad.ghimiray@intel.com,
	 matthew.d.roper@intel.com
Subject: Re: [PATCH v2] drm/xe: Drop all mappings for wedged device
Date: Mon, 30 Mar 2026 09:45:22 +0200	[thread overview]
Message-ID: <52e4f2e61ab7df9b46c1c16ac4dfde929732629a.camel@linux.intel.com> (raw)
In-Reply-To: <156d287e-117f-4575-a90c-3aaa233ed670@intel.com>

On Fri, 2026-03-27 at 15:03 +0000, Matthew Auld wrote:
> On 27/03/2026 10:41, Thomas Hellström wrote:
> > On Fri, 2026-03-27 at 10:18 +0000, Matthew Auld wrote:
> > > On 26/03/2026 21:19, Matthew Brost wrote:
> > > > On Thu, Mar 26, 2026 at 06:58:16PM +0530, Raag Jadav wrote:
> > > > > As per uapi documentation[1], the prerequisite for wedged
> > > > > device
> > > > > is to
> > > > > drop all memory mappings. Follow it.
> > > > > 
> > > > > [1] Documentation/gpu/drm-uapi.rst
> > > > > 
> > > > > v2: Also drop CPU mappings (Matthew Auld)
> > > > > 
> > > > > Fixes: 7bc00751f877 ("drm/xe: Use device wedged event")
> > > > > Signed-off-by: Raag Jadav <raag.jadav@intel.com>
> > > > > ---
> > > > >    drivers/gpu/drm/xe/xe_bo_evict.c | 8 +++++++-
> > > > >    drivers/gpu/drm/xe/xe_bo_evict.h | 1 +
> > > > >    drivers/gpu/drm/xe/xe_device.c   | 5 +++++
> > > > >    3 files changed, 13 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/xe/xe_bo_evict.c
> > > > > b/drivers/gpu/drm/xe/xe_bo_evict.c
> > > > > index 7661fca7f278..f741cda50b2d 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_bo_evict.c
> > > > > +++ b/drivers/gpu/drm/xe/xe_bo_evict.c
> > > > > @@ -270,7 +270,13 @@ int xe_bo_restore_late(struct xe_device
> > > > > *xe)
> > > > >    	return ret;
> > > > >    }
> > > > >    
> > > > > -static void xe_bo_pci_dev_remove_pinned(struct xe_device
> > > > > *xe)
> > > > > +/**
> > > > > + * xe_bo_pci_dev_remove_pinned() - Unmap external bos
> > > > > + * @xe: xe device
> > > > > + *
> > > > > + * Drop dma mappings of all external pinned bos.
> > > > > + */
> > > > > +void xe_bo_pci_dev_remove_pinned(struct xe_device *xe)
> > > > >    {
> > > > >    	struct xe_tile *tile;
> > > > >    	unsigned int id;
> > > > > diff --git a/drivers/gpu/drm/xe/xe_bo_evict.h
> > > > > b/drivers/gpu/drm/xe/xe_bo_evict.h
> > > > > index e8385cb7f5e9..6ce27e272780 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_bo_evict.h
> > > > > +++ b/drivers/gpu/drm/xe/xe_bo_evict.h
> > > > > @@ -15,6 +15,7 @@ void
> > > > > xe_bo_notifier_unprepare_all_pinned(struct
> > > > > xe_device *xe);
> > > > >    int xe_bo_restore_early(struct xe_device *xe);
> > > > >    int xe_bo_restore_late(struct xe_device *xe);
> > > > >    
> > > > > +void xe_bo_pci_dev_remove_pinned(struct xe_device *xe);
> > > > >    void xe_bo_pci_dev_remove_all(struct xe_device *xe);
> > > > >    
> > > > >    int xe_bo_pinned_init(struct xe_device *xe);
> > > > > diff --git a/drivers/gpu/drm/xe/xe_device.c
> > > > > b/drivers/gpu/drm/xe/xe_device.c
> > > > > index b17d4a878686..4c0097f3aefb 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_device.c
> > > > > +++ b/drivers/gpu/drm/xe/xe_device.c
> > > > > @@ -1347,6 +1347,11 @@ void xe_device_declare_wedged(struct
> > > > > xe_device *xe)
> > > > >    	for_each_gt(gt, xe, id)
> > > > >    		xe_gt_declare_wedged(gt);
> > > > >    
> > > > > +	/* Drop dma mappings of external bos */
> > > > > +	xe_bo_pci_dev_remove_pinned(xe);
> > > > 
> > > > Do we even need the part above? unmap_mapping_range() should
> > > > drop
> > > > all
> > > > DMA mappings for the device being wedged, right? In other
> > > > words,
> > > > the device
> > > > should no longer be able to access system memory or other
> > > > devices’
> > > > memory
> > > > via PCIe P2P. I'm not 100% sure about this, though.
> > > 
> > > AFAIK unmap_mapping_range() is just for the CPU mmap side. It
> > > should
> > > ensure ~everything is refaulted on the next CPU access, so we can
> > > point
> > > to dummy page.
> > > 
> > > For dma mapping side, I'm still not completely sure what the best
> > > approach is. On the one hand, device is wedged so we should not
> > > really
> > > be doing new GPU access? Ioctls are all blocked, and with below,
> > > CPU
> > > access will be re-directed to dummy page. So perhaps doing
> > > nothing
> > > for
> > > dma mapping side is OK? If we want to actually remove all dma
> > > mappings
> > > for extra safety, I think closest thing is maybe purge all BOs?
> > > Similar
> > > to what we do for an unplug.
> > 
> > It sounds like, when the device is wedged beyond recovery, not even
> > the
> > unplug / pcie device unbind path should be doing any hwardware
> > accesses. So if that path is fixed up to avoid that, then perhaps
> > we
> > can just unbind the pcie device just after wedging? That is, of
> > course
> > if it's acceptable that the drm_device <-> pcie device association
> > is
> > broken.
> 
> Yeah, it does seem kind of similar to the unplug flow, just that
> device 
> is potentially inaccessible through that. All of the devm cleanup,
> plus 
> already nuking CPU mapppings and the purge stuff is relevant for
> wedge, 
> I think.
> 
> Do we know if coredump is maybe interesting after wedge? Maybe that
> is 
> one potential issue, since it looks like that is attached to the 
> physical device? AFAICT that would currently get nuked on
> unbind/unplug? 
> Perhaps if we get into a wedge state, the user might want to collect
> any 
> logs/coredump first, assuming there are some?
> 
> Also thinking some more, there is also the dma-buf exported to
> another 
> device edgecase, like with VRAM? With unplug, I think we move it to 
> system memory and notify the importer. But here, if device is cooked,
> we 
> maybe can't actually move it with hw? Do we know if this is handled
> in 
> some way already?

For dma-buf, there is a new interface landing to "revoke" a dma-buf
completely. I think it's already in the code but not fully sure. We
should take a look at that. Otherwise it sounds like for dynamic
importers at least we need to reject mapping and run move_notify() on 
all exported dma-buf.

Apart from that, following
https://dri.freedesktop.org/docs/drm/gpu/drm-uapi.html#device-wedging

and from discussions with Rodrigo IMO we need to
* Block user-space mappings of device memory - unmap_mapping_range().
* Figure out how to handle device-private SVM pages. At the very least
we need to invalidate the mm ranges, error on new migration attempts
and block dma-setup for peer devices. 
* Kill all dma-mappings.
* Signal all fences.
* Block and drain user-space accesses (ioctl, read, write, fault).
Looks like much is blocked today but not drained. To drain we need to
be using drm_dev_enter() and drm_dev_exit() and add wedge
synchronization with those.
* Figure out what user-space accesses remaining needed for post-wedge
access of debug / telemetry data.
* Ensure this all doesn't make unplug / devres / dmres cleanup trip.

/Thomas


> 
> > 
> > /Thomas
> > 
> > 
> > 
> > > 
> > > So perhaps xe_bo_pci_dev_remove_all() is better here? Also I
> > > guess
> > > would
> > > need:
> > > 
> > > @@ -349,7 +349,8 @@ static void xe_evict_flags(struct
> > > ttm_buffer_object
> > > *tbo,
> > >                   return;
> > >           }
> > > 
> > > -       if (device_unplugged && !tbo->base.dma_buf) {
> > > +       if ((device_unplugged || xe_device_wedged(xe)) &&
> > > +           !tbo->base.dma_buf) {
> > >                   *placement = purge_placement;
> > >                   return;
> > >           }
> > > 
> > > > 
> > > > Matt
> > > > 
> > > > > +	/* Drop all CPU mappings pointing to this device */
> > > > > +	unmap_mapping_range(xe->drm.anon_inode->i_mapping,
> > > > > 0, 0,
> > > > > 1);
> > > > > +
> > > > >    	if (xe_device_wedged(xe)) {
> > > > >    		/*
> > > > >    		 * XE_WEDGED_MODE_UPON_ANY_HANG_NO_RESET is
> > > > > intended for debugging
> > > > > -- 
> > > > > 2.43.0
> > > > >

next prev parent reply	other threads:[~2026-03-30  7:45 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-26 13:28 [PATCH v2] drm/xe: Drop all mappings for wedged device Raag Jadav
2026-03-26 14:06 ` ✓ CI.KUnit: success for " Patchwork
2026-03-26 14:42 ` ✓ Xe.CI.BAT: " Patchwork
2026-03-26 21:19 ` [PATCH v2] " Matthew Brost
2026-03-27 10:18   ` Matthew Auld
2026-03-27 10:41     ` Thomas Hellström
2026-03-27 14:24       ` Raag Jadav
2026-03-27 15:03       ` Matthew Auld
2026-03-30  7:45         ` Thomas Hellström [this message]
2026-03-27 19:51     ` Matthew Brost
2026-03-27  4:56 ` ✓ Xe.CI.FULL: success for " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=52e4f2e61ab7df9b46c1c16ac4dfde929732629a.camel@linux.intel.com \
    --to=thomas.hellstrom@linux.intel.com \
    --cc=himal.prasad.ghimiray@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=matthew.auld@intel.com \
    --cc=matthew.brost@intel.com \
    --cc=matthew.d.roper@intel.com \
    --cc=raag.jadav@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox