Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Matthew Brost <matthew.brost@intel.com>
To: Matt Roper <matthew.d.roper@intel.com>
Cc: Tejas Upadhyay <tejas.upadhyay@intel.com>,
	<intel-xe@lists.freedesktop.org>, <matthew.auld@intel.com>,
	<thomas.hellstrom@linux.intel.com>
Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually
Date: Tue, 10 Feb 2026 16:02:17 -0800	[thread overview]
Message-ID: <aYvHCeTd5pOFo2p5@lstrano-desk.jf.intel.com> (raw)
In-Reply-To: <20260210210525.GC4694@mdroper-desk1.amr.corp.intel.com>

On Tue, Feb 10, 2026 at 01:05:25PM -0800, Matt Roper wrote:
> On Tue, Feb 10, 2026 at 06:21:22PM +0530, Tejas Upadhyay wrote:
> > "eXtended Architecture" (XA) tagged memory—memory shared between the CPU and GPU
> 
> I'm pretty sure this expansion of "XA" is wrong; where are you seeing
> this definition?  Everything in the bspec indicates that XA means "wb -
> transient app" (similar to how "XD" is 'wb - transient display").  I'm
> not sure why exactly they picked "X" to refer to transient in both of
> these cases, but I've never seen any documentation that refers to it as
> "extended."
> 
> > is treated differently from other GPU memory when the Media engine is power-gated.
> > 
> > XA is *always* flushed, like at the end-of-submssion (and maybe other
> 
> I assume you're referring to the fact that the driver performs flushes
> at the end of submission (via PIPE_CONTROL or MI_FLUSH_DW), and that
> depending on other state/optimizations in the system, those flushes may
> flush the entire device cache, or may only flush the subset of cache
> data that is not marked as transient.  The way you worded this was
> confusing since it makes it sound like cache flushes happen
> automatically somewhere in hardware/firmware.
> 
> > places), just that internally as an optimisation hw doesn't need to make
> > that a full flush (which will also include XA) when Media is
> > off/powergated, since it doesn't need to worry about GT caches vs Media
> > coherency, and only CPU vs GPU coherency, so can make that flush a
> > targeted XA flush, since stuff tagged with XA now means it's shared with
> > the CPU. The main implication is that we now need to somehow flush non-XA before
> > freeing system memory pages, otherwise dirty cachelines could be flushed after the free
> > (like if Media suddenly turns on and does a full flush)
> 
> This description seems really confusing.  My understanding is that
> marking something as wb-transient-app indicates that it might be
> accessed by something other than our graphics/media IP (i.e., accessed
> from the CPU, exported to another device, etc.), so transient data truly
> does need to be flushed at the points in the driver where a flush
> typically happens.
> 
> However when something is _not_ transient, then either:
>  - it's "private" to the GPU and only our graphics/media IP will be
>    accessing it
>  - it's bound with a coherent PAT index so that outside observers like
>    the CPU can snoop the device cache, even when the cache hasn't been
>    flushed
> 
> If media is not active, then there's really no need to include
> non-transient data when an device cache flush happens since there's no
> real need for the data to get to RAM.  So that enables an optimization
> (which comes in your next patch), that allows flushes to only operate on
> the subset of the device cache tagged as "transient" if media is idle.
> 
> As you said, we eventually do want to force a flush of the non-transient
> data as well once we're freeing the underlying pages.  So how do we do
> that?  It's not clear to me how the changes below are accomplishing
> that.  Is there a way to explicitly request a full device cache flush
> (ignoring the transient vs non-transient tagging)?  Since the GuC
> handles the optimization in the next patch (toggling whether flushes are
> full flushes vs non-transient flushes depending on whether media is
> active), I thought there might be some kind of GuC interface to request
> "please do one full flush now, even if media is idle."
> 

I’m not an expert here by any means, but everything above from Matt
seems like valid concerns. Thomas also raised some concerns in the two
previous revisions; again I’m not an expert, but reading through
those, it doesn’t really seem like he received proper answers to his
questions.

A couple of comments below.

> 
> Matt
> 
> > 
> > V2(MattA): Expand commit description
> > 
> > Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_bo.c      |  3 ++-
> >  drivers/gpu/drm/xe/xe_device.c  | 23 +++++++++++++++++++++++
> >  drivers/gpu/drm/xe/xe_device.h  |  1 +
> >  drivers/gpu/drm/xe/xe_userptr.c |  3 ++-
> >  4 files changed, 28 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
> > index e9180b01a4e4..4455886b211e 100644
> > --- a/drivers/gpu/drm/xe/xe_bo.c
> > +++ b/drivers/gpu/drm/xe/xe_bo.c
> > @@ -689,7 +689,8 @@ static int xe_bo_trigger_rebind(struct xe_device *xe, struct xe_bo *bo,
> >  
> >  		if (!xe_vm_in_fault_mode(vm)) {
> >  			drm_gpuvm_bo_evict(vm_bo, true);
> > -			continue;
> > +			if (!xe_device_needs_cache_flush(xe))
> > +				continue;

This will trigger a TLB invalidation (and I assume a cache flush) every
time we move or free memory in the 3D stack if it has a binding. It also
performs a synchronous wait on the BO being idle. Both of these are very
expensive operations. I can’t imagine the granularity we want here is to
do this on every move/free with bindings.

Also, for LR compute with preempt fences, we would trigger the preempt
fences during the wait, so a TLB invalidation after this seems
unnecessary, though perhaps the cache flush is still required?

I think this needs a bit more explanation, because without knowing a lot
about the exact requirements, the implementation does not look correct.

> >  		}
> >  
> >  		if (!idle) {
> > diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
> > index 743c18e0c580..da2abed94bc0 100644
> > --- a/drivers/gpu/drm/xe/xe_device.c
> > +++ b/drivers/gpu/drm/xe/xe_device.c
> > @@ -1097,6 +1097,29 @@ static void tdf_request_sync(struct xe_device *xe)
> >  	}
> >  }
> >  
> > +/**
> > + * xe_device_needs_cache_flush - Whether the cache needs to be flushed
> > + * @xe: The device to check.
> > + *
> > + * Return: true if the device needs cache flush, false otherwise.
> > + */
> > +bool xe_device_needs_cache_flush(struct xe_device *xe)
> > +{
> > +	/* XA is *always* flushed, like at the end-of-submssion (and maybe other
> > +	 * places), just that internally as an optimisation hw doesn't need to make
> > +	 * that a full flush (which will also include XA) when Media is
> > +	 * off/powergated, since it doesn't need to worry about GT caches vs Media
> > +	 * coherency, and only CPU vs GPU coherency, so can make that flush a
> > +	 * targeted XA flush, since stuff tagged with XA now means it's shared with
> > +	 * the CPU. The main implication is that we now need to somehow flush non-XA before
> > +	 * freeing system memory pages, otherwise dirty cachelines could be flushed after the free
> > +	 * (like if Media suddenly turns on and does a full flush)
> > +	 */
> > +	if (GRAPHICS_VER(xe) >= 35 && !IS_DGFX(xe))
> > +		return true;
> > +	return false;
> > +}
> > +
> >  void xe_device_l2_flush(struct xe_device *xe)
> >  {
> >  	struct xe_gt *gt;
> > diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
> > index 39464650533b..baf386e0e037 100644
> > --- a/drivers/gpu/drm/xe/xe_device.h
> > +++ b/drivers/gpu/drm/xe/xe_device.h
> > @@ -184,6 +184,7 @@ void xe_device_snapshot_print(struct xe_device *xe, struct drm_printer *p);
> >  u64 xe_device_canonicalize_addr(struct xe_device *xe, u64 address);
> >  u64 xe_device_uncanonicalize_addr(struct xe_device *xe, u64 address);
> >  
> > +bool xe_device_needs_cache_flush(struct xe_device *xe);
> >  void xe_device_td_flush(struct xe_device *xe);
> >  void xe_device_l2_flush(struct xe_device *xe);
> >  
> > diff --git a/drivers/gpu/drm/xe/xe_userptr.c b/drivers/gpu/drm/xe/xe_userptr.c
> > index e120323c43bc..b435ea7f9b66 100644
> > --- a/drivers/gpu/drm/xe/xe_userptr.c
> > +++ b/drivers/gpu/drm/xe/xe_userptr.c
> > @@ -114,7 +114,8 @@ static void __vma_userptr_invalidate(struct xe_vm *vm, struct xe_userptr_vma *uv
> >  				    false, MAX_SCHEDULE_TIMEOUT);
> >  	XE_WARN_ON(err <= 0);
> >  
> > -	if (xe_vm_in_fault_mode(vm) && userptr->initial_bind) {
> > +	if ((xe_vm_in_fault_mode(vm) || xe_device_needs_cache_flush(vm->xe)) &&
> > +	    userptr->initial_bind) {

Same concern with the LR preempt fence as above — the hardware will be
interrupted via preempt fences, so it doesn’t seem necessary to
invalidate the TLBs but perhaps we need a cflush and TLB invalidation
is the mechanism for that too?

Matt

> >  		err = xe_vm_invalidate_vma(vma);
> >  		XE_WARN_ON(err);
> >  	}
> > -- 
> > 2.52.0
> > 
> 
> -- 
> Matt Roper
> Graphics Software Engineer
> Linux GPU Platform Enablement
> Intel Corporation

  reply	other threads:[~2026-02-11  0:02 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-10 12:51 [PATCH 0/3] drm/xe/xe3p_lpg: L2 flush optimization Tejas Upadhyay
2026-02-10 12:51 ` [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually Tejas Upadhyay
2026-02-10 21:05   ` Matt Roper
2026-02-11  0:02     ` Matthew Brost [this message]
2026-02-11 19:06       ` Upadhyay, Tejas
2026-02-11 21:11         ` Matt Roper
2026-02-12  9:53           ` Matthew Auld
2026-02-13 11:17             ` Upadhyay, Tejas
2026-02-13 13:27               ` Matthew Auld
2026-02-13 13:30                 ` Souza, Jose
2026-02-13 16:23           ` Upadhyay, Tejas
2026-02-13 16:48             ` Souza, Jose
2026-02-13 17:16               ` Matt Roper
2026-02-13 17:31                 ` Souza, Jose
2026-02-13 17:31                 ` Matthew Auld
2026-02-16 10:23                   ` Thomas Hellström
2026-02-16 10:58                     ` Matthew Auld
2026-02-16 12:07                       ` Thomas Hellström
2026-02-16 14:55                         ` Matthew Auld
2026-02-16 15:38                           ` Thomas Hellström
2026-02-16 16:41                             ` Matthew Auld
2026-02-17  6:19                               ` Upadhyay, Tejas
2026-02-17  9:53                                 ` Thomas Hellström
2026-02-17 17:04                               ` Thomas Hellström
2026-02-17 18:41                                 ` Matthew Auld
2026-02-16 10:56             ` Thomas Hellström
2026-02-16 11:26               ` Upadhyay, Tejas
2026-02-13 17:29           ` Matthew Auld
2026-02-10 12:51 ` [PATCH 2/3] drm/xe/xe3p_lpg: Enable L2 flush optimization feature Tejas Upadhyay
2026-02-10 12:51 ` [PATCH 3/3] drm/xe/xe3p: Skip TD flush Tejas Upadhyay
2026-02-10 21:22   ` Matt Roper
2026-02-13 11:06     ` Upadhyay, Tejas
2026-02-10 13:35 ` ✗ CI.checkpatch: warning for drm/xe/xe3p_lpg: L2 flush optimization (rev2) Patchwork
2026-02-10 13:36 ` ✓ CI.KUnit: success " Patchwork
2026-02-10 14:33 ` ✗ Xe.CI.BAT: failure " Patchwork
2026-02-10 18:06 ` ✗ Xe.CI.FULL: " Patchwork
  -- strict thread matches above, loose matches on Subject: below --
2025-11-25  9:43 [PATCH 0/3] drm/xe/xe3p_lpg: L2 flush optimization Tejas Upadhyay
2025-11-25  9:43 ` [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually Tejas Upadhyay
2025-11-25 10:17   ` Matthew Auld
2025-11-25 13:39     ` Souza, Jose
2025-11-25 15:06   ` Thomas Hellström
2025-11-25 15:31     ` Upadhyay, Tejas
2025-11-26 10:26       ` Thomas Hellström

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aYvHCeTd5pOFo2p5@lstrano-desk.jf.intel.com \
    --to=matthew.brost@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=matthew.auld@intel.com \
    --cc=matthew.d.roper@intel.com \
    --cc=tejas.upadhyay@intel.com \
    --cc=thomas.hellstrom@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox