Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Matthew Auld <matthew.auld@intel.com>
To: Matt Roper <matthew.d.roper@intel.com>,
	"Upadhyay, Tejas" <tejas.upadhyay@intel.com>
Cc: "Brost, Matthew" <matthew.brost@intel.com>,
	"intel-xe@lists.freedesktop.org" <intel-xe@lists.freedesktop.org>,
	"thomas.hellstrom@linux.intel.com"
	<thomas.hellstrom@linux.intel.com>
Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually
Date: Fri, 13 Feb 2026 17:29:51 +0000	[thread overview]
Message-ID: <5657d6e8-ab61-482f-9e84-74e4b2ff244b@intel.com> (raw)
In-Reply-To: <20260211211125.GL4694@mdroper-desk1.amr.corp.intel.com>

On 11/02/2026 21:11, Matt Roper wrote:
> On Wed, Feb 11, 2026 at 07:06:05PM +0000, Upadhyay, Tejas wrote:
>>
>>
>>> -----Original Message-----
>>> From: Brost, Matthew <matthew.brost@intel.com>
>>> Sent: 11 February 2026 05:32
>>> To: Roper, Matthew D <matthew.d.roper@intel.com>
>>> Cc: Upadhyay, Tejas <tejas.upadhyay@intel.com>; intel-
>>> xe@lists.freedesktop.org; Auld, Matthew <matthew.auld@intel.com>;
>>> thomas.hellstrom@linux.intel.com
>>> Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo
>>> cachelines manually
>>>
>>> On Tue, Feb 10, 2026 at 01:05:25PM -0800, Matt Roper wrote:
>>>> On Tue, Feb 10, 2026 at 06:21:22PM +0530, Tejas Upadhyay wrote:
>>>>> "eXtended Architecture" (XA) tagged memory—memory shared between
>>> the
>>>>> CPU and GPU
>>>>
>>>> I'm pretty sure this expansion of "XA" is wrong; where are you seeing
>>>> this definition?  Everything in the bspec indicates that XA means "wb
>>>> - transient app" (similar to how "XD" is 'wb - transient display").
>>>> I'm not sure why exactly they picked "X" to refer to transient in both
>>>> of these cases, but I've never seen any documentation that refers to
>>>> it as "extended."
>>>>
>>>>> is treated differently from other GPU memory when the Media engine is
>>> power-gated.
>>>>>
>>>>> XA is *always* flushed, like at the end-of-submssion (and maybe
>>>>> other
>>>>
>>>> I assume you're referring to the fact that the driver performs flushes
>>>> at the end of submission (via PIPE_CONTROL or MI_FLUSH_DW), and that
>>>> depending on other state/optimizations in the system, those flushes
>>>> may flush the entire device cache, or may only flush the subset of
>>>> cache data that is not marked as transient.  The way you worded this
>>>> was confusing since it makes it sound like cache flushes happen
>>>> automatically somewhere in hardware/firmware.
>>>>
>>>>> places), just that internally as an optimisation hw doesn't need to
>>>>> make that a full flush (which will also include XA) when Media is
>>>>> off/powergated, since it doesn't need to worry about GT caches vs
>>>>> Media coherency, and only CPU vs GPU coherency, so can make that
>>>>> flush a targeted XA flush, since stuff tagged with XA now means it's
>>>>> shared with the CPU. The main implication is that we now need to
>>>>> somehow flush non-XA before freeing system memory pages, otherwise
>>>>> dirty cachelines could be flushed after the free (like if Media
>>>>> suddenly turns on and does a full flush)
>>>>
>>>> This description seems really confusing.  My understanding is that
>>>> marking something as wb-transient-app indicates that it might be
>>>> accessed by something other than our graphics/media IP (i.e., accessed
>>>> from the CPU, exported to another device, etc.), so transient data
>>>> truly does need to be flushed at the points in the driver where a
>>>> flush typically happens.
>>>>
>>>> However when something is _not_ transient, then either:
>>>>   - it's "private" to the GPU and only our graphics/media IP will be
>>>>     accessing it
>>>>   - it's bound with a coherent PAT index so that outside observers like
>>>>     the CPU can snoop the device cache, even when the cache hasn't been
>>>>     flushed
>>>>
>>>> If media is not active, then there's really no need to include
>>>> non-transient data when an device cache flush happens since there's no
>>>> real need for the data to get to RAM.  So that enables an optimization
>>>> (which comes in your next patch), that allows flushes to only operate
>>>> on the subset of the device cache tagged as "transient" if media is idle.
>>
>> But what If we have stale non-XA marked pages for userptr, and that
>> object moves out and at the same time media comes back, will end up in
>> full flush and flush the stale entry to RAM.
> 
> What makes userptr special here?  During general, active usage, userptr
> would be data that's accessible by the CPU, so it needs to either be
> transient (so CPU can see the data in RAM after explicit flushes) or it
> needs to be using a coherent PAT (so that the CPU can just snoop the GPU
> cache).  If you marked userptr as both non-XA and non-coherent, then
> that sounds likely to be a userspace bug (and probably something we can
> catch and reject as an invalid case on any Xe3p or later platforms that
> support this) since the CPU wouldn't have any reliable way of seeing GPU
> updates.

Yes, we have always rejected coh_none + useptr, if someone tries that 
with vm_bind.

My understanding is that you can now no longer use vanilla 1WAY and 
expect it to always be flushed from the GPU caches when the fence has 
signalled i.e end of submission, like when media is off. So you have to 
only allow 2WAY or XA for userptr, or you need a manual sync flush here, 
for security, before the pages potentially vanish.

> 
> If something happens that changes the GTT mapping of an object, then
> doesn't that already trigger a TLB invalidation when necessary in the
> driver today?  It was my understanding that "heavy" TLB invalidations

Right, but the driver is clever and in some cases postpones that TLB 
flush (with PPC bit set) to the next re-bind, so when you next call the 
exec() ioctl, for example. I think the idea is you then only need one 
flush, instead of two. But that is way too late for this, as we need to 
do the flush before the memory is potentially freed. Otherwise the 
memory can be freed and when media next wakes up or you trigger full GPU 
flush on next exec() you potentially corrupt the next user of that memory.

For useptr case, this patch was just trying to make that TLB flush (with 
PPC bit) happen immediately before we dma unmap the physical pages, and 
before the memory is potentially freed.


> wait for data values to be globally observable before starting, so I
> think that would ensure that any non-XA data makes it to RAM before any
> binding changes, object, destruction, etc.?  Is there something special
> about userptr that makes that case more of a problem?
> 
> I just found bspec page 74635 which gives an overview of the various
> flush and invalidate cases, and I don't see anything there that makes it
> obvious to me that userptr would be special.
> 
> 
>>
>>>>
>>>> As you said, we eventually do want to force a flush of the
>>>> non-transient data as well once we're freeing the underlying pages.
>>>> So how do we do that?  It's not clear to me how the changes below are
>>>> accomplishing that.  Is there a way to explicitly request a full
>>>> device cache flush (ignoring the transient vs non-transient tagging)?
>>>> Since the GuC handles the optimization in the next patch (toggling
>>>> whether flushes are full flushes vs non-transient flushes depending on
>>>> whether media is active), I thought there might be some kind of GuC
>>>> interface to request "please do one full flush now, even if media is idle."
>>>>
>>>
>>> I’m not an expert here by any means, but everything above from Matt seems
>>> like valid concerns. Thomas also raised some concerns in the two previous
>>> revisions; again I’m not an expert, but reading through those, it doesn’t really
>>> seem like he received proper answers to his questions.
>>
>> Its forcing flush via tlb invalidation PPC flag under xe_invalidate_vma( ).
> 
> By the way, what is "PPC?"  It seems like it's another new synonym for
> the device cache?  It's already really confusing that some of our
> hardware docs use a mix of both "L2" and "L3" to refer to the same
> device cache for historical reasons...
> 
> 
> Matt
> 
>>
>>>
>>> A couple of comments below.
>>>
>>>>
>>>> Matt
>>>>
>>>>>
>>>>> V2(MattA): Expand commit description
>>>>>
>>>>> Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
>>>>> ---
>>>>>   drivers/gpu/drm/xe/xe_bo.c      |  3 ++-
>>>>>   drivers/gpu/drm/xe/xe_device.c  | 23 +++++++++++++++++++++++
>>>>> drivers/gpu/drm/xe/xe_device.h  |  1 +
>>>>> drivers/gpu/drm/xe/xe_userptr.c |  3 ++-
>>>>>   4 files changed, 28 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
>>>>> index e9180b01a4e4..4455886b211e 100644
>>>>> --- a/drivers/gpu/drm/xe/xe_bo.c
>>>>> +++ b/drivers/gpu/drm/xe/xe_bo.c
>>>>> @@ -689,7 +689,8 @@ static int xe_bo_trigger_rebind(struct xe_device
>>>>> *xe, struct xe_bo *bo,
>>>>>
>>>>>   		if (!xe_vm_in_fault_mode(vm)) {
>>>>>   			drm_gpuvm_bo_evict(vm_bo, true);
>>>>> -			continue;
>>>>> +			if (!xe_device_needs_cache_flush(xe))
>>>>> +				continue;
>>>
>>> This will trigger a TLB invalidation (and I assume a cache flush) every time we
>>> move or free memory in the 3D stack if it has a binding. It also performs a
>>> synchronous wait on the BO being idle. Both of these are very expensive
>>> operations. I can’t imagine the granularity we want here is to do this on every
>>> move/free with bindings.
>>>
>>> Also, for LR compute with preempt fences, we would trigger the preempt
>>> fences during the wait, so a TLB invalidation after this seems unnecessary,
>>> though perhaps the cache flush is still required?
>>>
>>> I think this needs a bit more explanation, because without knowing a lot about
>>> the exact requirements, the implementation does not look correct.
>>
>> The thing is that we are trying to solve problem with userptr with non-XA pat, consider if that BO got moved while media is not active. As soon as media will come back active, stale cached entries of that object will be flushed as part of full flush , which may corrupt things.
>> There was thinking that with this patch we would at least solve the problem of corruption and later when page_reclamation feature comes in will help in performance as well. But now when page reclamation feature is merged earlier and it tightly coupled with bind/unbind some cases like discussed above (which are not doing unbind immediately on move/free) are missed in reclamation.
>>   
>> So thought was to let this solution go in with little perf hit and discuss with page reclamation owner to come with cleaner solution together.
>>
>> Tejas
>>>
>>>>>   		}
>>>>>
>>>>>   		if (!idle) {
>>>>> diff --git a/drivers/gpu/drm/xe/xe_device.c
>>>>> b/drivers/gpu/drm/xe/xe_device.c index 743c18e0c580..da2abed94bc0
>>>>> 100644
>>>>> --- a/drivers/gpu/drm/xe/xe_device.c
>>>>> +++ b/drivers/gpu/drm/xe/xe_device.c
>>>>> @@ -1097,6 +1097,29 @@ static void tdf_request_sync(struct xe_device
>>> *xe)
>>>>>   	}
>>>>>   }
>>>>>
>>>>> +/**
>>>>> + * xe_device_needs_cache_flush - Whether the cache needs to be
>>>>> +flushed
>>>>> + * @xe: The device to check.
>>>>> + *
>>>>> + * Return: true if the device needs cache flush, false otherwise.
>>>>> + */
>>>>> +bool xe_device_needs_cache_flush(struct xe_device *xe) {
>>>>> +	/* XA is *always* flushed, like at the end-of-submssion (and maybe
>>> other
>>>>> +	 * places), just that internally as an optimisation hw doesn't need to
>>> make
>>>>> +	 * that a full flush (which will also include XA) when Media is
>>>>> +	 * off/powergated, since it doesn't need to worry about GT caches vs
>>> Media
>>>>> +	 * coherency, and only CPU vs GPU coherency, so can make that flush
>>> a
>>>>> +	 * targeted XA flush, since stuff tagged with XA now means it's shared
>>> with
>>>>> +	 * the CPU. The main implication is that we now need to somehow
>>> flush non-XA before
>>>>> +	 * freeing system memory pages, otherwise dirty cachelines could be
>>> flushed after the free
>>>>> +	 * (like if Media suddenly turns on and does a full flush)
>>>>> +	 */
>>>>> +	if (GRAPHICS_VER(xe) >= 35 && !IS_DGFX(xe))
>>>>> +		return true;
>>>>> +	return false;
>>>>> +}
>>>>> +
>>>>>   void xe_device_l2_flush(struct xe_device *xe)  {
>>>>>   	struct xe_gt *gt;
>>>>> diff --git a/drivers/gpu/drm/xe/xe_device.h
>>>>> b/drivers/gpu/drm/xe/xe_device.h index 39464650533b..baf386e0e037
>>>>> 100644
>>>>> --- a/drivers/gpu/drm/xe/xe_device.h
>>>>> +++ b/drivers/gpu/drm/xe/xe_device.h
>>>>> @@ -184,6 +184,7 @@ void xe_device_snapshot_print(struct xe_device
>>>>> *xe, struct drm_printer *p);
>>>>>   u64 xe_device_canonicalize_addr(struct xe_device *xe, u64 address);
>>>>>   u64 xe_device_uncanonicalize_addr(struct xe_device *xe, u64
>>>>> address);
>>>>>
>>>>> +bool xe_device_needs_cache_flush(struct xe_device *xe);
>>>>>   void xe_device_td_flush(struct xe_device *xe);  void
>>>>> xe_device_l2_flush(struct xe_device *xe);
>>>>>
>>>>> diff --git a/drivers/gpu/drm/xe/xe_userptr.c
>>>>> b/drivers/gpu/drm/xe/xe_userptr.c index e120323c43bc..b435ea7f9b66
>>>>> 100644
>>>>> --- a/drivers/gpu/drm/xe/xe_userptr.c
>>>>> +++ b/drivers/gpu/drm/xe/xe_userptr.c
>>>>> @@ -114,7 +114,8 @@ static void __vma_userptr_invalidate(struct xe_vm
>>> *vm, struct xe_userptr_vma *uv
>>>>>   				    false, MAX_SCHEDULE_TIMEOUT);
>>>>>   	XE_WARN_ON(err <= 0);
>>>>>
>>>>> -	if (xe_vm_in_fault_mode(vm) && userptr->initial_bind) {
>>>>> +	if ((xe_vm_in_fault_mode(vm) || xe_device_needs_cache_flush(vm-
>>>> xe)) &&
>>>>> +	    userptr->initial_bind) {
>>>
>>> Same concern with the LR preempt fence as above — the hardware will be
>>> interrupted via preempt fences, so it doesn’t seem necessary to invalidate the
>>> TLBs but perhaps we need a cflush and TLB invalidation is the mechanism for
>>> that too?
>>>
>>> Matt
>>>
>>>>>   		err = xe_vm_invalidate_vma(vma);
>>>>>   		XE_WARN_ON(err);
>>>>>   	}
>>>>> --
>>>>> 2.52.0
>>>>>
>>>>
>>>> --
>>>> Matt Roper
>>>> Graphics Software Engineer
>>>> Linux GPU Platform Enablement
>>>> Intel Corporation
> 


  parent reply	other threads:[~2026-02-13 17:29 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-10 12:51 [PATCH 0/3] drm/xe/xe3p_lpg: L2 flush optimization Tejas Upadhyay
2026-02-10 12:51 ` [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually Tejas Upadhyay
2026-02-10 21:05   ` Matt Roper
2026-02-11  0:02     ` Matthew Brost
2026-02-11 19:06       ` Upadhyay, Tejas
2026-02-11 21:11         ` Matt Roper
2026-02-12  9:53           ` Matthew Auld
2026-02-13 11:17             ` Upadhyay, Tejas
2026-02-13 13:27               ` Matthew Auld
2026-02-13 13:30                 ` Souza, Jose
2026-02-13 16:23           ` Upadhyay, Tejas
2026-02-13 16:48             ` Souza, Jose
2026-02-13 17:16               ` Matt Roper
2026-02-13 17:31                 ` Souza, Jose
2026-02-13 17:31                 ` Matthew Auld
2026-02-16 10:23                   ` Thomas Hellström
2026-02-16 10:58                     ` Matthew Auld
2026-02-16 12:07                       ` Thomas Hellström
2026-02-16 14:55                         ` Matthew Auld
2026-02-16 15:38                           ` Thomas Hellström
2026-02-16 16:41                             ` Matthew Auld
2026-02-17  6:19                               ` Upadhyay, Tejas
2026-02-17  9:53                                 ` Thomas Hellström
2026-02-17 17:04                               ` Thomas Hellström
2026-02-17 18:41                                 ` Matthew Auld
2026-02-16 10:56             ` Thomas Hellström
2026-02-16 11:26               ` Upadhyay, Tejas
2026-02-13 17:29           ` Matthew Auld [this message]
2026-02-10 12:51 ` [PATCH 2/3] drm/xe/xe3p_lpg: Enable L2 flush optimization feature Tejas Upadhyay
2026-02-10 12:51 ` [PATCH 3/3] drm/xe/xe3p: Skip TD flush Tejas Upadhyay
2026-02-10 21:22   ` Matt Roper
2026-02-13 11:06     ` Upadhyay, Tejas
2026-02-10 13:35 ` ✗ CI.checkpatch: warning for drm/xe/xe3p_lpg: L2 flush optimization (rev2) Patchwork
2026-02-10 13:36 ` ✓ CI.KUnit: success " Patchwork
2026-02-10 14:33 ` ✗ Xe.CI.BAT: failure " Patchwork
2026-02-10 18:06 ` ✗ Xe.CI.FULL: " Patchwork
  -- strict thread matches above, loose matches on Subject: below --
2025-11-25  9:43 [PATCH 0/3] drm/xe/xe3p_lpg: L2 flush optimization Tejas Upadhyay
2025-11-25  9:43 ` [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually Tejas Upadhyay
2025-11-25 10:17   ` Matthew Auld
2025-11-25 13:39     ` Souza, Jose
2025-11-25 15:06   ` Thomas Hellström
2025-11-25 15:31     ` Upadhyay, Tejas
2025-11-26 10:26       ` Thomas Hellström

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5657d6e8-ab61-482f-9e84-74e4b2ff244b@intel.com \
    --to=matthew.auld@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=matthew.brost@intel.com \
    --cc=matthew.d.roper@intel.com \
    --cc=tejas.upadhyay@intel.com \
    --cc=thomas.hellstrom@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox