From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id CC950E81A2F for ; Mon, 16 Feb 2026 14:55:07 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 7C15C10E13F; Mon, 16 Feb 2026 14:55:07 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="C1zJhwvj"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.12]) by gabe.freedesktop.org (Postfix) with ESMTPS id B2A4E10E13F for ; Mon, 16 Feb 2026 14:55:05 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1771253706; x=1802789706; h=message-id:date:mime-version:subject:to:cc:references: from:in-reply-to:content-transfer-encoding; bh=BrFAqswcvo3eAE24XHkURucHaYeaOlCvoyy0C/6X/hk=; b=C1zJhwvjV7CbO7c8Cc6qWO+d41jybQy2Np3Uaspmn5kV35YzSgbuJAbU CqaJ8fFWcIG7DFo0jgnI2QSiFP/0wXcpsGorXFjIoIkGektBlD2weIFOX Z7z07WzfOdpzt0wJokQ9F/ArBCN/0TGSPSXakI+51J+WP9V7uPnhMWKsq YWNqHMa/kBMlRMtidUDb57itD1E1ZbMsdBjuh5HjJkXniAv5Ed2MkwsFC ny6NZ1zxqcmqhse93Uh7ntP6x/n2beXLdN68ij7QXvsAYRCYTg6rsxddY t4q/rnc/hS7fqvhA3APMooWWp+EprgduQ+t3ja4A/P9FuRTRusKglYXq7 g==; X-CSE-ConnectionGUID: pLesLGP7RlSeuRYNDfhEYg== X-CSE-MsgGUID: sApLU28ZTGyBP/a4YLqdpQ== X-IronPort-AV: E=McAfee;i="6800,10657,11702"; a="83771834" X-IronPort-AV: E=Sophos;i="6.21,294,1763452800"; d="scan'208";a="83771834" Received: from orviesa009.jf.intel.com ([10.64.159.149]) by orvoesa104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 16 Feb 2026 06:55:05 -0800 X-CSE-ConnectionGUID: B+DJtMjyS2e6b0P9wG8WFg== X-CSE-MsgGUID: Be0w15kFQZqWoBmpX7LZhw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,294,1763452800"; d="scan'208";a="213473012" Received: from ijarvine-mobl1.ger.corp.intel.com (HELO [10.245.244.239]) ([10.245.244.239]) by orviesa009-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 16 Feb 2026 06:55:03 -0800 Message-ID: <3da2dfce-9bbf-42cd-a178-24a6c44b18c3@intel.com> Date: Mon, 16 Feb 2026 14:55:01 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually To: =?UTF-8?Q?Thomas_Hellstr=C3=B6m?= , Matt Roper , "Souza, Jose" Cc: "Upadhyay, Tejas" , "Mrozek, Michal" , "intel-xe@lists.freedesktop.org" , "Brost, Matthew" References: <20260210125120.1329411-5-tejas.upadhyay@intel.com> <20260210125120.1329411-6-tejas.upadhyay@intel.com> <20260210210525.GC4694@mdroper-desk1.amr.corp.intel.com> <20260211211125.GL4694@mdroper-desk1.amr.corp.intel.com> <20260213171638.GC52346@mdroper-desk1.amr.corp.intel.com> <75dcc80b39ed33a7abc620b2614b0e81586a6299.camel@linux.intel.com> <8ce35a23-b639-4c4f-acfd-993c4f9d5008@intel.com> <9ce0acf02e8ef63bf81cfbb9e1053bfa1437362f.camel@linux.intel.com> Content-Language: en-GB From: Matthew Auld In-Reply-To: <9ce0acf02e8ef63bf81cfbb9e1053bfa1437362f.camel@linux.intel.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-BeenThere: intel-xe@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel Xe graphics driver List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: intel-xe-bounces@lists.freedesktop.org Sender: "Intel-xe" On 16/02/2026 12:07, Thomas Hellström wrote: > On Mon, 2026-02-16 at 10:58 +0000, Matthew Auld wrote: >> On 16/02/2026 10:23, Thomas Hellström wrote: >>> On Fri, 2026-02-13 at 17:31 +0000, Matthew Auld wrote: >>>> On 13/02/2026 17:16, Matt Roper wrote: >>>>> On Fri, Feb 13, 2026 at 04:48:39PM +0000, Souza, Jose wrote: >>>>>> On Fri, 2026-02-13 at 16:23 +0000, Upadhyay, Tejas wrote: >>>>>>> >>>>>>> >>>>>>>> -----Original Message----- >>>>>>>> From: Roper, Matthew D >>>>>>>> Sent: 12 February 2026 02:41 >>>>>>>> To: Upadhyay, Tejas >>>>>>>> Cc: Brost, Matthew ; intel- >>>>>>>> xe@lists.freedesktop.org; Auld, Matthew >>>>>>>> ; >>>>>>>> thomas.hellstrom@linux.intel.com >>>>>>>> Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush >>>>>>>> userptr/shrinker bo >>>>>>>> cachelines manually >>>>>>>> >>>>>>>> On Wed, Feb 11, 2026 at 07:06:05PM +0000, Upadhyay, Tejas >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>>> -----Original Message----- >>>>>>>>>> From: Brost, Matthew >>>>>>>>>> Sent: 11 February 2026 05:32 >>>>>>>>>> To: Roper, Matthew D >>>>>>>>>> Cc: Upadhyay, Tejas ; >>>>>>>>>> intel- >>>>>>>>>> xe@lists.freedesktop.org; Auld, Matthew >>>>>>>>>> ; >>>>>>>>>> thomas.hellstrom@linux.intel.com >>>>>>>>>> Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush >>>>>>>>>> userptr/shrinker bo >>>>>>>>>> cachelines manually >>>>>>>>>> >>>>>>>>>> On Tue, Feb 10, 2026 at 01:05:25PM -0800, Matt Roper >>>>>>>>>> wrote: >>>>>>>>>>> On Tue, Feb 10, 2026 at 06:21:22PM +0530, Tejas >>>>>>>>>>> Upadhyay >>>>>>>>>>> wrote: >>>>>>>>>>>> "eXtended Architecture" (XA) tagged memory—memory >>>>>>>>>>>> shared >>>>>>>> between >>>>>>>>>> the >>>>>>>>>>>> CPU and GPU >>>>>>>>>>> >>>>>>>>>>> I'm pretty sure this expansion of "XA" is wrong; >>>>>>>>>>> where >>>>>>>>>>> are >>>>>>>>>>> you >>>>>>>>>>> seeing this definition?  Everything in the bspec >>>>>>>>>>> indicates >>>>>>>>>>> that XA >>>>>>>>>>> means "wb >>>>>>>>>>> - transient app" (similar to how "XD" is 'wb - >>>>>>>>>>> transient >>>>>>>>>>> display"). >>>>>>>>>>> I'm not sure why exactly they picked "X" to refer >>>>>>>>>>> to >>>>>>>>>>> transient in >>>>>>>>>>> both of these cases, but I've never seen any >>>>>>>>>>> documentation >>>>>>>>>>> that >>>>>>>>>>> refers to it as "extended." >>>>>>>>>>> >>>>>>>>>>>> is treated differently from other GPU memory when >>>>>>>>>>>> the >>>>>>>>>>>> Media >>>>>>>>>>>> engine is >>>>>>>>>> power-gated. >>>>>>>>>>>> >>>>>>>>>>>> XA is *always* flushed, like at the end-of- >>>>>>>>>>>> submssion >>>>>>>>>>>> (and >>>>>>>>>>>> maybe >>>>>>>>>>>> other >>>>>>>>>>> >>>>>>>>>>> I assume you're referring to the fact that the >>>>>>>>>>> driver >>>>>>>>>>> performs >>>>>>>>>>> flushes at the end of submission (via PIPE_CONTROL >>>>>>>>>>> or >>>>>>>>>>> MI_FLUSH_DW), and that depending on other >>>>>>>>>>> state/optimizations >>>>>>>>>>> in >>>>>>>>>>> the system, those flushes may flush the entire >>>>>>>>>>> device >>>>>>>>>>> cache, >>>>>>>>>>> or >>>>>>>>>>> may only flush the subset of cache data that is not >>>>>>>>>>> marked as >>>>>>>>>>> transient.  The way you worded this was confusing >>>>>>>>>>> since >>>>>>>>>>> it >>>>>>>>>>> makes >>>>>>>>>>> it sound like cache flushes happen automatically >>>>>>>>>>> somewhere in >>>>>>>> hardware/firmware. >>>>>>>>>>> >>>>>>>>>>>> places), just that internally as an optimisation >>>>>>>>>>>> hw >>>>>>>>>>>> doesn't >>>>>>>>>>>> need >>>>>>>>>>>> to make that a full flush (which will also >>>>>>>>>>>> include >>>>>>>>>>>> XA) when >>>>>>>>>>>> Media is off/powergated, since it doesn't need to >>>>>>>>>>>> worry >>>>>>>>>>>> about GT >>>>>>>>>>>> caches vs Media coherency, and only CPU vs GPU >>>>>>>>>>>> coherency, >>>>>>>>>>>> so can >>>>>>>>>>>> make that flush a targeted XA flush, since stuff >>>>>>>>>>>> tagged >>>>>>>>>>>> with XA >>>>>>>>>>>> now means it's shared with the CPU. The main >>>>>>>>>>>> implication is >>>>>>>>>>>> that >>>>>>>>>>>> we now need to somehow flush non-XA before >>>>>>>>>>>> freeing >>>>>>>>>>>> system >>>>>>>>>>>> memory >>>>>>>>>>>> pages, otherwise dirty cachelines could be >>>>>>>>>>>> flushed >>>>>>>>>>>> after >>>>>>>>>>>> the >>>>>>>>>>>> free (like if Media suddenly turns on and does a >>>>>>>>>>>> full >>>>>>>>>>>> flush) >>>>>>>>>>> >>>>>>>>>>> This description seems really confusing.  My >>>>>>>>>>> understanding is >>>>>>>>>>> that >>>>>>>>>>> marking something as wb-transient-app indicates >>>>>>>>>>> that it >>>>>>>>>>> might >>>>>>>>>>> be >>>>>>>>>>> accessed by something other than our graphics/media >>>>>>>>>>> IP >>>>>>>>>>> (i.e., >>>>>>>>>>> accessed from the CPU, exported to another device, >>>>>>>>>>> etc.), so >>>>>>>>>>> transient data truly does need to be flushed at the >>>>>>>>>>> points in >>>>>>>>>>> the >>>>>>>>>>> driver where a flush typically happens. >>>>>>>>>>> >>>>>>>>>>> However when something is _not_ transient, then >>>>>>>>>>> either: >>>>>>>>>>>    - it's "private" to the GPU and only our >>>>>>>>>>> graphics/media IP >>>>>>>>>>> will be >>>>>>>>>>>      accessing it >>>>>>>>>>>    - it's bound with a coherent PAT index so that >>>>>>>>>>> outside >>>>>>>>>>> observers like >>>>>>>>>>>      the CPU can snoop the device cache, even when >>>>>>>>>>> the >>>>>>>>>>> cache >>>>>>>>>>> hasn't been >>>>>>>>>>>      flushed >>>>>>>>>>> >>>>>>>>>>> If media is not active, then there's really no need >>>>>>>>>>> to >>>>>>>>>>> include >>>>>>>>>>> non-transient data when an device cache flush >>>>>>>>>>> happens >>>>>>>>>>> since >>>>>>>>>>> there's no real need for the data to get to RAM. >>>>>>>>>>> So >>>>>>>>>>> that >>>>>>>>>>> enables >>>>>>>>>>> an optimization (which comes in your next patch), >>>>>>>>>>> that >>>>>>>>>>> allows >>>>>>>>>>> flushes to only operate on the subset of the device >>>>>>>>>>> cache >>>>>>>>>>> tagged as >>>>>>>> "transient" if media is idle. >>>>>>>>> >>>>>>>>> But what If we have stale non-XA marked pages for >>>>>>>>> userptr, >>>>>>>>> and >>>>>>>>> that >>>>>>>>> object moves out and at the same time media comes back, >>>>>>>>> will end >>>>>>>>> up in >>>>>>>>> full flush and flush the stale entry to RAM. >>>>>>>> >>>>>>>> What makes userptr special here?  During general, active >>>>>>>> usage, >>>>>>>> userptr would >>>>>>>> be data that's accessible by the CPU, so it needs to >>>>>>>> either >>>>>>>> be >>>>>>>> transient (so CPU >>>>>>>> can see the data in RAM after explicit flushes) or it >>>>>>>> needs >>>>>>>> to be >>>>>>>> using a >>>>>>>> coherent PAT (so that the CPU can just snoop the GPU >>>>>>>> cache). >>>>>>>> If >>>>>>>> you marked >>>>>>>> userptr as both non-XA and non-coherent, then that sounds >>>>>>>> likely to >>>>>>>> be a >>>>>>>> userspace bug (and probably something we can catch and >>>>>>>> reject >>>>>>>> as an >>>>>>>> invalid >>>>>>>> case on any Xe3p or later platforms that support this) >>>>>>>> since >>>>>>>> the >>>>>>>> CPU wouldn't >>>>>>>> have any reliable way of seeing GPU updates. >>>>>>> >>>>>>> Right. FYI @Mrozek, Michal @Souza, Jose >>>>>>> For userptr, as explained above, it needs to be either >>>>>>> coherent >>>>>>> or XA >>>>>>> pat index, or else KMD will reject as invalid case. >>>>>>> >>>>>>> >>>>>> The coherency restriction is already in the uAPI: >>>>>> >>>>>> "Note: For userptr and externally imported dma-buf the kernel >>>>>> expects >>>>>> either 1WAY or 2WAY for the @pat_index." >>>>>> >>>>>> Using 1 way is enough as Xe KMD does a PIPE_CONTROL flushing >>>>>> GPU >>>>>> caches >>>>>> at the end of batch buffers. >>>>> >>>>> But isn't that what we're discussing here?  1-way *won't* >>>>> necessarily be >>>>> enough anymore because PIPE_CONTROL instructions don't flush >>>>> the >>>>> entire >>>>> cache anymore.  Whenever the GuC determines that media is >>>>> inactive >>>>> and >>>>> activates the optimization, PIPE_CONTROL, MI_FLUSH_DW, etc. >>>>> change >>>>> behavior to only flush out the subset of data that was marked >>>>> as >>>>> app-transient; anything not marked that way doesn't get flushed >>>>> now.  So >>>>> there's a new requirement here that you ensure you're using an >>>>> XA >>>>> PAT >>>>> index, or you switch to use 2-way coherency which will allow >>>>> the >>>>> CPU to >>>>> snoop the GPU's caches. >>>> >>>> That exactly matches my understanding also. >>> >>> This only ever affects IGFX, right? Since AFAIU we don't have 2-way >>> coherency with DGFX? >> >> Yeah, this should be igpu only. I seem to also recall that on dgpu, >> Media is coherent with l2/l3, but also I don't think system memory >> can >> be cached in l2/l3 (only VRAM), which I assume is why there is the >> special SMRO (system-memory-read-only) cache only on dgpu, which is >> flushed when the fence signals, unlike the l2/l3. > > Yes that sounds reasonable. > >> >>> >>> It sounds like the same PAT restriction is needed also for imported >>> dma-buf, right? >> >> Good point. Looks like we are missing that still. Otherwise we can >> run >> into the same issues with stale l2/l3/ppc. > > So if this affects only system memory could we instead of relying on 2- > way coherency or XA, just flush at dma unmap time, because that's > typically just before releasing the pages. Yeah, I think we could make it work, from security pov, similar to userptr, with the right manual flushes in KMD. Maybe just a question if userspace wants such a model? Anything cached in l2/l3 might require manual flushing by userspace (if that is even possible)? > > The exception, though, is dma-buf where the exporter can actually > release memory before all importers have given up their dma-mappings. > > /Thomas > >> >>> >>> /Thomas >>> >>> >>>> >>>>> >>>>> >>>>> Matt >>>>> >>>>>> >>>>>>> >>>>>>>> >>>>>>>> If something happens that changes the GTT mapping of an >>>>>>>> object, >>>>>>>> then >>>>>>>> doesn't that already trigger a TLB invalidation when >>>>>>>> necessary in >>>>>>>> the driver >>>>>>>> today?  It was my understanding that "heavy" TLB >>>>>>>> invalidations wait >>>>>>>> for data >>>>>>>> values to be globally observable before starting, so I >>>>>>>> think >>>>>>>> that >>>>>>>> would ensure >>>>>>>> that any non-XA data makes it to RAM before any binding >>>>>>>> changes, >>>>>>>> object, >>>>>>>> destruction, etc.?  Is there something special about >>>>>>>> userptr >>>>>>>> that >>>>>>>> makes that >>>>>>>> case more of a problem? >>>>>>>> >>>>>>>> I just found bspec page 74635 which gives an overview of >>>>>>>> the >>>>>>>> various flush >>>>>>>> and invalidate cases, and I don't see anything there that >>>>>>>> makes it >>>>>>>> obvious to >>>>>>>> me that userptr would be special. >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> As you said, we eventually do want to force a flush >>>>>>>>>>> of >>>>>>>>>>> the >>>>>>>>>>> non-transient data as well once we're freeing the >>>>>>>>>>> underlying >>>>>>>>>>> pages. >>>>>>>>>>> So how do we do that?  It's not clear to me how the >>>>>>>>>>> changes >>>>>>>>>>> below >>>>>>>>>>> are accomplishing that.  Is there a way to >>>>>>>>>>> explicitly >>>>>>>>>>> request >>>>>>>>>>> a >>>>>>>>>>> full device cache flush (ignoring the transient vs >>>>>>>>>>> non- >>>>>>>>>>> transient tagging)? >>>>>>>>>>> Since the GuC handles the optimization in the next >>>>>>>>>>> patch >>>>>>>>>>> (toggling >>>>>>>>>>> whether flushes are full flushes vs non-transient >>>>>>>>>>> flushes >>>>>>>>>>> depending on whether media is active), I thought >>>>>>>>>>> there >>>>>>>>>>> might >>>>>>>>>>> be >>>>>>>>>>> some kind of GuC interface to request "please do >>>>>>>>>>> one >>>>>>>>>>> full >>>>>>>>>>> flush now, even >>>>>>>> if media is idle." >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I’m not an expert here by any means, but everything >>>>>>>>>> above >>>>>>>>>> from >>>>>>>>>> Matt >>>>>>>>>> seems like valid concerns. Thomas also raised some >>>>>>>>>> concerns in >>>>>>>>>> the >>>>>>>>>> two previous revisions; again I’m not an expert, but >>>>>>>>>> reading >>>>>>>>>> through >>>>>>>>>> those, it doesn’t really seem like he received proper >>>>>>>>>> answers >>>>>>>>>> to his >>>>>>>> questions. >>>>>>>>> >>>>>>>>> Its forcing flush via tlb invalidation PPC flag under >>>>>>>>> xe_invalidate_vma( ). >>>>>>>> >>>>>>>> By the way, what is "PPC?"  It seems like it's another >>>>>>>> new >>>>>>>> synonym >>>>>>>> for the >>>>>>>> device cache?  It's already really confusing that some of >>>>>>>> our >>>>>>>> hardware docs use >>>>>>>> a mix of both "L2" and "L3" to refer to the same device >>>>>>>> cache >>>>>>>> for >>>>>>>> historical >>>>>>>> reasons... >>>>>>>> >>>>>>>> >>>>>>>> Matt >>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> A couple of comments below. >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Matt >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> V2(MattA): Expand commit description >>>>>>>>>>>> >>>>>>>>>>>> Signed-off-by: Tejas Upadhyay >>>>>>>>>>>> >>>>>>>>>>>> --- >>>>>>>>>>>>    drivers/gpu/drm/xe/xe_bo.c      |  3 ++- >>>>>>>>>>>>    drivers/gpu/drm/xe/xe_device.c  | 23 >>>>>>>>>>>> +++++++++++++++++++++++ >>>>>>>>>>>> drivers/gpu/drm/xe/xe_device.h  |  1 + >>>>>>>>>>>> drivers/gpu/drm/xe/xe_userptr.c |  3 ++- >>>>>>>>>>>>    4 files changed, 28 insertions(+), 2 >>>>>>>>>>>> deletions(-) >>>>>>>>>>>> >>>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_bo.c >>>>>>>>>>>> b/drivers/gpu/drm/xe/xe_bo.c index >>>>>>>>>>>> e9180b01a4e4..4455886b211e >>>>>>>>>>>> 100644 >>>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_bo.c >>>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_bo.c >>>>>>>>>>>> @@ -689,7 +689,8 @@ static int >>>>>>>>>>>> xe_bo_trigger_rebind(struct >>>>>>>>>>>> xe_device *xe, struct xe_bo *bo, >>>>>>>>>>>> >>>>>>>>>>>>     if (!xe_vm_in_fault_mode(vm)) { >>>>>>>>>>>>     drm_gpuvm_bo_evict(vm_bo >>>>>>>>>>>> , >>>>>>>>>>>> true); >>>>>>>>>>>> - continue; >>>>>>>>>>>> + if >>>>>>>>>>>> (!xe_device_needs_cache_flush(xe)) >>>>>>>>>>>> + continue; >>>>>>> >>>>>>> Matt R, >>>>>>> This flush will be still needed as there can be non-xa >>>>>>> buffers >>>>>>> which >>>>>>> can be evicted while media was off and stale entries can be >>>>>>> flushed >>>>>>> when media comes back on. Which was not case earlier as >>>>>>> full >>>>>>> flush >>>>>>> was happening at regular sync points and that’s where this >>>>>>> feature is >>>>>>> bringing optimization now. >>>>>>> >>>>>>> Tejas >>>>>>> >>>>>>>>>> >>>>>>>>>> This will trigger a TLB invalidation (and I assume a >>>>>>>>>> cache >>>>>>>>>> flush) >>>>>>>>>> every time we move or free memory in the 3D stack if >>>>>>>>>> it >>>>>>>>>> has a >>>>>>>>>> binding. It also performs a synchronous wait on the >>>>>>>>>> BO >>>>>>>>>> being >>>>>>>>>> idle. >>>>>>>>>> Both of these are very expensive operations. I can’t >>>>>>>>>> imagine >>>>>>>>>> the >>>>>>>>>> granularity we want here is to do this on every >>>>>>>>>> move/free >>>>>>>>>> with >>>>>>>>>> bindings. >>>>>>>>>> >>>>>>>>>> Also, for LR compute with preempt fences, we would >>>>>>>>>> trigger the >>>>>>>>>> preempt fences during the wait, so a TLB invalidation >>>>>>>>>> after >>>>>>>>>> this >>>>>>>>>> seems unnecessary, though perhaps the cache flush is >>>>>>>>>> still >>>>>>>>>> required? >>>>>>>>>> >>>>>>>>>> I think this needs a bit more explanation, because >>>>>>>>>> without >>>>>>>>>> knowing a >>>>>>>>>> lot about the exact requirements, the implementation >>>>>>>>>> does >>>>>>>>>> not >>>>>>>>>> look >>>>>>>> correct. >>>>>>>>> >>>>>>>>> The thing is that we are trying to solve problem with >>>>>>>>> userptr >>>>>>>>> with non-XA >>>>>>>> pat, consider if that BO got moved while media is not >>>>>>>> active. >>>>>>>> As >>>>>>>> soon as media >>>>>>>> will come back active, stale cached entries of that >>>>>>>> object >>>>>>>> will be >>>>>>>> flushed as part >>>>>>>> of full flush , which may corrupt things. >>>>>>>>> There was thinking that with this patch we would at >>>>>>>>> least >>>>>>>>> solve >>>>>>>>> the problem >>>>>>>> of corruption and later when page_reclamation feature >>>>>>>> comes >>>>>>>> in will >>>>>>>> help in >>>>>>>> performance as well. But now when page reclamation >>>>>>>> feature is >>>>>>>> merged earlier >>>>>>>> and it tightly coupled with bind/unbind some cases like >>>>>>>> discussed >>>>>>>> above >>>>>>>> (which are not doing unbind immediately on move/free) are >>>>>>>> missed in >>>>>>>> reclamation. >>>>>>>>> >>>>>>>>> So thought was to let this solution go in with little >>>>>>>>> perf >>>>>>>>> hit >>>>>>>>> and discuss with >>>>>>>> page reclamation owner to come with cleaner solution >>>>>>>> together. >>>>>>>>> >>>>>>>>> Tejas >>>>>>>>>> >>>>>>>>>>>>     } >>>>>>>>>>>> >>>>>>>>>>>>     if (!idle) { >>>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_device.c >>>>>>>>>>>> b/drivers/gpu/drm/xe/xe_device.c index >>>>>>>>>>>> 743c18e0c580..da2abed94bc0 >>>>>>>>>>>> 100644 >>>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_device.c >>>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_device.c >>>>>>>>>>>> @@ -1097,6 +1097,29 @@ static void >>>>>>>>>>>> tdf_request_sync(struct >>>>>>>>>>>> xe_device >>>>>>>>>> *xe) >>>>>>>>>>>>     } >>>>>>>>>>>>    } >>>>>>>>>>>> >>>>>>>>>>>> +/** >>>>>>>>>>>> + * xe_device_needs_cache_flush - Whether the >>>>>>>>>>>> cache >>>>>>>>>>>> needs >>>>>>>>>>>> to be >>>>>>>>>>>> +flushed >>>>>>>>>>>> + * @xe: The device to check. >>>>>>>>>>>> + * >>>>>>>>>>>> + * Return: true if the device needs cache flush, >>>>>>>>>>>> false >>>>>>>>>>>> otherwise. >>>>>>>>>>>> + */ >>>>>>>>>>>> +bool xe_device_needs_cache_flush(struct >>>>>>>>>>>> xe_device >>>>>>>>>>>> *xe) { >>>>>>>>>>>> + /* XA is *always* flushed, like at the >>>>>>>>>>>> end- >>>>>>>>>>>> of- >>>>>>>>>>>> submssion (and >>>>>>>>>>>> +maybe >>>>>>>>>> other >>>>>>>>>>>> + * places), just that internally as an >>>>>>>>>>>> optimisation hw doesn't >>>>>>>>>>>> +need to >>>>>>>>>> make >>>>>>>>>>>> + * that a full flush (which will also >>>>>>>>>>>> include XA) >>>>>>>>>>>> when Media is >>>>>>>>>>>> + * off/powergated, since it doesn't need >>>>>>>>>>>> to >>>>>>>>>>>> worry >>>>>>>>>>>> about GT >>>>>>>>>>>> +caches vs >>>>>>>>>> Media >>>>>>>>>>>> + * coherency, and only CPU vs GPU >>>>>>>>>>>> coherency, >>>>>>>>>>>> so >>>>>>>>>>>> can make >>>>>>>> that >>>>>>>>>>>> +flush >>>>>>>>>> a >>>>>>>>>>>> + * targeted XA flush, since stuff tagged >>>>>>>>>>>> with XA >>>>>>>>>>>> now means >>>>>>>>>>>> +it's shared >>>>>>>>>> with >>>>>>>>>>>> + * the CPU. The main implication is that >>>>>>>>>>>> we >>>>>>>>>>>> now >>>>>>>>>>>> need to >>>>>>>>>>>> +somehow >>>>>>>>>> flush non-XA before >>>>>>>>>>>> + * freeing system memory pages, >>>>>>>>>>>> otherwise >>>>>>>>>>>> dirty >>>>>>>>>>>> cachelines >>>>>>>>>>>> +could be >>>>>>>>>> flushed after the free >>>>>>>>>>>> + * (like if Media suddenly turns on and >>>>>>>>>>>> does >>>>>>>>>>>> a >>>>>>>>>>>> full flush) >>>>>>>>>>>> + */ >>>>>>>>>>>> + if (GRAPHICS_VER(xe) >= 35 && >>>>>>>>>>>> !IS_DGFX(xe)) >>>>>>>>>>>> + return true; >>>>>>>>>>>> + return false; >>>>>>>>>>>> +} >>>>>>>>>>>> + >>>>>>>>>>>>    void xe_device_l2_flush(struct xe_device *xe) >>>>>>>>>>>> { >>>>>>>>>>>>     struct xe_gt *gt; >>>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_device.h >>>>>>>>>>>> b/drivers/gpu/drm/xe/xe_device.h index >>>>>>>>>>>> 39464650533b..baf386e0e037 >>>>>>>>>>>> 100644 >>>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_device.h >>>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_device.h >>>>>>>>>>>> @@ -184,6 +184,7 @@ void >>>>>>>>>>>> xe_device_snapshot_print(struct >>>>>>>>>>>> xe_device *xe, struct drm_printer *p); >>>>>>>>>>>>    u64 xe_device_canonicalize_addr(struct >>>>>>>>>>>> xe_device >>>>>>>>>>>> *xe, u64 >>>>>>>>>>>> address); >>>>>>>>>>>>    u64 xe_device_uncanonicalize_addr(struct >>>>>>>>>>>> xe_device >>>>>>>>>>>> *xe, >>>>>>>>>>>> u64 >>>>>>>>>>>> address); >>>>>>>>>>>> >>>>>>>>>>>> +bool xe_device_needs_cache_flush(struct >>>>>>>>>>>> xe_device >>>>>>>>>>>> *xe); >>>>>>>>>>>>    void xe_device_td_flush(struct xe_device *xe); >>>>>>>>>>>> void >>>>>>>>>>>> xe_device_l2_flush(struct xe_device *xe); >>>>>>>>>>>> >>>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_userptr.c >>>>>>>>>>>> b/drivers/gpu/drm/xe/xe_userptr.c index >>>>>>>>>>>> e120323c43bc..b435ea7f9b66 >>>>>>>>>>>> 100644 >>>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_userptr.c >>>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_userptr.c >>>>>>>>>>>> @@ -114,7 +114,8 @@ static void >>>>>>>>>>>> __vma_userptr_invalidate(struct >>>>>>>>>>>> xe_vm >>>>>>>>>> *vm, struct xe_userptr_vma *uv >>>>>>>>>>>>         false, >>>>>>>>>>>> MAX_SCHEDULE_TIMEOUT); >>>>>>>>>>>>     XE_WARN_ON(err <= 0); >>>>>>>>>>>> >>>>>>>>>>>> - if (xe_vm_in_fault_mode(vm) && userptr- >>>>>>>>>>>>> initial_bind) { >>>>>>>>>>>> + if ((xe_vm_in_fault_mode(vm) || >>>>>>>>>>>> +xe_device_needs_cache_flush(vm- >>>>>>>>>>> xe)) && >>>>>>>>>>>> +     userptr->initial_bind) { >>>>>>>>>> >>>>>>>>>> Same concern with the LR preempt fence as above — the >>>>>>>>>> hardware >>>>>>>>>> will >>>>>>>>>> be interrupted via preempt fences, so it doesn’t seem >>>>>>>>>> necessary >>>>>>>>>> to >>>>>>>>>> invalidate the TLBs but perhaps we need a cflush and >>>>>>>>>> TLB >>>>>>>>>> invalidation is the mechanism for that too? >>>>>>>>>> >>>>>>>>>> Matt >>>>>>>>>> >>>>>>>>>>>>     err = xe_vm_invalidate_vma(vma); >>>>>>>>>>>>     XE_WARN_ON(err); >>>>>>>>>>>>     } >>>>>>>>>>>> -- >>>>>>>>>>>> 2.52.0 >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Matt Roper >>>>>>>>>>> Graphics Software Engineer >>>>>>>>>>> Linux GPU Platform Enablement >>>>>>>>>>> Intel Corporation >>>>>>>> >>>>>>>> -- >>>>>>>> Matt Roper >>>>>>>> Graphics Software Engineer >>>>>>>> Linux GPU Platform Enablement >>>>>>>> Intel Corporation >>>>>