From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 92EDAE7BD8C for ; Mon, 16 Feb 2026 10:58:38 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 5029E10E22F; Mon, 16 Feb 2026 10:58:38 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="jMOYQzHg"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.7]) by gabe.freedesktop.org (Postfix) with ESMTPS id 7F84410E22F for ; Mon, 16 Feb 2026 10:58:36 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1771239516; x=1802775516; h=message-id:date:mime-version:subject:to:cc:references: from:in-reply-to:content-transfer-encoding; bh=mbvZdwFpf+Nqe9wtE6B49qNTHO4GtZ34ulME4rKuZ18=; b=jMOYQzHgTbHQXhsC5lGp6BpqcRJDU7C3kuiqvKkOQsYw06y1gKtInfCM SfWqsC7cKmme2QxeDa0g+xUwuO86T88SX7YYVxq1zZug64j5k/HXTtkhc q23smCwmXEeNtle/yyCqA5Lro9gCm1dMEO3t77fr/HYhLJEeAMOi3/8hF 0jiNKQ/JlIVe/GdTHkYV3PdcnPbV1xGUeOEjBG4BcK6Tn4h2rJxvgvxfv TRAshhIff6qWmbBbip4cHX1TfYPncyEPy9LoE63xQUNe5IfO7Zvzl+mPT X+eEboUrIpK44VESeHr45zSapQaDs3SHss0rkgEBXCoa0v+Gq2axsTcWW g==; X-CSE-ConnectionGUID: Jf9Ow+kUQPW0tS9Wq/RvXg== X-CSE-MsgGUID: WFyUdy0MSfaCH9wVjsO7EQ== X-IronPort-AV: E=McAfee;i="6800,10657,11702"; a="97778598" X-IronPort-AV: E=Sophos;i="6.21,294,1763452800"; d="scan'208";a="97778598" Received: from orviesa007.jf.intel.com ([10.64.159.147]) by fmvoesa101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 16 Feb 2026 02:58:36 -0800 X-CSE-ConnectionGUID: 7c164NbLR++Sa6qaJ6Oglg== X-CSE-MsgGUID: Q9EsxLE+RoS4YNMKypCLDA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,294,1763452800"; d="scan'208";a="213664622" Received: from ijarvine-mobl1.ger.corp.intel.com (HELO [10.245.244.239]) ([10.245.244.239]) by orviesa007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 16 Feb 2026 02:58:34 -0800 Message-ID: <8ce35a23-b639-4c4f-acfd-993c4f9d5008@intel.com> Date: Mon, 16 Feb 2026 10:58:31 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually To: =?UTF-8?Q?Thomas_Hellstr=C3=B6m?= , Matt Roper , "Souza, Jose" Cc: "Upadhyay, Tejas" , "Mrozek, Michal" , "intel-xe@lists.freedesktop.org" , "Brost, Matthew" References: <20260210125120.1329411-5-tejas.upadhyay@intel.com> <20260210125120.1329411-6-tejas.upadhyay@intel.com> <20260210210525.GC4694@mdroper-desk1.amr.corp.intel.com> <20260211211125.GL4694@mdroper-desk1.amr.corp.intel.com> <20260213171638.GC52346@mdroper-desk1.amr.corp.intel.com> <75dcc80b39ed33a7abc620b2614b0e81586a6299.camel@linux.intel.com> Content-Language: en-GB From: Matthew Auld In-Reply-To: <75dcc80b39ed33a7abc620b2614b0e81586a6299.camel@linux.intel.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-BeenThere: intel-xe@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel Xe graphics driver List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: intel-xe-bounces@lists.freedesktop.org Sender: "Intel-xe" On 16/02/2026 10:23, Thomas Hellström wrote: > On Fri, 2026-02-13 at 17:31 +0000, Matthew Auld wrote: >> On 13/02/2026 17:16, Matt Roper wrote: >>> On Fri, Feb 13, 2026 at 04:48:39PM +0000, Souza, Jose wrote: >>>> On Fri, 2026-02-13 at 16:23 +0000, Upadhyay, Tejas wrote: >>>>> >>>>> >>>>>> -----Original Message----- >>>>>> From: Roper, Matthew D >>>>>> Sent: 12 February 2026 02:41 >>>>>> To: Upadhyay, Tejas >>>>>> Cc: Brost, Matthew ; intel- >>>>>> xe@lists.freedesktop.org; Auld, Matthew >>>>>> ; >>>>>> thomas.hellstrom@linux.intel.com >>>>>> Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush >>>>>> userptr/shrinker bo >>>>>> cachelines manually >>>>>> >>>>>> On Wed, Feb 11, 2026 at 07:06:05PM +0000, Upadhyay, Tejas >>>>>> wrote: >>>>>>> >>>>>>> >>>>>>>> -----Original Message----- >>>>>>>> From: Brost, Matthew >>>>>>>> Sent: 11 February 2026 05:32 >>>>>>>> To: Roper, Matthew D >>>>>>>> Cc: Upadhyay, Tejas ; intel- >>>>>>>> xe@lists.freedesktop.org; Auld, Matthew >>>>>>>> ; >>>>>>>> thomas.hellstrom@linux.intel.com >>>>>>>> Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush >>>>>>>> userptr/shrinker bo >>>>>>>> cachelines manually >>>>>>>> >>>>>>>> On Tue, Feb 10, 2026 at 01:05:25PM -0800, Matt Roper >>>>>>>> wrote: >>>>>>>>> On Tue, Feb 10, 2026 at 06:21:22PM +0530, Tejas >>>>>>>>> Upadhyay >>>>>>>>> wrote: >>>>>>>>>> "eXtended Architecture" (XA) tagged memory—memory >>>>>>>>>> shared >>>>>> between >>>>>>>> the >>>>>>>>>> CPU and GPU >>>>>>>>> >>>>>>>>> I'm pretty sure this expansion of "XA" is wrong; where >>>>>>>>> are >>>>>>>>> you >>>>>>>>> seeing this definition?  Everything in the bspec >>>>>>>>> indicates >>>>>>>>> that XA >>>>>>>>> means "wb >>>>>>>>> - transient app" (similar to how "XD" is 'wb - >>>>>>>>> transient >>>>>>>>> display"). >>>>>>>>> I'm not sure why exactly they picked "X" to refer to >>>>>>>>> transient in >>>>>>>>> both of these cases, but I've never seen any >>>>>>>>> documentation >>>>>>>>> that >>>>>>>>> refers to it as "extended." >>>>>>>>> >>>>>>>>>> is treated differently from other GPU memory when the >>>>>>>>>> Media >>>>>>>>>> engine is >>>>>>>> power-gated. >>>>>>>>>> >>>>>>>>>> XA is *always* flushed, like at the end-of-submssion >>>>>>>>>> (and >>>>>>>>>> maybe >>>>>>>>>> other >>>>>>>>> >>>>>>>>> I assume you're referring to the fact that the driver >>>>>>>>> performs >>>>>>>>> flushes at the end of submission (via PIPE_CONTROL or >>>>>>>>> MI_FLUSH_DW), and that depending on other >>>>>>>>> state/optimizations >>>>>>>>> in >>>>>>>>> the system, those flushes may flush the entire device >>>>>>>>> cache, >>>>>>>>> or >>>>>>>>> may only flush the subset of cache data that is not >>>>>>>>> marked as >>>>>>>>> transient.  The way you worded this was confusing since >>>>>>>>> it >>>>>>>>> makes >>>>>>>>> it sound like cache flushes happen automatically >>>>>>>>> somewhere in >>>>>> hardware/firmware. >>>>>>>>> >>>>>>>>>> places), just that internally as an optimisation hw >>>>>>>>>> doesn't >>>>>>>>>> need >>>>>>>>>> to make that a full flush (which will also include >>>>>>>>>> XA) when >>>>>>>>>> Media is off/powergated, since it doesn't need to >>>>>>>>>> worry >>>>>>>>>> about GT >>>>>>>>>> caches vs Media coherency, and only CPU vs GPU >>>>>>>>>> coherency, >>>>>>>>>> so can >>>>>>>>>> make that flush a targeted XA flush, since stuff >>>>>>>>>> tagged >>>>>>>>>> with XA >>>>>>>>>> now means it's shared with the CPU. The main >>>>>>>>>> implication is >>>>>>>>>> that >>>>>>>>>> we now need to somehow flush non-XA before freeing >>>>>>>>>> system >>>>>>>>>> memory >>>>>>>>>> pages, otherwise dirty cachelines could be flushed >>>>>>>>>> after >>>>>>>>>> the >>>>>>>>>> free (like if Media suddenly turns on and does a full >>>>>>>>>> flush) >>>>>>>>> >>>>>>>>> This description seems really confusing.  My >>>>>>>>> understanding is >>>>>>>>> that >>>>>>>>> marking something as wb-transient-app indicates that it >>>>>>>>> might >>>>>>>>> be >>>>>>>>> accessed by something other than our graphics/media IP >>>>>>>>> (i.e., >>>>>>>>> accessed from the CPU, exported to another device, >>>>>>>>> etc.), so >>>>>>>>> transient data truly does need to be flushed at the >>>>>>>>> points in >>>>>>>>> the >>>>>>>>> driver where a flush typically happens. >>>>>>>>> >>>>>>>>> However when something is _not_ transient, then either: >>>>>>>>>   - it's "private" to the GPU and only our >>>>>>>>> graphics/media IP >>>>>>>>> will be >>>>>>>>>     accessing it >>>>>>>>>   - it's bound with a coherent PAT index so that >>>>>>>>> outside >>>>>>>>> observers like >>>>>>>>>     the CPU can snoop the device cache, even when the >>>>>>>>> cache >>>>>>>>> hasn't been >>>>>>>>>     flushed >>>>>>>>> >>>>>>>>> If media is not active, then there's really no need to >>>>>>>>> include >>>>>>>>> non-transient data when an device cache flush happens >>>>>>>>> since >>>>>>>>> there's no real need for the data to get to RAM.  So >>>>>>>>> that >>>>>>>>> enables >>>>>>>>> an optimization (which comes in your next patch), that >>>>>>>>> allows >>>>>>>>> flushes to only operate on the subset of the device >>>>>>>>> cache >>>>>>>>> tagged as >>>>>> "transient" if media is idle. >>>>>>> >>>>>>> But what If we have stale non-XA marked pages for userptr, >>>>>>> and >>>>>>> that >>>>>>> object moves out and at the same time media comes back, >>>>>>> will end >>>>>>> up in >>>>>>> full flush and flush the stale entry to RAM. >>>>>> >>>>>> What makes userptr special here?  During general, active >>>>>> usage, >>>>>> userptr would >>>>>> be data that's accessible by the CPU, so it needs to either >>>>>> be >>>>>> transient (so CPU >>>>>> can see the data in RAM after explicit flushes) or it needs >>>>>> to be >>>>>> using a >>>>>> coherent PAT (so that the CPU can just snoop the GPU cache). >>>>>> If >>>>>> you marked >>>>>> userptr as both non-XA and non-coherent, then that sounds >>>>>> likely to >>>>>> be a >>>>>> userspace bug (and probably something we can catch and reject >>>>>> as an >>>>>> invalid >>>>>> case on any Xe3p or later platforms that support this) since >>>>>> the >>>>>> CPU wouldn't >>>>>> have any reliable way of seeing GPU updates. >>>>> >>>>> Right. FYI @Mrozek, Michal @Souza, Jose >>>>> For userptr, as explained above, it needs to be either coherent >>>>> or XA >>>>> pat index, or else KMD will reject as invalid case. >>>>> >>>>> >>>> The coherency restriction is already in the uAPI: >>>> >>>> "Note: For userptr and externally imported dma-buf the kernel >>>> expects >>>> either 1WAY or 2WAY for the @pat_index." >>>> >>>> Using 1 way is enough as Xe KMD does a PIPE_CONTROL flushing GPU >>>> caches >>>> at the end of batch buffers. >>> >>> But isn't that what we're discussing here?  1-way *won't* >>> necessarily be >>> enough anymore because PIPE_CONTROL instructions don't flush the >>> entire >>> cache anymore.  Whenever the GuC determines that media is inactive >>> and >>> activates the optimization, PIPE_CONTROL, MI_FLUSH_DW, etc. change >>> behavior to only flush out the subset of data that was marked as >>> app-transient; anything not marked that way doesn't get flushed >>> now.  So >>> there's a new requirement here that you ensure you're using an XA >>> PAT >>> index, or you switch to use 2-way coherency which will allow the >>> CPU to >>> snoop the GPU's caches. >> >> That exactly matches my understanding also. > > This only ever affects IGFX, right? Since AFAIU we don't have 2-way > coherency with DGFX? Yeah, this should be igpu only. I seem to also recall that on dgpu, Media is coherent with l2/l3, but also I don't think system memory can be cached in l2/l3 (only VRAM), which I assume is why there is the special SMRO (system-memory-read-only) cache only on dgpu, which is flushed when the fence signals, unlike the l2/l3. > > It sounds like the same PAT restriction is needed also for imported > dma-buf, right? Good point. Looks like we are missing that still. Otherwise we can run into the same issues with stale l2/l3/ppc. > > /Thomas > > >> >>> >>> >>> Matt >>> >>>> >>>>> >>>>>> >>>>>> If something happens that changes the GTT mapping of an >>>>>> object, >>>>>> then >>>>>> doesn't that already trigger a TLB invalidation when >>>>>> necessary in >>>>>> the driver >>>>>> today?  It was my understanding that "heavy" TLB >>>>>> invalidations wait >>>>>> for data >>>>>> values to be globally observable before starting, so I think >>>>>> that >>>>>> would ensure >>>>>> that any non-XA data makes it to RAM before any binding >>>>>> changes, >>>>>> object, >>>>>> destruction, etc.?  Is there something special about userptr >>>>>> that >>>>>> makes that >>>>>> case more of a problem? >>>>>> >>>>>> I just found bspec page 74635 which gives an overview of the >>>>>> various flush >>>>>> and invalidate cases, and I don't see anything there that >>>>>> makes it >>>>>> obvious to >>>>>> me that userptr would be special. >>>>>> >>>>>> >>>>>>> >>>>>>>>> >>>>>>>>> As you said, we eventually do want to force a flush of >>>>>>>>> the >>>>>>>>> non-transient data as well once we're freeing the >>>>>>>>> underlying >>>>>>>>> pages. >>>>>>>>> So how do we do that?  It's not clear to me how the >>>>>>>>> changes >>>>>>>>> below >>>>>>>>> are accomplishing that.  Is there a way to explicitly >>>>>>>>> request >>>>>>>>> a >>>>>>>>> full device cache flush (ignoring the transient vs non- >>>>>>>>> transient tagging)? >>>>>>>>> Since the GuC handles the optimization in the next >>>>>>>>> patch >>>>>>>>> (toggling >>>>>>>>> whether flushes are full flushes vs non-transient >>>>>>>>> flushes >>>>>>>>> depending on whether media is active), I thought there >>>>>>>>> might >>>>>>>>> be >>>>>>>>> some kind of GuC interface to request "please do one >>>>>>>>> full >>>>>>>>> flush now, even >>>>>> if media is idle." >>>>>>>>> >>>>>>>> >>>>>>>> I’m not an expert here by any means, but everything above >>>>>>>> from >>>>>>>> Matt >>>>>>>> seems like valid concerns. Thomas also raised some >>>>>>>> concerns in >>>>>>>> the >>>>>>>> two previous revisions; again I’m not an expert, but >>>>>>>> reading >>>>>>>> through >>>>>>>> those, it doesn’t really seem like he received proper >>>>>>>> answers >>>>>>>> to his >>>>>> questions. >>>>>>> >>>>>>> Its forcing flush via tlb invalidation PPC flag under >>>>>>> xe_invalidate_vma( ). >>>>>> >>>>>> By the way, what is "PPC?"  It seems like it's another new >>>>>> synonym >>>>>> for the >>>>>> device cache?  It's already really confusing that some of our >>>>>> hardware docs use >>>>>> a mix of both "L2" and "L3" to refer to the same device cache >>>>>> for >>>>>> historical >>>>>> reasons... >>>>>> >>>>>> >>>>>> Matt >>>>>> >>>>>>> >>>>>>>> >>>>>>>> A couple of comments below. >>>>>>>> >>>>>>>>> >>>>>>>>> Matt >>>>>>>>> >>>>>>>>>> >>>>>>>>>> V2(MattA): Expand commit description >>>>>>>>>> >>>>>>>>>> Signed-off-by: Tejas Upadhyay >>>>>>>>>> >>>>>>>>>> --- >>>>>>>>>>   drivers/gpu/drm/xe/xe_bo.c      |  3 ++- >>>>>>>>>>   drivers/gpu/drm/xe/xe_device.c  | 23 >>>>>>>>>> +++++++++++++++++++++++ >>>>>>>>>> drivers/gpu/drm/xe/xe_device.h  |  1 + >>>>>>>>>> drivers/gpu/drm/xe/xe_userptr.c |  3 ++- >>>>>>>>>>   4 files changed, 28 insertions(+), 2 deletions(-) >>>>>>>>>> >>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_bo.c >>>>>>>>>> b/drivers/gpu/drm/xe/xe_bo.c index >>>>>>>>>> e9180b01a4e4..4455886b211e >>>>>>>>>> 100644 >>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_bo.c >>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_bo.c >>>>>>>>>> @@ -689,7 +689,8 @@ static int >>>>>>>>>> xe_bo_trigger_rebind(struct >>>>>>>>>> xe_device *xe, struct xe_bo *bo, >>>>>>>>>> >>>>>>>>>>    if (!xe_vm_in_fault_mode(vm)) { >>>>>>>>>>    drm_gpuvm_bo_evict(vm_bo, >>>>>>>>>> true); >>>>>>>>>> - continue; >>>>>>>>>> + if >>>>>>>>>> (!xe_device_needs_cache_flush(xe)) >>>>>>>>>> + continue; >>>>> >>>>> Matt R, >>>>> This flush will be still needed as there can be non-xa buffers >>>>> which >>>>> can be evicted while media was off and stale entries can be >>>>> flushed >>>>> when media comes back on. Which was not case earlier as full >>>>> flush >>>>> was happening at regular sync points and that’s where this >>>>> feature is >>>>> bringing optimization now. >>>>> >>>>> Tejas >>>>> >>>>>>>> >>>>>>>> This will trigger a TLB invalidation (and I assume a >>>>>>>> cache >>>>>>>> flush) >>>>>>>> every time we move or free memory in the 3D stack if it >>>>>>>> has a >>>>>>>> binding. It also performs a synchronous wait on the BO >>>>>>>> being >>>>>>>> idle. >>>>>>>> Both of these are very expensive operations. I can’t >>>>>>>> imagine >>>>>>>> the >>>>>>>> granularity we want here is to do this on every move/free >>>>>>>> with >>>>>>>> bindings. >>>>>>>> >>>>>>>> Also, for LR compute with preempt fences, we would >>>>>>>> trigger the >>>>>>>> preempt fences during the wait, so a TLB invalidation >>>>>>>> after >>>>>>>> this >>>>>>>> seems unnecessary, though perhaps the cache flush is >>>>>>>> still >>>>>>>> required? >>>>>>>> >>>>>>>> I think this needs a bit more explanation, because >>>>>>>> without >>>>>>>> knowing a >>>>>>>> lot about the exact requirements, the implementation does >>>>>>>> not >>>>>>>> look >>>>>> correct. >>>>>>> >>>>>>> The thing is that we are trying to solve problem with >>>>>>> userptr >>>>>>> with non-XA >>>>>> pat, consider if that BO got moved while media is not active. >>>>>> As >>>>>> soon as media >>>>>> will come back active, stale cached entries of that object >>>>>> will be >>>>>> flushed as part >>>>>> of full flush , which may corrupt things. >>>>>>> There was thinking that with this patch we would at least >>>>>>> solve >>>>>>> the problem >>>>>> of corruption and later when page_reclamation feature comes >>>>>> in will >>>>>> help in >>>>>> performance as well. But now when page reclamation feature is >>>>>> merged earlier >>>>>> and it tightly coupled with bind/unbind some cases like >>>>>> discussed >>>>>> above >>>>>> (which are not doing unbind immediately on move/free) are >>>>>> missed in >>>>>> reclamation. >>>>>>> >>>>>>> So thought was to let this solution go in with little perf >>>>>>> hit >>>>>>> and discuss with >>>>>> page reclamation owner to come with cleaner solution >>>>>> together. >>>>>>> >>>>>>> Tejas >>>>>>>> >>>>>>>>>>    } >>>>>>>>>> >>>>>>>>>>    if (!idle) { >>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_device.c >>>>>>>>>> b/drivers/gpu/drm/xe/xe_device.c index >>>>>>>>>> 743c18e0c580..da2abed94bc0 >>>>>>>>>> 100644 >>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_device.c >>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_device.c >>>>>>>>>> @@ -1097,6 +1097,29 @@ static void >>>>>>>>>> tdf_request_sync(struct >>>>>>>>>> xe_device >>>>>>>> *xe) >>>>>>>>>>    } >>>>>>>>>>   } >>>>>>>>>> >>>>>>>>>> +/** >>>>>>>>>> + * xe_device_needs_cache_flush - Whether the cache >>>>>>>>>> needs >>>>>>>>>> to be >>>>>>>>>> +flushed >>>>>>>>>> + * @xe: The device to check. >>>>>>>>>> + * >>>>>>>>>> + * Return: true if the device needs cache flush, >>>>>>>>>> false >>>>>>>>>> otherwise. >>>>>>>>>> + */ >>>>>>>>>> +bool xe_device_needs_cache_flush(struct xe_device >>>>>>>>>> *xe) { >>>>>>>>>> + /* XA is *always* flushed, like at the end- >>>>>>>>>> of- >>>>>>>>>> submssion (and >>>>>>>>>> +maybe >>>>>>>> other >>>>>>>>>> + * places), just that internally as an >>>>>>>>>> optimisation hw doesn't >>>>>>>>>> +need to >>>>>>>> make >>>>>>>>>> + * that a full flush (which will also >>>>>>>>>> include XA) >>>>>>>>>> when Media is >>>>>>>>>> + * off/powergated, since it doesn't need to >>>>>>>>>> worry >>>>>>>>>> about GT >>>>>>>>>> +caches vs >>>>>>>> Media >>>>>>>>>> + * coherency, and only CPU vs GPU coherency, >>>>>>>>>> so >>>>>>>>>> can make >>>>>> that >>>>>>>>>> +flush >>>>>>>> a >>>>>>>>>> + * targeted XA flush, since stuff tagged >>>>>>>>>> with XA >>>>>>>>>> now means >>>>>>>>>> +it's shared >>>>>>>> with >>>>>>>>>> + * the CPU. The main implication is that we >>>>>>>>>> now >>>>>>>>>> need to >>>>>>>>>> +somehow >>>>>>>> flush non-XA before >>>>>>>>>> + * freeing system memory pages, otherwise >>>>>>>>>> dirty >>>>>>>>>> cachelines >>>>>>>>>> +could be >>>>>>>> flushed after the free >>>>>>>>>> + * (like if Media suddenly turns on and does >>>>>>>>>> a >>>>>>>>>> full flush) >>>>>>>>>> + */ >>>>>>>>>> + if (GRAPHICS_VER(xe) >= 35 && !IS_DGFX(xe)) >>>>>>>>>> + return true; >>>>>>>>>> + return false; >>>>>>>>>> +} >>>>>>>>>> + >>>>>>>>>>   void xe_device_l2_flush(struct xe_device *xe)  { >>>>>>>>>>    struct xe_gt *gt; >>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_device.h >>>>>>>>>> b/drivers/gpu/drm/xe/xe_device.h index >>>>>>>>>> 39464650533b..baf386e0e037 >>>>>>>>>> 100644 >>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_device.h >>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_device.h >>>>>>>>>> @@ -184,6 +184,7 @@ void >>>>>>>>>> xe_device_snapshot_print(struct >>>>>>>>>> xe_device *xe, struct drm_printer *p); >>>>>>>>>>   u64 xe_device_canonicalize_addr(struct xe_device >>>>>>>>>> *xe, u64 >>>>>>>>>> address); >>>>>>>>>>   u64 xe_device_uncanonicalize_addr(struct xe_device >>>>>>>>>> *xe, >>>>>>>>>> u64 >>>>>>>>>> address); >>>>>>>>>> >>>>>>>>>> +bool xe_device_needs_cache_flush(struct xe_device >>>>>>>>>> *xe); >>>>>>>>>>   void xe_device_td_flush(struct xe_device *xe); >>>>>>>>>> void >>>>>>>>>> xe_device_l2_flush(struct xe_device *xe); >>>>>>>>>> >>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_userptr.c >>>>>>>>>> b/drivers/gpu/drm/xe/xe_userptr.c index >>>>>>>>>> e120323c43bc..b435ea7f9b66 >>>>>>>>>> 100644 >>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_userptr.c >>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_userptr.c >>>>>>>>>> @@ -114,7 +114,8 @@ static void >>>>>>>>>> __vma_userptr_invalidate(struct >>>>>>>>>> xe_vm >>>>>>>> *vm, struct xe_userptr_vma *uv >>>>>>>>>>        false, >>>>>>>>>> MAX_SCHEDULE_TIMEOUT); >>>>>>>>>>    XE_WARN_ON(err <= 0); >>>>>>>>>> >>>>>>>>>> - if (xe_vm_in_fault_mode(vm) && userptr- >>>>>>>>>>> initial_bind) { >>>>>>>>>> + if ((xe_vm_in_fault_mode(vm) || >>>>>>>>>> +xe_device_needs_cache_flush(vm- >>>>>>>>> xe)) && >>>>>>>>>> +     userptr->initial_bind) { >>>>>>>> >>>>>>>> Same concern with the LR preempt fence as above — the >>>>>>>> hardware >>>>>>>> will >>>>>>>> be interrupted via preempt fences, so it doesn’t seem >>>>>>>> necessary >>>>>>>> to >>>>>>>> invalidate the TLBs but perhaps we need a cflush and TLB >>>>>>>> invalidation is the mechanism for that too? >>>>>>>> >>>>>>>> Matt >>>>>>>> >>>>>>>>>>    err = xe_vm_invalidate_vma(vma); >>>>>>>>>>    XE_WARN_ON(err); >>>>>>>>>>    } >>>>>>>>>> -- >>>>>>>>>> 2.52.0 >>>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Matt Roper >>>>>>>>> Graphics Software Engineer >>>>>>>>> Linux GPU Platform Enablement >>>>>>>>> Intel Corporation >>>>>> >>>>>> -- >>>>>> Matt Roper >>>>>> Graphics Software Engineer >>>>>> Linux GPU Platform Enablement >>>>>> Intel Corporation >>>