From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 775A3EDF17D for ; Fri, 13 Feb 2026 17:31:55 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 3A80910E2FB; Fri, 13 Feb 2026 17:31:55 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="cJ0O/jdh"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.10]) by gabe.freedesktop.org (Postfix) with ESMTPS id 727CB10E2FB for ; Fri, 13 Feb 2026 17:31:53 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1771003913; x=1802539913; h=message-id:date:mime-version:subject:to:cc:references: from:in-reply-to:content-transfer-encoding; bh=EneVE7gpxV/Itb2Gc2NYNcba6Iygcsc4uNRV6Su7HKs=; b=cJ0O/jdhetQh1ATlR/mgXUA9qr1LtYYwaSGYscP+ypo5UJuoY1rhYb4n 7rQ4OFDRI8J30khm1iN8gpNzjlwklZnOhivq/7MrcMlWRvWziRkX3hOnJ iHLN7Si5/VM1fLEeBkQ5ckspNO7ZkJSnTiPrN6SbOUHgQ04ZG6bFcUF+p W874nH8tasOzO8LbTg+PaPlpDZz+9MTmi1WmChmBQhuCKZA0clQctYEFT hR8+qAOiPvum2S8d8Ra/n5/YS3LxkhEnEmkSwtYnJ8Vo0pT8s90EKA6+U sRnsCRy5md6UkN2O0ob9Mc2kjFX7nJYiZa5P8GcURpAuYQiqw73W+W0O3 Q==; X-CSE-ConnectionGUID: hCOPOFf9Tjm5LSRqxC8HYw== X-CSE-MsgGUID: ulLcUBqVQIG/kPlJ2NE9ng== X-IronPort-AV: E=McAfee;i="6800,10657,11700"; a="83558517" X-IronPort-AV: E=Sophos;i="6.21,289,1763452800"; d="scan'208";a="83558517" Received: from orviesa002.jf.intel.com ([10.64.159.142]) by fmvoesa104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Feb 2026 09:31:53 -0800 X-CSE-ConnectionGUID: /IxkVXVLT3qXy5G43bpxiA== X-CSE-MsgGUID: TnS9fU/TSB6vMO9s5hIdeQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,289,1763452800"; d="scan'208";a="243575144" Received: from abityuts-desk.ger.corp.intel.com (HELO [10.245.244.47]) ([10.245.244.47]) by orviesa002-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Feb 2026 09:31:52 -0800 Message-ID: Date: Fri, 13 Feb 2026 17:31:48 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually To: Matt Roper , "Souza, Jose" Cc: "Upadhyay, Tejas" , "Mrozek, Michal" , "intel-xe@lists.freedesktop.org" , "Brost, Matthew" , "thomas.hellstrom@linux.intel.com" References: <20260210125120.1329411-5-tejas.upadhyay@intel.com> <20260210125120.1329411-6-tejas.upadhyay@intel.com> <20260210210525.GC4694@mdroper-desk1.amr.corp.intel.com> <20260211211125.GL4694@mdroper-desk1.amr.corp.intel.com> <20260213171638.GC52346@mdroper-desk1.amr.corp.intel.com> Content-Language: en-GB From: Matthew Auld In-Reply-To: <20260213171638.GC52346@mdroper-desk1.amr.corp.intel.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-BeenThere: intel-xe@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel Xe graphics driver List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: intel-xe-bounces@lists.freedesktop.org Sender: "Intel-xe" On 13/02/2026 17:16, Matt Roper wrote: > On Fri, Feb 13, 2026 at 04:48:39PM +0000, Souza, Jose wrote: >> On Fri, 2026-02-13 at 16:23 +0000, Upadhyay, Tejas wrote: >>> >>> >>>> -----Original Message----- >>>> From: Roper, Matthew D >>>> Sent: 12 February 2026 02:41 >>>> To: Upadhyay, Tejas >>>> Cc: Brost, Matthew ; intel- >>>> xe@lists.freedesktop.org; Auld, Matthew ; >>>> thomas.hellstrom@linux.intel.com >>>> Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo >>>> cachelines manually >>>> >>>> On Wed, Feb 11, 2026 at 07:06:05PM +0000, Upadhyay, Tejas wrote: >>>>> >>>>> >>>>>> -----Original Message----- >>>>>> From: Brost, Matthew >>>>>> Sent: 11 February 2026 05:32 >>>>>> To: Roper, Matthew D >>>>>> Cc: Upadhyay, Tejas ; intel- >>>>>> xe@lists.freedesktop.org; Auld, Matthew >>>>>> ; >>>>>> thomas.hellstrom@linux.intel.com >>>>>> Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush >>>>>> userptr/shrinker bo >>>>>> cachelines manually >>>>>> >>>>>> On Tue, Feb 10, 2026 at 01:05:25PM -0800, Matt Roper wrote: >>>>>>> On Tue, Feb 10, 2026 at 06:21:22PM +0530, Tejas Upadhyay >>>>>>> wrote: >>>>>>>> "eXtended Architecture" (XA) tagged memory—memory shared >>>> between >>>>>> the >>>>>>>> CPU and GPU >>>>>>> >>>>>>> I'm pretty sure this expansion of "XA" is wrong; where are >>>>>>> you >>>>>>> seeing this definition?  Everything in the bspec indicates >>>>>>> that XA >>>>>>> means "wb >>>>>>> - transient app" (similar to how "XD" is 'wb - transient >>>>>>> display"). >>>>>>> I'm not sure why exactly they picked "X" to refer to >>>>>>> transient in >>>>>>> both of these cases, but I've never seen any documentation >>>>>>> that >>>>>>> refers to it as "extended." >>>>>>> >>>>>>>> is treated differently from other GPU memory when the Media >>>>>>>> engine is >>>>>> power-gated. >>>>>>>> >>>>>>>> XA is *always* flushed, like at the end-of-submssion (and >>>>>>>> maybe >>>>>>>> other >>>>>>> >>>>>>> I assume you're referring to the fact that the driver >>>>>>> performs >>>>>>> flushes at the end of submission (via PIPE_CONTROL or >>>>>>> MI_FLUSH_DW), and that depending on other state/optimizations >>>>>>> in >>>>>>> the system, those flushes may flush the entire device cache, >>>>>>> or >>>>>>> may only flush the subset of cache data that is not marked as >>>>>>> transient.  The way you worded this was confusing since it >>>>>>> makes >>>>>>> it sound like cache flushes happen automatically somewhere in >>>> hardware/firmware. >>>>>>> >>>>>>>> places), just that internally as an optimisation hw doesn't >>>>>>>> need >>>>>>>> to make that a full flush (which will also include XA) when >>>>>>>> Media is off/powergated, since it doesn't need to worry >>>>>>>> about GT >>>>>>>> caches vs Media coherency, and only CPU vs GPU coherency, >>>>>>>> so can >>>>>>>> make that flush a targeted XA flush, since stuff tagged >>>>>>>> with XA >>>>>>>> now means it's shared with the CPU. The main implication is >>>>>>>> that >>>>>>>> we now need to somehow flush non-XA before freeing system >>>>>>>> memory >>>>>>>> pages, otherwise dirty cachelines could be flushed after >>>>>>>> the >>>>>>>> free (like if Media suddenly turns on and does a full >>>>>>>> flush) >>>>>>> >>>>>>> This description seems really confusing.  My understanding is >>>>>>> that >>>>>>> marking something as wb-transient-app indicates that it might >>>>>>> be >>>>>>> accessed by something other than our graphics/media IP (i.e., >>>>>>> accessed from the CPU, exported to another device, etc.), so >>>>>>> transient data truly does need to be flushed at the points in >>>>>>> the >>>>>>> driver where a flush typically happens. >>>>>>> >>>>>>> However when something is _not_ transient, then either: >>>>>>>  - it's "private" to the GPU and only our graphics/media IP >>>>>>> will be >>>>>>>    accessing it >>>>>>>  - it's bound with a coherent PAT index so that outside >>>>>>> observers like >>>>>>>    the CPU can snoop the device cache, even when the cache >>>>>>> hasn't been >>>>>>>    flushed >>>>>>> >>>>>>> If media is not active, then there's really no need to >>>>>>> include >>>>>>> non-transient data when an device cache flush happens since >>>>>>> there's no real need for the data to get to RAM.  So that >>>>>>> enables >>>>>>> an optimization (which comes in your next patch), that allows >>>>>>> flushes to only operate on the subset of the device cache >>>>>>> tagged as >>>> "transient" if media is idle. >>>>> >>>>> But what If we have stale non-XA marked pages for userptr, and >>>>> that >>>>> object moves out and at the same time media comes back, will end >>>>> up in >>>>> full flush and flush the stale entry to RAM. >>>> >>>> What makes userptr special here?  During general, active usage, >>>> userptr would >>>> be data that's accessible by the CPU, so it needs to either be >>>> transient (so CPU >>>> can see the data in RAM after explicit flushes) or it needs to be >>>> using a >>>> coherent PAT (so that the CPU can just snoop the GPU cache).  If >>>> you marked >>>> userptr as both non-XA and non-coherent, then that sounds likely to >>>> be a >>>> userspace bug (and probably something we can catch and reject as an >>>> invalid >>>> case on any Xe3p or later platforms that support this) since the >>>> CPU wouldn't >>>> have any reliable way of seeing GPU updates. >>> >>> Right. FYI @Mrozek, Michal @Souza, Jose >>> For userptr, as explained above, it needs to be either coherent or XA >>> pat index, or else KMD will reject as invalid case. >>> >>> >> The coherency restriction is already in the uAPI: >> >> "Note: For userptr and externally imported dma-buf the kernel expects >> either 1WAY or 2WAY for the @pat_index." >> >> Using 1 way is enough as Xe KMD does a PIPE_CONTROL flushing GPU caches >> at the end of batch buffers. > > But isn't that what we're discussing here? 1-way *won't* necessarily be > enough anymore because PIPE_CONTROL instructions don't flush the entire > cache anymore. Whenever the GuC determines that media is inactive and > activates the optimization, PIPE_CONTROL, MI_FLUSH_DW, etc. change > behavior to only flush out the subset of data that was marked as > app-transient; anything not marked that way doesn't get flushed now. So > there's a new requirement here that you ensure you're using an XA PAT > index, or you switch to use 2-way coherency which will allow the CPU to > snoop the GPU's caches. That exactly matches my understanding also. > > > Matt > >> >>> >>>> >>>> If something happens that changes the GTT mapping of an object, >>>> then >>>> doesn't that already trigger a TLB invalidation when necessary in >>>> the driver >>>> today?  It was my understanding that "heavy" TLB invalidations wait >>>> for data >>>> values to be globally observable before starting, so I think that >>>> would ensure >>>> that any non-XA data makes it to RAM before any binding changes, >>>> object, >>>> destruction, etc.?  Is there something special about userptr that >>>> makes that >>>> case more of a problem? >>>> >>>> I just found bspec page 74635 which gives an overview of the >>>> various flush >>>> and invalidate cases, and I don't see anything there that makes it >>>> obvious to >>>> me that userptr would be special. >>>> >>>> >>>>> >>>>>>> >>>>>>> As you said, we eventually do want to force a flush of the >>>>>>> non-transient data as well once we're freeing the underlying >>>>>>> pages. >>>>>>> So how do we do that?  It's not clear to me how the changes >>>>>>> below >>>>>>> are accomplishing that.  Is there a way to explicitly request >>>>>>> a >>>>>>> full device cache flush (ignoring the transient vs non- >>>>>>> transient tagging)? >>>>>>> Since the GuC handles the optimization in the next patch >>>>>>> (toggling >>>>>>> whether flushes are full flushes vs non-transient flushes >>>>>>> depending on whether media is active), I thought there might >>>>>>> be >>>>>>> some kind of GuC interface to request "please do one full >>>>>>> flush now, even >>>> if media is idle." >>>>>>> >>>>>> >>>>>> I’m not an expert here by any means, but everything above from >>>>>> Matt >>>>>> seems like valid concerns. Thomas also raised some concerns in >>>>>> the >>>>>> two previous revisions; again I’m not an expert, but reading >>>>>> through >>>>>> those, it doesn’t really seem like he received proper answers >>>>>> to his >>>> questions. >>>>> >>>>> Its forcing flush via tlb invalidation PPC flag under >>>>> xe_invalidate_vma( ). >>>> >>>> By the way, what is "PPC?"  It seems like it's another new synonym >>>> for the >>>> device cache?  It's already really confusing that some of our >>>> hardware docs use >>>> a mix of both "L2" and "L3" to refer to the same device cache for >>>> historical >>>> reasons... >>>> >>>> >>>> Matt >>>> >>>>> >>>>>> >>>>>> A couple of comments below. >>>>>> >>>>>>> >>>>>>> Matt >>>>>>> >>>>>>>> >>>>>>>> V2(MattA): Expand commit description >>>>>>>> >>>>>>>> Signed-off-by: Tejas Upadhyay >>>>>>>> --- >>>>>>>>  drivers/gpu/drm/xe/xe_bo.c      |  3 ++- >>>>>>>>  drivers/gpu/drm/xe/xe_device.c  | 23 >>>>>>>> +++++++++++++++++++++++ >>>>>>>> drivers/gpu/drm/xe/xe_device.h  |  1 + >>>>>>>> drivers/gpu/drm/xe/xe_userptr.c |  3 ++- >>>>>>>>  4 files changed, 28 insertions(+), 2 deletions(-) >>>>>>>> >>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_bo.c >>>>>>>> b/drivers/gpu/drm/xe/xe_bo.c index >>>>>>>> e9180b01a4e4..4455886b211e >>>>>>>> 100644 >>>>>>>> --- a/drivers/gpu/drm/xe/xe_bo.c >>>>>>>> +++ b/drivers/gpu/drm/xe/xe_bo.c >>>>>>>> @@ -689,7 +689,8 @@ static int xe_bo_trigger_rebind(struct >>>>>>>> xe_device *xe, struct xe_bo *bo, >>>>>>>> >>>>>>>>   if (!xe_vm_in_fault_mode(vm)) { >>>>>>>>   drm_gpuvm_bo_evict(vm_bo, true); >>>>>>>> - continue; >>>>>>>> + if >>>>>>>> (!xe_device_needs_cache_flush(xe)) >>>>>>>> + continue; >>> >>> Matt R, >>> This flush will be still needed as there can be non-xa buffers which >>> can be evicted while media was off and stale entries can be flushed >>> when media comes back on. Which was not case earlier as full flush >>> was happening at regular sync points and that’s where this feature is >>> bringing optimization now. >>> >>> Tejas >>> >>>>>> >>>>>> This will trigger a TLB invalidation (and I assume a cache >>>>>> flush) >>>>>> every time we move or free memory in the 3D stack if it has a >>>>>> binding. It also performs a synchronous wait on the BO being >>>>>> idle. >>>>>> Both of these are very expensive operations. I can’t imagine >>>>>> the >>>>>> granularity we want here is to do this on every move/free with >>>>>> bindings. >>>>>> >>>>>> Also, for LR compute with preempt fences, we would trigger the >>>>>> preempt fences during the wait, so a TLB invalidation after >>>>>> this >>>>>> seems unnecessary, though perhaps the cache flush is still >>>>>> required? >>>>>> >>>>>> I think this needs a bit more explanation, because without >>>>>> knowing a >>>>>> lot about the exact requirements, the implementation does not >>>>>> look >>>> correct. >>>>> >>>>> The thing is that we are trying to solve problem with userptr >>>>> with non-XA >>>> pat, consider if that BO got moved while media is not active. As >>>> soon as media >>>> will come back active, stale cached entries of that object will be >>>> flushed as part >>>> of full flush , which may corrupt things. >>>>> There was thinking that with this patch we would at least solve >>>>> the problem >>>> of corruption and later when page_reclamation feature comes in will >>>> help in >>>> performance as well. But now when page reclamation feature is >>>> merged earlier >>>> and it tightly coupled with bind/unbind some cases like discussed >>>> above >>>> (which are not doing unbind immediately on move/free) are missed in >>>> reclamation. >>>>> >>>>> So thought was to let this solution go in with little perf hit >>>>> and discuss with >>>> page reclamation owner to come with cleaner solution together. >>>>> >>>>> Tejas >>>>>> >>>>>>>>   } >>>>>>>> >>>>>>>>   if (!idle) { >>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_device.c >>>>>>>> b/drivers/gpu/drm/xe/xe_device.c index >>>>>>>> 743c18e0c580..da2abed94bc0 >>>>>>>> 100644 >>>>>>>> --- a/drivers/gpu/drm/xe/xe_device.c >>>>>>>> +++ b/drivers/gpu/drm/xe/xe_device.c >>>>>>>> @@ -1097,6 +1097,29 @@ static void tdf_request_sync(struct >>>>>>>> xe_device >>>>>> *xe) >>>>>>>>   } >>>>>>>>  } >>>>>>>> >>>>>>>> +/** >>>>>>>> + * xe_device_needs_cache_flush - Whether the cache needs >>>>>>>> to be >>>>>>>> +flushed >>>>>>>> + * @xe: The device to check. >>>>>>>> + * >>>>>>>> + * Return: true if the device needs cache flush, false >>>>>>>> otherwise. >>>>>>>> + */ >>>>>>>> +bool xe_device_needs_cache_flush(struct xe_device *xe) { >>>>>>>> + /* XA is *always* flushed, like at the end-of- >>>>>>>> submssion (and >>>>>>>> +maybe >>>>>> other >>>>>>>> + * places), just that internally as an >>>>>>>> optimisation hw doesn't >>>>>>>> +need to >>>>>> make >>>>>>>> + * that a full flush (which will also include XA) >>>>>>>> when Media is >>>>>>>> + * off/powergated, since it doesn't need to worry >>>>>>>> about GT >>>>>>>> +caches vs >>>>>> Media >>>>>>>> + * coherency, and only CPU vs GPU coherency, so >>>>>>>> can make >>>> that >>>>>>>> +flush >>>>>> a >>>>>>>> + * targeted XA flush, since stuff tagged with XA >>>>>>>> now means >>>>>>>> +it's shared >>>>>> with >>>>>>>> + * the CPU. The main implication is that we now >>>>>>>> need to >>>>>>>> +somehow >>>>>> flush non-XA before >>>>>>>> + * freeing system memory pages, otherwise dirty >>>>>>>> cachelines >>>>>>>> +could be >>>>>> flushed after the free >>>>>>>> + * (like if Media suddenly turns on and does a >>>>>>>> full flush) >>>>>>>> + */ >>>>>>>> + if (GRAPHICS_VER(xe) >= 35 && !IS_DGFX(xe)) >>>>>>>> + return true; >>>>>>>> + return false; >>>>>>>> +} >>>>>>>> + >>>>>>>>  void xe_device_l2_flush(struct xe_device *xe)  { >>>>>>>>   struct xe_gt *gt; >>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_device.h >>>>>>>> b/drivers/gpu/drm/xe/xe_device.h index >>>>>>>> 39464650533b..baf386e0e037 >>>>>>>> 100644 >>>>>>>> --- a/drivers/gpu/drm/xe/xe_device.h >>>>>>>> +++ b/drivers/gpu/drm/xe/xe_device.h >>>>>>>> @@ -184,6 +184,7 @@ void xe_device_snapshot_print(struct >>>>>>>> xe_device *xe, struct drm_printer *p); >>>>>>>>  u64 xe_device_canonicalize_addr(struct xe_device *xe, u64 >>>>>>>> address); >>>>>>>>  u64 xe_device_uncanonicalize_addr(struct xe_device *xe, >>>>>>>> u64 >>>>>>>> address); >>>>>>>> >>>>>>>> +bool xe_device_needs_cache_flush(struct xe_device *xe); >>>>>>>>  void xe_device_td_flush(struct xe_device *xe);  void >>>>>>>> xe_device_l2_flush(struct xe_device *xe); >>>>>>>> >>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_userptr.c >>>>>>>> b/drivers/gpu/drm/xe/xe_userptr.c index >>>>>>>> e120323c43bc..b435ea7f9b66 >>>>>>>> 100644 >>>>>>>> --- a/drivers/gpu/drm/xe/xe_userptr.c >>>>>>>> +++ b/drivers/gpu/drm/xe/xe_userptr.c >>>>>>>> @@ -114,7 +114,8 @@ static void >>>>>>>> __vma_userptr_invalidate(struct >>>>>>>> xe_vm >>>>>> *vm, struct xe_userptr_vma *uv >>>>>>>>       false, >>>>>>>> MAX_SCHEDULE_TIMEOUT); >>>>>>>>   XE_WARN_ON(err <= 0); >>>>>>>> >>>>>>>> - if (xe_vm_in_fault_mode(vm) && userptr- >>>>>>>>> initial_bind) { >>>>>>>> + if ((xe_vm_in_fault_mode(vm) || >>>>>>>> +xe_device_needs_cache_flush(vm- >>>>>>> xe)) && >>>>>>>> +     userptr->initial_bind) { >>>>>> >>>>>> Same concern with the LR preempt fence as above — the hardware >>>>>> will >>>>>> be interrupted via preempt fences, so it doesn’t seem necessary >>>>>> to >>>>>> invalidate the TLBs but perhaps we need a cflush and TLB >>>>>> invalidation is the mechanism for that too? >>>>>> >>>>>> Matt >>>>>> >>>>>>>>   err = xe_vm_invalidate_vma(vma); >>>>>>>>   XE_WARN_ON(err); >>>>>>>>   } >>>>>>>> -- >>>>>>>> 2.52.0 >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Matt Roper >>>>>>> Graphics Software Engineer >>>>>>> Linux GPU Platform Enablement >>>>>>> Intel Corporation >>>> >>>> -- >>>> Matt Roper >>>> Graphics Software Engineer >>>> Linux GPU Platform Enablement >>>> Intel Corporation >