From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <intel-xe-bounces@lists.freedesktop.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.lore.kernel.org (Postfix) with ESMTPS id CC950E81A2F
	for <intel-xe@archiver.kernel.org>; Mon, 16 Feb 2026 14:55:07 +0000 (UTC)
Received: from gabe.freedesktop.org (localhost [127.0.0.1])
	by gabe.freedesktop.org (Postfix) with ESMTP id 7C15C10E13F;
	Mon, 16 Feb 2026 14:55:07 +0000 (UTC)
Authentication-Results: gabe.freedesktop.org;
	dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="C1zJhwvj";
	dkim-atps=neutral
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.12])
 by gabe.freedesktop.org (Postfix) with ESMTPS id B2A4E10E13F
 for <intel-xe@lists.freedesktop.org>; Mon, 16 Feb 2026 14:55:05 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
 d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
 t=1771253706; x=1802789706;
 h=message-id:date:mime-version:subject:to:cc:references:
 from:in-reply-to:content-transfer-encoding;
 bh=BrFAqswcvo3eAE24XHkURucHaYeaOlCvoyy0C/6X/hk=;
 b=C1zJhwvjV7CbO7c8Cc6qWO+d41jybQy2Np3Uaspmn5kV35YzSgbuJAbU
 CqaJ8fFWcIG7DFo0jgnI2QSiFP/0wXcpsGorXFjIoIkGektBlD2weIFOX
 Z7z07WzfOdpzt0wJokQ9F/ArBCN/0TGSPSXakI+51J+WP9V7uPnhMWKsq
 YWNqHMa/kBMlRMtidUDb57itD1E1ZbMsdBjuh5HjJkXniAv5Ed2MkwsFC
 ny6NZ1zxqcmqhse93Uh7ntP6x/n2beXLdN68ij7QXvsAYRCYTg6rsxddY
 t4q/rnc/hS7fqvhA3APMooWWp+EprgduQ+t3ja4A/P9FuRTRusKglYXq7 g==;
X-CSE-ConnectionGUID: pLesLGP7RlSeuRYNDfhEYg==
X-CSE-MsgGUID: sApLU28ZTGyBP/a4YLqdpQ==
X-IronPort-AV: E=McAfee;i="6800,10657,11702"; a="83771834"
X-IronPort-AV: E=Sophos;i="6.21,294,1763452800"; d="scan'208";a="83771834"
Received: from orviesa009.jf.intel.com ([10.64.159.149])
 by orvoesa104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 16 Feb 2026 06:55:05 -0800
X-CSE-ConnectionGUID: B+DJtMjyS2e6b0P9wG8WFg==
X-CSE-MsgGUID: Be0w15kFQZqWoBmpX7LZhw==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.21,294,1763452800"; d="scan'208";a="213473012"
Received: from ijarvine-mobl1.ger.corp.intel.com (HELO [10.245.244.239])
 ([10.245.244.239])
 by orviesa009-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 16 Feb 2026 06:55:03 -0800
Message-ID: <3da2dfce-9bbf-42cd-a178-24a6c44b18c3@intel.com>
Date: Mon, 16 Feb 2026 14:55:01 +0000
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines
 manually
To: =?UTF-8?Q?Thomas_Hellstr=C3=B6m?= <thomas.hellstrom@linux.intel.com>,
 Matt Roper <matthew.d.roper@intel.com>, "Souza, Jose" <jose.souza@intel.com>
Cc: "Upadhyay, Tejas" <tejas.upadhyay@intel.com>,
 "Mrozek, Michal" <michal.mrozek@intel.com>,
 "intel-xe@lists.freedesktop.org" <intel-xe@lists.freedesktop.org>,
 "Brost, Matthew" <matthew.brost@intel.com>
References: <20260210125120.1329411-5-tejas.upadhyay@intel.com>
 <20260210125120.1329411-6-tejas.upadhyay@intel.com>
 <20260210210525.GC4694@mdroper-desk1.amr.corp.intel.com>
 <aYvHCeTd5pOFo2p5@lstrano-desk.jf.intel.com>
 <SJ1PR11MB6204CD2B54AD51B175F96F298163A@SJ1PR11MB6204.namprd11.prod.outlook.com>
 <20260211211125.GL4694@mdroper-desk1.amr.corp.intel.com>
 <SJ1PR11MB6204B6489254C702FB9A9D478161A@SJ1PR11MB6204.namprd11.prod.outlook.com>
 <f57d25f0131d124bdee35411d15513739c7d17c3.camel@intel.com>
 <20260213171638.GC52346@mdroper-desk1.amr.corp.intel.com>
 <a62d59e4-c77d-4e28-8227-be253733dd7b@intel.com>
 <75dcc80b39ed33a7abc620b2614b0e81586a6299.camel@linux.intel.com>
 <8ce35a23-b639-4c4f-acfd-993c4f9d5008@intel.com>
 <9ce0acf02e8ef63bf81cfbb9e1053bfa1437362f.camel@linux.intel.com>
Content-Language: en-GB
From: Matthew Auld <matthew.auld@intel.com>
In-Reply-To: <9ce0acf02e8ef63bf81cfbb9e1053bfa1437362f.camel@linux.intel.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-BeenThere: intel-xe@lists.freedesktop.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Intel Xe graphics driver <intel-xe.lists.freedesktop.org>
List-Unsubscribe: <https://lists.freedesktop.org/mailman/options/intel-xe>,
 <mailto:intel-xe-request@lists.freedesktop.org?subject=unsubscribe>
List-Archive: <https://lists.freedesktop.org/archives/intel-xe>
List-Post: <mailto:intel-xe@lists.freedesktop.org>
List-Help: <mailto:intel-xe-request@lists.freedesktop.org?subject=help>
List-Subscribe: <https://lists.freedesktop.org/mailman/listinfo/intel-xe>,
 <mailto:intel-xe-request@lists.freedesktop.org?subject=subscribe>
Errors-To: intel-xe-bounces@lists.freedesktop.org
Sender: "Intel-xe" <intel-xe-bounces@lists.freedesktop.org>

On 16/02/2026 12:07, Thomas Hellström wrote:
> On Mon, 2026-02-16 at 10:58 +0000, Matthew Auld wrote:
>> On 16/02/2026 10:23, Thomas Hellström wrote:
>>> On Fri, 2026-02-13 at 17:31 +0000, Matthew Auld wrote:
>>>> On 13/02/2026 17:16, Matt Roper wrote:
>>>>> On Fri, Feb 13, 2026 at 04:48:39PM +0000, Souza, Jose wrote:
>>>>>> On Fri, 2026-02-13 at 16:23 +0000, Upadhyay, Tejas wrote:
>>>>>>>
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Roper, Matthew D <matthew.d.roper@intel.com>
>>>>>>>> Sent: 12 February 2026 02:41
>>>>>>>> To: Upadhyay, Tejas <tejas.upadhyay@intel.com>
>>>>>>>> Cc: Brost, Matthew <matthew.brost@intel.com>; intel-
>>>>>>>> xe@lists.freedesktop.org; Auld, Matthew
>>>>>>>> <matthew.auld@intel.com>;
>>>>>>>> thomas.hellstrom@linux.intel.com
>>>>>>>> Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush
>>>>>>>> userptr/shrinker bo
>>>>>>>> cachelines manually
>>>>>>>>
>>>>>>>> On Wed, Feb 11, 2026 at 07:06:05PM +0000, Upadhyay, Tejas
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Brost, Matthew <matthew.brost@intel.com>
>>>>>>>>>> Sent: 11 February 2026 05:32
>>>>>>>>>> To: Roper, Matthew D <matthew.d.roper@intel.com>
>>>>>>>>>> Cc: Upadhyay, Tejas <tejas.upadhyay@intel.com>;
>>>>>>>>>> intel-
>>>>>>>>>> xe@lists.freedesktop.org; Auld, Matthew
>>>>>>>>>> <matthew.auld@intel.com>;
>>>>>>>>>> thomas.hellstrom@linux.intel.com
>>>>>>>>>> Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush
>>>>>>>>>> userptr/shrinker bo
>>>>>>>>>> cachelines manually
>>>>>>>>>>
>>>>>>>>>> On Tue, Feb 10, 2026 at 01:05:25PM -0800, Matt Roper
>>>>>>>>>> wrote:
>>>>>>>>>>> On Tue, Feb 10, 2026 at 06:21:22PM +0530, Tejas
>>>>>>>>>>> Upadhyay
>>>>>>>>>>> wrote:
>>>>>>>>>>>> "eXtended Architecture" (XA) tagged memory—memory
>>>>>>>>>>>> shared
>>>>>>>> between
>>>>>>>>>> the
>>>>>>>>>>>> CPU and GPU
>>>>>>>>>>>
>>>>>>>>>>> I'm pretty sure this expansion of "XA" is wrong;
>>>>>>>>>>> where
>>>>>>>>>>> are
>>>>>>>>>>> you
>>>>>>>>>>> seeing this definition?  Everything in the bspec
>>>>>>>>>>> indicates
>>>>>>>>>>> that XA
>>>>>>>>>>> means "wb
>>>>>>>>>>> - transient app" (similar to how "XD" is 'wb -
>>>>>>>>>>> transient
>>>>>>>>>>> display").
>>>>>>>>>>> I'm not sure why exactly they picked "X" to refer
>>>>>>>>>>> to
>>>>>>>>>>> transient in
>>>>>>>>>>> both of these cases, but I've never seen any
>>>>>>>>>>> documentation
>>>>>>>>>>> that
>>>>>>>>>>> refers to it as "extended."
>>>>>>>>>>>
>>>>>>>>>>>> is treated differently from other GPU memory when
>>>>>>>>>>>> the
>>>>>>>>>>>> Media
>>>>>>>>>>>> engine is
>>>>>>>>>> power-gated.
>>>>>>>>>>>>
>>>>>>>>>>>> XA is *always* flushed, like at the end-of-
>>>>>>>>>>>> submssion
>>>>>>>>>>>> (and
>>>>>>>>>>>> maybe
>>>>>>>>>>>> other
>>>>>>>>>>>
>>>>>>>>>>> I assume you're referring to the fact that the
>>>>>>>>>>> driver
>>>>>>>>>>> performs
>>>>>>>>>>> flushes at the end of submission (via PIPE_CONTROL
>>>>>>>>>>> or
>>>>>>>>>>> MI_FLUSH_DW), and that depending on other
>>>>>>>>>>> state/optimizations
>>>>>>>>>>> in
>>>>>>>>>>> the system, those flushes may flush the entire
>>>>>>>>>>> device
>>>>>>>>>>> cache,
>>>>>>>>>>> or
>>>>>>>>>>> may only flush the subset of cache data that is not
>>>>>>>>>>> marked as
>>>>>>>>>>> transient.  The way you worded this was confusing
>>>>>>>>>>> since
>>>>>>>>>>> it
>>>>>>>>>>> makes
>>>>>>>>>>> it sound like cache flushes happen automatically
>>>>>>>>>>> somewhere in
>>>>>>>> hardware/firmware.
>>>>>>>>>>>
>>>>>>>>>>>> places), just that internally as an optimisation
>>>>>>>>>>>> hw
>>>>>>>>>>>> doesn't
>>>>>>>>>>>> need
>>>>>>>>>>>> to make that a full flush (which will also
>>>>>>>>>>>> include
>>>>>>>>>>>> XA) when
>>>>>>>>>>>> Media is off/powergated, since it doesn't need to
>>>>>>>>>>>> worry
>>>>>>>>>>>> about GT
>>>>>>>>>>>> caches vs Media coherency, and only CPU vs GPU
>>>>>>>>>>>> coherency,
>>>>>>>>>>>> so can
>>>>>>>>>>>> make that flush a targeted XA flush, since stuff
>>>>>>>>>>>> tagged
>>>>>>>>>>>> with XA
>>>>>>>>>>>> now means it's shared with the CPU. The main
>>>>>>>>>>>> implication is
>>>>>>>>>>>> that
>>>>>>>>>>>> we now need to somehow flush non-XA before
>>>>>>>>>>>> freeing
>>>>>>>>>>>> system
>>>>>>>>>>>> memory
>>>>>>>>>>>> pages, otherwise dirty cachelines could be
>>>>>>>>>>>> flushed
>>>>>>>>>>>> after
>>>>>>>>>>>> the
>>>>>>>>>>>> free (like if Media suddenly turns on and does a
>>>>>>>>>>>> full
>>>>>>>>>>>> flush)
>>>>>>>>>>>
>>>>>>>>>>> This description seems really confusing.  My
>>>>>>>>>>> understanding is
>>>>>>>>>>> that
>>>>>>>>>>> marking something as wb-transient-app indicates
>>>>>>>>>>> that it
>>>>>>>>>>> might
>>>>>>>>>>> be
>>>>>>>>>>> accessed by something other than our graphics/media
>>>>>>>>>>> IP
>>>>>>>>>>> (i.e.,
>>>>>>>>>>> accessed from the CPU, exported to another device,
>>>>>>>>>>> etc.), so
>>>>>>>>>>> transient data truly does need to be flushed at the
>>>>>>>>>>> points in
>>>>>>>>>>> the
>>>>>>>>>>> driver where a flush typically happens.
>>>>>>>>>>>
>>>>>>>>>>> However when something is _not_ transient, then
>>>>>>>>>>> either:
>>>>>>>>>>>     - it's "private" to the GPU and only our
>>>>>>>>>>> graphics/media IP
>>>>>>>>>>> will be
>>>>>>>>>>>       accessing it
>>>>>>>>>>>     - it's bound with a coherent PAT index so that
>>>>>>>>>>> outside
>>>>>>>>>>> observers like
>>>>>>>>>>>       the CPU can snoop the device cache, even when
>>>>>>>>>>> the
>>>>>>>>>>> cache
>>>>>>>>>>> hasn't been
>>>>>>>>>>>       flushed
>>>>>>>>>>>
>>>>>>>>>>> If media is not active, then there's really no need
>>>>>>>>>>> to
>>>>>>>>>>> include
>>>>>>>>>>> non-transient data when an device cache flush
>>>>>>>>>>> happens
>>>>>>>>>>> since
>>>>>>>>>>> there's no real need for the data to get to RAM.
>>>>>>>>>>> So
>>>>>>>>>>> that
>>>>>>>>>>> enables
>>>>>>>>>>> an optimization (which comes in your next patch),
>>>>>>>>>>> that
>>>>>>>>>>> allows
>>>>>>>>>>> flushes to only operate on the subset of the device
>>>>>>>>>>> cache
>>>>>>>>>>> tagged as
>>>>>>>> "transient" if media is idle.
>>>>>>>>>
>>>>>>>>> But what If we have stale non-XA marked pages for
>>>>>>>>> userptr,
>>>>>>>>> and
>>>>>>>>> that
>>>>>>>>> object moves out and at the same time media comes back,
>>>>>>>>> will end
>>>>>>>>> up in
>>>>>>>>> full flush and flush the stale entry to RAM.
>>>>>>>>
>>>>>>>> What makes userptr special here?  During general, active
>>>>>>>> usage,
>>>>>>>> userptr would
>>>>>>>> be data that's accessible by the CPU, so it needs to
>>>>>>>> either
>>>>>>>> be
>>>>>>>> transient (so CPU
>>>>>>>> can see the data in RAM after explicit flushes) or it
>>>>>>>> needs
>>>>>>>> to be
>>>>>>>> using a
>>>>>>>> coherent PAT (so that the CPU can just snoop the GPU
>>>>>>>> cache).
>>>>>>>> If
>>>>>>>> you marked
>>>>>>>> userptr as both non-XA and non-coherent, then that sounds
>>>>>>>> likely to
>>>>>>>> be a
>>>>>>>> userspace bug (and probably something we can catch and
>>>>>>>> reject
>>>>>>>> as an
>>>>>>>> invalid
>>>>>>>> case on any Xe3p or later platforms that support this)
>>>>>>>> since
>>>>>>>> the
>>>>>>>> CPU wouldn't
>>>>>>>> have any reliable way of seeing GPU updates.
>>>>>>>
>>>>>>> Right. FYI @Mrozek, Michal @Souza, Jose
>>>>>>> For userptr, as explained above, it needs to be either
>>>>>>> coherent
>>>>>>> or XA
>>>>>>> pat index, or else KMD will reject as invalid case.
>>>>>>>
>>>>>>>
>>>>>> The coherency restriction is already in the uAPI:
>>>>>>
>>>>>> "Note: For userptr and externally imported dma-buf the kernel
>>>>>> expects
>>>>>> either 1WAY or 2WAY for the @pat_index."
>>>>>>
>>>>>> Using 1 way is enough as Xe KMD does a PIPE_CONTROL flushing
>>>>>> GPU
>>>>>> caches
>>>>>> at the end of batch buffers.
>>>>>
>>>>> But isn't that what we're discussing here?  1-way *won't*
>>>>> necessarily be
>>>>> enough anymore because PIPE_CONTROL instructions don't flush
>>>>> the
>>>>> entire
>>>>> cache anymore.  Whenever the GuC determines that media is
>>>>> inactive
>>>>> and
>>>>> activates the optimization, PIPE_CONTROL, MI_FLUSH_DW, etc.
>>>>> change
>>>>> behavior to only flush out the subset of data that was marked
>>>>> as
>>>>> app-transient; anything not marked that way doesn't get flushed
>>>>> now.  So
>>>>> there's a new requirement here that you ensure you're using an
>>>>> XA
>>>>> PAT
>>>>> index, or you switch to use 2-way coherency which will allow
>>>>> the
>>>>> CPU to
>>>>> snoop the GPU's caches.
>>>>
>>>> That exactly matches my understanding also.
>>>
>>> This only ever affects IGFX, right? Since AFAIU we don't have 2-way
>>> coherency with DGFX?
>>
>> Yeah, this should be igpu only. I seem to also recall that on dgpu,
>> Media is coherent with l2/l3, but also I don't think system memory
>> can
>> be cached in l2/l3 (only VRAM), which I assume is why there is the
>> special SMRO (system-memory-read-only) cache only on dgpu, which is
>> flushed when the fence signals, unlike the l2/l3.
> 
> Yes that sounds reasonable.
> 
>>
>>>
>>> It sounds like the same PAT restriction is needed also for imported
>>> dma-buf, right?
>>
>> Good point. Looks like we are missing that still. Otherwise we can
>> run
>> into the same issues with stale l2/l3/ppc.
> 
> So if this affects only system memory could we instead of relying on 2-
> way coherency or XA, just flush at dma unmap time, because that's
> typically just before releasing the pages.

Yeah, I think we could make it work, from security pov, similar to 
userptr, with the right manual flushes in KMD. Maybe just a question if 
userspace wants such a model? Anything cached in l2/l3 might require 
manual flushing by userspace (if that is even possible)?

> 
> The exception, though, is dma-buf where the exporter can actually
> release memory before all importers have given up their dma-mappings.
> 
> /Thomas
> 
>>
>>>
>>> /Thomas
>>>
>>>
>>>>
>>>>>
>>>>>
>>>>> Matt
>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> If something happens that changes the GTT mapping of an
>>>>>>>> object,
>>>>>>>> then
>>>>>>>> doesn't that already trigger a TLB invalidation when
>>>>>>>> necessary in
>>>>>>>> the driver
>>>>>>>> today?  It was my understanding that "heavy" TLB
>>>>>>>> invalidations wait
>>>>>>>> for data
>>>>>>>> values to be globally observable before starting, so I
>>>>>>>> think
>>>>>>>> that
>>>>>>>> would ensure
>>>>>>>> that any non-XA data makes it to RAM before any binding
>>>>>>>> changes,
>>>>>>>> object,
>>>>>>>> destruction, etc.?  Is there something special about
>>>>>>>> userptr
>>>>>>>> that
>>>>>>>> makes that
>>>>>>>> case more of a problem?
>>>>>>>>
>>>>>>>> I just found bspec page 74635 which gives an overview of
>>>>>>>> the
>>>>>>>> various flush
>>>>>>>> and invalidate cases, and I don't see anything there that
>>>>>>>> makes it
>>>>>>>> obvious to
>>>>>>>> me that userptr would be special.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> As you said, we eventually do want to force a flush
>>>>>>>>>>> of
>>>>>>>>>>> the
>>>>>>>>>>> non-transient data as well once we're freeing the
>>>>>>>>>>> underlying
>>>>>>>>>>> pages.
>>>>>>>>>>> So how do we do that?  It's not clear to me how the
>>>>>>>>>>> changes
>>>>>>>>>>> below
>>>>>>>>>>> are accomplishing that.  Is there a way to
>>>>>>>>>>> explicitly
>>>>>>>>>>> request
>>>>>>>>>>> a
>>>>>>>>>>> full device cache flush (ignoring the transient vs
>>>>>>>>>>> non-
>>>>>>>>>>> transient tagging)?
>>>>>>>>>>> Since the GuC handles the optimization in the next
>>>>>>>>>>> patch
>>>>>>>>>>> (toggling
>>>>>>>>>>> whether flushes are full flushes vs non-transient
>>>>>>>>>>> flushes
>>>>>>>>>>> depending on whether media is active), I thought
>>>>>>>>>>> there
>>>>>>>>>>> might
>>>>>>>>>>> be
>>>>>>>>>>> some kind of GuC interface to request "please do
>>>>>>>>>>> one
>>>>>>>>>>> full
>>>>>>>>>>> flush now, even
>>>>>>>> if media is idle."
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I’m not an expert here by any means, but everything
>>>>>>>>>> above
>>>>>>>>>> from
>>>>>>>>>> Matt
>>>>>>>>>> seems like valid concerns. Thomas also raised some
>>>>>>>>>> concerns in
>>>>>>>>>> the
>>>>>>>>>> two previous revisions; again I’m not an expert, but
>>>>>>>>>> reading
>>>>>>>>>> through
>>>>>>>>>> those, it doesn’t really seem like he received proper
>>>>>>>>>> answers
>>>>>>>>>> to his
>>>>>>>> questions.
>>>>>>>>>
>>>>>>>>> Its forcing flush via tlb invalidation PPC flag under
>>>>>>>>> xe_invalidate_vma( ).
>>>>>>>>
>>>>>>>> By the way, what is "PPC?"  It seems like it's another
>>>>>>>> new
>>>>>>>> synonym
>>>>>>>> for the
>>>>>>>> device cache?  It's already really confusing that some of
>>>>>>>> our
>>>>>>>> hardware docs use
>>>>>>>> a mix of both "L2" and "L3" to refer to the same device
>>>>>>>> cache
>>>>>>>> for
>>>>>>>> historical
>>>>>>>> reasons...
>>>>>>>>
>>>>>>>>
>>>>>>>> Matt
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> A couple of comments below.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Matt
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> V2(MattA): Expand commit description
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Tejas Upadhyay
>>>>>>>>>>>> <tejas.upadhyay@intel.com>
>>>>>>>>>>>> ---
>>>>>>>>>>>>     drivers/gpu/drm/xe/xe_bo.c      |  3 ++-
>>>>>>>>>>>>     drivers/gpu/drm/xe/xe_device.c  | 23
>>>>>>>>>>>> +++++++++++++++++++++++
>>>>>>>>>>>> drivers/gpu/drm/xe/xe_device.h  |  1 +
>>>>>>>>>>>> drivers/gpu/drm/xe/xe_userptr.c |  3 ++-
>>>>>>>>>>>>     4 files changed, 28 insertions(+), 2
>>>>>>>>>>>> deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_bo.c
>>>>>>>>>>>> b/drivers/gpu/drm/xe/xe_bo.c index
>>>>>>>>>>>> e9180b01a4e4..4455886b211e
>>>>>>>>>>>> 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_bo.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_bo.c
>>>>>>>>>>>> @@ -689,7 +689,8 @@ static int
>>>>>>>>>>>> xe_bo_trigger_rebind(struct
>>>>>>>>>>>> xe_device *xe, struct xe_bo *bo,
>>>>>>>>>>>>
>>>>>>>>>>>>     		if (!xe_vm_in_fault_mode(vm)) {
>>>>>>>>>>>>     			drm_gpuvm_bo_evict(vm_bo
>>>>>>>>>>>> ,
>>>>>>>>>>>> true);
>>>>>>>>>>>> -			continue;
>>>>>>>>>>>> +			if
>>>>>>>>>>>> (!xe_device_needs_cache_flush(xe))
>>>>>>>>>>>> +				continue;
>>>>>>>
>>>>>>> Matt R,
>>>>>>> This flush will be still needed as there can be non-xa
>>>>>>> buffers
>>>>>>> which
>>>>>>> can be evicted while media was off and stale entries can be
>>>>>>> flushed
>>>>>>> when media comes back on. Which was not case earlier as
>>>>>>> full
>>>>>>> flush
>>>>>>> was happening at regular sync points and that’s where this
>>>>>>> feature is
>>>>>>> bringing optimization now.
>>>>>>>
>>>>>>> Tejas
>>>>>>>
>>>>>>>>>>
>>>>>>>>>> This will trigger a TLB invalidation (and I assume a
>>>>>>>>>> cache
>>>>>>>>>> flush)
>>>>>>>>>> every time we move or free memory in the 3D stack if
>>>>>>>>>> it
>>>>>>>>>> has a
>>>>>>>>>> binding. It also performs a synchronous wait on the
>>>>>>>>>> BO
>>>>>>>>>> being
>>>>>>>>>> idle.
>>>>>>>>>> Both of these are very expensive operations. I can’t
>>>>>>>>>> imagine
>>>>>>>>>> the
>>>>>>>>>> granularity we want here is to do this on every
>>>>>>>>>> move/free
>>>>>>>>>> with
>>>>>>>>>> bindings.
>>>>>>>>>>
>>>>>>>>>> Also, for LR compute with preempt fences, we would
>>>>>>>>>> trigger the
>>>>>>>>>> preempt fences during the wait, so a TLB invalidation
>>>>>>>>>> after
>>>>>>>>>> this
>>>>>>>>>> seems unnecessary, though perhaps the cache flush is
>>>>>>>>>> still
>>>>>>>>>> required?
>>>>>>>>>>
>>>>>>>>>> I think this needs a bit more explanation, because
>>>>>>>>>> without
>>>>>>>>>> knowing a
>>>>>>>>>> lot about the exact requirements, the implementation
>>>>>>>>>> does
>>>>>>>>>> not
>>>>>>>>>> look
>>>>>>>> correct.
>>>>>>>>>
>>>>>>>>> The thing is that we are trying to solve problem with
>>>>>>>>> userptr
>>>>>>>>> with non-XA
>>>>>>>> pat, consider if that BO got moved while media is not
>>>>>>>> active.
>>>>>>>> As
>>>>>>>> soon as media
>>>>>>>> will come back active, stale cached entries of that
>>>>>>>> object
>>>>>>>> will be
>>>>>>>> flushed as part
>>>>>>>> of full flush , which may corrupt things.
>>>>>>>>> There was thinking that with this patch we would at
>>>>>>>>> least
>>>>>>>>> solve
>>>>>>>>> the problem
>>>>>>>> of corruption and later when page_reclamation feature
>>>>>>>> comes
>>>>>>>> in will
>>>>>>>> help in
>>>>>>>> performance as well. But now when page reclamation
>>>>>>>> feature is
>>>>>>>> merged earlier
>>>>>>>> and it tightly coupled with bind/unbind some cases like
>>>>>>>> discussed
>>>>>>>> above
>>>>>>>> (which are not doing unbind immediately on move/free) are
>>>>>>>> missed in
>>>>>>>> reclamation.
>>>>>>>>>
>>>>>>>>> So thought was to let this solution go in with little
>>>>>>>>> perf
>>>>>>>>> hit
>>>>>>>>> and discuss with
>>>>>>>> page reclamation owner to come with cleaner solution
>>>>>>>> together.
>>>>>>>>>
>>>>>>>>> Tejas
>>>>>>>>>>
>>>>>>>>>>>>     		}
>>>>>>>>>>>>
>>>>>>>>>>>>     		if (!idle) {
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_device.c
>>>>>>>>>>>> b/drivers/gpu/drm/xe/xe_device.c index
>>>>>>>>>>>> 743c18e0c580..da2abed94bc0
>>>>>>>>>>>> 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_device.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_device.c
>>>>>>>>>>>> @@ -1097,6 +1097,29 @@ static void
>>>>>>>>>>>> tdf_request_sync(struct
>>>>>>>>>>>> xe_device
>>>>>>>>>> *xe)
>>>>>>>>>>>>     	}
>>>>>>>>>>>>     }
>>>>>>>>>>>>
>>>>>>>>>>>> +/**
>>>>>>>>>>>> + * xe_device_needs_cache_flush - Whether the
>>>>>>>>>>>> cache
>>>>>>>>>>>> needs
>>>>>>>>>>>> to be
>>>>>>>>>>>> +flushed
>>>>>>>>>>>> + * @xe: The device to check.
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * Return: true if the device needs cache flush,
>>>>>>>>>>>> false
>>>>>>>>>>>> otherwise.
>>>>>>>>>>>> + */
>>>>>>>>>>>> +bool xe_device_needs_cache_flush(struct
>>>>>>>>>>>> xe_device
>>>>>>>>>>>> *xe) {
>>>>>>>>>>>> +	/* XA is *always* flushed, like at the
>>>>>>>>>>>> end-
>>>>>>>>>>>> of-
>>>>>>>>>>>> submssion (and
>>>>>>>>>>>> +maybe
>>>>>>>>>> other
>>>>>>>>>>>> +	 * places), just that internally as an
>>>>>>>>>>>> optimisation hw doesn't
>>>>>>>>>>>> +need to
>>>>>>>>>> make
>>>>>>>>>>>> +	 * that a full flush (which will also
>>>>>>>>>>>> include XA)
>>>>>>>>>>>> when Media is
>>>>>>>>>>>> +	 * off/powergated, since it doesn't need
>>>>>>>>>>>> to
>>>>>>>>>>>> worry
>>>>>>>>>>>> about GT
>>>>>>>>>>>> +caches vs
>>>>>>>>>> Media
>>>>>>>>>>>> +	 * coherency, and only CPU vs GPU
>>>>>>>>>>>> coherency,
>>>>>>>>>>>> so
>>>>>>>>>>>> can make
>>>>>>>> that
>>>>>>>>>>>> +flush
>>>>>>>>>> a
>>>>>>>>>>>> +	 * targeted XA flush, since stuff tagged
>>>>>>>>>>>> with XA
>>>>>>>>>>>> now means
>>>>>>>>>>>> +it's shared
>>>>>>>>>> with
>>>>>>>>>>>> +	 * the CPU. The main implication is that
>>>>>>>>>>>> we
>>>>>>>>>>>> now
>>>>>>>>>>>> need to
>>>>>>>>>>>> +somehow
>>>>>>>>>> flush non-XA before
>>>>>>>>>>>> +	 * freeing system memory pages,
>>>>>>>>>>>> otherwise
>>>>>>>>>>>> dirty
>>>>>>>>>>>> cachelines
>>>>>>>>>>>> +could be
>>>>>>>>>> flushed after the free
>>>>>>>>>>>> +	 * (like if Media suddenly turns on and
>>>>>>>>>>>> does
>>>>>>>>>>>> a
>>>>>>>>>>>> full flush)
>>>>>>>>>>>> +	 */
>>>>>>>>>>>> +	if (GRAPHICS_VER(xe) >= 35 &&
>>>>>>>>>>>> !IS_DGFX(xe))
>>>>>>>>>>>> +		return true;
>>>>>>>>>>>> +	return false;
>>>>>>>>>>>> +}
>>>>>>>>>>>> +
>>>>>>>>>>>>     void xe_device_l2_flush(struct xe_device *xe)
>>>>>>>>>>>> {
>>>>>>>>>>>>     	struct xe_gt *gt;
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_device.h
>>>>>>>>>>>> b/drivers/gpu/drm/xe/xe_device.h index
>>>>>>>>>>>> 39464650533b..baf386e0e037
>>>>>>>>>>>> 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_device.h
>>>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_device.h
>>>>>>>>>>>> @@ -184,6 +184,7 @@ void
>>>>>>>>>>>> xe_device_snapshot_print(struct
>>>>>>>>>>>> xe_device *xe, struct drm_printer *p);
>>>>>>>>>>>>     u64 xe_device_canonicalize_addr(struct
>>>>>>>>>>>> xe_device
>>>>>>>>>>>> *xe, u64
>>>>>>>>>>>> address);
>>>>>>>>>>>>     u64 xe_device_uncanonicalize_addr(struct
>>>>>>>>>>>> xe_device
>>>>>>>>>>>> *xe,
>>>>>>>>>>>> u64
>>>>>>>>>>>> address);
>>>>>>>>>>>>
>>>>>>>>>>>> +bool xe_device_needs_cache_flush(struct
>>>>>>>>>>>> xe_device
>>>>>>>>>>>> *xe);
>>>>>>>>>>>>     void xe_device_td_flush(struct xe_device *xe);
>>>>>>>>>>>> void
>>>>>>>>>>>> xe_device_l2_flush(struct xe_device *xe);
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_userptr.c
>>>>>>>>>>>> b/drivers/gpu/drm/xe/xe_userptr.c index
>>>>>>>>>>>> e120323c43bc..b435ea7f9b66
>>>>>>>>>>>> 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_userptr.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_userptr.c
>>>>>>>>>>>> @@ -114,7 +114,8 @@ static void
>>>>>>>>>>>> __vma_userptr_invalidate(struct
>>>>>>>>>>>> xe_vm
>>>>>>>>>> *vm, struct xe_userptr_vma *uv
>>>>>>>>>>>>     				    false,
>>>>>>>>>>>> MAX_SCHEDULE_TIMEOUT);
>>>>>>>>>>>>     	XE_WARN_ON(err <= 0);
>>>>>>>>>>>>
>>>>>>>>>>>> -	if (xe_vm_in_fault_mode(vm) && userptr-
>>>>>>>>>>>>> initial_bind) {
>>>>>>>>>>>> +	if ((xe_vm_in_fault_mode(vm) ||
>>>>>>>>>>>> +xe_device_needs_cache_flush(vm-
>>>>>>>>>>> xe)) &&
>>>>>>>>>>>> +	    userptr->initial_bind) {
>>>>>>>>>>
>>>>>>>>>> Same concern with the LR preempt fence as above — the
>>>>>>>>>> hardware
>>>>>>>>>> will
>>>>>>>>>> be interrupted via preempt fences, so it doesn’t seem
>>>>>>>>>> necessary
>>>>>>>>>> to
>>>>>>>>>> invalidate the TLBs but perhaps we need a cflush and
>>>>>>>>>> TLB
>>>>>>>>>> invalidation is the mechanism for that too?
>>>>>>>>>>
>>>>>>>>>> Matt
>>>>>>>>>>
>>>>>>>>>>>>     		err = xe_vm_invalidate_vma(vma);
>>>>>>>>>>>>     		XE_WARN_ON(err);
>>>>>>>>>>>>     	}
>>>>>>>>>>>> --
>>>>>>>>>>>> 2.52.0
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Matt Roper
>>>>>>>>>>> Graphics Software Engineer
>>>>>>>>>>> Linux GPU Platform Enablement
>>>>>>>>>>> Intel Corporation
>>>>>>>>
>>>>>>>> --
>>>>>>>> Matt Roper
>>>>>>>> Graphics Software Engineer
>>>>>>>> Linux GPU Platform Enablement
>>>>>>>> Intel Corporation
>>>>>