From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id B0BEEE7BDA9 for ; Mon, 16 Feb 2026 12:07:28 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 68C0010E329; Mon, 16 Feb 2026 12:07:28 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="aQ3tEc6E"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.15]) by gabe.freedesktop.org (Postfix) with ESMTPS id CC18610E303 for ; Mon, 16 Feb 2026 12:07:26 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1771243647; x=1802779647; h=message-id:subject:from:to:cc:date:in-reply-to: references:content-transfer-encoding:mime-version; bh=NDf8BngrHzSCMu8/f652cHDo+09u8fty+ibjY3td+/Y=; b=aQ3tEc6EjTarZzUFGAfVw7PcO9Vm7es3O8xTRFpU/d8s8Mx5o4Izg8gy qzlVjZ/vxKUV/umkc2CGTiuYnBrWMgyT03x07ET5d+eLyRM9dPYLobUVC ACTj4jE+NRZfu4HEnY/GsID8AS1/3azy2NtHO6aal7/amX9EOVClQQnq+ gTDq8cs8V7VB94BTSf2YaaUJ9bhGBCyCpblKj9She8MghE3tRiW4o2etq jGiEJSkAx0TshRpILpNSuFT7pV3sLXLc+b2QfHBjdlxnOiHLKFdAA5082 wOylaHE7SSc1YL/IL2GPYwXWZjZ50rIcmKeNdxgQrlyAtvxEWc7y+jzSz A==; X-CSE-ConnectionGUID: 7/ZyhL8gQQ+BOyx39Gc6gw== X-CSE-MsgGUID: mJe3ZAs3QGS+p5xX2/1yBg== X-IronPort-AV: E=McAfee;i="6800,10657,11702"; a="75943469" X-IronPort-AV: E=Sophos;i="6.21,294,1763452800"; d="scan'208";a="75943469" Received: from fmviesa002.fm.intel.com ([10.60.135.142]) by orvoesa107.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 16 Feb 2026 04:07:26 -0800 X-CSE-ConnectionGUID: +TlY5WTvQ2+BYH4vEJ56zQ== X-CSE-MsgGUID: dgLoRtYnS9Ovi+M3sTebSg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,294,1763452800"; d="scan'208";a="236616989" Received: from pgcooper-mobl3.ger.corp.intel.com (HELO [10.245.244.231]) ([10.245.244.231]) by fmviesa002-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 16 Feb 2026 04:07:24 -0800 Message-ID: <9ce0acf02e8ef63bf81cfbb9e1053bfa1437362f.camel@linux.intel.com> Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually From: Thomas =?ISO-8859-1?Q?Hellstr=F6m?= To: Matthew Auld , Matt Roper , "Souza, Jose" Cc: "Upadhyay, Tejas" , "Mrozek, Michal" , "intel-xe@lists.freedesktop.org" , "Brost, Matthew" Date: Mon, 16 Feb 2026 13:07:22 +0100 In-Reply-To: <8ce35a23-b639-4c4f-acfd-993c4f9d5008@intel.com> References: <20260210125120.1329411-5-tejas.upadhyay@intel.com> <20260210125120.1329411-6-tejas.upadhyay@intel.com> <20260210210525.GC4694@mdroper-desk1.amr.corp.intel.com> <20260211211125.GL4694@mdroper-desk1.amr.corp.intel.com> <20260213171638.GC52346@mdroper-desk1.amr.corp.intel.com> <75dcc80b39ed33a7abc620b2614b0e81586a6299.camel@linux.intel.com> <8ce35a23-b639-4c4f-acfd-993c4f9d5008@intel.com> Organization: Intel Sweden AB, Registration Number: 556189-6027 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable User-Agent: Evolution 3.58.2 (3.58.2-1.fc43) MIME-Version: 1.0 X-BeenThere: intel-xe@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel Xe graphics driver List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: intel-xe-bounces@lists.freedesktop.org Sender: "Intel-xe" On Mon, 2026-02-16 at 10:58 +0000, Matthew Auld wrote: > On 16/02/2026 10:23, Thomas Hellstr=C3=B6m wrote: > > On Fri, 2026-02-13 at 17:31 +0000, Matthew Auld wrote: > > > On 13/02/2026 17:16, Matt Roper wrote: > > > > On Fri, Feb 13, 2026 at 04:48:39PM +0000, Souza, Jose wrote: > > > > > On Fri, 2026-02-13 at 16:23 +0000, Upadhyay, Tejas wrote: > > > > > >=20 > > > > > >=20 > > > > > > > -----Original Message----- > > > > > > > From: Roper, Matthew D > > > > > > > Sent: 12 February 2026 02:41 > > > > > > > To: Upadhyay, Tejas > > > > > > > Cc: Brost, Matthew ; intel- > > > > > > > xe@lists.freedesktop.org; Auld, Matthew > > > > > > > ; > > > > > > > thomas.hellstrom@linux.intel.com > > > > > > > Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush > > > > > > > userptr/shrinker bo > > > > > > > cachelines manually > > > > > > >=20 > > > > > > > On Wed, Feb 11, 2026 at 07:06:05PM +0000, Upadhyay, Tejas > > > > > > > wrote: > > > > > > > >=20 > > > > > > > >=20 > > > > > > > > > -----Original Message----- > > > > > > > > > From: Brost, Matthew > > > > > > > > > Sent: 11 February 2026 05:32 > > > > > > > > > To: Roper, Matthew D > > > > > > > > > Cc: Upadhyay, Tejas ; > > > > > > > > > intel- > > > > > > > > > xe@lists.freedesktop.org; Auld, Matthew > > > > > > > > > ; > > > > > > > > > thomas.hellstrom@linux.intel.com > > > > > > > > > Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush > > > > > > > > > userptr/shrinker bo > > > > > > > > > cachelines manually > > > > > > > > >=20 > > > > > > > > > On Tue, Feb 10, 2026 at 01:05:25PM -0800, Matt Roper > > > > > > > > > wrote: > > > > > > > > > > On Tue, Feb 10, 2026 at 06:21:22PM +0530, Tejas > > > > > > > > > > Upadhyay > > > > > > > > > > wrote: > > > > > > > > > > > "eXtended Architecture" (XA) tagged memory=E2=80=94me= mory > > > > > > > > > > > shared > > > > > > > between > > > > > > > > > the > > > > > > > > > > > CPU and GPU > > > > > > > > > >=20 > > > > > > > > > > I'm pretty sure this expansion of "XA" is wrong; > > > > > > > > > > where > > > > > > > > > > are > > > > > > > > > > you > > > > > > > > > > seeing this definition?=C2=A0 Everything in the bspec > > > > > > > > > > indicates > > > > > > > > > > that XA > > > > > > > > > > means "wb > > > > > > > > > > - transient app" (similar to how "XD" is 'wb - > > > > > > > > > > transient > > > > > > > > > > display"). > > > > > > > > > > I'm not sure why exactly they picked "X" to refer > > > > > > > > > > to > > > > > > > > > > transient in > > > > > > > > > > both of these cases, but I've never seen any > > > > > > > > > > documentation > > > > > > > > > > that > > > > > > > > > > refers to it as "extended." > > > > > > > > > >=20 > > > > > > > > > > > is treated differently from other GPU memory when > > > > > > > > > > > the > > > > > > > > > > > Media > > > > > > > > > > > engine is > > > > > > > > > power-gated. > > > > > > > > > > >=20 > > > > > > > > > > > XA is *always* flushed, like at the end-of- > > > > > > > > > > > submssion > > > > > > > > > > > (and > > > > > > > > > > > maybe > > > > > > > > > > > other > > > > > > > > > >=20 > > > > > > > > > > I assume you're referring to the fact that the > > > > > > > > > > driver > > > > > > > > > > performs > > > > > > > > > > flushes at the end of submission (via PIPE_CONTROL > > > > > > > > > > or > > > > > > > > > > MI_FLUSH_DW), and that depending on other > > > > > > > > > > state/optimizations > > > > > > > > > > in > > > > > > > > > > the system, those flushes may flush the entire > > > > > > > > > > device > > > > > > > > > > cache, > > > > > > > > > > or > > > > > > > > > > may only flush the subset of cache data that is not > > > > > > > > > > marked as > > > > > > > > > > transient.=C2=A0 The way you worded this was confusing > > > > > > > > > > since > > > > > > > > > > it > > > > > > > > > > makes > > > > > > > > > > it sound like cache flushes happen automatically > > > > > > > > > > somewhere in > > > > > > > hardware/firmware. > > > > > > > > > >=20 > > > > > > > > > > > places), just that internally as an optimisation > > > > > > > > > > > hw > > > > > > > > > > > doesn't > > > > > > > > > > > need > > > > > > > > > > > to make that a full flush (which will also > > > > > > > > > > > include > > > > > > > > > > > XA) when > > > > > > > > > > > Media is off/powergated, since it doesn't need to > > > > > > > > > > > worry > > > > > > > > > > > about GT > > > > > > > > > > > caches vs Media coherency, and only CPU vs GPU > > > > > > > > > > > coherency, > > > > > > > > > > > so can > > > > > > > > > > > make that flush a targeted XA flush, since stuff > > > > > > > > > > > tagged > > > > > > > > > > > with XA > > > > > > > > > > > now means it's shared with the CPU. The main > > > > > > > > > > > implication is > > > > > > > > > > > that > > > > > > > > > > > we now need to somehow flush non-XA before > > > > > > > > > > > freeing > > > > > > > > > > > system > > > > > > > > > > > memory > > > > > > > > > > > pages, otherwise dirty cachelines could be > > > > > > > > > > > flushed > > > > > > > > > > > after > > > > > > > > > > > the > > > > > > > > > > > free (like if Media suddenly turns on and does a > > > > > > > > > > > full > > > > > > > > > > > flush) > > > > > > > > > >=20 > > > > > > > > > > This description seems really confusing.=C2=A0 My > > > > > > > > > > understanding is > > > > > > > > > > that > > > > > > > > > > marking something as wb-transient-app indicates > > > > > > > > > > that it > > > > > > > > > > might > > > > > > > > > > be > > > > > > > > > > accessed by something other than our graphics/media > > > > > > > > > > IP > > > > > > > > > > (i.e., > > > > > > > > > > accessed from the CPU, exported to another device, > > > > > > > > > > etc.), so > > > > > > > > > > transient data truly does need to be flushed at the > > > > > > > > > > points in > > > > > > > > > > the > > > > > > > > > > driver where a flush typically happens. > > > > > > > > > >=20 > > > > > > > > > > However when something is _not_ transient, then > > > > > > > > > > either: > > > > > > > > > > =C2=A0=C2=A0=C2=A0- it's "private" to the GPU and only = our > > > > > > > > > > graphics/media IP > > > > > > > > > > will be > > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0 accessing it > > > > > > > > > > =C2=A0=C2=A0=C2=A0- it's bound with a coherent PAT inde= x so that > > > > > > > > > > outside > > > > > > > > > > observers like > > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0 the CPU can snoop the device c= ache, even when > > > > > > > > > > the > > > > > > > > > > cache > > > > > > > > > > hasn't been > > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0 flushed > > > > > > > > > >=20 > > > > > > > > > > If media is not active, then there's really no need > > > > > > > > > > to > > > > > > > > > > include > > > > > > > > > > non-transient data when an device cache flush > > > > > > > > > > happens > > > > > > > > > > since > > > > > > > > > > there's no real need for the data to get to RAM.=C2=A0 > > > > > > > > > > So > > > > > > > > > > that > > > > > > > > > > enables > > > > > > > > > > an optimization (which comes in your next patch), > > > > > > > > > > that > > > > > > > > > > allows > > > > > > > > > > flushes to only operate on the subset of the device > > > > > > > > > > cache > > > > > > > > > > tagged as > > > > > > > "transient" if media is idle. > > > > > > > >=20 > > > > > > > > But what If we have stale non-XA marked pages for > > > > > > > > userptr, > > > > > > > > and > > > > > > > > that > > > > > > > > object moves out and at the same time media comes back, > > > > > > > > will end > > > > > > > > up in > > > > > > > > full flush and flush the stale entry to RAM. > > > > > > >=20 > > > > > > > What makes userptr special here?=C2=A0 During general, active > > > > > > > usage, > > > > > > > userptr would > > > > > > > be data that's accessible by the CPU, so it needs to > > > > > > > either > > > > > > > be > > > > > > > transient (so CPU > > > > > > > can see the data in RAM after explicit flushes) or it > > > > > > > needs > > > > > > > to be > > > > > > > using a > > > > > > > coherent PAT (so that the CPU can just snoop the GPU > > > > > > > cache). > > > > > > > If > > > > > > > you marked > > > > > > > userptr as both non-XA and non-coherent, then that sounds > > > > > > > likely to > > > > > > > be a > > > > > > > userspace bug (and probably something we can catch and > > > > > > > reject > > > > > > > as an > > > > > > > invalid > > > > > > > case on any Xe3p or later platforms that support this) > > > > > > > since > > > > > > > the > > > > > > > CPU wouldn't > > > > > > > have any reliable way of seeing GPU updates. > > > > > >=20 > > > > > > Right. FYI @Mrozek, Michal @Souza, Jose > > > > > > For userptr, as explained above, it needs to be either > > > > > > coherent > > > > > > or XA > > > > > > pat index, or else KMD will reject as invalid case. > > > > > >=20 > > > > > >=20 > > > > > The coherency restriction is already in the uAPI: > > > > >=20 > > > > > "Note: For userptr and externally imported dma-buf the kernel > > > > > expects > > > > > either 1WAY or 2WAY for the @pat_index." > > > > >=20 > > > > > Using 1 way is enough as Xe KMD does a PIPE_CONTROL flushing > > > > > GPU > > > > > caches > > > > > at the end of batch buffers. > > > >=20 > > > > But isn't that what we're discussing here?=C2=A0 1-way *won't* > > > > necessarily be > > > > enough anymore because PIPE_CONTROL instructions don't flush > > > > the > > > > entire > > > > cache anymore.=C2=A0 Whenever the GuC determines that media is > > > > inactive > > > > and > > > > activates the optimization, PIPE_CONTROL, MI_FLUSH_DW, etc. > > > > change > > > > behavior to only flush out the subset of data that was marked > > > > as > > > > app-transient; anything not marked that way doesn't get flushed > > > > now.=C2=A0 So > > > > there's a new requirement here that you ensure you're using an > > > > XA > > > > PAT > > > > index, or you switch to use 2-way coherency which will allow > > > > the > > > > CPU to > > > > snoop the GPU's caches. > > >=20 > > > That exactly matches my understanding also. > >=20 > > This only ever affects IGFX, right? Since AFAIU we don't have 2-way > > coherency with DGFX? >=20 > Yeah, this should be igpu only. I seem to also recall that on dgpu,=20 > Media is coherent with l2/l3, but also I don't think system memory > can=20 > be cached in l2/l3 (only VRAM), which I assume is why there is the=20 > special SMRO (system-memory-read-only) cache only on dgpu, which is=20 > flushed when the fence signals, unlike the l2/l3. Yes that sounds reasonable. >=20 > >=20 > > It sounds like the same PAT restriction is needed also for imported > > dma-buf, right? >=20 > Good point. Looks like we are missing that still. Otherwise we can > run=20 > into the same issues with stale l2/l3/ppc. So if this affects only system memory could we instead of relying on 2- way coherency or XA, just flush at dma unmap time, because that's typically just before releasing the pages. The exception, though, is dma-buf where the exporter can actually release memory before all importers have given up their dma-mappings. /Thomas >=20 > >=20 > > /Thomas > >=20 > >=20 > > >=20 > > > >=20 > > > >=20 > > > > Matt > > > >=20 > > > > >=20 > > > > > >=20 > > > > > > >=20 > > > > > > > If something happens that changes the GTT mapping of an > > > > > > > object, > > > > > > > then > > > > > > > doesn't that already trigger a TLB invalidation when > > > > > > > necessary in > > > > > > > the driver > > > > > > > today?=C2=A0 It was my understanding that "heavy" TLB > > > > > > > invalidations wait > > > > > > > for data > > > > > > > values to be globally observable before starting, so I > > > > > > > think > > > > > > > that > > > > > > > would ensure > > > > > > > that any non-XA data makes it to RAM before any binding > > > > > > > changes, > > > > > > > object, > > > > > > > destruction, etc.?=C2=A0 Is there something special about > > > > > > > userptr > > > > > > > that > > > > > > > makes that > > > > > > > case more of a problem? > > > > > > >=20 > > > > > > > I just found bspec page 74635 which gives an overview of > > > > > > > the > > > > > > > various flush > > > > > > > and invalidate cases, and I don't see anything there that > > > > > > > makes it > > > > > > > obvious to > > > > > > > me that userptr would be special. > > > > > > >=20 > > > > > > >=20 > > > > > > > >=20 > > > > > > > > > >=20 > > > > > > > > > > As you said, we eventually do want to force a flush > > > > > > > > > > of > > > > > > > > > > the > > > > > > > > > > non-transient data as well once we're freeing the > > > > > > > > > > underlying > > > > > > > > > > pages. > > > > > > > > > > So how do we do that?=C2=A0 It's not clear to me how th= e > > > > > > > > > > changes > > > > > > > > > > below > > > > > > > > > > are accomplishing that.=C2=A0 Is there a way to > > > > > > > > > > explicitly > > > > > > > > > > request > > > > > > > > > > a > > > > > > > > > > full device cache flush (ignoring the transient vs > > > > > > > > > > non- > > > > > > > > > > transient tagging)? > > > > > > > > > > Since the GuC handles the optimization in the next > > > > > > > > > > patch > > > > > > > > > > (toggling > > > > > > > > > > whether flushes are full flushes vs non-transient > > > > > > > > > > flushes > > > > > > > > > > depending on whether media is active), I thought > > > > > > > > > > there > > > > > > > > > > might > > > > > > > > > > be > > > > > > > > > > some kind of GuC interface to request "please do > > > > > > > > > > one > > > > > > > > > > full > > > > > > > > > > flush now, even > > > > > > > if media is idle." > > > > > > > > > >=20 > > > > > > > > >=20 > > > > > > > > > I=E2=80=99m not an expert here by any means, but everythi= ng > > > > > > > > > above > > > > > > > > > from > > > > > > > > > Matt > > > > > > > > > seems like valid concerns. Thomas also raised some > > > > > > > > > concerns in > > > > > > > > > the > > > > > > > > > two previous revisions; again I=E2=80=99m not an expert, = but > > > > > > > > > reading > > > > > > > > > through > > > > > > > > > those, it doesn=E2=80=99t really seem like he received pr= oper > > > > > > > > > answers > > > > > > > > > to his > > > > > > > questions. > > > > > > > >=20 > > > > > > > > Its forcing flush via tlb invalidation PPC flag under > > > > > > > > xe_invalidate_vma( ). > > > > > > >=20 > > > > > > > By the way, what is "PPC?"=C2=A0 It seems like it's another > > > > > > > new > > > > > > > synonym > > > > > > > for the > > > > > > > device cache?=C2=A0 It's already really confusing that some o= f > > > > > > > our > > > > > > > hardware docs use > > > > > > > a mix of both "L2" and "L3" to refer to the same device > > > > > > > cache > > > > > > > for > > > > > > > historical > > > > > > > reasons... > > > > > > >=20 > > > > > > >=20 > > > > > > > Matt > > > > > > >=20 > > > > > > > >=20 > > > > > > > > >=20 > > > > > > > > > A couple of comments below. > > > > > > > > >=20 > > > > > > > > > >=20 > > > > > > > > > > Matt > > > > > > > > > >=20 > > > > > > > > > > >=20 > > > > > > > > > > > V2(MattA): Expand commit description > > > > > > > > > > >=20 > > > > > > > > > > > Signed-off-by: Tejas Upadhyay > > > > > > > > > > > > > > > > > > > > > > --- > > > > > > > > > > > =C2=A0=C2=A0=C2=A0drivers/gpu/drm/xe/xe_bo.c=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 |=C2=A0 3 ++- > > > > > > > > > > > =C2=A0=C2=A0=C2=A0drivers/gpu/drm/xe/xe_device.c=C2= =A0 | 23 > > > > > > > > > > > +++++++++++++++++++++++ > > > > > > > > > > > drivers/gpu/drm/xe/xe_device.h=C2=A0 |=C2=A0 1 + > > > > > > > > > > > drivers/gpu/drm/xe/xe_userptr.c |=C2=A0 3 ++- > > > > > > > > > > > =C2=A0=C2=A0=C2=A04 files changed, 28 insertions(+), = 2 > > > > > > > > > > > deletions(-) > > > > > > > > > > >=20 > > > > > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_bo.c > > > > > > > > > > > b/drivers/gpu/drm/xe/xe_bo.c index > > > > > > > > > > > e9180b01a4e4..4455886b211e > > > > > > > > > > > 100644 > > > > > > > > > > > --- a/drivers/gpu/drm/xe/xe_bo.c > > > > > > > > > > > +++ b/drivers/gpu/drm/xe/xe_bo.c > > > > > > > > > > > @@ -689,7 +689,8 @@ static int > > > > > > > > > > > xe_bo_trigger_rebind(struct > > > > > > > > > > > xe_device *xe, struct xe_bo *bo, > > > > > > > > > > >=20 > > > > > > > > > > > =C2=A0=C2=A0=C2=A0 if (!xe_vm_in_fault_mode(vm)) { > > > > > > > > > > > =C2=A0=C2=A0=C2=A0 drm_gpuvm_bo_evict(vm_bo > > > > > > > > > > > , > > > > > > > > > > > true); > > > > > > > > > > > - continue; > > > > > > > > > > > + if > > > > > > > > > > > (!xe_device_needs_cache_flush(xe)) > > > > > > > > > > > + continue; > > > > > >=20 > > > > > > Matt R, > > > > > > This flush will be still needed as there can be non-xa > > > > > > buffers > > > > > > which > > > > > > can be evicted while media was off and stale entries can be > > > > > > flushed > > > > > > when media comes back on. Which was not case earlier as > > > > > > full > > > > > > flush > > > > > > was happening at regular sync points and that=E2=80=99s where t= his > > > > > > feature is > > > > > > bringing optimization now. > > > > > >=20 > > > > > > Tejas > > > > > >=20 > > > > > > > > >=20 > > > > > > > > > This will trigger a TLB invalidation (and I assume a > > > > > > > > > cache > > > > > > > > > flush) > > > > > > > > > every time we move or free memory in the 3D stack if > > > > > > > > > it > > > > > > > > > has a > > > > > > > > > binding. It also performs a synchronous wait on the > > > > > > > > > BO > > > > > > > > > being > > > > > > > > > idle. > > > > > > > > > Both of these are very expensive operations. I can=E2=80= =99t > > > > > > > > > imagine > > > > > > > > > the > > > > > > > > > granularity we want here is to do this on every > > > > > > > > > move/free > > > > > > > > > with > > > > > > > > > bindings. > > > > > > > > >=20 > > > > > > > > > Also, for LR compute with preempt fences, we would > > > > > > > > > trigger the > > > > > > > > > preempt fences during the wait, so a TLB invalidation > > > > > > > > > after > > > > > > > > > this > > > > > > > > > seems unnecessary, though perhaps the cache flush is > > > > > > > > > still > > > > > > > > > required? > > > > > > > > >=20 > > > > > > > > > I think this needs a bit more explanation, because > > > > > > > > > without > > > > > > > > > knowing a > > > > > > > > > lot about the exact requirements, the implementation > > > > > > > > > does > > > > > > > > > not > > > > > > > > > look > > > > > > > correct. > > > > > > > >=20 > > > > > > > > The thing is that we are trying to solve problem with > > > > > > > > userptr > > > > > > > > with non-XA > > > > > > > pat, consider if that BO got moved while media is not > > > > > > > active. > > > > > > > As > > > > > > > soon as media > > > > > > > will come back active, stale cached entries of that > > > > > > > object > > > > > > > will be > > > > > > > flushed as part > > > > > > > of full flush , which may corrupt things. > > > > > > > > There was thinking that with this patch we would at > > > > > > > > least > > > > > > > > solve > > > > > > > > the problem > > > > > > > of corruption and later when page_reclamation feature > > > > > > > comes > > > > > > > in will > > > > > > > help in > > > > > > > performance as well. But now when page reclamation > > > > > > > feature is > > > > > > > merged earlier > > > > > > > and it tightly coupled with bind/unbind some cases like > > > > > > > discussed > > > > > > > above > > > > > > > (which are not doing unbind immediately on move/free) are > > > > > > > missed in > > > > > > > reclamation. > > > > > > > >=20 > > > > > > > > So thought was to let this solution go in with little > > > > > > > > perf > > > > > > > > hit > > > > > > > > and discuss with > > > > > > > page reclamation owner to come with cleaner solution > > > > > > > together. > > > > > > > >=20 > > > > > > > > Tejas > > > > > > > > >=20 > > > > > > > > > > > =C2=A0=C2=A0=C2=A0 } > > > > > > > > > > >=20 > > > > > > > > > > > =C2=A0=C2=A0=C2=A0 if (!idle) { > > > > > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_device.c > > > > > > > > > > > b/drivers/gpu/drm/xe/xe_device.c index > > > > > > > > > > > 743c18e0c580..da2abed94bc0 > > > > > > > > > > > 100644 > > > > > > > > > > > --- a/drivers/gpu/drm/xe/xe_device.c > > > > > > > > > > > +++ b/drivers/gpu/drm/xe/xe_device.c > > > > > > > > > > > @@ -1097,6 +1097,29 @@ static void > > > > > > > > > > > tdf_request_sync(struct > > > > > > > > > > > xe_device > > > > > > > > > *xe) > > > > > > > > > > > =C2=A0=C2=A0=C2=A0 } > > > > > > > > > > > =C2=A0=C2=A0=C2=A0} > > > > > > > > > > >=20 > > > > > > > > > > > +/** > > > > > > > > > > > + * xe_device_needs_cache_flush - Whether the > > > > > > > > > > > cache > > > > > > > > > > > needs > > > > > > > > > > > to be > > > > > > > > > > > +flushed > > > > > > > > > > > + * @xe: The device to check. > > > > > > > > > > > + * > > > > > > > > > > > + * Return: true if the device needs cache flush, > > > > > > > > > > > false > > > > > > > > > > > otherwise. > > > > > > > > > > > + */ > > > > > > > > > > > +bool xe_device_needs_cache_flush(struct > > > > > > > > > > > xe_device > > > > > > > > > > > *xe) { > > > > > > > > > > > + /* XA is *always* flushed, like at the > > > > > > > > > > > end- > > > > > > > > > > > of- > > > > > > > > > > > submssion (and > > > > > > > > > > > +maybe > > > > > > > > > other > > > > > > > > > > > + * places), just that internally as an > > > > > > > > > > > optimisation hw doesn't > > > > > > > > > > > +need to > > > > > > > > > make > > > > > > > > > > > + * that a full flush (which will also > > > > > > > > > > > include XA) > > > > > > > > > > > when Media is > > > > > > > > > > > + * off/powergated, since it doesn't need > > > > > > > > > > > to > > > > > > > > > > > worry > > > > > > > > > > > about GT > > > > > > > > > > > +caches vs > > > > > > > > > Media > > > > > > > > > > > + * coherency, and only CPU vs GPU > > > > > > > > > > > coherency, > > > > > > > > > > > so > > > > > > > > > > > can make > > > > > > > that > > > > > > > > > > > +flush > > > > > > > > > a > > > > > > > > > > > + * targeted XA flush, since stuff tagged > > > > > > > > > > > with XA > > > > > > > > > > > now means > > > > > > > > > > > +it's shared > > > > > > > > > with > > > > > > > > > > > + * the CPU. The main implication is that > > > > > > > > > > > we > > > > > > > > > > > now > > > > > > > > > > > need to > > > > > > > > > > > +somehow > > > > > > > > > flush non-XA before > > > > > > > > > > > + * freeing system memory pages, > > > > > > > > > > > otherwise > > > > > > > > > > > dirty > > > > > > > > > > > cachelines > > > > > > > > > > > +could be > > > > > > > > > flushed after the free > > > > > > > > > > > + * (like if Media suddenly turns on and > > > > > > > > > > > does > > > > > > > > > > > a > > > > > > > > > > > full flush) > > > > > > > > > > > + */ > > > > > > > > > > > + if (GRAPHICS_VER(xe) >=3D 35 && > > > > > > > > > > > !IS_DGFX(xe)) > > > > > > > > > > > + return true; > > > > > > > > > > > + return false; > > > > > > > > > > > +} > > > > > > > > > > > + > > > > > > > > > > > =C2=A0=C2=A0=C2=A0void xe_device_l2_flush(struct xe_d= evice *xe)=C2=A0 > > > > > > > > > > > { > > > > > > > > > > > =C2=A0=C2=A0=C2=A0 struct xe_gt *gt; > > > > > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_device.h > > > > > > > > > > > b/drivers/gpu/drm/xe/xe_device.h index > > > > > > > > > > > 39464650533b..baf386e0e037 > > > > > > > > > > > 100644 > > > > > > > > > > > --- a/drivers/gpu/drm/xe/xe_device.h > > > > > > > > > > > +++ b/drivers/gpu/drm/xe/xe_device.h > > > > > > > > > > > @@ -184,6 +184,7 @@ void > > > > > > > > > > > xe_device_snapshot_print(struct > > > > > > > > > > > xe_device *xe, struct drm_printer *p); > > > > > > > > > > > =C2=A0=C2=A0=C2=A0u64 xe_device_canonicalize_addr(str= uct > > > > > > > > > > > xe_device > > > > > > > > > > > *xe, u64 > > > > > > > > > > > address); > > > > > > > > > > > =C2=A0=C2=A0=C2=A0u64 xe_device_uncanonicalize_addr(s= truct > > > > > > > > > > > xe_device > > > > > > > > > > > *xe, > > > > > > > > > > > u64 > > > > > > > > > > > address); > > > > > > > > > > >=20 > > > > > > > > > > > +bool xe_device_needs_cache_flush(struct > > > > > > > > > > > xe_device > > > > > > > > > > > *xe); > > > > > > > > > > > =C2=A0=C2=A0=C2=A0void xe_device_td_flush(struct xe_d= evice *xe); > > > > > > > > > > > void > > > > > > > > > > > xe_device_l2_flush(struct xe_device *xe); > > > > > > > > > > >=20 > > > > > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_userptr.c > > > > > > > > > > > b/drivers/gpu/drm/xe/xe_userptr.c index > > > > > > > > > > > e120323c43bc..b435ea7f9b66 > > > > > > > > > > > 100644 > > > > > > > > > > > --- a/drivers/gpu/drm/xe/xe_userptr.c > > > > > > > > > > > +++ b/drivers/gpu/drm/xe/xe_userptr.c > > > > > > > > > > > @@ -114,7 +114,8 @@ static void > > > > > > > > > > > __vma_userptr_invalidate(struct > > > > > > > > > > > xe_vm > > > > > > > > > *vm, struct xe_userptr_vma *uv > > > > > > > > > > > =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 false, > > > > > > > > > > > MAX_SCHEDULE_TIMEOUT); > > > > > > > > > > > =C2=A0=C2=A0=C2=A0 XE_WARN_ON(err <=3D 0); > > > > > > > > > > >=20 > > > > > > > > > > > - if (xe_vm_in_fault_mode(vm) && userptr- > > > > > > > > > > > > initial_bind) { > > > > > > > > > > > + if ((xe_vm_in_fault_mode(vm) || > > > > > > > > > > > +xe_device_needs_cache_flush(vm- > > > > > > > > > > xe)) && > > > > > > > > > > > + =C2=A0=C2=A0=C2=A0 userptr->initial_bind) { > > > > > > > > >=20 > > > > > > > > > Same concern with the LR preempt fence as above =E2=80=94= the > > > > > > > > > hardware > > > > > > > > > will > > > > > > > > > be interrupted via preempt fences, so it doesn=E2=80=99t = seem > > > > > > > > > necessary > > > > > > > > > to > > > > > > > > > invalidate the TLBs but perhaps we need a cflush and > > > > > > > > > TLB > > > > > > > > > invalidation is the mechanism for that too? > > > > > > > > >=20 > > > > > > > > > Matt > > > > > > > > >=20 > > > > > > > > > > > =C2=A0=C2=A0=C2=A0 err =3D xe_vm_invalidate_vma(vma)= ; > > > > > > > > > > > =C2=A0=C2=A0=C2=A0 XE_WARN_ON(err); > > > > > > > > > > > =C2=A0=C2=A0=C2=A0 } > > > > > > > > > > > -- > > > > > > > > > > > 2.52.0 > > > > > > > > > > >=20 > > > > > > > > > >=20 > > > > > > > > > > -- > > > > > > > > > > Matt Roper > > > > > > > > > > Graphics Software Engineer > > > > > > > > > > Linux GPU Platform Enablement > > > > > > > > > > Intel Corporation > > > > > > >=20 > > > > > > > -- > > > > > > > Matt Roper > > > > > > > Graphics Software Engineer > > > > > > > Linux GPU Platform Enablement > > > > > > > Intel Corporation > > > >=20