From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id E06BAC87FD2 for ; Mon, 11 Aug 2025 15:16:16 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 5114810E4F9; Mon, 11 Aug 2025 15:16:16 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="gkjgyhv/"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.12]) by gabe.freedesktop.org (Postfix) with ESMTPS id 27D3B10E4F7; Mon, 11 Aug 2025 15:16:15 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1754925376; x=1786461376; h=message-id:subject:from:to:cc:date:in-reply-to: references:content-transfer-encoding:mime-version; bh=GRA8UruL7C31U0Qi7hvdh7lKTtsA/KbEXNjxN8VCpEc=; b=gkjgyhv/YS1Ax8mVQvRT6XCWfAjUTHpCAUvMCRzSMAwXe5akKT18nN2O jhmV6PzzsblC6xzEWdeHUxe7AXsyR+Bnba6dVRfaP8FHLCHObgslPfSBo DgpA0Ku0Sa1apq8MtbgRbFOzrPdGavLdDlzi0KZri/LbpGn72suPOL7Jc KixFgNssv9pYNpBboGLXkzjKCyo45IJX3/+2KCaeuYtvboYOVo84EnSTy YgB+7E5BFRnM8Aa0yvnVo7jDUgRRAw6lbVHZQGoYjuHYJ7bttPtlWUdZw XGKxiODNap1vTels0laaZ+pTNK+At8dE73Uh+bm7RWP8SUfOKlx0khGCc g==; X-CSE-ConnectionGUID: ZZmYgrq4RPalscxrbYTQbw== X-CSE-MsgGUID: 4WjdTRMqQQG7ANVVdfheAA== X-IronPort-AV: E=McAfee;i="6800,10657,11518"; a="68635867" X-IronPort-AV: E=Sophos;i="6.17,278,1747724400"; d="scan'208";a="68635867" Received: from orviesa006.jf.intel.com ([10.64.159.146]) by orvoesa104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Aug 2025 08:16:15 -0700 X-CSE-ConnectionGUID: Ut5T+zOGRiaYdNkFNiifpw== X-CSE-MsgGUID: wIk+109zQu+LbeXgptT9NQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.17,278,1747724400"; d="scan'208";a="165148605" Received: from abityuts-desk.ger.corp.intel.com (HELO [10.245.244.228]) ([10.245.244.228]) by orviesa006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Aug 2025 08:16:13 -0700 Message-ID: Subject: Re: [PATCH] drm/ttm: WIP limit the TTM pool to 32bit CPUs From: Thomas =?ISO-8859-1?Q?Hellstr=F6m?= To: Christian =?ISO-8859-1?Q?K=F6nig?= , dri-devel@lists.freedesktop.org, intel-gfx@lists.freedesktop.org, intel-xe@lists.freedesktop.org Cc: airlied@gmail.com, matthew.brost@intel.com Date: Mon, 11 Aug 2025 17:16:10 +0200 In-Reply-To: <76b287bf-9698-4df2-ac20-70b178a3a7c1@amd.com> References: <20250806132838.1831-1-christian.koenig@amd.com> <3ff97e0ee433817c0c071c264d3a28622d717dfa.camel@linux.intel.com> <76b287bf-9698-4df2-ac20-70b178a3a7c1@amd.com> Organization: Intel Sweden AB, Registration Number: 556189-6027 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable User-Agent: Evolution 3.54.3 (3.54.3-1.fc41) MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" On Mon, 2025-08-11 at 13:51 +0200, Christian K=C3=B6nig wrote: > On 07.08.25 18:47, Thomas Hellstr=C3=B6m wrote: > > > Well it's surprising that even modern CPU do stuff like that. > > > That > > > could explain some of the problems we had with uncached mappings > > > on > > > ARM and RISC-V. > >=20 > > Yeah. I agree. Need to double-check with HW people whether that is > > gone > > with Panther Lake. Don't have a confirmation yet on that. >=20 > Going to ask around AMD internally as well. >=20 > > If it wasn't for the writeback of speculative prefetches we > > could've > > settled for only have TTM map the user-space mappings without > > changing > > the kernel map, just like the i915 driver does for older GPUS. >=20 > We should probably come up with a CPU whitelist or blacklist where > that is actually needed. >=20 > > > > Second, IIRC vm_insert_pfn_prot() on X86 will override the > > > > given > > > > caching mode with the last caching mode set for the kernel > > > > linear > > > > map, > > > > so if you try to set up a write-combined GPU mapping without a > > > > previous > > > > call to set_pages_xxxxx it will actually end up cached. see > > > > track_pfn_insert(). > > >=20 > > > That is exactly the same incorrect assumption I made as well. > > >=20 > > > It's not the linear mapping where that comes from but a separate > > > page > > > attribute table, see /sys/kernel/debug/x86/pat_memtype_list. > > >=20 > > > Question is why the heck should we do this? I mean we keep an > > > extra > > > rb tree around to overwrite something the driver knows in the > > > first > > > place? > > >=20 > > > That is basically just tons of extra overhead for nothing as far > > > as I > > > can see. > >=20 > > IIRC it was PAT people enforcing the x86 documentation that aliased > > mappings with conflicting caching attributes were not allowed. But > > it > > has proven to work at least on those CPUs not suffering from the > > clean > > cache-line writeback mentioned above. >=20 > Makes sense. With the PAT handling even accessing things through > /dev/mem gives you the right caching. >=20 > Do you have a list of Intel CPUs where it works? AFAIK LNL is the only one so far where it doesn't work. I need to get more information internally. >=20 > > FWIW If I understand the code correctly, i915 bypasses this by > > setting > > up user-space mappings not by vm_insert_pfn_prot() but using > > apply_to_page_range(), mapping the whole bo. >=20 > Yeah, that's probably not something we can do. Even filling in 2MiB > of address space at a time caused performance problems for TTM. Wasn't that because of repeated calls to vmf_insert_pfn_prot(), repeating the caching checks and page-table walk all the time?=C2=A0 I think apply_to_page_range() should be pretty fast. Also, to avoid regressions due to changing the number of prefaulted pages, we could perhaps honor the MADV_RANDOM and MADV_SEQUENTIAL advises for UMD to use; one faulting a single page only, one faulting the whole bo, but also see below: >=20 > We should probably just drop overriding the attributes in > vmf_insert_pfn_prot(). I think either solution will see resistance with arch people. We should probably involve them in the discussion. Thanks, /Thomas >=20 > Regards, > Christian. >=20 > >=20 > > /Thomas > >=20 > >=20 > > >=20 > > > Thanks for taking a look, > > > Christian. > > >=20 > > > >=20 > > > > /Thomas > > > >=20 > > > >=20 > > > > > --- > > > > > =C2=A0drivers/gpu/drm/ttm/ttm_pool.c | 16 +++++++++++----- > > > > > =C2=A01 file changed, 11 insertions(+), 5 deletions(-) > > > > >=20 > > > > > diff --git a/drivers/gpu/drm/ttm/ttm_pool.c > > > > > b/drivers/gpu/drm/ttm/ttm_pool.c > > > > > index baf27c70a419..7487eac29398 100644 > > > > > --- a/drivers/gpu/drm/ttm/ttm_pool.c > > > > > +++ b/drivers/gpu/drm/ttm/ttm_pool.c > > > > > @@ -38,7 +38,7 @@ > > > > > =C2=A0#include > > > > > =C2=A0#include > > > > > =C2=A0 > > > > > -#ifdef CONFIG_X86 > > > > > +#ifdef CONFIG_X86_32 > > > > > =C2=A0#include > > > > > =C2=A0#endif > > > > > =C2=A0 > > > > > @@ -46,6 +46,7 @@ > > > > > =C2=A0#include > > > > > =C2=A0#include > > > > > =C2=A0#include > > > > > +#include > > > > > =C2=A0 > > > > > =C2=A0#include "ttm_module.h" > > > > > =C2=A0 > > > > > @@ -192,7 +193,7 @@ static void ttm_pool_free_page(struct > > > > > ttm_pool > > > > > *pool, enum ttm_caching caching, > > > > > =C2=A0 struct ttm_pool_dma *dma; > > > > > =C2=A0 void *vaddr; > > > > > =C2=A0 > > > > > -#ifdef CONFIG_X86 > > > > > +#ifdef CONFIG_X86_32 > > > > > =C2=A0 /* We don't care that set_pages_wb is inefficient > > > > > here. > > > > > This > > > > > is only > > > > > =C2=A0 * used when we have to shrink and CPU overhead is > > > > > irrelevant then. > > > > > =C2=A0 */ > > > > > @@ -218,7 +219,7 @@ static void ttm_pool_free_page(struct > > > > > ttm_pool > > > > > *pool, enum ttm_caching caching, > > > > > =C2=A0/* Apply any cpu-caching deferred during page allocation */ > > > > > =C2=A0static int ttm_pool_apply_caching(struct > > > > > ttm_pool_alloc_state > > > > > *alloc) > > > > > =C2=A0{ > > > > > -#ifdef CONFIG_X86 > > > > > +#ifdef CONFIG_X86_32 > > > > > =C2=A0 unsigned int num_pages =3D alloc->pages - alloc- > > > > > > caching_divide; > > > > > =C2=A0 > > > > > =C2=A0 if (!num_pages) > > > > > @@ -232,6 +233,11 @@ static int ttm_pool_apply_caching(struct > > > > > ttm_pool_alloc_state *alloc) > > > > > =C2=A0 case ttm_uncached: > > > > > =C2=A0 return set_pages_array_uc(alloc- > > > > > >caching_divide, > > > > > num_pages); > > > > > =C2=A0 } > > > > > + > > > > > +#elif defined(CONFIG_X86_64) > > > > > + unsigned int num_pages =3D alloc->pages - alloc- > > > > > > caching_divide; > > > > > + > > > > > + drm_clflush_pages(alloc->caching_divide, num_pages); > > > > > =C2=A0#endif > > > > > =C2=A0 alloc->caching_divide =3D alloc->pages; > > > > > =C2=A0 return 0; > > > > > @@ -342,7 +348,7 @@ static struct ttm_pool_type > > > > > *ttm_pool_select_type(struct ttm_pool *pool, > > > > > =C2=A0 if (pool->use_dma_alloc) > > > > > =C2=A0 return &pool- > > > > > >caching[caching].orders[order]; > > > > > =C2=A0 > > > > > -#ifdef CONFIG_X86 > > > > > +#ifdef CONFIG_X86_32 > > > > > =C2=A0 switch (caching) { > > > > > =C2=A0 case ttm_write_combined: > > > > > =C2=A0 if (pool->nid !=3D NUMA_NO_NODE) > > > > > @@ -980,7 +986,7 @@ long ttm_pool_backup(struct ttm_pool > > > > > *pool, > > > > > struct ttm_tt *tt, > > > > > =C2=A0 =C2=A0=C2=A0=C2=A0 pool->use_dma_alloc || ttm_tt_is_backed= _up(tt)) > > > > > =C2=A0 return -EBUSY; > > > > > =C2=A0 > > > > > -#ifdef CONFIG_X86 > > > > > +#ifdef CONFIG_X86_32 > > > > > =C2=A0 /* Anything returned to the system needs to be > > > > > cached. > > > > > */ > > > > > =C2=A0 if (tt->caching !=3D ttm_cached) > > > > > =C2=A0 set_pages_array_wb(tt->pages, tt- > > > > > >num_pages); > > > >=20 > > >=20 > >=20 >=20