[PATCH v3 0/5] Improving the worst case TTM large allocation latency

dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/5] Improving the worst case TTM large allocation latency
@ 2025-10-08 11:53 Tvrtko Ursulin
  2025-10-08 11:53 ` [PATCH v3 1/5] drm/ttm: Add getter for some pool properties Tvrtko Ursulin
                   ` (5 more replies)
  0 siblings, 6 replies; 24+ messages in thread
From: Tvrtko Ursulin @ 2025-10-08 11:53 UTC (permalink / raw)
  To: amd-gfx, dri-devel
  Cc: kernel-dev, Tvrtko Ursulin, Alex Deucher, Christian König,
	Danilo Krummrich, Dave Airlie, Gerd Hoffmann, Joonas Lahtinen,
	Lucas De Marchi, Lyude Paul, Maarten Lankhorst, Maxime Ripard,
	Rodrigo Vivi, Sui Jingfeng, Thadeu Lima de Souza Cascardo,
	Thomas Hellström, Thomas Zimmermann, Zack Rusin

Disclaimer:
Please note that as this series includes a patch which touches a good number of
drivers I will only copy everyone in the cover letter and the respective patch.
Assumption is people are subscribed to dri-devel so can look at the whole series
there. I know someone is bound to complain for both the case when everyone is
copied on everything for getting too much email, and also for this other case.
So please be flexible.

Description:

All drivers which use the TTM pool allocator end up requesting large order
allocations when allocating large buffers. Those can be slow due memory pressure
and so add latency to buffer creation. But there is often also a size limit
above which contiguous blocks do not bring any performance benefits. This series
allows drivers to say when it is okay for the TTM to try a bit less hard.

We do this by allowing drivers to specify this cut off point when creating the
TTM device and pools. Allocations above this size will skip direct reclaim so
under memory pressure worst case latency will improve. Background reclaim is
still kicked off and both before and after the memory pressure all the TTM pool
buckets remain to be used as they are today.

This is especially interesting if someone has configured MAX_PAGE_ORDER to
higher than the default. And even with the default, with amdgpu for example,
the last patch in the series makes use of the new feature by telling TTM that
above 2MiB we do not expect performance benefits. Which makes TTM not try direct
reclaim for the top bucket (4MiB).

End result is TTM drivers become a tiny bit nicer mm citizens and users benefit
from better worst case buffer creation latencies. As a side benefit we get rid
of two instances of those often very unreadable mutliple nameless booleans
function signatures.

If this sounds interesting and gets merge the invidual drivers can follow up
with patches configuring their thresholds.

v2:
 * Christian suggested to pass in the new data by changing the function signatures.

v3:
 * Moved ttm pool helpers into new ttm_pool_internal.h. (Christian)

v1 thread:
https://lore.kernel.org/dri-devel/20250919131127.90932-1-tvrtko.ursulin@igalia.com/

Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Sui Jingfeng <suijingfeng@loongson.cn>
Cc: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: Zack Rusin <zack.rusin@broadcom.com>

Tvrtko Ursulin (5):
  drm/ttm: Add getter for some pool properties
  drm/ttm: Replace multiple booleans with flags in pool init
  drm/ttm: Replace multiple booleans with flags in device init
  drm/ttm: Allow drivers to specify maximum beneficial TTM pool size
  drm/amdgpu: Configure max beneficial TTM pool allocation order

 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |  7 +--
 drivers/gpu/drm/drm_gem_vram_helper.c         |  2 +-
 drivers/gpu/drm/i915/intel_region_ttm.c       |  2 +-
 drivers/gpu/drm/loongson/lsdc_ttm.c           |  2 +-
 drivers/gpu/drm/nouveau/nouveau_ttm.c         |  4 +-
 drivers/gpu/drm/qxl/qxl_ttm.c                 |  2 +-
 drivers/gpu/drm/radeon/radeon_ttm.c           |  4 +-
 drivers/gpu/drm/ttm/tests/ttm_bo_test.c       | 16 +++----
 .../gpu/drm/ttm/tests/ttm_bo_validate_test.c  |  2 +-
 drivers/gpu/drm/ttm/tests/ttm_device_test.c   | 31 +++++--------
 drivers/gpu/drm/ttm/tests/ttm_kunit_helpers.c | 22 ++++-----
 drivers/gpu/drm/ttm/tests/ttm_kunit_helpers.h |  7 +--
 drivers/gpu/drm/ttm/tests/ttm_pool_test.c     | 23 +++++-----
 drivers/gpu/drm/ttm/ttm_device.c              |  7 ++-
 drivers/gpu/drm/ttm/ttm_pool.c                | 45 +++++++++++--------
 drivers/gpu/drm/ttm/ttm_pool_internal.h       | 24 ++++++++++
 drivers/gpu/drm/ttm/ttm_tt.c                  | 10 +++--
 drivers/gpu/drm/vmwgfx/vmwgfx_drv.c           |  4 +-
 drivers/gpu/drm/xe/xe_device.c                |  2 +-
 include/drm/ttm/ttm_device.h                  |  2 +-
 include/drm/ttm/ttm_pool.h                    | 13 +++---
 21 files changed, 125 insertions(+), 106 deletions(-)
 create mode 100644 drivers/gpu/drm/ttm/ttm_pool_internal.h

-- 
2.48.0

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v3 1/5] drm/ttm: Add getter for some pool properties
  2025-10-08 11:53 [PATCH v3 0/5] Improving the worst case TTM large allocation latency Tvrtko Ursulin
@ 2025-10-08 11:53 ` Tvrtko Ursulin
  2025-10-08 11:53 ` [PATCH v3 2/5] drm/ttm: Replace multiple booleans with flags in pool init Tvrtko Ursulin
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 24+ messages in thread
From: Tvrtko Ursulin @ 2025-10-08 11:53 UTC (permalink / raw)
  To: amd-gfx, dri-devel
  Cc: kernel-dev, Tvrtko Ursulin, Christian König,
	Thomas Hellström

No functional change but to allow easier refactoring in the future.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/ttm/ttm_pool.c          | 29 +++++++++++++------------
 drivers/gpu/drm/ttm/ttm_pool_internal.h | 19 ++++++++++++++++
 drivers/gpu/drm/ttm/ttm_tt.c            | 10 +++++----
 3 files changed, 40 insertions(+), 18 deletions(-)
 create mode 100644 drivers/gpu/drm/ttm/ttm_pool_internal.h

diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index baf27c70a419..ff6fab4122bb 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -48,6 +48,7 @@
 #include <drm/ttm/ttm_bo.h>
 
 #include "ttm_module.h"
+#include "ttm_pool_internal.h"
 
 #ifdef CONFIG_FAULT_INJECTION
 #include <linux/fault-inject.h>
@@ -148,7 +149,7 @@ static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
 		gfp_flags |= __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN |
 			__GFP_THISNODE;
 
-	if (!pool->use_dma_alloc) {
+	if (!ttm_pool_uses_dma_alloc(pool)) {
 		p = alloc_pages_node(pool->nid, gfp_flags, order);
 		if (p)
 			p->private = order;
@@ -200,7 +201,7 @@ static void ttm_pool_free_page(struct ttm_pool *pool, enum ttm_caching caching,
 		set_pages_wb(p, 1 << order);
 #endif
 
-	if (!pool || !pool->use_dma_alloc) {
+	if (!pool || !ttm_pool_uses_dma_alloc(pool)) {
 		__free_pages(p, order);
 		return;
 	}
@@ -243,7 +244,7 @@ static int ttm_pool_map(struct ttm_pool *pool, unsigned int order,
 {
 	dma_addr_t addr;
 
-	if (pool->use_dma_alloc) {
+	if (ttm_pool_uses_dma_alloc(pool)) {
 		struct ttm_pool_dma *dma = (void *)p->private;
 
 		addr = dma->addr;
@@ -265,7 +266,7 @@ static void ttm_pool_unmap(struct ttm_pool *pool, dma_addr_t dma_addr,
 			   unsigned int num_pages)
 {
 	/* Unmapped while freeing the page */
-	if (pool->use_dma_alloc)
+	if (ttm_pool_uses_dma_alloc(pool))
 		return;
 
 	dma_unmap_page(pool->dev, dma_addr, (long)num_pages << PAGE_SHIFT,
@@ -339,7 +340,7 @@ static struct ttm_pool_type *ttm_pool_select_type(struct ttm_pool *pool,
 						  enum ttm_caching caching,
 						  unsigned int order)
 {
-	if (pool->use_dma_alloc)
+	if (ttm_pool_uses_dma_alloc(pool))
 		return &pool->caching[caching].orders[order];
 
 #ifdef CONFIG_X86
@@ -348,7 +349,7 @@ static struct ttm_pool_type *ttm_pool_select_type(struct ttm_pool *pool,
 		if (pool->nid != NUMA_NO_NODE)
 			return &pool->caching[caching].orders[order];
 
-		if (pool->use_dma32)
+		if (ttm_pool_uses_dma32(pool))
 			return &global_dma32_write_combined[order];
 
 		return &global_write_combined[order];
@@ -356,7 +357,7 @@ static struct ttm_pool_type *ttm_pool_select_type(struct ttm_pool *pool,
 		if (pool->nid != NUMA_NO_NODE)
 			return &pool->caching[caching].orders[order];
 
-		if (pool->use_dma32)
+		if (ttm_pool_uses_dma32(pool))
 			return &global_dma32_uncached[order];
 
 		return &global_uncached[order];
@@ -396,7 +397,7 @@ static unsigned int ttm_pool_shrink(void)
 /* Return the allocation order based for a page */
 static unsigned int ttm_pool_page_order(struct ttm_pool *pool, struct page *p)
 {
-	if (pool->use_dma_alloc) {
+	if (ttm_pool_uses_dma_alloc(pool)) {
 		struct ttm_pool_dma *dma = (void *)p->private;
 
 		return dma->vaddr & ~PAGE_MASK;
@@ -719,7 +720,7 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
 	if (ctx->gfp_retry_mayfail)
 		gfp_flags |= __GFP_RETRY_MAYFAIL;
 
-	if (pool->use_dma32)
+	if (ttm_pool_uses_dma32(pool))
 		gfp_flags |= GFP_DMA32;
 	else
 		gfp_flags |= GFP_HIGHUSER;
@@ -977,7 +978,7 @@ long ttm_pool_backup(struct ttm_pool *pool, struct ttm_tt *tt,
 		return -EINVAL;
 
 	if ((!ttm_backup_bytes_avail() && !flags->purge) ||
-	    pool->use_dma_alloc || ttm_tt_is_backed_up(tt))
+	    ttm_pool_uses_dma_alloc(pool) || ttm_tt_is_backed_up(tt))
 		return -EBUSY;
 
 #ifdef CONFIG_X86
@@ -1014,7 +1015,7 @@ long ttm_pool_backup(struct ttm_pool *pool, struct ttm_tt *tt,
 	if (flags->purge)
 		return shrunken;
 
-	if (pool->use_dma32)
+	if (ttm_pool_uses_dma32(pool))
 		gfp = GFP_DMA32;
 	else
 		gfp = GFP_HIGHUSER;
@@ -1068,7 +1069,7 @@ void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
 {
 	unsigned int i, j;
 
-	WARN_ON(!dev && use_dma_alloc);
+	WARN_ON(!dev && ttm_pool_uses_dma_alloc(pool));
 
 	pool->dev = dev;
 	pool->nid = nid;
@@ -1239,7 +1240,7 @@ int ttm_pool_debugfs(struct ttm_pool *pool, struct seq_file *m)
 {
 	unsigned int i;
 
-	if (!pool->use_dma_alloc && pool->nid == NUMA_NO_NODE) {
+	if (!ttm_pool_uses_dma_alloc(pool) && pool->nid == NUMA_NO_NODE) {
 		seq_puts(m, "unused\n");
 		return 0;
 	}
@@ -1250,7 +1251,7 @@ int ttm_pool_debugfs(struct ttm_pool *pool, struct seq_file *m)
 	for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i) {
 		if (!ttm_pool_select_type(pool, i, 0))
 			continue;
-		if (pool->use_dma_alloc)
+		if (ttm_pool_uses_dma_alloc(pool))
 			seq_puts(m, "DMA ");
 		else
 			seq_printf(m, "N%d ", pool->nid);
diff --git a/drivers/gpu/drm/ttm/ttm_pool_internal.h b/drivers/gpu/drm/ttm/ttm_pool_internal.h
new file mode 100644
index 000000000000..3e50d30bd95a
--- /dev/null
+++ b/drivers/gpu/drm/ttm/ttm_pool_internal.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0 OR MIT */
+/* Copyright (c) 2025 Valve Corporation */
+
+#ifndef _TTM_POOL_INTERNAL_H_
+#define _TTM_POOL_INTERNAL_H_
+
+#include <drm/ttm/ttm_pool.h>
+
+static inline bool ttm_pool_uses_dma_alloc(struct ttm_pool *pool)
+{
+	return pool->use_dma_alloc;
+}
+
+static inline bool ttm_pool_uses_dma32(struct ttm_pool *pool)
+{
+	return pool->use_dma32;
+}
+
+#endif
diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
index 506e257dfba8..ced0875d0722 100644
--- a/drivers/gpu/drm/ttm/ttm_tt.c
+++ b/drivers/gpu/drm/ttm/ttm_tt.c
@@ -46,6 +46,7 @@
 #include <drm/ttm/ttm_tt.h>
 
 #include "ttm_module.h"
+#include "ttm_pool_internal.h"
 
 static unsigned long ttm_pages_limit;
 
@@ -93,7 +94,8 @@ int ttm_tt_create(struct ttm_buffer_object *bo, bool zero_alloc)
 	 * mapped TT pages need to be decrypted or otherwise the drivers
 	 * will end up sending encrypted mem to the gpu.
 	 */
-	if (bdev->pool.use_dma_alloc && cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT)) {
+	if (ttm_pool_uses_dma_alloc(&bdev->pool) &&
+	    cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT)) {
 		page_flags |= TTM_TT_FLAG_DECRYPTED;
 		drm_info_once(ddev, "TT memory decryption enabled.");
 	}
@@ -378,7 +380,7 @@ int ttm_tt_populate(struct ttm_device *bdev,
 
 	if (!(ttm->page_flags & TTM_TT_FLAG_EXTERNAL)) {
 		atomic_long_add(ttm->num_pages, &ttm_pages_allocated);
-		if (bdev->pool.use_dma32)
+		if (ttm_pool_uses_dma32(&bdev->pool))
 			atomic_long_add(ttm->num_pages,
 					&ttm_dma32_pages_allocated);
 	}
@@ -416,7 +418,7 @@ int ttm_tt_populate(struct ttm_device *bdev,
 error:
 	if (!(ttm->page_flags & TTM_TT_FLAG_EXTERNAL)) {
 		atomic_long_sub(ttm->num_pages, &ttm_pages_allocated);
-		if (bdev->pool.use_dma32)
+		if (ttm_pool_uses_dma32(&bdev->pool))
 			atomic_long_sub(ttm->num_pages,
 					&ttm_dma32_pages_allocated);
 	}
@@ -439,7 +441,7 @@ void ttm_tt_unpopulate(struct ttm_device *bdev, struct ttm_tt *ttm)
 
 	if (!(ttm->page_flags & TTM_TT_FLAG_EXTERNAL)) {
 		atomic_long_sub(ttm->num_pages, &ttm_pages_allocated);
-		if (bdev->pool.use_dma32)
+		if (ttm_pool_uses_dma32(&bdev->pool))
 			atomic_long_sub(ttm->num_pages,
 					&ttm_dma32_pages_allocated);
 	}
-- 
2.48.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 2/5] drm/ttm: Replace multiple booleans with flags in pool init
  2025-10-08 11:53 [PATCH v3 0/5] Improving the worst case TTM large allocation latency Tvrtko Ursulin
  2025-10-08 11:53 ` [PATCH v3 1/5] drm/ttm: Add getter for some pool properties Tvrtko Ursulin
@ 2025-10-08 11:53 ` Tvrtko Ursulin
  2025-10-10 15:10   ` kernel test robot
  2025-10-08 11:53 ` [PATCH v3 3/5] drm/ttm: Replace multiple booleans with flags in device init Tvrtko Ursulin
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 24+ messages in thread
From: Tvrtko Ursulin @ 2025-10-08 11:53 UTC (permalink / raw)
  To: amd-gfx, dri-devel
  Cc: kernel-dev, Tvrtko Ursulin, Alex Deucher, Christian König,
	Thomas Hellström

Multiple consecutive boolean function arguments are usually not very
readable.

Replace the ones in ttm_pool_init() with flags with the additional
benefit of soon being able to pass in more data with just this one
code base churning cost.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c     |  2 +-
 drivers/gpu/drm/ttm/tests/ttm_device_test.c | 23 +++++++--------------
 drivers/gpu/drm/ttm/tests/ttm_pool_test.c   | 23 ++++++++++-----------
 drivers/gpu/drm/ttm/ttm_device.c            |  4 +++-
 drivers/gpu/drm/ttm/ttm_pool.c              |  8 +++----
 drivers/gpu/drm/ttm/ttm_pool_internal.h     |  4 ++--
 include/drm/ttm/ttm_pool.h                  | 11 +++++-----
 7 files changed, 34 insertions(+), 41 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index aa9ee5dffa45..8f6d331e1ea2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -1837,7 +1837,7 @@ static int amdgpu_ttm_pools_init(struct amdgpu_device *adev)
 	for (i = 0; i < adev->gmc.num_mem_partitions; i++) {
 		ttm_pool_init(&adev->mman.ttm_pools[i], adev->dev,
 			      adev->gmc.mem_partitions[i].numa.node,
-			      false, false);
+			      0);
 	}
 	return 0;
 }
diff --git a/drivers/gpu/drm/ttm/tests/ttm_device_test.c b/drivers/gpu/drm/ttm/tests/ttm_device_test.c
index 1621903818e5..18c58b0fc18c 100644
--- a/drivers/gpu/drm/ttm/tests/ttm_device_test.c
+++ b/drivers/gpu/drm/ttm/tests/ttm_device_test.c
@@ -10,8 +10,7 @@
 
 struct ttm_device_test_case {
 	const char *description;
-	bool use_dma_alloc;
-	bool use_dma32;
+	unsigned int flags;
 	bool pools_init_expected;
 };
 
@@ -119,26 +118,21 @@ static void ttm_device_init_no_vma_man(struct kunit *test)
 static const struct ttm_device_test_case ttm_device_cases[] = {
 	{
 		.description = "No DMA allocations, no DMA32 required",
-		.use_dma_alloc = false,
-		.use_dma32 = false,
 		.pools_init_expected = false,
 	},
 	{
 		.description = "DMA allocations, DMA32 required",
-		.use_dma_alloc = true,
-		.use_dma32 = true,
+		.flags = TTM_POOL_USE_DMA_ALLOC | TTM_POOL_USE_DMA32,
 		.pools_init_expected = true,
 	},
 	{
 		.description = "No DMA allocations, DMA32 required",
-		.use_dma_alloc = false,
-		.use_dma32 = true,
+		.flags = TTM_POOL_USE_DMA32,
 		.pools_init_expected = false,
 	},
 	{
 		.description = "DMA allocations, no DMA32 required",
-		.use_dma_alloc = true,
-		.use_dma32 = false,
+		.flags = TTM_POOL_USE_DMA_ALLOC,
 		.pools_init_expected = true,
 	},
 };
@@ -163,15 +157,14 @@ static void ttm_device_init_pools(struct kunit *test)
 	KUNIT_ASSERT_NOT_NULL(test, ttm_dev);
 
 	err = ttm_device_kunit_init(priv, ttm_dev,
-				    params->use_dma_alloc,
-				    params->use_dma32);
+				    params->flags & TTM_POOL_USE_DMA_ALLOC,
+				    params->flags & TTM_POOL_USE_DMA32);
 	KUNIT_ASSERT_EQ(test, err, 0);
 
 	pool = &ttm_dev->pool;
 	KUNIT_ASSERT_NOT_NULL(test, pool);
 	KUNIT_EXPECT_PTR_EQ(test, pool->dev, priv->dev);
-	KUNIT_EXPECT_EQ(test, pool->use_dma_alloc, params->use_dma_alloc);
-	KUNIT_EXPECT_EQ(test, pool->use_dma32, params->use_dma32);
+	KUNIT_EXPECT_EQ(test, pool->flags, params->flags);
 
 	if (params->pools_init_expected) {
 		for (int i = 0; i < TTM_NUM_CACHING_TYPES; ++i) {
@@ -181,7 +174,7 @@ static void ttm_device_init_pools(struct kunit *test)
 				KUNIT_EXPECT_EQ(test, pt.caching, i);
 				KUNIT_EXPECT_EQ(test, pt.order, j);
 
-				if (params->use_dma_alloc)
+				if (ttm_pool_uses_dma_alloc(pool))
 					KUNIT_ASSERT_FALSE(test,
 							   list_empty(&pt.pages));
 			}
diff --git a/drivers/gpu/drm/ttm/tests/ttm_pool_test.c b/drivers/gpu/drm/ttm/tests/ttm_pool_test.c
index 8ade53371f72..8caf4a587bac 100644
--- a/drivers/gpu/drm/ttm/tests/ttm_pool_test.c
+++ b/drivers/gpu/drm/ttm/tests/ttm_pool_test.c
@@ -12,7 +12,7 @@
 struct ttm_pool_test_case {
 	const char *description;
 	unsigned int order;
-	bool use_dma_alloc;
+	unsigned int flags;
 };
 
 struct ttm_pool_test_priv {
@@ -86,7 +86,7 @@ static struct ttm_pool *ttm_pool_pre_populated(struct kunit *test,
 	pool = kunit_kzalloc(test, sizeof(*pool), GFP_KERNEL);
 	KUNIT_ASSERT_NOT_NULL(test, pool);
 
-	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false);
+	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, TTM_POOL_USE_DMA_ALLOC);
 
 	err = ttm_pool_alloc(pool, tt, &simple_ctx);
 	KUNIT_ASSERT_EQ(test, err, 0);
@@ -113,12 +113,12 @@ static const struct ttm_pool_test_case ttm_pool_basic_cases[] = {
 	{
 		.description = "One page, with coherent DMA mappings enabled",
 		.order = 0,
-		.use_dma_alloc = true,
+		.flags = TTM_POOL_USE_DMA_ALLOC,
 	},
 	{
 		.description = "Above the allocation limit, with coherent DMA mappings enabled",
 		.order = MAX_PAGE_ORDER + 1,
-		.use_dma_alloc = true,
+		.flags = TTM_POOL_USE_DMA_ALLOC,
 	},
 };
 
@@ -150,12 +150,11 @@ static void ttm_pool_alloc_basic(struct kunit *test)
 	pool = kunit_kzalloc(test, sizeof(*pool), GFP_KERNEL);
 	KUNIT_ASSERT_NOT_NULL(test, pool);
 
-	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, params->use_dma_alloc,
-		      false);
+	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, params->flags);
 
 	KUNIT_ASSERT_PTR_EQ(test, pool->dev, devs->dev);
 	KUNIT_ASSERT_EQ(test, pool->nid, NUMA_NO_NODE);
-	KUNIT_ASSERT_EQ(test, pool->use_dma_alloc, params->use_dma_alloc);
+	KUNIT_ASSERT_EQ(test, pool->flags, params->flags);
 
 	err = ttm_pool_alloc(pool, tt, &simple_ctx);
 	KUNIT_ASSERT_EQ(test, err, 0);
@@ -165,14 +164,14 @@ static void ttm_pool_alloc_basic(struct kunit *test)
 	last_page = tt->pages[tt->num_pages - 1];
 
 	if (params->order <= MAX_PAGE_ORDER) {
-		if (params->use_dma_alloc) {
+		if (ttm_pool_uses_dma_alloc(pool)) {
 			KUNIT_ASSERT_NOT_NULL(test, (void *)fst_page->private);
 			KUNIT_ASSERT_NOT_NULL(test, (void *)last_page->private);
 		} else {
 			KUNIT_ASSERT_EQ(test, fst_page->private, params->order);
 		}
 	} else {
-		if (params->use_dma_alloc) {
+		if (ttm_pool_uses_dma_alloc(pool)) {
 			KUNIT_ASSERT_NOT_NULL(test, (void *)fst_page->private);
 			KUNIT_ASSERT_NULL(test, (void *)last_page->private);
 		} else {
@@ -218,7 +217,7 @@ static void ttm_pool_alloc_basic_dma_addr(struct kunit *test)
 	pool = kunit_kzalloc(test, sizeof(*pool), GFP_KERNEL);
 	KUNIT_ASSERT_NOT_NULL(test, pool);
 
-	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false);
+	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, TTM_POOL_USE_DMA_ALLOC);
 
 	err = ttm_pool_alloc(pool, tt, &simple_ctx);
 	KUNIT_ASSERT_EQ(test, err, 0);
@@ -348,7 +347,7 @@ static void ttm_pool_free_dma_alloc(struct kunit *test)
 	pool = kunit_kzalloc(test, sizeof(*pool), GFP_KERNEL);
 	KUNIT_ASSERT_NOT_NULL(test, pool);
 
-	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false);
+	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, TTM_POOL_USE_DMA_ALLOC);
 	ttm_pool_alloc(pool, tt, &simple_ctx);
 
 	pt = &pool->caching[caching].orders[order];
@@ -379,7 +378,7 @@ static void ttm_pool_free_no_dma_alloc(struct kunit *test)
 	pool = kunit_kzalloc(test, sizeof(*pool), GFP_KERNEL);
 	KUNIT_ASSERT_NOT_NULL(test, pool);
 
-	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, false, false);
+	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, 0);
 	ttm_pool_alloc(pool, tt, &simple_ctx);
 
 	pt = &pool->caching[caching].orders[order];
diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c
index c3e2fcbdd2cc..250e9f10145d 100644
--- a/drivers/gpu/drm/ttm/ttm_device.c
+++ b/drivers/gpu/drm/ttm/ttm_device.c
@@ -236,7 +236,9 @@ int ttm_device_init(struct ttm_device *bdev, const struct ttm_device_funcs *func
 	else
 		nid = NUMA_NO_NODE;
 
-	ttm_pool_init(&bdev->pool, dev, nid, use_dma_alloc, use_dma32);
+	ttm_pool_init(&bdev->pool, dev, nid,
+		      (use_dma_alloc ? TTM_POOL_USE_DMA_ALLOC : 0) |
+		      (use_dma32 ? TTM_POOL_USE_DMA32 : 0));
 
 	bdev->vma_manager = vma_manager;
 	spin_lock_init(&bdev->lru_lock);
diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index ff6fab4122bb..5d007d1a15f2 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -1059,13 +1059,12 @@ long ttm_pool_backup(struct ttm_pool *pool, struct ttm_tt *tt,
  * @pool: the pool to initialize
  * @dev: device for DMA allocations and mappings
  * @nid: NUMA node to use for allocations
- * @use_dma_alloc: true if coherent DMA alloc should be used
- * @use_dma32: true if GFP_DMA32 should be used
+ * @flags: TTM_POOL_ flags
  *
  * Initialize the pool and its pool types.
  */
 void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
-		   int nid, bool use_dma_alloc, bool use_dma32)
+		   int nid, unsigned int flags)
 {
 	unsigned int i, j;
 
@@ -1073,8 +1072,7 @@ void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
 
 	pool->dev = dev;
 	pool->nid = nid;
-	pool->use_dma_alloc = use_dma_alloc;
-	pool->use_dma32 = use_dma32;
+	pool->flags = flags;
 
 	for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i) {
 		for (j = 0; j < NR_PAGE_ORDERS; ++j) {
diff --git a/drivers/gpu/drm/ttm/ttm_pool_internal.h b/drivers/gpu/drm/ttm/ttm_pool_internal.h
index 3e50d30bd95a..91aa156ddded 100644
--- a/drivers/gpu/drm/ttm/ttm_pool_internal.h
+++ b/drivers/gpu/drm/ttm/ttm_pool_internal.h
@@ -8,12 +8,12 @@
 
 static inline bool ttm_pool_uses_dma_alloc(struct ttm_pool *pool)
 {
-	return pool->use_dma_alloc;
+	return pool->flags & TTM_POOL_USE_DMA_ALLOC;
 }
 
 static inline bool ttm_pool_uses_dma32(struct ttm_pool *pool)
 {
-	return pool->use_dma32;
+	return pool->flags & TTM_POOL_USE_DMA32;
 }
 
 #endif
diff --git a/include/drm/ttm/ttm_pool.h b/include/drm/ttm/ttm_pool.h
index 54cd34a6e4c0..70db2e5c950f 100644
--- a/include/drm/ttm/ttm_pool.h
+++ b/include/drm/ttm/ttm_pool.h
@@ -59,21 +59,22 @@ struct ttm_pool_type {
 	struct list_head pages;
 };
 
+#define TTM_POOL_USE_DMA_ALLOC 	BIT(0) /* Use coherent DMA allocations. */
+#define TTM_POOL_USE_DMA32	BIT(1) /* Use GFP_DMA32 allocations. */
+
 /**
  * struct ttm_pool - Pool for all caching and orders
  *
  * @dev: the device we allocate pages for
  * @nid: which numa node to use
- * @use_dma_alloc: if coherent DMA allocations should be used
- * @use_dma32: if GFP_DMA32 should be used
+ * @flags: TTM_POOL_ flags
  * @caching: pools for each caching/order
  */
 struct ttm_pool {
 	struct device *dev;
 	int nid;
 
-	bool use_dma_alloc;
-	bool use_dma32;
+	unsigned int flags;
 
 	struct {
 		struct ttm_pool_type orders[NR_PAGE_ORDERS];
@@ -85,7 +86,7 @@ int ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
 void ttm_pool_free(struct ttm_pool *pool, struct ttm_tt *tt);
 
 void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
-		   int nid, bool use_dma_alloc, bool use_dma32);
+		   int nid, unsigned int flags);
 void ttm_pool_fini(struct ttm_pool *pool);
 
 int ttm_pool_debugfs(struct ttm_pool *pool, struct seq_file *m);
-- 
2.48.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 3/5] drm/ttm: Replace multiple booleans with flags in device init
  2025-10-08 11:53 [PATCH v3 0/5] Improving the worst case TTM large allocation latency Tvrtko Ursulin
  2025-10-08 11:53 ` [PATCH v3 1/5] drm/ttm: Add getter for some pool properties Tvrtko Ursulin
  2025-10-08 11:53 ` [PATCH v3 2/5] drm/ttm: Replace multiple booleans with flags in pool init Tvrtko Ursulin
@ 2025-10-08 11:53 ` Tvrtko Ursulin
  2025-10-08 11:53 ` [PATCH v3 4/5] drm/ttm: Allow drivers to specify maximum beneficial TTM pool size Tvrtko Ursulin
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 24+ messages in thread
From: Tvrtko Ursulin @ 2025-10-08 11:53 UTC (permalink / raw)
  To: amd-gfx, dri-devel
  Cc: kernel-dev, Tvrtko Ursulin, Alex Deucher, Christian König,
	Danilo Krummrich, Dave Airlie, Gerd Hoffmann, Joonas Lahtinen,
	Lucas De Marchi, Lyude Paul, Maarten Lankhorst, Maxime Ripard,
	Rodrigo Vivi, Sui Jingfeng, Thomas Hellström,
	Thomas Zimmermann, Zack Rusin

Multiple consecutive boolean function arguments are usually not very
readable.

Replace the ones in ttm_device_init() with flags with the additional
benefit of soon being able to pass in more data with just a one off
code base churning cost.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Sui Jingfeng <suijingfeng@loongson.cn>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: Zack Rusin <zack.rusin@broadcom.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |  4 ++--
 drivers/gpu/drm/drm_gem_vram_helper.c         |  2 +-
 drivers/gpu/drm/i915/intel_region_ttm.c       |  2 +-
 drivers/gpu/drm/loongson/lsdc_ttm.c           |  2 +-
 drivers/gpu/drm/nouveau/nouveau_ttm.c         |  4 ++--
 drivers/gpu/drm/qxl/qxl_ttm.c                 |  2 +-
 drivers/gpu/drm/radeon/radeon_ttm.c           |  4 ++--
 drivers/gpu/drm/ttm/tests/ttm_bo_test.c       | 16 +++++++-------
 .../gpu/drm/ttm/tests/ttm_bo_validate_test.c  |  2 +-
 drivers/gpu/drm/ttm/tests/ttm_device_test.c   | 12 +++++-----
 drivers/gpu/drm/ttm/tests/ttm_kunit_helpers.c | 22 ++++++++-----------
 drivers/gpu/drm/ttm/tests/ttm_kunit_helpers.h |  7 ++----
 drivers/gpu/drm/ttm/ttm_device.c              |  9 +++-----
 drivers/gpu/drm/vmwgfx/vmwgfx_drv.c           |  4 ++--
 drivers/gpu/drm/xe/xe_device.c                |  2 +-
 include/drm/ttm/ttm_device.h                  |  2 +-
 16 files changed, 42 insertions(+), 54 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 8f6d331e1ea2..948c6d0a422b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -1930,8 +1930,8 @@ int amdgpu_ttm_init(struct amdgpu_device *adev)
 	r = ttm_device_init(&adev->mman.bdev, &amdgpu_bo_driver, adev->dev,
 			       adev_to_drm(adev)->anon_inode->i_mapping,
 			       adev_to_drm(adev)->vma_offset_manager,
-			       adev->need_swiotlb,
-			       dma_addressing_limited(adev->dev));
+			       (adev->need_swiotlb ? TTM_POOL_USE_DMA_ALLOC : 0) |
+			       (dma_addressing_limited(adev->dev) ? TTM_POOL_USE_DMA32 : 0));
 	if (r) {
 		dev_err(adev->dev,
 			"failed initializing buffer object driver(%d).\n", r);
diff --git a/drivers/gpu/drm/drm_gem_vram_helper.c b/drivers/gpu/drm/drm_gem_vram_helper.c
index 90760d0ca071..386d15f630f5 100644
--- a/drivers/gpu/drm/drm_gem_vram_helper.c
+++ b/drivers/gpu/drm/drm_gem_vram_helper.c
@@ -859,7 +859,7 @@ static int drm_vram_mm_init(struct drm_vram_mm *vmm, struct drm_device *dev,
 	ret = ttm_device_init(&vmm->bdev, &bo_driver, dev->dev,
 				 dev->anon_inode->i_mapping,
 				 dev->vma_offset_manager,
-				 false, true);
+				 TTM_POOL_USE_DMA32);
 	if (ret)
 		return ret;
 
diff --git a/drivers/gpu/drm/i915/intel_region_ttm.c b/drivers/gpu/drm/i915/intel_region_ttm.c
index 04525d92bec5..47a69aad5c3f 100644
--- a/drivers/gpu/drm/i915/intel_region_ttm.c
+++ b/drivers/gpu/drm/i915/intel_region_ttm.c
@@ -34,7 +34,7 @@ int intel_region_ttm_device_init(struct drm_i915_private *dev_priv)
 
 	return ttm_device_init(&dev_priv->bdev, i915_ttm_driver(),
 			       drm->dev, drm->anon_inode->i_mapping,
-			       drm->vma_offset_manager, false, false);
+			       drm->vma_offset_manager, 0);
 }
 
 /**
diff --git a/drivers/gpu/drm/loongson/lsdc_ttm.c b/drivers/gpu/drm/loongson/lsdc_ttm.c
index 2e42c6970c9f..b74dadd1a9c6 100644
--- a/drivers/gpu/drm/loongson/lsdc_ttm.c
+++ b/drivers/gpu/drm/loongson/lsdc_ttm.c
@@ -544,7 +544,7 @@ int lsdc_ttm_init(struct lsdc_device *ldev)
 
 	ret = ttm_device_init(&ldev->bdev, &lsdc_bo_driver, ddev->dev,
 			      ddev->anon_inode->i_mapping,
-			      ddev->vma_offset_manager, false, true);
+			      ddev->vma_offset_manager, TTM_POOL_USE_DMA32);
 	if (ret)
 		return ret;
 
diff --git a/drivers/gpu/drm/nouveau/nouveau_ttm.c b/drivers/gpu/drm/nouveau/nouveau_ttm.c
index 7d2436e5d50d..34649eb2b7c5 100644
--- a/drivers/gpu/drm/nouveau/nouveau_ttm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_ttm.c
@@ -302,8 +302,8 @@ nouveau_ttm_init(struct nouveau_drm *drm)
 	ret = ttm_device_init(&drm->ttm.bdev, &nouveau_bo_driver, drm->dev->dev,
 				  dev->anon_inode->i_mapping,
 				  dev->vma_offset_manager,
-				  drm_need_swiotlb(drm->client.mmu.dmabits),
-				  drm->client.mmu.dmabits <= 32);
+				  (drm_need_swiotlb(drm->client.mmu.dmabits) ? TTM_POOL_USE_DMA_ALLOC : 0 ) |
+				  (drm->client.mmu.dmabits <= 32 ? TTM_POOL_USE_DMA32 : 0));
 	if (ret) {
 		NV_ERROR(drm, "error initialising bo driver, %d\n", ret);
 		return ret;
diff --git a/drivers/gpu/drm/qxl/qxl_ttm.c b/drivers/gpu/drm/qxl/qxl_ttm.c
index 765a144cea14..85d9df48affa 100644
--- a/drivers/gpu/drm/qxl/qxl_ttm.c
+++ b/drivers/gpu/drm/qxl/qxl_ttm.c
@@ -196,7 +196,7 @@ int qxl_ttm_init(struct qxl_device *qdev)
 	r = ttm_device_init(&qdev->mman.bdev, &qxl_bo_driver, NULL,
 			    qdev->ddev.anon_inode->i_mapping,
 			    qdev->ddev.vma_offset_manager,
-			    false, false);
+			    0);
 	if (r) {
 		DRM_ERROR("failed initializing buffer object driver(%d).\n", r);
 		return r;
diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c
index 616d25c8c2de..95758054db65 100644
--- a/drivers/gpu/drm/radeon/radeon_ttm.c
+++ b/drivers/gpu/drm/radeon/radeon_ttm.c
@@ -683,8 +683,8 @@ int radeon_ttm_init(struct radeon_device *rdev)
 	r = ttm_device_init(&rdev->mman.bdev, &radeon_bo_driver, rdev->dev,
 			       rdev_to_drm(rdev)->anon_inode->i_mapping,
 			       rdev_to_drm(rdev)->vma_offset_manager,
-			       rdev->need_swiotlb,
-			       dma_addressing_limited(&rdev->pdev->dev));
+			       (rdev->need_swiotlb ? TTM_POOL_USE_DMA_ALLOC : 0 ) |
+			       (dma_addressing_limited(&rdev->pdev->dev) ? TTM_POOL_USE_DMA32 : 0));
 	if (r) {
 		DRM_ERROR("failed initializing buffer object driver(%d).\n", r);
 		return r;
diff --git a/drivers/gpu/drm/ttm/tests/ttm_bo_test.c b/drivers/gpu/drm/ttm/tests/ttm_bo_test.c
index 5426b435f702..d468f8322072 100644
--- a/drivers/gpu/drm/ttm/tests/ttm_bo_test.c
+++ b/drivers/gpu/drm/ttm/tests/ttm_bo_test.c
@@ -251,7 +251,7 @@ static void ttm_bo_unreserve_basic(struct kunit *test)
 	ttm_dev = kunit_kzalloc(test, sizeof(*ttm_dev), GFP_KERNEL);
 	KUNIT_ASSERT_NOT_NULL(test, ttm_dev);
 
-	err = ttm_device_kunit_init(priv, ttm_dev, false, false);
+	err = ttm_device_kunit_init(priv, ttm_dev, 0);
 	KUNIT_ASSERT_EQ(test, err, 0);
 	priv->ttm_dev = ttm_dev;
 
@@ -290,7 +290,7 @@ static void ttm_bo_unreserve_pinned(struct kunit *test)
 	ttm_dev = kunit_kzalloc(test, sizeof(*ttm_dev), GFP_KERNEL);
 	KUNIT_ASSERT_NOT_NULL(test, ttm_dev);
 
-	err = ttm_device_kunit_init(priv, ttm_dev, false, false);
+	err = ttm_device_kunit_init(priv, ttm_dev, 0);
 	KUNIT_ASSERT_EQ(test, err, 0);
 	priv->ttm_dev = ttm_dev;
 
@@ -342,7 +342,7 @@ static void ttm_bo_unreserve_bulk(struct kunit *test)
 	resv = kunit_kzalloc(test, sizeof(*resv), GFP_KERNEL);
 	KUNIT_ASSERT_NOT_NULL(test, resv);
 
-	err = ttm_device_kunit_init(priv, ttm_dev, false, false);
+	err = ttm_device_kunit_init(priv, ttm_dev, 0);
 	KUNIT_ASSERT_EQ(test, err, 0);
 	priv->ttm_dev = ttm_dev;
 
@@ -394,7 +394,7 @@ static void ttm_bo_fini_basic(struct kunit *test)
 	ttm_dev = kunit_kzalloc(test, sizeof(*ttm_dev), GFP_KERNEL);
 	KUNIT_ASSERT_NOT_NULL(test, ttm_dev);
 
-	err = ttm_device_kunit_init(priv, ttm_dev, false, false);
+	err = ttm_device_kunit_init(priv, ttm_dev, 0);
 	KUNIT_ASSERT_EQ(test, err, 0);
 	priv->ttm_dev = ttm_dev;
 
@@ -437,7 +437,7 @@ static void ttm_bo_fini_shared_resv(struct kunit *test)
 	ttm_dev = kunit_kzalloc(test, sizeof(*ttm_dev), GFP_KERNEL);
 	KUNIT_ASSERT_NOT_NULL(test, ttm_dev);
 
-	err = ttm_device_kunit_init(priv, ttm_dev, false, false);
+	err = ttm_device_kunit_init(priv, ttm_dev, 0);
 	KUNIT_ASSERT_EQ(test, err, 0);
 	priv->ttm_dev = ttm_dev;
 
@@ -477,7 +477,7 @@ static void ttm_bo_pin_basic(struct kunit *test)
 	ttm_dev = kunit_kzalloc(test, sizeof(*ttm_dev), GFP_KERNEL);
 	KUNIT_ASSERT_NOT_NULL(test, ttm_dev);
 
-	err = ttm_device_kunit_init(priv, ttm_dev, false, false);
+	err = ttm_device_kunit_init(priv, ttm_dev, 0);
 	KUNIT_ASSERT_EQ(test, err, 0);
 	priv->ttm_dev = ttm_dev;
 
@@ -512,7 +512,7 @@ static void ttm_bo_pin_unpin_resource(struct kunit *test)
 	ttm_dev = kunit_kzalloc(test, sizeof(*ttm_dev), GFP_KERNEL);
 	KUNIT_ASSERT_NOT_NULL(test, ttm_dev);
 
-	err = ttm_device_kunit_init(priv, ttm_dev, false, false);
+	err = ttm_device_kunit_init(priv, ttm_dev, 0);
 	KUNIT_ASSERT_EQ(test, err, 0);
 	priv->ttm_dev = ttm_dev;
 
@@ -563,7 +563,7 @@ static void ttm_bo_multiple_pin_one_unpin(struct kunit *test)
 	ttm_dev = kunit_kzalloc(test, sizeof(*ttm_dev), GFP_KERNEL);
 	KUNIT_ASSERT_NOT_NULL(test, ttm_dev);
 
-	err = ttm_device_kunit_init(priv, ttm_dev, false, false);
+	err = ttm_device_kunit_init(priv, ttm_dev, 0);
 	KUNIT_ASSERT_EQ(test, err, 0);
 	priv->ttm_dev = ttm_dev;
 
diff --git a/drivers/gpu/drm/ttm/tests/ttm_bo_validate_test.c b/drivers/gpu/drm/ttm/tests/ttm_bo_validate_test.c
index 3a1eef83190c..17a570af296c 100644
--- a/drivers/gpu/drm/ttm/tests/ttm_bo_validate_test.c
+++ b/drivers/gpu/drm/ttm/tests/ttm_bo_validate_test.c
@@ -995,7 +995,7 @@ static void ttm_bo_validate_busy_domain_evict(struct kunit *test)
 	 */
 	ttm_device_fini(priv->ttm_dev);
 
-	err = ttm_device_kunit_init_bad_evict(test->priv, priv->ttm_dev, false, false);
+	err = ttm_device_kunit_init_bad_evict(test->priv, priv->ttm_dev);
 	KUNIT_ASSERT_EQ(test, err, 0);
 
 	ttm_mock_manager_init(priv->ttm_dev, mem_type, MANAGER_SIZE);
diff --git a/drivers/gpu/drm/ttm/tests/ttm_device_test.c b/drivers/gpu/drm/ttm/tests/ttm_device_test.c
index 18c58b0fc18c..0e0359f1cc0e 100644
--- a/drivers/gpu/drm/ttm/tests/ttm_device_test.c
+++ b/drivers/gpu/drm/ttm/tests/ttm_device_test.c
@@ -24,7 +24,7 @@ static void ttm_device_init_basic(struct kunit *test)
 	ttm_dev = kunit_kzalloc(test, sizeof(*ttm_dev), GFP_KERNEL);
 	KUNIT_ASSERT_NOT_NULL(test, ttm_dev);
 
-	err = ttm_device_kunit_init(priv, ttm_dev, false, false);
+	err = ttm_device_kunit_init(priv, ttm_dev, 0);
 	KUNIT_ASSERT_EQ(test, err, 0);
 
 	KUNIT_EXPECT_PTR_EQ(test, ttm_dev->funcs, &ttm_dev_funcs);
@@ -54,7 +54,7 @@ static void ttm_device_init_multiple(struct kunit *test)
 	KUNIT_ASSERT_NOT_NULL(test, ttm_devs);
 
 	for (i = 0; i < num_dev; i++) {
-		err = ttm_device_kunit_init(priv, &ttm_devs[i], false, false);
+		err = ttm_device_kunit_init(priv, &ttm_devs[i], 0);
 		KUNIT_ASSERT_EQ(test, err, 0);
 
 		KUNIT_EXPECT_PTR_EQ(test, ttm_devs[i].dev_mapping,
@@ -80,7 +80,7 @@ static void ttm_device_fini_basic(struct kunit *test)
 	ttm_dev = kunit_kzalloc(test, sizeof(*ttm_dev), GFP_KERNEL);
 	KUNIT_ASSERT_NOT_NULL(test, ttm_dev);
 
-	err = ttm_device_kunit_init(priv, ttm_dev, false, false);
+	err = ttm_device_kunit_init(priv, ttm_dev, 0);
 	KUNIT_ASSERT_EQ(test, err, 0);
 
 	man = ttm_manager_type(ttm_dev, TTM_PL_SYSTEM);
@@ -108,7 +108,7 @@ static void ttm_device_init_no_vma_man(struct kunit *test)
 	vma_man = drm->vma_offset_manager;
 	drm->vma_offset_manager = NULL;
 
-	err = ttm_device_kunit_init(priv, ttm_dev, false, false);
+	err = ttm_device_kunit_init(priv, ttm_dev, 0);
 	KUNIT_EXPECT_EQ(test, err, -EINVAL);
 
 	/* Bring the manager back for a graceful cleanup */
@@ -156,9 +156,7 @@ static void ttm_device_init_pools(struct kunit *test)
 	ttm_dev = kunit_kzalloc(test, sizeof(*ttm_dev), GFP_KERNEL);
 	KUNIT_ASSERT_NOT_NULL(test, ttm_dev);
 
-	err = ttm_device_kunit_init(priv, ttm_dev,
-				    params->flags & TTM_POOL_USE_DMA_ALLOC,
-				    params->flags & TTM_POOL_USE_DMA32);
+	err = ttm_device_kunit_init(priv, ttm_dev, params->flags);
 	KUNIT_ASSERT_EQ(test, err, 0);
 
 	pool = &ttm_dev->pool;
diff --git a/drivers/gpu/drm/ttm/tests/ttm_kunit_helpers.c b/drivers/gpu/drm/ttm/tests/ttm_kunit_helpers.c
index 7aaf0d1395ff..4b0fc4f5fcbf 100644
--- a/drivers/gpu/drm/ttm/tests/ttm_kunit_helpers.c
+++ b/drivers/gpu/drm/ttm/tests/ttm_kunit_helpers.c
@@ -117,8 +117,7 @@ static void bad_evict_flags(struct ttm_buffer_object *bo,
 
 static int ttm_device_kunit_init_with_funcs(struct ttm_test_devices *priv,
 					    struct ttm_device *ttm,
-					    bool use_dma_alloc,
-					    bool use_dma32,
+					    unsigned int pool_flags,
 					    struct ttm_device_funcs *funcs)
 {
 	struct drm_device *drm = priv->drm;
@@ -127,7 +126,7 @@ static int ttm_device_kunit_init_with_funcs(struct ttm_test_devices *priv,
 	err = ttm_device_init(ttm, funcs, drm->dev,
 			      drm->anon_inode->i_mapping,
 			      drm->vma_offset_manager,
-			      use_dma_alloc, use_dma32);
+			      pool_flags);
 
 	return err;
 }
@@ -143,11 +142,10 @@ EXPORT_SYMBOL_GPL(ttm_dev_funcs);
 
 int ttm_device_kunit_init(struct ttm_test_devices *priv,
 			  struct ttm_device *ttm,
-			  bool use_dma_alloc,
-			  bool use_dma32)
+			  unsigned int pool_flags)
 {
-	return ttm_device_kunit_init_with_funcs(priv, ttm, use_dma_alloc,
-						use_dma32, &ttm_dev_funcs);
+	return ttm_device_kunit_init_with_funcs(priv, ttm, pool_flags,
+						&ttm_dev_funcs);
 }
 EXPORT_SYMBOL_GPL(ttm_device_kunit_init);
 
@@ -161,12 +159,10 @@ struct ttm_device_funcs ttm_dev_funcs_bad_evict = {
 EXPORT_SYMBOL_GPL(ttm_dev_funcs_bad_evict);
 
 int ttm_device_kunit_init_bad_evict(struct ttm_test_devices *priv,
-				    struct ttm_device *ttm,
-				    bool use_dma_alloc,
-				    bool use_dma32)
+				    struct ttm_device *ttm)
 {
-	return ttm_device_kunit_init_with_funcs(priv, ttm, use_dma_alloc,
-						use_dma32, &ttm_dev_funcs_bad_evict);
+	return ttm_device_kunit_init_with_funcs(priv, ttm, 0,
+						&ttm_dev_funcs_bad_evict);
 }
 EXPORT_SYMBOL_GPL(ttm_device_kunit_init_bad_evict);
 
@@ -252,7 +248,7 @@ struct ttm_test_devices *ttm_test_devices_all(struct kunit *test)
 	ttm_dev = kunit_kzalloc(test, sizeof(*ttm_dev), GFP_KERNEL);
 	KUNIT_ASSERT_NOT_NULL(test, ttm_dev);
 
-	err = ttm_device_kunit_init(devs, ttm_dev, false, false);
+	err = ttm_device_kunit_init(devs, ttm_dev, 0);
 	KUNIT_ASSERT_EQ(test, err, 0);
 
 	devs->ttm_dev = ttm_dev;
diff --git a/drivers/gpu/drm/ttm/tests/ttm_kunit_helpers.h b/drivers/gpu/drm/ttm/tests/ttm_kunit_helpers.h
index c7da23232ffa..8112892ba290 100644
--- a/drivers/gpu/drm/ttm/tests/ttm_kunit_helpers.h
+++ b/drivers/gpu/drm/ttm/tests/ttm_kunit_helpers.h
@@ -28,12 +28,9 @@ struct ttm_test_devices {
 /* Building blocks for test-specific init functions */
 int ttm_device_kunit_init(struct ttm_test_devices *priv,
 			  struct ttm_device *ttm,
-			  bool use_dma_alloc,
-			  bool use_dma32);
+			  unsigned int pool_flags);
 int ttm_device_kunit_init_bad_evict(struct ttm_test_devices *priv,
-				    struct ttm_device *ttm,
-				    bool use_dma_alloc,
-				    bool use_dma32);
+				    struct ttm_device *ttm);
 struct ttm_buffer_object *ttm_bo_kunit_init(struct kunit *test,
 					    struct ttm_test_devices *devs,
 					    size_t size,
diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c
index 250e9f10145d..43f9cab186b1 100644
--- a/drivers/gpu/drm/ttm/ttm_device.c
+++ b/drivers/gpu/drm/ttm/ttm_device.c
@@ -198,8 +198,7 @@ EXPORT_SYMBOL(ttm_device_swapout);
  * @dev: The core kernel device pointer for DMA mappings and allocations.
  * @mapping: The address space to use for this bo.
  * @vma_manager: A pointer to a vma manager.
- * @use_dma_alloc: If coherent DMA allocation API should be used.
- * @use_dma32: If we should use GFP_DMA32 for device memory allocations.
+ * @pool_flags: TTM_POOL_ flags.
  *
  * Initializes a struct ttm_device:
  * Returns:
@@ -208,7 +207,7 @@ EXPORT_SYMBOL(ttm_device_swapout);
 int ttm_device_init(struct ttm_device *bdev, const struct ttm_device_funcs *funcs,
 		    struct device *dev, struct address_space *mapping,
 		    struct drm_vma_offset_manager *vma_manager,
-		    bool use_dma_alloc, bool use_dma32)
+		    unsigned int pool_flags)
 {
 	struct ttm_global *glob = &ttm_glob;
 	int ret, nid;
@@ -236,9 +235,7 @@ int ttm_device_init(struct ttm_device *bdev, const struct ttm_device_funcs *func
 	else
 		nid = NUMA_NO_NODE;
 
-	ttm_pool_init(&bdev->pool, dev, nid,
-		      (use_dma_alloc ? TTM_POOL_USE_DMA_ALLOC : 0) |
-		      (use_dma32 ? TTM_POOL_USE_DMA32 : 0));
+	ttm_pool_init(&bdev->pool, dev, nid, pool_flags);
 
 	bdev->vma_manager = vma_manager;
 	spin_lock_init(&bdev->lru_lock);
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_drv.c b/drivers/gpu/drm/vmwgfx/vmwgfx_drv.c
index 8ff958d119be..b5e0ba3c02b9 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_drv.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_drv.c
@@ -1023,8 +1023,8 @@ static int vmw_driver_load(struct vmw_private *dev_priv, u32 pci_id)
 			      dev_priv->drm.dev,
 			      dev_priv->drm.anon_inode->i_mapping,
 			      dev_priv->drm.vma_offset_manager,
-			      dev_priv->map_mode == vmw_dma_alloc_coherent,
-			      false);
+			      (dev_priv->map_mode == vmw_dma_alloc_coherent) ?
+			      TTM_POOL_USE_DMA_ALLOC : 0);
 	if (unlikely(ret != 0)) {
 		drm_err(&dev_priv->drm,
 			"Failed initializing TTM buffer object driver.\n");
diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index ab0302d5f25f..86bdf1fc8392 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -437,7 +437,7 @@ struct xe_device *xe_device_create(struct pci_dev *pdev,
 
 	err = ttm_device_init(&xe->ttm, &xe_ttm_funcs, xe->drm.dev,
 			      xe->drm.anon_inode->i_mapping,
-			      xe->drm.vma_offset_manager, false, false);
+			      xe->drm.vma_offset_manager, 0);
 	if (WARN_ON(err))
 		goto err;
 
diff --git a/include/drm/ttm/ttm_device.h b/include/drm/ttm/ttm_device.h
index 592b5f802859..baa1c0c3ed86 100644
--- a/include/drm/ttm/ttm_device.h
+++ b/include/drm/ttm/ttm_device.h
@@ -292,7 +292,7 @@ static inline void ttm_set_driver_manager(struct ttm_device *bdev, int type,
 int ttm_device_init(struct ttm_device *bdev, const struct ttm_device_funcs *funcs,
 		    struct device *dev, struct address_space *mapping,
 		    struct drm_vma_offset_manager *vma_manager,
-		    bool use_dma_alloc, bool use_dma32);
+		    unsigned int pool_flags);
 void ttm_device_fini(struct ttm_device *bdev);
 void ttm_device_clear_dma_mappings(struct ttm_device *bdev);
 
-- 
2.48.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 4/5] drm/ttm: Allow drivers to specify maximum beneficial TTM pool size
  2025-10-08 11:53 [PATCH v3 0/5] Improving the worst case TTM large allocation latency Tvrtko Ursulin
                   ` (2 preceding siblings ...)
  2025-10-08 11:53 ` [PATCH v3 3/5] drm/ttm: Replace multiple booleans with flags in device init Tvrtko Ursulin
@ 2025-10-08 11:53 ` Tvrtko Ursulin
  2025-10-08 11:53 ` [PATCH v3 5/5] drm/amdgpu: Configure max beneficial TTM pool allocation order Tvrtko Ursulin
  2025-10-08 12:35 ` [PATCH v3 0/5] Improving the worst case TTM large allocation latency Christian König
  5 siblings, 0 replies; 24+ messages in thread
From: Tvrtko Ursulin @ 2025-10-08 11:53 UTC (permalink / raw)
  To: amd-gfx, dri-devel
  Cc: kernel-dev, Tvrtko Ursulin, Christian König,
	Thadeu Lima de Souza Cascardo

GPUs typically benefit from contiguous memory via reduced TLB pressure and
improved caching performance, where the maximum size of contiguous block
which adds a performance benefit is related to hardware design.

TTM pool allocator by default tries (hard) to allocate up to the system
MAX_PAGE_ORDER blocks. This varies by the CPU platform and can also be
configured via Kconfig.

If that limit was set to be higher than the GPU can make an extra use of,
lets allow the individual drivers to let TTM know over which allocation
order can the pool allocator afford to make a little bit less effort with.

We implement this by disabling direct reclaim for those allocations, which
reduces the allocation latency and lowers the demands on the page
allocator, in cases where expending this effort is not critical for the
GPU in question.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
---
 drivers/gpu/drm/ttm/ttm_pool.c          | 8 ++++++++
 drivers/gpu/drm/ttm/ttm_pool_internal.h | 5 +++++
 include/drm/ttm/ttm_pool.h              | 6 ++++--
 3 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index 5d007d1a15f2..183e39d0d505 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -136,6 +136,7 @@ static DECLARE_RWSEM(pool_shrink_rwsem);
 static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
 					unsigned int order)
 {
+	const unsigned int beneficial_order = ttm_pool_beneficial_order(pool);
 	unsigned long attr = DMA_ATTR_FORCE_CONTIGUOUS;
 	struct ttm_pool_dma *dma;
 	struct page *p;
@@ -149,6 +150,13 @@ static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
 		gfp_flags |= __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN |
 			__GFP_THISNODE;
 
+	/*
+	 * Do not add latency to the allocation path for allocations orders
+	 * device tolds us do not bring them additional performance gains.
+	 */
+	if (beneficial_order && order > beneficial_order)
+		gfp_flags &= ~__GFP_DIRECT_RECLAIM;
+
 	if (!ttm_pool_uses_dma_alloc(pool)) {
 		p = alloc_pages_node(pool->nid, gfp_flags, order);
 		if (p)
diff --git a/drivers/gpu/drm/ttm/ttm_pool_internal.h b/drivers/gpu/drm/ttm/ttm_pool_internal.h
index 91aa156ddded..638497fcc8a8 100644
--- a/drivers/gpu/drm/ttm/ttm_pool_internal.h
+++ b/drivers/gpu/drm/ttm/ttm_pool_internal.h
@@ -16,4 +16,9 @@ static inline bool ttm_pool_uses_dma32(struct ttm_pool *pool)
 	return pool->flags & TTM_POOL_USE_DMA32;
 }
 
+static inline bool ttm_pool_beneficial_order(struct ttm_pool *pool)
+{
+	return pool->flags & 0xff;
+}
+
 #endif
diff --git a/include/drm/ttm/ttm_pool.h b/include/drm/ttm/ttm_pool.h
index 70db2e5c950f..60753b051c65 100644
--- a/include/drm/ttm/ttm_pool.h
+++ b/include/drm/ttm/ttm_pool.h
@@ -59,8 +59,10 @@ struct ttm_pool_type {
 	struct list_head pages;
 };
 
-#define TTM_POOL_USE_DMA_ALLOC 	BIT(0) /* Use coherent DMA allocations. */
-#define TTM_POOL_USE_DMA32	BIT(1) /* Use GFP_DMA32 allocations. */
+/* Check helpers in ttm_pool_internal.h when modifying: */
+#define TTM_POOL_BENEFICIAL_ORDER(n)	((n) & 0xff) /* Max order which caller can benefit from */
+#define TTM_POOL_USE_DMA_ALLOC 		BIT(8) /* Use coherent DMA allocations. */
+#define TTM_POOL_USE_DMA32		BIT(9) /* Use GFP_DMA32 allocations. */
 
 /**
  * struct ttm_pool - Pool for all caching and orders
-- 
2.48.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 5/5] drm/amdgpu: Configure max beneficial TTM pool allocation order
  2025-10-08 11:53 [PATCH v3 0/5] Improving the worst case TTM large allocation latency Tvrtko Ursulin
                   ` (3 preceding siblings ...)
  2025-10-08 11:53 ` [PATCH v3 4/5] drm/ttm: Allow drivers to specify maximum beneficial TTM pool size Tvrtko Ursulin
@ 2025-10-08 11:53 ` Tvrtko Ursulin
  2025-10-08 23:18   ` Matthew Brost
  2025-10-08 12:35 ` [PATCH v3 0/5] Improving the worst case TTM large allocation latency Christian König
  5 siblings, 1 reply; 24+ messages in thread
From: Tvrtko Ursulin @ 2025-10-08 11:53 UTC (permalink / raw)
  To: amd-gfx, dri-devel
  Cc: kernel-dev, Tvrtko Ursulin, Alex Deucher, Christian König,
	Thadeu Lima de Souza Cascardo

Let the TTM pool allocator know that we can afford for it to expend less
effort for satisfying contiguous allocations larger than 2MiB. The latter
is the maximum relevant PTE entry size and the driver and hardware are
happy to get larger blocks only opportunistically.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 948c6d0a422b..723b885210a7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -1837,7 +1837,7 @@ static int amdgpu_ttm_pools_init(struct amdgpu_device *adev)
 	for (i = 0; i < adev->gmc.num_mem_partitions; i++) {
 		ttm_pool_init(&adev->mman.ttm_pools[i], adev->dev,
 			      adev->gmc.mem_partitions[i].numa.node,
-			      0);
+			      TTM_POOL_BENEFICIAL_ORDER(get_order(2 * SZ_1M)));
 	}
 	return 0;
 }
@@ -1931,7 +1931,8 @@ int amdgpu_ttm_init(struct amdgpu_device *adev)
 			       adev_to_drm(adev)->anon_inode->i_mapping,
 			       adev_to_drm(adev)->vma_offset_manager,
 			       (adev->need_swiotlb ? TTM_POOL_USE_DMA_ALLOC : 0) |
-			       (dma_addressing_limited(adev->dev) ? TTM_POOL_USE_DMA32 : 0));
+			       (dma_addressing_limited(adev->dev) ? TTM_POOL_USE_DMA32 : 0) |
+			       TTM_POOL_BENEFICIAL_ORDER(get_order(2 * SZ_1M)));
 	if (r) {
 		dev_err(adev->dev,
 			"failed initializing buffer object driver(%d).\n", r);
-- 
2.48.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/5] Improving the worst case TTM large allocation latency
  2025-10-08 11:53 [PATCH v3 0/5] Improving the worst case TTM large allocation latency Tvrtko Ursulin
                   ` (4 preceding siblings ...)
  2025-10-08 11:53 ` [PATCH v3 5/5] drm/amdgpu: Configure max beneficial TTM pool allocation order Tvrtko Ursulin
@ 2025-10-08 12:35 ` Christian König
  2025-10-08 13:50   ` Tvrtko Ursulin
  5 siblings, 1 reply; 24+ messages in thread
From: Christian König @ 2025-10-08 12:35 UTC (permalink / raw)
  To: Tvrtko Ursulin, amd-gfx, dri-devel
  Cc: kernel-dev, Alex Deucher, Danilo Krummrich, Dave Airlie,
	Gerd Hoffmann, Joonas Lahtinen, Lucas De Marchi, Lyude Paul,
	Maarten Lankhorst, Maxime Ripard, Rodrigo Vivi, Sui Jingfeng,
	Thadeu Lima de Souza Cascardo, Thomas Hellström,
	Thomas Zimmermann, Zack Rusin

On 08.10.25 13:53, Tvrtko Ursulin wrote:
> Disclaimer:
> Please note that as this series includes a patch which touches a good number of
> drivers I will only copy everyone in the cover letter and the respective patch.
> Assumption is people are subscribed to dri-devel so can look at the whole series
> there. I know someone is bound to complain for both the case when everyone is
> copied on everything for getting too much email, and also for this other case.
> So please be flexible.
> 
> Description:
> 
> All drivers which use the TTM pool allocator end up requesting large order
> allocations when allocating large buffers. Those can be slow due memory pressure
> and so add latency to buffer creation. But there is often also a size limit
> above which contiguous blocks do not bring any performance benefits. This series
> allows drivers to say when it is okay for the TTM to try a bit less hard.
> 
> We do this by allowing drivers to specify this cut off point when creating the
> TTM device and pools. Allocations above this size will skip direct reclaim so
> under memory pressure worst case latency will improve. Background reclaim is
> still kicked off and both before and after the memory pressure all the TTM pool
> buckets remain to be used as they are today.
> 
> This is especially interesting if someone has configured MAX_PAGE_ORDER to
> higher than the default. And even with the default, with amdgpu for example,
> the last patch in the series makes use of the new feature by telling TTM that
> above 2MiB we do not expect performance benefits. Which makes TTM not try direct
> reclaim for the top bucket (4MiB).
> 
> End result is TTM drivers become a tiny bit nicer mm citizens and users benefit
> from better worst case buffer creation latencies. As a side benefit we get rid
> of two instances of those often very unreadable mutliple nameless booleans
> function signatures.
> 
> If this sounds interesting and gets merge the invidual drivers can follow up
> with patches configuring their thresholds.
> 
> v2:
>  * Christian suggested to pass in the new data by changing the function signatures.
> 
> v3:
>  * Moved ttm pool helpers into new ttm_pool_internal.h. (Christian)

Patch #3 is Acked-by: Christian König <christian.koenig@amd.com>.

The rest is Reviewed-by: Christian König <christian.koenig@amd.com>

Thanks,
Christian.

> 
> v1 thread:
> https://lore.kernel.org/dri-devel/20250919131127.90932-1-tvrtko.ursulin@igalia.com/
> 
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: Dave Airlie <airlied@redhat.com>
> Cc: Gerd Hoffmann <kraxel@redhat.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Cc: Maxime Ripard <mripard@kernel.org>
> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> Cc: Sui Jingfeng <suijingfeng@loongson.cn>
> Cc: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Cc: Thomas Zimmermann <tzimmermann@suse.de>
> Cc: Zack Rusin <zack.rusin@broadcom.com>
> 
> Tvrtko Ursulin (5):
>   drm/ttm: Add getter for some pool properties
>   drm/ttm: Replace multiple booleans with flags in pool init
>   drm/ttm: Replace multiple booleans with flags in device init
>   drm/ttm: Allow drivers to specify maximum beneficial TTM pool size
>   drm/amdgpu: Configure max beneficial TTM pool allocation order
> 
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |  7 +--
>  drivers/gpu/drm/drm_gem_vram_helper.c         |  2 +-
>  drivers/gpu/drm/i915/intel_region_ttm.c       |  2 +-
>  drivers/gpu/drm/loongson/lsdc_ttm.c           |  2 +-
>  drivers/gpu/drm/nouveau/nouveau_ttm.c         |  4 +-
>  drivers/gpu/drm/qxl/qxl_ttm.c                 |  2 +-
>  drivers/gpu/drm/radeon/radeon_ttm.c           |  4 +-
>  drivers/gpu/drm/ttm/tests/ttm_bo_test.c       | 16 +++----
>  .../gpu/drm/ttm/tests/ttm_bo_validate_test.c  |  2 +-
>  drivers/gpu/drm/ttm/tests/ttm_device_test.c   | 31 +++++--------
>  drivers/gpu/drm/ttm/tests/ttm_kunit_helpers.c | 22 ++++-----
>  drivers/gpu/drm/ttm/tests/ttm_kunit_helpers.h |  7 +--
>  drivers/gpu/drm/ttm/tests/ttm_pool_test.c     | 23 +++++-----
>  drivers/gpu/drm/ttm/ttm_device.c              |  7 ++-
>  drivers/gpu/drm/ttm/ttm_pool.c                | 45 +++++++++++--------
>  drivers/gpu/drm/ttm/ttm_pool_internal.h       | 24 ++++++++++
>  drivers/gpu/drm/ttm/ttm_tt.c                  | 10 +++--
>  drivers/gpu/drm/vmwgfx/vmwgfx_drv.c           |  4 +-
>  drivers/gpu/drm/xe/xe_device.c                |  2 +-
>  include/drm/ttm/ttm_device.h                  |  2 +-
>  include/drm/ttm/ttm_pool.h                    | 13 +++---
>  21 files changed, 125 insertions(+), 106 deletions(-)
>  create mode 100644 drivers/gpu/drm/ttm/ttm_pool_internal.h
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/5] Improving the worst case TTM large allocation latency
  2025-10-08 12:35 ` [PATCH v3 0/5] Improving the worst case TTM large allocation latency Christian König
@ 2025-10-08 13:50   ` Tvrtko Ursulin
  2025-10-08 14:02     ` Christian König
  2025-10-08 14:34     ` Matthew Auld
  0 siblings, 2 replies; 24+ messages in thread
From: Tvrtko Ursulin @ 2025-10-08 13:50 UTC (permalink / raw)
  To: Christian König, amd-gfx, Lucas De Marchi, dri-devel,
	Rodrigo Vivi
  Cc: kernel-dev, Alex Deucher, Danilo Krummrich, Dave Airlie,
	Gerd Hoffmann, Joonas Lahtinen, Lyude Paul, Maarten Lankhorst,
	Maxime Ripard, Sui Jingfeng, Thadeu Lima de Souza Cascardo,
	Thomas Hellström, Thomas Zimmermann, Zack Rusin


On 08/10/2025 13:35, Christian König wrote:
> On 08.10.25 13:53, Tvrtko Ursulin wrote:
>> Disclaimer:
>> Please note that as this series includes a patch which touches a good number of
>> drivers I will only copy everyone in the cover letter and the respective patch.
>> Assumption is people are subscribed to dri-devel so can look at the whole series
>> there. I know someone is bound to complain for both the case when everyone is
>> copied on everything for getting too much email, and also for this other case.
>> So please be flexible.
>>
>> Description:
>>
>> All drivers which use the TTM pool allocator end up requesting large order
>> allocations when allocating large buffers. Those can be slow due memory pressure
>> and so add latency to buffer creation. But there is often also a size limit
>> above which contiguous blocks do not bring any performance benefits. This series
>> allows drivers to say when it is okay for the TTM to try a bit less hard.
>>
>> We do this by allowing drivers to specify this cut off point when creating the
>> TTM device and pools. Allocations above this size will skip direct reclaim so
>> under memory pressure worst case latency will improve. Background reclaim is
>> still kicked off and both before and after the memory pressure all the TTM pool
>> buckets remain to be used as they are today.
>>
>> This is especially interesting if someone has configured MAX_PAGE_ORDER to
>> higher than the default. And even with the default, with amdgpu for example,
>> the last patch in the series makes use of the new feature by telling TTM that
>> above 2MiB we do not expect performance benefits. Which makes TTM not try direct
>> reclaim for the top bucket (4MiB).
>>
>> End result is TTM drivers become a tiny bit nicer mm citizens and users benefit
>> from better worst case buffer creation latencies. As a side benefit we get rid
>> of two instances of those often very unreadable mutliple nameless booleans
>> function signatures.
>>
>> If this sounds interesting and gets merge the invidual drivers can follow up
>> with patches configuring their thresholds.
>>
>> v2:
>>   * Christian suggested to pass in the new data by changing the function signatures.
>>
>> v3:
>>   * Moved ttm pool helpers into new ttm_pool_internal.h. (Christian)
> 
> Patch #3 is Acked-by: Christian König <christian.koenig@amd.com>.
>
> The rest is Reviewed-by: Christian König <christian.koenig@amd.com>

Thank you!

So I think now I need acks to merge via drm-misc for all the drivers 
which have their own trees. Which seems to be just xe.

Also interesting for other drivers is that when this lands folks can 
start passing in their "max size which leads to performance gains" via 
TTM_POOL_BENEFICIAL_ORDER and get the worst case allocation latency 
improvements.

I am thinking xe also maxes out at 2MiB pages, for others I don't know.

Regards,

Tvrtko

>> v1 thread:
>> https://lore.kernel.org/dri-devel/20250919131127.90932-1-tvrtko.ursulin@igalia.com/
>>
>> Cc: Alex Deucher <alexander.deucher@amd.com>
>> Cc: Christian König <christian.koenig@amd.com>
>> Cc: Danilo Krummrich <dakr@kernel.org>
>> Cc: Dave Airlie <airlied@redhat.com>
>> Cc: Gerd Hoffmann <kraxel@redhat.com>
>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
>> Cc: Lyude Paul <lyude@redhat.com>
>> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
>> Cc: Maxime Ripard <mripard@kernel.org>
>> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
>> Cc: Sui Jingfeng <suijingfeng@loongson.cn>
>> Cc: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
>> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>> Cc: Thomas Zimmermann <tzimmermann@suse.de>
>> Cc: Zack Rusin <zack.rusin@broadcom.com>
>>
>> Tvrtko Ursulin (5):
>>    drm/ttm: Add getter for some pool properties
>>    drm/ttm: Replace multiple booleans with flags in pool init
>>    drm/ttm: Replace multiple booleans with flags in device init
>>    drm/ttm: Allow drivers to specify maximum beneficial TTM pool size
>>    drm/amdgpu: Configure max beneficial TTM pool allocation order
>>
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |  7 +--
>>   drivers/gpu/drm/drm_gem_vram_helper.c         |  2 +-
>>   drivers/gpu/drm/i915/intel_region_ttm.c       |  2 +-
>>   drivers/gpu/drm/loongson/lsdc_ttm.c           |  2 +-
>>   drivers/gpu/drm/nouveau/nouveau_ttm.c         |  4 +-
>>   drivers/gpu/drm/qxl/qxl_ttm.c                 |  2 +-
>>   drivers/gpu/drm/radeon/radeon_ttm.c           |  4 +-
>>   drivers/gpu/drm/ttm/tests/ttm_bo_test.c       | 16 +++----
>>   .../gpu/drm/ttm/tests/ttm_bo_validate_test.c  |  2 +-
>>   drivers/gpu/drm/ttm/tests/ttm_device_test.c   | 31 +++++--------
>>   drivers/gpu/drm/ttm/tests/ttm_kunit_helpers.c | 22 ++++-----
>>   drivers/gpu/drm/ttm/tests/ttm_kunit_helpers.h |  7 +--
>>   drivers/gpu/drm/ttm/tests/ttm_pool_test.c     | 23 +++++-----
>>   drivers/gpu/drm/ttm/ttm_device.c              |  7 ++-
>>   drivers/gpu/drm/ttm/ttm_pool.c                | 45 +++++++++++--------
>>   drivers/gpu/drm/ttm/ttm_pool_internal.h       | 24 ++++++++++
>>   drivers/gpu/drm/ttm/ttm_tt.c                  | 10 +++--
>>   drivers/gpu/drm/vmwgfx/vmwgfx_drv.c           |  4 +-
>>   drivers/gpu/drm/xe/xe_device.c                |  2 +-
>>   include/drm/ttm/ttm_device.h                  |  2 +-
>>   include/drm/ttm/ttm_pool.h                    | 13 +++---
>>   21 files changed, 125 insertions(+), 106 deletions(-)
>>   create mode 100644 drivers/gpu/drm/ttm/ttm_pool_internal.h
>>
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/5] Improving the worst case TTM large allocation latency
  2025-10-08 13:50   ` Tvrtko Ursulin
@ 2025-10-08 14:02     ` Christian König
  2025-10-08 14:39       ` Thomas Hellström
  2025-10-08 14:34     ` Matthew Auld
  1 sibling, 1 reply; 24+ messages in thread
From: Christian König @ 2025-10-08 14:02 UTC (permalink / raw)
  To: Tvrtko Ursulin, amd-gfx, Lucas De Marchi, dri-devel, Rodrigo Vivi
  Cc: kernel-dev, Alex Deucher, Danilo Krummrich, Dave Airlie,
	Gerd Hoffmann, Joonas Lahtinen, Lyude Paul, Maarten Lankhorst,
	Maxime Ripard, Sui Jingfeng, Thadeu Lima de Souza Cascardo,
	Thomas Hellström, Thomas Zimmermann, Zack Rusin

On 08.10.25 15:50, Tvrtko Ursulin wrote:
> 
> On 08/10/2025 13:35, Christian König wrote:
>> On 08.10.25 13:53, Tvrtko Ursulin wrote:
>>> Disclaimer:
>>> Please note that as this series includes a patch which touches a good number of
>>> drivers I will only copy everyone in the cover letter and the respective patch.
>>> Assumption is people are subscribed to dri-devel so can look at the whole series
>>> there. I know someone is bound to complain for both the case when everyone is
>>> copied on everything for getting too much email, and also for this other case.
>>> So please be flexible.
>>>
>>> Description:
>>>
>>> All drivers which use the TTM pool allocator end up requesting large order
>>> allocations when allocating large buffers. Those can be slow due memory pressure
>>> and so add latency to buffer creation. But there is often also a size limit
>>> above which contiguous blocks do not bring any performance benefits. This series
>>> allows drivers to say when it is okay for the TTM to try a bit less hard.
>>>
>>> We do this by allowing drivers to specify this cut off point when creating the
>>> TTM device and pools. Allocations above this size will skip direct reclaim so
>>> under memory pressure worst case latency will improve. Background reclaim is
>>> still kicked off and both before and after the memory pressure all the TTM pool
>>> buckets remain to be used as they are today.
>>>
>>> This is especially interesting if someone has configured MAX_PAGE_ORDER to
>>> higher than the default. And even with the default, with amdgpu for example,
>>> the last patch in the series makes use of the new feature by telling TTM that
>>> above 2MiB we do not expect performance benefits. Which makes TTM not try direct
>>> reclaim for the top bucket (4MiB).
>>>
>>> End result is TTM drivers become a tiny bit nicer mm citizens and users benefit
>>> from better worst case buffer creation latencies. As a side benefit we get rid
>>> of two instances of those often very unreadable mutliple nameless booleans
>>> function signatures.
>>>
>>> If this sounds interesting and gets merge the invidual drivers can follow up
>>> with patches configuring their thresholds.
>>>
>>> v2:
>>>   * Christian suggested to pass in the new data by changing the function signatures.
>>>
>>> v3:
>>>   * Moved ttm pool helpers into new ttm_pool_internal.h. (Christian)
>>
>> Patch #3 is Acked-by: Christian König <christian.koenig@amd.com>.
>>
>> The rest is Reviewed-by: Christian König <christian.koenig@amd.com>
> 
> Thank you!
> 
> So I think now I need acks to merge via drm-misc for all the drivers which have their own trees. Which seems to be just xe.

I think you should ping the XE guys for their opinion, but since there shouldn't be any functional change for them you can probably go ahead and merge the patches to drm-misc-next when there is no reply in time.

> Also interesting for other drivers is that when this lands folks can start passing in their "max size which leads to performance gains" via TTM_POOL_BENEFICIAL_ORDER and get the worst case allocation latency improvements.

Yeah, as said before if any other driver says they don't need this behavior we should certainly add something.

> I am thinking xe also maxes out at 2MiB pages, for others I don't know.

For AMDGPU it can actually be that this changes on future HW generations, so having it configurable is certainly the right approach.

Regards,
Christian.

> 
> Regards,
> 
> Tvrtko
> 
>>> v1 thread:
>>> https://lore.kernel.org/dri-devel/20250919131127.90932-1-tvrtko.ursulin@igalia.com/
>>>
>>> Cc: Alex Deucher <alexander.deucher@amd.com>
>>> Cc: Christian König <christian.koenig@amd.com>
>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>> Cc: Dave Airlie <airlied@redhat.com>
>>> Cc: Gerd Hoffmann <kraxel@redhat.com>
>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
>>> Cc: Lyude Paul <lyude@redhat.com>
>>> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
>>> Cc: Maxime Ripard <mripard@kernel.org>
>>> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
>>> Cc: Sui Jingfeng <suijingfeng@loongson.cn>
>>> Cc: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
>>> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>>> Cc: Thomas Zimmermann <tzimmermann@suse.de>
>>> Cc: Zack Rusin <zack.rusin@broadcom.com>
>>>
>>> Tvrtko Ursulin (5):
>>>    drm/ttm: Add getter for some pool properties
>>>    drm/ttm: Replace multiple booleans with flags in pool init
>>>    drm/ttm: Replace multiple booleans with flags in device init
>>>    drm/ttm: Allow drivers to specify maximum beneficial TTM pool size
>>>    drm/amdgpu: Configure max beneficial TTM pool allocation order
>>>
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |  7 +--
>>>   drivers/gpu/drm/drm_gem_vram_helper.c         |  2 +-
>>>   drivers/gpu/drm/i915/intel_region_ttm.c       |  2 +-
>>>   drivers/gpu/drm/loongson/lsdc_ttm.c           |  2 +-
>>>   drivers/gpu/drm/nouveau/nouveau_ttm.c         |  4 +-
>>>   drivers/gpu/drm/qxl/qxl_ttm.c                 |  2 +-
>>>   drivers/gpu/drm/radeon/radeon_ttm.c           |  4 +-
>>>   drivers/gpu/drm/ttm/tests/ttm_bo_test.c       | 16 +++----
>>>   .../gpu/drm/ttm/tests/ttm_bo_validate_test.c  |  2 +-
>>>   drivers/gpu/drm/ttm/tests/ttm_device_test.c   | 31 +++++--------
>>>   drivers/gpu/drm/ttm/tests/ttm_kunit_helpers.c | 22 ++++-----
>>>   drivers/gpu/drm/ttm/tests/ttm_kunit_helpers.h |  7 +--
>>>   drivers/gpu/drm/ttm/tests/ttm_pool_test.c     | 23 +++++-----
>>>   drivers/gpu/drm/ttm/ttm_device.c              |  7 ++-
>>>   drivers/gpu/drm/ttm/ttm_pool.c                | 45 +++++++++++--------
>>>   drivers/gpu/drm/ttm/ttm_pool_internal.h       | 24 ++++++++++
>>>   drivers/gpu/drm/ttm/ttm_tt.c                  | 10 +++--
>>>   drivers/gpu/drm/vmwgfx/vmwgfx_drv.c           |  4 +-
>>>   drivers/gpu/drm/xe/xe_device.c                |  2 +-
>>>   include/drm/ttm/ttm_device.h                  |  2 +-
>>>   include/drm/ttm/ttm_pool.h                    | 13 +++---
>>>   21 files changed, 125 insertions(+), 106 deletions(-)
>>>   create mode 100644 drivers/gpu/drm/ttm/ttm_pool_internal.h
>>>
>>
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/5] Improving the worst case TTM large allocation latency
  2025-10-08 13:50   ` Tvrtko Ursulin
  2025-10-08 14:02     ` Christian König
@ 2025-10-08 14:34     ` Matthew Auld
  2025-10-09  8:41       ` Tvrtko Ursulin
  1 sibling, 1 reply; 24+ messages in thread
From: Matthew Auld @ 2025-10-08 14:34 UTC (permalink / raw)
  To: Tvrtko Ursulin, Christian König, amd-gfx, Lucas De Marchi,
	dri-devel, Rodrigo Vivi
  Cc: kernel-dev, Alex Deucher, Danilo Krummrich, Dave Airlie,
	Gerd Hoffmann, Joonas Lahtinen, Lyude Paul, Maarten Lankhorst,
	Maxime Ripard, Sui Jingfeng, Thadeu Lima de Souza Cascardo,
	Thomas Hellström, Thomas Zimmermann, Zack Rusin

On 08/10/2025 14:50, Tvrtko Ursulin wrote:
> 
> On 08/10/2025 13:35, Christian König wrote:
>> On 08.10.25 13:53, Tvrtko Ursulin wrote:
>>> Disclaimer:
>>> Please note that as this series includes a patch which touches a good 
>>> number of
>>> drivers I will only copy everyone in the cover letter and the 
>>> respective patch.
>>> Assumption is people are subscribed to dri-devel so can look at the 
>>> whole series
>>> there. I know someone is bound to complain for both the case when 
>>> everyone is
>>> copied on everything for getting too much email, and also for this 
>>> other case.
>>> So please be flexible.
>>>
>>> Description:
>>>
>>> All drivers which use the TTM pool allocator end up requesting large 
>>> order
>>> allocations when allocating large buffers. Those can be slow due 
>>> memory pressure
>>> and so add latency to buffer creation. But there is often also a size 
>>> limit
>>> above which contiguous blocks do not bring any performance benefits. 
>>> This series
>>> allows drivers to say when it is okay for the TTM to try a bit less 
>>> hard.
>>>
>>> We do this by allowing drivers to specify this cut off point when 
>>> creating the
>>> TTM device and pools. Allocations above this size will skip direct 
>>> reclaim so
>>> under memory pressure worst case latency will improve. Background 
>>> reclaim is
>>> still kicked off and both before and after the memory pressure all 
>>> the TTM pool
>>> buckets remain to be used as they are today.
>>>
>>> This is especially interesting if someone has configured 
>>> MAX_PAGE_ORDER to
>>> higher than the default. And even with the default, with amdgpu for 
>>> example,
>>> the last patch in the series makes use of the new feature by telling 
>>> TTM that
>>> above 2MiB we do not expect performance benefits. Which makes TTM not 
>>> try direct
>>> reclaim for the top bucket (4MiB).
>>>
>>> End result is TTM drivers become a tiny bit nicer mm citizens and 
>>> users benefit
>>> from better worst case buffer creation latencies. As a side benefit 
>>> we get rid
>>> of two instances of those often very unreadable mutliple nameless 
>>> booleans
>>> function signatures.
>>>
>>> If this sounds interesting and gets merge the invidual drivers can 
>>> follow up
>>> with patches configuring their thresholds.
>>>
>>> v2:
>>>   * Christian suggested to pass in the new data by changing the 
>>> function signatures.
>>>
>>> v3:
>>>   * Moved ttm pool helpers into new ttm_pool_internal.h. (Christian)
>>
>> Patch #3 is Acked-by: Christian König <christian.koenig@amd.com>.
>>
>> The rest is Reviewed-by: Christian König <christian.koenig@amd.com>
> 
> Thank you!
> 
> So I think now I need acks to merge via drm-misc for all the drivers 
> which have their own trees. Which seems to be just xe.
> 
> Also interesting for other drivers is that when this lands folks can 
> start passing in their "max size which leads to performance gains" via 
> TTM_POOL_BENEFICIAL_ORDER and get the worst case allocation latency 
> improvements.
> 
> I am thinking xe also maxes out at 2MiB pages, for others I don't know.

Yeah, next level up from 2M GTT page is still 1G GTT page. I think we 
especially need 64K/2M system memory pages on igpu to get some perf back 
when enabling iommu on some platforms IIRC. Not aware of really needing 
 > 2M so sounds like we might also benefit by maxing out at 2M, if it 
reduces allocation latency in some cases.

> 
> Regards,
> 
> Tvrtko
> 
>>> v1 thread:
>>> https://lore.kernel.org/dri-devel/20250919131127.90932-1- 
>>> tvrtko.ursulin@igalia.com/
>>>
>>> Cc: Alex Deucher <alexander.deucher@amd.com>
>>> Cc: Christian König <christian.koenig@amd.com>
>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>> Cc: Dave Airlie <airlied@redhat.com>
>>> Cc: Gerd Hoffmann <kraxel@redhat.com>
>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
>>> Cc: Lyude Paul <lyude@redhat.com>
>>> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
>>> Cc: Maxime Ripard <mripard@kernel.org>
>>> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
>>> Cc: Sui Jingfeng <suijingfeng@loongson.cn>
>>> Cc: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
>>> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>>> Cc: Thomas Zimmermann <tzimmermann@suse.de>
>>> Cc: Zack Rusin <zack.rusin@broadcom.com>
>>>
>>> Tvrtko Ursulin (5):
>>>    drm/ttm: Add getter for some pool properties
>>>    drm/ttm: Replace multiple booleans with flags in pool init
>>>    drm/ttm: Replace multiple booleans with flags in device init
>>>    drm/ttm: Allow drivers to specify maximum beneficial TTM pool size
>>>    drm/amdgpu: Configure max beneficial TTM pool allocation order
>>>
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |  7 +--
>>>   drivers/gpu/drm/drm_gem_vram_helper.c         |  2 +-
>>>   drivers/gpu/drm/i915/intel_region_ttm.c       |  2 +-
>>>   drivers/gpu/drm/loongson/lsdc_ttm.c           |  2 +-
>>>   drivers/gpu/drm/nouveau/nouveau_ttm.c         |  4 +-
>>>   drivers/gpu/drm/qxl/qxl_ttm.c                 |  2 +-
>>>   drivers/gpu/drm/radeon/radeon_ttm.c           |  4 +-
>>>   drivers/gpu/drm/ttm/tests/ttm_bo_test.c       | 16 +++----
>>>   .../gpu/drm/ttm/tests/ttm_bo_validate_test.c  |  2 +-
>>>   drivers/gpu/drm/ttm/tests/ttm_device_test.c   | 31 +++++--------
>>>   drivers/gpu/drm/ttm/tests/ttm_kunit_helpers.c | 22 ++++-----
>>>   drivers/gpu/drm/ttm/tests/ttm_kunit_helpers.h |  7 +--
>>>   drivers/gpu/drm/ttm/tests/ttm_pool_test.c     | 23 +++++-----
>>>   drivers/gpu/drm/ttm/ttm_device.c              |  7 ++-
>>>   drivers/gpu/drm/ttm/ttm_pool.c                | 45 +++++++++++--------
>>>   drivers/gpu/drm/ttm/ttm_pool_internal.h       | 24 ++++++++++
>>>   drivers/gpu/drm/ttm/ttm_tt.c                  | 10 +++--
>>>   drivers/gpu/drm/vmwgfx/vmwgfx_drv.c           |  4 +-
>>>   drivers/gpu/drm/xe/xe_device.c                |  2 +-
>>>   include/drm/ttm/ttm_device.h                  |  2 +-
>>>   include/drm/ttm/ttm_pool.h                    | 13 +++---
>>>   21 files changed, 125 insertions(+), 106 deletions(-)
>>>   create mode 100644 drivers/gpu/drm/ttm/ttm_pool_internal.h
>>>
>>
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/5] Improving the worst case TTM large allocation latency
  2025-10-08 14:02     ` Christian König
@ 2025-10-08 14:39       ` Thomas Hellström
  2025-10-09  8:53         ` Tvrtko Ursulin
  0 siblings, 1 reply; 24+ messages in thread
From: Thomas Hellström @ 2025-10-08 14:39 UTC (permalink / raw)
  To: Christian König, Tvrtko Ursulin, amd-gfx, Lucas De Marchi,
	dri-devel, Rodrigo Vivi
  Cc: kernel-dev, Alex Deucher, Danilo Krummrich, Dave Airlie,
	Gerd Hoffmann, Joonas Lahtinen, Lyude Paul, Maarten Lankhorst,
	Maxime Ripard, Sui Jingfeng, Thadeu Lima de Souza Cascardo,
	Thomas Zimmermann, Zack Rusin

On Wed, 2025-10-08 at 16:02 +0200, Christian König wrote:
> On 08.10.25 15:50, Tvrtko Ursulin wrote:
> > 
> > On 08/10/2025 13:35, Christian König wrote:
> > > On 08.10.25 13:53, Tvrtko Ursulin wrote:
> > > > Disclaimer:
> > > > Please note that as this series includes a patch which touches
> > > > a good number of
> > > > drivers I will only copy everyone in the cover letter and the
> > > > respective patch.
> > > > Assumption is people are subscribed to dri-devel so can look at
> > > > the whole series
> > > > there. I know someone is bound to complain for both the case
> > > > when everyone is
> > > > copied on everything for getting too much email, and also for
> > > > this other case.
> > > > So please be flexible.
> > > > 
> > > > Description:
> > > > 
> > > > All drivers which use the TTM pool allocator end up requesting
> > > > large order
> > > > allocations when allocating large buffers. Those can be slow
> > > > due memory pressure
> > > > and so add latency to buffer creation. But there is often also
> > > > a size limit
> > > > above which contiguous blocks do not bring any performance
> > > > benefits. This series
> > > > allows drivers to say when it is okay for the TTM to try a bit
> > > > less hard.
> > > > 
> > > > We do this by allowing drivers to specify this cut off point
> > > > when creating the
> > > > TTM device and pools. Allocations above this size will skip
> > > > direct reclaim so
> > > > under memory pressure worst case latency will improve.
> > > > Background reclaim is
> > > > still kicked off and both before and after the memory pressure
> > > > all the TTM pool
> > > > buckets remain to be used as they are today.
> > > > 
> > > > This is especially interesting if someone has configured
> > > > MAX_PAGE_ORDER to
> > > > higher than the default. And even with the default, with amdgpu
> > > > for example,
> > > > the last patch in the series makes use of the new feature by
> > > > telling TTM that
> > > > above 2MiB we do not expect performance benefits. Which makes
> > > > TTM not try direct
> > > > reclaim for the top bucket (4MiB).
> > > > 
> > > > End result is TTM drivers become a tiny bit nicer mm citizens
> > > > and users benefit
> > > > from better worst case buffer creation latencies. As a side
> > > > benefit we get rid
> > > > of two instances of those often very unreadable mutliple
> > > > nameless booleans
> > > > function signatures.
> > > > 
> > > > If this sounds interesting and gets merge the invidual drivers
> > > > can follow up
> > > > with patches configuring their thresholds.
> > > > 
> > > > v2:
> > > >   * Christian suggested to pass in the new data by changing the
> > > > function signatures.
> > > > 
> > > > v3:
> > > >   * Moved ttm pool helpers into new ttm_pool_internal.h.
> > > > (Christian)
> > > 
> > > Patch #3 is Acked-by: Christian König <christian.koenig@amd.com>.
> > > 
> > > The rest is Reviewed-by: Christian König
> > > <christian.koenig@amd.com>
> > 
> > Thank you!
> > 
> > So I think now I need acks to merge via drm-misc for all the
> > drivers which have their own trees. Which seems to be just xe.
> 
> I think you should ping the XE guys for their opinion, but since
> there shouldn't be any functional change for them you can probably go
> ahead and merge the patches to drm-misc-next when there is no reply
> in time.

I will try to do a review tonight. One thing that comes up though, is
the change to ttm_device_init() where you add pool_flags. I had another
patch series a number of months ago that added a struct with flags
there instead to select the return value given when OOM. Now that we're
adding an argument, should we try to use a struct instead so that we
can use it for more that pool behavior?


I'll be able to find a pointer to that series later today.

/Thomas


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 5/5] drm/amdgpu: Configure max beneficial TTM pool allocation order
  2025-10-08 11:53 ` [PATCH v3 5/5] drm/amdgpu: Configure max beneficial TTM pool allocation order Tvrtko Ursulin
@ 2025-10-08 23:18   ` Matthew Brost
  2025-10-09  8:58     ` Tvrtko Ursulin
  0 siblings, 1 reply; 24+ messages in thread
From: Matthew Brost @ 2025-10-08 23:18 UTC (permalink / raw)
  To: Tvrtko Ursulin
  Cc: amd-gfx, dri-devel, kernel-dev, Alex Deucher,
	Christian König, Thadeu Lima de Souza Cascardo,
	thomas.hellstrom

On Wed, Oct 08, 2025 at 12:53:14PM +0100, Tvrtko Ursulin wrote:
> Let the TTM pool allocator know that we can afford for it to expend less
> effort for satisfying contiguous allocations larger than 2MiB. The latter
> is the maximum relevant PTE entry size and the driver and hardware are
> happy to get larger blocks only opportunistically.
> 
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>

+Thomas - Seems like we'd want to do this in Xe too?

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> index 948c6d0a422b..723b885210a7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> @@ -1837,7 +1837,7 @@ static int amdgpu_ttm_pools_init(struct amdgpu_device *adev)
>  	for (i = 0; i < adev->gmc.num_mem_partitions; i++) {
>  		ttm_pool_init(&adev->mman.ttm_pools[i], adev->dev,
>  			      adev->gmc.mem_partitions[i].numa.node,
> -			      0);
> +			      TTM_POOL_BENEFICIAL_ORDER(get_order(2 * SZ_1M)));

SZ_2M btw.

Matt

>  	}
>  	return 0;
>  }
> @@ -1931,7 +1931,8 @@ int amdgpu_ttm_init(struct amdgpu_device *adev)
>  			       adev_to_drm(adev)->anon_inode->i_mapping,
>  			       adev_to_drm(adev)->vma_offset_manager,
>  			       (adev->need_swiotlb ? TTM_POOL_USE_DMA_ALLOC : 0) |
> -			       (dma_addressing_limited(adev->dev) ? TTM_POOL_USE_DMA32 : 0));
> +			       (dma_addressing_limited(adev->dev) ? TTM_POOL_USE_DMA32 : 0) |
> +			       TTM_POOL_BENEFICIAL_ORDER(get_order(2 * SZ_1M)));
>  	if (r) {
>  		dev_err(adev->dev,
>  			"failed initializing buffer object driver(%d).\n", r);
> -- 
> 2.48.0
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/5] Improving the worst case TTM large allocation latency
  2025-10-08 14:34     ` Matthew Auld
@ 2025-10-09  8:41       ` Tvrtko Ursulin
  0 siblings, 0 replies; 24+ messages in thread
From: Tvrtko Ursulin @ 2025-10-09  8:41 UTC (permalink / raw)
  To: Matthew Auld, Christian König, amd-gfx, Lucas De Marchi,
	dri-devel, Rodrigo Vivi
  Cc: kernel-dev, Alex Deucher, Danilo Krummrich, Dave Airlie,
	Gerd Hoffmann, Joonas Lahtinen, Lyude Paul, Maarten Lankhorst,
	Maxime Ripard, Sui Jingfeng, Thadeu Lima de Souza Cascardo,
	Thomas Hellström, Thomas Zimmermann, Zack Rusin


On 08/10/2025 15:34, Matthew Auld wrote:
> On 08/10/2025 14:50, Tvrtko Ursulin wrote:
>>
>> On 08/10/2025 13:35, Christian König wrote:
>>> On 08.10.25 13:53, Tvrtko Ursulin wrote:
>>>> Disclaimer:
>>>> Please note that as this series includes a patch which touches a 
>>>> good number of
>>>> drivers I will only copy everyone in the cover letter and the 
>>>> respective patch.
>>>> Assumption is people are subscribed to dri-devel so can look at the 
>>>> whole series
>>>> there. I know someone is bound to complain for both the case when 
>>>> everyone is
>>>> copied on everything for getting too much email, and also for this 
>>>> other case.
>>>> So please be flexible.
>>>>
>>>> Description:
>>>>
>>>> All drivers which use the TTM pool allocator end up requesting large 
>>>> order
>>>> allocations when allocating large buffers. Those can be slow due 
>>>> memory pressure
>>>> and so add latency to buffer creation. But there is often also a 
>>>> size limit
>>>> above which contiguous blocks do not bring any performance benefits. 
>>>> This series
>>>> allows drivers to say when it is okay for the TTM to try a bit less 
>>>> hard.
>>>>
>>>> We do this by allowing drivers to specify this cut off point when 
>>>> creating the
>>>> TTM device and pools. Allocations above this size will skip direct 
>>>> reclaim so
>>>> under memory pressure worst case latency will improve. Background 
>>>> reclaim is
>>>> still kicked off and both before and after the memory pressure all 
>>>> the TTM pool
>>>> buckets remain to be used as they are today.
>>>>
>>>> This is especially interesting if someone has configured 
>>>> MAX_PAGE_ORDER to
>>>> higher than the default. And even with the default, with amdgpu for 
>>>> example,
>>>> the last patch in the series makes use of the new feature by telling 
>>>> TTM that
>>>> above 2MiB we do not expect performance benefits. Which makes TTM 
>>>> not try direct
>>>> reclaim for the top bucket (4MiB).
>>>>
>>>> End result is TTM drivers become a tiny bit nicer mm citizens and 
>>>> users benefit
>>>> from better worst case buffer creation latencies. As a side benefit 
>>>> we get rid
>>>> of two instances of those often very unreadable mutliple nameless 
>>>> booleans
>>>> function signatures.
>>>>
>>>> If this sounds interesting and gets merge the invidual drivers can 
>>>> follow up
>>>> with patches configuring their thresholds.
>>>>
>>>> v2:
>>>>   * Christian suggested to pass in the new data by changing the 
>>>> function signatures.
>>>>
>>>> v3:
>>>>   * Moved ttm pool helpers into new ttm_pool_internal.h. (Christian)
>>>
>>> Patch #3 is Acked-by: Christian König <christian.koenig@amd.com>.
>>>
>>> The rest is Reviewed-by: Christian König <christian.koenig@amd.com>
>>
>> Thank you!
>>
>> So I think now I need acks to merge via drm-misc for all the drivers 
>> which have their own trees. Which seems to be just xe.
>>
>> Also interesting for other drivers is that when this lands folks can 
>> start passing in their "max size which leads to performance gains" via 
>> TTM_POOL_BENEFICIAL_ORDER and get the worst case allocation latency 
>> improvements.
>>
>> I am thinking xe also maxes out at 2MiB pages, for others I don't know.
> 
> Yeah, next level up from 2M GTT page is still 1G GTT page. I think we 
> especially need 64K/2M system memory pages on igpu to get some perf back 
> when enabling iommu on some platforms IIRC. Not aware of really needing 
>  > 2M so sounds like we might also benefit by maxing out at 2M, if it 
> reduces allocation latency in some cases.

To clarify a bit, the current semantics of the series is not to max out 
at the order specified by the driver as the "max beneficial", but to 
just skip doing direct reclaim above it. Otherwise TTM pool allocator 
keeps the current behaviour of trying from MAX_PAGE_ORDER and down. As 
that is 4MB by default on x86 (and configurable on some platforms via 
Kconfig), idea is not to pay the latency cost of direct reclaim for 
sizes which bring no additional performance benefit to the GPU. And 
since it will kick off background reclaim, both past and future 
allocation can still get the larger order blocks.

If in the future we want to actually max out at the driver specified 
size that could be discussed. But for now I wanted to have the smallest 
change in behaviour possible.

Regards,

Tvrtko

>>>> v1 thread:
>>>> https://lore.kernel.org/dri-devel/20250919131127.90932-1- 
>>>> tvrtko.ursulin@igalia.com/
>>>>
>>>> Cc: Alex Deucher <alexander.deucher@amd.com>
>>>> Cc: Christian König <christian.koenig@amd.com>
>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>> Cc: Dave Airlie <airlied@redhat.com>
>>>> Cc: Gerd Hoffmann <kraxel@redhat.com>
>>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>>>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
>>>> Cc: Maxime Ripard <mripard@kernel.org>
>>>> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
>>>> Cc: Sui Jingfeng <suijingfeng@loongson.cn>
>>>> Cc: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
>>>> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>>>> Cc: Thomas Zimmermann <tzimmermann@suse.de>
>>>> Cc: Zack Rusin <zack.rusin@broadcom.com>
>>>>
>>>> Tvrtko Ursulin (5):
>>>>    drm/ttm: Add getter for some pool properties
>>>>    drm/ttm: Replace multiple booleans with flags in pool init
>>>>    drm/ttm: Replace multiple booleans with flags in device init
>>>>    drm/ttm: Allow drivers to specify maximum beneficial TTM pool size
>>>>    drm/amdgpu: Configure max beneficial TTM pool allocation order
>>>>
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |  7 +--
>>>>   drivers/gpu/drm/drm_gem_vram_helper.c         |  2 +-
>>>>   drivers/gpu/drm/i915/intel_region_ttm.c       |  2 +-
>>>>   drivers/gpu/drm/loongson/lsdc_ttm.c           |  2 +-
>>>>   drivers/gpu/drm/nouveau/nouveau_ttm.c         |  4 +-
>>>>   drivers/gpu/drm/qxl/qxl_ttm.c                 |  2 +-
>>>>   drivers/gpu/drm/radeon/radeon_ttm.c           |  4 +-
>>>>   drivers/gpu/drm/ttm/tests/ttm_bo_test.c       | 16 +++----
>>>>   .../gpu/drm/ttm/tests/ttm_bo_validate_test.c  |  2 +-
>>>>   drivers/gpu/drm/ttm/tests/ttm_device_test.c   | 31 +++++--------
>>>>   drivers/gpu/drm/ttm/tests/ttm_kunit_helpers.c | 22 ++++-----
>>>>   drivers/gpu/drm/ttm/tests/ttm_kunit_helpers.h |  7 +--
>>>>   drivers/gpu/drm/ttm/tests/ttm_pool_test.c     | 23 +++++-----
>>>>   drivers/gpu/drm/ttm/ttm_device.c              |  7 ++-
>>>>   drivers/gpu/drm/ttm/ttm_pool.c                | 45 ++++++++++ 
>>>> +--------
>>>>   drivers/gpu/drm/ttm/ttm_pool_internal.h       | 24 ++++++++++
>>>>   drivers/gpu/drm/ttm/ttm_tt.c                  | 10 +++--
>>>>   drivers/gpu/drm/vmwgfx/vmwgfx_drv.c           |  4 +-
>>>>   drivers/gpu/drm/xe/xe_device.c                |  2 +-
>>>>   include/drm/ttm/ttm_device.h                  |  2 +-
>>>>   include/drm/ttm/ttm_pool.h                    | 13 +++---
>>>>   21 files changed, 125 insertions(+), 106 deletions(-)
>>>>   create mode 100644 drivers/gpu/drm/ttm/ttm_pool_internal.h
>>>>
>>>
>>
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/5] Improving the worst case TTM large allocation latency
  2025-10-08 14:39       ` Thomas Hellström
@ 2025-10-09  8:53         ` Tvrtko Ursulin
  2025-10-10 14:11           ` Thomas Hellström
  0 siblings, 1 reply; 24+ messages in thread
From: Tvrtko Ursulin @ 2025-10-09  8:53 UTC (permalink / raw)
  To: Thomas Hellström, Christian König, amd-gfx,
	Lucas De Marchi, dri-devel, Rodrigo Vivi
  Cc: kernel-dev, Alex Deucher, Danilo Krummrich, Dave Airlie,
	Gerd Hoffmann, Joonas Lahtinen, Lyude Paul, Maarten Lankhorst,
	Maxime Ripard, Sui Jingfeng, Thadeu Lima de Souza Cascardo,
	Thomas Zimmermann, Zack Rusin


On 08/10/2025 15:39, Thomas Hellström wrote:
> On Wed, 2025-10-08 at 16:02 +0200, Christian König wrote:
>> On 08.10.25 15:50, Tvrtko Ursulin wrote:
>>>
>>> On 08/10/2025 13:35, Christian König wrote:
>>>> On 08.10.25 13:53, Tvrtko Ursulin wrote:
>>>>> Disclaimer:
>>>>> Please note that as this series includes a patch which touches
>>>>> a good number of
>>>>> drivers I will only copy everyone in the cover letter and the
>>>>> respective patch.
>>>>> Assumption is people are subscribed to dri-devel so can look at
>>>>> the whole series
>>>>> there. I know someone is bound to complain for both the case
>>>>> when everyone is
>>>>> copied on everything for getting too much email, and also for
>>>>> this other case.
>>>>> So please be flexible.
>>>>>
>>>>> Description:
>>>>>
>>>>> All drivers which use the TTM pool allocator end up requesting
>>>>> large order
>>>>> allocations when allocating large buffers. Those can be slow
>>>>> due memory pressure
>>>>> and so add latency to buffer creation. But there is often also
>>>>> a size limit
>>>>> above which contiguous blocks do not bring any performance
>>>>> benefits. This series
>>>>> allows drivers to say when it is okay for the TTM to try a bit
>>>>> less hard.
>>>>>
>>>>> We do this by allowing drivers to specify this cut off point
>>>>> when creating the
>>>>> TTM device and pools. Allocations above this size will skip
>>>>> direct reclaim so
>>>>> under memory pressure worst case latency will improve.
>>>>> Background reclaim is
>>>>> still kicked off and both before and after the memory pressure
>>>>> all the TTM pool
>>>>> buckets remain to be used as they are today.
>>>>>
>>>>> This is especially interesting if someone has configured
>>>>> MAX_PAGE_ORDER to
>>>>> higher than the default. And even with the default, with amdgpu
>>>>> for example,
>>>>> the last patch in the series makes use of the new feature by
>>>>> telling TTM that
>>>>> above 2MiB we do not expect performance benefits. Which makes
>>>>> TTM not try direct
>>>>> reclaim for the top bucket (4MiB).
>>>>>
>>>>> End result is TTM drivers become a tiny bit nicer mm citizens
>>>>> and users benefit
>>>>> from better worst case buffer creation latencies. As a side
>>>>> benefit we get rid
>>>>> of two instances of those often very unreadable mutliple
>>>>> nameless booleans
>>>>> function signatures.
>>>>>
>>>>> If this sounds interesting and gets merge the invidual drivers
>>>>> can follow up
>>>>> with patches configuring their thresholds.
>>>>>
>>>>> v2:
>>>>>    * Christian suggested to pass in the new data by changing the
>>>>> function signatures.
>>>>>
>>>>> v3:
>>>>>    * Moved ttm pool helpers into new ttm_pool_internal.h.
>>>>> (Christian)
>>>>
>>>> Patch #3 is Acked-by: Christian König <christian.koenig@amd.com>.
>>>>
>>>> The rest is Reviewed-by: Christian König
>>>> <christian.koenig@amd.com>
>>>
>>> Thank you!
>>>
>>> So I think now I need acks to merge via drm-misc for all the
>>> drivers which have their own trees. Which seems to be just xe.
>>
>> I think you should ping the XE guys for their opinion, but since
>> there shouldn't be any functional change for them you can probably go
>> ahead and merge the patches to drm-misc-next when there is no reply
>> in time.
> 
> I will try to do a review tonight. One thing that comes up though, is
> the change to ttm_device_init() where you add pool_flags. I had another
> patch series a number of months ago that added a struct with flags
> there instead to select the return value given when OOM. Now that we're
> adding an argument, should we try to use a struct instead so that we
> can use it for more that pool behavior?
> 
> 
> I'll be able to find a pointer to that series later today.

Found it: 
https://lore.kernel.org/dri-devel/20241002122422.287276-1-thomas.hellstrom@linux.intel.com/

Glad to see in that thread it isn't just me permanently slowed down by 
"false, false" and similar. :)

I considered using a struct too and I guess there wasn't too much of a 
sway that I went with flags. I thought not to overcomplicate with the on 
stack struct which is mostly not needed for something so low level, and 
to stick with the old school C visual patterns.

Since you only needed a single boolean in your series I suppose you 
could just follow up on my series if you find it acceptable. Or I can go 
with yours, no problem either.

Regards,

Tvrtko


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 5/5] drm/amdgpu: Configure max beneficial TTM pool allocation order
  2025-10-08 23:18   ` Matthew Brost
@ 2025-10-09  8:58     ` Tvrtko Ursulin
  2025-10-10 14:14       ` Thomas Hellström
  0 siblings, 1 reply; 24+ messages in thread
From: Tvrtko Ursulin @ 2025-10-09  8:58 UTC (permalink / raw)
  To: Matthew Brost
  Cc: amd-gfx, dri-devel, kernel-dev, Alex Deucher,
	Christian König, Thadeu Lima de Souza Cascardo,
	thomas.hellstrom


On 09/10/2025 00:18, Matthew Brost wrote:
> On Wed, Oct 08, 2025 at 12:53:14PM +0100, Tvrtko Ursulin wrote:
>> Let the TTM pool allocator know that we can afford for it to expend less
>> effort for satisfying contiguous allocations larger than 2MiB. The latter
>> is the maximum relevant PTE entry size and the driver and hardware are
>> happy to get larger blocks only opportunistically.
>>
>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>> Cc: Alex Deucher <alexander.deucher@amd.com>
>> Cc: Christian König <christian.koenig@amd.com>
>> Cc: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
> 
> +Thomas - Seems like we'd want to do this in Xe too?
> 
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 5 +++--
>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> index 948c6d0a422b..723b885210a7 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> @@ -1837,7 +1837,7 @@ static int amdgpu_ttm_pools_init(struct amdgpu_device *adev)
>>   	for (i = 0; i < adev->gmc.num_mem_partitions; i++) {
>>   		ttm_pool_init(&adev->mman.ttm_pools[i], adev->dev,
>>   			      adev->gmc.mem_partitions[i].numa.node,
>> -			      0);
>> +			      TTM_POOL_BENEFICIAL_ORDER(get_order(2 * SZ_1M)));
> 
> SZ_2M btw.

I thought I grepped exactly to see if that existed but apparently I did 
not, thanks!

Regards,

Tvrtko

> 
> Matt
> 
>>   	}
>>   	return 0;
>>   }
>> @@ -1931,7 +1931,8 @@ int amdgpu_ttm_init(struct amdgpu_device *adev)
>>   			       adev_to_drm(adev)->anon_inode->i_mapping,
>>   			       adev_to_drm(adev)->vma_offset_manager,
>>   			       (adev->need_swiotlb ? TTM_POOL_USE_DMA_ALLOC : 0) |
>> -			       (dma_addressing_limited(adev->dev) ? TTM_POOL_USE_DMA32 : 0));
>> +			       (dma_addressing_limited(adev->dev) ? TTM_POOL_USE_DMA32 : 0) |
>> +			       TTM_POOL_BENEFICIAL_ORDER(get_order(2 * SZ_1M)));
>>   	if (r) {
>>   		dev_err(adev->dev,
>>   			"failed initializing buffer object driver(%d).\n", r);
>> -- 
>> 2.48.0
>>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/5] Improving the worst case TTM large allocation latency
  2025-10-09  8:53         ` Tvrtko Ursulin
@ 2025-10-10 14:11           ` Thomas Hellström
  2025-10-10 14:18             ` Thomas Hellström
  2025-10-11  8:00             ` Tvrtko Ursulin
  0 siblings, 2 replies; 24+ messages in thread
From: Thomas Hellström @ 2025-10-10 14:11 UTC (permalink / raw)
  To: Tvrtko Ursulin, Christian König, amd-gfx, Lucas De Marchi,
	dri-devel, Rodrigo Vivi
  Cc: kernel-dev, Alex Deucher, Danilo Krummrich, Dave Airlie,
	Gerd Hoffmann, Joonas Lahtinen, Lyude Paul, Maarten Lankhorst,
	Maxime Ripard, Sui Jingfeng, Thadeu Lima de Souza Cascardo,
	Thomas Zimmermann, Zack Rusin

On Thu, 2025-10-09 at 09:53 +0100, Tvrtko Ursulin wrote:
> 
> On 08/10/2025 15:39, Thomas Hellström wrote:
> > On Wed, 2025-10-08 at 16:02 +0200, Christian König wrote:
> > > On 08.10.25 15:50, Tvrtko Ursulin wrote:
> > > > 
> > > > On 08/10/2025 13:35, Christian König wrote:
> > > > > On 08.10.25 13:53, Tvrtko Ursulin wrote:
> > > > > > Disclaimer:
> > > > > > Please note that as this series includes a patch which
> > > > > > touches
> > > > > > a good number of
> > > > > > drivers I will only copy everyone in the cover letter and
> > > > > > the
> > > > > > respective patch.
> > > > > > Assumption is people are subscribed to dri-devel so can
> > > > > > look at
> > > > > > the whole series
> > > > > > there. I know someone is bound to complain for both the
> > > > > > case
> > > > > > when everyone is
> > > > > > copied on everything for getting too much email, and also
> > > > > > for
> > > > > > this other case.
> > > > > > So please be flexible.
> > > > > > 
> > > > > > Description:
> > > > > > 
> > > > > > All drivers which use the TTM pool allocator end up
> > > > > > requesting
> > > > > > large order
> > > > > > allocations when allocating large buffers. Those can be
> > > > > > slow
> > > > > > due memory pressure
> > > > > > and so add latency to buffer creation. But there is often
> > > > > > also
> > > > > > a size limit
> > > > > > above which contiguous blocks do not bring any performance
> > > > > > benefits. This series
> > > > > > allows drivers to say when it is okay for the TTM to try a
> > > > > > bit
> > > > > > less hard.
> > > > > > 
> > > > > > We do this by allowing drivers to specify this cut off
> > > > > > point
> > > > > > when creating the
> > > > > > TTM device and pools. Allocations above this size will skip
> > > > > > direct reclaim so
> > > > > > under memory pressure worst case latency will improve.
> > > > > > Background reclaim is
> > > > > > still kicked off and both before and after the memory
> > > > > > pressure
> > > > > > all the TTM pool
> > > > > > buckets remain to be used as they are today.
> > > > > > 
> > > > > > This is especially interesting if someone has configured
> > > > > > MAX_PAGE_ORDER to
> > > > > > higher than the default. And even with the default, with
> > > > > > amdgpu
> > > > > > for example,
> > > > > > the last patch in the series makes use of the new feature
> > > > > > by
> > > > > > telling TTM that
> > > > > > above 2MiB we do not expect performance benefits. Which
> > > > > > makes
> > > > > > TTM not try direct
> > > > > > reclaim for the top bucket (4MiB).
> > > > > > 
> > > > > > End result is TTM drivers become a tiny bit nicer mm
> > > > > > citizens
> > > > > > and users benefit
> > > > > > from better worst case buffer creation latencies. As a side
> > > > > > benefit we get rid
> > > > > > of two instances of those often very unreadable mutliple
> > > > > > nameless booleans
> > > > > > function signatures.
> > > > > > 
> > > > > > If this sounds interesting and gets merge the invidual
> > > > > > drivers
> > > > > > can follow up
> > > > > > with patches configuring their thresholds.
> > > > > > 
> > > > > > v2:
> > > > > >    * Christian suggested to pass in the new data by
> > > > > > changing the
> > > > > > function signatures.
> > > > > > 
> > > > > > v3:
> > > > > >    * Moved ttm pool helpers into new ttm_pool_internal.h.
> > > > > > (Christian)
> > > > > 
> > > > > Patch #3 is Acked-by: Christian König
> > > > > <christian.koenig@amd.com>.
> > > > > 
> > > > > The rest is Reviewed-by: Christian König
> > > > > <christian.koenig@amd.com>
> > > > 
> > > > Thank you!
> > > > 
> > > > So I think now I need acks to merge via drm-misc for all the
> > > > drivers which have their own trees. Which seems to be just xe.
> > > 
> > > I think you should ping the XE guys for their opinion, but since
> > > there shouldn't be any functional change for them you can
> > > probably go
> > > ahead and merge the patches to drm-misc-next when there is no
> > > reply
> > > in time.
> > 
> > I will try to do a review tonight. One thing that comes up though,
> > is
> > the change to ttm_device_init() where you add pool_flags. I had
> > another
> > patch series a number of months ago that added a struct with flags
> > there instead to select the return value given when OOM. Now that
> > we're
> > adding an argument, should we try to use a struct instead so that
> > we
> > can use it for more that pool behavior?
> > 
> > 
> > I'll be able to find a pointer to that series later today.
> 
> Found it: 
> https://lore.kernel.org/dri-devel/20241002122422.287276-1-thomas.hellstrom@linux.intel.com/
> 
> Glad to see in that thread it isn't just me permanently slowed down
> by 
> "false, false" and similar. :)
> 
> I considered using a struct too and I guess there wasn't too much of
> a 
> sway that I went with flags. I thought not to overcomplicate with the
> on 
> stack struct which is mostly not needed for something so low level,
> and 
> to stick with the old school C visual patterns.
> 
> Since you only needed a single boolean in your series I suppose you 
> could just follow up on my series if you find it acceptable. Or I can
> go 
> with yours, no problem either.

It seems yours has the most momentum ATM. I can follow up on yours. It
would be great if we could perhaps change the naming of "pool_flags" to
something more generic.

Thanks,
Thomas


> 
> Regards,
> 
> Tvrtko
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 5/5] drm/amdgpu: Configure max beneficial TTM pool allocation order
  2025-10-09  8:58     ` Tvrtko Ursulin
@ 2025-10-10 14:14       ` Thomas Hellström
  2025-10-13  7:03         ` Christian König
  0 siblings, 1 reply; 24+ messages in thread
From: Thomas Hellström @ 2025-10-10 14:14 UTC (permalink / raw)
  To: Tvrtko Ursulin, Matthew Brost
  Cc: amd-gfx, dri-devel, kernel-dev, Alex Deucher,
	Christian König, Thadeu Lima de Souza Cascardo

On Thu, 2025-10-09 at 09:58 +0100, Tvrtko Ursulin wrote:
> 
> On 09/10/2025 00:18, Matthew Brost wrote:
> > On Wed, Oct 08, 2025 at 12:53:14PM +0100, Tvrtko Ursulin wrote:
> > > Let the TTM pool allocator know that we can afford for it to
> > > expend less
> > > effort for satisfying contiguous allocations larger than 2MiB.
> > > The latter
> > > is the maximum relevant PTE entry size and the driver and
> > > hardware are
> > > happy to get larger blocks only opportunistically.
> > > 
> > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> > > Cc: Alex Deucher <alexander.deucher@amd.com>
> > > Cc: Christian König <christian.koenig@amd.com>
> > > Cc: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
> > 
> > +Thomas - Seems like we'd want to do this in Xe too?

Yeah, Indeed.

While IIRC we can actually set up 1GiB PTEs, I'm not aware of any huge
benefits from using that compared to 2MiB ones. We certainly want the
allocator to try pretty hard for 2MiB ones, though.

Thanks,
Thomas


> > 
> > > ---
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 5 +++--
> > >   1 file changed, 3 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> > > index 948c6d0a422b..723b885210a7 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> > > @@ -1837,7 +1837,7 @@ static int amdgpu_ttm_pools_init(struct
> > > amdgpu_device *adev)
> > >   	for (i = 0; i < adev->gmc.num_mem_partitions; i++) {
> > >   		ttm_pool_init(&adev->mman.ttm_pools[i], adev-
> > > >dev,
> > >   			      adev-
> > > >gmc.mem_partitions[i].numa.node,
> > > -			      0);
> > > +			     
> > > TTM_POOL_BENEFICIAL_ORDER(get_order(2 * SZ_1M)));
> > 
> > SZ_2M btw.
> 
> I thought I grepped exactly to see if that existed but apparently I
> did 
> not, thanks!
> 
> Regards,
> 
> Tvrtko
> 
> > 
> > Matt
> > 
> > >   	}
> > >   	return 0;
> > >   }
> > > @@ -1931,7 +1931,8 @@ int amdgpu_ttm_init(struct amdgpu_device
> > > *adev)
> > >   			       adev_to_drm(adev)->anon_inode-
> > > >i_mapping,
> > >   			       adev_to_drm(adev)-
> > > >vma_offset_manager,
> > >   			       (adev->need_swiotlb ?
> > > TTM_POOL_USE_DMA_ALLOC : 0) |
> > > -			       (dma_addressing_limited(adev-
> > > >dev) ? TTM_POOL_USE_DMA32 : 0));
> > > +			       (dma_addressing_limited(adev-
> > > >dev) ? TTM_POOL_USE_DMA32 : 0) |
> > > +			      
> > > TTM_POOL_BENEFICIAL_ORDER(get_order(2 * SZ_1M)));
> > >   	if (r) {
> > >   		dev_err(adev->dev,
> > >   			"failed initializing buffer object
> > > driver(%d).\n", r);
> > > -- 
> > > 2.48.0
> > > 
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/5] Improving the worst case TTM large allocation latency
  2025-10-10 14:11           ` Thomas Hellström
@ 2025-10-10 14:18             ` Thomas Hellström
  2025-10-11  8:00             ` Tvrtko Ursulin
  1 sibling, 0 replies; 24+ messages in thread
From: Thomas Hellström @ 2025-10-10 14:18 UTC (permalink / raw)
  To: Tvrtko Ursulin, Christian König, amd-gfx, Lucas De Marchi,
	dri-devel, Rodrigo Vivi
  Cc: kernel-dev, Alex Deucher, Danilo Krummrich, Dave Airlie,
	Gerd Hoffmann, Joonas Lahtinen, Lyude Paul, Maarten Lankhorst,
	Maxime Ripard, Sui Jingfeng, Thadeu Lima de Souza Cascardo,
	Thomas Zimmermann, Zack Rusin

On Fri, 2025-10-10 at 16:11 +0200, Thomas Hellström wrote:
> On Thu, 2025-10-09 at 09:53 +0100, Tvrtko Ursulin wrote:
> > 
> > On 08/10/2025 15:39, Thomas Hellström wrote:
> > > On Wed, 2025-10-08 at 16:02 +0200, Christian König wrote:
> > > > On 08.10.25 15:50, Tvrtko Ursulin wrote:
> > > > > 
> > > > > On 08/10/2025 13:35, Christian König wrote:
> > > > > > On 08.10.25 13:53, Tvrtko Ursulin wrote:
> > > > > > > Disclaimer:
> > > > > > > Please note that as this series includes a patch which
> > > > > > > touches
> > > > > > > a good number of
> > > > > > > drivers I will only copy everyone in the cover letter and
> > > > > > > the
> > > > > > > respective patch.
> > > > > > > Assumption is people are subscribed to dri-devel so can
> > > > > > > look at
> > > > > > > the whole series
> > > > > > > there. I know someone is bound to complain for both the
> > > > > > > case
> > > > > > > when everyone is
> > > > > > > copied on everything for getting too much email, and also
> > > > > > > for
> > > > > > > this other case.
> > > > > > > So please be flexible.
> > > > > > > 
> > > > > > > Description:
> > > > > > > 
> > > > > > > All drivers which use the TTM pool allocator end up
> > > > > > > requesting
> > > > > > > large order
> > > > > > > allocations when allocating large buffers. Those can be
> > > > > > > slow
> > > > > > > due memory pressure
> > > > > > > and so add latency to buffer creation. But there is often
> > > > > > > also
> > > > > > > a size limit
> > > > > > > above which contiguous blocks do not bring any
> > > > > > > performance
> > > > > > > benefits. This series
> > > > > > > allows drivers to say when it is okay for the TTM to try
> > > > > > > a
> > > > > > > bit
> > > > > > > less hard.
> > > > > > > 
> > > > > > > We do this by allowing drivers to specify this cut off
> > > > > > > point
> > > > > > > when creating the
> > > > > > > TTM device and pools. Allocations above this size will
> > > > > > > skip
> > > > > > > direct reclaim so
> > > > > > > under memory pressure worst case latency will improve.
> > > > > > > Background reclaim is
> > > > > > > still kicked off and both before and after the memory
> > > > > > > pressure
> > > > > > > all the TTM pool
> > > > > > > buckets remain to be used as they are today.
> > > > > > > 
> > > > > > > This is especially interesting if someone has configured
> > > > > > > MAX_PAGE_ORDER to
> > > > > > > higher than the default. And even with the default, with
> > > > > > > amdgpu
> > > > > > > for example,
> > > > > > > the last patch in the series makes use of the new feature
> > > > > > > by
> > > > > > > telling TTM that
> > > > > > > above 2MiB we do not expect performance benefits. Which
> > > > > > > makes
> > > > > > > TTM not try direct
> > > > > > > reclaim for the top bucket (4MiB).
> > > > > > > 
> > > > > > > End result is TTM drivers become a tiny bit nicer mm
> > > > > > > citizens
> > > > > > > and users benefit
> > > > > > > from better worst case buffer creation latencies. As a
> > > > > > > side
> > > > > > > benefit we get rid
> > > > > > > of two instances of those often very unreadable mutliple
> > > > > > > nameless booleans
> > > > > > > function signatures.
> > > > > > > 
> > > > > > > If this sounds interesting and gets merge the invidual
> > > > > > > drivers
> > > > > > > can follow up
> > > > > > > with patches configuring their thresholds.
> > > > > > > 
> > > > > > > v2:
> > > > > > >    * Christian suggested to pass in the new data by
> > > > > > > changing the
> > > > > > > function signatures.
> > > > > > > 
> > > > > > > v3:
> > > > > > >    * Moved ttm pool helpers into new ttm_pool_internal.h.
> > > > > > > (Christian)
> > > > > > 
> > > > > > Patch #3 is Acked-by: Christian König
> > > > > > <christian.koenig@amd.com>.
> > > > > > 
> > > > > > The rest is Reviewed-by: Christian König
> > > > > > <christian.koenig@amd.com>
> > > > > 
> > > > > Thank you!
> > > > > 
> > > > > So I think now I need acks to merge via drm-misc for all the
> > > > > drivers which have their own trees. Which seems to be just
> > > > > xe.
> > > > 
> > > > I think you should ping the XE guys for their opinion, but
> > > > since
> > > > there shouldn't be any functional change for them you can
> > > > probably go
> > > > ahead and merge the patches to drm-misc-next when there is no
> > > > reply
> > > > in time.
> > > 
> > > I will try to do a review tonight. One thing that comes up
> > > though,
> > > is
> > > the change to ttm_device_init() where you add pool_flags. I had
> > > another
> > > patch series a number of months ago that added a struct with
> > > flags
> > > there instead to select the return value given when OOM. Now that
> > > we're
> > > adding an argument, should we try to use a struct instead so that
> > > we
> > > can use it for more that pool behavior?
> > > 
> > > 
> > > I'll be able to find a pointer to that series later today.
> > 
> > Found it: 
> > https://lore.kernel.org/dri-devel/20241002122422.287276-1-thomas.hellstrom@linux.intel.com/
> > 
> > Glad to see in that thread it isn't just me permanently slowed down
> > by 
> > "false, false" and similar. :)
> > 
> > I considered using a struct too and I guess there wasn't too much
> > of
> > a 
> > sway that I went with flags. I thought not to overcomplicate with
> > the
> > on 
> > stack struct which is mostly not needed for something so low level,
> > and 
> > to stick with the old school C visual patterns.
> > 
> > Since you only needed a single boolean in your series I suppose you
> > could just follow up on my series if you find it acceptable. Or I
> > can
> > go 
> > with yours, no problem either.
> 
> It seems yours has the most momentum ATM. I can follow up on yours.
> It
> would be great if we could perhaps change the naming of "pool_flags"
> to
> something more generic.

Oh, BTW we started using that 'const struct' flags argument style in xe
and in the TTM shrinker helper (ttm_bo_shrink()) as well, so there is a
precedent. I'll let you decide :)

Thanks,
Thomas


> 
> Thanks,
> Thomas
> 
> 
> > 
> > Regards,
> > 
> > Tvrtko
> > 
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 2/5] drm/ttm: Replace multiple booleans with flags in pool init
  2025-10-08 11:53 ` [PATCH v3 2/5] drm/ttm: Replace multiple booleans with flags in pool init Tvrtko Ursulin
@ 2025-10-10 15:10   ` kernel test robot
  0 siblings, 0 replies; 24+ messages in thread
From: kernel test robot @ 2025-10-10 15:10 UTC (permalink / raw)
  To: Tvrtko Ursulin, amd-gfx, dri-devel
  Cc: oe-kbuild-all, kernel-dev, Tvrtko Ursulin, Alex Deucher,
	Christian König, Thomas Hellström

Hi Tvrtko,

kernel test robot noticed the following build errors:

[auto build test ERROR on drm/drm-next]
[also build test ERROR on drm-i915/for-linux-next drm-i915/for-linux-next-fixes drm-xe/drm-xe-next drm-exynos/exynos-drm-next drm-intel/for-linux-next drm-intel/for-linux-next-fixes drm-misc/drm-misc-next drm-tip/drm-tip linus/master v6.17 next-20251009]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Tvrtko-Ursulin/drm-ttm-Add-getter-for-some-pool-properties/20251010-052711
base:   git://anongit.freedesktop.org/drm/drm drm-next
patch link:    https://lore.kernel.org/r/20251008115314.55438-3-tvrtko.ursulin%40igalia.com
patch subject: [PATCH v3 2/5] drm/ttm: Replace multiple booleans with flags in pool init
config: sh-randconfig-002-20251010 (https://download.01.org/0day-ci/archive/20251010/202510102220.inEYOJoK-lkp@intel.com/config)
compiler: sh4-linux-gcc (GCC) 15.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251010/202510102220.inEYOJoK-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202510102220.inEYOJoK-lkp@intel.com/

All errors (new ones prefixed by >>):

   drivers/gpu/drm/ttm/tests/ttm_pool_test.c: In function 'ttm_pool_alloc_basic':
>> drivers/gpu/drm/ttm/tests/ttm_pool_test.c:167:21: error: implicit declaration of function 'ttm_pool_uses_dma_alloc'; did you mean 'ttm_pool_restore_and_alloc'? [-Wimplicit-function-declaration]
     167 |                 if (ttm_pool_uses_dma_alloc(pool)) {
         |                     ^~~~~~~~~~~~~~~~~~~~~~~
         |                     ttm_pool_restore_and_alloc
--
   drivers/gpu/drm/ttm/tests/ttm_device_test.c: In function 'ttm_device_init_pools':
>> drivers/gpu/drm/ttm/tests/ttm_device_test.c:177:37: error: implicit declaration of function 'ttm_pool_uses_dma_alloc'; did you mean 'ttm_pool_restore_and_alloc'? [-Wimplicit-function-declaration]
     177 |                                 if (ttm_pool_uses_dma_alloc(pool))
         |                                     ^~~~~~~~~~~~~~~~~~~~~~~
         |                                     ttm_pool_restore_and_alloc


vim +167 drivers/gpu/drm/ttm/tests/ttm_pool_test.c

   130	
   131	KUNIT_ARRAY_PARAM(ttm_pool_alloc_basic, ttm_pool_basic_cases,
   132			  ttm_pool_alloc_case_desc);
   133	
   134	static void ttm_pool_alloc_basic(struct kunit *test)
   135	{
   136		struct ttm_pool_test_priv *priv = test->priv;
   137		struct ttm_test_devices *devs = priv->devs;
   138		const struct ttm_pool_test_case *params = test->param_value;
   139		struct ttm_tt *tt;
   140		struct ttm_pool *pool;
   141		struct page *fst_page, *last_page;
   142		enum ttm_caching caching = ttm_uncached;
   143		unsigned int expected_num_pages = 1 << params->order;
   144		size_t size = expected_num_pages * PAGE_SIZE;
   145		int err;
   146	
   147		tt = ttm_tt_kunit_init(test, 0, caching, size);
   148		KUNIT_ASSERT_NOT_NULL(test, tt);
   149	
   150		pool = kunit_kzalloc(test, sizeof(*pool), GFP_KERNEL);
   151		KUNIT_ASSERT_NOT_NULL(test, pool);
   152	
   153		ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, params->flags);
   154	
   155		KUNIT_ASSERT_PTR_EQ(test, pool->dev, devs->dev);
   156		KUNIT_ASSERT_EQ(test, pool->nid, NUMA_NO_NODE);
   157		KUNIT_ASSERT_EQ(test, pool->flags, params->flags);
   158	
   159		err = ttm_pool_alloc(pool, tt, &simple_ctx);
   160		KUNIT_ASSERT_EQ(test, err, 0);
   161		KUNIT_ASSERT_EQ(test, tt->num_pages, expected_num_pages);
   162	
   163		fst_page = tt->pages[0];
   164		last_page = tt->pages[tt->num_pages - 1];
   165	
   166		if (params->order <= MAX_PAGE_ORDER) {
 > 167			if (ttm_pool_uses_dma_alloc(pool)) {
   168				KUNIT_ASSERT_NOT_NULL(test, (void *)fst_page->private);
   169				KUNIT_ASSERT_NOT_NULL(test, (void *)last_page->private);
   170			} else {
   171				KUNIT_ASSERT_EQ(test, fst_page->private, params->order);
   172			}
   173		} else {
   174			if (ttm_pool_uses_dma_alloc(pool)) {
   175				KUNIT_ASSERT_NOT_NULL(test, (void *)fst_page->private);
   176				KUNIT_ASSERT_NULL(test, (void *)last_page->private);
   177			} else {
   178				/*
   179				 * We expect to alloc one big block, followed by
   180				 * order 0 blocks
   181				 */
   182				KUNIT_ASSERT_EQ(test, fst_page->private,
   183						min_t(unsigned int, MAX_PAGE_ORDER,
   184						      params->order));
   185				KUNIT_ASSERT_EQ(test, last_page->private, 0);
   186			}
   187		}
   188	
   189		ttm_pool_free(pool, tt);
   190		ttm_tt_fini(tt);
   191		ttm_pool_fini(pool);
   192	}
   193	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/5] Improving the worst case TTM large allocation latency
  2025-10-10 14:11           ` Thomas Hellström
  2025-10-10 14:18             ` Thomas Hellström
@ 2025-10-11  8:00             ` Tvrtko Ursulin
  2025-10-13  8:48               ` Thomas Hellström
  1 sibling, 1 reply; 24+ messages in thread
From: Tvrtko Ursulin @ 2025-10-11  8:00 UTC (permalink / raw)
  To: Thomas Hellström, Christian König, amd-gfx,
	Lucas De Marchi, dri-devel, Rodrigo Vivi
  Cc: kernel-dev, Alex Deucher, Danilo Krummrich, Dave Airlie,
	Gerd Hoffmann, Joonas Lahtinen, Lyude Paul, Maarten Lankhorst,
	Maxime Ripard, Sui Jingfeng, Thadeu Lima de Souza Cascardo,
	Thomas Zimmermann, Zack Rusin


On 10/10/2025 15:11, Thomas Hellström wrote:
> On Thu, 2025-10-09 at 09:53 +0100, Tvrtko Ursulin wrote:
>>
>> On 08/10/2025 15:39, Thomas Hellström wrote:
>>> On Wed, 2025-10-08 at 16:02 +0200, Christian König wrote:
>>>> On 08.10.25 15:50, Tvrtko Ursulin wrote:
>>>>>
>>>>> On 08/10/2025 13:35, Christian König wrote:
>>>>>> On 08.10.25 13:53, Tvrtko Ursulin wrote:
>>>>>>> Disclaimer:
>>>>>>> Please note that as this series includes a patch which
>>>>>>> touches
>>>>>>> a good number of
>>>>>>> drivers I will only copy everyone in the cover letter and
>>>>>>> the
>>>>>>> respective patch.
>>>>>>> Assumption is people are subscribed to dri-devel so can
>>>>>>> look at
>>>>>>> the whole series
>>>>>>> there. I know someone is bound to complain for both the
>>>>>>> case
>>>>>>> when everyone is
>>>>>>> copied on everything for getting too much email, and also
>>>>>>> for
>>>>>>> this other case.
>>>>>>> So please be flexible.
>>>>>>>
>>>>>>> Description:
>>>>>>>
>>>>>>> All drivers which use the TTM pool allocator end up
>>>>>>> requesting
>>>>>>> large order
>>>>>>> allocations when allocating large buffers. Those can be
>>>>>>> slow
>>>>>>> due memory pressure
>>>>>>> and so add latency to buffer creation. But there is often
>>>>>>> also
>>>>>>> a size limit
>>>>>>> above which contiguous blocks do not bring any performance
>>>>>>> benefits. This series
>>>>>>> allows drivers to say when it is okay for the TTM to try a
>>>>>>> bit
>>>>>>> less hard.
>>>>>>>
>>>>>>> We do this by allowing drivers to specify this cut off
>>>>>>> point
>>>>>>> when creating the
>>>>>>> TTM device and pools. Allocations above this size will skip
>>>>>>> direct reclaim so
>>>>>>> under memory pressure worst case latency will improve.
>>>>>>> Background reclaim is
>>>>>>> still kicked off and both before and after the memory
>>>>>>> pressure
>>>>>>> all the TTM pool
>>>>>>> buckets remain to be used as they are today.
>>>>>>>
>>>>>>> This is especially interesting if someone has configured
>>>>>>> MAX_PAGE_ORDER to
>>>>>>> higher than the default. And even with the default, with
>>>>>>> amdgpu
>>>>>>> for example,
>>>>>>> the last patch in the series makes use of the new feature
>>>>>>> by
>>>>>>> telling TTM that
>>>>>>> above 2MiB we do not expect performance benefits. Which
>>>>>>> makes
>>>>>>> TTM not try direct
>>>>>>> reclaim for the top bucket (4MiB).
>>>>>>>
>>>>>>> End result is TTM drivers become a tiny bit nicer mm
>>>>>>> citizens
>>>>>>> and users benefit
>>>>>>> from better worst case buffer creation latencies. As a side
>>>>>>> benefit we get rid
>>>>>>> of two instances of those often very unreadable mutliple
>>>>>>> nameless booleans
>>>>>>> function signatures.
>>>>>>>
>>>>>>> If this sounds interesting and gets merge the invidual
>>>>>>> drivers
>>>>>>> can follow up
>>>>>>> with patches configuring their thresholds.
>>>>>>>
>>>>>>> v2:
>>>>>>>     * Christian suggested to pass in the new data by
>>>>>>> changing the
>>>>>>> function signatures.
>>>>>>>
>>>>>>> v3:
>>>>>>>     * Moved ttm pool helpers into new ttm_pool_internal.h.
>>>>>>> (Christian)
>>>>>>
>>>>>> Patch #3 is Acked-by: Christian König
>>>>>> <christian.koenig@amd.com>.
>>>>>>
>>>>>> The rest is Reviewed-by: Christian König
>>>>>> <christian.koenig@amd.com>
>>>>>
>>>>> Thank you!
>>>>>
>>>>> So I think now I need acks to merge via drm-misc for all the
>>>>> drivers which have their own trees. Which seems to be just xe.
>>>>
>>>> I think you should ping the XE guys for their opinion, but since
>>>> there shouldn't be any functional change for them you can
>>>> probably go
>>>> ahead and merge the patches to drm-misc-next when there is no
>>>> reply
>>>> in time.
>>>
>>> I will try to do a review tonight. One thing that comes up though,
>>> is
>>> the change to ttm_device_init() where you add pool_flags. I had
>>> another
>>> patch series a number of months ago that added a struct with flags
>>> there instead to select the return value given when OOM. Now that
>>> we're
>>> adding an argument, should we try to use a struct instead so that
>>> we
>>> can use it for more that pool behavior?
>>>
>>>
>>> I'll be able to find a pointer to that series later today.
>>
>> Found it:
>> https://lore.kernel.org/dri-devel/20241002122422.287276-1-thomas.hellstrom@linux.intel.com/
>>
>> Glad to see in that thread it isn't just me permanently slowed down
>> by
>> "false, false" and similar. :)
>>
>> I considered using a struct too and I guess there wasn't too much of
>> a
>> sway that I went with flags. I thought not to overcomplicate with the
>> on
>> stack struct which is mostly not needed for something so low level,
>> and
>> to stick with the old school C visual patterns.
>>
>> Since you only needed a single boolean in your series I suppose you
>> could just follow up on my series if you find it acceptable. Or I can
>> go
>> with yours, no problem either.
> 
> It seems yours has the most momentum ATM. I can follow up on yours. It
> would be great if we could perhaps change the naming of "pool_flags" to
> something more generic.

Do you have a name in mind? For ttm_device_init pool_flags made sense to 
signify they relate only to the poll.

I need to respin anyway since I forgot to include the new header from 
unit tests.

Regards,

Tvrtko


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 5/5] drm/amdgpu: Configure max beneficial TTM pool allocation order
  2025-10-10 14:14       ` Thomas Hellström
@ 2025-10-13  7:03         ` Christian König
  0 siblings, 0 replies; 24+ messages in thread
From: Christian König @ 2025-10-13  7:03 UTC (permalink / raw)
  To: Thomas Hellström, Tvrtko Ursulin, Matthew Brost
  Cc: amd-gfx, dri-devel, kernel-dev, Alex Deucher,
	Thadeu Lima de Souza Cascardo

On 10.10.25 16:14, Thomas Hellström wrote:
> On Thu, 2025-10-09 at 09:58 +0100, Tvrtko Ursulin wrote:
>>
>> On 09/10/2025 00:18, Matthew Brost wrote:
>>> On Wed, Oct 08, 2025 at 12:53:14PM +0100, Tvrtko Ursulin wrote:
>>>> Let the TTM pool allocator know that we can afford for it to
>>>> expend less
>>>> effort for satisfying contiguous allocations larger than 2MiB.
>>>> The latter
>>>> is the maximum relevant PTE entry size and the driver and
>>>> hardware are
>>>> happy to get larger blocks only opportunistically.
>>>>
>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>>>> Cc: Alex Deucher <alexander.deucher@amd.com>
>>>> Cc: Christian König <christian.koenig@amd.com>
>>>> Cc: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
>>>
>>> +Thomas - Seems like we'd want to do this in Xe too?
> 
> Yeah, Indeed.
> 
> While IIRC we can actually set up 1GiB PTEs, I'm not aware of any huge
> benefits from using that compared to 2MiB ones. We certainly want the
> allocator to try pretty hard for 2MiB ones, though.

Yeah same here for AMD GPUs.

There are some obscure use cases for 1GiB on amdgpu, but only for VRAM. And even for VRAM we only guarantee that as best effort.

Regards,
Christian.

> 
> Thanks,
> Thomas
> 
> 
>>>
>>>> ---
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 5 +++--
>>>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>> index 948c6d0a422b..723b885210a7 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>> @@ -1837,7 +1837,7 @@ static int amdgpu_ttm_pools_init(struct
>>>> amdgpu_device *adev)
>>>>   	for (i = 0; i < adev->gmc.num_mem_partitions; i++) {
>>>>   		ttm_pool_init(&adev->mman.ttm_pools[i], adev-
>>>>> dev,
>>>>   			      adev-
>>>>> gmc.mem_partitions[i].numa.node,
>>>> -			      0);
>>>> +			     
>>>> TTM_POOL_BENEFICIAL_ORDER(get_order(2 * SZ_1M)));
>>>
>>> SZ_2M btw.
>>
>> I thought I grepped exactly to see if that existed but apparently I
>> did 
>> not, thanks!
>>
>> Regards,
>>
>> Tvrtko
>>
>>>
>>> Matt
>>>
>>>>   	}
>>>>   	return 0;
>>>>   }
>>>> @@ -1931,7 +1931,8 @@ int amdgpu_ttm_init(struct amdgpu_device
>>>> *adev)
>>>>   			       adev_to_drm(adev)->anon_inode-
>>>>> i_mapping,
>>>>   			       adev_to_drm(adev)-
>>>>> vma_offset_manager,
>>>>   			       (adev->need_swiotlb ?
>>>> TTM_POOL_USE_DMA_ALLOC : 0) |
>>>> -			       (dma_addressing_limited(adev-
>>>>> dev) ? TTM_POOL_USE_DMA32 : 0));
>>>> +			       (dma_addressing_limited(adev-
>>>>> dev) ? TTM_POOL_USE_DMA32 : 0) |
>>>> +			      
>>>> TTM_POOL_BENEFICIAL_ORDER(get_order(2 * SZ_1M)));
>>>>   	if (r) {
>>>>   		dev_err(adev->dev,
>>>>   			"failed initializing buffer object
>>>> driver(%d).\n", r);
>>>> -- 
>>>> 2.48.0
>>>>
>>
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/5] Improving the worst case TTM large allocation latency
  2025-10-11  8:00             ` Tvrtko Ursulin
@ 2025-10-13  8:48               ` Thomas Hellström
  2025-10-13  9:17                 ` Tvrtko Ursulin
  0 siblings, 1 reply; 24+ messages in thread
From: Thomas Hellström @ 2025-10-13  8:48 UTC (permalink / raw)
  To: Tvrtko Ursulin, Christian König, amd-gfx, Lucas De Marchi,
	dri-devel, Rodrigo Vivi
  Cc: kernel-dev, Alex Deucher, Danilo Krummrich, Dave Airlie,
	Gerd Hoffmann, Joonas Lahtinen, Lyude Paul, Maarten Lankhorst,
	Maxime Ripard, Sui Jingfeng, Thadeu Lima de Souza Cascardo,
	Thomas Zimmermann, Zack Rusin

Hi, Tvrtko,

On Sat, 2025-10-11 at 09:00 +0100, Tvrtko Ursulin wrote:
> 
> On 10/10/2025 15:11, Thomas Hellström wrote:
> > On Thu, 2025-10-09 at 09:53 +0100, Tvrtko Ursulin wrote:
> > > 
> > > On 08/10/2025 15:39, Thomas Hellström wrote:
> > > > On Wed, 2025-10-08 at 16:02 +0200, Christian König wrote:
> > > > > On 08.10.25 15:50, Tvrtko Ursulin wrote:
> > > > > > 
> > > > > > On 08/10/2025 13:35, Christian König wrote:
> > > > > > > On 08.10.25 13:53, Tvrtko Ursulin wrote:
> > > > > > > > Disclaimer:
> > > > > > > > Please note that as this series includes a patch which
> > > > > > > > touches
> > > > > > > > a good number of
> > > > > > > > drivers I will only copy everyone in the cover letter
> > > > > > > > and
> > > > > > > > the
> > > > > > > > respective patch.
> > > > > > > > Assumption is people are subscribed to dri-devel so can
> > > > > > > > look at
> > > > > > > > the whole series
> > > > > > > > there. I know someone is bound to complain for both the
> > > > > > > > case
> > > > > > > > when everyone is
> > > > > > > > copied on everything for getting too much email, and
> > > > > > > > also
> > > > > > > > for
> > > > > > > > this other case.
> > > > > > > > So please be flexible.
> > > > > > > > 
> > > > > > > > Description:
> > > > > > > > 
> > > > > > > > All drivers which use the TTM pool allocator end up
> > > > > > > > requesting
> > > > > > > > large order
> > > > > > > > allocations when allocating large buffers. Those can be
> > > > > > > > slow
> > > > > > > > due memory pressure
> > > > > > > > and so add latency to buffer creation. But there is
> > > > > > > > often
> > > > > > > > also
> > > > > > > > a size limit
> > > > > > > > above which contiguous blocks do not bring any
> > > > > > > > performance
> > > > > > > > benefits. This series
> > > > > > > > allows drivers to say when it is okay for the TTM to
> > > > > > > > try a
> > > > > > > > bit
> > > > > > > > less hard.
> > > > > > > > 
> > > > > > > > We do this by allowing drivers to specify this cut off
> > > > > > > > point
> > > > > > > > when creating the
> > > > > > > > TTM device and pools. Allocations above this size will
> > > > > > > > skip
> > > > > > > > direct reclaim so
> > > > > > > > under memory pressure worst case latency will improve.
> > > > > > > > Background reclaim is
> > > > > > > > still kicked off and both before and after the memory
> > > > > > > > pressure
> > > > > > > > all the TTM pool
> > > > > > > > buckets remain to be used as they are today.
> > > > > > > > 
> > > > > > > > This is especially interesting if someone has
> > > > > > > > configured
> > > > > > > > MAX_PAGE_ORDER to
> > > > > > > > higher than the default. And even with the default,
> > > > > > > > with
> > > > > > > > amdgpu
> > > > > > > > for example,
> > > > > > > > the last patch in the series makes use of the new
> > > > > > > > feature
> > > > > > > > by
> > > > > > > > telling TTM that
> > > > > > > > above 2MiB we do not expect performance benefits. Which
> > > > > > > > makes
> > > > > > > > TTM not try direct
> > > > > > > > reclaim for the top bucket (4MiB).
> > > > > > > > 
> > > > > > > > End result is TTM drivers become a tiny bit nicer mm
> > > > > > > > citizens
> > > > > > > > and users benefit
> > > > > > > > from better worst case buffer creation latencies. As a
> > > > > > > > side
> > > > > > > > benefit we get rid
> > > > > > > > of two instances of those often very unreadable
> > > > > > > > mutliple
> > > > > > > > nameless booleans
> > > > > > > > function signatures.
> > > > > > > > 
> > > > > > > > If this sounds interesting and gets merge the invidual
> > > > > > > > drivers
> > > > > > > > can follow up
> > > > > > > > with patches configuring their thresholds.
> > > > > > > > 
> > > > > > > > v2:
> > > > > > > >     * Christian suggested to pass in the new data by
> > > > > > > > changing the
> > > > > > > > function signatures.
> > > > > > > > 
> > > > > > > > v3:
> > > > > > > >     * Moved ttm pool helpers into new
> > > > > > > > ttm_pool_internal.h.
> > > > > > > > (Christian)
> > > > > > > 
> > > > > > > Patch #3 is Acked-by: Christian König
> > > > > > > <christian.koenig@amd.com>.
> > > > > > > 
> > > > > > > The rest is Reviewed-by: Christian König
> > > > > > > <christian.koenig@amd.com>
> > > > > > 
> > > > > > Thank you!
> > > > > > 
> > > > > > So I think now I need acks to merge via drm-misc for all
> > > > > > the
> > > > > > drivers which have their own trees. Which seems to be just
> > > > > > xe.
> > > > > 
> > > > > I think you should ping the XE guys for their opinion, but
> > > > > since
> > > > > there shouldn't be any functional change for them you can
> > > > > probably go
> > > > > ahead and merge the patches to drm-misc-next when there is no
> > > > > reply
> > > > > in time.
> > > > 
> > > > I will try to do a review tonight. One thing that comes up
> > > > though,
> > > > is
> > > > the change to ttm_device_init() where you add pool_flags. I had
> > > > another
> > > > patch series a number of months ago that added a struct with
> > > > flags
> > > > there instead to select the return value given when OOM. Now
> > > > that
> > > > we're
> > > > adding an argument, should we try to use a struct instead so
> > > > that
> > > > we
> > > > can use it for more that pool behavior?
> > > > 
> > > > 
> > > > I'll be able to find a pointer to that series later today.
> > > 
> > > Found it:
> > > https://lore.kernel.org/dri-devel/20241002122422.287276-1-thomas.hellstrom@linux.intel.com/
> > > 
> > > Glad to see in that thread it isn't just me permanently slowed
> > > down
> > > by
> > > "false, false" and similar. :)
> > > 
> > > I considered using a struct too and I guess there wasn't too much
> > > of
> > > a
> > > sway that I went with flags. I thought not to overcomplicate with
> > > the
> > > on
> > > stack struct which is mostly not needed for something so low
> > > level,
> > > and
> > > to stick with the old school C visual patterns.
> > > 
> > > Since you only needed a single boolean in your series I suppose
> > > you
> > > could just follow up on my series if you find it acceptable. Or I
> > > can
> > > go
> > > with yours, no problem either.
> > 
> > It seems yours has the most momentum ATM. I can follow up on yours.
> > It
> > would be great if we could perhaps change the naming of
> > "pool_flags" to
> > something more generic.
> 
> Do you have a name in mind? For ttm_device_init pool_flags made sense
> to 
> signify they relate only to the poll.

Well, what I had in mind would have been "flags" or
"device_init_flags".

Really one could change this once flags starts to have other meanings
as well, like the return value change I was proposing.
But the reason I was suggesting to do this now is to avoid yet another
added parameter to ttm_device_init, since obtaining an ack from all TTM
driver maintainers is typically time-consuming if at all possible.

When adding functionality to allocation functions, for example the use
of the ttm_allocation_ctx has proven easier to use since it's easily
extendible typically without changes to drivers.

Thanks,
Thomas



> 
> I need to respin anyway since I forgot to include the new header from
> unit tests.
> 
> Regards,
> 
> Tvrtko
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/5] Improving the worst case TTM large allocation latency
  2025-10-13  8:48               ` Thomas Hellström
@ 2025-10-13  9:17                 ` Tvrtko Ursulin
  2025-10-13  9:23                   ` Tvrtko Ursulin
  0 siblings, 1 reply; 24+ messages in thread
From: Tvrtko Ursulin @ 2025-10-13  9:17 UTC (permalink / raw)
  To: Thomas Hellström, Christian König, amd-gfx,
	Lucas De Marchi, dri-devel, Rodrigo Vivi
  Cc: kernel-dev, Alex Deucher, Danilo Krummrich, Dave Airlie,
	Gerd Hoffmann, Joonas Lahtinen, Lyude Paul, Maarten Lankhorst,
	Maxime Ripard, Sui Jingfeng, Thadeu Lima de Souza Cascardo,
	Thomas Zimmermann, Zack Rusin


On 13/10/2025 09:48, Thomas Hellström wrote:
> Hi, Tvrtko,
> 
> On Sat, 2025-10-11 at 09:00 +0100, Tvrtko Ursulin wrote:
>>
>> On 10/10/2025 15:11, Thomas Hellström wrote:
>>> On Thu, 2025-10-09 at 09:53 +0100, Tvrtko Ursulin wrote:
>>>>
>>>> On 08/10/2025 15:39, Thomas Hellström wrote:
>>>>> On Wed, 2025-10-08 at 16:02 +0200, Christian König wrote:
>>>>>> On 08.10.25 15:50, Tvrtko Ursulin wrote:
>>>>>>>
>>>>>>> On 08/10/2025 13:35, Christian König wrote:
>>>>>>>> On 08.10.25 13:53, Tvrtko Ursulin wrote:
>>>>>>>>> Disclaimer:
>>>>>>>>> Please note that as this series includes a patch which
>>>>>>>>> touches
>>>>>>>>> a good number of
>>>>>>>>> drivers I will only copy everyone in the cover letter
>>>>>>>>> and
>>>>>>>>> the
>>>>>>>>> respective patch.
>>>>>>>>> Assumption is people are subscribed to dri-devel so can
>>>>>>>>> look at
>>>>>>>>> the whole series
>>>>>>>>> there. I know someone is bound to complain for both the
>>>>>>>>> case
>>>>>>>>> when everyone is
>>>>>>>>> copied on everything for getting too much email, and
>>>>>>>>> also
>>>>>>>>> for
>>>>>>>>> this other case.
>>>>>>>>> So please be flexible.
>>>>>>>>>
>>>>>>>>> Description:
>>>>>>>>>
>>>>>>>>> All drivers which use the TTM pool allocator end up
>>>>>>>>> requesting
>>>>>>>>> large order
>>>>>>>>> allocations when allocating large buffers. Those can be
>>>>>>>>> slow
>>>>>>>>> due memory pressure
>>>>>>>>> and so add latency to buffer creation. But there is
>>>>>>>>> often
>>>>>>>>> also
>>>>>>>>> a size limit
>>>>>>>>> above which contiguous blocks do not bring any
>>>>>>>>> performance
>>>>>>>>> benefits. This series
>>>>>>>>> allows drivers to say when it is okay for the TTM to
>>>>>>>>> try a
>>>>>>>>> bit
>>>>>>>>> less hard.
>>>>>>>>>
>>>>>>>>> We do this by allowing drivers to specify this cut off
>>>>>>>>> point
>>>>>>>>> when creating the
>>>>>>>>> TTM device and pools. Allocations above this size will
>>>>>>>>> skip
>>>>>>>>> direct reclaim so
>>>>>>>>> under memory pressure worst case latency will improve.
>>>>>>>>> Background reclaim is
>>>>>>>>> still kicked off and both before and after the memory
>>>>>>>>> pressure
>>>>>>>>> all the TTM pool
>>>>>>>>> buckets remain to be used as they are today.
>>>>>>>>>
>>>>>>>>> This is especially interesting if someone has
>>>>>>>>> configured
>>>>>>>>> MAX_PAGE_ORDER to
>>>>>>>>> higher than the default. And even with the default,
>>>>>>>>> with
>>>>>>>>> amdgpu
>>>>>>>>> for example,
>>>>>>>>> the last patch in the series makes use of the new
>>>>>>>>> feature
>>>>>>>>> by
>>>>>>>>> telling TTM that
>>>>>>>>> above 2MiB we do not expect performance benefits. Which
>>>>>>>>> makes
>>>>>>>>> TTM not try direct
>>>>>>>>> reclaim for the top bucket (4MiB).
>>>>>>>>>
>>>>>>>>> End result is TTM drivers become a tiny bit nicer mm
>>>>>>>>> citizens
>>>>>>>>> and users benefit
>>>>>>>>> from better worst case buffer creation latencies. As a
>>>>>>>>> side
>>>>>>>>> benefit we get rid
>>>>>>>>> of two instances of those often very unreadable
>>>>>>>>> mutliple
>>>>>>>>> nameless booleans
>>>>>>>>> function signatures.
>>>>>>>>>
>>>>>>>>> If this sounds interesting and gets merge the invidual
>>>>>>>>> drivers
>>>>>>>>> can follow up
>>>>>>>>> with patches configuring their thresholds.
>>>>>>>>>
>>>>>>>>> v2:
>>>>>>>>>      * Christian suggested to pass in the new data by
>>>>>>>>> changing the
>>>>>>>>> function signatures.
>>>>>>>>>
>>>>>>>>> v3:
>>>>>>>>>      * Moved ttm pool helpers into new
>>>>>>>>> ttm_pool_internal.h.
>>>>>>>>> (Christian)
>>>>>>>>
>>>>>>>> Patch #3 is Acked-by: Christian König
>>>>>>>> <christian.koenig@amd.com>.
>>>>>>>>
>>>>>>>> The rest is Reviewed-by: Christian König
>>>>>>>> <christian.koenig@amd.com>
>>>>>>>
>>>>>>> Thank you!
>>>>>>>
>>>>>>> So I think now I need acks to merge via drm-misc for all
>>>>>>> the
>>>>>>> drivers which have their own trees. Which seems to be just
>>>>>>> xe.
>>>>>>
>>>>>> I think you should ping the XE guys for their opinion, but
>>>>>> since
>>>>>> there shouldn't be any functional change for them you can
>>>>>> probably go
>>>>>> ahead and merge the patches to drm-misc-next when there is no
>>>>>> reply
>>>>>> in time.
>>>>>
>>>>> I will try to do a review tonight. One thing that comes up
>>>>> though,
>>>>> is
>>>>> the change to ttm_device_init() where you add pool_flags. I had
>>>>> another
>>>>> patch series a number of months ago that added a struct with
>>>>> flags
>>>>> there instead to select the return value given when OOM. Now
>>>>> that
>>>>> we're
>>>>> adding an argument, should we try to use a struct instead so
>>>>> that
>>>>> we
>>>>> can use it for more that pool behavior?
>>>>>
>>>>>
>>>>> I'll be able to find a pointer to that series later today.
>>>>
>>>> Found it:
>>>> https://lore.kernel.org/dri-devel/20241002122422.287276-1-thomas.hellstrom@linux.intel.com/
>>>>
>>>> Glad to see in that thread it isn't just me permanently slowed
>>>> down
>>>> by
>>>> "false, false" and similar. :)
>>>>
>>>> I considered using a struct too and I guess there wasn't too much
>>>> of
>>>> a
>>>> sway that I went with flags. I thought not to overcomplicate with
>>>> the
>>>> on
>>>> stack struct which is mostly not needed for something so low
>>>> level,
>>>> and
>>>> to stick with the old school C visual patterns.
>>>>
>>>> Since you only needed a single boolean in your series I suppose
>>>> you
>>>> could just follow up on my series if you find it acceptable. Or I
>>>> can
>>>> go
>>>> with yours, no problem either.
>>>
>>> It seems yours has the most momentum ATM. I can follow up on yours.
>>> It
>>> would be great if we could perhaps change the naming of
>>> "pool_flags" to
>>> something more generic.
>>
>> Do you have a name in mind? For ttm_device_init pool_flags made sense
>> to
>> signify they relate only to the poll.
> 
> Well, what I had in mind would have been "flags" or
> "device_init_flags".
> 
> Really one could change this once flags starts to have other meanings
> as well, like the return value change I was proposing.
> But the reason I was suggesting to do this now is to avoid yet another
> added parameter to ttm_device_init, since obtaining an ack from all TTM
> driver maintainers is typically time-consuming if at all possible.
> 
> When adding functionality to allocation functions, for example the use
> of the ttm_allocation_ctx has proven easier to use since it's easily
> extendible typically without changes to drivers.

I understood from your previous reply you were okay with flags.

If I rename pool_flags to allocation_flags would that work for you?

Regards,

Tvrtko

>> I need to respin anyway since I forgot to include the new header from
>> unit tests.
>>
>> Regards,
>>
>> Tvrtko
>>
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/5] Improving the worst case TTM large allocation latency
  2025-10-13  9:17                 ` Tvrtko Ursulin
@ 2025-10-13  9:23                   ` Tvrtko Ursulin
  0 siblings, 0 replies; 24+ messages in thread
From: Tvrtko Ursulin @ 2025-10-13  9:23 UTC (permalink / raw)
  To: Thomas Hellström, Christian König, amd-gfx,
	Lucas De Marchi, dri-devel, Rodrigo Vivi
  Cc: kernel-dev, Alex Deucher, Danilo Krummrich, Dave Airlie,
	Gerd Hoffmann, Joonas Lahtinen, Lyude Paul, Maarten Lankhorst,
	Maxime Ripard, Sui Jingfeng, Thadeu Lima de Souza Cascardo,
	Thomas Zimmermann, Zack Rusin


On 13/10/2025 10:17, Tvrtko Ursulin wrote:
> 
> On 13/10/2025 09:48, Thomas Hellström wrote:
>> Hi, Tvrtko,
>>
>> On Sat, 2025-10-11 at 09:00 +0100, Tvrtko Ursulin wrote:
>>>
>>> On 10/10/2025 15:11, Thomas Hellström wrote:
>>>> On Thu, 2025-10-09 at 09:53 +0100, Tvrtko Ursulin wrote:
>>>>>
>>>>> On 08/10/2025 15:39, Thomas Hellström wrote:
>>>>>> On Wed, 2025-10-08 at 16:02 +0200, Christian König wrote:
>>>>>>> On 08.10.25 15:50, Tvrtko Ursulin wrote:
>>>>>>>>
>>>>>>>> On 08/10/2025 13:35, Christian König wrote:
>>>>>>>>> On 08.10.25 13:53, Tvrtko Ursulin wrote:
>>>>>>>>>> Disclaimer:
>>>>>>>>>> Please note that as this series includes a patch which
>>>>>>>>>> touches
>>>>>>>>>> a good number of
>>>>>>>>>> drivers I will only copy everyone in the cover letter
>>>>>>>>>> and
>>>>>>>>>> the
>>>>>>>>>> respective patch.
>>>>>>>>>> Assumption is people are subscribed to dri-devel so can
>>>>>>>>>> look at
>>>>>>>>>> the whole series
>>>>>>>>>> there. I know someone is bound to complain for both the
>>>>>>>>>> case
>>>>>>>>>> when everyone is
>>>>>>>>>> copied on everything for getting too much email, and
>>>>>>>>>> also
>>>>>>>>>> for
>>>>>>>>>> this other case.
>>>>>>>>>> So please be flexible.
>>>>>>>>>>
>>>>>>>>>> Description:
>>>>>>>>>>
>>>>>>>>>> All drivers which use the TTM pool allocator end up
>>>>>>>>>> requesting
>>>>>>>>>> large order
>>>>>>>>>> allocations when allocating large buffers. Those can be
>>>>>>>>>> slow
>>>>>>>>>> due memory pressure
>>>>>>>>>> and so add latency to buffer creation. But there is
>>>>>>>>>> often
>>>>>>>>>> also
>>>>>>>>>> a size limit
>>>>>>>>>> above which contiguous blocks do not bring any
>>>>>>>>>> performance
>>>>>>>>>> benefits. This series
>>>>>>>>>> allows drivers to say when it is okay for the TTM to
>>>>>>>>>> try a
>>>>>>>>>> bit
>>>>>>>>>> less hard.
>>>>>>>>>>
>>>>>>>>>> We do this by allowing drivers to specify this cut off
>>>>>>>>>> point
>>>>>>>>>> when creating the
>>>>>>>>>> TTM device and pools. Allocations above this size will
>>>>>>>>>> skip
>>>>>>>>>> direct reclaim so
>>>>>>>>>> under memory pressure worst case latency will improve.
>>>>>>>>>> Background reclaim is
>>>>>>>>>> still kicked off and both before and after the memory
>>>>>>>>>> pressure
>>>>>>>>>> all the TTM pool
>>>>>>>>>> buckets remain to be used as they are today.
>>>>>>>>>>
>>>>>>>>>> This is especially interesting if someone has
>>>>>>>>>> configured
>>>>>>>>>> MAX_PAGE_ORDER to
>>>>>>>>>> higher than the default. And even with the default,
>>>>>>>>>> with
>>>>>>>>>> amdgpu
>>>>>>>>>> for example,
>>>>>>>>>> the last patch in the series makes use of the new
>>>>>>>>>> feature
>>>>>>>>>> by
>>>>>>>>>> telling TTM that
>>>>>>>>>> above 2MiB we do not expect performance benefits. Which
>>>>>>>>>> makes
>>>>>>>>>> TTM not try direct
>>>>>>>>>> reclaim for the top bucket (4MiB).
>>>>>>>>>>
>>>>>>>>>> End result is TTM drivers become a tiny bit nicer mm
>>>>>>>>>> citizens
>>>>>>>>>> and users benefit
>>>>>>>>>> from better worst case buffer creation latencies. As a
>>>>>>>>>> side
>>>>>>>>>> benefit we get rid
>>>>>>>>>> of two instances of those often very unreadable
>>>>>>>>>> mutliple
>>>>>>>>>> nameless booleans
>>>>>>>>>> function signatures.
>>>>>>>>>>
>>>>>>>>>> If this sounds interesting and gets merge the invidual
>>>>>>>>>> drivers
>>>>>>>>>> can follow up
>>>>>>>>>> with patches configuring their thresholds.
>>>>>>>>>>
>>>>>>>>>> v2:
>>>>>>>>>>      * Christian suggested to pass in the new data by
>>>>>>>>>> changing the
>>>>>>>>>> function signatures.
>>>>>>>>>>
>>>>>>>>>> v3:
>>>>>>>>>>      * Moved ttm pool helpers into new
>>>>>>>>>> ttm_pool_internal.h.
>>>>>>>>>> (Christian)
>>>>>>>>>
>>>>>>>>> Patch #3 is Acked-by: Christian König
>>>>>>>>> <christian.koenig@amd.com>.
>>>>>>>>>
>>>>>>>>> The rest is Reviewed-by: Christian König
>>>>>>>>> <christian.koenig@amd.com>
>>>>>>>>
>>>>>>>> Thank you!
>>>>>>>>
>>>>>>>> So I think now I need acks to merge via drm-misc for all
>>>>>>>> the
>>>>>>>> drivers which have their own trees. Which seems to be just
>>>>>>>> xe.
>>>>>>>
>>>>>>> I think you should ping the XE guys for their opinion, but
>>>>>>> since
>>>>>>> there shouldn't be any functional change for them you can
>>>>>>> probably go
>>>>>>> ahead and merge the patches to drm-misc-next when there is no
>>>>>>> reply
>>>>>>> in time.
>>>>>>
>>>>>> I will try to do a review tonight. One thing that comes up
>>>>>> though,
>>>>>> is
>>>>>> the change to ttm_device_init() where you add pool_flags. I had
>>>>>> another
>>>>>> patch series a number of months ago that added a struct with
>>>>>> flags
>>>>>> there instead to select the return value given when OOM. Now
>>>>>> that
>>>>>> we're
>>>>>> adding an argument, should we try to use a struct instead so
>>>>>> that
>>>>>> we
>>>>>> can use it for more that pool behavior?
>>>>>>
>>>>>>
>>>>>> I'll be able to find a pointer to that series later today.
>>>>>
>>>>> Found it:
>>>>> https://lore.kernel.org/dri-devel/20241002122422.287276-1- 
>>>>> thomas.hellstrom@linux.intel.com/
>>>>>
>>>>> Glad to see in that thread it isn't just me permanently slowed
>>>>> down
>>>>> by
>>>>> "false, false" and similar. :)
>>>>>
>>>>> I considered using a struct too and I guess there wasn't too much
>>>>> of
>>>>> a
>>>>> sway that I went with flags. I thought not to overcomplicate with
>>>>> the
>>>>> on
>>>>> stack struct which is mostly not needed for something so low
>>>>> level,
>>>>> and
>>>>> to stick with the old school C visual patterns.
>>>>>
>>>>> Since you only needed a single boolean in your series I suppose
>>>>> you
>>>>> could just follow up on my series if you find it acceptable. Or I
>>>>> can
>>>>> go
>>>>> with yours, no problem either.
>>>>
>>>> It seems yours has the most momentum ATM. I can follow up on yours.
>>>> It
>>>> would be great if we could perhaps change the naming of
>>>> "pool_flags" to
>>>> something more generic.
>>>
>>> Do you have a name in mind? For ttm_device_init pool_flags made sense
>>> to
>>> signify they relate only to the poll.
>>
>> Well, what I had in mind would have been "flags" or
>> "device_init_flags".
>>
>> Really one could change this once flags starts to have other meanings
>> as well, like the return value change I was proposing.
>> But the reason I was suggesting to do this now is to avoid yet another
>> added parameter to ttm_device_init, since obtaining an ack from all TTM
>> driver maintainers is typically time-consuming if at all possible.
>>
>> When adding functionality to allocation functions, for example the use
>> of the ttm_allocation_ctx has proven easier to use since it's easily
>> extendible typically without changes to drivers.
> 
> I understood from your previous reply you were okay with flags.
> 
> If I rename pool_flags to allocation_flags would that work for you?

P.S. Or if you want to carry on with struct that is fine by me. I am not 
a huge fan of a syntax like:

  	err = ttm_device_init(&xe->ttm, &xe_ttm_funcs, xe->drm.dev,
  			      xe->drm.anon_inode->i_mapping,
			      xe->drm.vma_offset_manager,
			      (struct ttm_device_init_flags){});

Since I don't find it very readable. But as my flags approach is also 
not perfect I have no strong feelings either way.

Regards,

Tvrtko


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2025-10-13  9:23 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-08 11:53 [PATCH v3 0/5] Improving the worst case TTM large allocation latency Tvrtko Ursulin
2025-10-08 11:53 ` [PATCH v3 1/5] drm/ttm: Add getter for some pool properties Tvrtko Ursulin
2025-10-08 11:53 ` [PATCH v3 2/5] drm/ttm: Replace multiple booleans with flags in pool init Tvrtko Ursulin
2025-10-10 15:10   ` kernel test robot
2025-10-08 11:53 ` [PATCH v3 3/5] drm/ttm: Replace multiple booleans with flags in device init Tvrtko Ursulin
2025-10-08 11:53 ` [PATCH v3 4/5] drm/ttm: Allow drivers to specify maximum beneficial TTM pool size Tvrtko Ursulin
2025-10-08 11:53 ` [PATCH v3 5/5] drm/amdgpu: Configure max beneficial TTM pool allocation order Tvrtko Ursulin
2025-10-08 23:18   ` Matthew Brost
2025-10-09  8:58     ` Tvrtko Ursulin
2025-10-10 14:14       ` Thomas Hellström
2025-10-13  7:03         ` Christian König
2025-10-08 12:35 ` [PATCH v3 0/5] Improving the worst case TTM large allocation latency Christian König
2025-10-08 13:50   ` Tvrtko Ursulin
2025-10-08 14:02     ` Christian König
2025-10-08 14:39       ` Thomas Hellström
2025-10-09  8:53         ` Tvrtko Ursulin
2025-10-10 14:11           ` Thomas Hellström
2025-10-10 14:18             ` Thomas Hellström
2025-10-11  8:00             ` Tvrtko Ursulin
2025-10-13  8:48               ` Thomas Hellström
2025-10-13  9:17                 ` Tvrtko Ursulin
2025-10-13  9:23                   ` Tvrtko Ursulin
2025-10-08 14:34     ` Matthew Auld
2025-10-09  8:41       ` Tvrtko Ursulin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).