* [PATCH RFC v2 0/3] drm/ttm: allow direct reclaim to be skipped
@ 2025-09-18 20:09 Thadeu Lima de Souza Cascardo
2025-09-18 20:09 ` [PATCH RFC v2 1/3] ttm: pool: allow requests to prefer latency over throughput Thadeu Lima de Souza Cascardo
` (3 more replies)
0 siblings, 4 replies; 13+ messages in thread
From: Thadeu Lima de Souza Cascardo @ 2025-09-18 20:09 UTC (permalink / raw)
To: Christian Koenig, Michel Dänzer, Huang Rui, Matthew Auld,
Matthew Brost, Maarten Lankhorst, Maxime Ripard,
Thomas Zimmermann, David Airlie, Simona Vetter
Cc: amd-gfx, dri-devel, linux-kernel, kernel-dev, Tvrtko Ursulin,
Sergey Senozhatsky, Thadeu Lima de Souza Cascardo
On certain workloads, like on ChromeOS when opening multiple tabs and
windows, and switching desktops, memory pressure can build up and latency
is observed as high order allocations result in memory reclaim. This was
observed when running on an amdgpu.
This is caused by TTM pool allocations and turning off direct reclaim when
doing those higher order allocations leads to lower memory pressure.
Since turning direct reclaim off might also lead to lower throughput,
make it tunable, both as a module parameter that can be changed in sysfs
and as a flag when allocating a GEM object.
A latency option will avoid direct reclaim for higher order allocations.
The throughput option could be later used to more agressively compact pages
or reclaim, by not using __GFP_NORETRY.
Other drivers can later opt to use this mechanism too.
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
---
Changes in v2:
- Make disabling direct reclaim an option.
- Link to v1: https://lore.kernel.org/r/20250910-ttm_pool_no_direct_reclaim-v1-1-53b0fa7f80fa@igalia.com
---
Thadeu Lima de Souza Cascardo (3):
ttm: pool: allow requests to prefer latency over throughput
ttm: pool: add a module parameter to set latency preference
drm/amdgpu: allow allocation preferences when creating GEM object
drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 3 ++-
drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 3 ++-
drivers/gpu/drm/ttm/ttm_pool.c | 23 +++++++++++++++++------
drivers/gpu/drm/ttm/ttm_tt.c | 2 +-
include/drm/ttm/ttm_bo.h | 5 +++++
include/drm/ttm/ttm_pool.h | 2 +-
include/drm/ttm/ttm_tt.h | 2 +-
include/uapi/drm/amdgpu_drm.h | 9 +++++++++
8 files changed, 38 insertions(+), 11 deletions(-)
---
base-commit: f83ec76bf285bea5727f478a68b894f5543ca76e
change-id: 20250909-ttm_pool_no_direct_reclaim-ee0807a2d3fe
Best regards,
--
Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH RFC v2 1/3] ttm: pool: allow requests to prefer latency over throughput
2025-09-18 20:09 [PATCH RFC v2 0/3] drm/ttm: allow direct reclaim to be skipped Thadeu Lima de Souza Cascardo
@ 2025-09-18 20:09 ` Thadeu Lima de Souza Cascardo
2025-09-18 20:09 ` [PATCH RFC v2 2/3] ttm: pool: add a module parameter to set latency preference Thadeu Lima de Souza Cascardo
` (2 subsequent siblings)
3 siblings, 0 replies; 13+ messages in thread
From: Thadeu Lima de Souza Cascardo @ 2025-09-18 20:09 UTC (permalink / raw)
To: Christian Koenig, Michel Dänzer, Huang Rui, Matthew Auld,
Matthew Brost, Maarten Lankhorst, Maxime Ripard,
Thomas Zimmermann, David Airlie, Simona Vetter
Cc: amd-gfx, dri-devel, linux-kernel, kernel-dev, Tvrtko Ursulin,
Sergey Senozhatsky, Thadeu Lima de Souza Cascardo
The TTM pool allocator prefer to allocate higher order pages such that the
GPU will spend less time walking page tables and provide better throughput.
There were cases where too much fragmented memory led to a 30% change in
the throughput of a given GPU workload on a datacenter.
On a desktop workload on a low-memory system, though, allocating such
higher order pages might put the system under memory pressure, triggering
direct reclaim and leading to latency in certain desktop operations, while
allocating lower order pages would be possible and avoid such reclaims.
This was seen on ChromeOS when opening multiple tabs and switching
desktops, leading to high latency in such operations.
Add an option to the ttm operation context that allows the behavior to be
set system wide or per TTM object.
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
---
drivers/gpu/drm/ttm/ttm_pool.c | 11 +++++++----
include/drm/ttm/ttm_bo.h | 5 +++++
2 files changed, 12 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index baf27c70a4193a121fbc8b4e67cd6feb4c612b85..02c622a103fcece003bd70ce6b5833ada70f5228 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -133,7 +133,8 @@ static DECLARE_RWSEM(pool_shrink_rwsem);
/* Allocate pages of size 1 << order with the given gfp_flags */
static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
- unsigned int order)
+ unsigned int order,
+ const struct ttm_operation_ctx *ctx)
{
unsigned long attr = DMA_ATTR_FORCE_CONTIGUOUS;
struct ttm_pool_dma *dma;
@@ -144,9 +145,12 @@ static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
* Mapping pages directly into an userspace process and calling
* put_page() on a TTM allocated page is illegal.
*/
- if (order)
+ if (order) {
gfp_flags |= __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN |
__GFP_THISNODE;
+ if (ctx->alloc_method == ttm_op_alloc_latency)
+ gfp_flags &= ~__GFP_DIRECT_RECLAIM;
+ }
if (!pool->use_dma_alloc) {
p = alloc_pages_node(pool->nid, gfp_flags, order);
@@ -745,7 +749,7 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
if (!p) {
page_caching = ttm_cached;
allow_pools = false;
- p = ttm_pool_alloc_page(pool, gfp_flags, order);
+ p = ttm_pool_alloc_page(pool, gfp_flags, order, ctx);
}
/* If that fails, lower the order if possible and retry. */
if (!p) {
@@ -815,7 +819,6 @@ int ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
return -EINVAL;
ttm_pool_alloc_state_init(tt, &alloc);
-
return __ttm_pool_alloc(pool, tt, ctx, &alloc, NULL);
}
EXPORT_SYMBOL(ttm_pool_alloc);
diff --git a/include/drm/ttm/ttm_bo.h b/include/drm/ttm/ttm_bo.h
index 479b7ed075c0ffba21df971db7fef914c531a51d..8531f8e8bb9b079927d0e4759a12819303542f62 100644
--- a/include/drm/ttm/ttm_bo.h
+++ b/include/drm/ttm/ttm_bo.h
@@ -184,6 +184,11 @@ struct ttm_operation_ctx {
bool no_wait_gpu;
bool gfp_retry_mayfail;
bool allow_res_evict;
+ enum {
+ ttm_op_alloc_default = 0,
+ ttm_op_alloc_latency = 2,
+ ttm_op_alloc_throughput = 3,
+ } alloc_method;
struct dma_resv *resv;
uint64_t bytes_moved;
};
--
2.47.3
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH RFC v2 2/3] ttm: pool: add a module parameter to set latency preference
2025-09-18 20:09 [PATCH RFC v2 0/3] drm/ttm: allow direct reclaim to be skipped Thadeu Lima de Souza Cascardo
2025-09-18 20:09 ` [PATCH RFC v2 1/3] ttm: pool: allow requests to prefer latency over throughput Thadeu Lima de Souza Cascardo
@ 2025-09-18 20:09 ` Thadeu Lima de Souza Cascardo
2025-09-18 20:09 ` [PATCH RFC v2 3/3] drm/amdgpu: allow allocation preferences when creating GEM object Thadeu Lima de Souza Cascardo
2025-09-19 6:46 ` [PATCH RFC v2 0/3] drm/ttm: allow direct reclaim to be skipped Christian König
3 siblings, 0 replies; 13+ messages in thread
From: Thadeu Lima de Souza Cascardo @ 2025-09-18 20:09 UTC (permalink / raw)
To: Christian Koenig, Michel Dänzer, Huang Rui, Matthew Auld,
Matthew Brost, Maarten Lankhorst, Maxime Ripard,
Thomas Zimmermann, David Airlie, Simona Vetter
Cc: amd-gfx, dri-devel, linux-kernel, kernel-dev, Tvrtko Ursulin,
Sergey Senozhatsky, Thadeu Lima de Souza Cascardo
This allows a system-wide setting for allocations of higher order pages not
to use direct reclaim. The default setting is to keep existing behavior and
allow direct reclaim when allocating higher order pages.
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
---
drivers/gpu/drm/ttm/ttm_pool.c | 12 ++++++++++--
drivers/gpu/drm/ttm/ttm_tt.c | 2 +-
include/drm/ttm/ttm_pool.h | 2 +-
include/drm/ttm/ttm_tt.h | 2 +-
4 files changed, 13 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index 02c622a103fcece003bd70ce6b5833ada70f5228..39203f2c247a36b0389682d7fb841088f4c8a95b 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -56,6 +56,11 @@ static DECLARE_FAULT_ATTR(backup_fault_inject);
#define should_fail(...) false
#endif
+static unsigned int ttm_alloc_method;
+
+MODULE_PARM_DESC(alloc_method, "TTM allocation method (0 - throughput, 1 - latency");
+module_param_named(alloc_method, ttm_alloc_method, uint, 0644);
+
/**
* struct ttm_pool_dma - Helper object for coherent DMA mappings
*
@@ -702,7 +707,7 @@ static unsigned int ttm_pool_alloc_find_order(unsigned int highest,
}
static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
- const struct ttm_operation_ctx *ctx,
+ struct ttm_operation_ctx *ctx,
struct ttm_pool_alloc_state *alloc,
struct ttm_pool_tt_restore *restore)
{
@@ -717,6 +722,9 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
WARN_ON(!alloc->remaining_pages || ttm_tt_is_populated(tt));
WARN_ON(alloc->dma_addr && !pool->dev);
+ if (ctx->alloc_method == ttm_op_alloc_default && ttm_alloc_method == 1)
+ ctx->alloc_method = ttm_op_alloc_latency;
+
if (tt->page_flags & TTM_TT_FLAG_ZERO_ALLOC)
gfp_flags |= __GFP_ZERO;
@@ -837,7 +845,7 @@ EXPORT_SYMBOL(ttm_pool_alloc);
* Returns: 0 on successe, negative error code otherwise.
*/
int ttm_pool_restore_and_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
- const struct ttm_operation_ctx *ctx)
+ struct ttm_operation_ctx *ctx)
{
struct ttm_pool_alloc_state alloc;
diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
index 506e257dfba8501815f8416e808f437e5f17aa8f..e1975d740b948f9b7fe1d35d913a458026d2c783 100644
--- a/drivers/gpu/drm/ttm/ttm_tt.c
+++ b/drivers/gpu/drm/ttm/ttm_tt.c
@@ -294,7 +294,7 @@ long ttm_tt_backup(struct ttm_device *bdev, struct ttm_tt *tt,
}
int ttm_tt_restore(struct ttm_device *bdev, struct ttm_tt *tt,
- const struct ttm_operation_ctx *ctx)
+ struct ttm_operation_ctx *ctx)
{
int ret = ttm_pool_restore_and_alloc(&bdev->pool, tt, ctx);
diff --git a/include/drm/ttm/ttm_pool.h b/include/drm/ttm/ttm_pool.h
index 54cd34a6e4c0ac5e17844b50fd08e72143b460c1..08f9a1388754fac352058ac2beb2b59bb944477c 100644
--- a/include/drm/ttm/ttm_pool.h
+++ b/include/drm/ttm/ttm_pool.h
@@ -95,7 +95,7 @@ void ttm_pool_drop_backed_up(struct ttm_tt *tt);
long ttm_pool_backup(struct ttm_pool *pool, struct ttm_tt *ttm,
const struct ttm_backup_flags *flags);
int ttm_pool_restore_and_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
- const struct ttm_operation_ctx *ctx);
+ struct ttm_operation_ctx *ctx);
int ttm_pool_mgr_init(unsigned long num_pages);
void ttm_pool_mgr_fini(void);
diff --git a/include/drm/ttm/ttm_tt.h b/include/drm/ttm/ttm_tt.h
index 406437ad674bf1a96527b45c5a81c58a747271c1..3575e20b77f3ccbc3d9aad0afbb762055b3cb139 100644
--- a/include/drm/ttm/ttm_tt.h
+++ b/include/drm/ttm/ttm_tt.h
@@ -296,7 +296,7 @@ long ttm_tt_backup(struct ttm_device *bdev, struct ttm_tt *tt,
const struct ttm_backup_flags flags);
int ttm_tt_restore(struct ttm_device *bdev, struct ttm_tt *tt,
- const struct ttm_operation_ctx *ctx);
+ struct ttm_operation_ctx *ctx);
int ttm_tt_setup_backup(struct ttm_tt *tt);
--
2.47.3
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH RFC v2 3/3] drm/amdgpu: allow allocation preferences when creating GEM object
2025-09-18 20:09 [PATCH RFC v2 0/3] drm/ttm: allow direct reclaim to be skipped Thadeu Lima de Souza Cascardo
2025-09-18 20:09 ` [PATCH RFC v2 1/3] ttm: pool: allow requests to prefer latency over throughput Thadeu Lima de Souza Cascardo
2025-09-18 20:09 ` [PATCH RFC v2 2/3] ttm: pool: add a module parameter to set latency preference Thadeu Lima de Souza Cascardo
@ 2025-09-18 20:09 ` Thadeu Lima de Souza Cascardo
2025-09-19 6:46 ` [PATCH RFC v2 0/3] drm/ttm: allow direct reclaim to be skipped Christian König
3 siblings, 0 replies; 13+ messages in thread
From: Thadeu Lima de Souza Cascardo @ 2025-09-18 20:09 UTC (permalink / raw)
To: Christian Koenig, Michel Dänzer, Huang Rui, Matthew Auld,
Matthew Brost, Maarten Lankhorst, Maxime Ripard,
Thomas Zimmermann, David Airlie, Simona Vetter
Cc: amd-gfx, dri-devel, linux-kernel, kernel-dev, Tvrtko Ursulin,
Sergey Senozhatsky, Thadeu Lima de Souza Cascardo
When creating a GEM object on amdgpu, it may be specified that latency
during allocation should be preferred over throughput when processing.
That will reflect into the TTM operation, which will lead to the use of
direct reclaim for higher order pages when throughput is preferred, even if
latency is configured to be preferred in the system.
If latency is preferred, no direct reclaim will be used for higher order
pages, which might lead to more use of lower order pages, which can also
compromised throughput.
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 3 ++-
drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 3 ++-
include/uapi/drm/amdgpu_drm.h | 9 +++++++++
3 files changed, 13 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
index d1ccbfcf21fa62a8d4fe1b8f020cf00d34efe1ab..0a0333e7ed1a45de63fedfbc161094f6de7fda00 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
@@ -451,7 +451,8 @@ int amdgpu_gem_create_ioctl(struct drm_device *dev, void *data,
AMDGPU_GEM_CREATE_EXPLICIT_SYNC |
AMDGPU_GEM_CREATE_ENCRYPTED |
AMDGPU_GEM_CREATE_GFX12_DCC |
- AMDGPU_GEM_CREATE_DISCARDABLE))
+ AMDGPU_GEM_CREATE_DISCARDABLE |
+ AMDGPU_GEM_ALLOCATION_MASK))
return -EINVAL;
/* reject invalid gem domains */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index 122a882948839464dc197d40ff8e46cf161f7b42..54350460bb41e4bc057eb61d7bb6014457e56c6e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -632,7 +632,8 @@ int amdgpu_bo_create(struct amdgpu_device *adev,
/* We opt to avoid OOM on system pages allocations */
.gfp_retry_mayfail = true,
.allow_res_evict = bp->type != ttm_bo_type_kernel,
- .resv = bp->resv
+ .resv = bp->resv,
+ .alloc_method = AMDGPU_GEM_ALLOCATION(bp->flags)
};
struct amdgpu_bo *bo;
unsigned long page_align, size = bp->size;
diff --git a/include/uapi/drm/amdgpu_drm.h b/include/uapi/drm/amdgpu_drm.h
index bdedbaccf776db0c86cec939725a435c37f09f77..b796744abeba2bf4b14556251b36938ba0905c1e 100644
--- a/include/uapi/drm/amdgpu_drm.h
+++ b/include/uapi/drm/amdgpu_drm.h
@@ -180,6 +180,15 @@ extern "C" {
/* Set PTE.D and recompress during GTT->VRAM moves according to TILING flags. */
#define AMDGPU_GEM_CREATE_GFX12_DCC (1 << 16)
+/* Prioritize allocation latency or high-order allocations that favor
+ * throughput */
+#define AMDGPU_GEM_OVERRIDE_ALLOCATION_SHIFT (17)
+#define AMDGPU_GEM_ALLOCATION_DEFAULT (0 << AMDGPU_GEM_OVERRIDE_ALLOCATION_SHIFT)
+#define AMDGPU_GEM_ALLOCATION_LATENCY (2 << AMDGPU_GEM_OVERRIDE_ALLOCATION_SHIFT)
+#define AMDGPU_GEM_ALLOCATION_THROUGHPUT (3 << AMDGPU_GEM_OVERRIDE_ALLOCATION_SHIFT)
+#define AMDGPU_GEM_ALLOCATION_MASK (3 << AMDGPU_GEM_OVERRIDE_ALLOCATION_SHIFT)
+#define AMDGPU_GEM_ALLOCATION(flags) ((flags & AMDGPU_GEM_ALLOCATION_MASK) >> AMDGPU_GEM_OVERRIDE_ALLOCATION_SHIFT)
+
struct drm_amdgpu_gem_create_in {
/** the requested memory size */
__u64 bo_size;
--
2.47.3
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH RFC v2 0/3] drm/ttm: allow direct reclaim to be skipped
2025-09-18 20:09 [PATCH RFC v2 0/3] drm/ttm: allow direct reclaim to be skipped Thadeu Lima de Souza Cascardo
` (2 preceding siblings ...)
2025-09-18 20:09 ` [PATCH RFC v2 3/3] drm/amdgpu: allow allocation preferences when creating GEM object Thadeu Lima de Souza Cascardo
@ 2025-09-19 6:46 ` Christian König
2025-09-19 7:43 ` Tvrtko Ursulin
3 siblings, 1 reply; 13+ messages in thread
From: Christian König @ 2025-09-19 6:46 UTC (permalink / raw)
To: Thadeu Lima de Souza Cascardo, Michel Dänzer, Huang Rui,
Matthew Auld, Matthew Brost, Maarten Lankhorst, Maxime Ripard,
Thomas Zimmermann, David Airlie, Simona Vetter
Cc: amd-gfx, dri-devel, linux-kernel, kernel-dev, Tvrtko Ursulin,
Sergey Senozhatsky
On 18.09.25 22:09, Thadeu Lima de Souza Cascardo wrote:
> On certain workloads, like on ChromeOS when opening multiple tabs and
> windows, and switching desktops, memory pressure can build up and latency
> is observed as high order allocations result in memory reclaim. This was
> observed when running on an amdgpu.
>
> This is caused by TTM pool allocations and turning off direct reclaim when
> doing those higher order allocations leads to lower memory pressure.
>
> Since turning direct reclaim off might also lead to lower throughput,
> make it tunable, both as a module parameter that can be changed in sysfs
> and as a flag when allocating a GEM object.
>
> A latency option will avoid direct reclaim for higher order allocations.
>
> The throughput option could be later used to more agressively compact pages
> or reclaim, by not using __GFP_NORETRY.
Well I can only repeat it, at least for amdgpu that is a clear NAK from my side to this.
The behavior to allocate huge pages is a must have for the driver.
The alternative I can offer is to disable the fallback which in your case would trigger the OOM killer.
Regards,
Christian.
>
> Other drivers can later opt to use this mechanism too.
>
> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
> ---
> Changes in v2:
> - Make disabling direct reclaim an option.
> - Link to v1: https://lore.kernel.org/r/20250910-ttm_pool_no_direct_reclaim-v1-1-53b0fa7f80fa@igalia.com
>
> ---
> Thadeu Lima de Souza Cascardo (3):
> ttm: pool: allow requests to prefer latency over throughput
> ttm: pool: add a module parameter to set latency preference
> drm/amdgpu: allow allocation preferences when creating GEM object
>
> drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 3 ++-
> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 3 ++-
> drivers/gpu/drm/ttm/ttm_pool.c | 23 +++++++++++++++++------
> drivers/gpu/drm/ttm/ttm_tt.c | 2 +-
> include/drm/ttm/ttm_bo.h | 5 +++++
> include/drm/ttm/ttm_pool.h | 2 +-
> include/drm/ttm/ttm_tt.h | 2 +-
> include/uapi/drm/amdgpu_drm.h | 9 +++++++++
> 8 files changed, 38 insertions(+), 11 deletions(-)
> ---
> base-commit: f83ec76bf285bea5727f478a68b894f5543ca76e
> change-id: 20250909-ttm_pool_no_direct_reclaim-ee0807a2d3fe
>
> Best regards,
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH RFC v2 0/3] drm/ttm: allow direct reclaim to be skipped
2025-09-19 6:46 ` [PATCH RFC v2 0/3] drm/ttm: allow direct reclaim to be skipped Christian König
@ 2025-09-19 7:43 ` Tvrtko Ursulin
2025-09-19 8:01 ` Christian König
0 siblings, 1 reply; 13+ messages in thread
From: Tvrtko Ursulin @ 2025-09-19 7:43 UTC (permalink / raw)
To: Christian König, Thadeu Lima de Souza Cascardo,
Michel Dänzer, Huang Rui, Matthew Auld, Matthew Brost,
Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
Simona Vetter
Cc: amd-gfx, dri-devel, linux-kernel, kernel-dev, Sergey Senozhatsky
On 19/09/2025 07:46, Christian König wrote:
> On 18.09.25 22:09, Thadeu Lima de Souza Cascardo wrote:
>> On certain workloads, like on ChromeOS when opening multiple tabs and
>> windows, and switching desktops, memory pressure can build up and latency
>> is observed as high order allocations result in memory reclaim. This was
>> observed when running on an amdgpu.
>>
>> This is caused by TTM pool allocations and turning off direct reclaim when
>> doing those higher order allocations leads to lower memory pressure.
>>
>> Since turning direct reclaim off might also lead to lower throughput,
>> make it tunable, both as a module parameter that can be changed in sysfs
>> and as a flag when allocating a GEM object.
>>
>> A latency option will avoid direct reclaim for higher order allocations.
>>
>> The throughput option could be later used to more agressively compact pages
>> or reclaim, by not using __GFP_NORETRY.
>
> Well I can only repeat it, at least for amdgpu that is a clear NAK from my side to this.
>
> The behavior to allocate huge pages is a must have for the driver.
Disclaimer that I wouldn't go system-wide but per device - so somewhere
in sysfs rather than a modparam. That kind of a toggle would not sound
problematic to me since it leaves the policy outside the kernel and
allows people to tune to their liking.
One side question thought - does AMD benefit from larger than 2MiB
contiguous blocks? IIUC the maximum PTE is 2MiB so maybe not? In which
case it may make sense to add some TTM API letting drivers tell the pool
allocator what is the maximum order to bother with. Larger than that may
have diminishing benefit for the disproportionate pressure on the memory
allocator and reclaim.
Regards,
Tvrtko
> The alternative I can offer is to disable the fallback which in your case would trigger the OOM killer.
>
> Regards,
> Christian.
>
>>
>> Other drivers can later opt to use this mechanism too.
>>
>> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
>> ---
>> Changes in v2:
>> - Make disabling direct reclaim an option.
>> - Link to v1: https://lore.kernel.org/r/20250910-ttm_pool_no_direct_reclaim-v1-1-53b0fa7f80fa@igalia.com
>>
>> ---
>> Thadeu Lima de Souza Cascardo (3):
>> ttm: pool: allow requests to prefer latency over throughput
>> ttm: pool: add a module parameter to set latency preference
>> drm/amdgpu: allow allocation preferences when creating GEM object
>>
>> drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 3 ++-
>> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 3 ++-
>> drivers/gpu/drm/ttm/ttm_pool.c | 23 +++++++++++++++++------
>> drivers/gpu/drm/ttm/ttm_tt.c | 2 +-
>> include/drm/ttm/ttm_bo.h | 5 +++++
>> include/drm/ttm/ttm_pool.h | 2 +-
>> include/drm/ttm/ttm_tt.h | 2 +-
>> include/uapi/drm/amdgpu_drm.h | 9 +++++++++
>> 8 files changed, 38 insertions(+), 11 deletions(-)
>> ---
>> base-commit: f83ec76bf285bea5727f478a68b894f5543ca76e
>> change-id: 20250909-ttm_pool_no_direct_reclaim-ee0807a2d3fe
>>
>> Best regards,
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH RFC v2 0/3] drm/ttm: allow direct reclaim to be skipped
2025-09-19 7:43 ` Tvrtko Ursulin
@ 2025-09-19 8:01 ` Christian König
2025-09-19 8:46 ` Tvrtko Ursulin
2025-09-19 11:13 ` Thadeu Lima de Souza Cascardo
0 siblings, 2 replies; 13+ messages in thread
From: Christian König @ 2025-09-19 8:01 UTC (permalink / raw)
To: Tvrtko Ursulin, Thadeu Lima de Souza Cascardo, Michel Dänzer,
Huang Rui, Matthew Auld, Matthew Brost, Maarten Lankhorst,
Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter
Cc: amd-gfx, dri-devel, linux-kernel, kernel-dev, Sergey Senozhatsky
On 19.09.25 09:43, Tvrtko Ursulin wrote:
> On 19/09/2025 07:46, Christian König wrote:
>> On 18.09.25 22:09, Thadeu Lima de Souza Cascardo wrote:
>>> On certain workloads, like on ChromeOS when opening multiple tabs and
>>> windows, and switching desktops, memory pressure can build up and latency
>>> is observed as high order allocations result in memory reclaim. This was
>>> observed when running on an amdgpu.
>>>
>>> This is caused by TTM pool allocations and turning off direct reclaim when
>>> doing those higher order allocations leads to lower memory pressure.
>>>
>>> Since turning direct reclaim off might also lead to lower throughput,
>>> make it tunable, both as a module parameter that can be changed in sysfs
>>> and as a flag when allocating a GEM object.
>>>
>>> A latency option will avoid direct reclaim for higher order allocations.
>>>
>>> The throughput option could be later used to more agressively compact pages
>>> or reclaim, by not using __GFP_NORETRY.
>>
>> Well I can only repeat it, at least for amdgpu that is a clear NAK from my side to this.
>>
>> The behavior to allocate huge pages is a must have for the driver.
>
> Disclaimer that I wouldn't go system-wide but per device - so somewhere in sysfs rather than a modparam. That kind of a toggle would not sound problematic to me since it leaves the policy outside the kernel and allows people to tune to their liking.
Yeah I've also wrote before when that is somehow beneficial for nouveau (for example) then I don't have any problem with making the policy device dependent.
But for amdgpu we have so many so bad experiences with this approach that I absolutely can't accept that.
> One side question thought - does AMD benefit from larger than 2MiB contiguous blocks? IIUC the maximum PTE is 2MiB so maybe not? In which case it may make sense to add some TTM API letting drivers tell the pool allocator what is the maximum order to bother with. Larger than that may have diminishing benefit for the disproportionate pressure on the memory allocator and reclaim.
Using 1GiB allocations would allow for the page tables to skip another layer on AMD GPUs, but the most benefit is between 4kiB and 2MiB since that can be handled more efficiently by the L1. Having 2MiB allocations then also has an additional benefit for L2.
Apart from performance for AMD GPUs there are also some HW features which only work with huge pages, e.g. on some laptops you can get for example flickering on the display if the scanout buffer is back by to many small pages.
NVidia used to work on 1GiB allocations which as far as I know was the kickoff for the whole ongoing switch to using folios instead of pages. And from reading public available documentation I have the impression that NVidia GPUs works more or less the same as AMD GPUs regarding the TLB.
Another alternative would be that we add a WARN_ONCE() when we have to fallback to lower order pages, but that wouldn't help the end user either. It just makes it more obvious that you need more memory for a specific use case without triggering the OOM killer.
Regards,
Christian.
>
> Regards,
>
> Tvrtko
>
>> The alternative I can offer is to disable the fallback which in your case would trigger the OOM killer.
>>
>> Regards,
>> Christian.
>>
>>>
>>> Other drivers can later opt to use this mechanism too.
>>>
>>> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
>>> ---
>>> Changes in v2:
>>> - Make disabling direct reclaim an option.
>>> - Link to v1: https://lore.kernel.org/r/20250910-ttm_pool_no_direct_reclaim-v1-1-53b0fa7f80fa@igalia.com
>>>
>>> ---
>>> Thadeu Lima de Souza Cascardo (3):
>>> ttm: pool: allow requests to prefer latency over throughput
>>> ttm: pool: add a module parameter to set latency preference
>>> drm/amdgpu: allow allocation preferences when creating GEM object
>>>
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 3 ++-
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 3 ++-
>>> drivers/gpu/drm/ttm/ttm_pool.c | 23 +++++++++++++++++------
>>> drivers/gpu/drm/ttm/ttm_tt.c | 2 +-
>>> include/drm/ttm/ttm_bo.h | 5 +++++
>>> include/drm/ttm/ttm_pool.h | 2 +-
>>> include/drm/ttm/ttm_tt.h | 2 +-
>>> include/uapi/drm/amdgpu_drm.h | 9 +++++++++
>>> 8 files changed, 38 insertions(+), 11 deletions(-)
>>> ---
>>> base-commit: f83ec76bf285bea5727f478a68b894f5543ca76e
>>> change-id: 20250909-ttm_pool_no_direct_reclaim-ee0807a2d3fe
>>>
>>> Best regards,
>>
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH RFC v2 0/3] drm/ttm: allow direct reclaim to be skipped
2025-09-19 8:01 ` Christian König
@ 2025-09-19 8:46 ` Tvrtko Ursulin
2025-09-19 10:17 ` Christian König
2025-09-19 11:13 ` Thadeu Lima de Souza Cascardo
1 sibling, 1 reply; 13+ messages in thread
From: Tvrtko Ursulin @ 2025-09-19 8:46 UTC (permalink / raw)
To: Christian König, Thadeu Lima de Souza Cascardo,
Michel Dänzer, Huang Rui, Matthew Auld, Matthew Brost,
Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
Simona Vetter
Cc: amd-gfx, dri-devel, linux-kernel, kernel-dev, Sergey Senozhatsky
On 19/09/2025 09:01, Christian König wrote:
> On 19.09.25 09:43, Tvrtko Ursulin wrote:
>> On 19/09/2025 07:46, Christian König wrote:
>>> On 18.09.25 22:09, Thadeu Lima de Souza Cascardo wrote:
>>>> On certain workloads, like on ChromeOS when opening multiple tabs and
>>>> windows, and switching desktops, memory pressure can build up and latency
>>>> is observed as high order allocations result in memory reclaim. This was
>>>> observed when running on an amdgpu.
>>>>
>>>> This is caused by TTM pool allocations and turning off direct reclaim when
>>>> doing those higher order allocations leads to lower memory pressure.
>>>>
>>>> Since turning direct reclaim off might also lead to lower throughput,
>>>> make it tunable, both as a module parameter that can be changed in sysfs
>>>> and as a flag when allocating a GEM object.
>>>>
>>>> A latency option will avoid direct reclaim for higher order allocations.
>>>>
>>>> The throughput option could be later used to more agressively compact pages
>>>> or reclaim, by not using __GFP_NORETRY.
>>>
>>> Well I can only repeat it, at least for amdgpu that is a clear NAK from my side to this.
>>>
>>> The behavior to allocate huge pages is a must have for the driver.
>>
>> Disclaimer that I wouldn't go system-wide but per device - so somewhere in sysfs rather than a modparam. That kind of a toggle would not sound problematic to me since it leaves the policy outside the kernel and allows people to tune to their liking.
>
> Yeah I've also wrote before when that is somehow beneficial for nouveau (for example) then I don't have any problem with making the policy device dependent.
>
> But for amdgpu we have so many so bad experiences with this approach that I absolutely can't accept that.
>
>> One side question thought - does AMD benefit from larger than 2MiB contiguous blocks? IIUC the maximum PTE is 2MiB so maybe not? In which case it may make sense to add some TTM API letting drivers tell the pool allocator what is the maximum order to bother with. Larger than that may have diminishing benefit for the disproportionate pressure on the memory allocator and reclaim.
>
> Using 1GiB allocations would allow for the page tables to skip another layer on AMD GPUs, but the most benefit is between 4kiB and 2MiB since that can be handled more efficiently by the L1. Having 2MiB allocations then also has an additional benefit for L2.
>
> Apart from performance for AMD GPUs there are also some HW features which only work with huge pages, e.g. on some laptops you can get for example flickering on the display if the scanout buffer is back by to many small pages.
>
> NVidia used to work on 1GiB allocations which as far as I know was the kickoff for the whole ongoing switch to using folios instead of pages. And from reading public available documentation I have the impression that NVidia GPUs works more or less the same as AMD GPUs regarding the TLB.
1GiB is beyond the TTM pool allocator scope, right?
From what you wrote it sounds like my idea would actually be okay. A
very gentle approach (minimal change in behaviour) to only disable
direct reclaim above the threshold set by the driver. Along the lines of:
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 428265046815..06b243f05edd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -1824,7 +1824,7 @@ static int amdgpu_ttm_pools_init(struct
amdgpu_device *adev)
for (i = 0; i < adev->gmc.num_mem_partitions; i++) {
ttm_pool_init(&adev->mman.ttm_pools[i], adev->dev,
adev->gmc.mem_partitions[i].numa.node,
- false, false);
+ false, false, get_order(2 * SZ_1M));
}
return 0;
}
@@ -1865,7 +1865,8 @@ int amdgpu_ttm_init(struct amdgpu_device *adev)
adev_to_drm(adev)->anon_inode->i_mapping,
adev_to_drm(adev)->vma_offset_manager,
adev->need_swiotlb,
- dma_addressing_limited(adev->dev));
+ dma_addressing_limited(adev->dev),
+ get_order(2 * SZ_1M));
if (r) {
dev_err(adev->dev,
"failed initializing buffer object driver(%d).\n", r);
diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index baf27c70a419..5d54e8373230 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -726,8 +726,12 @@ static int __ttm_pool_alloc(struct ttm_pool *pool,
struct ttm_tt *tt,
page_caching = tt->caching;
allow_pools = true;
- for (order = ttm_pool_alloc_find_order(MAX_PAGE_ORDER, alloc);
- alloc->remaining_pages;
+
+ order = ttm_pool_alloc_find_order(MAX_PAGE_ORDER, alloc);
+ if (order > pool->max_beneficial_order)
+ gfp_flags &= ~__GFP_DIRECT_RECLAIM;
+
+ for (; alloc->remaining_pages;
order = ttm_pool_alloc_find_order(order, alloc)) {
struct ttm_pool_type *pt;
@@ -745,6 +749,8 @@ static int __ttm_pool_alloc(struct ttm_pool *pool,
struct ttm_tt *tt,
if (!p) {
page_caching = ttm_cached;
allow_pools = false;
+ if (order <= pool->max_beneficial_order)
+ gfp_flags |= __GFP_DIRECT_RECLAIM;
p = ttm_pool_alloc_page(pool, gfp_flags, order);
}
/* If that fails, lower the order if possible and retry. */
@@ -1064,7 +1070,8 @@ long ttm_pool_backup(struct ttm_pool *pool, struct
ttm_tt *tt,
* Initialize the pool and its pool types.
*/
void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
- int nid, bool use_dma_alloc, bool use_dma32)
+ int nid, bool use_dma_alloc, bool use_dma32,
+ unsigned int max_beneficial_order)
{
unsigned int i, j;
@@ -1074,6 +1081,7 @@ void ttm_pool_init(struct ttm_pool *pool, struct
device *dev,
pool->nid = nid;
pool->use_dma_alloc = use_dma_alloc;
pool->use_dma32 = use_dma32;
+ pool->max_beneficial_order = max_beneficial_order;
for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i) {
for (j = 0; j < NR_PAGE_ORDERS; ++j) {
That should have the page allocator working less hard and lower the
latency with large buffers.
Then a more aggressive change on top could be:
diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index 5d54e8373230..152164f79927 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -726,12 +726,8 @@ static int __ttm_pool_alloc(struct ttm_pool *pool,
struct ttm_tt *tt,
page_caching = tt->caching;
allow_pools = true;
-
- order = ttm_pool_alloc_find_order(MAX_PAGE_ORDER, alloc);
- if (order > pool->max_beneficial_order)
- gfp_flags &= ~__GFP_DIRECT_RECLAIM;
-
- for (; alloc->remaining_pages;
+ for (order = ttm_pool_alloc_find_order(pool->max_beneficial_order, alloc);
+ alloc->remaining_pages;
order = ttm_pool_alloc_find_order(order, alloc)) {
struct ttm_pool_type *pt;
@@ -749,8 +745,6 @@ static int __ttm_pool_alloc(struct ttm_pool *pool,
struct ttm_tt *tt,
if (!p) {
page_caching = ttm_cached;
allow_pools = false;
- if (order <= pool->max_beneficial_order)
- gfp_flags |= __GFP_DIRECT_RECLAIM;
p = ttm_pool_alloc_page(pool, gfp_flags, order);
}
/* If that fails, lower the order if possible and retry. */
Ie. don't even bother trying to allocate orders above what the driver
says is useful. Could be made a drivers choice as well.
And all could be combined with some sort of a sysfs control, as Cascardo
was suggesting, to disable direct reclaim completely if someone wants that.
Regards,
Tvrtko
> Another alternative would be that we add a WARN_ONCE() when we have to fallback to lower order pages, but that wouldn't help the end user either. It just makes it more obvious that you need more memory for a specific use case without triggering the OOM killer.
>
> Regards,
> Christian.
>
>>
>> Regards,
>>
>> Tvrtko
>>
>>> The alternative I can offer is to disable the fallback which in your case would trigger the OOM killer.
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> Other drivers can later opt to use this mechanism too.
>>>>
>>>> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
>>>> ---
>>>> Changes in v2:
>>>> - Make disabling direct reclaim an option.
>>>> - Link to v1: https://lore.kernel.org/r/20250910-ttm_pool_no_direct_reclaim-v1-1-53b0fa7f80fa@igalia.com
>>>>
>>>> ---
>>>> Thadeu Lima de Souza Cascardo (3):
>>>> ttm: pool: allow requests to prefer latency over throughput
>>>> ttm: pool: add a module parameter to set latency preference
>>>> drm/amdgpu: allow allocation preferences when creating GEM object
>>>>
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 3 ++-
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 3 ++-
>>>> drivers/gpu/drm/ttm/ttm_pool.c | 23 +++++++++++++++++------
>>>> drivers/gpu/drm/ttm/ttm_tt.c | 2 +-
>>>> include/drm/ttm/ttm_bo.h | 5 +++++
>>>> include/drm/ttm/ttm_pool.h | 2 +-
>>>> include/drm/ttm/ttm_tt.h | 2 +-
>>>> include/uapi/drm/amdgpu_drm.h | 9 +++++++++
>>>> 8 files changed, 38 insertions(+), 11 deletions(-)
>>>> ---
>>>> base-commit: f83ec76bf285bea5727f478a68b894f5543ca76e
>>>> change-id: 20250909-ttm_pool_no_direct_reclaim-ee0807a2d3fe
>>>>
>>>> Best regards,
>>>
>>
>
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH RFC v2 0/3] drm/ttm: allow direct reclaim to be skipped
2025-09-19 8:46 ` Tvrtko Ursulin
@ 2025-09-19 10:17 ` Christian König
2025-09-19 10:42 ` Tvrtko Ursulin
0 siblings, 1 reply; 13+ messages in thread
From: Christian König @ 2025-09-19 10:17 UTC (permalink / raw)
To: Tvrtko Ursulin, Thadeu Lima de Souza Cascardo, Michel Dänzer,
Huang Rui, Matthew Auld, Matthew Brost, Maarten Lankhorst,
Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter
Cc: amd-gfx, dri-devel, linux-kernel, kernel-dev, Sergey Senozhatsky
On 19.09.25 10:46, Tvrtko Ursulin wrote:
>
> On 19/09/2025 09:01, Christian König wrote:
>> On 19.09.25 09:43, Tvrtko Ursulin wrote:
>>> On 19/09/2025 07:46, Christian König wrote:
>>>> On 18.09.25 22:09, Thadeu Lima de Souza Cascardo wrote:
>>>>> On certain workloads, like on ChromeOS when opening multiple tabs and
>>>>> windows, and switching desktops, memory pressure can build up and latency
>>>>> is observed as high order allocations result in memory reclaim. This was
>>>>> observed when running on an amdgpu.
>>>>>
>>>>> This is caused by TTM pool allocations and turning off direct reclaim when
>>>>> doing those higher order allocations leads to lower memory pressure.
>>>>>
>>>>> Since turning direct reclaim off might also lead to lower throughput,
>>>>> make it tunable, both as a module parameter that can be changed in sysfs
>>>>> and as a flag when allocating a GEM object.
>>>>>
>>>>> A latency option will avoid direct reclaim for higher order allocations.
>>>>>
>>>>> The throughput option could be later used to more agressively compact pages
>>>>> or reclaim, by not using __GFP_NORETRY.
>>>>
>>>> Well I can only repeat it, at least for amdgpu that is a clear NAK from my side to this.
>>>>
>>>> The behavior to allocate huge pages is a must have for the driver.
>>>
>>> Disclaimer that I wouldn't go system-wide but per device - so somewhere in sysfs rather than a modparam. That kind of a toggle would not sound problematic to me since it leaves the policy outside the kernel and allows people to tune to their liking.
>>
>> Yeah I've also wrote before when that is somehow beneficial for nouveau (for example) then I don't have any problem with making the policy device dependent.
>>
>> But for amdgpu we have so many so bad experiences with this approach that I absolutely can't accept that.
>>
>>> One side question thought - does AMD benefit from larger than 2MiB contiguous blocks? IIUC the maximum PTE is 2MiB so maybe not? In which case it may make sense to add some TTM API letting drivers tell the pool allocator what is the maximum order to bother with. Larger than that may have diminishing benefit for the disproportionate pressure on the memory allocator and reclaim.
>>
>> Using 1GiB allocations would allow for the page tables to skip another layer on AMD GPUs, but the most benefit is between 4kiB and 2MiB since that can be handled more efficiently by the L1. Having 2MiB allocations then also has an additional benefit for L2.
>>
>> Apart from performance for AMD GPUs there are also some HW features which only work with huge pages, e.g. on some laptops you can get for example flickering on the display if the scanout buffer is back by to many small pages.
>>
>> NVidia used to work on 1GiB allocations which as far as I know was the kickoff for the whole ongoing switch to using folios instead of pages. And from reading public available documentation I have the impression that NVidia GPUs works more or less the same as AMD GPUs regarding the TLB.
>
> 1GiB is beyond the TTM pool allocator scope, right?
Yes, on x86 64bit the pool allocator can allocate at maximum 2MiB by default IIRC.
>
> From what you wrote it sounds like my idea would actually be okay. A very gentle approach (minimal change in behaviour) to only disable direct reclaim above the threshold set by the driver.
Well the problem is that the threshold set by amdgpu would be 2MiB and by default there isn't anything above it on x86. So that would be a no-op. On ARM64 that idea could potentially help maybe.
I could look into the HW documentation again what we would need as minimum for functional correctness, but there are quite a number of use cases and lowering from 2MiB to something like 256KiB or 512KiB potentially won't really help and still cause a number of performance issues in the L2.
Regards,
Christian.
> Along the lines of:
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> index 428265046815..06b243f05edd 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> @@ -1824,7 +1824,7 @@ static int amdgpu_ttm_pools_init(struct amdgpu_device *adev)
> for (i = 0; i < adev->gmc.num_mem_partitions; i++) {
> ttm_pool_init(&adev->mman.ttm_pools[i], adev->dev,
> adev->gmc.mem_partitions[i].numa.node,
> - false, false);
> + false, false, get_order(2 * SZ_1M));
> }
> return 0;
> }
> @@ -1865,7 +1865,8 @@ int amdgpu_ttm_init(struct amdgpu_device *adev)
> adev_to_drm(adev)->anon_inode->i_mapping,
> adev_to_drm(adev)->vma_offset_manager,
> adev->need_swiotlb,
> - dma_addressing_limited(adev->dev));
> + dma_addressing_limited(adev->dev),
> + get_order(2 * SZ_1M));
> if (r) {
> dev_err(adev->dev,
> "failed initializing buffer object driver(%d).\n", r);
> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
> index baf27c70a419..5d54e8373230 100644
> --- a/drivers/gpu/drm/ttm/ttm_pool.c
> +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> @@ -726,8 +726,12 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
>
> page_caching = tt->caching;
> allow_pools = true;
> - for (order = ttm_pool_alloc_find_order(MAX_PAGE_ORDER, alloc);
> - alloc->remaining_pages;
> +
> + order = ttm_pool_alloc_find_order(MAX_PAGE_ORDER, alloc);
> + if (order > pool->max_beneficial_order)
> + gfp_flags &= ~__GFP_DIRECT_RECLAIM;
> +
> + for (; alloc->remaining_pages;
> order = ttm_pool_alloc_find_order(order, alloc)) {
> struct ttm_pool_type *pt;
>
> @@ -745,6 +749,8 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
> if (!p) {
> page_caching = ttm_cached;
> allow_pools = false;
> + if (order <= pool->max_beneficial_order)
> + gfp_flags |= __GFP_DIRECT_RECLAIM;
> p = ttm_pool_alloc_page(pool, gfp_flags, order);
> }
> /* If that fails, lower the order if possible and retry. */
> @@ -1064,7 +1070,8 @@ long ttm_pool_backup(struct ttm_pool *pool, struct ttm_tt *tt,
> * Initialize the pool and its pool types.
> */
> void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
> - int nid, bool use_dma_alloc, bool use_dma32)
> + int nid, bool use_dma_alloc, bool use_dma32,
> + unsigned int max_beneficial_order)
> {
> unsigned int i, j;
>
> @@ -1074,6 +1081,7 @@ void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
> pool->nid = nid;
> pool->use_dma_alloc = use_dma_alloc;
> pool->use_dma32 = use_dma32;
> + pool->max_beneficial_order = max_beneficial_order;
>
> for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i) {
> for (j = 0; j < NR_PAGE_ORDERS; ++j) {
>
>
> That should have the page allocator working less hard and lower the latency with large buffers.
>
> Then a more aggressive change on top could be:
>
> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
> index 5d54e8373230..152164f79927 100644
> --- a/drivers/gpu/drm/ttm/ttm_pool.c
> +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> @@ -726,12 +726,8 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
>
> page_caching = tt->caching;
> allow_pools = true;
> -
> - order = ttm_pool_alloc_find_order(MAX_PAGE_ORDER, alloc);
> - if (order > pool->max_beneficial_order)
> - gfp_flags &= ~__GFP_DIRECT_RECLAIM;
> -
> - for (; alloc->remaining_pages;
> + for (order = ttm_pool_alloc_find_order(pool->max_beneficial_order, alloc);
> + alloc->remaining_pages;
> order = ttm_pool_alloc_find_order(order, alloc)) {
> struct ttm_pool_type *pt;
>
> @@ -749,8 +745,6 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
> if (!p) {
> page_caching = ttm_cached;
> allow_pools = false;
> - if (order <= pool->max_beneficial_order)
> - gfp_flags |= __GFP_DIRECT_RECLAIM;
> p = ttm_pool_alloc_page(pool, gfp_flags, order);
> }
> /* If that fails, lower the order if possible and retry. */
>
> Ie. don't even bother trying to allocate orders above what the driver says is useful. Could be made a drivers choice as well.
>
> And all could be combined with some sort of a sysfs control, as Cascardo was suggesting, to disable direct reclaim completely if someone wants that.
>
> Regards,
>
> Tvrtko
>
>> Another alternative would be that we add a WARN_ONCE() when we have to fallback to lower order pages, but that wouldn't help the end user either. It just makes it more obvious that you need more memory for a specific use case without triggering the OOM killer.
>>
>> Regards,
>> Christian.
>>
>>>
>>> Regards,
>>>
>>> Tvrtko
>>>
>>>> The alternative I can offer is to disable the fallback which in your case would trigger the OOM killer.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>
>>>>> Other drivers can later opt to use this mechanism too.
>>>>>
>>>>> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
>>>>> ---
>>>>> Changes in v2:
>>>>> - Make disabling direct reclaim an option.
>>>>> - Link to v1: https://lore.kernel.org/r/20250910-ttm_pool_no_direct_reclaim-v1-1-53b0fa7f80fa@igalia.com
>>>>>
>>>>> ---
>>>>> Thadeu Lima de Souza Cascardo (3):
>>>>> ttm: pool: allow requests to prefer latency over throughput
>>>>> ttm: pool: add a module parameter to set latency preference
>>>>> drm/amdgpu: allow allocation preferences when creating GEM object
>>>>>
>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 3 ++-
>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 3 ++-
>>>>> drivers/gpu/drm/ttm/ttm_pool.c | 23 +++++++++++++++++------
>>>>> drivers/gpu/drm/ttm/ttm_tt.c | 2 +-
>>>>> include/drm/ttm/ttm_bo.h | 5 +++++
>>>>> include/drm/ttm/ttm_pool.h | 2 +-
>>>>> include/drm/ttm/ttm_tt.h | 2 +-
>>>>> include/uapi/drm/amdgpu_drm.h | 9 +++++++++
>>>>> 8 files changed, 38 insertions(+), 11 deletions(-)
>>>>> ---
>>>>> base-commit: f83ec76bf285bea5727f478a68b894f5543ca76e
>>>>> change-id: 20250909-ttm_pool_no_direct_reclaim-ee0807a2d3fe
>>>>>
>>>>> Best regards,
>>>>
>>>
>>
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH RFC v2 0/3] drm/ttm: allow direct reclaim to be skipped
2025-09-19 10:17 ` Christian König
@ 2025-09-19 10:42 ` Tvrtko Ursulin
2025-09-19 12:04 ` Christian König
0 siblings, 1 reply; 13+ messages in thread
From: Tvrtko Ursulin @ 2025-09-19 10:42 UTC (permalink / raw)
To: Christian König, Thadeu Lima de Souza Cascardo,
Michel Dänzer, Huang Rui, Matthew Auld, Matthew Brost,
Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
Simona Vetter
Cc: amd-gfx, dri-devel, linux-kernel, kernel-dev, Sergey Senozhatsky
On 19/09/2025 11:17, Christian König wrote:
> On 19.09.25 10:46, Tvrtko Ursulin wrote:
>>
>> On 19/09/2025 09:01, Christian König wrote:
>>> On 19.09.25 09:43, Tvrtko Ursulin wrote:
>>>> On 19/09/2025 07:46, Christian König wrote:
>>>>> On 18.09.25 22:09, Thadeu Lima de Souza Cascardo wrote:
>>>>>> On certain workloads, like on ChromeOS when opening multiple tabs and
>>>>>> windows, and switching desktops, memory pressure can build up and latency
>>>>>> is observed as high order allocations result in memory reclaim. This was
>>>>>> observed when running on an amdgpu.
>>>>>>
>>>>>> This is caused by TTM pool allocations and turning off direct reclaim when
>>>>>> doing those higher order allocations leads to lower memory pressure.
>>>>>>
>>>>>> Since turning direct reclaim off might also lead to lower throughput,
>>>>>> make it tunable, both as a module parameter that can be changed in sysfs
>>>>>> and as a flag when allocating a GEM object.
>>>>>>
>>>>>> A latency option will avoid direct reclaim for higher order allocations.
>>>>>>
>>>>>> The throughput option could be later used to more agressively compact pages
>>>>>> or reclaim, by not using __GFP_NORETRY.
>>>>>
>>>>> Well I can only repeat it, at least for amdgpu that is a clear NAK from my side to this.
>>>>>
>>>>> The behavior to allocate huge pages is a must have for the driver.
>>>>
>>>> Disclaimer that I wouldn't go system-wide but per device - so somewhere in sysfs rather than a modparam. That kind of a toggle would not sound problematic to me since it leaves the policy outside the kernel and allows people to tune to their liking.
>>>
>>> Yeah I've also wrote before when that is somehow beneficial for nouveau (for example) then I don't have any problem with making the policy device dependent.
>>>
>>> But for amdgpu we have so many so bad experiences with this approach that I absolutely can't accept that.
>>>
>>>> One side question thought - does AMD benefit from larger than 2MiB contiguous blocks? IIUC the maximum PTE is 2MiB so maybe not? In which case it may make sense to add some TTM API letting drivers tell the pool allocator what is the maximum order to bother with. Larger than that may have diminishing benefit for the disproportionate pressure on the memory allocator and reclaim.
>>>
>>> Using 1GiB allocations would allow for the page tables to skip another layer on AMD GPUs, but the most benefit is between 4kiB and 2MiB since that can be handled more efficiently by the L1. Having 2MiB allocations then also has an additional benefit for L2.
>>>
>>> Apart from performance for AMD GPUs there are also some HW features which only work with huge pages, e.g. on some laptops you can get for example flickering on the display if the scanout buffer is back by to many small pages.
>>>
>>> NVidia used to work on 1GiB allocations which as far as I know was the kickoff for the whole ongoing switch to using folios instead of pages. And from reading public available documentation I have the impression that NVidia GPUs works more or less the same as AMD GPUs regarding the TLB.
>>
>> 1GiB is beyond the TTM pool allocator scope, right?
>
> Yes, on x86 64bit the pool allocator can allocate at maximum 2MiB by default IIRC.
I think 10 is the max order so 4MiB. So it wouldn't be much relief to
the allocator but better than nothing(tm)?
>> From what you wrote it sounds like my idea would actually be okay. A very gentle approach (minimal change in behaviour) to only disable direct reclaim above the threshold set by the driver.
>
> Well the problem is that the threshold set by amdgpu would be 2MiB and by default there isn't anything above it on x86. So that would be a no-op. On ARM64 that idea could potentially help maybe.
Some architectures appear to default to more than 10, and some offer
Kconfig to change the default.
I think this means in the patch I proposed I am missing a
min(MAX_PAGE_ORDER, max_beneficial_order) when setting the pool property.
> I could look into the HW documentation again what we would need as minimum for functional correctness, but there are quite a number of use cases and lowering from 2MiB to something like 256KiB or 512KiB potentially won't really help and still cause a number of performance issues in the L2.
It would be very good if you could check for requirements regarding
functional correctness. Could that also differ per generation/part, and
if so, maybe it should be made configurable in the ttm_pool API as well
as an order below it is better to fail instead of moving to a lower order?
Regards,
Tvrtko
>> Along the lines of:
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> index 428265046815..06b243f05edd 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> @@ -1824,7 +1824,7 @@ static int amdgpu_ttm_pools_init(struct amdgpu_device *adev)
>> for (i = 0; i < adev->gmc.num_mem_partitions; i++) {
>> ttm_pool_init(&adev->mman.ttm_pools[i], adev->dev,
>> adev->gmc.mem_partitions[i].numa.node,
>> - false, false);
>> + false, false, get_order(2 * SZ_1M));
>> }
>> return 0;
>> }
>> @@ -1865,7 +1865,8 @@ int amdgpu_ttm_init(struct amdgpu_device *adev)
>> adev_to_drm(adev)->anon_inode->i_mapping,
>> adev_to_drm(adev)->vma_offset_manager,
>> adev->need_swiotlb,
>> - dma_addressing_limited(adev->dev));
>> + dma_addressing_limited(adev->dev),
>> + get_order(2 * SZ_1M));
>> if (r) {
>> dev_err(adev->dev,
>> "failed initializing buffer object driver(%d).\n", r);
>> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
>> index baf27c70a419..5d54e8373230 100644
>> --- a/drivers/gpu/drm/ttm/ttm_pool.c
>> +++ b/drivers/gpu/drm/ttm/ttm_pool.c
>> @@ -726,8 +726,12 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
>>
>> page_caching = tt->caching;
>> allow_pools = true;
>> - for (order = ttm_pool_alloc_find_order(MAX_PAGE_ORDER, alloc);
>> - alloc->remaining_pages;
>> +
>> + order = ttm_pool_alloc_find_order(MAX_PAGE_ORDER, alloc);
>> + if (order > pool->max_beneficial_order)
>> + gfp_flags &= ~__GFP_DIRECT_RECLAIM;
>> +
>> + for (; alloc->remaining_pages;
>> order = ttm_pool_alloc_find_order(order, alloc)) {
>> struct ttm_pool_type *pt;
>>
>> @@ -745,6 +749,8 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
>> if (!p) {
>> page_caching = ttm_cached;
>> allow_pools = false;
>> + if (order <= pool->max_beneficial_order)
>> + gfp_flags |= __GFP_DIRECT_RECLAIM;
>> p = ttm_pool_alloc_page(pool, gfp_flags, order);
>> }
>> /* If that fails, lower the order if possible and retry. */
>> @@ -1064,7 +1070,8 @@ long ttm_pool_backup(struct ttm_pool *pool, struct ttm_tt *tt,
>> * Initialize the pool and its pool types.
>> */
>> void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
>> - int nid, bool use_dma_alloc, bool use_dma32)
>> + int nid, bool use_dma_alloc, bool use_dma32,
>> + unsigned int max_beneficial_order)
>> {
>> unsigned int i, j;
>>
>> @@ -1074,6 +1081,7 @@ void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
>> pool->nid = nid;
>> pool->use_dma_alloc = use_dma_alloc;
>> pool->use_dma32 = use_dma32;
>> + pool->max_beneficial_order = max_beneficial_order;
>>
>> for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i) {
>> for (j = 0; j < NR_PAGE_ORDERS; ++j) {
>>
>>
>> That should have the page allocator working less hard and lower the latency with large buffers.
>>
>> Then a more aggressive change on top could be:
>>
>> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
>> index 5d54e8373230..152164f79927 100644
>> --- a/drivers/gpu/drm/ttm/ttm_pool.c
>> +++ b/drivers/gpu/drm/ttm/ttm_pool.c
>> @@ -726,12 +726,8 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
>>
>> page_caching = tt->caching;
>> allow_pools = true;
>> -
>> - order = ttm_pool_alloc_find_order(MAX_PAGE_ORDER, alloc);
>> - if (order > pool->max_beneficial_order)
>> - gfp_flags &= ~__GFP_DIRECT_RECLAIM;
>> -
>> - for (; alloc->remaining_pages;
>> + for (order = ttm_pool_alloc_find_order(pool->max_beneficial_order, alloc);
>> + alloc->remaining_pages;
>> order = ttm_pool_alloc_find_order(order, alloc)) {
>> struct ttm_pool_type *pt;
>>
>> @@ -749,8 +745,6 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
>> if (!p) {
>> page_caching = ttm_cached;
>> allow_pools = false;
>> - if (order <= pool->max_beneficial_order)
>> - gfp_flags |= __GFP_DIRECT_RECLAIM;
>> p = ttm_pool_alloc_page(pool, gfp_flags, order);
>> }
>> /* If that fails, lower the order if possible and retry. */
>>
>> Ie. don't even bother trying to allocate orders above what the driver says is useful. Could be made a drivers choice as well.
>>
>> And all could be combined with some sort of a sysfs control, as Cascardo was suggesting, to disable direct reclaim completely if someone wants that.
>>
>> Regards,
>>
>> Tvrtko
>>
>>> Another alternative would be that we add a WARN_ONCE() when we have to fallback to lower order pages, but that wouldn't help the end user either. It just makes it more obvious that you need more memory for a specific use case without triggering the OOM killer.
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> Regards,
>>>>
>>>> Tvrtko
>>>>
>>>>> The alternative I can offer is to disable the fallback which in your case would trigger the OOM killer.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>>
>>>>>> Other drivers can later opt to use this mechanism too.
>>>>>>
>>>>>> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
>>>>>> ---
>>>>>> Changes in v2:
>>>>>> - Make disabling direct reclaim an option.
>>>>>> - Link to v1: https://lore.kernel.org/r/20250910-ttm_pool_no_direct_reclaim-v1-1-53b0fa7f80fa@igalia.com
>>>>>>
>>>>>> ---
>>>>>> Thadeu Lima de Souza Cascardo (3):
>>>>>> ttm: pool: allow requests to prefer latency over throughput
>>>>>> ttm: pool: add a module parameter to set latency preference
>>>>>> drm/amdgpu: allow allocation preferences when creating GEM object
>>>>>>
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 3 ++-
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 3 ++-
>>>>>> drivers/gpu/drm/ttm/ttm_pool.c | 23 +++++++++++++++++------
>>>>>> drivers/gpu/drm/ttm/ttm_tt.c | 2 +-
>>>>>> include/drm/ttm/ttm_bo.h | 5 +++++
>>>>>> include/drm/ttm/ttm_pool.h | 2 +-
>>>>>> include/drm/ttm/ttm_tt.h | 2 +-
>>>>>> include/uapi/drm/amdgpu_drm.h | 9 +++++++++
>>>>>> 8 files changed, 38 insertions(+), 11 deletions(-)
>>>>>> ---
>>>>>> base-commit: f83ec76bf285bea5727f478a68b894f5543ca76e
>>>>>> change-id: 20250909-ttm_pool_no_direct_reclaim-ee0807a2d3fe
>>>>>>
>>>>>> Best regards,
>>>>>
>>>>
>>>
>>
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH RFC v2 0/3] drm/ttm: allow direct reclaim to be skipped
2025-09-19 8:01 ` Christian König
2025-09-19 8:46 ` Tvrtko Ursulin
@ 2025-09-19 11:13 ` Thadeu Lima de Souza Cascardo
2025-09-19 11:39 ` Christian König
1 sibling, 1 reply; 13+ messages in thread
From: Thadeu Lima de Souza Cascardo @ 2025-09-19 11:13 UTC (permalink / raw)
To: Christian König
Cc: Tvrtko Ursulin, Michel Dänzer, Huang Rui, Matthew Auld,
Matthew Brost, Maarten Lankhorst, Maxime Ripard,
Thomas Zimmermann, David Airlie, Simona Vetter, amd-gfx,
dri-devel, linux-kernel, kernel-dev, Sergey Senozhatsky
On Fri, Sep 19, 2025 at 10:01:26AM +0200, Christian König wrote:
> On 19.09.25 09:43, Tvrtko Ursulin wrote:
> > On 19/09/2025 07:46, Christian König wrote:
> >> On 18.09.25 22:09, Thadeu Lima de Souza Cascardo wrote:
> >>> On certain workloads, like on ChromeOS when opening multiple tabs and
> >>> windows, and switching desktops, memory pressure can build up and latency
> >>> is observed as high order allocations result in memory reclaim. This was
> >>> observed when running on an amdgpu.
> >>>
> >>> This is caused by TTM pool allocations and turning off direct reclaim when
> >>> doing those higher order allocations leads to lower memory pressure.
> >>>
> >>> Since turning direct reclaim off might also lead to lower throughput,
> >>> make it tunable, both as a module parameter that can be changed in sysfs
> >>> and as a flag when allocating a GEM object.
> >>>
> >>> A latency option will avoid direct reclaim for higher order allocations.
> >>>
> >>> The throughput option could be later used to more agressively compact pages
> >>> or reclaim, by not using __GFP_NORETRY.
> >>
> >> Well I can only repeat it, at least for amdgpu that is a clear NAK from my side to this.
> >>
> >> The behavior to allocate huge pages is a must have for the driver.
> >
> > Disclaimer that I wouldn't go system-wide but per device - so somewhere in sysfs rather than a modparam. That kind of a toggle would not sound problematic to me since it leaves the policy outside the kernel and allows people to tune to their liking.
>
> Yeah I've also wrote before when that is somehow beneficial for nouveau (for example) then I don't have any problem with making the policy device dependent.
>
> But for amdgpu we have so many so bad experiences with this approach that I absolutely can't accept that.
The mechanism here allows it to be set per device. I even considered that
as a patch in the RFC, but I opted to get it out sooner so we could have
this discussion.
>
> > One side question thought - does AMD benefit from larger than 2MiB contiguous blocks? IIUC the maximum PTE is 2MiB so maybe not? In which case it may make sense to add some TTM API letting drivers tell the pool allocator what is the maximum order to bother with. Larger than that may have diminishing benefit for the disproportionate pressure on the memory allocator and reclaim.
>
> Using 1GiB allocations would allow for the page tables to skip another layer on AMD GPUs, but the most benefit is between 4kiB and 2MiB since that can be handled more efficiently by the L1. Having 2MiB allocations then also has an additional benefit for L2.
>
> Apart from performance for AMD GPUs there are also some HW features which only work with huge pages, e.g. on some laptops you can get for example flickering on the display if the scanout buffer is back by to many small pages.
>
> NVidia used to work on 1GiB allocations which as far as I know was the kickoff for the whole ongoing switch to using folios instead of pages. And from reading public available documentation I have the impression that NVidia GPUs works more or less the same as AMD GPUs regarding the TLB.
>
> Another alternative would be that we add a WARN_ONCE() when we have to fallback to lower order pages, but that wouldn't help the end user either. It just makes it more obvious that you need more memory for a specific use case without triggering the OOM killer.
>
> Regards,
> Christian.
>
> >
> > Regards,
> >
> > Tvrtko
> >
> >> The alternative I can offer is to disable the fallback which in your case would trigger the OOM killer.
> >>
Warning could be as simple as removing __GFP_NOWARN. But I don't think we
want either a warning or to trigger the OOM killer when allocating lower
order pages are still possible. That will already happen when we get to 0
order pages, where there is no fallback available anymore, and, then, it
makes sense to try harder and warn if no page can be allocated.
Under my current workload, the balance skews torwards 0-order pages,
reducing the amount of 10 and 9 order pages to half, when comparing runs
with direct reclaim and without direct reclaim. So, I understand your
concern in respect to the impact on the GPU TLB and potential flickering.
Is there a way we can measure it on the devices we are using? And, then, if
it does not show to be a problem on those devices, would making this be a
setting per-device be acceptable to you? In a way that we could have in
userspace a list of devices where it is okay to prefer not to reclaim over
getting huge pages and that could be set if the workload prefers lower
latency in those allocations?
Thanks.
Cascardo.
> >> Regards,
> >> Christian.
> >>
> >>>
> >>> Other drivers can later opt to use this mechanism too.
> >>>
> >>> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
> >>> ---
> >>> Changes in v2:
> >>> - Make disabling direct reclaim an option.
> >>> - Link to v1: https://lore.kernel.org/r/20250910-ttm_pool_no_direct_reclaim-v1-1-53b0fa7f80fa@igalia.com
> >>>
> >>> ---
> >>> Thadeu Lima de Souza Cascardo (3):
> >>> ttm: pool: allow requests to prefer latency over throughput
> >>> ttm: pool: add a module parameter to set latency preference
> >>> drm/amdgpu: allow allocation preferences when creating GEM object
> >>>
> >>> drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 3 ++-
> >>> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 3 ++-
> >>> drivers/gpu/drm/ttm/ttm_pool.c | 23 +++++++++++++++++------
> >>> drivers/gpu/drm/ttm/ttm_tt.c | 2 +-
> >>> include/drm/ttm/ttm_bo.h | 5 +++++
> >>> include/drm/ttm/ttm_pool.h | 2 +-
> >>> include/drm/ttm/ttm_tt.h | 2 +-
> >>> include/uapi/drm/amdgpu_drm.h | 9 +++++++++
> >>> 8 files changed, 38 insertions(+), 11 deletions(-)
> >>> ---
> >>> base-commit: f83ec76bf285bea5727f478a68b894f5543ca76e
> >>> change-id: 20250909-ttm_pool_no_direct_reclaim-ee0807a2d3fe
> >>>
> >>> Best regards,
> >>
> >
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH RFC v2 0/3] drm/ttm: allow direct reclaim to be skipped
2025-09-19 11:13 ` Thadeu Lima de Souza Cascardo
@ 2025-09-19 11:39 ` Christian König
0 siblings, 0 replies; 13+ messages in thread
From: Christian König @ 2025-09-19 11:39 UTC (permalink / raw)
To: Thadeu Lima de Souza Cascardo
Cc: Tvrtko Ursulin, Michel Dänzer, Huang Rui, Matthew Auld,
Matthew Brost, Maarten Lankhorst, Maxime Ripard,
Thomas Zimmermann, David Airlie, Simona Vetter, amd-gfx,
dri-devel, linux-kernel, kernel-dev, Sergey Senozhatsky
On 19.09.25 13:13, Thadeu Lima de Souza Cascardo wrote:
>>>
>>>> The alternative I can offer is to disable the fallback which in your case would trigger the OOM killer.
>>>>
>
> Warning could be as simple as removing __GFP_NOWARN. But I don't think we
> want either a warning or to trigger the OOM killer when allocating lower
> order pages are still possible. That will already happen when we get to 0
> order pages, where there is no fallback available anymore, and, then, it
> makes sense to try harder and warn if no page can be allocated.
I don't think you understand the problem.
Allocating lower order pages is not really an alternative. You run into really a lot of technical issues with that.
The reason we have it is to prevent crashes in OOM situations. In other words still allow displaying warning messages for example.
> Under my current workload, the balance skews torwards 0-order pages,
> reducing the amount of 10 and 9 order pages to half, when comparing runs
> with direct reclaim and without direct reclaim.
That pretty much completely disqualifies this approach.
This is a clear indicator that your system simply doesn't have enough memory for the workload you are trying to run.
> So, I understand your
> concern in respect to the impact on the GPU TLB and potential flickering.
> Is there a way we can measure it on the devices we are using? And, then, if
> it does not show to be a problem on those devices, would making this be a
> setting per-device be acceptable to you? In a way that we could have in
> userspace a list of devices where it is okay to prefer not to reclaim over
> getting huge pages and that could be set if the workload prefers lower
> latency in those allocations?
No, you are clearly trying to run a use case which as far as I can see we can't really support without running into a lot of trouble sooner or later.
Regards,
Christian.
>
> Thanks.
> Cascardo.
>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>
>>>>> Other drivers can later opt to use this mechanism too.
>>>>>
>>>>> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
>>>>> ---
>>>>> Changes in v2:
>>>>> - Make disabling direct reclaim an option.
>>>>> - Link to v1: https://lore.kernel.org/r/20250910-ttm_pool_no_direct_reclaim-v1-1-53b0fa7f80fa@igalia.com
>>>>>
>>>>> ---
>>>>> Thadeu Lima de Souza Cascardo (3):
>>>>> ttm: pool: allow requests to prefer latency over throughput
>>>>> ttm: pool: add a module parameter to set latency preference
>>>>> drm/amdgpu: allow allocation preferences when creating GEM object
>>>>>
>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 3 ++-
>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 3 ++-
>>>>> drivers/gpu/drm/ttm/ttm_pool.c | 23 +++++++++++++++++------
>>>>> drivers/gpu/drm/ttm/ttm_tt.c | 2 +-
>>>>> include/drm/ttm/ttm_bo.h | 5 +++++
>>>>> include/drm/ttm/ttm_pool.h | 2 +-
>>>>> include/drm/ttm/ttm_tt.h | 2 +-
>>>>> include/uapi/drm/amdgpu_drm.h | 9 +++++++++
>>>>> 8 files changed, 38 insertions(+), 11 deletions(-)
>>>>> ---
>>>>> base-commit: f83ec76bf285bea5727f478a68b894f5543ca76e
>>>>> change-id: 20250909-ttm_pool_no_direct_reclaim-ee0807a2d3fe
>>>>>
>>>>> Best regards,
>>>>
>>>
>>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH RFC v2 0/3] drm/ttm: allow direct reclaim to be skipped
2025-09-19 10:42 ` Tvrtko Ursulin
@ 2025-09-19 12:04 ` Christian König
0 siblings, 0 replies; 13+ messages in thread
From: Christian König @ 2025-09-19 12:04 UTC (permalink / raw)
To: Tvrtko Ursulin, Thadeu Lima de Souza Cascardo, Michel Dänzer,
Huang Rui, Matthew Auld, Matthew Brost, Maarten Lankhorst,
Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter
Cc: amd-gfx, dri-devel, linux-kernel, kernel-dev, Sergey Senozhatsky
On 19.09.25 12:42, Tvrtko Ursulin wrote:
>
> On 19/09/2025 11:17, Christian König wrote:
>> On 19.09.25 10:46, Tvrtko Ursulin wrote:
>>>
>>> On 19/09/2025 09:01, Christian König wrote:
>>>> On 19.09.25 09:43, Tvrtko Ursulin wrote:
>>>>> On 19/09/2025 07:46, Christian König wrote:
>>>>>> On 18.09.25 22:09, Thadeu Lima de Souza Cascardo wrote:
>>>>>>> On certain workloads, like on ChromeOS when opening multiple tabs and
>>>>>>> windows, and switching desktops, memory pressure can build up and latency
>>>>>>> is observed as high order allocations result in memory reclaim. This was
>>>>>>> observed when running on an amdgpu.
>>>>>>>
>>>>>>> This is caused by TTM pool allocations and turning off direct reclaim when
>>>>>>> doing those higher order allocations leads to lower memory pressure.
>>>>>>>
>>>>>>> Since turning direct reclaim off might also lead to lower throughput,
>>>>>>> make it tunable, both as a module parameter that can be changed in sysfs
>>>>>>> and as a flag when allocating a GEM object.
>>>>>>>
>>>>>>> A latency option will avoid direct reclaim for higher order allocations.
>>>>>>>
>>>>>>> The throughput option could be later used to more agressively compact pages
>>>>>>> or reclaim, by not using __GFP_NORETRY.
>>>>>>
>>>>>> Well I can only repeat it, at least for amdgpu that is a clear NAK from my side to this.
>>>>>>
>>>>>> The behavior to allocate huge pages is a must have for the driver.
>>>>>
>>>>> Disclaimer that I wouldn't go system-wide but per device - so somewhere in sysfs rather than a modparam. That kind of a toggle would not sound problematic to me since it leaves the policy outside the kernel and allows people to tune to their liking.
>>>>
>>>> Yeah I've also wrote before when that is somehow beneficial for nouveau (for example) then I don't have any problem with making the policy device dependent.
>>>>
>>>> But for amdgpu we have so many so bad experiences with this approach that I absolutely can't accept that.
>>>>
>>>>> One side question thought - does AMD benefit from larger than 2MiB contiguous blocks? IIUC the maximum PTE is 2MiB so maybe not? In which case it may make sense to add some TTM API letting drivers tell the pool allocator what is the maximum order to bother with. Larger than that may have diminishing benefit for the disproportionate pressure on the memory allocator and reclaim.
>>>>
>>>> Using 1GiB allocations would allow for the page tables to skip another layer on AMD GPUs, but the most benefit is between 4kiB and 2MiB since that can be handled more efficiently by the L1. Having 2MiB allocations then also has an additional benefit for L2.
>>>>
>>>> Apart from performance for AMD GPUs there are also some HW features which only work with huge pages, e.g. on some laptops you can get for example flickering on the display if the scanout buffer is back by to many small pages.
>>>>
>>>> NVidia used to work on 1GiB allocations which as far as I know was the kickoff for the whole ongoing switch to using folios instead of pages. And from reading public available documentation I have the impression that NVidia GPUs works more or less the same as AMD GPUs regarding the TLB.
>>>
>>> 1GiB is beyond the TTM pool allocator scope, right?
>>
>> Yes, on x86 64bit the pool allocator can allocate at maximum 2MiB by default IIRC.
>
> I think 10 is the max order so 4MiB. So it wouldn't be much relief to the allocator but better than nothing(tm)?
Good point, that can certainly be.
>>> From what you wrote it sounds like my idea would actually be okay. A very gentle approach (minimal change in behaviour) to only disable direct reclaim above the threshold set by the driver.
>>
>> Well the problem is that the threshold set by amdgpu would be 2MiB and by default there isn't anything above it on x86. So that would be a no-op. On ARM64 that idea could potentially help maybe.
>
> Some architectures appear to default to more than 10, and some offer Kconfig to change the default.
x86 also has an Kconfig option for that IIRC. So yeah, your idea is not that bad if somebody has adjusted that setting.
>
> I think this means in the patch I proposed I am missing a min(MAX_PAGE_ORDER, max_beneficial_order) when setting the pool property.
>
>> I could look into the HW documentation again what we would need as minimum for functional correctness, but there are quite a number of use cases and lowering from 2MiB to something like 256KiB or 512KiB potentially won't really help and still cause a number of performance issues in the L2.
>
> It would be very good if you could check for requirements regarding functional correctness. Could that also differ per generation/part, and if so, maybe it should be made configurable in the ttm_pool API as well as an order below it is better to fail instead of moving to a lower order?
I do remember that it used to be 256KiB on some really old parts, but that is for >10 year old HW. For anything newer 2MiB has always been what we haven been testing with.
Failing lower order allocation is also something we have tested, but that resulted in a lot of unhappy people as well.
Regards,
Christian.
>
> Regards,
>
> Tvrtko
>
>>> Along the lines of:
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>> index 428265046815..06b243f05edd 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>> @@ -1824,7 +1824,7 @@ static int amdgpu_ttm_pools_init(struct amdgpu_device *adev)
>>> for (i = 0; i < adev->gmc.num_mem_partitions; i++) {
>>> ttm_pool_init(&adev->mman.ttm_pools[i], adev->dev,
>>> adev->gmc.mem_partitions[i].numa.node,
>>> - false, false);
>>> + false, false, get_order(2 * SZ_1M));
>>> }
>>> return 0;
>>> }
>>> @@ -1865,7 +1865,8 @@ int amdgpu_ttm_init(struct amdgpu_device *adev)
>>> adev_to_drm(adev)->anon_inode->i_mapping,
>>> adev_to_drm(adev)->vma_offset_manager,
>>> adev->need_swiotlb,
>>> - dma_addressing_limited(adev->dev));
>>> + dma_addressing_limited(adev->dev),
>>> + get_order(2 * SZ_1M));
>>> if (r) {
>>> dev_err(adev->dev,
>>> "failed initializing buffer object driver(%d).\n", r);
>>> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
>>> index baf27c70a419..5d54e8373230 100644
>>> --- a/drivers/gpu/drm/ttm/ttm_pool.c
>>> +++ b/drivers/gpu/drm/ttm/ttm_pool.c
>>> @@ -726,8 +726,12 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
>>>
>>> page_caching = tt->caching;
>>> allow_pools = true;
>>> - for (order = ttm_pool_alloc_find_order(MAX_PAGE_ORDER, alloc);
>>> - alloc->remaining_pages;
>>> +
>>> + order = ttm_pool_alloc_find_order(MAX_PAGE_ORDER, alloc);
>>> + if (order > pool->max_beneficial_order)
>>> + gfp_flags &= ~__GFP_DIRECT_RECLAIM;
>>> +
>>> + for (; alloc->remaining_pages;
>>> order = ttm_pool_alloc_find_order(order, alloc)) {
>>> struct ttm_pool_type *pt;
>>>
>>> @@ -745,6 +749,8 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
>>> if (!p) {
>>> page_caching = ttm_cached;
>>> allow_pools = false;
>>> + if (order <= pool->max_beneficial_order)
>>> + gfp_flags |= __GFP_DIRECT_RECLAIM;
>>> p = ttm_pool_alloc_page(pool, gfp_flags, order);
>>> }
>>> /* If that fails, lower the order if possible and retry. */
>>> @@ -1064,7 +1070,8 @@ long ttm_pool_backup(struct ttm_pool *pool, struct ttm_tt *tt,
>>> * Initialize the pool and its pool types.
>>> */
>>> void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
>>> - int nid, bool use_dma_alloc, bool use_dma32)
>>> + int nid, bool use_dma_alloc, bool use_dma32,
>>> + unsigned int max_beneficial_order)
>>> {
>>> unsigned int i, j;
>>>
>>> @@ -1074,6 +1081,7 @@ void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
>>> pool->nid = nid;
>>> pool->use_dma_alloc = use_dma_alloc;
>>> pool->use_dma32 = use_dma32;
>>> + pool->max_beneficial_order = max_beneficial_order;
>>>
>>> for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i) {
>>> for (j = 0; j < NR_PAGE_ORDERS; ++j) {
>>>
>>>
>>> That should have the page allocator working less hard and lower the latency with large buffers.
>>>
>>> Then a more aggressive change on top could be:
>>>
>>> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
>>> index 5d54e8373230..152164f79927 100644
>>> --- a/drivers/gpu/drm/ttm/ttm_pool.c
>>> +++ b/drivers/gpu/drm/ttm/ttm_pool.c
>>> @@ -726,12 +726,8 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
>>>
>>> page_caching = tt->caching;
>>> allow_pools = true;
>>> -
>>> - order = ttm_pool_alloc_find_order(MAX_PAGE_ORDER, alloc);
>>> - if (order > pool->max_beneficial_order)
>>> - gfp_flags &= ~__GFP_DIRECT_RECLAIM;
>>> -
>>> - for (; alloc->remaining_pages;
>>> + for (order = ttm_pool_alloc_find_order(pool->max_beneficial_order, alloc);
>>> + alloc->remaining_pages;
>>> order = ttm_pool_alloc_find_order(order, alloc)) {
>>> struct ttm_pool_type *pt;
>>>
>>> @@ -749,8 +745,6 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
>>> if (!p) {
>>> page_caching = ttm_cached;
>>> allow_pools = false;
>>> - if (order <= pool->max_beneficial_order)
>>> - gfp_flags |= __GFP_DIRECT_RECLAIM;
>>> p = ttm_pool_alloc_page(pool, gfp_flags, order);
>>> }
>>> /* If that fails, lower the order if possible and retry. */
>>>
>>> Ie. don't even bother trying to allocate orders above what the driver says is useful. Could be made a drivers choice as well.
>>>
>>> And all could be combined with some sort of a sysfs control, as Cascardo was suggesting, to disable direct reclaim completely if someone wants that.
>>>
>>> Regards,
>>>
>>> Tvrtko
>>>
>>>> Another alternative would be that we add a WARN_ONCE() when we have to fallback to lower order pages, but that wouldn't help the end user either. It just makes it more obvious that you need more memory for a specific use case without triggering the OOM killer.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>
>>>>> Regards,
>>>>>
>>>>> Tvrtko
>>>>>
>>>>>> The alternative I can offer is to disable the fallback which in your case would trigger the OOM killer.
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>>
>>>>>>> Other drivers can later opt to use this mechanism too.
>>>>>>>
>>>>>>> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
>>>>>>> ---
>>>>>>> Changes in v2:
>>>>>>> - Make disabling direct reclaim an option.
>>>>>>> - Link to v1: https://lore.kernel.org/r/20250910-ttm_pool_no_direct_reclaim-v1-1-53b0fa7f80fa@igalia.com
>>>>>>>
>>>>>>> ---
>>>>>>> Thadeu Lima de Souza Cascardo (3):
>>>>>>> ttm: pool: allow requests to prefer latency over throughput
>>>>>>> ttm: pool: add a module parameter to set latency preference
>>>>>>> drm/amdgpu: allow allocation preferences when creating GEM object
>>>>>>>
>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 3 ++-
>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 3 ++-
>>>>>>> drivers/gpu/drm/ttm/ttm_pool.c | 23 +++++++++++++++++------
>>>>>>> drivers/gpu/drm/ttm/ttm_tt.c | 2 +-
>>>>>>> include/drm/ttm/ttm_bo.h | 5 +++++
>>>>>>> include/drm/ttm/ttm_pool.h | 2 +-
>>>>>>> include/drm/ttm/ttm_tt.h | 2 +-
>>>>>>> include/uapi/drm/amdgpu_drm.h | 9 +++++++++
>>>>>>> 8 files changed, 38 insertions(+), 11 deletions(-)
>>>>>>> ---
>>>>>>> base-commit: f83ec76bf285bea5727f478a68b894f5543ca76e
>>>>>>> change-id: 20250909-ttm_pool_no_direct_reclaim-ee0807a2d3fe
>>>>>>>
>>>>>>> Best regards,
>>>>>>
>>>>>
>>>>
>>>
>>
>
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2025-09-19 12:04 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-18 20:09 [PATCH RFC v2 0/3] drm/ttm: allow direct reclaim to be skipped Thadeu Lima de Souza Cascardo
2025-09-18 20:09 ` [PATCH RFC v2 1/3] ttm: pool: allow requests to prefer latency over throughput Thadeu Lima de Souza Cascardo
2025-09-18 20:09 ` [PATCH RFC v2 2/3] ttm: pool: add a module parameter to set latency preference Thadeu Lima de Souza Cascardo
2025-09-18 20:09 ` [PATCH RFC v2 3/3] drm/amdgpu: allow allocation preferences when creating GEM object Thadeu Lima de Souza Cascardo
2025-09-19 6:46 ` [PATCH RFC v2 0/3] drm/ttm: allow direct reclaim to be skipped Christian König
2025-09-19 7:43 ` Tvrtko Ursulin
2025-09-19 8:01 ` Christian König
2025-09-19 8:46 ` Tvrtko Ursulin
2025-09-19 10:17 ` Christian König
2025-09-19 10:42 ` Tvrtko Ursulin
2025-09-19 12:04 ` Christian König
2025-09-19 11:13 ` Thadeu Lima de Souza Cascardo
2025-09-19 11:39 ` Christian König
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox