* Re: [RFC PATCH] Limit reclaim to avoid TTM desktop stutter under mem pressure
2026-04-01 2:08 [RFC PATCH] Limit reclaim to avoid TTM desktop stutter under mem pressure Daniel Colascione
@ 2026-04-01 7:35 ` Thomas Hellström
2026-04-01 10:16 ` Christian König
2026-04-06 21:02 ` Matthew Brost
2 siblings, 0 replies; 9+ messages in thread
From: Thomas Hellström @ 2026-04-01 7:35 UTC (permalink / raw)
To: Daniel Colascione, dri-devel, intel-xe, Christian Koenig,
Huang Rui, Matthew Auld, Matthew Brost, Maarten Lankhorst,
Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
linux-kernel
On Tue, 2026-03-31 at 22:08 -0400, Daniel Colascione wrote:
> TTM seems to be too eager to kick off reclaim while kwin is drawing
>
> I've noticed that in 7.0-rc6, and since at least 6.17, kwin_wayland
> stalls in DRM ioctls to xe when the system is under memory pressure,
> causing missed frames, cursor-movement stutter, and general
> sluggishness. The root cause seems to be synchronous and asynchronous
> reclaim in ttm_pool_alloc_page as TTM tries, and fails, to allocate
> progressively lower-order pages in response to pool-cache misses when
> allocating graphics buffers.
>
> Memory is fragmented enough that the compaction fails (as I can see
> in
> compact_fail and compact_stall in /proc/vmstat; extfrag says the
> normal
> pool is unusable for large allocations too). Additionally, compaction
> seems to be emptying the ttm pool, since page_pool in TTM debugfs
> reports all the buckets are empty while I'm seeing the
> kwin_wayland sluggishness.
>
> In profiles, I see time dominated by copy_pages and clear_pages in
> the
> TTM paging code. kswapd runs constantly despite the system as a whole
> having plenty of free memory.
>
> I can reproduce the problem on my 32GB-RAM X1C Gen 13 by booting with
> kernelcore=8G (not needed, but makes the repro happen sooner),
> running a
> find / >/dev/null (to fragment memory), and doing general web
> browsing. The stalls seem self-perpetuating once it gets started; it
> persists even after killing the find. I've noticed this stall in
> ordinary use too, even without the kernelcore= zone tweak, but
> without
> kernelcore, it usually takes a while (hours?) after boot for memory
> to
> become fragmented enough that higher-order allocations fail.
>
> The patch below fixes the issue for me. TBC, I'm not sure it's the
> _right_ fix, but it works for me. I'm guessing that even if the
> approach
> is right, a new module parameter isn't warranted.
>
> With the patch below, when I set my new max_reclaim_order ttm module
> parameter to zero, the kwin_wayland stalls under memory pressure
> stop. (TBC, this setting inhibits sync or async reclaim except for
> order-zero pages.) TTM allocation occurs in latency-critical paths
> (e.g. Wayland frame commit): do you think we _should_ reclaim here?
Could you elaborate on what exactly fixes this. You say that if you set
max_reclaim_order to 0 kwin_wayland stalls, but OTOH the default is 0
and you also say it fixes the issue?
>
> BTW, I also tried having xe pass a beneficial order of 9, but it
> didn't
> help: we end up doing a lot of compaction work below this order
> anyway.
>
> Signed-off-by: Daniel Colascione <dancol@dancol.org>
Interesting. The xe bo shrinker is actually splitting pages to avoid
dipping too far into the kernel reserves when swapping stuff out,
perhaps contributing to the fragmentation. Could you check what happens
if you turn that shrinker off by disabling swap? Does that improve on
the situation?
sudo /sbin/swapoff -a
Another thing that appears bad is that if compaction fails, and starts
shrinking the lower order pools, we might end up in a pathological
situation where lower-order WC allocation split higher order pages and
those are immediately reclaimed.
It sounds like we also need to investigate why buffer object
allocations are made in latency-critical paths.
Thanks,
Thomas
>
> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c
> b/drivers/gpu/drm/ttm/ttm_pool.c
> index c0d95559197c..fd255914c0d3 100644
> --- a/drivers/gpu/drm/ttm/ttm_pool.c
> +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> @@ -115,9 +115,13 @@ struct ttm_pool_tt_restore {
> };
>
> static unsigned long page_pool_size;
> +static unsigned int max_reclaim_order;
>
> MODULE_PARM_DESC(page_pool_size, "Number of pages in the WC/UC/DMA
> pool");
> module_param(page_pool_size, ulong, 0644);
> +MODULE_PARM_DESC(max_reclaim_order,
> + "Maximum order that keeps upstream reclaim
> behavior");
> +module_param(max_reclaim_order, uint, 0644);
>
> static atomic_long_t allocated_pages;
>
> @@ -146,16 +150,14 @@ static struct page *ttm_pool_alloc_page(struct
> ttm_pool *pool, gfp_t gfp_flags,
> * Mapping pages directly into an userspace process and
> calling
> * put_page() on a TTM allocated page is illegal.
> */
> - if (order)
> + if (order) {
> gfp_flags |= __GFP_NOMEMALLOC | __GFP_NORETRY |
> __GFP_NOWARN |
> __GFP_THISNODE;
> -
> - /*
> - * Do not add latency to the allocation path for allocations
> orders
> - * device tolds us do not bring them additional performance
> gains.
> - */
> - if (beneficial_order && order > beneficial_order)
> - gfp_flags &= ~__GFP_DIRECT_RECLAIM;
> + if (beneficial_order && order > beneficial_order)
> + gfp_flags &= ~__GFP_DIRECT_RECLAIM;
> + if (order > max_reclaim_order)
> + gfp_flags &= ~__GFP_RECLAIM;
> + }
>
> if (!ttm_pool_uses_dma_alloc(pool)) {
> p = alloc_pages_node(pool->nid, gfp_flags, order);
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: [RFC PATCH] Limit reclaim to avoid TTM desktop stutter under mem pressure
2026-04-01 2:08 [RFC PATCH] Limit reclaim to avoid TTM desktop stutter under mem pressure Daniel Colascione
2026-04-01 7:35 ` Thomas Hellström
@ 2026-04-01 10:16 ` Christian König
2026-04-06 21:02 ` Matthew Brost
2 siblings, 0 replies; 9+ messages in thread
From: Christian König @ 2026-04-01 10:16 UTC (permalink / raw)
To: Daniel Colascione, dri-devel, intel-xe, Huang Rui, Matthew Auld,
Matthew Brost, Maarten Lankhorst, Maxime Ripard,
Thomas Zimmermann, David Airlie, Simona Vetter,
Thomas Hellström, linux-kernel
On 4/1/26 04:08, Daniel Colascione wrote:
> TTM seems to be too eager to kick off reclaim while kwin is drawing
>
> I've noticed that in 7.0-rc6, and since at least 6.17, kwin_wayland
> stalls in DRM ioctls to xe when the system is under memory pressure,
> causing missed frames, cursor-movement stutter, and general
> sluggishness. The root cause seems to be synchronous and asynchronous
> reclaim in ttm_pool_alloc_page as TTM tries, and fails, to allocate
> progressively lower-order pages in response to pool-cache misses when
> allocating graphics buffers.
>
> Memory is fragmented enough that the compaction fails (as I can see in
> compact_fail and compact_stall in /proc/vmstat; extfrag says the normal
> pool is unusable for large allocations too). Additionally, compaction
> seems to be emptying the ttm pool, since page_pool in TTM debugfs
> reports all the buckets are empty while I'm seeing the
> kwin_wayland sluggishness.
>
> In profiles, I see time dominated by copy_pages and clear_pages in the
> TTM paging code. kswapd runs constantly despite the system as a whole
> having plenty of free memory.
>
> I can reproduce the problem on my 32GB-RAM X1C Gen 13 by booting with
> kernelcore=8G (not needed, but makes the repro happen sooner), running a
> find / >/dev/null (to fragment memory), and doing general web
> browsing. The stalls seem self-perpetuating once it gets started; it
> persists even after killing the find. I've noticed this stall in
> ordinary use too, even without the kernelcore= zone tweak, but without
> kernelcore, it usually takes a while (hours?) after boot for memory to
> become fragmented enough that higher-order allocations fail.
>
> The patch below fixes the issue for me. TBC, I'm not sure it's the
> _right_ fix, but it works for me. I'm guessing that even if the approach
> is right, a new module parameter isn't warranted.
Yeah the module parameter is probably good for testing but really won't fly.
>
> With the patch below, when I set my new max_reclaim_order ttm module
> parameter to zero, the kwin_wayland stalls under memory pressure
> stop. (TBC, this setting inhibits sync or async reclaim except for
> order-zero pages.) TTM allocation occurs in latency-critical paths
> (e.g. Wayland frame commit): do you think we _should_ reclaim here?
>
> BTW, I also tried having xe pass a beneficial order of 9, but it didn't
> help: we end up doing a lot of compaction work below this order anyway.
Well as far as I can tell that allocation behavior is completely intentional and just the lesser evil.
I can't say much for Intel HW, but for AMD HW the difference between higher order pages (2MiB) and anything below usually comes with a 20-30% performance drop. So falling back to anything below 2MiB is actually only the last resort to avoid the OOM killer.
The real question is where is that heavy fragmentation and sluggishness coming from? Even when the find / > /dev/null creates a lot of 4KiB allocations the kernel should be able to reclaim them to create larger pages again.
And then finally I agree with Thomas that userspace shouldn't make that many allocations on a normal desktop. That should also be a good place to start investigating what happens here.
Regards,
Christian.
>
> Signed-off-by: Daniel Colascione <dancol@dancol.org>
>
> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
> index c0d95559197c..fd255914c0d3 100644
> --- a/drivers/gpu/drm/ttm/ttm_pool.c
> +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> @@ -115,9 +115,13 @@ struct ttm_pool_tt_restore {
> };
>
> static unsigned long page_pool_size;
> +static unsigned int max_reclaim_order;
>
> MODULE_PARM_DESC(page_pool_size, "Number of pages in the WC/UC/DMA pool");
> module_param(page_pool_size, ulong, 0644);
> +MODULE_PARM_DESC(max_reclaim_order,
> + "Maximum order that keeps upstream reclaim behavior");
> +module_param(max_reclaim_order, uint, 0644);
>
> static atomic_long_t allocated_pages;
>
> @@ -146,16 +150,14 @@ static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
> * Mapping pages directly into an userspace process and calling
> * put_page() on a TTM allocated page is illegal.
> */
> - if (order)
> + if (order) {
> gfp_flags |= __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN |
> __GFP_THISNODE;
> -
> - /*
> - * Do not add latency to the allocation path for allocations orders
> - * device tolds us do not bring them additional performance gains.
> - */
> - if (beneficial_order && order > beneficial_order)
> - gfp_flags &= ~__GFP_DIRECT_RECLAIM;
> + if (beneficial_order && order > beneficial_order)
> + gfp_flags &= ~__GFP_DIRECT_RECLAIM;
> + if (order > max_reclaim_order)
> + gfp_flags &= ~__GFP_RECLAIM;
> + }
>
> if (!ttm_pool_uses_dma_alloc(pool)) {
> p = alloc_pages_node(pool->nid, gfp_flags, order);
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: [RFC PATCH] Limit reclaim to avoid TTM desktop stutter under mem pressure
2026-04-01 2:08 [RFC PATCH] Limit reclaim to avoid TTM desktop stutter under mem pressure Daniel Colascione
2026-04-01 7:35 ` Thomas Hellström
2026-04-01 10:16 ` Christian König
@ 2026-04-06 21:02 ` Matthew Brost
2026-04-06 21:53 ` Matthew Brost
2026-04-07 7:43 ` Christian König
2 siblings, 2 replies; 9+ messages in thread
From: Matthew Brost @ 2026-04-06 21:02 UTC (permalink / raw)
To: Daniel Colascione
Cc: dri-devel, intel-xe, Christian Koenig, Huang Rui, Matthew Auld,
Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
Simona Vetter, Thomas Hellström, linux-kernel
On Tue, Mar 31, 2026 at 10:08:58PM -0400, Daniel Colascione wrote:
> TTM seems to be too eager to kick off reclaim while kwin is drawing
>
> I've noticed that in 7.0-rc6, and since at least 6.17, kwin_wayland
> stalls in DRM ioctls to xe when the system is under memory pressure,
> causing missed frames, cursor-movement stutter, and general
> sluggishness. The root cause seems to be synchronous and asynchronous
> reclaim in ttm_pool_alloc_page as TTM tries, and fails, to allocate
> progressively lower-order pages in response to pool-cache misses when
> allocating graphics buffers.
>
> Memory is fragmented enough that the compaction fails (as I can see in
> compact_fail and compact_stall in /proc/vmstat; extfrag says the normal
> pool is unusable for large allocations too). Additionally, compaction
> seems to be emptying the ttm pool, since page_pool in TTM debugfs
> reports all the buckets are empty while I'm seeing the
> kwin_wayland sluggishness.
>
> In profiles, I see time dominated by copy_pages and clear_pages in the
> TTM paging code. kswapd runs constantly despite the system as a whole
> having plenty of free memory.
>
> I can reproduce the problem on my 32GB-RAM X1C Gen 13 by booting with
> kernelcore=8G (not needed, but makes the repro happen sooner), running a
> find / >/dev/null (to fragment memory), and doing general web
> browsing. The stalls seem self-perpetuating once it gets started; it
> persists even after killing the find. I've noticed this stall in
> ordinary use too, even without the kernelcore= zone tweak, but without
> kernelcore, it usually takes a while (hours?) after boot for memory to
> become fragmented enough that higher-order allocations fail.
>
> The patch below fixes the issue for me. TBC, I'm not sure it's the
> _right_ fix, but it works for me. I'm guessing that even if the approach
> is right, a new module parameter isn't warranted.
>
> With the patch below, when I set my new max_reclaim_order ttm module
> parameter to zero, the kwin_wayland stalls under memory pressure
> stop. (TBC, this setting inhibits sync or async reclaim except for
> order-zero pages.) TTM allocation occurs in latency-critical paths
> (e.g. Wayland frame commit): do you think we _should_ reclaim here?
>
> BTW, I also tried having xe pass a beneficial order of 9, but it didn't
> help: we end up doing a lot of compaction work below this order anyway.
I was going to suggest changing Xe to align with what AMDGPU is doing [1].
Unfortunate this didn’t help.
[1] https://elixir.bootlin.com/linux/v6.19.11/source/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c#L1795
>
> Signed-off-by: Daniel Colascione <dancol@dancol.org>
>
> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
> index c0d95559197c..fd255914c0d3 100644
> --- a/drivers/gpu/drm/ttm/ttm_pool.c
> +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> @@ -115,9 +115,13 @@ struct ttm_pool_tt_restore {
> };
>
> static unsigned long page_pool_size;
> +static unsigned int max_reclaim_order;
>
> MODULE_PARM_DESC(page_pool_size, "Number of pages in the WC/UC/DMA pool");
> module_param(page_pool_size, ulong, 0644);
> +MODULE_PARM_DESC(max_reclaim_order,
> + "Maximum order that keeps upstream reclaim behavior");
> +module_param(max_reclaim_order, uint, 0644);
>
> static atomic_long_t allocated_pages;
>
> @@ -146,16 +150,14 @@ static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
> * Mapping pages directly into an userspace process and calling
> * put_page() on a TTM allocated page is illegal.
> */
> - if (order)
> + if (order) {
> gfp_flags |= __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN |
> __GFP_THISNODE;
> -
> - /*
> - * Do not add latency to the allocation path for allocations orders
> - * device tolds us do not bring them additional performance gains.
> - */
> - if (beneficial_order && order > beneficial_order)
> - gfp_flags &= ~__GFP_DIRECT_RECLAIM;
> + if (beneficial_order && order > beneficial_order)
> + gfp_flags &= ~__GFP_DIRECT_RECLAIM;
> + if (order > max_reclaim_order)
> + gfp_flags &= ~__GFP_RECLAIM;
I’m not very familiar with this code, but at first glance it doesn’t
seem quite right.
Would setting Xe’s beneficial to 9, similar to AMD’s, along with this
diff, help?
If I’m understanding this correctly, we would try a single allocation
attempt with __GFP_DIRECT_RECLAIM cleared for the size we care about,
still attempt allocations from the pools, and then finally fall back to
allocating single pages one at a time.
Matt
diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index aa41099c5ecf..f1f430aba0c1 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -714,6 +714,7 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
struct ttm_pool_alloc_state *alloc,
struct ttm_pool_tt_restore *restore)
{
+ const unsigned int beneficial_order = ttm_pool_beneficial_order(pool);
enum ttm_caching page_caching;
gfp_t gfp_flags = GFP_USER;
pgoff_t caching_divide;
@@ -757,7 +758,8 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
if (!p) {
page_caching = ttm_cached;
allow_pools = false;
- p = ttm_pool_alloc_page(pool, gfp_flags, order);
+ if (!order || order >= beneficial_order)
+ p = ttm_pool_alloc_page(pool, gfp_flags, order);
}
/* If that fails, lower the order if possible and retry. */
if (!p) {
> + }
>
> if (!ttm_pool_uses_dma_alloc(pool)) {
> p = alloc_pages_node(pool->nid, gfp_flags, order);
^ permalink raw reply related [flat|nested] 9+ messages in thread* Re: [RFC PATCH] Limit reclaim to avoid TTM desktop stutter under mem pressure
2026-04-06 21:02 ` Matthew Brost
@ 2026-04-06 21:53 ` Matthew Brost
2026-04-07 7:43 ` Christian König
1 sibling, 0 replies; 9+ messages in thread
From: Matthew Brost @ 2026-04-06 21:53 UTC (permalink / raw)
To: Daniel Colascione
Cc: dri-devel, intel-xe, Christian Koenig, Huang Rui, Matthew Auld,
Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
Simona Vetter, Thomas Hellström, linux-kernel
On Mon, Apr 06, 2026 at 02:02:44PM -0700, Matthew Brost wrote:
> On Tue, Mar 31, 2026 at 10:08:58PM -0400, Daniel Colascione wrote:
> > TTM seems to be too eager to kick off reclaim while kwin is drawing
> >
> > I've noticed that in 7.0-rc6, and since at least 6.17, kwin_wayland
> > stalls in DRM ioctls to xe when the system is under memory pressure,
> > causing missed frames, cursor-movement stutter, and general
> > sluggishness. The root cause seems to be synchronous and asynchronous
> > reclaim in ttm_pool_alloc_page as TTM tries, and fails, to allocate
> > progressively lower-order pages in response to pool-cache misses when
> > allocating graphics buffers.
> >
> > Memory is fragmented enough that the compaction fails (as I can see in
> > compact_fail and compact_stall in /proc/vmstat; extfrag says the normal
> > pool is unusable for large allocations too). Additionally, compaction
> > seems to be emptying the ttm pool, since page_pool in TTM debugfs
> > reports all the buckets are empty while I'm seeing the
> > kwin_wayland sluggishness.
> >
> > In profiles, I see time dominated by copy_pages and clear_pages in the
> > TTM paging code. kswapd runs constantly despite the system as a whole
> > having plenty of free memory.
> >
> > I can reproduce the problem on my 32GB-RAM X1C Gen 13 by booting with
> > kernelcore=8G (not needed, but makes the repro happen sooner), running a
> > find / >/dev/null (to fragment memory), and doing general web
> > browsing. The stalls seem self-perpetuating once it gets started; it
> > persists even after killing the find. I've noticed this stall in
> > ordinary use too, even without the kernelcore= zone tweak, but without
> > kernelcore, it usually takes a while (hours?) after boot for memory to
> > become fragmented enough that higher-order allocations fail.
> >
> > The patch below fixes the issue for me. TBC, I'm not sure it's the
> > _right_ fix, but it works for me. I'm guessing that even if the approach
> > is right, a new module parameter isn't warranted.
> >
> > With the patch below, when I set my new max_reclaim_order ttm module
> > parameter to zero, the kwin_wayland stalls under memory pressure
> > stop. (TBC, this setting inhibits sync or async reclaim except for
> > order-zero pages.) TTM allocation occurs in latency-critical paths
> > (e.g. Wayland frame commit): do you think we _should_ reclaim here?
> >
> > BTW, I also tried having xe pass a beneficial order of 9, but it didn't
> > help: we end up doing a lot of compaction work below this order anyway.
>
> I was going to suggest changing Xe to align with what AMDGPU is doing [1].
>
> Unfortunate this didn’t help.
>
> [1] https://elixir.bootlin.com/linux/v6.19.11/source/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c#L1795
>
> >
> > Signed-off-by: Daniel Colascione <dancol@dancol.org>
> >
> > diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
> > index c0d95559197c..fd255914c0d3 100644
> > --- a/drivers/gpu/drm/ttm/ttm_pool.c
> > +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> > @@ -115,9 +115,13 @@ struct ttm_pool_tt_restore {
> > };
> >
> > static unsigned long page_pool_size;
> > +static unsigned int max_reclaim_order;
> >
> > MODULE_PARM_DESC(page_pool_size, "Number of pages in the WC/UC/DMA pool");
> > module_param(page_pool_size, ulong, 0644);
> > +MODULE_PARM_DESC(max_reclaim_order,
> > + "Maximum order that keeps upstream reclaim behavior");
> > +module_param(max_reclaim_order, uint, 0644);
> >
> > static atomic_long_t allocated_pages;
> >
> > @@ -146,16 +150,14 @@ static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
> > * Mapping pages directly into an userspace process and calling
> > * put_page() on a TTM allocated page is illegal.
> > */
> > - if (order)
> > + if (order) {
> > gfp_flags |= __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN |
> > __GFP_THISNODE;
> > -
> > - /*
> > - * Do not add latency to the allocation path for allocations orders
> > - * device tolds us do not bring them additional performance gains.
> > - */
> > - if (beneficial_order && order > beneficial_order)
> > - gfp_flags &= ~__GFP_DIRECT_RECLAIM;
> > + if (beneficial_order && order > beneficial_order)
> > + gfp_flags &= ~__GFP_DIRECT_RECLAIM;
> > + if (order > max_reclaim_order)
> > + gfp_flags &= ~__GFP_RECLAIM;
>
> I’m not very familiar with this code, but at first glance it doesn’t
> seem quite right.
>
> Would setting Xe’s beneficial to 9, similar to AMD’s, along with this
> diff, help?
>
> If I’m understanding this correctly, we would try a single allocation
> attempt with __GFP_DIRECT_RECLAIM cleared for the size we care about,
> still attempt allocations from the pools, and then finally fall back to
> allocating single pages one at a time.
>
> Matt
>
I think I'm actually missing another part of this...
diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index aa41099c5ecf..19a163334756 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -154,8 +154,12 @@ static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
* Do not add latency to the allocation path for allocations orders
* device tolds us do not bring them additional performance gains.
*/
- if (beneficial_order && order > beneficial_order)
- gfp_flags &= ~__GFP_DIRECT_RECLAIM;
+ if (beneficial_order) {
+ if (order == beneficial_order)
+ gfp_flags &= ~__GFP_DIRECT_RECLAIM;
+ else if (order)
+ gfp_flags &= ~__GFP_RECLAIM;
+ }
We’d need buy-in from everyone in the TTM community, but to me this
makes sense—only kick off kswapd at the orders we actually care about,
and only allocate at higher orders than those as well. If you’re running
a recent Xe Mesa version, the smallest page size we ever allocate in
Mesa is 2MB because suballocation in Mesa, I suspect most other venders
Mesa implementations do this too or are moving in that direction.
Matt
> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
> index aa41099c5ecf..f1f430aba0c1 100644
> --- a/drivers/gpu/drm/ttm/ttm_pool.c
> +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> @@ -714,6 +714,7 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
> struct ttm_pool_alloc_state *alloc,
> struct ttm_pool_tt_restore *restore)
> {
> + const unsigned int beneficial_order = ttm_pool_beneficial_order(pool);
> enum ttm_caching page_caching;
> gfp_t gfp_flags = GFP_USER;
> pgoff_t caching_divide;
> @@ -757,7 +758,8 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
> if (!p) {
> page_caching = ttm_cached;
> allow_pools = false;
> - p = ttm_pool_alloc_page(pool, gfp_flags, order);
> + if (!order || order >= beneficial_order)
> + p = ttm_pool_alloc_page(pool, gfp_flags, order);
> }
> /* If that fails, lower the order if possible and retry. */
> if (!p) {
>
>
> > + }
> >
> > if (!ttm_pool_uses_dma_alloc(pool)) {
> > p = alloc_pages_node(pool->nid, gfp_flags, order);
^ permalink raw reply related [flat|nested] 9+ messages in thread* Re: [RFC PATCH] Limit reclaim to avoid TTM desktop stutter under mem pressure
2026-04-06 21:02 ` Matthew Brost
2026-04-06 21:53 ` Matthew Brost
@ 2026-04-07 7:43 ` Christian König
2026-04-07 17:34 ` Matthew Brost
1 sibling, 1 reply; 9+ messages in thread
From: Christian König @ 2026-04-07 7:43 UTC (permalink / raw)
To: Matthew Brost, Daniel Colascione
Cc: dri-devel, intel-xe, Huang Rui, Matthew Auld, Maarten Lankhorst,
Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
Thomas Hellström, linux-kernel
On 4/6/26 23:02, Matthew Brost wrote:
> On Tue, Mar 31, 2026 at 10:08:58PM -0400, Daniel Colascione wrote:
...
>> -
>> - /*
>> - * Do not add latency to the allocation path for allocations orders
>> - * device tolds us do not bring them additional performance gains.
>> - */
>> - if (beneficial_order && order > beneficial_order)
>> - gfp_flags &= ~__GFP_DIRECT_RECLAIM;
>> + if (beneficial_order && order > beneficial_order)
>> + gfp_flags &= ~__GFP_DIRECT_RECLAIM;
>> + if (order > max_reclaim_order)
>> + gfp_flags &= ~__GFP_RECLAIM;
>
> I’m not very familiar with this code, but at first glance it doesn’t
> seem quite right.
>
> Would setting Xe’s beneficial to 9, similar to AMD’s, along with this
> diff, help?
No, not really. The problem is that giving 9 as beneficial order only saves us avoiding direct reclaim for 10 (>=11 is usually not used in a x86 linux kernel anyway).
>
> If I’m understanding this correctly, we would try a single allocation
> attempt with __GFP_DIRECT_RECLAIM cleared for the size we care about,
> still attempt allocations from the pools, and then finally fall back to
> allocating single pages one at a time.
Well the code is a bit broken, but the general idea is not so bad.
What we could do is to use beneficial_order as sweet spot and set __GFP_DIRECT_RECLAIM only for the allocations with that order.
This would skip setting it for order 1..8, which are nice to have as well but not so necessary that we always need to trigger reclaim for them.
Regards,
Christian.
>
> Matt
>
> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
> index aa41099c5ecf..f1f430aba0c1 100644
> --- a/drivers/gpu/drm/ttm/ttm_pool.c
> +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> @@ -714,6 +714,7 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
> struct ttm_pool_alloc_state *alloc,
> struct ttm_pool_tt_restore *restore)
> {
> + const unsigned int beneficial_order = ttm_pool_beneficial_order(pool);
> enum ttm_caching page_caching;
> gfp_t gfp_flags = GFP_USER;
> pgoff_t caching_divide;
> @@ -757,7 +758,8 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
> if (!p) {
> page_caching = ttm_cached;
> allow_pools = false;
> - p = ttm_pool_alloc_page(pool, gfp_flags, order);
> + if (!order || order >= beneficial_order)
> + p = ttm_pool_alloc_page(pool, gfp_flags, order);
> }
> /* If that fails, lower the order if possible and retry. */
> if (!p) {
>
>
>> + }
>>
>> if (!ttm_pool_uses_dma_alloc(pool)) {
>> p = alloc_pages_node(pool->nid, gfp_flags, order);
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: [RFC PATCH] Limit reclaim to avoid TTM desktop stutter under mem pressure
2026-04-07 7:43 ` Christian König
@ 2026-04-07 17:34 ` Matthew Brost
2026-04-08 8:00 ` Christian König
0 siblings, 1 reply; 9+ messages in thread
From: Matthew Brost @ 2026-04-07 17:34 UTC (permalink / raw)
To: Christian König
Cc: Daniel Colascione, dri-devel, intel-xe, Huang Rui, Matthew Auld,
Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
Simona Vetter, Thomas Hellström, linux-kernel
On Tue, Apr 07, 2026 at 09:43:30AM +0200, Christian König wrote:
> On 4/6/26 23:02, Matthew Brost wrote:
> > On Tue, Mar 31, 2026 at 10:08:58PM -0400, Daniel Colascione wrote:
> ...
> >> -
> >> - /*
> >> - * Do not add latency to the allocation path for allocations orders
> >> - * device tolds us do not bring them additional performance gains.
> >> - */
> >> - if (beneficial_order && order > beneficial_order)
> >> - gfp_flags &= ~__GFP_DIRECT_RECLAIM;
> >> + if (beneficial_order && order > beneficial_order)
> >> + gfp_flags &= ~__GFP_DIRECT_RECLAIM;
> >> + if (order > max_reclaim_order)
> >> + gfp_flags &= ~__GFP_RECLAIM;
> >
> > I’m not very familiar with this code, but at first glance it doesn’t
> > seem quite right.
> >
> > Would setting Xe’s beneficial to 9, similar to AMD’s, along with this
> > diff, help?
>
> No, not really. The problem is that giving 9 as beneficial order only saves us avoiding direct reclaim for 10 (>=11 is usually not used in a x86 linux kernel anyway).
>
Yes, the first snippet was a bit incomplete. I adjusted it in a
self-reply, but that likely still isn’t exactly right either. I’ll also
take a look at how reclaim works at higher orders and how kswapd behaves
there—I’m shooting from the hip a bit at the moment.
> >
> > If I’m understanding this correctly, we would try a single allocation
> > attempt with __GFP_DIRECT_RECLAIM cleared for the size we care about,
> > still attempt allocations from the pools, and then finally fall back to
> > allocating single pages one at a time.
>
> Well the code is a bit broken, but the general idea is not so bad.
>
> What we could do is to use beneficial_order as sweet spot and set __GFP_DIRECT_RECLAIM only for the allocations with that order.
>
That’s roughly what my follow-up snippet did, but with
__GFP_DIRECT_RECLAIM replaced by __GFP_KSWAPD_RECLAIM. I’m really not
sure what the correct policy should be here. But in general I agree
beneficial_order should be the sweet spot where we trigger some sort of
reclaim.
> This would skip setting it for order 1..8, which are nice to have as well but not so necessary that we always need to trigger reclaim for them.
>
This has made me think a bit further. I’m not really sure the current
approach of TTM setting policy is actually the right choice—it might be
better to give drivers more control so they can tune this themselves.
Rough idea...
struct ttm_pool_order_policy {
bool enable; /* Should I call ttm_pool_alloc_page for an order */
gfp_t reclaim_mask; /* Used in ttm_pool_alloc_page &= ~reclaim_mask; */
};
Then, in ttm_pool_init, we could optionally pass in a table (0 →
MAX_PAGE_ORDER) that controls the allocation pipeline in
__ttm_pool_alloc.
This may be overkill, and it still wouldn’t provide per-BO control,
which might be desirable for cases like compositors versus compute
workloads, etc.
What do you think?
Matt
> Regards,
> Christian.
>
> >
> > Matt
> >
> > diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
> > index aa41099c5ecf..f1f430aba0c1 100644
> > --- a/drivers/gpu/drm/ttm/ttm_pool.c
> > +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> > @@ -714,6 +714,7 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
> > struct ttm_pool_alloc_state *alloc,
> > struct ttm_pool_tt_restore *restore)
> > {
> > + const unsigned int beneficial_order = ttm_pool_beneficial_order(pool);
> > enum ttm_caching page_caching;
> > gfp_t gfp_flags = GFP_USER;
> > pgoff_t caching_divide;
> > @@ -757,7 +758,8 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
> > if (!p) {
> > page_caching = ttm_cached;
> > allow_pools = false;
> > - p = ttm_pool_alloc_page(pool, gfp_flags, order);
> > + if (!order || order >= beneficial_order)
> > + p = ttm_pool_alloc_page(pool, gfp_flags, order);
> > }
> > /* If that fails, lower the order if possible and retry. */
> > if (!p) {
> >
> >
> >> + }
> >>
> >> if (!ttm_pool_uses_dma_alloc(pool)) {
> >> p = alloc_pages_node(pool->nid, gfp_flags, order);
>
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: [RFC PATCH] Limit reclaim to avoid TTM desktop stutter under mem pressure
2026-04-07 17:34 ` Matthew Brost
@ 2026-04-08 8:00 ` Christian König
2026-04-09 5:12 ` Matthew Brost
0 siblings, 1 reply; 9+ messages in thread
From: Christian König @ 2026-04-08 8:00 UTC (permalink / raw)
To: Matthew Brost
Cc: Daniel Colascione, dri-devel, intel-xe, Huang Rui, Matthew Auld,
Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
Simona Vetter, Thomas Hellström, linux-kernel
On 4/7/26 19:34, Matthew Brost wrote:
> On Tue, Apr 07, 2026 at 09:43:30AM +0200, Christian König wrote:
>> On 4/6/26 23:02, Matthew Brost wrote:
>>> On Tue, Mar 31, 2026 at 10:08:58PM -0400, Daniel Colascione wrote:
>> ...
>>>> -
>>>> - /*
>>>> - * Do not add latency to the allocation path for allocations orders
>>>> - * device tolds us do not bring them additional performance gains.
>>>> - */
>>>> - if (beneficial_order && order > beneficial_order)
>>>> - gfp_flags &= ~__GFP_DIRECT_RECLAIM;
>>>> + if (beneficial_order && order > beneficial_order)
>>>> + gfp_flags &= ~__GFP_DIRECT_RECLAIM;
>>>> + if (order > max_reclaim_order)
>>>> + gfp_flags &= ~__GFP_RECLAIM;
>>>
>>> I’m not very familiar with this code, but at first glance it doesn’t
>>> seem quite right.
>>>
>>> Would setting Xe’s beneficial to 9, similar to AMD’s, along with this
>>> diff, help?
>>
>> No, not really. The problem is that giving 9 as beneficial order only saves us avoiding direct reclaim for 10 (>=11 is usually not used in a x86 linux kernel anyway).
>>
>
> Yes, the first snippet was a bit incomplete. I adjusted it in a
> self-reply, but that likely still isn’t exactly right either. I’ll also
> take a look at how reclaim works at higher orders and how kswapd behaves
> there—I’m shooting from the hip a bit at the moment.
>
>>>
>>> If I’m understanding this correctly, we would try a single allocation
>>> attempt with __GFP_DIRECT_RECLAIM cleared for the size we care about,
>>> still attempt allocations from the pools, and then finally fall back to
>>> allocating single pages one at a time.
>>
>> Well the code is a bit broken, but the general idea is not so bad.
>>
>> What we could do is to use beneficial_order as sweet spot and set __GFP_DIRECT_RECLAIM only for the allocations with that order.
>>
>
> That’s roughly what my follow-up snippet did, but with
> __GFP_DIRECT_RECLAIM replaced by __GFP_KSWAPD_RECLAIM. I’m really not
> sure what the correct policy should be here. But in general I agree
> beneficial_order should be the sweet spot where we trigger some sort of
> reclaim.
>
>> This would skip setting it for order 1..8, which are nice to have as well but not so necessary that we always need to trigger reclaim for them.
>>
>
> This has made me think a bit further. I’m not really sure the current
> approach of TTM setting policy is actually the right choice—it might be
> better to give drivers more control so they can tune this themselves.
>
> Rough idea...
>
> struct ttm_pool_order_policy {
> bool enable; /* Should I call ttm_pool_alloc_page for an order */
> gfp_t reclaim_mask; /* Used in ttm_pool_alloc_page &= ~reclaim_mask; */
> };
>
> Then, in ttm_pool_init, we could optionally pass in a table (0 →
> MAX_PAGE_ORDER) that controls the allocation pipeline in
> __ttm_pool_alloc.
>
> This may be overkill, and it still wouldn’t provide per-BO control,
> which might be desirable for cases like compositors versus compute
> workloads, etc.
>
> What do you think?
That you need to completely disable allocation of a specific order is rather unlikely from my experience.
Different HW has different sweat spots they want for allocation, e.g. 64k, 256k, 2M etc... but in general it has proven to be always beneficial to try to allocate large pages first just to speed up allocation (calling GFP once for a 2M page compared to 512 times for 4k pages makes a huge difference).
I also don't want to overload the driver->TTM interface with to much information, so just giving the sweat spot or maybe a mask for the most desired orders should potentially do it.
Christian.
>
> Matt
>
>> Regards,
>> Christian.
>>
>>>
>>> Matt
>>>
>>> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
>>> index aa41099c5ecf..f1f430aba0c1 100644
>>> --- a/drivers/gpu/drm/ttm/ttm_pool.c
>>> +++ b/drivers/gpu/drm/ttm/ttm_pool.c
>>> @@ -714,6 +714,7 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
>>> struct ttm_pool_alloc_state *alloc,
>>> struct ttm_pool_tt_restore *restore)
>>> {
>>> + const unsigned int beneficial_order = ttm_pool_beneficial_order(pool);
>>> enum ttm_caching page_caching;
>>> gfp_t gfp_flags = GFP_USER;
>>> pgoff_t caching_divide;
>>> @@ -757,7 +758,8 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
>>> if (!p) {
>>> page_caching = ttm_cached;
>>> allow_pools = false;
>>> - p = ttm_pool_alloc_page(pool, gfp_flags, order);
>>> + if (!order || order >= beneficial_order)
>>> + p = ttm_pool_alloc_page(pool, gfp_flags, order);
>>> }
>>> /* If that fails, lower the order if possible and retry. */
>>> if (!p) {
>>>
>>>
>>>> + }
>>>>
>>>> if (!ttm_pool_uses_dma_alloc(pool)) {
>>>> p = alloc_pages_node(pool->nid, gfp_flags, order);
>>
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: [RFC PATCH] Limit reclaim to avoid TTM desktop stutter under mem pressure
2026-04-08 8:00 ` Christian König
@ 2026-04-09 5:12 ` Matthew Brost
0 siblings, 0 replies; 9+ messages in thread
From: Matthew Brost @ 2026-04-09 5:12 UTC (permalink / raw)
To: Christian König
Cc: Daniel Colascione, dri-devel, intel-xe, Huang Rui, Matthew Auld,
Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
Simona Vetter, Thomas Hellström, linux-kernel
On Wed, Apr 08, 2026 at 10:00:26AM +0200, Christian König wrote:
> On 4/7/26 19:34, Matthew Brost wrote:
> > On Tue, Apr 07, 2026 at 09:43:30AM +0200, Christian König wrote:
> >> On 4/6/26 23:02, Matthew Brost wrote:
> >>> On Tue, Mar 31, 2026 at 10:08:58PM -0400, Daniel Colascione wrote:
> >> ...
> >>>> -
> >>>> - /*
> >>>> - * Do not add latency to the allocation path for allocations orders
> >>>> - * device tolds us do not bring them additional performance gains.
> >>>> - */
> >>>> - if (beneficial_order && order > beneficial_order)
> >>>> - gfp_flags &= ~__GFP_DIRECT_RECLAIM;
> >>>> + if (beneficial_order && order > beneficial_order)
> >>>> + gfp_flags &= ~__GFP_DIRECT_RECLAIM;
> >>>> + if (order > max_reclaim_order)
> >>>> + gfp_flags &= ~__GFP_RECLAIM;
> >>>
> >>> I’m not very familiar with this code, but at first glance it doesn’t
> >>> seem quite right.
> >>>
> >>> Would setting Xe’s beneficial to 9, similar to AMD’s, along with this
> >>> diff, help?
> >>
> >> No, not really. The problem is that giving 9 as beneficial order only saves us avoiding direct reclaim for 10 (>=11 is usually not used in a x86 linux kernel anyway).
> >>
> >
> > Yes, the first snippet was a bit incomplete. I adjusted it in a
> > self-reply, but that likely still isn’t exactly right either. I’ll also
> > take a look at how reclaim works at higher orders and how kswapd behaves
> > there—I’m shooting from the hip a bit at the moment.
> >
> >>>
> >>> If I’m understanding this correctly, we would try a single allocation
> >>> attempt with __GFP_DIRECT_RECLAIM cleared for the size we care about,
> >>> still attempt allocations from the pools, and then finally fall back to
> >>> allocating single pages one at a time.
> >>
> >> Well the code is a bit broken, but the general idea is not so bad.
> >>
> >> What we could do is to use beneficial_order as sweet spot and set __GFP_DIRECT_RECLAIM only for the allocations with that order.
> >>
> >
> > That’s roughly what my follow-up snippet did, but with
> > __GFP_DIRECT_RECLAIM replaced by __GFP_KSWAPD_RECLAIM. I’m really not
> > sure what the correct policy should be here. But in general I agree
> > beneficial_order should be the sweet spot where we trigger some sort of
> > reclaim.
> >
> >> This would skip setting it for order 1..8, which are nice to have as well but not so necessary that we always need to trigger reclaim for them.
> >>
> >
> > This has made me think a bit further. I’m not really sure the current
> > approach of TTM setting policy is actually the right choice—it might be
> > better to give drivers more control so they can tune this themselves.
> >
> > Rough idea...
> >
> > struct ttm_pool_order_policy {
> > bool enable; /* Should I call ttm_pool_alloc_page for an order */
> > gfp_t reclaim_mask; /* Used in ttm_pool_alloc_page &= ~reclaim_mask; */
> > };
> >
> > Then, in ttm_pool_init, we could optionally pass in a table (0 →
> > MAX_PAGE_ORDER) that controls the allocation pipeline in
> > __ttm_pool_alloc.
> >
> > This may be overkill, and it still wouldn’t provide per-BO control,
> > which might be desirable for cases like compositors versus compute
> > workloads, etc.
> >
> > What do you think?
>
> That you need to completely disable allocation of a specific order is rather unlikely from my experience.
>
That might be true as I haven't really dug in here.
> Different HW has different sweat spots they want for allocation, e.g. 64k, 256k, 2M etc... but in general it has proven to be always beneficial to try to allocate large pages first just to speed up allocation (calling GFP once for a 2M page compared to 512 times for 4k pages makes a huge difference).
Yes, I agree '2M page compared to 512 times for 4k pages makes a huge
difference', likewise dma-mapping pages 2M pages helps a ton vs 4k.
>
> I also don't want to overload the driver->TTM interface with to much information, so just giving the sweat spot or maybe a mask for the most desired orders should potentially do it.
>
I think this is a good place to start, I suspect direct reclaim on
beneficial order plus direct reclaim order 0 is enough.
The description of this issue is a bit confusing look as really point to
many smaller pages being held onto somewhere, which completely throws
kswap into a loop.
Matt
> Christian.
>
> >
> > Matt
> >
> >> Regards,
> >> Christian.
> >>
> >>>
> >>> Matt
> >>>
> >>> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
> >>> index aa41099c5ecf..f1f430aba0c1 100644
> >>> --- a/drivers/gpu/drm/ttm/ttm_pool.c
> >>> +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> >>> @@ -714,6 +714,7 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
> >>> struct ttm_pool_alloc_state *alloc,
> >>> struct ttm_pool_tt_restore *restore)
> >>> {
> >>> + const unsigned int beneficial_order = ttm_pool_beneficial_order(pool);
> >>> enum ttm_caching page_caching;
> >>> gfp_t gfp_flags = GFP_USER;
> >>> pgoff_t caching_divide;
> >>> @@ -757,7 +758,8 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
> >>> if (!p) {
> >>> page_caching = ttm_cached;
> >>> allow_pools = false;
> >>> - p = ttm_pool_alloc_page(pool, gfp_flags, order);
> >>> + if (!order || order >= beneficial_order)
> >>> + p = ttm_pool_alloc_page(pool, gfp_flags, order);
> >>> }
> >>> /* If that fails, lower the order if possible and retry. */
> >>> if (!p) {
> >>>
> >>>
> >>>> + }
> >>>>
> >>>> if (!ttm_pool_uses_dma_alloc(pool)) {
> >>>> p = alloc_pages_node(pool->nid, gfp_flags, order);
> >>
>
^ permalink raw reply [flat|nested] 9+ messages in thread