[PATCH] drm/ttm/pool: back up at native page order

public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] drm/ttm/pool: back up at native page order
@ 2026-05-04  4:26 Matthew Brost
  2026-05-04  8:35 ` Thomas Hellström
  0 siblings, 1 reply; 5+ messages in thread
From: Matthew Brost @ 2026-05-04  4:26 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: Christian Koenig, Huang Rui, Matthew Auld, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	linux-kernel, stable, Thomas Hellström

ttm_pool_split_for_swap() splits high-order pool pages into order-0
pages during backup so each 4K page can be released to the system as
soon as it has been written to shmem. While this minimizes the
allocator's working set during reclaim, it actively fragments memory:
every TTM-backed compound page that the shrinker touches is shattered
into order-0 pages, even when the rest of the system would prefer that
the high-order block stay intact. Under sustained kswapd pressure this
is enough to drive other parts of MM into recovery loops from which
they cannot easily escape, because the memory TTM just freed is no
longer contiguous.

Stop splitting on the backup path and back up each compound atomically
at its native order in ttm_pool_backup():

  - For each non-handle slot, read the order from the head page and
    back up all 1<<order subpages to consecutive shmem indices,
    writing the resulting handles into tt->pages[] as we go.
  - On any per-subpage backup failure, drop the handles we just wrote
    for this compound and restore the original page pointers, so the
    compound is left fully intact and may be retried later. shrunken
    is only incremented once the whole compound succeeds.
  - On success, the compound is freed once at its native order. No
    split_page(), no per-4K refcount juggling, no fragmentation
    introduced from this path.
  - Slots that already hold a backup handle from a previous partial
    attempt are skipped. A compound that would extend past a
    fault-injection-truncated num_pages is skipped rather than split.

The restore-side leftover-page branch in ttm_pool_restore_commit() is
left as-is for now: that path can still split a previously-retained
compound, but in practice it is unreachable under realistic workloads
(per profiling we have not been able to trigger it), so it is not
worth complicating the restore state machine to avoid the split there.
If it ever becomes a problem in practice it can be addressed
independently.

ttm_pool_split_for_swap() itself is retained for the restore path's
sole remaining caller. The DMA-mapped pre-backup unmap loop, the
purge path, ttm_pool_free_*, and ttm_pool_unmap_and_free() already
operate at native order and are unchanged.

Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: dri-devel@lists.freedesktop.org
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Fixes: b63d715b8090 ("drm/ttm/pool, drm/ttm/tt: Provide a helper to shrink pages")
Suggested-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Assisted-by: Claude:claude-opus-4.6
Signed-off-by: Matthew Brost <matthew.brost@intel.com>

---

A follow-up should attempt writeback to shmem at folio order as well,
but the API for doing so is unclear and may be incomplete.

This patch is related to the pending series [1] and significantly
reduces the likelihood of Xe entering a kswapd loop under fragmentation.
The kswapd → shrinker → Xe shrinker → TTM backup path is still
exercised; however, with this change the backup path no longer worsens
fragmentation, which previously amplified reclaim pressure and
reinforced the kswapd loop.

Nonetheless, the pathological case that [1] aims to address still exists
and requires a proper solution. Even with this patch, a kswapd loop due
to severe fragmentation can still be triggered, although it is now
substantially harder to reproduce.

[1] https://patchwork.freedesktop.org/series/165330/
---
 drivers/gpu/drm/ttm/ttm_pool.c | 71 +++++++++++++++++++++++++++-------
 1 file changed, 57 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index 278bbe7a11ad..5ead0aba4bb7 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -1036,12 +1036,11 @@ long ttm_pool_backup(struct ttm_pool *pool, struct ttm_tt *tt,
 {
 	struct file *backup = tt->backup;
 	struct page *page;
-	unsigned long handle;
 	gfp_t alloc_gfp;
 	gfp_t gfp;
 	int ret = 0;
 	pgoff_t shrunken = 0;
-	pgoff_t i, num_pages;
+	pgoff_t i, num_pages, npages;
 
 	if (WARN_ON(ttm_tt_is_backed_up(tt)))
 		return -EINVAL;
@@ -1097,28 +1096,72 @@ long ttm_pool_backup(struct ttm_pool *pool, struct ttm_tt *tt,
 	if (IS_ENABLED(CONFIG_FAULT_INJECTION) && should_fail(&backup_fault_inject, 1))
 		num_pages = DIV_ROUND_UP(num_pages, 2);
 
-	for (i = 0; i < num_pages; ++i) {
-		s64 shandle;
+	for (i = 0; i < num_pages; i += npages) {
+		unsigned int order;
+		pgoff_t j;
 
+		npages = 1;
 		page = tt->pages[i];
 		if (unlikely(!page))
 			continue;
 
-		ttm_pool_split_for_swap(pool, page);
+		/* Already-handled entry from a previous attempt. */
+		if (unlikely(ttm_backup_page_ptr_is_handle(page)))
+			continue;
 
-		shandle = ttm_backup_backup_page(backup, page, flags->writeback, i,
-						 gfp, alloc_gfp);
-		if (shandle < 0) {
-			/* We allow partially shrunken tts */
-			ret = shandle;
+		order = ttm_pool_page_order(pool, page);
+		npages = 1UL << order;
+
+		/*
+		 * Back up the compound atomically at its native order. If
+		 * fault injection truncated num_pages mid-compound, skip
+		 * the partial tail rather than splitting.
+		 */
+		if (unlikely(i + npages > num_pages))
 			break;
+
+		for (j = 0; j < npages; ++j) {
+			unsigned long handle;
+			s64 shandle;
+
+			if (IS_ENABLED(CONFIG_FAULT_INJECTION) &&
+			    should_fail(&backup_fault_inject, 1))
+				shandle = -1;
+			else
+				shandle = ttm_backup_backup_page(backup, page + j,
+								 flags->writeback,
+								 i + j, gfp,
+								 alloc_gfp);
+
+			if (unlikely(shandle < 0)) {
+				pgoff_t k;
+
+				ret = shandle;
+				/*
+				 * Roll back: drop the handles we just wrote
+				 * and restore the original page pointers so
+				 * the compound remains intact and may be
+				 * retried later.
+				 */
+				for (k = 0; k < j; ++k) {
+					handle = ttm_backup_page_ptr_to_handle(tt->pages[i + k]);
+					ttm_backup_drop(backup, handle);
+					tt->pages[i + k] = page + k;
+				}
+
+				goto out;
+			}
+			handle = shandle;
+			tt->pages[i + j] = ttm_backup_handle_to_page_ptr(shandle);
 		}
-		handle = shandle;
-		tt->pages[i] = ttm_backup_handle_to_page_ptr(handle);
-		__free_pages_gpu_account(page, 0, false);
-		shrunken++;
+
+		/* Compound fully backed up; free at native order. */
+		page->private = 0;
+		__free_pages_gpu_account(page, order, false);
+		shrunken += npages;
 	}
 
+out:
 	return shrunken ? shrunken : ret;
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] drm/ttm/pool: back up at native page order
  2026-05-04  4:26 [PATCH] drm/ttm/pool: back up at native page order Matthew Brost
@ 2026-05-04  8:35 ` Thomas Hellström
  2026-05-04 14:30   ` Matthew Brost
  0 siblings, 1 reply; 5+ messages in thread
From: Thomas Hellström @ 2026-05-04  8:35 UTC (permalink / raw)
  To: Matthew Brost, intel-xe, dri-devel
  Cc: Christian Koenig, Huang Rui, Matthew Auld, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	linux-kernel, stable

Hi, Matt,

On Sun, 2026-05-03 at 21:26 -0700, Matthew Brost wrote:
> ttm_pool_split_for_swap() splits high-order pool pages into order-0
> pages during backup so each 4K page can be released to the system as
> soon as it has been written to shmem. While this minimizes the
> allocator's working set during reclaim, it actively fragments memory:
> every TTM-backed compound page that the shrinker touches is shattered
> into order-0 pages, even when the rest of the system would prefer
> that
> the high-order block stay intact. Under sustained kswapd pressure
> this
> is enough to drive other parts of MM into recovery loops from which
> they cannot easily escape, because the memory TTM just freed is no
> longer contiguous.
> 
> Stop splitting on the backup path and back up each compound
> atomically
> at its native order in ttm_pool_backup():
> 
>   - For each non-handle slot, read the order from the head page and
>     back up all 1<<order subpages to consecutive shmem indices,
>     writing the resulting handles into tt->pages[] as we go.
>   - On any per-subpage backup failure, drop the handles we just wrote
>     for this compound and restore the original page pointers, so the
>     compound is left fully intact and may be retried later. shrunken
>     is only incremented once the whole compound succeeds.
>   - On success, the compound is freed once at its native order. No
>     split_page(), no per-4K refcount juggling, no fragmentation
>     introduced from this path.
>   - Slots that already hold a backup handle from a previous partial
>     attempt are skipped. A compound that would extend past a
>     fault-injection-truncated num_pages is skipped rather than split.
> 
> The restore-side leftover-page branch in ttm_pool_restore_commit() is
> left as-is for now: that path can still split a previously-retained
> compound, but in practice it is unreachable under realistic workloads
> (per profiling we have not been able to trigger it), so it is not
> worth complicating the restore state machine to avoid the split
> there.
> If it ever becomes a problem in practice it can be addressed
> independently.
> 
> ttm_pool_split_for_swap() itself is retained for the restore path's
> sole remaining caller. The DMA-mapped pre-backup unmap loop, the
> purge path, ttm_pool_free_*, and ttm_pool_unmap_and_free() already
> operate at native order and are unchanged.

This split is intentional in that without it, we'd need to first
allocate 1 << order pages from the kernel's *reserves* in order to
later free 2 << order pages, making the shrinker much more likely to
fail in true OOM situations. (I believe this was one of the reasons the
initial shrinker attempts from AMD didn't work as expected).


I believe the solution here is in the ttm_backup layer, We should
introduce a ttm_backup_backup_folio function and either insert the page
directly into the shmem object (zero-copy) or even directly into the
swap cache. Then we should completely restrict xe page allocations to
only allow THP and PAGE_SIZE (Possibly 64K pages, but they'd either
need a split or perhaps they are small enough to be backed-up using
one-go copy, similar to this patch, but in the backup layer). FWIW at
the time the shrinker was put together, AFAIU SHMEM split large pages
on swapping anyway, but since that appears to have changed, we need to
catch up.

Inserting directly into the swap-cache WIP is here, rebased on a recent
kernel (This is an old idea that has actually been out on RFC once).
This needs a core mm bugfix (also in the branch), but I'm not sure the
swap cache is the right place to do this, at least not if we don't
immediately schedule a write to disc, it looks like current users don't
want to keep pages in swap-cache for very long (related to that bug)
https://gitlab.freedesktop.org/thomash/xe-vibe/-/commits/thp_swapping2

Inserting directly into shmem (A fairly recent idea that is mostly
untested)
https://gitlab.freedesktop.org/thomash/xe-vibe/-/commits/insert_shmem?ref_type=heads
Since SHMEM schedules writeout immediately when pages are moved to the
swap-cache, it's not as susceptible to the above bug, since swap-cache
entries are not typically held for folios for which we haven't
scheduled writeout.

We should try to solicit feedback from mm people on these two
approaches.

/Thomas

> 
> Cc: Christian Koenig <christian.koenig@amd.com>
> Cc: Huang Rui <ray.huang@amd.com>
> Cc: Matthew Auld <matthew.auld@intel.com>
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Cc: Maxime Ripard <mripard@kernel.org>
> Cc: Thomas Zimmermann <tzimmermann@suse.de>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> Cc: dri-devel@lists.freedesktop.org
> Cc: linux-kernel@vger.kernel.org
> Cc: stable@vger.kernel.org
> Fixes: b63d715b8090 ("drm/ttm/pool, drm/ttm/tt: Provide a helper to
> shrink pages")
> Suggested-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Assisted-by: Claude:claude-opus-4.6
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> 
> ---
> 
> A follow-up should attempt writeback to shmem at folio order as well,
> but the API for doing so is unclear and may be incomplete.
> 
> This patch is related to the pending series [1] and significantly
> reduces the likelihood of Xe entering a kswapd loop under
> fragmentation.
> The kswapd → shrinker → Xe shrinker → TTM backup path is still
> exercised; however, with this change the backup path no longer
> worsens
> fragmentation, which previously amplified reclaim pressure and
> reinforced the kswapd loop.
> 
> Nonetheless, the pathological case that [1] aims to address still
> exists
> and requires a proper solution. Even with this patch, a kswapd loop
> due
> to severe fragmentation can still be triggered, although it is now
> substantially harder to reproduce.
> 
> [1] https://patchwork.freedesktop.org/series/165330/
> ---
>  drivers/gpu/drm/ttm/ttm_pool.c | 71 +++++++++++++++++++++++++++-----
> --
>  1 file changed, 57 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c
> b/drivers/gpu/drm/ttm/ttm_pool.c
> index 278bbe7a11ad..5ead0aba4bb7 100644
> --- a/drivers/gpu/drm/ttm/ttm_pool.c
> +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> @@ -1036,12 +1036,11 @@ long ttm_pool_backup(struct ttm_pool *pool,
> struct ttm_tt *tt,
>  {
>  	struct file *backup = tt->backup;
>  	struct page *page;
> -	unsigned long handle;
>  	gfp_t alloc_gfp;
>  	gfp_t gfp;
>  	int ret = 0;
>  	pgoff_t shrunken = 0;
> -	pgoff_t i, num_pages;
> +	pgoff_t i, num_pages, npages;
>  
>  	if (WARN_ON(ttm_tt_is_backed_up(tt)))
>  		return -EINVAL;
> @@ -1097,28 +1096,72 @@ long ttm_pool_backup(struct ttm_pool *pool,
> struct ttm_tt *tt,
>  	if (IS_ENABLED(CONFIG_FAULT_INJECTION) &&
> should_fail(&backup_fault_inject, 1))
>  		num_pages = DIV_ROUND_UP(num_pages, 2);
>  
> -	for (i = 0; i < num_pages; ++i) {
> -		s64 shandle;
> +	for (i = 0; i < num_pages; i += npages) {
> +		unsigned int order;
> +		pgoff_t j;
>  
> +		npages = 1;
>  		page = tt->pages[i];
>  		if (unlikely(!page))
>  			continue;
>  
> -		ttm_pool_split_for_swap(pool, page);
> +		/* Already-handled entry from a previous attempt. */
> +		if (unlikely(ttm_backup_page_ptr_is_handle(page)))
> +			continue;
>  
> -		shandle = ttm_backup_backup_page(backup, page,
> flags->writeback, i,
> -						 gfp, alloc_gfp);
> -		if (shandle < 0) {
> -			/* We allow partially shrunken tts */
> -			ret = shandle;
> +		order = ttm_pool_page_order(pool, page);
> +		npages = 1UL << order;
> +
> +		/*
> +		 * Back up the compound atomically at its native
> order. If
> +		 * fault injection truncated num_pages mid-compound,
> skip
> +		 * the partial tail rather than splitting.
> +		 */
> +		if (unlikely(i + npages > num_pages))
>  			break;
> +
> +		for (j = 0; j < npages; ++j) {
> +			unsigned long handle;
> +			s64 shandle;
> +
> +			if (IS_ENABLED(CONFIG_FAULT_INJECTION) &&
> +			    should_fail(&backup_fault_inject, 1))
> +				shandle = -1;
> +			else
> +				shandle =
> ttm_backup_backup_page(backup, page + j,
> +								
> flags->writeback,
> +								 i +
> j, gfp,
> +								
> alloc_gfp);
> +
> +			if (unlikely(shandle < 0)) {
> +				pgoff_t k;
> +
> +				ret = shandle;
> +				/*
> +				 * Roll back: drop the handles we
> just wrote
> +				 * and restore the original page
> pointers so
> +				 * the compound remains intact and
> may be
> +				 * retried later.
> +				 */
> +				for (k = 0; k < j; ++k) {
> +					handle =
> ttm_backup_page_ptr_to_handle(tt->pages[i + k]);
> +					ttm_backup_drop(backup,
> handle);
> +					tt->pages[i + k] = page + k;
> +				}
> +
> +				goto out;
> +			}
> +			handle = shandle;
> +			tt->pages[i + j] =
> ttm_backup_handle_to_page_ptr(shandle);
>  		}
> -		handle = shandle;
> -		tt->pages[i] =
> ttm_backup_handle_to_page_ptr(handle);
> -		__free_pages_gpu_account(page, 0, false);
> -		shrunken++;
> +
> +		/* Compound fully backed up; free at native order.
> */
> +		page->private = 0;
> +		__free_pages_gpu_account(page, order, false);
> +		shrunken += npages;
>  	}
>  
> +out:
>  	return shrunken ? shrunken : ret;
>  }
>  

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] drm/ttm/pool: back up at native page order
  2026-05-04  8:35 ` Thomas Hellström
@ 2026-05-04 14:30   ` Matthew Brost
  2026-05-04 15:19     ` Thomas Hellström
  0 siblings, 1 reply; 5+ messages in thread
From: Matthew Brost @ 2026-05-04 14:30 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, dri-devel, Christian Koenig, Huang Rui, Matthew Auld,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Simona Vetter, linux-kernel, stable

On Mon, May 04, 2026 at 10:35:23AM +0200, Thomas Hellström wrote:
> Hi, Matt,
> 
> On Sun, 2026-05-03 at 21:26 -0700, Matthew Brost wrote:
> > ttm_pool_split_for_swap() splits high-order pool pages into order-0
> > pages during backup so each 4K page can be released to the system as
> > soon as it has been written to shmem. While this minimizes the
> > allocator's working set during reclaim, it actively fragments memory:
> > every TTM-backed compound page that the shrinker touches is shattered
> > into order-0 pages, even when the rest of the system would prefer
> > that
> > the high-order block stay intact. Under sustained kswapd pressure
> > this
> > is enough to drive other parts of MM into recovery loops from which
> > they cannot easily escape, because the memory TTM just freed is no
> > longer contiguous.
> > 
> > Stop splitting on the backup path and back up each compound
> > atomically
> > at its native order in ttm_pool_backup():
> > 
> >   - For each non-handle slot, read the order from the head page and
> >     back up all 1<<order subpages to consecutive shmem indices,
> >     writing the resulting handles into tt->pages[] as we go.
> >   - On any per-subpage backup failure, drop the handles we just wrote
> >     for this compound and restore the original page pointers, so the
> >     compound is left fully intact and may be retried later. shrunken
> >     is only incremented once the whole compound succeeds.
> >   - On success, the compound is freed once at its native order. No
> >     split_page(), no per-4K refcount juggling, no fragmentation
> >     introduced from this path.
> >   - Slots that already hold a backup handle from a previous partial
> >     attempt are skipped. A compound that would extend past a
> >     fault-injection-truncated num_pages is skipped rather than split.
> > 
> > The restore-side leftover-page branch in ttm_pool_restore_commit() is
> > left as-is for now: that path can still split a previously-retained
> > compound, but in practice it is unreachable under realistic workloads
> > (per profiling we have not been able to trigger it), so it is not
> > worth complicating the restore state machine to avoid the split
> > there.
> > If it ever becomes a problem in practice it can be addressed
> > independently.
> > 
> > ttm_pool_split_for_swap() itself is retained for the restore path's
> > sole remaining caller. The DMA-mapped pre-backup unmap loop, the
> > purge path, ttm_pool_free_*, and ttm_pool_unmap_and_free() already
> > operate at native order and are unchanged.
> 
> This split is intentional in that without it, we'd need to first
> allocate 1 << order pages from the kernel's *reserves* in order to
> later free 2 << order pages, making the shrinker much more likely to
> fail in true OOM situations. (I believe this was one of the reasons the
> initial shrinker attempts from AMD didn't work as expected).
> 

So where exactly is allocation done—shmem_read_folio_gfp or
shmem_writeout? I did notice and called out, in the commit message, that
those interfaces are a bit confusing with respect to whether they
actually work with higher-order allocations.

Also, FWIW, this patch by itself seems to greatly help with
fragmentation, and I haven’t seen the OOM killer kick in. I’ve done
things like running WebGL in a bunch of Chrome tabs, then running
bonnie++ (which basically uses all memory), or running IGTs, which also
use all available memory. Based on that, I’m leaning toward this patch
alone working as designed.

> 
> I believe the solution here is in the ttm_backup layer, We should
> introduce a ttm_backup_backup_folio function and either insert the page

I think something like ttm_backup_backup_folio() makes sense, again I
called out in commit message.

> directly into the shmem object (zero-copy) or even directly into the
> swap cache. Then we should completely restrict xe page allocations to
> only allow THP and PAGE_SIZE (Possibly 64K pages, but they'd either
> need a split or perhaps they are small enough to be backed-up using

Yes, I did raise something like with Christian too [1]. IMO the driver
should be able dictate to TTM the orders it likely to allocate at.

[1] https://patchwork.freedesktop.org/patch/716362/?series=164338&rev=1

> one-go copy, similar to this patch, but in the backup layer). FWIW at
> the time the shrinker was put together, AFAIU SHMEM split large pages
> on swapping anyway, but since that appears to have changed, we need to
> catch up.
> 
> Inserting directly into the swap-cache WIP is here, rebased on a recent
> kernel (This is an old idea that has actually been out on RFC once).
> This needs a core mm bugfix (also in the branch), but I'm not sure the
> swap cache is the right place to do this, at least not if we don't
> immediately schedule a write to disc, it looks like current users don't
> want to keep pages in swap-cache for very long (related to that bug)
> https://gitlab.freedesktop.org/thomash/xe-vibe/-/commits/thp_swapping2
> 
> Inserting directly into shmem (A fairly recent idea that is mostly
> untested)
> https://gitlab.freedesktop.org/thomash/xe-vibe/-/commits/insert_shmem?ref_type=heads
> Since SHMEM schedules writeout immediately when pages are moved to the
> swap-cache, it's not as susceptible to the above bug, since swap-cache
> entries are not typically held for folios for which we haven't
> scheduled writeout.
> 

Let me take a look at these branches today.

> We should try to solicit feedback from mm people on these two
> approaches.

+1, but I think we should stop here if this patch, as‑is, is OK to go
in—ideally as a fix—since, based on my testing, it seems to help quite a
bit and current upstream shrinker is badly broken.

Matt

> 
> /Thomas
> 
> > 
> > Cc: Christian Koenig <christian.koenig@amd.com>
> > Cc: Huang Rui <ray.huang@amd.com>
> > Cc: Matthew Auld <matthew.auld@intel.com>
> > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > Cc: Maxime Ripard <mripard@kernel.org>
> > Cc: Thomas Zimmermann <tzimmermann@suse.de>
> > Cc: David Airlie <airlied@gmail.com>
> > Cc: Simona Vetter <simona@ffwll.ch>
> > Cc: dri-devel@lists.freedesktop.org
> > Cc: linux-kernel@vger.kernel.org
> > Cc: stable@vger.kernel.org
> > Fixes: b63d715b8090 ("drm/ttm/pool, drm/ttm/tt: Provide a helper to
> > shrink pages")
> > Suggested-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > Assisted-by: Claude:claude-opus-4.6
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > 
> > ---
> > 
> > A follow-up should attempt writeback to shmem at folio order as well,
> > but the API for doing so is unclear and may be incomplete.
> > 
> > This patch is related to the pending series [1] and significantly
> > reduces the likelihood of Xe entering a kswapd loop under
> > fragmentation.
> > The kswapd → shrinker → Xe shrinker → TTM backup path is still
> > exercised; however, with this change the backup path no longer
> > worsens
> > fragmentation, which previously amplified reclaim pressure and
> > reinforced the kswapd loop.
> > 
> > Nonetheless, the pathological case that [1] aims to address still
> > exists
> > and requires a proper solution. Even with this patch, a kswapd loop
> > due
> > to severe fragmentation can still be triggered, although it is now
> > substantially harder to reproduce.
> > 
> > [1] https://patchwork.freedesktop.org/series/165330/
> > ---
> >  drivers/gpu/drm/ttm/ttm_pool.c | 71 +++++++++++++++++++++++++++-----
> > --
> >  1 file changed, 57 insertions(+), 14 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/ttm/ttm_pool.c
> > b/drivers/gpu/drm/ttm/ttm_pool.c
> > index 278bbe7a11ad..5ead0aba4bb7 100644
> > --- a/drivers/gpu/drm/ttm/ttm_pool.c
> > +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> > @@ -1036,12 +1036,11 @@ long ttm_pool_backup(struct ttm_pool *pool,
> > struct ttm_tt *tt,
> >  {
> >  	struct file *backup = tt->backup;
> >  	struct page *page;
> > -	unsigned long handle;
> >  	gfp_t alloc_gfp;
> >  	gfp_t gfp;
> >  	int ret = 0;
> >  	pgoff_t shrunken = 0;
> > -	pgoff_t i, num_pages;
> > +	pgoff_t i, num_pages, npages;
> >  
> >  	if (WARN_ON(ttm_tt_is_backed_up(tt)))
> >  		return -EINVAL;
> > @@ -1097,28 +1096,72 @@ long ttm_pool_backup(struct ttm_pool *pool,
> > struct ttm_tt *tt,
> >  	if (IS_ENABLED(CONFIG_FAULT_INJECTION) &&
> > should_fail(&backup_fault_inject, 1))
> >  		num_pages = DIV_ROUND_UP(num_pages, 2);
> >  
> > -	for (i = 0; i < num_pages; ++i) {
> > -		s64 shandle;
> > +	for (i = 0; i < num_pages; i += npages) {
> > +		unsigned int order;
> > +		pgoff_t j;
> >  
> > +		npages = 1;
> >  		page = tt->pages[i];
> >  		if (unlikely(!page))
> >  			continue;
> >  
> > -		ttm_pool_split_for_swap(pool, page);
> > +		/* Already-handled entry from a previous attempt. */
> > +		if (unlikely(ttm_backup_page_ptr_is_handle(page)))
> > +			continue;
> >  
> > -		shandle = ttm_backup_backup_page(backup, page,
> > flags->writeback, i,
> > -						 gfp, alloc_gfp);
> > -		if (shandle < 0) {
> > -			/* We allow partially shrunken tts */
> > -			ret = shandle;
> > +		order = ttm_pool_page_order(pool, page);
> > +		npages = 1UL << order;
> > +
> > +		/*
> > +		 * Back up the compound atomically at its native
> > order. If
> > +		 * fault injection truncated num_pages mid-compound,
> > skip
> > +		 * the partial tail rather than splitting.
> > +		 */
> > +		if (unlikely(i + npages > num_pages))
> >  			break;
> > +
> > +		for (j = 0; j < npages; ++j) {
> > +			unsigned long handle;
> > +			s64 shandle;
> > +
> > +			if (IS_ENABLED(CONFIG_FAULT_INJECTION) &&
> > +			    should_fail(&backup_fault_inject, 1))
> > +				shandle = -1;
> > +			else
> > +				shandle =
> > ttm_backup_backup_page(backup, page + j,
> > +								
> > flags->writeback,
> > +								 i +
> > j, gfp,
> > +								
> > alloc_gfp);
> > +
> > +			if (unlikely(shandle < 0)) {
> > +				pgoff_t k;
> > +
> > +				ret = shandle;
> > +				/*
> > +				 * Roll back: drop the handles we
> > just wrote
> > +				 * and restore the original page
> > pointers so
> > +				 * the compound remains intact and
> > may be
> > +				 * retried later.
> > +				 */
> > +				for (k = 0; k < j; ++k) {
> > +					handle =
> > ttm_backup_page_ptr_to_handle(tt->pages[i + k]);
> > +					ttm_backup_drop(backup,
> > handle);
> > +					tt->pages[i + k] = page + k;
> > +				}
> > +
> > +				goto out;
> > +			}
> > +			handle = shandle;
> > +			tt->pages[i + j] =
> > ttm_backup_handle_to_page_ptr(shandle);
> >  		}
> > -		handle = shandle;
> > -		tt->pages[i] =
> > ttm_backup_handle_to_page_ptr(handle);
> > -		__free_pages_gpu_account(page, 0, false);
> > -		shrunken++;
> > +
> > +		/* Compound fully backed up; free at native order.
> > */
> > +		page->private = 0;
> > +		__free_pages_gpu_account(page, order, false);
> > +		shrunken += npages;
> >  	}
> >  
> > +out:
> >  	return shrunken ? shrunken : ret;
> >  }
> >  

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] drm/ttm/pool: back up at native page order
  2026-05-04 14:30   ` Matthew Brost
@ 2026-05-04 15:19     ` Thomas Hellström
  2026-05-04 17:35       ` Matthew Brost
  0 siblings, 1 reply; 5+ messages in thread
From: Thomas Hellström @ 2026-05-04 15:19 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, Christian Koenig, Huang Rui, Matthew Auld,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Simona Vetter, linux-kernel, stable

On Mon, 2026-05-04 at 07:30 -0700, Matthew Brost wrote:
> On Mon, May 04, 2026 at 10:35:23AM +0200, Thomas Hellström wrote:
> > Hi, Matt,
> > 
> > On Sun, 2026-05-03 at 21:26 -0700, Matthew Brost wrote:
> > > ttm_pool_split_for_swap() splits high-order pool pages into
> > > order-0
> > > pages during backup so each 4K page can be released to the system
> > > as
> > > soon as it has been written to shmem. While this minimizes the
> > > allocator's working set during reclaim, it actively fragments
> > > memory:
> > > every TTM-backed compound page that the shrinker touches is
> > > shattered
> > > into order-0 pages, even when the rest of the system would prefer
> > > that
> > > the high-order block stay intact. Under sustained kswapd pressure
> > > this
> > > is enough to drive other parts of MM into recovery loops from
> > > which
> > > they cannot easily escape, because the memory TTM just freed is
> > > no
> > > longer contiguous.
> > > 
> > > Stop splitting on the backup path and back up each compound
> > > atomically
> > > at its native order in ttm_pool_backup():
> > > 
> > >   - For each non-handle slot, read the order from the head page
> > > and
> > >     back up all 1<<order subpages to consecutive shmem indices,
> > >     writing the resulting handles into tt->pages[] as we go.
> > >   - On any per-subpage backup failure, drop the handles we just
> > > wrote
> > >     for this compound and restore the original page pointers, so
> > > the
> > >     compound is left fully intact and may be retried later.
> > > shrunken
> > >     is only incremented once the whole compound succeeds.
> > >   - On success, the compound is freed once at its native order.
> > > No
> > >     split_page(), no per-4K refcount juggling, no fragmentation
> > >     introduced from this path.
> > >   - Slots that already hold a backup handle from a previous
> > > partial
> > >     attempt are skipped. A compound that would extend past a
> > >     fault-injection-truncated num_pages is skipped rather than
> > > split.
> > > 
> > > The restore-side leftover-page branch in
> > > ttm_pool_restore_commit() is
> > > left as-is for now: that path can still split a previously-
> > > retained
> > > compound, but in practice it is unreachable under realistic
> > > workloads
> > > (per profiling we have not been able to trigger it), so it is not
> > > worth complicating the restore state machine to avoid the split
> > > there.
> > > If it ever becomes a problem in practice it can be addressed
> > > independently.
> > > 
> > > ttm_pool_split_for_swap() itself is retained for the restore
> > > path's
> > > sole remaining caller. The DMA-mapped pre-backup unmap loop, the
> > > purge path, ttm_pool_free_*, and ttm_pool_unmap_and_free()
> > > already
> > > operate at native order and are unchanged.
> > 
> > This split is intentional in that without it, we'd need to first
> > allocate 1 << order pages from the kernel's *reserves* in order to
> > later free 2 << order pages, making the shrinker much more likely
> > to
> > fail in true OOM situations. (I believe this was one of the reasons
> > the
> > initial shrinker attempts from AMD didn't work as expected).
> > 
> 
> So where exactly is allocation done—shmem_read_folio_gfp or
> shmem_writeout? I did notice and called out, in the commit message,
> that
> those interfaces are a bit confusing with respect to whether they
> actually work with higher-order allocations.

The interesting one is in shmem_read_folio_gfp(). This used to be 4K-
page only (but i915 had some tricks to make this allocate 2M folios).
My understanding (to be verified) is that this recently was changed to
allow 2M by default, and also to allow 2M folio writeout. Writeout
moves the folio from the page-cache to the swap-cache and then starts a
fs writeout operation. Pages are put back on the LRU and are freed when
writeout completes.

As I understand it, shmem_read_folio_gfp() will also potentially
allocate memory for the shmem object radix tree.

> 
> Also, FWIW, this patch by itself seems to greatly help with
> fragmentation, and I haven’t seen the OOM killer kick in. I’ve done
> things like running WebGL in a bunch of Chrome tabs, then running
> bonnie++ (which basically uses all memory), or running IGTs, which
> also
> use all available memory. Based on that, I’m leaning toward this
> patch
> alone working as designed.

Good to know. Perhaps it would feel safer if we completely restrict the
xe TTM order to 2M and below (if we haven't already).

> 
> > 
> > I believe the solution here is in the ttm_backup layer, We should
> > introduce a ttm_backup_backup_folio function and either insert the
> > page
> 
> I think something like ttm_backup_backup_folio() makes sense, again I
> called out in commit message.
> 
> > directly into the shmem object (zero-copy) or even directly into
> > the
> > swap cache. Then we should completely restrict xe page allocations
> > to
> > only allow THP and PAGE_SIZE (Possibly 64K pages, but they'd either
> > need a split or perhaps they are small enough to be backed-up using
> 
> Yes, I did raise something like with Christian too [1]. IMO the
> driver
> should be able dictate to TTM the orders it likely to allocate at.
> 
> [1]
> https://patchwork.freedesktop.org/patch/716362/?series=164338&rev=1
> 
> > one-go copy, similar to this patch, but in the backup layer). FWIW
> > at
> > the time the shrinker was put together, AFAIU SHMEM split large
> > pages
> > on swapping anyway, but since that appears to have changed, we need
> > to
> > catch up.
> > 
> > Inserting directly into the swap-cache WIP is here, rebased on a
> > recent
> > kernel (This is an old idea that has actually been out on RFC
> > once).
> > This needs a core mm bugfix (also in the branch), but I'm not sure
> > the
> > swap cache is the right place to do this, at least not if we don't
> > immediately schedule a write to disc, it looks like current users
> > don't
> > want to keep pages in swap-cache for very long (related to that
> > bug)
> > https://gitlab.freedesktop.org/thomash/xe-vibe/-/commits/thp_swapping2
> > 
> > Inserting directly into shmem (A fairly recent idea that is mostly
> > untested)
> > https://gitlab.freedesktop.org/thomash/xe-vibe/-/commits/insert_shmem?ref_type=heads
> > Since SHMEM schedules writeout immediately when pages are moved to
> > the
> > swap-cache, it's not as susceptible to the above bug, since swap-
> > cache
> > entries are not typically held for folios for which we haven't
> > scheduled writeout.
> > 
> 
> Let me take a look at these branches today.
> 
> > We should try to solicit feedback from mm people on these two
> > approaches.
> 
> +1, but I think we should stop here if this patch, as‑is, is OK to go
> in—ideally as a fix—since, based on my testing, it seems to help
> quite a
> bit and current upstream shrinker is badly broken.

Well I think the problem with testing shrinking behavior is that we
haven't had good test-cases, so we don't really know if this change
would break something that currently works. In the shmem documentation
there's even some wording about concerns that the shmem radix tree
allocations could accumulate and drain the kernel reserves. 

According to Google AI, the kernel reserves are around 2MiB times the
number of zones, controlled by vm.min_free_kbytes.

But I think if we would push this or something similar but then we
should 

*) Move the ttm_backup interface to be folio-based.
*) Restrict order to 2M
*) Craft a test-case that triggers a shmem_read_folio_gfp() error in
the backup path and verify that they do behave gracefully.

And then as follow-ups:

a) Investigate direct shmem insertion.
b) Address any remaining flaws from partially backed-up bos.
c) Cgroups integration, following up on airlied's work ensuring that
the evictee is charged for the shmem memory.


Thanks,
Thomas



> Matt
> 
> > 
> > /Thomas
> > 
> > > 
> > > Cc: Christian Koenig <christian.koenig@amd.com>
> > > Cc: Huang Rui <ray.huang@amd.com>
> > > Cc: Matthew Auld <matthew.auld@intel.com>
> > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > Cc: Maxime Ripard <mripard@kernel.org>
> > > Cc: Thomas Zimmermann <tzimmermann@suse.de>
> > > Cc: David Airlie <airlied@gmail.com>
> > > Cc: Simona Vetter <simona@ffwll.ch>
> > > Cc: dri-devel@lists.freedesktop.org
> > > Cc: linux-kernel@vger.kernel.org
> > > Cc: stable@vger.kernel.org
> > > Fixes: b63d715b8090 ("drm/ttm/pool, drm/ttm/tt: Provide a helper
> > > to
> > > shrink pages")
> > > Suggested-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > Assisted-by: Claude:claude-opus-4.6
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > 
> > > ---
> > > 
> > > A follow-up should attempt writeback to shmem at folio order as
> > > well,
> > > but the API for doing so is unclear and may be incomplete.
> > > 
> > > This patch is related to the pending series [1] and significantly
> > > reduces the likelihood of Xe entering a kswapd loop under
> > > fragmentation.
> > > The kswapd → shrinker → Xe shrinker → TTM backup path is still
> > > exercised; however, with this change the backup path no longer
> > > worsens
> > > fragmentation, which previously amplified reclaim pressure and
> > > reinforced the kswapd loop.
> > > 
> > > Nonetheless, the pathological case that [1] aims to address still
> > > exists
> > > and requires a proper solution. Even with this patch, a kswapd
> > > loop
> > > due
> > > to severe fragmentation can still be triggered, although it is
> > > now
> > > substantially harder to reproduce.
> > > 
> > > [1] https://patchwork.freedesktop.org/series/165330/
> > > ---
> > >  drivers/gpu/drm/ttm/ttm_pool.c | 71 +++++++++++++++++++++++++++-
> > > ----
> > > --
> > >  1 file changed, 57 insertions(+), 14 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/ttm/ttm_pool.c
> > > b/drivers/gpu/drm/ttm/ttm_pool.c
> > > index 278bbe7a11ad..5ead0aba4bb7 100644
> > > --- a/drivers/gpu/drm/ttm/ttm_pool.c
> > > +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> > > @@ -1036,12 +1036,11 @@ long ttm_pool_backup(struct ttm_pool
> > > *pool,
> > > struct ttm_tt *tt,
> > >  {
> > >  	struct file *backup = tt->backup;
> > >  	struct page *page;
> > > -	unsigned long handle;
> > >  	gfp_t alloc_gfp;
> > >  	gfp_t gfp;
> > >  	int ret = 0;
> > >  	pgoff_t shrunken = 0;
> > > -	pgoff_t i, num_pages;
> > > +	pgoff_t i, num_pages, npages;
> > >  
> > >  	if (WARN_ON(ttm_tt_is_backed_up(tt)))
> > >  		return -EINVAL;
> > > @@ -1097,28 +1096,72 @@ long ttm_pool_backup(struct ttm_pool
> > > *pool,
> > > struct ttm_tt *tt,
> > >  	if (IS_ENABLED(CONFIG_FAULT_INJECTION) &&
> > > should_fail(&backup_fault_inject, 1))
> > >  		num_pages = DIV_ROUND_UP(num_pages, 2);
> > >  
> > > -	for (i = 0; i < num_pages; ++i) {
> > > -		s64 shandle;
> > > +	for (i = 0; i < num_pages; i += npages) {
> > > +		unsigned int order;
> > > +		pgoff_t j;
> > >  
> > > +		npages = 1;
> > >  		page = tt->pages[i];
> > >  		if (unlikely(!page))
> > >  			continue;
> > >  
> > > -		ttm_pool_split_for_swap(pool, page);
> > > +		/* Already-handled entry from a previous
> > > attempt. */
> > > +		if
> > > (unlikely(ttm_backup_page_ptr_is_handle(page)))
> > > +			continue;
> > >  
> > > -		shandle = ttm_backup_backup_page(backup, page,
> > > flags->writeback, i,
> > > -						 gfp,
> > > alloc_gfp);
> > > -		if (shandle < 0) {
> > > -			/* We allow partially shrunken tts */
> > > -			ret = shandle;
> > > +		order = ttm_pool_page_order(pool, page);
> > > +		npages = 1UL << order;
> > > +
> > > +		/*
> > > +		 * Back up the compound atomically at its native
> > > order. If
> > > +		 * fault injection truncated num_pages mid-
> > > compound,
> > > skip
> > > +		 * the partial tail rather than splitting.
> > > +		 */
> > > +		if (unlikely(i + npages > num_pages))
> > >  			break;
> > > +
> > > +		for (j = 0; j < npages; ++j) {
> > > +			unsigned long handle;
> > > +			s64 shandle;
> > > +
> > > +			if (IS_ENABLED(CONFIG_FAULT_INJECTION)
> > > &&
> > > +			    should_fail(&backup_fault_inject,
> > > 1))
> > > +				shandle = -1;
> > > +			else
> > > +				shandle =
> > > ttm_backup_backup_page(backup, page + j,
> > > +								
> > > flags->writeback,
> > > +								
> > > i +
> > > j, gfp,
> > > +								
> > > alloc_gfp);
> > > +
> > > +			if (unlikely(shandle < 0)) {
> > > +				pgoff_t k;
> > > +
> > > +				ret = shandle;
> > > +				/*
> > > +				 * Roll back: drop the handles
> > > we
> > > just wrote
> > > +				 * and restore the original page
> > > pointers so
> > > +				 * the compound remains intact
> > > and
> > > may be
> > > +				 * retried later.
> > > +				 */
> > > +				for (k = 0; k < j; ++k) {
> > > +					handle =
> > > ttm_backup_page_ptr_to_handle(tt->pages[i + k]);
> > > +					ttm_backup_drop(backup,
> > > handle);
> > > +					tt->pages[i + k] = page
> > > + k;
> > > +				}
> > > +
> > > +				goto out;
> > > +			}
> > > +			handle = shandle;
> > > +			tt->pages[i + j] =
> > > ttm_backup_handle_to_page_ptr(shandle);
> > >  		}
> > > -		handle = shandle;
> > > -		tt->pages[i] =
> > > ttm_backup_handle_to_page_ptr(handle);
> > > -		__free_pages_gpu_account(page, 0, false);
> > > -		shrunken++;
> > > +
> > > +		/* Compound fully backed up; free at native
> > > order.
> > > */
> > > +		page->private = 0;
> > > +		__free_pages_gpu_account(page, order, false);
> > > +		shrunken += npages;
> > >  	}
> > >  
> > > +out:
> > >  	return shrunken ? shrunken : ret;
> > >  }
> > >  

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] drm/ttm/pool: back up at native page order
  2026-05-04 15:19     ` Thomas Hellström
@ 2026-05-04 17:35       ` Matthew Brost
  0 siblings, 0 replies; 5+ messages in thread
From: Matthew Brost @ 2026-05-04 17:35 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, dri-devel, Christian Koenig, Huang Rui, Matthew Auld,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Simona Vetter, linux-kernel, stable

On Mon, May 04, 2026 at 05:19:42PM +0200, Thomas Hellström wrote:
> On Mon, 2026-05-04 at 07:30 -0700, Matthew Brost wrote:
> > On Mon, May 04, 2026 at 10:35:23AM +0200, Thomas Hellström wrote:
> > > Hi, Matt,
> > > 
> > > On Sun, 2026-05-03 at 21:26 -0700, Matthew Brost wrote:
> > > > ttm_pool_split_for_swap() splits high-order pool pages into
> > > > order-0
> > > > pages during backup so each 4K page can be released to the system
> > > > as
> > > > soon as it has been written to shmem. While this minimizes the
> > > > allocator's working set during reclaim, it actively fragments
> > > > memory:
> > > > every TTM-backed compound page that the shrinker touches is
> > > > shattered
> > > > into order-0 pages, even when the rest of the system would prefer
> > > > that
> > > > the high-order block stay intact. Under sustained kswapd pressure
> > > > this
> > > > is enough to drive other parts of MM into recovery loops from
> > > > which
> > > > they cannot easily escape, because the memory TTM just freed is
> > > > no
> > > > longer contiguous.
> > > > 
> > > > Stop splitting on the backup path and back up each compound
> > > > atomically
> > > > at its native order in ttm_pool_backup():
> > > > 
> > > >   - For each non-handle slot, read the order from the head page
> > > > and
> > > >     back up all 1<<order subpages to consecutive shmem indices,
> > > >     writing the resulting handles into tt->pages[] as we go.
> > > >   - On any per-subpage backup failure, drop the handles we just
> > > > wrote
> > > >     for this compound and restore the original page pointers, so
> > > > the
> > > >     compound is left fully intact and may be retried later.
> > > > shrunken
> > > >     is only incremented once the whole compound succeeds.
> > > >   - On success, the compound is freed once at its native order.
> > > > No
> > > >     split_page(), no per-4K refcount juggling, no fragmentation
> > > >     introduced from this path.
> > > >   - Slots that already hold a backup handle from a previous
> > > > partial
> > > >     attempt are skipped. A compound that would extend past a
> > > >     fault-injection-truncated num_pages is skipped rather than
> > > > split.
> > > > 
> > > > The restore-side leftover-page branch in
> > > > ttm_pool_restore_commit() is
> > > > left as-is for now: that path can still split a previously-
> > > > retained
> > > > compound, but in practice it is unreachable under realistic
> > > > workloads
> > > > (per profiling we have not been able to trigger it), so it is not
> > > > worth complicating the restore state machine to avoid the split
> > > > there.
> > > > If it ever becomes a problem in practice it can be addressed
> > > > independently.
> > > > 
> > > > ttm_pool_split_for_swap() itself is retained for the restore
> > > > path's
> > > > sole remaining caller. The DMA-mapped pre-backup unmap loop, the
> > > > purge path, ttm_pool_free_*, and ttm_pool_unmap_and_free()
> > > > already
> > > > operate at native order and are unchanged.
> > > 
> > > This split is intentional in that without it, we'd need to first
> > > allocate 1 << order pages from the kernel's *reserves* in order to
> > > later free 2 << order pages, making the shrinker much more likely
> > > to
> > > fail in true OOM situations. (I believe this was one of the reasons
> > > the
> > > initial shrinker attempts from AMD didn't work as expected).
> > > 
> > 
> > So where exactly is allocation done—shmem_read_folio_gfp or
> > shmem_writeout? I did notice and called out, in the commit message,
> > that
> > those interfaces are a bit confusing with respect to whether they
> > actually work with higher-order allocations.
> 
> The interesting one is in shmem_read_folio_gfp(). This used to be 4K-
> page only (but i915 had some tricks to make this allocate 2M folios).


It looks like to_folio() in ttm_backup_backup_page() (the output from
shmem_read_folio_gfp()) is always order-0—at least with how I have the
kernel configured.

I see what you’re saying here: we need to allocate however many order-0
pages are required in shmem_read_folio_gfp() before we call
__free_pages_gpu_account() on the higher-order folio we’re backing up.
In the worst case (again, with my kernel configuration on x86), this is
order 10 (4MB). I think certain Kconfig options can make this larger,
and on platforms like ARM these higher orders can represent huge amounts
of memory.

> My understanding (to be verified) is that this recently was changed to
> allow 2M by default, and also to allow 2M folio writeout. Writeout
> moves the folio from the page-cache to the swap-cache and then starts a
> fs writeout operation. Pages are put back on the LRU and are freed when
> writeout completes.
> 
> As I understand it, shmem_read_folio_gfp() will also potentially
> allocate memory for the shmem object radix tree.
> 
> > 
> > Also, FWIW, this patch by itself seems to greatly help with
> > fragmentation, and I haven’t seen the OOM killer kick in. I’ve done
> > things like running WebGL in a bunch of Chrome tabs, then running
> > bonnie++ (which basically uses all memory), or running IGTs, which
> > also
> > use all available memory. Based on that, I’m leaning toward this
> > patch
> > alone working as designed.
> 
> Good to know. Perhaps it would feel safer if we completely restrict the
> xe TTM order to 2M and below (if we haven't already).
> 

We don’t do this today, but that seems like a reasonable idea, although
the default order-10 on x86 isn’t really all that bad either.

What if ttm_backup_backup_page() fails while we’re trying to back up a
higher-order page - We could then split the page, free the pages up to the
current point of iteration, and then retry to make forward progress. If
I recall correctly, if it fails again, the shrinker gracefully handles
partial backups.

> > 
> > > 
> > > I believe the solution here is in the ttm_backup layer, We should
> > > introduce a ttm_backup_backup_folio function and either insert the
> > > page
> > 
> > I think something like ttm_backup_backup_folio() makes sense, again I
> > called out in commit message.
> > 
> > > directly into the shmem object (zero-copy) or even directly into
> > > the
> > > swap cache. Then we should completely restrict xe page allocations
> > > to
> > > only allow THP and PAGE_SIZE (Possibly 64K pages, but they'd either
> > > need a split or perhaps they are small enough to be backed-up using
> > 
> > Yes, I did raise something like with Christian too [1]. IMO the
> > driver
> > should be able dictate to TTM the orders it likely to allocate at.
> > 
> > [1]
> > https://patchwork.freedesktop.org/patch/716362/?series=164338&rev=1
> > 
> > > one-go copy, similar to this patch, but in the backup layer). FWIW
> > > at
> > > the time the shrinker was put together, AFAIU SHMEM split large
> > > pages
> > > on swapping anyway, but since that appears to have changed, we need
> > > to
> > > catch up.
> > > 
> > > Inserting directly into the swap-cache WIP is here, rebased on a
> > > recent
> > > kernel (This is an old idea that has actually been out on RFC
> > > once).
> > > This needs a core mm bugfix (also in the branch), but I'm not sure
> > > the
> > > swap cache is the right place to do this, at least not if we don't
> > > immediately schedule a write to disc, it looks like current users
> > > don't
> > > want to keep pages in swap-cache for very long (related to that
> > > bug)
> > > https://gitlab.freedesktop.org/thomash/xe-vibe/-/commits/thp_swapping2
> > > 
> > > Inserting directly into shmem (A fairly recent idea that is mostly
> > > untested)
> > > https://gitlab.freedesktop.org/thomash/xe-vibe/-/commits/insert_shmem?ref_type=heads
> > > Since SHMEM schedules writeout immediately when pages are moved to
> > > the
> > > swap-cache, it's not as susceptible to the above bug, since swap-
> > > cache
> > > entries are not typically held for folios for which we haven't
> > > scheduled writeout.
> > > 
> > 
> > Let me take a look at these branches today.
> > 
> > > We should try to solicit feedback from mm people on these two
> > > approaches.
> > 
> > +1, but I think we should stop here if this patch, as‑is, is OK to go
> > in—ideally as a fix—since, based on my testing, it seems to help
> > quite a
> > bit and current upstream shrinker is badly broken.
> 
> Well I think the problem with testing shrinking behavior is that we
> haven't had good test-cases, so we don't really know if this change

I agree that we don’t have great test cases. We can try to get some
better IGTs, but I really think we need some end‑to‑end testing like
what I’ve been manually doing (e.g., opening a bunch of WebGL tabs on a
system without a lot of memory to trigger shrinking, and/or running
memory‑heavy workloads on the CLI at the same time—compiling the kernel
with a large number of threads is likely a decent option).

> would break something that currently works. In the shmem documentation
> there's even some wording about concerns that the shmem radix tree
> allocations could accumulate and drain the kernel reserves. 
> 
> According to Google AI, the kernel reserves are around 2MiB times the
> number of zones, controlled by vm.min_free_kbytes.
> 
> But I think if we would push this or something similar but then we
> should 
> 
> *) Move the ttm_backup interface to be folio-based.
> *) Restrict order to 2M

I can 'Restrict order to 2M' in this series if you think it is a good
idea.

> *) Craft a test-case that triggers a shmem_read_folio_gfp() error in
> the backup path and verify that they do behave gracefully.
> 

I think, as a stop-gap and backportable fix, this patch plus a fallback
to splitting with error injection (I have error injection in this series
but I have ran and injected errors) is a reasonable option. Longer term,
yes, moving ttm_backup to be folio-based makes sense.

> And then as follow-ups:
> 
> a) Investigate direct shmem insertion.
> b) Address any remaining flaws from partially backed-up bos.
> c) Cgroups integration, following up on airlied's work ensuring that
> the evictee is charged for the shmem memory.
> 

And then also +1 on exploring these options.

Matt

> 
> Thanks,
> Thomas
> 
> 
> 
> > Matt
> > 
> > > 
> > > /Thomas
> > > 
> > > > 
> > > > Cc: Christian Koenig <christian.koenig@amd.com>
> > > > Cc: Huang Rui <ray.huang@amd.com>
> > > > Cc: Matthew Auld <matthew.auld@intel.com>
> > > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > > Cc: Maxime Ripard <mripard@kernel.org>
> > > > Cc: Thomas Zimmermann <tzimmermann@suse.de>
> > > > Cc: David Airlie <airlied@gmail.com>
> > > > Cc: Simona Vetter <simona@ffwll.ch>
> > > > Cc: dri-devel@lists.freedesktop.org
> > > > Cc: linux-kernel@vger.kernel.org
> > > > Cc: stable@vger.kernel.org
> > > > Fixes: b63d715b8090 ("drm/ttm/pool, drm/ttm/tt: Provide a helper
> > > > to
> > > > shrink pages")
> > > > Suggested-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > > Assisted-by: Claude:claude-opus-4.6
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > 
> > > > ---
> > > > 
> > > > A follow-up should attempt writeback to shmem at folio order as
> > > > well,
> > > > but the API for doing so is unclear and may be incomplete.
> > > > 
> > > > This patch is related to the pending series [1] and significantly
> > > > reduces the likelihood of Xe entering a kswapd loop under
> > > > fragmentation.
> > > > The kswapd → shrinker → Xe shrinker → TTM backup path is still
> > > > exercised; however, with this change the backup path no longer
> > > > worsens
> > > > fragmentation, which previously amplified reclaim pressure and
> > > > reinforced the kswapd loop.
> > > > 
> > > > Nonetheless, the pathological case that [1] aims to address still
> > > > exists
> > > > and requires a proper solution. Even with this patch, a kswapd
> > > > loop
> > > > due
> > > > to severe fragmentation can still be triggered, although it is
> > > > now
> > > > substantially harder to reproduce.
> > > > 
> > > > [1] https://patchwork.freedesktop.org/series/165330/
> > > > ---
> > > >  drivers/gpu/drm/ttm/ttm_pool.c | 71 +++++++++++++++++++++++++++-
> > > > ----
> > > > --
> > > >  1 file changed, 57 insertions(+), 14 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/ttm/ttm_pool.c
> > > > b/drivers/gpu/drm/ttm/ttm_pool.c
> > > > index 278bbe7a11ad..5ead0aba4bb7 100644
> > > > --- a/drivers/gpu/drm/ttm/ttm_pool.c
> > > > +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> > > > @@ -1036,12 +1036,11 @@ long ttm_pool_backup(struct ttm_pool
> > > > *pool,
> > > > struct ttm_tt *tt,
> > > >  {
> > > >  	struct file *backup = tt->backup;
> > > >  	struct page *page;
> > > > -	unsigned long handle;
> > > >  	gfp_t alloc_gfp;
> > > >  	gfp_t gfp;
> > > >  	int ret = 0;
> > > >  	pgoff_t shrunken = 0;
> > > > -	pgoff_t i, num_pages;
> > > > +	pgoff_t i, num_pages, npages;
> > > >  
> > > >  	if (WARN_ON(ttm_tt_is_backed_up(tt)))
> > > >  		return -EINVAL;
> > > > @@ -1097,28 +1096,72 @@ long ttm_pool_backup(struct ttm_pool
> > > > *pool,
> > > > struct ttm_tt *tt,
> > > >  	if (IS_ENABLED(CONFIG_FAULT_INJECTION) &&
> > > > should_fail(&backup_fault_inject, 1))
> > > >  		num_pages = DIV_ROUND_UP(num_pages, 2);
> > > >  
> > > > -	for (i = 0; i < num_pages; ++i) {
> > > > -		s64 shandle;
> > > > +	for (i = 0; i < num_pages; i += npages) {
> > > > +		unsigned int order;
> > > > +		pgoff_t j;
> > > >  
> > > > +		npages = 1;
> > > >  		page = tt->pages[i];
> > > >  		if (unlikely(!page))
> > > >  			continue;
> > > >  
> > > > -		ttm_pool_split_for_swap(pool, page);
> > > > +		/* Already-handled entry from a previous
> > > > attempt. */
> > > > +		if
> > > > (unlikely(ttm_backup_page_ptr_is_handle(page)))
> > > > +			continue;
> > > >  
> > > > -		shandle = ttm_backup_backup_page(backup, page,
> > > > flags->writeback, i,
> > > > -						 gfp,
> > > > alloc_gfp);
> > > > -		if (shandle < 0) {
> > > > -			/* We allow partially shrunken tts */
> > > > -			ret = shandle;
> > > > +		order = ttm_pool_page_order(pool, page);
> > > > +		npages = 1UL << order;
> > > > +
> > > > +		/*
> > > > +		 * Back up the compound atomically at its native
> > > > order. If
> > > > +		 * fault injection truncated num_pages mid-
> > > > compound,
> > > > skip
> > > > +		 * the partial tail rather than splitting.
> > > > +		 */
> > > > +		if (unlikely(i + npages > num_pages))
> > > >  			break;
> > > > +
> > > > +		for (j = 0; j < npages; ++j) {
> > > > +			unsigned long handle;
> > > > +			s64 shandle;
> > > > +
> > > > +			if (IS_ENABLED(CONFIG_FAULT_INJECTION)
> > > > &&
> > > > +			    should_fail(&backup_fault_inject,
> > > > 1))
> > > > +				shandle = -1;
> > > > +			else
> > > > +				shandle =
> > > > ttm_backup_backup_page(backup, page + j,
> > > > +								
> > > > flags->writeback,
> > > > +								
> > > > i +
> > > > j, gfp,
> > > > +								
> > > > alloc_gfp);
> > > > +
> > > > +			if (unlikely(shandle < 0)) {
> > > > +				pgoff_t k;
> > > > +
> > > > +				ret = shandle;
> > > > +				/*
> > > > +				 * Roll back: drop the handles
> > > > we
> > > > just wrote
> > > > +				 * and restore the original page
> > > > pointers so
> > > > +				 * the compound remains intact
> > > > and
> > > > may be
> > > > +				 * retried later.
> > > > +				 */
> > > > +				for (k = 0; k < j; ++k) {
> > > > +					handle =
> > > > ttm_backup_page_ptr_to_handle(tt->pages[i + k]);
> > > > +					ttm_backup_drop(backup,
> > > > handle);
> > > > +					tt->pages[i + k] = page
> > > > + k;
> > > > +				}
> > > > +
> > > > +				goto out;
> > > > +			}
> > > > +			handle = shandle;
> > > > +			tt->pages[i + j] =
> > > > ttm_backup_handle_to_page_ptr(shandle);
> > > >  		}
> > > > -		handle = shandle;
> > > > -		tt->pages[i] =
> > > > ttm_backup_handle_to_page_ptr(handle);
> > > > -		__free_pages_gpu_account(page, 0, false);
> > > > -		shrunken++;
> > > > +
> > > > +		/* Compound fully backed up; free at native
> > > > order.
> > > > */
> > > > +		page->private = 0;
> > > > +		__free_pages_gpu_account(page, order, false);
> > > > +		shrunken += npages;
> > > >  	}
> > > >  
> > > > +out:
> > > >  	return shrunken ? shrunken : ret;
> > > >  }
> > > >  

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-05-04 17:36 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-04  4:26 [PATCH] drm/ttm/pool: back up at native page order Matthew Brost
2026-05-04  8:35 ` Thomas Hellström
2026-05-04 14:30   ` Matthew Brost
2026-05-04 15:19     ` Thomas Hellström
2026-05-04 17:35       ` Matthew Brost

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox