public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Matthew Brost <matthew.brost@intel.com>
To: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
Cc: <intel-xe@lists.freedesktop.org>,
	<dri-devel@lists.freedesktop.org>,
	Christian Koenig <christian.koenig@amd.com>,
	Huang Rui <ray.huang@amd.com>,
	Matthew Auld <matthew.auld@intel.com>,
	Maarten Lankhorst <maarten.lankhorst@linux.intel.com>,
	Maxime Ripard <mripard@kernel.org>,
	Thomas Zimmermann <tzimmermann@suse.de>,
	David Airlie <airlied@gmail.com>, Simona Vetter <simona@ffwll.ch>,
	<linux-kernel@vger.kernel.org>, <stable@vger.kernel.org>
Subject: Re: [PATCH] drm/ttm/pool: back up at native page order
Date: Mon, 4 May 2026 10:35:56 -0700	[thread overview]
Message-ID: <afjY/OHiSovekTTX@gsse-cloud1.jf.intel.com> (raw)
In-Reply-To: <017d25edabb2e4f60da7421278bffa20f51b0142.camel@linux.intel.com>

On Mon, May 04, 2026 at 05:19:42PM +0200, Thomas Hellström wrote:
> On Mon, 2026-05-04 at 07:30 -0700, Matthew Brost wrote:
> > On Mon, May 04, 2026 at 10:35:23AM +0200, Thomas Hellström wrote:
> > > Hi, Matt,
> > > 
> > > On Sun, 2026-05-03 at 21:26 -0700, Matthew Brost wrote:
> > > > ttm_pool_split_for_swap() splits high-order pool pages into
> > > > order-0
> > > > pages during backup so each 4K page can be released to the system
> > > > as
> > > > soon as it has been written to shmem. While this minimizes the
> > > > allocator's working set during reclaim, it actively fragments
> > > > memory:
> > > > every TTM-backed compound page that the shrinker touches is
> > > > shattered
> > > > into order-0 pages, even when the rest of the system would prefer
> > > > that
> > > > the high-order block stay intact. Under sustained kswapd pressure
> > > > this
> > > > is enough to drive other parts of MM into recovery loops from
> > > > which
> > > > they cannot easily escape, because the memory TTM just freed is
> > > > no
> > > > longer contiguous.
> > > > 
> > > > Stop splitting on the backup path and back up each compound
> > > > atomically
> > > > at its native order in ttm_pool_backup():
> > > > 
> > > >   - For each non-handle slot, read the order from the head page
> > > > and
> > > >     back up all 1<<order subpages to consecutive shmem indices,
> > > >     writing the resulting handles into tt->pages[] as we go.
> > > >   - On any per-subpage backup failure, drop the handles we just
> > > > wrote
> > > >     for this compound and restore the original page pointers, so
> > > > the
> > > >     compound is left fully intact and may be retried later.
> > > > shrunken
> > > >     is only incremented once the whole compound succeeds.
> > > >   - On success, the compound is freed once at its native order.
> > > > No
> > > >     split_page(), no per-4K refcount juggling, no fragmentation
> > > >     introduced from this path.
> > > >   - Slots that already hold a backup handle from a previous
> > > > partial
> > > >     attempt are skipped. A compound that would extend past a
> > > >     fault-injection-truncated num_pages is skipped rather than
> > > > split.
> > > > 
> > > > The restore-side leftover-page branch in
> > > > ttm_pool_restore_commit() is
> > > > left as-is for now: that path can still split a previously-
> > > > retained
> > > > compound, but in practice it is unreachable under realistic
> > > > workloads
> > > > (per profiling we have not been able to trigger it), so it is not
> > > > worth complicating the restore state machine to avoid the split
> > > > there.
> > > > If it ever becomes a problem in practice it can be addressed
> > > > independently.
> > > > 
> > > > ttm_pool_split_for_swap() itself is retained for the restore
> > > > path's
> > > > sole remaining caller. The DMA-mapped pre-backup unmap loop, the
> > > > purge path, ttm_pool_free_*, and ttm_pool_unmap_and_free()
> > > > already
> > > > operate at native order and are unchanged.
> > > 
> > > This split is intentional in that without it, we'd need to first
> > > allocate 1 << order pages from the kernel's *reserves* in order to
> > > later free 2 << order pages, making the shrinker much more likely
> > > to
> > > fail in true OOM situations. (I believe this was one of the reasons
> > > the
> > > initial shrinker attempts from AMD didn't work as expected).
> > > 
> > 
> > So where exactly is allocation done—shmem_read_folio_gfp or
> > shmem_writeout? I did notice and called out, in the commit message,
> > that
> > those interfaces are a bit confusing with respect to whether they
> > actually work with higher-order allocations.
> 
> The interesting one is in shmem_read_folio_gfp(). This used to be 4K-
> page only (but i915 had some tricks to make this allocate 2M folios).


It looks like to_folio() in ttm_backup_backup_page() (the output from
shmem_read_folio_gfp()) is always order-0—at least with how I have the
kernel configured.

I see what you’re saying here: we need to allocate however many order-0
pages are required in shmem_read_folio_gfp() before we call
__free_pages_gpu_account() on the higher-order folio we’re backing up.
In the worst case (again, with my kernel configuration on x86), this is
order 10 (4MB). I think certain Kconfig options can make this larger,
and on platforms like ARM these higher orders can represent huge amounts
of memory.

> My understanding (to be verified) is that this recently was changed to
> allow 2M by default, and also to allow 2M folio writeout. Writeout
> moves the folio from the page-cache to the swap-cache and then starts a
> fs writeout operation. Pages are put back on the LRU and are freed when
> writeout completes.
> 
> As I understand it, shmem_read_folio_gfp() will also potentially
> allocate memory for the shmem object radix tree.
> 
> > 
> > Also, FWIW, this patch by itself seems to greatly help with
> > fragmentation, and I haven’t seen the OOM killer kick in. I’ve done
> > things like running WebGL in a bunch of Chrome tabs, then running
> > bonnie++ (which basically uses all memory), or running IGTs, which
> > also
> > use all available memory. Based on that, I’m leaning toward this
> > patch
> > alone working as designed.
> 
> Good to know. Perhaps it would feel safer if we completely restrict the
> xe TTM order to 2M and below (if we haven't already).
> 

We don’t do this today, but that seems like a reasonable idea, although
the default order-10 on x86 isn’t really all that bad either.

What if ttm_backup_backup_page() fails while we’re trying to back up a
higher-order page - We could then split the page, free the pages up to the
current point of iteration, and then retry to make forward progress. If
I recall correctly, if it fails again, the shrinker gracefully handles
partial backups.

> > 
> > > 
> > > I believe the solution here is in the ttm_backup layer, We should
> > > introduce a ttm_backup_backup_folio function and either insert the
> > > page
> > 
> > I think something like ttm_backup_backup_folio() makes sense, again I
> > called out in commit message.
> > 
> > > directly into the shmem object (zero-copy) or even directly into
> > > the
> > > swap cache. Then we should completely restrict xe page allocations
> > > to
> > > only allow THP and PAGE_SIZE (Possibly 64K pages, but they'd either
> > > need a split or perhaps they are small enough to be backed-up using
> > 
> > Yes, I did raise something like with Christian too [1]. IMO the
> > driver
> > should be able dictate to TTM the orders it likely to allocate at.
> > 
> > [1]
> > https://patchwork.freedesktop.org/patch/716362/?series=164338&rev=1
> > 
> > > one-go copy, similar to this patch, but in the backup layer). FWIW
> > > at
> > > the time the shrinker was put together, AFAIU SHMEM split large
> > > pages
> > > on swapping anyway, but since that appears to have changed, we need
> > > to
> > > catch up.
> > > 
> > > Inserting directly into the swap-cache WIP is here, rebased on a
> > > recent
> > > kernel (This is an old idea that has actually been out on RFC
> > > once).
> > > This needs a core mm bugfix (also in the branch), but I'm not sure
> > > the
> > > swap cache is the right place to do this, at least not if we don't
> > > immediately schedule a write to disc, it looks like current users
> > > don't
> > > want to keep pages in swap-cache for very long (related to that
> > > bug)
> > > https://gitlab.freedesktop.org/thomash/xe-vibe/-/commits/thp_swapping2
> > > 
> > > Inserting directly into shmem (A fairly recent idea that is mostly
> > > untested)
> > > https://gitlab.freedesktop.org/thomash/xe-vibe/-/commits/insert_shmem?ref_type=heads
> > > Since SHMEM schedules writeout immediately when pages are moved to
> > > the
> > > swap-cache, it's not as susceptible to the above bug, since swap-
> > > cache
> > > entries are not typically held for folios for which we haven't
> > > scheduled writeout.
> > > 
> > 
> > Let me take a look at these branches today.
> > 
> > > We should try to solicit feedback from mm people on these two
> > > approaches.
> > 
> > +1, but I think we should stop here if this patch, as‑is, is OK to go
> > in—ideally as a fix—since, based on my testing, it seems to help
> > quite a
> > bit and current upstream shrinker is badly broken.
> 
> Well I think the problem with testing shrinking behavior is that we
> haven't had good test-cases, so we don't really know if this change

I agree that we don’t have great test cases. We can try to get some
better IGTs, but I really think we need some end‑to‑end testing like
what I’ve been manually doing (e.g., opening a bunch of WebGL tabs on a
system without a lot of memory to trigger shrinking, and/or running
memory‑heavy workloads on the CLI at the same time—compiling the kernel
with a large number of threads is likely a decent option).

> would break something that currently works. In the shmem documentation
> there's even some wording about concerns that the shmem radix tree
> allocations could accumulate and drain the kernel reserves. 
> 
> According to Google AI, the kernel reserves are around 2MiB times the
> number of zones, controlled by vm.min_free_kbytes.
> 
> But I think if we would push this or something similar but then we
> should 
> 
> *) Move the ttm_backup interface to be folio-based.
> *) Restrict order to 2M

I can 'Restrict order to 2M' in this series if you think it is a good
idea.

> *) Craft a test-case that triggers a shmem_read_folio_gfp() error in
> the backup path and verify that they do behave gracefully.
> 

I think, as a stop-gap and backportable fix, this patch plus a fallback
to splitting with error injection (I have error injection in this series
but I have ran and injected errors) is a reasonable option. Longer term,
yes, moving ttm_backup to be folio-based makes sense.

> And then as follow-ups:
> 
> a) Investigate direct shmem insertion.
> b) Address any remaining flaws from partially backed-up bos.
> c) Cgroups integration, following up on airlied's work ensuring that
> the evictee is charged for the shmem memory.
> 

And then also +1 on exploring these options.

Matt

> 
> Thanks,
> Thomas
> 
> 
> 
> > Matt
> > 
> > > 
> > > /Thomas
> > > 
> > > > 
> > > > Cc: Christian Koenig <christian.koenig@amd.com>
> > > > Cc: Huang Rui <ray.huang@amd.com>
> > > > Cc: Matthew Auld <matthew.auld@intel.com>
> > > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > > Cc: Maxime Ripard <mripard@kernel.org>
> > > > Cc: Thomas Zimmermann <tzimmermann@suse.de>
> > > > Cc: David Airlie <airlied@gmail.com>
> > > > Cc: Simona Vetter <simona@ffwll.ch>
> > > > Cc: dri-devel@lists.freedesktop.org
> > > > Cc: linux-kernel@vger.kernel.org
> > > > Cc: stable@vger.kernel.org
> > > > Fixes: b63d715b8090 ("drm/ttm/pool, drm/ttm/tt: Provide a helper
> > > > to
> > > > shrink pages")
> > > > Suggested-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > > Assisted-by: Claude:claude-opus-4.6
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > 
> > > > ---
> > > > 
> > > > A follow-up should attempt writeback to shmem at folio order as
> > > > well,
> > > > but the API for doing so is unclear and may be incomplete.
> > > > 
> > > > This patch is related to the pending series [1] and significantly
> > > > reduces the likelihood of Xe entering a kswapd loop under
> > > > fragmentation.
> > > > The kswapd → shrinker → Xe shrinker → TTM backup path is still
> > > > exercised; however, with this change the backup path no longer
> > > > worsens
> > > > fragmentation, which previously amplified reclaim pressure and
> > > > reinforced the kswapd loop.
> > > > 
> > > > Nonetheless, the pathological case that [1] aims to address still
> > > > exists
> > > > and requires a proper solution. Even with this patch, a kswapd
> > > > loop
> > > > due
> > > > to severe fragmentation can still be triggered, although it is
> > > > now
> > > > substantially harder to reproduce.
> > > > 
> > > > [1] https://patchwork.freedesktop.org/series/165330/
> > > > ---
> > > >  drivers/gpu/drm/ttm/ttm_pool.c | 71 +++++++++++++++++++++++++++-
> > > > ----
> > > > --
> > > >  1 file changed, 57 insertions(+), 14 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/ttm/ttm_pool.c
> > > > b/drivers/gpu/drm/ttm/ttm_pool.c
> > > > index 278bbe7a11ad..5ead0aba4bb7 100644
> > > > --- a/drivers/gpu/drm/ttm/ttm_pool.c
> > > > +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> > > > @@ -1036,12 +1036,11 @@ long ttm_pool_backup(struct ttm_pool
> > > > *pool,
> > > > struct ttm_tt *tt,
> > > >  {
> > > >  	struct file *backup = tt->backup;
> > > >  	struct page *page;
> > > > -	unsigned long handle;
> > > >  	gfp_t alloc_gfp;
> > > >  	gfp_t gfp;
> > > >  	int ret = 0;
> > > >  	pgoff_t shrunken = 0;
> > > > -	pgoff_t i, num_pages;
> > > > +	pgoff_t i, num_pages, npages;
> > > >  
> > > >  	if (WARN_ON(ttm_tt_is_backed_up(tt)))
> > > >  		return -EINVAL;
> > > > @@ -1097,28 +1096,72 @@ long ttm_pool_backup(struct ttm_pool
> > > > *pool,
> > > > struct ttm_tt *tt,
> > > >  	if (IS_ENABLED(CONFIG_FAULT_INJECTION) &&
> > > > should_fail(&backup_fault_inject, 1))
> > > >  		num_pages = DIV_ROUND_UP(num_pages, 2);
> > > >  
> > > > -	for (i = 0; i < num_pages; ++i) {
> > > > -		s64 shandle;
> > > > +	for (i = 0; i < num_pages; i += npages) {
> > > > +		unsigned int order;
> > > > +		pgoff_t j;
> > > >  
> > > > +		npages = 1;
> > > >  		page = tt->pages[i];
> > > >  		if (unlikely(!page))
> > > >  			continue;
> > > >  
> > > > -		ttm_pool_split_for_swap(pool, page);
> > > > +		/* Already-handled entry from a previous
> > > > attempt. */
> > > > +		if
> > > > (unlikely(ttm_backup_page_ptr_is_handle(page)))
> > > > +			continue;
> > > >  
> > > > -		shandle = ttm_backup_backup_page(backup, page,
> > > > flags->writeback, i,
> > > > -						 gfp,
> > > > alloc_gfp);
> > > > -		if (shandle < 0) {
> > > > -			/* We allow partially shrunken tts */
> > > > -			ret = shandle;
> > > > +		order = ttm_pool_page_order(pool, page);
> > > > +		npages = 1UL << order;
> > > > +
> > > > +		/*
> > > > +		 * Back up the compound atomically at its native
> > > > order. If
> > > > +		 * fault injection truncated num_pages mid-
> > > > compound,
> > > > skip
> > > > +		 * the partial tail rather than splitting.
> > > > +		 */
> > > > +		if (unlikely(i + npages > num_pages))
> > > >  			break;
> > > > +
> > > > +		for (j = 0; j < npages; ++j) {
> > > > +			unsigned long handle;
> > > > +			s64 shandle;
> > > > +
> > > > +			if (IS_ENABLED(CONFIG_FAULT_INJECTION)
> > > > &&
> > > > +			    should_fail(&backup_fault_inject,
> > > > 1))
> > > > +				shandle = -1;
> > > > +			else
> > > > +				shandle =
> > > > ttm_backup_backup_page(backup, page + j,
> > > > +								
> > > > flags->writeback,
> > > > +								
> > > > i +
> > > > j, gfp,
> > > > +								
> > > > alloc_gfp);
> > > > +
> > > > +			if (unlikely(shandle < 0)) {
> > > > +				pgoff_t k;
> > > > +
> > > > +				ret = shandle;
> > > > +				/*
> > > > +				 * Roll back: drop the handles
> > > > we
> > > > just wrote
> > > > +				 * and restore the original page
> > > > pointers so
> > > > +				 * the compound remains intact
> > > > and
> > > > may be
> > > > +				 * retried later.
> > > > +				 */
> > > > +				for (k = 0; k < j; ++k) {
> > > > +					handle =
> > > > ttm_backup_page_ptr_to_handle(tt->pages[i + k]);
> > > > +					ttm_backup_drop(backup,
> > > > handle);
> > > > +					tt->pages[i + k] = page
> > > > + k;
> > > > +				}
> > > > +
> > > > +				goto out;
> > > > +			}
> > > > +			handle = shandle;
> > > > +			tt->pages[i + j] =
> > > > ttm_backup_handle_to_page_ptr(shandle);
> > > >  		}
> > > > -		handle = shandle;
> > > > -		tt->pages[i] =
> > > > ttm_backup_handle_to_page_ptr(handle);
> > > > -		__free_pages_gpu_account(page, 0, false);
> > > > -		shrunken++;
> > > > +
> > > > +		/* Compound fully backed up; free at native
> > > > order.
> > > > */
> > > > +		page->private = 0;
> > > > +		__free_pages_gpu_account(page, order, false);
> > > > +		shrunken += npages;
> > > >  	}
> > > >  
> > > > +out:
> > > >  	return shrunken ? shrunken : ret;
> > > >  }
> > > >  

      reply	other threads:[~2026-05-04 17:36 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-04  4:26 [PATCH] drm/ttm/pool: back up at native page order Matthew Brost
2026-05-04  8:35 ` Thomas Hellström
2026-05-04 14:30   ` Matthew Brost
2026-05-04 15:19     ` Thomas Hellström
2026-05-04 17:35       ` Matthew Brost [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=afjY/OHiSovekTTX@gsse-cloud1.jf.intel.com \
    --to=matthew.brost@intel.com \
    --cc=airlied@gmail.com \
    --cc=christian.koenig@amd.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maarten.lankhorst@linux.intel.com \
    --cc=matthew.auld@intel.com \
    --cc=mripard@kernel.org \
    --cc=ray.huang@amd.com \
    --cc=simona@ffwll.ch \
    --cc=stable@vger.kernel.org \
    --cc=thomas.hellstrom@linux.intel.com \
    --cc=tzimmermann@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox