From: Matthew Brost <matthew.brost@intel.com>
To: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
Cc: <intel-xe@lists.freedesktop.org>,
<dri-devel@lists.freedesktop.org>,
Christian Koenig <christian.koenig@amd.com>,
Huang Rui <ray.huang@amd.com>,
Matthew Auld <matthew.auld@intel.com>,
Maarten Lankhorst <maarten.lankhorst@linux.intel.com>,
Maxime Ripard <mripard@kernel.org>,
Thomas Zimmermann <tzimmermann@suse.de>,
David Airlie <airlied@gmail.com>, Simona Vetter <simona@ffwll.ch>,
<linux-kernel@vger.kernel.org>, <stable@vger.kernel.org>
Subject: Re: [PATCH] drm/ttm/pool: back up at native page order
Date: Mon, 4 May 2026 07:30:29 -0700 [thread overview]
Message-ID: <afitheEYRoy7VoPu@gsse-cloud1.jf.intel.com> (raw)
In-Reply-To: <58ea6837e2aa808bf9f3ba304395058a2d08b8d0.camel@linux.intel.com>
On Mon, May 04, 2026 at 10:35:23AM +0200, Thomas Hellström wrote:
> Hi, Matt,
>
> On Sun, 2026-05-03 at 21:26 -0700, Matthew Brost wrote:
> > ttm_pool_split_for_swap() splits high-order pool pages into order-0
> > pages during backup so each 4K page can be released to the system as
> > soon as it has been written to shmem. While this minimizes the
> > allocator's working set during reclaim, it actively fragments memory:
> > every TTM-backed compound page that the shrinker touches is shattered
> > into order-0 pages, even when the rest of the system would prefer
> > that
> > the high-order block stay intact. Under sustained kswapd pressure
> > this
> > is enough to drive other parts of MM into recovery loops from which
> > they cannot easily escape, because the memory TTM just freed is no
> > longer contiguous.
> >
> > Stop splitting on the backup path and back up each compound
> > atomically
> > at its native order in ttm_pool_backup():
> >
> > - For each non-handle slot, read the order from the head page and
> > back up all 1<<order subpages to consecutive shmem indices,
> > writing the resulting handles into tt->pages[] as we go.
> > - On any per-subpage backup failure, drop the handles we just wrote
> > for this compound and restore the original page pointers, so the
> > compound is left fully intact and may be retried later. shrunken
> > is only incremented once the whole compound succeeds.
> > - On success, the compound is freed once at its native order. No
> > split_page(), no per-4K refcount juggling, no fragmentation
> > introduced from this path.
> > - Slots that already hold a backup handle from a previous partial
> > attempt are skipped. A compound that would extend past a
> > fault-injection-truncated num_pages is skipped rather than split.
> >
> > The restore-side leftover-page branch in ttm_pool_restore_commit() is
> > left as-is for now: that path can still split a previously-retained
> > compound, but in practice it is unreachable under realistic workloads
> > (per profiling we have not been able to trigger it), so it is not
> > worth complicating the restore state machine to avoid the split
> > there.
> > If it ever becomes a problem in practice it can be addressed
> > independently.
> >
> > ttm_pool_split_for_swap() itself is retained for the restore path's
> > sole remaining caller. The DMA-mapped pre-backup unmap loop, the
> > purge path, ttm_pool_free_*, and ttm_pool_unmap_and_free() already
> > operate at native order and are unchanged.
>
> This split is intentional in that without it, we'd need to first
> allocate 1 << order pages from the kernel's *reserves* in order to
> later free 2 << order pages, making the shrinker much more likely to
> fail in true OOM situations. (I believe this was one of the reasons the
> initial shrinker attempts from AMD didn't work as expected).
>
So where exactly is allocation done—shmem_read_folio_gfp or
shmem_writeout? I did notice and called out, in the commit message, that
those interfaces are a bit confusing with respect to whether they
actually work with higher-order allocations.
Also, FWIW, this patch by itself seems to greatly help with
fragmentation, and I haven’t seen the OOM killer kick in. I’ve done
things like running WebGL in a bunch of Chrome tabs, then running
bonnie++ (which basically uses all memory), or running IGTs, which also
use all available memory. Based on that, I’m leaning toward this patch
alone working as designed.
>
> I believe the solution here is in the ttm_backup layer, We should
> introduce a ttm_backup_backup_folio function and either insert the page
I think something like ttm_backup_backup_folio() makes sense, again I
called out in commit message.
> directly into the shmem object (zero-copy) or even directly into the
> swap cache. Then we should completely restrict xe page allocations to
> only allow THP and PAGE_SIZE (Possibly 64K pages, but they'd either
> need a split or perhaps they are small enough to be backed-up using
Yes, I did raise something like with Christian too [1]. IMO the driver
should be able dictate to TTM the orders it likely to allocate at.
[1] https://patchwork.freedesktop.org/patch/716362/?series=164338&rev=1
> one-go copy, similar to this patch, but in the backup layer). FWIW at
> the time the shrinker was put together, AFAIU SHMEM split large pages
> on swapping anyway, but since that appears to have changed, we need to
> catch up.
>
> Inserting directly into the swap-cache WIP is here, rebased on a recent
> kernel (This is an old idea that has actually been out on RFC once).
> This needs a core mm bugfix (also in the branch), but I'm not sure the
> swap cache is the right place to do this, at least not if we don't
> immediately schedule a write to disc, it looks like current users don't
> want to keep pages in swap-cache for very long (related to that bug)
> https://gitlab.freedesktop.org/thomash/xe-vibe/-/commits/thp_swapping2
>
> Inserting directly into shmem (A fairly recent idea that is mostly
> untested)
> https://gitlab.freedesktop.org/thomash/xe-vibe/-/commits/insert_shmem?ref_type=heads
> Since SHMEM schedules writeout immediately when pages are moved to the
> swap-cache, it's not as susceptible to the above bug, since swap-cache
> entries are not typically held for folios for which we haven't
> scheduled writeout.
>
Let me take a look at these branches today.
> We should try to solicit feedback from mm people on these two
> approaches.
+1, but I think we should stop here if this patch, as‑is, is OK to go
in—ideally as a fix—since, based on my testing, it seems to help quite a
bit and current upstream shrinker is badly broken.
Matt
>
> /Thomas
>
> >
> > Cc: Christian Koenig <christian.koenig@amd.com>
> > Cc: Huang Rui <ray.huang@amd.com>
> > Cc: Matthew Auld <matthew.auld@intel.com>
> > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > Cc: Maxime Ripard <mripard@kernel.org>
> > Cc: Thomas Zimmermann <tzimmermann@suse.de>
> > Cc: David Airlie <airlied@gmail.com>
> > Cc: Simona Vetter <simona@ffwll.ch>
> > Cc: dri-devel@lists.freedesktop.org
> > Cc: linux-kernel@vger.kernel.org
> > Cc: stable@vger.kernel.org
> > Fixes: b63d715b8090 ("drm/ttm/pool, drm/ttm/tt: Provide a helper to
> > shrink pages")
> > Suggested-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > Assisted-by: Claude:claude-opus-4.6
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> >
> > ---
> >
> > A follow-up should attempt writeback to shmem at folio order as well,
> > but the API for doing so is unclear and may be incomplete.
> >
> > This patch is related to the pending series [1] and significantly
> > reduces the likelihood of Xe entering a kswapd loop under
> > fragmentation.
> > The kswapd → shrinker → Xe shrinker → TTM backup path is still
> > exercised; however, with this change the backup path no longer
> > worsens
> > fragmentation, which previously amplified reclaim pressure and
> > reinforced the kswapd loop.
> >
> > Nonetheless, the pathological case that [1] aims to address still
> > exists
> > and requires a proper solution. Even with this patch, a kswapd loop
> > due
> > to severe fragmentation can still be triggered, although it is now
> > substantially harder to reproduce.
> >
> > [1] https://patchwork.freedesktop.org/series/165330/
> > ---
> > drivers/gpu/drm/ttm/ttm_pool.c | 71 +++++++++++++++++++++++++++-----
> > --
> > 1 file changed, 57 insertions(+), 14 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/ttm/ttm_pool.c
> > b/drivers/gpu/drm/ttm/ttm_pool.c
> > index 278bbe7a11ad..5ead0aba4bb7 100644
> > --- a/drivers/gpu/drm/ttm/ttm_pool.c
> > +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> > @@ -1036,12 +1036,11 @@ long ttm_pool_backup(struct ttm_pool *pool,
> > struct ttm_tt *tt,
> > {
> > struct file *backup = tt->backup;
> > struct page *page;
> > - unsigned long handle;
> > gfp_t alloc_gfp;
> > gfp_t gfp;
> > int ret = 0;
> > pgoff_t shrunken = 0;
> > - pgoff_t i, num_pages;
> > + pgoff_t i, num_pages, npages;
> >
> > if (WARN_ON(ttm_tt_is_backed_up(tt)))
> > return -EINVAL;
> > @@ -1097,28 +1096,72 @@ long ttm_pool_backup(struct ttm_pool *pool,
> > struct ttm_tt *tt,
> > if (IS_ENABLED(CONFIG_FAULT_INJECTION) &&
> > should_fail(&backup_fault_inject, 1))
> > num_pages = DIV_ROUND_UP(num_pages, 2);
> >
> > - for (i = 0; i < num_pages; ++i) {
> > - s64 shandle;
> > + for (i = 0; i < num_pages; i += npages) {
> > + unsigned int order;
> > + pgoff_t j;
> >
> > + npages = 1;
> > page = tt->pages[i];
> > if (unlikely(!page))
> > continue;
> >
> > - ttm_pool_split_for_swap(pool, page);
> > + /* Already-handled entry from a previous attempt. */
> > + if (unlikely(ttm_backup_page_ptr_is_handle(page)))
> > + continue;
> >
> > - shandle = ttm_backup_backup_page(backup, page,
> > flags->writeback, i,
> > - gfp, alloc_gfp);
> > - if (shandle < 0) {
> > - /* We allow partially shrunken tts */
> > - ret = shandle;
> > + order = ttm_pool_page_order(pool, page);
> > + npages = 1UL << order;
> > +
> > + /*
> > + * Back up the compound atomically at its native
> > order. If
> > + * fault injection truncated num_pages mid-compound,
> > skip
> > + * the partial tail rather than splitting.
> > + */
> > + if (unlikely(i + npages > num_pages))
> > break;
> > +
> > + for (j = 0; j < npages; ++j) {
> > + unsigned long handle;
> > + s64 shandle;
> > +
> > + if (IS_ENABLED(CONFIG_FAULT_INJECTION) &&
> > + should_fail(&backup_fault_inject, 1))
> > + shandle = -1;
> > + else
> > + shandle =
> > ttm_backup_backup_page(backup, page + j,
> > +
> > flags->writeback,
> > + i +
> > j, gfp,
> > +
> > alloc_gfp);
> > +
> > + if (unlikely(shandle < 0)) {
> > + pgoff_t k;
> > +
> > + ret = shandle;
> > + /*
> > + * Roll back: drop the handles we
> > just wrote
> > + * and restore the original page
> > pointers so
> > + * the compound remains intact and
> > may be
> > + * retried later.
> > + */
> > + for (k = 0; k < j; ++k) {
> > + handle =
> > ttm_backup_page_ptr_to_handle(tt->pages[i + k]);
> > + ttm_backup_drop(backup,
> > handle);
> > + tt->pages[i + k] = page + k;
> > + }
> > +
> > + goto out;
> > + }
> > + handle = shandle;
> > + tt->pages[i + j] =
> > ttm_backup_handle_to_page_ptr(shandle);
> > }
> > - handle = shandle;
> > - tt->pages[i] =
> > ttm_backup_handle_to_page_ptr(handle);
> > - __free_pages_gpu_account(page, 0, false);
> > - shrunken++;
> > +
> > + /* Compound fully backed up; free at native order.
> > */
> > + page->private = 0;
> > + __free_pages_gpu_account(page, order, false);
> > + shrunken += npages;
> > }
> >
> > +out:
> > return shrunken ? shrunken : ret;
> > }
> >
next prev parent reply other threads:[~2026-05-04 14:30 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-04 4:26 [PATCH] drm/ttm/pool: back up at native page order Matthew Brost
2026-05-04 8:35 ` Thomas Hellström
2026-05-04 14:30 ` Matthew Brost [this message]
2026-05-04 15:19 ` Thomas Hellström
2026-05-04 17:35 ` Matthew Brost
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=afitheEYRoy7VoPu@gsse-cloud1.jf.intel.com \
--to=matthew.brost@intel.com \
--cc=airlied@gmail.com \
--cc=christian.koenig@amd.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=intel-xe@lists.freedesktop.org \
--cc=linux-kernel@vger.kernel.org \
--cc=maarten.lankhorst@linux.intel.com \
--cc=matthew.auld@intel.com \
--cc=mripard@kernel.org \
--cc=ray.huang@amd.com \
--cc=simona@ffwll.ch \
--cc=stable@vger.kernel.org \
--cc=thomas.hellstrom@linux.intel.com \
--cc=tzimmermann@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox