From: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
To: Matthew Brost <matthew.brost@intel.com>
Cc: intel-xe@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
Christian Koenig <christian.koenig@amd.com>,
Huang Rui <ray.huang@amd.com>,
Matthew Auld <matthew.auld@intel.com>,
Maarten Lankhorst <maarten.lankhorst@linux.intel.com>,
Maxime Ripard <mripard@kernel.org>,
Thomas Zimmermann <tzimmermann@suse.de>,
David Airlie <airlied@gmail.com>,
Simona Vetter <simona@ffwll.ch>,
linux-kernel@vger.kernel.org, stable@vger.kernel.org
Subject: Re: [PATCH v5 2/2] drm/ttm/pool: back up at native page order
Date: Wed, 06 May 2026 18:26:43 +0200 [thread overview]
Message-ID: <906de072af1f6744aed1eb914ff196f0f5e00016.camel@linux.intel.com> (raw)
In-Reply-To: <afto1UzV/yfNrv1S@gsse-cloud1.jf.intel.com>
On Wed, 2026-05-06 at 09:14 -0700, Matthew Brost wrote:
> On Wed, May 06, 2026 at 04:23:29PM +0200, Thomas Hellström wrote:
> > Hi, Matt
> >
> > On Tue, 2026-05-05 at 13:04 -0700, Matthew Brost wrote:
> > > ttm_pool_split_for_swap() splits high-order pool pages into
> > > order-0
> > > pages during backup so each 4K page can be released to the system
> > > as
> > > soon as it has been written to shmem. While this minimizes the
> > > allocator's working set during reclaim, it actively fragments
> > > memory:
> > > every TTM-backed compound page that the shrinker touches is
> > > shattered
> > > into order-0 pages, even when the rest of the system would prefer
> > > that
> > > the high-order block stay intact. Under sustained kswapd pressure
> > > this
> > > is enough to drive other parts of MM into recovery loops from
> > > which
> > > they cannot easily escape, because the memory TTM just freed is
> > > no
> > > longer contiguous.
> > >
> > > Stop unconditionally splitting on the backup path and back up
> > > each
> > > compound at its native order in ttm_pool_backup():
> > >
> > > - For each non-handle slot, read the order from the head page
> > > and
> > > back up all 1<<order subpages to consecutive shmem indices,
> > > writing the resulting handles into tt->pages[] as we go.
> > > - On success, the compound is freed once at its native order.
> > > No
> > > split_page(), no per-4K refcount juggling, no fragmentation
> > > introduced from this path.
> > > - Slots that already hold a backup handle from a previous
> > > partial
> > > attempt are skipped. A compound that would extend past a
> > > fault-injection-truncated num_pages is skipped rather than
> > > split.
> > >
> > > A per-subpage backup failure cannot be made fully atomic: backing
> > > up
> > > a
> > > subpage allocates a shmem folio before the source page can be
> > > released,
> > > so under true OOM any subpage in a compound (not just the first)
> > > may
> > > fail to be backed up with the rest of the source compound still
> > > live
> > > and contiguous. To make forward progress in that case, fall back
> > > to
> > > splitting the source compound and backing up its remaining
> > > subpages
> > > individually:
> > >
> > > - On the first per-subpage failure for a compound (and only if
> > > order > 0), call ttm_pool_split_for_swap() to split the
> > > source
> > > compound, release the subpages whose contents already live in
> > > shmem (their handles in tt->pages stay valid), and retry the
> > > failing subpage at order 0.
> > > - Subsequent successful subpage backups in the now-split
> > > compound
> > > free their source page individually as soon as the handle is
> > > written.
> > > - A second failure after splitting terminates the loop with
> > > partial
> > > progress; the remaining order-0 subpages stay in tt->pages as
> > > plain page pointers and are cleaned up by the normal
> > > ttm_pool_drop_backed_up() / ttm_pool_free_range() paths.
> > >
> > > This restores the original split-on-OOM fallback behavior while
> > > keeping the common, non-OOM case fragmentation-free. It also
> > > preserves the "partial backup is allowed" contract: shrunken is
> > > incremented per backed-up subpage so the caller still sees
> > > forward
> > > progress when a compound only partially succeeds.
> > >
> > > The restore-side leftover-page branch in
> > > ttm_pool_restore_commit() is
> > > left as-is for now: that path can still split a previously-
> > > retained
> > > compound, but in practice it is unreachable under realistic
> > > workloads
> > > (per profiling we have not been able to trigger it), so it is not
> > > worth complicating the restore state machine to avoid the split
> > > there.
> > > If it ever becomes a problem in practice it can be addressed
> > > independently.
> > >
> > > ttm_pool_split_for_swap() itself is retained both for the OOM
> > > fallback above and for the restore path's remaining caller. The
> > > DMA-mapped pre-backup unmap loop, the purge path,
> > > ttm_pool_free_*,
> > > and ttm_pool_unmap_and_free() already operate at native order and
> > > are unchanged.
> > >
> > > Cc: Christian Koenig <christian.koenig@amd.com>
> > > Cc: Huang Rui <ray.huang@amd.com>
> > > Cc: Matthew Auld <matthew.auld@intel.com>
> > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > Cc: Maxime Ripard <mripard@kernel.org>
> > > Cc: Thomas Zimmermann <tzimmermann@suse.de>
> > > Cc: David Airlie <airlied@gmail.com>
> > > Cc: Simona Vetter <simona@ffwll.ch>
> > > Cc: dri-devel@lists.freedesktop.org
> > > Cc: linux-kernel@vger.kernel.org
> > > Cc: stable@vger.kernel.org
> > > Fixes: b63d715b8090 ("drm/ttm/pool, drm/ttm/tt: Provide a helper
> > > to
> > > shrink pages")
> > > Suggested-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > Assisted-by: Claude:claude-opus-4.6
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > >
> > > ---
> > >
> > > A follow-up should attempt writeback to shmem at folio order as
> > > well,
> > > but the API for doing so is unclear and may be incomplete.
> > >
> > > This patch is related to the pending series [1] and significantly
> > > reduces the likelihood of Xe entering a kswapd loop under
> > > fragmentation.
> > > The kswapd → shrinker → Xe shrinker → TTM backup path is still
> > > exercised; however, with this change the backup path no longer
> > > worsens
> > > fragmentation, which previously amplified reclaim pressure and
> > > reinforced the kswapd loop.
> > >
> > > Nonetheless, the pathological case that [1] aims to address still
> > > exists
> > > and requires a proper solution. Even with this patch, a kswapd
> > > loop
> > > due
> > > to severe fragmentation can still be triggered, although it is
> > > now
> > > substantially harder to reproduce.
> > >
> > > v2:
> > > - Split pages and free immediately if backup fails are higher
> > > order
> > > (Thomas)
> > > v3:
> > > - Skip handles in purge path (sashiko)
> > > v5:
> > > - Refactor into ttm_pool_backup_folio (Thomas)
> > >
> > > [1] https://patchwork.freedesktop.org/series/165330/
> > > ---
> > > drivers/gpu/drm/ttm/ttm_pool.c | 110
> > > ++++++++++++++++++++++++++++---
> > > --
> > > 1 file changed, 94 insertions(+), 16 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/ttm/ttm_pool.c
> > > b/drivers/gpu/drm/ttm/ttm_pool.c
> > > index d380a3c7fe40..78efc8524133 100644
> > > --- a/drivers/gpu/drm/ttm/ttm_pool.c
> > > +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> > > @@ -1019,6 +1019,70 @@ void ttm_pool_drop_backed_up(struct ttm_tt
> > > *tt)
> > > ttm_pool_free_range(NULL, tt, ttm_cached, start_page,
> > > tt-
> > > > num_pages);
> > > }
> > >
> > > +static int ttm_pool_backup_folio(struct ttm_pool *pool, struct
> > > ttm_tt *tt,
> > > + struct file *backup, struct
> > > folio
> > > *folio,
> > > + unsigned int order, bool
> > > writeback,
> > > + pgoff_t idx, gfp_t page_gfp,
> > > gfp_t
> > > alloc_gfp)
> >
> > I don't really understand why we can't end up with a
> > ttm_backup_backup_folio(), which I believe is the proper layering,
> > already at this point? Please see a suggestion at
> >
> > https://gitlab.freedesktop.org/thomash/xe-vibe/-/commits/ttm_swapout?ref_type=heads
> >
> > Here the splitting logic is kept in the ttm_pool, but ttm_backup
> > supports handing large folios to it.
> >
> > Although the cumulative diffstat becomes larger, the end code
> > becomes
> > smaller and IMO easier to read, and we don't need to introduce code
> > that we immediately have to refactor.
>
> That version looks fine too. If that is preference no issue.
Cool. Note that there is a bug in that we don't pass the folio order
into ttm_backup_backup_folio(). I'm force-pushing a fix for that.
>
> My goal with this series is get something than can reasonably be
> backported to LTS kernels so the desktop doesn't frequently enter
> kswapd
> because of fragmentation. We now have at least 3 reports of this
> being
> an issue.
>
> This is larger fix [1] which works in tandem but seemly unlikely to
> backportable given it add new concepts to the core MM [1].
>
> [1] https://patchwork.freedesktop.org/series/165329/
>
> >
> > But I'm starting to question the general approach: Even if the
> > *shrinker* can recover from a total kernel memory reserve
> > depletion, it
> > can't really be considered a reasonable practice, since if we
> > frequently deplete the reserves, *other* important allocations in
> > the
> > system like GFP_ATOMIC, PF_MEMALLOC may spuriously start to fail
> > and
> > people will have a hard time finding out why.
> >
>
> Wouldn’t GFP_ATOMIC enter direct reclaim, hit our shrinker, and
> eventually make progress—i.e., take the split path if needed? I’m not
> 100% sure, but my initial reaction is that this concern may not be
> valid; however, MM is hard to reason about.
No, GFP_ATOMIC just uses what's available without any reclaim at all.
It's more aggressive than GFP_NOWAIT in that it allows dipping into the
kernel reserves.
>
> Again, FWIW, I’ve tried a lot of things to trigger OOM—for example,
> running WebGL tabs and then kicking off various very memory-intensive
> workloads from the CLI—and I still haven’t hit OOM or seen memory
> allocation failures or warnings.
>
> > So I actually don't think we can be avoiding the splitting without
> > direct insertion. FWIW, up until recently when shmem started
> > supporting
>
> I agree direct insertion is better solution. Do you think this
> something
> we could reasonably get working and backport? I haven't done any
> research on direct insertion yet, thus why I'm asking.
Yes I think so. The problem would be to get it accepted. Looking into
that now, but hitting various kinds of subtle issues.
Thanks,
Thomas
>
> > huge page swapping, other GPU drivers basically also split pages at
> > swapout.
>
> I wonder if other drivers have the same issue? The deadly combo is
> allow
> GPUs to subscribe all of system memory, allocate THP pages (or higher
> order pages), and split them in the shrinker. Xe might be the only
> driver with right combo to hit this but not 100% sure without a deep
> dive.
>
> >
> > Another idea for improving on the compaction loop, perhaps worth
> > trying
> > is this change, shamelessly stolen from i915:
> >
> > https://gitlab.freedesktop.org/thomash/xe-vibe/-/commits/shrinker_batch?ref_type=heads
> >
>
> I'd have to give this a try - I'm quickly running out of time before
> I
> leave for month though.
>
> Matt
>
> > /Thomas
> >
> >
> > > +{
> > > + struct page *page = folio_page(folio, 0);
> > > + int shrunken = 0, npages = 1UL << order, ret = 0, i;
> > > + bool folio_has_been_split = false;
> > > +
> > > + for (i = 0; i < npages; ++i) {
> > > + s64 shandle;
> > > +
> > > +try_again_after_split:
> > > + if (IS_ENABLED(CONFIG_FAULT_INJECTION) &&
> > > + should_fail(&backup_fault_inject, 1))
> > > + shandle = -ENOMEM;
> > > + else
> > > + shandle = ttm_backup_backup_page(backup,
> > > page + i,
> > > +
> > > writeback,
> > > idx + i,
> > > +
> > > page_gfp,
> > > alloc_gfp);
> > > +
> > > + if (shandle < 0 && !folio_has_been_split &&
> > > order) {
> > > + pgoff_t j;
> > > +
> > > + /*
> > > + * True OOM: could not allocate a shmem
> > > folio
> > > + * for the next subpage. Fall back to
> > > splitting
> > > + * the source compound and backing up
> > > subpages
> > > + * individually. Release the already-
> > > backed-
> > > up
> > > + * subpages whose contents now live in
> > > shmem;
> > > + * any further failure terminates the
> > > loop
> > > with
> > > + * partial progress (handled by the
> > > caller).
> > > + */
> > > + folio_has_been_split = true;
> > > + ttm_pool_split_for_swap(pool, page);
> > > +
> > > + for (j = 0; j < i; ++j) {
> > > + __free_pages_gpu_account(page +
> > > j,
> > > 0, false);
> > > + shrunken++;
> > > + }
> > > +
> > > + goto try_again_after_split;
> > > + } else if (shandle < 0) {
> > > + ret = shandle;
> > > + goto out;
> > > + } else if (folio_has_been_split) {
> > > + __free_pages_gpu_account(page + i, 0,
> > > false);
> > > + shrunken++;
> > > + }
> > > +
> > > + tt->pages[idx + i] =
> > > ttm_backup_handle_to_page_ptr(shandle);
> > > + }
> > > +
> > > + if (!folio_has_been_split) {
> > > + /* Compound fully backed up; free at native
> > > order.
> > > */
> > > + page->private = 0;
> > > + __free_pages_gpu_account(page, order, false);
> > > + shrunken += npages;
> > > + }
> > > +
> > > +out:
> > > + return shrunken ? shrunken : ret;
> > > +}
> > > +
> > > /**
> > > * ttm_pool_backup() - Back up or purge a struct ttm_tt
> > > * @pool: The pool used when allocating the struct ttm_tt.
> > > @@ -1045,12 +1109,11 @@ long ttm_pool_backup(struct ttm_pool
> > > *pool,
> > > struct ttm_tt *tt,
> > > {
> > > struct file *backup = tt->backup;
> > > struct page *page;
> > > - unsigned long handle;
> > > gfp_t alloc_gfp;
> > > gfp_t gfp;
> > > int ret = 0;
> > > pgoff_t shrunken = 0;
> > > - pgoff_t i, num_pages;
> > > + pgoff_t i, num_pages, npages;
> > >
> > > if (WARN_ON(ttm_tt_is_backed_up(tt)))
> > > return -EINVAL;
> > > @@ -1070,7 +1133,8 @@ long ttm_pool_backup(struct ttm_pool *pool,
> > > struct ttm_tt *tt,
> > > unsigned int order;
> > >
> > > page = tt->pages[i];
> > > - if (unlikely(!page)) {
> > > + if (unlikely(!page ||
> > > +
> > > ttm_backup_page_ptr_is_handle(page))) {
> > > num_pages = 1;
> > > continue;
> > > }
> > > @@ -1106,26 +1170,40 @@ long ttm_pool_backup(struct ttm_pool
> > > *pool,
> > > struct ttm_tt *tt,
> > > if (IS_ENABLED(CONFIG_FAULT_INJECTION) &&
> > > should_fail(&backup_fault_inject, 1))
> > > num_pages = DIV_ROUND_UP(num_pages, 2);
> > >
> > > - for (i = 0; i < num_pages; ++i) {
> > > - s64 shandle;
> > > + for (i = 0; i < num_pages; i += npages) {
> > > + unsigned int order;
> > >
> > > + npages = 1;
> > > page = tt->pages[i];
> > > if (unlikely(!page))
> > > continue;
> > >
> > > - ttm_pool_split_for_swap(pool, page);
> > > + /* Already-handled entry from a previous
> > > attempt. */
> > > + if
> > > (unlikely(ttm_backup_page_ptr_is_handle(page)))
> > > + continue;
> > >
> > > - shandle = ttm_backup_backup_page(backup, page,
> > > flags->writeback, i,
> > > - gfp,
> > > alloc_gfp);
> > > - if (shandle < 0) {
> > > - /* We allow partially shrunken tts */
> > > - ret = shandle;
> > > + order = ttm_pool_page_order(pool, page);
> > > + npages = 1UL << order;
> > > +
> > > + /*
> > > + * Back up the compound atomically at its native
> > > order. If
> > > + * fault injection truncated num_pages mid-
> > > compound,
> > > skip
> > > + * the partial tail rather than splitting.
> > > + */
> > > + if (unlikely(i + npages > num_pages))
> > > + break;
> > > +
> > > + ret = ttm_pool_backup_folio(pool, tt, backup,
> > > page_folio(page),
> > > + order, flags-
> > > >writeback,
> > > i, gfp,
> > > + alloc_gfp);
> > > + if (unlikely(ret < 0))
> > > + break;
> > > +
> > > + shrunken += ret;
> > > +
> > > + /* partial backup */
> > > + if (unlikely(ret != npages))
> > > break;
> > > - }
> > > - handle = shandle;
> > > - tt->pages[i] =
> > > ttm_backup_handle_to_page_ptr(handle);
> > > - __free_pages_gpu_account(page, 0, false);
> > > - shrunken++;
> > > }
> > >
> > > return shrunken ? shrunken : ret;
next prev parent reply other threads:[~2026-05-06 16:26 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20260505200443.3300962-1-matthew.brost@intel.com>
2026-05-05 20:04 ` [PATCH v5 1/2] drm/ttm: Drop tt->restore after successful restore Matthew Brost
2026-05-05 20:04 ` [PATCH v5 2/2] drm/ttm/pool: back up at native page order Matthew Brost
2026-05-06 14:23 ` Thomas Hellström
2026-05-06 16:14 ` Matthew Brost
2026-05-06 16:16 ` Matthew Brost
2026-05-06 16:26 ` Thomas Hellström [this message]
2026-05-06 18:05 ` Matthew Brost
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=906de072af1f6744aed1eb914ff196f0f5e00016.camel@linux.intel.com \
--to=thomas.hellstrom@linux.intel.com \
--cc=airlied@gmail.com \
--cc=christian.koenig@amd.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=intel-xe@lists.freedesktop.org \
--cc=linux-kernel@vger.kernel.org \
--cc=maarten.lankhorst@linux.intel.com \
--cc=matthew.auld@intel.com \
--cc=matthew.brost@intel.com \
--cc=mripard@kernel.org \
--cc=ray.huang@amd.com \
--cc=simona@ffwll.ch \
--cc=stable@vger.kernel.org \
--cc=tzimmermann@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox