* [PATCH v5 1/2] drm/ttm: Drop tt->restore after successful restore [not found] <20260505200443.3300962-1-matthew.brost@intel.com> @ 2026-05-05 20:04 ` Matthew Brost 2026-05-05 20:04 ` [PATCH v5 2/2] drm/ttm/pool: back up at native page order Matthew Brost 1 sibling, 0 replies; 7+ messages in thread From: Matthew Brost @ 2026-05-05 20:04 UTC (permalink / raw) To: intel-xe, dri-devel Cc: Thomas Hellström, Christian Koenig, Huang Rui, Matthew Auld, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter, linux-kernel, stable ttm_pool_restore_and_alloc() can successfully complete the restore process via ttm_pool_restore_commit(), but tt->restore is not dropped afterward. As a result, subsequent backup/restore flows observe what appears to be a completed restore, while in reality shmem handles are still installed in tt->pages, leading to the stack trace below. Fix this by freeing and dropping tt->restore in ttm_pool_restore_and_alloc() upon successful completion of the restore. 20545 [ 309.784531] RIP: 0010:sg_alloc_append_table_from_pages+0x38c/0x490 20547 [ 309.809570] RSP: 0018:ffffc9000623b838 EFLAGS: 00010206 20548 [ 309.814827] RAX: 0000000000001000 RBX: ffff88816e42a160 RCX: 0000000000000000 20549 [ 309.821986] RDX: 0000000000002000 RSI: 0000000000000003 RDI: 0000000000001000 20550 [ 309.829147] RBP: ffff88816e42a168 R08: 0000000000000002 R09: 000000007ffff000 20551 [ 309.836310] R10: ffffc9000623b928 R11: 0000000000000000 R12: 000000007ffff000 20552 [ 309.843471] R13: ffff88815ba5a100 R14: 0000000000000000 R15: 0000000000000001 20553 [ 309.850634] FS: 00007f9ff305e700(0000) GS:ffff888276c94000(0000) knlGS:0000000000000000 20554 [ 309.858749] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 20555 [ 309.864519] CR2: 00007f9fca701000 CR3: 00000001565e2005 CR4: 0000000008f70ef0 20556 [ 309.871678] PKRU: 55555558 20557 [ 309.874403] Call Trace: 20558 [ 309.876866] <TASK> 20559 [ 309.878988] sg_alloc_table_from_pages_segment+0x60/0x100 20560 [ 309.884415] ? ttm_resource_manager_usage+0x36/0x60 [ttm] 20561 [ 309.889845] ? xe_tt_map_sg+0x7d/0xd0 [xe] 20562 [ 309.894045] xe_tt_map_sg+0x7d/0xd0 [xe] 20563 [ 309.898037] xe_bo_move+0x927/0xaa0 [xe] 20564 [ 309.902029] ttm_bo_handle_move_mem+0xba/0x170 [ttm] 20565 [ 309.907022] ttm_bo_validate+0xbe/0x190 [ttm] 20566 [ 309.911405] xe_bo_validate+0x9a/0x120 [xe] 20567 [ 309.915663] xe_gpuvm_validate+0xd9/0x140 [xe] 20568 [ 309.920206] drm_gpuvm_validate+0x2f0/0x5b0 [drm_gpuvm] 20569 [ 309.925459] ? drm_exec_lock_obj+0x63/0x210 [drm_exec] 20570 [ 309.930627] xe_vm_validate_rebind+0x46/0xb0 [xe] 20571 [ 309.935428] xe_exec_fn+0x20/0x40 [xe] 20572 [ 309.939249] drm_gpuvm_exec_lock+0x78/0xc0 [drm_gpuvm] 20573 [ 309.944410] xe_validation_exec_lock+0x5a/0xa0 [xe] 20574 [ 309.949385] xe_exec_ioctl+0x806/0xc30 [xe] 20575 [ 309.953639] ? ttwu_queue_wakelist+0xd9/0xf0 20576 [ 309.957935] ? __pfx_xe_exec_fn+0x10/0x10 [xe] 20577 [ 309.962449] ? __wake_up_common+0x73/0xa0 20578 [ 309.966482] ? __pfx_xe_exec_ioctl+0x10/0x10 [xe] 20579 [ 309.971263] drm_ioctl_kernel+0xa3/0x100 20580 [ 309.975209] drm_ioctl+0x213/0x440 20581 [ 309.978637] ? __pfx_xe_exec_ioctl+0x10/0x10 [xe] 20582 [ 309.983415] xe_drm_ioctl+0x67/0xd0 [xe] 20583 [ 309.987408] __x64_sys_ioctl+0x7f/0xd0 Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Christian Koenig <christian.koenig@amd.com> Cc: Huang Rui <ray.huang@amd.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Maxime Ripard <mripard@kernel.org> Cc: Thomas Zimmermann <tzimmermann@suse.de> Cc: David Airlie <airlied@gmail.com> Cc: Simona Vetter <simona@ffwll.ch> Cc: dri-devel@lists.freedesktop.org Cc: linux-kernel@vger.kernel.org Cc: stable@vger.kernel.org Fixes: b63d715b8090 ("drm/ttm/pool, drm/ttm/tt: Provide a helper to shrink pages") Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> --- v3: - Call ttm_pool_apply_caching after freeing local restore (sashiko) - Save alloc in snapshot on restore failure (sashiko) v4: - Actual 'Save alloc in snapshot on restore failure (sashiko)' v5: - kfree retore after ttm_pool_apply_caching (sashiko) --- drivers/gpu/drm/ttm/ttm_pool.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c index 278bbe7a11ad..d380a3c7fe40 100644 --- a/drivers/gpu/drm/ttm/ttm_pool.c +++ b/drivers/gpu/drm/ttm/ttm_pool.c @@ -902,6 +902,7 @@ int ttm_pool_restore_and_alloc(struct ttm_pool *pool, struct ttm_tt *tt, { struct ttm_pool_tt_restore *restore = tt->restore; struct ttm_pool_alloc_state alloc; + int ret; if (WARN_ON(!ttm_tt_is_backed_up(tt))) return -EINVAL; @@ -925,14 +926,22 @@ int ttm_pool_restore_and_alloc(struct ttm_pool *pool, struct ttm_tt *tt, } else { alloc = restore->snapshot_alloc; if (ttm_pool_restore_valid(restore)) { - int ret = ttm_pool_restore_commit(restore, tt->backup, - ctx, &alloc); + ret = ttm_pool_restore_commit(restore, tt->backup, + ctx, &alloc); if (ret) return ret; } - if (!alloc.remaining_pages) + if (!alloc.remaining_pages) { + ret = ttm_pool_apply_caching(&alloc); + if (ret) + return ret; + + kfree(tt->restore); + tt->restore = NULL; + return 0; + } } return __ttm_pool_alloc(pool, tt, ctx, &alloc, restore); -- 2.34.1 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH v5 2/2] drm/ttm/pool: back up at native page order [not found] <20260505200443.3300962-1-matthew.brost@intel.com> 2026-05-05 20:04 ` [PATCH v5 1/2] drm/ttm: Drop tt->restore after successful restore Matthew Brost @ 2026-05-05 20:04 ` Matthew Brost 2026-05-06 14:23 ` Thomas Hellström 1 sibling, 1 reply; 7+ messages in thread From: Matthew Brost @ 2026-05-05 20:04 UTC (permalink / raw) To: intel-xe, dri-devel Cc: Christian Koenig, Huang Rui, Matthew Auld, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter, linux-kernel, stable, Thomas Hellström ttm_pool_split_for_swap() splits high-order pool pages into order-0 pages during backup so each 4K page can be released to the system as soon as it has been written to shmem. While this minimizes the allocator's working set during reclaim, it actively fragments memory: every TTM-backed compound page that the shrinker touches is shattered into order-0 pages, even when the rest of the system would prefer that the high-order block stay intact. Under sustained kswapd pressure this is enough to drive other parts of MM into recovery loops from which they cannot easily escape, because the memory TTM just freed is no longer contiguous. Stop unconditionally splitting on the backup path and back up each compound at its native order in ttm_pool_backup(): - For each non-handle slot, read the order from the head page and back up all 1<<order subpages to consecutive shmem indices, writing the resulting handles into tt->pages[] as we go. - On success, the compound is freed once at its native order. No split_page(), no per-4K refcount juggling, no fragmentation introduced from this path. - Slots that already hold a backup handle from a previous partial attempt are skipped. A compound that would extend past a fault-injection-truncated num_pages is skipped rather than split. A per-subpage backup failure cannot be made fully atomic: backing up a subpage allocates a shmem folio before the source page can be released, so under true OOM any subpage in a compound (not just the first) may fail to be backed up with the rest of the source compound still live and contiguous. To make forward progress in that case, fall back to splitting the source compound and backing up its remaining subpages individually: - On the first per-subpage failure for a compound (and only if order > 0), call ttm_pool_split_for_swap() to split the source compound, release the subpages whose contents already live in shmem (their handles in tt->pages stay valid), and retry the failing subpage at order 0. - Subsequent successful subpage backups in the now-split compound free their source page individually as soon as the handle is written. - A second failure after splitting terminates the loop with partial progress; the remaining order-0 subpages stay in tt->pages as plain page pointers and are cleaned up by the normal ttm_pool_drop_backed_up() / ttm_pool_free_range() paths. This restores the original split-on-OOM fallback behavior while keeping the common, non-OOM case fragmentation-free. It also preserves the "partial backup is allowed" contract: shrunken is incremented per backed-up subpage so the caller still sees forward progress when a compound only partially succeeds. The restore-side leftover-page branch in ttm_pool_restore_commit() is left as-is for now: that path can still split a previously-retained compound, but in practice it is unreachable under realistic workloads (per profiling we have not been able to trigger it), so it is not worth complicating the restore state machine to avoid the split there. If it ever becomes a problem in practice it can be addressed independently. ttm_pool_split_for_swap() itself is retained both for the OOM fallback above and for the restore path's remaining caller. The DMA-mapped pre-backup unmap loop, the purge path, ttm_pool_free_*, and ttm_pool_unmap_and_free() already operate at native order and are unchanged. Cc: Christian Koenig <christian.koenig@amd.com> Cc: Huang Rui <ray.huang@amd.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Maxime Ripard <mripard@kernel.org> Cc: Thomas Zimmermann <tzimmermann@suse.de> Cc: David Airlie <airlied@gmail.com> Cc: Simona Vetter <simona@ffwll.ch> Cc: dri-devel@lists.freedesktop.org Cc: linux-kernel@vger.kernel.org Cc: stable@vger.kernel.org Fixes: b63d715b8090 ("drm/ttm/pool, drm/ttm/tt: Provide a helper to shrink pages") Suggested-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Assisted-by: Claude:claude-opus-4.6 Signed-off-by: Matthew Brost <matthew.brost@intel.com> --- A follow-up should attempt writeback to shmem at folio order as well, but the API for doing so is unclear and may be incomplete. This patch is related to the pending series [1] and significantly reduces the likelihood of Xe entering a kswapd loop under fragmentation. The kswapd → shrinker → Xe shrinker → TTM backup path is still exercised; however, with this change the backup path no longer worsens fragmentation, which previously amplified reclaim pressure and reinforced the kswapd loop. Nonetheless, the pathological case that [1] aims to address still exists and requires a proper solution. Even with this patch, a kswapd loop due to severe fragmentation can still be triggered, although it is now substantially harder to reproduce. v2: - Split pages and free immediately if backup fails are higher order (Thomas) v3: - Skip handles in purge path (sashiko) v5: - Refactor into ttm_pool_backup_folio (Thomas) [1] https://patchwork.freedesktop.org/series/165330/ --- drivers/gpu/drm/ttm/ttm_pool.c | 110 ++++++++++++++++++++++++++++----- 1 file changed, 94 insertions(+), 16 deletions(-) diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c index d380a3c7fe40..78efc8524133 100644 --- a/drivers/gpu/drm/ttm/ttm_pool.c +++ b/drivers/gpu/drm/ttm/ttm_pool.c @@ -1019,6 +1019,70 @@ void ttm_pool_drop_backed_up(struct ttm_tt *tt) ttm_pool_free_range(NULL, tt, ttm_cached, start_page, tt->num_pages); } +static int ttm_pool_backup_folio(struct ttm_pool *pool, struct ttm_tt *tt, + struct file *backup, struct folio *folio, + unsigned int order, bool writeback, + pgoff_t idx, gfp_t page_gfp, gfp_t alloc_gfp) +{ + struct page *page = folio_page(folio, 0); + int shrunken = 0, npages = 1UL << order, ret = 0, i; + bool folio_has_been_split = false; + + for (i = 0; i < npages; ++i) { + s64 shandle; + +try_again_after_split: + if (IS_ENABLED(CONFIG_FAULT_INJECTION) && + should_fail(&backup_fault_inject, 1)) + shandle = -ENOMEM; + else + shandle = ttm_backup_backup_page(backup, page + i, + writeback, idx + i, + page_gfp, alloc_gfp); + + if (shandle < 0 && !folio_has_been_split && order) { + pgoff_t j; + + /* + * True OOM: could not allocate a shmem folio + * for the next subpage. Fall back to splitting + * the source compound and backing up subpages + * individually. Release the already-backed-up + * subpages whose contents now live in shmem; + * any further failure terminates the loop with + * partial progress (handled by the caller). + */ + folio_has_been_split = true; + ttm_pool_split_for_swap(pool, page); + + for (j = 0; j < i; ++j) { + __free_pages_gpu_account(page + j, 0, false); + shrunken++; + } + + goto try_again_after_split; + } else if (shandle < 0) { + ret = shandle; + goto out; + } else if (folio_has_been_split) { + __free_pages_gpu_account(page + i, 0, false); + shrunken++; + } + + tt->pages[idx + i] = ttm_backup_handle_to_page_ptr(shandle); + } + + if (!folio_has_been_split) { + /* Compound fully backed up; free at native order. */ + page->private = 0; + __free_pages_gpu_account(page, order, false); + shrunken += npages; + } + +out: + return shrunken ? shrunken : ret; +} + /** * ttm_pool_backup() - Back up or purge a struct ttm_tt * @pool: The pool used when allocating the struct ttm_tt. @@ -1045,12 +1109,11 @@ long ttm_pool_backup(struct ttm_pool *pool, struct ttm_tt *tt, { struct file *backup = tt->backup; struct page *page; - unsigned long handle; gfp_t alloc_gfp; gfp_t gfp; int ret = 0; pgoff_t shrunken = 0; - pgoff_t i, num_pages; + pgoff_t i, num_pages, npages; if (WARN_ON(ttm_tt_is_backed_up(tt))) return -EINVAL; @@ -1070,7 +1133,8 @@ long ttm_pool_backup(struct ttm_pool *pool, struct ttm_tt *tt, unsigned int order; page = tt->pages[i]; - if (unlikely(!page)) { + if (unlikely(!page || + ttm_backup_page_ptr_is_handle(page))) { num_pages = 1; continue; } @@ -1106,26 +1170,40 @@ long ttm_pool_backup(struct ttm_pool *pool, struct ttm_tt *tt, if (IS_ENABLED(CONFIG_FAULT_INJECTION) && should_fail(&backup_fault_inject, 1)) num_pages = DIV_ROUND_UP(num_pages, 2); - for (i = 0; i < num_pages; ++i) { - s64 shandle; + for (i = 0; i < num_pages; i += npages) { + unsigned int order; + npages = 1; page = tt->pages[i]; if (unlikely(!page)) continue; - ttm_pool_split_for_swap(pool, page); + /* Already-handled entry from a previous attempt. */ + if (unlikely(ttm_backup_page_ptr_is_handle(page))) + continue; - shandle = ttm_backup_backup_page(backup, page, flags->writeback, i, - gfp, alloc_gfp); - if (shandle < 0) { - /* We allow partially shrunken tts */ - ret = shandle; + order = ttm_pool_page_order(pool, page); + npages = 1UL << order; + + /* + * Back up the compound atomically at its native order. If + * fault injection truncated num_pages mid-compound, skip + * the partial tail rather than splitting. + */ + if (unlikely(i + npages > num_pages)) + break; + + ret = ttm_pool_backup_folio(pool, tt, backup, page_folio(page), + order, flags->writeback, i, gfp, + alloc_gfp); + if (unlikely(ret < 0)) + break; + + shrunken += ret; + + /* partial backup */ + if (unlikely(ret != npages)) break; - } - handle = shandle; - tt->pages[i] = ttm_backup_handle_to_page_ptr(handle); - __free_pages_gpu_account(page, 0, false); - shrunken++; } return shrunken ? shrunken : ret; -- 2.34.1 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH v5 2/2] drm/ttm/pool: back up at native page order 2026-05-05 20:04 ` [PATCH v5 2/2] drm/ttm/pool: back up at native page order Matthew Brost @ 2026-05-06 14:23 ` Thomas Hellström 2026-05-06 16:14 ` Matthew Brost 0 siblings, 1 reply; 7+ messages in thread From: Thomas Hellström @ 2026-05-06 14:23 UTC (permalink / raw) To: Matthew Brost, intel-xe, dri-devel Cc: Christian Koenig, Huang Rui, Matthew Auld, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter, linux-kernel, stable Hi, Matt On Tue, 2026-05-05 at 13:04 -0700, Matthew Brost wrote: > ttm_pool_split_for_swap() splits high-order pool pages into order-0 > pages during backup so each 4K page can be released to the system as > soon as it has been written to shmem. While this minimizes the > allocator's working set during reclaim, it actively fragments memory: > every TTM-backed compound page that the shrinker touches is shattered > into order-0 pages, even when the rest of the system would prefer > that > the high-order block stay intact. Under sustained kswapd pressure > this > is enough to drive other parts of MM into recovery loops from which > they cannot easily escape, because the memory TTM just freed is no > longer contiguous. > > Stop unconditionally splitting on the backup path and back up each > compound at its native order in ttm_pool_backup(): > > - For each non-handle slot, read the order from the head page and > back up all 1<<order subpages to consecutive shmem indices, > writing the resulting handles into tt->pages[] as we go. > - On success, the compound is freed once at its native order. No > split_page(), no per-4K refcount juggling, no fragmentation > introduced from this path. > - Slots that already hold a backup handle from a previous partial > attempt are skipped. A compound that would extend past a > fault-injection-truncated num_pages is skipped rather than split. > > A per-subpage backup failure cannot be made fully atomic: backing up > a > subpage allocates a shmem folio before the source page can be > released, > so under true OOM any subpage in a compound (not just the first) may > fail to be backed up with the rest of the source compound still live > and contiguous. To make forward progress in that case, fall back to > splitting the source compound and backing up its remaining subpages > individually: > > - On the first per-subpage failure for a compound (and only if > order > 0), call ttm_pool_split_for_swap() to split the source > compound, release the subpages whose contents already live in > shmem (their handles in tt->pages stay valid), and retry the > failing subpage at order 0. > - Subsequent successful subpage backups in the now-split compound > free their source page individually as soon as the handle is > written. > - A second failure after splitting terminates the loop with partial > progress; the remaining order-0 subpages stay in tt->pages as > plain page pointers and are cleaned up by the normal > ttm_pool_drop_backed_up() / ttm_pool_free_range() paths. > > This restores the original split-on-OOM fallback behavior while > keeping the common, non-OOM case fragmentation-free. It also > preserves the "partial backup is allowed" contract: shrunken is > incremented per backed-up subpage so the caller still sees forward > progress when a compound only partially succeeds. > > The restore-side leftover-page branch in ttm_pool_restore_commit() is > left as-is for now: that path can still split a previously-retained > compound, but in practice it is unreachable under realistic workloads > (per profiling we have not been able to trigger it), so it is not > worth complicating the restore state machine to avoid the split > there. > If it ever becomes a problem in practice it can be addressed > independently. > > ttm_pool_split_for_swap() itself is retained both for the OOM > fallback above and for the restore path's remaining caller. The > DMA-mapped pre-backup unmap loop, the purge path, ttm_pool_free_*, > and ttm_pool_unmap_and_free() already operate at native order and > are unchanged. > > Cc: Christian Koenig <christian.koenig@amd.com> > Cc: Huang Rui <ray.huang@amd.com> > Cc: Matthew Auld <matthew.auld@intel.com> > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> > Cc: Maxime Ripard <mripard@kernel.org> > Cc: Thomas Zimmermann <tzimmermann@suse.de> > Cc: David Airlie <airlied@gmail.com> > Cc: Simona Vetter <simona@ffwll.ch> > Cc: dri-devel@lists.freedesktop.org > Cc: linux-kernel@vger.kernel.org > Cc: stable@vger.kernel.org > Fixes: b63d715b8090 ("drm/ttm/pool, drm/ttm/tt: Provide a helper to > shrink pages") > Suggested-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> > Assisted-by: Claude:claude-opus-4.6 > Signed-off-by: Matthew Brost <matthew.brost@intel.com> > > --- > > A follow-up should attempt writeback to shmem at folio order as well, > but the API for doing so is unclear and may be incomplete. > > This patch is related to the pending series [1] and significantly > reduces the likelihood of Xe entering a kswapd loop under > fragmentation. > The kswapd → shrinker → Xe shrinker → TTM backup path is still > exercised; however, with this change the backup path no longer > worsens > fragmentation, which previously amplified reclaim pressure and > reinforced the kswapd loop. > > Nonetheless, the pathological case that [1] aims to address still > exists > and requires a proper solution. Even with this patch, a kswapd loop > due > to severe fragmentation can still be triggered, although it is now > substantially harder to reproduce. > > v2: > - Split pages and free immediately if backup fails are higher order > (Thomas) > v3: > - Skip handles in purge path (sashiko) > v5: > - Refactor into ttm_pool_backup_folio (Thomas) > > [1] https://patchwork.freedesktop.org/series/165330/ > --- > drivers/gpu/drm/ttm/ttm_pool.c | 110 ++++++++++++++++++++++++++++--- > -- > 1 file changed, 94 insertions(+), 16 deletions(-) > > diff --git a/drivers/gpu/drm/ttm/ttm_pool.c > b/drivers/gpu/drm/ttm/ttm_pool.c > index d380a3c7fe40..78efc8524133 100644 > --- a/drivers/gpu/drm/ttm/ttm_pool.c > +++ b/drivers/gpu/drm/ttm/ttm_pool.c > @@ -1019,6 +1019,70 @@ void ttm_pool_drop_backed_up(struct ttm_tt > *tt) > ttm_pool_free_range(NULL, tt, ttm_cached, start_page, tt- > >num_pages); > } > > +static int ttm_pool_backup_folio(struct ttm_pool *pool, struct > ttm_tt *tt, > + struct file *backup, struct folio > *folio, > + unsigned int order, bool writeback, > + pgoff_t idx, gfp_t page_gfp, gfp_t > alloc_gfp) I don't really understand why we can't end up with a ttm_backup_backup_folio(), which I believe is the proper layering, already at this point? Please see a suggestion at https://gitlab.freedesktop.org/thomash/xe-vibe/-/commits/ttm_swapout?ref_type=heads Here the splitting logic is kept in the ttm_pool, but ttm_backup supports handing large folios to it. Although the cumulative diffstat becomes larger, the end code becomes smaller and IMO easier to read, and we don't need to introduce code that we immediately have to refactor. But I'm starting to question the general approach: Even if the *shrinker* can recover from a total kernel memory reserve depletion, it can't really be considered a reasonable practice, since if we frequently deplete the reserves, *other* important allocations in the system like GFP_ATOMIC, PF_MEMALLOC may spuriously start to fail and people will have a hard time finding out why. So I actually don't think we can be avoiding the splitting without direct insertion. FWIW, up until recently when shmem started supporting huge page swapping, other GPU drivers basically also split pages at swapout. Another idea for improving on the compaction loop, perhaps worth trying is this change, shamelessly stolen from i915: https://gitlab.freedesktop.org/thomash/xe-vibe/-/commits/shrinker_batch?ref_type=heads /Thomas > +{ > + struct page *page = folio_page(folio, 0); > + int shrunken = 0, npages = 1UL << order, ret = 0, i; > + bool folio_has_been_split = false; > + > + for (i = 0; i < npages; ++i) { > + s64 shandle; > + > +try_again_after_split: > + if (IS_ENABLED(CONFIG_FAULT_INJECTION) && > + should_fail(&backup_fault_inject, 1)) > + shandle = -ENOMEM; > + else > + shandle = ttm_backup_backup_page(backup, > page + i, > + writeback, > idx + i, > + page_gfp, > alloc_gfp); > + > + if (shandle < 0 && !folio_has_been_split && order) { > + pgoff_t j; > + > + /* > + * True OOM: could not allocate a shmem > folio > + * for the next subpage. Fall back to > splitting > + * the source compound and backing up > subpages > + * individually. Release the already-backed- > up > + * subpages whose contents now live in > shmem; > + * any further failure terminates the loop > with > + * partial progress (handled by the caller). > + */ > + folio_has_been_split = true; > + ttm_pool_split_for_swap(pool, page); > + > + for (j = 0; j < i; ++j) { > + __free_pages_gpu_account(page + j, > 0, false); > + shrunken++; > + } > + > + goto try_again_after_split; > + } else if (shandle < 0) { > + ret = shandle; > + goto out; > + } else if (folio_has_been_split) { > + __free_pages_gpu_account(page + i, 0, > false); > + shrunken++; > + } > + > + tt->pages[idx + i] = > ttm_backup_handle_to_page_ptr(shandle); > + } > + > + if (!folio_has_been_split) { > + /* Compound fully backed up; free at native order. > */ > + page->private = 0; > + __free_pages_gpu_account(page, order, false); > + shrunken += npages; > + } > + > +out: > + return shrunken ? shrunken : ret; > +} > + > /** > * ttm_pool_backup() - Back up or purge a struct ttm_tt > * @pool: The pool used when allocating the struct ttm_tt. > @@ -1045,12 +1109,11 @@ long ttm_pool_backup(struct ttm_pool *pool, > struct ttm_tt *tt, > { > struct file *backup = tt->backup; > struct page *page; > - unsigned long handle; > gfp_t alloc_gfp; > gfp_t gfp; > int ret = 0; > pgoff_t shrunken = 0; > - pgoff_t i, num_pages; > + pgoff_t i, num_pages, npages; > > if (WARN_ON(ttm_tt_is_backed_up(tt))) > return -EINVAL; > @@ -1070,7 +1133,8 @@ long ttm_pool_backup(struct ttm_pool *pool, > struct ttm_tt *tt, > unsigned int order; > > page = tt->pages[i]; > - if (unlikely(!page)) { > + if (unlikely(!page || > + > ttm_backup_page_ptr_is_handle(page))) { > num_pages = 1; > continue; > } > @@ -1106,26 +1170,40 @@ long ttm_pool_backup(struct ttm_pool *pool, > struct ttm_tt *tt, > if (IS_ENABLED(CONFIG_FAULT_INJECTION) && > should_fail(&backup_fault_inject, 1)) > num_pages = DIV_ROUND_UP(num_pages, 2); > > - for (i = 0; i < num_pages; ++i) { > - s64 shandle; > + for (i = 0; i < num_pages; i += npages) { > + unsigned int order; > > + npages = 1; > page = tt->pages[i]; > if (unlikely(!page)) > continue; > > - ttm_pool_split_for_swap(pool, page); > + /* Already-handled entry from a previous attempt. */ > + if (unlikely(ttm_backup_page_ptr_is_handle(page))) > + continue; > > - shandle = ttm_backup_backup_page(backup, page, > flags->writeback, i, > - gfp, alloc_gfp); > - if (shandle < 0) { > - /* We allow partially shrunken tts */ > - ret = shandle; > + order = ttm_pool_page_order(pool, page); > + npages = 1UL << order; > + > + /* > + * Back up the compound atomically at its native > order. If > + * fault injection truncated num_pages mid-compound, > skip > + * the partial tail rather than splitting. > + */ > + if (unlikely(i + npages > num_pages)) > + break; > + > + ret = ttm_pool_backup_folio(pool, tt, backup, > page_folio(page), > + order, flags->writeback, > i, gfp, > + alloc_gfp); > + if (unlikely(ret < 0)) > + break; > + > + shrunken += ret; > + > + /* partial backup */ > + if (unlikely(ret != npages)) > break; > - } > - handle = shandle; > - tt->pages[i] = > ttm_backup_handle_to_page_ptr(handle); > - __free_pages_gpu_account(page, 0, false); > - shrunken++; > } > > return shrunken ? shrunken : ret; ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v5 2/2] drm/ttm/pool: back up at native page order 2026-05-06 14:23 ` Thomas Hellström @ 2026-05-06 16:14 ` Matthew Brost 2026-05-06 16:16 ` Matthew Brost 2026-05-06 16:26 ` Thomas Hellström 0 siblings, 2 replies; 7+ messages in thread From: Matthew Brost @ 2026-05-06 16:14 UTC (permalink / raw) To: Thomas Hellström Cc: intel-xe, dri-devel, Christian Koenig, Huang Rui, Matthew Auld, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter, linux-kernel, stable On Wed, May 06, 2026 at 04:23:29PM +0200, Thomas Hellström wrote: > Hi, Matt > > On Tue, 2026-05-05 at 13:04 -0700, Matthew Brost wrote: > > ttm_pool_split_for_swap() splits high-order pool pages into order-0 > > pages during backup so each 4K page can be released to the system as > > soon as it has been written to shmem. While this minimizes the > > allocator's working set during reclaim, it actively fragments memory: > > every TTM-backed compound page that the shrinker touches is shattered > > into order-0 pages, even when the rest of the system would prefer > > that > > the high-order block stay intact. Under sustained kswapd pressure > > this > > is enough to drive other parts of MM into recovery loops from which > > they cannot easily escape, because the memory TTM just freed is no > > longer contiguous. > > > > Stop unconditionally splitting on the backup path and back up each > > compound at its native order in ttm_pool_backup(): > > > > - For each non-handle slot, read the order from the head page and > > back up all 1<<order subpages to consecutive shmem indices, > > writing the resulting handles into tt->pages[] as we go. > > - On success, the compound is freed once at its native order. No > > split_page(), no per-4K refcount juggling, no fragmentation > > introduced from this path. > > - Slots that already hold a backup handle from a previous partial > > attempt are skipped. A compound that would extend past a > > fault-injection-truncated num_pages is skipped rather than split. > > > > A per-subpage backup failure cannot be made fully atomic: backing up > > a > > subpage allocates a shmem folio before the source page can be > > released, > > so under true OOM any subpage in a compound (not just the first) may > > fail to be backed up with the rest of the source compound still live > > and contiguous. To make forward progress in that case, fall back to > > splitting the source compound and backing up its remaining subpages > > individually: > > > > - On the first per-subpage failure for a compound (and only if > > order > 0), call ttm_pool_split_for_swap() to split the source > > compound, release the subpages whose contents already live in > > shmem (their handles in tt->pages stay valid), and retry the > > failing subpage at order 0. > > - Subsequent successful subpage backups in the now-split compound > > free their source page individually as soon as the handle is > > written. > > - A second failure after splitting terminates the loop with partial > > progress; the remaining order-0 subpages stay in tt->pages as > > plain page pointers and are cleaned up by the normal > > ttm_pool_drop_backed_up() / ttm_pool_free_range() paths. > > > > This restores the original split-on-OOM fallback behavior while > > keeping the common, non-OOM case fragmentation-free. It also > > preserves the "partial backup is allowed" contract: shrunken is > > incremented per backed-up subpage so the caller still sees forward > > progress when a compound only partially succeeds. > > > > The restore-side leftover-page branch in ttm_pool_restore_commit() is > > left as-is for now: that path can still split a previously-retained > > compound, but in practice it is unreachable under realistic workloads > > (per profiling we have not been able to trigger it), so it is not > > worth complicating the restore state machine to avoid the split > > there. > > If it ever becomes a problem in practice it can be addressed > > independently. > > > > ttm_pool_split_for_swap() itself is retained both for the OOM > > fallback above and for the restore path's remaining caller. The > > DMA-mapped pre-backup unmap loop, the purge path, ttm_pool_free_*, > > and ttm_pool_unmap_and_free() already operate at native order and > > are unchanged. > > > > Cc: Christian Koenig <christian.koenig@amd.com> > > Cc: Huang Rui <ray.huang@amd.com> > > Cc: Matthew Auld <matthew.auld@intel.com> > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> > > Cc: Maxime Ripard <mripard@kernel.org> > > Cc: Thomas Zimmermann <tzimmermann@suse.de> > > Cc: David Airlie <airlied@gmail.com> > > Cc: Simona Vetter <simona@ffwll.ch> > > Cc: dri-devel@lists.freedesktop.org > > Cc: linux-kernel@vger.kernel.org > > Cc: stable@vger.kernel.org > > Fixes: b63d715b8090 ("drm/ttm/pool, drm/ttm/tt: Provide a helper to > > shrink pages") > > Suggested-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> > > Assisted-by: Claude:claude-opus-4.6 > > Signed-off-by: Matthew Brost <matthew.brost@intel.com> > > > > --- > > > > A follow-up should attempt writeback to shmem at folio order as well, > > but the API for doing so is unclear and may be incomplete. > > > > This patch is related to the pending series [1] and significantly > > reduces the likelihood of Xe entering a kswapd loop under > > fragmentation. > > The kswapd → shrinker → Xe shrinker → TTM backup path is still > > exercised; however, with this change the backup path no longer > > worsens > > fragmentation, which previously amplified reclaim pressure and > > reinforced the kswapd loop. > > > > Nonetheless, the pathological case that [1] aims to address still > > exists > > and requires a proper solution. Even with this patch, a kswapd loop > > due > > to severe fragmentation can still be triggered, although it is now > > substantially harder to reproduce. > > > > v2: > > - Split pages and free immediately if backup fails are higher order > > (Thomas) > > v3: > > - Skip handles in purge path (sashiko) > > v5: > > - Refactor into ttm_pool_backup_folio (Thomas) > > > > [1] https://patchwork.freedesktop.org/series/165330/ > > --- > > drivers/gpu/drm/ttm/ttm_pool.c | 110 ++++++++++++++++++++++++++++--- > > -- > > 1 file changed, 94 insertions(+), 16 deletions(-) > > > > diff --git a/drivers/gpu/drm/ttm/ttm_pool.c > > b/drivers/gpu/drm/ttm/ttm_pool.c > > index d380a3c7fe40..78efc8524133 100644 > > --- a/drivers/gpu/drm/ttm/ttm_pool.c > > +++ b/drivers/gpu/drm/ttm/ttm_pool.c > > @@ -1019,6 +1019,70 @@ void ttm_pool_drop_backed_up(struct ttm_tt > > *tt) > > ttm_pool_free_range(NULL, tt, ttm_cached, start_page, tt- > > >num_pages); > > } > > > > +static int ttm_pool_backup_folio(struct ttm_pool *pool, struct > > ttm_tt *tt, > > + struct file *backup, struct folio > > *folio, > > + unsigned int order, bool writeback, > > + pgoff_t idx, gfp_t page_gfp, gfp_t > > alloc_gfp) > > I don't really understand why we can't end up with a > ttm_backup_backup_folio(), which I believe is the proper layering, > already at this point? Please see a suggestion at > > https://gitlab.freedesktop.org/thomash/xe-vibe/-/commits/ttm_swapout?ref_type=heads > > Here the splitting logic is kept in the ttm_pool, but ttm_backup > supports handing large folios to it. > > Although the cumulative diffstat becomes larger, the end code becomes > smaller and IMO easier to read, and we don't need to introduce code > that we immediately have to refactor. That version looks fine too. If that is preference no issue. My goal with this series is get something than can reasonably be backported to LTS kernels so the desktop doesn't frequently enter kswapd because of fragmentation. We now have at least 3 reports of this being an issue. This is larger fix [1] which works in tandem but seemly unlikely to backportable given it add new concepts to the core MM [1]. [1] https://patchwork.freedesktop.org/series/165329/ > > But I'm starting to question the general approach: Even if the > *shrinker* can recover from a total kernel memory reserve depletion, it > can't really be considered a reasonable practice, since if we > frequently deplete the reserves, *other* important allocations in the > system like GFP_ATOMIC, PF_MEMALLOC may spuriously start to fail and > people will have a hard time finding out why. > Wouldn’t GFP_ATOMIC enter direct reclaim, hit our shrinker, and eventually make progress—i.e., take the split path if needed? I’m not 100% sure, but my initial reaction is that this concern may not be valid; however, MM is hard to reason about. Again, FWIW, I’ve tried a lot of things to trigger OOM—for example, running WebGL tabs and then kicking off various very memory-intensive workloads from the CLI—and I still haven’t hit OOM or seen memory allocation failures or warnings. > So I actually don't think we can be avoiding the splitting without > direct insertion. FWIW, up until recently when shmem started supporting I agree direct insertion is better solution. Do you think this something we could reasonably get working and backport? I haven't done any research on direct insertion yet, thus why I'm asking. > huge page swapping, other GPU drivers basically also split pages at > swapout. I wonder if other drivers have the same issue? The deadly combo is allow GPUs to subscribe all of system memory, allocate THP pages (or higher order pages), and split them in the shrinker. Xe might be the only driver with right combo to hit this but not 100% sure without a deep dive. > > Another idea for improving on the compaction loop, perhaps worth trying > is this change, shamelessly stolen from i915: > > https://gitlab.freedesktop.org/thomash/xe-vibe/-/commits/shrinker_batch?ref_type=heads > I'd have to give this a try - I'm quickly running out of time before I leave for month though. Matt > /Thomas > > > > +{ > > + struct page *page = folio_page(folio, 0); > > + int shrunken = 0, npages = 1UL << order, ret = 0, i; > > + bool folio_has_been_split = false; > > + > > + for (i = 0; i < npages; ++i) { > > + s64 shandle; > > + > > +try_again_after_split: > > + if (IS_ENABLED(CONFIG_FAULT_INJECTION) && > > + should_fail(&backup_fault_inject, 1)) > > + shandle = -ENOMEM; > > + else > > + shandle = ttm_backup_backup_page(backup, > > page + i, > > + writeback, > > idx + i, > > + page_gfp, > > alloc_gfp); > > + > > + if (shandle < 0 && !folio_has_been_split && order) { > > + pgoff_t j; > > + > > + /* > > + * True OOM: could not allocate a shmem > > folio > > + * for the next subpage. Fall back to > > splitting > > + * the source compound and backing up > > subpages > > + * individually. Release the already-backed- > > up > > + * subpages whose contents now live in > > shmem; > > + * any further failure terminates the loop > > with > > + * partial progress (handled by the caller). > > + */ > > + folio_has_been_split = true; > > + ttm_pool_split_for_swap(pool, page); > > + > > + for (j = 0; j < i; ++j) { > > + __free_pages_gpu_account(page + j, > > 0, false); > > + shrunken++; > > + } > > + > > + goto try_again_after_split; > > + } else if (shandle < 0) { > > + ret = shandle; > > + goto out; > > + } else if (folio_has_been_split) { > > + __free_pages_gpu_account(page + i, 0, > > false); > > + shrunken++; > > + } > > + > > + tt->pages[idx + i] = > > ttm_backup_handle_to_page_ptr(shandle); > > + } > > + > > + if (!folio_has_been_split) { > > + /* Compound fully backed up; free at native order. > > */ > > + page->private = 0; > > + __free_pages_gpu_account(page, order, false); > > + shrunken += npages; > > + } > > + > > +out: > > + return shrunken ? shrunken : ret; > > +} > > + > > /** > > * ttm_pool_backup() - Back up or purge a struct ttm_tt > > * @pool: The pool used when allocating the struct ttm_tt. > > @@ -1045,12 +1109,11 @@ long ttm_pool_backup(struct ttm_pool *pool, > > struct ttm_tt *tt, > > { > > struct file *backup = tt->backup; > > struct page *page; > > - unsigned long handle; > > gfp_t alloc_gfp; > > gfp_t gfp; > > int ret = 0; > > pgoff_t shrunken = 0; > > - pgoff_t i, num_pages; > > + pgoff_t i, num_pages, npages; > > > > if (WARN_ON(ttm_tt_is_backed_up(tt))) > > return -EINVAL; > > @@ -1070,7 +1133,8 @@ long ttm_pool_backup(struct ttm_pool *pool, > > struct ttm_tt *tt, > > unsigned int order; > > > > page = tt->pages[i]; > > - if (unlikely(!page)) { > > + if (unlikely(!page || > > + > > ttm_backup_page_ptr_is_handle(page))) { > > num_pages = 1; > > continue; > > } > > @@ -1106,26 +1170,40 @@ long ttm_pool_backup(struct ttm_pool *pool, > > struct ttm_tt *tt, > > if (IS_ENABLED(CONFIG_FAULT_INJECTION) && > > should_fail(&backup_fault_inject, 1)) > > num_pages = DIV_ROUND_UP(num_pages, 2); > > > > - for (i = 0; i < num_pages; ++i) { > > - s64 shandle; > > + for (i = 0; i < num_pages; i += npages) { > > + unsigned int order; > > > > + npages = 1; > > page = tt->pages[i]; > > if (unlikely(!page)) > > continue; > > > > - ttm_pool_split_for_swap(pool, page); > > + /* Already-handled entry from a previous attempt. */ > > + if (unlikely(ttm_backup_page_ptr_is_handle(page))) > > + continue; > > > > - shandle = ttm_backup_backup_page(backup, page, > > flags->writeback, i, > > - gfp, alloc_gfp); > > - if (shandle < 0) { > > - /* We allow partially shrunken tts */ > > - ret = shandle; > > + order = ttm_pool_page_order(pool, page); > > + npages = 1UL << order; > > + > > + /* > > + * Back up the compound atomically at its native > > order. If > > + * fault injection truncated num_pages mid-compound, > > skip > > + * the partial tail rather than splitting. > > + */ > > + if (unlikely(i + npages > num_pages)) > > + break; > > + > > + ret = ttm_pool_backup_folio(pool, tt, backup, > > page_folio(page), > > + order, flags->writeback, > > i, gfp, > > + alloc_gfp); > > + if (unlikely(ret < 0)) > > + break; > > + > > + shrunken += ret; > > + > > + /* partial backup */ > > + if (unlikely(ret != npages)) > > break; > > - } > > - handle = shandle; > > - tt->pages[i] = > > ttm_backup_handle_to_page_ptr(handle); > > - __free_pages_gpu_account(page, 0, false); > > - shrunken++; > > } > > > > return shrunken ? shrunken : ret; ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v5 2/2] drm/ttm/pool: back up at native page order 2026-05-06 16:14 ` Matthew Brost @ 2026-05-06 16:16 ` Matthew Brost 2026-05-06 16:26 ` Thomas Hellström 1 sibling, 0 replies; 7+ messages in thread From: Matthew Brost @ 2026-05-06 16:16 UTC (permalink / raw) To: Thomas Hellström Cc: intel-xe, dri-devel, Christian Koenig, Huang Rui, Matthew Auld, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter, linux-kernel, stable On Wed, May 06, 2026 at 09:14:13AM -0700, Matthew Brost wrote: > On Wed, May 06, 2026 at 04:23:29PM +0200, Thomas Hellström wrote: > > Hi, Matt > > > > On Tue, 2026-05-05 at 13:04 -0700, Matthew Brost wrote: > > > ttm_pool_split_for_swap() splits high-order pool pages into order-0 > > > pages during backup so each 4K page can be released to the system as > > > soon as it has been written to shmem. While this minimizes the > > > allocator's working set during reclaim, it actively fragments memory: > > > every TTM-backed compound page that the shrinker touches is shattered > > > into order-0 pages, even when the rest of the system would prefer > > > that > > > the high-order block stay intact. Under sustained kswapd pressure > > > this > > > is enough to drive other parts of MM into recovery loops from which > > > they cannot easily escape, because the memory TTM just freed is no > > > longer contiguous. > > > > > > Stop unconditionally splitting on the backup path and back up each > > > compound at its native order in ttm_pool_backup(): > > > > > > - For each non-handle slot, read the order from the head page and > > > back up all 1<<order subpages to consecutive shmem indices, > > > writing the resulting handles into tt->pages[] as we go. > > > - On success, the compound is freed once at its native order. No > > > split_page(), no per-4K refcount juggling, no fragmentation > > > introduced from this path. > > > - Slots that already hold a backup handle from a previous partial > > > attempt are skipped. A compound that would extend past a > > > fault-injection-truncated num_pages is skipped rather than split. > > > > > > A per-subpage backup failure cannot be made fully atomic: backing up > > > a > > > subpage allocates a shmem folio before the source page can be > > > released, > > > so under true OOM any subpage in a compound (not just the first) may > > > fail to be backed up with the rest of the source compound still live > > > and contiguous. To make forward progress in that case, fall back to > > > splitting the source compound and backing up its remaining subpages > > > individually: > > > > > > - On the first per-subpage failure for a compound (and only if > > > order > 0), call ttm_pool_split_for_swap() to split the source > > > compound, release the subpages whose contents already live in > > > shmem (their handles in tt->pages stay valid), and retry the > > > failing subpage at order 0. > > > - Subsequent successful subpage backups in the now-split compound > > > free their source page individually as soon as the handle is > > > written. > > > - A second failure after splitting terminates the loop with partial > > > progress; the remaining order-0 subpages stay in tt->pages as > > > plain page pointers and are cleaned up by the normal > > > ttm_pool_drop_backed_up() / ttm_pool_free_range() paths. > > > > > > This restores the original split-on-OOM fallback behavior while > > > keeping the common, non-OOM case fragmentation-free. It also > > > preserves the "partial backup is allowed" contract: shrunken is > > > incremented per backed-up subpage so the caller still sees forward > > > progress when a compound only partially succeeds. > > > > > > The restore-side leftover-page branch in ttm_pool_restore_commit() is > > > left as-is for now: that path can still split a previously-retained > > > compound, but in practice it is unreachable under realistic workloads > > > (per profiling we have not been able to trigger it), so it is not > > > worth complicating the restore state machine to avoid the split > > > there. > > > If it ever becomes a problem in practice it can be addressed > > > independently. > > > > > > ttm_pool_split_for_swap() itself is retained both for the OOM > > > fallback above and for the restore path's remaining caller. The > > > DMA-mapped pre-backup unmap loop, the purge path, ttm_pool_free_*, > > > and ttm_pool_unmap_and_free() already operate at native order and > > > are unchanged. > > > > > > Cc: Christian Koenig <christian.koenig@amd.com> > > > Cc: Huang Rui <ray.huang@amd.com> > > > Cc: Matthew Auld <matthew.auld@intel.com> > > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> > > > Cc: Maxime Ripard <mripard@kernel.org> > > > Cc: Thomas Zimmermann <tzimmermann@suse.de> > > > Cc: David Airlie <airlied@gmail.com> > > > Cc: Simona Vetter <simona@ffwll.ch> > > > Cc: dri-devel@lists.freedesktop.org > > > Cc: linux-kernel@vger.kernel.org > > > Cc: stable@vger.kernel.org > > > Fixes: b63d715b8090 ("drm/ttm/pool, drm/ttm/tt: Provide a helper to > > > shrink pages") > > > Suggested-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> > > > Assisted-by: Claude:claude-opus-4.6 > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com> > > > > > > --- > > > > > > A follow-up should attempt writeback to shmem at folio order as well, > > > but the API for doing so is unclear and may be incomplete. > > > > > > This patch is related to the pending series [1] and significantly > > > reduces the likelihood of Xe entering a kswapd loop under > > > fragmentation. > > > The kswapd → shrinker → Xe shrinker → TTM backup path is still > > > exercised; however, with this change the backup path no longer > > > worsens > > > fragmentation, which previously amplified reclaim pressure and > > > reinforced the kswapd loop. > > > > > > Nonetheless, the pathological case that [1] aims to address still > > > exists > > > and requires a proper solution. Even with this patch, a kswapd loop > > > due > > > to severe fragmentation can still be triggered, although it is now > > > substantially harder to reproduce. > > > > > > v2: > > > - Split pages and free immediately if backup fails are higher order > > > (Thomas) > > > v3: > > > - Skip handles in purge path (sashiko) > > > v5: > > > - Refactor into ttm_pool_backup_folio (Thomas) > > > > > > [1] https://patchwork.freedesktop.org/series/165330/ > > > --- > > > drivers/gpu/drm/ttm/ttm_pool.c | 110 ++++++++++++++++++++++++++++--- > > > -- > > > 1 file changed, 94 insertions(+), 16 deletions(-) > > > > > > diff --git a/drivers/gpu/drm/ttm/ttm_pool.c > > > b/drivers/gpu/drm/ttm/ttm_pool.c > > > index d380a3c7fe40..78efc8524133 100644 > > > --- a/drivers/gpu/drm/ttm/ttm_pool.c > > > +++ b/drivers/gpu/drm/ttm/ttm_pool.c > > > @@ -1019,6 +1019,70 @@ void ttm_pool_drop_backed_up(struct ttm_tt > > > *tt) > > > ttm_pool_free_range(NULL, tt, ttm_cached, start_page, tt- > > > >num_pages); > > > } > > > > > > +static int ttm_pool_backup_folio(struct ttm_pool *pool, struct > > > ttm_tt *tt, > > > + struct file *backup, struct folio > > > *folio, > > > + unsigned int order, bool writeback, > > > + pgoff_t idx, gfp_t page_gfp, gfp_t > > > alloc_gfp) > > > > I don't really understand why we can't end up with a > > ttm_backup_backup_folio(), which I believe is the proper layering, > > already at this point? Please see a suggestion at > > > > https://gitlab.freedesktop.org/thomash/xe-vibe/-/commits/ttm_swapout?ref_type=heads > > > > Here the splitting logic is kept in the ttm_pool, but ttm_backup > > supports handing large folios to it. > > > > Although the cumulative diffstat becomes larger, the end code becomes > > smaller and IMO easier to read, and we don't need to introduce code > > that we immediately have to refactor. > > That version looks fine too. If that is preference no issue. > > My goal with this series is get something than can reasonably be > backported to LTS kernels so the desktop doesn't frequently enter kswapd > because of fragmentation. We now have at least 3 reports of this being > an issue. > > This is larger fix [1] which works in tandem but seemly unlikely to > backportable given it add new concepts to the core MM [1]. > > [1] https://patchwork.freedesktop.org/series/165329/ > > > > > But I'm starting to question the general approach: Even if the > > *shrinker* can recover from a total kernel memory reserve depletion, it > > can't really be considered a reasonable practice, since if we > > frequently deplete the reserves, *other* important allocations in the > > system like GFP_ATOMIC, PF_MEMALLOC may spuriously start to fail and > > people will have a hard time finding out why. > > > > Wouldn’t GFP_ATOMIC enter direct reclaim, hit our shrinker, and > eventually make progress—i.e., take the split path if needed? I’m not > 100% sure, but my initial reaction is that this concern may not be > valid; however, MM is hard to reason about. > > Again, FWIW, I’ve tried a lot of things to trigger OOM—for example, > running WebGL tabs and then kicking off various very memory-intensive > workloads from the CLI—and I still haven’t hit OOM or seen memory > allocation failures or warnings. > > > So I actually don't think we can be avoiding the splitting without > > direct insertion. FWIW, up until recently when shmem started supporting > > I agree direct insertion is better solution. Do you think this something > we could reasonably get working and backport? I haven't done any > research on direct insertion yet, thus why I'm asking. > > > huge page swapping, other GPU drivers basically also split pages at > > swapout. > > I wonder if other drivers have the same issue? The deadly combo is allow > GPUs to subscribe all of system memory, allocate THP pages (or higher > order pages), and split them in the shrinker. Xe might be the only > driver with right combo to hit this but not 100% sure without a deep > dive. > + For completeness the THP allocation must have GFP flags to enter reclaim. Matt > > > > Another idea for improving on the compaction loop, perhaps worth trying > > is this change, shamelessly stolen from i915: > > > > https://gitlab.freedesktop.org/thomash/xe-vibe/-/commits/shrinker_batch?ref_type=heads > > > > I'd have to give this a try - I'm quickly running out of time before I > leave for month though. > > Matt > > > /Thomas > > > > > > > +{ > > > + struct page *page = folio_page(folio, 0); > > > + int shrunken = 0, npages = 1UL << order, ret = 0, i; > > > + bool folio_has_been_split = false; > > > + > > > + for (i = 0; i < npages; ++i) { > > > + s64 shandle; > > > + > > > +try_again_after_split: > > > + if (IS_ENABLED(CONFIG_FAULT_INJECTION) && > > > + should_fail(&backup_fault_inject, 1)) > > > + shandle = -ENOMEM; > > > + else > > > + shandle = ttm_backup_backup_page(backup, > > > page + i, > > > + writeback, > > > idx + i, > > > + page_gfp, > > > alloc_gfp); > > > + > > > + if (shandle < 0 && !folio_has_been_split && order) { > > > + pgoff_t j; > > > + > > > + /* > > > + * True OOM: could not allocate a shmem > > > folio > > > + * for the next subpage. Fall back to > > > splitting > > > + * the source compound and backing up > > > subpages > > > + * individually. Release the already-backed- > > > up > > > + * subpages whose contents now live in > > > shmem; > > > + * any further failure terminates the loop > > > with > > > + * partial progress (handled by the caller). > > > + */ > > > + folio_has_been_split = true; > > > + ttm_pool_split_for_swap(pool, page); > > > + > > > + for (j = 0; j < i; ++j) { > > > + __free_pages_gpu_account(page + j, > > > 0, false); > > > + shrunken++; > > > + } > > > + > > > + goto try_again_after_split; > > > + } else if (shandle < 0) { > > > + ret = shandle; > > > + goto out; > > > + } else if (folio_has_been_split) { > > > + __free_pages_gpu_account(page + i, 0, > > > false); > > > + shrunken++; > > > + } > > > + > > > + tt->pages[idx + i] = > > > ttm_backup_handle_to_page_ptr(shandle); > > > + } > > > + > > > + if (!folio_has_been_split) { > > > + /* Compound fully backed up; free at native order. > > > */ > > > + page->private = 0; > > > + __free_pages_gpu_account(page, order, false); > > > + shrunken += npages; > > > + } > > > + > > > +out: > > > + return shrunken ? shrunken : ret; > > > +} > > > + > > > /** > > > * ttm_pool_backup() - Back up or purge a struct ttm_tt > > > * @pool: The pool used when allocating the struct ttm_tt. > > > @@ -1045,12 +1109,11 @@ long ttm_pool_backup(struct ttm_pool *pool, > > > struct ttm_tt *tt, > > > { > > > struct file *backup = tt->backup; > > > struct page *page; > > > - unsigned long handle; > > > gfp_t alloc_gfp; > > > gfp_t gfp; > > > int ret = 0; > > > pgoff_t shrunken = 0; > > > - pgoff_t i, num_pages; > > > + pgoff_t i, num_pages, npages; > > > > > > if (WARN_ON(ttm_tt_is_backed_up(tt))) > > > return -EINVAL; > > > @@ -1070,7 +1133,8 @@ long ttm_pool_backup(struct ttm_pool *pool, > > > struct ttm_tt *tt, > > > unsigned int order; > > > > > > page = tt->pages[i]; > > > - if (unlikely(!page)) { > > > + if (unlikely(!page || > > > + > > > ttm_backup_page_ptr_is_handle(page))) { > > > num_pages = 1; > > > continue; > > > } > > > @@ -1106,26 +1170,40 @@ long ttm_pool_backup(struct ttm_pool *pool, > > > struct ttm_tt *tt, > > > if (IS_ENABLED(CONFIG_FAULT_INJECTION) && > > > should_fail(&backup_fault_inject, 1)) > > > num_pages = DIV_ROUND_UP(num_pages, 2); > > > > > > - for (i = 0; i < num_pages; ++i) { > > > - s64 shandle; > > > + for (i = 0; i < num_pages; i += npages) { > > > + unsigned int order; > > > > > > + npages = 1; > > > page = tt->pages[i]; > > > if (unlikely(!page)) > > > continue; > > > > > > - ttm_pool_split_for_swap(pool, page); > > > + /* Already-handled entry from a previous attempt. */ > > > + if (unlikely(ttm_backup_page_ptr_is_handle(page))) > > > + continue; > > > > > > - shandle = ttm_backup_backup_page(backup, page, > > > flags->writeback, i, > > > - gfp, alloc_gfp); > > > - if (shandle < 0) { > > > - /* We allow partially shrunken tts */ > > > - ret = shandle; > > > + order = ttm_pool_page_order(pool, page); > > > + npages = 1UL << order; > > > + > > > + /* > > > + * Back up the compound atomically at its native > > > order. If > > > + * fault injection truncated num_pages mid-compound, > > > skip > > > + * the partial tail rather than splitting. > > > + */ > > > + if (unlikely(i + npages > num_pages)) > > > + break; > > > + > > > + ret = ttm_pool_backup_folio(pool, tt, backup, > > > page_folio(page), > > > + order, flags->writeback, > > > i, gfp, > > > + alloc_gfp); > > > + if (unlikely(ret < 0)) > > > + break; > > > + > > > + shrunken += ret; > > > + > > > + /* partial backup */ > > > + if (unlikely(ret != npages)) > > > break; > > > - } > > > - handle = shandle; > > > - tt->pages[i] = > > > ttm_backup_handle_to_page_ptr(handle); > > > - __free_pages_gpu_account(page, 0, false); > > > - shrunken++; > > > } > > > > > > return shrunken ? shrunken : ret; ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v5 2/2] drm/ttm/pool: back up at native page order 2026-05-06 16:14 ` Matthew Brost 2026-05-06 16:16 ` Matthew Brost @ 2026-05-06 16:26 ` Thomas Hellström 2026-05-06 18:05 ` Matthew Brost 1 sibling, 1 reply; 7+ messages in thread From: Thomas Hellström @ 2026-05-06 16:26 UTC (permalink / raw) To: Matthew Brost Cc: intel-xe, dri-devel, Christian Koenig, Huang Rui, Matthew Auld, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter, linux-kernel, stable On Wed, 2026-05-06 at 09:14 -0700, Matthew Brost wrote: > On Wed, May 06, 2026 at 04:23:29PM +0200, Thomas Hellström wrote: > > Hi, Matt > > > > On Tue, 2026-05-05 at 13:04 -0700, Matthew Brost wrote: > > > ttm_pool_split_for_swap() splits high-order pool pages into > > > order-0 > > > pages during backup so each 4K page can be released to the system > > > as > > > soon as it has been written to shmem. While this minimizes the > > > allocator's working set during reclaim, it actively fragments > > > memory: > > > every TTM-backed compound page that the shrinker touches is > > > shattered > > > into order-0 pages, even when the rest of the system would prefer > > > that > > > the high-order block stay intact. Under sustained kswapd pressure > > > this > > > is enough to drive other parts of MM into recovery loops from > > > which > > > they cannot easily escape, because the memory TTM just freed is > > > no > > > longer contiguous. > > > > > > Stop unconditionally splitting on the backup path and back up > > > each > > > compound at its native order in ttm_pool_backup(): > > > > > > - For each non-handle slot, read the order from the head page > > > and > > > back up all 1<<order subpages to consecutive shmem indices, > > > writing the resulting handles into tt->pages[] as we go. > > > - On success, the compound is freed once at its native order. > > > No > > > split_page(), no per-4K refcount juggling, no fragmentation > > > introduced from this path. > > > - Slots that already hold a backup handle from a previous > > > partial > > > attempt are skipped. A compound that would extend past a > > > fault-injection-truncated num_pages is skipped rather than > > > split. > > > > > > A per-subpage backup failure cannot be made fully atomic: backing > > > up > > > a > > > subpage allocates a shmem folio before the source page can be > > > released, > > > so under true OOM any subpage in a compound (not just the first) > > > may > > > fail to be backed up with the rest of the source compound still > > > live > > > and contiguous. To make forward progress in that case, fall back > > > to > > > splitting the source compound and backing up its remaining > > > subpages > > > individually: > > > > > > - On the first per-subpage failure for a compound (and only if > > > order > 0), call ttm_pool_split_for_swap() to split the > > > source > > > compound, release the subpages whose contents already live in > > > shmem (their handles in tt->pages stay valid), and retry the > > > failing subpage at order 0. > > > - Subsequent successful subpage backups in the now-split > > > compound > > > free their source page individually as soon as the handle is > > > written. > > > - A second failure after splitting terminates the loop with > > > partial > > > progress; the remaining order-0 subpages stay in tt->pages as > > > plain page pointers and are cleaned up by the normal > > > ttm_pool_drop_backed_up() / ttm_pool_free_range() paths. > > > > > > This restores the original split-on-OOM fallback behavior while > > > keeping the common, non-OOM case fragmentation-free. It also > > > preserves the "partial backup is allowed" contract: shrunken is > > > incremented per backed-up subpage so the caller still sees > > > forward > > > progress when a compound only partially succeeds. > > > > > > The restore-side leftover-page branch in > > > ttm_pool_restore_commit() is > > > left as-is for now: that path can still split a previously- > > > retained > > > compound, but in practice it is unreachable under realistic > > > workloads > > > (per profiling we have not been able to trigger it), so it is not > > > worth complicating the restore state machine to avoid the split > > > there. > > > If it ever becomes a problem in practice it can be addressed > > > independently. > > > > > > ttm_pool_split_for_swap() itself is retained both for the OOM > > > fallback above and for the restore path's remaining caller. The > > > DMA-mapped pre-backup unmap loop, the purge path, > > > ttm_pool_free_*, > > > and ttm_pool_unmap_and_free() already operate at native order and > > > are unchanged. > > > > > > Cc: Christian Koenig <christian.koenig@amd.com> > > > Cc: Huang Rui <ray.huang@amd.com> > > > Cc: Matthew Auld <matthew.auld@intel.com> > > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> > > > Cc: Maxime Ripard <mripard@kernel.org> > > > Cc: Thomas Zimmermann <tzimmermann@suse.de> > > > Cc: David Airlie <airlied@gmail.com> > > > Cc: Simona Vetter <simona@ffwll.ch> > > > Cc: dri-devel@lists.freedesktop.org > > > Cc: linux-kernel@vger.kernel.org > > > Cc: stable@vger.kernel.org > > > Fixes: b63d715b8090 ("drm/ttm/pool, drm/ttm/tt: Provide a helper > > > to > > > shrink pages") > > > Suggested-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> > > > Assisted-by: Claude:claude-opus-4.6 > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com> > > > > > > --- > > > > > > A follow-up should attempt writeback to shmem at folio order as > > > well, > > > but the API for doing so is unclear and may be incomplete. > > > > > > This patch is related to the pending series [1] and significantly > > > reduces the likelihood of Xe entering a kswapd loop under > > > fragmentation. > > > The kswapd → shrinker → Xe shrinker → TTM backup path is still > > > exercised; however, with this change the backup path no longer > > > worsens > > > fragmentation, which previously amplified reclaim pressure and > > > reinforced the kswapd loop. > > > > > > Nonetheless, the pathological case that [1] aims to address still > > > exists > > > and requires a proper solution. Even with this patch, a kswapd > > > loop > > > due > > > to severe fragmentation can still be triggered, although it is > > > now > > > substantially harder to reproduce. > > > > > > v2: > > > - Split pages and free immediately if backup fails are higher > > > order > > > (Thomas) > > > v3: > > > - Skip handles in purge path (sashiko) > > > v5: > > > - Refactor into ttm_pool_backup_folio (Thomas) > > > > > > [1] https://patchwork.freedesktop.org/series/165330/ > > > --- > > > drivers/gpu/drm/ttm/ttm_pool.c | 110 > > > ++++++++++++++++++++++++++++--- > > > -- > > > 1 file changed, 94 insertions(+), 16 deletions(-) > > > > > > diff --git a/drivers/gpu/drm/ttm/ttm_pool.c > > > b/drivers/gpu/drm/ttm/ttm_pool.c > > > index d380a3c7fe40..78efc8524133 100644 > > > --- a/drivers/gpu/drm/ttm/ttm_pool.c > > > +++ b/drivers/gpu/drm/ttm/ttm_pool.c > > > @@ -1019,6 +1019,70 @@ void ttm_pool_drop_backed_up(struct ttm_tt > > > *tt) > > > ttm_pool_free_range(NULL, tt, ttm_cached, start_page, > > > tt- > > > > num_pages); > > > } > > > > > > +static int ttm_pool_backup_folio(struct ttm_pool *pool, struct > > > ttm_tt *tt, > > > + struct file *backup, struct > > > folio > > > *folio, > > > + unsigned int order, bool > > > writeback, > > > + pgoff_t idx, gfp_t page_gfp, > > > gfp_t > > > alloc_gfp) > > > > I don't really understand why we can't end up with a > > ttm_backup_backup_folio(), which I believe is the proper layering, > > already at this point? Please see a suggestion at > > > > https://gitlab.freedesktop.org/thomash/xe-vibe/-/commits/ttm_swapout?ref_type=heads > > > > Here the splitting logic is kept in the ttm_pool, but ttm_backup > > supports handing large folios to it. > > > > Although the cumulative diffstat becomes larger, the end code > > becomes > > smaller and IMO easier to read, and we don't need to introduce code > > that we immediately have to refactor. > > That version looks fine too. If that is preference no issue. Cool. Note that there is a bug in that we don't pass the folio order into ttm_backup_backup_folio(). I'm force-pushing a fix for that. > > My goal with this series is get something than can reasonably be > backported to LTS kernels so the desktop doesn't frequently enter > kswapd > because of fragmentation. We now have at least 3 reports of this > being > an issue. > > This is larger fix [1] which works in tandem but seemly unlikely to > backportable given it add new concepts to the core MM [1]. > > [1] https://patchwork.freedesktop.org/series/165329/ > > > > > But I'm starting to question the general approach: Even if the > > *shrinker* can recover from a total kernel memory reserve > > depletion, it > > can't really be considered a reasonable practice, since if we > > frequently deplete the reserves, *other* important allocations in > > the > > system like GFP_ATOMIC, PF_MEMALLOC may spuriously start to fail > > and > > people will have a hard time finding out why. > > > > Wouldn’t GFP_ATOMIC enter direct reclaim, hit our shrinker, and > eventually make progress—i.e., take the split path if needed? I’m not > 100% sure, but my initial reaction is that this concern may not be > valid; however, MM is hard to reason about. No, GFP_ATOMIC just uses what's available without any reclaim at all. It's more aggressive than GFP_NOWAIT in that it allows dipping into the kernel reserves. > > Again, FWIW, I’ve tried a lot of things to trigger OOM—for example, > running WebGL tabs and then kicking off various very memory-intensive > workloads from the CLI—and I still haven’t hit OOM or seen memory > allocation failures or warnings. > > > So I actually don't think we can be avoiding the splitting without > > direct insertion. FWIW, up until recently when shmem started > > supporting > > I agree direct insertion is better solution. Do you think this > something > we could reasonably get working and backport? I haven't done any > research on direct insertion yet, thus why I'm asking. Yes I think so. The problem would be to get it accepted. Looking into that now, but hitting various kinds of subtle issues. Thanks, Thomas > > > huge page swapping, other GPU drivers basically also split pages at > > swapout. > > I wonder if other drivers have the same issue? The deadly combo is > allow > GPUs to subscribe all of system memory, allocate THP pages (or higher > order pages), and split them in the shrinker. Xe might be the only > driver with right combo to hit this but not 100% sure without a deep > dive. > > > > > Another idea for improving on the compaction loop, perhaps worth > > trying > > is this change, shamelessly stolen from i915: > > > > https://gitlab.freedesktop.org/thomash/xe-vibe/-/commits/shrinker_batch?ref_type=heads > > > > I'd have to give this a try - I'm quickly running out of time before > I > leave for month though. > > Matt > > > /Thomas > > > > > > > +{ > > > + struct page *page = folio_page(folio, 0); > > > + int shrunken = 0, npages = 1UL << order, ret = 0, i; > > > + bool folio_has_been_split = false; > > > + > > > + for (i = 0; i < npages; ++i) { > > > + s64 shandle; > > > + > > > +try_again_after_split: > > > + if (IS_ENABLED(CONFIG_FAULT_INJECTION) && > > > + should_fail(&backup_fault_inject, 1)) > > > + shandle = -ENOMEM; > > > + else > > > + shandle = ttm_backup_backup_page(backup, > > > page + i, > > > + > > > writeback, > > > idx + i, > > > + > > > page_gfp, > > > alloc_gfp); > > > + > > > + if (shandle < 0 && !folio_has_been_split && > > > order) { > > > + pgoff_t j; > > > + > > > + /* > > > + * True OOM: could not allocate a shmem > > > folio > > > + * for the next subpage. Fall back to > > > splitting > > > + * the source compound and backing up > > > subpages > > > + * individually. Release the already- > > > backed- > > > up > > > + * subpages whose contents now live in > > > shmem; > > > + * any further failure terminates the > > > loop > > > with > > > + * partial progress (handled by the > > > caller). > > > + */ > > > + folio_has_been_split = true; > > > + ttm_pool_split_for_swap(pool, page); > > > + > > > + for (j = 0; j < i; ++j) { > > > + __free_pages_gpu_account(page + > > > j, > > > 0, false); > > > + shrunken++; > > > + } > > > + > > > + goto try_again_after_split; > > > + } else if (shandle < 0) { > > > + ret = shandle; > > > + goto out; > > > + } else if (folio_has_been_split) { > > > + __free_pages_gpu_account(page + i, 0, > > > false); > > > + shrunken++; > > > + } > > > + > > > + tt->pages[idx + i] = > > > ttm_backup_handle_to_page_ptr(shandle); > > > + } > > > + > > > + if (!folio_has_been_split) { > > > + /* Compound fully backed up; free at native > > > order. > > > */ > > > + page->private = 0; > > > + __free_pages_gpu_account(page, order, false); > > > + shrunken += npages; > > > + } > > > + > > > +out: > > > + return shrunken ? shrunken : ret; > > > +} > > > + > > > /** > > > * ttm_pool_backup() - Back up or purge a struct ttm_tt > > > * @pool: The pool used when allocating the struct ttm_tt. > > > @@ -1045,12 +1109,11 @@ long ttm_pool_backup(struct ttm_pool > > > *pool, > > > struct ttm_tt *tt, > > > { > > > struct file *backup = tt->backup; > > > struct page *page; > > > - unsigned long handle; > > > gfp_t alloc_gfp; > > > gfp_t gfp; > > > int ret = 0; > > > pgoff_t shrunken = 0; > > > - pgoff_t i, num_pages; > > > + pgoff_t i, num_pages, npages; > > > > > > if (WARN_ON(ttm_tt_is_backed_up(tt))) > > > return -EINVAL; > > > @@ -1070,7 +1133,8 @@ long ttm_pool_backup(struct ttm_pool *pool, > > > struct ttm_tt *tt, > > > unsigned int order; > > > > > > page = tt->pages[i]; > > > - if (unlikely(!page)) { > > > + if (unlikely(!page || > > > + > > > ttm_backup_page_ptr_is_handle(page))) { > > > num_pages = 1; > > > continue; > > > } > > > @@ -1106,26 +1170,40 @@ long ttm_pool_backup(struct ttm_pool > > > *pool, > > > struct ttm_tt *tt, > > > if (IS_ENABLED(CONFIG_FAULT_INJECTION) && > > > should_fail(&backup_fault_inject, 1)) > > > num_pages = DIV_ROUND_UP(num_pages, 2); > > > > > > - for (i = 0; i < num_pages; ++i) { > > > - s64 shandle; > > > + for (i = 0; i < num_pages; i += npages) { > > > + unsigned int order; > > > > > > + npages = 1; > > > page = tt->pages[i]; > > > if (unlikely(!page)) > > > continue; > > > > > > - ttm_pool_split_for_swap(pool, page); > > > + /* Already-handled entry from a previous > > > attempt. */ > > > + if > > > (unlikely(ttm_backup_page_ptr_is_handle(page))) > > > + continue; > > > > > > - shandle = ttm_backup_backup_page(backup, page, > > > flags->writeback, i, > > > - gfp, > > > alloc_gfp); > > > - if (shandle < 0) { > > > - /* We allow partially shrunken tts */ > > > - ret = shandle; > > > + order = ttm_pool_page_order(pool, page); > > > + npages = 1UL << order; > > > + > > > + /* > > > + * Back up the compound atomically at its native > > > order. If > > > + * fault injection truncated num_pages mid- > > > compound, > > > skip > > > + * the partial tail rather than splitting. > > > + */ > > > + if (unlikely(i + npages > num_pages)) > > > + break; > > > + > > > + ret = ttm_pool_backup_folio(pool, tt, backup, > > > page_folio(page), > > > + order, flags- > > > >writeback, > > > i, gfp, > > > + alloc_gfp); > > > + if (unlikely(ret < 0)) > > > + break; > > > + > > > + shrunken += ret; > > > + > > > + /* partial backup */ > > > + if (unlikely(ret != npages)) > > > break; > > > - } > > > - handle = shandle; > > > - tt->pages[i] = > > > ttm_backup_handle_to_page_ptr(handle); > > > - __free_pages_gpu_account(page, 0, false); > > > - shrunken++; > > > } > > > > > > return shrunken ? shrunken : ret; ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v5 2/2] drm/ttm/pool: back up at native page order 2026-05-06 16:26 ` Thomas Hellström @ 2026-05-06 18:05 ` Matthew Brost 0 siblings, 0 replies; 7+ messages in thread From: Matthew Brost @ 2026-05-06 18:05 UTC (permalink / raw) To: Thomas Hellström Cc: intel-xe, dri-devel, Christian Koenig, Huang Rui, Matthew Auld, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter, linux-kernel, stable On Wed, May 06, 2026 at 06:26:43PM +0200, Thomas Hellström wrote: > On Wed, 2026-05-06 at 09:14 -0700, Matthew Brost wrote: > > On Wed, May 06, 2026 at 04:23:29PM +0200, Thomas Hellström wrote: > > > Hi, Matt > > > > > > On Tue, 2026-05-05 at 13:04 -0700, Matthew Brost wrote: > > > > ttm_pool_split_for_swap() splits high-order pool pages into > > > > order-0 > > > > pages during backup so each 4K page can be released to the system > > > > as > > > > soon as it has been written to shmem. While this minimizes the > > > > allocator's working set during reclaim, it actively fragments > > > > memory: > > > > every TTM-backed compound page that the shrinker touches is > > > > shattered > > > > into order-0 pages, even when the rest of the system would prefer > > > > that > > > > the high-order block stay intact. Under sustained kswapd pressure > > > > this > > > > is enough to drive other parts of MM into recovery loops from > > > > which > > > > they cannot easily escape, because the memory TTM just freed is > > > > no > > > > longer contiguous. > > > > > > > > Stop unconditionally splitting on the backup path and back up > > > > each > > > > compound at its native order in ttm_pool_backup(): > > > > > > > > - For each non-handle slot, read the order from the head page > > > > and > > > > back up all 1<<order subpages to consecutive shmem indices, > > > > writing the resulting handles into tt->pages[] as we go. > > > > - On success, the compound is freed once at its native order. > > > > No > > > > split_page(), no per-4K refcount juggling, no fragmentation > > > > introduced from this path. > > > > - Slots that already hold a backup handle from a previous > > > > partial > > > > attempt are skipped. A compound that would extend past a > > > > fault-injection-truncated num_pages is skipped rather than > > > > split. > > > > > > > > A per-subpage backup failure cannot be made fully atomic: backing > > > > up > > > > a > > > > subpage allocates a shmem folio before the source page can be > > > > released, > > > > so under true OOM any subpage in a compound (not just the first) > > > > may > > > > fail to be backed up with the rest of the source compound still > > > > live > > > > and contiguous. To make forward progress in that case, fall back > > > > to > > > > splitting the source compound and backing up its remaining > > > > subpages > > > > individually: > > > > > > > > - On the first per-subpage failure for a compound (and only if > > > > order > 0), call ttm_pool_split_for_swap() to split the > > > > source > > > > compound, release the subpages whose contents already live in > > > > shmem (their handles in tt->pages stay valid), and retry the > > > > failing subpage at order 0. > > > > - Subsequent successful subpage backups in the now-split > > > > compound > > > > free their source page individually as soon as the handle is > > > > written. > > > > - A second failure after splitting terminates the loop with > > > > partial > > > > progress; the remaining order-0 subpages stay in tt->pages as > > > > plain page pointers and are cleaned up by the normal > > > > ttm_pool_drop_backed_up() / ttm_pool_free_range() paths. > > > > > > > > This restores the original split-on-OOM fallback behavior while > > > > keeping the common, non-OOM case fragmentation-free. It also > > > > preserves the "partial backup is allowed" contract: shrunken is > > > > incremented per backed-up subpage so the caller still sees > > > > forward > > > > progress when a compound only partially succeeds. > > > > > > > > The restore-side leftover-page branch in > > > > ttm_pool_restore_commit() is > > > > left as-is for now: that path can still split a previously- > > > > retained > > > > compound, but in practice it is unreachable under realistic > > > > workloads > > > > (per profiling we have not been able to trigger it), so it is not > > > > worth complicating the restore state machine to avoid the split > > > > there. > > > > If it ever becomes a problem in practice it can be addressed > > > > independently. > > > > > > > > ttm_pool_split_for_swap() itself is retained both for the OOM > > > > fallback above and for the restore path's remaining caller. The > > > > DMA-mapped pre-backup unmap loop, the purge path, > > > > ttm_pool_free_*, > > > > and ttm_pool_unmap_and_free() already operate at native order and > > > > are unchanged. > > > > > > > > Cc: Christian Koenig <christian.koenig@amd.com> > > > > Cc: Huang Rui <ray.huang@amd.com> > > > > Cc: Matthew Auld <matthew.auld@intel.com> > > > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> > > > > Cc: Maxime Ripard <mripard@kernel.org> > > > > Cc: Thomas Zimmermann <tzimmermann@suse.de> > > > > Cc: David Airlie <airlied@gmail.com> > > > > Cc: Simona Vetter <simona@ffwll.ch> > > > > Cc: dri-devel@lists.freedesktop.org > > > > Cc: linux-kernel@vger.kernel.org > > > > Cc: stable@vger.kernel.org > > > > Fixes: b63d715b8090 ("drm/ttm/pool, drm/ttm/tt: Provide a helper > > > > to > > > > shrink pages") > > > > Suggested-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> > > > > Assisted-by: Claude:claude-opus-4.6 > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com> > > > > > > > > --- > > > > > > > > A follow-up should attempt writeback to shmem at folio order as > > > > well, > > > > but the API for doing so is unclear and may be incomplete. > > > > > > > > This patch is related to the pending series [1] and significantly > > > > reduces the likelihood of Xe entering a kswapd loop under > > > > fragmentation. > > > > The kswapd → shrinker → Xe shrinker → TTM backup path is still > > > > exercised; however, with this change the backup path no longer > > > > worsens > > > > fragmentation, which previously amplified reclaim pressure and > > > > reinforced the kswapd loop. > > > > > > > > Nonetheless, the pathological case that [1] aims to address still > > > > exists > > > > and requires a proper solution. Even with this patch, a kswapd > > > > loop > > > > due > > > > to severe fragmentation can still be triggered, although it is > > > > now > > > > substantially harder to reproduce. > > > > > > > > v2: > > > > - Split pages and free immediately if backup fails are higher > > > > order > > > > (Thomas) > > > > v3: > > > > - Skip handles in purge path (sashiko) > > > > v5: > > > > - Refactor into ttm_pool_backup_folio (Thomas) > > > > > > > > [1] https://patchwork.freedesktop.org/series/165330/ > > > > --- > > > > drivers/gpu/drm/ttm/ttm_pool.c | 110 > > > > ++++++++++++++++++++++++++++--- > > > > -- > > > > 1 file changed, 94 insertions(+), 16 deletions(-) > > > > > > > > diff --git a/drivers/gpu/drm/ttm/ttm_pool.c > > > > b/drivers/gpu/drm/ttm/ttm_pool.c > > > > index d380a3c7fe40..78efc8524133 100644 > > > > --- a/drivers/gpu/drm/ttm/ttm_pool.c > > > > +++ b/drivers/gpu/drm/ttm/ttm_pool.c > > > > @@ -1019,6 +1019,70 @@ void ttm_pool_drop_backed_up(struct ttm_tt > > > > *tt) > > > > ttm_pool_free_range(NULL, tt, ttm_cached, start_page, > > > > tt- > > > > > num_pages); > > > > } > > > > > > > > +static int ttm_pool_backup_folio(struct ttm_pool *pool, struct > > > > ttm_tt *tt, > > > > + struct file *backup, struct > > > > folio > > > > *folio, > > > > + unsigned int order, bool > > > > writeback, > > > > + pgoff_t idx, gfp_t page_gfp, > > > > gfp_t > > > > alloc_gfp) > > > > > > I don't really understand why we can't end up with a > > > ttm_backup_backup_folio(), which I believe is the proper layering, > > > already at this point? Please see a suggestion at > > > > > > https://gitlab.freedesktop.org/thomash/xe-vibe/-/commits/ttm_swapout?ref_type=heads > > > > > > Here the splitting logic is kept in the ttm_pool, but ttm_backup > > > supports handing large folios to it. > > > > > > Although the cumulative diffstat becomes larger, the end code > > > becomes > > > smaller and IMO easier to read, and we don't need to introduce code > > > that we immediately have to refactor. > > > > That version looks fine too. If that is preference no issue. > > Cool. Note that there is a bug in that we don't pass the folio order > into ttm_backup_backup_folio(). I'm force-pushing a fix for that. > > > > > > My goal with this series is get something than can reasonably be > > backported to LTS kernels so the desktop doesn't frequently enter > > kswapd > > because of fragmentation. We now have at least 3 reports of this > > being > > an issue. > > > > This is larger fix [1] which works in tandem but seemly unlikely to > > backportable given it add new concepts to the core MM [1]. > > > > [1] https://patchwork.freedesktop.org/series/165329/ > > > > > > > > But I'm starting to question the general approach: Even if the > > > *shrinker* can recover from a total kernel memory reserve > > > depletion, it > > > can't really be considered a reasonable practice, since if we > > > frequently deplete the reserves, *other* important allocations in > > > the > > > system like GFP_ATOMIC, PF_MEMALLOC may spuriously start to fail > > > and > > > people will have a hard time finding out why. > > > > > > > Wouldn’t GFP_ATOMIC enter direct reclaim, hit our shrinker, and > > eventually make progress—i.e., take the split path if needed? I’m not > > 100% sure, but my initial reaction is that this concern may not be > > valid; however, MM is hard to reason about. > > No, GFP_ATOMIC just uses what's available without any reclaim at all. > It's more aggressive than GFP_NOWAIT in that it allows dipping into the > kernel reserves. > Right - wrote this before I had my coffee. > > > > Again, FWIW, I’ve tried a lot of things to trigger OOM—for example, > > running WebGL tabs and then kicking off various very memory-intensive > > workloads from the CLI—and I still haven’t hit OOM or seen memory > > allocation failures or warnings. > > > > > So I actually don't think we can be avoiding the splitting without > > > direct insertion. FWIW, up until recently when shmem started > > > supporting > > > > I agree direct insertion is better solution. Do you think this > > something > > we could reasonably get working and backport? I haven't done any > > research on direct insertion yet, thus why I'm asking. > > Yes I think so. The problem would be to get it accepted. Looking into > that now, but hitting various kinds of subtle issues. > Ok, I'm pretty unlikely to get the shrinker work to the finish line before I go - fine with whatever lands in either part: - Shrinking THP should not make fragmentation worse (this patch), a version of this should get Xe reasonably stable, hopefully this fix can be backported. - Avoid evicting working sets under fragmentation ([1] above) Matt > Thanks, > Thomas > > > > > > > huge page swapping, other GPU drivers basically also split pages at > > > swapout. > > > > I wonder if other drivers have the same issue? The deadly combo is > > allow > > GPUs to subscribe all of system memory, allocate THP pages (or higher > > order pages), and split them in the shrinker. Xe might be the only > > driver with right combo to hit this but not 100% sure without a deep > > dive. > > > > > > > > Another idea for improving on the compaction loop, perhaps worth > > > trying > > > is this change, shamelessly stolen from i915: > > > > > > https://gitlab.freedesktop.org/thomash/xe-vibe/-/commits/shrinker_batch?ref_type=heads > > > > > > > I'd have to give this a try - I'm quickly running out of time before > > I > > leave for month though. > > > > Matt > > > > > /Thomas > > > > > > > > > > +{ > > > > + struct page *page = folio_page(folio, 0); > > > > + int shrunken = 0, npages = 1UL << order, ret = 0, i; > > > > + bool folio_has_been_split = false; > > > > + > > > > + for (i = 0; i < npages; ++i) { > > > > + s64 shandle; > > > > + > > > > +try_again_after_split: > > > > + if (IS_ENABLED(CONFIG_FAULT_INJECTION) && > > > > + should_fail(&backup_fault_inject, 1)) > > > > + shandle = -ENOMEM; > > > > + else > > > > + shandle = ttm_backup_backup_page(backup, > > > > page + i, > > > > + > > > > writeback, > > > > idx + i, > > > > + > > > > page_gfp, > > > > alloc_gfp); > > > > + > > > > + if (shandle < 0 && !folio_has_been_split && > > > > order) { > > > > + pgoff_t j; > > > > + > > > > + /* > > > > + * True OOM: could not allocate a shmem > > > > folio > > > > + * for the next subpage. Fall back to > > > > splitting > > > > + * the source compound and backing up > > > > subpages > > > > + * individually. Release the already- > > > > backed- > > > > up > > > > + * subpages whose contents now live in > > > > shmem; > > > > + * any further failure terminates the > > > > loop > > > > with > > > > + * partial progress (handled by the > > > > caller). > > > > + */ > > > > + folio_has_been_split = true; > > > > + ttm_pool_split_for_swap(pool, page); > > > > + > > > > + for (j = 0; j < i; ++j) { > > > > + __free_pages_gpu_account(page + > > > > j, > > > > 0, false); > > > > + shrunken++; > > > > + } > > > > + > > > > + goto try_again_after_split; > > > > + } else if (shandle < 0) { > > > > + ret = shandle; > > > > + goto out; > > > > + } else if (folio_has_been_split) { > > > > + __free_pages_gpu_account(page + i, 0, > > > > false); > > > > + shrunken++; > > > > + } > > > > + > > > > + tt->pages[idx + i] = > > > > ttm_backup_handle_to_page_ptr(shandle); > > > > + } > > > > + > > > > + if (!folio_has_been_split) { > > > > + /* Compound fully backed up; free at native > > > > order. > > > > */ > > > > + page->private = 0; > > > > + __free_pages_gpu_account(page, order, false); > > > > + shrunken += npages; > > > > + } > > > > + > > > > +out: > > > > + return shrunken ? shrunken : ret; > > > > +} > > > > + > > > > /** > > > > * ttm_pool_backup() - Back up or purge a struct ttm_tt > > > > * @pool: The pool used when allocating the struct ttm_tt. > > > > @@ -1045,12 +1109,11 @@ long ttm_pool_backup(struct ttm_pool > > > > *pool, > > > > struct ttm_tt *tt, > > > > { > > > > struct file *backup = tt->backup; > > > > struct page *page; > > > > - unsigned long handle; > > > > gfp_t alloc_gfp; > > > > gfp_t gfp; > > > > int ret = 0; > > > > pgoff_t shrunken = 0; > > > > - pgoff_t i, num_pages; > > > > + pgoff_t i, num_pages, npages; > > > > > > > > if (WARN_ON(ttm_tt_is_backed_up(tt))) > > > > return -EINVAL; > > > > @@ -1070,7 +1133,8 @@ long ttm_pool_backup(struct ttm_pool *pool, > > > > struct ttm_tt *tt, > > > > unsigned int order; > > > > > > > > page = tt->pages[i]; > > > > - if (unlikely(!page)) { > > > > + if (unlikely(!page || > > > > + > > > > ttm_backup_page_ptr_is_handle(page))) { > > > > num_pages = 1; > > > > continue; > > > > } > > > > @@ -1106,26 +1170,40 @@ long ttm_pool_backup(struct ttm_pool > > > > *pool, > > > > struct ttm_tt *tt, > > > > if (IS_ENABLED(CONFIG_FAULT_INJECTION) && > > > > should_fail(&backup_fault_inject, 1)) > > > > num_pages = DIV_ROUND_UP(num_pages, 2); > > > > > > > > - for (i = 0; i < num_pages; ++i) { > > > > - s64 shandle; > > > > + for (i = 0; i < num_pages; i += npages) { > > > > + unsigned int order; > > > > > > > > + npages = 1; > > > > page = tt->pages[i]; > > > > if (unlikely(!page)) > > > > continue; > > > > > > > > - ttm_pool_split_for_swap(pool, page); > > > > + /* Already-handled entry from a previous > > > > attempt. */ > > > > + if > > > > (unlikely(ttm_backup_page_ptr_is_handle(page))) > > > > + continue; > > > > > > > > - shandle = ttm_backup_backup_page(backup, page, > > > > flags->writeback, i, > > > > - gfp, > > > > alloc_gfp); > > > > - if (shandle < 0) { > > > > - /* We allow partially shrunken tts */ > > > > - ret = shandle; > > > > + order = ttm_pool_page_order(pool, page); > > > > + npages = 1UL << order; > > > > + > > > > + /* > > > > + * Back up the compound atomically at its native > > > > order. If > > > > + * fault injection truncated num_pages mid- > > > > compound, > > > > skip > > > > + * the partial tail rather than splitting. > > > > + */ > > > > + if (unlikely(i + npages > num_pages)) > > > > + break; > > > > + > > > > + ret = ttm_pool_backup_folio(pool, tt, backup, > > > > page_folio(page), > > > > + order, flags- > > > > >writeback, > > > > i, gfp, > > > > + alloc_gfp); > > > > + if (unlikely(ret < 0)) > > > > + break; > > > > + > > > > + shrunken += ret; > > > > + > > > > + /* partial backup */ > > > > + if (unlikely(ret != npages)) > > > > break; > > > > - } > > > > - handle = shandle; > > > > - tt->pages[i] = > > > > ttm_backup_handle_to_page_ptr(handle); > > > > - __free_pages_gpu_account(page, 0, false); > > > > - shrunken++; > > > > } > > > > > > > > return shrunken ? shrunken : ret; ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2026-05-06 18:05 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20260505200443.3300962-1-matthew.brost@intel.com>
2026-05-05 20:04 ` [PATCH v5 1/2] drm/ttm: Drop tt->restore after successful restore Matthew Brost
2026-05-05 20:04 ` [PATCH v5 2/2] drm/ttm/pool: back up at native page order Matthew Brost
2026-05-06 14:23 ` Thomas Hellström
2026-05-06 16:14 ` Matthew Brost
2026-05-06 16:16 ` Matthew Brost
2026-05-06 16:26 ` Thomas Hellström
2026-05-06 18:05 ` Matthew Brost
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox