* [PATCH v6 0/2] mm, drm/xe: Avoid reclaim/eviction loops under fragmentation @ 2026-06-17 3:22 Matthew Brost 2026-06-17 3:22 ` [PATCH v6 1/2] mm: Introduce opportunistic_compaction concept to vmscan and shrinkers Matthew Brost 2026-06-17 3:22 ` [PATCH v6 2/2] drm/xe: Make use of shrink_control::opportunistic_compaction hint Matthew Brost 0 siblings, 2 replies; 6+ messages in thread From: Matthew Brost @ 2026-06-17 3:22 UTC (permalink / raw) To: linux-mm, linux-kernel, intel-xe, dri-devel Cc: Dave Chinner, Qi Zheng, Roman Gushchin, Johannes Weiner, Shakeel Butt, Kairui Song, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Tvrtko Ursulin, Thomas Hellström, Carlos Santa, Christian Koenig, Huang Rui, Matthew Auld, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter, Daniel Colascione, Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko Continuation of [1]. TTM allocations at higher orders can drive Xe into a pathological reclaim loop when memory is fragmented: kswapd → shrinker → eviction → rebind (exec ioctl) → repeat In this state, reclaim is triggered despite substantial free memory, but fails to produce contiguous higher-order pages. The Xe shrinker then evicts active buffer objects, increasing faulting and rebind activity and further feeding the loop. The result is high CPU overhead and poor GPU forward progress. This issue was first reported in [2] and independently observed internally and by Google. A simple reproducer is: - Boot an iGPU system with mem=8G - Launch 10 Chrome tabs running the WebGL aquarium demo - Configure each tab with ~5k fish Under this workload, ftrace shows a continuous loop of: xe_shrinker_scan (kswapd) xe_vma_rebind_exec Performance degrades significantly, with each tab dropping to ~2 FPS on PTL (Ubuntu 24.04). At the same time, /proc/buddyinfo shows substantial free memory but no higher-order availability. For example, the Normal zone: Count: 4063 4595 3455 3400 3139 2762 2293 1655 643 0 0 This corresponds to ~2.8GB free memory, but no order-9 (2MB) blocks, indicating severe fragmentation. This series addresses the issue in two layers: MM: Introduce an opportunistic_compaction hint in shrink_control. kswapd folds the gfp flags of its wakers into a per-pgdat tri-state (see enum kswapd_opportunistic_compaction_type) and forwards it to shrinkers. The hint is set when every waker for a kswapd run is a failable high-order allocation (__GFP_NORETRY or __GFP_RETRY_MAYFAIL, without __GFP_NOFAIL) — i.e. callers that would rather see the allocation fail than have working sets torn down to satisfy it. Any order-0 or non-failable waker clears the hint for that run, so normal memory pressure is unaffected. Similarly direct recliam sets the opportunistic_compaction hint based caller's gfp_mask and order. Xe: Consume shrink_control::opportunistic_compaction in the Xe shrinker. When the hint is set for a high-order pass, the shrinker skips advertising and performing TTM backup work — which operates at native page order and would not help compaction — and avoids tearing down active GPU working sets. With these changes, the reclaim/eviction loop is eliminated. The same workload improves to ~10 FPS per tab (Ubuntu 24.04) or ~15 FPS per tab (Ubuntu 24.10), and kswapd activity subsides. Buddyinfo after applying this series shows restored higher-order availability: Count: 8526 7067 3092 1959 1292 660 194 28 20 13 1 In addition various 3D benchmarks show signicant improvement memory is fragmented. v2: - Layer with core MM / TTM helpers (Thomas) v4: - Fix build (CI) v5: - Use shrinker based heurstics (Dave Chinner, Thomas's GFP idea) - Rename lazy_compaction → opportunistic_compaction v6: - Drop order in shrink_control rely only on opportunistic_compaction hint (Testing) - Set opportunistic_compaction in direct reclaim (Testing) - Drop unrelated TTM which merged independently [1] https://patchwork.freedesktop.org/series/165329/ [2] https://patchwork.freedesktop.org/patch/716404/?series=164353&rev=1 Cc: Dave Chinner <david@fromorbit.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Kairui Song <kasong@tencent.com> Cc: Barry Song <baohua@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Wei Xu <weixugc@google.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Carlos Santa <carlos.santa@intel.com> Cc: Christian Koenig <christian.koenig@amd.com> Cc: Huang Rui <ray.huang@amd.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Maxime Ripard <mripard@kernel.org> Cc: Thomas Zimmermann <tzimmermann@suse.de> Cc: David Airlie <airlied@gmail.com> Cc: Simona Vetter <simona@ffwll.ch> CC: dri-devel@lists.freedesktop.org Cc: Daniel Colascione <dancol@dancol.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Hildenbrand <david@kernel.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org Matthew Brost (2): mm: Introduce opportunistic_compaction concept to vmscan and shrinkers drm/xe: Make use of shrink_control::opportunistic_compaction hint drivers/gpu/drm/xe/xe_shrinker.c | 20 ++++++- include/linux/mmzone.h | 40 ++++++++++++++ include/linux/shrinker.h | 20 +++++++ mm/internal.h | 2 +- mm/shrinker.c | 13 +++-- mm/vmscan.c | 95 +++++++++++++++++++++++++++++--- 6 files changed, 174 insertions(+), 16 deletions(-) -- 2.34.1 ^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH v6 1/2] mm: Introduce opportunistic_compaction concept to vmscan and shrinkers 2026-06-17 3:22 [PATCH v6 0/2] mm, drm/xe: Avoid reclaim/eviction loops under fragmentation Matthew Brost @ 2026-06-17 3:22 ` Matthew Brost 2026-06-22 23:10 ` Dave Chinner 2026-06-17 3:22 ` [PATCH v6 2/2] drm/xe: Make use of shrink_control::opportunistic_compaction hint Matthew Brost 1 sibling, 1 reply; 6+ messages in thread From: Matthew Brost @ 2026-06-17 3:22 UTC (permalink / raw) To: linux-mm, linux-kernel, intel-xe, dri-devel Cc: Andrew Morton, Dave Chinner, Qi Zheng, Roman Gushchin, Muchun Song, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Johannes Weiner, Shakeel Butt, Kairui Song, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu High-order allocations using __GFP_NORETRY or __GFP_RETRY_MAYFAIL are often opportunistic attempts to satisfy fragmentation-sensitive allocations rather than indications of severe memory pressure. In these cases, reclaim may invoke shrinkers that aggressively destroy working sets even though reclaim is unlikely to materially improve the allocation outcome. Some shrinkers manage expensive backing or migration operations where reclaim can result in substantial working set disruption despite the system having sufficient free memory overall. This is particularly visible in fragmentation-heavy workloads where reclaim repeatedly tears down active state while kswapd attempts to satisfy higher-order allocations. Introduce an opportunistic_compaction hint in shrink_control that allows kswapd to communicate when reclaim originates from a high-order allocation context that may be fragmentation driven rather than true memory pressure. Shrinkers may use this hint to avoid destructive working set reclaim while still participating normally during order-0 or stronger reclaim conditions. The hint is propagated through shrink_slab() and derived from high-order kswapd wakeups associated or direct reclaim gfp with non-failing allocation contexts. No functional changes are introduced for existing shrinkers. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Kairui Song <kasong@tencent.com> Cc: Barry Song <baohua@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Wei Xu <weixugc@google.com> Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org Assisted-by: Claude:claude-opus-4.6 Signed-off-by: Matthew Brost <matthew.brost@intel.com> --- Dave Chinner — I’d appreciate feedback on this approach, as you NACKed an earlier revision where a similar heuristic was implemented in the drivers. It is now baked into the shrinker, as you suggested. --- include/linux/mmzone.h | 40 +++++++++++++++++ include/linux/shrinker.h | 20 +++++++++ mm/internal.h | 2 +- mm/shrinker.c | 13 ++++-- mm/vmscan.c | 95 ++++++++++++++++++++++++++++++++++++---- 5 files changed, 157 insertions(+), 13 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 9adb2ad21da5..1afc51018355 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1461,6 +1461,39 @@ struct memory_failure_stats { }; #endif +/* + * Per-pgdat state machine for the kswapd "opportunistic compaction" hint. + * + * wakeup_kswapd() collapses the gfp flags of all wakers that arrive between + * two kswapd runs into a single tri-state, which kswapd then forwards to the + * shrinkers via shrink_control::opportunistic_compaction: + * + * KSWAPD_UNSET_OPPORTUNISTIC_COMPACTION + * Initial state after kswapd consumes the previous value. No waker has + * been observed yet for the upcoming run. + * + * KSWAPD_NO_OPPORTUNISTIC_COMPACTION + * At least one waker is an order-0 allocation, or a high-order + * allocation that cannot tolerate failure (i.e., not eligible for + * opportunistic behaviour). Shrinkers must do their normal best-effort + * work; the hint is cleared. + * + * KSWAPD_OPPORTUNISTIC_COMPACTION + * All wakers seen so far are high-order allocations that may fail + * (__GFP_NORETRY or __GFP_RETRY_MAYFAIL, without __GFP_NOFAIL). Shrinkers + * may skip work that is unlikely to produce a contiguous high-order + * block (e.g., evicting working-set pages). + * + * The state is sticky in the "NO" direction within a single kswapd run: once + * any non-eligible waker is observed, subsequent eligible wakers cannot + * upgrade it back to KSWAPD_OPPORTUNISTIC_COMPACTION. + */ +enum kswapd_opportunistic_compaction_type { + KSWAPD_UNSET_OPPORTUNISTIC_COMPACTION = 0, + KSWAPD_NO_OPPORTUNISTIC_COMPACTION, + KSWAPD_OPPORTUNISTIC_COMPACTION, +}; + /* * On NUMA machines, each NUMA node would have a pg_data_t to describe * it's memory layout. On UMA machines there is a single pglist_data which @@ -1525,6 +1558,13 @@ typedef struct pglist_data { #endif struct task_struct *kswapd; /* Protected by kswapd_lock */ int kswapd_order; + /* + * Aggregated opportunistic-compaction hint for the next kswapd run. + * Updated by wakeup_kswapd() based on the gfp flags / order of each + * waker, and consumed (and reset) by kswapd before balance_pgdat(). + * See enum kswapd_opportunistic_compaction_type for the state machine. + */ + atomic_t kswapd_opportunistic_compaction; enum zone_type kswapd_highest_zoneidx; atomic_t kswapd_failures; /* Number of 'reclaimed == 0' runs */ diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h index 1a00be90d93a..5f3e8dc98129 100644 --- a/include/linux/shrinker.h +++ b/include/linux/shrinker.h @@ -37,6 +37,26 @@ struct shrink_control { /* current node being shrunk (for NUMA aware shrinkers) */ int nid; + /* + * Opportunistic compaction hint. + * + * Set by the reclaim path to tell shrinkers that this pass is + * driven by an order > 0 allocation that the caller is willing to + * have fail (e.g., __GFP_NORETRY / __GFP_RETRY_MAYFAIL without + * __GFP_NOFAIL). Such allocations only really benefit from + * shrinking when doing so frees up a contiguous, high-order block; + * thrashing working sets in the hope of producing one is typically + * counter-productive. + * + * Shrinkers that can produce naturally-aligned high-order folios + * (see shrink_control::order) should treat this as a hint to skip + * costly work that is unlikely to help compaction (for example, + * evicting hot/working-set pages just to free single pages). + * + * Only meaningful when @order > 0; ignored otherwise. + */ + bool opportunistic_compaction; + /* * How many objects scan_objects should scan and try to reclaim. * This is reset before every call, so it is safe for callees diff --git a/mm/internal.h b/mm/internal.h index 5a2ddcf68e0b..cc915f04cf1e 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1760,7 +1760,7 @@ void __meminit __init_page_from_nid(unsigned long pfn, int nid); /* shrinker related functions */ unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg, - int priority); + int priority, bool opportunistic_compaction); int shmem_add_to_page_cache(struct folio *folio, struct address_space *mapping, diff --git a/mm/shrinker.c b/mm/shrinker.c index 76b3f750cf65..2cf8f3a157f9 100644 --- a/mm/shrinker.c +++ b/mm/shrinker.c @@ -467,7 +467,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, #ifdef CONFIG_MEMCG static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, - struct mem_cgroup *memcg, int priority) + struct mem_cgroup *memcg, int priority, bool opportunistic_compaction) { struct shrinker_info *info; unsigned long ret, freed = 0; @@ -529,6 +529,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, .gfp_mask = gfp_mask, .nid = nid, .memcg = memcg, + .opportunistic_compaction = opportunistic_compaction, }; struct shrinker *shrinker; int shrinker_id = calc_shrinker_id(index, offset); @@ -588,7 +589,8 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, } #else /* !CONFIG_MEMCG */ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, - struct mem_cgroup *memcg, int priority) + struct mem_cgroup *memcg, int priority, + bool opportunistic_compaction) { return 0; } @@ -600,6 +602,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, * @nid: node whose slab caches to target * @memcg: memory cgroup whose slab caches to target * @priority: the reclaim priority + * @opportunistic_compaction: do compaction opportunistically (e.g., do not swap working sets) * * Call the shrink functions to age shrinkable caches. * @@ -615,7 +618,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, * Returns the number of reclaimed slab objects. */ unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg, - int priority) + int priority, bool opportunistic_compaction) { unsigned long ret, freed = 0; struct shrinker *shrinker; @@ -628,7 +631,8 @@ unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg, * oom. */ if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg)) - return shrink_slab_memcg(gfp_mask, nid, memcg, priority); + return shrink_slab_memcg(gfp_mask, nid, memcg, priority, + opportunistic_compaction); /* * lockless algorithm of global shrink. @@ -657,6 +661,7 @@ unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg, .gfp_mask = gfp_mask, .nid = nid, .memcg = memcg, + .opportunistic_compaction = opportunistic_compaction, }; if (!shrinker_try_get(shrinker)) diff --git a/mm/vmscan.c b/mm/vmscan.c index bd1b1aa12581..c40bc3f9ddd4 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -96,6 +96,14 @@ struct scan_control { /* Swappiness value for proactive reclaim. Always use sc_swappiness()! */ int *proactive_swappiness; + /* + * Opportunistic compaction hint snapshotted from the pgdat at the + * start of this reclaim pass. Forwarded to shrinkers through + * shrink_control::opportunistic_compaction so they can skip + * non-productive work for failable high-order allocations. + */ + enum kswapd_opportunistic_compaction_type kswapd_opportunistic_compaction; + /* Can active folios be deactivated as part of reclaim? */ #define DEACTIVATE_ANON 1 #define DEACTIVATE_FILE 2 @@ -200,6 +208,29 @@ struct scan_control { */ int vm_swappiness = 60; +/* + * Is @gfp_flags a high-order allocation that is eligible for the + * "opportunistic compaction" treatment in kswapd / shrinkers? + * + * The caller must be willing to tolerate failure (__GFP_NORETRY or + * __GFP_RETRY_MAYFAIL) and must not have set __GFP_NOFAIL. For such + * allocations there is little value in burning working-set pages just to + * scrape together a single high-order block: if compaction can't easily + * succeed, the caller would rather see the allocation fail. + */ +static bool gfp_opportunistic_compaction(gfp_t gfp_flags) +{ + return (gfp_flags & (__GFP_NORETRY | __GFP_RETRY_MAYFAIL)) && + !(gfp_flags & __GFP_NOFAIL); +} + +static bool sc_opportunistic_compaction(struct scan_control *sc) +{ + return sc->order && (sc->kswapd_opportunistic_compaction == + KSWAPD_OPPORTUNISTIC_COMPACTION || (!current_is_kswapd() && + gfp_opportunistic_compaction(sc->gfp_mask))); +} + #ifdef CONFIG_MEMCG /* Returns true for reclaim through cgroup limits or cgroup interfaces. */ @@ -412,7 +443,7 @@ static unsigned long drop_slab_node(int nid) memcg = mem_cgroup_iter(NULL, NULL, NULL); do { - freed += shrink_slab(GFP_KERNEL, nid, memcg, 0); + freed += shrink_slab(GFP_KERNEL, nid, memcg, 0, false); } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL); return freed; @@ -5053,6 +5084,7 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc) unsigned long reclaimed = sc->nr_reclaimed; struct mem_cgroup *memcg = lruvec_memcg(lruvec); struct pglist_data *pgdat = lruvec_pgdat(lruvec); + bool opportunistic_compaction = sc_opportunistic_compaction(sc); /* lru_gen_age_node() called mem_cgroup_calculate_protection() */ if (mem_cgroup_below_min(NULL, memcg)) @@ -5068,7 +5100,8 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc) success = try_to_shrink_lruvec(lruvec, sc); - shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority); + shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority, + opportunistic_compaction); if (!sc->proactive) vmpressure(sc->gfp_mask, memcg, false, sc->nr_scanned - scanned, @@ -6134,6 +6167,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); unsigned long reclaimed; unsigned long scanned; + bool opportunistic_compaction = sc_opportunistic_compaction(sc); /* * This loop can become CPU-bound when target memcgs @@ -6171,7 +6205,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) shrink_lruvec(lruvec, sc); shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, - sc->priority); + sc->priority, opportunistic_compaction); /* Record the group's reclaim efficiency */ if (!sc->proactive) @@ -7104,8 +7138,14 @@ clear_reclaim_active(pg_data_t *pgdat, int highest_zoneidx) * found to have free_pages <= high_wmark_pages(zone), any page in that zone * or lower is eligible for reclaim until at least one usable zone is * balanced. + * + * @kswapd_opportunistic_compaction is the aggregated hint produced by + * wakeup_kswapd() for this run; it is propagated into scan_control so that + * shrinkers can skip costly work that is unlikely to help compaction when + * all wakers are failable high-order allocations. */ -static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx) +static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx, + enum kswapd_opportunistic_compaction_type kswapd_opportunistic_compaction) { int i; unsigned long nr_soft_reclaimed; @@ -7119,6 +7159,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx) .gfp_mask = GFP_KERNEL, .order = order, .may_unmap = 1, + .kswapd_opportunistic_compaction = kswapd_opportunistic_compaction, }; set_task_reclaim_state(current, &sc.reclaim_state); @@ -7338,8 +7379,10 @@ static enum zone_type kswapd_highest_zoneidx(pg_data_t *pgdat, return curr_idx == MAX_NR_ZONES ? prev_highest_zoneidx : curr_idx; } -static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_order, - unsigned int highest_zoneidx) +static void +kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_order, + unsigned int highest_zoneidx, + enum kswapd_opportunistic_compaction_type kswapd_opportunistic_compaction) { long remaining = 0; DEFINE_WAIT(wait); @@ -7385,6 +7428,11 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o if (READ_ONCE(pgdat->kswapd_order) < reclaim_order) WRITE_ONCE(pgdat->kswapd_order, reclaim_order); + + if (kswapd_opportunistic_compaction == + KSWAPD_NO_OPPORTUNISTIC_COMPACTION) + atomic_set(&pgdat->kswapd_opportunistic_compaction, + KSWAPD_NO_OPPORTUNISTIC_COMPACTION); } finish_wait(&pgdat->kswapd_wait, &wait); @@ -7441,6 +7489,7 @@ static int kswapd(void *p) unsigned int highest_zoneidx = MAX_NR_ZONES - 1; pg_data_t *pgdat = (pg_data_t *)p; struct task_struct *tsk = current; + enum kswapd_opportunistic_compaction_type kswapd_opportunistic_compaction; /* * Tell the memory management that we're a "memory allocator", @@ -7458,6 +7507,8 @@ static int kswapd(void *p) set_freezable(); WRITE_ONCE(pgdat->kswapd_order, 0); + atomic_set(&pgdat->kswapd_opportunistic_compaction, + KSWAPD_UNSET_OPPORTUNISTIC_COMPACTION); WRITE_ONCE(pgdat->kswapd_highest_zoneidx, MAX_NR_ZONES); atomic_set(&pgdat->nr_writeback_throttled, 0); for ( ; ; ) { @@ -7466,13 +7517,18 @@ static int kswapd(void *p) alloc_order = reclaim_order = READ_ONCE(pgdat->kswapd_order); highest_zoneidx = kswapd_highest_zoneidx(pgdat, highest_zoneidx); + kswapd_opportunistic_compaction = + atomic_read(&pgdat->kswapd_opportunistic_compaction); kswapd_try_sleep: kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order, - highest_zoneidx); + highest_zoneidx, kswapd_opportunistic_compaction); /* Read the new order and highest_zoneidx */ alloc_order = READ_ONCE(pgdat->kswapd_order); + kswapd_opportunistic_compaction = + atomic_xchg(&pgdat->kswapd_opportunistic_compaction, + KSWAPD_UNSET_OPPORTUNISTIC_COMPACTION); highest_zoneidx = kswapd_highest_zoneidx(pgdat, highest_zoneidx); WRITE_ONCE(pgdat->kswapd_order, 0); @@ -7499,7 +7555,8 @@ static int kswapd(void *p) trace_mm_vmscan_kswapd_wake(pgdat->node_id, highest_zoneidx, alloc_order); reclaim_order = balance_pgdat(pgdat, alloc_order, - highest_zoneidx); + highest_zoneidx, + kswapd_opportunistic_compaction); if (reclaim_order < alloc_order) goto kswapd_try_sleep; } @@ -7537,6 +7594,28 @@ void wakeup_kswapd(struct zone *zone, gfp_t gfp_flags, int order, if (READ_ONCE(pgdat->kswapd_order) < order) WRITE_ONCE(pgdat->kswapd_order, order); + /* + * Fold this waker into the per-pgdat opportunistic-compaction hint + * that kswapd will pick up at the start of its next run. + * + * The state is sticky in the "NO" direction: once any waker in this + * batch is order-0 or a non-failable high-order allocation, the hint + * stays cleared until kswapd consumes it. Only when every waker so + * far is a failable high-order allocation do we set + * KSWAPD_OPPORTUNISTIC_COMPACTION, asking shrinkers to skip work + * that won't realistically help compaction. + */ + if (atomic_read(&pgdat->kswapd_opportunistic_compaction) != + KSWAPD_NO_OPPORTUNISTIC_COMPACTION) { + if (!order || !gfp_opportunistic_compaction(gfp_flags)) + atomic_set(&pgdat->kswapd_opportunistic_compaction, + KSWAPD_NO_OPPORTUNISTIC_COMPACTION); + else if (order && gfp_opportunistic_compaction(gfp_flags)) + atomic_cmpxchg(&pgdat->kswapd_opportunistic_compaction, + KSWAPD_UNSET_OPPORTUNISTIC_COMPACTION, + KSWAPD_OPPORTUNISTIC_COMPACTION); + } + if (!waitqueue_active(&pgdat->kswapd_wait)) return; -- 2.34.1 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH v6 1/2] mm: Introduce opportunistic_compaction concept to vmscan and shrinkers 2026-06-17 3:22 ` [PATCH v6 1/2] mm: Introduce opportunistic_compaction concept to vmscan and shrinkers Matthew Brost @ 2026-06-22 23:10 ` Dave Chinner 2026-06-23 0:09 ` Matthew Brost 0 siblings, 1 reply; 6+ messages in thread From: Dave Chinner @ 2026-06-22 23:10 UTC (permalink / raw) To: Matthew Brost Cc: linux-mm, linux-kernel, intel-xe, dri-devel, Andrew Morton, Dave Chinner, Qi Zheng, Roman Gushchin, Muchun Song, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Johannes Weiner, Shakeel Butt, Kairui Song, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu On Tue, Jun 16, 2026 at 08:22:17PM -0700, Matthew Brost wrote: > High-order allocations using __GFP_NORETRY or __GFP_RETRY_MAYFAIL > are often opportunistic attempts to satisfy fragmentation-sensitive > allocations rather than indications of severe memory pressure. In these > cases, reclaim may invoke shrinkers that aggressively destroy working > sets even though reclaim is unlikely to materially improve the > allocation outcome. > > Some shrinkers manage expensive backing or migration operations where > reclaim can result in substantial working set disruption despite the > system having sufficient free memory overall. This is particularly > visible in fragmentation-heavy workloads where reclaim repeatedly tears > down active state while kswapd attempts to satisfy higher-order > allocations. > > Introduce an opportunistic_compaction hint in shrink_control that allows > kswapd to communicate when reclaim originates from a high-order > allocation context that may be fragmentation driven rather than true > memory pressure. Shrinkers may use this hint to avoid destructive > working set reclaim while still participating normally during order-0 > or stronger reclaim conditions. To be honest, this seems like another "push a hint through to the XE shrinker" mechanism under a different name. You seem so focused on fixing the XE reproducer that the -systemic problem- that -any- high-order folio demand causes is not being acknowleged. e.g. we use high-order folios extensively in the page cache these days, and there are -many- cases where memory compaction driven by high-order demand cause significant performance regressions for page cache performance. To date, every single person who has wanted to fix the problem they are seeing has effectively attempted to -turn off compaction- via GFP flags. I've even done that myself inside XFS to work around kvmalloc() issues with a lack of GFP_NOFAIL support and doing costly high order allocations that fail and trigger compaction before falling back to vmalloc(). However, these issues have since been fixed in the kvmalloc() code, such that it now does the right thing for most calling contexts (i.e. tries high-order kmalloc() without triggering compaction, then fall back to GFP_NOFAIL vmalloc()). This has made kvmalloc() more performant and better behaved for -all users-, not just XFS. This is not sustainable - we need compaction to be robust and performant in the face of high-order folio demands, regardless of what subsystem is generating the demand. So with that in mind, let me paraphrase the comment in the second patch in the Xe shrinker implementation: "Shrinker reclaim is based on implementation specific object sizes so it is unlikely to ever acheive contiguous page reclaim in a manner that will measurably improve compaction rates." You also say: > No functional changes are introduced for existing shrinkers. Consider how many shrinkable caches the general statement above applies to, and then think about the fundamental impedence mismatch between the affected shrinkable caches and what this patch actually fixes. For example, what happens to slab-based caches if the XE cache is being excessively reclaimed under high-order page demand? e.g. the slab-based cache may have tens of objects per page and holds a system-level performance critical working set of objects. How do these caches handle the excessive reclaim demand being generated by compaction thrashing? Yup, they don't. In the case of filesystem caches, the "reclaim and repopulate" pattern you describe causing the XE perf problems causes internal slab cache fragmentation. Not only does this not improve compaction rates, it also results in more memory fragmentation because slab pages get pinned by a small number of long lived objects and they won't get freed until the cache is largely emptied. IOWs, things get -even worse- from a memory fragmentation POV when compaction thrashing causes the working set of a high-object-count-per-page slab cache to thrash.... This isn't isolated to individual subsystem thrashing. If we run a file-based workload that generates high-order folio demand and hence compaction (e.g copy tens of GB of files between two XFS, ext4 or btrfs filesystems), that will -also- trash the Xe working set via the shrinker being hammered by memory compaction try to free up contiguous pages for the page cache. Similarly, if we run a Xe workload that generates sustained high order folio demand, that will trash the working set in the dentry and inode caches and any other shrinkable slab-based cache. Hence the abstracted case of the problem we need to solve is this: shrinker reclaim is based on x-byte objects is extremely unlikely to acheive contiguous page reclaim in a manner that will measurably improve compaction rates. This is a problem that has to be addresses by the high level infrastructure level, not worked around by individual shrinkers. IMO, compaction shouldn't trigger shrinkers unless the shrinkers are specifically flagged as being able to release contiguous pages of memory in short order. I don't think there's very many shrinkable caches that even hold a significant quantity of objects larger than a single page, so it's clearly questionable as to whether compaction based reclaim should run shrinker reclaim to begin with. i.e. a subsystem that can track high order folios in a shrinkable cache should probably have a "->compaction_scan()" method that is run directly from compaction context to try to free high order folios. This provides a direct opt-in mechanism for a subsystem, and it allows subsystems that can track low- and high- order objects independently to efficiently free objects in a way that will help improve compaction rates without impacting the entire working set of objects in the cache. IOWs, this patch to inform kswapd about it's trigger (doesn't it already have a "reason" parameter, though?) is likely a necessary part of the solution - we don't want kswapd running shrinkers if it has been triggered to reclaim pages for compaction. This patch would allows kswapd to elide normal shrinker passes when it has been woken purely for compaction purposes. Given that the compaction code would be running the high-order reclaim capable shrinkers itself, this would avoid trashing the working set of most shrinkable caches -by default- under high order allocation demand.... -Dave. -- Dave Chinner dgc@kernel.org ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v6 1/2] mm: Introduce opportunistic_compaction concept to vmscan and shrinkers 2026-06-22 23:10 ` Dave Chinner @ 2026-06-23 0:09 ` Matthew Brost 2026-06-23 5:32 ` Dave Chinner 0 siblings, 1 reply; 6+ messages in thread From: Matthew Brost @ 2026-06-23 0:09 UTC (permalink / raw) To: Dave Chinner Cc: linux-mm, linux-kernel, intel-xe, dri-devel, Andrew Morton, Dave Chinner, Qi Zheng, Roman Gushchin, Muchun Song, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Johannes Weiner, Shakeel Butt, Kairui Song, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu On Tue, Jun 23, 2026 at 09:10:43AM +1000, Dave Chinner wrote: > On Tue, Jun 16, 2026 at 08:22:17PM -0700, Matthew Brost wrote: > > High-order allocations using __GFP_NORETRY or __GFP_RETRY_MAYFAIL > > are often opportunistic attempts to satisfy fragmentation-sensitive > > allocations rather than indications of severe memory pressure. In these > > cases, reclaim may invoke shrinkers that aggressively destroy working > > sets even though reclaim is unlikely to materially improve the > > allocation outcome. > > > > Some shrinkers manage expensive backing or migration operations where > > reclaim can result in substantial working set disruption despite the > > system having sufficient free memory overall. This is particularly > > visible in fragmentation-heavy workloads where reclaim repeatedly tears > > down active state while kswapd attempts to satisfy higher-order > > allocations. > > > > Introduce an opportunistic_compaction hint in shrink_control that allows > > kswapd to communicate when reclaim originates from a high-order > > allocation context that may be fragmentation driven rather than true > > memory pressure. Shrinkers may use this hint to avoid destructive > > working set reclaim while still participating normally during order-0 > > or stronger reclaim conditions. Thanks for the input - this is a tough problem. > > To be honest, this seems like another "push a hint through to the XE > shrinker" mechanism under a different name. You seem so focused on > fixing the XE reproducer that the -systemic problem- that -any- > high-order folio demand causes is not being acknowleged. > I'm not exactly sure I agree here. Communicating via __GFP_NORETRY or __GFP_RETRY_MAYFAIL with a higher order implies that the caller can handle higher-order allocation failures, so the shrinker shouldn’t try too hard to obtain a large page (e.g., evict a working set). I agree that Xe is currently the only shrinker making use of this, but other shrinkers could also hook into it. This information simply isn’t available today. > e.g. we use high-order folios extensively in the page cache these > days, and there are -many- cases where memory compaction driven by > high-order demand cause significant performance regressions for page > cache performance. To date, every single person who has wanted to > fix the problem they are seeing has effectively attempted to -turn > off compaction- via GFP flags. So does that mean they clear __GFP_RECLAIM? That isn't really what in DRM or Xe. In former case we have pools of lower order pages in TTM not in use that can be shrunk, potentially freeing multiple lower orders pages so a higher order page formed, and the later possible BOs (sets of pages) in Xe marked as purgable (not is in working set) which can also be shrunk. Other DRM drivers have purging concepts too. I’m not very familiar with what other shrinkers or subsystems want, but presumably other shrinkers have pools or caches that aren’t currently in use, where they can say, “OK, I’ll give these pages up for opportunistic compaction, but I won’t give up my working set.” Of course, as mentioned above, if someone else explicitly requests large pages by avoiding __GFP_NORETRY and __GFP_RETRY_MAYFAIL, the shrinker should then give up its working set. > > I've even done that myself inside XFS to work around kvmalloc() > issues with a lack of GFP_NOFAIL support and doing costly high order > allocations that fail and trigger compaction before falling back to > vmalloc(). However, these issues have since been fixed in the > kvmalloc() code, such that it now does the right thing for most > calling contexts (i.e. tries high-order kmalloc() without triggering > compaction, then fall back to GFP_NOFAIL vmalloc()). This has made > kvmalloc() more performant and better behaved for -all users-, not > just XFS. > > This is not sustainable - we need compaction to be robust and > performant in the face of high-order folio demands, regardless of > what subsystem is generating the demand. > > So with that in mind, let me paraphrase the comment in the second > patch in the Xe shrinker implementation: > > "Shrinker reclaim is based on implementation specific object sizes > so it is unlikely to ever acheive contiguous page reclaim in a > manner that will measurably improve compaction rates." > This might be slightly misworded—what I really mean is that I don’t want to give up my working set for higher-order allocations that are allowed to fail, but I do want to give up my cache. > You also say: > > > No functional changes are introduced for existing shrinkers. > > Consider how many shrinkable caches the general statement above > applies to, and then think about the fundamental impedence mismatch > between the affected shrinkable caches and what this patch actually > fixes. > Yes, as mentioned above, I’m only addressing Xe here, and I agree that this is likely an issue. Do you know of other shrinkers that have pools or caches which can be shrunk under the conditions I’m introducing here, but also have a working set they would prefer not to give up? If so, a link on elixir.bootlin.com would be helpful so I can take a look. I’ll also try to go through other shrinkers myself. > For example, what happens to slab-based caches if the XE cache is being > excessively reclaimed under high-order page demand? e.g. the slab-based > cache may have tens of objects per page and holds a system-level > performance critical working set of objects. How do these caches > handle the excessive reclaim demand being generated by compaction > thrashing? > > Yup, they don't. > Agree. > In the case of filesystem caches, the "reclaim and repopulate" > pattern you describe causing the XE perf problems causes internal > slab cache fragmentation. Not only does this not improve compaction > rates, it also results in more memory fragmentation because slab > pages get pinned by a small number of long lived objects and they > won't get freed until the cache is largely emptied. IOWs, things > get -even worse- from a memory fragmentation POV when compaction > thrashing causes the working set of a high-object-count-per-page > slab cache to thrash.... > Got a link to the code which you are referring to? That seems like a problem similar to another issue in DRM/Xe. We found that the process of shrinking actually drove fragmentation by splitting folios down to order-0 and then backing pages up one at a time. I have a separate fix in flight for that. Could the filesystem detect these hints and avoid shrinking in a way that causes fragmentation? Alternatively, could it perform shrinking in a way that doesn’t shatter folios, or detect long-lived objects so it understands that shrinking isn’t going to help reduce fragmentation? > This isn't isolated to individual subsystem thrashing. If we run a > file-based workload that generates high-order folio demand and hence What GFP flags are typical used for file-based workloads? > compaction (e.g copy tens of GB of files between two XFS, ext4 or > btrfs filesystems), that will -also- trash the Xe working set via > the shrinker being hammered by memory compaction try to free up > contiguous pages for the page cache. > I could see this. > Similarly, if we run a Xe workload that generates sustained high > order folio demand, that will trash the working set in the dentry > and inode caches and any other shrinkable slab-based cache. > I could also see this but DRM / Xe will set __GFP_NORETRY or __GFP_RETRY_MAYFAIL on higer-orders so those caches should be able to not trash its working set if looked for this hint. > Hence the abstracted case of the problem we need to solve is this: > shrinker reclaim is based on x-byte objects is extremely unlikely to > acheive contiguous page reclaim in a manner that will measurably > improve compaction rates. > > This is a problem that has to be addresses by the high level > infrastructure level, not worked around by individual shrinkers. > > IMO, compaction shouldn't trigger shrinkers unless the shrinkers are > specifically flagged as being able to release contiguous pages of > memory in short order. I don't think there's very many shrinkable > caches that even hold a significant quantity of objects larger than > a single page, so it's clearly questionable as to whether compaction > based reclaim should run shrinker reclaim to begin with. > Yes, sort of do this in Xe by changing '->count_objects' based on the hint. > i.e. a subsystem that can track high order folios in a shrinkable > cache should probably have a "->compaction_scan()" method that is > run directly from compaction context to try to free high order When you say “compaction context,” which parts of the code are you referring to? I’d like to explore this option, but I need a bit more context. > folios. This provides a direct opt-in mechanism for a subsystem, and > it allows subsystems that can track low- and high- order objects > independently to efficiently free objects in a way that will help > improve compaction rates without impacting the entire working set of > objects in the cache. Does this help if, for example, the cache is holding onto two order-8 folios that could be freed and merged, while the caller really wants an order-9 folio? This seems like a possible scenario in caches and is certainly true in TTM pools. > > IOWs, this patch to inform kswapd about it's trigger (doesn't it > already have a "reason" parameter, though?) is likely a necessary > part of the solution - we don't want kswapd running shrinkers if it > has been triggered to reclaim pages for compaction. This patch would > allows kswapd to elide normal shrinker passes when it has been woken > purely for compaction purposes. Given that the compaction code would > be running the high-order reclaim capable shrinkers itself, this > would avoid trashing the working set of most shrinkable caches -by > default- under high order allocation demand.... > I’m trying to parse this—are you suggesting that, one way or another, we introduce a heuristic where shrinkers can act on a hint (whether it’s what I have here or a new ->compaction_scan() vfunc), and then attempt to fix all shrinkers in this series? I’m open to trying to fix other shrinkers as well. Do you have any particular ones in mind? I count around 45 shrinkers in Linux, so it’s unlikely I can fix every single one, though or all shrinkers need to be fixed. On a side note, I just noticed that struct shrinker has count_objects and scan_objects as individual vfuncs rather than using a const struct shrinker_ops *ops. Should we change that? The latter seems cleaner and is typically how things are done in Linux. Matt > -Dave. > -- > Dave Chinner > dgc@kernel.org ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v6 1/2] mm: Introduce opportunistic_compaction concept to vmscan and shrinkers 2026-06-23 0:09 ` Matthew Brost @ 2026-06-23 5:32 ` Dave Chinner 0 siblings, 0 replies; 6+ messages in thread From: Dave Chinner @ 2026-06-23 5:32 UTC (permalink / raw) To: Matthew Brost Cc: linux-mm, linux-kernel, intel-xe, dri-devel, Andrew Morton, Dave Chinner, Qi Zheng, Roman Gushchin, Muchun Song, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Johannes Weiner, Shakeel Butt, Kairui Song, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu On Mon, Jun 22, 2026 at 05:09:33PM -0700, Matthew Brost wrote: > On Tue, Jun 23, 2026 at 09:10:43AM +1000, Dave Chinner wrote: > > On Tue, Jun 16, 2026 at 08:22:17PM -0700, Matthew Brost wrote: > > > High-order allocations using __GFP_NORETRY or __GFP_RETRY_MAYFAIL > > > are often opportunistic attempts to satisfy fragmentation-sensitive > > > allocations rather than indications of severe memory pressure. In these > > > cases, reclaim may invoke shrinkers that aggressively destroy working > > > sets even though reclaim is unlikely to materially improve the > > > allocation outcome. > > > > > > Some shrinkers manage expensive backing or migration operations where > > > reclaim can result in substantial working set disruption despite the > > > system having sufficient free memory overall. This is particularly > > > visible in fragmentation-heavy workloads where reclaim repeatedly tears > > > down active state while kswapd attempts to satisfy higher-order > > > allocations. > > > > > > Introduce an opportunistic_compaction hint in shrink_control that allows > > > kswapd to communicate when reclaim originates from a high-order > > > allocation context that may be fragmentation driven rather than true > > > memory pressure. Shrinkers may use this hint to avoid destructive > > > working set reclaim while still participating normally during order-0 > > > or stronger reclaim conditions. > > Thanks for the input - this is a tough problem. Yes, that it is. > > To be honest, this seems like another "push a hint through to the XE > > shrinker" mechanism under a different name. You seem so focused on > > fixing the XE reproducer that the -systemic problem- that -any- > > high-order folio demand causes is not being acknowleged. > > > > I'm not exactly sure I agree here. Communicating via __GFP_NORETRY or > __GFP_RETRY_MAYFAIL with a higher order implies that the caller can > handle higher-order allocation failures, so the shrinker shouldn’t try > too hard to obtain a large page (e.g., evict a working set). I agree > that Xe is currently the only shrinker making use of this, but other > shrinkers could also hook into it. This information simply isn’t > available today. Right, but "we are doing compaction" isn't information that tells the subsystem shrinker what it needs to do. "memory compaction is occurring" isn't a well defined action like "count reclaimable objects" or "scan N objects and reclaim as many as you can without blocking". Directed high order object reclaim should be much efficient that trying to use general memory pressure to age out enough objects to reform contiguous pages. We need to help memory compaction, and we can't really do that by layering heuristics over reclaim algorithms designed to maintain working sets efficiently. > > e.g. we use high-order folios extensively in the page cache these > > days, and there are -many- cases where memory compaction driven by > > high-order demand cause significant performance regressions for page > > cache performance. To date, every single person who has wanted to > > fix the problem they are seeing has effectively attempted to -turn > > off compaction- via GFP flags. > > So does that mean they clear __GFP_RECLAIM? Usually __GFP_DIRECT_RECLAIM, as it's the overhead of direct compaction that causes the performance problems. > That isn't really what in DRM or Xe. In former case we have pools of > lower order pages in TTM not in use that can be shrunk, potentially > freeing multiple lower orders pages so a higher order page formed, and > the later possible BOs (sets of pages) in Xe marked as purgable (not is > in working set) which can also be shrunk. Other DRM drivers have purging > concepts too. > > I’m not very familiar with what other shrinkers or subsystems want, but > presumably other shrinkers have pools or caches that aren’t currently in > use, where they can say, “OK, I’ll give these pages up for opportunistic > compaction, but I won’t give up my working set.” Of course, as mentioned > above, if someone else explicitly requests large pages by avoiding > __GFP_NORETRY and __GFP_RETRY_MAYFAIL, the shrinker should then give up > its working set. Most caches are slab-based, so there can be 10s of objects with different life cycles per page. There is no almost possiblity that shrinker reclaim will free pages without substantial amounts of the cache being reclaimed. > > I've even done that myself inside XFS to work around kvmalloc() > > issues with a lack of GFP_NOFAIL support and doing costly high order > > allocations that fail and trigger compaction before falling back to > > vmalloc(). However, these issues have since been fixed in the > > kvmalloc() code, such that it now does the right thing for most > > calling contexts (i.e. tries high-order kmalloc() without triggering > > compaction, then fall back to GFP_NOFAIL vmalloc()). This has made > > kvmalloc() more performant and better behaved for -all users-, not > > just XFS. > > > > This is not sustainable - we need compaction to be robust and > > performant in the face of high-order folio demands, regardless of > > what subsystem is generating the demand. > > > > So with that in mind, let me paraphrase the comment in the second > > patch in the Xe shrinker implementation: > > > > "Shrinker reclaim is based on implementation specific object sizes > > so it is unlikely to ever acheive contiguous page reclaim in a > > manner that will measurably improve compaction rates." > > > > This might be slightly misworded—what I really mean is that I don’t want > to give up my working set for higher-order allocations that are allowed > to fail, but I do want to give up my cache. Right, that's the core of the problem - compaction is the high-order reclaim trigger, the existing shrinker infrastructure reclaim is for the working-set maintenance reclaim algorithm the subsystem uses.. > > You also say: > > > > > No functional changes are introduced for existing shrinkers. > > > > Consider how many shrinkable caches the general statement above > > applies to, and then think about the fundamental impedence mismatch > > between the affected shrinkable caches and what this patch actually > > fixes. > > > > Yes, as mentioned above, I’m only addressing Xe here, and I agree that > this is likely an issue. Do you know of other shrinkers that have pools > or caches which can be shrunk under the conditions I’m introducing here, > but also have a working set they would prefer not to give up? The first that comes to mind is the xfs_buf cache. This cache holds cached metadata buffers that have different sizes can each contain up 64kB of contiguous pages. The allocation algorithm uses optimisitic large folios allocation, but if that fails it falls back to vmalloc. The working set is maintained by a prioritised multi-scan LRU so that more frequently accessed metadata is held tighter by the cache than less frequently accessed (e.g. btree roots have higher retention priority than the lowest leaves). It does not currently track buffer objects by size, by if there was a benefit to doing so then it could be implemented. I'd much prefer to have such tracking separate to the working set maintenance, especially as they will likely need some kind of balancing to prevent high-order buffers in the working set from being thrashed by compaction demand.... I know there are other caches that have variable sized objects, but I'd have to go look at the code to referesh my memory of which ones they are... > If so, a > link on elixir.bootlin.com would be helpful so I can take a look. I’ll > also try to go through other shrinkers myself. cscope is your friend. fs/xfs/xfs_buf.c contains the XFS buffer cache and shrinker infrastructure, but looking at the code without any understanding of the filesystem structures or how it interacts with the other XFS shrinkable caches probably isn't as useful as you might think it will be. > > For example, what happens to slab-based caches if the XE cache is being > > excessively reclaimed under high-order page demand? e.g. the slab-based > > cache may have tens of objects per page and holds a system-level > > performance critical working set of objects. How do these caches > > handle the excessive reclaim demand being generated by compaction > > thrashing? > > > > Yup, they don't. > > > > Agree. > > > In the case of filesystem caches, the "reclaim and repopulate" > > pattern you describe causing the XE perf problems causes internal > > slab cache fragmentation. Not only does this not improve compaction > > rates, it also results in more memory fragmentation because slab > > pages get pinned by a small number of long lived objects and they > > won't get freed until the cache is largely emptied. IOWs, things > > get -even worse- from a memory fragmentation POV when compaction > > thrashing causes the working set of a high-object-count-per-page > > slab cache to thrash.... > > > > Got a link to the code which you are referring to? Do a lore search for "dentry cache defragmentation". You should be able to find discussions that go back to around 2006 about discussions on identry cache fragmentation and approaches like slab-page based object reclaim to support internal defragmentation. The fact that we don't have slab cache defragmentation despite many years of people wanting such functionality should tell you how complex the problem is.... :/ > That seems like a problem similar to another issue in DRM/Xe. We found > that the process of shrinking actually drove fragmentation by splitting > folios down to order-0 and then backing pages up one at a time. I have a > separate fix in flight for that. Possibly, though the life cycle differences I'm talking about can be a few milliseconds (temporary file) vs weeks (long running database instance holding it's table files open the whole time it is running). > Could the filesystem detect these hints and avoid shrinking in a way > that causes fragmentation? Not really. The fragmentation problem is caused by physical object placement in the slab pages at allocation time, not the act of reclaiming the object. i.e. we don't know what the expected cache life time of a dentry or an inode will be when we allocate it, so it just gets allocated in the next free slot in the current partial slab page. When you get a mix of dentries that are pinned by open files in long running applications and dentries for access-once files in the same page, we end up with reclaim freeing all the object slots that contained access-once files. However, the pages are still pinned by the objects for the open files that are in active use. IOWs, LRU-based reclaim can free >90% of the objects in a cache that held millions of objects with mixed lifetimes and still not free any memory at all. There's nothing reclaim can do about it because the problem is created at allocation time when lifetime is a complete unknown. > Alternatively, could it perform shrinking in > a way that doesn’t shatter folios, or detect long-lived objects so it > understands that shrinking isn’t going to help reduce fragmentation? Referenced filesystem objects are not on the LRUs, so the shrinkers aren't even aware of such long lived objects. And, as per the "dentry cache defrag" comment above, we can't ask the slab to reclaim or move objects because we don't track the owners of external references to the objects themselves. > > > This isn't isolated to individual subsystem thrashing. If we run a > > file-based workload that generates high-order folio demand and hence > > What GFP flags are typical used for file-based workloads? Mostly GFP_KERNEL, with a mix of GFP_NOFS. non-blocking paths also tend to add GFP_NOWAITS, and memory reclaim sensitive paths often use __GFP_MEMALLOC to prevent reclaim recursion. Some filesystems also make extensive use of GFP_NOFAIL (e.g. XFS). > > compaction (e.g copy tens of GB of files between two XFS, ext4 or > > btrfs filesystems), that will -also- trash the Xe working set via > > the shrinker being hammered by memory compaction try to free up > > contiguous pages for the page cache. > > > > I could see this. > > > Similarly, if we run a Xe workload that generates sustained high > > order folio demand, that will trash the working set in the dentry > > and inode caches and any other shrinkable slab-based cache. > > > > I could also see this but DRM / Xe will set __GFP_NORETRY or > __GFP_RETRY_MAYFAIL on higer-orders so those caches should be able to > not trash its working set if looked for this hint. This relies on all the allocation code everywhere always doing exactly the right thing so that memory reclaim "behaves". That is what I've been saying is not a sustainable approach - all it takes is one allocation or one shrinker not to do the right thing, and we've got another mole to whack. i.e. memory allocation should do the right/best thing for the system with default parameters. > > Hence the abstracted case of the problem we need to solve is this: > > shrinker reclaim is based on x-byte objects is extremely unlikely to > > acheive contiguous page reclaim in a manner that will measurably > > improve compaction rates. > > > > This is a problem that has to be addresses by the high level > > infrastructure level, not worked around by individual shrinkers. > > > > IMO, compaction shouldn't trigger shrinkers unless the shrinkers are > > specifically flagged as being able to release contiguous pages of > > memory in short order. I don't think there's very many shrinkable > > caches that even hold a significant quantity of objects larger than > > a single page, so it's clearly questionable as to whether compaction > > based reclaim should run shrinker reclaim to begin with. > > > > Yes, sort of do this in Xe by changing '->count_objects' based on the > hint. I know. That's the problem - it's relying on the infrastructure passing down a specific internal context hint in an existing interface so a specific subsystem can work around a specific problematic behaviour. Indeed, for compaction we don't actually care about the count, what we largely care about is whether the subsystem has any objects the same size or larger than the current compaction demand. Efficient object reclaim for compaction has a different control variable set (e.g. find objects larger than, objects physically near to, etc), and this can't really be properly fitted into the existing count/scan shrinker reclaim algorithm. Hence I think it needs new shrinker methods to implement effectively. > > i.e. a subsystem that can track high order folios in a shrinkable > > cache should probably have a "->compaction_scan()" method that is > > run directly from compaction context to try to free high order > > When you say “compaction context,” which parts of the code are you > referring to? I’d like to explore this option, but I need a bit more > context. kcompactd does background compaction, similar to how we have kswapd to do background memory reclaim. Direct compaction (part of direct reclaim) via __alloc_pages_direct_compact() that will be called before direct memory reclaim in the case of a high-order allocation. > > > folios. This provides a direct opt-in mechanism for a subsystem, and > > it allows subsystems that can track low- and high- order objects > > independently to efficiently free objects in a way that will help > > improve compaction rates without impacting the entire working set of > > objects in the cache. > > > Does this help if, for example, the cache is holding onto two order-8 > folios that could be freed and merged, while the caller really wants an > order-9 folio? This seems like a possible scenario in caches and is > certainly true in TTM pools. Depends on how the interface is implemented. IIUC, the direct compaction code will return a right-sized page early if it creates one via compact_zone(). Hence if that path can call into shrinkers to do high-order scanning that results in two mergable order-8 objects being freed and merged into an order-9 object that fulfils the compaction requirements, then it will result in compaction succeeding where it currently fails. And I think that kcompactd will run until certain watermarks are met, so again having a high-order shrinker that directly impacts the high order page watermarks would be much more efficient that trying to use general memory pressure to randomly shoot down enough objects to reform contiguous pages. > > IOWs, this patch to inform kswapd about it's trigger (doesn't it > > already have a "reason" parameter, though?) is likely a necessary > > part of the solution - we don't want kswapd running shrinkers if it > > has been triggered to reclaim pages for compaction. This patch would > > allows kswapd to elide normal shrinker passes when it has been woken > > purely for compaction purposes. Given that the compaction code would > > be running the high-order reclaim capable shrinkers itself, this > > would avoid trashing the working set of most shrinkable caches -by > > default- under high order allocation demand.... > > > > I’m trying to parse this—are you suggesting that, one way or another, we > introduce a heuristic where shrinkers can act on a hint (whether it’s > what I have here or a new ->compaction_scan() vfunc), and then attempt > to fix all shrinkers in this series? I don't want existing shrinkers to be touched at all. What I want is for memory reclaim (both direct and kswapd) to elide the shrink_slab() calls into shrinkers when memory reclaim is being driven by high order allocation failure. i.e. high-order allocation failure should not generate shrinkable cache memory pressure because shrinkable caches in general cannot return contiguous memory that will allows compaction to make progress. The existing behaviour has more negative affects on system performance than positive, so we need a fix for "everyone". I think we should provide a new opt-in ->compaction_scan() method for compaction aware subsystem shrinkers that is run from compact_zone() context. This allows subsystems that can manage high order objects to optimise the return of high order objects to the free space pool, thereby significantly improving the chance for compaction to succeed without adversely impacting the rest of the shrinkable caches in the system. Further, we should not kick kswapd because of compaction failures because kcompactd will already be running ->compaction_scan() capable shrinkers from it's callouts to compact_zone() in the background that will do this work as efficiently as possible. > I’m open to trying to fix other > shrinkers as well. Do you have any particular ones in mind? I count > around 45 shrinkers in Linux, so it’s unlikely I can fix every single > one, though or all shrinkers need to be fixed. They'd all need to be fixed, which is why I suggested a new method to be added. Avoid calling the existing shrinkers in the adverse situation, call the new one from the right context where it actually benefits compaction and high-order memory allocation. > On a side note, I just noticed that struct shrinker has count_objects > and scan_objects as individual vfuncs rather than using a const struct > shrinker_ops *ops. Should we change that? The latter seems cleaner and > is typically how things are done in Linux. We probably should - the current structure is largely historical and there's only ever been two methods. If we are adding another method, then it would probably make sense to add an external ops structure to reduce the memory footprint a little. -Dave. -- Dave Chinner dgc@kernel.org ^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH v6 2/2] drm/xe: Make use of shrink_control::opportunistic_compaction hint 2026-06-17 3:22 [PATCH v6 0/2] mm, drm/xe: Avoid reclaim/eviction loops under fragmentation Matthew Brost 2026-06-17 3:22 ` [PATCH v6 1/2] mm: Introduce opportunistic_compaction concept to vmscan and shrinkers Matthew Brost @ 2026-06-17 3:22 ` Matthew Brost 1 sibling, 0 replies; 6+ messages in thread From: Matthew Brost @ 2026-06-17 3:22 UTC (permalink / raw) To: linux-mm, linux-kernel, intel-xe, dri-devel Cc: Andrew Morton, Dave Chinner, Qi Zheng, Roman Gushchin, Muchun Song, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Johannes Weiner, Shakeel Butt, Kairui Song, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Thomas Hellström Xe/TTM backup reclaim can be extremely expensive under fragmentation pressure as reclaim may migrate or destroy actively used GPU working sets despite the system still having substantial free memory available. Under high-order opportunistic reclaim, repeatedly backing up GPU memory can lead to reclaim/rebind ping-pong behavior where active GPU working sets are continuously torn down and reconstructed without materially improving allocation success. Use the new shrink_control::opportunistic_compaction hint to avoid Xe backup reclaim during fragmentation-driven high-order reclaim attempts. In this mode the shrinker skips advertising backup-backed reclaimable memory and avoids initiating backup operations entirely. Order-0 and non-opportunistic reclaim behavior remain unchanged, so Xe backup reclaim still participates normally during genuine memory pressure. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Kairui Song <kasong@tencent.com> Cc: Barry Song <baohua@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Wei Xu <weixugc@google.com> Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org Assisted-by: Claude:claude-opus-4.6 Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> --- drivers/gpu/drm/xe/xe_shrinker.c | 20 +++++++++++++++++--- 1 file changed, 17 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/xe/xe_shrinker.c b/drivers/gpu/drm/xe/xe_shrinker.c index 83374cd57660..198149f266c6 100644 --- a/drivers/gpu/drm/xe/xe_shrinker.c +++ b/drivers/gpu/drm/xe/xe_shrinker.c @@ -139,10 +139,17 @@ static unsigned long xe_shrinker_count(struct shrinker *shrink, struct shrink_control *sc) { struct xe_shrinker *shrinker = to_xe_shrinker(shrink); - unsigned long num_pages; + unsigned long num_pages = 0; bool can_backup = !!(sc->gfp_mask & __GFP_FS); - num_pages = ttm_backup_bytes_avail() >> PAGE_SHIFT; + /* + * Skip accounting backup-able pages when this is an opportunistic + * high-order pass: TTM backup work shrinks at native page granularity + * and is unlikely to produce the contiguous block the caller wants, + * so don't advertise it as reclaimable for this hint. + */ + if (!sc->opportunistic_compaction) + num_pages = ttm_backup_bytes_avail() >> PAGE_SHIFT; read_lock(&shrinker->lock); if (can_backup) @@ -233,7 +240,14 @@ static unsigned long xe_shrinker_scan(struct shrinker *shrink, struct shrink_con } sc->nr_scanned = nr_scanned; - if (nr_scanned >= nr_to_scan || !can_backup) + /* + * Stop after the purge pass for opportunistic high-order reclaim: + * the subsequent backup/writeback pass works at native page order + * and is unlikely to free a contiguous high-order block, so doing + * it here would just churn working sets for no compaction benefit. + */ + if (nr_scanned >= nr_to_scan || !can_backup || + sc->opportunistic_compaction) goto out; /* If we didn't wake before, try to do it now if needed. */ -- 2.34.1 ^ permalink raw reply related [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-06-23 5:33 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-06-17 3:22 [PATCH v6 0/2] mm, drm/xe: Avoid reclaim/eviction loops under fragmentation Matthew Brost 2026-06-17 3:22 ` [PATCH v6 1/2] mm: Introduce opportunistic_compaction concept to vmscan and shrinkers Matthew Brost 2026-06-22 23:10 ` Dave Chinner 2026-06-23 0:09 ` Matthew Brost 2026-06-23 5:32 ` Dave Chinner 2026-06-17 3:22 ` [PATCH v6 2/2] drm/xe: Make use of shrink_control::opportunistic_compaction hint Matthew Brost
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox