[PATCH v6 0/2] mm, drm/xe: Avoid reclaim/eviction loops under fragmentation

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v6 0/2] mm, drm/xe: Avoid reclaim/eviction loops under fragmentation
@ 2026-06-17  3:22 Matthew Brost
  2026-06-17  3:22 ` [PATCH v6 1/2] mm: Introduce opportunistic_compaction concept to vmscan and shrinkers Matthew Brost
  2026-06-17  3:22 ` [PATCH v6 2/2] drm/xe: Make use of shrink_control::opportunistic_compaction hint Matthew Brost
  0 siblings, 2 replies; 6+ messages in thread
From: Matthew Brost @ 2026-06-17  3:22 UTC (permalink / raw)
  To: linux-mm, linux-kernel, intel-xe, dri-devel
  Cc: Dave Chinner, Qi Zheng, Roman Gushchin, Johannes Weiner,
	Shakeel Butt, Kairui Song, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Tvrtko Ursulin, Thomas Hellström,
	Carlos Santa, Christian Koenig, Huang Rui, Matthew Auld,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Simona Vetter, Daniel Colascione, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko

Continuation of [1].

TTM allocations at higher orders can drive Xe into a pathological
reclaim loop when memory is fragmented:

kswapd → shrinker → eviction → rebind (exec ioctl) → repeat

In this state, reclaim is triggered despite substantial free memory,
but fails to produce contiguous higher-order pages. The Xe shrinker then
evicts active buffer objects, increasing faulting and rebind activity
and further feeding the loop. The result is high CPU overhead and poor
GPU forward progress.

This issue was first reported in [2] and independently observed
internally and by Google.

A simple reproducer is:

- Boot an iGPU system with mem=8G
- Launch 10 Chrome tabs running the WebGL aquarium demo
- Configure each tab with ~5k fish

Under this workload, ftrace shows a continuous loop of:

xe_shrinker_scan (kswapd)
xe_vma_rebind_exec

Performance degrades significantly, with each tab dropping to ~2 FPS on
PTL (Ubuntu 24.04).

At the same time, /proc/buddyinfo shows substantial free memory but no
higher-order availability. For example, the Normal zone:

Count: 4063 4595 3455 3400 3139 2762 2293 1655 643 0 0

This corresponds to ~2.8GB free memory, but no order-9 (2MB) blocks,
indicating severe fragmentation.

This series addresses the issue in two layers:

MM: Introduce an opportunistic_compaction hint in shrink_control.
kswapd folds the gfp flags of its wakers into a per-pgdat tri-state
(see enum kswapd_opportunistic_compaction_type) and forwards it to
shrinkers. The hint is set when every waker for a kswapd run is a
failable high-order allocation (__GFP_NORETRY or __GFP_RETRY_MAYFAIL,
without __GFP_NOFAIL) — i.e. callers that would rather see the
allocation fail than have working sets torn down to satisfy it. Any
order-0 or non-failable waker clears the hint for that run, so normal
memory pressure is unaffected. Similarly direct recliam sets the
opportunistic_compaction hint based caller's gfp_mask and order.

Xe: Consume shrink_control::opportunistic_compaction in the Xe
shrinker. When the hint is set for a high-order pass, the shrinker
skips advertising and performing TTM backup work — which operates at
native page order and would not help compaction — and avoids tearing
down active GPU working sets.

With these changes, the reclaim/eviction loop is eliminated. The same
workload improves to ~10 FPS per tab (Ubuntu 24.04) or ~15 FPS per tab
(Ubuntu 24.10), and kswapd activity subsides.

Buddyinfo after applying this series shows restored higher-order
availability:

Count: 8526 7067 3092 1959 1292 660 194 28 20 13 1

In addition various 3D benchmarks show signicant improvement memory is
fragmented.

v2:
 - Layer with core MM / TTM helpers (Thomas)
v4:
 - Fix build (CI)
v5:
 - Use shrinker based heurstics (Dave Chinner, Thomas's GFP idea)
 - Rename lazy_compaction → opportunistic_compaction
v6:
 - Drop order in shrink_control rely only on opportunistic_compaction
   hint (Testing)
 - Set opportunistic_compaction in direct reclaim (Testing)
 - Drop unrelated TTM which merged independently

[1] https://patchwork.freedesktop.org/series/165329/
[2] https://patchwork.freedesktop.org/patch/716404/?series=164353&rev=1

Cc: Dave Chinner <david@fromorbit.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Kairui Song <kasong@tencent.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Carlos Santa <carlos.santa@intel.com>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
CC: dri-devel@lists.freedesktop.org
Cc: Daniel Colascione <dancol@dancol.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org

Matthew Brost (2):
  mm: Introduce opportunistic_compaction concept to vmscan and shrinkers
  drm/xe: Make use of shrink_control::opportunistic_compaction hint

 drivers/gpu/drm/xe/xe_shrinker.c | 20 ++++++-
 include/linux/mmzone.h           | 40 ++++++++++++++
 include/linux/shrinker.h         | 20 +++++++
 mm/internal.h                    |  2 +-
 mm/shrinker.c                    | 13 +++--
 mm/vmscan.c                      | 95 +++++++++++++++++++++++++++++---
 6 files changed, 174 insertions(+), 16 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v6 1/2] mm: Introduce opportunistic_compaction concept to vmscan and shrinkers
  2026-06-17  3:22 [PATCH v6 0/2] mm, drm/xe: Avoid reclaim/eviction loops under fragmentation Matthew Brost
@ 2026-06-17  3:22 ` Matthew Brost
  2026-06-22 23:10   ` Dave Chinner
  2026-06-17  3:22 ` [PATCH v6 2/2] drm/xe: Make use of shrink_control::opportunistic_compaction hint Matthew Brost
  1 sibling, 1 reply; 6+ messages in thread
From: Matthew Brost @ 2026-06-17  3:22 UTC (permalink / raw)
  To: linux-mm, linux-kernel, intel-xe, dri-devel
  Cc: Andrew Morton, Dave Chinner, Qi Zheng, Roman Gushchin,
	Muchun Song, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Johannes Weiner, Shakeel Butt, Kairui Song, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu

High-order allocations using __GFP_NORETRY or __GFP_RETRY_MAYFAIL
are often opportunistic attempts to satisfy fragmentation-sensitive
allocations rather than indications of severe memory pressure. In these
cases, reclaim may invoke shrinkers that aggressively destroy working
sets even though reclaim is unlikely to materially improve the
allocation outcome.

Some shrinkers manage expensive backing or migration operations where
reclaim can result in substantial working set disruption despite the
system having sufficient free memory overall. This is particularly
visible in fragmentation-heavy workloads where reclaim repeatedly tears
down active state while kswapd attempts to satisfy higher-order
allocations.

Introduce an opportunistic_compaction hint in shrink_control that allows
kswapd to communicate when reclaim originates from a high-order
allocation context that may be fragmentation driven rather than true
memory pressure. Shrinkers may use this hint to avoid destructive
working set reclaim while still participating normally during order-0
or stronger reclaim conditions.

The hint is propagated through shrink_slab() and derived from
high-order kswapd wakeups associated or direct reclaim gfp with
non-failing allocation contexts.

No functional changes are introduced for existing shrinkers.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: David Hildenbrand <david@kernel.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Kairui Song <kasong@tencent.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Wei Xu <weixugc@google.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Assisted-by: Claude:claude-opus-4.6
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---

Dave Chinner — I’d appreciate feedback on this approach, as you NACKed
an earlier revision where a similar heuristic was implemented in the
drivers. It is now baked into the shrinker, as you suggested.

---
 include/linux/mmzone.h   | 40 +++++++++++++++++
 include/linux/shrinker.h | 20 +++++++++
 mm/internal.h            |  2 +-
 mm/shrinker.c            | 13 ++++--
 mm/vmscan.c              | 95 ++++++++++++++++++++++++++++++++++++----
 5 files changed, 157 insertions(+), 13 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9adb2ad21da5..1afc51018355 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1461,6 +1461,39 @@ struct memory_failure_stats {
 };
 #endif
 
+/*
+ * Per-pgdat state machine for the kswapd "opportunistic compaction" hint.
+ *
+ * wakeup_kswapd() collapses the gfp flags of all wakers that arrive between
+ * two kswapd runs into a single tri-state, which kswapd then forwards to the
+ * shrinkers via shrink_control::opportunistic_compaction:
+ *
+ *   KSWAPD_UNSET_OPPORTUNISTIC_COMPACTION
+ *	Initial state after kswapd consumes the previous value. No waker has
+ *	been observed yet for the upcoming run.
+ *
+ *   KSWAPD_NO_OPPORTUNISTIC_COMPACTION
+ *	At least one waker is an order-0 allocation, or a high-order
+ *	allocation that cannot tolerate failure (i.e., not eligible for
+ *	opportunistic behaviour). Shrinkers must do their normal best-effort
+ *	work; the hint is cleared.
+ *
+ *   KSWAPD_OPPORTUNISTIC_COMPACTION
+ *	All wakers seen so far are high-order allocations that may fail
+ *	(__GFP_NORETRY or __GFP_RETRY_MAYFAIL, without __GFP_NOFAIL). Shrinkers
+ *	may skip work that is unlikely to produce a contiguous high-order
+ *	block (e.g., evicting working-set pages).
+ *
+ * The state is sticky in the "NO" direction within a single kswapd run: once
+ * any non-eligible waker is observed, subsequent eligible wakers cannot
+ * upgrade it back to KSWAPD_OPPORTUNISTIC_COMPACTION.
+ */
+enum kswapd_opportunistic_compaction_type {
+	KSWAPD_UNSET_OPPORTUNISTIC_COMPACTION = 0,
+	KSWAPD_NO_OPPORTUNISTIC_COMPACTION,
+	KSWAPD_OPPORTUNISTIC_COMPACTION,
+};
+
 /*
  * On NUMA machines, each NUMA node would have a pg_data_t to describe
  * it's memory layout. On UMA machines there is a single pglist_data which
@@ -1525,6 +1558,13 @@ typedef struct pglist_data {
 #endif
 	struct task_struct *kswapd;	/* Protected by kswapd_lock */
 	int kswapd_order;
+	/*
+	 * Aggregated opportunistic-compaction hint for the next kswapd run.
+	 * Updated by wakeup_kswapd() based on the gfp flags / order of each
+	 * waker, and consumed (and reset) by kswapd before balance_pgdat().
+	 * See enum kswapd_opportunistic_compaction_type for the state machine.
+	 */
+	atomic_t kswapd_opportunistic_compaction;
 	enum zone_type kswapd_highest_zoneidx;
 
 	atomic_t kswapd_failures;	/* Number of 'reclaimed == 0' runs */
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 1a00be90d93a..5f3e8dc98129 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -37,6 +37,26 @@ struct shrink_control {
 	/* current node being shrunk (for NUMA aware shrinkers) */
 	int nid;
 
+	/*
+	 * Opportunistic compaction hint.
+	 *
+	 * Set by the reclaim path to tell shrinkers that this pass is
+	 * driven by an order > 0 allocation that the caller is willing to
+	 * have fail (e.g., __GFP_NORETRY / __GFP_RETRY_MAYFAIL without
+	 * __GFP_NOFAIL). Such allocations only really benefit from
+	 * shrinking when doing so frees up a contiguous, high-order block;
+	 * thrashing working sets in the hope of producing one is typically
+	 * counter-productive.
+	 *
+	 * Shrinkers that can produce naturally-aligned high-order folios
+	 * (see shrink_control::order) should treat this as a hint to skip
+	 * costly work that is unlikely to help compaction (for example,
+	 * evicting hot/working-set pages just to free single pages).
+	 *
+	 * Only meaningful when @order > 0; ignored otherwise.
+	 */
+	bool opportunistic_compaction;
+
 	/*
 	 * How many objects scan_objects should scan and try to reclaim.
 	 * This is reset before every call, so it is safe for callees
diff --git a/mm/internal.h b/mm/internal.h
index 5a2ddcf68e0b..cc915f04cf1e 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1760,7 +1760,7 @@ void __meminit __init_page_from_nid(unsigned long pfn, int nid);
 
 /* shrinker related functions */
 unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg,
-			  int priority);
+			  int priority, bool opportunistic_compaction);
 
 int shmem_add_to_page_cache(struct folio *folio,
 			    struct address_space *mapping,
diff --git a/mm/shrinker.c b/mm/shrinker.c
index 76b3f750cf65..2cf8f3a157f9 100644
--- a/mm/shrinker.c
+++ b/mm/shrinker.c
@@ -467,7 +467,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 
 #ifdef CONFIG_MEMCG
 static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
-			struct mem_cgroup *memcg, int priority)
+			struct mem_cgroup *memcg, int priority, bool opportunistic_compaction)
 {
 	struct shrinker_info *info;
 	unsigned long ret, freed = 0;
@@ -529,6 +529,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
 				.gfp_mask = gfp_mask,
 				.nid = nid,
 				.memcg = memcg,
+				.opportunistic_compaction = opportunistic_compaction,
 			};
 			struct shrinker *shrinker;
 			int shrinker_id = calc_shrinker_id(index, offset);
@@ -588,7 +589,8 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
 }
 #else /* !CONFIG_MEMCG */
 static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
-			struct mem_cgroup *memcg, int priority)
+			struct mem_cgroup *memcg, int priority,
+			bool opportunistic_compaction)
 {
 	return 0;
 }
@@ -600,6 +602,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
  * @nid: node whose slab caches to target
  * @memcg: memory cgroup whose slab caches to target
  * @priority: the reclaim priority
+ * @opportunistic_compaction: do compaction opportunistically (e.g., do not swap working sets)
  *
  * Call the shrink functions to age shrinkable caches.
  *
@@ -615,7 +618,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
  * Returns the number of reclaimed slab objects.
  */
 unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg,
-			  int priority)
+			  int priority, bool opportunistic_compaction)
 {
 	unsigned long ret, freed = 0;
 	struct shrinker *shrinker;
@@ -628,7 +631,8 @@ unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg,
 	 * oom.
 	 */
 	if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
-		return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
+		return shrink_slab_memcg(gfp_mask, nid, memcg, priority,
+					 opportunistic_compaction);
 
 	/*
 	 * lockless algorithm of global shrink.
@@ -657,6 +661,7 @@ unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg,
 			.gfp_mask = gfp_mask,
 			.nid = nid,
 			.memcg = memcg,
+			.opportunistic_compaction = opportunistic_compaction,
 		};
 
 		if (!shrinker_try_get(shrinker))
diff --git a/mm/vmscan.c b/mm/vmscan.c
index bd1b1aa12581..c40bc3f9ddd4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -96,6 +96,14 @@ struct scan_control {
 	/* Swappiness value for proactive reclaim. Always use sc_swappiness()! */
 	int *proactive_swappiness;
 
+	/*
+	 * Opportunistic compaction hint snapshotted from the pgdat at the
+	 * start of this reclaim pass. Forwarded to shrinkers through
+	 * shrink_control::opportunistic_compaction so they can skip
+	 * non-productive work for failable high-order allocations.
+	 */
+	enum kswapd_opportunistic_compaction_type kswapd_opportunistic_compaction;
+
 	/* Can active folios be deactivated as part of reclaim? */
 #define DEACTIVATE_ANON 1
 #define DEACTIVATE_FILE 2
@@ -200,6 +208,29 @@ struct scan_control {
  */
 int vm_swappiness = 60;
 
+/*
+ * Is @gfp_flags a high-order allocation that is eligible for the
+ * "opportunistic compaction" treatment in kswapd / shrinkers?
+ *
+ * The caller must be willing to tolerate failure (__GFP_NORETRY or
+ * __GFP_RETRY_MAYFAIL) and must not have set __GFP_NOFAIL. For such
+ * allocations there is little value in burning working-set pages just to
+ * scrape together a single high-order block: if compaction can't easily
+ * succeed, the caller would rather see the allocation fail.
+ */
+static bool gfp_opportunistic_compaction(gfp_t gfp_flags)
+{
+	return (gfp_flags & (__GFP_NORETRY | __GFP_RETRY_MAYFAIL)) &&
+		!(gfp_flags & __GFP_NOFAIL);
+}
+
+static bool sc_opportunistic_compaction(struct scan_control *sc)
+{
+	return sc->order && (sc->kswapd_opportunistic_compaction ==
+		KSWAPD_OPPORTUNISTIC_COMPACTION || (!current_is_kswapd() &&
+		 gfp_opportunistic_compaction(sc->gfp_mask)));
+}
+
 #ifdef CONFIG_MEMCG
 
 /* Returns true for reclaim through cgroup limits or cgroup interfaces. */
@@ -412,7 +443,7 @@ static unsigned long drop_slab_node(int nid)
 
 	memcg = mem_cgroup_iter(NULL, NULL, NULL);
 	do {
-		freed += shrink_slab(GFP_KERNEL, nid, memcg, 0);
+		freed += shrink_slab(GFP_KERNEL, nid, memcg, 0, false);
 	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
 
 	return freed;
@@ -5053,6 +5084,7 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
 	unsigned long reclaimed = sc->nr_reclaimed;
 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+	bool opportunistic_compaction = sc_opportunistic_compaction(sc);
 
 	/* lru_gen_age_node() called mem_cgroup_calculate_protection() */
 	if (mem_cgroup_below_min(NULL, memcg))
@@ -5068,7 +5100,8 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
 
 	success = try_to_shrink_lruvec(lruvec, sc);
 
-	shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority);
+	shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority,
+		    opportunistic_compaction);
 
 	if (!sc->proactive)
 		vmpressure(sc->gfp_mask, memcg, false, sc->nr_scanned - scanned,
@@ -6134,6 +6167,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
 		unsigned long reclaimed;
 		unsigned long scanned;
+		bool opportunistic_compaction = sc_opportunistic_compaction(sc);
 
 		/*
 		 * This loop can become CPU-bound when target memcgs
@@ -6171,7 +6205,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 		shrink_lruvec(lruvec, sc);
 
 		shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
-			    sc->priority);
+			    sc->priority, opportunistic_compaction);
 
 		/* Record the group's reclaim efficiency */
 		if (!sc->proactive)
@@ -7104,8 +7138,14 @@ clear_reclaim_active(pg_data_t *pgdat, int highest_zoneidx)
  * found to have free_pages <= high_wmark_pages(zone), any page in that zone
  * or lower is eligible for reclaim until at least one usable zone is
  * balanced.
+ *
+ * @kswapd_opportunistic_compaction is the aggregated hint produced by
+ * wakeup_kswapd() for this run; it is propagated into scan_control so that
+ * shrinkers can skip costly work that is unlikely to help compaction when
+ * all wakers are failable high-order allocations.
  */
-static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
+static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx,
+			 enum kswapd_opportunistic_compaction_type kswapd_opportunistic_compaction)
 {
 	int i;
 	unsigned long nr_soft_reclaimed;
@@ -7119,6 +7159,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
 		.gfp_mask = GFP_KERNEL,
 		.order = order,
 		.may_unmap = 1,
+		.kswapd_opportunistic_compaction = kswapd_opportunistic_compaction,
 	};
 
 	set_task_reclaim_state(current, &sc.reclaim_state);
@@ -7338,8 +7379,10 @@ static enum zone_type kswapd_highest_zoneidx(pg_data_t *pgdat,
 	return curr_idx == MAX_NR_ZONES ? prev_highest_zoneidx : curr_idx;
 }
 
-static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_order,
-				unsigned int highest_zoneidx)
+static void
+kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_order,
+		    unsigned int highest_zoneidx,
+		    enum kswapd_opportunistic_compaction_type kswapd_opportunistic_compaction)
 {
 	long remaining = 0;
 	DEFINE_WAIT(wait);
@@ -7385,6 +7428,11 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o
 
 			if (READ_ONCE(pgdat->kswapd_order) < reclaim_order)
 				WRITE_ONCE(pgdat->kswapd_order, reclaim_order);
+
+			if (kswapd_opportunistic_compaction ==
+			    KSWAPD_NO_OPPORTUNISTIC_COMPACTION)
+				atomic_set(&pgdat->kswapd_opportunistic_compaction,
+					   KSWAPD_NO_OPPORTUNISTIC_COMPACTION);
 		}
 
 		finish_wait(&pgdat->kswapd_wait, &wait);
@@ -7441,6 +7489,7 @@ static int kswapd(void *p)
 	unsigned int highest_zoneidx = MAX_NR_ZONES - 1;
 	pg_data_t *pgdat = (pg_data_t *)p;
 	struct task_struct *tsk = current;
+	enum kswapd_opportunistic_compaction_type kswapd_opportunistic_compaction;
 
 	/*
 	 * Tell the memory management that we're a "memory allocator",
@@ -7458,6 +7507,8 @@ static int kswapd(void *p)
 	set_freezable();
 
 	WRITE_ONCE(pgdat->kswapd_order, 0);
+	atomic_set(&pgdat->kswapd_opportunistic_compaction,
+		   KSWAPD_UNSET_OPPORTUNISTIC_COMPACTION);
 	WRITE_ONCE(pgdat->kswapd_highest_zoneidx, MAX_NR_ZONES);
 	atomic_set(&pgdat->nr_writeback_throttled, 0);
 	for ( ; ; ) {
@@ -7466,13 +7517,18 @@ static int kswapd(void *p)
 		alloc_order = reclaim_order = READ_ONCE(pgdat->kswapd_order);
 		highest_zoneidx = kswapd_highest_zoneidx(pgdat,
 							highest_zoneidx);
+		kswapd_opportunistic_compaction =
+			atomic_read(&pgdat->kswapd_opportunistic_compaction);
 
 kswapd_try_sleep:
 		kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
-					highest_zoneidx);
+				    highest_zoneidx, kswapd_opportunistic_compaction);
 
 		/* Read the new order and highest_zoneidx */
 		alloc_order = READ_ONCE(pgdat->kswapd_order);
+		kswapd_opportunistic_compaction =
+			atomic_xchg(&pgdat->kswapd_opportunistic_compaction,
+				    KSWAPD_UNSET_OPPORTUNISTIC_COMPACTION);
 		highest_zoneidx = kswapd_highest_zoneidx(pgdat,
 							highest_zoneidx);
 		WRITE_ONCE(pgdat->kswapd_order, 0);
@@ -7499,7 +7555,8 @@ static int kswapd(void *p)
 		trace_mm_vmscan_kswapd_wake(pgdat->node_id, highest_zoneidx,
 						alloc_order);
 		reclaim_order = balance_pgdat(pgdat, alloc_order,
-						highest_zoneidx);
+						highest_zoneidx,
+						kswapd_opportunistic_compaction);
 		if (reclaim_order < alloc_order)
 			goto kswapd_try_sleep;
 	}
@@ -7537,6 +7594,28 @@ void wakeup_kswapd(struct zone *zone, gfp_t gfp_flags, int order,
 	if (READ_ONCE(pgdat->kswapd_order) < order)
 		WRITE_ONCE(pgdat->kswapd_order, order);
 
+	/*
+	 * Fold this waker into the per-pgdat opportunistic-compaction hint
+	 * that kswapd will pick up at the start of its next run.
+	 *
+	 * The state is sticky in the "NO" direction: once any waker in this
+	 * batch is order-0 or a non-failable high-order allocation, the hint
+	 * stays cleared until kswapd consumes it. Only when every waker so
+	 * far is a failable high-order allocation do we set
+	 * KSWAPD_OPPORTUNISTIC_COMPACTION, asking shrinkers to skip work
+	 * that won't realistically help compaction.
+	 */
+	if (atomic_read(&pgdat->kswapd_opportunistic_compaction) !=
+	    KSWAPD_NO_OPPORTUNISTIC_COMPACTION) {
+		if (!order || !gfp_opportunistic_compaction(gfp_flags))
+			atomic_set(&pgdat->kswapd_opportunistic_compaction,
+				   KSWAPD_NO_OPPORTUNISTIC_COMPACTION);
+		else if (order && gfp_opportunistic_compaction(gfp_flags))
+			atomic_cmpxchg(&pgdat->kswapd_opportunistic_compaction,
+				       KSWAPD_UNSET_OPPORTUNISTIC_COMPACTION,
+				       KSWAPD_OPPORTUNISTIC_COMPACTION);
+	}
+
 	if (!waitqueue_active(&pgdat->kswapd_wait))
 		return;
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v6 1/2] mm: Introduce opportunistic_compaction concept to vmscan and shrinkers
  2026-06-17  3:22 ` [PATCH v6 1/2] mm: Introduce opportunistic_compaction concept to vmscan and shrinkers Matthew Brost
@ 2026-06-22 23:10   ` Dave Chinner
  2026-06-23  0:09     ` Matthew Brost
  0 siblings, 1 reply; 6+ messages in thread
From: Dave Chinner @ 2026-06-22 23:10 UTC (permalink / raw)
  To: Matthew Brost
  Cc: linux-mm, linux-kernel, intel-xe, dri-devel, Andrew Morton,
	Dave Chinner, Qi Zheng, Roman Gushchin, Muchun Song,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Johannes Weiner, Shakeel Butt, Kairui Song, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu

On Tue, Jun 16, 2026 at 08:22:17PM -0700, Matthew Brost wrote:
> High-order allocations using __GFP_NORETRY or __GFP_RETRY_MAYFAIL
> are often opportunistic attempts to satisfy fragmentation-sensitive
> allocations rather than indications of severe memory pressure. In these
> cases, reclaim may invoke shrinkers that aggressively destroy working
> sets even though reclaim is unlikely to materially improve the
> allocation outcome.
> 
> Some shrinkers manage expensive backing or migration operations where
> reclaim can result in substantial working set disruption despite the
> system having sufficient free memory overall. This is particularly
> visible in fragmentation-heavy workloads where reclaim repeatedly tears
> down active state while kswapd attempts to satisfy higher-order
> allocations.
> 
> Introduce an opportunistic_compaction hint in shrink_control that allows
> kswapd to communicate when reclaim originates from a high-order
> allocation context that may be fragmentation driven rather than true
> memory pressure. Shrinkers may use this hint to avoid destructive
> working set reclaim while still participating normally during order-0
> or stronger reclaim conditions.

To be honest, this seems like another "push a hint through to the XE
shrinker" mechanism under a different name. You seem so focused on
fixing the XE reproducer that the -systemic problem- that -any-
high-order folio demand causes is not being acknowleged.

e.g. we use high-order folios extensively in the page cache these
days, and there are -many- cases where memory compaction driven by
high-order demand cause significant performance regressions for page
cache performance. To date, every single person who has wanted to
fix the problem they are seeing has effectively attempted to -turn
off compaction- via GFP flags.

I've even done that myself inside XFS to work around kvmalloc()
issues with a lack of GFP_NOFAIL support and doing costly high order
allocations that fail and trigger compaction before falling back to
vmalloc().  However, these issues have since been fixed in the
kvmalloc() code, such that it now does the right thing for most
calling contexts (i.e. tries high-order kmalloc() without triggering
compaction, then fall back to GFP_NOFAIL vmalloc()). This has made
kvmalloc() more performant and better behaved for -all users-, not
just XFS.

This is not sustainable - we need compaction to be robust and
performant in the face of high-order folio demands, regardless of
what subsystem is generating the demand.

So with that in mind, let me paraphrase the comment in the second
patch in the Xe shrinker implementation:

"Shrinker reclaim is based on implementation specific object sizes
so it is unlikely to ever acheive contiguous page reclaim in a
manner that will measurably improve compaction rates."

You also say:

> No functional changes are introduced for existing shrinkers.

Consider how many shrinkable caches the general statement above
applies to, and then think about the fundamental impedence mismatch
between the affected shrinkable caches and what this patch actually
fixes.

For example, what happens to slab-based caches if the XE cache is being
excessively reclaimed under high-order page demand? e.g. the slab-based
cache may have tens of objects per page and holds a system-level
performance critical working set of objects. How do these caches
handle the excessive reclaim demand being generated by compaction
thrashing?

Yup, they don't.

In the case of filesystem caches, the "reclaim and repopulate"
pattern you describe causing the XE perf problems causes internal
slab cache fragmentation. Not only does this not improve compaction
rates, it also results in more memory fragmentation because slab
pages get pinned by a small number of long lived objects and they
won't get freed until the cache is largely emptied.  IOWs, things
get -even worse- from a memory fragmentation POV when compaction
thrashing causes the working set of a high-object-count-per-page
slab cache to thrash....

This isn't isolated to individual subsystem thrashing.  If we run a
file-based workload that generates high-order folio demand and hence
compaction (e.g copy tens of GB of files between two XFS, ext4 or
btrfs filesystems), that will -also- trash the Xe working set via
the shrinker being hammered by memory compaction try to free up
contiguous pages for the page cache.

Similarly, if we run a Xe workload that generates sustained high
order folio demand, that will trash the working set in the dentry
and inode caches and any other shrinkable slab-based cache.

Hence the abstracted case of the problem we need to solve is this:
shrinker reclaim is based on x-byte objects is extremely unlikely to
acheive contiguous page reclaim in a manner that will measurably
improve compaction rates.

This is a problem that has to be addresses by the high level
infrastructure level, not worked around by individual shrinkers.

IMO, compaction shouldn't trigger shrinkers unless the shrinkers are
specifically flagged as being able to release contiguous pages of
memory in short order. I don't think there's very many shrinkable
caches that even hold a significant quantity of objects larger than
a single page, so it's clearly questionable as to whether compaction
based reclaim should run shrinker reclaim to begin with.

i.e. a subsystem that can track high order folios in a shrinkable
cache should probably have a "->compaction_scan()" method that is
run directly from compaction context to try to free high order
folios. This provides a direct opt-in mechanism for a subsystem, and
it allows subsystems that can track low- and high- order objects
independently to efficiently free objects in a way that will help
improve compaction rates without impacting the entire working set of
objects in the cache.

IOWs, this patch to inform kswapd about it's trigger (doesn't it
already have a "reason" parameter, though?) is likely a necessary
part of the solution - we don't want kswapd running shrinkers if it
has been triggered to reclaim pages for compaction. This patch would
allows kswapd to elide normal shrinker passes when it has been woken
purely for compaction purposes. Given that the compaction code would
be running the high-order reclaim capable shrinkers itself, this
would avoid trashing the working set of most shrinkable caches -by
default- under high order allocation demand....

-Dave.
-- 
Dave Chinner
dgc@kernel.org

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v6 1/2] mm: Introduce opportunistic_compaction concept to vmscan and shrinkers
  2026-06-22 23:10   ` Dave Chinner
@ 2026-06-23  0:09     ` Matthew Brost
  2026-06-23  5:32       ` Dave Chinner
  0 siblings, 1 reply; 6+ messages in thread
From: Matthew Brost @ 2026-06-23  0:09 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-mm, linux-kernel, intel-xe, dri-devel, Andrew Morton,
	Dave Chinner, Qi Zheng, Roman Gushchin, Muchun Song,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Johannes Weiner, Shakeel Butt, Kairui Song, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu

On Tue, Jun 23, 2026 at 09:10:43AM +1000, Dave Chinner wrote:
> On Tue, Jun 16, 2026 at 08:22:17PM -0700, Matthew Brost wrote:
> > High-order allocations using __GFP_NORETRY or __GFP_RETRY_MAYFAIL
> > are often opportunistic attempts to satisfy fragmentation-sensitive
> > allocations rather than indications of severe memory pressure. In these
> > cases, reclaim may invoke shrinkers that aggressively destroy working
> > sets even though reclaim is unlikely to materially improve the
> > allocation outcome.
> > 
> > Some shrinkers manage expensive backing or migration operations where
> > reclaim can result in substantial working set disruption despite the
> > system having sufficient free memory overall. This is particularly
> > visible in fragmentation-heavy workloads where reclaim repeatedly tears
> > down active state while kswapd attempts to satisfy higher-order
> > allocations.
> > 
> > Introduce an opportunistic_compaction hint in shrink_control that allows
> > kswapd to communicate when reclaim originates from a high-order
> > allocation context that may be fragmentation driven rather than true
> > memory pressure. Shrinkers may use this hint to avoid destructive
> > working set reclaim while still participating normally during order-0
> > or stronger reclaim conditions.

Thanks for the input - this is a tough problem.

> 
> To be honest, this seems like another "push a hint through to the XE
> shrinker" mechanism under a different name. You seem so focused on
> fixing the XE reproducer that the -systemic problem- that -any-
> high-order folio demand causes is not being acknowleged.
> 

I'm not exactly sure I agree here. Communicating via __GFP_NORETRY or
__GFP_RETRY_MAYFAIL with a higher order implies that the caller can
handle higher-order allocation failures, so the shrinker shouldn’t try
too hard to obtain a large page (e.g., evict a working set). I agree
that Xe is currently the only shrinker making use of this, but other
shrinkers could also hook into it. This information simply isn’t
available today.

> e.g. we use high-order folios extensively in the page cache these
> days, and there are -many- cases where memory compaction driven by
> high-order demand cause significant performance regressions for page
> cache performance. To date, every single person who has wanted to
> fix the problem they are seeing has effectively attempted to -turn
> off compaction- via GFP flags.

So does that mean they clear __GFP_RECLAIM?

That isn't really what in DRM or Xe. In former case we have pools of
lower order pages in TTM not in use that can be shrunk, potentially
freeing multiple lower orders pages so a higher order page formed, and
the later possible BOs (sets of pages) in Xe marked as purgable (not is
in working set) which can also be shrunk. Other DRM drivers have purging
concepts too.

I’m not very familiar with what other shrinkers or subsystems want, but
presumably other shrinkers have pools or caches that aren’t currently in
use, where they can say, “OK, I’ll give these pages up for opportunistic
compaction, but I won’t give up my working set.” Of course, as mentioned
above, if someone else explicitly requests large pages by avoiding
__GFP_NORETRY and __GFP_RETRY_MAYFAIL, the shrinker should then give up
its working set.

> 
> I've even done that myself inside XFS to work around kvmalloc()
> issues with a lack of GFP_NOFAIL support and doing costly high order
> allocations that fail and trigger compaction before falling back to
> vmalloc().  However, these issues have since been fixed in the
> kvmalloc() code, such that it now does the right thing for most
> calling contexts (i.e. tries high-order kmalloc() without triggering
> compaction, then fall back to GFP_NOFAIL vmalloc()). This has made
> kvmalloc() more performant and better behaved for -all users-, not
> just XFS.
> 
> This is not sustainable - we need compaction to be robust and
> performant in the face of high-order folio demands, regardless of
> what subsystem is generating the demand.
> 
> So with that in mind, let me paraphrase the comment in the second
> patch in the Xe shrinker implementation:
> 
> "Shrinker reclaim is based on implementation specific object sizes
> so it is unlikely to ever acheive contiguous page reclaim in a
> manner that will measurably improve compaction rates."
> 

This might be slightly misworded—what I really mean is that I don’t want
to give up my working set for higher-order allocations that are allowed
to fail, but I do want to give up my cache.

> You also say:
> 
> > No functional changes are introduced for existing shrinkers.
> 
> Consider how many shrinkable caches the general statement above
> applies to, and then think about the fundamental impedence mismatch
> between the affected shrinkable caches and what this patch actually
> fixes.
> 

Yes, as mentioned above, I’m only addressing Xe here, and I agree that
this is likely an issue. Do you know of other shrinkers that have pools
or caches which can be shrunk under the conditions I’m introducing here,
but also have a working set they would prefer not to give up? If so, a
link on elixir.bootlin.com would be helpful so I can take a look. I’ll
also try to go through other shrinkers myself.

> For example, what happens to slab-based caches if the XE cache is being
> excessively reclaimed under high-order page demand? e.g. the slab-based
> cache may have tens of objects per page and holds a system-level
> performance critical working set of objects. How do these caches
> handle the excessive reclaim demand being generated by compaction
> thrashing?
> 
> Yup, they don't.
> 

Agree.

> In the case of filesystem caches, the "reclaim and repopulate"
> pattern you describe causing the XE perf problems causes internal
> slab cache fragmentation. Not only does this not improve compaction
> rates, it also results in more memory fragmentation because slab
> pages get pinned by a small number of long lived objects and they
> won't get freed until the cache is largely emptied.  IOWs, things
> get -even worse- from a memory fragmentation POV when compaction
> thrashing causes the working set of a high-object-count-per-page
> slab cache to thrash....
> 

Got a link to the code which you are referring to?

That seems like a problem similar to another issue in DRM/Xe. We found
that the process of shrinking actually drove fragmentation by splitting
folios down to order-0 and then backing pages up one at a time. I have a
separate fix in flight for that.

Could the filesystem detect these hints and avoid shrinking in a way
that causes fragmentation? Alternatively, could it perform shrinking in
a way that doesn’t shatter folios, or detect long-lived objects so it
understands that shrinking isn’t going to help reduce fragmentation?

> This isn't isolated to individual subsystem thrashing.  If we run a
> file-based workload that generates high-order folio demand and hence

What GFP flags are typical used for file-based workloads?

> compaction (e.g copy tens of GB of files between two XFS, ext4 or
> btrfs filesystems), that will -also- trash the Xe working set via
> the shrinker being hammered by memory compaction try to free up
> contiguous pages for the page cache.
>

I could see this.

> Similarly, if we run a Xe workload that generates sustained high
> order folio demand, that will trash the working set in the dentry
> and inode caches and any other shrinkable slab-based cache.
> 

I could also see this but DRM / Xe will set __GFP_NORETRY or
__GFP_RETRY_MAYFAIL on higer-orders so those caches should be able to
not trash its working set if looked for this hint. 

> Hence the abstracted case of the problem we need to solve is this:
> shrinker reclaim is based on x-byte objects is extremely unlikely to
> acheive contiguous page reclaim in a manner that will measurably
> improve compaction rates.
> 
> This is a problem that has to be addresses by the high level
> infrastructure level, not worked around by individual shrinkers.
> 
> IMO, compaction shouldn't trigger shrinkers unless the shrinkers are
> specifically flagged as being able to release contiguous pages of
> memory in short order. I don't think there's very many shrinkable
> caches that even hold a significant quantity of objects larger than
> a single page, so it's clearly questionable as to whether compaction
> based reclaim should run shrinker reclaim to begin with.
> 

Yes, sort of do this in Xe by changing '->count_objects' based on the
hint.

> i.e. a subsystem that can track high order folios in a shrinkable
> cache should probably have a "->compaction_scan()" method that is
> run directly from compaction context to try to free high order

When you say “compaction context,” which parts of the code are you
referring to? I’d like to explore this option, but I need a bit more
context.

> folios. This provides a direct opt-in mechanism for a subsystem, and
> it allows subsystems that can track low- and high- order objects
> independently to efficiently free objects in a way that will help
> improve compaction rates without impacting the entire working set of
> objects in the cache.

Does this help if, for example, the cache is holding onto two order-8
folios that could be freed and merged, while the caller really wants an
order-9 folio? This seems like a possible scenario in caches and is
certainly true in TTM pools.

> 
> IOWs, this patch to inform kswapd about it's trigger (doesn't it
> already have a "reason" parameter, though?) is likely a necessary
> part of the solution - we don't want kswapd running shrinkers if it
> has been triggered to reclaim pages for compaction. This patch would
> allows kswapd to elide normal shrinker passes when it has been woken
> purely for compaction purposes. Given that the compaction code would
> be running the high-order reclaim capable shrinkers itself, this
> would avoid trashing the working set of most shrinkable caches -by
> default- under high order allocation demand....
>

I’m trying to parse this—are you suggesting that, one way or another, we
introduce a heuristic where shrinkers can act on a hint (whether it’s
what I have here or a new ->compaction_scan() vfunc), and then attempt
to fix all shrinkers in this series? I’m open to trying to fix other
shrinkers as well. Do you have any particular ones in mind? I count
around 45 shrinkers in Linux, so it’s unlikely I can fix every single
one, though or all shrinkers need to be fixed.

On a side note, I just noticed that struct shrinker has count_objects
and scan_objects as individual vfuncs rather than using a const struct
shrinker_ops *ops. Should we change that? The latter seems cleaner and
is typically how things are done in Linux.

Matt

> -Dave.
> -- 
> Dave Chinner
> dgc@kernel.org

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v6 1/2] mm: Introduce opportunistic_compaction concept to vmscan and shrinkers
  2026-06-23  0:09     ` Matthew Brost
@ 2026-06-23  5:32       ` Dave Chinner
  0 siblings, 0 replies; 6+ messages in thread
From: Dave Chinner @ 2026-06-23  5:32 UTC (permalink / raw)
  To: Matthew Brost
  Cc: linux-mm, linux-kernel, intel-xe, dri-devel, Andrew Morton,
	Dave Chinner, Qi Zheng, Roman Gushchin, Muchun Song,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Johannes Weiner, Shakeel Butt, Kairui Song, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu

On Mon, Jun 22, 2026 at 05:09:33PM -0700, Matthew Brost wrote:
> On Tue, Jun 23, 2026 at 09:10:43AM +1000, Dave Chinner wrote:
> > On Tue, Jun 16, 2026 at 08:22:17PM -0700, Matthew Brost wrote:
> > > High-order allocations using __GFP_NORETRY or __GFP_RETRY_MAYFAIL
> > > are often opportunistic attempts to satisfy fragmentation-sensitive
> > > allocations rather than indications of severe memory pressure. In these
> > > cases, reclaim may invoke shrinkers that aggressively destroy working
> > > sets even though reclaim is unlikely to materially improve the
> > > allocation outcome.
> > > 
> > > Some shrinkers manage expensive backing or migration operations where
> > > reclaim can result in substantial working set disruption despite the
> > > system having sufficient free memory overall. This is particularly
> > > visible in fragmentation-heavy workloads where reclaim repeatedly tears
> > > down active state while kswapd attempts to satisfy higher-order
> > > allocations.
> > > 
> > > Introduce an opportunistic_compaction hint in shrink_control that allows
> > > kswapd to communicate when reclaim originates from a high-order
> > > allocation context that may be fragmentation driven rather than true
> > > memory pressure. Shrinkers may use this hint to avoid destructive
> > > working set reclaim while still participating normally during order-0
> > > or stronger reclaim conditions.
> 
> Thanks for the input - this is a tough problem.

Yes, that it is.

> > To be honest, this seems like another "push a hint through to the XE
> > shrinker" mechanism under a different name. You seem so focused on
> > fixing the XE reproducer that the -systemic problem- that -any-
> > high-order folio demand causes is not being acknowleged.
> > 
> 
> I'm not exactly sure I agree here. Communicating via __GFP_NORETRY or
> __GFP_RETRY_MAYFAIL with a higher order implies that the caller can
> handle higher-order allocation failures, so the shrinker shouldn’t try
> too hard to obtain a large page (e.g., evict a working set). I agree
> that Xe is currently the only shrinker making use of this, but other
> shrinkers could also hook into it. This information simply isn’t
> available today.

Right, but "we are doing compaction" isn't information that tells
the subsystem shrinker what it needs to do. "memory compaction is
occurring" isn't a well defined action like "count reclaimable
objects" or "scan N objects and reclaim as many as you can without
blocking".

Directed high order object reclaim should be much efficient that
trying to use general memory pressure to age out enough objects to
reform contiguous pages. We need to help memory compaction, and we
can't really do that by layering heuristics over reclaim algorithms
designed to maintain working sets efficiently.

> > e.g. we use high-order folios extensively in the page cache these
> > days, and there are -many- cases where memory compaction driven by
> > high-order demand cause significant performance regressions for page
> > cache performance. To date, every single person who has wanted to
> > fix the problem they are seeing has effectively attempted to -turn
> > off compaction- via GFP flags.
> 
> So does that mean they clear __GFP_RECLAIM?

Usually __GFP_DIRECT_RECLAIM, as it's the overhead of direct
compaction that causes the performance problems.

> That isn't really what in DRM or Xe. In former case we have pools of
> lower order pages in TTM not in use that can be shrunk, potentially
> freeing multiple lower orders pages so a higher order page formed, and
> the later possible BOs (sets of pages) in Xe marked as purgable (not is
> in working set) which can also be shrunk. Other DRM drivers have purging
> concepts too.
> 
> I’m not very familiar with what other shrinkers or subsystems want, but
> presumably other shrinkers have pools or caches that aren’t currently in
> use, where they can say, “OK, I’ll give these pages up for opportunistic
> compaction, but I won’t give up my working set.” Of course, as mentioned
> above, if someone else explicitly requests large pages by avoiding
> __GFP_NORETRY and __GFP_RETRY_MAYFAIL, the shrinker should then give up
> its working set.

Most caches are slab-based, so there can be 10s of objects with
different life cycles per page. There is no almost possiblity that
shrinker reclaim will free pages without substantial
amounts of the cache being reclaimed.

> > I've even done that myself inside XFS to work around kvmalloc()
> > issues with a lack of GFP_NOFAIL support and doing costly high order
> > allocations that fail and trigger compaction before falling back to
> > vmalloc().  However, these issues have since been fixed in the
> > kvmalloc() code, such that it now does the right thing for most
> > calling contexts (i.e. tries high-order kmalloc() without triggering
> > compaction, then fall back to GFP_NOFAIL vmalloc()). This has made
> > kvmalloc() more performant and better behaved for -all users-, not
> > just XFS.
> > 
> > This is not sustainable - we need compaction to be robust and
> > performant in the face of high-order folio demands, regardless of
> > what subsystem is generating the demand.
> > 
> > So with that in mind, let me paraphrase the comment in the second
> > patch in the Xe shrinker implementation:
> > 
> > "Shrinker reclaim is based on implementation specific object sizes
> > so it is unlikely to ever acheive contiguous page reclaim in a
> > manner that will measurably improve compaction rates."
> > 
> 
> This might be slightly misworded—what I really mean is that I don’t want
> to give up my working set for higher-order allocations that are allowed
> to fail, but I do want to give up my cache.

Right, that's the core of the problem - compaction is the high-order
reclaim trigger, the existing shrinker infrastructure reclaim is for
the working-set maintenance reclaim algorithm the subsystem uses..

> > You also say:
> > 
> > > No functional changes are introduced for existing shrinkers.
> > 
> > Consider how many shrinkable caches the general statement above
> > applies to, and then think about the fundamental impedence mismatch
> > between the affected shrinkable caches and what this patch actually
> > fixes.
> > 
> 
> Yes, as mentioned above, I’m only addressing Xe here, and I agree that
> this is likely an issue. Do you know of other shrinkers that have pools
> or caches which can be shrunk under the conditions I’m introducing here,
> but also have a working set they would prefer not to give up?

The first that comes to mind is the xfs_buf cache. This cache holds
cached metadata buffers that have different sizes can each contain
up 64kB of contiguous pages. The allocation algorithm uses
optimisitic large folios allocation, but if that fails it falls back
to vmalloc. The working set is maintained by a prioritised
multi-scan LRU so that more frequently accessed metadata is held
tighter by the cache than less frequently accessed (e.g. btree roots
have higher retention priority than the lowest leaves).

It does not currently track buffer objects by size, by if there was
a benefit to doing so then it could be implemented. I'd much prefer
to have such tracking separate to the working set maintenance,
especially as they will likely need some kind of balancing to
prevent high-order buffers in the working set from being thrashed by
compaction demand....

I know there are other caches that have variable sized objects, but
I'd have to go look at the code to referesh my memory of which ones
they are...

> If so, a
> link on elixir.bootlin.com would be helpful so I can take a look. I’ll
> also try to go through other shrinkers myself.

cscope is your friend.

fs/xfs/xfs_buf.c contains the XFS buffer cache and shrinker
infrastructure, but looking at the code without any understanding of
the filesystem structures or how it interacts with the other XFS
shrinkable caches probably isn't as useful as you might think it
will be. 

> > For example, what happens to slab-based caches if the XE cache is being
> > excessively reclaimed under high-order page demand? e.g. the slab-based
> > cache may have tens of objects per page and holds a system-level
> > performance critical working set of objects. How do these caches
> > handle the excessive reclaim demand being generated by compaction
> > thrashing?
> > 
> > Yup, they don't.
> > 
> 
> Agree.
> 
> > In the case of filesystem caches, the "reclaim and repopulate"
> > pattern you describe causing the XE perf problems causes internal
> > slab cache fragmentation. Not only does this not improve compaction
> > rates, it also results in more memory fragmentation because slab
> > pages get pinned by a small number of long lived objects and they
> > won't get freed until the cache is largely emptied.  IOWs, things
> > get -even worse- from a memory fragmentation POV when compaction
> > thrashing causes the working set of a high-object-count-per-page
> > slab cache to thrash....
> > 
> 
> Got a link to the code which you are referring to?

Do a lore search for "dentry cache defragmentation". You should be
able to find discussions that go back to around 2006 about
discussions on identry cache fragmentation and approaches like
slab-page based object reclaim to support internal defragmentation.

The fact that we don't have slab cache defragmentation despite many
years of people wanting such functionality should tell you how
complex the problem is.... :/

> That seems like a problem similar to another issue in DRM/Xe. We found
> that the process of shrinking actually drove fragmentation by splitting
> folios down to order-0 and then backing pages up one at a time. I have a
> separate fix in flight for that.

Possibly, though the life cycle differences I'm talking about can be
a few milliseconds (temporary file) vs weeks (long running database
instance holding it's table files open the whole time it is
running).

> Could the filesystem detect these hints and avoid shrinking in a way
> that causes fragmentation?

Not really. The fragmentation problem is caused by physical object
placement in the slab pages at allocation time, not the act of
reclaiming the object.

i.e. we don't know what the expected cache life time of a dentry or
an inode will be when we allocate it, so it just gets allocated in
the next free slot in the current partial slab page. When you get a
mix of dentries that are pinned by open files in long running
applications and dentries for access-once files in the same page, we
end up with reclaim freeing all the object slots that contained
access-once files. However, the pages are still pinned by the
objects for the open files that are in active use.

IOWs, LRU-based reclaim can free >90% of the objects in a cache that
held millions of objects with mixed lifetimes and still not free any
memory at all. There's nothing reclaim can do about it because the
problem is created at allocation time when lifetime is a complete
unknown.

> Alternatively, could it perform shrinking in
> a way that doesn’t shatter folios, or detect long-lived objects so it
> understands that shrinking isn’t going to help reduce fragmentation?

Referenced filesystem objects are not on the LRUs, so the shrinkers
aren't even aware of such long lived objects. And, as per the
"dentry cache defrag" comment above, we can't ask the slab to
reclaim or move objects because we don't track the owners of
external references to the objects themselves.

> 
> > This isn't isolated to individual subsystem thrashing.  If we run a
> > file-based workload that generates high-order folio demand and hence
> 
> What GFP flags are typical used for file-based workloads?

Mostly GFP_KERNEL, with a mix of GFP_NOFS. non-blocking paths also
tend to add GFP_NOWAITS, and memory reclaim sensitive paths often
use __GFP_MEMALLOC to prevent reclaim recursion. Some filesystems
also make extensive use of GFP_NOFAIL (e.g. XFS).

> > compaction (e.g copy tens of GB of files between two XFS, ext4 or
> > btrfs filesystems), that will -also- trash the Xe working set via
> > the shrinker being hammered by memory compaction try to free up
> > contiguous pages for the page cache.
> >
> 
> I could see this.
> 
> > Similarly, if we run a Xe workload that generates sustained high
> > order folio demand, that will trash the working set in the dentry
> > and inode caches and any other shrinkable slab-based cache.
> > 
> 
> I could also see this but DRM / Xe will set __GFP_NORETRY or
> __GFP_RETRY_MAYFAIL on higer-orders so those caches should be able to
> not trash its working set if looked for this hint. 

This relies on all the allocation code everywhere always doing
exactly the right thing so that memory reclaim "behaves". That is
what I've been saying is not a sustainable approach - all it takes
is one allocation or one shrinker not to do the right thing, and
we've got another mole to whack.  i.e. memory allocation should do
the right/best thing for the system with default parameters.

> > Hence the abstracted case of the problem we need to solve is this:
> > shrinker reclaim is based on x-byte objects is extremely unlikely to
> > acheive contiguous page reclaim in a manner that will measurably
> > improve compaction rates.
> > 
> > This is a problem that has to be addresses by the high level
> > infrastructure level, not worked around by individual shrinkers.
> > 
> > IMO, compaction shouldn't trigger shrinkers unless the shrinkers are
> > specifically flagged as being able to release contiguous pages of
> > memory in short order. I don't think there's very many shrinkable
> > caches that even hold a significant quantity of objects larger than
> > a single page, so it's clearly questionable as to whether compaction
> > based reclaim should run shrinker reclaim to begin with.
> > 
> 
> Yes, sort of do this in Xe by changing '->count_objects' based on the
> hint.

I know. That's the problem - it's relying on the infrastructure
passing down a specific internal context hint in an existing
interface so a specific subsystem can work around a specific
problematic behaviour.

Indeed, for compaction we don't actually care about the count, what
we largely care about is whether the subsystem has any objects the
same size or larger than the current compaction demand. Efficient
object reclaim for compaction has a different control variable set
(e.g. find objects larger than, objects physically near to, etc),
and this can't really be properly fitted into the existing
count/scan shrinker reclaim algorithm.

Hence I think it needs new shrinker methods to implement
effectively.

> > i.e. a subsystem that can track high order folios in a shrinkable
> > cache should probably have a "->compaction_scan()" method that is
> > run directly from compaction context to try to free high order
> 
> When you say “compaction context,” which parts of the code are you
> referring to? I’d like to explore this option, but I need a bit more
> context.

kcompactd does background compaction, similar to how we have kswapd
to do background memory reclaim.

Direct compaction (part of direct reclaim) via
__alloc_pages_direct_compact() that will be called before direct
memory reclaim in the case of a high-order allocation.

> 
> > folios. This provides a direct opt-in mechanism for a subsystem, and
> > it allows subsystems that can track low- and high- order objects
> > independently to efficiently free objects in a way that will help
> > improve compaction rates without impacting the entire working set of
> > objects in the cache.
> 
> 
> Does this help if, for example, the cache is holding onto two order-8
> folios that could be freed and merged, while the caller really wants an
> order-9 folio? This seems like a possible scenario in caches and is
> certainly true in TTM pools.

Depends on how the interface is implemented.

IIUC, the direct compaction code will return a right-sized page
early if it creates one via compact_zone(). Hence if that path can
call into shrinkers to do high-order scanning that results in two
mergable order-8 objects being freed and merged into an order-9
object that fulfils the compaction requirements, then it will result
in compaction succeeding where it currently fails.

And I think that kcompactd will run until certain watermarks are
met, so again having a high-order shrinker that directly impacts the
high order page watermarks would be much more efficient that trying
to use general memory pressure to randomly shoot down enough objects
to reform contiguous pages.

> > IOWs, this patch to inform kswapd about it's trigger (doesn't it
> > already have a "reason" parameter, though?) is likely a necessary
> > part of the solution - we don't want kswapd running shrinkers if it
> > has been triggered to reclaim pages for compaction. This patch would
> > allows kswapd to elide normal shrinker passes when it has been woken
> > purely for compaction purposes. Given that the compaction code would
> > be running the high-order reclaim capable shrinkers itself, this
> > would avoid trashing the working set of most shrinkable caches -by
> > default- under high order allocation demand....
> >
> 
> I’m trying to parse this—are you suggesting that, one way or another, we
> introduce a heuristic where shrinkers can act on a hint (whether it’s
> what I have here or a new ->compaction_scan() vfunc), and then attempt
> to fix all shrinkers in this series?

I don't want existing shrinkers to be touched at all.

What I want is for memory reclaim (both direct and kswapd) to elide
the shrink_slab() calls into shrinkers when memory reclaim is being
driven by high order allocation failure.

i.e. high-order allocation failure should not generate shrinkable
cache memory pressure because shrinkable caches in general cannot
return contiguous memory that will allows compaction to make
progress. The existing behaviour has more negative affects on system
performance than positive, so we need a fix for "everyone".

I think we should provide a new opt-in ->compaction_scan() method for
compaction aware subsystem shrinkers that is run from compact_zone()
context. This allows subsystems that can manage high order objects
to optimise the return of high order objects to the free space pool,
thereby significantly improving the chance for compaction to
succeed without adversely impacting the rest of the shrinkable
caches in the system.

Further, we should not kick kswapd because of compaction failures
because kcompactd will already be running ->compaction_scan()
capable shrinkers from it's callouts to compact_zone() in the
background that will do this work as efficiently as possible.

> I’m open to trying to fix other
> shrinkers as well. Do you have any particular ones in mind? I count
> around 45 shrinkers in Linux, so it’s unlikely I can fix every single
> one, though or all shrinkers need to be fixed.

They'd all need to be fixed, which is why I suggested a new method
to be added. Avoid calling the existing shrinkers in the adverse
situation, call the new one from the right context where it actually
benefits compaction and high-order memory allocation.

> On a side note, I just noticed that struct shrinker has count_objects
> and scan_objects as individual vfuncs rather than using a const struct
> shrinker_ops *ops. Should we change that? The latter seems cleaner and
> is typically how things are done in Linux.

We probably should - the current structure is largely historical and
there's only ever been two methods. If we are adding another method,
then it would probably make sense to add an external ops structure
to reduce the memory footprint a little.

-Dave.
-- 
Dave Chinner
dgc@kernel.org

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v6 2/2] drm/xe: Make use of shrink_control::opportunistic_compaction hint
  2026-06-17  3:22 [PATCH v6 0/2] mm, drm/xe: Avoid reclaim/eviction loops under fragmentation Matthew Brost
  2026-06-17  3:22 ` [PATCH v6 1/2] mm: Introduce opportunistic_compaction concept to vmscan and shrinkers Matthew Brost
@ 2026-06-17  3:22 ` Matthew Brost
  1 sibling, 0 replies; 6+ messages in thread
From: Matthew Brost @ 2026-06-17  3:22 UTC (permalink / raw)
  To: linux-mm, linux-kernel, intel-xe, dri-devel
  Cc: Andrew Morton, Dave Chinner, Qi Zheng, Roman Gushchin,
	Muchun Song, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Johannes Weiner, Shakeel Butt, Kairui Song, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Thomas Hellström

Xe/TTM backup reclaim can be extremely expensive under fragmentation
pressure as reclaim may migrate or destroy actively used GPU working
sets despite the system still having substantial free memory available.

Under high-order opportunistic reclaim, repeatedly backing up GPU
memory can lead to reclaim/rebind ping-pong behavior where active GPU
working sets are continuously torn down and reconstructed without
materially improving allocation success.

Use the new shrink_control::opportunistic_compaction hint to avoid Xe
backup reclaim during fragmentation-driven high-order reclaim attempts.
In this mode the shrinker skips advertising backup-backed reclaimable
memory and avoids initiating backup operations entirely.

Order-0 and non-opportunistic reclaim behavior remain unchanged, so
Xe backup reclaim still participates normally during genuine memory
pressure.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: David Hildenbrand <david@kernel.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Kairui Song <kasong@tencent.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Wei Xu <weixugc@google.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Assisted-by: Claude:claude-opus-4.6
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/xe/xe_shrinker.c | 20 +++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_shrinker.c b/drivers/gpu/drm/xe/xe_shrinker.c
index 83374cd57660..198149f266c6 100644
--- a/drivers/gpu/drm/xe/xe_shrinker.c
+++ b/drivers/gpu/drm/xe/xe_shrinker.c
@@ -139,10 +139,17 @@ static unsigned long
 xe_shrinker_count(struct shrinker *shrink, struct shrink_control *sc)
 {
 	struct xe_shrinker *shrinker = to_xe_shrinker(shrink);
-	unsigned long num_pages;
+	unsigned long num_pages = 0;
 	bool can_backup = !!(sc->gfp_mask & __GFP_FS);
 
-	num_pages = ttm_backup_bytes_avail() >> PAGE_SHIFT;
+	/*
+	 * Skip accounting backup-able pages when this is an opportunistic
+	 * high-order pass: TTM backup work shrinks at native page granularity
+	 * and is unlikely to produce the contiguous block the caller wants,
+	 * so don't advertise it as reclaimable for this hint.
+	 */
+	if (!sc->opportunistic_compaction)
+		num_pages = ttm_backup_bytes_avail() >> PAGE_SHIFT;
 	read_lock(&shrinker->lock);
 
 	if (can_backup)
@@ -233,7 +240,14 @@ static unsigned long xe_shrinker_scan(struct shrinker *shrink, struct shrink_con
 	}
 
 	sc->nr_scanned = nr_scanned;
-	if (nr_scanned >= nr_to_scan || !can_backup)
+	/*
+	 * Stop after the purge pass for opportunistic high-order reclaim:
+	 * the subsequent backup/writeback pass works at native page order
+	 * and is unlikely to free a contiguous high-order block, so doing
+	 * it here would just churn working sets for no compaction benefit.
+	 */
+	if (nr_scanned >= nr_to_scan || !can_backup ||
+	    sc->opportunistic_compaction)
 		goto out;
 
 	/* If we didn't wake before, try to do it now if needed. */
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-06-23  5:33 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-17  3:22 [PATCH v6 0/2] mm, drm/xe: Avoid reclaim/eviction loops under fragmentation Matthew Brost
2026-06-17  3:22 ` [PATCH v6 1/2] mm: Introduce opportunistic_compaction concept to vmscan and shrinkers Matthew Brost
2026-06-22 23:10   ` Dave Chinner
2026-06-23  0:09     ` Matthew Brost
2026-06-23  5:32       ` Dave Chinner
2026-06-17  3:22 ` [PATCH v6 2/2] drm/xe: Make use of shrink_control::opportunistic_compaction hint Matthew Brost

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox