* [PATCH v3 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation
@ 2026-04-30 18:23 Matthew Brost
2026-04-30 18:23 ` [PATCH v3 1/6] mm: Wire up order in shrink_control Matthew Brost
2026-04-30 18:23 ` [PATCH v3 2/6] mm: Introduce zone_maybe_fragmented_in_shrinker() Matthew Brost
0 siblings, 2 replies; 3+ messages in thread
From: Matthew Brost @ 2026-04-30 18:23 UTC (permalink / raw)
To: intel-xe, dri-devel
Cc: Dave Chinner, Qi Zheng, Roman Gushchin, Johannes Weiner,
Shakeel Butt, Kairui Song, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Tvrtko Ursulin, Thomas Hellström,
Carlos Santa, Christian Koenig, Huang Rui, Matthew Auld,
Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
Simona Vetter, Daniel Colascione, Andrew Morton,
David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
linux-mm, linux-kernel
TTM allocations at higher orders can drive Xe into a pathological
reclaim loop when memory is fragmented:
kswapd → shrinker → eviction → rebind (exec ioctl) → repeat
In this state, reclaim is triggered despite substantial free memory,
but fails to produce contiguous higher-order pages. The Xe shrinker then
evicts active buffer objects, increasing faulting and rebind activity
and further feeding the loop. The result is high CPU overhead and poor
GPU forward progress.
This issue was first reported in [1] and independently observed
internally and by Google.
A simple reproducer is:
- Boot an iGPU system with mem=8G
- Launch 10 Chrome tabs running the WebGL aquarium demo
- Configure each tab with ~5k fish
Under this workload, ftrace shows a continuous loop of:
xe_shrinker_scan (kswapd)
xe_vma_rebind_exec
Performance degrades significantly, with each tab dropping to ~2 FPS on
PTL (Ubuntu 24.04).
At the same time, /proc/buddyinfo shows substantial free memory but no
higher-order availability. For example, the Normal zone:
Count: 4063 4595 3455 3400 3139 2762 2293 1655 643 0 0
This corresponds to ~2.8GB free memory, but no order-9 (2MB) blocks,
indicating severe fragmentation.
This series addresses the issue in two ways:
TTM: Restrict direct reclaim to beneficial_order. Larger allocations
use __GFP_NORETRY to fail quickly rather than triggering reclaim.
Xe: Introduce a heuristic in the shrinker to avoid eviction when
running under kswapd and the system appears memory-rich but
fragmented.
With these changes, the reclaim/eviction loop is eliminated. The same
workload improves to ~10 FPS per tab (Ubuntu 24.04) or ~15 FPS per tab
(Ubuntu 24.10), and kswapd activity subsides.
Buddyinfo after applying this series shows restored higher-order
availability:
Count: 8526 7067 3092 1959 1292 660 194 28 20 13 1
Matt
v2:
- Layer with core MM / TTM helpers (Thomas)
[1] https://patchwork.freedesktop.org/patch/716404/?series=164353&rev=1
Cc: Dave Chinner <david@fromorbit.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Kairui Song <kasong@tencent.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Carlos Santa <carlos.santa@intel.com>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
CC: dri-devel@lists.freedesktop.org
Cc: Daniel Colascione <dancol@dancol.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Matthew Brost (6):
mm: Wire up order in shrink_control
mm: Introduce zone_maybe_fragmented_in_shrinker()
drm/ttm: Issue direct reclaim at beneficial_order
drm/ttm: Introduce ttm_bo_shrink_kswap_maybe_fragmented()
drm/xe: Set TTM device beneficial_order to 9 (2M)
drm/xe: Avoid shrinker reclaim from kswapd under fragmentation
drivers/gpu/drm/ttm/ttm_bo_util.c | 38 +++++++++++++++++++++++++++++++
drivers/gpu/drm/ttm/ttm_pool.c | 4 ++--
drivers/gpu/drm/xe/xe_device.c | 3 ++-
drivers/gpu/drm/xe/xe_shrinker.c | 3 +++
include/drm/ttm/ttm_bo.h | 2 ++
include/linux/shrinker.h | 3 +++
include/linux/vmstat.h | 12 ++++++++++
mm/internal.h | 4 ++--
mm/shrinker.c | 11 +++++----
mm/vmscan.c | 7 +++---
10 files changed, 75 insertions(+), 12 deletions(-)
--
2.34.1
^ permalink raw reply [flat|nested] 3+ messages in thread
* [PATCH v3 1/6] mm: Wire up order in shrink_control
2026-04-30 18:23 [PATCH v3 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation Matthew Brost
@ 2026-04-30 18:23 ` Matthew Brost
2026-04-30 18:23 ` [PATCH v3 2/6] mm: Introduce zone_maybe_fragmented_in_shrinker() Matthew Brost
1 sibling, 0 replies; 3+ messages in thread
From: Matthew Brost @ 2026-04-30 18:23 UTC (permalink / raw)
To: intel-xe, dri-devel
Cc: Andrew Morton, Dave Chinner, Qi Zheng, Roman Gushchin,
Muchun Song, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Johannes Weiner, Shakeel Butt, Kairui Song, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-mm, linux-kernel,
Thomas Hellström
Pass the allocation order through shrink_control so shrinkers have
visibility into the order that triggered reclaim.
This allows shrinkers to implement better heuristics, such as detecting
high-order allocation pressure or fragmentation and avoiding eviction
of working sets when reclaim is invoked from kswapd.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: David Hildenbrand <david@kernel.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Kairui Song <kasong@tencent.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Wei Xu <weixugc@google.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Suggested-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
include/linux/shrinker.h | 3 +++
mm/internal.h | 4 ++--
mm/shrinker.c | 11 +++++++----
mm/vmscan.c | 7 ++++---
4 files changed, 16 insertions(+), 9 deletions(-)
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 1a00be90d93a..7072f693b9be 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -37,6 +37,9 @@ struct shrink_control {
/* current node being shrunk (for NUMA aware shrinkers) */
int nid;
+ /* Allocation order we are currently trying to fulfil. */
+ s8 order;
+
/*
* How many objects scan_objects should scan and try to reclaim.
* This is reset before every call, so it is safe for callees
diff --git a/mm/internal.h b/mm/internal.h
index 5a2ddcf68e0b..ff8671dccf7b 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1759,8 +1759,8 @@ void __meminit __init_single_page(struct page *page, unsigned long pfn,
void __meminit __init_page_from_nid(unsigned long pfn, int nid);
/* shrinker related functions */
-unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg,
- int priority);
+unsigned long shrink_slab(gfp_t gfp_mask, int nid, s8 order,
+ struct mem_cgroup *memcg, int priority);
int shmem_add_to_page_cache(struct folio *folio,
struct address_space *mapping,
diff --git a/mm/shrinker.c b/mm/shrinker.c
index 76b3f750cf65..fb23a338fb22 100644
--- a/mm/shrinker.c
+++ b/mm/shrinker.c
@@ -466,7 +466,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
}
#ifdef CONFIG_MEMCG
-static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
+static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, s8 order,
struct mem_cgroup *memcg, int priority)
{
struct shrinker_info *info;
@@ -528,6 +528,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
struct shrink_control sc = {
.gfp_mask = gfp_mask,
.nid = nid,
+ .order = order,
.memcg = memcg,
};
struct shrinker *shrinker;
@@ -598,6 +599,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
* shrink_slab - shrink slab caches
* @gfp_mask: allocation context
* @nid: node whose slab caches to target
+ * @order: order of allocation
* @memcg: memory cgroup whose slab caches to target
* @priority: the reclaim priority
*
@@ -614,8 +616,8 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
*
* Returns the number of reclaimed slab objects.
*/
-unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg,
- int priority)
+unsigned long shrink_slab(gfp_t gfp_mask, int nid, s8 order,
+ struct mem_cgroup *memcg, int priority)
{
unsigned long ret, freed = 0;
struct shrinker *shrinker;
@@ -628,7 +630,7 @@ unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg,
* oom.
*/
if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
- return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
+ return shrink_slab_memcg(gfp_mask, nid, order, memcg, priority);
/*
* lockless algorithm of global shrink.
@@ -656,6 +658,7 @@ unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg,
struct shrink_control sc = {
.gfp_mask = gfp_mask,
.nid = nid,
+ .order = order,
.memcg = memcg,
};
diff --git a/mm/vmscan.c b/mm/vmscan.c
index bd1b1aa12581..a54d14ecad25 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -412,7 +412,7 @@ static unsigned long drop_slab_node(int nid)
memcg = mem_cgroup_iter(NULL, NULL, NULL);
do {
- freed += shrink_slab(GFP_KERNEL, nid, memcg, 0);
+ freed += shrink_slab(GFP_KERNEL, nid, 0, memcg, 0);
} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
return freed;
@@ -5068,7 +5068,8 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
success = try_to_shrink_lruvec(lruvec, sc);
- shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority);
+ shrink_slab(sc->gfp_mask, pgdat->node_id, sc->order, memcg,
+ sc->priority);
if (!sc->proactive)
vmpressure(sc->gfp_mask, memcg, false, sc->nr_scanned - scanned,
@@ -6170,7 +6171,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
shrink_lruvec(lruvec, sc);
- shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
+ shrink_slab(sc->gfp_mask, pgdat->node_id, sc->order, memcg,
sc->priority);
/* Record the group's reclaim efficiency */
--
2.34.1
^ permalink raw reply related [flat|nested] 3+ messages in thread
* [PATCH v3 2/6] mm: Introduce zone_maybe_fragmented_in_shrinker()
2026-04-30 18:23 [PATCH v3 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation Matthew Brost
2026-04-30 18:23 ` [PATCH v3 1/6] mm: Wire up order in shrink_control Matthew Brost
@ 2026-04-30 18:23 ` Matthew Brost
1 sibling, 0 replies; 3+ messages in thread
From: Matthew Brost @ 2026-04-30 18:23 UTC (permalink / raw)
To: intel-xe, dri-devel
Cc: Thomas Hellström, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel
Introduce zone_maybe_fragmented_in_shrinker() as a lightweight helper to
allow subsystems to make coarse decisions about reclaim behavior in the
presence of likely fragmentation.
The helper implements a simple heuristic: if the number of free pages
in a zone exceeds twice the high watermark, the zone is considered to
have ample free memory and allocation failures are more likely due to
fragmentation than overall memory pressure.
This is intentionally imprecise and is not meant to replace the core
MM compaction or fragmentation accounting logic. Instead, it provides
a cheap signal for callers (e.g., shrinkers) that wish to avoid
overly aggressive reclaim when sufficient free memory exists but
high-order allocations may still fail.
No functional changes; this is a preparatory helper for future users.
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
v3: s/zone_appear_fragmented/zone_maybe_fragmented_in_shrinker (David
Hildenbrand)
---
include/linux/vmstat.h | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 3c9c266cf782..1ad48f70c9d9 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -483,6 +483,18 @@ static inline const char *zone_stat_name(enum zone_stat_item item)
return vmstat_text[item];
}
+static inline bool zone_maybe_fragmented_in_shrinker(struct zone *zone)
+{
+ /*
+ * Simple heuristic: if the number of free pages is more than twice the
+ * high watermark, this may suggest that the zone is heavily fragmented.
+ * When called from a shrinker, aggressively evicting memory in this case
+ * may do more harm to overall system performance than good.
+ */
+ return zone_page_state(zone, NR_FREE_PAGES) >
+ high_wmark_pages(zone) * 2;
+}
+
#ifdef CONFIG_NUMA
static inline const char *numa_stat_name(enum numa_stat_item item)
{
--
2.34.1
^ permalink raw reply related [flat|nested] 3+ messages in thread
end of thread, other threads:[~2026-04-30 18:23 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-30 18:23 [PATCH v3 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation Matthew Brost
2026-04-30 18:23 ` [PATCH v3 1/6] mm: Wire up order in shrink_control Matthew Brost
2026-04-30 18:23 ` [PATCH v3 2/6] mm: Introduce zone_maybe_fragmented_in_shrinker() Matthew Brost
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox