* [PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation
@ 2026-04-30 19:18 Matthew Brost
2026-04-30 19:18 ` [PATCH v4 1/6] mm: Wire up order in shrink_control Matthew Brost
` (3 more replies)
0 siblings, 4 replies; 14+ messages in thread
From: Matthew Brost @ 2026-04-30 19:18 UTC (permalink / raw)
To: intel-xe, dri-devel
Cc: Dave Chinner, Qi Zheng, Roman Gushchin, Johannes Weiner,
Shakeel Butt, Kairui Song, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Tvrtko Ursulin, Thomas Hellström,
Carlos Santa, Christian Koenig, Huang Rui, Matthew Auld,
Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
Simona Vetter, Daniel Colascione, Andrew Morton,
David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
linux-mm, linux-kernel
TTM allocations at higher orders can drive Xe into a pathological
reclaim loop when memory is fragmented:
kswapd → shrinker → eviction → rebind (exec ioctl) → repeat
In this state, reclaim is triggered despite substantial free memory,
but fails to produce contiguous higher-order pages. The Xe shrinker then
evicts active buffer objects, increasing faulting and rebind activity
and further feeding the loop. The result is high CPU overhead and poor
GPU forward progress.
This issue was first reported in [1] and independently observed
internally and by Google.
A simple reproducer is:
- Boot an iGPU system with mem=8G
- Launch 10 Chrome tabs running the WebGL aquarium demo
- Configure each tab with ~5k fish
Under this workload, ftrace shows a continuous loop of:
xe_shrinker_scan (kswapd)
xe_vma_rebind_exec
Performance degrades significantly, with each tab dropping to ~2 FPS on
PTL (Ubuntu 24.04).
At the same time, /proc/buddyinfo shows substantial free memory but no
higher-order availability. For example, the Normal zone:
Count: 4063 4595 3455 3400 3139 2762 2293 1655 643 0 0
This corresponds to ~2.8GB free memory, but no order-9 (2MB) blocks,
indicating severe fragmentation.
This series addresses the issue in two ways:
TTM: Restrict direct reclaim to beneficial_order. Larger allocations
use __GFP_NORETRY to fail quickly rather than triggering reclaim.
Xe: Introduce a heuristic in the shrinker to avoid eviction when
running under kswapd and the system appears memory-rich but
fragmented.
With these changes, the reclaim/eviction loop is eliminated. The same
workload improves to ~10 FPS per tab (Ubuntu 24.04) or ~15 FPS per tab
(Ubuntu 24.10), and kswapd activity subsides.
Buddyinfo after applying this series shows restored higher-order
availability:
Count: 8526 7067 3092 1959 1292 660 194 28 20 13 1
Matt
v2:
- Layer with core MM / TTM helpers (Thomas)
v4:
- Fix build (CI)
[1] https://patchwork.freedesktop.org/patch/716404/?series=164353&rev=1
Cc: Dave Chinner <david@fromorbit.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Kairui Song <kasong@tencent.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Carlos Santa <carlos.santa@intel.com>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
CC: dri-devel@lists.freedesktop.org
Cc: Daniel Colascione <dancol@dancol.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Matthew Brost (6):
mm: Wire up order in shrink_control
mm: Introduce zone_maybe_fragmented_in_shrinker()
drm/ttm: Issue direct reclaim at beneficial_order
drm/ttm: Introduce ttm_bo_shrink_kswap_maybe_fragmented()
drm/xe: Set TTM device beneficial_order to 9 (2M)
drm/xe: Avoid shrinker reclaim from kswapd under fragmentation
drivers/gpu/drm/ttm/ttm_bo_util.c | 38 +++++++++++++++++++++++++++++++
drivers/gpu/drm/ttm/ttm_pool.c | 4 ++--
drivers/gpu/drm/xe/xe_device.c | 3 ++-
drivers/gpu/drm/xe/xe_shrinker.c | 3 +++
include/drm/ttm/ttm_bo.h | 2 ++
include/linux/shrinker.h | 3 +++
include/linux/vmstat.h | 12 ++++++++++
mm/internal.h | 4 ++--
mm/shrinker.c | 13 +++++++----
mm/vmscan.c | 7 +++---
10 files changed, 76 insertions(+), 13 deletions(-)
--
2.34.1
^ permalink raw reply [flat|nested] 14+ messages in thread* [PATCH v4 1/6] mm: Wire up order in shrink_control 2026-04-30 19:18 [PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation Matthew Brost @ 2026-04-30 19:18 ` Matthew Brost 2026-04-30 19:18 ` [PATCH v4 2/6] mm: Introduce zone_maybe_fragmented_in_shrinker() Matthew Brost ` (2 subsequent siblings) 3 siblings, 0 replies; 14+ messages in thread From: Matthew Brost @ 2026-04-30 19:18 UTC (permalink / raw) To: intel-xe, dri-devel Cc: Andrew Morton, Dave Chinner, Qi Zheng, Roman Gushchin, Muchun Song, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Johannes Weiner, Shakeel Butt, Kairui Song, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-mm, linux-kernel, Thomas Hellström Pass the allocation order through shrink_control so shrinkers have visibility into the order that triggered reclaim. This allows shrinkers to implement better heuristics, such as detecting high-order allocation pressure or fragmentation and avoiding eviction of working sets when reclaim is invoked from kswapd. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Kairui Song <kasong@tencent.com> Cc: Barry Song <baohua@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Wei Xu <weixugc@google.com> Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org Suggested-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> --- v4: Fix build without CONFIG_MEMCG (CI) --- include/linux/shrinker.h | 3 +++ mm/internal.h | 4 ++-- mm/shrinker.c | 13 ++++++++----- mm/vmscan.c | 7 ++++--- 4 files changed, 17 insertions(+), 10 deletions(-) diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h index 1a00be90d93a..7072f693b9be 100644 --- a/include/linux/shrinker.h +++ b/include/linux/shrinker.h @@ -37,6 +37,9 @@ struct shrink_control { /* current node being shrunk (for NUMA aware shrinkers) */ int nid; + /* Allocation order we are currently trying to fulfil. */ + s8 order; + /* * How many objects scan_objects should scan and try to reclaim. * This is reset before every call, so it is safe for callees diff --git a/mm/internal.h b/mm/internal.h index 5a2ddcf68e0b..ff8671dccf7b 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1759,8 +1759,8 @@ void __meminit __init_single_page(struct page *page, unsigned long pfn, void __meminit __init_page_from_nid(unsigned long pfn, int nid); /* shrinker related functions */ -unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg, - int priority); +unsigned long shrink_slab(gfp_t gfp_mask, int nid, s8 order, + struct mem_cgroup *memcg, int priority); int shmem_add_to_page_cache(struct folio *folio, struct address_space *mapping, diff --git a/mm/shrinker.c b/mm/shrinker.c index 76b3f750cf65..c83f3b3daa08 100644 --- a/mm/shrinker.c +++ b/mm/shrinker.c @@ -466,7 +466,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, } #ifdef CONFIG_MEMCG -static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, +static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, s8 order, struct mem_cgroup *memcg, int priority) { struct shrinker_info *info; @@ -528,6 +528,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, struct shrink_control sc = { .gfp_mask = gfp_mask, .nid = nid, + .order = order, .memcg = memcg, }; struct shrinker *shrinker; @@ -587,7 +588,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, return freed; } #else /* !CONFIG_MEMCG */ -static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, +static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, s8 order, struct mem_cgroup *memcg, int priority) { return 0; @@ -598,6 +599,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, * shrink_slab - shrink slab caches * @gfp_mask: allocation context * @nid: node whose slab caches to target + * @order: order of allocation * @memcg: memory cgroup whose slab caches to target * @priority: the reclaim priority * @@ -614,8 +616,8 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, * * Returns the number of reclaimed slab objects. */ -unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg, - int priority) +unsigned long shrink_slab(gfp_t gfp_mask, int nid, s8 order, + struct mem_cgroup *memcg, int priority) { unsigned long ret, freed = 0; struct shrinker *shrinker; @@ -628,7 +630,7 @@ unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg, * oom. */ if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg)) - return shrink_slab_memcg(gfp_mask, nid, memcg, priority); + return shrink_slab_memcg(gfp_mask, nid, order, memcg, priority); /* * lockless algorithm of global shrink. @@ -656,6 +658,7 @@ unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg, struct shrink_control sc = { .gfp_mask = gfp_mask, .nid = nid, + .order = order, .memcg = memcg, }; diff --git a/mm/vmscan.c b/mm/vmscan.c index bd1b1aa12581..a54d14ecad25 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -412,7 +412,7 @@ static unsigned long drop_slab_node(int nid) memcg = mem_cgroup_iter(NULL, NULL, NULL); do { - freed += shrink_slab(GFP_KERNEL, nid, memcg, 0); + freed += shrink_slab(GFP_KERNEL, nid, 0, memcg, 0); } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL); return freed; @@ -5068,7 +5068,8 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc) success = try_to_shrink_lruvec(lruvec, sc); - shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority); + shrink_slab(sc->gfp_mask, pgdat->node_id, sc->order, memcg, + sc->priority); if (!sc->proactive) vmpressure(sc->gfp_mask, memcg, false, sc->nr_scanned - scanned, @@ -6170,7 +6171,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) shrink_lruvec(lruvec, sc); - shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, + shrink_slab(sc->gfp_mask, pgdat->node_id, sc->order, memcg, sc->priority); /* Record the group's reclaim efficiency */ -- 2.34.1 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v4 2/6] mm: Introduce zone_maybe_fragmented_in_shrinker() 2026-04-30 19:18 [PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation Matthew Brost 2026-04-30 19:18 ` [PATCH v4 1/6] mm: Wire up order in shrink_control Matthew Brost @ 2026-04-30 19:18 ` Matthew Brost 2026-05-01 0:50 ` Santa, Carlos 2026-05-01 19:08 ` PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation Kenneth Crudup 2026-04-30 23:01 ` [PATCH " Andrew Morton 2026-05-01 1:42 ` Dave Chinner 3 siblings, 2 replies; 14+ messages in thread From: Matthew Brost @ 2026-04-30 19:18 UTC (permalink / raw) To: intel-xe, dri-devel Cc: Thomas Hellström, Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel Introduce zone_maybe_fragmented_in_shrinker() as a lightweight helper to allow subsystems to make coarse decisions about reclaim behavior in the presence of likely fragmentation. The helper implements a simple heuristic: if the number of free pages in a zone exceeds twice the high watermark, the zone is considered to have ample free memory and allocation failures are more likely due to fragmentation than overall memory pressure. This is intentionally imprecise and is not meant to replace the core MM compaction or fragmentation accounting logic. Instead, it provides a cheap signal for callers (e.g., shrinkers) that wish to avoid overly aggressive reclaim when sufficient free memory exists but high-order allocations may still fail. No functional changes; this is a preparatory helper for future users. Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Hildenbrand <david@kernel.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Matthew Brost <matthew.brost@intel.com> --- v3: s/zone_appear_fragmented/zone_maybe_fragmented_in_shrinker (David Hildenbrand) --- include/linux/vmstat.h | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index 3c9c266cf782..1ad48f70c9d9 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -483,6 +483,18 @@ static inline const char *zone_stat_name(enum zone_stat_item item) return vmstat_text[item]; } +static inline bool zone_maybe_fragmented_in_shrinker(struct zone *zone) +{ + /* + * Simple heuristic: if the number of free pages is more than twice the + * high watermark, this may suggest that the zone is heavily fragmented. + * When called from a shrinker, aggressively evicting memory in this case + * may do more harm to overall system performance than good. + */ + return zone_page_state(zone, NR_FREE_PAGES) > + high_wmark_pages(zone) * 2; +} + #ifdef CONFIG_NUMA static inline const char *numa_stat_name(enum numa_stat_item item) { -- 2.34.1 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH v4 2/6] mm: Introduce zone_maybe_fragmented_in_shrinker() 2026-04-30 19:18 ` [PATCH v4 2/6] mm: Introduce zone_maybe_fragmented_in_shrinker() Matthew Brost @ 2026-05-01 0:50 ` Santa, Carlos 2026-05-01 19:08 ` PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation Kenneth Crudup 1 sibling, 0 replies; 14+ messages in thread From: Santa, Carlos @ 2026-05-01 0:50 UTC (permalink / raw) To: intel-xe@lists.freedesktop.org, Brost, Matthew, dri-devel@lists.freedesktop.org Cc: linux-kernel@vger.kernel.org, Liam.Howlett@oracle.com, david@kernel.org, surenb@google.com, akpm@linux-foundation.org, thomas.hellstrom@linux.intel.com, ljs@kernel.org, vbabka@kernel.org, linux-mm@kvack.org, rppt@kernel.org, mhocko@suse.com On Thu, 2026-04-30 at 12:18 -0700, Matthew Brost wrote: > Introduce zone_maybe_fragmented_in_shrinker() as a lightweight helper > to > allow subsystems to make coarse decisions about reclaim behavior in > the > presence of likely fragmentation. > > The helper implements a simple heuristic: if the number of free pages > in a zone exceeds twice the high watermark, the zone is considered to > have ample free memory and allocation failures are more likely due to > fragmentation than overall memory pressure. > > This is intentionally imprecise and is not meant to replace the core > MM compaction or fragmentation accounting logic. Instead, it provides > a cheap signal for callers (e.g., shrinkers) that wish to avoid > overly aggressive reclaim when sufficient free memory exists but > high-order allocations may still fail. > > No functional changes; this is a preparatory helper for future users. > > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: David Hildenbrand <david@kernel.org> > Cc: Lorenzo Stoakes <ljs@kernel.org> > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> > Cc: Vlastimil Babka <vbabka@kernel.org> > Cc: Mike Rapoport <rppt@kernel.org> > Cc: Suren Baghdasaryan <surenb@google.com> > Cc: Michal Hocko <mhocko@suse.com> > Cc: linux-mm@kvack.org > Cc: linux-kernel@vger.kernel.org > Signed-off-by: Matthew Brost <matthew.brost@intel.com> > > --- > > v3: s/zone_appear_fragmented/zone_maybe_fragmented_in_shrinker (David > Hildenbrand) > --- > include/linux/vmstat.h | 12 ++++++++++++ > 1 file changed, 12 insertions(+) > > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h > index 3c9c266cf782..1ad48f70c9d9 100644 > --- a/include/linux/vmstat.h > +++ b/include/linux/vmstat.h > @@ -483,6 +483,18 @@ static inline const char *zone_stat_name(enum > zone_stat_item item) > return vmstat_text[item]; > } > on the below heuristic, I was thinking of the following case: a large memory system (say 16G, 32G), heavily fragmented (for whatever reason) but constraint by the IOMMU requiring large pages due to hw alignment, if I am not mistaken the below check will cause the shrinker to bail out too 'early' since the there's plenty of available memory but none of that is contiguous, then end result should be giving back small pages which should reduce performance, right? below are some made up numbers: Metric | 8GB | 16GB ----------------|-------------------|------------------- High Wmark | ~45MB (11k pgs) | ~90MB (23k pgs) Bail Gate (2x) | ~90MB (22k pgs) | ~180MB (46k pgs) Free RAM | 120MB | 7100MB (7.1GB) Shrinker | RUNS (Free<Gate) | BAILS (Free>Gate) Outcome | Merges 2MB blocks | 4KB pages In other words, replacing the check with numbers: System | Free RAM (Pages) | Gate (Pages) | Free < Gate? | Result -------------|------------------|--------------|--------------|------- 8GB | 20,480 (80MB) | 22,946 | 20480 < 22946| RUNS 16GB | 1,832,740 (7.1G) | 45,894 | 1.8M < 45k? | BAILS Carlos > +static inline bool zone_maybe_fragmented_in_shrinker(struct zone > *zone) > +{ > + /* > + * Simple heuristic: if the number of free pages is more > than twice the > + * high watermark, this may suggest that the zone is heavily > fragmented. > + * When called from a shrinker, aggressively evicting memory > in this case > + * may do more harm to overall system performance than good. > + */ > + return zone_page_state(zone, NR_FREE_PAGES) > > + high_wmark_pages(zone) * 2; > +} > + > #ifdef CONFIG_NUMA > static inline const char *numa_stat_name(enum numa_stat_item item) > { ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation 2026-04-30 19:18 ` [PATCH v4 2/6] mm: Introduce zone_maybe_fragmented_in_shrinker() Matthew Brost 2026-05-01 0:50 ` Santa, Carlos @ 2026-05-01 19:08 ` Kenneth Crudup 2026-05-01 20:00 ` Matthew Brost 1 sibling, 1 reply; 14+ messages in thread From: Kenneth Crudup @ 2026-05-01 19:08 UTC (permalink / raw) To: Matthew Brost, intel-xe, dri-devel Cc: Thomas Hellström, Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel, Kenneth C On 4/30/26 12:18, Matthew Brost wrote: > Introduce zone_maybe_fragmented_in_shrinker() as a lightweight helper to > allow subsystems to make coarse decisions about reclaim behavior in the > presence of likely fragmentation I'm running Linus' master on my LunarLake (258v) laptop, and sometimes after compiling a kernel (of all things) I'd see kswapd0 thrash despite having quite a bit of free memory. I finally traced it to the xe driver after seeing the "GPUActive" field in /proc/meminfo suddenly start rising, eventually growing larger than real memory by several times (see below). This patchset fixes the issue, and I'm sure there'll be a fix going into Linus' master soon, but what I'M wondering is how could building a kernel (which is just in a KDE Konsole running on Wayland) make the GPActive grow from ~1.6G to > 30G (and continue to rise, RN I'm seeing 91839848 kBs and still growing). -Kenny ---- SwapTotal: 33554428 kB MemTotal: 32345672 kB GPUActive: 652640 kB GPUReclaim: 403988 kB SwapTotal: 33554428 kB MemTotal: 32345672 kB GPUActive: 651180 kB GPUReclaim: 406812 kB SwapTotal: 33554428 kB MemTotal: 32345672 kB GPUActive: 659004 kB GPUReclaim: 399396 kB SwapTotal: 33554428 kB MemTotal: 32345672 kB GPUActive: 666996 kB GPUReclaim: 392764 kB <some hours later> GPUActive: 91832468 kB SwapTotal: 33554428 kB MemTotal: 32345672 kB GPUReclaim: 488000 kB GPUActive: 91832332 kB SwapTotal: 33554428 kB MemTotal: 32345672 kB GPUReclaim: 487988 kB GPUActive: 91869376 kB SwapTotal: 33554428 kB MemTotal: 32345672 kB GPUReclaim: 486504 kB ---- -- Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange County CA ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation 2026-05-01 19:08 ` PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation Kenneth Crudup @ 2026-05-01 20:00 ` Matthew Brost 2026-05-01 20:05 ` Kenneth Crudup 0 siblings, 1 reply; 14+ messages in thread From: Matthew Brost @ 2026-05-01 20:00 UTC (permalink / raw) To: Kenneth Crudup, airlied Cc: intel-xe, dri-devel, Thomas Hellström, Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel On Fri, May 01, 2026 at 12:08:48PM -0700, Kenneth Crudup wrote: > > On 4/30/26 12:18, Matthew Brost wrote: > > > Introduce zone_maybe_fragmented_in_shrinker() as a lightweight helper to > > allow subsystems to make coarse decisions about reclaim behavior in the > > presence of likely fragmentation > > I'm running Linus' master on my LunarLake (258v) laptop, and sometimes after +Dave So is this 7.1-rc1? It looks like new feature to 7.1 added by Dave [1] and something look off here. Thanks for pointing this out. I'm grabbing a machine now to see if I can recreate this... Matt [1] git format-patch -1 2232ba9c7931d > compiling a kernel (of all things) I'd see kswapd0 thrash despite having > quite a bit of free memory. > > I finally traced it to the xe driver after seeing the "GPUActive" field in > /proc/meminfo suddenly start rising, eventually growing larger than real > memory by several times (see below). > > This patchset fixes the issue, and I'm sure there'll be a fix going into > Linus' master soon, but what I'M wondering is how could building a kernel > (which is just in a KDE Konsole running on Wayland) make the GPActive grow > from ~1.6G to > 30G (and continue to rise, RN I'm seeing 91839848 kBs and > still growing). > > -Kenny > > ---- > SwapTotal: 33554428 kB > MemTotal: 32345672 kB > GPUActive: 652640 kB > GPUReclaim: 403988 kB > > SwapTotal: 33554428 kB > MemTotal: 32345672 kB > GPUActive: 651180 kB > GPUReclaim: 406812 kB > > SwapTotal: 33554428 kB > MemTotal: 32345672 kB > GPUActive: 659004 kB > GPUReclaim: 399396 kB > > SwapTotal: 33554428 kB > MemTotal: 32345672 kB > GPUActive: 666996 kB > GPUReclaim: 392764 kB > > <some hours later> > GPUActive: 91832468 kB > SwapTotal: 33554428 kB > MemTotal: 32345672 kB > GPUReclaim: 488000 kB > > GPUActive: 91832332 kB > SwapTotal: 33554428 kB > MemTotal: 32345672 kB > GPUReclaim: 487988 kB > > GPUActive: 91869376 kB > SwapTotal: 33554428 kB > MemTotal: 32345672 kB > GPUReclaim: 486504 kB > ---- > > -- > Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange County > CA > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation 2026-05-01 20:00 ` Matthew Brost @ 2026-05-01 20:05 ` Kenneth Crudup 2026-05-01 21:10 ` Matthew Brost 0 siblings, 1 reply; 14+ messages in thread From: Kenneth Crudup @ 2026-05-01 20:05 UTC (permalink / raw) To: Matthew Brost, airlied Cc: intel-xe, dri-devel, Thomas Hellström, Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel On 5/1/26 13:00, Matthew Brost wrote: > So is this 7.1-rc1? It looks like new feature to 7.1 added by Dave [1] and > something look off here. Thanks for pointing this out. Yeah. I grab his master branch daily (as of 6fe0be6dc7fa RN). Is this a "shoot the messenger" thing? IOW, is the reporting off, or is the memory usage really that high? (BTW, those are in 30-second intervals) >> ---- >> SwapTotal: 33554428 kB >> MemTotal: 32345672 kB >> GPUActive: 652640 kB >> GPUReclaim: 403988 kB >> >> SwapTotal: 33554428 kB >> MemTotal: 32345672 kB >> GPUActive: 651180 kB >> GPUReclaim: 406812 kB >> >> SwapTotal: 33554428 kB >> MemTotal: 32345672 kB >> GPUActive: 659004 kB >> GPUReclaim: 399396 kB >> >> SwapTotal: 33554428 kB >> MemTotal: 32345672 kB >> GPUActive: 666996 kB >> GPUReclaim: 392764 kB >> >> <some hours later> >> GPUActive: 91832468 kB >> SwapTotal: 33554428 kB >> MemTotal: 32345672 kB >> GPUReclaim: 488000 kB >> >> GPUActive: 91832332 kB >> SwapTotal: 33554428 kB >> MemTotal: 32345672 kB >> GPUReclaim: 487988 kB >> >> GPUActive: 91869376 kB >> SwapTotal: 33554428 kB >> MemTotal: 32345672 kB >> GPUReclaim: 486504 kB >> ---- -K -- Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange County CA ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation 2026-05-01 20:05 ` Kenneth Crudup @ 2026-05-01 21:10 ` Matthew Brost 2026-05-01 22:33 ` Matthew Brost 0 siblings, 1 reply; 14+ messages in thread From: Matthew Brost @ 2026-05-01 21:10 UTC (permalink / raw) To: Kenneth Crudup Cc: airlied, intel-xe, dri-devel, Thomas Hellström, Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel On Fri, May 01, 2026 at 01:05:57PM -0700, Kenneth Crudup wrote: > > On 5/1/26 13:00, Matthew Brost wrote: > > > So is this 7.1-rc1? It looks like new feature to 7.1 added by Dave [1] and > > something look off here. Thanks for pointing this out. > > Yeah. I grab his master branch daily (as of 6fe0be6dc7fa RN). > > Is this a "shoot the messenger" thing? IOW, is the reporting off, or is the I don't think I'm firing any shots. > memory usage really that high? I've been able to recreate this. It looks like accounting is correct until the Xe shrinker runs - every time it kicks in GPUActive grows and will not reduce past some new floor value. It looks like an accounting bug in TTM or Xe (?). Here is my output on a 8G PTL where I have intentionally triggered shrinker to evict at least 23875 BOs (most likey quite few more but this what I easily see in dmesg) after closing everything on desktop. cat /proc/meminfo | grep GPU; cat /proc/buddyinfo; GPUActive: 13100036 kB GPUReclaim: 152 kB Node 0, zone DMA 0 1 0 0 0 0 0 0 1 1 3 Node 0, zone DMA32 2320 1882 1523 1238 980 740 482 275 114 88 205 Node 0, zone Normal 9751 9343 6466 4237 2703 1162 805 420 191 145 289 Let me spend a bit of time here to see if I figure out where the accounting goes wrong. Matt > > (BTW, those are in 30-second intervals) > > > > ---- > > > SwapTotal: 33554428 kB > > > MemTotal: 32345672 kB > > > GPUActive: 652640 kB > > > GPUReclaim: 403988 kB > > > > > > SwapTotal: 33554428 kB > > > MemTotal: 32345672 kB > > > GPUActive: 651180 kB > > > GPUReclaim: 406812 kB > > > > > > SwapTotal: 33554428 kB > > > MemTotal: 32345672 kB > > > GPUActive: 659004 kB > > > GPUReclaim: 399396 kB > > > > > > SwapTotal: 33554428 kB > > > MemTotal: 32345672 kB > > > GPUActive: 666996 kB > > > GPUReclaim: 392764 kB > > > > > > <some hours later> > > > GPUActive: 91832468 kB > > > SwapTotal: 33554428 kB > > > MemTotal: 32345672 kB > > > GPUReclaim: 488000 kB > > > > > > GPUActive: 91832332 kB > > > SwapTotal: 33554428 kB > > > MemTotal: 32345672 kB > > > GPUReclaim: 487988 kB > > > > > > GPUActive: 91869376 kB > > > SwapTotal: 33554428 kB > > > MemTotal: 32345672 kB > > > GPUReclaim: 486504 kB > > > ---- > > -K > > -- > Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange County > CA > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation 2026-05-01 21:10 ` Matthew Brost @ 2026-05-01 22:33 ` Matthew Brost 0 siblings, 0 replies; 14+ messages in thread From: Matthew Brost @ 2026-05-01 22:33 UTC (permalink / raw) To: Kenneth Crudup Cc: airlied, intel-xe, dri-devel, Thomas Hellström, Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel On Fri, May 01, 2026 at 02:10:07PM -0700, Matthew Brost wrote: > On Fri, May 01, 2026 at 01:05:57PM -0700, Kenneth Crudup wrote: > > > > On 5/1/26 13:00, Matthew Brost wrote: > > > > > So is this 7.1-rc1? It looks like new feature to 7.1 added by Dave [1] and > > > something look off here. Thanks for pointing this out. > > > > Yeah. I grab his master branch daily (as of 6fe0be6dc7fa RN). > > > > Is this a "shoot the messenger" thing? IOW, is the reporting off, or is the > > I don't think I'm firing any shots. > > > memory usage really that high? > > I've been able to recreate this. It looks like accounting is correct > until the Xe shrinker runs - every time it kicks in GPUActive grows and > will not reduce past some new floor value. It looks like an accounting > bug in TTM or Xe (?). > > Here is my output on a 8G PTL where I have intentionally triggered > shrinker to evict at least 23875 BOs (most likey quite few more but this > what I easily see in dmesg) after closing everything on desktop. > > cat /proc/meminfo | grep GPU; cat /proc/buddyinfo; > GPUActive: 13100036 kB > GPUReclaim: 152 kB > Node 0, zone DMA 0 1 0 0 0 0 0 0 1 1 3 > Node 0, zone DMA32 2320 1882 1523 1238 980 740 482 275 114 88 205 > Node 0, zone Normal 9751 9343 6466 4237 2703 1162 805 420 191 145 289 > > Let me spend a bit of time here to see if I figure out where the > accounting goes wrong. > Looks like a simple accounting error in the shrinking path. Here is a fix [1] that seems to work for me. If you want to give a it try, that would be helpful. Matt [1] https://patchwork.freedesktop.org/series/165862/ > Matt > > > > > (BTW, those are in 30-second intervals) > > > > > > ---- > > > > SwapTotal: 33554428 kB > > > > MemTotal: 32345672 kB > > > > GPUActive: 652640 kB > > > > GPUReclaim: 403988 kB > > > > > > > > SwapTotal: 33554428 kB > > > > MemTotal: 32345672 kB > > > > GPUActive: 651180 kB > > > > GPUReclaim: 406812 kB > > > > > > > > SwapTotal: 33554428 kB > > > > MemTotal: 32345672 kB > > > > GPUActive: 659004 kB > > > > GPUReclaim: 399396 kB > > > > > > > > SwapTotal: 33554428 kB > > > > MemTotal: 32345672 kB > > > > GPUActive: 666996 kB > > > > GPUReclaim: 392764 kB > > > > > > > > <some hours later> > > > > GPUActive: 91832468 kB > > > > SwapTotal: 33554428 kB > > > > MemTotal: 32345672 kB > > > > GPUReclaim: 488000 kB > > > > > > > > GPUActive: 91832332 kB > > > > SwapTotal: 33554428 kB > > > > MemTotal: 32345672 kB > > > > GPUReclaim: 487988 kB > > > > > > > > GPUActive: 91869376 kB > > > > SwapTotal: 33554428 kB > > > > MemTotal: 32345672 kB > > > > GPUReclaim: 486504 kB > > > > ---- > > > > -K > > > > -- > > Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange County > > CA > > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation 2026-04-30 19:18 [PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation Matthew Brost 2026-04-30 19:18 ` [PATCH v4 1/6] mm: Wire up order in shrink_control Matthew Brost 2026-04-30 19:18 ` [PATCH v4 2/6] mm: Introduce zone_maybe_fragmented_in_shrinker() Matthew Brost @ 2026-04-30 23:01 ` Andrew Morton 2026-05-01 6:28 ` Matthew Brost 2026-05-01 1:42 ` Dave Chinner 3 siblings, 1 reply; 14+ messages in thread From: Andrew Morton @ 2026-04-30 23:01 UTC (permalink / raw) To: Matthew Brost Cc: intel-xe, dri-devel, Dave Chinner, Qi Zheng, Roman Gushchin, Johannes Weiner, Shakeel Butt, Kairui Song, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Tvrtko Ursulin, Thomas Hellström, Carlos Santa, Christian Koenig, Huang Rui, Matthew Auld, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter, Daniel Colascione, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel On Thu, 30 Apr 2026 12:18:03 -0700 Matthew Brost <matthew.brost@intel.com> wrote: > TTM allocations at higher orders can drive Xe into a pathological > reclaim loop when memory is fragmented: > > kswapd → shrinker → eviction → rebind (exec ioctl) → repeat > > In this state, reclaim is triggered despite substantial free memory, > but fails to produce contiguous higher-order pages. The Xe shrinker then > evicts active buffer objects, increasing faulting and rebind activity > and further feeding the loop. The result is high CPU overhead and poor > GPU forward progress. > > ... > > This series addresses the issue in two ways: > > TTM: Restrict direct reclaim to beneficial_order. Larger allocations > use __GFP_NORETRY to fail quickly rather than triggering reclaim. > > Xe: Introduce a heuristic in the shrinker to avoid eviction when > running under kswapd and the system appears memory-rich but > fragmented. Please cc everyone on all the patches? It's kind of annoying to have to hunt around to find out how these proposed changes will be used. Personal preference, anyway. AI review flagged a few possible issues: https://sashiko.dev/#/patchset/20260430191809.2142544-1-matthew.brost@intel.com ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation 2026-04-30 23:01 ` [PATCH " Andrew Morton @ 2026-05-01 6:28 ` Matthew Brost 2026-05-01 12:51 ` Andrew Morton 0 siblings, 1 reply; 14+ messages in thread From: Matthew Brost @ 2026-05-01 6:28 UTC (permalink / raw) To: Andrew Morton Cc: intel-xe, dri-devel, Dave Chinner, Qi Zheng, Roman Gushchin, Johannes Weiner, Shakeel Butt, Kairui Song, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Tvrtko Ursulin, Thomas Hellström, Carlos Santa, Christian Koenig, Huang Rui, Matthew Auld, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter, Daniel Colascione, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel On Thu, Apr 30, 2026 at 04:01:05PM -0700, Andrew Morton wrote: > On Thu, 30 Apr 2026 12:18:03 -0700 Matthew Brost <matthew.brost@intel.com> wrote: > > > TTM allocations at higher orders can drive Xe into a pathological > > reclaim loop when memory is fragmented: > > > > kswapd → shrinker → eviction → rebind (exec ioctl) → repeat > > > > In this state, reclaim is triggered despite substantial free memory, > > but fails to produce contiguous higher-order pages. The Xe shrinker then > > evicts active buffer objects, increasing faulting and rebind activity > > and further feeding the loop. The result is high CPU overhead and poor > > GPU forward progress. > > > > ... > > > > This series addresses the issue in two ways: > > > > TTM: Restrict direct reclaim to beneficial_order. Larger allocations > > use __GFP_NORETRY to fail quickly rather than triggering reclaim. > > > > Xe: Introduce a heuristic in the shrinker to avoid eviction when > > running under kswapd and the system appears memory-rich but > > fragmented. > > Please cc everyone on all the patches? It's kind of annoying to have > to hunt around to find out how these proposed changes will be used. > Personal preference, anyway. > Will do - we discussed this in the past and thought we landed on Cc everyone on the cover then individual patches but will blast everyone going forward. > AI review flagged a few possible issues: > https://sashiko.dev/#/patchset/20260430191809.2142544-1-matthew.brost@intel.com Idk, who authors sashiko but what make it really nice if you could reply to it to talk things out. Looking at replies... - 'Could this global counter drift significantly' this is looks right for multi-CPU which isn't really the target here, but will adjust - 'Additionally, does NR_FREE_PAGES implicitly include CMA pages?' this is looks right, will adjust - 'Can high_wmark_pages(zone) evaluate to zero during early boot' theoretically possible (?), but non-issue IMO, certainly a GPU shrinker which is current use case this is impossible but maybe add a warn_on if high_wmark_pages(zone) returns zero - 'Is this description accurate?' I inverted the TTM kernel doc vs the code, will fix Matt ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation 2026-05-01 6:28 ` Matthew Brost @ 2026-05-01 12:51 ` Andrew Morton 0 siblings, 0 replies; 14+ messages in thread From: Andrew Morton @ 2026-05-01 12:51 UTC (permalink / raw) To: Matthew Brost Cc: intel-xe, dri-devel, Dave Chinner, Qi Zheng, Roman Gushchin, Johannes Weiner, Shakeel Butt, Kairui Song, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Tvrtko Ursulin, Thomas Hellström, Carlos Santa, Christian Koenig, Huang Rui, Matthew Auld, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter, Daniel Colascione, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel On Thu, 30 Apr 2026 23:28:08 -0700 Matthew Brost <matthew.brost@intel.com> wrote: > > AI review flagged a few possible issues: > > https://sashiko.dev/#/patchset/20260430191809.2142544-1-matthew.brost@intel.com > > Idk, who authors sashiko but what make it really nice if you could reply > to it to talk things out. It's a gemini 3 thing, based on prompts developed by Roman Gushchin and Chris Mason and others. Google is making this available to kernel developers at a non-trivial expense. And yes, it would be great if Sashiko were able to learn from our replies and to fine-tune its checking based on the human corrections. I've asked for this a few times but didn't really understand the reply ;) ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation 2026-04-30 19:18 [PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation Matthew Brost ` (2 preceding siblings ...) 2026-04-30 23:01 ` [PATCH " Andrew Morton @ 2026-05-01 1:42 ` Dave Chinner 2026-05-01 7:09 ` Matthew Brost 3 siblings, 1 reply; 14+ messages in thread From: Dave Chinner @ 2026-05-01 1:42 UTC (permalink / raw) To: Matthew Brost Cc: intel-xe, dri-devel, Dave Chinner, Qi Zheng, Roman Gushchin, Johannes Weiner, Shakeel Butt, Kairui Song, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Tvrtko Ursulin, Thomas Hellström, Carlos Santa, Christian Koenig, Huang Rui, Matthew Auld, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter, Daniel Colascione, Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel On Thu, Apr 30, 2026 at 12:18:03PM -0700, Matthew Brost wrote: > TTM allocations at higher orders can drive Xe into a pathological > reclaim loop when memory is fragmented: > > kswapd → shrinker → eviction → rebind (exec ioctl) → repeat > > In this state, reclaim is triggered despite substantial free memory, > but fails to produce contiguous higher-order pages. The Xe shrinker then > evicts active buffer objects, increasing faulting and rebind activity > and further feeding the loop. The result is high CPU overhead and poor > GPU forward progress. > > This issue was first reported in [1] and independently observed > internally and by Google. > > A simple reproducer is: > > - Boot an iGPU system with mem=8G > - Launch 10 Chrome tabs running the WebGL aquarium demo > - Configure each tab with ~5k fish > > Under this workload, ftrace shows a continuous loop of: > > xe_shrinker_scan (kswapd) > xe_vma_rebind_exec > > Performance degrades significantly, with each tab dropping to ~2 FPS on > PTL (Ubuntu 24.04). > > At the same time, /proc/buddyinfo shows substantial free memory but no > higher-order availability. For example, the Normal zone: > > Count: 4063 4595 3455 3400 3139 2762 2293 1655 643 0 0 > > This corresponds to ~2.8GB free memory, but no order-9 (2MB) blocks, > indicating severe fragmentation. > > This series addresses the issue in two ways: > > TTM: Restrict direct reclaim to beneficial_order. Larger allocations > use __GFP_NORETRY to fail quickly rather than triggering reclaim. NACK. As I have said to the people trying to hack around direct reclaim for high order allocations being costly for the page cache, fix the problem with direct reclaim. (e.g. https://lore.kernel.org/linux-xfs/adLlrSZ5oRAa_Hfd@dread/) We should not be hacking around a problem in the mm infrastructure by changing allocation context flags every high order allocation call site that needs high order allocations. Understand and fix the infrastructure problem once and for all. > Xe: Introduce a heuristic in the shrinker to avoid eviction when > running under kswapd and the system appears memory-rich but > fragmented. NACK on architectural grounds. Custom heuristics in individual shrinkers to decide whether the should do what the mm subsystem has asked them to do has -always- been a mistake to allow. The mm subsystem makes the decision on how much cache shrinkage needs to occur, the shrinkers just do what they are told to do. If we have a problem where a workload causes excessive shrinker reclaim, then we need to address the problem in the infrastructure because excessive reclaim affects the performance of -all- subsystems with shrinkable caches, not just the TTM subsystem. As it is, I can't review what you've actually implemented because you only cc'd me on a single patch in the series. In future, please cc me on the whole patchset because shrinkers need to work as a coherent whole, not just in isolation.... -Dave. -- Dave Chinner dgc@kernel.org ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation 2026-05-01 1:42 ` Dave Chinner @ 2026-05-01 7:09 ` Matthew Brost 0 siblings, 0 replies; 14+ messages in thread From: Matthew Brost @ 2026-05-01 7:09 UTC (permalink / raw) To: Dave Chinner Cc: intel-xe, dri-devel, Dave Chinner, Qi Zheng, Roman Gushchin, Johannes Weiner, Shakeel Butt, Kairui Song, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Tvrtko Ursulin, Thomas Hellström, Carlos Santa, Christian Koenig, Huang Rui, Matthew Auld, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter, Daniel Colascione, Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel On Fri, May 01, 2026 at 11:42:19AM +1000, Dave Chinner wrote: Thanks for the feedback. I’m looking into this more, and it’s becoming clear that this is a hard problem—one that will likely require coordinated work between DRM and core MM to really sort out. That said, I do think what I have in place is a reasonable short-term fix. More below. > On Thu, Apr 30, 2026 at 12:18:03PM -0700, Matthew Brost wrote: > > TTM allocations at higher orders can drive Xe into a pathological > > reclaim loop when memory is fragmented: > > > > kswapd → shrinker → eviction → rebind (exec ioctl) → repeat > > > > In this state, reclaim is triggered despite substantial free memory, > > but fails to produce contiguous higher-order pages. The Xe shrinker then > > evicts active buffer objects, increasing faulting and rebind activity > > and further feeding the loop. The result is high CPU overhead and poor > > GPU forward progress. > > > > This issue was first reported in [1] and independently observed > > internally and by Google. > > > > A simple reproducer is: > > > > - Boot an iGPU system with mem=8G > > - Launch 10 Chrome tabs running the WebGL aquarium demo > > - Configure each tab with ~5k fish > > > > Under this workload, ftrace shows a continuous loop of: > > > > xe_shrinker_scan (kswapd) > > xe_vma_rebind_exec > > > > Performance degrades significantly, with each tab dropping to ~2 FPS on > > PTL (Ubuntu 24.04). > > > > At the same time, /proc/buddyinfo shows substantial free memory but no > > higher-order availability. For example, the Normal zone: > > > > Count: 4063 4595 3455 3400 3139 2762 2293 1655 643 0 0 > > > > This corresponds to ~2.8GB free memory, but no order-9 (2MB) blocks, > > indicating severe fragmentation. > > > > This series addresses the issue in two ways: > > > > TTM: Restrict direct reclaim to beneficial_order. Larger allocations > > use __GFP_NORETRY to fail quickly rather than triggering reclaim. > > NACK. > > As I have said to the people trying to hack around direct reclaim > for high order allocations being costly for the page cache, fix the > problem with direct reclaim. (e.g. > https://lore.kernel.org/linux-xfs/adLlrSZ5oRAa_Hfd@dread/) > I read your response. Maybe this isn't clear what is going here. At beneficial_order: gfp == __GFP_RECLAIM | __GFP_NORETRY At order zero: gfp == __GFP_RECLAIM This roughly existing behavior, the exact changes are here [1]. [1] https://patchwork.freedesktop.org/patch/722247/?series=165329&rev=3 If this is truly a NACK, then we can rethink it—likely by disabling reclaim at higher orders—but that has its own downsides for DRM and GPUs. Ideally, you want purgeable BOs to be evicted when a higher-order allocation fails; you really don’t want to end up in an insane kswap loop. > We should not be hacking around a problem in the mm infrastructure > by changing allocation context flags every high order allocation > call site that needs high order allocations. Understand and fix the > infrastructure problem once and for all. > Well, I agree that we should aim to fix this in core MM, but as the saying goes, Rome wasn’t built in a day. The fact is that these GFP flags do exist, and suddenly drawing a line and declaring them no longer valid feels a bit unfair. I’ll also note that Intel—and I personally—have an interest in fixing shrinking, so you can expect follow-up work here. > > Xe: Introduce a heuristic in the shrinker to avoid eviction when > > running under kswapd and the system appears memory-rich but > > fragmented. > > NACK on architectural grounds. > > Custom heuristics in individual shrinkers to decide whether the > should do what the mm subsystem has asked them to do has -always- > been a mistake to allow. The mm subsystem makes the decision on how I’m not going to disagree with using custom heuristics in individual shrinkers, but I’d wager that most shrinkers sadly already implement custom heuristics. > much cache shrinkage needs to occur, the shrinkers just do what they > are told to do. > > If we have a problem where a workload causes excessive shrinker > reclaim, then we need to address the problem in the infrastructure > because excessive reclaim affects the performance of -all- > subsystems with shrinkable caches, not just the TTM subsystem. > Yes, I agree, and I’ve thought about the implications of simply having TTM back off when a higher-order allocation fails, even when we actually have enough memory, and how that would affect everyone. This series at least fixes the “well, there goes my GUI” problem. I do have another patch locally that prevents TTM from accidentally fragmenting memory and triggering the kswap loop, but under enough pressure I can still get the GUI to lock up for periods of time. With this series, however, I can’t reproduce that issue. > As it is, I can't review what you've actually implemented because > you only cc'd me on a single patch in the series. In future, please > cc me on the whole patchset because shrinkers need to work as a > coherent whole, not just in isolation.... > Sorry about this - Andrew just said the same thing. Here is PW link [2]. Or: b4 mbox 20260430191809.2142544-1-matthew.brost@intel.com [2] https://patchwork.freedesktop.org/series/165329/ If you have any ideas on how to fix this in the core, let’s discuss. I have a bunch of ideas in my head, but core MM isn’t my native domain. Matt > -Dave. > -- > Dave Chinner > dgc@kernel.org ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2026-05-01 22:33 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-04-30 19:18 [PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation Matthew Brost 2026-04-30 19:18 ` [PATCH v4 1/6] mm: Wire up order in shrink_control Matthew Brost 2026-04-30 19:18 ` [PATCH v4 2/6] mm: Introduce zone_maybe_fragmented_in_shrinker() Matthew Brost 2026-05-01 0:50 ` Santa, Carlos 2026-05-01 19:08 ` PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation Kenneth Crudup 2026-05-01 20:00 ` Matthew Brost 2026-05-01 20:05 ` Kenneth Crudup 2026-05-01 21:10 ` Matthew Brost 2026-05-01 22:33 ` Matthew Brost 2026-04-30 23:01 ` [PATCH " Andrew Morton 2026-05-01 6:28 ` Matthew Brost 2026-05-01 12:51 ` Andrew Morton 2026-05-01 1:42 ` Dave Chinner 2026-05-01 7:09 ` Matthew Brost
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox