[PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation
@ 2026-04-30 19:18 Matthew Brost
  2026-04-30 19:18 ` [PATCH v4 1/6] mm: Wire up order in shrink_control Matthew Brost
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Matthew Brost @ 2026-04-30 19:18 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: Dave Chinner, Qi Zheng, Roman Gushchin, Johannes Weiner,
	Shakeel Butt, Kairui Song, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Tvrtko Ursulin, Thomas Hellström,
	Carlos Santa, Christian Koenig, Huang Rui, Matthew Auld,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Simona Vetter, Daniel Colascione, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm, linux-kernel

TTM allocations at higher orders can drive Xe into a pathological
reclaim loop when memory is fragmented:

kswapd → shrinker → eviction → rebind (exec ioctl) → repeat

In this state, reclaim is triggered despite substantial free memory,
but fails to produce contiguous higher-order pages. The Xe shrinker then
evicts active buffer objects, increasing faulting and rebind activity
and further feeding the loop. The result is high CPU overhead and poor
GPU forward progress.

This issue was first reported in [1] and independently observed
internally and by Google.

A simple reproducer is:

- Boot an iGPU system with mem=8G
- Launch 10 Chrome tabs running the WebGL aquarium demo
- Configure each tab with ~5k fish

Under this workload, ftrace shows a continuous loop of:

xe_shrinker_scan (kswapd)
xe_vma_rebind_exec

Performance degrades significantly, with each tab dropping to ~2 FPS on
PTL (Ubuntu 24.04).

At the same time, /proc/buddyinfo shows substantial free memory but no
higher-order availability. For example, the Normal zone:

Count: 4063 4595 3455 3400 3139 2762 2293 1655 643 0 0

This corresponds to ~2.8GB free memory, but no order-9 (2MB) blocks,
indicating severe fragmentation.

This series addresses the issue in two ways:

TTM: Restrict direct reclaim to beneficial_order. Larger allocations
use __GFP_NORETRY to fail quickly rather than triggering reclaim.

Xe: Introduce a heuristic in the shrinker to avoid eviction when
running under kswapd and the system appears memory-rich but
fragmented.

With these changes, the reclaim/eviction loop is eliminated. The same
workload improves to ~10 FPS per tab (Ubuntu 24.04) or ~15 FPS per tab
(Ubuntu 24.10), and kswapd activity subsides.

Buddyinfo after applying this series shows restored higher-order
availability:

Count: 8526 7067 3092 1959 1292 660 194 28 20 13 1

Matt

v2:
 - Layer with core MM / TTM helpers (Thomas)
v4:
 - Fix build (CI)

[1] https://patchwork.freedesktop.org/patch/716404/?series=164353&rev=1

Cc: Dave Chinner <david@fromorbit.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Kairui Song <kasong@tencent.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Carlos Santa <carlos.santa@intel.com>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
CC: dri-devel@lists.freedesktop.org
Cc: Daniel Colascione <dancol@dancol.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org

Matthew Brost (6):
  mm: Wire up order in shrink_control
  mm: Introduce zone_maybe_fragmented_in_shrinker()
  drm/ttm: Issue direct reclaim at beneficial_order
  drm/ttm: Introduce ttm_bo_shrink_kswap_maybe_fragmented()
  drm/xe: Set TTM device beneficial_order to 9 (2M)
  drm/xe: Avoid shrinker reclaim from kswapd under fragmentation

 drivers/gpu/drm/ttm/ttm_bo_util.c | 38 +++++++++++++++++++++++++++++++
 drivers/gpu/drm/ttm/ttm_pool.c    |  4 ++--
 drivers/gpu/drm/xe/xe_device.c    |  3 ++-
 drivers/gpu/drm/xe/xe_shrinker.c  |  3 +++
 include/drm/ttm/ttm_bo.h          |  2 ++
 include/linux/shrinker.h          |  3 +++
 include/linux/vmstat.h            | 12 ++++++++++
 mm/internal.h                     |  4 ++--
 mm/shrinker.c                     | 13 +++++++----
 mm/vmscan.c                       |  7 +++---
 10 files changed, 76 insertions(+), 13 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v4 1/6] mm: Wire up order in shrink_control
  2026-04-30 19:18 [PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation Matthew Brost
@ 2026-04-30 19:18 ` Matthew Brost
  2026-04-30 19:18 ` [PATCH v4 2/6] mm: Introduce zone_maybe_fragmented_in_shrinker() Matthew Brost
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 14+ messages in thread
From: Matthew Brost @ 2026-04-30 19:18 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: Andrew Morton, Dave Chinner, Qi Zheng, Roman Gushchin,
	Muchun Song, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Johannes Weiner, Shakeel Butt, Kairui Song, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-mm, linux-kernel,
	Thomas Hellström

Pass the allocation order through shrink_control so shrinkers have
visibility into the order that triggered reclaim.

This allows shrinkers to implement better heuristics, such as detecting
high-order allocation pressure or fragmentation and avoiding eviction
of working sets when reclaim is invoked from kswapd.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: David Hildenbrand <david@kernel.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Kairui Song <kasong@tencent.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Wei Xu <weixugc@google.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Suggested-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>

---
v4: Fix build without CONFIG_MEMCG (CI)
---
 include/linux/shrinker.h |  3 +++
 mm/internal.h            |  4 ++--
 mm/shrinker.c            | 13 ++++++++-----
 mm/vmscan.c              |  7 ++++---
 4 files changed, 17 insertions(+), 10 deletions(-)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 1a00be90d93a..7072f693b9be 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -37,6 +37,9 @@ struct shrink_control {
 	/* current node being shrunk (for NUMA aware shrinkers) */
 	int nid;
 
+	/* Allocation order we are currently trying to fulfil. */
+	s8 order;
+
 	/*
 	 * How many objects scan_objects should scan and try to reclaim.
 	 * This is reset before every call, so it is safe for callees
diff --git a/mm/internal.h b/mm/internal.h
index 5a2ddcf68e0b..ff8671dccf7b 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1759,8 +1759,8 @@ void __meminit __init_single_page(struct page *page, unsigned long pfn,
 void __meminit __init_page_from_nid(unsigned long pfn, int nid);
 
 /* shrinker related functions */
-unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg,
-			  int priority);
+unsigned long shrink_slab(gfp_t gfp_mask, int nid, s8 order,
+			  struct mem_cgroup *memcg, int priority);
 
 int shmem_add_to_page_cache(struct folio *folio,
 			    struct address_space *mapping,
diff --git a/mm/shrinker.c b/mm/shrinker.c
index 76b3f750cf65..c83f3b3daa08 100644
--- a/mm/shrinker.c
+++ b/mm/shrinker.c
@@ -466,7 +466,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 }
 
 #ifdef CONFIG_MEMCG
-static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
+static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, s8 order,
 			struct mem_cgroup *memcg, int priority)
 {
 	struct shrinker_info *info;
@@ -528,6 +528,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
 			struct shrink_control sc = {
 				.gfp_mask = gfp_mask,
 				.nid = nid,
+				.order = order,
 				.memcg = memcg,
 			};
 			struct shrinker *shrinker;
@@ -587,7 +588,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
 	return freed;
 }
 #else /* !CONFIG_MEMCG */
-static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
+static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, s8 order,
 			struct mem_cgroup *memcg, int priority)
 {
 	return 0;
@@ -598,6 +599,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
  * shrink_slab - shrink slab caches
  * @gfp_mask: allocation context
  * @nid: node whose slab caches to target
+ * @order: order of allocation
  * @memcg: memory cgroup whose slab caches to target
  * @priority: the reclaim priority
  *
@@ -614,8 +616,8 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
  *
  * Returns the number of reclaimed slab objects.
  */
-unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg,
-			  int priority)
+unsigned long shrink_slab(gfp_t gfp_mask, int nid, s8 order,
+			  struct mem_cgroup *memcg, int priority)
 {
 	unsigned long ret, freed = 0;
 	struct shrinker *shrinker;
@@ -628,7 +630,7 @@ unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg,
 	 * oom.
 	 */
 	if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
-		return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
+		return shrink_slab_memcg(gfp_mask, nid, order, memcg, priority);
 
 	/*
 	 * lockless algorithm of global shrink.
@@ -656,6 +658,7 @@ unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg,
 		struct shrink_control sc = {
 			.gfp_mask = gfp_mask,
 			.nid = nid,
+			.order = order,
 			.memcg = memcg,
 		};
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index bd1b1aa12581..a54d14ecad25 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -412,7 +412,7 @@ static unsigned long drop_slab_node(int nid)
 
 	memcg = mem_cgroup_iter(NULL, NULL, NULL);
 	do {
-		freed += shrink_slab(GFP_KERNEL, nid, memcg, 0);
+		freed += shrink_slab(GFP_KERNEL, nid, 0, memcg, 0);
 	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
 
 	return freed;
@@ -5068,7 +5068,8 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
 
 	success = try_to_shrink_lruvec(lruvec, sc);
 
-	shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority);
+	shrink_slab(sc->gfp_mask, pgdat->node_id, sc->order, memcg,
+		    sc->priority);
 
 	if (!sc->proactive)
 		vmpressure(sc->gfp_mask, memcg, false, sc->nr_scanned - scanned,
@@ -6170,7 +6171,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 
 		shrink_lruvec(lruvec, sc);
 
-		shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
+		shrink_slab(sc->gfp_mask, pgdat->node_id, sc->order, memcg,
 			    sc->priority);
 
 		/* Record the group's reclaim efficiency */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v4 2/6] mm: Introduce zone_maybe_fragmented_in_shrinker()
  2026-04-30 19:18 [PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation Matthew Brost
  2026-04-30 19:18 ` [PATCH v4 1/6] mm: Wire up order in shrink_control Matthew Brost
@ 2026-04-30 19:18 ` Matthew Brost
  2026-05-01  0:50   ` Santa, Carlos
  2026-05-01 19:08   ` PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation Kenneth Crudup
  2026-04-30 23:01 ` [PATCH " Andrew Morton
  2026-05-01  1:42 ` Dave Chinner
  3 siblings, 2 replies; 14+ messages in thread
From: Matthew Brost @ 2026-04-30 19:18 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: Thomas Hellström, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel

Introduce zone_maybe_fragmented_in_shrinker() as a lightweight helper to
allow subsystems to make coarse decisions about reclaim behavior in the
presence of likely fragmentation.

The helper implements a simple heuristic: if the number of free pages
in a zone exceeds twice the high watermark, the zone is considered to
have ample free memory and allocation failures are more likely due to
fragmentation than overall memory pressure.

This is intentionally imprecise and is not meant to replace the core
MM compaction or fragmentation accounting logic. Instead, it provides
a cheap signal for callers (e.g., shrinkers) that wish to avoid
overly aggressive reclaim when sufficient free memory exists but
high-order allocations may still fail.

No functional changes; this is a preparatory helper for future users.

Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Matthew Brost <matthew.brost@intel.com>

---

v3: s/zone_appear_fragmented/zone_maybe_fragmented_in_shrinker (David
    Hildenbrand)
---
 include/linux/vmstat.h | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 3c9c266cf782..1ad48f70c9d9 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -483,6 +483,18 @@ static inline const char *zone_stat_name(enum zone_stat_item item)
 	return vmstat_text[item];
 }
 
+static inline bool zone_maybe_fragmented_in_shrinker(struct zone *zone)
+{
+	/*
+	 * Simple heuristic: if the number of free pages is more than twice the
+	 * high watermark, this may suggest that the zone is heavily fragmented.
+	 * When called from a shrinker, aggressively evicting memory in this case
+	 * may do more harm to overall system performance than good.
+	 */
+	return zone_page_state(zone, NR_FREE_PAGES) >
+		high_wmark_pages(zone) * 2;
+}
+
 #ifdef CONFIG_NUMA
 static inline const char *numa_stat_name(enum numa_stat_item item)
 {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 2/6] mm: Introduce zone_maybe_fragmented_in_shrinker()
  2026-04-30 19:18 ` [PATCH v4 2/6] mm: Introduce zone_maybe_fragmented_in_shrinker() Matthew Brost
@ 2026-05-01  0:50   ` Santa, Carlos
  2026-05-01 19:08   ` PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation Kenneth Crudup
  1 sibling, 0 replies; 14+ messages in thread
From: Santa, Carlos @ 2026-05-01  0:50 UTC (permalink / raw)
  To: intel-xe@lists.freedesktop.org, Brost, Matthew,
	dri-devel@lists.freedesktop.org
  Cc: linux-kernel@vger.kernel.org, Liam.Howlett@oracle.com,
	david@kernel.org, surenb@google.com, akpm@linux-foundation.org,
	thomas.hellstrom@linux.intel.com, ljs@kernel.org,
	vbabka@kernel.org, linux-mm@kvack.org, rppt@kernel.org,
	mhocko@suse.com

On Thu, 2026-04-30 at 12:18 -0700, Matthew Brost wrote:
> Introduce zone_maybe_fragmented_in_shrinker() as a lightweight helper
> to
> allow subsystems to make coarse decisions about reclaim behavior in
> the
> presence of likely fragmentation.
> 
> The helper implements a simple heuristic: if the number of free pages
> in a zone exceeds twice the high watermark, the zone is considered to
> have ample free memory and allocation failures are more likely due to
> fragmentation than overall memory pressure.
> 
> This is intentionally imprecise and is not meant to replace the core
> MM compaction or fragmentation accounting logic. Instead, it provides
> a cheap signal for callers (e.g., shrinkers) that wish to avoid
> overly aggressive reclaim when sufficient free memory exists but
> high-order allocations may still fail.
> 
> No functional changes; this is a preparatory helper for future users.
> 
> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@kernel.org>
> Cc: Lorenzo Stoakes <ljs@kernel.org>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Vlastimil Babka <vbabka@kernel.org>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> 
> ---
> 
> v3: s/zone_appear_fragmented/zone_maybe_fragmented_in_shrinker (David
>     Hildenbrand)
> ---
>  include/linux/vmstat.h | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index 3c9c266cf782..1ad48f70c9d9 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -483,6 +483,18 @@ static inline const char *zone_stat_name(enum
> zone_stat_item item)
>  	return vmstat_text[item];
>  }
>  

on the below heuristic, I was thinking of the following case: a large
memory system (say 16G, 32G), heavily fragmented (for whatever reason)
but constraint by the IOMMU requiring large pages due to hw alignment,
if I am not mistaken the below check will cause the shrinker to bail
out too 'early' since the there's plenty of available memory but none
of that is contiguous, then end result should be giving back small
pages which should reduce performance, right?

below are some made up numbers:


Metric          | 8GB               | 16GB
----------------|-------------------|-------------------
High Wmark      | ~45MB (11k pgs)   | ~90MB (23k pgs)
Bail Gate (2x)  | ~90MB (22k pgs)   | ~180MB (46k pgs)
Free RAM        | 120MB             | 7100MB (7.1GB)
Shrinker        | RUNS (Free<Gate)  | BAILS (Free>Gate)
Outcome         | Merges 2MB blocks | 4KB pages

In other words, replacing the check with numbers:

System       | Free RAM (Pages) | Gate (Pages) | Free < Gate? | Result
-------------|------------------|--------------|--------------|-------
8GB          | 20,480 (80MB)    | 22,946       | 20480 < 22946| RUNS
16GB         | 1,832,740 (7.1G) | 45,894       | 1.8M < 45k?  | BAILS


Carlos

> +static inline bool zone_maybe_fragmented_in_shrinker(struct zone
> *zone)
> +{
> +	/*
> +	 * Simple heuristic: if the number of free pages is more
> than twice the
> +	 * high watermark, this may suggest that the zone is heavily
> fragmented.
> +	 * When called from a shrinker, aggressively evicting memory
> in this case
> +	 * may do more harm to overall system performance than good.
> +	 */
> +	return zone_page_state(zone, NR_FREE_PAGES) >
> +		high_wmark_pages(zone) * 2;
> +}
> +
>  #ifdef CONFIG_NUMA
>  static inline const char *numa_stat_name(enum numa_stat_item item)
>  {


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation
  2026-04-30 19:18 ` [PATCH v4 2/6] mm: Introduce zone_maybe_fragmented_in_shrinker() Matthew Brost
  2026-05-01  0:50   ` Santa, Carlos
@ 2026-05-01 19:08   ` Kenneth Crudup
  2026-05-01 20:00     ` Matthew Brost
  1 sibling, 1 reply; 14+ messages in thread
From: Kenneth Crudup @ 2026-05-01 19:08 UTC (permalink / raw)
  To: Matthew Brost, intel-xe, dri-devel
  Cc: Thomas Hellström, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel,
	Kenneth C


On 4/30/26 12:18, Matthew Brost wrote:

> Introduce zone_maybe_fragmented_in_shrinker() as a lightweight helper to
> allow subsystems to make coarse decisions about reclaim behavior in the
> presence of likely fragmentation

I'm running Linus' master on my LunarLake (258v) laptop, and sometimes 
after compiling a kernel (of all things) I'd see kswapd0 thrash despite 
having quite a bit of free memory.

I finally traced it to the xe driver after seeing the "GPUActive" field 
in /proc/meminfo suddenly start rising, eventually growing larger than 
real memory by several times (see below).

This patchset fixes the issue, and I'm sure there'll be a fix going into 
Linus' master soon, but what I'M wondering is how could building a 
kernel (which is just in a KDE Konsole running on Wayland) make the 
GPActive grow from ~1.6G to > 30G (and continue to rise, RN I'm seeing 
91839848 kBs and still growing).

-Kenny

----
SwapTotal:      33554428 kB
MemTotal:       32345672 kB
GPUActive:        652640 kB
GPUReclaim:       403988 kB

SwapTotal:      33554428 kB
MemTotal:       32345672 kB
GPUActive:        651180 kB
GPUReclaim:       406812 kB

SwapTotal:      33554428 kB
MemTotal:       32345672 kB
GPUActive:        659004 kB
GPUReclaim:       399396 kB

SwapTotal:      33554428 kB
MemTotal:       32345672 kB
GPUActive:        666996 kB
GPUReclaim:       392764 kB

<some hours later>
GPUActive:      91832468 kB
SwapTotal:      33554428 kB
MemTotal:       32345672 kB
GPUReclaim:       488000 kB

GPUActive:      91832332 kB
SwapTotal:      33554428 kB
MemTotal:       32345672 kB
GPUReclaim:       487988 kB

GPUActive:      91869376 kB
SwapTotal:      33554428 kB
MemTotal:       32345672 kB
GPUReclaim:       486504 kB
----

-- 
Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange 
County CA


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation
  2026-05-01 19:08   ` PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation Kenneth Crudup
@ 2026-05-01 20:00     ` Matthew Brost
  2026-05-01 20:05       ` Kenneth Crudup
  0 siblings, 1 reply; 14+ messages in thread
From: Matthew Brost @ 2026-05-01 20:00 UTC (permalink / raw)
  To: Kenneth Crudup, airlied
  Cc: intel-xe, dri-devel, Thomas Hellström, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm, linux-kernel

On Fri, May 01, 2026 at 12:08:48PM -0700, Kenneth Crudup wrote:
> 
> On 4/30/26 12:18, Matthew Brost wrote:
> 
> > Introduce zone_maybe_fragmented_in_shrinker() as a lightweight helper to
> > allow subsystems to make coarse decisions about reclaim behavior in the
> > presence of likely fragmentation
> 
> I'm running Linus' master on my LunarLake (258v) laptop, and sometimes after

+Dave

So is this 7.1-rc1? It looks like new feature to 7.1 added by Dave [1] and
something look off here. Thanks for pointing this out.

I'm grabbing a machine now to see if I can recreate this...

Matt

[1] git format-patch -1 2232ba9c7931d

> compiling a kernel (of all things) I'd see kswapd0 thrash despite having
> quite a bit of free memory.
> 
> I finally traced it to the xe driver after seeing the "GPUActive" field in
> /proc/meminfo suddenly start rising, eventually growing larger than real
> memory by several times (see below).
> 
> This patchset fixes the issue, and I'm sure there'll be a fix going into
> Linus' master soon, but what I'M wondering is how could building a kernel
> (which is just in a KDE Konsole running on Wayland) make the GPActive grow
> from ~1.6G to > 30G (and continue to rise, RN I'm seeing 91839848 kBs and
> still growing).
> 
> -Kenny
> 
> ----
> SwapTotal:      33554428 kB
> MemTotal:       32345672 kB
> GPUActive:        652640 kB
> GPUReclaim:       403988 kB
> 
> SwapTotal:      33554428 kB
> MemTotal:       32345672 kB
> GPUActive:        651180 kB
> GPUReclaim:       406812 kB
> 
> SwapTotal:      33554428 kB
> MemTotal:       32345672 kB
> GPUActive:        659004 kB
> GPUReclaim:       399396 kB
> 
> SwapTotal:      33554428 kB
> MemTotal:       32345672 kB
> GPUActive:        666996 kB
> GPUReclaim:       392764 kB
> 
> <some hours later>
> GPUActive:      91832468 kB
> SwapTotal:      33554428 kB
> MemTotal:       32345672 kB
> GPUReclaim:       488000 kB
> 
> GPUActive:      91832332 kB
> SwapTotal:      33554428 kB
> MemTotal:       32345672 kB
> GPUReclaim:       487988 kB
> 
> GPUActive:      91869376 kB
> SwapTotal:      33554428 kB
> MemTotal:       32345672 kB
> GPUReclaim:       486504 kB
> ----
> 
> -- 
> Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange County
> CA
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation
  2026-05-01 20:00     ` Matthew Brost
@ 2026-05-01 20:05       ` Kenneth Crudup
  2026-05-01 21:10         ` Matthew Brost
  0 siblings, 1 reply; 14+ messages in thread
From: Kenneth Crudup @ 2026-05-01 20:05 UTC (permalink / raw)
  To: Matthew Brost, airlied
  Cc: intel-xe, dri-devel, Thomas Hellström, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm, linux-kernel


On 5/1/26 13:00, Matthew Brost wrote:

> So is this 7.1-rc1? It looks like new feature to 7.1 added by Dave [1] and
> something look off here. Thanks for pointing this out.

Yeah. I grab his master branch daily (as of 6fe0be6dc7fa RN).

Is this a "shoot the messenger" thing? IOW, is the reporting off, or is 
the memory usage really that high?

(BTW, those are in 30-second intervals)

>> ----
>> SwapTotal:      33554428 kB
>> MemTotal:       32345672 kB
>> GPUActive:        652640 kB
>> GPUReclaim:       403988 kB
>>
>> SwapTotal:      33554428 kB
>> MemTotal:       32345672 kB
>> GPUActive:        651180 kB
>> GPUReclaim:       406812 kB
>>
>> SwapTotal:      33554428 kB
>> MemTotal:       32345672 kB
>> GPUActive:        659004 kB
>> GPUReclaim:       399396 kB
>>
>> SwapTotal:      33554428 kB
>> MemTotal:       32345672 kB
>> GPUActive:        666996 kB
>> GPUReclaim:       392764 kB
>>
>> <some hours later>
>> GPUActive:      91832468 kB
>> SwapTotal:      33554428 kB
>> MemTotal:       32345672 kB
>> GPUReclaim:       488000 kB
>>
>> GPUActive:      91832332 kB
>> SwapTotal:      33554428 kB
>> MemTotal:       32345672 kB
>> GPUReclaim:       487988 kB
>>
>> GPUActive:      91869376 kB
>> SwapTotal:      33554428 kB
>> MemTotal:       32345672 kB
>> GPUReclaim:       486504 kB
>> ----

-K

-- 
Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange 
County CA


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation
  2026-05-01 20:05       ` Kenneth Crudup
@ 2026-05-01 21:10         ` Matthew Brost
  2026-05-01 22:33           ` Matthew Brost
  0 siblings, 1 reply; 14+ messages in thread
From: Matthew Brost @ 2026-05-01 21:10 UTC (permalink / raw)
  To: Kenneth Crudup
  Cc: airlied, intel-xe, dri-devel, Thomas Hellström,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel

On Fri, May 01, 2026 at 01:05:57PM -0700, Kenneth Crudup wrote:
> 
> On 5/1/26 13:00, Matthew Brost wrote:
> 
> > So is this 7.1-rc1? It looks like new feature to 7.1 added by Dave [1] and
> > something look off here. Thanks for pointing this out.
> 
> Yeah. I grab his master branch daily (as of 6fe0be6dc7fa RN).
> 
> Is this a "shoot the messenger" thing? IOW, is the reporting off, or is the

I don't think I'm firing any shots.

> memory usage really that high?

I've been able to recreate this. It looks like accounting is correct
until the Xe shrinker runs - every time it kicks in GPUActive grows and
will not reduce past some new floor value. It looks like an accounting
bug in TTM or Xe (?).

Here is my output on a 8G PTL where I have intentionally triggered
shrinker to evict at least 23875 BOs (most likey quite few more but this
what I easily see in dmesg) after closing everything on desktop.

cat /proc/meminfo | grep GPU; cat /proc/buddyinfo;
GPUActive:      13100036 kB
GPUReclaim:          152 kB
Node 0, zone      DMA      0      1      0      0      0      0      0      0      1      1      3
Node 0, zone    DMA32   2320   1882   1523   1238    980    740    482    275    114     88    205
Node 0, zone   Normal   9751   9343   6466   4237   2703   1162    805    420    191    145    289

Let me spend a bit of time here to see if I figure out where the
accounting goes wrong.

Matt

> 
> (BTW, those are in 30-second intervals)
> 
> > > ----
> > > SwapTotal:      33554428 kB
> > > MemTotal:       32345672 kB
> > > GPUActive:        652640 kB
> > > GPUReclaim:       403988 kB
> > > 
> > > SwapTotal:      33554428 kB
> > > MemTotal:       32345672 kB
> > > GPUActive:        651180 kB
> > > GPUReclaim:       406812 kB
> > > 
> > > SwapTotal:      33554428 kB
> > > MemTotal:       32345672 kB
> > > GPUActive:        659004 kB
> > > GPUReclaim:       399396 kB
> > > 
> > > SwapTotal:      33554428 kB
> > > MemTotal:       32345672 kB
> > > GPUActive:        666996 kB
> > > GPUReclaim:       392764 kB
> > > 
> > > <some hours later>
> > > GPUActive:      91832468 kB
> > > SwapTotal:      33554428 kB
> > > MemTotal:       32345672 kB
> > > GPUReclaim:       488000 kB
> > > 
> > > GPUActive:      91832332 kB
> > > SwapTotal:      33554428 kB
> > > MemTotal:       32345672 kB
> > > GPUReclaim:       487988 kB
> > > 
> > > GPUActive:      91869376 kB
> > > SwapTotal:      33554428 kB
> > > MemTotal:       32345672 kB
> > > GPUReclaim:       486504 kB
> > > ----
> 
> -K
> 
> -- 
> Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange County
> CA
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation
  2026-05-01 21:10         ` Matthew Brost
@ 2026-05-01 22:33           ` Matthew Brost
  0 siblings, 0 replies; 14+ messages in thread
From: Matthew Brost @ 2026-05-01 22:33 UTC (permalink / raw)
  To: Kenneth Crudup
  Cc: airlied, intel-xe, dri-devel, Thomas Hellström,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel

On Fri, May 01, 2026 at 02:10:07PM -0700, Matthew Brost wrote:
> On Fri, May 01, 2026 at 01:05:57PM -0700, Kenneth Crudup wrote:
> > 
> > On 5/1/26 13:00, Matthew Brost wrote:
> > 
> > > So is this 7.1-rc1? It looks like new feature to 7.1 added by Dave [1] and
> > > something look off here. Thanks for pointing this out.
> > 
> > Yeah. I grab his master branch daily (as of 6fe0be6dc7fa RN).
> > 
> > Is this a "shoot the messenger" thing? IOW, is the reporting off, or is the
> 
> I don't think I'm firing any shots.
> 
> > memory usage really that high?
> 
> I've been able to recreate this. It looks like accounting is correct
> until the Xe shrinker runs - every time it kicks in GPUActive grows and
> will not reduce past some new floor value. It looks like an accounting
> bug in TTM or Xe (?).
> 
> Here is my output on a 8G PTL where I have intentionally triggered
> shrinker to evict at least 23875 BOs (most likey quite few more but this
> what I easily see in dmesg) after closing everything on desktop.
> 
> cat /proc/meminfo | grep GPU; cat /proc/buddyinfo;
> GPUActive:      13100036 kB
> GPUReclaim:          152 kB
> Node 0, zone      DMA      0      1      0      0      0      0      0      0      1      1      3
> Node 0, zone    DMA32   2320   1882   1523   1238    980    740    482    275    114     88    205
> Node 0, zone   Normal   9751   9343   6466   4237   2703   1162    805    420    191    145    289
> 
> Let me spend a bit of time here to see if I figure out where the
> accounting goes wrong.
> 

Looks like a simple accounting error in the shrinking path. Here is a fix
[1] that seems to work for me.

If you want to give a it try, that would be helpful.

Matt
 
[1] https://patchwork.freedesktop.org/series/165862/

> Matt
> 
> > 
> > (BTW, those are in 30-second intervals)
> > 
> > > > ----
> > > > SwapTotal:      33554428 kB
> > > > MemTotal:       32345672 kB
> > > > GPUActive:        652640 kB
> > > > GPUReclaim:       403988 kB
> > > > 
> > > > SwapTotal:      33554428 kB
> > > > MemTotal:       32345672 kB
> > > > GPUActive:        651180 kB
> > > > GPUReclaim:       406812 kB
> > > > 
> > > > SwapTotal:      33554428 kB
> > > > MemTotal:       32345672 kB
> > > > GPUActive:        659004 kB
> > > > GPUReclaim:       399396 kB
> > > > 
> > > > SwapTotal:      33554428 kB
> > > > MemTotal:       32345672 kB
> > > > GPUActive:        666996 kB
> > > > GPUReclaim:       392764 kB
> > > > 
> > > > <some hours later>
> > > > GPUActive:      91832468 kB
> > > > SwapTotal:      33554428 kB
> > > > MemTotal:       32345672 kB
> > > > GPUReclaim:       488000 kB
> > > > 
> > > > GPUActive:      91832332 kB
> > > > SwapTotal:      33554428 kB
> > > > MemTotal:       32345672 kB
> > > > GPUReclaim:       487988 kB
> > > > 
> > > > GPUActive:      91869376 kB
> > > > SwapTotal:      33554428 kB
> > > > MemTotal:       32345672 kB
> > > > GPUReclaim:       486504 kB
> > > > ----
> > 
> > -K
> > 
> > -- 
> > Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange County
> > CA
> > 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation
  2026-04-30 19:18 [PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation Matthew Brost
  2026-04-30 19:18 ` [PATCH v4 1/6] mm: Wire up order in shrink_control Matthew Brost
  2026-04-30 19:18 ` [PATCH v4 2/6] mm: Introduce zone_maybe_fragmented_in_shrinker() Matthew Brost
@ 2026-04-30 23:01 ` Andrew Morton
  2026-05-01  6:28   ` Matthew Brost
  2026-05-01  1:42 ` Dave Chinner
  3 siblings, 1 reply; 14+ messages in thread
From: Andrew Morton @ 2026-04-30 23:01 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, Dave Chinner, Qi Zheng, Roman Gushchin,
	Johannes Weiner, Shakeel Butt, Kairui Song, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Tvrtko Ursulin,
	Thomas Hellström, Carlos Santa, Christian Koenig, Huang Rui,
	Matthew Auld, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	David Airlie, Simona Vetter, Daniel Colascione, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel

On Thu, 30 Apr 2026 12:18:03 -0700 Matthew Brost <matthew.brost@intel.com> wrote:

> TTM allocations at higher orders can drive Xe into a pathological
> reclaim loop when memory is fragmented:
> 
> kswapd → shrinker → eviction → rebind (exec ioctl) → repeat
> 
> In this state, reclaim is triggered despite substantial free memory,
> but fails to produce contiguous higher-order pages. The Xe shrinker then
> evicts active buffer objects, increasing faulting and rebind activity
> and further feeding the loop. The result is high CPU overhead and poor
> GPU forward progress.
> 
> ...
>
> This series addresses the issue in two ways:
> 
> TTM: Restrict direct reclaim to beneficial_order. Larger allocations
> use __GFP_NORETRY to fail quickly rather than triggering reclaim.
> 
> Xe: Introduce a heuristic in the shrinker to avoid eviction when
> running under kswapd and the system appears memory-rich but
> fragmented.

Please cc everyone on all the patches?  It's kind of annoying to have
to hunt around to find out how these proposed changes will be used. 
Personal preference, anyway.

AI review flagged a few possible issues:
	https://sashiko.dev/#/patchset/20260430191809.2142544-1-matthew.brost@intel.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation
  2026-04-30 23:01 ` [PATCH " Andrew Morton
@ 2026-05-01  6:28   ` Matthew Brost
  2026-05-01 12:51     ` Andrew Morton
  0 siblings, 1 reply; 14+ messages in thread
From: Matthew Brost @ 2026-05-01  6:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: intel-xe, dri-devel, Dave Chinner, Qi Zheng, Roman Gushchin,
	Johannes Weiner, Shakeel Butt, Kairui Song, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Tvrtko Ursulin,
	Thomas Hellström, Carlos Santa, Christian Koenig, Huang Rui,
	Matthew Auld, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	David Airlie, Simona Vetter, Daniel Colascione, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel

On Thu, Apr 30, 2026 at 04:01:05PM -0700, Andrew Morton wrote:
> On Thu, 30 Apr 2026 12:18:03 -0700 Matthew Brost <matthew.brost@intel.com> wrote:
> 
> > TTM allocations at higher orders can drive Xe into a pathological
> > reclaim loop when memory is fragmented:
> > 
> > kswapd → shrinker → eviction → rebind (exec ioctl) → repeat
> > 
> > In this state, reclaim is triggered despite substantial free memory,
> > but fails to produce contiguous higher-order pages. The Xe shrinker then
> > evicts active buffer objects, increasing faulting and rebind activity
> > and further feeding the loop. The result is high CPU overhead and poor
> > GPU forward progress.
> > 
> > ...
> >
> > This series addresses the issue in two ways:
> > 
> > TTM: Restrict direct reclaim to beneficial_order. Larger allocations
> > use __GFP_NORETRY to fail quickly rather than triggering reclaim.
> > 
> > Xe: Introduce a heuristic in the shrinker to avoid eviction when
> > running under kswapd and the system appears memory-rich but
> > fragmented.
> 
> Please cc everyone on all the patches?  It's kind of annoying to have
> to hunt around to find out how these proposed changes will be used. 
> Personal preference, anyway.
> 

Will do - we discussed this in the past and thought we landed on Cc
everyone on the cover then individual patches but will blast everyone
going forward.

> AI review flagged a few possible issues:
> 	https://sashiko.dev/#/patchset/20260430191809.2142544-1-matthew.brost@intel.com

Idk, who authors sashiko but what make it really nice if you could reply
to it to talk things out.

Looking at replies...

- 'Could this global counter drift significantly'
	this is looks right for multi-CPU which isn't really the target
	here, but will adjust

- 'Additionally, does NR_FREE_PAGES implicitly include CMA pages?'
	this is looks right, will adjust

- 'Can high_wmark_pages(zone) evaluate to zero during early boot'
	theoretically possible (?), but non-issue IMO, certainly a GPU
	shrinker which is current use case this is impossible but maybe
	add a warn_on if high_wmark_pages(zone) returns zero

- 'Is this description accurate?'
	I inverted the TTM kernel doc vs the code, will fix

Matt

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation
  2026-05-01  6:28   ` Matthew Brost
@ 2026-05-01 12:51     ` Andrew Morton
  0 siblings, 0 replies; 14+ messages in thread
From: Andrew Morton @ 2026-05-01 12:51 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, Dave Chinner, Qi Zheng, Roman Gushchin,
	Johannes Weiner, Shakeel Butt, Kairui Song, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Tvrtko Ursulin,
	Thomas Hellström, Carlos Santa, Christian Koenig, Huang Rui,
	Matthew Auld, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	David Airlie, Simona Vetter, Daniel Colascione, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel

On Thu, 30 Apr 2026 23:28:08 -0700 Matthew Brost <matthew.brost@intel.com> wrote:

> > AI review flagged a few possible issues:
> > 	https://sashiko.dev/#/patchset/20260430191809.2142544-1-matthew.brost@intel.com
> 
> Idk, who authors sashiko but what make it really nice if you could reply
> to it to talk things out.

It's a gemini 3 thing, based on prompts developed by Roman
Gushchin and Chris Mason and others.  Google is making this available
to kernel developers at a non-trivial expense.

And yes, it would be great if Sashiko were able to learn from our
replies and to fine-tune its checking based on the human corrections. 
I've asked for this a few times but didn't really understand the reply
;)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation
  2026-04-30 19:18 [PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation Matthew Brost
                   ` (2 preceding siblings ...)
  2026-04-30 23:01 ` [PATCH " Andrew Morton
@ 2026-05-01  1:42 ` Dave Chinner
  2026-05-01  7:09   ` Matthew Brost
  3 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2026-05-01  1:42 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, Dave Chinner, Qi Zheng, Roman Gushchin,
	Johannes Weiner, Shakeel Butt, Kairui Song, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Tvrtko Ursulin,
	Thomas Hellström, Carlos Santa, Christian Koenig, Huang Rui,
	Matthew Auld, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	David Airlie, Simona Vetter, Daniel Colascione, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm, linux-kernel

On Thu, Apr 30, 2026 at 12:18:03PM -0700, Matthew Brost wrote:
> TTM allocations at higher orders can drive Xe into a pathological
> reclaim loop when memory is fragmented:
> 
> kswapd → shrinker → eviction → rebind (exec ioctl) → repeat
> 
> In this state, reclaim is triggered despite substantial free memory,
> but fails to produce contiguous higher-order pages. The Xe shrinker then
> evicts active buffer objects, increasing faulting and rebind activity
> and further feeding the loop. The result is high CPU overhead and poor
> GPU forward progress.
> 
> This issue was first reported in [1] and independently observed
> internally and by Google.
> 
> A simple reproducer is:
> 
> - Boot an iGPU system with mem=8G
> - Launch 10 Chrome tabs running the WebGL aquarium demo
> - Configure each tab with ~5k fish
> 
> Under this workload, ftrace shows a continuous loop of:
> 
> xe_shrinker_scan (kswapd)
> xe_vma_rebind_exec
> 
> Performance degrades significantly, with each tab dropping to ~2 FPS on
> PTL (Ubuntu 24.04).
> 
> At the same time, /proc/buddyinfo shows substantial free memory but no
> higher-order availability. For example, the Normal zone:
> 
> Count: 4063 4595 3455 3400 3139 2762 2293 1655 643 0 0
> 
> This corresponds to ~2.8GB free memory, but no order-9 (2MB) blocks,
> indicating severe fragmentation.
> 
> This series addresses the issue in two ways:
> 
> TTM: Restrict direct reclaim to beneficial_order. Larger allocations
> use __GFP_NORETRY to fail quickly rather than triggering reclaim.

NACK.

As I have said to the people trying to hack around direct reclaim
for high order allocations being costly for the page cache, fix the
problem with direct reclaim. (e.g.
https://lore.kernel.org/linux-xfs/adLlrSZ5oRAa_Hfd@dread/)

We should not be hacking around a problem in the mm infrastructure
by changing allocation context flags every high order allocation 
call site that needs high order allocations. Understand and fix the
infrastructure problem once and for all.

> Xe: Introduce a heuristic in the shrinker to avoid eviction when
> running under kswapd and the system appears memory-rich but
> fragmented.

NACK on architectural grounds.

Custom heuristics in individual shrinkers to decide whether the
should do what the mm subsystem has asked them to do has -always-
been a mistake to allow. The mm subsystem makes the decision on how
much cache shrinkage needs to occur, the shrinkers just do what they
are told to do.

If we have a problem where a workload causes excessive shrinker
reclaim, then we need to address the problem in the infrastructure
because excessive reclaim affects the performance of -all-
subsystems with shrinkable caches, not just the TTM subsystem.

As it is, I can't review what you've actually implemented because
you only cc'd me on a single patch in the series. In future, please
cc me on the whole patchset because shrinkers need to work as a
coherent whole, not just in isolation....

-Dave.
-- 
Dave Chinner
dgc@kernel.org

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation
  2026-05-01  1:42 ` Dave Chinner
@ 2026-05-01  7:09   ` Matthew Brost
  0 siblings, 0 replies; 14+ messages in thread
From: Matthew Brost @ 2026-05-01  7:09 UTC (permalink / raw)
  To: Dave Chinner
  Cc: intel-xe, dri-devel, Dave Chinner, Qi Zheng, Roman Gushchin,
	Johannes Weiner, Shakeel Butt, Kairui Song, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Tvrtko Ursulin,
	Thomas Hellström, Carlos Santa, Christian Koenig, Huang Rui,
	Matthew Auld, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	David Airlie, Simona Vetter, Daniel Colascione, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm, linux-kernel

On Fri, May 01, 2026 at 11:42:19AM +1000, Dave Chinner wrote:

Thanks for the feedback. I’m looking into this more, and it’s becoming
clear that this is a hard problem—one that will likely require
coordinated work between DRM and core MM to really sort out. That said,
I do think what I have in place is a reasonable short-term fix.

More below.

> On Thu, Apr 30, 2026 at 12:18:03PM -0700, Matthew Brost wrote:
> > TTM allocations at higher orders can drive Xe into a pathological
> > reclaim loop when memory is fragmented:
> > 
> > kswapd → shrinker → eviction → rebind (exec ioctl) → repeat
> > 
> > In this state, reclaim is triggered despite substantial free memory,
> > but fails to produce contiguous higher-order pages. The Xe shrinker then
> > evicts active buffer objects, increasing faulting and rebind activity
> > and further feeding the loop. The result is high CPU overhead and poor
> > GPU forward progress.
> > 
> > This issue was first reported in [1] and independently observed
> > internally and by Google.
> > 
> > A simple reproducer is:
> > 
> > - Boot an iGPU system with mem=8G
> > - Launch 10 Chrome tabs running the WebGL aquarium demo
> > - Configure each tab with ~5k fish
> > 
> > Under this workload, ftrace shows a continuous loop of:
> > 
> > xe_shrinker_scan (kswapd)
> > xe_vma_rebind_exec
> > 
> > Performance degrades significantly, with each tab dropping to ~2 FPS on
> > PTL (Ubuntu 24.04).
> > 
> > At the same time, /proc/buddyinfo shows substantial free memory but no
> > higher-order availability. For example, the Normal zone:
> > 
> > Count: 4063 4595 3455 3400 3139 2762 2293 1655 643 0 0
> > 
> > This corresponds to ~2.8GB free memory, but no order-9 (2MB) blocks,
> > indicating severe fragmentation.
> > 
> > This series addresses the issue in two ways:
> > 
> > TTM: Restrict direct reclaim to beneficial_order. Larger allocations
> > use __GFP_NORETRY to fail quickly rather than triggering reclaim.
> 
> NACK.
> 
> As I have said to the people trying to hack around direct reclaim
> for high order allocations being costly for the page cache, fix the
> problem with direct reclaim. (e.g.
> https://lore.kernel.org/linux-xfs/adLlrSZ5oRAa_Hfd@dread/)
> 

I read your response. Maybe this isn't clear what is going here.

At beneficial_order: gfp == __GFP_RECLAIM | __GFP_NORETRY
At order zero: gfp == __GFP_RECLAIM

This roughly existing behavior, the exact changes are here [1].

[1] https://patchwork.freedesktop.org/patch/722247/?series=165329&rev=3

If this is truly a NACK, then we can rethink it—likely by disabling
reclaim at higher orders—but that has its own downsides for DRM and
GPUs. Ideally, you want purgeable BOs to be evicted when a higher-order
allocation fails; you really don’t want to end up in an insane kswap
loop.

> We should not be hacking around a problem in the mm infrastructure
> by changing allocation context flags every high order allocation 
> call site that needs high order allocations. Understand and fix the
> infrastructure problem once and for all.
> 

Well, I agree that we should aim to fix this in core MM, but as the
saying goes, Rome wasn’t built in a day. The fact is that these GFP
flags do exist, and suddenly drawing a line and declaring them no longer
valid feels a bit unfair. I’ll also note that Intel—and I
personally—have an interest in fixing shrinking, so you can expect
follow-up work here.

> > Xe: Introduce a heuristic in the shrinker to avoid eviction when
> > running under kswapd and the system appears memory-rich but
> > fragmented.
> 
> NACK on architectural grounds.
> 
> Custom heuristics in individual shrinkers to decide whether the
> should do what the mm subsystem has asked them to do has -always-
> been a mistake to allow. The mm subsystem makes the decision on how

I’m not going to disagree with using custom heuristics in individual
shrinkers, but I’d wager that most shrinkers sadly already implement
custom heuristics.

> much cache shrinkage needs to occur, the shrinkers just do what they
> are told to do.
> 
> If we have a problem where a workload causes excessive shrinker
> reclaim, then we need to address the problem in the infrastructure
> because excessive reclaim affects the performance of -all-
> subsystems with shrinkable caches, not just the TTM subsystem.
> 

Yes, I agree, and I’ve thought about the implications of simply having
TTM back off when a higher-order allocation fails, even when we actually
have enough memory, and how that would affect everyone. This series at
least fixes the “well, there goes my GUI” problem.

I do have another patch locally that prevents TTM from accidentally
fragmenting memory and triggering the kswap loop, but under enough
pressure I can still get the GUI to lock up for periods of time. With
this series, however, I can’t reproduce that issue.

> As it is, I can't review what you've actually implemented because
> you only cc'd me on a single patch in the series. In future, please
> cc me on the whole patchset because shrinkers need to work as a
> coherent whole, not just in isolation....
> 

Sorry about this - Andrew just said the same thing. Here is PW link [2].

Or:

b4 mbox 20260430191809.2142544-1-matthew.brost@intel.com

[2] https://patchwork.freedesktop.org/series/165329/

If you have any ideas on how to fix this in the core, let’s discuss. I
have a bunch of ideas in my head, but core MM isn’t my native domain.

Matt

> -Dave.
> -- 
> Dave Chinner
> dgc@kernel.org

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2026-05-01 22:33 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-30 19:18 [PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation Matthew Brost
2026-04-30 19:18 ` [PATCH v4 1/6] mm: Wire up order in shrink_control Matthew Brost
2026-04-30 19:18 ` [PATCH v4 2/6] mm: Introduce zone_maybe_fragmented_in_shrinker() Matthew Brost
2026-05-01  0:50   ` Santa, Carlos
2026-05-01 19:08   ` PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation Kenneth Crudup
2026-05-01 20:00     ` Matthew Brost
2026-05-01 20:05       ` Kenneth Crudup
2026-05-01 21:10         ` Matthew Brost
2026-05-01 22:33           ` Matthew Brost
2026-04-30 23:01 ` [PATCH " Andrew Morton
2026-05-01  6:28   ` Matthew Brost
2026-05-01 12:51     ` Andrew Morton
2026-05-01  1:42 ` Dave Chinner
2026-05-01  7:09   ` Matthew Brost

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox