[PATCH v2 0/5] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/5] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation
@ 2026-04-23  5:56 Matthew Brost
  2026-04-23  5:56 ` [PATCH v2 1/5] mm: Introduce zone_appears_fragmented() Matthew Brost
  0 siblings, 1 reply; 20+ messages in thread
From: Matthew Brost @ 2026-04-23  5:56 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: Tvrtko Ursulin, Thomas Hellström, Carlos Santa,
	Christian Koenig, Huang Rui, Matthew Auld, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	Daniel Colascione, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel

TTM allocations at higher orders can drive Xe into a pathological
reclaim loop when memory is fragmented:

kswapd → shrinker → eviction → rebind (exec ioctl) → repeat

In this state, reclaim is triggered despite substantial free memory,
but fails to produce contiguous higher-order pages. The Xe shrinker then
evicts active buffer objects, increasing faulting and rebind activity
and further feeding the loop. The result is high CPU overhead and poor
GPU forward progress.

This issue was first reported in [1] and independently observed
internally and by Google.

A simple reproducer is:

- Boot an iGPU system with mem=8G
- Launch 10 Chrome tabs running the WebGL aquarium demo
- Configure each tab with ~5k fish

Under this workload, ftrace shows a continuous loop of:

xe_shrinker_scan (kswapd)
xe_vma_rebind_exec

Performance degrades significantly, with each tab dropping to ~2 FPS on
PTL (Ubuntu 24.04).

At the same time, /proc/buddyinfo shows substantial free memory but no
higher-order availability. For example, the Normal zone:

Count: 4063 4595 3455 3400 3139 2762 2293 1655 643 0 0

This corresponds to ~2.8GB free memory, but no order-9 (2MB) blocks,
indicating severe fragmentation.

This series addresses the issue in two ways:

TTM: Restrict direct reclaim to beneficial_order. Larger allocations
use __GFP_NORETRY to fail quickly rather than triggering reclaim.

Xe: Introduce a heuristic in the shrinker to avoid eviction when
running under kswapd and the system appears memory-rich but
fragmented.

With these changes, the reclaim/eviction loop is eliminated. The same
workload improves to ~10 FPS per tab (Ubuntu 24.04) or ~15 FPS per tab
(Ubuntu 24.10), and kswapd activity subsides.

Buddyinfo after applying this series shows restored higher-order
availability:

Count: 8526 7067 3092 1959 1292 660 194 28 20 13 1

Matt

v2:
 - Layer with core MM / TTM helpers (Thomas)

[1] https://patchwork.freedesktop.org/patch/716404/?series=164353&rev=1

Cc: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Carlos Santa <carlos.santa@intel.com>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
CC: dri-devel@lists.freedesktop.org
Cc: Daniel Colascione <dancol@dancol.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org

Matthew Brost (5):
  mm: Introduce zone_appears_fragmented()
  drm/ttm: Issue direct reclaim at beneficial_order
  drm/ttm: Introduce ttm_bo_shrink_kswap_fragmented()
  drm/xe: Set TTM device beneficial_order to 9 (2M)
  drm/xe: Avoid shrinker reclaim from kswapd under fragmentation

 drivers/gpu/drm/ttm/ttm_bo_util.c | 34 +++++++++++++++++++++++++++++++
 drivers/gpu/drm/ttm/ttm_pool.c    |  4 ++--
 drivers/gpu/drm/xe/xe_device.c    |  3 ++-
 drivers/gpu/drm/xe/xe_shrinker.c  |  3 +++
 include/drm/ttm/ttm_bo.h          |  2 ++
 include/linux/vmstat.h            | 13 ++++++++++++
 6 files changed, 56 insertions(+), 3 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 1/5] mm: Introduce zone_appears_fragmented()
  2026-04-23  5:56 [PATCH v2 0/5] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation Matthew Brost
@ 2026-04-23  5:56 ` Matthew Brost
  2026-04-23  6:04   ` Balbir Singh
                     ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Matthew Brost @ 2026-04-23  5:56 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: Thomas Hellström, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel

Introduce zone_appears_fragmented() as a lightweight helper to allow
subsystems to make coarse decisions about reclaim behavior in the
presence of likely fragmentation.

The helper implements a simple heuristic: if the number of free pages
in a zone exceeds twice the high watermark, the zone is considered to
have ample free memory and allocation failures are more likely due to
fragmentation than overall memory pressure.

This is intentionally imprecise and is not meant to replace the core
MM compaction or fragmentation accounting logic. Instead, it provides
a cheap signal for callers (e.g., shrinkers) that wish to avoid
overly aggressive reclaim when sufficient free memory exists but
high-order allocations may still fail.

No functional changes; this is a preparatory helper for future users.

Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 include/linux/vmstat.h | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 3c9c266cf782..568d9f4f1a1f 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -483,6 +483,19 @@ static inline const char *zone_stat_name(enum zone_stat_item item)
 	return vmstat_text[item];
 }
 
+static inline bool zone_appears_fragmented(struct zone *zone)
+{
+	/*
+	 * Simple heuristic: if the number of free pages is more than twice the
+	 * high watermark, this strongly suggests that the zone is heavily
+	 * fragmented when called from a shrinker.
+	 */
+	if (zone_page_state(zone, NR_FREE_PAGES) > high_wmark_pages(zone) * 2)
+		return true;
+
+	return false;
+}
+
 #ifdef CONFIG_NUMA
 static inline const char *numa_stat_name(enum numa_stat_item item)
 {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/5] mm: Introduce zone_appears_fragmented()
  2026-04-23  5:56 ` [PATCH v2 1/5] mm: Introduce zone_appears_fragmented() Matthew Brost
@ 2026-04-23  6:04   ` Balbir Singh
  2026-04-23  6:16     ` Matthew Brost
  2026-04-23 10:27   ` David Hildenbrand (Arm)
  2026-04-28  9:51   ` Andi Shyti
  2 siblings, 1 reply; 20+ messages in thread
From: Balbir Singh @ 2026-04-23  6:04 UTC (permalink / raw)
  To: Matthew Brost, intel-xe, dri-devel
  Cc: Thomas Hellström, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel

On 4/23/26 15:56, Matthew Brost wrote:
> Introduce zone_appears_fragmented() as a lightweight helper to allow
> subsystems to make coarse decisions about reclaim behavior in the
> presence of likely fragmentation.
> 
> The helper implements a simple heuristic: if the number of free pages
> in a zone exceeds twice the high watermark, the zone is considered to
> have ample free memory and allocation failures are more likely due to
> fragmentation than overall memory pressure.
> 
> This is intentionally imprecise and is not meant to replace the core
> MM compaction or fragmentation accounting logic. Instead, it provides
> a cheap signal for callers (e.g., shrinkers) that wish to avoid
> overly aggressive reclaim when sufficient free memory exists but
> high-order allocations may still fail.
> 
> No functional changes; this is a preparatory helper for future users.
> 
> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@kernel.org>
> Cc: Lorenzo Stoakes <ljs@kernel.org>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Vlastimil Babka <vbabka@kernel.org>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  include/linux/vmstat.h | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index 3c9c266cf782..568d9f4f1a1f 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -483,6 +483,19 @@ static inline const char *zone_stat_name(enum zone_stat_item item)
>  	return vmstat_text[item];
>  }
>  
> +static inline bool zone_appears_fragmented(struct zone *zone)
> +{
> +	/*
> +	 * Simple heuristic: if the number of free pages is more than twice the
> +	 * high watermark, this strongly suggests that the zone is heavily
> +	 * fragmented when called from a shrinker.
> +	 */
> +	if (zone_page_state(zone, NR_FREE_PAGES) > high_wmark_pages(zone) * 2)
> +		return true;
> +
> +	return false;
> +}
> +
>  #ifdef CONFIG_NUMA
>  static inline const char *numa_stat_name(enum numa_stat_item item)
>  {


Without any usage/users, this is hard to review. I don't understand the heuristic
or it's logic as applied to fragmentation either.

Balbir

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/5] mm: Introduce zone_appears_fragmented()
  2026-04-23  6:04   ` Balbir Singh
@ 2026-04-23  6:16     ` Matthew Brost
  2026-04-23  6:27       ` Matthew Brost
  0 siblings, 1 reply; 20+ messages in thread
From: Matthew Brost @ 2026-04-23  6:16 UTC (permalink / raw)
  To: Balbir Singh
  Cc: intel-xe, dri-devel, Thomas Hellström, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm, linux-kernel

On Thu, Apr 23, 2026 at 04:04:32PM +1000, Balbir Singh wrote:
> On 4/23/26 15:56, Matthew Brost wrote:
> > Introduce zone_appears_fragmented() as a lightweight helper to allow
> > subsystems to make coarse decisions about reclaim behavior in the
> > presence of likely fragmentation.
> > 
> > The helper implements a simple heuristic: if the number of free pages
> > in a zone exceeds twice the high watermark, the zone is considered to
> > have ample free memory and allocation failures are more likely due to
> > fragmentation than overall memory pressure.
> > 
> > This is intentionally imprecise and is not meant to replace the core
> > MM compaction or fragmentation accounting logic. Instead, it provides
> > a cheap signal for callers (e.g., shrinkers) that wish to avoid
> > overly aggressive reclaim when sufficient free memory exists but
> > high-order allocations may still fail.
> > 
> > No functional changes; this is a preparatory helper for future users.
> > 
> > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: David Hildenbrand <david@kernel.org>
> > Cc: Lorenzo Stoakes <ljs@kernel.org>
> > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> > Cc: Vlastimil Babka <vbabka@kernel.org>
> > Cc: Mike Rapoport <rppt@kernel.org>
> > Cc: Suren Baghdasaryan <surenb@google.com>
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: linux-mm@kvack.org
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  include/linux/vmstat.h | 13 +++++++++++++
> >  1 file changed, 13 insertions(+)
> > 
> > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> > index 3c9c266cf782..568d9f4f1a1f 100644
> > --- a/include/linux/vmstat.h
> > +++ b/include/linux/vmstat.h
> > @@ -483,6 +483,19 @@ static inline const char *zone_stat_name(enum zone_stat_item item)
> >  	return vmstat_text[item];
> >  }
> >  
> > +static inline bool zone_appears_fragmented(struct zone *zone)
> > +{
> > +	/*
> > +	 * Simple heuristic: if the number of free pages is more than twice the
> > +	 * high watermark, this strongly suggests that the zone is heavily
> > +	 * fragmented when called from a shrinker.
> > +	 */
> > +	if (zone_page_state(zone, NR_FREE_PAGES) > high_wmark_pages(zone) * 2)
> > +		return true;
> > +
> > +	return false;
> > +}
> > +
> >  #ifdef CONFIG_NUMA
> >  static inline const char *numa_stat_name(enum numa_stat_item item)
> >  {
> 
> 
> Without any usage/users, this is hard to review. I don't understand the heuristic
> or it's logic as applied to fragmentation either.
> 

Sorry—it’s always confusing who to CC on cross-subsystem series. Last
time this occurred, we agreed to CC everyone listed in the cover letter,
which I did. Anyway, let me provide the Patchwork links...

Cover letter: https://patchwork.freedesktop.org/series/165329/
TTM patch which uses this: https://patchwork.freedesktop.org/patch/720036/?series=165329&rev=1
Xe side which uses the TTM helper: https://patchwork.freedesktop.org/patch/720031/?series=165329&rev=1

Matt

> Balbir

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/5] mm: Introduce zone_appears_fragmented()
  2026-04-23  6:16     ` Matthew Brost
@ 2026-04-23  6:27       ` Matthew Brost
  0 siblings, 0 replies; 20+ messages in thread
From: Matthew Brost @ 2026-04-23  6:27 UTC (permalink / raw)
  To: Balbir Singh
  Cc: intel-xe, dri-devel, Thomas Hellström, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm, linux-kernel

On Wed, Apr 22, 2026 at 11:16:37PM -0700, Matthew Brost wrote:
> On Thu, Apr 23, 2026 at 04:04:32PM +1000, Balbir Singh wrote:
> > On 4/23/26 15:56, Matthew Brost wrote:
> > > Introduce zone_appears_fragmented() as a lightweight helper to allow
> > > subsystems to make coarse decisions about reclaim behavior in the
> > > presence of likely fragmentation.
> > > 
> > > The helper implements a simple heuristic: if the number of free pages
> > > in a zone exceeds twice the high watermark, the zone is considered to
> > > have ample free memory and allocation failures are more likely due to
> > > fragmentation than overall memory pressure.
> > > 
> > > This is intentionally imprecise and is not meant to replace the core
> > > MM compaction or fragmentation accounting logic. Instead, it provides
> > > a cheap signal for callers (e.g., shrinkers) that wish to avoid
> > > overly aggressive reclaim when sufficient free memory exists but
> > > high-order allocations may still fail.
> > > 
> > > No functional changes; this is a preparatory helper for future users.
> > > 
> > > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Cc: David Hildenbrand <david@kernel.org>
> > > Cc: Lorenzo Stoakes <ljs@kernel.org>
> > > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> > > Cc: Vlastimil Babka <vbabka@kernel.org>
> > > Cc: Mike Rapoport <rppt@kernel.org>
> > > Cc: Suren Baghdasaryan <surenb@google.com>
> > > Cc: Michal Hocko <mhocko@suse.com>
> > > Cc: linux-mm@kvack.org
> > > Cc: linux-kernel@vger.kernel.org
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > ---
> > >  include/linux/vmstat.h | 13 +++++++++++++
> > >  1 file changed, 13 insertions(+)
> > > 
> > > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> > > index 3c9c266cf782..568d9f4f1a1f 100644
> > > --- a/include/linux/vmstat.h
> > > +++ b/include/linux/vmstat.h
> > > @@ -483,6 +483,19 @@ static inline const char *zone_stat_name(enum zone_stat_item item)
> > >  	return vmstat_text[item];
> > >  }
> > >  
> > > +static inline bool zone_appears_fragmented(struct zone *zone)
> > > +{
> > > +	/*
> > > +	 * Simple heuristic: if the number of free pages is more than twice the
> > > +	 * high watermark, this strongly suggests that the zone is heavily
> > > +	 * fragmented when called from a shrinker.
> > > +	 */
> > > +	if (zone_page_state(zone, NR_FREE_PAGES) > high_wmark_pages(zone) * 2)
> > > +		return true;
> > > +
> > > +	return false;
> > > +}
> > > +
> > >  #ifdef CONFIG_NUMA
> > >  static inline const char *numa_stat_name(enum numa_stat_item item)
> > >  {
> > 
> > 
> > Without any usage/users, this is hard to review. I don't understand the heuristic
> > or it's logic as applied to fragmentation either.
> > 
> 
> Sorry—it’s always confusing who to CC on cross-subsystem series. Last
> time this occurred, we agreed to CC everyone listed in the cover letter,
> which I did. Anyway, let me provide the Patchwork links...
> 
> Cover letter: https://patchwork.freedesktop.org/series/165329/
> TTM patch which uses this: https://patchwork.freedesktop.org/patch/720036/?series=165329&rev=1
> Xe side which uses the TTM helper: https://patchwork.freedesktop.org/patch/720031/?series=165329&rev=1
> 

Also if you want grab whole series locally here is what I do when I'm
missed on a Cc:

b4 mbox <msg-id>
mutt -f <msg-id>

So here, with msg-id from cover letter:

b4 mbox 20260423055656.1696379-1-matthew.brost@intel.com
mutt -f  ./20260423055656.1696379-1-matthew.brost@intel.com.mbx

Matt

> Matt
> 
> > Balbir

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/5] mm: Introduce zone_appears_fragmented()
  2026-04-23  5:56 ` [PATCH v2 1/5] mm: Introduce zone_appears_fragmented() Matthew Brost
  2026-04-23  6:04   ` Balbir Singh
@ 2026-04-23 10:27   ` David Hildenbrand (Arm)
  2026-04-23 11:27     ` Thomas Hellström
  2026-04-28  9:51   ` Andi Shyti
  2 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand (Arm) @ 2026-04-23 10:27 UTC (permalink / raw)
  To: Matthew Brost, intel-xe, dri-devel
  Cc: Thomas Hellström, Andrew Morton, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel,
	Johannes Weiner, Michal Hocko

On 4/23/26 07:56, Matthew Brost wrote:
> Introduce zone_appears_fragmented() as a lightweight helper to allow
> subsystems to make coarse decisions about reclaim behavior in the
> presence of likely fragmentation.
> 
> The helper implements a simple heuristic: if the number of free pages
> in a zone exceeds twice the high watermark, the zone is considered to
> have ample free memory and allocation failures are more likely due to
> fragmentation than overall memory pressure.
> 
> This is intentionally imprecise and is not meant to replace the core
> MM compaction or fragmentation accounting logic. Instead, it provides
> a cheap signal for callers (e.g., shrinkers) that wish to avoid
> overly aggressive reclaim when sufficient free memory exists but
> high-order allocations may still fail.
> 
> No functional changes; this is a preparatory helper for future users.
> 
> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@kernel.org>
> Cc: Lorenzo Stoakes <ljs@kernel.org>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Vlastimil Babka <vbabka@kernel.org>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  include/linux/vmstat.h | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index 3c9c266cf782..568d9f4f1a1f 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -483,6 +483,19 @@ static inline const char *zone_stat_name(enum zone_stat_item item)
>  	return vmstat_text[item];
>  }
>  
> +static inline bool zone_appears_fragmented(struct zone *zone)
> +{

"zone_likely_fragmented" or "zone_maybe_fragmented" might be clearer, depending
on the actual semantics.

> +	/*
> +	 * Simple heuristic: if the number of free pages is more than twice the
> +	 * high watermark, this strongly suggests that the zone is heavily
> +	 * fragmented when called from a shrinker.
> +	 */

I'll cc some more people. But the "when called from a shrinker" bit is
concerning. Are there additional semantics that should be expressed in the
function name, for example?

Something that implies that this function only gives you a reasonable answer in
a certain context.

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/5] mm: Introduce zone_appears_fragmented()
  2026-04-23 10:27   ` David Hildenbrand (Arm)
@ 2026-04-23 11:27     ` Thomas Hellström
  2026-04-23 19:08       ` Matthew Brost
  0 siblings, 1 reply; 20+ messages in thread
From: Thomas Hellström @ 2026-04-23 11:27 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Matthew Brost, intel-xe, dri-devel
  Cc: Andrew Morton, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linux-mm,
	linux-kernel, Johannes Weiner

On Thu, 2026-04-23 at 12:27 +0200, David Hildenbrand (Arm) wrote:
> On 4/23/26 07:56, Matthew Brost wrote:
> > Introduce zone_appears_fragmented() as a lightweight helper to
> > allow
> > subsystems to make coarse decisions about reclaim behavior in the
> > presence of likely fragmentation.
> > 
> > The helper implements a simple heuristic: if the number of free
> > pages
> > in a zone exceeds twice the high watermark, the zone is considered
> > to
> > have ample free memory and allocation failures are more likely due
> > to
> > fragmentation than overall memory pressure.
> > 
> > This is intentionally imprecise and is not meant to replace the
> > core
> > MM compaction or fragmentation accounting logic. Instead, it
> > provides
> > a cheap signal for callers (e.g., shrinkers) that wish to avoid
> > overly aggressive reclaim when sufficient free memory exists but
> > high-order allocations may still fail.
> > 
> > No functional changes; this is a preparatory helper for future
> > users.
> > 
> > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: David Hildenbrand <david@kernel.org>
> > Cc: Lorenzo Stoakes <ljs@kernel.org>
> > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> > Cc: Vlastimil Babka <vbabka@kernel.org>
> > Cc: Mike Rapoport <rppt@kernel.org>
> > Cc: Suren Baghdasaryan <surenb@google.com>
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: linux-mm@kvack.org
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  include/linux/vmstat.h | 13 +++++++++++++
> >  1 file changed, 13 insertions(+)
> > 
> > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> > index 3c9c266cf782..568d9f4f1a1f 100644
> > --- a/include/linux/vmstat.h
> > +++ b/include/linux/vmstat.h
> > @@ -483,6 +483,19 @@ static inline const char *zone_stat_name(enum
> > zone_stat_item item)
> >  	return vmstat_text[item];
> >  }
> >  
> > +static inline bool zone_appears_fragmented(struct zone *zone)
> > +{
> 
> "zone_likely_fragmented" or "zone_maybe_fragmented" might be clearer,
> depending
> on the actual semantics.
> 
> > +	/*
> > +	 * Simple heuristic: if the number of free pages is more
> > than twice the
> > +	 * high watermark, this strongly suggests that the zone is
> > heavily
> > +	 * fragmented when called from a shrinker.
> > +	 */
> 
> I'll cc some more people. But the "when called from a shrinker" bit
> is
> concerning. Are there additional semantics that should be expressed
> in the
> function name, for example?
> 
> Something that implies that this function only gives you a reasonable
> answer in
> a certain context.

I think that test would not be relevant for cgroup-aware shrinking.

What about trying to pass something in the struct shrink_control? Like
if we pass the struct scan_control's order field also in struct
shrink_control, really expensive shrinkers could duck reclaim attempts
from higher-order allocations that may fail anyway:

      if (sc->order > PAGE_ALLOC_COSTLY_ORDER &&
           (sc->gfp_mask & (__GFP_NORETRY | __GFP_RETRY_MAYFAIL)) &&
           !(sc->gfp_mask & __GFP_NOFAIL))
           return SHRINK_STOP;

Possibly exposed as an inline helper in the shrinker interface?

/Thomas





^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/5] mm: Introduce zone_appears_fragmented()
  2026-04-23 11:27     ` Thomas Hellström
@ 2026-04-23 19:08       ` Matthew Brost
  2026-04-23 22:21         ` Matthew Brost
  0 siblings, 1 reply; 20+ messages in thread
From: Matthew Brost @ 2026-04-23 19:08 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: David Hildenbrand (Arm), intel-xe, dri-devel, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel,
	Johannes Weiner

On Thu, Apr 23, 2026 at 01:27:11PM +0200, Thomas Hellström wrote:
> On Thu, 2026-04-23 at 12:27 +0200, David Hildenbrand (Arm) wrote:
> > On 4/23/26 07:56, Matthew Brost wrote:
> > > Introduce zone_appears_fragmented() as a lightweight helper to
> > > allow
> > > subsystems to make coarse decisions about reclaim behavior in the
> > > presence of likely fragmentation.
> > > 
> > > The helper implements a simple heuristic: if the number of free
> > > pages
> > > in a zone exceeds twice the high watermark, the zone is considered
> > > to
> > > have ample free memory and allocation failures are more likely due
> > > to
> > > fragmentation than overall memory pressure.
> > > 
> > > This is intentionally imprecise and is not meant to replace the
> > > core
> > > MM compaction or fragmentation accounting logic. Instead, it
> > > provides
> > > a cheap signal for callers (e.g., shrinkers) that wish to avoid
> > > overly aggressive reclaim when sufficient free memory exists but
> > > high-order allocations may still fail.
> > > 
> > > No functional changes; this is a preparatory helper for future
> > > users.
> > > 
> > > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Cc: David Hildenbrand <david@kernel.org>
> > > Cc: Lorenzo Stoakes <ljs@kernel.org>
> > > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> > > Cc: Vlastimil Babka <vbabka@kernel.org>
> > > Cc: Mike Rapoport <rppt@kernel.org>
> > > Cc: Suren Baghdasaryan <surenb@google.com>
> > > Cc: Michal Hocko <mhocko@suse.com>
> > > Cc: linux-mm@kvack.org
> > > Cc: linux-kernel@vger.kernel.org
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > ---
> > >  include/linux/vmstat.h | 13 +++++++++++++
> > >  1 file changed, 13 insertions(+)
> > > 
> > > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> > > index 3c9c266cf782..568d9f4f1a1f 100644
> > > --- a/include/linux/vmstat.h
> > > +++ b/include/linux/vmstat.h
> > > @@ -483,6 +483,19 @@ static inline const char *zone_stat_name(enum
> > > zone_stat_item item)
> > >  	return vmstat_text[item];
> > >  }
> > >  
> > > +static inline bool zone_appears_fragmented(struct zone *zone)
> > > +{
> > 
> > "zone_likely_fragmented" or "zone_maybe_fragmented" might be clearer,
> > depending
> > on the actual semantics.
> > 
> > > +	/*
> > > +	 * Simple heuristic: if the number of free pages is more
> > > than twice the
> > > +	 * high watermark, this strongly suggests that the zone is
> > > heavily
> > > +	 * fragmented when called from a shrinker.
> > > +	 */
> > 
> > I'll cc some more people. But the "when called from a shrinker" bit
> > is
> > concerning. Are there additional semantics that should be expressed
> > in the
> > function name, for example?
> > 
> > Something that implies that this function only gives you a reasonable
> > answer in
> > a certain context.
> 
> I think that test would not be relevant for cgroup-aware shrinking.
> 
> What about trying to pass something in the struct shrink_control? Like
> if we pass the struct scan_control's order field also in struct

If the order were included in shrink_control, there is about a 95%
certain that this change would allow TTM / Xe to break the problematic
kswapd feedback loop. This may also better express the intent of the
problem we are trying to fix here.

For reference, the cover letter [1] details the problem.

Any guidance from the core MM folks would be appreciated—would adding
the order to shrink_control be an acceptable solution?

Matt

[1] https://patchwork.freedesktop.org/series/165330/

> shrink_control, really expensive shrinkers could duck reclaim attempts
> from higher-order allocations that may fail anyway:
> 
>       if (sc->order > PAGE_ALLOC_COSTLY_ORDER &&
>            (sc->gfp_mask & (__GFP_NORETRY | __GFP_RETRY_MAYFAIL)) &&
>            !(sc->gfp_mask & __GFP_NOFAIL))
>            return SHRINK_STOP;
> 
> Possibly exposed as an inline helper in the shrinker interface?
> 
> /Thomas
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/5] mm: Introduce zone_appears_fragmented()
  2026-04-23 19:08       ` Matthew Brost
@ 2026-04-23 22:21         ` Matthew Brost
  2026-04-24  7:05           ` Thomas Hellström
  0 siblings, 1 reply; 20+ messages in thread
From: Matthew Brost @ 2026-04-23 22:21 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: David Hildenbrand (Arm), intel-xe, dri-devel, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel,
	Johannes Weiner

On Thu, Apr 23, 2026 at 12:08:36PM -0700, Matthew Brost wrote:
> On Thu, Apr 23, 2026 at 01:27:11PM +0200, Thomas Hellström wrote:
> > On Thu, 2026-04-23 at 12:27 +0200, David Hildenbrand (Arm) wrote:
> > > On 4/23/26 07:56, Matthew Brost wrote:
> > > > Introduce zone_appears_fragmented() as a lightweight helper to
> > > > allow
> > > > subsystems to make coarse decisions about reclaim behavior in the
> > > > presence of likely fragmentation.
> > > > 
> > > > The helper implements a simple heuristic: if the number of free
> > > > pages
> > > > in a zone exceeds twice the high watermark, the zone is considered
> > > > to
> > > > have ample free memory and allocation failures are more likely due
> > > > to
> > > > fragmentation than overall memory pressure.
> > > > 
> > > > This is intentionally imprecise and is not meant to replace the
> > > > core
> > > > MM compaction or fragmentation accounting logic. Instead, it
> > > > provides
> > > > a cheap signal for callers (e.g., shrinkers) that wish to avoid
> > > > overly aggressive reclaim when sufficient free memory exists but
> > > > high-order allocations may still fail.
> > > > 
> > > > No functional changes; this is a preparatory helper for future
> > > > users.
> > > > 
> > > > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > Cc: David Hildenbrand <david@kernel.org>
> > > > Cc: Lorenzo Stoakes <ljs@kernel.org>
> > > > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> > > > Cc: Vlastimil Babka <vbabka@kernel.org>
> > > > Cc: Mike Rapoport <rppt@kernel.org>
> > > > Cc: Suren Baghdasaryan <surenb@google.com>
> > > > Cc: Michal Hocko <mhocko@suse.com>
> > > > Cc: linux-mm@kvack.org
> > > > Cc: linux-kernel@vger.kernel.org
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > ---
> > > >  include/linux/vmstat.h | 13 +++++++++++++
> > > >  1 file changed, 13 insertions(+)
> > > > 
> > > > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> > > > index 3c9c266cf782..568d9f4f1a1f 100644
> > > > --- a/include/linux/vmstat.h
> > > > +++ b/include/linux/vmstat.h
> > > > @@ -483,6 +483,19 @@ static inline const char *zone_stat_name(enum
> > > > zone_stat_item item)
> > > >  	return vmstat_text[item];
> > > >  }
> > > >  
> > > > +static inline bool zone_appears_fragmented(struct zone *zone)
> > > > +{
> > > 
> > > "zone_likely_fragmented" or "zone_maybe_fragmented" might be clearer,
> > > depending
> > > on the actual semantics.
> > > 
> > > > +	/*
> > > > +	 * Simple heuristic: if the number of free pages is more
> > > > than twice the
> > > > +	 * high watermark, this strongly suggests that the zone is
> > > > heavily
> > > > +	 * fragmented when called from a shrinker.
> > > > +	 */
> > > 
> > > I'll cc some more people. But the "when called from a shrinker" bit
> > > is
> > > concerning. Are there additional semantics that should be expressed
> > > in the
> > > function name, for example?
> > > 
> > > Something that implies that this function only gives you a reasonable
> > > answer in
> > > a certain context.
> > 
> > I think that test would not be relevant for cgroup-aware shrinking.
> > 
> > What about trying to pass something in the struct shrink_control? Like
> > if we pass the struct scan_control's order field also in struct
> 
> If the order were included in shrink_control, there is about a 95%
> certain that this change would allow TTM / Xe to break the problematic
> kswapd feedback loop. This may also better express the intent of the
> problem we are trying to fix here.
> 
> For reference, the cover letter [1] details the problem.
> 
> Any guidance from the core MM folks would be appreciated—would adding
> the order to shrink_control be an acceptable solution?
> 
> Matt
> 
> [1] https://patchwork.freedesktop.org/series/165330/
> 
> > shrink_control, really expensive shrinkers could duck reclaim attempts
> > from higher-order allocations that may fail anyway:
> > 
> >       if (sc->order > PAGE_ALLOC_COSTLY_ORDER &&
> >            (sc->gfp_mask & (__GFP_NORETRY | __GFP_RETRY_MAYFAIL)) &&
> >            !(sc->gfp_mask & __GFP_NOFAIL))

It doesn't look like __GFP_NORETRY, __GFP_RETRY_MAYFAIL, __GFP_NOFAIL
make it to the sc->gfp_mask flags from the caller and get into kswapd
loop...

 1182 [  394.049058] xe_shrinker_scan: no skip order=9, gfp=0x0000000000000cc0
 1183 [  394.049061] CPU: 2 UID: 0 PID: 110 Comm: kswapd0 Not tainted 7.0.0-xe+ #355 PREEMPT(full)
 1184 [  394.049062] Hardware name: Intel Corporation Panther Lake Client Platform/PTL-UH LP5 T3 RVP1, BIOS PTLPFWI1.R00.3332.D05.2509011438 09/01/2025
 1185 [  394.049063] Call Trace:
 1186 [  394.049065]  <TASK>
 1187 [  394.049066]  dump_stack_lvl+0x55/0x70
 1188 [  394.049073]  xe_shrinker_scan+0x274/0x280 [xe]
 1189 [  394.049181]  do_shrink_slab+0x132/0x360
 1190 [  394.049184]  shrink_slab+0xf0/0x3e0
 1191 [  394.049186]  shrink_node+0x2bd/0x800
 1192 [  394.049188]  balance_pgdat+0x323/0x760
 1193 [  394.049189]  kswapd+0x1c3/0x340
 1194 [  394.049190]  ? __pfx_autoremove_wake_function+0x10/0x10
 1195 [  394.049193]  ? __pfx_kswapd+0x10/0x10
 1196 [  394.049194]  kthread+0xdf/0x120
 1197 [  394.049196]  ? __pfx_kthread+0x10/0x10
 1198 [  394.049197]  ret_from_fork+0x1d0/0x220
 1199 [  394.049200]  ? __pfx_kthread+0x10/0x10
 1200 [  394.049200]  ret_from_fork_asm+0x1a/0x30
 1201 [  394.049202]  </TASK>

Will look into if this is fixable, but again any core MM guidance would
helpful.

Matt

> >            return SHRINK_STOP;
> > 
> > Possibly exposed as an inline helper in the shrinker interface?
> > 
> > /Thomas
> > 
> > 
> > 
> > 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/5] mm: Introduce zone_appears_fragmented()
  2026-04-23 22:21         ` Matthew Brost
@ 2026-04-24  7:05           ` Thomas Hellström
  2026-04-24  7:26             ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 20+ messages in thread
From: Thomas Hellström @ 2026-04-24  7:05 UTC (permalink / raw)
  To: Matthew Brost
  Cc: David Hildenbrand (Arm), intel-xe, dri-devel, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel,
	Johannes Weiner

On Thu, 2026-04-23 at 15:21 -0700, Matthew Brost wrote:
> On Thu, Apr 23, 2026 at 12:08:36PM -0700, Matthew Brost wrote:
> > On Thu, Apr 23, 2026 at 01:27:11PM +0200, Thomas Hellström wrote:
> > > On Thu, 2026-04-23 at 12:27 +0200, David Hildenbrand (Arm) wrote:
> > > > On 4/23/26 07:56, Matthew Brost wrote:
> > > > > Introduce zone_appears_fragmented() as a lightweight helper
> > > > > to
> > > > > allow
> > > > > subsystems to make coarse decisions about reclaim behavior in
> > > > > the
> > > > > presence of likely fragmentation.
> > > > > 
> > > > > The helper implements a simple heuristic: if the number of
> > > > > free
> > > > > pages
> > > > > in a zone exceeds twice the high watermark, the zone is
> > > > > considered
> > > > > to
> > > > > have ample free memory and allocation failures are more
> > > > > likely due
> > > > > to
> > > > > fragmentation than overall memory pressure.
> > > > > 
> > > > > This is intentionally imprecise and is not meant to replace
> > > > > the
> > > > > core
> > > > > MM compaction or fragmentation accounting logic. Instead, it
> > > > > provides
> > > > > a cheap signal for callers (e.g., shrinkers) that wish to
> > > > > avoid
> > > > > overly aggressive reclaim when sufficient free memory exists
> > > > > but
> > > > > high-order allocations may still fail.
> > > > > 
> > > > > No functional changes; this is a preparatory helper for
> > > > > future
> > > > > users.
> > > > > 
> > > > > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > > Cc: David Hildenbrand <david@kernel.org>
> > > > > Cc: Lorenzo Stoakes <ljs@kernel.org>
> > > > > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> > > > > Cc: Vlastimil Babka <vbabka@kernel.org>
> > > > > Cc: Mike Rapoport <rppt@kernel.org>
> > > > > Cc: Suren Baghdasaryan <surenb@google.com>
> > > > > Cc: Michal Hocko <mhocko@suse.com>
> > > > > Cc: linux-mm@kvack.org
> > > > > Cc: linux-kernel@vger.kernel.org
> > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > ---
> > > > >  include/linux/vmstat.h | 13 +++++++++++++
> > > > >  1 file changed, 13 insertions(+)
> > > > > 
> > > > > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> > > > > index 3c9c266cf782..568d9f4f1a1f 100644
> > > > > --- a/include/linux/vmstat.h
> > > > > +++ b/include/linux/vmstat.h
> > > > > @@ -483,6 +483,19 @@ static inline const char
> > > > > *zone_stat_name(enum
> > > > > zone_stat_item item)
> > > > >  	return vmstat_text[item];
> > > > >  }
> > > > >  
> > > > > +static inline bool zone_appears_fragmented(struct zone
> > > > > *zone)
> > > > > +{
> > > > 
> > > > "zone_likely_fragmented" or "zone_maybe_fragmented" might be
> > > > clearer,
> > > > depending
> > > > on the actual semantics.
> > > > 
> > > > > +	/*
> > > > > +	 * Simple heuristic: if the number of free pages is
> > > > > more
> > > > > than twice the
> > > > > +	 * high watermark, this strongly suggests that the
> > > > > zone is
> > > > > heavily
> > > > > +	 * fragmented when called from a shrinker.
> > > > > +	 */
> > > > 
> > > > I'll cc some more people. But the "when called from a shrinker"
> > > > bit
> > > > is
> > > > concerning. Are there additional semantics that should be
> > > > expressed
> > > > in the
> > > > function name, for example?
> > > > 
> > > > Something that implies that this function only gives you a
> > > > reasonable
> > > > answer in
> > > > a certain context.
> > > 
> > > I think that test would not be relevant for cgroup-aware
> > > shrinking.
> > > 
> > > What about trying to pass something in the struct shrink_control?
> > > Like
> > > if we pass the struct scan_control's order field also in struct
> > 
> > If the order were included in shrink_control, there is about a 95%
> > certain that this change would allow TTM / Xe to break the
> > problematic
> > kswapd feedback loop. This may also better express the intent of
> > the
> > problem we are trying to fix here.
> > 
> > For reference, the cover letter [1] details the problem.
> > 
> > Any guidance from the core MM folks would be appreciated—would
> > adding
> > the order to shrink_control be an acceptable solution?
> > 
> > Matt
> > 
> > [1] https://patchwork.freedesktop.org/series/165330/
> > 
> > > shrink_control, really expensive shrinkers could duck reclaim
> > > attempts
> > > from higher-order allocations that may fail anyway:
> > > 
> > >       if (sc->order > PAGE_ALLOC_COSTLY_ORDER &&
> > >            (sc->gfp_mask & (__GFP_NORETRY | __GFP_RETRY_MAYFAIL))
> > > &&
> > >            !(sc->gfp_mask & __GFP_NOFAIL))
> 
> It doesn't look like __GFP_NORETRY, __GFP_RETRY_MAYFAIL, __GFP_NOFAIL
> make it to the sc->gfp_mask flags from the caller and get into kswapd
> loop...

Perhaps that's because they mostly (only?) make sense from direct
reclaim? Looks like the trace is from kswapd.

Another metric to weigh in is perhaps the scan_control::priority field.
From my understanding it is progressively decreased towards 0 with 0
indicating most urgent shrinking. 

Thanks,
Thomas

> 
>  1182 [  394.049058] xe_shrinker_scan: no skip order=9,
> gfp=0x0000000000000cc0
>  1183 [  394.049061] CPU: 2 UID: 0 PID: 110 Comm: kswapd0 Not tainted
> 7.0.0-xe+ #355 PREEMPT(full)
>  1184 [  394.049062] Hardware name: Intel Corporation Panther Lake
> Client Platform/PTL-UH LP5 T3 RVP1, BIOS
> PTLPFWI1.R00.3332.D05.2509011438 09/01/2025
>  1185 [  394.049063] Call Trace:
>  1186 [  394.049065]  <TASK>
>  1187 [  394.049066]  dump_stack_lvl+0x55/0x70
>  1188 [  394.049073]  xe_shrinker_scan+0x274/0x280 [xe]
>  1189 [  394.049181]  do_shrink_slab+0x132/0x360
>  1190 [  394.049184]  shrink_slab+0xf0/0x3e0
>  1191 [  394.049186]  shrink_node+0x2bd/0x800
>  1192 [  394.049188]  balance_pgdat+0x323/0x760
>  1193 [  394.049189]  kswapd+0x1c3/0x340
>  1194 [  394.049190]  ? __pfx_autoremove_wake_function+0x10/0x10
>  1195 [  394.049193]  ? __pfx_kswapd+0x10/0x10
>  1196 [  394.049194]  kthread+0xdf/0x120
>  1197 [  394.049196]  ? __pfx_kthread+0x10/0x10
>  1198 [  394.049197]  ret_from_fork+0x1d0/0x220
>  1199 [  394.049200]  ? __pfx_kthread+0x10/0x10
>  1200 [  394.049200]  ret_from_fork_asm+0x1a/0x30
>  1201 [  394.049202]  </TASK>
> 
> Will look into if this is fixable, but again any core MM guidance
> would
> helpful.
> 
> Matt
> 
> > >            return SHRINK_STOP;
> > > 
> > > Possibly exposed as an inline helper in the shrinker interface?
> > > 
> > > /Thomas
> > > 
> > > 
> > > 
> > > 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/5] mm: Introduce zone_appears_fragmented()
  2026-04-24  7:05           ` Thomas Hellström
@ 2026-04-24  7:26             ` David Hildenbrand (Arm)
  2026-04-30  2:47               ` Matthew Brost
  0 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand (Arm) @ 2026-04-24  7:26 UTC (permalink / raw)
  To: Thomas Hellström, Matthew Brost
  Cc: intel-xe, dri-devel, Andrew Morton, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel,
	Johannes Weiner

On 4/24/26 09:05, Thomas Hellström wrote:
> On Thu, 2026-04-23 at 15:21 -0700, Matthew Brost wrote:
>> On Thu, Apr 23, 2026 at 12:08:36PM -0700, Matthew Brost wrote:
>>>
>>> If the order were included in shrink_control, there is about a 95%
>>> certain that this change would allow TTM / Xe to break the
>>> problematic
>>> kswapd feedback loop. This may also better express the intent of
>>> the
>>> problem we are trying to fix here.
>>>
>>> For reference, the cover letter [1] details the problem.
>>>
>>> Any guidance from the core MM folks would be appreciated—would
>>> adding
>>> the order to shrink_control be an acceptable solution?
>>>
>>> Matt
>>>
>>> [1] https://patchwork.freedesktop.org/series/165330/
>>>
>>
>> It doesn't look like __GFP_NORETRY, __GFP_RETRY_MAYFAIL, __GFP_NOFAIL
>> make it to the sc->gfp_mask flags from the caller and get into kswapd
>> loop...
> 
> Perhaps that's because they mostly (only?) make sense from direct
> reclaim? Looks like the trace is from kswapd.

kswap obtains the desired order through pgdat->kswapd_order, as a hint from
allocation code (wakeup_kswapd). The order can be easily merged (just use the max)

We do have the gfp_flags there, but merging them from different wakeups is a bit
more tricky (and when to reset?).

Assume we have one urgent request for order-0 and one non-urgent
(noretry,nofail, ...) request for order-9, we'd have to figure out a way how to
represent that. Gets more complicated for more orders.

Of course, we could have some kind of array, and try to store some "priority"
per order. But I assume plumbing that into the rest of kswapd might not be that
easy.


-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/5] mm: Introduce zone_appears_fragmented()
  2026-04-23  5:56 ` [PATCH v2 1/5] mm: Introduce zone_appears_fragmented() Matthew Brost
  2026-04-23  6:04   ` Balbir Singh
  2026-04-23 10:27   ` David Hildenbrand (Arm)
@ 2026-04-28  9:51   ` Andi Shyti
  2026-04-28 10:05     ` Andi Shyti
  2026-04-30  2:37     ` Matthew Brost
  2 siblings, 2 replies; 20+ messages in thread
From: Andi Shyti @ 2026-04-28  9:51 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, Thomas Hellström, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm, linux-kernel

Hi Matt,

On Wed, Apr 22, 2026 at 10:56:52PM -0700, Matthew Brost wrote:
> Introduce zone_appears_fragmented() as a lightweight helper to allow
> subsystems to make coarse decisions about reclaim behavior in the
> presence of likely fragmentation.
> 
> The helper implements a simple heuristic: if the number of free pages
> in a zone exceeds twice the high watermark, the zone is considered to
> have ample free memory and allocation failures are more likely due to
> fragmentation than overall memory pressure.
> 
> This is intentionally imprecise and is not meant to replace the core
> MM compaction or fragmentation accounting logic. Instead, it provides
> a cheap signal for callers (e.g., shrinkers) that wish to avoid
> overly aggressive reclaim when sufficient free memory exists but
> high-order allocations may still fail.
> 
> No functional changes; this is a preparatory helper for future users.
> 
> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@kernel.org>
> Cc: Lorenzo Stoakes <ljs@kernel.org>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Vlastimil Babka <vbabka@kernel.org>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  include/linux/vmstat.h | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index 3c9c266cf782..568d9f4f1a1f 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -483,6 +483,19 @@ static inline const char *zone_stat_name(enum zone_stat_item item)
>  	return vmstat_text[item];
>  }
>  
> +static inline bool zone_appears_fragmented(struct zone *zone)

this is a bit of a strong statement and the function name might
be misleading. You received some suggestions from David and I
would rename this function to something like
"zone_maybe_fragmented()".

> +{
> +	/*
> +	 * Simple heuristic: if the number of free pages is more than twice the
> +	 * high watermark, this strongly suggests that the zone is heavily
> +	 * fragmented when called from a shrinker.
> +	 */

The commit log explains it a bit better. The heuristic statement
here is too strong and it still sounds stronger than it should.

Andi

> +	if (zone_page_state(zone, NR_FREE_PAGES) > high_wmark_pages(zone) * 2)
> +		return true;
> +
> +	return false;
> +}
> +
>  #ifdef CONFIG_NUMA
>  static inline const char *numa_stat_name(enum numa_stat_item item)
>  {
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/5] mm: Introduce zone_appears_fragmented()
  2026-04-28  9:51   ` Andi Shyti
@ 2026-04-28 10:05     ` Andi Shyti
  2026-04-30  2:34       ` Matthew Brost
  2026-04-30  2:37     ` Matthew Brost
  1 sibling, 1 reply; 20+ messages in thread
From: Andi Shyti @ 2026-04-28 10:05 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, Thomas Hellström, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm, linux-kernel

> > +	if (zone_page_state(zone, NR_FREE_PAGES) > high_wmark_pages(zone) * 2)
> > +		return true;
> > +
> > +	return false;

P.S.: If you want, this can also be written:

	return (zone_page_state(zone, NR_FREE_PAGES) > high_wmark_pages(zone) * 2);

> > +}

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/5] mm: Introduce zone_appears_fragmented()
  2026-04-28 10:05     ` Andi Shyti
@ 2026-04-30  2:34       ` Matthew Brost
  0 siblings, 0 replies; 20+ messages in thread
From: Matthew Brost @ 2026-04-30  2:34 UTC (permalink / raw)
  To: Andi Shyti
  Cc: intel-xe, dri-devel, Thomas Hellström, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm, linux-kernel

On Tue, Apr 28, 2026 at 12:05:13PM +0200, Andi Shyti wrote:
> > > +	if (zone_page_state(zone, NR_FREE_PAGES) > high_wmark_pages(zone) * 2)
> > > +		return true;
> > > +
> > > +	return false;
> 
> P.S.: If you want, this can also be written:
> 
> 	return (zone_page_state(zone, NR_FREE_PAGES) > high_wmark_pages(zone) * 2);
> 

Indeed.

Matt

> > > +}

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/5] mm: Introduce zone_appears_fragmented()
  2026-04-28  9:51   ` Andi Shyti
  2026-04-28 10:05     ` Andi Shyti
@ 2026-04-30  2:37     ` Matthew Brost
  1 sibling, 0 replies; 20+ messages in thread
From: Matthew Brost @ 2026-04-30  2:37 UTC (permalink / raw)
  To: Andi Shyti
  Cc: intel-xe, dri-devel, Thomas Hellström, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm, linux-kernel

On Tue, Apr 28, 2026 at 11:51:34AM +0200, Andi Shyti wrote:
> Hi Matt,
> 
> On Wed, Apr 22, 2026 at 10:56:52PM -0700, Matthew Brost wrote:
> > Introduce zone_appears_fragmented() as a lightweight helper to allow
> > subsystems to make coarse decisions about reclaim behavior in the
> > presence of likely fragmentation.
> > 
> > The helper implements a simple heuristic: if the number of free pages
> > in a zone exceeds twice the high watermark, the zone is considered to
> > have ample free memory and allocation failures are more likely due to
> > fragmentation than overall memory pressure.
> > 
> > This is intentionally imprecise and is not meant to replace the core
> > MM compaction or fragmentation accounting logic. Instead, it provides
> > a cheap signal for callers (e.g., shrinkers) that wish to avoid
> > overly aggressive reclaim when sufficient free memory exists but
> > high-order allocations may still fail.
> > 
> > No functional changes; this is a preparatory helper for future users.
> > 
> > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: David Hildenbrand <david@kernel.org>
> > Cc: Lorenzo Stoakes <ljs@kernel.org>
> > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> > Cc: Vlastimil Babka <vbabka@kernel.org>
> > Cc: Mike Rapoport <rppt@kernel.org>
> > Cc: Suren Baghdasaryan <surenb@google.com>
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: linux-mm@kvack.org
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  include/linux/vmstat.h | 13 +++++++++++++
> >  1 file changed, 13 insertions(+)
> > 
> > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> > index 3c9c266cf782..568d9f4f1a1f 100644
> > --- a/include/linux/vmstat.h
> > +++ b/include/linux/vmstat.h
> > @@ -483,6 +483,19 @@ static inline const char *zone_stat_name(enum zone_stat_item item)
> >  	return vmstat_text[item];
> >  }
> >  
> > +static inline bool zone_appears_fragmented(struct zone *zone)
> 
> this is a bit of a strong statement and the function name might
> be misleading. You received some suggestions from David and I
> would rename this function to something like
> "zone_maybe_fragmented()".
> 

+1

> > +{
> > +	/*
> > +	 * Simple heuristic: if the number of free pages is more than twice the
> > +	 * high watermark, this strongly suggests that the zone is heavily
> > +	 * fragmented when called from a shrinker.
> > +	 */
> 
> The commit log explains it a bit better. The heuristic statement
> here is too strong and it still sounds stronger than it should.
> 

Can reword.

Matt

> Andi
> 
> > +	if (zone_page_state(zone, NR_FREE_PAGES) > high_wmark_pages(zone) * 2)
> > +		return true;
> > +
> > +	return false;
> > +}
> > +
> >  #ifdef CONFIG_NUMA
> >  static inline const char *numa_stat_name(enum numa_stat_item item)
> >  {
> > -- 
> > 2.34.1
> > 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/5] mm: Introduce zone_appears_fragmented()
  2026-04-24  7:26             ` David Hildenbrand (Arm)
@ 2026-04-30  2:47               ` Matthew Brost
  2026-04-30  7:47                 ` Thomas Hellström
  0 siblings, 1 reply; 20+ messages in thread
From: Matthew Brost @ 2026-04-30  2:47 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Thomas Hellström, intel-xe, dri-devel, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel,
	Johannes Weiner

On Fri, Apr 24, 2026 at 09:26:18AM +0200, David Hildenbrand (Arm) wrote:
> On 4/24/26 09:05, Thomas Hellström wrote:
> > On Thu, 2026-04-23 at 15:21 -0700, Matthew Brost wrote:
> >> On Thu, Apr 23, 2026 at 12:08:36PM -0700, Matthew Brost wrote:
> >>>
> >>> If the order were included in shrink_control, there is about a 95%
> >>> certain that this change would allow TTM / Xe to break the
> >>> problematic
> >>> kswapd feedback loop. This may also better express the intent of
> >>> the
> >>> problem we are trying to fix here.
> >>>
> >>> For reference, the cover letter [1] details the problem.
> >>>
> >>> Any guidance from the core MM folks would be appreciated—would
> >>> adding
> >>> the order to shrink_control be an acceptable solution?
> >>>
> >>> Matt
> >>>
> >>> [1] https://patchwork.freedesktop.org/series/165330/
> >>>
> >>
> >> It doesn't look like __GFP_NORETRY, __GFP_RETRY_MAYFAIL, __GFP_NOFAIL
> >> make it to the sc->gfp_mask flags from the caller and get into kswapd
> >> loop...
> > 
> > Perhaps that's because they mostly (only?) make sense from direct
> > reclaim? Looks like the trace is from kswapd.
> 
> kswap obtains the desired order through pgdat->kswapd_order, as a hint from
> allocation code (wakeup_kswapd). The order can be easily merged (just use the max)
> 

Yes.

My current thinking is wire the order into shrink_control as that is
quite straight forward + only call this helper + short circuit shrinker
on higher orders.

> We do have the gfp_flags there, but merging them from different wakeups is a bit
> more tricky (and when to reset?).
> 
> Assume we have one urgent request for order-0 and one non-urgent
> (noretry,nofail, ...) request for order-9, we'd have to figure out a way how to
> represent that. Gets more complicated for more orders.
> 
> Of course, we could have some kind of array, and try to store some "priority"
> per order. But I assume plumbing that into the rest of kswapd might not be that
> easy.

Yes, this seems non-trivial. I was also on a call with Google today
discussing what Android (client Linux) would like from shrinking, and my
initial feeling is that we will need to do some surgery to the shrinker
core and GPU shrinkers to make all of this work well over the next year
or so.

So again, I think starting with wiring order into shrink_control and
this helper is a good place to start, as it fixes an immediate issue.

Let me know if that seems like a reasonable direction.

Matt

> 
> 
> -- 
> Cheers,
> 
> David

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/5] mm: Introduce zone_appears_fragmented()
  2026-04-30  2:47               ` Matthew Brost
@ 2026-04-30  7:47                 ` Thomas Hellström
  2026-04-30 16:34                   ` Matthew Brost
  2026-04-30 17:06                   ` Vlastimil Babka (SUSE)
  0 siblings, 2 replies; 20+ messages in thread
From: Thomas Hellström @ 2026-04-30  7:47 UTC (permalink / raw)
  To: Matthew Brost, David Hildenbrand (Arm)
  Cc: intel-xe, dri-devel, Andrew Morton, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel,
	Johannes Weiner

On Wed, 2026-04-29 at 19:47 -0700, Matthew Brost wrote:
> On Fri, Apr 24, 2026 at 09:26:18AM +0200, David Hildenbrand (Arm)
> wrote:
> > On 4/24/26 09:05, Thomas Hellström wrote:
> > > On Thu, 2026-04-23 at 15:21 -0700, Matthew Brost wrote:
> > > > On Thu, Apr 23, 2026 at 12:08:36PM -0700, Matthew Brost wrote:
> > > > > 
> > > > > If the order were included in shrink_control, there is about
> > > > > a 95%
> > > > > certain that this change would allow TTM / Xe to break the
> > > > > problematic
> > > > > kswapd feedback loop. This may also better express the intent
> > > > > of
> > > > > the
> > > > > problem we are trying to fix here.
> > > > > 
> > > > > For reference, the cover letter [1] details the problem.
> > > > > 
> > > > > Any guidance from the core MM folks would be
> > > > > appreciated—would
> > > > > adding
> > > > > the order to shrink_control be an acceptable solution?
> > > > > 
> > > > > Matt
> > > > > 
> > > > > [1] https://patchwork.freedesktop.org/series/165330/
> > > > > 
> > > > 
> > > > It doesn't look like __GFP_NORETRY, __GFP_RETRY_MAYFAIL,
> > > > __GFP_NOFAIL
> > > > make it to the sc->gfp_mask flags from the caller and get into
> > > > kswapd
> > > > loop...
> > > 
> > > Perhaps that's because they mostly (only?) make sense from direct
> > > reclaim? Looks like the trace is from kswapd.
> > 
> > kswap obtains the desired order through pgdat->kswapd_order, as a
> > hint from
> > allocation code (wakeup_kswapd). The order can be easily merged
> > (just use the max)
> > 
> 
> Yes.
> 
> My current thinking is wire the order into shrink_control as that is
> quite straight forward + only call this helper + short circuit
> shrinker
> on higher orders.
> 
> > We do have the gfp_flags there, but merging them from different
> > wakeups is a bit
> > more tricky (and when to reset?).
> > 
> > Assume we have one urgent request for order-0 and one non-urgent
> > (noretry,nofail, ...) request for order-9, we'd have to figure out
> > a way how to
> > represent that. Gets more complicated for more orders.
> > 
> > Of course, we could have some kind of array, and try to store some
> > "priority"
> > per order. But I assume plumbing that into the rest of kswapd might
> > not be that
> > easy.
> 
> Yes, this seems non-trivial. I was also on a call with Google today
> discussing what Android (client Linux) would like from shrinking, and
> my
> initial feeling is that we will need to do some surgery to the
> shrinker
> core and GPU shrinkers to make all of this work well over the next
> year
> or so.
> 
> So again, I think starting with wiring order into shrink_control and
> this helper is a good place to start, as it fixes an immediate issue.
> 
> Let me know if that seems like a reasonable direction.

+1 for wiring order into shrink_control, and possibly also the priority
as mentioned in an earlier email.

However for cgroups-aware shrinkers, The number of free memory in a
zone might not be an indication of fragmentation-triggered reclaim at
all, it could be the result of the cgroup hitting its memory limits.

So I think if we can solve this with a combination of GFP flags,
plumbed-through order and plumbed-through priority, that would be
ideal.

Thanks,
Thomas

> 
> Matt
> 
> > 
> > 
> > -- 
> > Cheers,
> > 
> > David

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/5] mm: Introduce zone_appears_fragmented()
  2026-04-30  7:47                 ` Thomas Hellström
@ 2026-04-30 16:34                   ` Matthew Brost
  2026-04-30 19:59                     ` Thomas Hellström
  2026-04-30 17:06                   ` Vlastimil Babka (SUSE)
  1 sibling, 1 reply; 20+ messages in thread
From: Matthew Brost @ 2026-04-30 16:34 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: David Hildenbrand (Arm), intel-xe, dri-devel, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel,
	Johannes Weiner

On Thu, Apr 30, 2026 at 09:47:37AM +0200, Thomas Hellström wrote:
> On Wed, 2026-04-29 at 19:47 -0700, Matthew Brost wrote:
> > On Fri, Apr 24, 2026 at 09:26:18AM +0200, David Hildenbrand (Arm)
> > wrote:
> > > On 4/24/26 09:05, Thomas Hellström wrote:
> > > > On Thu, 2026-04-23 at 15:21 -0700, Matthew Brost wrote:
> > > > > On Thu, Apr 23, 2026 at 12:08:36PM -0700, Matthew Brost wrote:
> > > > > > 
> > > > > > If the order were included in shrink_control, there is about
> > > > > > a 95%
> > > > > > certain that this change would allow TTM / Xe to break the
> > > > > > problematic
> > > > > > kswapd feedback loop. This may also better express the intent
> > > > > > of
> > > > > > the
> > > > > > problem we are trying to fix here.
> > > > > > 
> > > > > > For reference, the cover letter [1] details the problem.
> > > > > > 
> > > > > > Any guidance from the core MM folks would be
> > > > > > appreciated—would
> > > > > > adding
> > > > > > the order to shrink_control be an acceptable solution?
> > > > > > 
> > > > > > Matt
> > > > > > 
> > > > > > [1] https://patchwork.freedesktop.org/series/165330/
> > > > > > 
> > > > > 
> > > > > It doesn't look like __GFP_NORETRY, __GFP_RETRY_MAYFAIL,
> > > > > __GFP_NOFAIL
> > > > > make it to the sc->gfp_mask flags from the caller and get into
> > > > > kswapd
> > > > > loop...
> > > > 
> > > > Perhaps that's because they mostly (only?) make sense from direct
> > > > reclaim? Looks like the trace is from kswapd.
> > > 
> > > kswap obtains the desired order through pgdat->kswapd_order, as a
> > > hint from
> > > allocation code (wakeup_kswapd). The order can be easily merged
> > > (just use the max)
> > > 
> > 
> > Yes.
> > 
> > My current thinking is wire the order into shrink_control as that is
> > quite straight forward + only call this helper + short circuit
> > shrinker
> > on higher orders.
> > 
> > > We do have the gfp_flags there, but merging them from different
> > > wakeups is a bit
> > > more tricky (and when to reset?).
> > > 
> > > Assume we have one urgent request for order-0 and one non-urgent
> > > (noretry,nofail, ...) request for order-9, we'd have to figure out
> > > a way how to
> > > represent that. Gets more complicated for more orders.
> > > 
> > > Of course, we could have some kind of array, and try to store some
> > > "priority"
> > > per order. But I assume plumbing that into the rest of kswapd might
> > > not be that
> > > easy.
> > 
> > Yes, this seems non-trivial. I was also on a call with Google today
> > discussing what Android (client Linux) would like from shrinking, and
> > my
> > initial feeling is that we will need to do some surgery to the
> > shrinker
> > core and GPU shrinkers to make all of this work well over the next
> > year
> > or so.
> > 
> > So again, I think starting with wiring order into shrink_control and
> > this helper is a good place to start, as it fixes an immediate issue.
> > 
> > Let me know if that seems like a reasonable direction.
> 
> +1 for wiring order into shrink_control, and possibly also the priority
> as mentioned in an earlier email.
> 

Let me look at how priority field is used as well.

> However for cgroups-aware shrinkers, The number of free memory in a
> zone might not be an indication of fragmentation-triggered reclaim at
> all, it could be the result of the cgroup hitting its memory limits.
> 

I agree for cgroups what is in place here is not sufficent and based
Google's feedback of every user space in Andriod is assigned a cgroup so
we will quickly need a cgroup story.

> So I think if we can solve this with a combination of GFP flags,
> plumbed-through order and plumbed-through priority, that would be
> ideal.

That is an idea. The other thing that came up is TTM LRU doesn't
understand relavence of hotness compared to other shrinkers LRUs (e.g.,
core pages) so our TTM shrinker may be evicting hot GPU pages while cold
non-GPU pages could be evicted which would create less stress on the
system. Perhaps priority / GFP flags will help here?

Matt

> 
> Thanks,
> Thomas
> 
> > 
> > Matt
> > 
> > > 
> > > 
> > > -- 
> > > Cheers,
> > > 
> > > David

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/5] mm: Introduce zone_appears_fragmented()
  2026-04-30  7:47                 ` Thomas Hellström
  2026-04-30 16:34                   ` Matthew Brost
@ 2026-04-30 17:06                   ` Vlastimil Babka (SUSE)
  1 sibling, 0 replies; 20+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-04-30 17:06 UTC (permalink / raw)
  To: Thomas Hellström, Matthew Brost, David Hildenbrand (Arm)
  Cc: intel-xe, dri-devel, Andrew Morton, Lorenzo Stoakes,
	Liam R. Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm, linux-kernel, Johannes Weiner

On 4/30/26 09:47, Thomas Hellström wrote:
> On Wed, 2026-04-29 at 19:47 -0700, Matthew Brost wrote:
>> 
>> So again, I think starting with wiring order into shrink_control and
>> this helper is a good place to start, as it fixes an immediate issue.
>> 
>> Let me know if that seems like a reasonable direction.
> 
> +1 for wiring order into shrink_control, and possibly also the priority
> as mentioned in an earlier email.
> 
> However for cgroups-aware shrinkers, The number of free memory in a
> zone might not be an indication of fragmentation-triggered reclaim at
> all, it could be the result of the cgroup hitting its memory limits.

I'm not sure I understand your concern wrt cgroups, but some hopefully
relevant (and hopefully not wrong) points:

- fragmentation is a zone-related property, not cgroup
- hitting a cgroup limit doesn't wake up kswapd nor go through the usual
reclaim/compaction paths, it's a form of direct-reclaim-only

So I believe it should be easy to recognize when your shrinker is called for
memcg shrinking and not kswapd and thus it can't be happening due to zone
fragmentation but must be due to memcg limits, and then you probably don't
need to check zone_appears_fragmented() at all.

> So I think if we can solve this with a combination of GFP flags,
> plumbed-through order and plumbed-through priority, that would be
> ideal.
> 
> Thanks,
> Thomas
> 
>> 
>> Matt
>> 
>> > 
>> > 
>> > -- 
>> > Cheers,
>> > 
>> > David


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/5] mm: Introduce zone_appears_fragmented()
  2026-04-30 16:34                   ` Matthew Brost
@ 2026-04-30 19:59                     ` Thomas Hellström
  0 siblings, 0 replies; 20+ messages in thread
From: Thomas Hellström @ 2026-04-30 19:59 UTC (permalink / raw)
  To: Matthew Brost
  Cc: David Hildenbrand (Arm), intel-xe, dri-devel, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel,
	Johannes Weiner

On Thu, 2026-04-30 at 09:34 -0700, Matthew Brost wrote:
> On Thu, Apr 30, 2026 at 09:47:37AM +0200, Thomas Hellström wrote:
> > On Wed, 2026-04-29 at 19:47 -0700, Matthew Brost wrote:
> > > On Fri, Apr 24, 2026 at 09:26:18AM +0200, David Hildenbrand (Arm)
> > > wrote:
> > > > On 4/24/26 09:05, Thomas Hellström wrote:
> > > > > On Thu, 2026-04-23 at 15:21 -0700, Matthew Brost wrote:
> > > > > > On Thu, Apr 23, 2026 at 12:08:36PM -0700, Matthew Brost
> > > > > > wrote:
> > > > > > > 
> > > > > > > If the order were included in shrink_control, there is
> > > > > > > about
> > > > > > > a 95%
> > > > > > > certain that this change would allow TTM / Xe to break
> > > > > > > the
> > > > > > > problematic
> > > > > > > kswapd feedback loop. This may also better express the
> > > > > > > intent
> > > > > > > of
> > > > > > > the
> > > > > > > problem we are trying to fix here.
> > > > > > > 
> > > > > > > For reference, the cover letter [1] details the problem.
> > > > > > > 
> > > > > > > Any guidance from the core MM folks would be
> > > > > > > appreciated—would
> > > > > > > adding
> > > > > > > the order to shrink_control be an acceptable solution?
> > > > > > > 
> > > > > > > Matt
> > > > > > > 
> > > > > > > [1] https://patchwork.freedesktop.org/series/165330/
> > > > > > > 
> > > > > > 
> > > > > > It doesn't look like __GFP_NORETRY, __GFP_RETRY_MAYFAIL,
> > > > > > __GFP_NOFAIL
> > > > > > make it to the sc->gfp_mask flags from the caller and get
> > > > > > into
> > > > > > kswapd
> > > > > > loop...
> > > > > 
> > > > > Perhaps that's because they mostly (only?) make sense from
> > > > > direct
> > > > > reclaim? Looks like the trace is from kswapd.
> > > > 
> > > > kswap obtains the desired order through pgdat->kswapd_order, as
> > > > a
> > > > hint from
> > > > allocation code (wakeup_kswapd). The order can be easily merged
> > > > (just use the max)
> > > > 
> > > 
> > > Yes.
> > > 
> > > My current thinking is wire the order into shrink_control as that
> > > is
> > > quite straight forward + only call this helper + short circuit
> > > shrinker
> > > on higher orders.
> > > 
> > > > We do have the gfp_flags there, but merging them from different
> > > > wakeups is a bit
> > > > more tricky (and when to reset?).
> > > > 
> > > > Assume we have one urgent request for order-0 and one non-
> > > > urgent
> > > > (noretry,nofail, ...) request for order-9, we'd have to figure
> > > > out
> > > > a way how to
> > > > represent that. Gets more complicated for more orders.
> > > > 
> > > > Of course, we could have some kind of array, and try to store
> > > > some
> > > > "priority"
> > > > per order. But I assume plumbing that into the rest of kswapd
> > > > might
> > > > not be that
> > > > easy.
> > > 
> > > Yes, this seems non-trivial. I was also on a call with Google
> > > today
> > > discussing what Android (client Linux) would like from shrinking,
> > > and
> > > my
> > > initial feeling is that we will need to do some surgery to the
> > > shrinker
> > > core and GPU shrinkers to make all of this work well over the
> > > next
> > > year
> > > or so.
> > > 
> > > So again, I think starting with wiring order into shrink_control
> > > and
> > > this helper is a good place to start, as it fixes an immediate
> > > issue.
> > > 
> > > Let me know if that seems like a reasonable direction.
> > 
> > +1 for wiring order into shrink_control, and possibly also the
> > priority
> > as mentioned in an earlier email.
> > 
> 
> Let me look at how priority field is used as well.
> 
> > However for cgroups-aware shrinkers, The number of free memory in a
> > zone might not be an indication of fragmentation-triggered reclaim
> > at
> > all, it could be the result of the cgroup hitting its memory
> > limits.
> > 
> 
> I agree for cgroups what is in place here is not sufficent and based
> Google's feedback of every user space in Andriod is assigned a cgroup
> so
> we will quickly need a cgroup story.
> 
> > So I think if we can solve this with a combination of GFP flags,
> > plumbed-through order and plumbed-through priority, that would be
> > ideal.
> 
> That is an idea. The other thing that came up is TTM LRU doesn't
> understand relavence of hotness compared to other shrinkers LRUs
> (e.g.,
> core pages) so our TTM shrinker may be evicting hot GPU pages while
> cold
> non-GPU pages could be evicted which would create less stress on the
> system. Perhaps priority / GFP flags will help here?

IIRC priority is used to calculate how many pages we are requested to
shrink compared to the number we say we have available, but that needs
to be double-checked. FWIW i915 has a check that xe's shrinker lacks,
that shrinking is not attempted unless the number of pages requested is
>= an average buffer object size.

/Thomas





> 
> Matt
> 
> > 
> > Thanks,
> > Thomas
> > 
> > > 
> > > Matt
> > > 
> > > > 
> > > > 
> > > > -- 
> > > > Cheers,
> > > > 
> > > > David

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2026-04-30 19:59 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-23  5:56 [PATCH v2 0/5] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation Matthew Brost
2026-04-23  5:56 ` [PATCH v2 1/5] mm: Introduce zone_appears_fragmented() Matthew Brost
2026-04-23  6:04   ` Balbir Singh
2026-04-23  6:16     ` Matthew Brost
2026-04-23  6:27       ` Matthew Brost
2026-04-23 10:27   ` David Hildenbrand (Arm)
2026-04-23 11:27     ` Thomas Hellström
2026-04-23 19:08       ` Matthew Brost
2026-04-23 22:21         ` Matthew Brost
2026-04-24  7:05           ` Thomas Hellström
2026-04-24  7:26             ` David Hildenbrand (Arm)
2026-04-30  2:47               ` Matthew Brost
2026-04-30  7:47                 ` Thomas Hellström
2026-04-30 16:34                   ` Matthew Brost
2026-04-30 19:59                     ` Thomas Hellström
2026-04-30 17:06                   ` Vlastimil Babka (SUSE)
2026-04-28  9:51   ` Andi Shyti
2026-04-28 10:05     ` Andi Shyti
2026-04-30  2:34       ` Matthew Brost
2026-04-30  2:37     ` Matthew Brost

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox