Re: [PATCH v6 1/2] mm: Introduce opportunistic_compaction concept to vmscan and shrinkers

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Matthew Brost <matthew.brost@intel.com>
To: Dave Chinner <dgc@kernel.org>
Cc: <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>,
	<intel-xe@lists.freedesktop.org>,
	<dri-devel@lists.freedesktop.org>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	Dave Chinner <david@fromorbit.com>,
	"Qi Zheng" <zhengqi.arch@bytedance.com>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Muchun Song <muchun.song@linux.dev>,
	"David Hildenbrand" <david@kernel.org>,
	Lorenzo Stoakes <ljs@kernel.org>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Vlastimil Babka <vbabka@kernel.org>,
	"Mike Rapoport" <rppt@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	"Michal Hocko" <mhocko@suse.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	Kairui Song <kasong@tencent.com>, Barry Song <baohua@kernel.org>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Yuanchu Xie <yuanchu@google.com>, Wei Xu <weixugc@google.com>
Subject: Re: [PATCH v6 1/2] mm: Introduce opportunistic_compaction concept to vmscan and shrinkers
Date: Mon, 22 Jun 2026 17:09:33 -0700	[thread overview]
Message-ID: <ajnOvdbT5/J6q3IN@gsse-cloud1.jf.intel.com> (raw)
In-Reply-To: <ajnA83DxAFqeJ-sv@dread>

On Tue, Jun 23, 2026 at 09:10:43AM +1000, Dave Chinner wrote:
> On Tue, Jun 16, 2026 at 08:22:17PM -0700, Matthew Brost wrote:
> > High-order allocations using __GFP_NORETRY or __GFP_RETRY_MAYFAIL
> > are often opportunistic attempts to satisfy fragmentation-sensitive
> > allocations rather than indications of severe memory pressure. In these
> > cases, reclaim may invoke shrinkers that aggressively destroy working
> > sets even though reclaim is unlikely to materially improve the
> > allocation outcome.
> > 
> > Some shrinkers manage expensive backing or migration operations where
> > reclaim can result in substantial working set disruption despite the
> > system having sufficient free memory overall. This is particularly
> > visible in fragmentation-heavy workloads where reclaim repeatedly tears
> > down active state while kswapd attempts to satisfy higher-order
> > allocations.
> > 
> > Introduce an opportunistic_compaction hint in shrink_control that allows
> > kswapd to communicate when reclaim originates from a high-order
> > allocation context that may be fragmentation driven rather than true
> > memory pressure. Shrinkers may use this hint to avoid destructive
> > working set reclaim while still participating normally during order-0
> > or stronger reclaim conditions.

Thanks for the input - this is a tough problem.

> 
> To be honest, this seems like another "push a hint through to the XE
> shrinker" mechanism under a different name. You seem so focused on
> fixing the XE reproducer that the -systemic problem- that -any-
> high-order folio demand causes is not being acknowleged.
> 

I'm not exactly sure I agree here. Communicating via __GFP_NORETRY or
__GFP_RETRY_MAYFAIL with a higher order implies that the caller can
handle higher-order allocation failures, so the shrinker shouldn’t try
too hard to obtain a large page (e.g., evict a working set). I agree
that Xe is currently the only shrinker making use of this, but other
shrinkers could also hook into it. This information simply isn’t
available today.

> e.g. we use high-order folios extensively in the page cache these
> days, and there are -many- cases where memory compaction driven by
> high-order demand cause significant performance regressions for page
> cache performance. To date, every single person who has wanted to
> fix the problem they are seeing has effectively attempted to -turn
> off compaction- via GFP flags.

So does that mean they clear __GFP_RECLAIM?

That isn't really what in DRM or Xe. In former case we have pools of
lower order pages in TTM not in use that can be shrunk, potentially
freeing multiple lower orders pages so a higher order page formed, and
the later possible BOs (sets of pages) in Xe marked as purgable (not is
in working set) which can also be shrunk. Other DRM drivers have purging
concepts too.

I’m not very familiar with what other shrinkers or subsystems want, but
presumably other shrinkers have pools or caches that aren’t currently in
use, where they can say, “OK, I’ll give these pages up for opportunistic
compaction, but I won’t give up my working set.” Of course, as mentioned
above, if someone else explicitly requests large pages by avoiding
__GFP_NORETRY and __GFP_RETRY_MAYFAIL, the shrinker should then give up
its working set.

> 
> I've even done that myself inside XFS to work around kvmalloc()
> issues with a lack of GFP_NOFAIL support and doing costly high order
> allocations that fail and trigger compaction before falling back to
> vmalloc().  However, these issues have since been fixed in the
> kvmalloc() code, such that it now does the right thing for most
> calling contexts (i.e. tries high-order kmalloc() without triggering
> compaction, then fall back to GFP_NOFAIL vmalloc()). This has made
> kvmalloc() more performant and better behaved for -all users-, not
> just XFS.
> 
> This is not sustainable - we need compaction to be robust and
> performant in the face of high-order folio demands, regardless of
> what subsystem is generating the demand.
> 
> So with that in mind, let me paraphrase the comment in the second
> patch in the Xe shrinker implementation:
> 
> "Shrinker reclaim is based on implementation specific object sizes
> so it is unlikely to ever acheive contiguous page reclaim in a
> manner that will measurably improve compaction rates."
> 

This might be slightly misworded—what I really mean is that I don’t want
to give up my working set for higher-order allocations that are allowed
to fail, but I do want to give up my cache.

> You also say:
> 
> > No functional changes are introduced for existing shrinkers.
> 
> Consider how many shrinkable caches the general statement above
> applies to, and then think about the fundamental impedence mismatch
> between the affected shrinkable caches and what this patch actually
> fixes.
> 

Yes, as mentioned above, I’m only addressing Xe here, and I agree that
this is likely an issue. Do you know of other shrinkers that have pools
or caches which can be shrunk under the conditions I’m introducing here,
but also have a working set they would prefer not to give up? If so, a
link on elixir.bootlin.com would be helpful so I can take a look. I’ll
also try to go through other shrinkers myself.

> For example, what happens to slab-based caches if the XE cache is being
> excessively reclaimed under high-order page demand? e.g. the slab-based
> cache may have tens of objects per page and holds a system-level
> performance critical working set of objects. How do these caches
> handle the excessive reclaim demand being generated by compaction
> thrashing?
> 
> Yup, they don't.
> 

Agree.

> In the case of filesystem caches, the "reclaim and repopulate"
> pattern you describe causing the XE perf problems causes internal
> slab cache fragmentation. Not only does this not improve compaction
> rates, it also results in more memory fragmentation because slab
> pages get pinned by a small number of long lived objects and they
> won't get freed until the cache is largely emptied.  IOWs, things
> get -even worse- from a memory fragmentation POV when compaction
> thrashing causes the working set of a high-object-count-per-page
> slab cache to thrash....
> 

Got a link to the code which you are referring to?

That seems like a problem similar to another issue in DRM/Xe. We found
that the process of shrinking actually drove fragmentation by splitting
folios down to order-0 and then backing pages up one at a time. I have a
separate fix in flight for that.

Could the filesystem detect these hints and avoid shrinking in a way
that causes fragmentation? Alternatively, could it perform shrinking in
a way that doesn’t shatter folios, or detect long-lived objects so it
understands that shrinking isn’t going to help reduce fragmentation?

> This isn't isolated to individual subsystem thrashing.  If we run a
> file-based workload that generates high-order folio demand and hence

What GFP flags are typical used for file-based workloads?

> compaction (e.g copy tens of GB of files between two XFS, ext4 or
> btrfs filesystems), that will -also- trash the Xe working set via
> the shrinker being hammered by memory compaction try to free up
> contiguous pages for the page cache.
>

I could see this.

> Similarly, if we run a Xe workload that generates sustained high
> order folio demand, that will trash the working set in the dentry
> and inode caches and any other shrinkable slab-based cache.
> 

I could also see this but DRM / Xe will set __GFP_NORETRY or
__GFP_RETRY_MAYFAIL on higer-orders so those caches should be able to
not trash its working set if looked for this hint. 

> Hence the abstracted case of the problem we need to solve is this:
> shrinker reclaim is based on x-byte objects is extremely unlikely to
> acheive contiguous page reclaim in a manner that will measurably
> improve compaction rates.
> 
> This is a problem that has to be addresses by the high level
> infrastructure level, not worked around by individual shrinkers.
> 
> IMO, compaction shouldn't trigger shrinkers unless the shrinkers are
> specifically flagged as being able to release contiguous pages of
> memory in short order. I don't think there's very many shrinkable
> caches that even hold a significant quantity of objects larger than
> a single page, so it's clearly questionable as to whether compaction
> based reclaim should run shrinker reclaim to begin with.
> 

Yes, sort of do this in Xe by changing '->count_objects' based on the
hint.

> i.e. a subsystem that can track high order folios in a shrinkable
> cache should probably have a "->compaction_scan()" method that is
> run directly from compaction context to try to free high order

When you say “compaction context,” which parts of the code are you
referring to? I’d like to explore this option, but I need a bit more
context.

> folios. This provides a direct opt-in mechanism for a subsystem, and
> it allows subsystems that can track low- and high- order objects
> independently to efficiently free objects in a way that will help
> improve compaction rates without impacting the entire working set of
> objects in the cache.

Does this help if, for example, the cache is holding onto two order-8
folios that could be freed and merged, while the caller really wants an
order-9 folio? This seems like a possible scenario in caches and is
certainly true in TTM pools.

> 
> IOWs, this patch to inform kswapd about it's trigger (doesn't it
> already have a "reason" parameter, though?) is likely a necessary
> part of the solution - we don't want kswapd running shrinkers if it
> has been triggered to reclaim pages for compaction. This patch would
> allows kswapd to elide normal shrinker passes when it has been woken
> purely for compaction purposes. Given that the compaction code would
> be running the high-order reclaim capable shrinkers itself, this
> would avoid trashing the working set of most shrinkable caches -by
> default- under high order allocation demand....
>

I’m trying to parse this—are you suggesting that, one way or another, we
introduce a heuristic where shrinkers can act on a hint (whether it’s
what I have here or a new ->compaction_scan() vfunc), and then attempt
to fix all shrinkers in this series? I’m open to trying to fix other
shrinkers as well. Do you have any particular ones in mind? I count
around 45 shrinkers in Linux, so it’s unlikely I can fix every single
one, though or all shrinkers need to be fixed.

On a side note, I just noticed that struct shrinker has count_objects
and scan_objects as individual vfuncs rather than using a const struct
shrinker_ops *ops. Should we change that? The latter seems cleaner and
is typically how things are done in Linux.

Matt

> -Dave.
> -- 
> Dave Chinner
> dgc@kernel.org

next prev parent reply	other threads:[~2026-06-23  0:09 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-17  3:22 [PATCH v6 0/2] mm, drm/xe: Avoid reclaim/eviction loops under fragmentation Matthew Brost
2026-06-17  3:22 ` [PATCH v6 1/2] mm: Introduce opportunistic_compaction concept to vmscan and shrinkers Matthew Brost
2026-06-22 23:10   ` Dave Chinner
2026-06-23  0:09     ` Matthew Brost [this message]
2026-06-17  3:22 ` [PATCH v6 2/2] drm/xe: Make use of shrink_control::opportunistic_compaction hint Matthew Brost

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ajnOvdbT5/J6q3IN@gsse-cloud1.jf.intel.com \
    --to=matthew.brost@intel.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=baohua@kernel.org \
    --cc=david@fromorbit.com \
    --cc=david@kernel.org \
    --cc=dgc@kernel.org \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=hannes@cmpxchg.org \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=kasong@tencent.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=mhocko@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=roman.gushchin@linux.dev \
    --cc=rppt@kernel.org \
    --cc=shakeel.butt@linux.dev \
    --cc=surenb@google.com \
    --cc=vbabka@kernel.org \
    --cc=weixugc@google.com \
    --cc=yuanchu@google.com \
    --cc=zhengqi.arch@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox