Re: [RFC PATCH 0/6] mm/damon: Add mTHP-aware collapse/split with ARM SPE feedback

All of lore.kernel.org
 help / color / mirror / Atom feed

From: SeongJae Park <sj@kernel.org>
To: wang lian <lianux.mm@gmail.com>
Cc: SeongJae Park <sj@kernel.org>,
	Gutierrez Asier <gutierrez.asier@huawei-partners.com>,
	akpm@linux-foundation.org, npache@redhat.com,
	daichaobing@sangfor.com.cn, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, kunwu.chan@gmail.com
Subject: Re: [RFC PATCH 0/6] mm/damon: Add mTHP-aware collapse/split with ARM SPE feedback
Date: Thu, 18 Jun 2026 18:52:39 -0700	[thread overview]
Message-ID: <20260619015241.9432-1-sj@kernel.org> (raw)
In-Reply-To: <459C0876-AC37-4A52-BF11-6436FF33CA90@gmail.com>

On Thu, 18 Jun 2026 21:13:07 +0800 wang lian <lianux.mm@gmail.com> wrote:

> 
> 
> > On Jun 18, 2026, at 19:03, Gutierrez Asier <gutierrez.asier@huawei-partners.com> wrote:
> > 
> > Hi Wang,
> > 
> > On 6/18/2026 12:48 PM, Wang Lian wrote:
> >> Received an off-list report that DAMON significantly overestimates
> >> hot memory in KVM/QEMU deployments with THP-backed tmpfs guest memory
> >> running Oracle workloads.
> >> 
> >> The root cause is structural: a PMD entry covers 512 4KB subpages with
> >> a single Access Flag (AF) bit. When any one subpage is accessed, the entire
> >> 2MB region appears "hot" to DAMON. On ARM64, this is compounded by the
> >> hardware AF mechanism -- the AF is only set on a TLB miss. Consequently, when the
> >> working set fits entirely within the L2 TLB (e.g., a 16MB working set with 2MB THP
> >> running on a Kunpeng 920's 2048-entry L2 TLB), DAMON becomes completely blind to
> >> subsequent accesses. x86 is not subject to this specific blindness under similar
> >> conditions.
> > 
> > Have you tried setting the minimum region size to 2MB?
> > 
> >> We reproduced this memory inflation on a Kunpeng 920 platform using a synthetic
> >> workload (8GB mmap with a 0.2% sparse hotspot, i.e. 16MB actually hot):
> >> THP=always causes DAMON to report the entire 8GB as hot, while THP=never
> >> reports only a few hundred MB -- a 512x overestimate relative to the actual
> >> 16MB hotspot under THP, and a ~33x gap between the two THP modes. ARM SPE hardware profiling
> >> independently confirms this asymmetry: out of 2,005 THPs sampled system-wide
> >> over 10 seconds, 97% had fewer than 10% of their 4KB subpages actually accessed.
> > 
> > THP always will just collapse the entire PID into huge pages anyway. This
> > is outside DAMON's control.
> > 
> > Have you tried setting THP to never and running DAMON with DAMON_COLLAPSE
> > action?
> > 
> >> To mitigate this, this series extends the existing DAMOS_COLLAPSE action to be
> >> mTHP-aware via a new target_order field, and introduces a new
> >> DAMOS_MTHP_SPLIT action. This enables DAMON to proactively split PMD THPs
> >> into smaller mTHPs when most subpages are probed as cold, and collapse them
> >> back when beneficial. To resolve the sub-PMD monitoring blindness, the split
> >> path can incorporate fine-grained hardware feedback from ARM SPE.
> >> The hardware feedback loop (damon_spe_folio_heatmap) implements a two-pass
> >> signal filter: it first identifies the peak chunk access count, and then marks
> >> sub-chunks with >= 1/10 of the peak count as hot, effectively filtering out
> >> SPE sampling noise. A configurable hot_threshold (default 30%) controls the
> >> split decision: only folios with a hot fraction below this threshold are
> >> eligible for splitting. When no SPE data is available, the infrastructure
> >> gracefully falls back to explicit PTE-level scanning via folio_walk.
> >> 
> >> Currently, SPE data is fed from userspace via debugfs (e.g., perf script piped
> >> through a histogram builder into /sys/kernel/debug/damon/spe_feed).
> >> 
> >> Collapse path (patches 1-3):
> >>  DAMON scheme action=COLLAPSE, target_order=N
> >>  -> damos_va_collapse() -> damon_collapse_folio_range()
> >>  -> collapse_huge_page()
> >> 
> >> Split path (patches 4-5):
> >>  DAMON scheme action=MTHP_SPLIT, target_order=N, hot_threshold=M
> >>  -> damos_va_mthp_split() -> damon_spe_hot_fraction()
> >>  -> split_folio_to_order()
> >> 
> >> SPE feedback infrastructure (patch 6):
> >>  perf script -> spe_hist -> debugfs spe_feed
> >>  -> per-folio rbtree {THP-aligned PFN -> access_count[512]}
> >>  -> damon_spe_folio_heatmap() -> hot_bitmap -> split decision
> >> 
> >> The userspace helper tools (including the spe_hist histogram builder and
> >> validation scripts) are archived at:
> >>  https://github.com/lianux-mm/damon_spe
> >> 
> >> Testing was performed on a Kunpeng 920 system (256 cores, 249GB RAM, base kernel
> >> 7.1.0-rc5+):
> >> 
> >>  T1 ARM64 blind spot: A 16MB THP workload (where 8 PMDs fit entirely within the
> >>     L2 TLB) resulted in DAMON detecting 0 regions. Conversely, using 512MB
> >>     with 4KB base pages, or a 16GB THP layout (exceeding L2 TLB reach), allowed
> >>     DAMON to function normally.
> >> 
> >>  T2 THP inflation: With an 8GB mmap and 16MB actually hot (0.2%),
> >>     THP=always: DAMON reported 8GB hot (512x vs ground truth);
> >>     THP=never: ~245MB (15x vs ground truth).  The THP-induced gap
> >>     between the two modes was ~33x.
> >> 
> >>  T3 RocksDB: Fragmented malloc allocation prevented THP formation, and DAMON
> >>     behaved normally. We could not reproduce THP inflation with RocksDB.
> >>     The workloads fundamentally vulnerable to this structural issue remain KVM
> >>     guests, JVM large heaps, and PostgreSQL shared_buffers.
> >> 
> >>  T4 min=0 deadlock break: A 256MB THP induced the DAMON blind spot.
> >>     Triggering an unconditional mthp_split (via nr_accesses/min=0) successfully
> >>     shattered the space into 16384x16KB folios, allowing DAMON to fully recover.
> >> 
> >>  T5 ARM SPE histogram: Out of 2005 sampled THPs, 97% exhibited <10% hot subpages.
> >>     A typical trace showed PFN 0x820db800 accumulated 39,794 hardware accesses
> >>     concentrated across only 3 out of 512 subpages.
> > The SPE stuff fits SeongJae's goals for DAMON-X, I think. Maybe this is something
> > we should keep in the user space and let the kernel provide only the API to add
> > different metrics, including PMU and SPE.
> 
> Hi Asier,
> 
> Thanks for your prompt and constructive reply. I really appreciate your 
> detailed analysis of the mTHP and SPE interaction.

Indeed, very helpful comments.  Thank you Asier!

> 
> Your point regarding the design boundary—whether this fits better in 
> user space or aligned with DAMON-X—is highly valuable. 

Actually Asier is saying about the perf event-based monitoring extension [1].
DAMON-X [2] is another project.

> 
> Since SeongJae (SJ) will look into this thread tomorrow, let us sync up 
> then. I look forward to cooperating with both of you to refine this 
> design and find the best architectural fit for the subsystem.

As I also replied, I'd also prefer this to be aligned with the perf event-based
extension roadmap.

[1] https://lore.kernel.org/all/20260525225208.1179-1-sj@kernel.org/
[2] https://lwn.net/Articles/1071256/


Thanks,
SJ

[...]

next prev parent reply	other threads:[~2026-06-19  1:52 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-18  9:48 [RFC PATCH 0/6] mm/damon: Add mTHP-aware collapse/split with ARM SPE feedback Wang Lian
2026-06-18  9:48 ` [RFC PATCH 1/6] mm/damon: add target_order field for DAMOS_COLLAPSE Wang Lian
2026-06-18  9:48 ` [RFC PATCH 2/6] mm/khugepaged: add damon_collapse_folio_range() for external callers Wang Lian
2026-06-18  9:48 ` [RFC PATCH 3/6] mm/damon/vaddr: implement mTHP-aware DAMOS_COLLAPSE handler Wang Lian
2026-06-18  9:48 ` [RFC PATCH 4/6] mm/damon: introduce DAMOS_MTHP_SPLIT action and hot_threshold Wang Lian
2026-06-18  9:48 ` [RFC PATCH 5/6] mm/damon/vaddr: implement DAMOS_MTHP_SPLIT handler Wang Lian
2026-06-18  9:48 ` [RFC PATCH 6/6] mm/damon: add SPE feedback for sub-THP split decisions Wang Lian
2026-06-18 11:03 ` [RFC PATCH 0/6] mm/damon: Add mTHP-aware collapse/split with ARM SPE feedback Gutierrez Asier
2026-06-18 13:13   ` wang lian
2026-06-19  1:52     ` SeongJae Park [this message]
2026-06-19  1:47 ` SeongJae Park
2026-06-19  1:54   ` SeongJae Park
2026-06-19  1:59     ` SeongJae Park
2026-06-19  3:40   ` Wang Lian
2026-06-19 14:31     ` Gutierrez Asier
2026-06-20 20:39       ` SeongJae Park

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260619015241.9432-1-sj@kernel.org \
    --to=sj@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=daichaobing@sangfor.com.cn \
    --cc=gutierrez.asier@huawei-partners.com \
    --cc=kunwu.chan@gmail.com \
    --cc=lianux.mm@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=npache@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.