Re: [RFC PATCH 0/6] mm/damon: Add mTHP-aware collapse/split with ARM SPE feedback

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

From: SeongJae Park <sj@kernel.org>
To: SeongJae Park <sj@kernel.org>
Cc: Wang Lian <lianux.mm@gmail.com>,
	akpm@linux-foundation.org, npache@redhat.com,
	gutierrez.asier@huawei-partners.com, daichaobing@sangfor.com.cn,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	kunwu.chan@gmail.com
Subject: Re: [RFC PATCH 0/6] mm/damon: Add mTHP-aware collapse/split with ARM SPE feedback
Date: Thu, 18 Jun 2026 18:54:10 -0700	[thread overview]
Message-ID: <20260619015411.9554-1-sj@kernel.org> (raw)
In-Reply-To: <20260619014707.9297-1-sj@kernel.org>

On Thu, 18 Jun 2026 18:47:16 -0700 SeongJae Park <sj@kernel.org> wrote:

> Hello Lian,
> 
> On Thu, 18 Jun 2026 17:48:32 +0800 Wang Lian <lianux.mm@gmail.com> wrote:
> 
> > Received an off-list report that DAMON significantly overestimates
> > hot memory in KVM/QEMU deployments with THP-backed tmpfs guest memory
> > running Oracle workloads.
> > 
> > The root cause is structural: a PMD entry covers 512 4KB subpages with
> > a single Access Flag (AF) bit. When any one subpage is accessed, the entire
> > 2MB region appears "hot" to DAMON. On ARM64,
> 
> This makes sense to me.  I also agree this could caused the reported problem.
> And this is a known limitation of DAMON.  My suggestion for straightforward
> workaround of this problem is, using 'age' information of DAMON for better
> identification of the hot memory.
> 
> That is, I don't expect real hot data in real production systems will evenly
> scattered.  Even if they are, I don't expect they will all evenly frequently
> accessed.  Only a few of those would be accessed frequently for long.  Even if
> that is, there would be data that frequently for longer.  You could show the
> distriibution of the pattern and find X % of hottest memory as hot.
> 
> We invented idle time percentiles [1] for a similar purpose, though it is more
> focusing on finding cold memory.
> 
> I understand this patch series is trying to make more fundamental and better
> solution on hardware that can do better.  Makes sense to me.
> 
> > this is compounded by the
> > hardware AF mechanism -- the AF is only set on a TLB miss. Consequently, when the
> > working set fits entirely within the L2 TLB (e.g., a 16MB working set with 2MB THP
> > running on a Kunpeng 920's 2048-entry L2 TLB), DAMON becomes completely blind to
> > subsequent accesses.
> 
> This makes sense to me.  However, I don't get how this is contributing to the
> problem.  Could you please elaborate?
> 
> > x86 is not subject to this specific blindness under similar
> > conditions.
> 
> To my understanding on x86, same issue exists.  If TLB hits, Aceessed bit is
> not set, and DAMON shows it as unaccessed.  Am I missing something?
> 
> > 
> > We reproduced this memory inflation on a Kunpeng 920 platform using a synthetic
> > workload (8GB mmap with a 0.2% sparse hotspot, i.e. 16MB actually hot):
> > THP=always causes DAMON to report the entire 8GB as hot, while THP=never
> > reports only a few hundred MB -- a 512x overestimate relative to the actual
> > 16MB hotspot under THP, and a ~33x gap between the two THP modes. ARM SPE hardware profiling
> > independently confirms this asymmetry: out of 2,005 THPs sampled system-wide
> > over 10 seconds, 97% had fewer than 10% of their 4KB subpages actually accessed.
> 
> I don't think the real world production systems to have this very artificial
> access pattern.  I believe (or, hope) use of 'age' can work around the issue in
> a reasonable level for many cases.  I understand this setup is only for PoC,
> and I think this is well designed test for the purpose.  Thank you for sharing
> this.
> 
> > 
> > To mitigate this, this series extends the existing DAMOS_COLLAPSE action to be
> > mTHP-aware via a new target_order field,
> 
> Makes sensee, and sounds nice.  Definitely no one size fits all!
> 
> > and introduces a new
> > DAMOS_MTHP_SPLIT action. This enables DAMON to proactively split PMD THPs
> > into smaller mTHPs
> 
> Nice!  Asier was planning to do similar work in future.  I think you could
> collaborate to reduce unnecessary duplicates!
> 
> I'd suggest making the name simpler and consistent to DAMOS_COLLAPSE, though.
> Say, DAMOS_SPLIT ?
> 
> > when most subpages are probed as cold, and collapse them
> > back when beneficial. To resolve the sub-PMD monitoring blindness, the split
> > path can incorporate fine-grained hardware feedback from ARM SPE.
> > 
> > The hardware feedback loop (damon_spe_folio_heatmap) implements a two-pass
> > signal filter: it first identifies the peak chunk access count, and then marks
> > sub-chunks with >= 1/10 of the peak count as hot, effectively filtering out
> > SPE sampling noise. A configurable hot_threshold (default 30%) controls the
> > split decision: only folios with a hot fraction below this threshold are
> > eligible for splitting. When no SPE data is available, the infrastructure
> > gracefully falls back to explicit PTE-level scanning via folio_walk.
> > 
> > Currently, SPE data is fed from userspace via debugfs (e.g., perf script piped
> > through a histogram builder into /sys/kernel/debug/damon/spe_feed).
> 
> So you implemented a debugfs interface?  That must be a nice approach for PoC.
> But it may be difficult to be upstreamed as is.
> 
> You could build a control plane that decides the exact address ranges to split,
> and directly feed it to DAMOS using DAMOS address filter.  max_nr_snapshots can
> also be useful for making such kind of user space controls more deterministic.
> 
> For simpler user-space control, utilizing user_input DAMOS quota goal [2]
> should also be another option.
> 
> We are also planning [3] to extend DAMON for perf events.  On top of it, we
> might be able to extend it further to utilize ARM SPE by DAMON itself, and do
> all this without the user space help but only DAMOS.
> 
> Baseed on below 'limitations' section, I understand this is only for PoC at the
> moment, and you plan to explore the perf event based approach.  I'd also
> recommend that.
> 
> > 
> > Collapse path (patches 1-3):
> >   DAMON scheme action=COLLAPSE, target_order=N
> >   -> damos_va_collapse() -> damon_collapse_folio_range()
> >   -> collapse_huge_page()
> > 
> > Split path (patches 4-5):
> >   DAMON scheme action=MTHP_SPLIT, target_order=N, hot_threshold=M
> >   -> damos_va_mthp_split() -> damon_spe_hot_fraction()
> >   -> split_folio_to_order()
> > 
> > SPE feedback infrastructure (patch 6):
> >   perf script -> spe_hist -> debugfs spe_feed
> >   -> per-folio rbtree {THP-aligned PFN -> access_count[512]}
> >   -> damon_spe_folio_heatmap() -> hot_bitmap -> split decision
> > 
> > The userspace helper tools (including the spe_hist histogram builder and
> > validation scripts) are archived at:
> >   https://github.com/lianux-mm/damon_spe
> 
> Thank you for making all the grateful code open!
> 
> > 
> > Testing was performed on a Kunpeng 920 system (256 cores, 249GB RAM, base kernel
> > 7.1.0-rc5+):
> > 
> >   T1 ARM64 blind spot: A 16MB THP workload (where 8 PMDs fit entirely within the
> >      L2 TLB) resulted in DAMON detecting 0 regions. Conversely, using 512MB
> >      with 4KB base pages, or a 16GB THP layout (exceeding L2 TLB reach), allowed
> >      DAMON to function normally.
> > 
> >   T2 THP inflation: With an 8GB mmap and 16MB actually hot (0.2%),
> >      THP=always: DAMON reported 8GB hot (512x vs ground truth);
> >      THP=never: ~245MB (15x vs ground truth).  The THP-induced gap
> >      between the two modes was ~33x.
> > 
> >   T3 RocksDB: Fragmented malloc allocation prevented THP formation, and DAMON
> >      behaved normally. We could not reproduce THP inflation with RocksDB.
> >      The workloads fundamentally vulnerable to this structural issue remain KVM
> >      guests, JVM large heaps, and PostgreSQL shared_buffers.
> > 
> >   T4 min=0 deadlock break: A 256MB THP induced the DAMON blind spot.
> >      Triggering an unconditional mthp_split (via nr_accesses/min=0) successfully
> >      shattered the space into 16384x16KB folios, allowing DAMON to fully recover.
> > 
> >   T5 ARM SPE histogram: Out of 2005 sampled THPs, 97% exhibited <10% hot subpages.
> >      A typical trace showed PFN 0x820db800 accumulated 39,794 hardware accesses
> >      concentrated across only 3 out of 512 subpages.
> > 
> >   End-to-end: Verified hot/cold discrimination. The SPE feed preserved a 90%
> >      hot THP intact, while successfully splitting a 25% cold THP into 128x16KB folios.
> > 
> > Known limitations:
> > - The full KVM + Oracle production chain has not yet been benchmarked end-to-end.
> >   While individual component verification is complete, full integration testing
> >   is planned in collaboration with Sangfor.
> > - khugepaged may aggressively re-collapse the mTHPs that DAMON splits. A
> >   coordination/back-off mechanism is required to avoid ping-pong effects.
> 
> Do you really need to khugepaged together, when you already have
> DAMOS_COLLAPSE, and anyway you are running DAMON for hugepage splits?
> 
> > - SPE data is currently funneled via a userspace daemon and debugfs. Direct
> >   kernel-side perf_event sampling integration is planned as a follow-up.
> 
> Nice, I think this will make our projects aligned and reduce unnecessary
> duplicates.  I'd encourage you to try this path.
> 
> > - The rbtree entry TTL (30s) and signal threshold (1/10 of peak) are empirical
> >   defaults subject to further tuning.
> 
> I don't fully understand this part.  Could you please elaborate?
> 
> > - The ARM64 DAMON blind spot (WSS < L2 TLB reach) is a pre-existing hardware-MMU
> >   characteristic, not introduced by this series. Setting nr_accesses/min=0
> >   serves as an effective workaround for the split path.
> 
> I don't fully understand this, too.  Could you please elaborate and enlighten
> me?
> 
> > 
> > Reported-by: Chaobing Dai <daichaobing@sangfor.com.cn>
> > Cc: SeongJae Park <sj@kernel.org>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Nico Pache <npache@redhat.com>
> > Cc: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
> > Cc: linux-mm@kvack.org
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: Wang Lian <lianux.mm@gmail.com>
> > 
> > Wang Lian (6):
> >   mm/damon: add target_order field for DAMOS_COLLAPSE
> >   mm/khugepaged: add damon_collapse_folio_range() for external callers
> >   mm/damon/vaddr: implement mTHP-aware DAMOS_COLLAPSE handler
> >   mm/damon: introduce DAMOS_MTHP_SPLIT action and hot_threshold
> >   mm/damon/vaddr: implement DAMOS_MTHP_SPLIT handler
> >   mm/damon: add SPE feedback for sub-THP split decisions
> > 
> >  include/linux/damon.h      |  18 ++
> >  include/linux/khugepaged.h |   3 +
> >  mm/damon/Kconfig           |  12 +
> >  mm/damon/Makefile          |   1 +
> >  mm/damon/core.c            |   3 +
> >  mm/damon/spe.c             | 505 +++++++++++++++++++++++++++++++++++++
> >  mm/damon/spe.h             |  62 +++++
> >  mm/damon/sysfs-schemes.c   |  96 +++++++
> >  mm/damon/vaddr.c           | 118 +++++++++
> >  mm/khugepaged.c            |  39 +++
> >  10 files changed, 857 insertions(+)
> >  create mode 100644 mm/damon/spe.c
> >  create mode 100644 mm/damon/spe.h
> 
> Because this is an RFC and we found high level TODO (trying perf event based
> appraoch instead of debugfs), I will skip reviewing the details.  If you have
> specific parts that want my detailed review, let me know.
> 
> Also, the perf event based monitoring is a long term project.  The ETA is the
> LSFMMBPF'27.  If you cannot wait until the time, maybe you could try the
> alternative approaches (using address filter or user_input quota goal) and
> upstreaming dependent parts (DAMOS_COLLAPSE extension for mTHP and DAMOS_SPLIT)
> first could also be a nice approach, in my opinion.
> 
> [1] https://origin.kernel.org/doc/html/latest/admin-guide/mm/damon/stat.html#memory-idle-ms-percentiles
> [2] https://origin.kernel.org/doc/html/latest/mm/damon/design.html#aim-oriented-feedback-driven-auto-tuning
> [3] https://lore.kernel.org/20251128193947.80866-1-sj@kernel.org/

The above link ([3]) is wrong, sorry.  Please use below.

[3] https://lore.kernel.org/20260525225208.1179-1-sj@kernel.org/


Thanks,
SJ

[...]

next prev parent reply	other threads:[~2026-06-19  1:54 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-18  9:48 [RFC PATCH 0/6] mm/damon: Add mTHP-aware collapse/split with ARM SPE feedback Wang Lian
2026-06-18  9:48 ` [RFC PATCH 1/6] mm/damon: add target_order field for DAMOS_COLLAPSE Wang Lian
2026-06-18  9:48 ` [RFC PATCH 2/6] mm/khugepaged: add damon_collapse_folio_range() for external callers Wang Lian
2026-06-18  9:48 ` [RFC PATCH 3/6] mm/damon/vaddr: implement mTHP-aware DAMOS_COLLAPSE handler Wang Lian
2026-06-18  9:48 ` [RFC PATCH 4/6] mm/damon: introduce DAMOS_MTHP_SPLIT action and hot_threshold Wang Lian
2026-06-18  9:48 ` [RFC PATCH 5/6] mm/damon/vaddr: implement DAMOS_MTHP_SPLIT handler Wang Lian
2026-06-18  9:48 ` [RFC PATCH 6/6] mm/damon: add SPE feedback for sub-THP split decisions Wang Lian
2026-06-18 11:03 ` [RFC PATCH 0/6] mm/damon: Add mTHP-aware collapse/split with ARM SPE feedback Gutierrez Asier
2026-06-19  1:47 ` SeongJae Park
2026-06-19  1:54   ` SeongJae Park [this message]
2026-06-19  1:59     ` SeongJae Park
2026-06-19  3:40   ` Wang Lian
2026-06-19 14:31     ` Gutierrez Asier
2026-06-20 20:39       ` SeongJae Park
     [not found] <459C0876-AC37-4A52-BF11-6436FF33CA90@gmail.com>
2026-06-19  1:52 ` SeongJae Park

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260619015411.9554-1-sj@kernel.org \
    --to=sj@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=daichaobing@sangfor.com.cn \
    --cc=gutierrez.asier@huawei-partners.com \
    --cc=kunwu.chan@gmail.com \
    --cc=lianux.mm@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=npache@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox