Re: [RFC PATCH 0/6] mm/damon: Add mTHP-aware collapse/split with ARM SPE feedback

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: wang lian <lianux.mm@gmail.com>
To: Gutierrez Asier <gutierrez.asier@huawei-partners.com>
Cc: sj@kernel.org, akpm@linux-foundation.org, npache@redhat.com,
	daichaobing@sangfor.com.cn, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, kunwu.chan@gmail.com
Subject: Re: [RFC PATCH 0/6] mm/damon: Add mTHP-aware collapse/split with ARM SPE feedback
Date: Thu, 18 Jun 2026 21:13:07 +0800	[thread overview]
Message-ID: <459C0876-AC37-4A52-BF11-6436FF33CA90@gmail.com> (raw)
In-Reply-To: <552e1d3a-3c60-4110-8601-00cd9bc7998a@huawei-partners.com>

[-- Attachment #1: Type: text/plain, Size: 8360 bytes --]



> On Jun 18, 2026, at 19:03, Gutierrez Asier <gutierrez.asier@huawei-partners.com> wrote:
> 
> Hi Wang,
> 
> On 6/18/2026 12:48 PM, Wang Lian wrote:
>> Received an off-list report that DAMON significantly overestimates
>> hot memory in KVM/QEMU deployments with THP-backed tmpfs guest memory
>> running Oracle workloads.
>> 
>> The root cause is structural: a PMD entry covers 512 4KB subpages with
>> a single Access Flag (AF) bit. When any one subpage is accessed, the entire
>> 2MB region appears "hot" to DAMON. On ARM64, this is compounded by the
>> hardware AF mechanism -- the AF is only set on a TLB miss. Consequently, when the
>> working set fits entirely within the L2 TLB (e.g., a 16MB working set with 2MB THP
>> running on a Kunpeng 920's 2048-entry L2 TLB), DAMON becomes completely blind to
>> subsequent accesses. x86 is not subject to this specific blindness under similar
>> conditions.
> 
> Have you tried setting the minimum region size to 2MB?
> 
>> We reproduced this memory inflation on a Kunpeng 920 platform using a synthetic
>> workload (8GB mmap with a 0.2% sparse hotspot, i.e. 16MB actually hot):
>> THP=always causes DAMON to report the entire 8GB as hot, while THP=never
>> reports only a few hundred MB -- a 512x overestimate relative to the actual
>> 16MB hotspot under THP, and a ~33x gap between the two THP modes. ARM SPE hardware profiling
>> independently confirms this asymmetry: out of 2,005 THPs sampled system-wide
>> over 10 seconds, 97% had fewer than 10% of their 4KB subpages actually accessed.
> 
> THP always will just collapse the entire PID into huge pages anyway. This
> is outside DAMON's control.
> 
> Have you tried setting THP to never and running DAMON with DAMON_COLLAPSE
> action?
> 
>> To mitigate this, this series extends the existing DAMOS_COLLAPSE action to be
>> mTHP-aware via a new target_order field, and introduces a new
>> DAMOS_MTHP_SPLIT action. This enables DAMON to proactively split PMD THPs
>> into smaller mTHPs when most subpages are probed as cold, and collapse them
>> back when beneficial. To resolve the sub-PMD monitoring blindness, the split
>> path can incorporate fine-grained hardware feedback from ARM SPE.
>> The hardware feedback loop (damon_spe_folio_heatmap) implements a two-pass
>> signal filter: it first identifies the peak chunk access count, and then marks
>> sub-chunks with >= 1/10 of the peak count as hot, effectively filtering out
>> SPE sampling noise. A configurable hot_threshold (default 30%) controls the
>> split decision: only folios with a hot fraction below this threshold are
>> eligible for splitting. When no SPE data is available, the infrastructure
>> gracefully falls back to explicit PTE-level scanning via folio_walk.
>> 
>> Currently, SPE data is fed from userspace via debugfs (e.g., perf script piped
>> through a histogram builder into /sys/kernel/debug/damon/spe_feed).
>> 
>> Collapse path (patches 1-3):
>>  DAMON scheme action=COLLAPSE, target_order=N
>>  -> damos_va_collapse() -> damon_collapse_folio_range()
>>  -> collapse_huge_page()
>> 
>> Split path (patches 4-5):
>>  DAMON scheme action=MTHP_SPLIT, target_order=N, hot_threshold=M
>>  -> damos_va_mthp_split() -> damon_spe_hot_fraction()
>>  -> split_folio_to_order()
>> 
>> SPE feedback infrastructure (patch 6):
>>  perf script -> spe_hist -> debugfs spe_feed
>>  -> per-folio rbtree {THP-aligned PFN -> access_count[512]}
>>  -> damon_spe_folio_heatmap() -> hot_bitmap -> split decision
>> 
>> The userspace helper tools (including the spe_hist histogram builder and
>> validation scripts) are archived at:
>>  https://github.com/lianux-mm/damon_spe
>> 
>> Testing was performed on a Kunpeng 920 system (256 cores, 249GB RAM, base kernel
>> 7.1.0-rc5+):
>> 
>>  T1 ARM64 blind spot: A 16MB THP workload (where 8 PMDs fit entirely within the
>>     L2 TLB) resulted in DAMON detecting 0 regions. Conversely, using 512MB
>>     with 4KB base pages, or a 16GB THP layout (exceeding L2 TLB reach), allowed
>>     DAMON to function normally.
>> 
>>  T2 THP inflation: With an 8GB mmap and 16MB actually hot (0.2%),
>>     THP=always: DAMON reported 8GB hot (512x vs ground truth);
>>     THP=never: ~245MB (15x vs ground truth).  The THP-induced gap
>>     between the two modes was ~33x.
>> 
>>  T3 RocksDB: Fragmented malloc allocation prevented THP formation, and DAMON
>>     behaved normally. We could not reproduce THP inflation with RocksDB.
>>     The workloads fundamentally vulnerable to this structural issue remain KVM
>>     guests, JVM large heaps, and PostgreSQL shared_buffers.
>> 
>>  T4 min=0 deadlock break: A 256MB THP induced the DAMON blind spot.
>>     Triggering an unconditional mthp_split (via nr_accesses/min=0) successfully
>>     shattered the space into 16384x16KB folios, allowing DAMON to fully recover.
>> 
>>  T5 ARM SPE histogram: Out of 2005 sampled THPs, 97% exhibited <10% hot subpages.
>>     A typical trace showed PFN 0x820db800 accumulated 39,794 hardware accesses
>>     concentrated across only 3 out of 512 subpages.
> The SPE stuff fits SeongJae's goals for DAMON-X, I think. Maybe this is something
> we should keep in the user space and let the kernel provide only the API to add
> different metrics, including PMU and SPE.

Hi Asier,

Thanks for your prompt and constructive reply. I really appreciate your 
detailed analysis of the mTHP and SPE interaction.

Your point regarding the design boundary—whether this fits better in 
user space or aligned with DAMON-X—is highly valuable. 

Since SeongJae (SJ) will look into this thread tomorrow, let us sync up 
then. I look forward to cooperating with both of you to refine this 
design and find the best architectural fit for the subsystem.

Thanks,
Wang Lian
>>  End-to-end: Verified hot/cold discrimination. The SPE feed preserved a 90%
>>     hot THP intact, while successfully splitting a 25% cold THP into 128x16KB folios.
>> 
>> Known limitations:
>> - The full KVM + Oracle production chain has not yet been benchmarked end-to-end.
>>  While individual component verification is complete, full integration testing
>>  is planned in collaboration with Sangfor.
>> - khugepaged may aggressively re-collapse the mTHPs that DAMON splits. A
>>  coordination/back-off mechanism is required to avoid ping-pong effects.
>> - SPE data is currently funneled via a userspace daemon and debugfs. Direct
>>  kernel-side perf_event sampling integration is planned as a follow-up.
>> - The rbtree entry TTL (30s) and signal threshold (1/10 of peak) are empirical
>>  defaults subject to further tuning.
>> - The ARM64 DAMON blind spot (WSS < L2 TLB reach) is a pre-existing hardware-MMU
>>  characteristic, not introduced by this series. Setting nr_accesses/min=0
>>  serves as an effective workaround for the split path.
>> 
>> Reported-by: Chaobing Dai <daichaobing@sangfor.com.cn>
>> Cc: SeongJae Park <sj@kernel.org>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Nico Pache <npache@redhat.com>
>> Cc: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
>> Cc: linux-mm@kvack.org
>> Cc: linux-kernel@vger.kernel.org
>> Signed-off-by: Wang Lian <lianux.mm@gmail.com>
>> 
>> Wang Lian (6):
>>  mm/damon: add target_order field for DAMOS_COLLAPSE
>>  mm/khugepaged: add damon_collapse_folio_range() for external callers
>>  mm/damon/vaddr: implement mTHP-aware DAMOS_COLLAPSE handler
>>  mm/damon: introduce DAMOS_MTHP_SPLIT action and hot_threshold
>>  mm/damon/vaddr: implement DAMOS_MTHP_SPLIT handler
>>  mm/damon: add SPE feedback for sub-THP split decisions
>> 
>> include/linux/damon.h      |  18 ++
>> include/linux/khugepaged.h |   3 +
>> mm/damon/Kconfig           |  12 +
>> mm/damon/Makefile          |   1 +
>> mm/damon/core.c            |   3 +
>> mm/damon/spe.c             | 505 +++++++++++++++++++++++++++++++++++++
>> mm/damon/spe.h             |  62 +++++
>> mm/damon/sysfs-schemes.c   |  96 +++++++
>> mm/damon/vaddr.c           | 118 +++++++++
>> mm/khugepaged.c            |  39 +++
>> 10 files changed, 857 insertions(+)
>> create mode 100644 mm/damon/spe.c
>> create mode 100644 mm/damon/spe.h
>> 
>> --
>> 2.50.1 (Apple Git-155)
>> 
> 
> -- 
> Asier Gutierrez
> Huawei


[-- Attachment #2: Type: text/html, Size: 24491 bytes --]

next prev parent reply	other threads:[~2026-06-18 13:13 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-18  9:48 [RFC PATCH 0/6] mm/damon: Add mTHP-aware collapse/split with ARM SPE feedback Wang Lian
2026-06-18  9:48 ` [RFC PATCH 1/6] mm/damon: add target_order field for DAMOS_COLLAPSE Wang Lian
2026-06-18  9:48 ` [RFC PATCH 2/6] mm/khugepaged: add damon_collapse_folio_range() for external callers Wang Lian
2026-06-18  9:48 ` [RFC PATCH 3/6] mm/damon/vaddr: implement mTHP-aware DAMOS_COLLAPSE handler Wang Lian
2026-06-18  9:48 ` [RFC PATCH 4/6] mm/damon: introduce DAMOS_MTHP_SPLIT action and hot_threshold Wang Lian
2026-06-18  9:48 ` [RFC PATCH 5/6] mm/damon/vaddr: implement DAMOS_MTHP_SPLIT handler Wang Lian
2026-06-18  9:48 ` [RFC PATCH 6/6] mm/damon: add SPE feedback for sub-THP split decisions Wang Lian
2026-06-18 11:03 ` [RFC PATCH 0/6] mm/damon: Add mTHP-aware collapse/split with ARM SPE feedback Gutierrez Asier
2026-06-18 13:13   ` wang lian [this message]
2026-06-19  1:52     ` SeongJae Park
2026-06-19  1:47 ` SeongJae Park
2026-06-19  1:54   ` SeongJae Park
2026-06-19  1:59     ` SeongJae Park
2026-06-19  3:40   ` Wang Lian
2026-06-19 14:31     ` Gutierrez Asier
2026-06-20 20:39       ` SeongJae Park

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=459C0876-AC37-4A52-BF11-6436FF33CA90@gmail.com \
    --to=lianux.mm@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=daichaobing@sangfor.com.cn \
    --cc=gutierrez.asier@huawei-partners.com \
    --cc=kunwu.chan@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=npache@redhat.com \
    --cc=sj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox