Re: [RFC PATCH 0/6] mm/damon: Add mTHP-aware collapse/split with ARM SPE feedback

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: [RFC PATCH 0/6] mm/damon: Add mTHP-aware collapse/split with ARM SPE feedback
       [not found] <20260619015411.9554-1-sj@kernel.org>
@ 2026-06-19  1:59 ` SeongJae Park
  0 siblings, 0 replies; 4+ messages in thread
From: SeongJae Park @ 2026-06-19  1:59 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Wang Lian, akpm, npache, gutierrez.asier, daichaobing, linux-mm,
	linux-kernel, kunwu.chan, damon

+ damon@lists.linux.dev

Please Cc damon@lists.linux.dev from the next revision, and all DAMON patches
in future.


Thanks,
SJ

On Thu, 18 Jun 2026 18:54:23 -0700 SeongJae Park <sj@kernel.org> wrote:

> On Thu, 18 Jun 2026 18:47:16 -0700 SeongJae Park <sj@kernel.org> wrote:
> 
> > Hello Lian,
> > 
> > On Thu, 18 Jun 2026 17:48:32 +0800 Wang Lian <lianux.mm@gmail.com> wrote:
> > 
> > > Received an off-list report that DAMON significantly overestimates
> > > hot memory in KVM/QEMU deployments with THP-backed tmpfs guest memory
> > > running Oracle workloads.
> > > 
> > > The root cause is structural: a PMD entry covers 512 4KB subpages with
> > > a single Access Flag (AF) bit. When any one subpage is accessed, the entire
> > > 2MB region appears "hot" to DAMON. On ARM64,
> > 
> > This makes sense to me.  I also agree this could caused the reported problem.
> > And this is a known limitation of DAMON.  My suggestion for straightforward
> > workaround of this problem is, using 'age' information of DAMON for better
> > identification of the hot memory.
> > 
> > That is, I don't expect real hot data in real production systems will evenly
> > scattered.  Even if they are, I don't expect they will all evenly frequently
> > accessed.  Only a few of those would be accessed frequently for long.  Even if
> > that is, there would be data that frequently for longer.  You could show the
> > distriibution of the pattern and find X % of hottest memory as hot.
> > 
> > We invented idle time percentiles [1] for a similar purpose, though it is more
> > focusing on finding cold memory.
> > 
> > I understand this patch series is trying to make more fundamental and better
> > solution on hardware that can do better.  Makes sense to me.
> > 
> > > this is compounded by the
> > > hardware AF mechanism -- the AF is only set on a TLB miss. Consequently, when the
> > > working set fits entirely within the L2 TLB (e.g., a 16MB working set with 2MB THP
> > > running on a Kunpeng 920's 2048-entry L2 TLB), DAMON becomes completely blind to
> > > subsequent accesses.
> > 
> > This makes sense to me.  However, I don't get how this is contributing to the
> > problem.  Could you please elaborate?
> > 
> > > x86 is not subject to this specific blindness under similar
> > > conditions.
> > 
> > To my understanding on x86, same issue exists.  If TLB hits, Aceessed bit is
> > not set, and DAMON shows it as unaccessed.  Am I missing something?
> > 
> > > 
> > > We reproduced this memory inflation on a Kunpeng 920 platform using a synthetic
> > > workload (8GB mmap with a 0.2% sparse hotspot, i.e. 16MB actually hot):
> > > THP=always causes DAMON to report the entire 8GB as hot, while THP=never
> > > reports only a few hundred MB -- a 512x overestimate relative to the actual
> > > 16MB hotspot under THP, and a ~33x gap between the two THP modes. ARM SPE hardware profiling
> > > independently confirms this asymmetry: out of 2,005 THPs sampled system-wide
> > > over 10 seconds, 97% had fewer than 10% of their 4KB subpages actually accessed.
> > 
> > I don't think the real world production systems to have this very artificial
> > access pattern.  I believe (or, hope) use of 'age' can work around the issue in
> > a reasonable level for many cases.  I understand this setup is only for PoC,
> > and I think this is well designed test for the purpose.  Thank you for sharing
> > this.
> > 
> > > 
> > > To mitigate this, this series extends the existing DAMOS_COLLAPSE action to be
> > > mTHP-aware via a new target_order field,
> > 
> > Makes sensee, and sounds nice.  Definitely no one size fits all!
> > 
> > > and introduces a new
> > > DAMOS_MTHP_SPLIT action. This enables DAMON to proactively split PMD THPs
> > > into smaller mTHPs
> > 
> > Nice!  Asier was planning to do similar work in future.  I think you could
> > collaborate to reduce unnecessary duplicates!
> > 
> > I'd suggest making the name simpler and consistent to DAMOS_COLLAPSE, though.
> > Say, DAMOS_SPLIT ?
> > 
> > > when most subpages are probed as cold, and collapse them
> > > back when beneficial. To resolve the sub-PMD monitoring blindness, the split
> > > path can incorporate fine-grained hardware feedback from ARM SPE.
> > > 
> > > The hardware feedback loop (damon_spe_folio_heatmap) implements a two-pass
> > > signal filter: it first identifies the peak chunk access count, and then marks
> > > sub-chunks with >= 1/10 of the peak count as hot, effectively filtering out
> > > SPE sampling noise. A configurable hot_threshold (default 30%) controls the
> > > split decision: only folios with a hot fraction below this threshold are
> > > eligible for splitting. When no SPE data is available, the infrastructure
> > > gracefully falls back to explicit PTE-level scanning via folio_walk.
> > > 
> > > Currently, SPE data is fed from userspace via debugfs (e.g., perf script piped
> > > through a histogram builder into /sys/kernel/debug/damon/spe_feed).
> > 
> > So you implemented a debugfs interface?  That must be a nice approach for PoC.
> > But it may be difficult to be upstreamed as is.
> > 
> > You could build a control plane that decides the exact address ranges to split,
> > and directly feed it to DAMOS using DAMOS address filter.  max_nr_snapshots can
> > also be useful for making such kind of user space controls more deterministic.
> > 
> > For simpler user-space control, utilizing user_input DAMOS quota goal [2]
> > should also be another option.
> > 
> > We are also planning [3] to extend DAMON for perf events.  On top of it, we
> > might be able to extend it further to utilize ARM SPE by DAMON itself, and do
> > all this without the user space help but only DAMOS.
> > 
> > Baseed on below 'limitations' section, I understand this is only for PoC at the
> > moment, and you plan to explore the perf event based approach.  I'd also
> > recommend that.
> > 
> > > 
> > > Collapse path (patches 1-3):
> > >   DAMON scheme action=COLLAPSE, target_order=N
> > >   -> damos_va_collapse() -> damon_collapse_folio_range()
> > >   -> collapse_huge_page()
> > > 
> > > Split path (patches 4-5):
> > >   DAMON scheme action=MTHP_SPLIT, target_order=N, hot_threshold=M
> > >   -> damos_va_mthp_split() -> damon_spe_hot_fraction()
> > >   -> split_folio_to_order()
> > > 
> > > SPE feedback infrastructure (patch 6):
> > >   perf script -> spe_hist -> debugfs spe_feed
> > >   -> per-folio rbtree {THP-aligned PFN -> access_count[512]}
> > >   -> damon_spe_folio_heatmap() -> hot_bitmap -> split decision
> > > 
> > > The userspace helper tools (including the spe_hist histogram builder and
> > > validation scripts) are archived at:
> > >   https://github.com/lianux-mm/damon_spe
> > 
> > Thank you for making all the grateful code open!
> > 
> > > 
> > > Testing was performed on a Kunpeng 920 system (256 cores, 249GB RAM, base kernel
> > > 7.1.0-rc5+):
> > > 
> > >   T1 ARM64 blind spot: A 16MB THP workload (where 8 PMDs fit entirely within the
> > >      L2 TLB) resulted in DAMON detecting 0 regions. Conversely, using 512MB
> > >      with 4KB base pages, or a 16GB THP layout (exceeding L2 TLB reach), allowed
> > >      DAMON to function normally.
> > > 
> > >   T2 THP inflation: With an 8GB mmap and 16MB actually hot (0.2%),
> > >      THP=always: DAMON reported 8GB hot (512x vs ground truth);
> > >      THP=never: ~245MB (15x vs ground truth).  The THP-induced gap
> > >      between the two modes was ~33x.
> > > 
> > >   T3 RocksDB: Fragmented malloc allocation prevented THP formation, and DAMON
> > >      behaved normally. We could not reproduce THP inflation with RocksDB.
> > >      The workloads fundamentally vulnerable to this structural issue remain KVM
> > >      guests, JVM large heaps, and PostgreSQL shared_buffers.
> > > 
> > >   T4 min=0 deadlock break: A 256MB THP induced the DAMON blind spot.
> > >      Triggering an unconditional mthp_split (via nr_accesses/min=0) successfully
> > >      shattered the space into 16384x16KB folios, allowing DAMON to fully recover.
> > > 
> > >   T5 ARM SPE histogram: Out of 2005 sampled THPs, 97% exhibited <10% hot subpages.
> > >      A typical trace showed PFN 0x820db800 accumulated 39,794 hardware accesses
> > >      concentrated across only 3 out of 512 subpages.
> > > 
> > >   End-to-end: Verified hot/cold discrimination. The SPE feed preserved a 90%
> > >      hot THP intact, while successfully splitting a 25% cold THP into 128x16KB folios.
> > > 
> > > Known limitations:
> > > - The full KVM + Oracle production chain has not yet been benchmarked end-to-end.
> > >   While individual component verification is complete, full integration testing
> > >   is planned in collaboration with Sangfor.
> > > - khugepaged may aggressively re-collapse the mTHPs that DAMON splits. A
> > >   coordination/back-off mechanism is required to avoid ping-pong effects.
> > 
> > Do you really need to khugepaged together, when you already have
> > DAMOS_COLLAPSE, and anyway you are running DAMON for hugepage splits?
> > 
> > > - SPE data is currently funneled via a userspace daemon and debugfs. Direct
> > >   kernel-side perf_event sampling integration is planned as a follow-up.
> > 
> > Nice, I think this will make our projects aligned and reduce unnecessary
> > duplicates.  I'd encourage you to try this path.
> > 
> > > - The rbtree entry TTL (30s) and signal threshold (1/10 of peak) are empirical
> > >   defaults subject to further tuning.
> > 
> > I don't fully understand this part.  Could you please elaborate?
> > 
> > > - The ARM64 DAMON blind spot (WSS < L2 TLB reach) is a pre-existing hardware-MMU
> > >   characteristic, not introduced by this series. Setting nr_accesses/min=0
> > >   serves as an effective workaround for the split path.
> > 
> > I don't fully understand this, too.  Could you please elaborate and enlighten
> > me?
> > 
> > > 
> > > Reported-by: Chaobing Dai <daichaobing@sangfor.com.cn>
> > > Cc: SeongJae Park <sj@kernel.org>
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Cc: Nico Pache <npache@redhat.com>
> > > Cc: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
> > > Cc: linux-mm@kvack.org
> > > Cc: linux-kernel@vger.kernel.org
> > > Signed-off-by: Wang Lian <lianux.mm@gmail.com>
> > > 
> > > Wang Lian (6):
> > >   mm/damon: add target_order field for DAMOS_COLLAPSE
> > >   mm/khugepaged: add damon_collapse_folio_range() for external callers
> > >   mm/damon/vaddr: implement mTHP-aware DAMOS_COLLAPSE handler
> > >   mm/damon: introduce DAMOS_MTHP_SPLIT action and hot_threshold
> > >   mm/damon/vaddr: implement DAMOS_MTHP_SPLIT handler
> > >   mm/damon: add SPE feedback for sub-THP split decisions
> > > 
> > >  include/linux/damon.h      |  18 ++
> > >  include/linux/khugepaged.h |   3 +
> > >  mm/damon/Kconfig           |  12 +
> > >  mm/damon/Makefile          |   1 +
> > >  mm/damon/core.c            |   3 +
> > >  mm/damon/spe.c             | 505 +++++++++++++++++++++++++++++++++++++
> > >  mm/damon/spe.h             |  62 +++++
> > >  mm/damon/sysfs-schemes.c   |  96 +++++++
> > >  mm/damon/vaddr.c           | 118 +++++++++
> > >  mm/khugepaged.c            |  39 +++
> > >  10 files changed, 857 insertions(+)
> > >  create mode 100644 mm/damon/spe.c
> > >  create mode 100644 mm/damon/spe.h
> > 
> > Because this is an RFC and we found high level TODO (trying perf event based
> > appraoch instead of debugfs), I will skip reviewing the details.  If you have
> > specific parts that want my detailed review, let me know.
> > 
> > Also, the perf event based monitoring is a long term project.  The ETA is the
> > LSFMMBPF'27.  If you cannot wait until the time, maybe you could try the
> > alternative approaches (using address filter or user_input quota goal) and
> > upstreaming dependent parts (DAMOS_COLLAPSE extension for mTHP and DAMOS_SPLIT)
> > first could also be a nice approach, in my opinion.
> > 
> > [1] https://origin.kernel.org/doc/html/latest/admin-guide/mm/damon/stat.html#memory-idle-ms-percentiles
> > [2] https://origin.kernel.org/doc/html/latest/mm/damon/design.html#aim-oriented-feedback-driven-auto-tuning
> > [3] https://lore.kernel.org/20251128193947.80866-1-sj@kernel.org/
> 
> The above link ([3]) is wrong, sorry.  Please use below.
> 
> [3] https://lore.kernel.org/20260525225208.1179-1-sj@kernel.org/
> 
> 
> Thanks,
> SJ
> 
> [...]
> 

Sent using hkml (https://github.com/sjp38/hackermail)

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [RFC PATCH 0/6] mm/damon: Add mTHP-aware collapse/split with ARM SPE feedback
@ 2026-06-18  9:48 Wang Lian
  2026-06-18 11:03 ` Gutierrez Asier
  0 siblings, 1 reply; 4+ messages in thread
From: Wang Lian @ 2026-06-18  9:48 UTC (permalink / raw)
  To: sj, akpm
  Cc: npache, gutierrez.asier, daichaobing, linux-mm, linux-kernel,
	lianux.mm, kunwu.chan

Received an off-list report that DAMON significantly overestimates
hot memory in KVM/QEMU deployments with THP-backed tmpfs guest memory
running Oracle workloads.

The root cause is structural: a PMD entry covers 512 4KB subpages with
a single Access Flag (AF) bit. When any one subpage is accessed, the entire
2MB region appears "hot" to DAMON. On ARM64, this is compounded by the
hardware AF mechanism -- the AF is only set on a TLB miss. Consequently, when the
working set fits entirely within the L2 TLB (e.g., a 16MB working set with 2MB THP
running on a Kunpeng 920's 2048-entry L2 TLB), DAMON becomes completely blind to
subsequent accesses. x86 is not subject to this specific blindness under similar
conditions.

We reproduced this memory inflation on a Kunpeng 920 platform using a synthetic
workload (8GB mmap with a 0.2% sparse hotspot, i.e. 16MB actually hot):
THP=always causes DAMON to report the entire 8GB as hot, while THP=never
reports only a few hundred MB -- a 512x overestimate relative to the actual
16MB hotspot under THP, and a ~33x gap between the two THP modes. ARM SPE hardware profiling
independently confirms this asymmetry: out of 2,005 THPs sampled system-wide
over 10 seconds, 97% had fewer than 10% of their 4KB subpages actually accessed.

To mitigate this, this series extends the existing DAMOS_COLLAPSE action to be
mTHP-aware via a new target_order field, and introduces a new
DAMOS_MTHP_SPLIT action. This enables DAMON to proactively split PMD THPs
into smaller mTHPs when most subpages are probed as cold, and collapse them
back when beneficial. To resolve the sub-PMD monitoring blindness, the split
path can incorporate fine-grained hardware feedback from ARM SPE.

The hardware feedback loop (damon_spe_folio_heatmap) implements a two-pass
signal filter: it first identifies the peak chunk access count, and then marks
sub-chunks with >= 1/10 of the peak count as hot, effectively filtering out
SPE sampling noise. A configurable hot_threshold (default 30%) controls the
split decision: only folios with a hot fraction below this threshold are
eligible for splitting. When no SPE data is available, the infrastructure
gracefully falls back to explicit PTE-level scanning via folio_walk.

Currently, SPE data is fed from userspace via debugfs (e.g., perf script piped
through a histogram builder into /sys/kernel/debug/damon/spe_feed).

Collapse path (patches 1-3):
  DAMON scheme action=COLLAPSE, target_order=N
  -> damos_va_collapse() -> damon_collapse_folio_range()
  -> collapse_huge_page()

Split path (patches 4-5):
  DAMON scheme action=MTHP_SPLIT, target_order=N, hot_threshold=M
  -> damos_va_mthp_split() -> damon_spe_hot_fraction()
  -> split_folio_to_order()

SPE feedback infrastructure (patch 6):
  perf script -> spe_hist -> debugfs spe_feed
  -> per-folio rbtree {THP-aligned PFN -> access_count[512]}
  -> damon_spe_folio_heatmap() -> hot_bitmap -> split decision

The userspace helper tools (including the spe_hist histogram builder and
validation scripts) are archived at:
  https://github.com/lianux-mm/damon_spe

Testing was performed on a Kunpeng 920 system (256 cores, 249GB RAM, base kernel
7.1.0-rc5+):

  T1 ARM64 blind spot: A 16MB THP workload (where 8 PMDs fit entirely within the
     L2 TLB) resulted in DAMON detecting 0 regions. Conversely, using 512MB
     with 4KB base pages, or a 16GB THP layout (exceeding L2 TLB reach), allowed
     DAMON to function normally.

  T2 THP inflation: With an 8GB mmap and 16MB actually hot (0.2%),
     THP=always: DAMON reported 8GB hot (512x vs ground truth);
     THP=never: ~245MB (15x vs ground truth).  The THP-induced gap
     between the two modes was ~33x.

  T3 RocksDB: Fragmented malloc allocation prevented THP formation, and DAMON
     behaved normally. We could not reproduce THP inflation with RocksDB.
     The workloads fundamentally vulnerable to this structural issue remain KVM
     guests, JVM large heaps, and PostgreSQL shared_buffers.

  T4 min=0 deadlock break: A 256MB THP induced the DAMON blind spot.
     Triggering an unconditional mthp_split (via nr_accesses/min=0) successfully
     shattered the space into 16384x16KB folios, allowing DAMON to fully recover.

  T5 ARM SPE histogram: Out of 2005 sampled THPs, 97% exhibited <10% hot subpages.
     A typical trace showed PFN 0x820db800 accumulated 39,794 hardware accesses
     concentrated across only 3 out of 512 subpages.

  End-to-end: Verified hot/cold discrimination. The SPE feed preserved a 90%
     hot THP intact, while successfully splitting a 25% cold THP into 128x16KB folios.

Known limitations:
- The full KVM + Oracle production chain has not yet been benchmarked end-to-end.
  While individual component verification is complete, full integration testing
  is planned in collaboration with Sangfor.
- khugepaged may aggressively re-collapse the mTHPs that DAMON splits. A
  coordination/back-off mechanism is required to avoid ping-pong effects.
- SPE data is currently funneled via a userspace daemon and debugfs. Direct
  kernel-side perf_event sampling integration is planned as a follow-up.
- The rbtree entry TTL (30s) and signal threshold (1/10 of peak) are empirical
  defaults subject to further tuning.
- The ARM64 DAMON blind spot (WSS < L2 TLB reach) is a pre-existing hardware-MMU
  characteristic, not introduced by this series. Setting nr_accesses/min=0
  serves as an effective workaround for the split path.

Reported-by: Chaobing Dai <daichaobing@sangfor.com.cn>
Cc: SeongJae Park <sj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nico Pache <npache@redhat.com>
Cc: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Wang Lian <lianux.mm@gmail.com>

Wang Lian (6):
  mm/damon: add target_order field for DAMOS_COLLAPSE
  mm/khugepaged: add damon_collapse_folio_range() for external callers
  mm/damon/vaddr: implement mTHP-aware DAMOS_COLLAPSE handler
  mm/damon: introduce DAMOS_MTHP_SPLIT action and hot_threshold
  mm/damon/vaddr: implement DAMOS_MTHP_SPLIT handler
  mm/damon: add SPE feedback for sub-THP split decisions

 include/linux/damon.h      |  18 ++
 include/linux/khugepaged.h |   3 +
 mm/damon/Kconfig           |  12 +
 mm/damon/Makefile          |   1 +
 mm/damon/core.c            |   3 +
 mm/damon/spe.c             | 505 +++++++++++++++++++++++++++++++++++++
 mm/damon/spe.h             |  62 +++++
 mm/damon/sysfs-schemes.c   |  96 +++++++
 mm/damon/vaddr.c           | 118 +++++++++
 mm/khugepaged.c            |  39 +++
 10 files changed, 857 insertions(+)
 create mode 100644 mm/damon/spe.c
 create mode 100644 mm/damon/spe.h

--
2.50.1 (Apple Git-155)

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC PATCH 0/6] mm/damon: Add mTHP-aware collapse/split with ARM SPE feedback
  2026-06-18  9:48 Wang Lian
@ 2026-06-18 11:03 ` Gutierrez Asier
  2026-06-18 13:13   ` wang lian
  0 siblings, 1 reply; 4+ messages in thread
From: Gutierrez Asier @ 2026-06-18 11:03 UTC (permalink / raw)
  To: Wang Lian, sj, akpm
  Cc: npache, daichaobing, linux-mm, linux-kernel, kunwu.chan

Hi Wang,

On 6/18/2026 12:48 PM, Wang Lian wrote:
> Received an off-list report that DAMON significantly overestimates
> hot memory in KVM/QEMU deployments with THP-backed tmpfs guest memory
> running Oracle workloads.
> 
> The root cause is structural: a PMD entry covers 512 4KB subpages with
> a single Access Flag (AF) bit. When any one subpage is accessed, the entire
> 2MB region appears "hot" to DAMON. On ARM64, this is compounded by the
> hardware AF mechanism -- the AF is only set on a TLB miss. Consequently, when the
> working set fits entirely within the L2 TLB (e.g., a 16MB working set with 2MB THP
> running on a Kunpeng 920's 2048-entry L2 TLB), DAMON becomes completely blind to
> subsequent accesses. x86 is not subject to this specific blindness under similar
> conditions.

Have you tried setting the minimum region size to 2MB?

> We reproduced this memory inflation on a Kunpeng 920 platform using a synthetic
> workload (8GB mmap with a 0.2% sparse hotspot, i.e. 16MB actually hot):
> THP=always causes DAMON to report the entire 8GB as hot, while THP=never
> reports only a few hundred MB -- a 512x overestimate relative to the actual
> 16MB hotspot under THP, and a ~33x gap between the two THP modes. ARM SPE hardware profiling
> independently confirms this asymmetry: out of 2,005 THPs sampled system-wide
> over 10 seconds, 97% had fewer than 10% of their 4KB subpages actually accessed.

THP always will just collapse the entire PID into huge pages anyway. This
is outside DAMON's control.

Have you tried setting THP to never and running DAMON with DAMON_COLLAPSE
action?

> To mitigate this, this series extends the existing DAMOS_COLLAPSE action to be
> mTHP-aware via a new target_order field, and introduces a new
> DAMOS_MTHP_SPLIT action. This enables DAMON to proactively split PMD THPs
> into smaller mTHPs when most subpages are probed as cold, and collapse them
> back when beneficial. To resolve the sub-PMD monitoring blindness, the split
> path can incorporate fine-grained hardware feedback from ARM SPE.
> The hardware feedback loop (damon_spe_folio_heatmap) implements a two-pass
> signal filter: it first identifies the peak chunk access count, and then marks
> sub-chunks with >= 1/10 of the peak count as hot, effectively filtering out
> SPE sampling noise. A configurable hot_threshold (default 30%) controls the
> split decision: only folios with a hot fraction below this threshold are
> eligible for splitting. When no SPE data is available, the infrastructure
> gracefully falls back to explicit PTE-level scanning via folio_walk.
> 
> Currently, SPE data is fed from userspace via debugfs (e.g., perf script piped
> through a histogram builder into /sys/kernel/debug/damon/spe_feed).
> 
> Collapse path (patches 1-3):
>   DAMON scheme action=COLLAPSE, target_order=N
>   -> damos_va_collapse() -> damon_collapse_folio_range()
>   -> collapse_huge_page()
> 
> Split path (patches 4-5):
>   DAMON scheme action=MTHP_SPLIT, target_order=N, hot_threshold=M
>   -> damos_va_mthp_split() -> damon_spe_hot_fraction()
>   -> split_folio_to_order()
> 
> SPE feedback infrastructure (patch 6):
>   perf script -> spe_hist -> debugfs spe_feed
>   -> per-folio rbtree {THP-aligned PFN -> access_count[512]}
>   -> damon_spe_folio_heatmap() -> hot_bitmap -> split decision
> 
> The userspace helper tools (including the spe_hist histogram builder and
> validation scripts) are archived at:
>   https://github.com/lianux-mm/damon_spe
> 
> Testing was performed on a Kunpeng 920 system (256 cores, 249GB RAM, base kernel
> 7.1.0-rc5+):
> 
>   T1 ARM64 blind spot: A 16MB THP workload (where 8 PMDs fit entirely within the
>      L2 TLB) resulted in DAMON detecting 0 regions. Conversely, using 512MB
>      with 4KB base pages, or a 16GB THP layout (exceeding L2 TLB reach), allowed
>      DAMON to function normally.
> 
>   T2 THP inflation: With an 8GB mmap and 16MB actually hot (0.2%),
>      THP=always: DAMON reported 8GB hot (512x vs ground truth);
>      THP=never: ~245MB (15x vs ground truth).  The THP-induced gap
>      between the two modes was ~33x.
> 
>   T3 RocksDB: Fragmented malloc allocation prevented THP formation, and DAMON
>      behaved normally. We could not reproduce THP inflation with RocksDB.
>      The workloads fundamentally vulnerable to this structural issue remain KVM
>      guests, JVM large heaps, and PostgreSQL shared_buffers.
> 
>   T4 min=0 deadlock break: A 256MB THP induced the DAMON blind spot.
>      Triggering an unconditional mthp_split (via nr_accesses/min=0) successfully
>      shattered the space into 16384x16KB folios, allowing DAMON to fully recover.
> 
>   T5 ARM SPE histogram: Out of 2005 sampled THPs, 97% exhibited <10% hot subpages.
>      A typical trace showed PFN 0x820db800 accumulated 39,794 hardware accesses
>      concentrated across only 3 out of 512 subpages.
The SPE stuff fits SeongJae's goals for DAMON-X, I think. Maybe this is something
we should keep in the user space and let the kernel provide only the API to add
different metrics, including PMU and SPE.
>   End-to-end: Verified hot/cold discrimination. The SPE feed preserved a 90%
>      hot THP intact, while successfully splitting a 25% cold THP into 128x16KB folios.
> 
> Known limitations:
> - The full KVM + Oracle production chain has not yet been benchmarked end-to-end.
>   While individual component verification is complete, full integration testing
>   is planned in collaboration with Sangfor.
> - khugepaged may aggressively re-collapse the mTHPs that DAMON splits. A
>   coordination/back-off mechanism is required to avoid ping-pong effects.
> - SPE data is currently funneled via a userspace daemon and debugfs. Direct
>   kernel-side perf_event sampling integration is planned as a follow-up.
> - The rbtree entry TTL (30s) and signal threshold (1/10 of peak) are empirical
>   defaults subject to further tuning.
> - The ARM64 DAMON blind spot (WSS < L2 TLB reach) is a pre-existing hardware-MMU
>   characteristic, not introduced by this series. Setting nr_accesses/min=0
>   serves as an effective workaround for the split path.
> 
> Reported-by: Chaobing Dai <daichaobing@sangfor.com.cn>
> Cc: SeongJae Park <sj@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Nico Pache <npache@redhat.com>
> Cc: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Wang Lian <lianux.mm@gmail.com>
> 
> Wang Lian (6):
>   mm/damon: add target_order field for DAMOS_COLLAPSE
>   mm/khugepaged: add damon_collapse_folio_range() for external callers
>   mm/damon/vaddr: implement mTHP-aware DAMOS_COLLAPSE handler
>   mm/damon: introduce DAMOS_MTHP_SPLIT action and hot_threshold
>   mm/damon/vaddr: implement DAMOS_MTHP_SPLIT handler
>   mm/damon: add SPE feedback for sub-THP split decisions
> 
>  include/linux/damon.h      |  18 ++
>  include/linux/khugepaged.h |   3 +
>  mm/damon/Kconfig           |  12 +
>  mm/damon/Makefile          |   1 +
>  mm/damon/core.c            |   3 +
>  mm/damon/spe.c             | 505 +++++++++++++++++++++++++++++++++++++
>  mm/damon/spe.h             |  62 +++++
>  mm/damon/sysfs-schemes.c   |  96 +++++++
>  mm/damon/vaddr.c           | 118 +++++++++
>  mm/khugepaged.c            |  39 +++
>  10 files changed, 857 insertions(+)
>  create mode 100644 mm/damon/spe.c
>  create mode 100644 mm/damon/spe.h
> 
> --
> 2.50.1 (Apple Git-155)
> 

-- 
Asier Gutierrez
Huawei


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC PATCH 0/6] mm/damon: Add mTHP-aware collapse/split with ARM SPE feedback
  2026-06-18 11:03 ` Gutierrez Asier
@ 2026-06-18 13:13   ` wang lian
  0 siblings, 0 replies; 4+ messages in thread
From: wang lian @ 2026-06-18 13:13 UTC (permalink / raw)
  To: Gutierrez Asier
  Cc: sj, akpm, npache, daichaobing, linux-mm, linux-kernel, kunwu.chan

[-- Attachment #1: Type: text/plain, Size: 8360 bytes --]



> On Jun 18, 2026, at 19:03, Gutierrez Asier <gutierrez.asier@huawei-partners.com> wrote:
> 
> Hi Wang,
> 
> On 6/18/2026 12:48 PM, Wang Lian wrote:
>> Received an off-list report that DAMON significantly overestimates
>> hot memory in KVM/QEMU deployments with THP-backed tmpfs guest memory
>> running Oracle workloads.
>> 
>> The root cause is structural: a PMD entry covers 512 4KB subpages with
>> a single Access Flag (AF) bit. When any one subpage is accessed, the entire
>> 2MB region appears "hot" to DAMON. On ARM64, this is compounded by the
>> hardware AF mechanism -- the AF is only set on a TLB miss. Consequently, when the
>> working set fits entirely within the L2 TLB (e.g., a 16MB working set with 2MB THP
>> running on a Kunpeng 920's 2048-entry L2 TLB), DAMON becomes completely blind to
>> subsequent accesses. x86 is not subject to this specific blindness under similar
>> conditions.
> 
> Have you tried setting the minimum region size to 2MB?
> 
>> We reproduced this memory inflation on a Kunpeng 920 platform using a synthetic
>> workload (8GB mmap with a 0.2% sparse hotspot, i.e. 16MB actually hot):
>> THP=always causes DAMON to report the entire 8GB as hot, while THP=never
>> reports only a few hundred MB -- a 512x overestimate relative to the actual
>> 16MB hotspot under THP, and a ~33x gap between the two THP modes. ARM SPE hardware profiling
>> independently confirms this asymmetry: out of 2,005 THPs sampled system-wide
>> over 10 seconds, 97% had fewer than 10% of their 4KB subpages actually accessed.
> 
> THP always will just collapse the entire PID into huge pages anyway. This
> is outside DAMON's control.
> 
> Have you tried setting THP to never and running DAMON with DAMON_COLLAPSE
> action?
> 
>> To mitigate this, this series extends the existing DAMOS_COLLAPSE action to be
>> mTHP-aware via a new target_order field, and introduces a new
>> DAMOS_MTHP_SPLIT action. This enables DAMON to proactively split PMD THPs
>> into smaller mTHPs when most subpages are probed as cold, and collapse them
>> back when beneficial. To resolve the sub-PMD monitoring blindness, the split
>> path can incorporate fine-grained hardware feedback from ARM SPE.
>> The hardware feedback loop (damon_spe_folio_heatmap) implements a two-pass
>> signal filter: it first identifies the peak chunk access count, and then marks
>> sub-chunks with >= 1/10 of the peak count as hot, effectively filtering out
>> SPE sampling noise. A configurable hot_threshold (default 30%) controls the
>> split decision: only folios with a hot fraction below this threshold are
>> eligible for splitting. When no SPE data is available, the infrastructure
>> gracefully falls back to explicit PTE-level scanning via folio_walk.
>> 
>> Currently, SPE data is fed from userspace via debugfs (e.g., perf script piped
>> through a histogram builder into /sys/kernel/debug/damon/spe_feed).
>> 
>> Collapse path (patches 1-3):
>>  DAMON scheme action=COLLAPSE, target_order=N
>>  -> damos_va_collapse() -> damon_collapse_folio_range()
>>  -> collapse_huge_page()
>> 
>> Split path (patches 4-5):
>>  DAMON scheme action=MTHP_SPLIT, target_order=N, hot_threshold=M
>>  -> damos_va_mthp_split() -> damon_spe_hot_fraction()
>>  -> split_folio_to_order()
>> 
>> SPE feedback infrastructure (patch 6):
>>  perf script -> spe_hist -> debugfs spe_feed
>>  -> per-folio rbtree {THP-aligned PFN -> access_count[512]}
>>  -> damon_spe_folio_heatmap() -> hot_bitmap -> split decision
>> 
>> The userspace helper tools (including the spe_hist histogram builder and
>> validation scripts) are archived at:
>>  https://github.com/lianux-mm/damon_spe
>> 
>> Testing was performed on a Kunpeng 920 system (256 cores, 249GB RAM, base kernel
>> 7.1.0-rc5+):
>> 
>>  T1 ARM64 blind spot: A 16MB THP workload (where 8 PMDs fit entirely within the
>>     L2 TLB) resulted in DAMON detecting 0 regions. Conversely, using 512MB
>>     with 4KB base pages, or a 16GB THP layout (exceeding L2 TLB reach), allowed
>>     DAMON to function normally.
>> 
>>  T2 THP inflation: With an 8GB mmap and 16MB actually hot (0.2%),
>>     THP=always: DAMON reported 8GB hot (512x vs ground truth);
>>     THP=never: ~245MB (15x vs ground truth).  The THP-induced gap
>>     between the two modes was ~33x.
>> 
>>  T3 RocksDB: Fragmented malloc allocation prevented THP formation, and DAMON
>>     behaved normally. We could not reproduce THP inflation with RocksDB.
>>     The workloads fundamentally vulnerable to this structural issue remain KVM
>>     guests, JVM large heaps, and PostgreSQL shared_buffers.
>> 
>>  T4 min=0 deadlock break: A 256MB THP induced the DAMON blind spot.
>>     Triggering an unconditional mthp_split (via nr_accesses/min=0) successfully
>>     shattered the space into 16384x16KB folios, allowing DAMON to fully recover.
>> 
>>  T5 ARM SPE histogram: Out of 2005 sampled THPs, 97% exhibited <10% hot subpages.
>>     A typical trace showed PFN 0x820db800 accumulated 39,794 hardware accesses
>>     concentrated across only 3 out of 512 subpages.
> The SPE stuff fits SeongJae's goals for DAMON-X, I think. Maybe this is something
> we should keep in the user space and let the kernel provide only the API to add
> different metrics, including PMU and SPE.

Hi Asier,

Thanks for your prompt and constructive reply. I really appreciate your 
detailed analysis of the mTHP and SPE interaction.

Your point regarding the design boundary—whether this fits better in 
user space or aligned with DAMON-X—is highly valuable. 

Since SeongJae (SJ) will look into this thread tomorrow, let us sync up 
then. I look forward to cooperating with both of you to refine this 
design and find the best architectural fit for the subsystem.

Thanks,
Wang Lian
>>  End-to-end: Verified hot/cold discrimination. The SPE feed preserved a 90%
>>     hot THP intact, while successfully splitting a 25% cold THP into 128x16KB folios.
>> 
>> Known limitations:
>> - The full KVM + Oracle production chain has not yet been benchmarked end-to-end.
>>  While individual component verification is complete, full integration testing
>>  is planned in collaboration with Sangfor.
>> - khugepaged may aggressively re-collapse the mTHPs that DAMON splits. A
>>  coordination/back-off mechanism is required to avoid ping-pong effects.
>> - SPE data is currently funneled via a userspace daemon and debugfs. Direct
>>  kernel-side perf_event sampling integration is planned as a follow-up.
>> - The rbtree entry TTL (30s) and signal threshold (1/10 of peak) are empirical
>>  defaults subject to further tuning.
>> - The ARM64 DAMON blind spot (WSS < L2 TLB reach) is a pre-existing hardware-MMU
>>  characteristic, not introduced by this series. Setting nr_accesses/min=0
>>  serves as an effective workaround for the split path.
>> 
>> Reported-by: Chaobing Dai <daichaobing@sangfor.com.cn>
>> Cc: SeongJae Park <sj@kernel.org>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Nico Pache <npache@redhat.com>
>> Cc: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
>> Cc: linux-mm@kvack.org
>> Cc: linux-kernel@vger.kernel.org
>> Signed-off-by: Wang Lian <lianux.mm@gmail.com>
>> 
>> Wang Lian (6):
>>  mm/damon: add target_order field for DAMOS_COLLAPSE
>>  mm/khugepaged: add damon_collapse_folio_range() for external callers
>>  mm/damon/vaddr: implement mTHP-aware DAMOS_COLLAPSE handler
>>  mm/damon: introduce DAMOS_MTHP_SPLIT action and hot_threshold
>>  mm/damon/vaddr: implement DAMOS_MTHP_SPLIT handler
>>  mm/damon: add SPE feedback for sub-THP split decisions
>> 
>> include/linux/damon.h      |  18 ++
>> include/linux/khugepaged.h |   3 +
>> mm/damon/Kconfig           |  12 +
>> mm/damon/Makefile          |   1 +
>> mm/damon/core.c            |   3 +
>> mm/damon/spe.c             | 505 +++++++++++++++++++++++++++++++++++++
>> mm/damon/spe.h             |  62 +++++
>> mm/damon/sysfs-schemes.c   |  96 +++++++
>> mm/damon/vaddr.c           | 118 +++++++++
>> mm/khugepaged.c            |  39 +++
>> 10 files changed, 857 insertions(+)
>> create mode 100644 mm/damon/spe.c
>> create mode 100644 mm/damon/spe.h
>> 
>> --
>> 2.50.1 (Apple Git-155)
>> 
> 
> -- 
> Asier Gutierrez
> Huawei


[-- Attachment #2: Type: text/html, Size: 24491 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-06-19  1:59 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260619015411.9554-1-sj@kernel.org>
2026-06-19  1:59 ` [RFC PATCH 0/6] mm/damon: Add mTHP-aware collapse/split with ARM SPE feedback SeongJae Park
2026-06-18  9:48 Wang Lian
2026-06-18 11:03 ` Gutierrez Asier
2026-06-18 13:13   ` wang lian

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.