[RFC PATCH 0/6] mm/damon: Add mTHP-aware collapse/split with ARM SPE feedback

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/6] mm/damon: Add mTHP-aware collapse/split with ARM SPE feedback
@ 2026-06-18  9:48 Wang Lian
  2026-06-18  9:48 ` [RFC PATCH 1/6] mm/damon: add target_order field for DAMOS_COLLAPSE Wang Lian
                   ` (7 more replies)
  0 siblings, 8 replies; 16+ messages in thread
From: Wang Lian @ 2026-06-18  9:48 UTC (permalink / raw)
  To: sj, akpm
  Cc: npache, gutierrez.asier, daichaobing, linux-mm, linux-kernel,
	lianux.mm, kunwu.chan

Received an off-list report that DAMON significantly overestimates
hot memory in KVM/QEMU deployments with THP-backed tmpfs guest memory
running Oracle workloads.

The root cause is structural: a PMD entry covers 512 4KB subpages with
a single Access Flag (AF) bit. When any one subpage is accessed, the entire
2MB region appears "hot" to DAMON. On ARM64, this is compounded by the
hardware AF mechanism -- the AF is only set on a TLB miss. Consequently, when the
working set fits entirely within the L2 TLB (e.g., a 16MB working set with 2MB THP
running on a Kunpeng 920's 2048-entry L2 TLB), DAMON becomes completely blind to
subsequent accesses. x86 is not subject to this specific blindness under similar
conditions.

We reproduced this memory inflation on a Kunpeng 920 platform using a synthetic
workload (8GB mmap with a 0.2% sparse hotspot, i.e. 16MB actually hot):
THP=always causes DAMON to report the entire 8GB as hot, while THP=never
reports only a few hundred MB -- a 512x overestimate relative to the actual
16MB hotspot under THP, and a ~33x gap between the two THP modes. ARM SPE hardware profiling
independently confirms this asymmetry: out of 2,005 THPs sampled system-wide
over 10 seconds, 97% had fewer than 10% of their 4KB subpages actually accessed.

To mitigate this, this series extends the existing DAMOS_COLLAPSE action to be
mTHP-aware via a new target_order field, and introduces a new
DAMOS_MTHP_SPLIT action. This enables DAMON to proactively split PMD THPs
into smaller mTHPs when most subpages are probed as cold, and collapse them
back when beneficial. To resolve the sub-PMD monitoring blindness, the split
path can incorporate fine-grained hardware feedback from ARM SPE.

The hardware feedback loop (damon_spe_folio_heatmap) implements a two-pass
signal filter: it first identifies the peak chunk access count, and then marks
sub-chunks with >= 1/10 of the peak count as hot, effectively filtering out
SPE sampling noise. A configurable hot_threshold (default 30%) controls the
split decision: only folios with a hot fraction below this threshold are
eligible for splitting. When no SPE data is available, the infrastructure
gracefully falls back to explicit PTE-level scanning via folio_walk.

Currently, SPE data is fed from userspace via debugfs (e.g., perf script piped
through a histogram builder into /sys/kernel/debug/damon/spe_feed).

Collapse path (patches 1-3):
  DAMON scheme action=COLLAPSE, target_order=N
  -> damos_va_collapse() -> damon_collapse_folio_range()
  -> collapse_huge_page()

Split path (patches 4-5):
  DAMON scheme action=MTHP_SPLIT, target_order=N, hot_threshold=M
  -> damos_va_mthp_split() -> damon_spe_hot_fraction()
  -> split_folio_to_order()

SPE feedback infrastructure (patch 6):
  perf script -> spe_hist -> debugfs spe_feed
  -> per-folio rbtree {THP-aligned PFN -> access_count[512]}
  -> damon_spe_folio_heatmap() -> hot_bitmap -> split decision

The userspace helper tools (including the spe_hist histogram builder and
validation scripts) are archived at:
  https://github.com/lianux-mm/damon_spe

Testing was performed on a Kunpeng 920 system (256 cores, 249GB RAM, base kernel
7.1.0-rc5+):

  T1 ARM64 blind spot: A 16MB THP workload (where 8 PMDs fit entirely within the
     L2 TLB) resulted in DAMON detecting 0 regions. Conversely, using 512MB
     with 4KB base pages, or a 16GB THP layout (exceeding L2 TLB reach), allowed
     DAMON to function normally.

  T2 THP inflation: With an 8GB mmap and 16MB actually hot (0.2%),
     THP=always: DAMON reported 8GB hot (512x vs ground truth);
     THP=never: ~245MB (15x vs ground truth).  The THP-induced gap
     between the two modes was ~33x.

  T3 RocksDB: Fragmented malloc allocation prevented THP formation, and DAMON
     behaved normally. We could not reproduce THP inflation with RocksDB.
     The workloads fundamentally vulnerable to this structural issue remain KVM
     guests, JVM large heaps, and PostgreSQL shared_buffers.

  T4 min=0 deadlock break: A 256MB THP induced the DAMON blind spot.
     Triggering an unconditional mthp_split (via nr_accesses/min=0) successfully
     shattered the space into 16384x16KB folios, allowing DAMON to fully recover.

  T5 ARM SPE histogram: Out of 2005 sampled THPs, 97% exhibited <10% hot subpages.
     A typical trace showed PFN 0x820db800 accumulated 39,794 hardware accesses
     concentrated across only 3 out of 512 subpages.

  End-to-end: Verified hot/cold discrimination. The SPE feed preserved a 90%
     hot THP intact, while successfully splitting a 25% cold THP into 128x16KB folios.

Known limitations:
- The full KVM + Oracle production chain has not yet been benchmarked end-to-end.
  While individual component verification is complete, full integration testing
  is planned in collaboration with Sangfor.
- khugepaged may aggressively re-collapse the mTHPs that DAMON splits. A
  coordination/back-off mechanism is required to avoid ping-pong effects.
- SPE data is currently funneled via a userspace daemon and debugfs. Direct
  kernel-side perf_event sampling integration is planned as a follow-up.
- The rbtree entry TTL (30s) and signal threshold (1/10 of peak) are empirical
  defaults subject to further tuning.
- The ARM64 DAMON blind spot (WSS < L2 TLB reach) is a pre-existing hardware-MMU
  characteristic, not introduced by this series. Setting nr_accesses/min=0
  serves as an effective workaround for the split path.

Reported-by: Chaobing Dai <daichaobing@sangfor.com.cn>
Cc: SeongJae Park <sj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nico Pache <npache@redhat.com>
Cc: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Wang Lian <lianux.mm@gmail.com>

Wang Lian (6):
  mm/damon: add target_order field for DAMOS_COLLAPSE
  mm/khugepaged: add damon_collapse_folio_range() for external callers
  mm/damon/vaddr: implement mTHP-aware DAMOS_COLLAPSE handler
  mm/damon: introduce DAMOS_MTHP_SPLIT action and hot_threshold
  mm/damon/vaddr: implement DAMOS_MTHP_SPLIT handler
  mm/damon: add SPE feedback for sub-THP split decisions

 include/linux/damon.h      |  18 ++
 include/linux/khugepaged.h |   3 +
 mm/damon/Kconfig           |  12 +
 mm/damon/Makefile          |   1 +
 mm/damon/core.c            |   3 +
 mm/damon/spe.c             | 505 +++++++++++++++++++++++++++++++++++++
 mm/damon/spe.h             |  62 +++++
 mm/damon/sysfs-schemes.c   |  96 +++++++
 mm/damon/vaddr.c           | 118 +++++++++
 mm/khugepaged.c            |  39 +++
 10 files changed, 857 insertions(+)
 create mode 100644 mm/damon/spe.c
 create mode 100644 mm/damon/spe.h

--
2.50.1 (Apple Git-155)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH 1/6] mm/damon: add target_order field for DAMOS_COLLAPSE
  2026-06-18  9:48 [RFC PATCH 0/6] mm/damon: Add mTHP-aware collapse/split with ARM SPE feedback Wang Lian
@ 2026-06-18  9:48 ` Wang Lian
  2026-06-18  9:48 ` [RFC PATCH 2/6] mm/khugepaged: add damon_collapse_folio_range() for external callers Wang Lian
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Wang Lian @ 2026-06-18  9:48 UTC (permalink / raw)
  To: sj, akpm
  Cc: npache, gutierrez.asier, daichaobing, linux-mm, linux-kernel,
	lianux.mm, kunwu.chan

DAMOS_COLLAPSE currently collapses into PMD-size THP only.  Add a
target_order field to express per-order mTHP collapse intent.  Zero
means system default (PMD order, same as current behavior).  Valid
values are 0 and 2..HPAGE_PMD_ORDER.

Wire up the sysfs interface: a per-scheme rw file "target_order".
Validate at store time that the value is in range, and warn at scheme
creation time if DAMOS_COLLAPSE is used with an unsupported non-PMD
order, resetting to 0.

The actual mTHP application via the khugepaged wrapper will be added
in subsequent patches.

Co-developed-by: Kunwu Chan <kunwu.chan@gmail.com>
Signed-off-by: Kunwu Chan <kunwu.chan@gmail.com>
Signed-off-by: Wang Lian <lianux.mm@gmail.com>
---
 include/linux/damon.h    |  5 +++++
 mm/damon/sysfs-schemes.c | 45 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 50 insertions(+)

diff --git a/include/linux/damon.h b/include/linux/damon.h
index 6f7edb3590ef..5a0587556573 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -572,6 +572,11 @@ struct damos_migrate_dests {
 struct damos {
 	struct damos_access_pattern pattern;
 	enum damos_action action;
+	/*
+	 * @target_order: target order for mTHP actions (DAMOS_COLLAPSE).
+	 * 0 means system default (PMD order).  Valid: 0, 2..HPAGE_PMD_ORDER.
+	 */
+	unsigned int target_order;
 	unsigned long apply_interval_us;
 /* private: internal use only */
 	/*
diff --git a/mm/damon/sysfs-schemes.c b/mm/damon/sysfs-schemes.c
index 329cfd0bbe9f..735970717048 100644
--- a/mm/damon/sysfs-schemes.c
+++ b/mm/damon/sysfs-schemes.c
@@ -6,7 +6,9 @@
  */
 
 #include <linux/slab.h>
+#include <linux/mm.h>
 #include <linux/numa.h>
+#include <linux/huge_mm.h>
 
 #include "sysfs-common.h"
 
@@ -2257,6 +2259,7 @@ struct damon_sysfs_scheme {
 	struct damon_sysfs_stats *stats;
 	struct damon_sysfs_scheme_regions *tried_regions;
 	int target_nid;
+	unsigned int target_order;
 	struct damos_sysfs_dests *dests;
 };
 
@@ -2642,6 +2645,34 @@ static ssize_t target_nid_store(struct kobject *kobj,
 	return err ? err : count;
 }
 
+static ssize_t target_order_show(struct kobject *kobj,
+		struct kobj_attribute *attr, char *buf)
+{
+	struct damon_sysfs_scheme *scheme = container_of(kobj,
+			struct damon_sysfs_scheme, kobj);
+
+	return sysfs_emit(buf, "%u\n", scheme->target_order);
+}
+
+static ssize_t target_order_store(struct kobject *kobj,
+		struct kobj_attribute *attr, const char *buf, size_t count)
+{
+	struct damon_sysfs_scheme *scheme = container_of(kobj,
+			struct damon_sysfs_scheme, kobj);
+	unsigned int val;
+	int err;
+
+	err = kstrtouint(buf, 0, &val);
+	if (err)
+		return err;
+
+	if (val != 0 && (val < 2 || val > HPAGE_PMD_ORDER))
+		return -EINVAL;
+
+	scheme->target_order = val;
+	return count;
+}
+
 static void damon_sysfs_scheme_release(struct kobject *kobj)
 {
 	kfree(container_of(kobj, struct damon_sysfs_scheme, kobj));
@@ -2656,10 +2687,14 @@ static struct kobj_attribute damon_sysfs_scheme_apply_interval_us_attr =
 static struct kobj_attribute damon_sysfs_scheme_target_nid_attr =
 		__ATTR_RW_MODE(target_nid, 0600);
 
+static struct kobj_attribute damon_sysfs_scheme_target_order_attr =
+		__ATTR_RW_MODE(target_order, 0600);
+
 static struct attribute *damon_sysfs_scheme_attrs[] = {
 	&damon_sysfs_scheme_action_attr.attr,
 	&damon_sysfs_scheme_apply_interval_us_attr.attr,
 	&damon_sysfs_scheme_target_nid_attr.attr,
+	&damon_sysfs_scheme_target_order_attr.attr,
 	NULL,
 };
 ATTRIBUTE_GROUPS(damon_sysfs_scheme);
@@ -3005,6 +3040,16 @@ static struct damos *damon_sysfs_mk_scheme(
 	if (!scheme)
 		return NULL;
 
+	if (sysfs_scheme->action == DAMOS_COLLAPSE &&
+	    sysfs_scheme->target_order != 0 &&
+	    sysfs_scheme->target_order != HPAGE_PMD_ORDER) {
+		pr_warn("DAMON collapse: target_order %u not supported, only PMD order (%u) is available. Use 0 or %u.\n",
+			sysfs_scheme->target_order,
+			HPAGE_PMD_ORDER, HPAGE_PMD_ORDER);
+		sysfs_scheme->target_order = 0;
+	}
+	scheme->target_order = sysfs_scheme->target_order;
+
 	err = damos_sysfs_add_quota_score(sysfs_quotas->goals, &scheme->quota);
 	if (err) {
 		damon_destroy_scheme(scheme);
-- 
2.50.1 (Apple Git-155)



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH 2/6] mm/khugepaged: add damon_collapse_folio_range() for external callers
  2026-06-18  9:48 [RFC PATCH 0/6] mm/damon: Add mTHP-aware collapse/split with ARM SPE feedback Wang Lian
  2026-06-18  9:48 ` [RFC PATCH 1/6] mm/damon: add target_order field for DAMOS_COLLAPSE Wang Lian
@ 2026-06-18  9:48 ` Wang Lian
  2026-06-18  9:48 ` [RFC PATCH 3/6] mm/damon/vaddr: implement mTHP-aware DAMOS_COLLAPSE handler Wang Lian
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Wang Lian @ 2026-06-18  9:48 UTC (permalink / raw)
  To: sj, akpm
  Cc: npache, gutierrez.asier, daichaobing, linux-mm, linux-kernel,
	lianux.mm, kunwu.chan

Export a thin wrapper around collapse_huge_page() that allows external
subsystems such as DAMON to trigger THP collapse on a target address
range.

Currently restricted to PMD order (HPAGE_PMD_ORDER), since
collapse_huge_page() does not yet support arbitrary mTHP orders.
The restriction can be relaxed when khugepaged gains mTHP support.

The caller must hold a reference to @mm.  Do not hold mmap lock:
collapse_huge_page() acquires mmap_read_lock for validation, releases
it, then acquires mmap_write_lock for the actual collapse.  Holding
an outer mmap_read_lock would cause a self-deadlock when the same
thread attempts the inner mmap_write_lock.

Co-developed-by: Kunwu Chan <kunwu.chan@gmail.com>
Signed-off-by: Kunwu Chan <kunwu.chan@gmail.com>
Signed-off-by: Wang Lian <lianux.mm@gmail.com>
---
 include/linux/khugepaged.h |  3 +++
 mm/khugepaged.c            | 39 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 42 insertions(+)

diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index d7a9053ff4fe..6fb8a6857790 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -20,6 +20,9 @@ extern bool current_is_khugepaged(void);
 void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 		bool install_pmd);
 
+int damon_collapse_folio_range(struct mm_struct *mm, unsigned long start_addr,
+			       unsigned int target_order);
+
 static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
 {
 	if (mm_flags_test(MMF_VM_HUGEPAGE, oldmm))
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 617bca76db49..0387841ba2e7 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -3272,3 +3272,42 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 	return thps == ((hend - hstart) >> HPAGE_PMD_SHIFT) ? 0
 			: madvise_collapse_errno(last_fail);
 }
+
+/**
+ * damon_collapse_folio_range() - Collapse base pages in range into a THP
+ * @mm:         mm_struct of the target process
+ * @start_addr: start address (must be order-aligned)
+ * @target_order: page order of the collapse result (currently only
+ *                HPAGE_PMD_ORDER is supported)
+ *
+ * Thin wrapper around collapse_huge_page() for external callers such as
+ * DAMON.  The caller must hold a reference to @mm.  Do not hold mmap
+ * lock: collapse_huge_page() acquires mmap_read_lock for validation,
+ * releases it, then acquires mmap_write_lock for the collapse.  Holding
+ * an outer mmap_read_lock would self-deadlock.
+ *
+ * Return: 0 on success, -EINVAL on bad arguments, negative error from
+ *         madvise_collapse_errno() otherwise.
+ */
+int damon_collapse_folio_range(struct mm_struct *mm, unsigned long start_addr,
+			       unsigned int target_order)
+{
+	struct collapse_control cc = {
+		.is_khugepaged = false,
+	};
+	enum scan_result result;
+
+	if (target_order != HPAGE_PMD_ORDER) {
+		pr_warn_once("%s: only PMD order (%u) is supported, got %u\n",
+			     __func__, HPAGE_PMD_ORDER, target_order);
+		return -EINVAL;
+	}
+	if (start_addr & ((PAGE_SIZE << target_order) - 1))
+		return -EINVAL;
+
+	result = collapse_huge_page(mm, start_addr, 1, 0, &cc, target_order);
+	if (result == SCAN_SUCCEED)
+		return 0;
+	return madvise_collapse_errno(result);
+}
+EXPORT_SYMBOL_GPL(damon_collapse_folio_range);
-- 
2.50.1 (Apple Git-155)



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH 3/6] mm/damon/vaddr: implement mTHP-aware DAMOS_COLLAPSE handler
  2026-06-18  9:48 [RFC PATCH 0/6] mm/damon: Add mTHP-aware collapse/split with ARM SPE feedback Wang Lian
  2026-06-18  9:48 ` [RFC PATCH 1/6] mm/damon: add target_order field for DAMOS_COLLAPSE Wang Lian
  2026-06-18  9:48 ` [RFC PATCH 2/6] mm/khugepaged: add damon_collapse_folio_range() for external callers Wang Lian
@ 2026-06-18  9:48 ` Wang Lian
  2026-06-18  9:48 ` [RFC PATCH 4/6] mm/damon: introduce DAMOS_MTHP_SPLIT action and hot_threshold Wang Lian
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Wang Lian @ 2026-06-18  9:48 UTC (permalink / raw)
  To: sj, akpm
  Cc: npache, gutierrez.asier, daichaobing, linux-mm, linux-kernel,
	lianux.mm, kunwu.chan

When target_order is set (non-zero), the DAMOS_COLLAPSE handler now calls
damon_collapse_folio_range() to collapse pages into the requested mTHP
size, iterating over the target region in order-aligned chunks.  When
target_order is 0 (default), the existing madvise(MADV_COLLAPSE) path is
used, preserving backwards compatibility.

Region boundaries are expanded outward to the covering aligned range
(ALIGN_DOWN start, ALIGN end) so that collapse works even after
kdamond_split_regions reduces region sizes below the chunk size.
collapse_huge_page() internally validates VMA bounds, so expanding
beyond the original region is safe.

No external mmap lock is held: collapse_huge_page() acquires
mmap_read_lock internally for validation, releases it, then acquires
mmap_write_lock for the actual collapse.  Holding an outer
mmap_read_lock would cause a self-deadlock when the same thread
attempts the inner mmap_write_lock.

Co-developed-by: Kunwu Chan <kunwu.chan@gmail.com>
Signed-off-by: Kunwu Chan <kunwu.chan@gmail.com>
Signed-off-by: Wang Lian <lianux.mm@gmail.com>
---
 mm/damon/vaddr.c | 38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
index d27147603564..2a3757c13bf0 100644
--- a/mm/damon/vaddr.c
+++ b/mm/damon/vaddr.c
@@ -14,6 +14,7 @@
 #include <linux/page_idle.h>
 #include <linux/pagewalk.h>
 #include <linux/sched/mm.h>
+#include <linux/khugepaged.h>
 
 #include "../internal.h"
 #include "ops-common.h"
@@ -899,6 +900,40 @@ static unsigned long damos_va_stat(struct damon_target *target,
 	return 0;
 }
 
+static unsigned long damos_va_collapse(struct damon_target *target,
+		struct damon_region *r, struct damos *s,
+		unsigned long *sz_filter_passed)
+{
+	unsigned long addr, end, chunk_sz;
+	unsigned int target_order = s->target_order;
+	unsigned long applied = 0;
+	struct mm_struct *mm;
+	int ret;
+
+	if (target_order < 2 || target_order > HPAGE_PMD_ORDER)
+		return 0;
+
+	chunk_sz = PAGE_SIZE << target_order;
+	addr = ALIGN_DOWN(r->ar.start, chunk_sz);
+	end = ALIGN(r->ar.end, chunk_sz);
+
+	mm = damon_get_mm(target);
+	if (!mm)
+		return 0;
+
+	while (addr < end) {
+		ret = damon_collapse_folio_range(mm, addr, target_order);
+		if (!ret)
+			applied += chunk_sz;
+		*sz_filter_passed += chunk_sz;
+		addr += chunk_sz;
+		cond_resched();
+	}
+
+	mmput(mm);
+	return applied;
+}
+
 static unsigned long damon_va_apply_scheme(struct damon_ctx *ctx,
 		struct damon_target *t, struct damon_region *r,
 		struct damos *scheme, unsigned long *sz_filter_passed)
@@ -922,6 +957,9 @@ static unsigned long damon_va_apply_scheme(struct damon_ctx *ctx,
 		madv_action = MADV_NOHUGEPAGE;
 		break;
 	case DAMOS_COLLAPSE:
+		if (scheme->target_order)
+			return damos_va_collapse(t, r, scheme,
+						 sz_filter_passed);
 		madv_action = MADV_COLLAPSE;
 		break;
 	case DAMOS_MIGRATE_HOT:
-- 
2.50.1 (Apple Git-155)



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH 4/6] mm/damon: introduce DAMOS_MTHP_SPLIT action and hot_threshold
  2026-06-18  9:48 [RFC PATCH 0/6] mm/damon: Add mTHP-aware collapse/split with ARM SPE feedback Wang Lian
                   ` (2 preceding siblings ...)
  2026-06-18  9:48 ` [RFC PATCH 3/6] mm/damon/vaddr: implement mTHP-aware DAMOS_COLLAPSE handler Wang Lian
@ 2026-06-18  9:48 ` Wang Lian
  2026-06-18  9:48 ` [RFC PATCH 5/6] mm/damon/vaddr: implement DAMOS_MTHP_SPLIT handler Wang Lian
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Wang Lian @ 2026-06-18  9:48 UTC (permalink / raw)
  To: sj, akpm
  Cc: npache, gutierrez.asier, daichaobing, linux-mm, linux-kernel,
	lianux.mm, kunwu.chan

Add a new DAMOS_MTHP_SPLIT action to split a large folio to the
specified target_order, and a hot_threshold parameter to control
split decisions based on sub-page access heatmap.

target_order: For MTHP_SPLIT, valid range is 2..HPAGE_PMD_ORDER-1,
allowing splits to e.g. order-2 (16KB) or order-3 (32KB) mTHP.
An invalid value (0 or >= PMD_ORDER) defaults to order-2; 0 would
mean "split to base page" which defeats the purpose of mTHP split.

hot_threshold: Minimum percentage (0-100) of hot subpages required
to preserve a THP.  THPs with hot_fraction >= hot_threshold are
kept intact; below it, the THP is split to target_order.  Default
is 30%, based on ARM SPE profiling on Kunpeng 920 which showed:

  - 97% of THPs have <10% hot subpages (clearly cold, split)
  - 1-2% have 10-30% (borderline, tunable)
  - <1% have >30% (genuinely hot, preserve)

The 30% default catches genuinely hot THPs while splitting the vast
majority of cold THPs.  Exposed as sysfs attribute for per-scheme
tuning (e.g. lower for memory-pressure scenarios, higher for
latency-sensitive workloads).

sysfs interface:
  /sys/kernel/mm/damon/admin/kdamonds/.../schemes/0/hot_threshold

The actual split implementation follows in subsequent patches.

Co-developed-by: Kunwu Chan <kunwu.chan@gmail.com>
Signed-off-by: Kunwu Chan <kunwu.chan@gmail.com>
Signed-off-by: Wang Lian <lianux.mm@gmail.com>
---
 include/linux/damon.h    | 17 ++++++++++++--
 mm/damon/sysfs-schemes.c | 51 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 66 insertions(+), 2 deletions(-)

diff --git a/include/linux/damon.h b/include/linux/damon.h
index 5a0587556573..982057bbce3b 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -121,6 +121,7 @@ struct damon_target {
  * @DAMOS_HUGEPAGE:	Call ``madvise()`` for the region with MADV_HUGEPAGE.
  * @DAMOS_NOHUGEPAGE:	Call ``madvise()`` for the region with MADV_NOHUGEPAGE.
  * @DAMOS_COLLAPSE:	Call ``madvise()`` for the region with MADV_COLLAPSE.
+ * @DAMOS_MTHP_SPLIT:	Split large folios to the target mTHP order.
  * @DAMOS_LRU_PRIO:	Prioritize the region on its LRU lists.
  * @DAMOS_LRU_DEPRIO:	Deprioritize the region on its LRU lists.
  * @DAMOS_MIGRATE_HOT:  Migrate the regions prioritizing warmer regions.
@@ -141,6 +142,7 @@ enum damos_action {
 	DAMOS_HUGEPAGE,
 	DAMOS_NOHUGEPAGE,
 	DAMOS_COLLAPSE,
+	DAMOS_MTHP_SPLIT,
 	DAMOS_LRU_PRIO,
 	DAMOS_LRU_DEPRIO,
 	DAMOS_MIGRATE_HOT,
@@ -573,10 +575,21 @@ struct damos {
 	struct damos_access_pattern pattern;
 	enum damos_action action;
 	/*
-	 * @target_order: target order for mTHP actions (DAMOS_COLLAPSE).
-	 * 0 means system default (PMD order).  Valid: 0, 2..HPAGE_PMD_ORDER.
+	 * @target_order: target mTHP order for DAMOS_COLLAPSE and
+	 * DAMOS_MTHP_SPLIT.  For COLLAPSE, 0 means PMD order default,
+	 * valid values: 0, 2..HPAGE_PMD_ORDER.  For MTHP_SPLIT,
+	 * valid values: 2..HPAGE_PMD_ORDER-1; 0 and HPAGE_PMD_ORDER
+	 * are rejected at scheme creation time (defaulting to 2).
 	 */
 	unsigned int target_order;
+	/*
+	 * @hot_threshold: minimum hot subpage percentage (0-100) to
+	 * preserve a THP during DAMOS_MTHP_SPLIT.  A THP with
+	 * hot_fraction >= hot_threshold is kept intact; below it, the
+	 * THP is split to @target_order.  Default 30 based on SPE
+	 * profiling showing 97% of THPs have <10% hot subpages.
+	 */
+	unsigned int hot_threshold;
 	unsigned long apply_interval_us;
 /* private: internal use only */
 	/*
diff --git a/mm/damon/sysfs-schemes.c b/mm/damon/sysfs-schemes.c
index 735970717048..823f1ca9bd90 100644
--- a/mm/damon/sysfs-schemes.c
+++ b/mm/damon/sysfs-schemes.c
@@ -2260,6 +2260,7 @@ struct damon_sysfs_scheme {
 	struct damon_sysfs_scheme_regions *tried_regions;
 	int target_nid;
 	unsigned int target_order;
+	unsigned int hot_threshold;
 	struct damos_sysfs_dests *dests;
 };
 
@@ -2293,6 +2294,10 @@ static struct damos_sysfs_action_name damos_sysfs_action_names[] = {
 		.action = DAMOS_COLLAPSE,
 		.name = "collapse",
 	},
+	{
+		.action = DAMOS_MTHP_SPLIT,
+		.name = "mthp_split",
+	},
 	{
 		.action = DAMOS_LRU_PRIO,
 		.name = "lru_prio",
@@ -2673,6 +2678,34 @@ static ssize_t target_order_store(struct kobject *kobj,
 	return count;
 }
 
+static ssize_t hot_threshold_show(struct kobject *kobj,
+		struct kobj_attribute *attr, char *buf)
+{
+	struct damon_sysfs_scheme *scheme = container_of(kobj,
+			struct damon_sysfs_scheme, kobj);
+
+	return sysfs_emit(buf, "%u\n", scheme->hot_threshold);
+}
+
+static ssize_t hot_threshold_store(struct kobject *kobj,
+		struct kobj_attribute *attr, const char *buf, size_t count)
+{
+	struct damon_sysfs_scheme *scheme = container_of(kobj,
+			struct damon_sysfs_scheme, kobj);
+	unsigned int val;
+	int err;
+
+	err = kstrtouint(buf, 0, &val);
+	if (err)
+		return err;
+
+	if (val > 100)
+		return -EINVAL;
+
+	scheme->hot_threshold = val;
+	return count;
+}
+
 static void damon_sysfs_scheme_release(struct kobject *kobj)
 {
 	kfree(container_of(kobj, struct damon_sysfs_scheme, kobj));
@@ -2690,11 +2723,15 @@ static struct kobj_attribute damon_sysfs_scheme_target_nid_attr =
 static struct kobj_attribute damon_sysfs_scheme_target_order_attr =
 		__ATTR_RW_MODE(target_order, 0600);
 
+static struct kobj_attribute damon_sysfs_scheme_hot_threshold_attr =
+		__ATTR_RW_MODE(hot_threshold, 0600);
+
 static struct attribute *damon_sysfs_scheme_attrs[] = {
 	&damon_sysfs_scheme_action_attr.attr,
 	&damon_sysfs_scheme_apply_interval_us_attr.attr,
 	&damon_sysfs_scheme_target_nid_attr.attr,
 	&damon_sysfs_scheme_target_order_attr.attr,
+	&damon_sysfs_scheme_hot_threshold_attr.attr,
 	NULL,
 };
 ATTRIBUTE_GROUPS(damon_sysfs_scheme);
@@ -3048,8 +3085,22 @@ static struct damos *damon_sysfs_mk_scheme(
 			HPAGE_PMD_ORDER, HPAGE_PMD_ORDER);
 		sysfs_scheme->target_order = 0;
 	}
+	if (sysfs_scheme->action == DAMOS_MTHP_SPLIT &&
+	    (sysfs_scheme->target_order == 0 ||
+	     sysfs_scheme->target_order >= HPAGE_PMD_ORDER)) {
+		pr_warn("DAMON mthp_split: target_order %u invalid, need 2..%u. Defaulting to 2.\n",
+			sysfs_scheme->target_order,
+			HPAGE_PMD_ORDER - 1);
+		sysfs_scheme->target_order = 2;
+	}
 	scheme->target_order = sysfs_scheme->target_order;
 
+	if (sysfs_scheme->action == DAMOS_MTHP_SPLIT) {
+		if (sysfs_scheme->hot_threshold == 0)
+			sysfs_scheme->hot_threshold = 30;
+		scheme->hot_threshold = sysfs_scheme->hot_threshold;
+	}
+
 	err = damos_sysfs_add_quota_score(sysfs_quotas->goals, &scheme->quota);
 	if (err) {
 		damon_destroy_scheme(scheme);
-- 
2.50.1 (Apple Git-155)



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH 5/6] mm/damon/vaddr: implement DAMOS_MTHP_SPLIT handler
  2026-06-18  9:48 [RFC PATCH 0/6] mm/damon: Add mTHP-aware collapse/split with ARM SPE feedback Wang Lian
                   ` (3 preceding siblings ...)
  2026-06-18  9:48 ` [RFC PATCH 4/6] mm/damon: introduce DAMOS_MTHP_SPLIT action and hot_threshold Wang Lian
@ 2026-06-18  9:48 ` Wang Lian
  2026-06-18  9:48 ` [RFC PATCH 6/6] mm/damon: add SPE feedback for sub-THP split decisions Wang Lian
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Wang Lian @ 2026-06-18  9:48 UTC (permalink / raw)
  To: sj, akpm
  Cc: npache, gutierrez.asier, daichaobing, linux-mm, linux-kernel,
	lianux.mm, kunwu.chan

Implement the DAMOS_MTHP_SPLIT action for vaddr-based DAMON operations.
Walk the region in PMD-sized aligned chunks, use folio_walk_start() to
locate THP folios, and call split_folio_to_order() when the folio order
exceeds the target_order.

Unlike COLLAPSE which is limited to anonymous memory via
collapse_huge_page(), split_folio_to_order() supports both anon and
shmem folios.  This is critical for tmpfs THP-backed KVM guest memory,
where cold and hot pages bundled together in a single PMD THP cause
DAMON to overestimate hot regions.

The handler holds mmap_read_lock per chunk for VMA lookup and
folio_walk_start(), then releases it before the next iteration.
split_folio_to_order() does not reacquire mmap locks internally,
so this pattern is safe.

Co-developed-by: Kunwu Chan <kunwu.chan@gmail.com>
Signed-off-by: Kunwu Chan <kunwu.chan@gmail.com>
Signed-off-by: Wang Lian <lianux.mm@gmail.com>
---
 mm/damon/vaddr.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 68 insertions(+)

diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
index 2a3757c13bf0..1957e390a277 100644
--- a/mm/damon/vaddr.c
+++ b/mm/damon/vaddr.c
@@ -934,6 +934,71 @@ static unsigned long damos_va_collapse(struct damon_target *target,
 	return applied;
 }
 
+static unsigned long damos_va_mthp_split(struct damon_target *target,
+		struct damon_region *r, struct damos *s,
+		unsigned long *sz_filter_passed)
+{
+	unsigned long addr, end, chunk_sz;
+	unsigned int target_order = s->target_order;
+	unsigned long applied = 0;
+	struct mm_struct *mm;
+	struct vm_area_struct *vma;
+	struct folio *folio;
+	struct folio_walk fw;
+
+	mm = damon_get_mm(target);
+	if (!mm)
+		return 0;
+
+	chunk_sz = PAGE_SIZE << HPAGE_PMD_ORDER;
+	addr = ALIGN_DOWN(r->ar.start, chunk_sz);
+	end = ALIGN(r->ar.end, chunk_sz);
+
+	while (addr < end) {
+		mmap_read_lock(mm);
+		vma = find_vma(mm, addr);
+		/*
+		 * split_folio_to_order() supports both anon and shmem
+		 * folios, so we accept any VMA that has a folio at @addr.
+		 * This covers important use cases like tmpfs THP-backed
+		 * KVM guest memory where cold and hot pages are bundled
+		 * together in a single PMD THP.
+		 */
+		if (!vma || addr < vma->vm_start)
+			goto unlock;
+
+		folio = folio_walk_start(&fw, vma, addr, 0);
+		if (!folio)
+			goto unlock;
+
+		if (folio_order(folio) > target_order) {
+			if (!folio_trylock(folio)) {
+				folio_walk_end(&fw, vma);
+				goto unlock;
+			}
+			folio_get(folio);
+			folio_walk_end(&fw, vma);
+
+			if (!split_folio_to_order(folio, target_order))
+				applied += chunk_sz;
+
+			folio_unlock(folio);
+			folio_put(folio);
+		} else {
+			folio_walk_end(&fw, vma);
+		}
+
+unlock:
+		*sz_filter_passed += chunk_sz;
+		addr += chunk_sz;
+		mmap_read_unlock(mm);
+		cond_resched();
+	}
+
+	mmput(mm);
+	return applied;
+}
+
 static unsigned long damon_va_apply_scheme(struct damon_ctx *ctx,
 		struct damon_target *t, struct damon_region *r,
 		struct damos *scheme, unsigned long *sz_filter_passed)
@@ -967,6 +1032,9 @@ static unsigned long damon_va_apply_scheme(struct damon_ctx *ctx,
 		return damos_va_migrate(t, r, scheme, sz_filter_passed);
 	case DAMOS_STAT:
 		return damos_va_stat(t, r, scheme, sz_filter_passed);
+	case DAMOS_MTHP_SPLIT:
+		return damos_va_mthp_split(t, r, scheme,
+					  sz_filter_passed);
 	default:
 		/*
 		 * DAMOS actions that are not yet supported by 'vaddr'.
-- 
2.50.1 (Apple Git-155)



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH 6/6] mm/damon: add SPE feedback for sub-THP split decisions
  2026-06-18  9:48 [RFC PATCH 0/6] mm/damon: Add mTHP-aware collapse/split with ARM SPE feedback Wang Lian
                   ` (4 preceding siblings ...)
  2026-06-18  9:48 ` [RFC PATCH 5/6] mm/damon/vaddr: implement DAMOS_MTHP_SPLIT handler Wang Lian
@ 2026-06-18  9:48 ` Wang Lian
  2026-06-18 11:03 ` [RFC PATCH 0/6] mm/damon: Add mTHP-aware collapse/split with ARM SPE feedback Gutierrez Asier
  2026-06-19  1:47 ` SeongJae Park
  7 siblings, 0 replies; 16+ messages in thread
From: Wang Lian @ 2026-06-18  9:48 UTC (permalink / raw)
  To: sj, akpm
  Cc: npache, gutierrez.asier, daichaobing, linux-mm, linux-kernel,
	lianux.mm, kunwu.chan

Add a sub-THP access heatmap that enables data-driven split decisions
in DAMOS_MTHP_SPLIT.  The split handler queries damon_spe_hot_fraction()
and compares against the scheme's configurable hot_threshold (default
30%, set in patch 4) to preserve genuinely hot THPs while splitting
cold ones.

Key data-driven design decisions from Kunpeng 920 SPE profiling:

  1. Signal vs noise threshold (this patch):
     Raw SPE data shows most THPs have scattered 1-2 sample hits across
     many subpages — noise, not genuine access patterns.  The heatmap
     now uses a two-pass signal threshold: a subpage chunk must have
     >= 1/10 of the peak chunk's access count to be considered hot.
     This reduces false hot classification from ~50% to <5% of subpages.

  2. hot_threshold 30% (patch 4, sysfs-configurable):
     With the signal filter applied, 97% of THPs have <10% hot
     subpages (clearly cold), 1-2% have 10-30% (borderline), and
     <1% have >30% (genuinely hot).  The 30% default catches hot THPs
     while allowing the vast majority to be split.

Architecture (three-phase):

  Phase 2a (current fallback):
    Walk PTE access bits via folio_walk for THPs already split to PTEs.
    For PMD-mapped THPs (the common case), return -EOPNOTSUPP, which
    causes the split handler to split unconditionally.

  Phase 2b (userspace daemon -> kernel, ready for validation):
    Userspace SPE daemon decodes ARM SPE records, feeds PFNs via debugfs
    (/sys/kernel/debug/damon/spe_feed).  The kernel aggregates accesses
    into a per-folio rbtree keyed by THP-aligned PFN.

  Phase 2c (kernel-native, future):
    perf_event_create_kernel_counter for ARM SPE.  Overflow handler
    calls damon_spe_record_access() directly.

Data structure (mm/damon/spe.c):
  - Per-folio rbtree keyed by PFN, storing access_count[512] (one
    counter per 4KB subpage)
  - Max 1024 entries, entries older than 30s are pruned periodically
  - Global spinlock-protected rbtree with GFP_ATOMIC allocation

Debugfs interface:
  - /sys/kernel/debug/damon/spe_feed  (write): accept one PFN per line
  - /sys/kernel/debug/damon/spe_stats (read):  rbtree stats + top entries

When CONFIG_DAMON_SPE is disabled, all SPE functions are empty stubs
returning -EOPNOTSUPP, making the split unconditional.

Co-developed-by: Kunwu Chan <kunwu.chan@gmail.com>
Signed-off-by: Kunwu Chan <kunwu.chan@gmail.com>
Signed-off-by: Wang Lian <lianux.mm@gmail.com>
---
 mm/damon/Kconfig  |  12 ++
 mm/damon/Makefile |   1 +
 mm/damon/core.c   |   3 +
 mm/damon/spe.c    | 505 ++++++++++++++++++++++++++++++++++++++++++++++
 mm/damon/spe.h    |  62 ++++++
 mm/damon/vaddr.c  |  16 +-
 6 files changed, 597 insertions(+), 2 deletions(-)
 create mode 100644 mm/damon/spe.c
 create mode 100644 mm/damon/spe.h

diff --git a/mm/damon/Kconfig b/mm/damon/Kconfig
index 34631a44cdec..ea75a8dab989 100644
--- a/mm/damon/Kconfig
+++ b/mm/damon/Kconfig
@@ -121,4 +121,16 @@ config DAMON_STAT_ENABLED_DEFAULT
 	  Whether to enable DAMON_STAT by default.  Users can disable it in
 	  boot or runtime using its 'enabled' parameter.
 
+config DAMON_SPE
+	bool "DAMON SPE feedback for sub-THP access monitoring (prototype)"
+	depends on DAMON_VADDR
+	help
+	  Enable sub-THP access heatmap feedback for DAMOS_MTHP_SPLIT.
+	  Currently a prototype: uses PTE access bits for THPs that have
+	  been split to PTEs, returns "no data" for PMD-mapped THPs.
+
+	  On hardware with ARM SPE (e.g. Kunpeng 920), this will be
+	  extended to provide per-subpage access data without needing to
+	  split the PMD first, enabling precise mTHP split decisions.
+
 endmenu
diff --git a/mm/damon/Makefile b/mm/damon/Makefile
index d8d6bf5f8bff..507b43a9f009 100644
--- a/mm/damon/Makefile
+++ b/mm/damon/Makefile
@@ -7,3 +7,4 @@ obj-$(CONFIG_DAMON_SYSFS)	+= sysfs-common.o sysfs-schemes.o sysfs.o
 obj-$(CONFIG_DAMON_RECLAIM)	+= modules-common.o reclaim.o
 obj-$(CONFIG_DAMON_LRU_SORT)	+= modules-common.o lru_sort.o
 obj-$(CONFIG_DAMON_STAT)	+= modules-common.o stat.o
+obj-$(CONFIG_DAMON_SPE)		+= spe.o
diff --git a/mm/damon/core.c b/mm/damon/core.c
index 265d51ade25b..0805e71a90d8 100644
--- a/mm/damon/core.c
+++ b/mm/damon/core.c
@@ -20,6 +20,7 @@
 
 /* for damon_get_folio() used by node eligible memory metrics */
 #include "ops-common.h"
+#include "spe.h"
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/damon.h>
@@ -2987,6 +2988,8 @@ static void kdamond_apply_schemes(struct damon_ctx *c)
 	if (!has_schemes_to_apply)
 		return;
 
+	damon_spe_prune();
+
 	max_region_sz = damon_region_sz_limit(c);
 	mutex_lock(&c->walk_control_lock);
 	damon_for_each_target(t, c) {
diff --git a/mm/damon/spe.c b/mm/damon/spe.c
new file mode 100644
index 000000000000..98f8d32053e4
--- /dev/null
+++ b/mm/damon/spe.c
@@ -0,0 +1,505 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * DAMON SPE (Statistical Profiling Extension) feedback
+ *
+ * Provides sub-THP access heatmap for intelligent split decisions.
+ *
+ * Architecture:
+ *   Phase 2a (current): PTE access bits via folio_walk.
+ *     Works only when a THP has been previously split to PTEs.
+ *     Returns -EOPNOTSUPP for PMD-mapped THPs.
+ *
+ *   Phase 2b (userspace): spe_hist daemon decodes SPE in userspace,
+ *     feeds {pfn, subpage_idx} via debugfs/sysfs into the rbtree below.
+ *
+ *   Phase 2c (kernel): perf_event_create_kernel_counter for ARM SPE,
+ *     overflow handler aggregates into rbtree.  Requires SPE hardware.
+ *
+ * Data structure:
+ *   Per-folio rbtree keyed by PFN, storing per-subpage access counts.
+ *   Entries are aged and pruned periodically.
+ *
+ * Copyright (C) 2026 Wang Lian <lianux.mm@gmail.com>
+ */
+
+#define pr_fmt(fmt) "damon-spe: " fmt
+
+#include <linux/mm.h>
+#include <linux/pagewalk.h>
+#include <linux/huge_mm.h>
+#include <linux/bitmap.h>
+#include <linux/rbtree.h>
+#include <linux/spinlock.h>
+#include <linux/slab.h>
+#include <linux/jiffies.h>
+#include <linux/sched.h>
+#include <linux/debugfs.h>
+#include <linux/seq_file.h>
+#include <linux/uaccess.h>
+#include <linux/init.h>
+#include "spe.h"
+
+/* Max sub-pages when querying at order 0 */
+#define DAMON_SPE_MAX_CHUNKS	512
+
+/* Max folio entries in the rbtree (per-mm or global) */
+#define DAMON_SPE_MAX_ENTRIES	1024
+
+/* Entry considered stale after this many jiffies (default: 30s) */
+#define DAMON_SPE_ENTRY_TTL	(30 * HZ)
+
+/*
+ * Per-folio access histogram entry.
+ * Keyed by pfn in an rbtree.  Each entry tracks access count per subpage.
+ * The access_count array is sized for PMD-order / 0 = 512 4KB subpages.
+ */
+struct damon_spe_entry {
+	struct rb_node		node;
+	unsigned long		pfn;		/* THP-aligned PFN */
+	pid_t			pid;		/* owner process */
+	unsigned long		access_count[DAMON_SPE_MAX_CHUNKS];
+	unsigned long		total_accesses;
+	unsigned long		last_access;	/* jiffies of last update */
+};
+
+static struct rb_root spe_tree = RB_ROOT;
+static DEFINE_SPINLOCK(spe_lock);
+static unsigned int spe_nr_entries;
+
+/* Forward declarations */
+static void __spe_prune(void);
+
+/*
+ * Find an entry by PFN.  Must be called with spe_lock held.
+ */
+static struct damon_spe_entry *spe_find(unsigned long pfn)
+{
+	struct rb_node *node = spe_tree.rb_node;
+
+	while (node) {
+		struct damon_spe_entry *e =
+			rb_entry(node, struct damon_spe_entry, node);
+
+		if (pfn < e->pfn)
+			node = node->rb_left;
+		else if (pfn > e->pfn)
+			node = node->rb_right;
+		else
+			return e;
+	}
+	return NULL;
+}
+
+/*
+ * Insert a new entry.  Must be called with spe_lock held.
+ * Returns the new entry, or NULL if the tree is full.
+ */
+static struct damon_spe_entry *spe_insert(unsigned long pfn, pid_t pid)
+{
+	struct rb_node **new = &spe_tree.rb_node, *parent = NULL;
+	struct damon_spe_entry *e;
+
+	if (spe_nr_entries >= DAMON_SPE_MAX_ENTRIES) {
+		__spe_prune();
+		if (spe_nr_entries >= DAMON_SPE_MAX_ENTRIES)
+			return NULL;
+	}
+
+	e = kzalloc(sizeof(*e), GFP_ATOMIC);
+	if (!e)
+		return NULL;
+
+	e->pfn = pfn;
+	e->pid = pid;
+	e->last_access = jiffies;
+
+	while (*new) {
+		struct damon_spe_entry *this =
+			rb_entry(*new, struct damon_spe_entry, node);
+
+		parent = *new;
+		if (pfn < this->pfn)
+			new = &((*new)->rb_left);
+		else if (pfn > this->pfn)
+			new = &((*new)->rb_right);
+		else {
+			/* Race: another CPU inserted the same PFN */
+			kfree(e);
+			return this;
+		}
+	}
+
+	rb_link_node(&e->node, parent, new);
+	rb_insert_color(&e->node, &spe_tree);
+	spe_nr_entries++;
+	return e;
+}
+
+/*
+ * Prune entries that haven't been updated for DAMON_SPE_ENTRY_TTL.
+ * Must be called with spe_lock held.
+ */
+static void __spe_prune(void)
+{
+	struct rb_node *node, *next;
+	unsigned long deadline = jiffies - DAMON_SPE_ENTRY_TTL;
+
+	node = rb_first(&spe_tree);
+	while (node) {
+		struct damon_spe_entry *e =
+			rb_entry(node, struct damon_spe_entry, node);
+
+		next = rb_next(node);
+
+		if (time_before(e->last_access, deadline)) {
+			rb_erase(&e->node, &spe_tree);
+			spe_nr_entries--;
+			kfree(e);
+		}
+		node = next;
+	}
+}
+
+/**
+ * damon_spe_record_access() - Record a single subpage access
+ * @pfn: Physical page frame number (any page within a THP)
+ * @pid: Process ID that performed the access
+ *
+ * The PFN is automatically aligned to the THP base.  The subpage index
+ * within the THP is derived from the low bits of the PFN.
+ *
+ * Context: Can be called from IRQ context.
+ */
+void damon_spe_record_access(unsigned long pfn, pid_t pid)
+{
+	unsigned long thp_pfn = pfn & ~(unsigned long)(DAMON_SPE_MAX_CHUNKS - 1);
+	unsigned int idx = pfn & (DAMON_SPE_MAX_CHUNKS - 1);
+	struct damon_spe_entry *e;
+	unsigned long flags;
+
+	spin_lock_irqsave(&spe_lock, flags);
+
+	e = spe_find(thp_pfn);
+	if (!e)
+		e = spe_insert(thp_pfn, pid);
+
+	if (e) {
+		e->access_count[idx]++;
+		e->total_accesses++;
+		e->last_access = jiffies;
+	}
+
+	spin_unlock_irqrestore(&spe_lock, flags);
+}
+EXPORT_SYMBOL_GPL(damon_spe_record_access);
+
+/**
+ * damon_spe_folio_heatmap() - Get sub-THP access bitmap for a folio
+ * @folio: The folio to query
+ * @vma: VMA containing the folio
+ * @addr: Virtual address of the folio start
+ * @target_order: Page order for each chunk in the bitmap
+ * @hot_bitmap: Output bitmap with one bit per chunk
+ *
+ * Queries the SPE rbtree first.  Falls back to PTE access bits if no
+ * SPE data is available (requires the THP to be split to PTEs).
+ *
+ * Return: Number of chunks on success, negative error on failure.
+ */
+int damon_spe_folio_heatmap(struct folio *folio, struct vm_area_struct *vma,
+			    unsigned long addr, unsigned int target_order,
+			    unsigned long *hot_bitmap)
+{
+	unsigned long num_chunks = folio_nr_pages(folio) >> target_order;
+	unsigned long chunk_sz = PAGE_SIZE << target_order;
+	unsigned long pfn;
+	unsigned long flags;
+	struct damon_spe_entry *e;
+	struct folio_walk fw;
+	struct folio *sub_folio;
+	int i;
+
+	if (!folio || !vma || !hot_bitmap)
+		return -EINVAL;
+	if (target_order >= folio_order(folio))
+		return -EINVAL;
+
+	pfn = folio_pfn(folio);
+
+	/*
+	 * Phase 2b/2c path: query the SPE rbtree.
+	 * If we have aggregated SPE data for this folio, use it.
+	 */
+	spin_lock_irqsave(&spe_lock, flags);
+	e = spe_find(pfn);
+	if (e && e->total_accesses > 0) {
+		unsigned long max_sum = 0;
+		unsigned long sig_thresh;
+		unsigned int spp = chunk_sz >> PAGE_SHIFT;
+
+		/* First pass: find peak chunk access count */
+		for (i = 0; i < num_chunks; i++) {
+			unsigned long sum = 0;
+			int j;
+
+			for (j = 0; j < spp; j++) {
+				unsigned int idx = i * spp + j;
+
+				if (idx < DAMON_SPE_MAX_CHUNKS)
+					sum += e->access_count[idx];
+			}
+			if (sum > max_sum)
+				max_sum = sum;
+		}
+
+		/*
+		 * Signal threshold: a chunk needs >= 1/10 of peak access
+		 * count to be considered hot.  This filters SPE noise —
+		 * Kunpeng 920 data shows most THPs have scattered 1-2
+		 * sample hits across many subpages that don't represent
+		 * genuine hot access patterns.
+		 */
+		sig_thresh = max(max_sum / 10, 1UL);
+
+		/* Second pass: build hot bitmap using signal threshold */
+		bitmap_zero(hot_bitmap, num_chunks);
+		for (i = 0; i < num_chunks; i++) {
+			unsigned long sum = 0;
+			int j;
+
+			for (j = 0; j < spp; j++) {
+				unsigned int idx = i * spp + j;
+
+				if (idx < DAMON_SPE_MAX_CHUNKS)
+					sum += e->access_count[idx];
+			}
+			if (sum >= sig_thresh)
+				__set_bit(i, hot_bitmap);
+		}
+
+		spin_unlock_irqrestore(&spe_lock, flags);
+		return (int)num_chunks;
+	}
+	spin_unlock_irqrestore(&spe_lock, flags);
+
+	/*
+	 * Phase 2a fallback: walk PTEs to check access bits.
+	 * Only works when the THP has been split to PTEs.
+	 */
+	bitmap_zero(hot_bitmap, num_chunks);
+
+	for (i = 0; i < num_chunks; i++) {
+		unsigned long chunk_addr = addr + i * chunk_sz;
+
+		sub_folio = folio_walk_start(&fw, vma, chunk_addr, 0);
+		if (!sub_folio)
+			return -EOPNOTSUPP;
+
+		if (fw.level == FW_LEVEL_PMD) {
+			folio_walk_end(&fw, vma);
+			return -EOPNOTSUPP;
+		}
+
+		if (fw.level == FW_LEVEL_PTE && pte_young(fw.pte))
+			__set_bit(i, hot_bitmap);
+
+		folio_walk_end(&fw, vma);
+	}
+
+	return (int)num_chunks;
+}
+EXPORT_SYMBOL_GPL(damon_spe_folio_heatmap);
+
+/**
+ * damon_spe_hot_fraction() - Return hot chunk percentage of a folio
+ * @folio: The folio to query
+ * @vma: VMA containing the folio
+ * @addr: Virtual address of the folio start
+ * @target_order: Page order for each chunk
+ *
+ * Return: Percentage (0-100) on success, negative error on failure.
+ */
+int damon_spe_hot_fraction(struct folio *folio, struct vm_area_struct *vma,
+			   unsigned long addr, unsigned int target_order)
+{
+	unsigned long num_chunks = folio_nr_pages(folio) >> target_order;
+	DECLARE_BITMAP(hot_bitmap, DAMON_SPE_MAX_CHUNKS);
+	int ret, hot;
+
+	if (num_chunks > DAMON_SPE_MAX_CHUNKS)
+		return -ERANGE;
+
+	ret = damon_spe_folio_heatmap(folio, vma, addr, target_order,
+				      hot_bitmap);
+	if (ret < 0)
+		return ret;
+
+	hot = bitmap_weight(hot_bitmap, num_chunks);
+	return (hot * 100) / (int)num_chunks;
+}
+EXPORT_SYMBOL_GPL(damon_spe_hot_fraction);
+
+/**
+ * damon_spe_prune() - Remove stale entries from the SPE rbtree
+ *
+ * Called from DAMON's aggregation cycle.  Removes entries not updated
+ * within DAMON_SPE_ENTRY_TTL jiffies.
+ */
+void damon_spe_prune(void)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&spe_lock, flags);
+	__spe_prune();
+	spin_unlock_irqrestore(&spe_lock, flags);
+}
+
+/**
+ * damon_spe_stats() - Return current SPE rbtree statistics
+ * @nr_entries: Output for number of entries, may be NULL
+ * @total_accesses: Output for total accumulated accesses, may be NULL
+ */
+void damon_spe_stats(unsigned int *nr_entries, unsigned long *total_accesses)
+{
+	struct rb_node *node;
+	unsigned long flags;
+	unsigned int count = 0;
+	unsigned long total = 0;
+
+	spin_lock_irqsave(&spe_lock, flags);
+	for (node = rb_first(&spe_tree); node; node = rb_next(node)) {
+		struct damon_spe_entry *e =
+			rb_entry(node, struct damon_spe_entry, node);
+		count++;
+		total += e->total_accesses;
+	}
+	spin_unlock_irqrestore(&spe_lock, flags);
+
+	if (nr_entries)
+		*nr_entries = count;
+	if (total_accesses)
+		*total_accesses = total;
+}
+EXPORT_SYMBOL_GPL(damon_spe_stats);
+
+/* ---- debugfs interface for Phase 2b (userspace daemon → kernel rbtree) ---- */
+
+static struct dentry *damon_spe_dentry;
+
+/*
+ * spe_feed write: accept one PFN per line (hex or decimal).
+ * The PFN is recorded as an access via damon_spe_record_access().
+ *
+ * Usage from userspace:
+ *   echo 0x12345678 > /sys/kernel/debug/damon/spe_feed
+ *
+ * For bulk feed from SPE daemon:
+ *   cat spe_pfns.txt > /sys/kernel/debug/damon/spe_feed
+ */
+static ssize_t spe_feed_write(struct file *file, const char __user *buf,
+			      size_t count, loff_t *ppos)
+{
+	char line[32];
+	size_t len = min(count, sizeof(line) - 1);
+	unsigned long pfn;
+
+	if (copy_from_user(line, buf, len))
+		return -EFAULT;
+	line[len] = '\0';
+
+	/* Strip trailing newline */
+	if (len > 0 && line[len - 1] == '\n')
+		line[len - 1] = '\0';
+
+	if (kstrtoul(line, 0, &pfn) == 0 && pfn != 0)
+		damon_spe_record_access(pfn, 0);
+
+	return count;
+}
+
+/*
+ * spe_stats read: show current SPE rbtree statistics.
+ *
+ * Usage:
+ *   cat /sys/kernel/debug/damon/spe_stats
+ */
+static int spe_stats_show(struct seq_file *m, void *v)
+{
+	struct rb_node *node;
+	unsigned long flags;
+	unsigned int count = 0;
+	unsigned long total = 0;
+
+	spin_lock_irqsave(&spe_lock, flags);
+	for (node = rb_first(&spe_tree); node; node = rb_next(node)) {
+		struct damon_spe_entry *e =
+			rb_entry(node, struct damon_spe_entry, node);
+		count++;
+		total += e->total_accesses;
+	}
+	spin_unlock_irqrestore(&spe_lock, flags);
+
+	seq_printf(m, "nr_entries=%u total_accesses=%lu\n", count, total);
+
+	/* Show top entries (limit output) */
+	spin_lock_irqsave(&spe_lock, flags);
+	count = 0;
+	for (node = rb_first(&spe_tree); node; node = rb_next(node)) {
+		struct damon_spe_entry *e =
+			rb_entry(node, struct damon_spe_entry, node);
+		unsigned int hot_pages = 0;
+		int i;
+
+		for (i = 0; i < DAMON_SPE_MAX_CHUNKS; i++)
+			if (e->access_count[i])
+				hot_pages++;
+
+		seq_printf(m, "  pfn=0x%lx pid=%d total=%lu hot_pages=%u/%d\n",
+			   e->pfn, e->pid, e->total_accesses,
+			   hot_pages, DAMON_SPE_MAX_CHUNKS);
+		if (++count >= 10)
+			break;
+	}
+	spin_unlock_irqrestore(&spe_lock, flags);
+
+	return 0;
+}
+
+static int spe_stats_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, spe_stats_show, inode->i_private);
+}
+
+static const struct file_operations spe_feed_fops = {
+	.write = spe_feed_write,
+};
+
+static const struct file_operations spe_stats_fops = {
+	.open = spe_stats_open,
+	.read = seq_read,
+	.llseek = seq_lseek,
+	.release = single_release,
+};
+
+static int __init damon_spe_dbgfs_init(void)
+{
+	struct dentry *d;
+
+	d = debugfs_lookup("damon", NULL);
+	if (!d) {
+		d = debugfs_create_dir("damon", NULL);
+		if (IS_ERR(d))
+			return PTR_ERR(d);
+	}
+	damon_spe_dentry = d;
+
+	debugfs_create_file("spe_feed", 0200, damon_spe_dentry,
+			    NULL, &spe_feed_fops);
+	debugfs_create_file("spe_stats", 0400, damon_spe_dentry,
+			    NULL, &spe_stats_fops);
+
+	pr_info("debugfs interface ready: /sys/kernel/debug/damon/spe_{feed,stats}\n");
+	return 0;
+}
+
+late_initcall(damon_spe_dbgfs_init);
diff --git a/mm/damon/spe.h b/mm/damon/spe.h
new file mode 100644
index 000000000000..38799688b5af
--- /dev/null
+++ b/mm/damon/spe.h
@@ -0,0 +1,62 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * DAMON SPE (Statistical Profiling Extension) feedback
+ *
+ * Provides sub-THP access heatmap for intelligent split decisions.
+ *
+ * Three-phase architecture:
+ *   Phase 2a: PTE access bits via folio_walk (current fallback)
+ *   Phase 2b: Userspace SPE daemon feeds {pfn, subpage} via debugfs
+ *   Phase 2c: Kernel perf_event_create_kernel_counter for ARM SPE
+ *
+ * Copyright (C) 2026 Wang Lian <lianux.mm@gmail.com>
+ */
+
+#ifndef _DAMON_SPE_H
+#define _DAMON_SPE_H
+
+#include <linux/mm_types.h>
+#include <linux/types.h>
+
+#ifdef CONFIG_DAMON_SPE
+
+/* ---- Sub-page heatmap query ---- */
+
+int damon_spe_folio_heatmap(struct folio *folio, struct vm_area_struct *vma,
+			    unsigned long addr, unsigned int target_order,
+			    unsigned long *hot_bitmap);
+
+int damon_spe_hot_fraction(struct folio *folio, struct vm_area_struct *vma,
+			   unsigned long addr, unsigned int target_order);
+
+/* ---- Recording (called from SPE event handler or userspace daemon) ---- */
+
+void damon_spe_record_access(unsigned long pfn, pid_t pid);
+
+/* ---- Maintenance ---- */
+
+void damon_spe_prune(void);
+void damon_spe_stats(unsigned int *nr_entries, unsigned long *total_accesses);
+
+#else /* !CONFIG_DAMON_SPE */
+
+static inline int damon_spe_folio_heatmap(struct folio *folio,
+		struct vm_area_struct *vma, unsigned long addr,
+		unsigned int target_order, unsigned long *hot_bitmap)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int damon_spe_hot_fraction(struct folio *folio,
+		struct vm_area_struct *vma, unsigned long addr,
+		unsigned int target_order)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline void damon_spe_record_access(unsigned long pfn, pid_t pid) {}
+static inline void damon_spe_prune(void) {}
+static inline void damon_spe_stats(unsigned int *nr, unsigned long *total) {}
+
+#endif /* CONFIG_DAMON_SPE */
+#endif /* _DAMON_SPE_H */
diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
index 1957e390a277..cb3ea2766b9e 100644
--- a/mm/damon/vaddr.c
+++ b/mm/damon/vaddr.c
@@ -18,6 +18,7 @@
 
 #include "../internal.h"
 #include "ops-common.h"
+#include "spe.h"
 
 #ifdef CONFIG_DAMON_VADDR_KUNIT_TEST
 #undef DAMON_MIN_REGION_SZ
@@ -945,6 +946,7 @@ static unsigned long damos_va_mthp_split(struct damon_target *target,
 	struct vm_area_struct *vma;
 	struct folio *folio;
 	struct folio_walk fw;
+	int hot_pct;
 
 	mm = damon_get_mm(target);
 	if (!mm)
@@ -979,8 +981,18 @@ static unsigned long damos_va_mthp_split(struct damon_target *target,
 			folio_get(folio);
 			folio_walk_end(&fw, vma);
 
-			if (!split_folio_to_order(folio, target_order))
-				applied += chunk_sz;
+			hot_pct = damon_spe_hot_fraction(folio, vma, addr,
+						 target_order);
+			/*
+			 * hot_pct < 0: no heatmap data (no SPE, PMD-mapped),
+			 * split unconditionally — DAMON access pattern already
+			 * identified this region as cold.
+			 */
+			if (hot_pct < 0 ||
+			    (unsigned int)hot_pct < s->hot_threshold) {
+				if (!split_folio_to_order(folio, target_order))
+					applied += chunk_sz;
+			}
 
 			folio_unlock(folio);
 			folio_put(folio);
-- 
2.50.1 (Apple Git-155)



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/6] mm/damon: Add mTHP-aware collapse/split with ARM SPE feedback
  2026-06-18  9:48 [RFC PATCH 0/6] mm/damon: Add mTHP-aware collapse/split with ARM SPE feedback Wang Lian
                   ` (5 preceding siblings ...)
  2026-06-18  9:48 ` [RFC PATCH 6/6] mm/damon: add SPE feedback for sub-THP split decisions Wang Lian
@ 2026-06-18 11:03 ` Gutierrez Asier
  2026-06-18 13:13   ` wang lian
  2026-06-19  1:47 ` SeongJae Park
  7 siblings, 1 reply; 16+ messages in thread
From: Gutierrez Asier @ 2026-06-18 11:03 UTC (permalink / raw)
  To: Wang Lian, sj, akpm
  Cc: npache, daichaobing, linux-mm, linux-kernel, kunwu.chan

Hi Wang,

On 6/18/2026 12:48 PM, Wang Lian wrote:
> Received an off-list report that DAMON significantly overestimates
> hot memory in KVM/QEMU deployments with THP-backed tmpfs guest memory
> running Oracle workloads.
> 
> The root cause is structural: a PMD entry covers 512 4KB subpages with
> a single Access Flag (AF) bit. When any one subpage is accessed, the entire
> 2MB region appears "hot" to DAMON. On ARM64, this is compounded by the
> hardware AF mechanism -- the AF is only set on a TLB miss. Consequently, when the
> working set fits entirely within the L2 TLB (e.g., a 16MB working set with 2MB THP
> running on a Kunpeng 920's 2048-entry L2 TLB), DAMON becomes completely blind to
> subsequent accesses. x86 is not subject to this specific blindness under similar
> conditions.

Have you tried setting the minimum region size to 2MB?

> We reproduced this memory inflation on a Kunpeng 920 platform using a synthetic
> workload (8GB mmap with a 0.2% sparse hotspot, i.e. 16MB actually hot):
> THP=always causes DAMON to report the entire 8GB as hot, while THP=never
> reports only a few hundred MB -- a 512x overestimate relative to the actual
> 16MB hotspot under THP, and a ~33x gap between the two THP modes. ARM SPE hardware profiling
> independently confirms this asymmetry: out of 2,005 THPs sampled system-wide
> over 10 seconds, 97% had fewer than 10% of their 4KB subpages actually accessed.

THP always will just collapse the entire PID into huge pages anyway. This
is outside DAMON's control.

Have you tried setting THP to never and running DAMON with DAMON_COLLAPSE
action?

> To mitigate this, this series extends the existing DAMOS_COLLAPSE action to be
> mTHP-aware via a new target_order field, and introduces a new
> DAMOS_MTHP_SPLIT action. This enables DAMON to proactively split PMD THPs
> into smaller mTHPs when most subpages are probed as cold, and collapse them
> back when beneficial. To resolve the sub-PMD monitoring blindness, the split
> path can incorporate fine-grained hardware feedback from ARM SPE.
> The hardware feedback loop (damon_spe_folio_heatmap) implements a two-pass
> signal filter: it first identifies the peak chunk access count, and then marks
> sub-chunks with >= 1/10 of the peak count as hot, effectively filtering out
> SPE sampling noise. A configurable hot_threshold (default 30%) controls the
> split decision: only folios with a hot fraction below this threshold are
> eligible for splitting. When no SPE data is available, the infrastructure
> gracefully falls back to explicit PTE-level scanning via folio_walk.
> 
> Currently, SPE data is fed from userspace via debugfs (e.g., perf script piped
> through a histogram builder into /sys/kernel/debug/damon/spe_feed).
> 
> Collapse path (patches 1-3):
>   DAMON scheme action=COLLAPSE, target_order=N
>   -> damos_va_collapse() -> damon_collapse_folio_range()
>   -> collapse_huge_page()
> 
> Split path (patches 4-5):
>   DAMON scheme action=MTHP_SPLIT, target_order=N, hot_threshold=M
>   -> damos_va_mthp_split() -> damon_spe_hot_fraction()
>   -> split_folio_to_order()
> 
> SPE feedback infrastructure (patch 6):
>   perf script -> spe_hist -> debugfs spe_feed
>   -> per-folio rbtree {THP-aligned PFN -> access_count[512]}
>   -> damon_spe_folio_heatmap() -> hot_bitmap -> split decision
> 
> The userspace helper tools (including the spe_hist histogram builder and
> validation scripts) are archived at:
>   https://github.com/lianux-mm/damon_spe
> 
> Testing was performed on a Kunpeng 920 system (256 cores, 249GB RAM, base kernel
> 7.1.0-rc5+):
> 
>   T1 ARM64 blind spot: A 16MB THP workload (where 8 PMDs fit entirely within the
>      L2 TLB) resulted in DAMON detecting 0 regions. Conversely, using 512MB
>      with 4KB base pages, or a 16GB THP layout (exceeding L2 TLB reach), allowed
>      DAMON to function normally.
> 
>   T2 THP inflation: With an 8GB mmap and 16MB actually hot (0.2%),
>      THP=always: DAMON reported 8GB hot (512x vs ground truth);
>      THP=never: ~245MB (15x vs ground truth).  The THP-induced gap
>      between the two modes was ~33x.
> 
>   T3 RocksDB: Fragmented malloc allocation prevented THP formation, and DAMON
>      behaved normally. We could not reproduce THP inflation with RocksDB.
>      The workloads fundamentally vulnerable to this structural issue remain KVM
>      guests, JVM large heaps, and PostgreSQL shared_buffers.
> 
>   T4 min=0 deadlock break: A 256MB THP induced the DAMON blind spot.
>      Triggering an unconditional mthp_split (via nr_accesses/min=0) successfully
>      shattered the space into 16384x16KB folios, allowing DAMON to fully recover.
> 
>   T5 ARM SPE histogram: Out of 2005 sampled THPs, 97% exhibited <10% hot subpages.
>      A typical trace showed PFN 0x820db800 accumulated 39,794 hardware accesses
>      concentrated across only 3 out of 512 subpages.
The SPE stuff fits SeongJae's goals for DAMON-X, I think. Maybe this is something
we should keep in the user space and let the kernel provide only the API to add
different metrics, including PMU and SPE.
>   End-to-end: Verified hot/cold discrimination. The SPE feed preserved a 90%
>      hot THP intact, while successfully splitting a 25% cold THP into 128x16KB folios.
> 
> Known limitations:
> - The full KVM + Oracle production chain has not yet been benchmarked end-to-end.
>   While individual component verification is complete, full integration testing
>   is planned in collaboration with Sangfor.
> - khugepaged may aggressively re-collapse the mTHPs that DAMON splits. A
>   coordination/back-off mechanism is required to avoid ping-pong effects.
> - SPE data is currently funneled via a userspace daemon and debugfs. Direct
>   kernel-side perf_event sampling integration is planned as a follow-up.
> - The rbtree entry TTL (30s) and signal threshold (1/10 of peak) are empirical
>   defaults subject to further tuning.
> - The ARM64 DAMON blind spot (WSS < L2 TLB reach) is a pre-existing hardware-MMU
>   characteristic, not introduced by this series. Setting nr_accesses/min=0
>   serves as an effective workaround for the split path.
> 
> Reported-by: Chaobing Dai <daichaobing@sangfor.com.cn>
> Cc: SeongJae Park <sj@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Nico Pache <npache@redhat.com>
> Cc: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Wang Lian <lianux.mm@gmail.com>
> 
> Wang Lian (6):
>   mm/damon: add target_order field for DAMOS_COLLAPSE
>   mm/khugepaged: add damon_collapse_folio_range() for external callers
>   mm/damon/vaddr: implement mTHP-aware DAMOS_COLLAPSE handler
>   mm/damon: introduce DAMOS_MTHP_SPLIT action and hot_threshold
>   mm/damon/vaddr: implement DAMOS_MTHP_SPLIT handler
>   mm/damon: add SPE feedback for sub-THP split decisions
> 
>  include/linux/damon.h      |  18 ++
>  include/linux/khugepaged.h |   3 +
>  mm/damon/Kconfig           |  12 +
>  mm/damon/Makefile          |   1 +
>  mm/damon/core.c            |   3 +
>  mm/damon/spe.c             | 505 +++++++++++++++++++++++++++++++++++++
>  mm/damon/spe.h             |  62 +++++
>  mm/damon/sysfs-schemes.c   |  96 +++++++
>  mm/damon/vaddr.c           | 118 +++++++++
>  mm/khugepaged.c            |  39 +++
>  10 files changed, 857 insertions(+)
>  create mode 100644 mm/damon/spe.c
>  create mode 100644 mm/damon/spe.h
> 
> --
> 2.50.1 (Apple Git-155)
> 

-- 
Asier Gutierrez
Huawei



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/6] mm/damon: Add mTHP-aware collapse/split with ARM SPE feedback
  2026-06-18 11:03 ` [RFC PATCH 0/6] mm/damon: Add mTHP-aware collapse/split with ARM SPE feedback Gutierrez Asier
@ 2026-06-18 13:13   ` wang lian
  2026-06-19  1:52     ` SeongJae Park
  0 siblings, 1 reply; 16+ messages in thread
From: wang lian @ 2026-06-18 13:13 UTC (permalink / raw)
  To: Gutierrez Asier
  Cc: sj, akpm, npache, daichaobing, linux-mm, linux-kernel, kunwu.chan

[-- Attachment #1: Type: text/plain, Size: 8360 bytes --]



> On Jun 18, 2026, at 19:03, Gutierrez Asier <gutierrez.asier@huawei-partners.com> wrote:
> 
> Hi Wang,
> 
> On 6/18/2026 12:48 PM, Wang Lian wrote:
>> Received an off-list report that DAMON significantly overestimates
>> hot memory in KVM/QEMU deployments with THP-backed tmpfs guest memory
>> running Oracle workloads.
>> 
>> The root cause is structural: a PMD entry covers 512 4KB subpages with
>> a single Access Flag (AF) bit. When any one subpage is accessed, the entire
>> 2MB region appears "hot" to DAMON. On ARM64, this is compounded by the
>> hardware AF mechanism -- the AF is only set on a TLB miss. Consequently, when the
>> working set fits entirely within the L2 TLB (e.g., a 16MB working set with 2MB THP
>> running on a Kunpeng 920's 2048-entry L2 TLB), DAMON becomes completely blind to
>> subsequent accesses. x86 is not subject to this specific blindness under similar
>> conditions.
> 
> Have you tried setting the minimum region size to 2MB?
> 
>> We reproduced this memory inflation on a Kunpeng 920 platform using a synthetic
>> workload (8GB mmap with a 0.2% sparse hotspot, i.e. 16MB actually hot):
>> THP=always causes DAMON to report the entire 8GB as hot, while THP=never
>> reports only a few hundred MB -- a 512x overestimate relative to the actual
>> 16MB hotspot under THP, and a ~33x gap between the two THP modes. ARM SPE hardware profiling
>> independently confirms this asymmetry: out of 2,005 THPs sampled system-wide
>> over 10 seconds, 97% had fewer than 10% of their 4KB subpages actually accessed.
> 
> THP always will just collapse the entire PID into huge pages anyway. This
> is outside DAMON's control.
> 
> Have you tried setting THP to never and running DAMON with DAMON_COLLAPSE
> action?
> 
>> To mitigate this, this series extends the existing DAMOS_COLLAPSE action to be
>> mTHP-aware via a new target_order field, and introduces a new
>> DAMOS_MTHP_SPLIT action. This enables DAMON to proactively split PMD THPs
>> into smaller mTHPs when most subpages are probed as cold, and collapse them
>> back when beneficial. To resolve the sub-PMD monitoring blindness, the split
>> path can incorporate fine-grained hardware feedback from ARM SPE.
>> The hardware feedback loop (damon_spe_folio_heatmap) implements a two-pass
>> signal filter: it first identifies the peak chunk access count, and then marks
>> sub-chunks with >= 1/10 of the peak count as hot, effectively filtering out
>> SPE sampling noise. A configurable hot_threshold (default 30%) controls the
>> split decision: only folios with a hot fraction below this threshold are
>> eligible for splitting. When no SPE data is available, the infrastructure
>> gracefully falls back to explicit PTE-level scanning via folio_walk.
>> 
>> Currently, SPE data is fed from userspace via debugfs (e.g., perf script piped
>> through a histogram builder into /sys/kernel/debug/damon/spe_feed).
>> 
>> Collapse path (patches 1-3):
>>  DAMON scheme action=COLLAPSE, target_order=N
>>  -> damos_va_collapse() -> damon_collapse_folio_range()
>>  -> collapse_huge_page()
>> 
>> Split path (patches 4-5):
>>  DAMON scheme action=MTHP_SPLIT, target_order=N, hot_threshold=M
>>  -> damos_va_mthp_split() -> damon_spe_hot_fraction()
>>  -> split_folio_to_order()
>> 
>> SPE feedback infrastructure (patch 6):
>>  perf script -> spe_hist -> debugfs spe_feed
>>  -> per-folio rbtree {THP-aligned PFN -> access_count[512]}
>>  -> damon_spe_folio_heatmap() -> hot_bitmap -> split decision
>> 
>> The userspace helper tools (including the spe_hist histogram builder and
>> validation scripts) are archived at:
>>  https://github.com/lianux-mm/damon_spe
>> 
>> Testing was performed on a Kunpeng 920 system (256 cores, 249GB RAM, base kernel
>> 7.1.0-rc5+):
>> 
>>  T1 ARM64 blind spot: A 16MB THP workload (where 8 PMDs fit entirely within the
>>     L2 TLB) resulted in DAMON detecting 0 regions. Conversely, using 512MB
>>     with 4KB base pages, or a 16GB THP layout (exceeding L2 TLB reach), allowed
>>     DAMON to function normally.
>> 
>>  T2 THP inflation: With an 8GB mmap and 16MB actually hot (0.2%),
>>     THP=always: DAMON reported 8GB hot (512x vs ground truth);
>>     THP=never: ~245MB (15x vs ground truth).  The THP-induced gap
>>     between the two modes was ~33x.
>> 
>>  T3 RocksDB: Fragmented malloc allocation prevented THP formation, and DAMON
>>     behaved normally. We could not reproduce THP inflation with RocksDB.
>>     The workloads fundamentally vulnerable to this structural issue remain KVM
>>     guests, JVM large heaps, and PostgreSQL shared_buffers.
>> 
>>  T4 min=0 deadlock break: A 256MB THP induced the DAMON blind spot.
>>     Triggering an unconditional mthp_split (via nr_accesses/min=0) successfully
>>     shattered the space into 16384x16KB folios, allowing DAMON to fully recover.
>> 
>>  T5 ARM SPE histogram: Out of 2005 sampled THPs, 97% exhibited <10% hot subpages.
>>     A typical trace showed PFN 0x820db800 accumulated 39,794 hardware accesses
>>     concentrated across only 3 out of 512 subpages.
> The SPE stuff fits SeongJae's goals for DAMON-X, I think. Maybe this is something
> we should keep in the user space and let the kernel provide only the API to add
> different metrics, including PMU and SPE.

Hi Asier,

Thanks for your prompt and constructive reply. I really appreciate your 
detailed analysis of the mTHP and SPE interaction.

Your point regarding the design boundary—whether this fits better in 
user space or aligned with DAMON-X—is highly valuable. 

Since SeongJae (SJ) will look into this thread tomorrow, let us sync up 
then. I look forward to cooperating with both of you to refine this 
design and find the best architectural fit for the subsystem.

Thanks,
Wang Lian
>>  End-to-end: Verified hot/cold discrimination. The SPE feed preserved a 90%
>>     hot THP intact, while successfully splitting a 25% cold THP into 128x16KB folios.
>> 
>> Known limitations:
>> - The full KVM + Oracle production chain has not yet been benchmarked end-to-end.
>>  While individual component verification is complete, full integration testing
>>  is planned in collaboration with Sangfor.
>> - khugepaged may aggressively re-collapse the mTHPs that DAMON splits. A
>>  coordination/back-off mechanism is required to avoid ping-pong effects.
>> - SPE data is currently funneled via a userspace daemon and debugfs. Direct
>>  kernel-side perf_event sampling integration is planned as a follow-up.
>> - The rbtree entry TTL (30s) and signal threshold (1/10 of peak) are empirical
>>  defaults subject to further tuning.
>> - The ARM64 DAMON blind spot (WSS < L2 TLB reach) is a pre-existing hardware-MMU
>>  characteristic, not introduced by this series. Setting nr_accesses/min=0
>>  serves as an effective workaround for the split path.
>> 
>> Reported-by: Chaobing Dai <daichaobing@sangfor.com.cn>
>> Cc: SeongJae Park <sj@kernel.org>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Nico Pache <npache@redhat.com>
>> Cc: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
>> Cc: linux-mm@kvack.org
>> Cc: linux-kernel@vger.kernel.org
>> Signed-off-by: Wang Lian <lianux.mm@gmail.com>
>> 
>> Wang Lian (6):
>>  mm/damon: add target_order field for DAMOS_COLLAPSE
>>  mm/khugepaged: add damon_collapse_folio_range() for external callers
>>  mm/damon/vaddr: implement mTHP-aware DAMOS_COLLAPSE handler
>>  mm/damon: introduce DAMOS_MTHP_SPLIT action and hot_threshold
>>  mm/damon/vaddr: implement DAMOS_MTHP_SPLIT handler
>>  mm/damon: add SPE feedback for sub-THP split decisions
>> 
>> include/linux/damon.h      |  18 ++
>> include/linux/khugepaged.h |   3 +
>> mm/damon/Kconfig           |  12 +
>> mm/damon/Makefile          |   1 +
>> mm/damon/core.c            |   3 +
>> mm/damon/spe.c             | 505 +++++++++++++++++++++++++++++++++++++
>> mm/damon/spe.h             |  62 +++++
>> mm/damon/sysfs-schemes.c   |  96 +++++++
>> mm/damon/vaddr.c           | 118 +++++++++
>> mm/khugepaged.c            |  39 +++
>> 10 files changed, 857 insertions(+)
>> create mode 100644 mm/damon/spe.c
>> create mode 100644 mm/damon/spe.h
>> 
>> --
>> 2.50.1 (Apple Git-155)
>> 
> 
> -- 
> Asier Gutierrez
> Huawei


[-- Attachment #2: Type: text/html, Size: 24491 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/6] mm/damon: Add mTHP-aware collapse/split with ARM SPE feedback
  2026-06-18 13:13   ` wang lian
@ 2026-06-19  1:52     ` SeongJae Park
  0 siblings, 0 replies; 16+ messages in thread
From: SeongJae Park @ 2026-06-19  1:52 UTC (permalink / raw)
  To: wang lian
  Cc: SeongJae Park, Gutierrez Asier, akpm, npache, daichaobing,
	linux-mm, linux-kernel, kunwu.chan

On Thu, 18 Jun 2026 21:13:07 +0800 wang lian <lianux.mm@gmail.com> wrote:

> 
> 
> > On Jun 18, 2026, at 19:03, Gutierrez Asier <gutierrez.asier@huawei-partners.com> wrote:
> > 
> > Hi Wang,
> > 
> > On 6/18/2026 12:48 PM, Wang Lian wrote:
> >> Received an off-list report that DAMON significantly overestimates
> >> hot memory in KVM/QEMU deployments with THP-backed tmpfs guest memory
> >> running Oracle workloads.
> >> 
> >> The root cause is structural: a PMD entry covers 512 4KB subpages with
> >> a single Access Flag (AF) bit. When any one subpage is accessed, the entire
> >> 2MB region appears "hot" to DAMON. On ARM64, this is compounded by the
> >> hardware AF mechanism -- the AF is only set on a TLB miss. Consequently, when the
> >> working set fits entirely within the L2 TLB (e.g., a 16MB working set with 2MB THP
> >> running on a Kunpeng 920's 2048-entry L2 TLB), DAMON becomes completely blind to
> >> subsequent accesses. x86 is not subject to this specific blindness under similar
> >> conditions.
> > 
> > Have you tried setting the minimum region size to 2MB?
> > 
> >> We reproduced this memory inflation on a Kunpeng 920 platform using a synthetic
> >> workload (8GB mmap with a 0.2% sparse hotspot, i.e. 16MB actually hot):
> >> THP=always causes DAMON to report the entire 8GB as hot, while THP=never
> >> reports only a few hundred MB -- a 512x overestimate relative to the actual
> >> 16MB hotspot under THP, and a ~33x gap between the two THP modes. ARM SPE hardware profiling
> >> independently confirms this asymmetry: out of 2,005 THPs sampled system-wide
> >> over 10 seconds, 97% had fewer than 10% of their 4KB subpages actually accessed.
> > 
> > THP always will just collapse the entire PID into huge pages anyway. This
> > is outside DAMON's control.
> > 
> > Have you tried setting THP to never and running DAMON with DAMON_COLLAPSE
> > action?
> > 
> >> To mitigate this, this series extends the existing DAMOS_COLLAPSE action to be
> >> mTHP-aware via a new target_order field, and introduces a new
> >> DAMOS_MTHP_SPLIT action. This enables DAMON to proactively split PMD THPs
> >> into smaller mTHPs when most subpages are probed as cold, and collapse them
> >> back when beneficial. To resolve the sub-PMD monitoring blindness, the split
> >> path can incorporate fine-grained hardware feedback from ARM SPE.
> >> The hardware feedback loop (damon_spe_folio_heatmap) implements a two-pass
> >> signal filter: it first identifies the peak chunk access count, and then marks
> >> sub-chunks with >= 1/10 of the peak count as hot, effectively filtering out
> >> SPE sampling noise. A configurable hot_threshold (default 30%) controls the
> >> split decision: only folios with a hot fraction below this threshold are
> >> eligible for splitting. When no SPE data is available, the infrastructure
> >> gracefully falls back to explicit PTE-level scanning via folio_walk.
> >> 
> >> Currently, SPE data is fed from userspace via debugfs (e.g., perf script piped
> >> through a histogram builder into /sys/kernel/debug/damon/spe_feed).
> >> 
> >> Collapse path (patches 1-3):
> >>  DAMON scheme action=COLLAPSE, target_order=N
> >>  -> damos_va_collapse() -> damon_collapse_folio_range()
> >>  -> collapse_huge_page()
> >> 
> >> Split path (patches 4-5):
> >>  DAMON scheme action=MTHP_SPLIT, target_order=N, hot_threshold=M
> >>  -> damos_va_mthp_split() -> damon_spe_hot_fraction()
> >>  -> split_folio_to_order()
> >> 
> >> SPE feedback infrastructure (patch 6):
> >>  perf script -> spe_hist -> debugfs spe_feed
> >>  -> per-folio rbtree {THP-aligned PFN -> access_count[512]}
> >>  -> damon_spe_folio_heatmap() -> hot_bitmap -> split decision
> >> 
> >> The userspace helper tools (including the spe_hist histogram builder and
> >> validation scripts) are archived at:
> >>  https://github.com/lianux-mm/damon_spe
> >> 
> >> Testing was performed on a Kunpeng 920 system (256 cores, 249GB RAM, base kernel
> >> 7.1.0-rc5+):
> >> 
> >>  T1 ARM64 blind spot: A 16MB THP workload (where 8 PMDs fit entirely within the
> >>     L2 TLB) resulted in DAMON detecting 0 regions. Conversely, using 512MB
> >>     with 4KB base pages, or a 16GB THP layout (exceeding L2 TLB reach), allowed
> >>     DAMON to function normally.
> >> 
> >>  T2 THP inflation: With an 8GB mmap and 16MB actually hot (0.2%),
> >>     THP=always: DAMON reported 8GB hot (512x vs ground truth);
> >>     THP=never: ~245MB (15x vs ground truth).  The THP-induced gap
> >>     between the two modes was ~33x.
> >> 
> >>  T3 RocksDB: Fragmented malloc allocation prevented THP formation, and DAMON
> >>     behaved normally. We could not reproduce THP inflation with RocksDB.
> >>     The workloads fundamentally vulnerable to this structural issue remain KVM
> >>     guests, JVM large heaps, and PostgreSQL shared_buffers.
> >> 
> >>  T4 min=0 deadlock break: A 256MB THP induced the DAMON blind spot.
> >>     Triggering an unconditional mthp_split (via nr_accesses/min=0) successfully
> >>     shattered the space into 16384x16KB folios, allowing DAMON to fully recover.
> >> 
> >>  T5 ARM SPE histogram: Out of 2005 sampled THPs, 97% exhibited <10% hot subpages.
> >>     A typical trace showed PFN 0x820db800 accumulated 39,794 hardware accesses
> >>     concentrated across only 3 out of 512 subpages.
> > The SPE stuff fits SeongJae's goals for DAMON-X, I think. Maybe this is something
> > we should keep in the user space and let the kernel provide only the API to add
> > different metrics, including PMU and SPE.
> 
> Hi Asier,
> 
> Thanks for your prompt and constructive reply. I really appreciate your 
> detailed analysis of the mTHP and SPE interaction.

Indeed, very helpful comments.  Thank you Asier!

> 
> Your point regarding the design boundary—whether this fits better in 
> user space or aligned with DAMON-X—is highly valuable. 

Actually Asier is saying about the perf event-based monitoring extension [1].
DAMON-X [2] is another project.

> 
> Since SeongJae (SJ) will look into this thread tomorrow, let us sync up 
> then. I look forward to cooperating with both of you to refine this 
> design and find the best architectural fit for the subsystem.

As I also replied, I'd also prefer this to be aligned with the perf event-based
extension roadmap.

[1] https://lore.kernel.org/all/20260525225208.1179-1-sj@kernel.org/
[2] https://lwn.net/Articles/1071256/


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/6] mm/damon: Add mTHP-aware collapse/split with ARM SPE feedback
  2026-06-18  9:48 [RFC PATCH 0/6] mm/damon: Add mTHP-aware collapse/split with ARM SPE feedback Wang Lian
                   ` (6 preceding siblings ...)
  2026-06-18 11:03 ` [RFC PATCH 0/6] mm/damon: Add mTHP-aware collapse/split with ARM SPE feedback Gutierrez Asier
@ 2026-06-19  1:47 ` SeongJae Park
  2026-06-19  1:54   ` SeongJae Park
  2026-06-19  3:40   ` Wang Lian
  7 siblings, 2 replies; 16+ messages in thread
From: SeongJae Park @ 2026-06-19  1:47 UTC (permalink / raw)
  To: Wang Lian
  Cc: SeongJae Park, akpm, npache, gutierrez.asier, daichaobing,
	linux-mm, linux-kernel, kunwu.chan

Hello Lian,

On Thu, 18 Jun 2026 17:48:32 +0800 Wang Lian <lianux.mm@gmail.com> wrote:

> Received an off-list report that DAMON significantly overestimates
> hot memory in KVM/QEMU deployments with THP-backed tmpfs guest memory
> running Oracle workloads.
> 
> The root cause is structural: a PMD entry covers 512 4KB subpages with
> a single Access Flag (AF) bit. When any one subpage is accessed, the entire
> 2MB region appears "hot" to DAMON. On ARM64,

This makes sense to me.  I also agree this could caused the reported problem.
And this is a known limitation of DAMON.  My suggestion for straightforward
workaround of this problem is, using 'age' information of DAMON for better
identification of the hot memory.

That is, I don't expect real hot data in real production systems will evenly
scattered.  Even if they are, I don't expect they will all evenly frequently
accessed.  Only a few of those would be accessed frequently for long.  Even if
that is, there would be data that frequently for longer.  You could show the
distriibution of the pattern and find X % of hottest memory as hot.

We invented idle time percentiles [1] for a similar purpose, though it is more
focusing on finding cold memory.

I understand this patch series is trying to make more fundamental and better
solution on hardware that can do better.  Makes sense to me.

> this is compounded by the
> hardware AF mechanism -- the AF is only set on a TLB miss. Consequently, when the
> working set fits entirely within the L2 TLB (e.g., a 16MB working set with 2MB THP
> running on a Kunpeng 920's 2048-entry L2 TLB), DAMON becomes completely blind to
> subsequent accesses.

This makes sense to me.  However, I don't get how this is contributing to the
problem.  Could you please elaborate?

> x86 is not subject to this specific blindness under similar
> conditions.

To my understanding on x86, same issue exists.  If TLB hits, Aceessed bit is
not set, and DAMON shows it as unaccessed.  Am I missing something?

> 
> We reproduced this memory inflation on a Kunpeng 920 platform using a synthetic
> workload (8GB mmap with a 0.2% sparse hotspot, i.e. 16MB actually hot):
> THP=always causes DAMON to report the entire 8GB as hot, while THP=never
> reports only a few hundred MB -- a 512x overestimate relative to the actual
> 16MB hotspot under THP, and a ~33x gap between the two THP modes. ARM SPE hardware profiling
> independently confirms this asymmetry: out of 2,005 THPs sampled system-wide
> over 10 seconds, 97% had fewer than 10% of their 4KB subpages actually accessed.

I don't think the real world production systems to have this very artificial
access pattern.  I believe (or, hope) use of 'age' can work around the issue in
a reasonable level for many cases.  I understand this setup is only for PoC,
and I think this is well designed test for the purpose.  Thank you for sharing
this.

> 
> To mitigate this, this series extends the existing DAMOS_COLLAPSE action to be
> mTHP-aware via a new target_order field,

Makes sensee, and sounds nice.  Definitely no one size fits all!

> and introduces a new
> DAMOS_MTHP_SPLIT action. This enables DAMON to proactively split PMD THPs
> into smaller mTHPs

Nice!  Asier was planning to do similar work in future.  I think you could
collaborate to reduce unnecessary duplicates!

I'd suggest making the name simpler and consistent to DAMOS_COLLAPSE, though.
Say, DAMOS_SPLIT ?

> when most subpages are probed as cold, and collapse them
> back when beneficial. To resolve the sub-PMD monitoring blindness, the split
> path can incorporate fine-grained hardware feedback from ARM SPE.
> 
> The hardware feedback loop (damon_spe_folio_heatmap) implements a two-pass
> signal filter: it first identifies the peak chunk access count, and then marks
> sub-chunks with >= 1/10 of the peak count as hot, effectively filtering out
> SPE sampling noise. A configurable hot_threshold (default 30%) controls the
> split decision: only folios with a hot fraction below this threshold are
> eligible for splitting. When no SPE data is available, the infrastructure
> gracefully falls back to explicit PTE-level scanning via folio_walk.
> 
> Currently, SPE data is fed from userspace via debugfs (e.g., perf script piped
> through a histogram builder into /sys/kernel/debug/damon/spe_feed).

So you implemented a debugfs interface?  That must be a nice approach for PoC.
But it may be difficult to be upstreamed as is.

You could build a control plane that decides the exact address ranges to split,
and directly feed it to DAMOS using DAMOS address filter.  max_nr_snapshots can
also be useful for making such kind of user space controls more deterministic.

For simpler user-space control, utilizing user_input DAMOS quota goal [2]
should also be another option.

We are also planning [3] to extend DAMON for perf events.  On top of it, we
might be able to extend it further to utilize ARM SPE by DAMON itself, and do
all this without the user space help but only DAMOS.

Baseed on below 'limitations' section, I understand this is only for PoC at the
moment, and you plan to explore the perf event based approach.  I'd also
recommend that.

> 
> Collapse path (patches 1-3):
>   DAMON scheme action=COLLAPSE, target_order=N
>   -> damos_va_collapse() -> damon_collapse_folio_range()
>   -> collapse_huge_page()
> 
> Split path (patches 4-5):
>   DAMON scheme action=MTHP_SPLIT, target_order=N, hot_threshold=M
>   -> damos_va_mthp_split() -> damon_spe_hot_fraction()
>   -> split_folio_to_order()
> 
> SPE feedback infrastructure (patch 6):
>   perf script -> spe_hist -> debugfs spe_feed
>   -> per-folio rbtree {THP-aligned PFN -> access_count[512]}
>   -> damon_spe_folio_heatmap() -> hot_bitmap -> split decision
> 
> The userspace helper tools (including the spe_hist histogram builder and
> validation scripts) are archived at:
>   https://github.com/lianux-mm/damon_spe

Thank you for making all the grateful code open!

> 
> Testing was performed on a Kunpeng 920 system (256 cores, 249GB RAM, base kernel
> 7.1.0-rc5+):
> 
>   T1 ARM64 blind spot: A 16MB THP workload (where 8 PMDs fit entirely within the
>      L2 TLB) resulted in DAMON detecting 0 regions. Conversely, using 512MB
>      with 4KB base pages, or a 16GB THP layout (exceeding L2 TLB reach), allowed
>      DAMON to function normally.
> 
>   T2 THP inflation: With an 8GB mmap and 16MB actually hot (0.2%),
>      THP=always: DAMON reported 8GB hot (512x vs ground truth);
>      THP=never: ~245MB (15x vs ground truth).  The THP-induced gap
>      between the two modes was ~33x.
> 
>   T3 RocksDB: Fragmented malloc allocation prevented THP formation, and DAMON
>      behaved normally. We could not reproduce THP inflation with RocksDB.
>      The workloads fundamentally vulnerable to this structural issue remain KVM
>      guests, JVM large heaps, and PostgreSQL shared_buffers.
> 
>   T4 min=0 deadlock break: A 256MB THP induced the DAMON blind spot.
>      Triggering an unconditional mthp_split (via nr_accesses/min=0) successfully
>      shattered the space into 16384x16KB folios, allowing DAMON to fully recover.
> 
>   T5 ARM SPE histogram: Out of 2005 sampled THPs, 97% exhibited <10% hot subpages.
>      A typical trace showed PFN 0x820db800 accumulated 39,794 hardware accesses
>      concentrated across only 3 out of 512 subpages.
> 
>   End-to-end: Verified hot/cold discrimination. The SPE feed preserved a 90%
>      hot THP intact, while successfully splitting a 25% cold THP into 128x16KB folios.
> 
> Known limitations:
> - The full KVM + Oracle production chain has not yet been benchmarked end-to-end.
>   While individual component verification is complete, full integration testing
>   is planned in collaboration with Sangfor.
> - khugepaged may aggressively re-collapse the mTHPs that DAMON splits. A
>   coordination/back-off mechanism is required to avoid ping-pong effects.

Do you really need to khugepaged together, when you already have
DAMOS_COLLAPSE, and anyway you are running DAMON for hugepage splits?

> - SPE data is currently funneled via a userspace daemon and debugfs. Direct
>   kernel-side perf_event sampling integration is planned as a follow-up.

Nice, I think this will make our projects aligned and reduce unnecessary
duplicates.  I'd encourage you to try this path.

> - The rbtree entry TTL (30s) and signal threshold (1/10 of peak) are empirical
>   defaults subject to further tuning.

I don't fully understand this part.  Could you please elaborate?

> - The ARM64 DAMON blind spot (WSS < L2 TLB reach) is a pre-existing hardware-MMU
>   characteristic, not introduced by this series. Setting nr_accesses/min=0
>   serves as an effective workaround for the split path.

I don't fully understand this, too.  Could you please elaborate and enlighten
me?

> 
> Reported-by: Chaobing Dai <daichaobing@sangfor.com.cn>
> Cc: SeongJae Park <sj@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Nico Pache <npache@redhat.com>
> Cc: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Wang Lian <lianux.mm@gmail.com>
> 
> Wang Lian (6):
>   mm/damon: add target_order field for DAMOS_COLLAPSE
>   mm/khugepaged: add damon_collapse_folio_range() for external callers
>   mm/damon/vaddr: implement mTHP-aware DAMOS_COLLAPSE handler
>   mm/damon: introduce DAMOS_MTHP_SPLIT action and hot_threshold
>   mm/damon/vaddr: implement DAMOS_MTHP_SPLIT handler
>   mm/damon: add SPE feedback for sub-THP split decisions
> 
>  include/linux/damon.h      |  18 ++
>  include/linux/khugepaged.h |   3 +
>  mm/damon/Kconfig           |  12 +
>  mm/damon/Makefile          |   1 +
>  mm/damon/core.c            |   3 +
>  mm/damon/spe.c             | 505 +++++++++++++++++++++++++++++++++++++
>  mm/damon/spe.h             |  62 +++++
>  mm/damon/sysfs-schemes.c   |  96 +++++++
>  mm/damon/vaddr.c           | 118 +++++++++
>  mm/khugepaged.c            |  39 +++
>  10 files changed, 857 insertions(+)
>  create mode 100644 mm/damon/spe.c
>  create mode 100644 mm/damon/spe.h

Because this is an RFC and we found high level TODO (trying perf event based
appraoch instead of debugfs), I will skip reviewing the details.  If you have
specific parts that want my detailed review, let me know.

Also, the perf event based monitoring is a long term project.  The ETA is the
LSFMMBPF'27.  If you cannot wait until the time, maybe you could try the
alternative approaches (using address filter or user_input quota goal) and
upstreaming dependent parts (DAMOS_COLLAPSE extension for mTHP and DAMOS_SPLIT)
first could also be a nice approach, in my opinion.

[1] https://origin.kernel.org/doc/html/latest/admin-guide/mm/damon/stat.html#memory-idle-ms-percentiles
[2] https://origin.kernel.org/doc/html/latest/mm/damon/design.html#aim-oriented-feedback-driven-auto-tuning
[3] https://lore.kernel.org/20251128193947.80866-1-sj@kernel.org/

Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/6] mm/damon: Add mTHP-aware collapse/split with ARM SPE feedback
  2026-06-19  1:47 ` SeongJae Park
@ 2026-06-19  1:54   ` SeongJae Park
  2026-06-19  1:59     ` SeongJae Park
  2026-06-19  3:40   ` Wang Lian
  1 sibling, 1 reply; 16+ messages in thread
From: SeongJae Park @ 2026-06-19  1:54 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Wang Lian, akpm, npache, gutierrez.asier, daichaobing, linux-mm,
	linux-kernel, kunwu.chan

On Thu, 18 Jun 2026 18:47:16 -0700 SeongJae Park <sj@kernel.org> wrote:

> Hello Lian,
> 
> On Thu, 18 Jun 2026 17:48:32 +0800 Wang Lian <lianux.mm@gmail.com> wrote:
> 
> > Received an off-list report that DAMON significantly overestimates
> > hot memory in KVM/QEMU deployments with THP-backed tmpfs guest memory
> > running Oracle workloads.
> > 
> > The root cause is structural: a PMD entry covers 512 4KB subpages with
> > a single Access Flag (AF) bit. When any one subpage is accessed, the entire
> > 2MB region appears "hot" to DAMON. On ARM64,
> 
> This makes sense to me.  I also agree this could caused the reported problem.
> And this is a known limitation of DAMON.  My suggestion for straightforward
> workaround of this problem is, using 'age' information of DAMON for better
> identification of the hot memory.
> 
> That is, I don't expect real hot data in real production systems will evenly
> scattered.  Even if they are, I don't expect they will all evenly frequently
> accessed.  Only a few of those would be accessed frequently for long.  Even if
> that is, there would be data that frequently for longer.  You could show the
> distriibution of the pattern and find X % of hottest memory as hot.
> 
> We invented idle time percentiles [1] for a similar purpose, though it is more
> focusing on finding cold memory.
> 
> I understand this patch series is trying to make more fundamental and better
> solution on hardware that can do better.  Makes sense to me.
> 
> > this is compounded by the
> > hardware AF mechanism -- the AF is only set on a TLB miss. Consequently, when the
> > working set fits entirely within the L2 TLB (e.g., a 16MB working set with 2MB THP
> > running on a Kunpeng 920's 2048-entry L2 TLB), DAMON becomes completely blind to
> > subsequent accesses.
> 
> This makes sense to me.  However, I don't get how this is contributing to the
> problem.  Could you please elaborate?
> 
> > x86 is not subject to this specific blindness under similar
> > conditions.
> 
> To my understanding on x86, same issue exists.  If TLB hits, Aceessed bit is
> not set, and DAMON shows it as unaccessed.  Am I missing something?
> 
> > 
> > We reproduced this memory inflation on a Kunpeng 920 platform using a synthetic
> > workload (8GB mmap with a 0.2% sparse hotspot, i.e. 16MB actually hot):
> > THP=always causes DAMON to report the entire 8GB as hot, while THP=never
> > reports only a few hundred MB -- a 512x overestimate relative to the actual
> > 16MB hotspot under THP, and a ~33x gap between the two THP modes. ARM SPE hardware profiling
> > independently confirms this asymmetry: out of 2,005 THPs sampled system-wide
> > over 10 seconds, 97% had fewer than 10% of their 4KB subpages actually accessed.
> 
> I don't think the real world production systems to have this very artificial
> access pattern.  I believe (or, hope) use of 'age' can work around the issue in
> a reasonable level for many cases.  I understand this setup is only for PoC,
> and I think this is well designed test for the purpose.  Thank you for sharing
> this.
> 
> > 
> > To mitigate this, this series extends the existing DAMOS_COLLAPSE action to be
> > mTHP-aware via a new target_order field,
> 
> Makes sensee, and sounds nice.  Definitely no one size fits all!
> 
> > and introduces a new
> > DAMOS_MTHP_SPLIT action. This enables DAMON to proactively split PMD THPs
> > into smaller mTHPs
> 
> Nice!  Asier was planning to do similar work in future.  I think you could
> collaborate to reduce unnecessary duplicates!
> 
> I'd suggest making the name simpler and consistent to DAMOS_COLLAPSE, though.
> Say, DAMOS_SPLIT ?
> 
> > when most subpages are probed as cold, and collapse them
> > back when beneficial. To resolve the sub-PMD monitoring blindness, the split
> > path can incorporate fine-grained hardware feedback from ARM SPE.
> > 
> > The hardware feedback loop (damon_spe_folio_heatmap) implements a two-pass
> > signal filter: it first identifies the peak chunk access count, and then marks
> > sub-chunks with >= 1/10 of the peak count as hot, effectively filtering out
> > SPE sampling noise. A configurable hot_threshold (default 30%) controls the
> > split decision: only folios with a hot fraction below this threshold are
> > eligible for splitting. When no SPE data is available, the infrastructure
> > gracefully falls back to explicit PTE-level scanning via folio_walk.
> > 
> > Currently, SPE data is fed from userspace via debugfs (e.g., perf script piped
> > through a histogram builder into /sys/kernel/debug/damon/spe_feed).
> 
> So you implemented a debugfs interface?  That must be a nice approach for PoC.
> But it may be difficult to be upstreamed as is.
> 
> You could build a control plane that decides the exact address ranges to split,
> and directly feed it to DAMOS using DAMOS address filter.  max_nr_snapshots can
> also be useful for making such kind of user space controls more deterministic.
> 
> For simpler user-space control, utilizing user_input DAMOS quota goal [2]
> should also be another option.
> 
> We are also planning [3] to extend DAMON for perf events.  On top of it, we
> might be able to extend it further to utilize ARM SPE by DAMON itself, and do
> all this without the user space help but only DAMOS.
> 
> Baseed on below 'limitations' section, I understand this is only for PoC at the
> moment, and you plan to explore the perf event based approach.  I'd also
> recommend that.
> 
> > 
> > Collapse path (patches 1-3):
> >   DAMON scheme action=COLLAPSE, target_order=N
> >   -> damos_va_collapse() -> damon_collapse_folio_range()
> >   -> collapse_huge_page()
> > 
> > Split path (patches 4-5):
> >   DAMON scheme action=MTHP_SPLIT, target_order=N, hot_threshold=M
> >   -> damos_va_mthp_split() -> damon_spe_hot_fraction()
> >   -> split_folio_to_order()
> > 
> > SPE feedback infrastructure (patch 6):
> >   perf script -> spe_hist -> debugfs spe_feed
> >   -> per-folio rbtree {THP-aligned PFN -> access_count[512]}
> >   -> damon_spe_folio_heatmap() -> hot_bitmap -> split decision
> > 
> > The userspace helper tools (including the spe_hist histogram builder and
> > validation scripts) are archived at:
> >   https://github.com/lianux-mm/damon_spe
> 
> Thank you for making all the grateful code open!
> 
> > 
> > Testing was performed on a Kunpeng 920 system (256 cores, 249GB RAM, base kernel
> > 7.1.0-rc5+):
> > 
> >   T1 ARM64 blind spot: A 16MB THP workload (where 8 PMDs fit entirely within the
> >      L2 TLB) resulted in DAMON detecting 0 regions. Conversely, using 512MB
> >      with 4KB base pages, or a 16GB THP layout (exceeding L2 TLB reach), allowed
> >      DAMON to function normally.
> > 
> >   T2 THP inflation: With an 8GB mmap and 16MB actually hot (0.2%),
> >      THP=always: DAMON reported 8GB hot (512x vs ground truth);
> >      THP=never: ~245MB (15x vs ground truth).  The THP-induced gap
> >      between the two modes was ~33x.
> > 
> >   T3 RocksDB: Fragmented malloc allocation prevented THP formation, and DAMON
> >      behaved normally. We could not reproduce THP inflation with RocksDB.
> >      The workloads fundamentally vulnerable to this structural issue remain KVM
> >      guests, JVM large heaps, and PostgreSQL shared_buffers.
> > 
> >   T4 min=0 deadlock break: A 256MB THP induced the DAMON blind spot.
> >      Triggering an unconditional mthp_split (via nr_accesses/min=0) successfully
> >      shattered the space into 16384x16KB folios, allowing DAMON to fully recover.
> > 
> >   T5 ARM SPE histogram: Out of 2005 sampled THPs, 97% exhibited <10% hot subpages.
> >      A typical trace showed PFN 0x820db800 accumulated 39,794 hardware accesses
> >      concentrated across only 3 out of 512 subpages.
> > 
> >   End-to-end: Verified hot/cold discrimination. The SPE feed preserved a 90%
> >      hot THP intact, while successfully splitting a 25% cold THP into 128x16KB folios.
> > 
> > Known limitations:
> > - The full KVM + Oracle production chain has not yet been benchmarked end-to-end.
> >   While individual component verification is complete, full integration testing
> >   is planned in collaboration with Sangfor.
> > - khugepaged may aggressively re-collapse the mTHPs that DAMON splits. A
> >   coordination/back-off mechanism is required to avoid ping-pong effects.
> 
> Do you really need to khugepaged together, when you already have
> DAMOS_COLLAPSE, and anyway you are running DAMON for hugepage splits?
> 
> > - SPE data is currently funneled via a userspace daemon and debugfs. Direct
> >   kernel-side perf_event sampling integration is planned as a follow-up.
> 
> Nice, I think this will make our projects aligned and reduce unnecessary
> duplicates.  I'd encourage you to try this path.
> 
> > - The rbtree entry TTL (30s) and signal threshold (1/10 of peak) are empirical
> >   defaults subject to further tuning.
> 
> I don't fully understand this part.  Could you please elaborate?
> 
> > - The ARM64 DAMON blind spot (WSS < L2 TLB reach) is a pre-existing hardware-MMU
> >   characteristic, not introduced by this series. Setting nr_accesses/min=0
> >   serves as an effective workaround for the split path.
> 
> I don't fully understand this, too.  Could you please elaborate and enlighten
> me?
> 
> > 
> > Reported-by: Chaobing Dai <daichaobing@sangfor.com.cn>
> > Cc: SeongJae Park <sj@kernel.org>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Nico Pache <npache@redhat.com>
> > Cc: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
> > Cc: linux-mm@kvack.org
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: Wang Lian <lianux.mm@gmail.com>
> > 
> > Wang Lian (6):
> >   mm/damon: add target_order field for DAMOS_COLLAPSE
> >   mm/khugepaged: add damon_collapse_folio_range() for external callers
> >   mm/damon/vaddr: implement mTHP-aware DAMOS_COLLAPSE handler
> >   mm/damon: introduce DAMOS_MTHP_SPLIT action and hot_threshold
> >   mm/damon/vaddr: implement DAMOS_MTHP_SPLIT handler
> >   mm/damon: add SPE feedback for sub-THP split decisions
> > 
> >  include/linux/damon.h      |  18 ++
> >  include/linux/khugepaged.h |   3 +
> >  mm/damon/Kconfig           |  12 +
> >  mm/damon/Makefile          |   1 +
> >  mm/damon/core.c            |   3 +
> >  mm/damon/spe.c             | 505 +++++++++++++++++++++++++++++++++++++
> >  mm/damon/spe.h             |  62 +++++
> >  mm/damon/sysfs-schemes.c   |  96 +++++++
> >  mm/damon/vaddr.c           | 118 +++++++++
> >  mm/khugepaged.c            |  39 +++
> >  10 files changed, 857 insertions(+)
> >  create mode 100644 mm/damon/spe.c
> >  create mode 100644 mm/damon/spe.h
> 
> Because this is an RFC and we found high level TODO (trying perf event based
> appraoch instead of debugfs), I will skip reviewing the details.  If you have
> specific parts that want my detailed review, let me know.
> 
> Also, the perf event based monitoring is a long term project.  The ETA is the
> LSFMMBPF'27.  If you cannot wait until the time, maybe you could try the
> alternative approaches (using address filter or user_input quota goal) and
> upstreaming dependent parts (DAMOS_COLLAPSE extension for mTHP and DAMOS_SPLIT)
> first could also be a nice approach, in my opinion.
> 
> [1] https://origin.kernel.org/doc/html/latest/admin-guide/mm/damon/stat.html#memory-idle-ms-percentiles
> [2] https://origin.kernel.org/doc/html/latest/mm/damon/design.html#aim-oriented-feedback-driven-auto-tuning
> [3] https://lore.kernel.org/20251128193947.80866-1-sj@kernel.org/

The above link ([3]) is wrong, sorry.  Please use below.

[3] https://lore.kernel.org/20260525225208.1179-1-sj@kernel.org/


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/6] mm/damon: Add mTHP-aware collapse/split with ARM SPE feedback
  2026-06-19  1:54   ` SeongJae Park
@ 2026-06-19  1:59     ` SeongJae Park
  0 siblings, 0 replies; 16+ messages in thread
From: SeongJae Park @ 2026-06-19  1:59 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Wang Lian, akpm, npache, gutierrez.asier, daichaobing, linux-mm,
	linux-kernel, kunwu.chan, damon

+ damon@lists.linux.dev

Please Cc damon@lists.linux.dev from the next revision, and all DAMON patches
in future.


Thanks,
SJ

On Thu, 18 Jun 2026 18:54:23 -0700 SeongJae Park <sj@kernel.org> wrote:

> On Thu, 18 Jun 2026 18:47:16 -0700 SeongJae Park <sj@kernel.org> wrote:
> 
> > Hello Lian,
> > 
> > On Thu, 18 Jun 2026 17:48:32 +0800 Wang Lian <lianux.mm@gmail.com> wrote:
> > 
> > > Received an off-list report that DAMON significantly overestimates
> > > hot memory in KVM/QEMU deployments with THP-backed tmpfs guest memory
> > > running Oracle workloads.
> > > 
> > > The root cause is structural: a PMD entry covers 512 4KB subpages with
> > > a single Access Flag (AF) bit. When any one subpage is accessed, the entire
> > > 2MB region appears "hot" to DAMON. On ARM64,
> > 
> > This makes sense to me.  I also agree this could caused the reported problem.
> > And this is a known limitation of DAMON.  My suggestion for straightforward
> > workaround of this problem is, using 'age' information of DAMON for better
> > identification of the hot memory.
> > 
> > That is, I don't expect real hot data in real production systems will evenly
> > scattered.  Even if they are, I don't expect they will all evenly frequently
> > accessed.  Only a few of those would be accessed frequently for long.  Even if
> > that is, there would be data that frequently for longer.  You could show the
> > distriibution of the pattern and find X % of hottest memory as hot.
> > 
> > We invented idle time percentiles [1] for a similar purpose, though it is more
> > focusing on finding cold memory.
> > 
> > I understand this patch series is trying to make more fundamental and better
> > solution on hardware that can do better.  Makes sense to me.
> > 
> > > this is compounded by the
> > > hardware AF mechanism -- the AF is only set on a TLB miss. Consequently, when the
> > > working set fits entirely within the L2 TLB (e.g., a 16MB working set with 2MB THP
> > > running on a Kunpeng 920's 2048-entry L2 TLB), DAMON becomes completely blind to
> > > subsequent accesses.
> > 
> > This makes sense to me.  However, I don't get how this is contributing to the
> > problem.  Could you please elaborate?
> > 
> > > x86 is not subject to this specific blindness under similar
> > > conditions.
> > 
> > To my understanding on x86, same issue exists.  If TLB hits, Aceessed bit is
> > not set, and DAMON shows it as unaccessed.  Am I missing something?
> > 
> > > 
> > > We reproduced this memory inflation on a Kunpeng 920 platform using a synthetic
> > > workload (8GB mmap with a 0.2% sparse hotspot, i.e. 16MB actually hot):
> > > THP=always causes DAMON to report the entire 8GB as hot, while THP=never
> > > reports only a few hundred MB -- a 512x overestimate relative to the actual
> > > 16MB hotspot under THP, and a ~33x gap between the two THP modes. ARM SPE hardware profiling
> > > independently confirms this asymmetry: out of 2,005 THPs sampled system-wide
> > > over 10 seconds, 97% had fewer than 10% of their 4KB subpages actually accessed.
> > 
> > I don't think the real world production systems to have this very artificial
> > access pattern.  I believe (or, hope) use of 'age' can work around the issue in
> > a reasonable level for many cases.  I understand this setup is only for PoC,
> > and I think this is well designed test for the purpose.  Thank you for sharing
> > this.
> > 
> > > 
> > > To mitigate this, this series extends the existing DAMOS_COLLAPSE action to be
> > > mTHP-aware via a new target_order field,
> > 
> > Makes sensee, and sounds nice.  Definitely no one size fits all!
> > 
> > > and introduces a new
> > > DAMOS_MTHP_SPLIT action. This enables DAMON to proactively split PMD THPs
> > > into smaller mTHPs
> > 
> > Nice!  Asier was planning to do similar work in future.  I think you could
> > collaborate to reduce unnecessary duplicates!
> > 
> > I'd suggest making the name simpler and consistent to DAMOS_COLLAPSE, though.
> > Say, DAMOS_SPLIT ?
> > 
> > > when most subpages are probed as cold, and collapse them
> > > back when beneficial. To resolve the sub-PMD monitoring blindness, the split
> > > path can incorporate fine-grained hardware feedback from ARM SPE.
> > > 
> > > The hardware feedback loop (damon_spe_folio_heatmap) implements a two-pass
> > > signal filter: it first identifies the peak chunk access count, and then marks
> > > sub-chunks with >= 1/10 of the peak count as hot, effectively filtering out
> > > SPE sampling noise. A configurable hot_threshold (default 30%) controls the
> > > split decision: only folios with a hot fraction below this threshold are
> > > eligible for splitting. When no SPE data is available, the infrastructure
> > > gracefully falls back to explicit PTE-level scanning via folio_walk.
> > > 
> > > Currently, SPE data is fed from userspace via debugfs (e.g., perf script piped
> > > through a histogram builder into /sys/kernel/debug/damon/spe_feed).
> > 
> > So you implemented a debugfs interface?  That must be a nice approach for PoC.
> > But it may be difficult to be upstreamed as is.
> > 
> > You could build a control plane that decides the exact address ranges to split,
> > and directly feed it to DAMOS using DAMOS address filter.  max_nr_snapshots can
> > also be useful for making such kind of user space controls more deterministic.
> > 
> > For simpler user-space control, utilizing user_input DAMOS quota goal [2]
> > should also be another option.
> > 
> > We are also planning [3] to extend DAMON for perf events.  On top of it, we
> > might be able to extend it further to utilize ARM SPE by DAMON itself, and do
> > all this without the user space help but only DAMOS.
> > 
> > Baseed on below 'limitations' section, I understand this is only for PoC at the
> > moment, and you plan to explore the perf event based approach.  I'd also
> > recommend that.
> > 
> > > 
> > > Collapse path (patches 1-3):
> > >   DAMON scheme action=COLLAPSE, target_order=N
> > >   -> damos_va_collapse() -> damon_collapse_folio_range()
> > >   -> collapse_huge_page()
> > > 
> > > Split path (patches 4-5):
> > >   DAMON scheme action=MTHP_SPLIT, target_order=N, hot_threshold=M
> > >   -> damos_va_mthp_split() -> damon_spe_hot_fraction()
> > >   -> split_folio_to_order()
> > > 
> > > SPE feedback infrastructure (patch 6):
> > >   perf script -> spe_hist -> debugfs spe_feed
> > >   -> per-folio rbtree {THP-aligned PFN -> access_count[512]}
> > >   -> damon_spe_folio_heatmap() -> hot_bitmap -> split decision
> > > 
> > > The userspace helper tools (including the spe_hist histogram builder and
> > > validation scripts) are archived at:
> > >   https://github.com/lianux-mm/damon_spe
> > 
> > Thank you for making all the grateful code open!
> > 
> > > 
> > > Testing was performed on a Kunpeng 920 system (256 cores, 249GB RAM, base kernel
> > > 7.1.0-rc5+):
> > > 
> > >   T1 ARM64 blind spot: A 16MB THP workload (where 8 PMDs fit entirely within the
> > >      L2 TLB) resulted in DAMON detecting 0 regions. Conversely, using 512MB
> > >      with 4KB base pages, or a 16GB THP layout (exceeding L2 TLB reach), allowed
> > >      DAMON to function normally.
> > > 
> > >   T2 THP inflation: With an 8GB mmap and 16MB actually hot (0.2%),
> > >      THP=always: DAMON reported 8GB hot (512x vs ground truth);
> > >      THP=never: ~245MB (15x vs ground truth).  The THP-induced gap
> > >      between the two modes was ~33x.
> > > 
> > >   T3 RocksDB: Fragmented malloc allocation prevented THP formation, and DAMON
> > >      behaved normally. We could not reproduce THP inflation with RocksDB.
> > >      The workloads fundamentally vulnerable to this structural issue remain KVM
> > >      guests, JVM large heaps, and PostgreSQL shared_buffers.
> > > 
> > >   T4 min=0 deadlock break: A 256MB THP induced the DAMON blind spot.
> > >      Triggering an unconditional mthp_split (via nr_accesses/min=0) successfully
> > >      shattered the space into 16384x16KB folios, allowing DAMON to fully recover.
> > > 
> > >   T5 ARM SPE histogram: Out of 2005 sampled THPs, 97% exhibited <10% hot subpages.
> > >      A typical trace showed PFN 0x820db800 accumulated 39,794 hardware accesses
> > >      concentrated across only 3 out of 512 subpages.
> > > 
> > >   End-to-end: Verified hot/cold discrimination. The SPE feed preserved a 90%
> > >      hot THP intact, while successfully splitting a 25% cold THP into 128x16KB folios.
> > > 
> > > Known limitations:
> > > - The full KVM + Oracle production chain has not yet been benchmarked end-to-end.
> > >   While individual component verification is complete, full integration testing
> > >   is planned in collaboration with Sangfor.
> > > - khugepaged may aggressively re-collapse the mTHPs that DAMON splits. A
> > >   coordination/back-off mechanism is required to avoid ping-pong effects.
> > 
> > Do you really need to khugepaged together, when you already have
> > DAMOS_COLLAPSE, and anyway you are running DAMON for hugepage splits?
> > 
> > > - SPE data is currently funneled via a userspace daemon and debugfs. Direct
> > >   kernel-side perf_event sampling integration is planned as a follow-up.
> > 
> > Nice, I think this will make our projects aligned and reduce unnecessary
> > duplicates.  I'd encourage you to try this path.
> > 
> > > - The rbtree entry TTL (30s) and signal threshold (1/10 of peak) are empirical
> > >   defaults subject to further tuning.
> > 
> > I don't fully understand this part.  Could you please elaborate?
> > 
> > > - The ARM64 DAMON blind spot (WSS < L2 TLB reach) is a pre-existing hardware-MMU
> > >   characteristic, not introduced by this series. Setting nr_accesses/min=0
> > >   serves as an effective workaround for the split path.
> > 
> > I don't fully understand this, too.  Could you please elaborate and enlighten
> > me?
> > 
> > > 
> > > Reported-by: Chaobing Dai <daichaobing@sangfor.com.cn>
> > > Cc: SeongJae Park <sj@kernel.org>
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Cc: Nico Pache <npache@redhat.com>
> > > Cc: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
> > > Cc: linux-mm@kvack.org
> > > Cc: linux-kernel@vger.kernel.org
> > > Signed-off-by: Wang Lian <lianux.mm@gmail.com>
> > > 
> > > Wang Lian (6):
> > >   mm/damon: add target_order field for DAMOS_COLLAPSE
> > >   mm/khugepaged: add damon_collapse_folio_range() for external callers
> > >   mm/damon/vaddr: implement mTHP-aware DAMOS_COLLAPSE handler
> > >   mm/damon: introduce DAMOS_MTHP_SPLIT action and hot_threshold
> > >   mm/damon/vaddr: implement DAMOS_MTHP_SPLIT handler
> > >   mm/damon: add SPE feedback for sub-THP split decisions
> > > 
> > >  include/linux/damon.h      |  18 ++
> > >  include/linux/khugepaged.h |   3 +
> > >  mm/damon/Kconfig           |  12 +
> > >  mm/damon/Makefile          |   1 +
> > >  mm/damon/core.c            |   3 +
> > >  mm/damon/spe.c             | 505 +++++++++++++++++++++++++++++++++++++
> > >  mm/damon/spe.h             |  62 +++++
> > >  mm/damon/sysfs-schemes.c   |  96 +++++++
> > >  mm/damon/vaddr.c           | 118 +++++++++
> > >  mm/khugepaged.c            |  39 +++
> > >  10 files changed, 857 insertions(+)
> > >  create mode 100644 mm/damon/spe.c
> > >  create mode 100644 mm/damon/spe.h
> > 
> > Because this is an RFC and we found high level TODO (trying perf event based
> > appraoch instead of debugfs), I will skip reviewing the details.  If you have
> > specific parts that want my detailed review, let me know.
> > 
> > Also, the perf event based monitoring is a long term project.  The ETA is the
> > LSFMMBPF'27.  If you cannot wait until the time, maybe you could try the
> > alternative approaches (using address filter or user_input quota goal) and
> > upstreaming dependent parts (DAMOS_COLLAPSE extension for mTHP and DAMOS_SPLIT)
> > first could also be a nice approach, in my opinion.
> > 
> > [1] https://origin.kernel.org/doc/html/latest/admin-guide/mm/damon/stat.html#memory-idle-ms-percentiles
> > [2] https://origin.kernel.org/doc/html/latest/mm/damon/design.html#aim-oriented-feedback-driven-auto-tuning
> > [3] https://lore.kernel.org/20251128193947.80866-1-sj@kernel.org/
> 
> The above link ([3]) is wrong, sorry.  Please use below.
> 
> [3] https://lore.kernel.org/20260525225208.1179-1-sj@kernel.org/
> 
> 
> Thanks,
> SJ
> 
> [...]
> 

Sent using hkml (https://github.com/sjp38/hackermail)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/6] mm/damon: Add mTHP-aware collapse/split with ARM SPE feedback
  2026-06-19  1:47 ` SeongJae Park
  2026-06-19  1:54   ` SeongJae Park
@ 2026-06-19  3:40   ` Wang Lian
  2026-06-19 14:31     ` Gutierrez Asier
  1 sibling, 1 reply; 16+ messages in thread
From: Wang Lian @ 2026-06-19  3:40 UTC (permalink / raw)
  To: sj
  Cc: akpm, daichaobing, gutierrez.asier, kunwu.chan, lianux.mm,
	linux-kernel, linux-mm, npache

Hi SeongJae,

Thank you for the thorough and thoughtful review.  Your feedback on the
x86 AF behavior was an important correction -- I'll address that and
your other questions below.

On Thu, 18 Jun 2026 SeongJae Park <sj@kernel.org> wrote:

> This makes sense to me.  I also agree this could caused the reported
> problem.  And this is a known limitation of DAMON.  My suggestion for
> straightforward workaround of this problem is, using 'age' information
> of DAMON for better identification of the hot memory.

Thank you for pointing out idle time percentiles [1].  We agree that 'age'
helps differentiate frequently-accessed from occasionally-accessed regions,
and it is a good workaround for many cases.

However, age operates at region granularity, which is still at or above
PMD level for THP-mapped memory.  When only a few 4KB subpages within a
2MB THP are hot, age tells us the region has been accessed recently, but
not which subpages are hot.  The split decision needs sub-PMD information,
which is what the SPE heatmap provides.

That said, combining age with split could be valuable: split only regions
that have been consistently hot (high age) AND have sparse sub-page access
patterns.  We will explore this.

> > On ARM64, this is compounded by the hardware AF mechanism -- the AF
> > is only set on a TLB miss.
>
> This makes sense to me.  However, I don't get how this is contributing
> to the problem.  Could you please elaborate?

The AF-on-TLB-miss behavior creates a second-order problem that directly
exacerbates the overestimation. 

When DAMON's mkold path clears the PMD AF, it deliberately skips the TLB 
flush to minimize overhead. If the dense working set fits entirely within 
the L2 TLB (e.g., 16MB workload using 8 PMD entries on Kunpeng 920's 2048-entry 
L2 TLB), subsequent hardware accesses hit the valid, stale TLB entries 
directly. The hardware MMU never generates a page table walk, so the 
in-memory PMD AF stays 0. 

Consequently, DAMON sees `nr_accesses = 0` and assumes the region is completely 
cold, making it impossible to naturally track the sub-page usage shifts. When 
sporadic/noise accesses later hit other parts of this "seemingly cold" PMD 
and trigger an isolated TLB refilling, DAMON abruptly sees the whole 2MB 
as hot. This binary oscillation (completely blind vs. fully hot) is what 
drives the massive overestimation under THP.

We confirmed this TLB-reach aspect empirically via our T1 test:
  16MB THP (8 PMDs, 0.4% of L2 TLB reach) -> DAMON tracks 0 accesses (blind)
  16GB THP (8192 PMDs, 400% of L2 TLB reach) -> DAMON tracks normally due to natural eviction

> > x86 is not subject to this specific blindness under similar
> > conditions.
>
> To my understanding on x86, same issue exists.  If TLB hits, Aceessed
> bit is not set, and DAMON shows it as unaccessed.  Am I missing
> something?

You are entirely right, and I was wrong on this point. I re-checked the 
kernel source and verified that x86's ptep_test_and_clear_young() does NOT 
flush the TLB. Even ptep_clear_flush_young() on x86 deliberately skips the 
flush as a performance optimization (arch/x86/mm/pgtable.c:486-502). The 
same optimization architectural behavior exists on PowerPC and RISC-V.

Therefore, both architectures are theoretically vulnerable to this stale-TLB 
blind spot under identical tightly-fit workloads. Our initial assumption 
was biased because T1 was only conducted on ARM64. We will reproduce the 
T1 setup on x86 to verify the exact behavior, and I will correct this 
claim in the v2 cover letter. Thank you for catching this mistake.

> Nice!  Asier was planning to do similar work in future.  I think you
> could collaborate to reduce unnecessary duplicates!

Great to hear! We would be happy to collaborate with Asier. I'll reach
out to him to coordinate our efforts.

> I'd suggest making the name simpler and consistent to DAMOS_COLLAPSE,
> though.  Say, DAMOS_SPLIT ?

Agreed. DAMOS_SPLIT is cleaner and fits the existing naming convention 
perfectly. Will rename in v2.

> So you implemented a debugfs interface?  That must be a nice approach
> for PoC.  But it may be difficult to be upstreamed as is.
>
> You could build a control plane that decides the exact address ranges
> to split, and directly feed it to DAMOS using DAMOS address filter.

The native perf event approach [3] aligns perfectly with our long-term 
Phase 2c plan, and we are highly interested in collaborating on it to 
eliminate the userspace daemon and debugfs bridge entirely.

However, since native kernel-side SPE handling is a long-term item, we 
will follow your pragmatic alternative suggestion for v2: use DAMOS address 
filters or user_input quota goals [2] to feed the split decisions from 
userspace cleanly. This allows us to upstream the core infrastructure 
(mTHP target_order for collapse and the new DAMOS_SPLIT action) first.

> Do you really need to khugepaged together, when you already have
> DAMOS_COLLAPSE, and anyway you are running DAMON for hugepage splits?

Excellent point. Running both concurrently on the same VMA introduces 
redundancy and heavy ping-pong effects. 

Option (b) is definitely cleaner: we will let DAMON handle both split and 
re-collapse decisions using its own access data. To make this robust in 
production environments where khugepaged is globally enabled, we will 
explore having the DAMOS_SPLIT path temporarily mark the target ranges 
(e.g., via a pseudo-VM_NOHUGEPAGE backing off mechanism) to prevent 
khugepaged from immediately undoing DAMON's work.

> > - The rbtree entry TTL (30s) and signal threshold (1/10 of peak) are
> >   empirical defaults subject to further tuning.
>
> I don't fully understand this part.  Could you please elaborate?

Since ARM SPE samples hardware accesses instruction-by-instruction, the raw 
data is highly statistical and noisy. 

The TTL (30s) defines the lifecycle of our per-folio rbtree tracking entries. 
Entries not updated within 30 seconds are pruned to prevent stale tracking data 
from corrupting split decisions after a workload phase change. 30s is selected 
to comfortably outlive DAMON's aggregation intervals while keeping the rbtree 
memory footprint tightly bounded.

The signal threshold (1/10 of peak) filters out the statistical sampling noise. 
Instead of treating any subpage with access > 0 as hot, the algorithm finds the 
peak access count inside the 2MB region and only marks sub-chunks with >= 1/10 
of that peak as genuinely hot. On Kunpeng 920, this specific threshold successfully 
reduced false-hot subpage classifications from ~50% to <5%. We plan to make 
these parameters sysfs-configurable.

> > - The ARM64 DAMON blind spot (WSS < L2 TLB reach) is a pre-existing
> >   hardware-MMU characteristic, not introduced by this series.  Setting
> >   nr_accesses/min=0 serves as an effective workaround for the split path.
>
> I don't fully understand this, too.  Could you please elaborate and
> enlighten me?

The blind spot creates an operational deadlock for the split infrastructure:
  1. WSS < TLB reach -> All THP entries stay cached in TLB.
  2. DAMON's page-table scan yields `nr_accesses = 0` globally.
  3. A scheme requiring `nr_accesses.min = 1` never fires -> DAMOS_SPLIT is never invoked.
  4. THPs remain unsplit -> WSS remains within TLB reach -> Loop returns to step 1.

Setting `nr_accesses.min = 0` and `max = 0` breaks this deadlock. It forces 
DAMON to evaluate these seemingly "dead/cold" regions. Once the split handler 
invokes, it checks the ARM SPE telemetry (which captures data directly from the 
instruction pipeline, completely bypassing the MMU page-table AF limitation). 
If SPE reveals a sparse access heatmap, the split is executed. Once shattered into 
mTHP/base pages, the TLB reach drops, natural TLB misses resume, and DAMON's 
standard page-table tracking fully recovers.

Thanks again for your guidance. The action items for v2 are locked in:
  1. Rename DAMOS_MTHP_SPLIT -> DAMOS_SPLIT.
  2. Drop debugfs in favor of DAMOS address filters / control plane.
  3. Correct x86 AF behavior statements in the cover letter.
  4. Coordinate with Asier on split/collapse unification.
  5. Implement back-off to prevent khugepaged ping-pong under Option (b).

Best regards,
Wang Lian

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/6] mm/damon: Add mTHP-aware collapse/split with ARM SPE feedback
  2026-06-19  3:40   ` Wang Lian
@ 2026-06-19 14:31     ` Gutierrez Asier
  2026-06-20 20:39       ` SeongJae Park
  0 siblings, 1 reply; 16+ messages in thread
From: Gutierrez Asier @ 2026-06-19 14:31 UTC (permalink / raw)
  To: Wang Lian, sj
  Cc: akpm, daichaobing, kunwu.chan, linux-kernel, linux-mm, npache



On 6/19/2026 6:40 AM, Wang Lian wrote:
> Hi SeongJae,
> 
> Thank you for the thorough and thoughtful review.  Your feedback on the
> x86 AF behavior was an important correction -- I'll address that and
> your other questions below.
> 
> On Thu, 18 Jun 2026 SeongJae Park <sj@kernel.org> wrote:
> 
>> This makes sense to me.  I also agree this could caused the reported
>> problem.  And this is a known limitation of DAMON.  My suggestion for
>> straightforward workaround of this problem is, using 'age' information
>> of DAMON for better identification of the hot memory.
> 
> Thank you for pointing out idle time percentiles [1].  We agree that 'age'
> helps differentiate frequently-accessed from occasionally-accessed regions,
> and it is a good workaround for many cases.
> 
> However, age operates at region granularity, which is still at or above
> PMD level for THP-mapped memory.  When only a few 4KB subpages within a
> 2MB THP are hot, age tells us the region has been accessed recently, but
> not which subpages are hot.  The split decision needs sub-PMD information,
> which is what the SPE heatmap provides.
> 
> That said, combining age with split could be valuable: split only regions
> that have been consistently hot (high age) AND have sparse sub-page access
> patterns.  We will explore this.
> 
>>> On ARM64, this is compounded by the hardware AF mechanism -- the AF
>>> is only set on a TLB miss.
>>
>> This makes sense to me.  However, I don't get how this is contributing
>> to the problem.  Could you please elaborate?
> 
> The AF-on-TLB-miss behavior creates a second-order problem that directly
> exacerbates the overestimation. 
> 
> When DAMON's mkold path clears the PMD AF, it deliberately skips the TLB 
> flush to minimize overhead. If the dense working set fits entirely within 
> the L2 TLB (e.g., 16MB workload using 8 PMD entries on Kunpeng 920's 2048-entry 
> L2 TLB), subsequent hardware accesses hit the valid, stale TLB entries 
> directly. The hardware MMU never generates a page table walk, so the 
> in-memory PMD AF stays 0. 
> 
> Consequently, DAMON sees `nr_accesses = 0` and assumes the region is completely 
> cold, making it impossible to naturally track the sub-page usage shifts. When 
> sporadic/noise accesses later hit other parts of this "seemingly cold" PMD 
> and trigger an isolated TLB refilling, DAMON abruptly sees the whole 2MB 
> as hot. This binary oscillation (completely blind vs. fully hot) is what 
> drives the massive overestimation under THP.
> 
> We confirmed this TLB-reach aspect empirically via our T1 test:
>   16MB THP (8 PMDs, 0.4% of L2 TLB reach) -> DAMON tracks 0 accesses (blind)
>   16GB THP (8192 PMDs, 400% of L2 TLB reach) -> DAMON tracks normally due to natural eviction
> 
>>> x86 is not subject to this specific blindness under similar
>>> conditions.
>>
>> To my understanding on x86, same issue exists.  If TLB hits, Aceessed
>> bit is not set, and DAMON shows it as unaccessed.  Am I missing
>> something?
> 
> You are entirely right, and I was wrong on this point. I re-checked the 
> kernel source and verified that x86's ptep_test_and_clear_young() does NOT 
> flush the TLB. Even ptep_clear_flush_young() on x86 deliberately skips the 
> flush as a performance optimization (arch/x86/mm/pgtable.c:486-502). The 
> same optimization architectural behavior exists on PowerPC and RISC-V.
> 
> Therefore, both architectures are theoretically vulnerable to this stale-TLB 
> blind spot under identical tightly-fit workloads. Our initial assumption 
> was biased because T1 was only conducted on ARM64. We will reproduce the 
> T1 setup on x86 to verify the exact behavior, and I will correct this 
> claim in the v2 cover letter. Thank you for catching this mistake.
> 
>> Nice!  Asier was planning to do similar work in future.  I think you
>> could collaborate to reduce unnecessary duplicates!
> 
> Great to hear! We would be happy to collaborate with Asier. I'll reach
> out to him to coordinate our efforts.
Sure, I will be happy to cooperate.
>> I'd suggest making the name simpler and consistent to DAMOS_COLLAPSE,
>> though.  Say, DAMOS_SPLIT ?
> 
> Agreed. DAMOS_SPLIT is cleaner and fits the existing naming convention 
> perfectly. Will rename in v2.
> 
>> So you implemented a debugfs interface?  That must be a nice approach
>> for PoC.  But it may be difficult to be upstreamed as is.
>>
>> You could build a control plane that decides the exact address ranges
>> to split, and directly feed it to DAMOS using DAMOS address filter.
> 
> The native perf event approach [3] aligns perfectly with our long-term 
> Phase 2c plan, and we are highly interested in collaborating on it to 
> eliminate the userspace daemon and debugfs bridge entirely.
> 
> However, since native kernel-side SPE handling is a long-term item, we 
> will follow your pragmatic alternative suggestion for v2: use DAMOS address 
> filters or user_input quota goals [2] to feed the split decisions from 
> userspace cleanly. This allows us to upstream the core infrastructure 
> (mTHP target_order for collapse and the new DAMOS_SPLIT action) first.
> 
>> Do you really need to khugepaged together, when you already have
>> DAMOS_COLLAPSE, and anyway you are running DAMON for hugepage splits?
> 
> Excellent point. Running both concurrently on the same VMA introduces 
> redundancy and heavy ping-pong effects. 
> 
> Option (b) is definitely cleaner: we will let DAMON handle both split and 
> re-collapse decisions using its own access data. To make this robust in 
> production environments where khugepaged is globally enabled, we will 
> explore having the DAMOS_SPLIT path temporarily mark the target ranges 
> (e.g., via a pseudo-VM_NOHUGEPAGE backing off mechanism) to prevent 
> khugepaged from immediately undoing DAMON's work.
> 
>>> - The rbtree entry TTL (30s) and signal threshold (1/10 of peak) are
>>>   empirical defaults subject to further tuning.
>>
>> I don't fully understand this part.  Could you please elaborate?
> 
> Since ARM SPE samples hardware accesses instruction-by-instruction, the raw 
> data is highly statistical and noisy. 
> 
> The TTL (30s) defines the lifecycle of our per-folio rbtree tracking entries. 
> Entries not updated within 30 seconds are pruned to prevent stale tracking data 
> from corrupting split decisions after a workload phase change. 30s is selected 
> to comfortably outlive DAMON's aggregation intervals while keeping the rbtree 
> memory footprint tightly bounded.
> 
> The signal threshold (1/10 of peak) filters out the statistical sampling noise. 
> Instead of treating any subpage with access > 0 as hot, the algorithm finds the 
> peak access count inside the 2MB region and only marks sub-chunks with >= 1/10 
> of that peak as genuinely hot. On Kunpeng 920, this specific threshold successfully 
> reduced false-hot subpage classifications from ~50% to <5%. We plan to make 
> these parameters sysfs-configurable.
> 
>>> - The ARM64 DAMON blind spot (WSS < L2 TLB reach) is a pre-existing
>>>   hardware-MMU characteristic, not introduced by this series.  Setting
>>>   nr_accesses/min=0 serves as an effective workaround for the split path.
>>
>> I don't fully understand this, too.  Could you please elaborate and
>> enlighten me?
> 
> The blind spot creates an operational deadlock for the split infrastructure:
>   1. WSS < TLB reach -> All THP entries stay cached in TLB.
>   2. DAMON's page-table scan yields `nr_accesses = 0` globally.
>   3. A scheme requiring `nr_accesses.min = 1` never fires -> DAMOS_SPLIT is never invoked.
>   4. THPs remain unsplit -> WSS remains within TLB reach -> Loop returns to step 1.
> 
> Setting `nr_accesses.min = 0` and `max = 0` breaks this deadlock. It forces 
> DAMON to evaluate these seemingly "dead/cold" regions. Once the split handler 
> invokes, it checks the ARM SPE telemetry (which captures data directly from the 
> instruction pipeline, completely bypassing the MMU page-table AF limitation). 
> If SPE reveals a sparse access heatmap, the split is executed. Once shattered into 
> mTHP/base pages, the TLB reach drops, natural TLB misses resume, and DAMON's 
> standard page-table tracking fully recovers.
> 
> 
> Thanks again for your guidance. The action items for v2 are locked in:
>   1. Rename DAMOS_MTHP_SPLIT -> DAMOS_SPLIT.
>   2. Drop debugfs in favor of DAMOS address filters / control plane.
>   3. Correct x86 AF behavior statements in the cover letter.
>   4. Coordinate with Asier on split/collapse unification.
>   5. Implement back-off to prevent khugepaged ping-pong under Option (b).
> 
> Best regards,
> Wang Lian

-- 
Asier Gutierrez
Huawei



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/6] mm/damon: Add mTHP-aware collapse/split with ARM SPE feedback
  2026-06-19 14:31     ` Gutierrez Asier
@ 2026-06-20 20:39       ` SeongJae Park
  0 siblings, 0 replies; 16+ messages in thread
From: SeongJae Park @ 2026-06-20 20:39 UTC (permalink / raw)
  To: Gutierrez Asier
  Cc: SeongJae Park, Wang Lian, akpm, daichaobing, kunwu.chan,
	linux-kernel, linux-mm, npache, damon

+ damon@lists.linux.dev

On Fri, 19 Jun 2026 17:31:33 +0300 Gutierrez Asier <gutierrez.asier@huawei-partners.com> wrote:

> 
> 
> On 6/19/2026 6:40 AM, Wang Lian wrote:
> > Hi SeongJae,
> > 
> > Thank you for the thorough and thoughtful review.  Your feedback on the
> > x86 AF behavior was an important correction -- I'll address that and
> > your other questions below.
> > 
> > On Thu, 18 Jun 2026 SeongJae Park <sj@kernel.org> wrote:
> > 
> >> This makes sense to me.  I also agree this could caused the reported
> >> problem.  And this is a known limitation of DAMON.  My suggestion for
> >> straightforward workaround of this problem is, using 'age' information
> >> of DAMON for better identification of the hot memory.
> > 
> > Thank you for pointing out idle time percentiles [1].  We agree that 'age'
> > helps differentiate frequently-accessed from occasionally-accessed regions,
> > and it is a good workaround for many cases.
> > 
> > However, age operates at region granularity, which is still at or above
> > PMD level for THP-mapped memory.  When only a few 4KB subpages within a
> > 2MB THP are hot, age tells us the region has been accessed recently, but
> > not which subpages are hot.  The split decision needs sub-PMD information,
> > which is what the SPE heatmap provides.
> > 
> > That said, combining age with split could be valuable: split only regions
> > that have been consistently hot (high age) AND have sparse sub-page access
> > patterns.  We will explore this.

Yes, I agree.  Using features like SPE in addition to 'age' will make it much
better.

> > 
> >>> On ARM64, this is compounded by the hardware AF mechanism -- the AF
> >>> is only set on a TLB miss.
> >>
> >> This makes sense to me.  However, I don't get how this is contributing
> >> to the problem.  Could you please elaborate?
> > 
> > The AF-on-TLB-miss behavior creates a second-order problem that directly
> > exacerbates the overestimation. 
> > 
> > When DAMON's mkold path clears the PMD AF, it deliberately skips the TLB 
> > flush to minimize overhead. If the dense working set fits entirely within 
> > the L2 TLB (e.g., 16MB workload using 8 PMD entries on Kunpeng 920's 2048-entry 
> > L2 TLB), subsequent hardware accesses hit the valid, stale TLB entries 
> > directly. The hardware MMU never generates a page table walk, so the 
> > in-memory PMD AF stays 0. 

Yes, that all makes sense.  But, let's not call it "stale" TLB entry.  TLB is
only for translation and the entry is doing its role.  We do not flush TLB by
purpose.  Nothing is stale here.

> > 
> > Consequently, DAMON sees `nr_accesses = 0` and assumes the region is completely 
> > cold, making it impossible to naturally track the sub-page usage shifts. When 
> > sporadic/noise accesses later hit other parts of this "seemingly cold" PMD 
> > and trigger an isolated TLB refilling, DAMON abruptly sees the whole 2MB 
> > as hot. This binary oscillation (completely blind vs. fully hot) is what 
> > drives the massive overestimation under THP.
> > 
> > We confirmed this TLB-reach aspect empirically via our T1 test:
> >   16MB THP (8 PMDs, 0.4% of L2 TLB reach) -> DAMON tracks 0 accesses (blind)
> >   16GB THP (8192 PMDs, 400% of L2 TLB reach) -> DAMON tracks normally due to natural eviction

So, it is a problem different from the previously mentioned one (showing more
hot memory), correct?  Has it reported from a real production?

DAMON does not flush TLB assuming real production systems would have anyway
large amount of working set that naturally flush TLB.  If there are real
production systems that this assumption doesn't apply, we may neeed to think
this again.

Anyway, the cover letter would be better to make this point clear.

> > 
> >>> x86 is not subject to this specific blindness under similar
> >>> conditions.
> >>
> >> To my understanding on x86, same issue exists.  If TLB hits, Aceessed
> >> bit is not set, and DAMON shows it as unaccessed.  Am I missing
> >> something?
> > 
> > You are entirely right, and I was wrong on this point. I re-checked the 
> > kernel source and verified that x86's ptep_test_and_clear_young() does NOT 
> > flush the TLB. Even ptep_clear_flush_young() on x86 deliberately skips the 
> > flush as a performance optimization (arch/x86/mm/pgtable.c:486-502). The 
> > same optimization architectural behavior exists on PowerPC and RISC-V.
> > 
> > Therefore, both architectures are theoretically vulnerable to this stale-TLB 
> > blind spot under identical tightly-fit workloads. Our initial assumption 
> > was biased because T1 was only conducted on ARM64. We will reproduce the 
> > T1 setup on x86 to verify the exact behavior, and I will correct this 
> > claim in the v2 cover letter. Thank you for catching this mistake.
> > 
> >> Nice!  Asier was planning to do similar work in future.  I think you
> >> could collaborate to reduce unnecessary duplicates!
> > 
> > Great to hear! We would be happy to collaborate with Asier. I'll reach
> > out to him to coordinate our efforts.
> Sure, I will be happy to cooperate.

Great, looking forward to your fantastic coworks!

> >> I'd suggest making the name simpler and consistent to DAMOS_COLLAPSE,
> >> though.  Say, DAMOS_SPLIT ?
> > 
> > Agreed. DAMOS_SPLIT is cleaner and fits the existing naming convention 
> > perfectly. Will rename in v2.
> > 
> >> So you implemented a debugfs interface?  That must be a nice approach
> >> for PoC.  But it may be difficult to be upstreamed as is.
> >>
> >> You could build a control plane that decides the exact address ranges
> >> to split, and directly feed it to DAMOS using DAMOS address filter.
> > 
> > The native perf event approach [3] aligns perfectly with our long-term 
> > Phase 2c plan, and we are highly interested in collaborating on it to 
> > eliminate the userspace daemon and debugfs bridge entirely.
> > 
> > However, since native kernel-side SPE handling is a long-term item, we 
> > will follow your pragmatic alternative suggestion for v2: use DAMOS address 
> > filters or user_input quota goals [2] to feed the split decisions from 
> > userspace cleanly. This allows us to upstream the core infrastructure 
> > (mTHP target_order for collapse and the new DAMOS_SPLIT action) first.
> > 
> >> Do you really need to khugepaged together, when you already have
> >> DAMOS_COLLAPSE, and anyway you are running DAMON for hugepage splits?
> > 
> > Excellent point. Running both concurrently on the same VMA introduces 
> > redundancy and heavy ping-pong effects. 
> > 
> > Option (b) is definitely cleaner: we will let DAMON handle both split and 
> > re-collapse decisions using its own access data. To make this robust in 
> > production environments where khugepaged is globally enabled, we will 
> > explore having the DAMOS_SPLIT path temporarily mark the target ranges 
> > (e.g., via a pseudo-VM_NOHUGEPAGE backing off mechanism) to prevent 
> > khugepaged from immediately undoing DAMON's work.

Can't you simply turn off khugepaged?

> > 
> >>> - The rbtree entry TTL (30s) and signal threshold (1/10 of peak) are
> >>>   empirical defaults subject to further tuning.
> >>
> >> I don't fully understand this part.  Could you please elaborate?
> > 
> > Since ARM SPE samples hardware accesses instruction-by-instruction, the raw 
> > data is highly statistical and noisy. 
> > 
> > The TTL (30s) defines the lifecycle of our per-folio rbtree tracking entries. 
> > Entries not updated within 30 seconds are pruned to prevent stale tracking data 
> > from corrupting split decisions after a workload phase change. 30s is selected 
> > to comfortably outlive DAMON's aggregation intervals while keeping the rbtree 
> > memory footprint tightly bounded.
> > 
> > The signal threshold (1/10 of peak) filters out the statistical sampling noise. 
> > Instead of treating any subpage with access > 0 as hot, the algorithm finds the 
> > peak access count inside the 2MB region and only marks sub-chunks with >= 1/10 
> > of that peak as genuinely hot. On Kunpeng 920, this specific threshold successfully 
> > reduced false-hot subpage classifications from ~50% to <5%. We plan to make 
> > these parameters sysfs-configurable.
> > 
> >>> - The ARM64 DAMON blind spot (WSS < L2 TLB reach) is a pre-existing
> >>>   hardware-MMU characteristic, not introduced by this series.  Setting
> >>>   nr_accesses/min=0 serves as an effective workaround for the split path.
> >>
> >> I don't fully understand this, too.  Could you please elaborate and
> >> enlighten me?
> > 
> > The blind spot creates an operational deadlock for the split infrastructure:
> >   1. WSS < TLB reach -> All THP entries stay cached in TLB.

But, is this really common on real production systems?

> >   2. DAMON's page-table scan yields `nr_accesses = 0` globally.
> >   3. A scheme requiring `nr_accesses.min = 1` never fires -> DAMOS_SPLIT is never invoked.

Why would you run DAMOS_SPLIT action for regions having 1 or higher
nr_accesses?  Having nr_accesses 1 or higher means the region was accessed, and
I asume you want to split THP that cold.  Shouldn't it rather target regions
having only 0 nr_accesses, as it means it is cold?

> >   4. THPs remain unsplit -> WSS remains within TLB reach -> Loop returns to step 1.
> > 
> > Setting `nr_accesses.min = 0` and `max = 0` breaks this deadlock.

Ah, yes, now I got it.  And this seems the right approach to me.  FYI
min_nr_accesses and max_nr_accesses are the terms we  usually use.

> > It forces 
> > DAMON to evaluate these seemingly "dead/cold" regions. Once the split handler 
> > invokes, it checks the ARM SPE telemetry (which captures data directly from the 
> > instruction pipeline, completely bypassing the MMU page-table AF limitation). 
> > If SPE reveals a sparse access heatmap, the split is executed. Once shattered into 
> > mTHP/base pages, the TLB reach drops, natural TLB misses resume, and DAMON's 
> > standard page-table tracking fully recovers.
> > 
> > 
> > Thanks again for your guidance. The action items for v2 are locked in:
> >   1. Rename DAMOS_MTHP_SPLIT -> DAMOS_SPLIT.
> >   2. Drop debugfs in favor of DAMOS address filters / control plane.
> >   3. Correct x86 AF behavior statements in the cover letter.
> >   4. Coordinate with Asier on split/collapse unification.

Sounds good.  Looking forward to the next version!  Also, if the blind spot is
a problem reported from the real production systems, please clarify it.

> >   5. Implement back-off to prevent khugepaged ping-pong under Option (b).

Again, why can't you just turn kdamond off?


Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2026-06-20 20:39 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-18  9:48 [RFC PATCH 0/6] mm/damon: Add mTHP-aware collapse/split with ARM SPE feedback Wang Lian
2026-06-18  9:48 ` [RFC PATCH 1/6] mm/damon: add target_order field for DAMOS_COLLAPSE Wang Lian
2026-06-18  9:48 ` [RFC PATCH 2/6] mm/khugepaged: add damon_collapse_folio_range() for external callers Wang Lian
2026-06-18  9:48 ` [RFC PATCH 3/6] mm/damon/vaddr: implement mTHP-aware DAMOS_COLLAPSE handler Wang Lian
2026-06-18  9:48 ` [RFC PATCH 4/6] mm/damon: introduce DAMOS_MTHP_SPLIT action and hot_threshold Wang Lian
2026-06-18  9:48 ` [RFC PATCH 5/6] mm/damon/vaddr: implement DAMOS_MTHP_SPLIT handler Wang Lian
2026-06-18  9:48 ` [RFC PATCH 6/6] mm/damon: add SPE feedback for sub-THP split decisions Wang Lian
2026-06-18 11:03 ` [RFC PATCH 0/6] mm/damon: Add mTHP-aware collapse/split with ARM SPE feedback Gutierrez Asier
2026-06-18 13:13   ` wang lian
2026-06-19  1:52     ` SeongJae Park
2026-06-19  1:47 ` SeongJae Park
2026-06-19  1:54   ` SeongJae Park
2026-06-19  1:59     ` SeongJae Park
2026-06-19  3:40   ` Wang Lian
2026-06-19 14:31     ` Gutierrez Asier
2026-06-20 20:39       ` SeongJae Park

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.