[RESEND RFC PATCH v2 0/5] mm/damon: add mTHP collapse and split actions

DAMON development mailing list
 help / color / mirror / Atom feed

* [RESEND RFC PATCH v2 0/5] mm/damon: add mTHP collapse and split actions
@ 2026-07-02  9:46 Lian Wang
  2026-07-02  9:46 ` [RESEND RFC PATCH v2 1/5] mm/damon: add target_order field for DAMOS_COLLAPSE Lian Wang
                   ` (6 more replies)
  0 siblings, 7 replies; 24+ messages in thread
From: Lian Wang @ 2026-07-02  9:46 UTC (permalink / raw)
  To: sj; +Cc: damon, linux-mm, daichaobing, kunwu.chan

Resend of v2 with the RFC tag restored (v1 was RFC PATCH, so v2 should
be RFC PATCH v2).

This resend also includes fixes for issues identified during review of
the earlier mis-sent PATCH v2 thread: uninitialized memory, TOCTOU
races, BUILD_BUG guards, missing sysfs action name registration, and
stack allocation overflow.  The series has been re-tested on aarch64
(anonymous and file-backed THP split) and is checkpatch clean.

v1: https://lore.kernel.org/linux-mm/20260618094838.32805-1-lianux.mm@gmail.com/

Changes since v1

 - Rename DAMOS_MTHP_SPLIT -> DAMOS_SPLIT for naming consistency with
   the existing actions (per SJ's review).
 - Drop the per-scheme hot_threshold field.  Hotness policy does not
   belong in the kernel; target selection now lives in user space and
   is expressed to DAMOS via the address filter (per SJ's review).
 - Drop the v1 SPE debugfs patch entirely.  debugfs is not the right
   interface for a feature, and the SPE profiler belongs in user space
   (see "User-space target selection" below).  v2 is kernel mechanism
   only: 5 patches.
 - Decouple T1 (a lab observation) from T2 (the production issue), and
   correct the architecture claim: ptep_test_and_clear_young() skips
   the TLB flush on both x86_64 and arm64, so the blind spot is
   architecture-independent rather than arm64-only.
 - Terminology: avoid "stale TLB".  A valid TLB entry is doing its
   job; the point is only that it lets the CPU satisfy a translation
   without a page-table walk, so the Accessed bit cleared by DAMON is
   not re-set.

Background

Two effects degrade DAMON's PTE-Accessed-bit (AF) signal once THP is
in play.  Both are described here as motivation only; this series does
not change the AF monitoring path.

T2 -- PMD-granularity inflation (production issue)

A 2MB THP is tracked by a single PMD-level Accessed bit.  One access
to any 4KB sub-page sets the AF for the whole 2MB, so DAMON reports
the entire THP as hot and cannot distinguish a genuinely hot 2MB
region from a 2MB region with a single hot 4KB page.  Cold memory
hides inside "hot" THPs, and access-driven pageout/migration becomes
coarse.

This is the workload that drove the work: Sangfor's Kunpeng 920 KVM
hosts running Oracle.  ARM SPE sampling of that workload shows 94.6%
of THPs have fewer than 10% of their sub-pages actually accessed.

T1 -- TLB-reach blind spot (lab observation)

When the working set fits within L2 TLB reach (measured at 2048
entries x 2MB = 4GB on Kunpeng 920; no public data available), the
CPU satisfies translations entirely from the TLB,
preventing translation table walks.  Because
ptep_test_and_clear_young() does not flush
the TLB, valid TLB entries continue to satisfy translations and the
AF that DAMON cleared is never re-set, so DAMON sees nr_accesses=0 for
memory that is in fact hot, and no scheme triggers.  This reproduces
in the lab with small workloads; it is not something we have seen
reported from production, where working sets exceed TLB reach.

What this series adds

Rather than change AF monitoring, this series adds two order-aware
DAMOS actions so a policy layer can act at mTHP granularity:

 - DAMOS_COLLAPSE + target_order (patches 1-3): collapse small folios
   up to a chosen mTHP order.  Patch 1 adds the target_order field and
   its sysfs file; patch 2 exports a khugepaged helper
   (damon_collapse_folio_range()); patch 3 wires the vaddr handler.

 - DAMOS_SPLIT + target_order (patches 4-5): split large folios down
   to a chosen mTHP order via split_folio_to_order(), for both
   anonymous and file-backed (tmpfs/shmem) folios.

The two are complementary, not competing:

   THP=never  + DAMOS_COLLAPSE: start at 4KB, grow hot regions up.
   THP=always + DAMOS_SPLIT:    start at 2MB, shrink cold regions down.

This dual-path design aligns with ideas discussed with Asier
Gutierrez; we plan to unify our mTHP automation and evaluation
roadmaps under this standard DAMOS_SPLIT action.

A deployment can pick either baseline, or run both, and let DAMOS
manage the placement.  THP is still wanted for the hot working set
(fewer TLB misses, shallower walks); the goal is not "no THP" but
"THP where it is hot, small pages where it is cold."

User-space target selection

The decision of *which* regions to collapse or split is left to user
space and fed to DAMOS through the existing DAMOS address filter
(DAMOS_FILTER_TYPE_ADDR) -- the interface suggested during v1 review.
The kernel provides the mechanism; user space provides the policy,
consistent with the perf/BPF "kernel samples, user space decides"
model and with the DAMON-X direction.

Because the AF signal is unreliable at PMD granularity (T1/T2), the
scheme is run with min_nr_accesses=0 so it does not gate on access
count, and the address filter selects targets.  min_nr_accesses=0 is
also what unblocks the T1 case, where nr_accesses is pinned at 0.

Why not just turn khugepaged off?  You can, but khugepaged is global
and usually left enabled because other workloads rely on it; it cannot
be disabled per region.  DAMOS_COLLAPSE gives per-region,
access-pattern-driven collapse -- a more precise, targeted complement
to khugepaged's global scan, not a replacement for it.  To handle the
runtime race where khugepaged might aggressively re-collapse what
DAMOS_SPLIT just split, we are evaluating a precise VMA-level handshake
or back-off mechanism to prevent ping-pong effects in mixed
environments.

Two user-space data sources produce the candidate address ranges:

 1. ARM SPE (ARMv8.2+): perf record (SPE) -> per-2MB hot-fraction
    histogram -> PA->VA via /proc/<pid>/pagemap -> sparse-THP VA
    ranges.  SPE reads physical addresses from the CPU pipeline,
    bypassing the TLB and page tables, so it is immune to T1 and T2.

 2. smaps fallback (no SPE): scan /proc/<pid>/smaps for THP-backed
    VMAs and treat the 2MB-aligned ranges as split candidates.

The SPE profiler stays in user space deliberately: the SPE PMU is a
single-consumer resource, so a kernel consumer would lock out
user-space perf and tooling (x86 PEBS / AMD IBS have the same
property).  Keeping it in user space avoids that and keeps the metric
source pluggable, in line with DAMON-X.  This is why v2 drops the v1
SPE debugfs patch.

Testing

Tested on aarch64 with this series applied to 7.1.0-rc5, THP=always,
using a DAMOS_SPLIT scheme (target_order=2, min_nr_accesses=0) and a
single DAMOS address filter selecting one 2MB-aligned range:

 - Anonymous THP: the filter splits exactly that one THP --
   sz_applied=2MB and AnonHugePages drops by 2MB, the rest of the
   256MB mapping untouched.
 - File-backed THP (tmpfs/shmem mounted huge=always): the same setup
   splits exactly one 2MB shmem THP -- sz_applied=2MB and
   ShmemPmdMapped drops by 2MB.  This confirms split_folio_to_order()
   works for shmem folios (the KVM-guest-on-THP-tmpfs case).
 - The address filter is what bounds the action: sz_tried covers the
   whole ~2GB monitored region while sz_applied is exactly the 2MB the
   filter selected.
 - A smaps-based path (for hosts without SPE) enumerates THP-backed
   ranges and splits all THP in the target workload.
 - checkpatch clean on all 5 patches.

Test scripts and SPE-to-DAMON pipeline tools:
https://github.com/lianux-mm/damon_spe/tree/v2

Lian Wang (5):
  mm/damon: add target_order field for DAMOS_COLLAPSE
  mm/khugepaged: add damon_collapse_folio_range() for external callers
  mm/damon/vaddr: implement mTHP-aware DAMOS_COLLAPSE handler
  mm/damon: introduce DAMOS_SPLIT action
  mm/damon/vaddr: implement DAMOS_SPLIT handler

 include/linux/damon.h      |  10 +++
 include/linux/khugepaged.h |   9 +++
 mm/damon/core.c            |   2 +
 mm/damon/sysfs-schemes.c   |  77 ++++++++++++++++++++++
 mm/damon/vaddr.c           | 128 +++++++++++++++++++++++++++++++++++++
 mm/khugepaged.c            |  46 +++++++++++++
 6 files changed, 272 insertions(+)

--
2.50.1 (Apple Git-155)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RESEND RFC PATCH v2 1/5] mm/damon: add target_order field for DAMOS_COLLAPSE
  2026-07-02  9:46 [RESEND RFC PATCH v2 0/5] mm/damon: add mTHP collapse and split actions Lian Wang
@ 2026-07-02  9:46 ` Lian Wang
  2026-07-02 10:01   ` sashiko-bot
  2026-07-02 18:51   ` SJ Park
  2026-07-02  9:46 ` [RESEND RFC PATCH v2 2/5] mm/khugepaged: add damon_collapse_folio_range() for external callers Lian Wang
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 24+ messages in thread
From: Lian Wang @ 2026-07-02  9:46 UTC (permalink / raw)
  To: sj; +Cc: damon, linux-mm, daichaobing, kunwu.chan

DAMOS_COLLAPSE currently collapses into PMD-size THP only.  Add a
target_order field to express per-order mTHP collapse intent.  Zero
means system default (PMD order, same as current behavior).  Valid
values are 0 and 2..HPAGE_PMD_ORDER.

Wire up the sysfs interface: a per-scheme rw file "target_order".
Validate at store time that the value is in range, and warn at scheme
creation time if DAMOS_COLLAPSE is used with an unsupported non-PMD
order, resetting to 0.

The actual mTHP application via the khugepaged wrapper will be added
in subsequent patches.

Co-developed-by: Kunwu Chan <kunwu.chan@gmail.com>
Signed-off-by: Kunwu Chan <kunwu.chan@gmail.com>
Signed-off-by: Lian Wang <lianux.mm@gmail.com>
Signed-off-by: Lian Wang <lianux.wang@processmission.com>
---
 include/linux/damon.h    |  5 ++++
 mm/damon/core.c          |  2 ++
 mm/damon/sysfs-schemes.c | 61 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 68 insertions(+)

diff --git a/include/linux/damon.h b/include/linux/damon.h
index 6f7edb3590ef..5a0587556573 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -572,6 +572,11 @@ struct damos_migrate_dests {
 struct damos {
 	struct damos_access_pattern pattern;
 	enum damos_action action;
+	/*
+	 * @target_order: target order for mTHP actions (DAMOS_COLLAPSE).
+	 * 0 means system default (PMD order).  Valid: 0, 2..HPAGE_PMD_ORDER.
+	 */
+	unsigned int target_order;
 	unsigned long apply_interval_us;
 /* private: internal use only */
 	/*
diff --git a/mm/damon/core.c b/mm/damon/core.c
index 265d51ade25b..be54defd4646 100644
--- a/mm/damon/core.c
+++ b/mm/damon/core.c
@@ -579,6 +579,7 @@ struct damos *damon_new_scheme(struct damos_access_pattern *pattern,
 
 	scheme->migrate_dests = (struct damos_migrate_dests){};
 	scheme->target_nid = target_nid;
+	scheme->target_order = 0;
 
 	return scheme;
 }
@@ -1278,6 +1279,7 @@ static int damos_commit(struct damos *dst, struct damos *src)
 
 	dst->wmarks = src->wmarks;
 	dst->target_nid = src->target_nid;
+	dst->target_order = src->target_order;
 
 	err = damos_commit_dests(&dst->migrate_dests, &src->migrate_dests);
 	if (err)
diff --git a/mm/damon/sysfs-schemes.c b/mm/damon/sysfs-schemes.c
index 329cfd0bbe9f..7dcd582ded86 100644
--- a/mm/damon/sysfs-schemes.c
+++ b/mm/damon/sysfs-schemes.c
@@ -6,7 +6,9 @@
  */
 
 #include <linux/slab.h>
+#include <linux/mm.h>
 #include <linux/numa.h>
+#include <linux/huge_mm.h>
 
 #include "sysfs-common.h"
 
@@ -2257,6 +2259,7 @@ struct damon_sysfs_scheme {
 	struct damon_sysfs_stats *stats;
 	struct damon_sysfs_scheme_regions *tried_regions;
 	int target_nid;
+	unsigned int target_order;
 	struct damos_sysfs_dests *dests;
 };
 
@@ -2323,6 +2326,7 @@ static struct damon_sysfs_scheme *damon_sysfs_scheme_alloc(
 	scheme->action = action;
 	scheme->apply_interval_us = apply_interval_us;
 	scheme->target_nid = NUMA_NO_NODE;
+	scheme->target_order = 0;
 	return scheme;
 }
 
@@ -2642,6 +2646,39 @@ static ssize_t target_nid_store(struct kobject *kobj,
 	return err ? err : count;
 }
 
+static ssize_t target_order_show(struct kobject *kobj,
+		struct kobj_attribute *attr, char *buf)
+{
+	struct damon_sysfs_scheme *scheme = container_of(kobj,
+			struct damon_sysfs_scheme, kobj);
+
+	return sysfs_emit(buf, "%u\n", scheme->target_order);
+}
+
+static ssize_t target_order_store(struct kobject *kobj,
+		struct kobj_attribute *attr, const char *buf, size_t count)
+{
+	struct damon_sysfs_scheme *scheme = container_of(kobj,
+			struct damon_sysfs_scheme, kobj);
+	unsigned int val;
+	int err;
+
+	err = kstrtouint(buf, 0, &val);
+	if (err)
+		return err;
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	if (val != 0 && (val < 2 || val > HPAGE_PMD_ORDER))
+		return -EINVAL;
+#else
+	if (val != 0)
+		return -EINVAL;
+#endif
+
+	scheme->target_order = val;
+	return count;
+}
+
 static void damon_sysfs_scheme_release(struct kobject *kobj)
 {
 	kfree(container_of(kobj, struct damon_sysfs_scheme, kobj));
@@ -2656,10 +2693,14 @@ static struct kobj_attribute damon_sysfs_scheme_apply_interval_us_attr =
 static struct kobj_attribute damon_sysfs_scheme_target_nid_attr =
 		__ATTR_RW_MODE(target_nid, 0600);
 
+static struct kobj_attribute damon_sysfs_scheme_target_order_attr =
+		__ATTR_RW_MODE(target_order, 0600);
+
 static struct attribute *damon_sysfs_scheme_attrs[] = {
 	&damon_sysfs_scheme_action_attr.attr,
 	&damon_sysfs_scheme_apply_interval_us_attr.attr,
 	&damon_sysfs_scheme_target_nid_attr.attr,
+	&damon_sysfs_scheme_target_order_attr.attr,
 	NULL,
 };
 ATTRIBUTE_GROUPS(damon_sysfs_scheme);
@@ -2970,6 +3011,7 @@ static struct damos *damon_sysfs_mk_scheme(
 	struct damon_sysfs_weights *sysfs_weights = sysfs_quotas->weights;
 	struct damon_sysfs_watermarks *sysfs_wmarks = sysfs_scheme->watermarks;
 	struct damos *scheme;
+	unsigned int target_order;
 	int err;
 
 	struct damos_access_pattern pattern = {
@@ -3005,6 +3047,25 @@ static struct damos *damon_sysfs_mk_scheme(
 	if (!scheme)
 		return NULL;
 
+	target_order = sysfs_scheme->target_order;
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	if (sysfs_scheme->action == DAMOS_COLLAPSE &&
+	    target_order != 0 &&
+	    target_order != HPAGE_PMD_ORDER) {
+		pr_warn("DAMON collapse: target_order %u not supported, only PMD order (%u) is available. Use 0 or %u.\n",
+			target_order,
+			HPAGE_PMD_ORDER, HPAGE_PMD_ORDER);
+		target_order = 0;
+	}
+#else
+	if (sysfs_scheme->action == DAMOS_COLLAPSE && target_order != 0) {
+		pr_warn("DAMON collapse: target_order not supported without THP. Use 0.\n");
+		target_order = 0;
+	}
+#endif
+	scheme->target_order = target_order;
+
 	err = damos_sysfs_add_quota_score(sysfs_quotas->goals, &scheme->quota);
 	if (err) {
 		damon_destroy_scheme(scheme);


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RESEND RFC PATCH v2 2/5] mm/khugepaged: add damon_collapse_folio_range() for external callers
  2026-07-02  9:46 [RESEND RFC PATCH v2 0/5] mm/damon: add mTHP collapse and split actions Lian Wang
  2026-07-02  9:46 ` [RESEND RFC PATCH v2 1/5] mm/damon: add target_order field for DAMOS_COLLAPSE Lian Wang
@ 2026-07-02  9:46 ` Lian Wang
  2026-07-02 10:13   ` sashiko-bot
  2026-07-02 11:07   ` Lorenzo Stoakes
  2026-07-02  9:46 ` [RESEND RFC PATCH v2 3/5] mm/damon/vaddr: implement mTHP-aware DAMOS_COLLAPSE handler Lian Wang
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 24+ messages in thread
From: Lian Wang @ 2026-07-02  9:46 UTC (permalink / raw)
  To: sj; +Cc: damon, linux-mm, daichaobing, kunwu.chan

Export a thin wrapper around collapse_huge_page() that allows external
subsystems such as DAMON to trigger THP collapse on a target address
range.

Currently restricted to PMD order (HPAGE_PMD_ORDER), since
collapse_huge_page() does not yet support arbitrary mTHP orders.
The restriction can be relaxed when khugepaged gains mTHP support.

The caller must hold a reference to @mm.  Do not hold mmap lock:
collapse_huge_page() acquires mmap_read_lock for validation, releases
it, then acquires mmap_write_lock for the actual collapse.  Holding
an outer mmap_read_lock would cause a self-deadlock when the same
thread attempts the inner mmap_write_lock.

Co-developed-by: Kunwu Chan <kunwu.chan@gmail.com>
Signed-off-by: Kunwu Chan <kunwu.chan@gmail.com>
Signed-off-by: Lian Wang <lianux.mm@gmail.com>
Signed-off-by: Lian Wang <lianux.wang@processmission.com>
---
 include/linux/khugepaged.h |  9 ++++++++
 mm/khugepaged.c            | 46 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 55 insertions(+)

diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index d7a9053ff4fe..f7d49cba712f 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -20,6 +20,9 @@ extern bool current_is_khugepaged(void);
 void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 		bool install_pmd);
 
+int damon_collapse_folio_range(struct mm_struct *mm, unsigned long start_addr,
+			       unsigned int target_order);
+
 static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
 {
 	if (mm_flags_test(MMF_VM_HUGEPAGE, oldmm))
@@ -47,6 +50,12 @@ static inline void collapse_pte_mapped_thp(struct mm_struct *mm,
 {
 }
 
+static inline int damon_collapse_folio_range(struct mm_struct *mm,
+		unsigned long start_addr, unsigned int target_order)
+{
+	return -EINVAL;
+}
+
 static inline void khugepaged_min_free_kbytes_update(void)
 {
 }
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 617bca76db49..7fe9ce1e0533 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -3272,3 +3272,49 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 	return thps == ((hend - hstart) >> HPAGE_PMD_SHIFT) ? 0
 			: madvise_collapse_errno(last_fail);
 }
+
+/**
+ * damon_collapse_folio_range() - Collapse base pages in range into a THP
+ * @mm:         mm_struct of the target process
+ * @start_addr: start address (must be order-aligned)
+ * @target_order: page order of the collapse result (currently only
+ *                HPAGE_PMD_ORDER is supported)
+ *
+ * Thin wrapper around collapse_huge_page() for external callers such as
+ * DAMON.  The caller must hold a reference to @mm.  Do not hold mmap
+ * lock: collapse_huge_page() acquires mmap_read_lock for validation,
+ * releases it, then acquires mmap_write_lock for the collapse.  Holding
+ * an outer mmap_read_lock would self-deadlock.
+ *
+ * Return: 0 on success, -EINVAL on bad arguments, negative error from
+ *         madvise_collapse_errno() otherwise.
+ */
+int damon_collapse_folio_range(struct mm_struct *mm, unsigned long start_addr,
+			       unsigned int target_order)
+{
+	struct collapse_control *cc;
+	enum scan_result result;
+
+	if (target_order != HPAGE_PMD_ORDER) {
+		pr_warn_once("%s: only PMD order (%u) is supported, got %u\n",
+			     __func__, HPAGE_PMD_ORDER, target_order);
+		return -EINVAL;
+	}
+	if (start_addr & ((PAGE_SIZE << target_order) - 1))
+		return -EINVAL;
+
+	cc = kmalloc_obj(*cc);
+	if (!cc)
+		return -ENOMEM;
+	cc->is_khugepaged = false;
+	cc->progress = 0;
+
+	lru_add_drain_all();
+
+	result = collapse_huge_page(mm, start_addr, 1, 0, cc, target_order);
+	kfree(cc);
+	if (result == SCAN_SUCCEED || result == SCAN_PMD_MAPPED)
+		return 0;
+	return madvise_collapse_errno(result);
+}
+EXPORT_SYMBOL_GPL(damon_collapse_folio_range);


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RESEND RFC PATCH v2 3/5] mm/damon/vaddr: implement mTHP-aware DAMOS_COLLAPSE handler
  2026-07-02  9:46 [RESEND RFC PATCH v2 0/5] mm/damon: add mTHP collapse and split actions Lian Wang
  2026-07-02  9:46 ` [RESEND RFC PATCH v2 1/5] mm/damon: add target_order field for DAMOS_COLLAPSE Lian Wang
  2026-07-02  9:46 ` [RESEND RFC PATCH v2 2/5] mm/khugepaged: add damon_collapse_folio_range() for external callers Lian Wang
@ 2026-07-02  9:46 ` Lian Wang
  2026-07-02 10:26   ` sashiko-bot
  2026-07-02 19:56   ` SJ Park
  2026-07-02  9:46 ` [RESEND RFC PATCH v2 4/5] mm/damon: introduce DAMOS_SPLIT action Lian Wang
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 24+ messages in thread
From: Lian Wang @ 2026-07-02  9:46 UTC (permalink / raw)
  To: sj; +Cc: damon, linux-mm, daichaobing, kunwu.chan

When target_order is set (non-zero), the DAMOS_COLLAPSE handler now calls
damon_collapse_folio_range() to collapse pages into the requested mTHP
size, iterating over the target region in order-aligned chunks.  When
target_order is 0 (default), the existing madvise(MADV_COLLAPSE) path is
used, preserving backwards compatibility.

Region boundaries are expanded outward to the covering aligned range
(ALIGN_DOWN start, ALIGN end) so that collapse works even after
kdamond_split_regions reduces region sizes below the chunk size.
collapse_huge_page() internally validates VMA bounds, so expanding
beyond the original region is safe.

No external mmap lock is held: collapse_huge_page() acquires
mmap_read_lock internally for validation, releases it, then acquires
mmap_write_lock for the actual collapse.  Holding an outer
mmap_read_lock would cause a self-deadlock when the same thread
attempts the inner mmap_write_lock.

Co-developed-by: Kunwu Chan <kunwu.chan@gmail.com>
Signed-off-by: Kunwu Chan <kunwu.chan@gmail.com>
Signed-off-by: Lian Wang <lianux.mm@gmail.com>
Signed-off-by: Lian Wang <lianux.wang@processmission.com>
---
 mm/damon/vaddr.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)

diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
index d27147603564..98a87609376b 100644
--- a/mm/damon/vaddr.c
+++ b/mm/damon/vaddr.c
@@ -14,6 +14,7 @@
 #include <linux/page_idle.h>
 #include <linux/pagewalk.h>
 #include <linux/sched/mm.h>
+#include <linux/khugepaged.h>
 
 #include "../internal.h"
 #include "ops-common.h"
@@ -899,6 +900,50 @@ static unsigned long damos_va_stat(struct damon_target *target,
 	return 0;
 }
 
+static unsigned long damos_va_collapse(struct damon_target *target,
+		struct damon_region *r, struct damos *s,
+		unsigned long *sz_filter_passed)
+{
+	unsigned long addr, end, chunk_sz;
+	unsigned long last_chunk = ULONG_MAX;
+	unsigned int target_order = s->target_order;
+	unsigned long applied = 0;
+	struct mm_struct *mm;
+	int ret;
+
+	if (target_order < 2 || target_order > HPAGE_PMD_ORDER)
+		return 0;
+
+	chunk_sz = PAGE_SIZE << target_order;
+	addr = ALIGN_DOWN(r->ar.start, chunk_sz);
+	end = ALIGN(r->ar.end, chunk_sz);
+	if (end < addr)
+		return 0;
+
+	mm = damon_get_mm(target);
+	if (!mm)
+		return 0;
+
+	while (addr < end) {
+		if (addr + chunk_sz < addr)
+			break;
+		if (addr == last_chunk)
+			goto next;
+		last_chunk = addr;
+
+		ret = damon_collapse_folio_range(mm, addr, target_order);
+		if (!ret)
+			applied += chunk_sz;
+		*sz_filter_passed += chunk_sz;
+next:
+		addr += chunk_sz;
+		cond_resched();
+	}
+
+	mmput(mm);
+	return applied;
+}
+
 static unsigned long damon_va_apply_scheme(struct damon_ctx *ctx,
 		struct damon_target *t, struct damon_region *r,
 		struct damos *scheme, unsigned long *sz_filter_passed)
@@ -922,6 +967,9 @@ static unsigned long damon_va_apply_scheme(struct damon_ctx *ctx,
 		madv_action = MADV_NOHUGEPAGE;
 		break;
 	case DAMOS_COLLAPSE:
+		if (scheme->target_order)
+			return damos_va_collapse(t, r, scheme,
+						 sz_filter_passed);
 		madv_action = MADV_COLLAPSE;
 		break;
 	case DAMOS_MIGRATE_HOT:


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RESEND RFC PATCH v2 4/5] mm/damon: introduce DAMOS_SPLIT action
  2026-07-02  9:46 [RESEND RFC PATCH v2 0/5] mm/damon: add mTHP collapse and split actions Lian Wang
                   ` (2 preceding siblings ...)
  2026-07-02  9:46 ` [RESEND RFC PATCH v2 3/5] mm/damon/vaddr: implement mTHP-aware DAMOS_COLLAPSE handler Lian Wang
@ 2026-07-02  9:46 ` Lian Wang
  2026-07-02 10:41   ` sashiko-bot
  2026-07-02 20:00   ` SJ Park
  2026-07-02  9:46 ` [RESEND RFC PATCH v2 5/5] mm/damon/vaddr: implement DAMOS_SPLIT handler Lian Wang
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 24+ messages in thread
From: Lian Wang @ 2026-07-02  9:46 UTC (permalink / raw)
  To: sj; +Cc: damon, linux-mm, daichaobing, kunwu.chan

Add DAMOS_SPLIT to the damos_action enum for splitting large folios
into smaller mTHP-order folios.  Add a target_order field to struct
damos to specify the desired split order.

Expose the action as "split" through the DAMON sysfs interface with
target_order validation (must be 2..HPAGE_PMD_ORDER-1).

Signed-off-by: Lian Wang <lianux.mm@gmail.com>
Signed-off-by: Lian Wang <lianux.wang@processmission.com>
---
 include/linux/damon.h    |  9 +++++++--
 mm/damon/sysfs-schemes.c | 16 ++++++++++++++++
 2 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/include/linux/damon.h b/include/linux/damon.h
index 5a0587556573..30cf4afb212c 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -121,6 +121,7 @@ struct damon_target {
  * @DAMOS_HUGEPAGE:	Call ``madvise()`` for the region with MADV_HUGEPAGE.
  * @DAMOS_NOHUGEPAGE:	Call ``madvise()`` for the region with MADV_NOHUGEPAGE.
  * @DAMOS_COLLAPSE:	Call ``madvise()`` for the region with MADV_COLLAPSE.
+ * @DAMOS_SPLIT:	Split large folios to the target mTHP order.
  * @DAMOS_LRU_PRIO:	Prioritize the region on its LRU lists.
  * @DAMOS_LRU_DEPRIO:	Deprioritize the region on its LRU lists.
  * @DAMOS_MIGRATE_HOT:  Migrate the regions prioritizing warmer regions.
@@ -141,6 +142,7 @@ enum damos_action {
 	DAMOS_HUGEPAGE,
 	DAMOS_NOHUGEPAGE,
 	DAMOS_COLLAPSE,
+	DAMOS_SPLIT,
 	DAMOS_LRU_PRIO,
 	DAMOS_LRU_DEPRIO,
 	DAMOS_MIGRATE_HOT,
@@ -573,8 +575,11 @@ struct damos {
 	struct damos_access_pattern pattern;
 	enum damos_action action;
 	/*
-	 * @target_order: target order for mTHP actions (DAMOS_COLLAPSE).
-	 * 0 means system default (PMD order).  Valid: 0, 2..HPAGE_PMD_ORDER.
+	 * @target_order: target mTHP order for DAMOS_COLLAPSE and
+	 * DAMOS_SPLIT.  For COLLAPSE, 0 means PMD order default,
+	 * valid values: 0, 2..HPAGE_PMD_ORDER.  For SPLIT,
+	 * valid values: 2..HPAGE_PMD_ORDER-1; 0 and HPAGE_PMD_ORDER
+	 * are rejected at scheme creation time (defaulting to 2).
 	 */
 	unsigned int target_order;
 	unsigned long apply_interval_us;
diff --git a/mm/damon/sysfs-schemes.c b/mm/damon/sysfs-schemes.c
index 7dcd582ded86..84a4617ca3d3 100644
--- a/mm/damon/sysfs-schemes.c
+++ b/mm/damon/sysfs-schemes.c
@@ -2293,6 +2293,10 @@ static struct damos_sysfs_action_name damos_sysfs_action_names[] = {
 		.action = DAMOS_COLLAPSE,
 		.name = "collapse",
 	},
+	{
+		.action = DAMOS_SPLIT,
+		.name = "split",
+	},
 	{
 		.action = DAMOS_LRU_PRIO,
 		.name = "lru_prio",
@@ -3058,11 +3062,23 @@ static struct damos *damon_sysfs_mk_scheme(
 			HPAGE_PMD_ORDER, HPAGE_PMD_ORDER);
 		target_order = 0;
 	}
+	if (sysfs_scheme->action == DAMOS_SPLIT &&
+	    (target_order == 0 ||
+	     target_order >= HPAGE_PMD_ORDER)) {
+		pr_warn("DAMON split: target_order %u invalid, need 2..%u. Defaulting to 2.\n",
+			target_order,
+			HPAGE_PMD_ORDER - 1);
+		target_order = 2;
+	}
 #else
 	if (sysfs_scheme->action == DAMOS_COLLAPSE && target_order != 0) {
 		pr_warn("DAMON collapse: target_order not supported without THP. Use 0.\n");
 		target_order = 0;
 	}
+	if (sysfs_scheme->action == DAMOS_SPLIT) {
+		pr_warn("DAMON split: not supported without THP.\n");
+		target_order = 2;
+	}
 #endif
 	scheme->target_order = target_order;
 


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RESEND RFC PATCH v2 5/5] mm/damon/vaddr: implement DAMOS_SPLIT handler
  2026-07-02  9:46 [RESEND RFC PATCH v2 0/5] mm/damon: add mTHP collapse and split actions Lian Wang
                   ` (3 preceding siblings ...)
  2026-07-02  9:46 ` [RESEND RFC PATCH v2 4/5] mm/damon: introduce DAMOS_SPLIT action Lian Wang
@ 2026-07-02  9:46 ` Lian Wang
  2026-07-02 10:49   ` sashiko-bot
  2026-07-02 20:18   ` SJ Park
  2026-07-02 10:23 ` [RESEND RFC PATCH v2 0/5] mm/damon: add mTHP collapse and split actions Lorenzo Stoakes
  2026-07-02 18:35 ` SJ Park
  6 siblings, 2 replies; 24+ messages in thread
From: Lian Wang @ 2026-07-02  9:46 UTC (permalink / raw)
  To: sj; +Cc: damon, linux-mm, daichaobing, kunwu.chan

Implement the vaddr operations layer handler for DAMOS_SPLIT.
For each folio in the target region that is larger than the
scheme's target_order, split it via split_folio_to_order().

This supports both anonymous and file-backed (e.g. tmpfs/shmem)
folios, covering KVM guest memory backed by THP tmpfs.

Signed-off-by: Lian Wang <lianux.mm@gmail.com>
Signed-off-by: Lian Wang <lianux.wang@processmission.com>
---
 mm/damon/vaddr.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 80 insertions(+)

diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
index 98a87609376b..a177703f7e0a 100644
--- a/mm/damon/vaddr.c
+++ b/mm/damon/vaddr.c
@@ -944,6 +944,83 @@ static unsigned long damos_va_collapse(struct damon_target *target,
 	return applied;
 }
 
+static unsigned long damos_va_split(struct damon_target *target,
+		struct damon_region *r, struct damos *s,
+		unsigned long *sz_filter_passed)
+{
+	unsigned long addr, end, chunk_sz;
+	unsigned int target_order = s->target_order;
+	unsigned long applied = 0;
+	struct mm_struct *mm;
+	struct vm_area_struct *vma;
+	struct folio *folio;
+	struct folio_walk fw;
+
+	mm = damon_get_mm(target);
+	if (!mm)
+		return 0;
+
+	chunk_sz = PAGE_SIZE << HPAGE_PMD_ORDER;
+	addr = ALIGN_DOWN(r->ar.start, chunk_sz);
+	end = ALIGN(r->ar.end, chunk_sz);
+	if (end < addr)
+		goto out_mmput;
+
+	while (addr < end) {
+		unsigned long folio_sz;
+
+		if (addr + chunk_sz < addr)
+			break;
+
+		mmap_read_lock(mm);
+		vma = find_vma(mm, addr);
+
+		if (!vma || addr < vma->vm_start ||
+		    vma->vm_flags & (VM_HUGETLB | VM_MIXEDMAP))
+			goto unlock;
+
+		folio = folio_walk_start(&fw, vma, addr, 0);
+		if (!folio)
+			goto unlock;
+
+		folio_sz = PAGE_SIZE << folio_order(folio);
+
+		if (folio_order(folio) > target_order) {
+			if (!folio_trylock(folio)) {
+				folio_walk_end(&fw, vma);
+				goto unlock;
+			}
+			folio_get(folio);
+			folio_walk_end(&fw, vma);
+
+			if (!split_folio_to_order(folio, target_order))
+				applied += folio_sz;
+
+			folio_unlock(folio);
+			folio_put(folio);
+			*sz_filter_passed += folio_sz;
+			addr += folio_sz;
+		} else {
+			folio_walk_end(&fw, vma);
+			*sz_filter_passed += chunk_sz;
+			addr += chunk_sz;
+		}
+		mmap_read_unlock(mm);
+		cond_resched();
+		continue;
+
+unlock:
+		*sz_filter_passed += chunk_sz;
+		addr += chunk_sz;
+		mmap_read_unlock(mm);
+		cond_resched();
+	}
+
+out_mmput:
+	mmput(mm);
+	return applied;
+}
+
 static unsigned long damon_va_apply_scheme(struct damon_ctx *ctx,
 		struct damon_target *t, struct damon_region *r,
 		struct damos *scheme, unsigned long *sz_filter_passed)
@@ -977,6 +1054,9 @@ static unsigned long damon_va_apply_scheme(struct damon_ctx *ctx,
 		return damos_va_migrate(t, r, scheme, sz_filter_passed);
 	case DAMOS_STAT:
 		return damos_va_stat(t, r, scheme, sz_filter_passed);
+	case DAMOS_SPLIT:
+		return damos_va_split(t, r, scheme,
+					  sz_filter_passed);
 	default:
 		/*
 		 * DAMOS actions that are not yet supported by 'vaddr'.


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RESEND RFC PATCH v2 2/5] mm/khugepaged: add damon_collapse_folio_range() for external callers
  2026-07-02  9:52 Lian Wang
@ 2026-07-02  9:52 ` Lian Wang
  2026-07-02 10:08   ` sashiko-bot
  0 siblings, 1 reply; 24+ messages in thread
From: Lian Wang @ 2026-07-02  9:52 UTC (permalink / raw)
  To: damon, linux-mm
  Cc: linux-kernel, sj, gutierrez.asier, daichaobing, lianux.wang,
	lianux.mm, kunwu.chan

Export a thin wrapper around collapse_huge_page() that allows external
subsystems such as DAMON to trigger THP collapse on a target address
range.

Currently restricted to PMD order (HPAGE_PMD_ORDER), since
collapse_huge_page() does not yet support arbitrary mTHP orders.
The restriction can be relaxed when khugepaged gains mTHP support.

The caller must hold a reference to @mm.  Do not hold mmap lock:
collapse_huge_page() acquires mmap_read_lock for validation, releases
it, then acquires mmap_write_lock for the actual collapse.  Holding
an outer mmap_read_lock would cause a self-deadlock when the same
thread attempts the inner mmap_write_lock.

Co-developed-by: Kunwu Chan <kunwu.chan@gmail.com>
Signed-off-by: Kunwu Chan <kunwu.chan@gmail.com>
Signed-off-by: Lian Wang <lianux.mm@gmail.com>
Signed-off-by: Lian Wang <lianux.wang@processmission.com>
---
 include/linux/khugepaged.h |  9 ++++++++
 mm/khugepaged.c            | 46 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 55 insertions(+)

diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index d7a9053ff4fe..f7d49cba712f 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -20,6 +20,9 @@ extern bool current_is_khugepaged(void);
 void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 		bool install_pmd);
 
+int damon_collapse_folio_range(struct mm_struct *mm, unsigned long start_addr,
+			       unsigned int target_order);
+
 static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
 {
 	if (mm_flags_test(MMF_VM_HUGEPAGE, oldmm))
@@ -47,6 +50,12 @@ static inline void collapse_pte_mapped_thp(struct mm_struct *mm,
 {
 }
 
+static inline int damon_collapse_folio_range(struct mm_struct *mm,
+		unsigned long start_addr, unsigned int target_order)
+{
+	return -EINVAL;
+}
+
 static inline void khugepaged_min_free_kbytes_update(void)
 {
 }
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 617bca76db49..7fe9ce1e0533 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -3272,3 +3272,49 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 	return thps == ((hend - hstart) >> HPAGE_PMD_SHIFT) ? 0
 			: madvise_collapse_errno(last_fail);
 }
+
+/**
+ * damon_collapse_folio_range() - Collapse base pages in range into a THP
+ * @mm:         mm_struct of the target process
+ * @start_addr: start address (must be order-aligned)
+ * @target_order: page order of the collapse result (currently only
+ *                HPAGE_PMD_ORDER is supported)
+ *
+ * Thin wrapper around collapse_huge_page() for external callers such as
+ * DAMON.  The caller must hold a reference to @mm.  Do not hold mmap
+ * lock: collapse_huge_page() acquires mmap_read_lock for validation,
+ * releases it, then acquires mmap_write_lock for the collapse.  Holding
+ * an outer mmap_read_lock would self-deadlock.
+ *
+ * Return: 0 on success, -EINVAL on bad arguments, negative error from
+ *         madvise_collapse_errno() otherwise.
+ */
+int damon_collapse_folio_range(struct mm_struct *mm, unsigned long start_addr,
+			       unsigned int target_order)
+{
+	struct collapse_control *cc;
+	enum scan_result result;
+
+	if (target_order != HPAGE_PMD_ORDER) {
+		pr_warn_once("%s: only PMD order (%u) is supported, got %u\n",
+			     __func__, HPAGE_PMD_ORDER, target_order);
+		return -EINVAL;
+	}
+	if (start_addr & ((PAGE_SIZE << target_order) - 1))
+		return -EINVAL;
+
+	cc = kmalloc_obj(*cc);
+	if (!cc)
+		return -ENOMEM;
+	cc->is_khugepaged = false;
+	cc->progress = 0;
+
+	lru_add_drain_all();
+
+	result = collapse_huge_page(mm, start_addr, 1, 0, cc, target_order);
+	kfree(cc);
+	if (result == SCAN_SUCCEED || result == SCAN_PMD_MAPPED)
+		return 0;
+	return madvise_collapse_errno(result);
+}
+EXPORT_SYMBOL_GPL(damon_collapse_folio_range);


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RESEND RFC PATCH v2 1/5] mm/damon: add target_order field for DAMOS_COLLAPSE
  2026-07-02  9:46 ` [RESEND RFC PATCH v2 1/5] mm/damon: add target_order field for DAMOS_COLLAPSE Lian Wang
@ 2026-07-02 10:01   ` sashiko-bot
  2026-07-02 18:51   ` SJ Park
  1 sibling, 0 replies; 24+ messages in thread
From: sashiko-bot @ 2026-07-02 10:01 UTC (permalink / raw)
  To: Lian Wang; +Cc: damon

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Medium] Double-read of `sysfs_scheme->action` allows a concurrent sysfs write to bypass the `target_order` validation.
--

commit 8f068b9d900dccbf38f0673e711831fdf8cce5dc
Author: Lian Wang <lianux.mm@gmail.com>

mm/damon: add target_order field for DAMOS_COLLAPSE

This commit adds a target_order field to the DAMOS scheme structure to support
per-order mTHP collapse operations. It also exposes this new parameter via
sysfs with basic validation.

> diff --git a/mm/damon/sysfs-schemes.c b/mm/damon/sysfs-schemes.c
> index dbf2b0515d582..9067cb66632ed 100644
> --- a/mm/damon/sysfs-schemes.c
> +++ b/mm/damon/sysfs-schemes.c

[ ... ]

> @@ -3010,6 +3052,25 @@ static struct damos *damon_sysfs_mk_scheme(
>  	if (!scheme)
>  		return NULL;
>  
> +	target_order = sysfs_scheme->target_order;
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	if (sysfs_scheme->action == DAMOS_COLLAPSE &&

[Severity: Medium]
Is there a time-of-check to time-of-use (TOCTOU) race condition here?

The scheme action is read earlier and passed into damon_new_scheme(),
which saves it to the newly allocated scheme:

	scheme = damon_new_scheme(&pattern, sysfs_scheme->action, ...);

Then it is read again here directly from sysfs_scheme->action for the
target_order validation.

If a concurrent sysfs write modifies the action parameter after
damon_new_scheme() but before this validation, could it bypass the check
and allow the scheme to be created with an invalid target_order?

Would it be safer to check the snapshotted scheme->action instead?

> +	    target_order != 0 &&
> +	    target_order != HPAGE_PMD_ORDER) {
> +		pr_warn("DAMON collapse: target_order %u not supported, only PMD order (%u) is available. Use 0 or %u.\n",
> +			target_order,
> +			HPAGE_PMD_ORDER, HPAGE_PMD_ORDER);
> +		target_order = 0;
> +	}
> +#else
> +	if (sysfs_scheme->action == DAMOS_COLLAPSE && target_order != 0) {
> +		pr_warn("DAMON collapse: target_order not supported without THP. Use 0.\n");
> +		target_order = 0;
> +	}
> +#endif
> +	scheme->target_order = target_order;
> +
>  	err = damos_sysfs_add_quota_score(sysfs_quotas->goals, &scheme->quota);
>  	if (err) {
>  		damon_destroy_scheme(scheme);

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260702094633.75658-1-lianux.mm@gmail.com?part=1

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RESEND RFC PATCH v2 2/5] mm/khugepaged: add damon_collapse_folio_range() for external callers
  2026-07-02  9:52 ` [RESEND RFC PATCH v2 2/5] mm/khugepaged: add damon_collapse_folio_range() for external callers Lian Wang
@ 2026-07-02 10:08   ` sashiko-bot
  0 siblings, 0 replies; 24+ messages in thread
From: sashiko-bot @ 2026-07-02 10:08 UTC (permalink / raw)
  To: Lian Wang; +Cc: damon

Thank you for your contribution! Sashiko AI review found 4 potential issue(s) to consider:
- [High] `damon_collapse_folio_range()` passes an uninitialized `struct collapse_control` to `collapse_huge_page()`, leaving fields like `node_load` and `alloc_nmask` with heap garbage, potentially corrupting NUMA node targeting during THP allocation.
- [High] Bypassing the read-only page table validation scan forces destructive PMD operations on un-collapsible memory ranges, creating a severe performance degradation vector.
- [Medium] `damon_collapse_folio_range()` silently fails to collapse file-backed memory ranges because it routes directly to the anonymous-only `collapse_huge_page()` function.
- [Medium] `damon_collapse_folio_range()` hardcodes the `unmapped` parameter to 0, which prevents the collapse operation from succeeding if there are any swapped-out pages in the PMD range.
--

commit e521d85e0499be054d0c31614f15b39332af3c13
Author: Lian Wang <lianux.mm@gmail.com>

mm/khugepaged: add damon_collapse_folio_range() for external callers

This commit introduces damon_collapse_folio_range(), providing a wrapper
around khugepaged's collapse_huge_page() for external subsystems like DAMON.
It allows targeted THP collapse on specific address ranges but is currently
limited to PMD order.

> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 617bca76db49b..7fe9ce1e05335 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c

[ ... ]

> @@ -3272,3 +3272,49 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>  	return thps == ((hend - hstart) >> HPAGE_PMD_SHIFT) ? 0
>  			: madvise_collapse_errno(last_fail);
>  }
> +
> +/**
> + * damon_collapse_folio_range() - Collapse base pages in range into a THP
> + * @mm:         mm_struct of the target process
> + * @start_addr: start address (must be order-aligned)
> + * @target_order: page order of the collapse result (currently only
> + *                HPAGE_PMD_ORDER is supported)
> + *
> + * Thin wrapper around collapse_huge_page() for external callers such as
> + * DAMON.  The caller must hold a reference to @mm.  Do not hold mmap
> + * lock: collapse_huge_page() acquires mmap_read_lock for validation,
> + * releases it, then acquires mmap_write_lock for the collapse.  Holding
> + * an outer mmap_read_lock would self-deadlock.
> + *
> + * Return: 0 on success, -EINVAL on bad arguments, negative error from
> + *         madvise_collapse_errno() otherwise.
> + */
> +int damon_collapse_folio_range(struct mm_struct *mm, unsigned long start_addr,
> +			       unsigned int target_order)
> +{
> +	struct collapse_control *cc;
> +	enum scan_result result;
> +
> +	if (target_order != HPAGE_PMD_ORDER) {
> +		pr_warn_once("%s: only PMD order (%u) is supported, got %u\n",
> +			     __func__, HPAGE_PMD_ORDER, target_order);
> +		return -EINVAL;
> +	}
> +	if (start_addr & ((PAGE_SIZE << target_order) - 1))
> +		return -EINVAL;
> +
> +	cc = kmalloc_obj(*cc);

[Severity: High]
Does this leave the collapse_control struct uninitialized?

When collapse_huge_page() eventually calls alloc_charge_folio(), it uses
collapse_find_target_node(). This function iterates over cc->node_load and
reads/writes cc->alloc_nmask:

collapse_find_target_node()
   ...
   node_set(nid, cc->alloc_nmask);

Since kmalloc_obj() does not zero memory, wouldn't this pass heap garbage
to the folio allocator and corrupt NUMA node targeting?

> +	if (!cc)
> +		return -ENOMEM;
> +	cc->is_khugepaged = false;
> +	cc->progress = 0;
> +
> +	lru_add_drain_all();
> +
> +	result = collapse_huge_page(mm, start_addr, 1, 0, cc, target_order);

[Severity: Medium]
Does this silently fail for file-backed memory regions?

By calling collapse_huge_page() directly instead of the dispatcher
collapse_single_pmd(), the flow reaches hugepage_vma_revalidate() with
expect_anon hardcoded to true:

hugepage_vma_revalidate()
    if (expect_anon && !vma_is_anonymous(vma))
        return SCAN_PAGE_ANON;

Wouldn't this unconditionally reject shmem/tmpfs mappings targeted by DAMON?

[Severity: Medium]
By passing 0 for the unmapped parameter here, does this prevent the collapse
operation if any pages are swapped out?

Because unmapped is 0, collapse_huge_page() skips
__collapse_huge_page_swapin(). Then, when __collapse_huge_page_isolate()
scans the PTEs, it will encounter swapped pages and fail:

__collapse_huge_page_isolate()
    if (!pte_present(pteval)) {
        result = SCAN_PTE_NON_PRESENT;
        goto out;
    }

Could we allow swap-in by passing 1 here instead?

[Severity: High]
Does bypassing the read-only scan phase cause severe performance issues?

Normally, collapse_scan_pmd() performs a read-only validation of the page
tables. By skipping straight to collapse_huge_page(), this code takes the
mmap_write_lock, destructively clears the PMD, and triggers a global TLB
flush before verifying if the PTEs can actually be isolated:

collapse_huge_page()
    _pmd = pmdp_collapse_flush(vma, pmd_addr, pmd);
    ...
    tlb_remove_table_sync_one();

If isolation fails (e.g., due to pinned or unmapped pages), it restores the
PMD and aborts. Because DAMON repeatedly scans and targets memory, wouldn't
this create continuous lock contention and TLB flush storms on
un-collapsible memory ranges?

> +	kfree(cc);
> +	if (result == SCAN_SUCCEED || result == SCAN_PMD_MAPPED)
> +		return 0;
> +	return madvise_collapse_errno(result);
> +}
> +EXPORT_SYMBOL_GPL(damon_collapse_folio_range);

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260702095227.75866-1-lianux.mm@gmail.com?part=2

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RESEND RFC PATCH v2 2/5] mm/khugepaged: add damon_collapse_folio_range() for external callers
  2026-07-02  9:46 ` [RESEND RFC PATCH v2 2/5] mm/khugepaged: add damon_collapse_folio_range() for external callers Lian Wang
@ 2026-07-02 10:13   ` sashiko-bot
  2026-07-02 11:07   ` Lorenzo Stoakes
  1 sibling, 0 replies; 24+ messages in thread
From: sashiko-bot @ 2026-07-02 10:13 UTC (permalink / raw)
  To: Lian Wang; +Cc: damon

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [High] Uninitialized memory in `struct collapse_control` causes undefined behavior and random NUMA node constraints during THP allocation.
- [High] Calling `lru_add_drain_all()` for every PMD creates severe system-wide latency spikes and IPI storms.
- [High] Directly calling collapse_huge_page() bypasses PTE pre-scanning and swap-in, causing unconditional THP allocations and breaking file-backed THP support.
--

commit 03af0641474bb6262efed77c042c17e3c94d4f27
Author: Lian Wang <lianux.mm@gmail.com>

mm/khugepaged: add damon_collapse_folio_range() for external callers

Adds a thin wrapper around collapse_huge_page() to allow external
subsystems like DAMON to trigger THP collapse on a target address range.

> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 617bca76db49b..7fe9ce1e05335 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c

[ ... ]

> @@ -3272,3 +3272,49 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>  	return thps == ((hend - hstart) >> HPAGE_PMD_SHIFT) ? 0
>  			: madvise_collapse_errno(last_fail);
>  }
> +
> +/**
> + * damon_collapse_folio_range() - Collapse base pages in range into a THP
> + * @mm:         mm_struct of the target process
> + * @start_addr: start address (must be order-aligned)
> + * @target_order: page order of the collapse result (currently only
> + *                HPAGE_PMD_ORDER is supported)
> + *
> + * Thin wrapper around collapse_huge_page() for external callers such as
> + * DAMON.  The caller must hold a reference to @mm.  Do not hold mmap
> + * lock: collapse_huge_page() acquires mmap_read_lock for validation,
> + * releases it, then acquires mmap_write_lock for the collapse.  Holding
> + * an outer mmap_read_lock would self-deadlock.
> + *
> + * Return: 0 on success, -EINVAL on bad arguments, negative error from
> + *         madvise_collapse_errno() otherwise.
> + */
> +int damon_collapse_folio_range(struct mm_struct *mm, unsigned long start_addr,
> +			       unsigned int target_order)
> +{
> +	struct collapse_control *cc;
> +	enum scan_result result;
> +
> +	if (target_order != HPAGE_PMD_ORDER) {
> +		pr_warn_once("%s: only PMD order (%u) is supported, got %u\n",
> +			     __func__, HPAGE_PMD_ORDER, target_order);
> +		return -EINVAL;
> +	}
> +	if (start_addr & ((PAGE_SIZE << target_order) - 1))
> +		return -EINVAL;
> +
> +	cc = kmalloc_obj(*cc);

[Severity: High]
Does this leave the cc->node_load array and cc->alloc_nmask nodemask
uninitialized?

Since this code bypasses collapse_scan_pmd() which typically zeros these
fields, collapse_huge_page() will call alloc_charge_folio() and then
collapse_find_target_node().

In collapse_find_target_node(), the code reads cc->node_load[nid] to find
the maximum value and sets bits in cc->alloc_nmask. Passing a garbage
nodemask to __folio_alloc() could cause undefined behavior and random NUMA
node constraints during allocation.

> +	if (!cc)
> +		return -ENOMEM;
> +	cc->is_khugepaged = false;
> +	cc->progress = 0;
> +
> +	lru_add_drain_all();

[Severity: High]
Can this cause system-wide latency spikes and IPI storms?

Because damon_collapse_folio_range() is called for every 2MB chunk by DAMON
iteration logic, unconditionally invoking the heavy cross-CPU
lru_add_drain_all() here means it will execute for every single PMD in the
region.

Unlike madvise_collapse() which hoists this drain out of its scanning loop,
placing it inside the per-PMD API might severely impact system performance.

> +
> +	result = collapse_huge_page(mm, start_addr, 1, 0, cc, target_order);

[Severity: High]
Does this direct call bypass PTE pre-scanning and swap-in logic?

By skipping collapse_scan_pmd() and directly invoking collapse_huge_page(),
this forces an unconditional allocation via alloc_charge_folio() for every
chunk evaluated, before even checking if the PMD is collapsible.

Also, passing 0 for the unmapped parameter skips the swap-in step, meaning
any swapped-out pages in the target region will cause the isolation phase
to abort.

Furthermore, hugepage_vma_revalidate() inside collapse_huge_page()
hardcodes expect_anon=true, so this path will silently fail for any
file-backed memory.

> +	kfree(cc);
> +	if (result == SCAN_SUCCEED || result == SCAN_PMD_MAPPED)
> +		return 0;
> +	return madvise_collapse_errno(result);
> +}
> +EXPORT_SYMBOL_GPL(damon_collapse_folio_range);

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260702094633.75658-1-lianux.mm@gmail.com?part=2

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RESEND RFC PATCH v2 0/5] mm/damon: add mTHP collapse and split actions
  2026-07-02  9:46 [RESEND RFC PATCH v2 0/5] mm/damon: add mTHP collapse and split actions Lian Wang
                   ` (4 preceding siblings ...)
  2026-07-02  9:46 ` [RESEND RFC PATCH v2 5/5] mm/damon/vaddr: implement DAMOS_SPLIT handler Lian Wang
@ 2026-07-02 10:23 ` Lorenzo Stoakes
  2026-07-02 16:52   ` SJ Park
  2026-07-02 18:35 ` SJ Park
  6 siblings, 1 reply; 24+ messages in thread
From: Lorenzo Stoakes @ 2026-07-02 10:23 UTC (permalink / raw)
  To: Lian Wang
  Cc: sj, damon, linux-mm, daichaobing, kunwu.chan, Andrew Morton,
	David Hildenbrand, Zi Yan, Baolin Wang, Liam R. Howlett,
	Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
	linux-kernel

+cc all those you missed.

I really need to write a bot to do this, because I'm getting a little tired of
pointing this out :))

On Thu, Jul 02, 2026 at 05:46:28PM +0800, Lian Wang wrote:
>  include/linux/damon.h      |  10 +++
>  include/linux/khugepaged.h |   9 +++
>  mm/damon/core.c            |   2 +
>  mm/damon/sysfs-schemes.c   |  77 ++++++++++++++++++++++
>  mm/damon/vaddr.c           | 128 +++++++++++++++++++++++++++++++++++++
>  mm/khugepaged.c            |  46 +++++++++++++
>  6 files changed, 272 insertions(+)

You are doing damon changes, and that belongs to SJ, sure.

But you're also changing core THP code? Please ensure you cc- THP people because
without our approval this cannot be merged:

$ scripts/get_maintainer.pl 20260702094633.75658-1-lianux.mm@gmail.com.mbx
SJ Park <sj@kernel.org> (maintainer:DAMON)
Andrew Morton <akpm@linux-foundation.org> (maintainer:MEMORY MANAGEMENT - THP (TRANSPARENT HUGE PAGE))
David Hildenbrand <david@kernel.org> (maintainer:MEMORY MANAGEMENT - THP (TRANSPARENT HUGE PAGE))
Lorenzo Stoakes <ljs@kernel.org> (maintainer:MEMORY MANAGEMENT - THP (TRANSPARENT HUGE PAGE))
Zi Yan <ziy@nvidia.com> (reviewer:MEMORY MANAGEMENT - THP (TRANSPARENT HUGE PAGE))
Baolin Wang <baolin.wang@linux.alibaba.com> (reviewer:MEMORY MANAGEMENT - THP (TRANSPARENT HUGE PAGE))
"Liam R. Howlett" <liam@infradead.org> (reviewer:MEMORY MANAGEMENT - THP (TRANSPARENT HUGE PAGE))
Nico Pache <npache@redhat.com> (reviewer:MEMORY MANAGEMENT - THP (TRANSPARENT HUGE PAGE))
Ryan Roberts <ryan.roberts@arm.com> (reviewer:MEMORY MANAGEMENT - THP (TRANSPARENT HUGE PAGE))
Dev Jain <dev.jain@arm.com> (reviewer:MEMORY MANAGEMENT - THP (TRANSPARENT HUGE PAGE))
Barry Song <baohua@kernel.org> (reviewer:MEMORY MANAGEMENT - THP (TRANSPARENT HUGE PAGE))
Lance Yang <lance.yang@linux.dev> (reviewer:MEMORY MANAGEMENT - THP (TRANSPARENT HUGE PAGE))
damon@lists.linux.dev (open list:DAMON)
linux-mm@kvack.org (open list:DAMON)
linux-kernel@vger.kernel.org (open list)

>
> --
> 2.50.1 (Apple Git-155)
>

Thanks, Lorenzo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RESEND RFC PATCH v2 3/5] mm/damon/vaddr: implement mTHP-aware DAMOS_COLLAPSE handler
  2026-07-02  9:46 ` [RESEND RFC PATCH v2 3/5] mm/damon/vaddr: implement mTHP-aware DAMOS_COLLAPSE handler Lian Wang
@ 2026-07-02 10:26   ` sashiko-bot
  2026-07-02 19:56   ` SJ Park
  1 sibling, 0 replies; 24+ messages in thread
From: sashiko-bot @ 2026-07-02 10:26 UTC (permalink / raw)
  To: Lian Wang; +Cc: damon

Thank you for your contribution! Sashiko AI review found 4 potential issue(s) to consider:
- [High] damos_va_collapse() accepts mTHP target orders but calls damon_collapse_folio_range() which strictly rejects them.
- [High] Calling damon_collapse_folio_range() in a per-chunk loop triggers severe global lock contention and IPI storms via lru_add_drain_all().
- [High] damos_va_split() skips entire PMD-sized chunks when encountering unmapped pages or small folios, preventing valid mTHPs from being split.
- [Medium] damos_va_collapse() fails to deduplicate overlapping chunk collapses across adjacent DAMON regions, artificially inflating statistics.
--

commit bbcfac46aaf7ca5b18e45240fde04cb5fffba9c1
Author: Lian Wang <lianux.mm@gmail.com>
Subject: mm/damon/vaddr: implement mTHP-aware DAMOS_COLLAPSE handler

This commit implements the DAMOS_COLLAPSE action for mTHP by iterating
over target regions in order-aligned chunks. It expands region boundaries
to the covering aligned range and calls damon_collapse_folio_range()
without holding an external mmap lock to avoid self-deadlocks.

> diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
> --- a/mm/damon/vaddr.c
> +++ b/mm/damon/vaddr.c
> @@ -895,6 +896,50 @@ static unsigned long damos_va_stat(struct damon_target *target,
>  	return 0;
>  }
>  
> +static unsigned long damos_va_collapse(struct damon_target *target,
> +		struct damon_region *r, struct damos *s,
> +		unsigned long *sz_filter_passed)
> +{
> +	unsigned long addr, end, chunk_sz;
> +	unsigned long last_chunk = ULONG_MAX;

[Severity: Medium]
Does this code fail to deduplicate overlapping chunk collapses across
adjacent DAMON regions?

When kdamond splits regions into smaller sizes that fall into the same
aligned chunk, damos_va_collapse() is called for each region. Because
last_chunk is initialized locally to ULONG_MAX on every call, it does
not persist across calls, which artificially inflates the DAMON
statistics (sz_filter_passed and applied) and prematurely exhausts
quotas.

Additionally, because addr is strictly incremented by chunk_sz within
the loop, the addr == last_chunk check appears to be dead code:

	while (addr < end) {
		...
		if (addr == last_chunk)
			goto next;
		last_chunk = addr;
		...
		addr += chunk_sz;
	}

> +	unsigned int target_order = s->target_order;
> +	unsigned long applied = 0;
> +	struct mm_struct *mm;
> +	int ret;
> +
> +	if (target_order < 2 || target_order > HPAGE_PMD_ORDER)
> +		return 0;
> +
> +	chunk_sz = PAGE_SIZE << target_order;
> +	addr = ALIGN_DOWN(r->ar.start, chunk_sz);
> +	end = ALIGN(r->ar.end, chunk_sz);
> +	if (end < addr)
> +		return 0;
> +
> +	mm = damon_get_mm(target);
> +	if (!mm)
> +		return 0;
> +
> +	while (addr < end) {
> +		if (addr + chunk_sz < addr)
> +			break;
> +		if (addr == last_chunk)
> +			goto next;
> +		last_chunk = addr;
> +
> +		ret = damon_collapse_folio_range(mm, addr, target_order);

[Severity: High]
Does calling damon_collapse_folio_range() with mTHP orders fail silently?

While this loop accepts target_order > 0 and < HPAGE_PMD_ORDER,
damon_collapse_folio_range() explicitly rejects any order that is not
PMD_ORDER:

mm/khugepaged.c:damon_collapse_folio_range() {
	if (target_order != HPAGE_PMD_ORDER) {
		pr_warn_once("... only PMD order ... is supported ...");
		return -EINVAL;
	}
}

This means attempts to collapse regions into mTHP sizes will fail, while
still incrementing sz_filter_passed.

[Severity: High]
Can this tight loop cause severe global lock contention and IPI storms?

damos_va_collapse() iterates over a target memory region in chunk_sz
increments, calling damon_collapse_folio_range() sequentially for every
chunk. However, the wrapper damon_collapse_folio_range() unconditionally
executes lru_add_drain_all():

mm/khugepaged.c:damon_collapse_folio_range() {
	...
	lru_add_drain_all();
	...
}

For a multi-GB VMA region, this places a global synchronization point
(acquiring lru_add_drain_mutex and sending IPIs to all online CPUs)
inside a tight loop hundreds or thousands of times, which can cause
system-wide performance degradation, latency spikes, and potential RCU
lockups.

> +		if (!ret)
> +			applied += chunk_sz;
> +		*sz_filter_passed += chunk_sz;
> +next:
> +		addr += chunk_sz;
> +		cond_resched();
> +	}
> +
> +	mmput(mm);
> +	return applied;
> +}

[ ... ]

I also noticed an issue in the related damos_va_split() function:

[Severity: High]
Does damos_va_split() skip entire PMD-sized chunks when encountering
unmapped pages or small folios?

In damos_va_split(), if an unmapped page is encountered, the code jumps
to the unlock label:

	folio = folio_walk_start(&fw, vma, addr, 0);
	if (!folio)
		goto unlock;

And at the unlock label, addr is advanced by chunk_sz (typically 2MB)
instead of PAGE_SIZE:

unlock:
	*sz_filter_passed += chunk_sz;
	addr += chunk_sz;

A similar advancement occurs in the else block when a small folio is
encountered:

	} else {
		folio_walk_end(&fw, vma);
		*sz_filter_passed += chunk_sz;
		addr += chunk_sz;
	}

By advancing addr by 2MB upon encountering a single unmapped or small
page, this prevents valid mTHPs in the remainder of the chunk from being
split.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260702094633.75658-1-lianux.mm@gmail.com?part=3

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RESEND RFC PATCH v2 4/5] mm/damon: introduce DAMOS_SPLIT action
  2026-07-02  9:46 ` [RESEND RFC PATCH v2 4/5] mm/damon: introduce DAMOS_SPLIT action Lian Wang
@ 2026-07-02 10:41   ` sashiko-bot
  2026-07-02 20:00   ` SJ Park
  1 sibling, 0 replies; 24+ messages in thread
From: sashiko-bot @ 2026-07-02 10:41 UTC (permalink / raw)
  To: Lian Wang; +Cc: damon

Thank you for your contribution! Sashiko AI review found 4 potential issue(s) to consider:
- [High] `damos_va_collapse` and `damos_va_split` redundantly process memory and corrupt accounting quotas, while `split` skips memory chunks.
- [Medium] `damos_va_collapse` and `damos_va_split` silently ignore DAMON operations layer filters.
- [Low] Unconditional use of `HPAGE_PMD_ORDER` breaks the build on architectures without huge page support.
- [Low] Implicit default target_order for DAMOS_SPLIT triggers a kernel warning on scheme creation.
--

Note: This report is provided as a free-form summary due to recitation
restrictions preventing the inline quotation of the original patch code.

Commit: bad86cd869e24f6ecf3b06b8afd727d7561ea5eb
Author: Lian Wang <lianux.mm@gmail.com>
Subject: mm/damon: introduce DAMOS_SPLIT action

This commit adds a new DAMOS_SPLIT action to the DAMON operations layer.
It allows splitting large folios into smaller mTHP-order folios based on a
target_order provided via the sysfs interface.

[Severity: Low]
Could the unconditional use of HPAGE_PMD_ORDER in damos_va_collapse and
damos_va_split in mm/damon/vaddr.c break the build on architectures without
huge page support? On architectures where CONFIG_PGTABLE_HAS_HUGE_LEAVES is
not defined, HPAGE_PMD_ORDER evaluates to a macro containing a BUILD_BUG,
which might cause a compilation failure since vaddr.c does not depend on
CONFIG_TRANSPARENT_HUGEPAGE.

[Severity: High]
Does damos_va_collapse redundantly process memory and inflate the accounting
statistics? It uses a local variable last_chunk to prevent duplicate
processing, but because the function is called per-region, last_chunk is
reinitialized to ULONG_MAX each time, causing it to repeatedly process the
same chunks.

Additionally, in damos_va_split, if a small folio is encountered or the
address is unmapped, does the address iterator incorrectly skip memory?
It appears to increment the address by the full chunk_sz (2MB) instead of
the folio size or PAGE_SIZE, which would skip the remaining folios in the
PMD block.

[Severity: Medium]
Do damos_va_collapse and damos_va_split ignore DAMON operations layer
filters? Unlike other actions that evaluate damos_va_filter_out on each
folio, these functions appear to unconditionally increment the
sz_filter_passed statistic. This might falsely report to the DAMON core
that the memory passed user-configured filters like anon/file or age filters.

[Severity: Low]
Will creating a sysfs DAMON scheme for the split action trigger unnecessary
kernel log spam if the target_order is left at its default? In
damon_sysfs_mk_scheme in mm/damon/sysfs-schemes.c, the target_order is
zero-initialized. The validation logic treats 0 as invalid for DAMOS_SPLIT,
defaults it to 2, and emits a pr_warn, which might generate a warning every
time such a scheme is committed.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260702094633.75658-1-lianux.mm@gmail.com?part=4

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RESEND RFC PATCH v2 5/5] mm/damon/vaddr: implement DAMOS_SPLIT handler
  2026-07-02  9:46 ` [RESEND RFC PATCH v2 5/5] mm/damon/vaddr: implement DAMOS_SPLIT handler Lian Wang
@ 2026-07-02 10:49   ` sashiko-bot
  2026-07-02 20:18   ` SJ Park
  1 sibling, 0 replies; 24+ messages in thread
From: sashiko-bot @ 2026-07-02 10:49 UTC (permalink / raw)
  To: Lian Wang; +Cc: damon

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [High] Incorrect virtual address advancement skips valid memory regions, overshoots unaligned folios, and corrupts DAMOS statistics.
- [High] Missing validation for driver-allocated compound pages before calling split_folio_to_order().
--

commit 8064bd2b9a44cd475dbdec8fba8f8de5e1726797
Author: Lian Wang <lianux.mm@gmail.com>

mm/damon/vaddr: implement DAMOS_SPLIT handler

Implement the vaddr operations layer handler for DAMOS_SPLIT.
For each folio in the target region that is larger than the
scheme's target_order, split it via split_folio_to_order().

This supports both anonymous and file-backed (e.g. tmpfs/shmem)
folios, covering KVM guest memory backed by THP tmpfs.

> diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
> index 8bfc6dcc635d5..73cb61c655fc7 100644
> --- a/mm/damon/vaddr.c
> +++ b/mm/damon/vaddr.c

[ ... ]

> @@ -940,6 +940,83 @@ static unsigned long damos_va_collapse(struct damon_target *target,
>  	return applied;
>  }
>  
> +static unsigned long damos_va_split(struct damon_target *target,

[ ... ]

> +		folio_sz = PAGE_SIZE << folio_order(folio);
> +
> +		if (folio_order(folio) > target_order) {
> +			if (!folio_trylock(folio)) {
> +				folio_walk_end(&fw, vma);
> +				goto unlock;
> +			}
> +			folio_get(folio);
> +			folio_walk_end(&fw, vma);
> +
> +			if (!split_folio_to_order(folio, target_order))

[Severity: High]
Can this call to split_folio_to_order() trigger a kernel panic if the folio
is a driver-allocated compound page?

folio_walk_start() can return driver-allocated compound pages mapped into
normal VMAs. These pages will have a folio_order() greater than zero, but
they lack the expected LRU, anonymous, or pagecache metadata. 

Should there be a check here to ensure the folio is splittable (for example,
verifying it is an LRU page) before attempting to split it?

> +				applied += folio_sz;
> +
> +			folio_unlock(folio);
> +			folio_put(folio);
> +			*sz_filter_passed += folio_sz;
> +			addr += folio_sz;

[Severity: High]
Could this advancement logic overshoot the folio boundary?

If the iterator starts in the middle of a large folio, advancing addr by the
unadjusted folio_sz might push the address past the actual end of the folio.

> +		} else {
> +			folio_walk_end(&fw, vma);
> +			*sz_filter_passed += chunk_sz;
> +			addr += chunk_sz;

[Severity: High]
Does advancing by chunk_sz here accidentally skip all subsequent memory
within the 2MB chunk?

If folio_walk_start() returns an order-0 folio or any folio smaller than
target_order, the else branch advances addr by chunk_sz (2MB). This appears
to skip over any remaining folios that might exist in the rest of the chunk.

> +		}
> +		mmap_read_unlock(mm);
> +		cond_resched();
> +		continue;
> +
> +unlock:
> +		*sz_filter_passed += chunk_sz;
> +		addr += chunk_sz;

[Severity: High]
Does advancing by chunk_sz in this error path inflate the DAMOS statistics
and skip over remaining memory?

If a VMA check fails, or if folio_trylock() fails above, the loop jumps
to unlock and adds chunk_sz (2MB) to both sz_filter_passed and addr. This
seems like it would bypass the rest of the memory in the chunk and report
incorrect statistics for the filtered size.

> +		mmap_read_unlock(mm);
> +		cond_resched();
> +	}
> +
> +out_mmput:
> +	mmput(mm);
> +	return applied;
> +}

[ ... ]

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260702094633.75658-1-lianux.mm@gmail.com?part=5

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RESEND RFC PATCH v2 2/5] mm/khugepaged: add damon_collapse_folio_range() for external callers
  2026-07-02  9:46 ` [RESEND RFC PATCH v2 2/5] mm/khugepaged: add damon_collapse_folio_range() for external callers Lian Wang
  2026-07-02 10:13   ` sashiko-bot
@ 2026-07-02 11:07   ` Lorenzo Stoakes
  2026-07-02 19:43     ` SJ Park
  1 sibling, 1 reply; 24+ messages in thread
From: Lorenzo Stoakes @ 2026-07-02 11:07 UTC (permalink / raw)
  To: Lian Wang
  Cc: sj, damon, linux-mm, daichaobing, kunwu.chan, Andrew Morton,
	David Hildenbrand, Zi Yan, Baolin Wang, Liam R. Howlett,
	Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
	linux-kernel

(+cc missing people again)

Sorry but we're not going to accept anything that exports THP logic like this at
all.

And a damon wrapper in core mm code is just a non-starter, so you really need to
rethink your approach.

I think SJ already commented on this in your v1 from what I can see? I'd listen
to his advice on this :)

On Thu, Jul 02, 2026 at 05:46:30PM +0800, Lian Wang wrote:
> Export a thin wrapper around collapse_huge_page() that allows external
> subsystems such as DAMON to trigger THP collapse on a target address
> range.
>
> Currently restricted to PMD order (HPAGE_PMD_ORDER), since
> collapse_huge_page() does not yet support arbitrary mTHP orders.
> The restriction can be relaxed when khugepaged gains mTHP support.
>
> The caller must hold a reference to @mm.  Do not hold mmap lock:
> collapse_huge_page() acquires mmap_read_lock for validation, releases
> it, then acquires mmap_write_lock for the actual collapse.  Holding
> an outer mmap_read_lock would cause a self-deadlock when the same
> thread attempts the inner mmap_write_lock.
>
> Co-developed-by: Kunwu Chan <kunwu.chan@gmail.com>
> Signed-off-by: Kunwu Chan <kunwu.chan@gmail.com>
> Signed-off-by: Lian Wang <lianux.mm@gmail.com>
> Signed-off-by: Lian Wang <lianux.wang@processmission.com>

This patch is exporting internal mm logic without proper safeguards so it's just
not something we're going to accept, sorry.

(Also not sure it's correct to have multiple S-o-b for the same person (unless
re-tagging a backport or something)? I'm not sure though)

> ---
>  include/linux/khugepaged.h |  9 ++++++++
>  mm/khugepaged.c            | 46 ++++++++++++++++++++++++++++++++++++++
>  2 files changed, 55 insertions(+)
>
> diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> index d7a9053ff4fe..f7d49cba712f 100644
> --- a/include/linux/khugepaged.h
> +++ b/include/linux/khugepaged.h
> @@ -20,6 +20,9 @@ extern bool current_is_khugepaged(void);
>  void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>  		bool install_pmd);
>
> +int damon_collapse_folio_range(struct mm_struct *mm, unsigned long start_addr,
> +			       unsigned int target_order);

No thanks. We're not putting damon wrappers into core code. This is breaking the
abstraction and letting arbitrary users invoke internal mm logic.

Plus you're literally exporting this so it can be abused by drivers.

> +
>  static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
>  {
>  	if (mm_flags_test(MMF_VM_HUGEPAGE, oldmm))
> @@ -47,6 +50,12 @@ static inline void collapse_pte_mapped_thp(struct mm_struct *mm,
>  {
>  }
>
> +static inline int damon_collapse_folio_range(struct mm_struct *mm,
> +		unsigned long start_addr, unsigned int target_order)
> +{
> +	return -EINVAL;
> +}
> +
>  static inline void khugepaged_min_free_kbytes_update(void)
>  {
>  }
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 617bca76db49..7fe9ce1e0533 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -3272,3 +3272,49 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>  	return thps == ((hend - hstart) >> HPAGE_PMD_SHIFT) ? 0
>  			: madvise_collapse_errno(last_fail);
>  }
> +
> +/**
> + * damon_collapse_folio_range() - Collapse base pages in range into a THP
> + * @mm:         mm_struct of the target process
> + * @start_addr: start address (must be order-aligned)
> + * @target_order: page order of the collapse result (currently only
> + *                HPAGE_PMD_ORDER is supported)
> + *
> + * Thin wrapper around collapse_huge_page() for external callers such as
> + * DAMON.  The caller must hold a reference to @mm.  Do not hold mmap

This is really fragile and bug bait.

> + * lock: collapse_huge_page() acquires mmap_read_lock for validation,

This is just gross, you're now collapsing based on an outdated concept of
what the current VMA state is...

You're also losing literally everything that madvise_collapse() does.

AND you're overriding THP limitations like max_ptes_none, which is horrible...

> + * releases it, then acquires mmap_write_lock for the collapse.  Holding
> + * an outer mmap_read_lock would self-deadlock.

This is a sign the interface is wrong!

> + *
> + * Return: 0 on success, -EINVAL on bad arguments, negative error from
> + *         madvise_collapse_errno() otherwise.
> + */
> +int damon_collapse_folio_range(struct mm_struct *mm, unsigned long start_addr,
> +			       unsigned int target_order)
> +{
> +	struct collapse_control *cc;
> +	enum scan_result result;
> +
> +	if (target_order != HPAGE_PMD_ORDER) {
> +		pr_warn_once("%s: only PMD order (%u) is supported, got %u\n",
> +			     __func__, HPAGE_PMD_ORDER, target_order);
> +		return -EINVAL;
> +	}
> +	if (start_addr & ((PAGE_SIZE << target_order) - 1))
> +		return -EINVAL;
> +
> +	cc = kmalloc_obj(*cc);
> +	if (!cc)
> +		return -ENOMEM;
> +	cc->is_khugepaged = false;
> +	cc->progress = 0;
> +
> +	lru_add_drain_all();

This is quite literally a copy/paste from madvise_collapse(). No no no :) code
duplication like this is also unacceptable.

> +
> +	result = collapse_huge_page(mm, start_addr, 1, 0, cc, target_order);
> +	kfree(cc);
> +	if (result == SCAN_SUCCEED || result == SCAN_PMD_MAPPED)
> +		return 0;
> +	return madvise_collapse_errno(result);
> +}
> +EXPORT_SYMBOL_GPL(damon_collapse_folio_range);

We _definitely_ cannot have internal mm logic _exported_.

Yeah sorry you need to rethink this.

Thanks, Lorenzo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RESEND RFC PATCH v2 0/5] mm/damon: add mTHP collapse and split actions
  2026-07-02 10:23 ` [RESEND RFC PATCH v2 0/5] mm/damon: add mTHP collapse and split actions Lorenzo Stoakes
@ 2026-07-02 16:52   ` SJ Park
  0 siblings, 0 replies; 24+ messages in thread
From: SJ Park @ 2026-07-02 16:52 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: SJ Park, Lian Wang, damon, linux-mm, daichaobing, kunwu.chan,
	Andrew Morton, David Hildenbrand, Zi Yan, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lance Yang, linux-kernel

On Thu, 2 Jul 2026 11:23:55 +0100 Lorenzo Stoakes <ljs@kernel.org> wrote:

> +cc all those you missed.

Thank you for doing this, Lorenzo.

> 
> I really need to write a bot to do this, because I'm getting a little tired of
> pointing this out :))

Good idea.  I will also consider implementing this kind of checks to to my lzy
tool box [1] or hkml [2].

> 
> On Thu, Jul 02, 2026 at 05:46:28PM +0800, Lian Wang wrote:
> >  include/linux/damon.h      |  10 +++
> >  include/linux/khugepaged.h |   9 +++
> >  mm/damon/core.c            |   2 +
> >  mm/damon/sysfs-schemes.c   |  77 ++++++++++++++++++++++
> >  mm/damon/vaddr.c           | 128 +++++++++++++++++++++++++++++++++++++
> >  mm/khugepaged.c            |  46 +++++++++++++
> >  6 files changed, 272 insertions(+)
> 
> You are doing damon changes, and that belongs to SJ, sure.
> 
> But you're also changing core THP code? Please ensure you cc- THP people because
> without our approval this cannot be merged:

Thank you for calling out this, Lorenzo.

Lian, please do as Lorenzo kindly asked, from the next revision.  You don't
need to add those recipients to all the patches if you worry their inbox
volumes.  But do ensure adding them to at least patches that modifies
khugepaged.h and khugepaged.c, and the cover letter.

If it is cumbersome, consider using 'hkml patch format' [3].  It does that (run
get_maintainer.pl and add recipients to each patch and the coverletter) for its
users.

[1] https://github.com/sjp38/lazybox
[2] https://github.com/sjp38/hackermail
[3] https://github.com/sjp38/hackermail/blob/master/USAGE.md#formatting-patches


Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RESEND RFC PATCH v2 0/5] mm/damon: add mTHP collapse and split actions
  2026-07-02  9:46 [RESEND RFC PATCH v2 0/5] mm/damon: add mTHP collapse and split actions Lian Wang
                   ` (5 preceding siblings ...)
  2026-07-02 10:23 ` [RESEND RFC PATCH v2 0/5] mm/damon: add mTHP collapse and split actions Lorenzo Stoakes
@ 2026-07-02 18:35 ` SJ Park
  2026-07-02 20:50   ` SJ Park
  6 siblings, 1 reply; 24+ messages in thread
From: SJ Park @ 2026-07-02 18:35 UTC (permalink / raw)
  To: Lian Wang; +Cc: SJ Park, damon, linux-mm, daichaobing, kunwu.chan

On Thu,  2 Jul 2026 17:46:28 +0800 Lian Wang <lianux.mm@gmail.com> wrote:

> Resend of v2 with the RFC tag restored (v1 was RFC PATCH, so v2 should
> be RFC PATCH v2).
> 
> This resend also includes fixes for issues identified during review of
> the earlier mis-sent PATCH v2 thread: uninitialized memory, TOCTOU
> races, BUILD_BUG guards, missing sysfs action name registration, and
> stack allocation overflow.  The series has been re-tested on aarch64
> (anonymous and file-backed THP split) and is checkpatch clean.
> 
> v1: https://lore.kernel.org/linux-mm/20260618094838.32805-1-lianux.mm@gmail.com/

Let's call it 'RFC v1'.

> 
> Changes since v1

Ditto.

> 
>  - Rename DAMOS_MTHP_SPLIT -> DAMOS_SPLIT for naming consistency with
>    the existing actions (per SJ's review).
>  - Drop the per-scheme hot_threshold field.  Hotness policy does not
>    belong in the kernel; target selection now lives in user space and
>    is expressed to DAMOS via the address filter (per SJ's review).
>  - Drop the v1 SPE debugfs patch entirely.  debugfs is not the right
>    interface for a feature, and the SPE profiler belongs in user space
>    (see "User-space target selection" below).  v2 is kernel mechanism
>    only: 5 patches.
>  - Decouple T1 (a lab observation) from T2 (the production issue), and
>    correct the architecture claim: ptep_test_and_clear_young() skips
>    the TLB flush on both x86_64 and arm64, so the blind spot is
>    architecture-independent rather than arm64-only.
>  - Terminology: avoid "stale TLB".  A valid TLB entry is doing its
>    job; the point is only that it lets the CPU satisfy a translation
>    without a page-table walk, so the Accessed bit cleared by DAMON is
>    not re-set.

Thank you for detailed changelog.  This is helpful for reviewers.
> 
> Background
> 
> Two effects degrade DAMON's PTE-Accessed-bit (AF) signal once THP is
> in play.  Both are described here as motivation only; this series does
> not change the AF monitoring path.
> 
> T2 -- PMD-granularity inflation (production issue)

I think it is better to call this T1, for readers.

> 
> A 2MB THP is tracked by a single PMD-level Accessed bit.  One access
> to any 4KB sub-page sets the AF for the whole 2MB, so DAMON reports
> the entire THP as hot and cannot distinguish a genuinely hot 2MB
> region from a 2MB region with a single hot 4KB page.  Cold memory
> hides inside "hot" THPs, and access-driven pageout/migration becomes
> coarse.
> 
> This is the workload that drove the work: Sangfor's Kunpeng 920 KVM
> hosts running Oracle.  ARM SPE sampling of that workload shows 94.6%
> of THPs have fewer than 10% of their sub-pages actually accessed.

Cool finding, thank you for sharing.  What DB workloads were running there?
Real production workload?  Or, synthetic benchmarks?

On the first read, I was wondering how you did ARM SPE sampling.  After reading
this mail to the end, I now understand you use perf.  Briefly mentioning that
here would  be nice.  E.g., "ARM SPE sampling of that worklaod using perf shows
..."

> 
> T1 -- TLB-reach blind spot (lab observation)

I think it is better to call this T2, for readers.

> 
> When the working set fits within L2 TLB reach (measured at 2048
> entries x 2MB = 4GB on Kunpeng 920; no public data available), the
> CPU satisfies translations entirely from the TLB,
> preventing translation table walks.  Because
> ptep_test_and_clear_young() does not flush

Wrapping text for the max columns is nice.  But let's not wrap it early when
there are spaces.  That could reduce space, and even carbon emissions from
people who want to read this nice cover letter after printing out on a paper.

> the TLB, valid TLB entries continue to satisfy translations and the
> AF that DAMON cleared is never re-set, so DAMON sees nr_accesses=0 for
> memory that is in fact hot, and no scheme triggers.  This reproduces
> in the lab with small workloads; it is not something we have seen
> reported from production, where working sets exceed TLB reach.
> 
> What this series adds
> 
> Rather than change AF monitoring, this series adds two order-aware
> DAMOS actions so a policy layer can act at mTHP granularity:

The background explained rooms to improve in DAMON's THP access "monitoring".
And this patch series is proposing adding new DAMOS actions for THP "handling".
Those are two unrelated things.

I really appreciate sharing your findings with the background, but as those are
not related to the proposal, I think it is better to be shared in a different
way.

I understand you are proposing this change because you know DAMON's hugepages
monitoring is imperfect, but still useful enough to get some benefits.  If
there were some findings that made you to think so, that could be good
background.

Also, you may have a reason to believe it is a good idea to use larger mTHP for
hot pages, and smaller mTHP for cold pages.  If so, and the description of the
reason is not trivial, that could be good materials to add on background.

> 
>  - DAMOS_COLLAPSE + target_order (patches 1-3): collapse small folios

This reads like you are introducing a new DAMOS action.  You indeed mentioned
"this series adds two order-aware DAMOS actions".  That's not completely wrong
in a sense, but more technically speaking you are adding a new mode of
DAMOS_COLLAPSE.  I'd recommend rephrasing to "extend" DAMOS_COLLAPSE..

>    up to a chosen mTHP order.  Patch 1 adds the target_order field and
>    its sysfs file; patch 2 exports a khugepaged helper
>    (damon_collapse_folio_range());

So patch 2 modifies khugepaged?  As Lorenzo mentioned on the other reply, that
change should also be reviewed by THP developers on MAINTAINERS file.  Please
ensure adding THP developers to the recipients list of the patch and this cover
letter.

The patch adds damon_collapse_folio_range() to khugepaged.h.  I understand
DAMON is the only user for now, and therefore you are adding damon_ prefix to
the name.  Not necesasrily DAMON is the only user forever.  And having damon_
prefix in a land outside of DAMON feels weird.  To be consistent with other
functions like collapse_pte_mapped_thp(), I'd suggest dropping the prefix from
the name.

>    patch 3 wires the vaddr handler.
> 
>  - DAMOS_SPLIT + target_order (patches 4-5): split large folios down
>    to a chosen mTHP order via split_folio_to_order(), for both
>    anonymous and file-backed (tmpfs/shmem) folios.
> 
> The two are complementary, not competing:
> 
>    THP=never  + DAMOS_COLLAPSE: start at 4KB, grow hot regions up.
>    THP=always + DAMOS_SPLIT:    start at 2MB, shrink cold regions down.
> 
> This dual-path design aligns with ideas discussed with Asier
> Gutierrez; we plan to unify our mTHP automation and evaluation
> roadmaps under this standard DAMOS_SPLIT action.
> 
> A deployment can pick either baseline, or run both, and let DAMOS
> manage the placement.  THP is still wanted for the hot working set
> (fewer TLB misses, shallower walks); the goal is not "no THP" but
> "THP where it is hot, small pages where it is cold."

I think this is a good idea.  Could you further elaborate what benefit users
can get from this in more detail, though?  Off the top of my head, I can expect
the benefits would be 1) less TLB miss from hot data, and 2) less mTHP
allocation failures from cold data occupying phsically contiguous memory.  But
you might showing even more benefits.    Anyway I think those are better to be
widely known by our kernel users.  Some of those may better to be put on the
background section.

> 
> User-space target selection
> 
> The decision of *which* regions to collapse or split is left to user
> space and fed to DAMOS through the existing DAMOS address filter
> (DAMOS_FILTER_TYPE_ADDR) -- the interface suggested during v1 review.
> The kernel provides the mechanism; user space provides the policy,
> consistent with the perf/BPF "kernel samples, user space decides"
> model and with the DAMON-X direction.
> 
> Because the AF signal is unreliable at PMD granularity (T1/T2), the
> scheme is run with min_nr_accesses=0 so it does not gate on access
> count, and the address filter selects targets.  min_nr_accesses=0 is
> also what unblocks the T1 case, where nr_accesses is pinned at 0.

Oh, so you are saying DAMON's huge pages monitoring is too problematic to use
as-is, for your use case.  That's completely fair.  And that explains what you
really want to do.  But this whole pictur is better to be described earlier
than your changes proposal.

From the beginning, explain why using larger mTHP for hot pages and smaller
mTHP for cold pages are good idea.  After that, explain how DAMON can be
extended for doing that.  Then, you can further explain your T1 and T2 findings
that explain why DAMON-only appraoch is not feasible, and how user-space target
selection can overcome it.

Also, I understand DAMON-only approach is not optimum or just useless for your
aimed use case.  But, is it completely useless for every possible use case?  I
think it might still provide some benefit in some use cases.  Could you pleae
clarify this point more in detail?  If you have data showing how useless
DAMON-only appraoch is, and how user space approach improves, it would be
awesome.

> 
> Why not just turn khugepaged off?  You can, but khugepaged is global
> and usually left enabled because other workloads rely on it; it cannot
> be disabled per region.  DAMOS_COLLAPSE gives per-region,
> access-pattern-driven collapse -- a more precise, targeted complement
> to khugepaged's global scan, not a replacement for it.  To handle the
> runtime race where khugepaged might aggressively re-collapse what
> DAMOS_SPLIT just split, we are evaluating a precise VMA-level handshake
> or back-off mechanism to prevent ping-pong effects in mixed
> environments.

Good reasoning.  However, khugepaged can be turned off per process, using
prctl().  How about turning khugepaged off for the process you want to use
DAMOS_COLLAPSE/SPLIT for?

> 
> Two user-space data sources produce the candidate address ranges:
> 
>  1. ARM SPE (ARMv8.2+): perf record (SPE) -> per-2MB hot-fraction
>     histogram -> PA->VA via /proc/<pid>/pagemap -> sparse-THP VA
>     ranges.  SPE reads physical addresses from the CPU pipeline,
>     bypassing the TLB and page tables, so it is immune to T1 and T2.
> 
>  2. smaps fallback (no SPE): scan /proc/<pid>/smaps for THP-backed
>     VMAs and treat the 2MB-aligned ranges as split candidates.
> 
> The SPE profiler stays in user space deliberately: the SPE PMU is a
> single-consumer resource, so a kernel consumer would lock out
> user-space perf and tooling (x86 PEBS / AMD IBS have the same
> property).  Keeping it in user space avoids that and keeps the metric
> source pluggable, in line with DAMON-X.

Maybe you are mentioning the perf events based DAMON, not DAMON-X.

And I understand you plan to extend DAMON to use ARM SPE, on top of the perf
events based DAMON as a future work.  As I mentioned before, I think that makes
perfect sense and I'm aligned.  Maybe this paragraph can bit reworded to make
it more clear, though.

> This is why v2 drops the v1
> SPE debugfs patch.
> 
> Testing
> 
> Tested on aarch64 with this series applied to 7.1.0-rc5, THP=always,
> using a DAMOS_SPLIT scheme (target_order=2, min_nr_accesses=0) and a
> single DAMOS address filter selecting one 2MB-aligned range:
> 
>  - Anonymous THP: the filter splits exactly that one THP --
>    sz_applied=2MB and AnonHugePages drops by 2MB, the rest of the
>    256MB mapping untouched.
>  - File-backed THP (tmpfs/shmem mounted huge=always): the same setup
>    splits exactly one 2MB shmem THP -- sz_applied=2MB and
>    ShmemPmdMapped drops by 2MB.  This confirms split_folio_to_order()
>    works for shmem folios (the KVM-guest-on-THP-tmpfs case).
>  - The address filter is what bounds the action: sz_tried covers the
>    whole ~2GB monitored region while sz_applied is exactly the 2MB the
>    filter selected.
>  - A smaps-based path (for hosts without SPE) enumerates THP-backed
>    ranges and splits all THP in the target workload.
>  - checkpatch clean on all 5 patches.

So, you tested only split part, for functionality.  Do you have plans to
further test collapse part, and performance?

> 
> Test scripts and SPE-to-DAMON pipeline tools:
> https://github.com/lianux-mm/damon_spe/tree/v2

Thank you for sharing the code!

So, I find rooms to improve on this cover letter for the readability and
clarity of the idea.  But as I mentioned before, I like the overall idea of
this series.

Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RESEND RFC PATCH v2 1/5] mm/damon: add target_order field for DAMOS_COLLAPSE
  2026-07-02  9:46 ` [RESEND RFC PATCH v2 1/5] mm/damon: add target_order field for DAMOS_COLLAPSE Lian Wang
  2026-07-02 10:01   ` sashiko-bot
@ 2026-07-02 18:51   ` SJ Park
  1 sibling, 0 replies; 24+ messages in thread
From: SJ Park @ 2026-07-02 18:51 UTC (permalink / raw)
  To: Lian Wang; +Cc: SJ Park, damon, linux-mm, daichaobing, kunwu.chan

On Thu,  2 Jul 2026 17:46:29 +0800 Lian Wang <lianux.mm@gmail.com> wrote:

> DAMOS_COLLAPSE currently collapses into PMD-size THP only.  Add a
> target_order field to express per-order mTHP collapse intent.  Zero
> means system default (PMD order, same as current behavior).  Valid
> values are 0 and 2..HPAGE_PMD_ORDER.
> 
> Wire up the sysfs interface: a per-scheme rw file "target_order".

Unless there is a name conflict, let's just call it "order".

> Validate at store time that the value is in range, and warn at scheme
> creation time if DAMOS_COLLAPSE is used with an unsupported non-PMD
> order, resetting to 0.

I'm working [1] on centralize the user parameter validations into core layer.
It is fine to have this for now.  But once the core layer validation is merged
into mm-new, let's use it for the validation.  That is, just let the user
writes invalidate target order.  But eventual attempt to use that via
damon_start() or damon_commit() will fail.

> 
> The actual mTHP application via the khugepaged wrapper will be added
> in subsequent patches.
> 
> Co-developed-by: Kunwu Chan <kunwu.chan@gmail.com>
> Signed-off-by: Kunwu Chan <kunwu.chan@gmail.com>
> Signed-off-by: Lian Wang <lianux.mm@gmail.com>
> Signed-off-by: Lian Wang <lianux.wang@processmission.com>

Having two Signed-off-by: looks odd.  If you want to give a credit to your
employer, you can add that to your signature, like:

Signed-off-by: Lian Wang (Your employer name) <lianux.mm@gmail.com>

That's a de-facto standard in mm community.

> ---
>  include/linux/damon.h    |  5 ++++
>  mm/damon/core.c          |  2 ++
>  mm/damon/sysfs-schemes.c | 61 ++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 68 insertions(+)
> 
> diff --git a/include/linux/damon.h b/include/linux/damon.h
> index 6f7edb3590ef..5a0587556573 100644
> --- a/include/linux/damon.h
> +++ b/include/linux/damon.h
> @@ -572,6 +572,11 @@ struct damos_migrate_dests {
>  struct damos {
>  	struct damos_access_pattern pattern;
>  	enum damos_action action;
> +	/*
> +	 * @target_order: target order for mTHP actions (DAMOS_COLLAPSE).
> +	 * 0 means system default (PMD order).  Valid: 0, 2..HPAGE_PMD_ORDER.
> +	 */
> +	unsigned int target_order;
>  	unsigned long apply_interval_us;

Let's rename to 'order'.  And put it in the union having 'target_nid' and
'migrate_dests'.

>  /* private: internal use only */
>  	/*
> diff --git a/mm/damon/core.c b/mm/damon/core.c
> index 265d51ade25b..be54defd4646 100644
> --- a/mm/damon/core.c
> +++ b/mm/damon/core.c
> @@ -579,6 +579,7 @@ struct damos *damon_new_scheme(struct damos_access_pattern *pattern,
>  
>  	scheme->migrate_dests = (struct damos_migrate_dests){};
>  	scheme->target_nid = target_nid;
> +	scheme->target_order = 0;
>  
>  	return scheme;
>  }
> @@ -1278,6 +1279,7 @@ static int damos_commit(struct damos *dst, struct damos *src)
>  
>  	dst->wmarks = src->wmarks;
>  	dst->target_nid = src->target_nid;
> +	dst->target_order = src->target_order;
>  
>  	err = damos_commit_dests(&dst->migrate_dests, &src->migrate_dests);
>  	if (err)
> diff --git a/mm/damon/sysfs-schemes.c b/mm/damon/sysfs-schemes.c
> index 329cfd0bbe9f..7dcd582ded86 100644
> --- a/mm/damon/sysfs-schemes.c
> +++ b/mm/damon/sysfs-schemes.c
> @@ -6,7 +6,9 @@
>   */
>  
>  #include <linux/slab.h>
> +#include <linux/mm.h>
>  #include <linux/numa.h>
> +#include <linux/huge_mm.h>
>  
>  #include "sysfs-common.h"
>  
> @@ -2257,6 +2259,7 @@ struct damon_sysfs_scheme {
>  	struct damon_sysfs_stats *stats;
>  	struct damon_sysfs_scheme_regions *tried_regions;
>  	int target_nid;
> +	unsigned int target_order;

Let's rename to 'order' if there is no problem.  Same for below.

>  	struct damos_sysfs_dests *dests;
>  };
>  
> @@ -2323,6 +2326,7 @@ static struct damon_sysfs_scheme *damon_sysfs_scheme_alloc(
>  	scheme->action = action;
>  	scheme->apply_interval_us = apply_interval_us;
>  	scheme->target_nid = NUMA_NO_NODE;
> +	scheme->target_order = 0;
>  	return scheme;
>  }
>  
> @@ -2642,6 +2646,39 @@ static ssize_t target_nid_store(struct kobject *kobj,
>  	return err ? err : count;
>  }
>  
> +static ssize_t target_order_show(struct kobject *kobj,
> +		struct kobj_attribute *attr, char *buf)
> +{
> +	struct damon_sysfs_scheme *scheme = container_of(kobj,
> +			struct damon_sysfs_scheme, kobj);
> +
> +	return sysfs_emit(buf, "%u\n", scheme->target_order);
> +}
> +
> +static ssize_t target_order_store(struct kobject *kobj,
> +		struct kobj_attribute *attr, const char *buf, size_t count)
> +{
> +	struct damon_sysfs_scheme *scheme = container_of(kobj,
> +			struct damon_sysfs_scheme, kobj);
> +	unsigned int val;
> +	int err;
> +
> +	err = kstrtouint(buf, 0, &val);
> +	if (err)
> +		return err;
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	if (val != 0 && (val < 2 || val > HPAGE_PMD_ORDER))
> +		return -EINVAL;
> +#else
> +	if (val != 0)
> +		return -EINVAL;
> +#endif

As I mentioned above, this should eventually removed in lieu of core layer
parameters validation.

> +
> +	scheme->target_order = val;
> +	return count;
> +}
> +
>  static void damon_sysfs_scheme_release(struct kobject *kobj)
>  {
>  	kfree(container_of(kobj, struct damon_sysfs_scheme, kobj));
> @@ -2656,10 +2693,14 @@ static struct kobj_attribute damon_sysfs_scheme_apply_interval_us_attr =
>  static struct kobj_attribute damon_sysfs_scheme_target_nid_attr =
>  		__ATTR_RW_MODE(target_nid, 0600);
>  
> +static struct kobj_attribute damon_sysfs_scheme_target_order_attr =
> +		__ATTR_RW_MODE(target_order, 0600);
> +
>  static struct attribute *damon_sysfs_scheme_attrs[] = {
>  	&damon_sysfs_scheme_action_attr.attr,
>  	&damon_sysfs_scheme_apply_interval_us_attr.attr,
>  	&damon_sysfs_scheme_target_nid_attr.attr,
> +	&damon_sysfs_scheme_target_order_attr.attr,
>  	NULL,
>  };
>  ATTRIBUTE_GROUPS(damon_sysfs_scheme);
> @@ -2970,6 +3011,7 @@ static struct damos *damon_sysfs_mk_scheme(
>  	struct damon_sysfs_weights *sysfs_weights = sysfs_quotas->weights;
>  	struct damon_sysfs_watermarks *sysfs_wmarks = sysfs_scheme->watermarks;
>  	struct damos *scheme;
> +	unsigned int target_order;
>  	int err;
>  
>  	struct damos_access_pattern pattern = {
> @@ -3005,6 +3047,25 @@ static struct damos *damon_sysfs_mk_scheme(
>  	if (!scheme)
>  		return NULL;
>  
> +	target_order = sysfs_scheme->target_order;
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	if (sysfs_scheme->action == DAMOS_COLLAPSE &&
> +	    target_order != 0 &&
> +	    target_order != HPAGE_PMD_ORDER) {
> +		pr_warn("DAMON collapse: target_order %u not supported, only PMD order (%u) is available. Use 0 or %u.\n",
> +			target_order,
> +			HPAGE_PMD_ORDER, HPAGE_PMD_ORDER);
> +		target_order = 0;
> +	}

Ditto.  This validation should eventually be removed.

> +#else
> +	if (sysfs_scheme->action == DAMOS_COLLAPSE && target_order != 0) {
> +		pr_warn("DAMON collapse: target_order not supported without THP. Use 0.\n");
> +		target_order = 0;
> +	}
> +#endif
> +	scheme->target_order = target_order;
> +
>  	err = damos_sysfs_add_quota_score(sysfs_quotas->goals, &scheme->quota);
>  	if (err) {
>  		damon_destroy_scheme(scheme);

Overall, looks good.

[1] https://lore.kerneel.org/20260701144815.113325-1-sj@kernel.org


Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RESEND RFC PATCH v2 2/5] mm/khugepaged: add damon_collapse_folio_range() for external callers
  2026-07-02 11:07   ` Lorenzo Stoakes
@ 2026-07-02 19:43     ` SJ Park
  2026-07-02 20:32       ` SJ Park
  0 siblings, 1 reply; 24+ messages in thread
From: SJ Park @ 2026-07-02 19:43 UTC (permalink / raw)
  To: Lian Wang
  Cc: SJ Park, Lorenzo Stoakes, damon, linux-mm, daichaobing,
	kunwu.chan, Andrew Morton, David Hildenbrand, Zi Yan, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lance Yang, linux-kernel

On Thu, 2 Jul 2026 12:07:01 +0100 Lorenzo Stoakes <ljs@kernel.org> wrote:

> (+cc missing people again)

Thank you for adding the recipients, and review, Lorenzo!

> 
> Sorry but we're not going to accept anything that exports THP logic like this at
> all.
> 
> And a damon wrapper in core mm code is just a non-starter, so you really need to
> rethink your approach.
> 
> I think SJ already commented on this in your v1 from what I can see? I'd listen
> to his advice on this :)

Lorenzo is right.  Not disrupting the world outside of mm/damon/ is the first
principle of DAMON development.  Sometimes we may have to make some changes
outside of mm/damon/, but we MUST make it not disruptive, small, and perfectly
aligned with the developers of the area with respects.

> 
> On Thu, Jul 02, 2026 at 05:46:30PM +0800, Lian Wang wrote:
> > Export a thin wrapper around collapse_huge_page() that allows external
> > subsystems such as DAMON to trigger THP collapse on a target address
> > range.
> >
> > Currently restricted to PMD order (HPAGE_PMD_ORDER), since
> > collapse_huge_page() does not yet support arbitrary mTHP orders.
> > The restriction can be relaxed when khugepaged gains mTHP support.
> >
> > The caller must hold a reference to @mm.  Do not hold mmap lock:
> > collapse_huge_page() acquires mmap_read_lock for validation, releases
> > it, then acquires mmap_write_lock for the actual collapse.  Holding
> > an outer mmap_read_lock would cause a self-deadlock when the same
> > thread attempts the inner mmap_write_lock.
> >
> > Co-developed-by: Kunwu Chan <kunwu.chan@gmail.com>
> > Signed-off-by: Kunwu Chan <kunwu.chan@gmail.com>
> > Signed-off-by: Lian Wang <lianux.mm@gmail.com>
> > Signed-off-by: Lian Wang <lianux.wang@processmission.com>
> 
> This patch is exporting internal mm logic without proper safeguards so it's just
> not something we're going to accept, sorry.
> 
> (Also not sure it's correct to have multiple S-o-b for the same person (unless
> re-tagging a backport or something)? I'm not sure though)

I agree to Lorenzo.

As I replied [1] to patch 1, if you just want to give a credit to your
employer, you could add your employer name on the single tag, like "Lian Wang
(processssmission) <lianux.mm@gmail.com>".

If it is for ensure the replies goes to not only your private inbox but also
corporate email inbox, you can simply Cc: your corporate email address.

> 
> > ---
> >  include/linux/khugepaged.h |  9 ++++++++
> >  mm/khugepaged.c            | 46 ++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 55 insertions(+)
> >
> > diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> > index d7a9053ff4fe..f7d49cba712f 100644
> > --- a/include/linux/khugepaged.h
> > +++ b/include/linux/khugepaged.h
> > @@ -20,6 +20,9 @@ extern bool current_is_khugepaged(void);
> >  void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> >  		bool install_pmd);
> >
> > +int damon_collapse_folio_range(struct mm_struct *mm, unsigned long start_addr,
> > +			       unsigned int target_order);
> 
> No thanks. We're not putting damon wrappers into core code. This is breaking the
> abstraction and letting arbitrary users invoke internal mm logic.

I agree.  First of all, having something with 'damon_' prefix in khugepaged.h
makes no sense.  If we have to expose something, maybe mm/internal.h could be
considered in my opinion.  But for only DAMON usage, include/linux/khugepaged.h
is definitely wrong.

> 
> Plus you're literally exporting this so it can be abused by drivers.

I was raising my eyebrows here.  I show this patch adds EXPORT_SYMBOL_GPL() at
the end.  Lorenzo is completely right.  I believe Lian added that only by a
mistake.

> 
> > +
> >  static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
> >  {
> >  	if (mm_flags_test(MMF_VM_HUGEPAGE, oldmm))
> > @@ -47,6 +50,12 @@ static inline void collapse_pte_mapped_thp(struct mm_struct *mm,
> >  {
> >  }
> >
> > +static inline int damon_collapse_folio_range(struct mm_struct *mm,
> > +		unsigned long start_addr, unsigned int target_order)
> > +{
> > +	return -EINVAL;
> > +}
> > +
> >  static inline void khugepaged_min_free_kbytes_update(void)
> >  {
> >  }
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 617bca76db49..7fe9ce1e0533 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -3272,3 +3272,49 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> >  	return thps == ((hend - hstart) >> HPAGE_PMD_SHIFT) ? 0
> >  			: madvise_collapse_errno(last_fail);
> >  }
> > +
> > +/**
> > + * damon_collapse_folio_range() - Collapse base pages in range into a THP
> > + * @mm:         mm_struct of the target process
> > + * @start_addr: start address (must be order-aligned)
> > + * @target_order: page order of the collapse result (currently only
> > + *                HPAGE_PMD_ORDER is supported)
> > + *
> > + * Thin wrapper around collapse_huge_page() for external callers such as
> > + * DAMON.  The caller must hold a reference to @mm.  Do not hold mmap
> 
> This is really fragile and bug bait.
> 
> > + * lock: collapse_huge_page() acquires mmap_read_lock for validation,
> 
> This is just gross, you're now collapsing based on an outdated concept of
> what the current VMA state is...
> 
> You're also losing literally everything that madvise_collapse() does.
> 
> AND you're overriding THP limitations like max_ptes_none, which is horrible...
> 
> > + * releases it, then acquires mmap_write_lock for the collapse.  Holding
> > + * an outer mmap_read_lock would self-deadlock.
> 
> This is a sign the interface is wrong!
> 
> > + *
> > + * Return: 0 on success, -EINVAL on bad arguments, negative error from
> > + *         madvise_collapse_errno() otherwise.
> > + */
> > +int damon_collapse_folio_range(struct mm_struct *mm, unsigned long start_addr,
> > +			       unsigned int target_order)
> > +{
> > +	struct collapse_control *cc;
> > +	enum scan_result result;
> > +
> > +	if (target_order != HPAGE_PMD_ORDER) {
> > +		pr_warn_once("%s: only PMD order (%u) is supported, got %u\n",
> > +			     __func__, HPAGE_PMD_ORDER, target_order);
> > +		return -EINVAL;
> > +	}
> > +	if (start_addr & ((PAGE_SIZE << target_order) - 1))
> > +		return -EINVAL;
> > +
> > +	cc = kmalloc_obj(*cc);
> > +	if (!cc)
> > +		return -ENOMEM;
> > +	cc->is_khugepaged = false;
> > +	cc->progress = 0;
> > +
> > +	lru_add_drain_all();
> 
> This is quite literally a copy/paste from madvise_collapse(). No no no :) code
> duplication like this is also unacceptable.

Thank you for kind review, Lorenzo.

I believe Lorenzo's concerns are valid and should all be addressed before
dropping RFC tag of this patch.  Also, if you "have to" post the next revision
before addressing those to get reviews of changes on other parts of this
series, let's make sure the fact is very clearly described.  In the way, we
can avoid wasting time of reviewers.  For example, you could add a note like
below at the beginning of the next revision of this patch.

"""
NOTE to THP reviewers: This patch is not yet addressing problems that are found
on the previous revision.  This version is posted for review of only other
parts in the series.  Please ignore this patch for now.
"""

> 
> > +
> > +	result = collapse_huge_page(mm, start_addr, 1, 0, cc, target_order);
> > +	kfree(cc);
> > +	if (result == SCAN_SUCCEED || result == SCAN_PMD_MAPPED)
> > +		return 0;
> > +	return madvise_collapse_errno(result);
> > +}
> > +EXPORT_SYMBOL_GPL(damon_collapse_folio_range);
> 
> We _definitely_ cannot have internal mm logic _exported_.

I completely agree.  Please drop this from the next version.

> 
> Yeah sorry you need to rethink this.

I wonder if we could extend madvise_collapse() for mTHP and use it.

I initially thought this is a silly idea, but
'Documentation/admin-guide/mm/transhuge.rst' is saying "currently",
madvise_collapse only supports collapsing to PMD-sized THPs and does not attemp
mTHP collapse.  It reads to me like supporting mTHP from MADV_COLLAPSE is a
possible TODO.  And it makes sense to me.  Users who do MADV_COLLAPSE may want
it to at least have a best-effort mTHP.  The interface of madvise_collapse() is
already receiving arbitrary address range, so fit with DAMON's purpose.

If supporting mTHP with MADV_THP is not a real TODO but just a silly idea that
better not to be implemented, I wonder if we can split out the core of
madvise_collapse(), extend it for mTHP, and expose it on mm/internal.h.

I concern if the above idea makes call depth unnecessarily deep, though.  A
shallower approach would be renaming madvise_collapse() to somewhat general (I
fail at making an example, but whatever without maddvise_ prefix) and extending
it for mTHP support.

Off the topic of this patch, but I find 'madvise_collapse()' is declared in
include/linux/huge_mm.h, while it is being used only inside mm/.  Would it make
sense to move the declaration to mm/internal.h?

[1] https://lore.kernel.org/20260702185137.91227-1-sj@kernel.org

Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RESEND RFC PATCH v2 3/5] mm/damon/vaddr: implement mTHP-aware DAMOS_COLLAPSE handler
  2026-07-02  9:46 ` [RESEND RFC PATCH v2 3/5] mm/damon/vaddr: implement mTHP-aware DAMOS_COLLAPSE handler Lian Wang
  2026-07-02 10:26   ` sashiko-bot
@ 2026-07-02 19:56   ` SJ Park
  1 sibling, 0 replies; 24+ messages in thread
From: SJ Park @ 2026-07-02 19:56 UTC (permalink / raw)
  To: Lian Wang; +Cc: SJ Park, damon, linux-mm, daichaobing, kunwu.chan

On Thu,  2 Jul 2026 17:46:31 +0800 Lian Wang <lianux.mm@gmail.com> wrote:

> When target_order is set (non-zero), the DAMOS_COLLAPSE handler now calls
> damon_collapse_folio_range() to collapse pages into the requested mTHP
> size, iterating over the target region in order-aligned chunks.  When
> target_order is 0 (default), the existing madvise(MADV_COLLAPSE) path is
> used, preserving backwards compatibility.
> 
> Region boundaries are expanded outward to the covering aligned range
> (ALIGN_DOWN start, ALIGN end) so that collapse works even after
> kdamond_split_regions reduces region sizes below the chunk size.
> collapse_huge_page() internally validates VMA bounds, so expanding
> beyond the original region is safe.
> 
> No external mmap lock is held: collapse_huge_page() acquires
> mmap_read_lock internally for validation, releases it, then acquires
> mmap_write_lock for the actual collapse.  Holding an outer
> mmap_read_lock would cause a self-deadlock when the same thread
> attempts the inner mmap_write_lock.
> 
> Co-developed-by: Kunwu Chan <kunwu.chan@gmail.com>
> Signed-off-by: Kunwu Chan <kunwu.chan@gmail.com>
> Signed-off-by: Lian Wang <lianux.mm@gmail.com>
> Signed-off-by: Lian Wang <lianux.wang@processmission.com>
> ---
>  mm/damon/vaddr.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 48 insertions(+)
> 
> diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
> index d27147603564..98a87609376b 100644
> --- a/mm/damon/vaddr.c
> +++ b/mm/damon/vaddr.c
> @@ -14,6 +14,7 @@
>  #include <linux/page_idle.h>
>  #include <linux/pagewalk.h>
>  #include <linux/sched/mm.h>
> +#include <linux/khugepaged.h>
>  
>  #include "../internal.h"
>  #include "ops-common.h"
> @@ -899,6 +900,50 @@ static unsigned long damos_va_stat(struct damon_target *target,
>  	return 0;
>  }
>  
> +static unsigned long damos_va_collapse(struct damon_target *target,
> +		struct damon_region *r, struct damos *s,
> +		unsigned long *sz_filter_passed)
> +{
> +	unsigned long addr, end, chunk_sz;
> +	unsigned long last_chunk = ULONG_MAX;
> +	unsigned int target_order = s->target_order;
> +	unsigned long applied = 0;
> +	struct mm_struct *mm;
> +	int ret;
> +
> +	if (target_order < 2 || target_order > HPAGE_PMD_ORDER)
> +		return 0;

I think this validation should be done by the caller, or whoever before this
function is called.  Let's remove this.

> +
> +	chunk_sz = PAGE_SIZE << target_order;
> +	addr = ALIGN_DOWN(r->ar.start, chunk_sz);
> +	end = ALIGN(r->ar.end, chunk_sz);

How about ALIGN() for addr and ALIGN_DOWN() for end?  I show madvise_collapse()
is doing so and being consistent to that is maybe a good idea to less confusing
users.

> +	if (end < addr)
> +		return 0;

Let's early-return for end == addr case, too.

> +
> +	mm = damon_get_mm(target);
> +	if (!mm)
> +		return 0;
> +
> +	while (addr < end) {
> +		if (addr + chunk_sz < addr)
> +			break;
> +		if (addr == last_chunk)
> +			goto next;
> +		last_chunk = addr;
> +
> +		ret = damon_collapse_folio_range(mm, addr, target_order);
> +		if (!ret)
> +			applied += chunk_sz;
> +		*sz_filter_passed += chunk_sz;

Shouldn't this call damos_va_filter_out() before damos_collapse_folio_range()?

> +next:
> +		addr += chunk_sz;
> +		cond_resched();
> +	}
> +
> +	mmput(mm);
> +	return applied;
> +}
> +
>  static unsigned long damon_va_apply_scheme(struct damon_ctx *ctx,
>  		struct damon_target *t, struct damon_region *r,
>  		struct damos *scheme, unsigned long *sz_filter_passed)
> @@ -922,6 +967,9 @@ static unsigned long damon_va_apply_scheme(struct damon_ctx *ctx,
>  		madv_action = MADV_NOHUGEPAGE;
>  		break;
>  	case DAMOS_COLLAPSE:
> +		if (scheme->target_order)
> +			return damos_va_collapse(t, r, scheme,
> +						 sz_filter_passed);

Let's use two tabs indentation on the second line of parameters, instead of
aligning things with the mix of tabs and spaces.

>  		madv_action = MADV_COLLAPSE;
>  		break;
>  	case DAMOS_MIGRATE_HOT:


Thanks,
SJ

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RESEND RFC PATCH v2 4/5] mm/damon: introduce DAMOS_SPLIT action
  2026-07-02  9:46 ` [RESEND RFC PATCH v2 4/5] mm/damon: introduce DAMOS_SPLIT action Lian Wang
  2026-07-02 10:41   ` sashiko-bot
@ 2026-07-02 20:00   ` SJ Park
  1 sibling, 0 replies; 24+ messages in thread
From: SJ Park @ 2026-07-02 20:00 UTC (permalink / raw)
  To: Lian Wang; +Cc: SJ Park, damon, linux-mm, daichaobing, kunwu.chan

On Thu,  2 Jul 2026 17:46:32 +0800 Lian Wang <lianux.mm@gmail.com> wrote:

> Add DAMOS_SPLIT to the damos_action enum for splitting large folios
> into smaller mTHP-order folios.  Add a target_order field to struct
> damos to specify the desired split order.

This commit is not adding but reusing the field.

> 
> Expose the action as "split" through the DAMON sysfs interface with
> target_order validation (must be 2..HPAGE_PMD_ORDER-1).
> 
> Signed-off-by: Lian Wang <lianux.mm@gmail.com>
> Signed-off-by: Lian Wang <lianux.wang@processmission.com>

Let's have only one S-o-b: line if there is no reason to have two.

> ---
>  include/linux/damon.h    |  9 +++++++--
>  mm/damon/sysfs-schemes.c | 16 ++++++++++++++++
>  2 files changed, 23 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/damon.h b/include/linux/damon.h
> index 5a0587556573..30cf4afb212c 100644
> --- a/include/linux/damon.h
> +++ b/include/linux/damon.h
> @@ -121,6 +121,7 @@ struct damon_target {
>   * @DAMOS_HUGEPAGE:	Call ``madvise()`` for the region with MADV_HUGEPAGE.
>   * @DAMOS_NOHUGEPAGE:	Call ``madvise()`` for the region with MADV_NOHUGEPAGE.
>   * @DAMOS_COLLAPSE:	Call ``madvise()`` for the region with MADV_COLLAPSE.
> + * @DAMOS_SPLIT:	Split large folios to the target mTHP order.
>   * @DAMOS_LRU_PRIO:	Prioritize the region on its LRU lists.
>   * @DAMOS_LRU_DEPRIO:	Deprioritize the region on its LRU lists.
>   * @DAMOS_MIGRATE_HOT:  Migrate the regions prioritizing warmer regions.
> @@ -141,6 +142,7 @@ enum damos_action {
>  	DAMOS_HUGEPAGE,
>  	DAMOS_NOHUGEPAGE,
>  	DAMOS_COLLAPSE,
> +	DAMOS_SPLIT,
>  	DAMOS_LRU_PRIO,
>  	DAMOS_LRU_DEPRIO,
>  	DAMOS_MIGRATE_HOT,
> @@ -573,8 +575,11 @@ struct damos {
>  	struct damos_access_pattern pattern;
>  	enum damos_action action;
>  	/*
> -	 * @target_order: target order for mTHP actions (DAMOS_COLLAPSE).
> -	 * 0 means system default (PMD order).  Valid: 0, 2..HPAGE_PMD_ORDER.
> +	 * @target_order: target mTHP order for DAMOS_COLLAPSE and
> +	 * DAMOS_SPLIT.  For COLLAPSE, 0 means PMD order default,
> +	 * valid values: 0, 2..HPAGE_PMD_ORDER.  For SPLIT,
> +	 * valid values: 2..HPAGE_PMD_ORDER-1; 0 and HPAGE_PMD_ORDER
> +	 * are rejected at scheme creation time (defaulting to 2).
>  	 */
>  	unsigned int target_order;

As I commented to an earlier patch, let's use the union.

>  	unsigned long apply_interval_us;
> diff --git a/mm/damon/sysfs-schemes.c b/mm/damon/sysfs-schemes.c
> index 7dcd582ded86..84a4617ca3d3 100644
> --- a/mm/damon/sysfs-schemes.c
> +++ b/mm/damon/sysfs-schemes.c
> @@ -2293,6 +2293,10 @@ static struct damos_sysfs_action_name damos_sysfs_action_names[] = {
>  		.action = DAMOS_COLLAPSE,
>  		.name = "collapse",
>  	},
> +	{
> +		.action = DAMOS_SPLIT,
> +		.name = "split",
> +	},
>  	{
>  		.action = DAMOS_LRU_PRIO,
>  		.name = "lru_prio",
> @@ -3058,11 +3062,23 @@ static struct damos *damon_sysfs_mk_scheme(
>  			HPAGE_PMD_ORDER, HPAGE_PMD_ORDER);
>  		target_order = 0;
>  	}
> +	if (sysfs_scheme->action == DAMOS_SPLIT &&
> +	    (target_order == 0 ||
> +	     target_order >= HPAGE_PMD_ORDER)) {
> +		pr_warn("DAMON split: target_order %u invalid, need 2..%u. Defaulting to 2.\n",
> +			target_order,
> +			HPAGE_PMD_ORDER - 1);
> +		target_order = 2;
> +	}

As I also commented to an earlier patch, this validation will eventually
dropped.

>  #else
>  	if (sysfs_scheme->action == DAMOS_COLLAPSE && target_order != 0) {
>  		pr_warn("DAMON collapse: target_order not supported without THP. Use 0.\n");
>  		target_order = 0;
>  	}
> +	if (sysfs_scheme->action == DAMOS_SPLIT) {
> +		pr_warn("DAMON split: not supported without THP.\n");
> +		target_order = 2;

Why set target_order here?

> +	}
>  #endif
>  	scheme->target_order = target_order;


Thanks,
SJ

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RESEND RFC PATCH v2 5/5] mm/damon/vaddr: implement DAMOS_SPLIT handler
  2026-07-02  9:46 ` [RESEND RFC PATCH v2 5/5] mm/damon/vaddr: implement DAMOS_SPLIT handler Lian Wang
  2026-07-02 10:49   ` sashiko-bot
@ 2026-07-02 20:18   ` SJ Park
  1 sibling, 0 replies; 24+ messages in thread
From: SJ Park @ 2026-07-02 20:18 UTC (permalink / raw)
  To: Lian Wang; +Cc: SJ Park, damon, linux-mm, daichaobing, kunwu.chan

On Thu,  2 Jul 2026 17:46:33 +0800 Lian Wang <lianux.mm@gmail.com> wrote:

> Implement the vaddr operations layer handler for DAMOS_SPLIT.
> For each folio in the target region that is larger than the
> scheme's target_order, split it via split_folio_to_order().
> 
> This supports both anonymous and file-backed (e.g. tmpfs/shmem)
> folios, covering KVM guest memory backed by THP tmpfs.
> 
> Signed-off-by: Lian Wang <lianux.mm@gmail.com>
> Signed-off-by: Lian Wang <lianux.wang@processmission.com>

Let's use single S-o-b: if it works.

> ---
>  mm/damon/vaddr.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 80 insertions(+)
> 
> diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
> index 98a87609376b..a177703f7e0a 100644
> --- a/mm/damon/vaddr.c
> +++ b/mm/damon/vaddr.c
> @@ -944,6 +944,83 @@ static unsigned long damos_va_collapse(struct damon_target *target,
>  	return applied;
>  }
>  
> +static unsigned long damos_va_split(struct damon_target *target,
> +		struct damon_region *r, struct damos *s,
> +		unsigned long *sz_filter_passed)
> +{
> +	unsigned long addr, end, chunk_sz;
> +	unsigned int target_order = s->target_order;
> +	unsigned long applied = 0;
> +	struct mm_struct *mm;
> +	struct vm_area_struct *vma;
> +	struct folio *folio;
> +	struct folio_walk fw;
> +
> +	mm = damon_get_mm(target);
> +	if (!mm)
> +		return 0;
> +
> +	chunk_sz = PAGE_SIZE << HPAGE_PMD_ORDER;
> +	addr = ALIGN_DOWN(r->ar.start, chunk_sz);
> +	end = ALIGN(r->ar.end, chunk_sz);

Like I commented to collapse part, would it make more sense to ALIGN() addr and
ALIGN_DOWN() end?

> +	if (end < addr)
> +		goto out_mmput;
> +
> +	while (addr < end) {

Above 'if (end < addr)' seems unnecessary, as while() here is covering it?

> +		unsigned long folio_sz;
> +
> +		if (addr + chunk_sz < addr)
> +			break;

What's the purpose here?  Preventing overflow?  How the overflow can happen,
and what is the consequence?

> +
> +		mmap_read_lock(mm);
> +		vma = find_vma(mm, addr);
> +
> +		if (!vma || addr < vma->vm_start ||
> +		    vma->vm_flags & (VM_HUGETLB | VM_MIXEDMAP))
> +			goto unlock;
> +
> +		folio = folio_walk_start(&fw, vma, addr, 0);

Why we need this?  Can't we use the simpler pattern like that on
damos_va_stat_pmd_entry()?

> +		if (!folio)
> +			goto unlock;
> +
> +		folio_sz = PAGE_SIZE << folio_order(folio);

You can use folio_size().

> +
> +		if (folio_order(folio) > target_order) {
> +			if (!folio_trylock(folio)) {
> +				folio_walk_end(&fw, vma);
> +				goto unlock;
> +			}
> +			folio_get(folio);
> +			folio_walk_end(&fw, vma);
> +
> +			if (!split_folio_to_order(folio, target_order))
> +				applied += folio_sz;
> +
> +			folio_unlock(folio);
> +			folio_put(folio);
> +			*sz_filter_passed += folio_sz;

Shoudn't this code calls damos_va_filter_out() before split_folio_to_order()?

> +			addr += folio_sz;
> +		} else {
> +			folio_walk_end(&fw, vma);
> +			*sz_filter_passed += chunk_sz;
> +			addr += chunk_sz;
> +		}
> +		mmap_read_unlock(mm);
> +		cond_resched();
> +		continue;
> +
> +unlock:
> +		*sz_filter_passed += chunk_sz;
> +		addr += chunk_sz;
> +		mmap_read_unlock(mm);
> +		cond_resched();
> +	}
> +
> +out_mmput:
> +	mmput(mm);
> +	return applied;
> +}
> +
>  static unsigned long damon_va_apply_scheme(struct damon_ctx *ctx,
>  		struct damon_target *t, struct damon_region *r,
>  		struct damos *scheme, unsigned long *sz_filter_passed)
> @@ -977,6 +1054,9 @@ static unsigned long damon_va_apply_scheme(struct damon_ctx *ctx,
>  		return damos_va_migrate(t, r, scheme, sz_filter_passed);
>  	case DAMOS_STAT:
>  		return damos_va_stat(t, r, scheme, sz_filter_passed);
> +	case DAMOS_SPLIT:
> +		return damos_va_split(t, r, scheme,
> +					  sz_filter_passed);

Let's keep the order same to that of damos_action.

>  	default:
>  		/*
>  		 * DAMOS actions that are not yet supported by 'vaddr'.


Thanks,
SJ

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RESEND RFC PATCH v2 2/5] mm/khugepaged: add damon_collapse_folio_range() for external callers
  2026-07-02 19:43     ` SJ Park
@ 2026-07-02 20:32       ` SJ Park
  0 siblings, 0 replies; 24+ messages in thread
From: SJ Park @ 2026-07-02 20:32 UTC (permalink / raw)
  To: SJ Park
  Cc: Lian Wang, Lorenzo Stoakes, damon, linux-mm, daichaobing,
	kunwu.chan, Andrew Morton, David Hildenbrand, Zi Yan, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lance Yang, linux-kernel

On Thu, 02 Jul 2026 12:43:59 -0700 SJ Park <sj@kernel.org> wrote:

> On Thu, 2 Jul 2026 12:07:01 +0100 Lorenzo Stoakes <ljs@kernel.org> wrote:
> 
> > (+cc missing people again)
> 
> Thank you for adding the recipients, and review, Lorenzo!
> 
> > 
> > Sorry but we're not going to accept anything that exports THP logic like this at
> > all.
> > 
> > And a damon wrapper in core mm code is just a non-starter, so you really need to
> > rethink your approach.
> > 
> > I think SJ already commented on this in your v1 from what I can see? I'd listen
> > to his advice on this :)
> 
> Lorenzo is right.  Not disrupting the world outside of mm/damon/ is the first
> principle of DAMON development.  Sometimes we may have to make some changes
> outside of mm/damon/, but we MUST make it not disruptive, small, and perfectly
> aligned with the developers of the area with respects.

The best option is just not doing this.  And that might be the case.

we already have pmd level DAMOS_COLLAPSE.  I find mTHP-supporting DAMOS_SPLIT
can be implemented without any changes on mm/damon/ external world.  If it is
true and there is no objection at doing that, mTHP collapse may not really
necessary.  That is, the users could collapse in pmd level first, and then
split in desired mTHP level to accomplish their goal.  I think that works for
common use cases, too.

It would be suboptimal to collapse in pmd level first and then split.  But the
efficiency is unclear.  I don't want to disrupt others for unclear gains.  We
can upstream split part first, measure the efficiency, and revisit mTHP
collapse if it turns out to be really needed.

What do you think, Lian?

Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RESEND RFC PATCH v2 0/5] mm/damon: add mTHP collapse and split actions
  2026-07-02 18:35 ` SJ Park
@ 2026-07-02 20:50   ` SJ Park
  0 siblings, 0 replies; 24+ messages in thread
From: SJ Park @ 2026-07-02 20:50 UTC (permalink / raw)
  To: SJ Park; +Cc: Lian Wang, damon, linux-mm, daichaobing, kunwu.chan

Keeping full original mail, so that Lian can answer all comments in one reply.

On Thu,  2 Jul 2026 11:35:51 -0700 SJ Park <sj@kernel.org> wrote:

> On Thu,  2 Jul 2026 17:46:28 +0800 Lian Wang <lianux.mm@gmail.com> wrote:
> 
> > Resend of v2 with the RFC tag restored (v1 was RFC PATCH, so v2 should
> > be RFC PATCH v2).
> > 
> > This resend also includes fixes for issues identified during review of
> > the earlier mis-sent PATCH v2 thread: uninitialized memory, TOCTOU
> > races, BUILD_BUG guards, missing sysfs action name registration, and
> > stack allocation overflow.  The series has been re-tested on aarch64
> > (anonymous and file-backed THP split) and is checkpatch clean.
> > 
> > v1: https://lore.kernel.org/linux-mm/20260618094838.32805-1-lianux.mm@gmail.com/
> 
> Let's call it 'RFC v1'.
> 
> > 
> > Changes since v1
> 
> Ditto.
> 
> > 
> >  - Rename DAMOS_MTHP_SPLIT -> DAMOS_SPLIT for naming consistency with
> >    the existing actions (per SJ's review).
> >  - Drop the per-scheme hot_threshold field.  Hotness policy does not
> >    belong in the kernel; target selection now lives in user space and
> >    is expressed to DAMOS via the address filter (per SJ's review).
> >  - Drop the v1 SPE debugfs patch entirely.  debugfs is not the right
> >    interface for a feature, and the SPE profiler belongs in user space
> >    (see "User-space target selection" below).  v2 is kernel mechanism
> >    only: 5 patches.
> >  - Decouple T1 (a lab observation) from T2 (the production issue), and
> >    correct the architecture claim: ptep_test_and_clear_young() skips
> >    the TLB flush on both x86_64 and arm64, so the blind spot is
> >    architecture-independent rather than arm64-only.
> >  - Terminology: avoid "stale TLB".  A valid TLB entry is doing its
> >    job; the point is only that it lets the CPU satisfy a translation
> >    without a page-table walk, so the Accessed bit cleared by DAMON is
> >    not re-set.
> 
> Thank you for detailed changelog.  This is helpful for reviewers.
> > 
> > Background
> > 
> > Two effects degrade DAMON's PTE-Accessed-bit (AF) signal once THP is
> > in play.  Both are described here as motivation only; this series does
> > not change the AF monitoring path.
> > 
> > T2 -- PMD-granularity inflation (production issue)
> 
> I think it is better to call this T1, for readers.
> 
> > 
> > A 2MB THP is tracked by a single PMD-level Accessed bit.  One access
> > to any 4KB sub-page sets the AF for the whole 2MB, so DAMON reports
> > the entire THP as hot and cannot distinguish a genuinely hot 2MB
> > region from a 2MB region with a single hot 4KB page.  Cold memory
> > hides inside "hot" THPs, and access-driven pageout/migration becomes
> > coarse.
> > 
> > This is the workload that drove the work: Sangfor's Kunpeng 920 KVM
> > hosts running Oracle.  ARM SPE sampling of that workload shows 94.6%
> > of THPs have fewer than 10% of their sub-pages actually accessed.
> 
> Cool finding, thank you for sharing.  What DB workloads were running there?
> Real production workload?  Or, synthetic benchmarks?
> 
> On the first read, I was wondering how you did ARM SPE sampling.  After reading
> this mail to the end, I now understand you use perf.  Briefly mentioning that
> here would  be nice.  E.g., "ARM SPE sampling of that worklaod using perf shows
> ..."
> 
> > 
> > T1 -- TLB-reach blind spot (lab observation)
> 
> I think it is better to call this T2, for readers.
> 
> > 
> > When the working set fits within L2 TLB reach (measured at 2048
> > entries x 2MB = 4GB on Kunpeng 920; no public data available), the
> > CPU satisfies translations entirely from the TLB,
> > preventing translation table walks.  Because
> > ptep_test_and_clear_young() does not flush
> 
> Wrapping text for the max columns is nice.  But let's not wrap it early when
> there are spaces.  That could reduce space, and even carbon emissions from
> people who want to read this nice cover letter after printing out on a paper.
> 
> > the TLB, valid TLB entries continue to satisfy translations and the
> > AF that DAMON cleared is never re-set, so DAMON sees nr_accesses=0 for
> > memory that is in fact hot, and no scheme triggers.  This reproduces
> > in the lab with small workloads; it is not something we have seen
> > reported from production, where working sets exceed TLB reach.
> > 
> > What this series adds
> > 
> > Rather than change AF monitoring, this series adds two order-aware
> > DAMOS actions so a policy layer can act at mTHP granularity:
> 
> The background explained rooms to improve in DAMON's THP access "monitoring".
> And this patch series is proposing adding new DAMOS actions for THP "handling".
> Those are two unrelated things.
> 
> I really appreciate sharing your findings with the background, but as those are
> not related to the proposal, I think it is better to be shared in a different
> way.
> 
> I understand you are proposing this change because you know DAMON's hugepages
> monitoring is imperfect, but still useful enough to get some benefits.  If
> there were some findings that made you to think so, that could be good
> background.
> 
> Also, you may have a reason to believe it is a good idea to use larger mTHP for
> hot pages, and smaller mTHP for cold pages.  If so, and the description of the
> reason is not trivial, that could be good materials to add on background.

Now I doubt if we really need two new DAMOS actions.  What happens if user asks
DAMOS_COLLAPSE of a target order for region that currently being backed by an
mTHP of an order that is larger than the newly asked one?  If we just ignore
the case, DAMOS_SPLIT will really nneeded.  But maybe we can just split the
large folio into the newly requested order mTHPs.  In this scenario,
DAMOS_SPLIT is not needed?

> 
> > 
> >  - DAMOS_COLLAPSE + target_order (patches 1-3): collapse small folios
> 
> This reads like you are introducing a new DAMOS action.  You indeed mentioned
> "this series adds two order-aware DAMOS actions".  That's not completely wrong
> in a sense, but more technically speaking you are adding a new mode of
> DAMOS_COLLAPSE.  I'd recommend rephrasing to "extend" DAMOS_COLLAPSE..
> 
> >    up to a chosen mTHP order.  Patch 1 adds the target_order field and
> >    its sysfs file; patch 2 exports a khugepaged helper
> >    (damon_collapse_folio_range());
> 
> So patch 2 modifies khugepaged?  As Lorenzo mentioned on the other reply, that
> change should also be reviewed by THP developers on MAINTAINERS file.  Please
> ensure adding THP developers to the recipients list of the patch and this cover
> letter.
> 
> The patch adds damon_collapse_folio_range() to khugepaged.h.  I understand
> DAMON is the only user for now, and therefore you are adding damon_ prefix to
> the name.  Not necesasrily DAMON is the only user forever.  And having damon_
> prefix in a land outside of DAMON feels weird.  To be consistent with other
> functions like collapse_pte_mapped_thp(), I'd suggest dropping the prefix from
> the name.
> 
> >    patch 3 wires the vaddr handler.
> > 
> >  - DAMOS_SPLIT + target_order (patches 4-5): split large folios down
> >    to a chosen mTHP order via split_folio_to_order(), for both
> >    anonymous and file-backed (tmpfs/shmem) folios.
> > 
> > The two are complementary, not competing:
> > 
> >    THP=never  + DAMOS_COLLAPSE: start at 4KB, grow hot regions up.
> >    THP=always + DAMOS_SPLIT:    start at 2MB, shrink cold regions down.
> > 
> > This dual-path design aligns with ideas discussed with Asier
> > Gutierrez; we plan to unify our mTHP automation and evaluation
> > roadmaps under this standard DAMOS_SPLIT action.
> > 
> > A deployment can pick either baseline, or run both, and let DAMOS
> > manage the placement.  THP is still wanted for the hot working set
> > (fewer TLB misses, shallower walks); the goal is not "no THP" but
> > "THP where it is hot, small pages where it is cold."
> 
> I think this is a good idea.  Could you further elaborate what benefit users
> can get from this in more detail, though?  Off the top of my head, I can expect
> the benefits would be 1) less TLB miss from hot data, and 2) less mTHP
> allocation failures from cold data occupying phsically contiguous memory.  But
> you might showing even more benefits.    Anyway I think those are better to be
> widely known by our kernel users.  Some of those may better to be put on the
> background section.
> 
> > 
> > User-space target selection
> > 
> > The decision of *which* regions to collapse or split is left to user
> > space and fed to DAMOS through the existing DAMOS address filter
> > (DAMOS_FILTER_TYPE_ADDR) -- the interface suggested during v1 review.
> > The kernel provides the mechanism; user space provides the policy,
> > consistent with the perf/BPF "kernel samples, user space decides"
> > model and with the DAMON-X direction.
> > 
> > Because the AF signal is unreliable at PMD granularity (T1/T2), the
> > scheme is run with min_nr_accesses=0 so it does not gate on access
> > count, and the address filter selects targets.  min_nr_accesses=0 is
> > also what unblocks the T1 case, where nr_accesses is pinned at 0.
> 
> Oh, so you are saying DAMON's huge pages monitoring is too problematic to use
> as-is, for your use case.  That's completely fair.  And that explains what you
> really want to do.  But this whole pictur is better to be described earlier
> than your changes proposal.
> 
> From the beginning, explain why using larger mTHP for hot pages and smaller
> mTHP for cold pages are good idea.  After that, explain how DAMON can be
> extended for doing that.  Then, you can further explain your T1 and T2 findings
> that explain why DAMON-only appraoch is not feasible, and how user-space target
> selection can overcome it.
> 
> Also, I understand DAMON-only approach is not optimum or just useless for your
> aimed use case.  But, is it completely useless for every possible use case?  I
> think it might still provide some benefit in some use cases.  Could you pleae
> clarify this point more in detail?  If you have data showing how useless
> DAMON-only appraoch is, and how user space approach improves, it would be
> awesome.
> 
> > 
> > Why not just turn khugepaged off?  You can, but khugepaged is global
> > and usually left enabled because other workloads rely on it; it cannot
> > be disabled per region.  DAMOS_COLLAPSE gives per-region,
> > access-pattern-driven collapse -- a more precise, targeted complement
> > to khugepaged's global scan, not a replacement for it.  To handle the
> > runtime race where khugepaged might aggressively re-collapse what
> > DAMOS_SPLIT just split, we are evaluating a precise VMA-level handshake
> > or back-off mechanism to prevent ping-pong effects in mixed
> > environments.
> 
> Good reasoning.  However, khugepaged can be turned off per process, using
> prctl().  How about turning khugepaged off for the process you want to use
> DAMOS_COLLAPSE/SPLIT for?
> 
> > 
> > Two user-space data sources produce the candidate address ranges:
> > 
> >  1. ARM SPE (ARMv8.2+): perf record (SPE) -> per-2MB hot-fraction
> >     histogram -> PA->VA via /proc/<pid>/pagemap -> sparse-THP VA
> >     ranges.  SPE reads physical addresses from the CPU pipeline,
> >     bypassing the TLB and page tables, so it is immune to T1 and T2.
> > 
> >  2. smaps fallback (no SPE): scan /proc/<pid>/smaps for THP-backed
> >     VMAs and treat the 2MB-aligned ranges as split candidates.
> > 
> > The SPE profiler stays in user space deliberately: the SPE PMU is a
> > single-consumer resource, so a kernel consumer would lock out
> > user-space perf and tooling (x86 PEBS / AMD IBS have the same
> > property).  Keeping it in user space avoids that and keeps the metric
> > source pluggable, in line with DAMON-X.
> 
> Maybe you are mentioning the perf events based DAMON, not DAMON-X.
> 
> And I understand you plan to extend DAMON to use ARM SPE, on top of the perf
> events based DAMON as a future work.  As I mentioned before, I think that makes
> perfect sense and I'm aligned.  Maybe this paragraph can bit reworded to make
> it more clear, though.
> 
> > This is why v2 drops the v1
> > SPE debugfs patch.
> > 
> > Testing
> > 
> > Tested on aarch64 with this series applied to 7.1.0-rc5, THP=always,
> > using a DAMOS_SPLIT scheme (target_order=2, min_nr_accesses=0) and a
> > single DAMOS address filter selecting one 2MB-aligned range:
> > 
> >  - Anonymous THP: the filter splits exactly that one THP --
> >    sz_applied=2MB and AnonHugePages drops by 2MB, the rest of the
> >    256MB mapping untouched.
> >  - File-backed THP (tmpfs/shmem mounted huge=always): the same setup
> >    splits exactly one 2MB shmem THP -- sz_applied=2MB and
> >    ShmemPmdMapped drops by 2MB.  This confirms split_folio_to_order()
> >    works for shmem folios (the KVM-guest-on-THP-tmpfs case).
> >  - The address filter is what bounds the action: sz_tried covers the
> >    whole ~2GB monitored region while sz_applied is exactly the 2MB the
> >    filter selected.
> >  - A smaps-based path (for hosts without SPE) enumerates THP-backed
> >    ranges and splits all THP in the target workload.
> >  - checkpatch clean on all 5 patches.
> 
> So, you tested only split part, for functionality.  Do you have plans to
> further test collapse part, and performance?
> 
> > 
> > Test scripts and SPE-to-DAMON pipeline tools:
> > https://github.com/lianux-mm/damon_spe/tree/v2
> 
> Thank you for sharing the code!
> 
> So, I find rooms to improve on this cover letter for the readability and
> clarity of the idea.  But as I mentioned before, I like the overall idea of
> this series.
> 
> 
> Thanks,
> SJ
> 
> [...]

Sent using hkml (https://github.com/sjp38/hackermail)

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2026-07-02 20:50 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-07-02  9:46 [RESEND RFC PATCH v2 0/5] mm/damon: add mTHP collapse and split actions Lian Wang
2026-07-02  9:46 ` [RESEND RFC PATCH v2 1/5] mm/damon: add target_order field for DAMOS_COLLAPSE Lian Wang
2026-07-02 10:01   ` sashiko-bot
2026-07-02 18:51   ` SJ Park
2026-07-02  9:46 ` [RESEND RFC PATCH v2 2/5] mm/khugepaged: add damon_collapse_folio_range() for external callers Lian Wang
2026-07-02 10:13   ` sashiko-bot
2026-07-02 11:07   ` Lorenzo Stoakes
2026-07-02 19:43     ` SJ Park
2026-07-02 20:32       ` SJ Park
2026-07-02  9:46 ` [RESEND RFC PATCH v2 3/5] mm/damon/vaddr: implement mTHP-aware DAMOS_COLLAPSE handler Lian Wang
2026-07-02 10:26   ` sashiko-bot
2026-07-02 19:56   ` SJ Park
2026-07-02  9:46 ` [RESEND RFC PATCH v2 4/5] mm/damon: introduce DAMOS_SPLIT action Lian Wang
2026-07-02 10:41   ` sashiko-bot
2026-07-02 20:00   ` SJ Park
2026-07-02  9:46 ` [RESEND RFC PATCH v2 5/5] mm/damon/vaddr: implement DAMOS_SPLIT handler Lian Wang
2026-07-02 10:49   ` sashiko-bot
2026-07-02 20:18   ` SJ Park
2026-07-02 10:23 ` [RESEND RFC PATCH v2 0/5] mm/damon: add mTHP collapse and split actions Lorenzo Stoakes
2026-07-02 16:52   ` SJ Park
2026-07-02 18:35 ` SJ Park
2026-07-02 20:50   ` SJ Park
  -- strict thread matches above, loose matches on Subject: below --
2026-07-02  9:52 Lian Wang
2026-07-02  9:52 ` [RESEND RFC PATCH v2 2/5] mm/khugepaged: add damon_collapse_folio_range() for external callers Lian Wang
2026-07-02 10:08   ` sashiko-bot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox