DAMON development mailing list
 help / color / mirror / Atom feed
* [RESEND RFC PATCH v2 0/5] mm/damon: add mTHP collapse and split actions
@ 2026-07-02  9:46 Lian Wang
  2026-07-02  9:46 ` [RESEND RFC PATCH v2 1/5] mm/damon: add target_order field for DAMOS_COLLAPSE Lian Wang
                   ` (6 more replies)
  0 siblings, 7 replies; 20+ messages in thread
From: Lian Wang @ 2026-07-02  9:46 UTC (permalink / raw)
  To: sj; +Cc: damon, linux-mm, daichaobing, kunwu.chan

Resend of v2 with the RFC tag restored (v1 was RFC PATCH, so v2 should
be RFC PATCH v2).

This resend also includes fixes for issues identified during review of
the earlier mis-sent PATCH v2 thread: uninitialized memory, TOCTOU
races, BUILD_BUG guards, missing sysfs action name registration, and
stack allocation overflow.  The series has been re-tested on aarch64
(anonymous and file-backed THP split) and is checkpatch clean.

v1: https://lore.kernel.org/linux-mm/20260618094838.32805-1-lianux.mm@gmail.com/

Changes since v1

 - Rename DAMOS_MTHP_SPLIT -> DAMOS_SPLIT for naming consistency with
   the existing actions (per SJ's review).
 - Drop the per-scheme hot_threshold field.  Hotness policy does not
   belong in the kernel; target selection now lives in user space and
   is expressed to DAMOS via the address filter (per SJ's review).
 - Drop the v1 SPE debugfs patch entirely.  debugfs is not the right
   interface for a feature, and the SPE profiler belongs in user space
   (see "User-space target selection" below).  v2 is kernel mechanism
   only: 5 patches.
 - Decouple T1 (a lab observation) from T2 (the production issue), and
   correct the architecture claim: ptep_test_and_clear_young() skips
   the TLB flush on both x86_64 and arm64, so the blind spot is
   architecture-independent rather than arm64-only.
 - Terminology: avoid "stale TLB".  A valid TLB entry is doing its
   job; the point is only that it lets the CPU satisfy a translation
   without a page-table walk, so the Accessed bit cleared by DAMON is
   not re-set.

Background

Two effects degrade DAMON's PTE-Accessed-bit (AF) signal once THP is
in play.  Both are described here as motivation only; this series does
not change the AF monitoring path.

T2 -- PMD-granularity inflation (production issue)

A 2MB THP is tracked by a single PMD-level Accessed bit.  One access
to any 4KB sub-page sets the AF for the whole 2MB, so DAMON reports
the entire THP as hot and cannot distinguish a genuinely hot 2MB
region from a 2MB region with a single hot 4KB page.  Cold memory
hides inside "hot" THPs, and access-driven pageout/migration becomes
coarse.

This is the workload that drove the work: Sangfor's Kunpeng 920 KVM
hosts running Oracle.  ARM SPE sampling of that workload shows 94.6%
of THPs have fewer than 10% of their sub-pages actually accessed.

T1 -- TLB-reach blind spot (lab observation)

When the working set fits within L2 TLB reach (measured at 2048
entries x 2MB = 4GB on Kunpeng 920; no public data available), the
CPU satisfies translations entirely from the TLB,
preventing translation table walks.  Because
ptep_test_and_clear_young() does not flush
the TLB, valid TLB entries continue to satisfy translations and the
AF that DAMON cleared is never re-set, so DAMON sees nr_accesses=0 for
memory that is in fact hot, and no scheme triggers.  This reproduces
in the lab with small workloads; it is not something we have seen
reported from production, where working sets exceed TLB reach.

What this series adds

Rather than change AF monitoring, this series adds two order-aware
DAMOS actions so a policy layer can act at mTHP granularity:

 - DAMOS_COLLAPSE + target_order (patches 1-3): collapse small folios
   up to a chosen mTHP order.  Patch 1 adds the target_order field and
   its sysfs file; patch 2 exports a khugepaged helper
   (damon_collapse_folio_range()); patch 3 wires the vaddr handler.

 - DAMOS_SPLIT + target_order (patches 4-5): split large folios down
   to a chosen mTHP order via split_folio_to_order(), for both
   anonymous and file-backed (tmpfs/shmem) folios.

The two are complementary, not competing:

   THP=never  + DAMOS_COLLAPSE: start at 4KB, grow hot regions up.
   THP=always + DAMOS_SPLIT:    start at 2MB, shrink cold regions down.

This dual-path design aligns with ideas discussed with Asier
Gutierrez; we plan to unify our mTHP automation and evaluation
roadmaps under this standard DAMOS_SPLIT action.

A deployment can pick either baseline, or run both, and let DAMOS
manage the placement.  THP is still wanted for the hot working set
(fewer TLB misses, shallower walks); the goal is not "no THP" but
"THP where it is hot, small pages where it is cold."

User-space target selection

The decision of *which* regions to collapse or split is left to user
space and fed to DAMOS through the existing DAMOS address filter
(DAMOS_FILTER_TYPE_ADDR) -- the interface suggested during v1 review.
The kernel provides the mechanism; user space provides the policy,
consistent with the perf/BPF "kernel samples, user space decides"
model and with the DAMON-X direction.

Because the AF signal is unreliable at PMD granularity (T1/T2), the
scheme is run with min_nr_accesses=0 so it does not gate on access
count, and the address filter selects targets.  min_nr_accesses=0 is
also what unblocks the T1 case, where nr_accesses is pinned at 0.

Why not just turn khugepaged off?  You can, but khugepaged is global
and usually left enabled because other workloads rely on it; it cannot
be disabled per region.  DAMOS_COLLAPSE gives per-region,
access-pattern-driven collapse -- a more precise, targeted complement
to khugepaged's global scan, not a replacement for it.  To handle the
runtime race where khugepaged might aggressively re-collapse what
DAMOS_SPLIT just split, we are evaluating a precise VMA-level handshake
or back-off mechanism to prevent ping-pong effects in mixed
environments.

Two user-space data sources produce the candidate address ranges:

 1. ARM SPE (ARMv8.2+): perf record (SPE) -> per-2MB hot-fraction
    histogram -> PA->VA via /proc/<pid>/pagemap -> sparse-THP VA
    ranges.  SPE reads physical addresses from the CPU pipeline,
    bypassing the TLB and page tables, so it is immune to T1 and T2.

 2. smaps fallback (no SPE): scan /proc/<pid>/smaps for THP-backed
    VMAs and treat the 2MB-aligned ranges as split candidates.

The SPE profiler stays in user space deliberately: the SPE PMU is a
single-consumer resource, so a kernel consumer would lock out
user-space perf and tooling (x86 PEBS / AMD IBS have the same
property).  Keeping it in user space avoids that and keeps the metric
source pluggable, in line with DAMON-X.  This is why v2 drops the v1
SPE debugfs patch.

Testing

Tested on aarch64 with this series applied to 7.1.0-rc5, THP=always,
using a DAMOS_SPLIT scheme (target_order=2, min_nr_accesses=0) and a
single DAMOS address filter selecting one 2MB-aligned range:

 - Anonymous THP: the filter splits exactly that one THP --
   sz_applied=2MB and AnonHugePages drops by 2MB, the rest of the
   256MB mapping untouched.
 - File-backed THP (tmpfs/shmem mounted huge=always): the same setup
   splits exactly one 2MB shmem THP -- sz_applied=2MB and
   ShmemPmdMapped drops by 2MB.  This confirms split_folio_to_order()
   works for shmem folios (the KVM-guest-on-THP-tmpfs case).
 - The address filter is what bounds the action: sz_tried covers the
   whole ~2GB monitored region while sz_applied is exactly the 2MB the
   filter selected.
 - A smaps-based path (for hosts without SPE) enumerates THP-backed
   ranges and splits all THP in the target workload.
 - checkpatch clean on all 5 patches.

Test scripts and SPE-to-DAMON pipeline tools:
https://github.com/lianux-mm/damon_spe/tree/v2

Lian Wang (5):
  mm/damon: add target_order field for DAMOS_COLLAPSE
  mm/khugepaged: add damon_collapse_folio_range() for external callers
  mm/damon/vaddr: implement mTHP-aware DAMOS_COLLAPSE handler
  mm/damon: introduce DAMOS_SPLIT action
  mm/damon/vaddr: implement DAMOS_SPLIT handler

 include/linux/damon.h      |  10 +++
 include/linux/khugepaged.h |   9 +++
 mm/damon/core.c            |   2 +
 mm/damon/sysfs-schemes.c   |  77 ++++++++++++++++++++++
 mm/damon/vaddr.c           | 128 +++++++++++++++++++++++++++++++++++++
 mm/khugepaged.c            |  46 +++++++++++++
 6 files changed, 272 insertions(+)

--
2.50.1 (Apple Git-155)

^ permalink raw reply	[flat|nested] 20+ messages in thread
* [RESEND RFC PATCH v2 0/5] mm/damon: add mTHP collapse and split actions
@ 2026-07-02  9:52 Lian Wang
  2026-07-02  9:52 ` [RESEND RFC PATCH v2 1/5] mm/damon: add target_order field for DAMOS_COLLAPSE Lian Wang
  0 siblings, 1 reply; 20+ messages in thread
From: Lian Wang @ 2026-07-02  9:52 UTC (permalink / raw)
  To: damon, linux-mm
  Cc: linux-kernel, sj, gutierrez.asier, daichaobing, lianux.wang,
	lianux.mm, kunwu.chan

Resend of v2 with the RFC tag restored (v1 was RFC PATCH, so v2 should
be RFC PATCH v2).

This resend also includes fixes for issues identified during review of
the earlier mis-sent PATCH v2 thread: uninitialized memory, TOCTOU
races, BUILD_BUG guards, missing sysfs action name registration, and
stack allocation overflow.  The series has been re-tested on aarch64
(anonymous and file-backed THP split) and is checkpatch clean.

v1: https://lore.kernel.org/linux-mm/20260618094838.32805-1-lianux.mm@gmail.com/

Changes since v1

 - Rename DAMOS_MTHP_SPLIT -> DAMOS_SPLIT for naming consistency with
   the existing actions (per SJ's review).
 - Drop the per-scheme hot_threshold field.  Hotness policy does not
   belong in the kernel; target selection now lives in user space and
   is expressed to DAMOS via the address filter (per SJ's review).
 - Drop the v1 SPE debugfs patch entirely.  debugfs is not the right
   interface for a feature, and the SPE profiler belongs in user space
   (see "User-space target selection" below).  v2 is kernel mechanism
   only: 5 patches.
 - Decouple T1 (a lab observation) from T2 (the production issue), and
   correct the architecture claim: ptep_test_and_clear_young() skips
   the TLB flush on both x86_64 and arm64, so the blind spot is
   architecture-independent rather than arm64-only.
 - Terminology: avoid "stale TLB".  A valid TLB entry is doing its
   job; the point is only that it lets the CPU satisfy a translation
   without a page-table walk, so the Accessed bit cleared by DAMON is
   not re-set.

Background

Two effects degrade DAMON's PTE-Accessed-bit (AF) signal once THP is
in play.  Both are described here as motivation only; this series does
not change the AF monitoring path.

T2 -- PMD-granularity inflation (production issue)

A 2MB THP is tracked by a single PMD-level Accessed bit.  One access
to any 4KB sub-page sets the AF for the whole 2MB, so DAMON reports
the entire THP as hot and cannot distinguish a genuinely hot 2MB
region from a 2MB region with a single hot 4KB page.  Cold memory
hides inside "hot" THPs, and access-driven pageout/migration becomes
coarse.

This is the workload that drove the work: Sangfor's Kunpeng 920 KVM
hosts running Oracle.  ARM SPE sampling of that workload shows 94.6%
of THPs have fewer than 10% of their sub-pages actually accessed.

T1 -- TLB-reach blind spot (lab observation)

When the working set fits within L2 TLB reach (measured at 2048
entries x 2MB = 4GB on Kunpeng 920; no public data available), the
CPU satisfies translations entirely from the TLB,
preventing translation table walks.  Because
ptep_test_and_clear_young() does not flush
the TLB, valid TLB entries continue to satisfy translations and the
AF that DAMON cleared is never re-set, so DAMON sees nr_accesses=0 for
memory that is in fact hot, and no scheme triggers.  This reproduces
in the lab with small workloads; it is not something we have seen
reported from production, where working sets exceed TLB reach.

What this series adds

Rather than change AF monitoring, this series adds two order-aware
DAMOS actions so a policy layer can act at mTHP granularity:

 - DAMOS_COLLAPSE + target_order (patches 1-3): collapse small folios
   up to a chosen mTHP order.  Patch 1 adds the target_order field and
   its sysfs file; patch 2 exports a khugepaged helper
   (damon_collapse_folio_range()); patch 3 wires the vaddr handler.

 - DAMOS_SPLIT + target_order (patches 4-5): split large folios down
   to a chosen mTHP order via split_folio_to_order(), for both
   anonymous and file-backed (tmpfs/shmem) folios.

The two are complementary, not competing:

   THP=never  + DAMOS_COLLAPSE: start at 4KB, grow hot regions up.
   THP=always + DAMOS_SPLIT:    start at 2MB, shrink cold regions down.

This dual-path design aligns with ideas discussed with Asier
Gutierrez; we plan to unify our mTHP automation and evaluation
roadmaps under this standard DAMOS_SPLIT action.

A deployment can pick either baseline, or run both, and let DAMOS
manage the placement.  THP is still wanted for the hot working set
(fewer TLB misses, shallower walks); the goal is not "no THP" but
"THP where it is hot, small pages where it is cold."

User-space target selection

The decision of *which* regions to collapse or split is left to user
space and fed to DAMOS through the existing DAMOS address filter
(DAMOS_FILTER_TYPE_ADDR) -- the interface suggested during v1 review.
The kernel provides the mechanism; user space provides the policy,
consistent with the perf/BPF "kernel samples, user space decides"
model and with the DAMON-X direction.

Because the AF signal is unreliable at PMD granularity (T1/T2), the
scheme is run with min_nr_accesses=0 so it does not gate on access
count, and the address filter selects targets.  min_nr_accesses=0 is
also what unblocks the T1 case, where nr_accesses is pinned at 0.

Why not just turn khugepaged off?  You can, but khugepaged is global
and usually left enabled because other workloads rely on it; it cannot
be disabled per region.  DAMOS_COLLAPSE gives per-region,
access-pattern-driven collapse -- a more precise, targeted complement
to khugepaged's global scan, not a replacement for it.  To handle the
runtime race where khugepaged might aggressively re-collapse what
DAMOS_SPLIT just split, we are evaluating a precise VMA-level handshake
or back-off mechanism to prevent ping-pong effects in mixed
environments.

Two user-space data sources produce the candidate address ranges:

 1. ARM SPE (ARMv8.2+): perf record (SPE) -> per-2MB hot-fraction
    histogram -> PA->VA via /proc/<pid>/pagemap -> sparse-THP VA
    ranges.  SPE reads physical addresses from the CPU pipeline,
    bypassing the TLB and page tables, so it is immune to T1 and T2.

 2. smaps fallback (no SPE): scan /proc/<pid>/smaps for THP-backed
    VMAs and treat the 2MB-aligned ranges as split candidates.

The SPE profiler stays in user space deliberately: the SPE PMU is a
single-consumer resource, so a kernel consumer would lock out
user-space perf and tooling (x86 PEBS / AMD IBS have the same
property).  Keeping it in user space avoids that and keeps the metric
source pluggable, in line with DAMON-X.  This is why v2 drops the v1
SPE debugfs patch.

Testing

Tested on aarch64 with this series applied to 7.1.0-rc5, THP=always,
using a DAMOS_SPLIT scheme (target_order=2, min_nr_accesses=0) and a
single DAMOS address filter selecting one 2MB-aligned range:

 - Anonymous THP: the filter splits exactly that one THP --
   sz_applied=2MB and AnonHugePages drops by 2MB, the rest of the
   256MB mapping untouched.
 - File-backed THP (tmpfs/shmem mounted huge=always): the same setup
   splits exactly one 2MB shmem THP -- sz_applied=2MB and
   ShmemPmdMapped drops by 2MB.  This confirms split_folio_to_order()
   works for shmem folios (the KVM-guest-on-THP-tmpfs case).
 - The address filter is what bounds the action: sz_tried covers the
   whole ~2GB monitored region while sz_applied is exactly the 2MB the
   filter selected.
 - A smaps-based path (for hosts without SPE) enumerates THP-backed
   ranges and splits all THP in the target workload.
 - checkpatch clean on all 5 patches.

Test scripts and SPE-to-DAMON pipeline tools:
https://github.com/lianux-mm/damon_spe/tree/v2

Lian Wang (5):
  mm/damon: add target_order field for DAMOS_COLLAPSE
  mm/khugepaged: add damon_collapse_folio_range() for external callers
  mm/damon/vaddr: implement mTHP-aware DAMOS_COLLAPSE handler
  mm/damon: introduce DAMOS_SPLIT action
  mm/damon/vaddr: implement DAMOS_SPLIT handler

 include/linux/damon.h      |  10 +++
 include/linux/khugepaged.h |   9 +++
 mm/damon/core.c            |   2 +
 mm/damon/sysfs-schemes.c   |  77 ++++++++++++++++++++++
 mm/damon/vaddr.c           | 128 +++++++++++++++++++++++++++++++++++++
 mm/khugepaged.c            |  46 +++++++++++++
 6 files changed, 272 insertions(+)

--
2.50.1 (Apple Git-155)

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2026-07-02 19:56 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-07-02  9:46 [RESEND RFC PATCH v2 0/5] mm/damon: add mTHP collapse and split actions Lian Wang
2026-07-02  9:46 ` [RESEND RFC PATCH v2 1/5] mm/damon: add target_order field for DAMOS_COLLAPSE Lian Wang
2026-07-02 10:01   ` sashiko-bot
2026-07-02 18:51   ` SJ Park
2026-07-02  9:46 ` [RESEND RFC PATCH v2 2/5] mm/khugepaged: add damon_collapse_folio_range() for external callers Lian Wang
2026-07-02 10:13   ` sashiko-bot
2026-07-02 11:07   ` Lorenzo Stoakes
2026-07-02 19:43     ` SJ Park
2026-07-02  9:46 ` [RESEND RFC PATCH v2 3/5] mm/damon/vaddr: implement mTHP-aware DAMOS_COLLAPSE handler Lian Wang
2026-07-02 10:26   ` sashiko-bot
2026-07-02 19:56   ` SJ Park
2026-07-02  9:46 ` [RESEND RFC PATCH v2 4/5] mm/damon: introduce DAMOS_SPLIT action Lian Wang
2026-07-02 10:41   ` sashiko-bot
2026-07-02  9:46 ` [RESEND RFC PATCH v2 5/5] mm/damon/vaddr: implement DAMOS_SPLIT handler Lian Wang
2026-07-02 10:49   ` sashiko-bot
2026-07-02 10:23 ` [RESEND RFC PATCH v2 0/5] mm/damon: add mTHP collapse and split actions Lorenzo Stoakes
2026-07-02 16:52   ` SJ Park
2026-07-02 18:35 ` SJ Park
  -- strict thread matches above, loose matches on Subject: below --
2026-07-02  9:52 Lian Wang
2026-07-02  9:52 ` [RESEND RFC PATCH v2 1/5] mm/damon: add target_order field for DAMOS_COLLAPSE Lian Wang
2026-07-02 10:02   ` sashiko-bot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox