From: SJ Park <sj@kernel.org>
To: SJ Park <sj@kernel.org>
Cc: Lian Wang <lianux.mm@gmail.com>,
damon@lists.linux.dev, linux-mm@kvack.org,
daichaobing@sangfor.com.cn, kunwu.chan@gmail.com
Subject: Re: [RESEND RFC PATCH v2 0/5] mm/damon: add mTHP collapse and split actions
Date: Thu, 2 Jul 2026 13:50:21 -0700 [thread overview]
Message-ID: <20260702205022.93030-1-sj@kernel.org> (raw)
In-Reply-To: <20260702183551.91007-1-sj@kernel.org>
Keeping full original mail, so that Lian can answer all comments in one reply.
On Thu, 2 Jul 2026 11:35:51 -0700 SJ Park <sj@kernel.org> wrote:
> On Thu, 2 Jul 2026 17:46:28 +0800 Lian Wang <lianux.mm@gmail.com> wrote:
>
> > Resend of v2 with the RFC tag restored (v1 was RFC PATCH, so v2 should
> > be RFC PATCH v2).
> >
> > This resend also includes fixes for issues identified during review of
> > the earlier mis-sent PATCH v2 thread: uninitialized memory, TOCTOU
> > races, BUILD_BUG guards, missing sysfs action name registration, and
> > stack allocation overflow. The series has been re-tested on aarch64
> > (anonymous and file-backed THP split) and is checkpatch clean.
> >
> > v1: https://lore.kernel.org/linux-mm/20260618094838.32805-1-lianux.mm@gmail.com/
>
> Let's call it 'RFC v1'.
>
> >
> > Changes since v1
>
> Ditto.
>
> >
> > - Rename DAMOS_MTHP_SPLIT -> DAMOS_SPLIT for naming consistency with
> > the existing actions (per SJ's review).
> > - Drop the per-scheme hot_threshold field. Hotness policy does not
> > belong in the kernel; target selection now lives in user space and
> > is expressed to DAMOS via the address filter (per SJ's review).
> > - Drop the v1 SPE debugfs patch entirely. debugfs is not the right
> > interface for a feature, and the SPE profiler belongs in user space
> > (see "User-space target selection" below). v2 is kernel mechanism
> > only: 5 patches.
> > - Decouple T1 (a lab observation) from T2 (the production issue), and
> > correct the architecture claim: ptep_test_and_clear_young() skips
> > the TLB flush on both x86_64 and arm64, so the blind spot is
> > architecture-independent rather than arm64-only.
> > - Terminology: avoid "stale TLB". A valid TLB entry is doing its
> > job; the point is only that it lets the CPU satisfy a translation
> > without a page-table walk, so the Accessed bit cleared by DAMON is
> > not re-set.
>
> Thank you for detailed changelog. This is helpful for reviewers.
> >
> > Background
> >
> > Two effects degrade DAMON's PTE-Accessed-bit (AF) signal once THP is
> > in play. Both are described here as motivation only; this series does
> > not change the AF monitoring path.
> >
> > T2 -- PMD-granularity inflation (production issue)
>
> I think it is better to call this T1, for readers.
>
> >
> > A 2MB THP is tracked by a single PMD-level Accessed bit. One access
> > to any 4KB sub-page sets the AF for the whole 2MB, so DAMON reports
> > the entire THP as hot and cannot distinguish a genuinely hot 2MB
> > region from a 2MB region with a single hot 4KB page. Cold memory
> > hides inside "hot" THPs, and access-driven pageout/migration becomes
> > coarse.
> >
> > This is the workload that drove the work: Sangfor's Kunpeng 920 KVM
> > hosts running Oracle. ARM SPE sampling of that workload shows 94.6%
> > of THPs have fewer than 10% of their sub-pages actually accessed.
>
> Cool finding, thank you for sharing. What DB workloads were running there?
> Real production workload? Or, synthetic benchmarks?
>
> On the first read, I was wondering how you did ARM SPE sampling. After reading
> this mail to the end, I now understand you use perf. Briefly mentioning that
> here would be nice. E.g., "ARM SPE sampling of that worklaod using perf shows
> ..."
>
> >
> > T1 -- TLB-reach blind spot (lab observation)
>
> I think it is better to call this T2, for readers.
>
> >
> > When the working set fits within L2 TLB reach (measured at 2048
> > entries x 2MB = 4GB on Kunpeng 920; no public data available), the
> > CPU satisfies translations entirely from the TLB,
> > preventing translation table walks. Because
> > ptep_test_and_clear_young() does not flush
>
> Wrapping text for the max columns is nice. But let's not wrap it early when
> there are spaces. That could reduce space, and even carbon emissions from
> people who want to read this nice cover letter after printing out on a paper.
>
> > the TLB, valid TLB entries continue to satisfy translations and the
> > AF that DAMON cleared is never re-set, so DAMON sees nr_accesses=0 for
> > memory that is in fact hot, and no scheme triggers. This reproduces
> > in the lab with small workloads; it is not something we have seen
> > reported from production, where working sets exceed TLB reach.
> >
> > What this series adds
> >
> > Rather than change AF monitoring, this series adds two order-aware
> > DAMOS actions so a policy layer can act at mTHP granularity:
>
> The background explained rooms to improve in DAMON's THP access "monitoring".
> And this patch series is proposing adding new DAMOS actions for THP "handling".
> Those are two unrelated things.
>
> I really appreciate sharing your findings with the background, but as those are
> not related to the proposal, I think it is better to be shared in a different
> way.
>
> I understand you are proposing this change because you know DAMON's hugepages
> monitoring is imperfect, but still useful enough to get some benefits. If
> there were some findings that made you to think so, that could be good
> background.
>
> Also, you may have a reason to believe it is a good idea to use larger mTHP for
> hot pages, and smaller mTHP for cold pages. If so, and the description of the
> reason is not trivial, that could be good materials to add on background.
Now I doubt if we really need two new DAMOS actions. What happens if user asks
DAMOS_COLLAPSE of a target order for region that currently being backed by an
mTHP of an order that is larger than the newly asked one? If we just ignore
the case, DAMOS_SPLIT will really nneeded. But maybe we can just split the
large folio into the newly requested order mTHPs. In this scenario,
DAMOS_SPLIT is not needed?
>
> >
> > - DAMOS_COLLAPSE + target_order (patches 1-3): collapse small folios
>
> This reads like you are introducing a new DAMOS action. You indeed mentioned
> "this series adds two order-aware DAMOS actions". That's not completely wrong
> in a sense, but more technically speaking you are adding a new mode of
> DAMOS_COLLAPSE. I'd recommend rephrasing to "extend" DAMOS_COLLAPSE..
>
> > up to a chosen mTHP order. Patch 1 adds the target_order field and
> > its sysfs file; patch 2 exports a khugepaged helper
> > (damon_collapse_folio_range());
>
> So patch 2 modifies khugepaged? As Lorenzo mentioned on the other reply, that
> change should also be reviewed by THP developers on MAINTAINERS file. Please
> ensure adding THP developers to the recipients list of the patch and this cover
> letter.
>
> The patch adds damon_collapse_folio_range() to khugepaged.h. I understand
> DAMON is the only user for now, and therefore you are adding damon_ prefix to
> the name. Not necesasrily DAMON is the only user forever. And having damon_
> prefix in a land outside of DAMON feels weird. To be consistent with other
> functions like collapse_pte_mapped_thp(), I'd suggest dropping the prefix from
> the name.
>
> > patch 3 wires the vaddr handler.
> >
> > - DAMOS_SPLIT + target_order (patches 4-5): split large folios down
> > to a chosen mTHP order via split_folio_to_order(), for both
> > anonymous and file-backed (tmpfs/shmem) folios.
> >
> > The two are complementary, not competing:
> >
> > THP=never + DAMOS_COLLAPSE: start at 4KB, grow hot regions up.
> > THP=always + DAMOS_SPLIT: start at 2MB, shrink cold regions down.
> >
> > This dual-path design aligns with ideas discussed with Asier
> > Gutierrez; we plan to unify our mTHP automation and evaluation
> > roadmaps under this standard DAMOS_SPLIT action.
> >
> > A deployment can pick either baseline, or run both, and let DAMOS
> > manage the placement. THP is still wanted for the hot working set
> > (fewer TLB misses, shallower walks); the goal is not "no THP" but
> > "THP where it is hot, small pages where it is cold."
>
> I think this is a good idea. Could you further elaborate what benefit users
> can get from this in more detail, though? Off the top of my head, I can expect
> the benefits would be 1) less TLB miss from hot data, and 2) less mTHP
> allocation failures from cold data occupying phsically contiguous memory. But
> you might showing even more benefits. Anyway I think those are better to be
> widely known by our kernel users. Some of those may better to be put on the
> background section.
>
> >
> > User-space target selection
> >
> > The decision of *which* regions to collapse or split is left to user
> > space and fed to DAMOS through the existing DAMOS address filter
> > (DAMOS_FILTER_TYPE_ADDR) -- the interface suggested during v1 review.
> > The kernel provides the mechanism; user space provides the policy,
> > consistent with the perf/BPF "kernel samples, user space decides"
> > model and with the DAMON-X direction.
> >
> > Because the AF signal is unreliable at PMD granularity (T1/T2), the
> > scheme is run with min_nr_accesses=0 so it does not gate on access
> > count, and the address filter selects targets. min_nr_accesses=0 is
> > also what unblocks the T1 case, where nr_accesses is pinned at 0.
>
> Oh, so you are saying DAMON's huge pages monitoring is too problematic to use
> as-is, for your use case. That's completely fair. And that explains what you
> really want to do. But this whole pictur is better to be described earlier
> than your changes proposal.
>
> From the beginning, explain why using larger mTHP for hot pages and smaller
> mTHP for cold pages are good idea. After that, explain how DAMON can be
> extended for doing that. Then, you can further explain your T1 and T2 findings
> that explain why DAMON-only appraoch is not feasible, and how user-space target
> selection can overcome it.
>
> Also, I understand DAMON-only approach is not optimum or just useless for your
> aimed use case. But, is it completely useless for every possible use case? I
> think it might still provide some benefit in some use cases. Could you pleae
> clarify this point more in detail? If you have data showing how useless
> DAMON-only appraoch is, and how user space approach improves, it would be
> awesome.
>
> >
> > Why not just turn khugepaged off? You can, but khugepaged is global
> > and usually left enabled because other workloads rely on it; it cannot
> > be disabled per region. DAMOS_COLLAPSE gives per-region,
> > access-pattern-driven collapse -- a more precise, targeted complement
> > to khugepaged's global scan, not a replacement for it. To handle the
> > runtime race where khugepaged might aggressively re-collapse what
> > DAMOS_SPLIT just split, we are evaluating a precise VMA-level handshake
> > or back-off mechanism to prevent ping-pong effects in mixed
> > environments.
>
> Good reasoning. However, khugepaged can be turned off per process, using
> prctl(). How about turning khugepaged off for the process you want to use
> DAMOS_COLLAPSE/SPLIT for?
>
> >
> > Two user-space data sources produce the candidate address ranges:
> >
> > 1. ARM SPE (ARMv8.2+): perf record (SPE) -> per-2MB hot-fraction
> > histogram -> PA->VA via /proc/<pid>/pagemap -> sparse-THP VA
> > ranges. SPE reads physical addresses from the CPU pipeline,
> > bypassing the TLB and page tables, so it is immune to T1 and T2.
> >
> > 2. smaps fallback (no SPE): scan /proc/<pid>/smaps for THP-backed
> > VMAs and treat the 2MB-aligned ranges as split candidates.
> >
> > The SPE profiler stays in user space deliberately: the SPE PMU is a
> > single-consumer resource, so a kernel consumer would lock out
> > user-space perf and tooling (x86 PEBS / AMD IBS have the same
> > property). Keeping it in user space avoids that and keeps the metric
> > source pluggable, in line with DAMON-X.
>
> Maybe you are mentioning the perf events based DAMON, not DAMON-X.
>
> And I understand you plan to extend DAMON to use ARM SPE, on top of the perf
> events based DAMON as a future work. As I mentioned before, I think that makes
> perfect sense and I'm aligned. Maybe this paragraph can bit reworded to make
> it more clear, though.
>
> > This is why v2 drops the v1
> > SPE debugfs patch.
> >
> > Testing
> >
> > Tested on aarch64 with this series applied to 7.1.0-rc5, THP=always,
> > using a DAMOS_SPLIT scheme (target_order=2, min_nr_accesses=0) and a
> > single DAMOS address filter selecting one 2MB-aligned range:
> >
> > - Anonymous THP: the filter splits exactly that one THP --
> > sz_applied=2MB and AnonHugePages drops by 2MB, the rest of the
> > 256MB mapping untouched.
> > - File-backed THP (tmpfs/shmem mounted huge=always): the same setup
> > splits exactly one 2MB shmem THP -- sz_applied=2MB and
> > ShmemPmdMapped drops by 2MB. This confirms split_folio_to_order()
> > works for shmem folios (the KVM-guest-on-THP-tmpfs case).
> > - The address filter is what bounds the action: sz_tried covers the
> > whole ~2GB monitored region while sz_applied is exactly the 2MB the
> > filter selected.
> > - A smaps-based path (for hosts without SPE) enumerates THP-backed
> > ranges and splits all THP in the target workload.
> > - checkpatch clean on all 5 patches.
>
> So, you tested only split part, for functionality. Do you have plans to
> further test collapse part, and performance?
>
> >
> > Test scripts and SPE-to-DAMON pipeline tools:
> > https://github.com/lianux-mm/damon_spe/tree/v2
>
> Thank you for sharing the code!
>
> So, I find rooms to improve on this cover letter for the readability and
> clarity of the idea. But as I mentioned before, I like the overall idea of
> this series.
>
>
> Thanks,
> SJ
>
> [...]
Sent using hkml (https://github.com/sjp38/hackermail)
next prev parent reply other threads:[~2026-07-02 20:50 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-07-02 9:46 [RESEND RFC PATCH v2 0/5] mm/damon: add mTHP collapse and split actions Lian Wang
2026-07-02 9:46 ` [RESEND RFC PATCH v2 1/5] mm/damon: add target_order field for DAMOS_COLLAPSE Lian Wang
2026-07-02 10:01 ` sashiko-bot
2026-07-02 18:51 ` SJ Park
2026-07-02 9:46 ` [RESEND RFC PATCH v2 2/5] mm/khugepaged: add damon_collapse_folio_range() for external callers Lian Wang
2026-07-02 10:13 ` sashiko-bot
2026-07-02 11:07 ` Lorenzo Stoakes
2026-07-02 19:43 ` SJ Park
2026-07-02 20:32 ` SJ Park
2026-07-02 9:46 ` [RESEND RFC PATCH v2 3/5] mm/damon/vaddr: implement mTHP-aware DAMOS_COLLAPSE handler Lian Wang
2026-07-02 10:26 ` sashiko-bot
2026-07-02 19:56 ` SJ Park
2026-07-02 9:46 ` [RESEND RFC PATCH v2 4/5] mm/damon: introduce DAMOS_SPLIT action Lian Wang
2026-07-02 10:41 ` sashiko-bot
2026-07-02 20:00 ` SJ Park
2026-07-02 9:46 ` [RESEND RFC PATCH v2 5/5] mm/damon/vaddr: implement DAMOS_SPLIT handler Lian Wang
2026-07-02 10:49 ` sashiko-bot
2026-07-02 20:18 ` SJ Park
2026-07-02 10:23 ` [RESEND RFC PATCH v2 0/5] mm/damon: add mTHP collapse and split actions Lorenzo Stoakes
2026-07-02 16:52 ` SJ Park
2026-07-02 18:35 ` SJ Park
2026-07-02 20:50 ` SJ Park [this message]
-- strict thread matches above, loose matches on Subject: below --
2026-07-02 9:52 Lian Wang
2026-07-02 16:39 ` SJ Park
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260702205022.93030-1-sj@kernel.org \
--to=sj@kernel.org \
--cc=daichaobing@sangfor.com.cn \
--cc=damon@lists.linux.dev \
--cc=kunwu.chan@gmail.com \
--cc=lianux.mm@gmail.com \
--cc=linux-mm@kvack.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox