From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id D5338C43458 for ; Fri, 3 Jul 2026 00:27:37 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C0C526B00C7; Thu, 2 Jul 2026 20:27:36 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BE4286B00C9; Thu, 2 Jul 2026 20:27:36 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B26F86B00C7; Thu, 2 Jul 2026 20:27:36 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 6AE116B00C7 for ; Thu, 2 Jul 2026 20:27:36 -0400 (EDT) Received: from smtpin06.hostedemail.com (lb01a-stub [10.200.18.249]) by unirelay07.hostedemail.com (Postfix) with ESMTP id BAF34168207 for ; Thu, 2 Jul 2026 20:50:34 +0000 (UTC) X-FDA: 84945029988.06.EBFF962 Received: from sea.source.kernel.org (sea.source.kernel.org [172.234.252.31]) by imf29.hostedemail.com (Postfix) with ESMTP id 171FD120003 for ; Thu, 2 Jul 2026 20:50:32 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20260515 header.b=fNVDbyVQ; spf=pass (imf29.hostedemail.com: domain of sj@kernel.org designates 172.234.252.31 as permitted sender) smtp.mailfrom=sj@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org ARC-Seal: i=1; a=rsa-sha256; d=hostedemail.com; s=arc-20220608; cv=none; t=1783025433; b=qB7iRN3gpu+UXASr3dZ/Xpy1QAMHUSr0bS/YYmlTFEKH1SivWSD6vnCtAAeT0W8ATY9cd2 uaDUvqLmn6T9xFxVJmdMrrWbWufAdJrMO9524R8oLhwssL4+Emvsh7oeMrDvNiPxwHkqOB E64hbRkWCe/eoTWDzcAVfQgu/yziLIY= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1783025433; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=vqauu4szxp0iL58aqPmp1qJUDuyyqobjwHU9GZsfExs=; b=LvEVw1HVTKompwlZW5l/w72Y35zQ29yCQ2jvrAdH4JMb3V59HgwKDPlTSaDnupdPJFlBaB xYSlemKmvgxDhoofs8WGH/Cgu+A/jyzSmLNsP+ec/qn5oj9NY7JQAv1B40y7eD3Mzk1d8V iDg9wvafSpSguDOdz/Ntg9H83Man+ko= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20260515 header.b=fNVDbyVQ; spf=pass (imf29.hostedemail.com: domain of sj@kernel.org designates 172.234.252.31 as permitted sender) smtp.mailfrom=sj@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org Received: from smtp.kernel.org (quasi.space.kernel.org [100.103.45.18]) by sea.source.kernel.org (Postfix) with ESMTP id 3DD77418A6; Thu, 2 Jul 2026 20:50:30 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 7BCC81F00A3A; Thu, 2 Jul 2026 20:50:26 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1783025430; bh=vqauu4szxp0iL58aqPmp1qJUDuyyqobjwHU9GZsfExs=; h=From:To:Cc:Subject:Date:In-Reply-To:References; b=fNVDbyVQUtDyA2fzbYiujIj6WdtDb0YzDQI+aFgKyC9s3UaN1Y1AUlNMN70JwoBL3 koETsUPyoJRSMHF1IWrn4823BQjsd8bkJpsOX9H7q7IAZohNf4l6+6dc/NGuRtVGmO 4V/L3u3BCjWd3xhPFwDALwkJbtEUun3PcKgWwWrjk/tHHEXgyYy6ULMbVU+6poo8YG /mAFcd5ToJHhe4UXntT0DsTPTWKfRJLLcF5+TwcWhshlNIvUCkNWeUM+BMLp/kThy3 /9wSVbXNQmjqMO/tjdyjG5IAHhevuVbfXrt3VnAYSNwKtOK6UgcyZg63JTy2Yx9FVZ 6PoK+3xJT4yaQ== From: SJ Park To: SJ Park Cc: Lian Wang , damon@lists.linux.dev, linux-mm@kvack.org, daichaobing@sangfor.com.cn, kunwu.chan@gmail.com Subject: Re: [RESEND RFC PATCH v2 0/5] mm/damon: add mTHP collapse and split actions Date: Thu, 2 Jul 2026 13:50:21 -0700 Message-ID: <20260702205022.93030-1-sj@kernel.org> X-Mailer: git-send-email 2.47.3 In-Reply-To: <20260702183551.91007-1-sj@kernel.org> References: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam08 X-Rspam-User: X-Stat-Signature: gjiejyqr51uoiinn6r7ew1trh794d8b6 X-Rspamd-Queue-Id: 171FD120003 X-HE-Tag: 1783025432-950280 X-HE-Meta: U2FsdGVkX19xQLiOYKFo4QEgb/HNtjsJoW19RQI6xao5fgDhZ9y4iiuck/OZmVnJGAX5ywgoMxR5DUU7NBgmiHW+kbLb8x+I5RLPvuE2SXb6c6B7uCA/eEDKk/BaQfxom+oKykyjoo9tLtCYULFLiwBXlqgNctEZDhp5AmDTVjDYBfWcf32i7vGel8aBIQr070EFGHpxh0v1AOCMObxTGMFmmADGGnuqCSChXdaS9t//t5izzM4Gh62ImGb6M1WuI5nN1BO1CJR7pYd61Xb05YCS8LDCBwteSWpWrH/h03nK7+kQYVhUnKcyDNCUp1oX6EUBxrs5EkvT58c4bWZJBBRcZs1onkZMGxCxMr2DGtRpaYZQecoNloYD42v7j55iU82svvWLF0hLorrFT4xbGvNzv+S7xaErRBr2G2WLgverhNwOp87xp++lng0n9raDl48vlRk3bwVpHczpTZGTi8YfPP238WJupmXprWH68DphGbtEOWO35dAAom4rO1duKNPpTGEdagzX3kIqdWF8giVG1H8hYvjufV5tPiHSZojmvD+fPBNS5zKsJWx2VIj1klzXt6gZV3RkagF9Je8lVShnBjX5AuAYqWHRIq9mOOcAV4l80OH17dBToHbWzotfEl4eQMlkVbu+5jUYIkxWB8KTiMPE7v86m0HdH06Z38HGTU8ZCnCXr/8V2qXYsWOgbyEfvLuLUGDP1vAn/D4falsv2Zr/ltnLMzOh2Lvy42DB/o4+c6s1y/jSRFB8Kv2kyZrr1Nfl9IxziFxWbp1jhSqq09IchO05uRIMfC0vcvYCztPXG/kt7WMTM3W2LR8nFLV1ehmaeP+2SqbL8xdxC67aVZfFwl+AJCVW2Uxn/5Y5jw0120kL9Ti+y6Io+Ry89N/Xdd7sy9GuSuc7PIwjSeDQT7yGLlyNNmAvW+vEEkw3RoVHHZzz0IE88+8IvdDH+UtH6YedKtL6xNH1ms6 kF6y42x6 N0ALAK6B/eL25JIYLnAWfQyPiGBEV7k6BVD3qjZ2wGAridN8ItFv0SkCD0bZ58c81LEBOncsHZVD7tFwdDylsGyFWozrHvTSzQZqAOLoTVGLH/bmYlOKRrR0fkMmQ9egZlVbSXF3MDFCBGSklOGs1B6+OjfIZ2NKhmuUrjYMpyDEioQi5tAOVXXHt/TJJPPKWD6Mq1sLiQlFgCO/URGFHvpJligZ6GZWCnX7MKaf9jpEwu4bfcwq5qRVoUVf3nKaUVIZCPNRnK76FJsDjTlUUECRl6NiwO3S1MHmHxckErGl+iHn10s7lLRCXoW6gfcpqWTGNKxyGkY6iQnbEZzGsVqEdZKlKWgoOPoUd Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Keeping full original mail, so that Lian can answer all comments in one reply. On Thu, 2 Jul 2026 11:35:51 -0700 SJ Park wrote: > On Thu, 2 Jul 2026 17:46:28 +0800 Lian Wang wrote: > > > Resend of v2 with the RFC tag restored (v1 was RFC PATCH, so v2 should > > be RFC PATCH v2). > > > > This resend also includes fixes for issues identified during review of > > the earlier mis-sent PATCH v2 thread: uninitialized memory, TOCTOU > > races, BUILD_BUG guards, missing sysfs action name registration, and > > stack allocation overflow. The series has been re-tested on aarch64 > > (anonymous and file-backed THP split) and is checkpatch clean. > > > > v1: https://lore.kernel.org/linux-mm/20260618094838.32805-1-lianux.mm@gmail.com/ > > Let's call it 'RFC v1'. > > > > > Changes since v1 > > Ditto. > > > > > - Rename DAMOS_MTHP_SPLIT -> DAMOS_SPLIT for naming consistency with > > the existing actions (per SJ's review). > > - Drop the per-scheme hot_threshold field. Hotness policy does not > > belong in the kernel; target selection now lives in user space and > > is expressed to DAMOS via the address filter (per SJ's review). > > - Drop the v1 SPE debugfs patch entirely. debugfs is not the right > > interface for a feature, and the SPE profiler belongs in user space > > (see "User-space target selection" below). v2 is kernel mechanism > > only: 5 patches. > > - Decouple T1 (a lab observation) from T2 (the production issue), and > > correct the architecture claim: ptep_test_and_clear_young() skips > > the TLB flush on both x86_64 and arm64, so the blind spot is > > architecture-independent rather than arm64-only. > > - Terminology: avoid "stale TLB". A valid TLB entry is doing its > > job; the point is only that it lets the CPU satisfy a translation > > without a page-table walk, so the Accessed bit cleared by DAMON is > > not re-set. > > Thank you for detailed changelog. This is helpful for reviewers. > > > > Background > > > > Two effects degrade DAMON's PTE-Accessed-bit (AF) signal once THP is > > in play. Both are described here as motivation only; this series does > > not change the AF monitoring path. > > > > T2 -- PMD-granularity inflation (production issue) > > I think it is better to call this T1, for readers. > > > > > A 2MB THP is tracked by a single PMD-level Accessed bit. One access > > to any 4KB sub-page sets the AF for the whole 2MB, so DAMON reports > > the entire THP as hot and cannot distinguish a genuinely hot 2MB > > region from a 2MB region with a single hot 4KB page. Cold memory > > hides inside "hot" THPs, and access-driven pageout/migration becomes > > coarse. > > > > This is the workload that drove the work: Sangfor's Kunpeng 920 KVM > > hosts running Oracle. ARM SPE sampling of that workload shows 94.6% > > of THPs have fewer than 10% of their sub-pages actually accessed. > > Cool finding, thank you for sharing. What DB workloads were running there? > Real production workload? Or, synthetic benchmarks? > > On the first read, I was wondering how you did ARM SPE sampling. After reading > this mail to the end, I now understand you use perf. Briefly mentioning that > here would be nice. E.g., "ARM SPE sampling of that worklaod using perf shows > ..." > > > > > T1 -- TLB-reach blind spot (lab observation) > > I think it is better to call this T2, for readers. > > > > > When the working set fits within L2 TLB reach (measured at 2048 > > entries x 2MB = 4GB on Kunpeng 920; no public data available), the > > CPU satisfies translations entirely from the TLB, > > preventing translation table walks. Because > > ptep_test_and_clear_young() does not flush > > Wrapping text for the max columns is nice. But let's not wrap it early when > there are spaces. That could reduce space, and even carbon emissions from > people who want to read this nice cover letter after printing out on a paper. > > > the TLB, valid TLB entries continue to satisfy translations and the > > AF that DAMON cleared is never re-set, so DAMON sees nr_accesses=0 for > > memory that is in fact hot, and no scheme triggers. This reproduces > > in the lab with small workloads; it is not something we have seen > > reported from production, where working sets exceed TLB reach. > > > > What this series adds > > > > Rather than change AF monitoring, this series adds two order-aware > > DAMOS actions so a policy layer can act at mTHP granularity: > > The background explained rooms to improve in DAMON's THP access "monitoring". > And this patch series is proposing adding new DAMOS actions for THP "handling". > Those are two unrelated things. > > I really appreciate sharing your findings with the background, but as those are > not related to the proposal, I think it is better to be shared in a different > way. > > I understand you are proposing this change because you know DAMON's hugepages > monitoring is imperfect, but still useful enough to get some benefits. If > there were some findings that made you to think so, that could be good > background. > > Also, you may have a reason to believe it is a good idea to use larger mTHP for > hot pages, and smaller mTHP for cold pages. If so, and the description of the > reason is not trivial, that could be good materials to add on background. Now I doubt if we really need two new DAMOS actions. What happens if user asks DAMOS_COLLAPSE of a target order for region that currently being backed by an mTHP of an order that is larger than the newly asked one? If we just ignore the case, DAMOS_SPLIT will really nneeded. But maybe we can just split the large folio into the newly requested order mTHPs. In this scenario, DAMOS_SPLIT is not needed? > > > > > - DAMOS_COLLAPSE + target_order (patches 1-3): collapse small folios > > This reads like you are introducing a new DAMOS action. You indeed mentioned > "this series adds two order-aware DAMOS actions". That's not completely wrong > in a sense, but more technically speaking you are adding a new mode of > DAMOS_COLLAPSE. I'd recommend rephrasing to "extend" DAMOS_COLLAPSE.. > > > up to a chosen mTHP order. Patch 1 adds the target_order field and > > its sysfs file; patch 2 exports a khugepaged helper > > (damon_collapse_folio_range()); > > So patch 2 modifies khugepaged? As Lorenzo mentioned on the other reply, that > change should also be reviewed by THP developers on MAINTAINERS file. Please > ensure adding THP developers to the recipients list of the patch and this cover > letter. > > The patch adds damon_collapse_folio_range() to khugepaged.h. I understand > DAMON is the only user for now, and therefore you are adding damon_ prefix to > the name. Not necesasrily DAMON is the only user forever. And having damon_ > prefix in a land outside of DAMON feels weird. To be consistent with other > functions like collapse_pte_mapped_thp(), I'd suggest dropping the prefix from > the name. > > > patch 3 wires the vaddr handler. > > > > - DAMOS_SPLIT + target_order (patches 4-5): split large folios down > > to a chosen mTHP order via split_folio_to_order(), for both > > anonymous and file-backed (tmpfs/shmem) folios. > > > > The two are complementary, not competing: > > > > THP=never + DAMOS_COLLAPSE: start at 4KB, grow hot regions up. > > THP=always + DAMOS_SPLIT: start at 2MB, shrink cold regions down. > > > > This dual-path design aligns with ideas discussed with Asier > > Gutierrez; we plan to unify our mTHP automation and evaluation > > roadmaps under this standard DAMOS_SPLIT action. > > > > A deployment can pick either baseline, or run both, and let DAMOS > > manage the placement. THP is still wanted for the hot working set > > (fewer TLB misses, shallower walks); the goal is not "no THP" but > > "THP where it is hot, small pages where it is cold." > > I think this is a good idea. Could you further elaborate what benefit users > can get from this in more detail, though? Off the top of my head, I can expect > the benefits would be 1) less TLB miss from hot data, and 2) less mTHP > allocation failures from cold data occupying phsically contiguous memory. But > you might showing even more benefits. Anyway I think those are better to be > widely known by our kernel users. Some of those may better to be put on the > background section. > > > > > User-space target selection > > > > The decision of *which* regions to collapse or split is left to user > > space and fed to DAMOS through the existing DAMOS address filter > > (DAMOS_FILTER_TYPE_ADDR) -- the interface suggested during v1 review. > > The kernel provides the mechanism; user space provides the policy, > > consistent with the perf/BPF "kernel samples, user space decides" > > model and with the DAMON-X direction. > > > > Because the AF signal is unreliable at PMD granularity (T1/T2), the > > scheme is run with min_nr_accesses=0 so it does not gate on access > > count, and the address filter selects targets. min_nr_accesses=0 is > > also what unblocks the T1 case, where nr_accesses is pinned at 0. > > Oh, so you are saying DAMON's huge pages monitoring is too problematic to use > as-is, for your use case. That's completely fair. And that explains what you > really want to do. But this whole pictur is better to be described earlier > than your changes proposal. > > From the beginning, explain why using larger mTHP for hot pages and smaller > mTHP for cold pages are good idea. After that, explain how DAMON can be > extended for doing that. Then, you can further explain your T1 and T2 findings > that explain why DAMON-only appraoch is not feasible, and how user-space target > selection can overcome it. > > Also, I understand DAMON-only approach is not optimum or just useless for your > aimed use case. But, is it completely useless for every possible use case? I > think it might still provide some benefit in some use cases. Could you pleae > clarify this point more in detail? If you have data showing how useless > DAMON-only appraoch is, and how user space approach improves, it would be > awesome. > > > > > Why not just turn khugepaged off? You can, but khugepaged is global > > and usually left enabled because other workloads rely on it; it cannot > > be disabled per region. DAMOS_COLLAPSE gives per-region, > > access-pattern-driven collapse -- a more precise, targeted complement > > to khugepaged's global scan, not a replacement for it. To handle the > > runtime race where khugepaged might aggressively re-collapse what > > DAMOS_SPLIT just split, we are evaluating a precise VMA-level handshake > > or back-off mechanism to prevent ping-pong effects in mixed > > environments. > > Good reasoning. However, khugepaged can be turned off per process, using > prctl(). How about turning khugepaged off for the process you want to use > DAMOS_COLLAPSE/SPLIT for? > > > > > Two user-space data sources produce the candidate address ranges: > > > > 1. ARM SPE (ARMv8.2+): perf record (SPE) -> per-2MB hot-fraction > > histogram -> PA->VA via /proc//pagemap -> sparse-THP VA > > ranges. SPE reads physical addresses from the CPU pipeline, > > bypassing the TLB and page tables, so it is immune to T1 and T2. > > > > 2. smaps fallback (no SPE): scan /proc//smaps for THP-backed > > VMAs and treat the 2MB-aligned ranges as split candidates. > > > > The SPE profiler stays in user space deliberately: the SPE PMU is a > > single-consumer resource, so a kernel consumer would lock out > > user-space perf and tooling (x86 PEBS / AMD IBS have the same > > property). Keeping it in user space avoids that and keeps the metric > > source pluggable, in line with DAMON-X. > > Maybe you are mentioning the perf events based DAMON, not DAMON-X. > > And I understand you plan to extend DAMON to use ARM SPE, on top of the perf > events based DAMON as a future work. As I mentioned before, I think that makes > perfect sense and I'm aligned. Maybe this paragraph can bit reworded to make > it more clear, though. > > > This is why v2 drops the v1 > > SPE debugfs patch. > > > > Testing > > > > Tested on aarch64 with this series applied to 7.1.0-rc5, THP=always, > > using a DAMOS_SPLIT scheme (target_order=2, min_nr_accesses=0) and a > > single DAMOS address filter selecting one 2MB-aligned range: > > > > - Anonymous THP: the filter splits exactly that one THP -- > > sz_applied=2MB and AnonHugePages drops by 2MB, the rest of the > > 256MB mapping untouched. > > - File-backed THP (tmpfs/shmem mounted huge=always): the same setup > > splits exactly one 2MB shmem THP -- sz_applied=2MB and > > ShmemPmdMapped drops by 2MB. This confirms split_folio_to_order() > > works for shmem folios (the KVM-guest-on-THP-tmpfs case). > > - The address filter is what bounds the action: sz_tried covers the > > whole ~2GB monitored region while sz_applied is exactly the 2MB the > > filter selected. > > - A smaps-based path (for hosts without SPE) enumerates THP-backed > > ranges and splits all THP in the target workload. > > - checkpatch clean on all 5 patches. > > So, you tested only split part, for functionality. Do you have plans to > further test collapse part, and performance? > > > > > Test scripts and SPE-to-DAMON pipeline tools: > > https://github.com/lianux-mm/damon_spe/tree/v2 > > Thank you for sharing the code! > > So, I find rooms to improve on this cover letter for the readability and > clarity of the idea. But as I mentioned before, I like the overall idea of > this series. > > > Thanks, > SJ > > [...] Sent using hkml (https://github.com/sjp38/hackermail)