From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2351D846F;
	Fri, 19 Jun 2026 01:59:42 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1781834383; cv=none; b=jLG9MGf3ycUYYLOeKTIrrw/ILEa99vUp26isnhikjJXvf2gpNQ13qdG69S7eBteAfBgE9E9zoEo/ri0E+yxyKG8F2djfimN14/A2fl/4oKHyZhBdFHtURx54KEED+EtKmLDoPKIovH3fuPS7Q8Sf3l57KzCqjGo5thK0uuTXR0w=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1781834383; c=relaxed/simple;
	bh=pBTilIW+R9GWm795s2n1zmSpp+KKvzREtlIHs06iTDU=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version; b=eQfKJCBE8pyMpYtayBwQu8IqDFRrwVP6O7JR1YDpCSq/8DlLDsk+IYQizr5ws0WM2v5qxjAH4Mqa7PSLNjRgR0MbxdH7atVUWB4kmu8tjc9SNnbjCqCa69nMruF0sKoHrYxLP+FEs1RNrnbh1nSxc5IPqeNkBmQ0lDQ3NmrYcNI=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=jVOgvpQq; arc=none smtp.client-ip=100.103.45.18
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="jVOgvpQq"
Received: by smtp.kernel.org (Postfix) with ESMTPSA id F154D1F000E9;
	Fri, 19 Jun 2026 01:59:39 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org;
	s=k20260515; t=1781834381;
	bh=7Ofmb9frWuZSVPum7XAfwUBl7J7p0AxyClhSStJ7K28=;
	h=From:To:Cc:Subject:Date:In-Reply-To:References;
	b=jVOgvpQqTuaIoN9yLw03Z01TcQunNRa0inlrc7z8zb4Vnpvaw2feFLmgP+jo3BFqb
	 cvYeNWdTgfstWZFeJXwT5EiWt797tKvWML3w5W7SrcLgndRKXp4B+9vw14qydbvy67
	 RaEOBqELsnZSHK1f28uOVTMmG4yR5GqcR1sUweiCSU4mZegYGwzOUgY/ZerJbHscE+
	 p+jMlH44AyMw4K0trskFoJLHZsoknyAunS2uaUvrXIGgGWFmzrMakmPZ4RfSmFsG6s
	 Jcbk1AVpiyEH1pSdeJIsQ+s/S/OMpzN8c4MrEiYC7d7QG2IGsF8H1OwJCY/3wOpiHH
	 0Rwv+Oz9MHsnw==
From: SeongJae Park <sj@kernel.org>
To: SeongJae Park <sj@kernel.org>
Cc: Wang Lian <lianux.mm@gmail.com>,
	akpm@linux-foundation.org,
	npache@redhat.com,
	gutierrez.asier@huawei-partners.com,
	daichaobing@sangfor.com.cn,
	linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	kunwu.chan@gmail.com,
	damon@lists.linux.dev
Subject: Re: [RFC PATCH 0/6] mm/damon: Add mTHP-aware collapse/split with ARM SPE feedback
Date: Thu, 18 Jun 2026 18:59:30 -0700
Message-ID: <20260619015931.9690-1-sj@kernel.org>
X-Mailer: git-send-email 2.47.3
In-Reply-To: <20260619015411.9554-1-sj@kernel.org>
References: 
Precedence: bulk
X-Mailing-List: damon@lists.linux.dev
List-Id: <damon.lists.linux.dev>
List-Subscribe: <mailto:damon+subscribe@lists.linux.dev>
List-Unsubscribe: <mailto:damon+unsubscribe@lists.linux.dev>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

+ damon@lists.linux.dev

Please Cc damon@lists.linux.dev from the next revision, and all DAMON patches
in future.


Thanks,
SJ

On Thu, 18 Jun 2026 18:54:23 -0700 SeongJae Park <sj@kernel.org> wrote:

> On Thu, 18 Jun 2026 18:47:16 -0700 SeongJae Park <sj@kernel.org> wrote:
> 
> > Hello Lian,
> > 
> > On Thu, 18 Jun 2026 17:48:32 +0800 Wang Lian <lianux.mm@gmail.com> wrote:
> > 
> > > Received an off-list report that DAMON significantly overestimates
> > > hot memory in KVM/QEMU deployments with THP-backed tmpfs guest memory
> > > running Oracle workloads.
> > > 
> > > The root cause is structural: a PMD entry covers 512 4KB subpages with
> > > a single Access Flag (AF) bit. When any one subpage is accessed, the entire
> > > 2MB region appears "hot" to DAMON. On ARM64,
> > 
> > This makes sense to me.  I also agree this could caused the reported problem.
> > And this is a known limitation of DAMON.  My suggestion for straightforward
> > workaround of this problem is, using 'age' information of DAMON for better
> > identification of the hot memory.
> > 
> > That is, I don't expect real hot data in real production systems will evenly
> > scattered.  Even if they are, I don't expect they will all evenly frequently
> > accessed.  Only a few of those would be accessed frequently for long.  Even if
> > that is, there would be data that frequently for longer.  You could show the
> > distriibution of the pattern and find X % of hottest memory as hot.
> > 
> > We invented idle time percentiles [1] for a similar purpose, though it is more
> > focusing on finding cold memory.
> > 
> > I understand this patch series is trying to make more fundamental and better
> > solution on hardware that can do better.  Makes sense to me.
> > 
> > > this is compounded by the
> > > hardware AF mechanism -- the AF is only set on a TLB miss. Consequently, when the
> > > working set fits entirely within the L2 TLB (e.g., a 16MB working set with 2MB THP
> > > running on a Kunpeng 920's 2048-entry L2 TLB), DAMON becomes completely blind to
> > > subsequent accesses.
> > 
> > This makes sense to me.  However, I don't get how this is contributing to the
> > problem.  Could you please elaborate?
> > 
> > > x86 is not subject to this specific blindness under similar
> > > conditions.
> > 
> > To my understanding on x86, same issue exists.  If TLB hits, Aceessed bit is
> > not set, and DAMON shows it as unaccessed.  Am I missing something?
> > 
> > > 
> > > We reproduced this memory inflation on a Kunpeng 920 platform using a synthetic
> > > workload (8GB mmap with a 0.2% sparse hotspot, i.e. 16MB actually hot):
> > > THP=always causes DAMON to report the entire 8GB as hot, while THP=never
> > > reports only a few hundred MB -- a 512x overestimate relative to the actual
> > > 16MB hotspot under THP, and a ~33x gap between the two THP modes. ARM SPE hardware profiling
> > > independently confirms this asymmetry: out of 2,005 THPs sampled system-wide
> > > over 10 seconds, 97% had fewer than 10% of their 4KB subpages actually accessed.
> > 
> > I don't think the real world production systems to have this very artificial
> > access pattern.  I believe (or, hope) use of 'age' can work around the issue in
> > a reasonable level for many cases.  I understand this setup is only for PoC,
> > and I think this is well designed test for the purpose.  Thank you for sharing
> > this.
> > 
> > > 
> > > To mitigate this, this series extends the existing DAMOS_COLLAPSE action to be
> > > mTHP-aware via a new target_order field,
> > 
> > Makes sensee, and sounds nice.  Definitely no one size fits all!
> > 
> > > and introduces a new
> > > DAMOS_MTHP_SPLIT action. This enables DAMON to proactively split PMD THPs
> > > into smaller mTHPs
> > 
> > Nice!  Asier was planning to do similar work in future.  I think you could
> > collaborate to reduce unnecessary duplicates!
> > 
> > I'd suggest making the name simpler and consistent to DAMOS_COLLAPSE, though.
> > Say, DAMOS_SPLIT ?
> > 
> > > when most subpages are probed as cold, and collapse them
> > > back when beneficial. To resolve the sub-PMD monitoring blindness, the split
> > > path can incorporate fine-grained hardware feedback from ARM SPE.
> > > 
> > > The hardware feedback loop (damon_spe_folio_heatmap) implements a two-pass
> > > signal filter: it first identifies the peak chunk access count, and then marks
> > > sub-chunks with >= 1/10 of the peak count as hot, effectively filtering out
> > > SPE sampling noise. A configurable hot_threshold (default 30%) controls the
> > > split decision: only folios with a hot fraction below this threshold are
> > > eligible for splitting. When no SPE data is available, the infrastructure
> > > gracefully falls back to explicit PTE-level scanning via folio_walk.
> > > 
> > > Currently, SPE data is fed from userspace via debugfs (e.g., perf script piped
> > > through a histogram builder into /sys/kernel/debug/damon/spe_feed).
> > 
> > So you implemented a debugfs interface?  That must be a nice approach for PoC.
> > But it may be difficult to be upstreamed as is.
> > 
> > You could build a control plane that decides the exact address ranges to split,
> > and directly feed it to DAMOS using DAMOS address filter.  max_nr_snapshots can
> > also be useful for making such kind of user space controls more deterministic.
> > 
> > For simpler user-space control, utilizing user_input DAMOS quota goal [2]
> > should also be another option.
> > 
> > We are also planning [3] to extend DAMON for perf events.  On top of it, we
> > might be able to extend it further to utilize ARM SPE by DAMON itself, and do
> > all this without the user space help but only DAMOS.
> > 
> > Baseed on below 'limitations' section, I understand this is only for PoC at the
> > moment, and you plan to explore the perf event based approach.  I'd also
> > recommend that.
> > 
> > > 
> > > Collapse path (patches 1-3):
> > >   DAMON scheme action=COLLAPSE, target_order=N
> > >   -> damos_va_collapse() -> damon_collapse_folio_range()
> > >   -> collapse_huge_page()
> > > 
> > > Split path (patches 4-5):
> > >   DAMON scheme action=MTHP_SPLIT, target_order=N, hot_threshold=M
> > >   -> damos_va_mthp_split() -> damon_spe_hot_fraction()
> > >   -> split_folio_to_order()
> > > 
> > > SPE feedback infrastructure (patch 6):
> > >   perf script -> spe_hist -> debugfs spe_feed
> > >   -> per-folio rbtree {THP-aligned PFN -> access_count[512]}
> > >   -> damon_spe_folio_heatmap() -> hot_bitmap -> split decision
> > > 
> > > The userspace helper tools (including the spe_hist histogram builder and
> > > validation scripts) are archived at:
> > >   https://github.com/lianux-mm/damon_spe
> > 
> > Thank you for making all the grateful code open!
> > 
> > > 
> > > Testing was performed on a Kunpeng 920 system (256 cores, 249GB RAM, base kernel
> > > 7.1.0-rc5+):
> > > 
> > >   T1 ARM64 blind spot: A 16MB THP workload (where 8 PMDs fit entirely within the
> > >      L2 TLB) resulted in DAMON detecting 0 regions. Conversely, using 512MB
> > >      with 4KB base pages, or a 16GB THP layout (exceeding L2 TLB reach), allowed
> > >      DAMON to function normally.
> > > 
> > >   T2 THP inflation: With an 8GB mmap and 16MB actually hot (0.2%),
> > >      THP=always: DAMON reported 8GB hot (512x vs ground truth);
> > >      THP=never: ~245MB (15x vs ground truth).  The THP-induced gap
> > >      between the two modes was ~33x.
> > > 
> > >   T3 RocksDB: Fragmented malloc allocation prevented THP formation, and DAMON
> > >      behaved normally. We could not reproduce THP inflation with RocksDB.
> > >      The workloads fundamentally vulnerable to this structural issue remain KVM
> > >      guests, JVM large heaps, and PostgreSQL shared_buffers.
> > > 
> > >   T4 min=0 deadlock break: A 256MB THP induced the DAMON blind spot.
> > >      Triggering an unconditional mthp_split (via nr_accesses/min=0) successfully
> > >      shattered the space into 16384x16KB folios, allowing DAMON to fully recover.
> > > 
> > >   T5 ARM SPE histogram: Out of 2005 sampled THPs, 97% exhibited <10% hot subpages.
> > >      A typical trace showed PFN 0x820db800 accumulated 39,794 hardware accesses
> > >      concentrated across only 3 out of 512 subpages.
> > > 
> > >   End-to-end: Verified hot/cold discrimination. The SPE feed preserved a 90%
> > >      hot THP intact, while successfully splitting a 25% cold THP into 128x16KB folios.
> > > 
> > > Known limitations:
> > > - The full KVM + Oracle production chain has not yet been benchmarked end-to-end.
> > >   While individual component verification is complete, full integration testing
> > >   is planned in collaboration with Sangfor.
> > > - khugepaged may aggressively re-collapse the mTHPs that DAMON splits. A
> > >   coordination/back-off mechanism is required to avoid ping-pong effects.
> > 
> > Do you really need to khugepaged together, when you already have
> > DAMOS_COLLAPSE, and anyway you are running DAMON for hugepage splits?
> > 
> > > - SPE data is currently funneled via a userspace daemon and debugfs. Direct
> > >   kernel-side perf_event sampling integration is planned as a follow-up.
> > 
> > Nice, I think this will make our projects aligned and reduce unnecessary
> > duplicates.  I'd encourage you to try this path.
> > 
> > > - The rbtree entry TTL (30s) and signal threshold (1/10 of peak) are empirical
> > >   defaults subject to further tuning.
> > 
> > I don't fully understand this part.  Could you please elaborate?
> > 
> > > - The ARM64 DAMON blind spot (WSS < L2 TLB reach) is a pre-existing hardware-MMU
> > >   characteristic, not introduced by this series. Setting nr_accesses/min=0
> > >   serves as an effective workaround for the split path.
> > 
> > I don't fully understand this, too.  Could you please elaborate and enlighten
> > me?
> > 
> > > 
> > > Reported-by: Chaobing Dai <daichaobing@sangfor.com.cn>
> > > Cc: SeongJae Park <sj@kernel.org>
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Cc: Nico Pache <npache@redhat.com>
> > > Cc: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
> > > Cc: linux-mm@kvack.org
> > > Cc: linux-kernel@vger.kernel.org
> > > Signed-off-by: Wang Lian <lianux.mm@gmail.com>
> > > 
> > > Wang Lian (6):
> > >   mm/damon: add target_order field for DAMOS_COLLAPSE
> > >   mm/khugepaged: add damon_collapse_folio_range() for external callers
> > >   mm/damon/vaddr: implement mTHP-aware DAMOS_COLLAPSE handler
> > >   mm/damon: introduce DAMOS_MTHP_SPLIT action and hot_threshold
> > >   mm/damon/vaddr: implement DAMOS_MTHP_SPLIT handler
> > >   mm/damon: add SPE feedback for sub-THP split decisions
> > > 
> > >  include/linux/damon.h      |  18 ++
> > >  include/linux/khugepaged.h |   3 +
> > >  mm/damon/Kconfig           |  12 +
> > >  mm/damon/Makefile          |   1 +
> > >  mm/damon/core.c            |   3 +
> > >  mm/damon/spe.c             | 505 +++++++++++++++++++++++++++++++++++++
> > >  mm/damon/spe.h             |  62 +++++
> > >  mm/damon/sysfs-schemes.c   |  96 +++++++
> > >  mm/damon/vaddr.c           | 118 +++++++++
> > >  mm/khugepaged.c            |  39 +++
> > >  10 files changed, 857 insertions(+)
> > >  create mode 100644 mm/damon/spe.c
> > >  create mode 100644 mm/damon/spe.h
> > 
> > Because this is an RFC and we found high level TODO (trying perf event based
> > appraoch instead of debugfs), I will skip reviewing the details.  If you have
> > specific parts that want my detailed review, let me know.
> > 
> > Also, the perf event based monitoring is a long term project.  The ETA is the
> > LSFMMBPF'27.  If you cannot wait until the time, maybe you could try the
> > alternative approaches (using address filter or user_input quota goal) and
> > upstreaming dependent parts (DAMOS_COLLAPSE extension for mTHP and DAMOS_SPLIT)
> > first could also be a nice approach, in my opinion.
> > 
> > [1] https://origin.kernel.org/doc/html/latest/admin-guide/mm/damon/stat.html#memory-idle-ms-percentiles
> > [2] https://origin.kernel.org/doc/html/latest/mm/damon/design.html#aim-oriented-feedback-driven-auto-tuning
> > [3] https://lore.kernel.org/20251128193947.80866-1-sj@kernel.org/
> 
> The above link ([3]) is wrong, sorry.  Please use below.
> 
> [3] https://lore.kernel.org/20260525225208.1179-1-sj@kernel.org/
> 
> 
> Thanks,
> SJ
> 
> [...]
> 

Sent using hkml (https://github.com/sjp38/hackermail)