From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2D605326941
	for <linux-kernel@vger.kernel.org>; Fri, 19 Jun 2026 01:52:46 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1781833967; cv=none; b=H7xHXCOggpXj6kEoSsEba2YJ995hQPlkHHEnmmUTpwY9HxqA4f6Wb+VyB/E8VSsFiJlC9pVSv5/rRunUF4ZZeCdkX1+wW7FVKvdstqAY4DbjjVsyMxFRD1lySYq7esql6w8inmTzLI4ESKhEOZPvI0nQtx8PQEFZ0rFaEE992hg=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1781833967; c=relaxed/simple;
	bh=8J7f5h3lktGWVDM0J/4nd7QTecoGWOIBR6YQ79SBCWg=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version:Content-Type; b=VbN/Xo4rvQDCMldW/yHWHX+JzwKRLRS5sXU+jz7y+J+F0+nXEvH2/mcXSgXzQxhVZG8D+/1t63ZzOuRo5rtSTei2Sx2PU4r1hRh1mDq+EuWUONOdvhJrERg77K4tEy3ncpGknUc32jwiL1YiHe/qYzAAMbbA6EuU9o7NJvm6zbQ=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=jQZUmupQ; arc=none smtp.client-ip=100.103.45.18
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="jQZUmupQ"
Received: by smtp.kernel.org (Postfix) with ESMTPSA id 1B6361F000E9;
	Fri, 19 Jun 2026 01:52:45 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org;
	s=k20260515; t=1781833965;
	bh=VQ2qTSixByWjQJk/3F+mblT/opWAMFOv/2NZkPQXDWM=;
	h=From:To:Cc:Subject:Date:In-Reply-To:References;
	b=jQZUmupQ80V2l/ML0oH0lCIJJe9pytlpzau+naiisWrC+YRKoAisfNy3MDrTIkm9j
	 d46qjLGDOCLkerarAA/UKGZL2ljFmGxbhZpmGxu8IhymeLc4bpmRoYjZGJDuP0a+Yk
	 IOyyc7Cru8rnqUIWDCto6AGGeIxMPH0vre+v/F6QMy0xTKIfWWyJ9N2OnhdQbB4i1H
	 lJGsZ9CrwMRIGTXp9qP3TMxxuP1FpKu4JJ3ZPPygo+pwonD2SGo+6beUUTW8GzGnI9
	 WWYKVJuJHiEb4EetEodmDZVyEORoQAZLf7IEZk0Cnb9YVyjpJzNzqmBscBiaZB7b26
	 GzTB0Ca63X4BQ==
From: SeongJae Park <sj@kernel.org>
To: wang lian <lianux.mm@gmail.com>
Cc: SeongJae Park <sj@kernel.org>,
	Gutierrez Asier <gutierrez.asier@huawei-partners.com>,
	akpm@linux-foundation.org,
	npache@redhat.com,
	daichaobing@sangfor.com.cn,
	linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	kunwu.chan@gmail.com
Subject: Re: [RFC PATCH 0/6] mm/damon: Add mTHP-aware collapse/split with ARM SPE feedback
Date: Thu, 18 Jun 2026 18:52:39 -0700
Message-ID: <20260619015241.9432-1-sj@kernel.org>
X-Mailer: git-send-email 2.47.3
In-Reply-To: <459C0876-AC37-4A52-BF11-6436FF33CA90@gmail.com>
References: 
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

On Thu, 18 Jun 2026 21:13:07 +0800 wang lian <lianux.mm@gmail.com> wrote:

> 
> 
> > On Jun 18, 2026, at 19:03, Gutierrez Asier <gutierrez.asier@huawei-partners.com> wrote:
> > 
> > Hi Wang,
> > 
> > On 6/18/2026 12:48 PM, Wang Lian wrote:
> >> Received an off-list report that DAMON significantly overestimates
> >> hot memory in KVM/QEMU deployments with THP-backed tmpfs guest memory
> >> running Oracle workloads.
> >> 
> >> The root cause is structural: a PMD entry covers 512 4KB subpages with
> >> a single Access Flag (AF) bit. When any one subpage is accessed, the entire
> >> 2MB region appears "hot" to DAMON. On ARM64, this is compounded by the
> >> hardware AF mechanism -- the AF is only set on a TLB miss. Consequently, when the
> >> working set fits entirely within the L2 TLB (e.g., a 16MB working set with 2MB THP
> >> running on a Kunpeng 920's 2048-entry L2 TLB), DAMON becomes completely blind to
> >> subsequent accesses. x86 is not subject to this specific blindness under similar
> >> conditions.
> > 
> > Have you tried setting the minimum region size to 2MB?
> > 
> >> We reproduced this memory inflation on a Kunpeng 920 platform using a synthetic
> >> workload (8GB mmap with a 0.2% sparse hotspot, i.e. 16MB actually hot):
> >> THP=always causes DAMON to report the entire 8GB as hot, while THP=never
> >> reports only a few hundred MB -- a 512x overestimate relative to the actual
> >> 16MB hotspot under THP, and a ~33x gap between the two THP modes. ARM SPE hardware profiling
> >> independently confirms this asymmetry: out of 2,005 THPs sampled system-wide
> >> over 10 seconds, 97% had fewer than 10% of their 4KB subpages actually accessed.
> > 
> > THP always will just collapse the entire PID into huge pages anyway. This
> > is outside DAMON's control.
> > 
> > Have you tried setting THP to never and running DAMON with DAMON_COLLAPSE
> > action?
> > 
> >> To mitigate this, this series extends the existing DAMOS_COLLAPSE action to be
> >> mTHP-aware via a new target_order field, and introduces a new
> >> DAMOS_MTHP_SPLIT action. This enables DAMON to proactively split PMD THPs
> >> into smaller mTHPs when most subpages are probed as cold, and collapse them
> >> back when beneficial. To resolve the sub-PMD monitoring blindness, the split
> >> path can incorporate fine-grained hardware feedback from ARM SPE.
> >> The hardware feedback loop (damon_spe_folio_heatmap) implements a two-pass
> >> signal filter: it first identifies the peak chunk access count, and then marks
> >> sub-chunks with >= 1/10 of the peak count as hot, effectively filtering out
> >> SPE sampling noise. A configurable hot_threshold (default 30%) controls the
> >> split decision: only folios with a hot fraction below this threshold are
> >> eligible for splitting. When no SPE data is available, the infrastructure
> >> gracefully falls back to explicit PTE-level scanning via folio_walk.
> >> 
> >> Currently, SPE data is fed from userspace via debugfs (e.g., perf script piped
> >> through a histogram builder into /sys/kernel/debug/damon/spe_feed).
> >> 
> >> Collapse path (patches 1-3):
> >>  DAMON scheme action=COLLAPSE, target_order=N
> >>  -> damos_va_collapse() -> damon_collapse_folio_range()
> >>  -> collapse_huge_page()
> >> 
> >> Split path (patches 4-5):
> >>  DAMON scheme action=MTHP_SPLIT, target_order=N, hot_threshold=M
> >>  -> damos_va_mthp_split() -> damon_spe_hot_fraction()
> >>  -> split_folio_to_order()
> >> 
> >> SPE feedback infrastructure (patch 6):
> >>  perf script -> spe_hist -> debugfs spe_feed
> >>  -> per-folio rbtree {THP-aligned PFN -> access_count[512]}
> >>  -> damon_spe_folio_heatmap() -> hot_bitmap -> split decision
> >> 
> >> The userspace helper tools (including the spe_hist histogram builder and
> >> validation scripts) are archived at:
> >>  https://github.com/lianux-mm/damon_spe
> >> 
> >> Testing was performed on a Kunpeng 920 system (256 cores, 249GB RAM, base kernel
> >> 7.1.0-rc5+):
> >> 
> >>  T1 ARM64 blind spot: A 16MB THP workload (where 8 PMDs fit entirely within the
> >>     L2 TLB) resulted in DAMON detecting 0 regions. Conversely, using 512MB
> >>     with 4KB base pages, or a 16GB THP layout (exceeding L2 TLB reach), allowed
> >>     DAMON to function normally.
> >> 
> >>  T2 THP inflation: With an 8GB mmap and 16MB actually hot (0.2%),
> >>     THP=always: DAMON reported 8GB hot (512x vs ground truth);
> >>     THP=never: ~245MB (15x vs ground truth).  The THP-induced gap
> >>     between the two modes was ~33x.
> >> 
> >>  T3 RocksDB: Fragmented malloc allocation prevented THP formation, and DAMON
> >>     behaved normally. We could not reproduce THP inflation with RocksDB.
> >>     The workloads fundamentally vulnerable to this structural issue remain KVM
> >>     guests, JVM large heaps, and PostgreSQL shared_buffers.
> >> 
> >>  T4 min=0 deadlock break: A 256MB THP induced the DAMON blind spot.
> >>     Triggering an unconditional mthp_split (via nr_accesses/min=0) successfully
> >>     shattered the space into 16384x16KB folios, allowing DAMON to fully recover.
> >> 
> >>  T5 ARM SPE histogram: Out of 2005 sampled THPs, 97% exhibited <10% hot subpages.
> >>     A typical trace showed PFN 0x820db800 accumulated 39,794 hardware accesses
> >>     concentrated across only 3 out of 512 subpages.
> > The SPE stuff fits SeongJae's goals for DAMON-X, I think. Maybe this is something
> > we should keep in the user space and let the kernel provide only the API to add
> > different metrics, including PMU and SPE.
> 
> Hi Asier,
> 
> Thanks for your prompt and constructive reply. I really appreciate your 
> detailed analysis of the mTHP and SPE interaction.

Indeed, very helpful comments.  Thank you Asier!

> 
> Your point regarding the design boundary—whether this fits better in 
> user space or aligned with DAMON-X—is highly valuable. 

Actually Asier is saying about the perf event-based monitoring extension [1].
DAMON-X [2] is another project.

> 
> Since SeongJae (SJ) will look into this thread tomorrow, let us sync up 
> then. I look forward to cooperating with both of you to refine this 
> design and find the best architectural fit for the subsystem.

As I also replied, I'd also prefer this to be aligned with the perf event-based
extension roadmap.

[1] https://lore.kernel.org/all/20260525225208.1179-1-sj@kernel.org/
[2] https://lwn.net/Articles/1071256/


Thanks,
SJ

[...]