From: Ravi Jonnalagadda <ravis.opensrc@gmail.com>
To: sj@kernel.org, akinobu.mita@gmail.com, damon@lists.linux.dev,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
linux-doc@vger.kernel.org
Cc: akpm@linux-foundation.org, corbet@lwn.net, bijan311@gmail.com,
ajayjoshi@micron.com, honggyu.kim@sk.com, yunjeong.mun@sk.com,
ravis.opensrc@gmail.com
Subject: [RFC PATCH 0/6] mm/damon: hardware-sampled access reports
Date: Fri, 29 May 2026 09:56:34 -0700 [thread overview]
Message-ID: <20260529165640.820-1-ravis.opensrc@gmail.com> (raw)
This series introduces a vendor and PMU-agnostic substrate inside DAMON
that consumes hardware-sampled access reports through the standard
perf-event interface. Userspace selects the PMU through sysfs (raw
type/config knobs), driving either Intel PEBS L3-miss sampling or AMD
IBS Op sampling.
Why a unified perf-event substrate
Earlier hardware-sampled access-monitoring proposal [1] took an AMD IBS
specific module path backend, owning its own probe configuration,
sysfs knobs, and lifecycle.
SeongJae Park has previously highlighted the advantage of Akinobu
Mita's perf-event proposal [2]: let DAMON register kernel-counter perf
events and consume samples from any sampling PMU that perf core knows
about. This series builds on that direction with the changes we
needed to run it cross-vendor:
- a per-CPU lockless ring between the NMI sample handler and the
kdamond drain,
- per-CPU events that follow CPU hotplug cleanly,
- events fire only while the monitor is running -- created disabled,
armed when kdamond starts, disarmed and drained when it stops,
- all-or-nothing init across CPUs: a partial-CPU create failure rolls
the whole event back rather than leaving silent gaps,
- safe handling of vendor sample-validity flags so a stale or
unpopulated address is never mistaken for a valid sample.
What the series adds
Patch 1 introduces the substrate's data types: a per-event
configuration struct and a per-context list to hang them on. A
CONFIG_PERF_EVENTS=n build folds to no-op stubs.
Patch 2 exposes those types through sysfs. Each entry maps to one
perf event and lets userspace pick the PMU and how to sample it: the
raw PMU type/config, addressing flags, and period or frequency. The
defaults are tuned for Intel PEBS; userspace overrides them for other
PMUs.
Patch 3 wires the sysfs apply path so configured events get attached
to the running monitoring context.
Patch 4 is the core of the series. It replaces the mutex-protected
report queue with a per-CPU lockless ring fed from NMI by the perf
overflow handler and drained once per sample tick by the kdamond.
Drained reports are matched to monitored regions by binary search
over a per-tick snapshot. The patch also wires the per-event
lifecycle into kdamond: events arm when the monitor starts, disarm
and drain when it stops, roll back cleanly when per-CPU init fails on
some CPUs, and a second context that asks for the substrate while
it is in use is rejected with -EBUSY.
Patch 5 is the perf-event backend. Two stateless overflow handlers
(one vaddr-keyed, one paddr-keyed) are picked at event creation time
and submit samples into the per-CPU ring. Vendor-specific sample
validity is honored at this layer.
Patch 6 adds a tracepoint at every node_eligible_mem_bp quota-goal
evaluation so userspace can watch goal convergence without polling
sysfs.
Userspace setup model
Userspace selects the sampling PMU by pointing the perf event's
`type` / `config` at it, and chooses the scheme topology that suits
the address space the PMU reports on. No module load or unload step
is involved; `echo on > state` arms the substrate, `echo off > state`
disarms it.
Two configurations were used for validation.
Configuration A: AMD IBS Op, paddr ops, system-wide PULL+PUSH tiering
IBS Op stamps samples with physical addresses, so DAMON reasons over
every backing page in the system regardless of which task or guest
touched it -- the substrate becomes a system-wide tiering controller.
Setup (abridged; `D=/sys/kernel/mm/damon/admin/kdamonds/0`):
echo 1 > /sys/kernel/mm/damon/admin/kdamonds/nr_kdamonds
echo 1 > $D/contexts/nr_contexts
echo paddr > $D/contexts/0/operations
# Two regions, one per NUMA node (DRAM + CXL). PA ranges
# are derived per host from /proc/iomem; omitted here.
echo 1 > $D/contexts/0/targets/nr_targets
echo 2 > $D/contexts/0/targets/0/regions/nr_regions
echo <DRAM_LO> > $D/contexts/0/targets/0/regions/0/start
echo <DRAM_HI> > $D/contexts/0/targets/0/regions/0/end
echo <CXL_LO> > $D/contexts/0/targets/0/regions/1/start
echo <CXL_HI> > $D/contexts/0/targets/0/regions/1/end
# IBS Op event, period-based, paddr-stamped:
PE=$D/contexts/0/monitoring_attrs/sample/perf_events
echo 1 > $PE/nr_perf_events
echo $(cat /sys/bus/event_source/devices/ibs_op/type) > $PE/0/type
echo 0 > $PE/0/config
echo 1 > $PE/0/sample_phys_addr
echo 0 > $PE/0/freq
echo 262144 > $PE/0/sample_period
echo 0 > $PE/0/exclude_kernel
echo 0 > $PE/0/exclude_hv
# PULL scheme: migrate_hot toward DRAM, gated on
# node_eligible_mem_bp(nid=DRAM) goal target_value=TARGET_BP.
# addr filter restricts source to the CXL range.
# PUSH scheme: migrate_hot toward CXL, gated on
# node_eligible_mem_bp(nid=CXL) target_value=10000-TARGET_BP.
# addr filter restricts source to the DRAM range.
# Both schemes are migrate_hot; they converge from opposite
# directions on the same hot working set.
echo on > $D/state
Userspace tunes the steady-state DRAM:CXL split by writing the goal
`target_value`s; DAMON's quota autotuner drives migration intensity
to match.
Workload: a QEMU/KVM guest pinned to one NUMA node, running 32
multichase multiload threads each touching a 4 GiB working set
(~128 GiB aggregate) with the memcpy-libc kernel. The guest sees
a flat single-NUMA layout and has no direct view of the host's
tiering topology, yet its hot pages are migrated to DRAM and cold
pages pushed to CXL by host-side DAMON acting on IBS-stamped
physical addresses -- the application inside the guest benefits
from tiering it never had to be aware of. Validated on AMD Turin
(132-CPU EPYC). The configuration converged to its target ratio
in seconds and remained stable for 7+ hours continuously, with no
perf core auto-throttle and no measurable drift in the achieved
interleave ratio.
Configuration B: Intel PEBS L3-miss, vaddr ops, per-PID weighted-dest
PEBS reports vaddr samples in the context of the running task.
DAMON's vaddr ops monitors a specific PID.
Setup (abridged):
echo 1 > /sys/kernel/mm/damon/admin/kdamonds/nr_kdamonds
echo 1 > $D/contexts/nr_contexts
echo vaddr > $D/contexts/0/operations
echo 1 > $D/contexts/0/targets/nr_targets
echo $PID > $D/contexts/0/targets/0/pid_target
echo 0 > $D/contexts/0/targets/0/regions/nr_regions
# PEBS MEM_LOAD_RETIRED.L3_MISS, frequency-based, vaddr-stamped:
echo 1 > $PE/nr_perf_events
echo 4 > $PE/0/type # PERF_TYPE_RAW
echo 0x20d1 > $PE/0/config # umask=0x20 event=0xd1
echo 0 > $PE/0/sample_phys_addr
echo 1 > $PE/0/freq
echo 5003 > $PE/0/sample_freq
echo 2 > $PE/0/precise_ip
echo 1 > $PE/0/wakeup_events
# Single migrate_hot scheme with two weighted destinations
# (DRAM + CXL). Userspace tunes the steady-state interleave by
# writing dests/{0,1}/weight.
echo on > $D/state
Workload: 32 multichase multiload threads with a 4 GiB working set
each (~128 GiB aggregate) running directly on the host, monitored
by DAMON via the multiload PID. Validated on Intel Granite Rapids
(144-CPU). Convergence is fast and the system is stable.
[1] https://lore.kernel.org/linux-mm/20260516223439.4033-1-ravis.opensrc@gmail.com/
[2] https://lore.kernel.org/20260423004211.7037-1-akinobu.mita@gmail.com
Ravi Jonnalagadda (6):
mm/damon: add struct damon_perf_event{,_attr} and per-ctx perf_events
list
mm/damon/sysfs-sample: expose perf_events configuration via sysfs
mm/damon/sysfs: install perf_events on apply
mm/damon/core: per-CPU SPSC ring drain and damon_perf_event lifecycle
mm/damon/vaddr: implement perf-event access check
mm/damon: add damos_node_eligible_mem_bp tracepoint
include/linux/damon.h | 80 +++++
include/trace/events/damon.h | 49 +++
mm/damon/core.c | 403 ++++++++++++++++++++----
mm/damon/ops-common.h | 39 +++
mm/damon/sysfs-common.h | 6 +
mm/damon/sysfs-sample.c | 579 +++++++++++++++++++++++++++++++++++
mm/damon/sysfs.c | 3 +
mm/damon/vaddr.c | 267 ++++++++++++++++
8 files changed, 1370 insertions(+), 56 deletions(-)
base-commit: 4c8ad15abf15eb480d3ad85f902001e35465ef18
--
2.43.0
next reply other threads:[~2026-05-29 16:56 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-29 16:56 Ravi Jonnalagadda [this message]
2026-05-29 16:56 ` [RFC PATCH 1/6] mm/damon: add struct damon_perf_event{,_attr} and per-ctx perf_events list Ravi Jonnalagadda
2026-05-29 16:56 ` [RFC PATCH 2/6] mm/damon/sysfs-sample: expose perf_events configuration via sysfs Ravi Jonnalagadda
2026-05-29 16:56 ` [RFC PATCH 3/6] mm/damon/sysfs: install perf_events on apply Ravi Jonnalagadda
2026-05-29 16:56 ` [RFC PATCH 4/6] mm/damon/core: per-CPU SPSC ring drain and damon_perf_event lifecycle Ravi Jonnalagadda
2026-05-29 16:56 ` [RFC PATCH 5/6] mm/damon/vaddr: implement perf-event access check Ravi Jonnalagadda
2026-05-29 16:56 ` [RFC PATCH 6/6] mm/damon: add damos_node_eligible_mem_bp tracepoint Ravi Jonnalagadda
2026-05-30 0:04 ` [RFC PATCH 0/6] mm/damon: hardware-sampled access reports SeongJae Park
2026-05-30 3:01 ` Akinobu Mita
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260529165640.820-1-ravis.opensrc@gmail.com \
--to=ravis.opensrc@gmail.com \
--cc=ajayjoshi@micron.com \
--cc=akinobu.mita@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=bijan311@gmail.com \
--cc=corbet@lwn.net \
--cc=damon@lists.linux.dev \
--cc=honggyu.kim@sk.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=sj@kernel.org \
--cc=yunjeong.mun@sk.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox