[RFC v2 0/7] mm/damon: extend for page faults reporting based access monitoring

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC v2 0/7] mm/damon: extend for page faults reporting based access monitoring
@ 2025-07-27 20:18 SeongJae Park
  2025-07-27 20:18 ` [RFC v2 1/7] mm/damon/core: introduce damon_report_access() SeongJae Park
                   ` (7 more replies)
  0 siblings, 8 replies; 15+ messages in thread
From: SeongJae Park @ 2025-07-27 20:18 UTC (permalink / raw)
  Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Jann Horn, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
	Pedro Falcato, Suren Baghdasaryan, Vlastimil Babka, damon,
	kernel-team, linux-kernel, linux-mm

TL; DR: Extend DAMON interface between core and operation sets for
operation set driven report-based monitoring such as per-CPU and
write-only access monitoring.  Further introduce an example physical
address space monitoring operation set that uses page faults as the
source of the information.

Background
----------

Existing DAMON operations set implementations, namely paddr, vaddr, and
fvaddr, use Accessed bits of page tables as the main source of the
access information.  Accessed bits has some restrictions, though.  For
example, it cannot tell which CPU or GPU made the access, whether the
access was read or write, and which part of the mapped entity was really
accessed.

Depending on the use case, the limitations can be problematic.  Because
the issue stems from the nature of page table Accessed bit, utilizing
access information from different sources can mitigate the issue.  Page
faults, memory access instructions sampling interrupts, system calls, or
any information from other kernel space friends such as subsystems or
device drivers of CXL or GPUs could be examples of the different
sources.

DAMON separates its core and operation set layer for easy extensions.
The operation set layer handles the low level access information
handling, and the core layer handles more high level work such as the
region-based overhead/accuracy control and access-aware system
operations.  Hence we can extend DAMON to use the different sources by
implementing and using another DAMON operations set.  The core layer
features will still be available with the new sources, without
additional changes.

Nevertheless, the current interface between the core and the operation
set layers is optimized for the Accessed bits case.  Specifically, the
interface asks the operations set if a given part of memory has been
accessed or not in a given time period (last sampling interval).  It is
easy for the Accessed bit use case, since the information is stored in
page tables.  Operation set can simply read the current value of the
Accessed bit.

For some sources other than Accessed bits, such as page faults or
instruction sampling interrupts, the operations set may need to collect
and keep the access information in its internal memory until the core
layer asks the access information.  Only after answering the question,
the information could be dropped.

Implementing such operation set internal memory management woudl be not
very trivial.  Also it could end up multiple similar operation set
implementations having their own internal memory management code that is
unnecessarily duplicated.

Core Layer Changes for Reporting-based Monitoring
-------------------------------------------------

Optimize such possible duplicated efforts, by updating DAMON core layer
to support real time access reporting.  The updated interface allows
operations set implementations to report (or, push) their information to
the core layer, on their preferred schedule.  DAMON core layer will
handle the reports by managing meta data and updating the final
monitoring results (DAMON regions) accordingly.

Also add another operations set callback to determine if a given access
report is eligible to be used for a given operations set.  For example,
if the operations set implementation is for monitoring only specific CPU
or writes, the operations set could ask the core layer to ignore
reported accesses that were made by other CPUs, or were made for reads.

paddr_fault: Page Faults-based Physical Address Space Access Monitoring
-----------------------------------------------------------------------

Using the core layer changes, implement a new DAMON operation set,
namely paddr_fault.  It is the same as the page table Accessed bits
based physical address space monitoring, but uses page faults as the
source of the access information.

Specifically, it installs PAGE_NONE protection to access sampling pages
on damon_operations->prepare_access_checks() callback.  Then, it
captures the following access to the page in the page fault handling
context, and directly reports the findings to DAMON, using
damon_report_access().

For the PAGE_NONE protection use case, introduce a new
change_protection() flag, namely MM_CP_DAMON.  To avoid interfering with
NUMA_BALANCING, the page fault handling invokes fault handling logic of
DAMON or NUMA_BALANCING, based on the NUMA_BALANCING enablement.

This operation set is only for giving examples of how the
damon_report_access() can be used for multiple sources of the
information, and easy testing.  It ain't be merged into the mainline as
is.  I'm currently planning to further develop it for per-CPU access
monitoring by the final version of this patch series.

How Per-CPU or Write-only Monitoring Can Be Implemented
-------------------------------------------------------

The paddr_fault can be extended for per-CPU or write-only monitoring.
We can get the access source CPU or whether it was write access from the
page fault information, and put that into the DAMON report (struct
damon_access_report).  Extending damon_access_report struct with a few
fields for storing the information would be needed.  Then we can make a
new DAMON operation set that is similar to paddr_fault, but checks the
eligibility of each access report, based on the CPU or write
information.  Of course, extending the existing operation set could also
be an option.  Then accesses made by CPUs of no interest or reads can be
ignored, and users can show the per-CPU or write-only accesses using
DAMON.

Expected Users: Scheduling, VM Live Migration and NUMA Page Migrations
----------------------------------------------------------------------

We have ongoing off-list discussions of expected use cases of this patch
series.  We expect this patch series can be used for implementing
per-CPU access monitoring, and it can be useful for L3 cache
utilization-aware threads/process scheduling.  Yet another expected use
case is write-only monitoring, for finding easier live migration target
VM instances.

Also I believe this can be extended for not only per-CPU but any access
entities including GPU-like accelerators, who expose their memory as
NUMA nodes in some setups.  With that, I think we could make a holistic
and efficient access-aware NUMA pages migration system.

Patches Sequence
----------------

The first patch introduces damon_report_access() that any kernel code
that can sleep can use, to report their access information on their
schedule.  The second patch adds DAMON core-operations set interface for
ignoring specific types of data access reports for the given operations
set configuration.  The third patch further implements the report
eligibility check logic for vaddr.  The fourth patch updates the core
layer to really use the reported access information for making the
monitoring results (DAMON regions).

The fifth patch implements a new change_protection() flag, MM_CP_DAMON,
and its fault handling logic for reporting the access to DAMON.  The
sixth patch implements a new page faults based physical address space
access monitoring operation set, namely paddr_fault, using MM_CP_DAMON.
Finally, the seventh patch updates DAMON sysfs interface to support
paddr_fault.

Plan for Dropping RFC
---------------------

This patch series is an RFC for early sharing of the idea that was also
shared on the last LSFMMBPF[1], as 'damon_report_access()' API plan.  We
will further optimize the core layer implementation and add one or more
real operations set implementations that utilize the report-based
interface, by the final version of this patch series.

Of course, concerns we find on RFCs should be addressed.

Revision History
----------------

Changes from RFC v1
(https://lore.kernel.org/20250629201443.52569-1-sj@kernel.org)
- Fixup report reading logic for access absence accounting
- Implement page faults based operations set (paddr_fault)

[1] https://lwn.net/Articles/1016525/

SeongJae Park (7):
  mm/damon/core: introduce damon_report_access()
  mm/damon/core: add eligible_report() ops callback
  mm/damon/vaddr: implement eligible_report()
  mm/damon/core: read received access reports
  mm/memory: implement MM_CP_DAMON
  mm/damon: implement paddr_fault operations set
  mm/damon/sysfs: support paddr_fault

 include/linux/damon.h |  34 ++++++++++++++
 include/linux/mm.h    |   1 +
 mm/damon/core.c       | 101 ++++++++++++++++++++++++++++++++++++++++++
 mm/damon/paddr.c      |  77 +++++++++++++++++++++++++++++++-
 mm/damon/sysfs.c      |   4 ++
 mm/damon/vaddr.c      |   7 +++
 mm/memory.c           |  53 +++++++++++++++++++++-
 mm/mprotect.c         |   5 +++
 8 files changed, 279 insertions(+), 3 deletions(-)

base-commit: 3452e05f01b2a3dd126bd08961cc0df8daa5beee
-- 
2.39.5

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC v2 1/7] mm/damon/core: introduce damon_report_access()
  2025-07-27 20:18 [RFC v2 0/7] mm/damon: extend for page faults reporting based access monitoring SeongJae Park
@ 2025-07-27 20:18 ` SeongJae Park
  2025-07-27 20:18 ` [RFC v2 2/7] mm/damon/core: add eligible_report() ops callback SeongJae Park
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: SeongJae Park @ 2025-07-27 20:18 UTC (permalink / raw)
  Cc: SeongJae Park, Andrew Morton, damon, kernel-team, linux-kernel,
	linux-mm

DAMON core layer asks operations set layer about past access
information, on core layer's schedule.  In other words, core layer
"pulls" the information from the operations set layer.  This is
problematic for a case where the operations set layer has no time and
space to save the information until the core layer queries.

Add a new DAMON API function for reporting identified data accesses to
DAMON, on the identifiers' schedule.  In other words, it lets the
operations set layer to "push" the information to the core layer.  The
function internally uses mutex, so reporting kernel code should be safe
to sleep.

This API was also discussed at LSFMM'25: https://lwn.net/Articles/1016525/

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 include/linux/damon.h | 25 +++++++++++++++++++++++++
 mm/damon/core.c       | 35 +++++++++++++++++++++++++++++++++++
 2 files changed, 60 insertions(+)

diff --git a/include/linux/damon.h b/include/linux/damon.h
index 479ee52b79e1..1f7592147d92 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -104,6 +104,23 @@ struct damon_target {
 	struct list_head list;
 };
 
+/**
+ * struct damon_access_report - Represent single acces report information.
+ * @pid:		The PID of the virtual address space of the address.
+ *			NULL if it is of the physical address.
+ * @addr:		The start address of the reporting region.
+ * @size:		The size of the reporting region.
+ *
+ * @pid could be stale, and hence shouldn't be de-referenced.
+ */
+struct damon_access_report {
+	struct pid *pid;
+	unsigned long addr;
+	unsigned long size;
+/* private: */
+	unsigned long report_jiffies;	/* when this report is made */
+};
+
 /**
  * enum damos_action - Represents an action of a Data Access Monitoring-based
  * Operation Scheme.
@@ -941,9 +958,17 @@ bool damon_is_running(struct damon_ctx *ctx);
 int damon_call(struct damon_ctx *ctx, struct damon_call_control *control);
 int damos_walk(struct damon_ctx *ctx, struct damos_walk_control *control);
 
+void damon_report_access(struct damon_access_report *report);
+
 int damon_set_region_biggest_system_ram_default(struct damon_target *t,
 				unsigned long *start, unsigned long *end);
 
+#else	/* CONFIG_DAMON */
+
+static inline void damon_report_access(struct damon_access_report *report)
+{
+}
+
 #endif	/* CONFIG_DAMON */
 
 #endif	/* _DAMON_H */
diff --git a/mm/damon/core.c b/mm/damon/core.c
index 64e59d15043a..4e25fe100b56 100644
--- a/mm/damon/core.c
+++ b/mm/damon/core.c
@@ -24,6 +24,8 @@
 #define DAMON_MIN_REGION 1
 #endif
 
+#define DAMON_ACCESS_REPORTS_CAP 1000
+
 static DEFINE_MUTEX(damon_lock);
 static int nr_running_ctxs;
 static bool running_exclusive_ctxs;
@@ -33,6 +35,11 @@ static struct damon_operations damon_registered_ops[NR_DAMON_OPS];
 
 static struct kmem_cache *damon_region_cache __ro_after_init;
 
+static DEFINE_MUTEX(damon_access_reports_lock);
+static struct damon_access_report damon_access_reports[
+	DAMON_ACCESS_REPORTS_CAP];
+static int damon_access_reports_len;
+
 /* Should be called under damon_ops_lock with id smaller than NR_DAMON_OPS */
 static bool __damon_is_registered_ops(enum damon_ops_id id)
 {
@@ -1461,6 +1468,34 @@ int damos_walk(struct damon_ctx *ctx, struct damos_walk_control *control)
 	return 0;
 }
 
+/**
+ * damon_report_access() - Report identified access events to DAMON.
+ * @report:	The reporting access information.
+ *
+ * Report access events to DAMON.
+ *
+ * Context: May sleep.
+ *
+ * NOTE: we may be able to implement this as a lockless queue, and allow any
+ * context.  As the overhead is unknown, and region-based DAMON logics would
+ * guarantee the reports would be not made that frequently, let's start with
+ * this simple implementation.
+ */
+void damon_report_access(struct damon_access_report *report)
+{
+	struct damon_access_report *dst;
+
+	/* silently fail for races */
+	if (!mutex_trylock(&damon_access_reports_lock))
+		return;
+	dst = &damon_access_reports[damon_access_reports_len++];
+	if (damon_access_reports_len == DAMON_ACCESS_REPORTS_CAP)
+		damon_access_reports_len = 0;
+	*dst = *report;
+	dst->report_jiffies = jiffies;
+	mutex_unlock(&damon_access_reports_lock);
+}
+
 /*
  * Warn and fix corrupted ->nr_accesses[_bp] for investigations and preventing
  * the problem being propagated.
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC v2 2/7] mm/damon/core: add eligible_report() ops callback
  2025-07-27 20:18 [RFC v2 0/7] mm/damon: extend for page faults reporting based access monitoring SeongJae Park
  2025-07-27 20:18 ` [RFC v2 1/7] mm/damon/core: introduce damon_report_access() SeongJae Park
@ 2025-07-27 20:18 ` SeongJae Park
  2025-07-27 20:18 ` [RFC v2 3/7] mm/damon/vaddr: implement eligible_report() SeongJae Park
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: SeongJae Park @ 2025-07-27 20:18 UTC (permalink / raw)
  Cc: SeongJae Park, Andrew Morton, damon, kernel-team, linux-kernel,
	linux-mm

Not every reported access information will be eligible for all DAMON ops
and targets.  For example, virtual address access reports will not be
eligible for 'padr' ops, or monitoring targets for a process that is
different from the process for the report.  If it is for monitoring
accesses from specific CPUs or write, reports from other CPUs or reads
should be ignored.  Add operations set callback for this report
eligibility checking.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 include/linux/damon.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/linux/damon.h b/include/linux/damon.h
index 1f7592147d92..f3e5c585b5f1 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -590,6 +590,7 @@ enum damon_ops_id {
  * @update:			Update operations-related data structures.
  * @prepare_access_checks:	Prepare next access check of target regions.
  * @check_accesses:		Check the accesses to target regions.
+ * @eligible_report:		Verify an access report for a target.
  * @get_scheme_score:		Get the score of a region for a scheme.
  * @apply_scheme:		Apply a DAMON-based operation scheme.
  * @target_valid:		Determine if the target is valid.
@@ -617,6 +618,8 @@ enum damon_ops_id {
  * last preparation and update the number of observed accesses of each region.
  * It should also return max number of observed accesses that made as a result
  * of its update.  The value will be used for regions adjustment threshold.
+ * @eligible_report should check if the given access report is eligible to be
+ * used by this operations set for the given target.
  * @get_scheme_score should return the priority score of a region for a scheme
  * as an integer in [0, &DAMOS_MAX_SCORE].
  * @apply_scheme is called from @kdamond when a region for user provided
@@ -635,6 +638,8 @@ struct damon_operations {
 	void (*update)(struct damon_ctx *context);
 	void (*prepare_access_checks)(struct damon_ctx *context);
 	unsigned int (*check_accesses)(struct damon_ctx *context);
+	bool (*eligible_report)(struct damon_access_report *report,
+			struct damon_target *t);
 	int (*get_scheme_score)(struct damon_ctx *context,
 			struct damon_target *t, struct damon_region *r,
 			struct damos *scheme);
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC v2 3/7] mm/damon/vaddr: implement eligible_report()
  2025-07-27 20:18 [RFC v2 0/7] mm/damon: extend for page faults reporting based access monitoring SeongJae Park
  2025-07-27 20:18 ` [RFC v2 1/7] mm/damon/core: introduce damon_report_access() SeongJae Park
  2025-07-27 20:18 ` [RFC v2 2/7] mm/damon/core: add eligible_report() ops callback SeongJae Park
@ 2025-07-27 20:18 ` SeongJae Park
  2025-07-27 20:18 ` [RFC v2 4/7] mm/damon/core: read received access reports SeongJae Park
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: SeongJae Park @ 2025-07-27 20:18 UTC (permalink / raw)
  Cc: SeongJae Park, Andrew Morton, damon, kernel-team, linux-kernel,
	linux-mm

Implement eligible_report() operation for vaddr operation set.  It
checks the eligibility of the report based on the pid of the report and
the current monitoring target.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/damon/vaddr.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
index 87e825349bdf..a80cd34e5232 100644
--- a/mm/damon/vaddr.c
+++ b/mm/damon/vaddr.c
@@ -792,6 +792,12 @@ static int damos_va_migrate_pte_entry(pte_t *pte, unsigned long addr,
 	return 0;
 }
 
+static bool damon_va_eligible_report(struct damon_access_report *report,
+		struct damon_target *t)
+{
+	return report->pid == t->pid;
+}
+
 /*
  * Functions for the target validity check and cleanup
  */
@@ -954,6 +960,7 @@ static int __init damon_va_initcall(void)
 		.update = damon_va_update,
 		.prepare_access_checks = damon_va_prepare_access_checks,
 		.check_accesses = damon_va_check_accesses,
+		.eligible_report = damon_va_eligible_report,
 		.target_valid = damon_va_target_valid,
 		.cleanup_target = damon_va_cleanup_target,
 		.cleanup = NULL,
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC v2 4/7] mm/damon/core: read received access reports
  2025-07-27 20:18 [RFC v2 0/7] mm/damon: extend for page faults reporting based access monitoring SeongJae Park
                   ` (2 preceding siblings ...)
  2025-07-27 20:18 ` [RFC v2 3/7] mm/damon/vaddr: implement eligible_report() SeongJae Park
@ 2025-07-27 20:18 ` SeongJae Park
  2025-07-27 20:18 ` [RFC v2 5/7] mm/memory: implement MM_CP_DAMON SeongJae Park
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: SeongJae Park @ 2025-07-27 20:18 UTC (permalink / raw)
  Cc: SeongJae Park, Andrew Morton, damon, kernel-team, linux-kernel,
	linux-mm

Reported access information is only saved in the core layer's internal
data structure.  Those are not really being used for final monitoring
results.  Update core layer access information (DAMON regions) using the
reported access information.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 include/linux/damon.h |  1 +
 mm/damon/core.c       | 66 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 67 insertions(+)

diff --git a/include/linux/damon.h b/include/linux/damon.h
index f3e5c585b5f1..8ec49beac573 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -83,6 +83,7 @@ struct damon_region {
 	unsigned int age;
 /* private: Internal value for age calculation. */
 	unsigned int last_nr_accesses;
+	bool access_reported;
 };
 
 /**
diff --git a/mm/damon/core.c b/mm/damon/core.c
index 4e25fe100b56..428ac1b83118 100644
--- a/mm/damon/core.c
+++ b/mm/damon/core.c
@@ -141,6 +141,7 @@ struct damon_region *damon_new_region(unsigned long start, unsigned long end)
 
 	region->age = 0;
 	region->last_nr_accesses = 0;
+	region->access_reported = false;
 
 	return region;
 }
@@ -2560,6 +2561,69 @@ static void kdamond_init_ctx(struct damon_ctx *ctx)
 	}
 }
 
+static void kdamond_apply_access_report(struct damon_access_report *report,
+		struct damon_target *t, struct damon_ctx *ctx)
+{
+	struct damon_region *r;
+
+	if (ctx->ops.eligible_report && !ctx->ops.eligible_report(report, t))
+		return;
+
+	/* todo: make search faster, e.g., binary search? */
+	damon_for_each_region(r, t) {
+		if (report->addr < r->ar.start)
+			continue;
+		if (r->ar.end < report->addr + report->size)
+			continue;
+		if (!r->access_reported)
+			damon_update_region_access_rate(r, true, &ctx->attrs);
+		r->access_reported = true;
+	}
+}
+
+static unsigned int kdamond_apply_zero_access_report(struct damon_ctx *ctx)
+{
+	struct damon_target *t;
+	struct damon_region *r;
+	unsigned int max_nr_accesses = 0;
+
+	damon_for_each_target(t, ctx) {
+		damon_for_each_region(r, t) {
+			if (r->access_reported)
+				r->access_reported = false;
+			else
+				damon_update_region_access_rate(r, false,
+						&ctx->attrs);
+			max_nr_accesses = max(max_nr_accesses, r->nr_accesses);
+		}
+	}
+	return max_nr_accesses;
+}
+
+static unsigned int kdamond_check_reported_accesses(struct damon_ctx *ctx)
+{
+	int i;
+	struct damon_access_report *report;
+	struct damon_target *t;
+
+	mutex_lock(&damon_access_reports_lock);
+	for (i = 0; i < damon_access_reports_len; i++) {
+		report = &damon_access_reports[i];
+		if (time_before(report->report_jiffies,
+					jiffies -
+					usecs_to_jiffies(
+						ctx->attrs.sample_interval)))
+			continue;
+		if (report->pid && !damon_target_has_pid(ctx))
+			continue;
+		damon_for_each_target(t, ctx)
+			kdamond_apply_access_report(report, t, ctx);
+	}
+	mutex_unlock(&damon_access_reports_lock);
+	/* For nr_accesses_bp, absence of access should also be reported. */
+	return kdamond_apply_zero_access_report(ctx);
+}
+
 /*
  * The monitoring daemon that runs as a kernel thread
  */
@@ -2607,6 +2671,8 @@ static int kdamond_fn(void *data)
 
 		if (ctx->ops.check_accesses)
 			max_nr_accesses = ctx->ops.check_accesses(ctx);
+		else
+			max_nr_accesses = kdamond_check_reported_accesses(ctx);
 
 		if (ctx->passed_sample_intervals >= next_aggregation_sis)
 			kdamond_merge_regions(ctx,
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC v2 5/7] mm/memory: implement MM_CP_DAMON
  2025-07-27 20:18 [RFC v2 0/7] mm/damon: extend for page faults reporting based access monitoring SeongJae Park
                   ` (3 preceding siblings ...)
  2025-07-27 20:18 ` [RFC v2 4/7] mm/damon/core: read received access reports SeongJae Park
@ 2025-07-27 20:18 ` SeongJae Park
  2025-07-28  5:19   ` Lorenzo Stoakes
  2025-07-27 20:18 ` [RFC v2 6/7] mm/damon: implement paddr_fault operations set SeongJae Park
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 15+ messages in thread
From: SeongJae Park @ 2025-07-27 20:18 UTC (permalink / raw)
  Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Jann Horn, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
	Pedro Falcato, Suren Baghdasaryan, Vlastimil Babka, damon,
	kernel-team, linux-kernel, linux-mm

DAMON is using Accessed bits of page table entries as the major source
of the access information.  It lacks some additional information such as
which CPU was making the access.  Page faults could be another source of
information for such additional information.

Implement another change_protection() flag for such use case, namely
MM_CP_DAMON.  DAMON will install PAGE_NONE protections using the flag.
To avoid interfering with NUMA_BALANCING, which is also using PAGE_NON
protection, pass the faults to DAMON only when NUMA_BALANCING is
disabled.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 include/linux/mm.h |  1 +
 mm/memory.c        | 53 ++++++++++++++++++++++++++++++++++++++++++++--
 mm/mprotect.c      |  5 +++++
 3 files changed, 57 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 21270f1664a4..ad92b77bf782 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2567,6 +2567,7 @@ int get_cmdline(struct task_struct *task, char *buffer, int buflen);
 #define  MM_CP_UFFD_WP_RESOLVE             (1UL << 3) /* Resolve wp */
 #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
 					    MM_CP_UFFD_WP_RESOLVE)
+#define MM_CP_DAMON                        (1UL << 4)
 
 bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
 			     pte_t pte);
diff --git a/mm/memory.c b/mm/memory.c
index 92fd18a5d8d1..656e610867b0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -75,6 +75,7 @@
 #include <linux/ptrace.h>
 #include <linux/vmalloc.h>
 #include <linux/sched/sysctl.h>
+#include <linux/damon.h>
 
 #include <trace/events/kmem.h>
 
@@ -5972,6 +5973,47 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud)
 	return VM_FAULT_FALLBACK;
 }
 
+static vm_fault_t do_damon_page(struct vm_fault *vmf, bool huge_pmd)
+{
+	struct damon_access_report access_report = {
+		.addr = vmf->address,
+		.size = 1,
+	};
+	struct vm_area_struct *vma = vmf->vma;
+	struct folio *folio;
+	pte_t pte, old_pte;
+	bool writable = false, ignore_writable = false;
+	bool pte_write_upgrade = vma_wants_manual_pte_write_upgrade(vma);
+
+	if (huge_pmd)
+		access_report.addr = PFN_PHYS(pmd_pfn(vmf->orig_pmd));
+	else
+		access_report.addr = PFN_PHYS(pte_pfn(vmf->orig_pte));
+
+	spin_lock(vmf->ptl);
+	old_pte = ptep_get(vmf->pte);
+	if (unlikely(!pte_same(old_pte, vmf->orig_pte))) {
+		pte_unmap_unlock(vmf->pte, vmf->ptl);
+		return 0;
+	}
+	pte = pte_modify(old_pte, vma->vm_page_prot);
+	writable = pte_write(pte);
+	if (!writable && pte_write_upgrade &&
+			can_change_pte_writable(vma, vmf->address, pte))
+		writable = true;
+	folio = vm_normal_folio(vma, vmf->address, pte);
+	if (folio && folio_test_large(folio))
+		numa_rebuild_large_mapping(vmf, vma, folio, pte,
+				ignore_writable, pte_write_upgrade);
+	else
+		numa_rebuild_single_mapping(vmf, vma, vmf->address, vmf->pte,
+				writable);
+	pte_unmap_unlock(vmf->pte, vmf->ptl);
+
+	damon_report_access(&access_report);
+	return 0;
+}
+
 /*
  * These routines also need to handle stuff like marking pages dirty
  * and/or accessed for architectures that don't do it in hardware (most
@@ -6036,8 +6078,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 	if (!pte_present(vmf->orig_pte))
 		return do_swap_page(vmf);
 
-	if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
+	if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma)) {
+		if (sysctl_numa_balancing_mode == NUMA_BALANCING_DISABLED)
+			return do_damon_page(vmf, false);
 		return do_numa_page(vmf);
+	}
 
 	spin_lock(vmf->ptl);
 	entry = vmf->orig_pte;
@@ -6159,8 +6204,12 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 			return 0;
 		}
 		if (pmd_trans_huge(vmf.orig_pmd)) {
-			if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma))
+			if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma)) {
+				if (sysctl_numa_balancing_mode ==
+						NUMA_BALANCING_DISABLED)
+					return do_damon_page(&vmf, true);
 				return do_huge_pmd_numa_page(&vmf);
+			}
 
 			if ((flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) &&
 			    !pmd_write(vmf.orig_pmd)) {
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 78bded7acf79..e8a76114e4f9 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -714,6 +714,11 @@ long change_protection(struct mmu_gather *tlb,
 	WARN_ON_ONCE(cp_flags & MM_CP_PROT_NUMA);
 #endif
 
+#ifdef CONFIG_ARCH_SUPPORTS_NUMA_BALANCING
+	if (cp_flags & MM_CP_DAMON)
+		newprot = PAGE_NONE;
+#endif
+
 	if (is_vm_hugetlb_page(vma))
 		pages = hugetlb_change_protection(vma, start, end, newprot,
 						  cp_flags);
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC v2 6/7] mm/damon: implement paddr_fault operations set
  2025-07-27 20:18 [RFC v2 0/7] mm/damon: extend for page faults reporting based access monitoring SeongJae Park
                   ` (4 preceding siblings ...)
  2025-07-27 20:18 ` [RFC v2 5/7] mm/memory: implement MM_CP_DAMON SeongJae Park
@ 2025-07-27 20:18 ` SeongJae Park
  2025-07-27 20:18 ` [RFC v2 7/7] mm/damon/sysfs: support paddr_fault SeongJae Park
  2025-08-04  2:47 ` [RFC v2 0/7] mm/damon: extend for page faults reporting based access monitoring Andrew Paniakin
  7 siblings, 0 replies; 15+ messages in thread
From: SeongJae Park @ 2025-07-27 20:18 UTC (permalink / raw)
  Cc: SeongJae Park, Andrew Morton, damon, kernel-team, linux-kernel,
	linux-mm

Implement an example damon_report_access() based DAMON operations set,
paddr_fault.  It monitors the physical address space accesses, same to
paddr.  Only one difference is that it uses page faults as its access
information source, using damon_report_access() and MM_CP_DAMON based
mechanisms.

This is not to be merged into the mainline as-is, but only for giving an
example of how damon_report_access() based operation sets can be
implemented and extended.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 include/linux/damon.h |  3 ++
 mm/damon/paddr.c      | 77 ++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 79 insertions(+), 1 deletion(-)

diff --git a/include/linux/damon.h b/include/linux/damon.h
index 8ec49beac573..c35ed89371d0 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -574,12 +574,15 @@ struct damos {
  * @DAMON_OPS_FVADDR:	Monitoring operations for only fixed ranges of virtual
  *			address spaces
  * @DAMON_OPS_PADDR:	Monitoring operations for the physical address space
+ * @DAMON_OPS_PADDR_FULAT:	Monitoring operations for the physical address
+ *				space, using page faults as the source
  * @NR_DAMON_OPS:	Number of monitoring operations implementations
  */
 enum damon_ops_id {
 	DAMON_OPS_VADDR,
 	DAMON_OPS_FVADDR,
 	DAMON_OPS_PADDR,
+	DAMON_OPS_PADDR_FAULT,
 	NR_DAMON_OPS,
 };
 
diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
index 53a55c5114fb..68c309ad1aa4 100644
--- a/mm/damon/paddr.c
+++ b/mm/damon/paddr.c
@@ -14,6 +14,7 @@
 #include <linux/swap.h>
 #include <linux/memory-tiers.h>
 #include <linux/mm_inline.h>
+#include <asm/tlb.h>
 
 #include "../internal.h"
 #include "ops-common.h"
@@ -97,6 +98,65 @@ static unsigned int damon_pa_check_accesses(struct damon_ctx *ctx)
 	return max_nr_accesses;
 }
 
+static bool damon_pa_fault_change_protection_one(struct folio *folio,
+		struct vm_area_struct *vma, unsigned long addr, void *arg)
+{
+	/* todo: batch or remove tlb flushing */
+	struct mmu_gather tlb;
+
+	if (!vma_is_accessible(vma))
+		return true;
+
+	tlb_gather_mmu(&tlb, vma->vm_mm);
+
+	change_protection(&tlb, vma, addr, addr + PAGE_SIZE, MM_CP_DAMON);
+
+	tlb_finish_mmu(&tlb);
+	return true;
+}
+
+static void damon_pa_fault_change_protection(unsigned long paddr)
+{
+	struct folio *folio = damon_get_folio(PHYS_PFN(paddr));
+	struct rmap_walk_control rwc = {
+		.rmap_one = damon_pa_fault_change_protection_one,
+		.anon_lock = folio_lock_anon_vma_read,
+	};
+	bool need_lock;
+
+	if (!folio)
+		return;
+	if (!folio_mapped(folio) || !folio_raw_mapping(folio))
+		return;
+
+	need_lock = !folio_test_anon(folio) || folio_test_ksm(folio);
+	if (need_lock && !folio_trylock(folio))
+		return;
+
+	rmap_walk(folio, &rwc);
+
+	if (need_lock)
+		folio_unlock(folio);
+}
+
+static void __damon_pa_fault_prepare_access_check(struct damon_region *r)
+{
+	r->sampling_addr = damon_rand(r->ar.start, r->ar.end);
+
+	damon_pa_fault_change_protection(r->sampling_addr);
+}
+
+static void damon_pa_fault_prepare_access_checks(struct damon_ctx *ctx)
+{
+	struct damon_target *t;
+	struct damon_region *r;
+
+	damon_for_each_target(t, ctx) {
+		damon_for_each_region(r, t)
+			__damon_pa_fault_prepare_access_check(r);
+	}
+}
+
 /*
  * damos_pa_filter_out - Return true if the page should be filtered out.
  */
@@ -355,8 +415,23 @@ static int __init damon_pa_initcall(void)
 		.apply_scheme = damon_pa_apply_scheme,
 		.get_scheme_score = damon_pa_scheme_score,
 	};
+	struct damon_operations fault_ops = {
+		.id = DAMON_OPS_PADDR_FAULT,
+		.init = NULL,
+		.update = NULL,
+		.prepare_access_checks = damon_pa_fault_prepare_access_checks,
+		.check_accesses = NULL,
+		.target_valid = NULL,
+		.cleanup = NULL,
+		.apply_scheme = damon_pa_apply_scheme,
+		.get_scheme_score = damon_pa_scheme_score,
+	};
+	int err;
 
-	return damon_register_ops(&ops);
+	err = damon_register_ops(&ops);
+	if (err)
+		return err;
+	return damon_register_ops(&fault_ops);
 };
 
 subsys_initcall(damon_pa_initcall);
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC v2 7/7] mm/damon/sysfs: support paddr_fault
  2025-07-27 20:18 [RFC v2 0/7] mm/damon: extend for page faults reporting based access monitoring SeongJae Park
                   ` (5 preceding siblings ...)
  2025-07-27 20:18 ` [RFC v2 6/7] mm/damon: implement paddr_fault operations set SeongJae Park
@ 2025-07-27 20:18 ` SeongJae Park
  2025-08-04  2:47 ` [RFC v2 0/7] mm/damon: extend for page faults reporting based access monitoring Andrew Paniakin
  7 siblings, 0 replies; 15+ messages in thread
From: SeongJae Park @ 2025-07-27 20:18 UTC (permalink / raw)
  Cc: SeongJae Park, Andrew Morton, damon, kernel-team, linux-kernel,
	linux-mm

Extend DAMON sysfs interface to support the page faults based physical
address space access monitoring.  Users can use it by writing
paddr_fault to the ops file.  For simple testing, the DAMON user-space
tool can be used as below, after applying below hack.

    $ git diff
    --- a/src/_damon_sysfs.py
    +++ b/src/_damon_sysfs.py
    @@ -548,7 +548,7 @@ def write_monitoring_attrs_dir(dir_path, context):

     def write_context_dir(dir_path, context):
         err = _damo_fs.write_file(os.path.join(dir_path, 'operations'),
    -                              context.ops)
    +                              'paddr_fault')
         if err is not None:
             return err

    $ sudo ./damo start
    $ sudo ./damo report access

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/damon/sysfs.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/mm/damon/sysfs.c b/mm/damon/sysfs.c
index 6d2b0dab50cb..b1bf43972491 100644
--- a/mm/damon/sysfs.c
+++ b/mm/damon/sysfs.c
@@ -829,6 +829,10 @@ static const struct damon_sysfs_ops_name damon_sysfs_ops_names[] = {
 		.ops_id = DAMON_OPS_PADDR,
 		.name = "paddr",
 	},
+	{
+		.ops_id = DAMON_OPS_PADDR_FAULT,
+		.name = "paddr_fault",
+	},
 };
 
 struct damon_sysfs_context {
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC v2 5/7] mm/memory: implement MM_CP_DAMON
  2025-07-27 20:18 ` [RFC v2 5/7] mm/memory: implement MM_CP_DAMON SeongJae Park
@ 2025-07-28  5:19   ` Lorenzo Stoakes
  2025-07-29  3:06     ` SeongJae Park
  0 siblings, 1 reply; 15+ messages in thread
From: Lorenzo Stoakes @ 2025-07-28  5:19 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Liam R. Howlett, Andrew Morton, David Hildenbrand, Jann Horn,
	Michal Hocko, Mike Rapoport, Pedro Falcato, Suren Baghdasaryan,
	Vlastimil Babka, damon, kernel-team, linux-kernel, linux-mm

On Sun, Jul 27, 2025 at 01:18:11PM -0700, SeongJae Park wrote:
> DAMON is using Accessed bits of page table entries as the major source
> of the access information.  It lacks some additional information such as
> which CPU was making the access.  Page faults could be another source of
> information for such additional information.
>
> Implement another change_protection() flag for such use case, namely
> MM_CP_DAMON.  DAMON will install PAGE_NONE protections using the flag.
> To avoid interfering with NUMA_BALANCING, which is also using PAGE_NON
> protection, pass the faults to DAMON only when NUMA_BALANCING is
> disabled.
>
> Signed-off-by: SeongJae Park <sj@kernel.org>

This seems to not be an upstreamable series right now unless I'm missing
something.

Firstly, you're making a change in behaviour even when CONFIG_DAMON is not
specified, and Linus already told you we can't have that default-on.

I'm very very much not happy with us doing something for 'damon'
unconditionaly when !CONFIG_DAMON on the assumption that an acessible
mapping has PROT_NONE set.

This change makes me nervous in general ANYWAY as you are now changing core
mm and introducing a new faulting mechanism specifically for DAMON.

And we are assuming that NUMA balancing being disabled is not racey in a
way that will cause things to break.

I really also dislike the idea of an _implicit_ assumption that we are good
to use the NUMA balancing faulting mechanism to 'tack on' DAMON stuff.

Is it really all that useful to provide a method that requires NUMA
balancing to be diabled?

Finally, you're making it so DAMON can mprotect() something to use the
DAMON/NUMA balancing fault handler, which doesn't appaer to check to see if
NUMA balacing is disabled, but anyway it could be re-enabled?

And then now DAMON is making stuff fault as NUMA balancing faults
incorrectly?

This just seems broken.

Unless there's really good justification I'm really not inclined for us to
merge this as-is right now unfortunately.

> ---
>  include/linux/mm.h |  1 +
>  mm/memory.c        | 53 ++++++++++++++++++++++++++++++++++++++++++++--
>  mm/mprotect.c      |  5 +++++
>  3 files changed, 57 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 21270f1664a4..ad92b77bf782 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2567,6 +2567,7 @@ int get_cmdline(struct task_struct *task, char *buffer, int buflen);
>  #define  MM_CP_UFFD_WP_RESOLVE             (1UL << 3) /* Resolve wp */
>  #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
>  					    MM_CP_UFFD_WP_RESOLVE)

/* Comment here :) */

> +#define MM_CP_DAMON                        (1UL << 4)

Shouldn't this be something more specific like MM_CP_DAMON_REPORT_FAULT ?

>
>  bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>  			     pte_t pte);
> diff --git a/mm/memory.c b/mm/memory.c
> index 92fd18a5d8d1..656e610867b0 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -75,6 +75,7 @@
>  #include <linux/ptrace.h>
>  #include <linux/vmalloc.h>
>  #include <linux/sched/sysctl.h>
> +#include <linux/damon.h>
>
>  #include <trace/events/kmem.h>
>
> @@ -5972,6 +5973,47 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud)
>  	return VM_FAULT_FALLBACK;
>  }
>
> +static vm_fault_t do_damon_page(struct vm_fault *vmf, bool huge_pmd)

You need to explain what this function is doing at least.

> +{
> +	struct damon_access_report access_report = {
> +		.addr = vmf->address,
> +		.size = 1,
> +	};
> +	struct vm_area_struct *vma = vmf->vma;
> +	struct folio *folio;
> +	pte_t pte, old_pte;
> +	bool writable = false, ignore_writable = false;
> +	bool pte_write_upgrade = vma_wants_manual_pte_write_upgrade(vma);
> +
> +	if (huge_pmd)
> +		access_report.addr = PFN_PHYS(pmd_pfn(vmf->orig_pmd));
> +	else
> +		access_report.addr = PFN_PHYS(pte_pfn(vmf->orig_pte));
> +
> +	spin_lock(vmf->ptl);
> +	old_pte = ptep_get(vmf->pte);
> +	if (unlikely(!pte_same(old_pte, vmf->orig_pte))) {
> +		pte_unmap_unlock(vmf->pte, vmf->ptl);
> +		return 0;
> +	}
> +	pte = pte_modify(old_pte, vma->vm_page_prot);

This is little weird, you're applying VMA protection bits to the PTE then
later asking about protection bits? Can't you just tell from the VMA flags?

Seems like do_numa_page() copy/pasta.

> +	writable = pte_write(pte);
> +	if (!writable && pte_write_upgrade &&
> +			can_change_pte_writable(vma, vmf->address, pte))
> +		writable = true;
> +	folio = vm_normal_folio(vma, vmf->address, pte);
> +	if (folio && folio_test_large(folio))
> +		numa_rebuild_large_mapping(vmf, vma, folio, pte,
> +				ignore_writable, pte_write_upgrade);
> +	else
> +		numa_rebuild_single_mapping(vmf, vma, vmf->address, vmf->pte,
> +				writable);
> +	pte_unmap_unlock(vmf->pte, vmf->ptl);
> +

All of this seems to be duplicating aspects of do_numa_page().

Seems more sensible, if it makes sense to accept this change (I'm still a
bit dubious about changing the faulting mechanism here), we should probably
implement a 'do_non_present_acessible()' function that avoids duplication +
bit-rot, with #ifdef CONFIG_DAMON's in place, or even better, just a hook
into damon code that's static inline void {} if not enabled.

Note that the numa fault handling logic is called _even if numa balancing
is disabled_. So we'd avoid all the other changes too.

> +	damon_report_access(&access_report);
> +	return 0;
> +}
> +
>  /*
>   * These routines also need to handle stuff like marking pages dirty
>   * and/or accessed for architectures that don't do it in hardware (most
> @@ -6036,8 +6078,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>  	if (!pte_present(vmf->orig_pte))
>  		return do_swap_page(vmf);
>
> -	if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
> +	if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma)) {
> +		if (sysctl_numa_balancing_mode == NUMA_BALANCING_DISABLED)
> +			return do_damon_page(vmf, false);

Making an assumption here that, even if damon is disabled, we should handle
a 'damon' fault is just icky, sorry.

Also are we 100% sure that there's no races here? When we disable numa
balancing do we correctly ensure that absolutely no racing NUMA balancing
faults can happen whatsever at this juncture?

Have you checked that and ensured that to be the case?

You really have to be 100% certain you're not going to wrongly handle NUMA
page faults this way, on a potentially non-CONFIG_DAMON kernel to boot.

Keep in mind fault handling is verrrry racey (purposefully) and can be done
under VMA read lock alone (and that's only very loosely a lock!).

This is just, yeah, concerning.

>  		return do_numa_page(vmf);
> +	}
>
>  	spin_lock(vmf->ptl);
>  	entry = vmf->orig_pte;
> @@ -6159,8 +6204,12 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
>  			return 0;
>  		}
>  		if (pmd_trans_huge(vmf.orig_pmd)) {
> -			if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma))
> +			if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma)) {
> +				if (sysctl_numa_balancing_mode ==
> +						NUMA_BALANCING_DISABLED)
> +					return do_damon_page(&vmf, true);

Now we have _more_ weirdness around what CONFIG_TRANSPARENT_HUGEPAGE
enables/disables.

>  				return do_huge_pmd_numa_page(&vmf);

This function will be called only if THP is enabled, but now damon
overrides that...

Let's try and make this consistent.

> +			}
>
>  			if ((flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) &&
>  			    !pmd_write(vmf.orig_pmd)) {
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 78bded7acf79..e8a76114e4f9 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -714,6 +714,11 @@ long change_protection(struct mmu_gather *tlb,
>  	WARN_ON_ONCE(cp_flags & MM_CP_PROT_NUMA);
>  #endif
>
> +#ifdef CONFIG_ARCH_SUPPORTS_NUMA_BALANCING
> +	if (cp_flags & MM_CP_DAMON)
> +		newprot = PAGE_NONE;
> +#endif

OK this just seems broken.

Firstly, again you're introducing behaviour that applies even if DAMON is
disabled. Let's please not.

And predicating on -the architecture even supporting NUMA balancing- seems
rather bizarre?

Secondly, somebody can disable/enable NUMA balancing, and thus you are now
allowing this function to, when somebody specifies 'DAMON', get NUMA
balancing fault handling??

If you don't bother checking whether NUMA balancing is disabled when you
set it, then boom - you've broken the faulting mechanism, but even if you
_do_, somebody can just switch it on again...

Sorry but unless I'm wildly missing something here we're going to need a
new faulting mechanism (if we even want to allow DAMON to have its own
fault handling that is!)

> +
>  	if (is_vm_hugetlb_page(vma))
>  		pages = hugetlb_change_protection(vma, start, end, newprot,
>  						  cp_flags);
> --
> 2.39.5

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v2 5/7] mm/memory: implement MM_CP_DAMON
  2025-07-28  5:19   ` Lorenzo Stoakes
@ 2025-07-29  3:06     ` SeongJae Park
  2025-07-29  9:40       ` Lorenzo Stoakes
  0 siblings, 1 reply; 15+ messages in thread
From: SeongJae Park @ 2025-07-29  3:06 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Jann Horn, Michal Hocko, Mike Rapoport, Pedro Falcato,
	Suren Baghdasaryan, Vlastimil Babka, damon, kernel-team,
	linux-kernel, linux-mm

On Mon, 28 Jul 2025 06:19:35 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

> On Sun, Jul 27, 2025 at 01:18:11PM -0700, SeongJae Park wrote:
> > DAMON is using Accessed bits of page table entries as the major source
> > of the access information.  It lacks some additional information such as
> > which CPU was making the access.  Page faults could be another source of
> > information for such additional information.
> >
> > Implement another change_protection() flag for such use case, namely
> > MM_CP_DAMON.  DAMON will install PAGE_NONE protections using the flag.
> > To avoid interfering with NUMA_BALANCING, which is also using PAGE_NON
> > protection, pass the faults to DAMON only when NUMA_BALANCING is
> > disabled.
> >
> > Signed-off-by: SeongJae Park <sj@kernel.org>
> 
> This seems to not be an upstreamable series right now unless I'm missing
> something.
> 
> Firstly, you're making a change in behaviour even when CONFIG_DAMON is not
> specified, and Linus already told you we can't have that default-on.
> 
> I'm very very much not happy with us doing something for 'damon'
> unconditionaly when !CONFIG_DAMON on the assumption that an acessible
> mapping has PROT_NONE set.
> 
> This change makes me nervous in general ANYWAY as you are now changing core
> mm and introducing a new faulting mechanism specifically for DAMON.
> 
> And we are assuming that NUMA balancing being disabled is not racey in a
> way that will cause things to break.
> 
> I really also dislike the idea of an _implicit_ assumption that we are good
> to use the NUMA balancing faulting mechanism to 'tack on' DAMON stuff.
> 
> Is it really all that useful to provide a method that requires NUMA
> balancing to be diabled?
> 
> Finally, you're making it so DAMON can mprotect() something to use the
> DAMON/NUMA balancing fault handler, which doesn't appaer to check to see if
> NUMA balacing is disabled, but anyway it could be re-enabled?
> 
> And then now DAMON is making stuff fault as NUMA balancing faults
> incorrectly?
> 
> This just seems broken.
> 
> Unless there's really good justification I'm really not inclined for us to
> merge this as-is right now unfortunately.

Thank you for review and comments, Lorenzo.  I fundamentally agree all your
points.  I don't aim to merge this as-is.  Actually this patch series is more
like POC, but apparently I was rushing.  I will try to adjust your concerns in
the next version.

> 
> > ---
> >  include/linux/mm.h |  1 +
> >  mm/memory.c        | 53 ++++++++++++++++++++++++++++++++++++++++++++--
> >  mm/mprotect.c      |  5 +++++
> >  3 files changed, 57 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 21270f1664a4..ad92b77bf782 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -2567,6 +2567,7 @@ int get_cmdline(struct task_struct *task, char *buffer, int buflen);
> >  #define  MM_CP_UFFD_WP_RESOLVE             (1UL << 3) /* Resolve wp */
> >  #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
> >  					    MM_CP_UFFD_WP_RESOLVE)
> 
> /* Comment here :) */

I will add the comments on the next version.

> 
> > +#define MM_CP_DAMON                        (1UL << 4)
> 
> Shouldn't this be something more specific like MM_CP_DAMON_REPORT_FAULT ?

Thank you for the suggestion, I will use it.

> 
> >
> >  bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
> >  			     pte_t pte);
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 92fd18a5d8d1..656e610867b0 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -75,6 +75,7 @@
> >  #include <linux/ptrace.h>
> >  #include <linux/vmalloc.h>
> >  #include <linux/sched/sysctl.h>
> > +#include <linux/damon.h>
> >
> >  #include <trace/events/kmem.h>
> >
> > @@ -5972,6 +5973,47 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud)
> >  	return VM_FAULT_FALLBACK;
> >  }
> >
> > +static vm_fault_t do_damon_page(struct vm_fault *vmf, bool huge_pmd)
> 
> You need to explain what this function is doing at least.

This function lifts the protection and reports the access information to DAMON.
I will add more comments on the next version.

> 
> > +{
> > +	struct damon_access_report access_report = {
> > +		.addr = vmf->address,
> > +		.size = 1,
> > +	};
> > +	struct vm_area_struct *vma = vmf->vma;
> > +	struct folio *folio;
> > +	pte_t pte, old_pte;
> > +	bool writable = false, ignore_writable = false;
> > +	bool pte_write_upgrade = vma_wants_manual_pte_write_upgrade(vma);
> > +
> > +	if (huge_pmd)
> > +		access_report.addr = PFN_PHYS(pmd_pfn(vmf->orig_pmd));
> > +	else
> > +		access_report.addr = PFN_PHYS(pte_pfn(vmf->orig_pte));
> > +
> > +	spin_lock(vmf->ptl);
> > +	old_pte = ptep_get(vmf->pte);
> > +	if (unlikely(!pte_same(old_pte, vmf->orig_pte))) {
> > +		pte_unmap_unlock(vmf->pte, vmf->ptl);
> > +		return 0;
> > +	}
> > +	pte = pte_modify(old_pte, vma->vm_page_prot);
> 
> This is little weird, you're applying VMA protection bits to the PTE then
> later asking about protection bits? Can't you just tell from the VMA flags?
> 
> Seems like do_numa_page() copy/pasta.
> 
> > +	writable = pte_write(pte);
> > +	if (!writable && pte_write_upgrade &&
> > +			can_change_pte_writable(vma, vmf->address, pte))
> > +		writable = true;
> > +	folio = vm_normal_folio(vma, vmf->address, pte);
> > +	if (folio && folio_test_large(folio))
> > +		numa_rebuild_large_mapping(vmf, vma, folio, pte,
> > +				ignore_writable, pte_write_upgrade);
> > +	else
> > +		numa_rebuild_single_mapping(vmf, vma, vmf->address, vmf->pte,
> > +				writable);
> > +	pte_unmap_unlock(vmf->pte, vmf->ptl);
> > +
> 
> All of this seems to be duplicating aspects of do_numa_page().

You are correct.

> 
> Seems more sensible, if it makes sense to accept this change (I'm still a
> bit dubious about changing the faulting mechanism here), we should probably
> implement a 'do_non_present_acessible()' function that avoids duplication +
> bit-rot, with #ifdef CONFIG_DAMON's in place, or even better, just a hook
> into damon code that's static inline void {} if not enabled.

I will do so in the next version.

> 
> Note that the numa fault handling logic is called _even if numa balancing
> is disabled_. So we'd avoid all the other changes too.
> 
> > +	damon_report_access(&access_report);
> > +	return 0;
> > +}
> > +
> >  /*
> >   * These routines also need to handle stuff like marking pages dirty
> >   * and/or accessed for architectures that don't do it in hardware (most
> > @@ -6036,8 +6078,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
> >  	if (!pte_present(vmf->orig_pte))
> >  		return do_swap_page(vmf);
> >
> > -	if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
> > +	if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma)) {
> > +		if (sysctl_numa_balancing_mode == NUMA_BALANCING_DISABLED)
> > +			return do_damon_page(vmf, false);
> 
> Making an assumption here that, even if damon is disabled, we should handle
> a 'damon' fault is just icky, sorry.

Agreed.  I will make this happen only when CONFIG_DAMON is enabled.

> 
> Also are we 100% sure that there's no races here? When we disable numa
> balancing do we correctly ensure that absolutely no racing NUMA balancing
> faults can happen whatsever at this juncture?

Enabling CONFIG_DAMON will not automatically invoke change_protection() with
MM_CP_DAMON.  Such invocations will be made only if the user disables NUMA
balancing and run DAMON in the reporting mode.

So there can be two racy cases.

If the user enables NUMA balancing and then disables it after a while, page
faults for MM_CP_PROT_NUMA can be handled by DAMON.  That could look odd, but
there should be no real problem in practice.  DAMON's fault handling will
cleanup the PAGE_NONE protection like NUMA balancing, and DAMON has no problem
at receiving such additional reports from MM_CP_PROT_NUMA-caused faults.  DAMON
may show a few more than expected accesses, but that's no problem for DAMON.

Similarly, if the user starts DAMON in the report mode, stops DAMON, and then
enables NUMA balancing, faults for MM_CP_DAMON that installed while DAMON was
running in the report mode can be handled by NUMA balancing.  This should also
not make real problems in practice in my opinion.  NUMA balancing could see
more accesses and migrate pages little bit more than expected, but that should
be just for a moment.

Maybe I'm missing something, or not well explaining my thoughts.  I will be
happy to learn what I'm missing, or get chances to further clarify my points.

> 
> Have you checked that and ensured that to be the case?
> 
> You really have to be 100% certain you're not going to wrongly handle NUMA
> page faults this way, on a potentially non-CONFIG_DAMON kernel to boot.

I will ensure that never happens on CONFIG_DAMON disabled kernels, in the next
version.

> 
> Keep in mind fault handling is verrrry racey (purposefully) and can be done
> under VMA read lock alone (and that's only very loosely a lock!).
> 
> This is just, yeah, concerning.

Thank you for the caution.  DAMON's fault handling code only saves the
information in its internal ring buffer.  It doesn't touch vmas.  So I think
there should be no such problems.  I will add the clarification on the next
version.

> 
> >  		return do_numa_page(vmf);
> > +	}
> >
> >  	spin_lock(vmf->ptl);
> >  	entry = vmf->orig_pte;
> > @@ -6159,8 +6204,12 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
> >  			return 0;
> >  		}
> >  		if (pmd_trans_huge(vmf.orig_pmd)) {
> > -			if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma))
> > +			if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma)) {
> > +				if (sysctl_numa_balancing_mode ==
> > +						NUMA_BALANCING_DISABLED)
> > +					return do_damon_page(&vmf, true);
> 
> Now we have _more_ weirdness around what CONFIG_TRANSPARENT_HUGEPAGE
> enables/disables.
> 
> >  				return do_huge_pmd_numa_page(&vmf);
> 
> This function will be called only if THP is enabled, but now damon
> overrides that...
> 
> Let's try and make this consistent.

I will do so in the next version.

> 
> > +			}
> >
> >  			if ((flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) &&
> >  			    !pmd_write(vmf.orig_pmd)) {
> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index 78bded7acf79..e8a76114e4f9 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -714,6 +714,11 @@ long change_protection(struct mmu_gather *tlb,
> >  	WARN_ON_ONCE(cp_flags & MM_CP_PROT_NUMA);
> >  #endif
> >
> > +#ifdef CONFIG_ARCH_SUPPORTS_NUMA_BALANCING
> > +	if (cp_flags & MM_CP_DAMON)
> > +		newprot = PAGE_NONE;
> > +#endif
> 
> OK this just seems broken.
> 
> Firstly, again you're introducing behaviour that applies even if DAMON is
> disabled. Let's please not.
> 
> And predicating on -the architecture even supporting NUMA balancing- seems
> rather bizarre?

Agree, I will ensure there will be no change on CONFIG_DAMON disabled kernels,
in the next version.

> 
> Secondly, somebody can disable/enable NUMA balancing, and thus you are now
> allowing this function to, when somebody specifies 'DAMON', get NUMA
> balancing fault handling??
> 
> If you don't bother checking whether NUMA balancing is disabled when you
> set it, then boom - you've broken the faulting mechanism, but even if you
> _do_, somebody can just switch it on again...

As I explained on the two racy cases aboe, faults that caused by DAMON or NUMA
balancing can be handled by the other's handling code, but only for a limited
time under the user's controls.  But to my understanding that should not cause
real problems in the practice, and users wouldn't be suggested to operate the
system in such a way.  I will add more documents and cautions about that in the
next version.

Again, I may be missing something, and would be happy to be enlightened or
get more chances to clarify my points.

> 
> Sorry but unless I'm wildly missing something here we're going to need a
> new faulting mechanism (if we even want to allow DAMON to have its own
> fault handling that is!)

No worry, I completely agree to your main points.  I think I was bettr to
explain more about the racy cases and ensure no interference on CONFIG_DAMON
disabled case, and thank you for letting me know that.

Maybe we need more discussions on the racy cases, but anyway I will try to
address all your concerns in the next version.


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v2 5/7] mm/memory: implement MM_CP_DAMON
  2025-07-29  3:06     ` SeongJae Park
@ 2025-07-29  9:40       ` Lorenzo Stoakes
  2025-07-30  4:21         ` SeongJae Park
  0 siblings, 1 reply; 15+ messages in thread
From: Lorenzo Stoakes @ 2025-07-29  9:40 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Liam R. Howlett, Andrew Morton, David Hildenbrand, Jann Horn,
	Michal Hocko, Mike Rapoport, Pedro Falcato, Suren Baghdasaryan,
	Vlastimil Babka, damon, kernel-team, linux-kernel, linux-mm

On Mon, Jul 28, 2025 at 08:06:32PM -0700, SeongJae Park wrote:
> On Mon, 28 Jul 2025 06:19:35 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>
> > On Sun, Jul 27, 2025 at 01:18:11PM -0700, SeongJae Park wrote:
> > > DAMON is using Accessed bits of page table entries as the major source
> > > of the access information.  It lacks some additional information such as
> > > which CPU was making the access.  Page faults could be another source of
> > > information for such additional information.
> > >
> > > Implement another change_protection() flag for such use case, namely
> > > MM_CP_DAMON.  DAMON will install PAGE_NONE protections using the flag.
> > > To avoid interfering with NUMA_BALANCING, which is also using PAGE_NON
> > > protection, pass the faults to DAMON only when NUMA_BALANCING is
> > > disabled.
> > >
> > > Signed-off-by: SeongJae Park <sj@kernel.org>
> >
> > This seems to not be an upstreamable series right now unless I'm missing
> > something.
> >
> > Firstly, you're making a change in behaviour even when CONFIG_DAMON is not
> > specified, and Linus already told you we can't have that default-on.
> >
> > I'm very very much not happy with us doing something for 'damon'
> > unconditionaly when !CONFIG_DAMON on the assumption that an acessible
> > mapping has PROT_NONE set.
> >
> > This change makes me nervous in general ANYWAY as you are now changing core
> > mm and introducing a new faulting mechanism specifically for DAMON.
> >
> > And we are assuming that NUMA balancing being disabled is not racey in a
> > way that will cause things to break.
> >
> > I really also dislike the idea of an _implicit_ assumption that we are good
> > to use the NUMA balancing faulting mechanism to 'tack on' DAMON stuff.
> >
> > Is it really all that useful to provide a method that requires NUMA
> > balancing to be diabled?
> >
> > Finally, you're making it so DAMON can mprotect() something to use the
> > DAMON/NUMA balancing fault handler, which doesn't appaer to check to see if
> > NUMA balacing is disabled, but anyway it could be re-enabled?
> >
> > And then now DAMON is making stuff fault as NUMA balancing faults
> > incorrectly?
> >
> > This just seems broken.
> >
> > Unless there's really good justification I'm really not inclined for us to
> > merge this as-is right now unfortunately.
>
> Thank you for review and comments, Lorenzo.  I fundamentally agree all your
> points.  I don't aim to merge this as-is.  Actually this patch series is more
> like POC, but apparently I was rushing.  I will try to adjust your concerns in
> the next version.

Thanks.

I do wonder whether we really can have a whole new faulting mechanism just for
DAMON. Because if in future, we wanted to change how this worked, we'd be
constrained, and it is a very specific user.

The issue is you need the PTE to be restored to its previous state, just like
NUMA balancing.

And I really really do not like this 'oh if you turn it off you can use it for
DAMON' thing, it's just really odd and asking for trouble.

I think the only _workable_ version of this would be to convert the numa
handling to a generic 'fault then restore' type mechanism that could be hooked
in to by either NUMA or DAMON.

But really I think you'd _need_ to not have significantly different behaviour
between the two and _not_ constrain this to being only when NUMA balancing is
disabled.

But then you'd need to know _this_ PTE is for NUMA balancing vs. another is for
this stuff.

I'm not really sure there is an upstreamable version of this, but it'd need to
be made generic like that if there were.

I think it might be worth taking some time to examine whether a version of this
that can be sensibly generic (which could have hooks for things) is _possible_
before sending a v2.

> >
> > Also are we 100% sure that there's no races here? When we disable numa
> > balancing do we correctly ensure that absolutely no racing NUMA balancing
> > faults can happen whatsever at this juncture?
>
> Enabling CONFIG_DAMON will not automatically invoke change_protection() with
> MM_CP_DAMON.  Such invocations will be made only if the user disables NUMA
> balancing and run DAMON in the reporting mode.
>
> So there can be two racy cases.
>
> If the user enables NUMA balancing and then disables it after a while, page
> faults for MM_CP_PROT_NUMA can be handled by DAMON.  That could look odd, but
> there should be no real problem in practice.  DAMON's fault handling will
> cleanup the PAGE_NONE protection like NUMA balancing, and DAMON has no problem
> at receiving such additional reports from MM_CP_PROT_NUMA-caused faults.  DAMON
> may show a few more than expected accesses, but that's no problem for DAMON.
>
> Similarly, if the user starts DAMON in the report mode, stops DAMON, and then
> enables NUMA balancing, faults for MM_CP_DAMON that installed while DAMON was
> running in the report mode can be handled by NUMA balancing.  This should also
> not make real problems in practice in my opinion.  NUMA balancing could see
> more accesses and migrate pages little bit more than expected, but that should
> be just for a moment.

I'm really concerned about these.

We're now introducing unexpected behaviour based on a race and allowing faults
to be mis-handled.

I'm maybe not as confident as you are that everything will 'just work' and it
seems like we're asking for obscure bugs in the future.

> > You really have to be 100% certain you're not going to wrongly handle NUMA
> > page faults this way, on a potentially non-CONFIG_DAMON kernel to boot.
>
> I will ensure that never happens on CONFIG_DAMON disabled kernels, in the next
> version.

Well, in the above you say that you can't help but do that when a race occurs?

>
> >
> > Keep in mind fault handling is verrrry racey (purposefully) and can be done
> > under VMA read lock alone (and that's only very loosely a lock!).
> >
> > This is just, yeah, concerning.
>
> Thank you for the caution.  DAMON's fault handling code only saves the
> information in its internal ring buffer.  It doesn't touch vmas.  So I think
> there should be no such problems.  I will add the clarification on the next
> version.

Right, I'm just saying that this all being super racey between NUMA
enabled/disabled seems pretty unavoidable, but we covered above.

> >
> > Secondly, somebody can disable/enable NUMA balancing, and thus you are now
> > allowing this function to, when somebody specifies 'DAMON', get NUMA
> > balancing fault handling??
> >
> > If you don't bother checking whether NUMA balancing is disabled when you
> > set it, then boom - you've broken the faulting mechanism, but even if you
> > _do_, somebody can just switch it on again...
>
> As I explained on the two racy cases aboe, faults that caused by DAMON or NUMA
> balancing can be handled by the other's handling code, but only for a limited
> time under the user's controls.  But to my understanding that should not cause
> real problems in the practice, and users wouldn't be suggested to operate the
> system in such a way.  I will add more documents and cautions about that in the
> next version.

I really don't think a version of the code that results in the wrong handler
being used is upstreamable, sorry.

I've not dug into the nitty gritty details on what would happen in both cases,
but even if it were 100% fine _now_, this is _exactly_ the kind of thing that
results in horrible hard-to-debug issues later, should something change.

Implicitly 'just having to know' that we might be in the wrong fault handler
seems like asking for trouble, and the RoI on an optional profiling tool (this
is not to take away from DAMON which is a _great_ utility, I'm saying it's
simply not part of the _core_) isn't there.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v2 5/7] mm/memory: implement MM_CP_DAMON
  2025-07-29  9:40       ` Lorenzo Stoakes
@ 2025-07-30  4:21         ` SeongJae Park
  2025-07-31 12:18           ` Lorenzo Stoakes
  0 siblings, 1 reply; 15+ messages in thread
From: SeongJae Park @ 2025-07-30  4:21 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Jann Horn, Michal Hocko, Mike Rapoport, Pedro Falcato,
	Suren Baghdasaryan, Vlastimil Babka, damon, kernel-team,
	linux-kernel, linux-mm

On Tue, 29 Jul 2025 10:40:11 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

> On Mon, Jul 28, 2025 at 08:06:32PM -0700, SeongJae Park wrote:
> > On Mon, 28 Jul 2025 06:19:35 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> >
> > > On Sun, Jul 27, 2025 at 01:18:11PM -0700, SeongJae Park wrote:
> > > > DAMON is using Accessed bits of page table entries as the major source
> > > > of the access information.  It lacks some additional information such as
> > > > which CPU was making the access.  Page faults could be another source of
> > > > information for such additional information.
> > > >
> > > > Implement another change_protection() flag for such use case, namely
> > > > MM_CP_DAMON.  DAMON will install PAGE_NONE protections using the flag.
> > > > To avoid interfering with NUMA_BALANCING, which is also using PAGE_NON
> > > > protection, pass the faults to DAMON only when NUMA_BALANCING is
> > > > disabled.
> > > >
> > > > Signed-off-by: SeongJae Park <sj@kernel.org>
> > >
> > > This seems to not be an upstreamable series right now unless I'm missing
> > > something.
> > >
> > > Firstly, you're making a change in behaviour even when CONFIG_DAMON is not
> > > specified, and Linus already told you we can't have that default-on.
> > >
> > > I'm very very much not happy with us doing something for 'damon'
> > > unconditionaly when !CONFIG_DAMON on the assumption that an acessible
> > > mapping has PROT_NONE set.
> > >
> > > This change makes me nervous in general ANYWAY as you are now changing core
> > > mm and introducing a new faulting mechanism specifically for DAMON.
> > >
> > > And we are assuming that NUMA balancing being disabled is not racey in a
> > > way that will cause things to break.
> > >
> > > I really also dislike the idea of an _implicit_ assumption that we are good
> > > to use the NUMA balancing faulting mechanism to 'tack on' DAMON stuff.
> > >
> > > Is it really all that useful to provide a method that requires NUMA
> > > balancing to be diabled?
> > >
> > > Finally, you're making it so DAMON can mprotect() something to use the
> > > DAMON/NUMA balancing fault handler, which doesn't appaer to check to see if
> > > NUMA balacing is disabled, but anyway it could be re-enabled?
> > >
> > > And then now DAMON is making stuff fault as NUMA balancing faults
> > > incorrectly?
> > >
> > > This just seems broken.
> > >
> > > Unless there's really good justification I'm really not inclined for us to
> > > merge this as-is right now unfortunately.
> >
> > Thank you for review and comments, Lorenzo.  I fundamentally agree all your
> > points.  I don't aim to merge this as-is.  Actually this patch series is more
> > like POC, but apparently I was rushing.  I will try to adjust your concerns in
> > the next version.
> 
> Thanks.
> 
> I do wonder whether we really can have a whole new faulting mechanism just for
> DAMON. Because if in future, we wanted to change how this worked, we'd be
> constrained, and it is a very specific user.
> 
> The issue is you need the PTE to be restored to its previous state, just like
> NUMA balancing.
> 
> And I really really do not like this 'oh if you turn it off you can use it for
> DAMON' thing, it's just really odd and asking for trouble.
> 
> I think the only _workable_ version of this would be to convert the numa
> handling to a generic 'fault then restore' type mechanism that could be hooked
> in to by either NUMA or DAMON.

I agree, and this is my current plan for the next version of this patch.

> 
> But really I think you'd _need_ to not have significantly different behaviour
> between the two and _not_ constrain this to being only when NUMA balancing is
> disabled.

I agree all the points.  Especially the current interface is ambiguous and easy
to mistake.

> 
> But then you'd need to know _this_ PTE is for NUMA balancing vs. another is for
> this stuff.

Yes, this would be the ideal.  But, memorizing something in page level is ...
always an interesting challenge in my opinion, and I have no good idea to do
this for now :)

> 
> I'm not really sure there is an upstreamable version of this, but it'd need to
> be made generic like that if there were.
> 
> I think it might be worth taking some time to examine whether a version of this
> that can be sensibly generic (which could have hooks for things) is _possible_
> before sending a v2.

Agreed, I don't need to rush.  Let's take time and discuss sufficiently. :)

Nonetheless, I may post a followup version of this patch series that contains
this one, even before we get a conclusion about this specific one.  I think I
may have to do that, for sharing the future idea in a way easy to understand
and test.  I think it might also help us at understanding the real ROI of this
patch, and if there is another option to move forward.  In the case, I will of
course keep RFC tag and clearly note that this patch is still under the
discussion and not willing to be merged as is before the discussion is done.

> 
> > >
> > > Also are we 100% sure that there's no races here? When we disable numa
> > > balancing do we correctly ensure that absolutely no racing NUMA balancing
> > > faults can happen whatsever at this juncture?
> >
> > Enabling CONFIG_DAMON will not automatically invoke change_protection() with
> > MM_CP_DAMON.  Such invocations will be made only if the user disables NUMA
> > balancing and run DAMON in the reporting mode.
> >
> > So there can be two racy cases.
> >
> > If the user enables NUMA balancing and then disables it after a while, page
> > faults for MM_CP_PROT_NUMA can be handled by DAMON.  That could look odd, but
> > there should be no real problem in practice.  DAMON's fault handling will
> > cleanup the PAGE_NONE protection like NUMA balancing, and DAMON has no problem
> > at receiving such additional reports from MM_CP_PROT_NUMA-caused faults.  DAMON
> > may show a few more than expected accesses, but that's no problem for DAMON.
> >
> > Similarly, if the user starts DAMON in the report mode, stops DAMON, and then
> > enables NUMA balancing, faults for MM_CP_DAMON that installed while DAMON was
> > running in the report mode can be handled by NUMA balancing.  This should also
> > not make real problems in practice in my opinion.  NUMA balancing could see
> > more accesses and migrate pages little bit more than expected, but that should
> > be just for a moment.
> 
> I'm really concerned about these.
> 
> We're now introducing unexpected behaviour based on a race and allowing faults
> to be mis-handled.
> 
> I'm maybe not as confident as you are that everything will 'just work' and it
> seems like we're asking for obscure bugs in the future.
> 
> > > You really have to be 100% certain you're not going to wrongly handle NUMA
> > > page faults this way, on a potentially non-CONFIG_DAMON kernel to boot.
> >
> > I will ensure that never happens on CONFIG_DAMON disabled kernels, in the next
> > version.
> 
> Well, in the above you say that you can't help but do that when a race occurs?

I mean, I can't help when CONFIG_DAMON is enabled.

The race (or, handling faults that caused by other entity) can hppen if and
only if all below things happen.

1. CONFIG_DAMON is enbled.
2. CONFIG_NUMA_BALANCING is enabled.
3. The user repeatedly turns on and off NUMA balancing and fault-based mode
   DAMON in runtime.

If any of these are not true, the race can completely avoided.  I was saying
about the case that the first condition is not met.

> 
> >
> > >
> > > Keep in mind fault handling is verrrry racey (purposefully) and can be done
> > > under VMA read lock alone (and that's only very loosely a lock!).
> > >
> > > This is just, yeah, concerning.
> >
> > Thank you for the caution.  DAMON's fault handling code only saves the
> > information in its internal ring buffer.  It doesn't touch vmas.  So I think
> > there should be no such problems.  I will add the clarification on the next
> > version.
> 
> Right, I'm just saying that this all being super racey between NUMA
> enabled/disabled seems pretty unavoidable, but we covered above.
> 
> > >
> > > Secondly, somebody can disable/enable NUMA balancing, and thus you are now
> > > allowing this function to, when somebody specifies 'DAMON', get NUMA
> > > balancing fault handling??
> > >
> > > If you don't bother checking whether NUMA balancing is disabled when you
> > > set it, then boom - you've broken the faulting mechanism, but even if you
> > > _do_, somebody can just switch it on again...
> >
> > As I explained on the two racy cases aboe, faults that caused by DAMON or NUMA
> > balancing can be handled by the other's handling code, but only for a limited
> > time under the user's controls.  But to my understanding that should not cause
> > real problems in the practice, and users wouldn't be suggested to operate the
> > system in such a way.  I will add more documents and cautions about that in the
> > next version.
> 
> I really don't think a version of the code that results in the wrong handler
> being used is upstreamable, sorry.
> 
> I've not dug into the nitty gritty details on what would happen in both cases,
> but even if it were 100% fine _now_, this is _exactly_ the kind of thing that
> results in horrible hard-to-debug issues later, should something change.
> 
> Implicitly 'just having to know' that we might be in the wrong fault handler
> seems like asking for trouble, and the RoI on an optional profiling tool (this
> is not to take away from DAMON which is a _great_ utility, I'm saying it's
> simply not part of the _core_) isn't there.

I completely understand your concerns, thank you for nicely and patiently
keeping this discussion.  I don't need to upstream this in short term, and open
to every option.  So let's take sufficient time and discussions.

I will take more time to think about a few potential options to move forward,
and share those with a level of details that can help us easily further
discuss, hopefully in a few days.


Thanks,
SJ


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v2 5/7] mm/memory: implement MM_CP_DAMON
  2025-07-30  4:21         ` SeongJae Park
@ 2025-07-31 12:18           ` Lorenzo Stoakes
  0 siblings, 0 replies; 15+ messages in thread
From: Lorenzo Stoakes @ 2025-07-31 12:18 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Liam R. Howlett, Andrew Morton, David Hildenbrand, Jann Horn,
	Michal Hocko, Mike Rapoport, Pedro Falcato, Suren Baghdasaryan,
	Vlastimil Babka, damon, kernel-team, linux-kernel, linux-mm

On Tue, Jul 29, 2025 at 09:21:06PM -0700, SeongJae Park wrote:
> > > Thank you for review and comments, Lorenzo.  I fundamentally agree all your
> > > points.  I don't aim to merge this as-is.  Actually this patch series is more
> > > like POC, but apparently I was rushing.  I will try to adjust your concerns in
> > > the next version.
> >
> > Thanks.
> >
> > I do wonder whether we really can have a whole new faulting mechanism just for
> > DAMON. Because if in future, we wanted to change how this worked, we'd be
> > constrained, and it is a very specific user.
> >
> > The issue is you need the PTE to be restored to its previous state, just like
> > NUMA balancing.
> >
> > And I really really do not like this 'oh if you turn it off you can use it for
> > DAMON' thing, it's just really odd and asking for trouble.
> >
> > I think the only _workable_ version of this would be to convert the numa
> > handling to a generic 'fault then restore' type mechanism that could be hooked
> > in to by either NUMA or DAMON.
>
> I agree, and this is my current plan for the next version of this patch.

I think we'd need some way to embed this information somehow, but I'll see what
you come up with :)

>
> >
> > But really I think you'd _need_ to not have significantly different behaviour
> > between the two and _not_ constrain this to being only when NUMA balancing is
> > disabled.
>
> I agree all the points.  Especially the current interface is ambiguous and easy
> to mistake.
>
> >
> > But then you'd need to know _this_ PTE is for NUMA balancing vs. another is for
> > this stuff.
>
> Yes, this would be the ideal.  But, memorizing something in page level is ...
> always an interesting challenge in my opinion, and I have no good idea to do
> this for now :)

Yes this is the crux, or another approach will need to be taken.

>
> >
> > I'm not really sure there is an upstreamable version of this, but it'd need to
> > be made generic like that if there were.
> >
> > I think it might be worth taking some time to examine whether a version of this
> > that can be sensibly generic (which could have hooks for things) is _possible_
> > before sending a v2.
>
> Agreed, I don't need to rush.  Let's take time and discuss sufficiently. :)

Worth just playing wtih different solution.

>
> Nonetheless, I may post a followup version of this patch series that contains
> this one, even before we get a conclusion about this specific one.  I think I
> may have to do that, for sharing the future idea in a way easy to understand
> and test.  I think it might also help us at understanding the real ROI of this
> patch, and if there is another option to move forward.  In the case, I will of
> course keep RFC tag and clearly note that this patch is still under the
> discussion and not willing to be merged as is before the discussion is done.

OK, as long as you please highlight this, indicating that you ack it's not
mergeable with the current approach.

> > Well, in the above you say that you can't help but do that when a race occurs?
>
> I mean, I can't help when CONFIG_DAMON is enabled.
>
> The race (or, handling faults that caused by other entity) can hppen if and
> only if all below things happen.
>
> 1. CONFIG_DAMON is enbled.
> 2. CONFIG_NUMA_BALANCING is enabled.
> 3. The user repeatedly turns on and off NUMA balancing and fault-based mode
>    DAMON in runtime.
>
> If any of these are not true, the race can completely avoided.  I was saying
> about the case that the first condition is not met.

Well not necessarily continually, just at any stage.

I do think you're limiting yourself rather by requiring NUMA balancing to be
switched off.

But in any case, for reasons previously mentioned, I don't like this approach
and I don't think it's at all workable, rather we either need a generic 'restore
PTE state' mechanism (that can somehow store 'both OR either' NUMA/DAMON state
with them), or an entirely different approach.

Another idea might be that of converting the current NUMA balancing to something
generic, whose semantics are such that:

- It restores PTE state
- It 'happens' to relocate the mapping to the node the CPU from which the access
  was made resides upon
- It also calls a hook.

So treat the NUMA balancing aspect as sort of implicit.

Then you can just use the machinery for marking ranges for NUMA balancing for
this, having defined a hook.

If you're indifferent as to whether you get the hook called for arbitrary NUMA
balancing faults, or have a way to identify whether or not those mappings were
marked for DAMON stats, then you have what you need without having to customise
_anything_ in the general faulting code for DAMON.

Instead you'd just set up the hook for DAMON.

This is just an untested and highly theoretical thought :)

> >
> > I really don't think a version of the code that results in the wrong handler
> > being used is upstreamable, sorry.
> >
> > I've not dug into the nitty gritty details on what would happen in both cases,
> > but even if it were 100% fine _now_, this is _exactly_ the kind of thing that
> > results in horrible hard-to-debug issues later, should something change.
> >
> > Implicitly 'just having to know' that we might be in the wrong fault handler
> > seems like asking for trouble, and the RoI on an optional profiling tool (this
> > is not to take away from DAMON which is a _great_ utility, I'm saying it's
> > simply not part of the _core_) isn't there.
>
> I completely understand your concerns, thank you for nicely and patiently
> keeping this discussion.  I don't need to upstream this in short term, and open
> to every option.  So let's take sufficient time and discussions.
>
> I will take more time to think about a few potential options to move forward,
> and share those with a level of details that can help us easily further
> discuss, hopefully in a few days.

Thanks!


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v2 0/7] mm/damon: extend for page faults reporting based access monitoring
  2025-07-27 20:18 [RFC v2 0/7] mm/damon: extend for page faults reporting based access monitoring SeongJae Park
                   ` (6 preceding siblings ...)
  2025-07-27 20:18 ` [RFC v2 7/7] mm/damon/sysfs: support paddr_fault SeongJae Park
@ 2025-08-04  2:47 ` Andrew Paniakin
  2025-08-04 16:57   ` SeongJae Park
  7 siblings, 1 reply; 15+ messages in thread
From: Andrew Paniakin @ 2025-08-04  2:47 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Liam R. Howlett, Andrew Morton, David Hildenbrand, Jann Horn,
	Lorenzo Stoakes, Michal Hocko, Mike Rapoport, Pedro Falcato,
	Suren Baghdasaryan, Vlastimil Babka, damon, kernel-team,
	linux-kernel, linux-mm, amazon-linux-kernel

On 27/07/2025, SeongJae Park wrote:
> TL; DR: Extend DAMON interface between core and operation sets for
> operation set driven report-based monitoring such as per-CPU and
> write-only access monitoring.  Further introduce an example physical
> address space monitoring operation set that uses page faults as the
> source of the information.

Thank you very much for starting this update. RFC mentions write-only
monitoring, this feature particularly would be really helpful in some of
our use cases such as lightweight live migration target selection, so we
are looking forward to collaborate in development and testing activity!


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v2 0/7] mm/damon: extend for page faults reporting based access monitoring
  2025-08-04  2:47 ` [RFC v2 0/7] mm/damon: extend for page faults reporting based access monitoring Andrew Paniakin
@ 2025-08-04 16:57   ` SeongJae Park
  0 siblings, 0 replies; 15+ messages in thread
From: SeongJae Park @ 2025-08-04 16:57 UTC (permalink / raw)
  To: Andrew Paniakin
  Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Jann Horn, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
	Pedro Falcato, Suren Baghdasaryan, Vlastimil Babka, damon,
	kernel-team, linux-kernel, linux-mm, amazon-linux-kernel

On Sun, 3 Aug 2025 19:47:41 -0700 Andrew Paniakin <apanyaki@amazon.com> wrote:

> On 27/07/2025, SeongJae Park wrote:
> > TL; DR: Extend DAMON interface between core and operation sets for
> > operation set driven report-based monitoring such as per-CPU and
> > write-only access monitoring.  Further introduce an example physical
> > address space monitoring operation set that uses page faults as the
> > source of the information.
> 
> Thank you very much for starting this update. RFC mentions write-only
> monitoring, this feature particularly would be really helpful in some of
> our use cases such as lightweight live migration target selection, so we
> are looking forward to collaborate in development and testing activity!

Thank you for letting us know your interest, Andrew.  This should be helpful at
better prioritizations.

Now development trees of DAMON and DAMON user-space tool support[1] write-only
monitoring.  The implementation is dirty and not upstreamable for now, but
please feel free to test and let me know what you see if you don't mind.

I will continue working on more testing and making it upstreamable.

[1] https://damonitor.github.io/posts/write_only_cpus_only_monitoring/


Thanks,
SJ


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-08-04 16:57 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-27 20:18 [RFC v2 0/7] mm/damon: extend for page faults reporting based access monitoring SeongJae Park
2025-07-27 20:18 ` [RFC v2 1/7] mm/damon/core: introduce damon_report_access() SeongJae Park
2025-07-27 20:18 ` [RFC v2 2/7] mm/damon/core: add eligible_report() ops callback SeongJae Park
2025-07-27 20:18 ` [RFC v2 3/7] mm/damon/vaddr: implement eligible_report() SeongJae Park
2025-07-27 20:18 ` [RFC v2 4/7] mm/damon/core: read received access reports SeongJae Park
2025-07-27 20:18 ` [RFC v2 5/7] mm/memory: implement MM_CP_DAMON SeongJae Park
2025-07-28  5:19   ` Lorenzo Stoakes
2025-07-29  3:06     ` SeongJae Park
2025-07-29  9:40       ` Lorenzo Stoakes
2025-07-30  4:21         ` SeongJae Park
2025-07-31 12:18           ` Lorenzo Stoakes
2025-07-27 20:18 ` [RFC v2 6/7] mm/damon: implement paddr_fault operations set SeongJae Park
2025-07-27 20:18 ` [RFC v2 7/7] mm/damon/sysfs: support paddr_fault SeongJae Park
2025-08-04  2:47 ` [RFC v2 0/7] mm/damon: extend for page faults reporting based access monitoring Andrew Paniakin
2025-08-04 16:57   ` SeongJae Park

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).