[RFC PATCH 0/5] mm/damon: DAMOS quota controller and paddr migration walk fixes

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/5] mm/damon: DAMOS quota controller and paddr migration walk fixes
@ 2026-05-16 21:03 Ravi Jonnalagadda
  2026-05-16 21:03 ` [RFC PATCH 1/5] mm/damon/core: fix nr_accesses_bp underflow in damon_moving_sum Ravi Jonnalagadda
                   ` (4 more replies)
  0 siblings, 5 replies; 25+ messages in thread
From: Ravi Jonnalagadda @ 2026-05-16 21:03 UTC (permalink / raw)
  To: sj, damon, linux-mm, linux-kernel, linux-doc
  Cc: akpm, corbet, bijan311, ajayjoshi, honggyu.kim, yunjeong.mun,
	ravis.opensrc

MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Hi,

This series carries five fixes for the DAMOS quota controller and the
paddr migration walk.  All five were surfaced during closed-loop tiering
testing on a heterogeneous-memory system (DRAM + CXL on a separate
NUMA node), but each fix is in code paths that benefit any caller --
not scoped to closed-loop tiering or to any specific goal metric.

Test envelope: AMD EPYC dual-socket host with CXL.mem on a separate
NUMA node, two-scheme migrate_hot PULL+PUSH setup driven by
node_eligible_mem_bp (now in linux-next)[1].

What each patch does
====================

Patch 1 - mm/damon/core: fix nr_accesses_bp underflow in damon_moving_sum

  damon_moving_sum() can underflow when a region's access rate drops
  to zero faster than the moving-sum window length.  The internal
  accumulator subtracts an outgoing sample without a lower bound,
  producing a sentinel-large nr_accesses_bp that mis-classifies a
  cold region as hot.  Affects every DAMOS scheme, since
  nr_accesses_bp is used on every access-rate update for every
  region regardless of which scheme or goal metric is active.

Patch 2 - mm/damon/core: cap effective quota size to total monitored memory

  The DAMOS quota tuner can compute an effective size (esz) larger
  than the total monitored memory, because the tuner integrates over
  cumulative deltas without bounding by the actual workload size.
  Once esz exceeds total monitored memory the per-tick "remaining
  quota" arithmetic stops being meaningful: any scheme can apply to
  the entire monitored space and "remaining" stays positive
  indefinitely.  Cap esz at total monitored memory so the controller
  remains within physically realisable bounds.  Tuner-shape and
  goal-metric agnostic.

Patch 3 - mm/damon/core: floor effective quota size at minimum region size

  Symmetric to patch 2: the tuner can also compute esz < min_region_sz,
  causing schemes to attempt zero-byte migrations for many ticks before
  the tuner ramps esz back up.  Observed under the CONSIST tuner with
  a node_eligible_mem_bp goal: ftrace traced esz stuck at 1 byte for
  96 seconds before the first region was tried; the first acted-on
  region appeared at t=113s when esz crossed the min_region_sz
  threshold.  Floor esz at min_region_sz so schemes always have at
  least one region's worth of quota when the tuner asks them to act.

Patch 4 - mm/damon/paddr: skip free pageblocks in migration walk

  damon_pa_migrate() walks every 4KB PFN in a region.  On
  sparsely-populated lower-tier extents (e.g., a 549GB CXL region
  with only ~8GB populated), this is ~144M PFN iterations per scheme
  tick at ~40ns each = ~5.6 seconds of walk per tick.  Use
  pageblock-level free-page detection to skip unpopulated runs of
  pages: only enter the per-page loop for pageblocks that contain at
  least one allocated page.  This brings the walk to
  O(region_size / pageblock_size) skip-check cost plus
  O(populated_pages) per-page work.  On x86 pageblocks are 2MB, so
  the same 549GB/8GB example becomes ~281K pageblock skip-checks
  (microseconds total) plus ~2M per-page visits for the populated
  pages -- ~80ms expected.  Helps any migrate_hot/migrate_cold scheme
  on paddr ops, regardless of what drives them.

Patch 5 - mm/damon/paddr: add time budget to migration page walk

  Densely populated regions (e.g., a busy DRAM range where most
  pageblocks contain at least one allocated page) can still consume
  full ticks even with patch 4 applied.  Add a 100ms wall-clock
  budget with a ktime_get() check every 4096 pages walked
  (~16MB worth).  When the budget expires before reaching the end of
  a region, kdamond returns control; subsequent ticks re-walk the
  region from the start.  Folios already on the target node are
  dropped at migration time, so re-walks only re-do collection work,
  not the migrate itself.  Together with the per-scheme quota cap,
  per-tick work is bounded and the workload converges over multiple
  ticks for dense regions.

  Worst-case migration walk contribution to a tick is bounded at
  100ms per scheme regardless of region size or population density,
  preserving kdamond's ability to service other DAMOS schemes and
  user-space sysfs operations during heavy migration phases.

Testing context
===============

  Hardware:  AMD EPYC dual-socket, CXL.mem on a separate NUMA node.
  Workload: 32GB hot working set across DRAM and CXL nodes.
  DAMON config: paddr ops, two migrate_hot schemes (PULL CXL->DRAM,
                PUSH DRAM->CXL) with complementary address filters,
                node_eligible_mem_bp goal per scheme, temporal
                quota tuner, 1s reset interval.

Each fix in this series was reproduced under the above setup, then
verified via ftrace and per-scheme stats after the fix landed.

References
==========

[1] mm/damon: add node_eligible_mem_bp goal metric
https://lore.kernel.org/damon/20260428030520.701-1-ravis.opensrc@gmail.com/

Ravi Jonnalagadda (5):
  mm/damon/core: fix nr_accesses_bp underflow in damon_moving_sum
  mm/damon/core: cap effective quota size to total monitored memory
  mm/damon/core: floor effective quota size at minimum region size
  mm/damon/paddr: skip free pageblocks in migration walk
  mm/damon/paddr: add time budget to migration page walk

 mm/damon/core.c  | 29 ++++++++++++++++++++++++++++-
 mm/damon/paddr.c | 40 +++++++++++++++++++++++++++++++++++++---
 2 files changed, 65 insertions(+), 4 deletions(-)

base-commit: 0cec77cfd5314c0b3b03530abe1a4b32e991f639
-- 
2.43.0

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [RFC PATCH 1/5] mm/damon/core: fix nr_accesses_bp underflow in damon_moving_sum
  2026-05-16 21:03 [RFC PATCH 0/5] mm/damon: DAMOS quota controller and paddr migration walk fixes Ravi Jonnalagadda
@ 2026-05-16 21:03 ` Ravi Jonnalagadda
  2026-05-16 22:29   ` sashiko-bot
  2026-05-17 18:16   ` SeongJae Park
  2026-05-16 21:03 ` [RFC PATCH 2/5] mm/damon/core: cap effective quota size to total monitored memory Ravi Jonnalagadda
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 25+ messages in thread
From: Ravi Jonnalagadda @ 2026-05-16 21:03 UTC (permalink / raw)
  To: sj, damon, linux-mm, linux-kernel, linux-doc
  Cc: akpm, corbet, bijan311, ajayjoshi, honggyu.kim, yunjeong.mun,
	ravis.opensrc

Guard against unsigned integer underflow when nomvsum/len_window
exceeds mvsum.  When that subtraction wraps, the moving sum returns a
near-ULONG_MAX value and corrupts nr_accesses_bp.

If subtrahend > mvsum, return new_value: this clamps the moving-sum
estimate to the current observation rather than wrapping.

Signed-off-by: Ravi Jonnalagadda <ravis.opensrc@gmail.com>
---
 mm/damon/core.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/damon/core.c b/mm/damon/core.c
index 3a8725e400c6b..9975f3d9ebfe9 100644
--- a/mm/damon/core.c
+++ b/mm/damon/core.c
@@ -3449,7 +3449,11 @@ int damon_set_region_system_rams_default(struct damon_target *t,
 static unsigned int damon_moving_sum(unsigned int mvsum, unsigned int nomvsum,
 		unsigned int len_window, unsigned int new_value)
 {
-	return mvsum - nomvsum / len_window + new_value;
+	unsigned int subtrahend = nomvsum / len_window;
+
+	if (subtrahend > mvsum)
+		return new_value;
+	return mvsum - subtrahend + new_value;
 }
 
 /**
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 1/5] mm/damon/core: fix nr_accesses_bp underflow in damon_moving_sum
  2026-05-16 21:03 ` [RFC PATCH 1/5] mm/damon/core: fix nr_accesses_bp underflow in damon_moving_sum Ravi Jonnalagadda
@ 2026-05-16 22:29   ` sashiko-bot
  2026-05-17 18:21     ` SeongJae Park
  2026-05-17 18:16   ` SeongJae Park
  1 sibling, 1 reply; 25+ messages in thread
From: sashiko-bot @ 2026-05-16 22:29 UTC (permalink / raw)
  To: Ravi Jonnalagadda; +Cc: damon

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] Unsigned integer overflow in DAMON's moving sum and access rate tracking for large intervals.
--

commit 1d58ed9728ddbaf4579ec265e453a7b3a233285c
Author: Ravi Jonnalagadda <ravis.opensrc@gmail.com>

mm/damon/core: fix nr_accesses_bp underflow in damon_moving_sum

Guard against unsigned integer underflow when nomvsum/len_window exceeds
mvsum. When that subtraction wraps, the moving sum returns a near maximum
value and corrupts nr_accesses_bp. If subtrahend > mvsum, return new_value.

> diff --git a/mm/damon/core.c b/mm/damon/core.c
> index 3a8725e400c6b..9975f3d9ebfe9 100644
> --- a/mm/damon/core.c
> +++ b/mm/damon/core.c
> @@ -3449,7 +3449,11 @@ int damon_set_region_system_rams_default(struct damon_target *t,
>  static unsigned int damon_moving_sum(unsigned int mvsum, unsigned int nomvsum,
>  		unsigned int len_window, unsigned int new_value)
>  {

This isn't a bug introduced by this patch, but could the nomvsum argument
passed here be silently overflowing?

Looking at damon_update_region_access_rate(), damon_moving_sum() is called
with r->last_nr_accesses * 10000:

mm/damon/core.c:damon_update_region_access_rate() {
	...
	if (attrs->sample_interval)
		len_window = damon_max_nr_accesses(attrs);
	r->nr_accesses_bp = damon_moving_sum(r->nr_accesses_bp,
			r->last_nr_accesses * 10000, len_window,
			accessed ? 10000 : 0);
	...
}

DAMON allows configuring aggr_interval / sample_interval up to the maximum
32-bit unsigned integer (for example, 10ms sampling and 1.5 hours
aggregation).

This means len_window, which bounds last_nr_accesses, can exceed 429,496.

If this happens, won't the multiplication r->last_nr_accesses * 10000
silently overflow the 32-bit unsigned int limit of 4,294,967,295?

If this overflows, the corrupted, wrapped-around value is passed as nomvsum,
which would cause subtrahend to be artificially small.

Additionally, would this cause r->nr_accesses_bp (which accumulates mvsum)
to also continuously wrap around 32-bit limits?

Could these silent integer overflows corrupt memory access tracking and lead
to erroneous memory management decisions, like reclaiming hot memory?

A proper fix might require bounding len_window to 429,496 or using 64-bit
integers.

> -	return mvsum - nomvsum / len_window + new_value;
> +	unsigned int subtrahend = nomvsum / len_window;
> +
> +	if (subtrahend > mvsum)
> +		return new_value;
> +	return mvsum - subtrahend + new_value;
>  }

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260516210357.2247-1-ravis.opensrc@gmail.com?part=1

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 1/5] mm/damon/core: fix nr_accesses_bp underflow in damon_moving_sum
  2026-05-16 22:29   ` sashiko-bot
@ 2026-05-17 18:21     ` SeongJae Park
  0 siblings, 0 replies; 25+ messages in thread
From: SeongJae Park @ 2026-05-17 18:21 UTC (permalink / raw)
  To: sashiko-bot; +Cc: SeongJae Park, Ravi Jonnalagadda, damon

On Sat, 16 May 2026 22:29:51 +0000 sashiko-bot@kernel.org wrote:

> Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
> - [High] Unsigned integer overflow in DAMON's moving sum and access rate tracking for large intervals.
> --
> 
> commit 1d58ed9728ddbaf4579ec265e453a7b3a233285c
> Author: Ravi Jonnalagadda <ravis.opensrc@gmail.com>
> 
> mm/damon/core: fix nr_accesses_bp underflow in damon_moving_sum
> 
> Guard against unsigned integer underflow when nomvsum/len_window exceeds
> mvsum. When that subtraction wraps, the moving sum returns a near maximum
> value and corrupts nr_accesses_bp. If subtrahend > mvsum, return new_value.
> 
> > diff --git a/mm/damon/core.c b/mm/damon/core.c
> > index 3a8725e400c6b..9975f3d9ebfe9 100644
> > --- a/mm/damon/core.c
> > +++ b/mm/damon/core.c
> > @@ -3449,7 +3449,11 @@ int damon_set_region_system_rams_default(struct damon_target *t,
> >  static unsigned int damon_moving_sum(unsigned int mvsum, unsigned int nomvsum,
> >  		unsigned int len_window, unsigned int new_value)
> >  {
> 
> This isn't a bug introduced by this patch, but could the nomvsum argument
> passed here be silently overflowing?
> 
> Looking at damon_update_region_access_rate(), damon_moving_sum() is called
> with r->last_nr_accesses * 10000:
> 
> mm/damon/core.c:damon_update_region_access_rate() {
> 	...
> 	if (attrs->sample_interval)
> 		len_window = damon_max_nr_accesses(attrs);
> 	r->nr_accesses_bp = damon_moving_sum(r->nr_accesses_bp,
> 			r->last_nr_accesses * 10000, len_window,
> 			accessed ? 10000 : 0);
> 	...
> }
> 
> DAMON allows configuring aggr_interval / sample_interval up to the maximum
> 32-bit unsigned integer (for example, 10ms sampling and 1.5 hours
> aggregation).
> 
> This means len_window, which bounds last_nr_accesses, can exceed 429,496.
> 
> If this happens, won't the multiplication r->last_nr_accesses * 10000
> silently overflow the 32-bit unsigned int limit of 4,294,967,295?
> 
> If this overflows, the corrupted, wrapped-around value is passed as nomvsum,
> which would cause subtrahend to be artificially small.
> 
> Additionally, would this cause r->nr_accesses_bp (which accumulates mvsum)
> to also continuously wrap around 32-bit limits?
> 
> Could these silent integer overflows corrupt memory access tracking and lead
> to erroneous memory management decisions, like reclaiming hot memory?
> 
> A proper fix might require bounding len_window to 429,496 or using 64-bit
> integers.

That could happen when user sets aggr_interval > sample_interval * 429,496.  I
don't think that's a common setup, and the sane user would do some testing
before using such arbitrary setup.  So I'd suggest revisiting this when we have
enough time, or if a real world issue is reported.


Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 1/5] mm/damon/core: fix nr_accesses_bp underflow in damon_moving_sum
  2026-05-16 21:03 ` [RFC PATCH 1/5] mm/damon/core: fix nr_accesses_bp underflow in damon_moving_sum Ravi Jonnalagadda
  2026-05-16 22:29   ` sashiko-bot
@ 2026-05-17 18:16   ` SeongJae Park
  1 sibling, 0 replies; 25+ messages in thread
From: SeongJae Park @ 2026-05-17 18:16 UTC (permalink / raw)
  To: Ravi Jonnalagadda
  Cc: SeongJae Park, damon, linux-mm, linux-kernel, linux-doc, akpm,
	corbet, bijan311, ajayjoshi, honggyu.kim, yunjeong.mun

Hello Ravi,

On Sat, 16 May 2026 14:03:53 -0700 Ravi Jonnalagadda <ravis.opensrc@gmail.com> wrote:

> Guard against unsigned integer underflow when nomvsum/len_window
> exceeds mvsum.

How could this happen?  mvsum is assumed to be same to nomvsum at the beginning
of the window.  Hence, even if there is only zero new_value, at the end of the
window, mvsum should be exactly zero.  Of course there could be a bug that
breaks the assumption.

> When that subtraction wraps, the moving sum returns a
> near-ULONG_MAX value and corrupts nr_accesses_bp.
> 
> If subtrahend > mvsum, return new_value: this clamps the moving-sum
> estimate to the current observation rather than wrapping.

I guess you saw this issue in real, and this change should fix the issue.  But
I think we should know why and how mvsum < nomvum / len_window can unexpectedly
happen, and fix that.

Could you share more details about when and how the situation happens?

Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [RFC PATCH 2/5] mm/damon/core: cap effective quota size to total monitored memory
  2026-05-16 21:03 [RFC PATCH 0/5] mm/damon: DAMOS quota controller and paddr migration walk fixes Ravi Jonnalagadda
  2026-05-16 21:03 ` [RFC PATCH 1/5] mm/damon/core: fix nr_accesses_bp underflow in damon_moving_sum Ravi Jonnalagadda
@ 2026-05-16 21:03 ` Ravi Jonnalagadda
  2026-05-16 22:55   ` sashiko-bot
  2026-05-17 18:36   ` SeongJae Park
  2026-05-16 21:03 ` [RFC PATCH 3/5] mm/damon/core: floor effective quota size at minimum region size Ravi Jonnalagadda
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 25+ messages in thread
From: Ravi Jonnalagadda @ 2026-05-16 21:03 UTC (permalink / raw)
  To: sj, damon, linux-mm, linux-kernel, linux-doc
  Cc: akpm, corbet, bijan311, ajayjoshi, honggyu.kim, yunjeong.mun,
	ravis.opensrc

The DAMOS quota goal tuner can compute an effective size (esz) larger
than the total monitored memory because it integrates over cumulative
deltas without bounding by the actual workload size.  Once esz exceeds
total monitored memory, the per-tick "remaining quota" arithmetic
stops being meaningful: any scheme can apply to the entire monitored
space and "remaining" stays positive indefinitely.

Cap esz to the total size of all currently monitored regions as a
final bound after all other quota calculations.  Add
damon_ctx_total_monitored_sz() helper that sums region sizes across
all targets.

The helper runs only inside damos_set_effective_quota(), which is
called at most once per quota reset_interval (default 1s) per scheme,
not per kdamond tick.  Walk cost is O(nr_regions) at that frequency
and is dominated by the enclosing tuner work.

This bound is tuner-shape and goal-metric agnostic: it constrains the
quota controller to physically realisable values regardless of which
tuner or goal metric drives it.

Signed-off-by: Ravi Jonnalagadda <ravis.opensrc@gmail.com>
---
 mm/damon/core.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/mm/damon/core.c b/mm/damon/core.c
index 9975f3d9ebfe9..fd1db234ca304 100644
--- a/mm/damon/core.c
+++ b/mm/damon/core.c
@@ -2614,6 +2614,19 @@ static void damos_goal_tune_esz_bp_temporal(struct damon_ctx *c,
 		quota->esz_bp = ULONG_MAX;
 }
 
+/* Sum of all monitored region sizes across all targets in @ctx. */
+static unsigned long damon_ctx_total_monitored_sz(struct damon_ctx *ctx)
+{
+	struct damon_target *t;
+	struct damon_region *r;
+	unsigned long total = 0;
+
+	damon_for_each_target(t, ctx)
+		damon_for_each_region(r, t)
+			total += damon_sz_region(r);
+	return total;
+}
+
 /*
  * Called only if quota->ms, or quota->sz are set, or quota->goals is not empty
  */
@@ -2621,6 +2634,7 @@ static void damos_set_effective_quota(struct damon_ctx *ctx, struct damos *s)
 {
 	struct damos_quota *quota = &s->quota;
 	unsigned long throughput;
+	unsigned long total_sz;
 	unsigned long esz = ULONG_MAX;
 
 	if (!quota->ms && list_empty(&quota->goals)) {
@@ -2649,6 +2663,11 @@ static void damos_set_effective_quota(struct damon_ctx *ctx, struct damos *s)
 	if (quota->sz && quota->sz < esz)
 		esz = quota->sz;
 
+	/* Safety cap: never migrate more than total monitored memory */
+	total_sz = damon_ctx_total_monitored_sz(ctx);
+	if (total_sz && esz > total_sz)
+		esz = total_sz;
+
 	quota->esz = esz;
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 2/5] mm/damon/core: cap effective quota size to total monitored memory
  2026-05-16 21:03 ` [RFC PATCH 2/5] mm/damon/core: cap effective quota size to total monitored memory Ravi Jonnalagadda
@ 2026-05-16 22:55   ` sashiko-bot
  2026-05-17 18:48     ` SeongJae Park
  2026-05-17 18:36   ` SeongJae Park
  1 sibling, 1 reply; 25+ messages in thread
From: sashiko-bot @ 2026-05-16 22:55 UTC (permalink / raw)
  To: Ravi Jonnalagadda; +Cc: damon

Thank you for your contribution! Sashiko AI review found 4 potential issue(s) to consider:
- [High] The safety cap limits the effective quota output (`esz`), but fails to cap the tuner's internal state (`quota->esz_bp`), leading to severe integral windup.
- [High] Accumulating the total monitored size into an `unsigned long` overflows on 32-bit architectures, permanently crippling the scheme quota.
- [High] Applying the safety cap can forcefully reduce the effective quota below `min_region_sz`, resulting in permanent scheme starvation for small targets.
- [Medium] The safety cap is completely bypassed for statically sized quotas due to an early return, causing inconsistent API behavior.
--

commit afc20f6b1154f371b12a0f1f9ad37023b5bba0ef
Author: Ravi Jonnalagadda <ravis.opensrc@gmail.com>
Subject: mm/damon/core: cap effective quota size to total monitored memory

This commit introduces a safety cap on the effective quota size computed by
the DAMOS quota goal tuner to prevent it from exceeding the total monitored
memory.

> diff --git a/mm/damon/core.c b/mm/damon/core.c
> --- a/mm/damon/core.c
> +++ b/mm/damon/core.c

[ ... ]

> @@ -2614,6 +2614,19 @@ static void damos_goal_tune_esz_bp_temporal(struct damon_ctx *c,
>  		quota->esz_bp = ULONG_MAX;
>  }
>  
> +/* Sum of all monitored region sizes across all targets in @ctx. */
> +static unsigned long damon_ctx_total_monitored_sz(struct damon_ctx *ctx)
> +{
> +	struct damon_target *t;
> +	struct damon_region *r;
> +	unsigned long total = 0;
> +
> +	damon_for_each_target(t, ctx)
> +		damon_for_each_region(r, t)
> +			total += damon_sz_region(r);

Could the total variable overflow on 32-bit architectures? Since a single
DAMON context can monitor multiple processes, the combined virtual memory size
might exceed 4 GiB. Would it be safer to use u64 to accumulate the total sizes
and avoid wrapping?

> +	return total;
> +}
> +

[ ... ]

> @@ -2621,6 +2634,7 @@ static void damos_set_effective_quota(struct damon_ctx *ctx, struct damos *s)
>  {
>  	struct damos_quota *quota = &s->quota;
>  	unsigned long throughput;
> +	unsigned long total_sz;
>  	unsigned long esz = ULONG_MAX;
>  
>  	if (!quota->ms && list_empty(&quota->goals)) {
>  		quota->esz = quota->sz;
>  		return;
>  	}

The commit message mentions this is a final bound after all other quota
calculations. If a purely static size quota is configured with no time limit
and no tuning goals, the early return here bypasses the new safety cap. Is
this intended behavior, or should the static quota also be constrained by
total_sz?

> @@ -2649,6 +2663,11 @@ static void damos_set_effective_quota(struct damon_ctx *ctx, struct damos *s)
>  	if (quota->sz && quota->sz < esz)
>  		esz = quota->sz;
>  
> +	/* Safety cap: never migrate more than total monitored memory */
> +	total_sz = damon_ctx_total_monitored_sz(ctx);
> +	if (total_sz && esz > total_sz)
> +		esz = total_sz;
> +

Can this cap force the effective quota below the minimum region size? If
total_sz happens to be smaller than ctx->min_region_sz, applying this cap at
the very end would make esz smaller than min_region_sz. If that occurs, will
operations layer checks like esz - charged_sz < min_region_sz always evaluate
to true and permanently starve the scheme?

Does the goal tuner's internal state also need to be bounded here? While the
effective output esz is capped, quota->esz_bp remains untouched.

If a workload consistently misses its tuning goal, quota->esz_bp might
exponentially increase until it hits ULONG_MAX. Once the workload behavior
changes, could the wound-up state take a prolonged time to decay back down
below the physical limit, leaving the scheme unresponsive during recovery?

>  	quota->esz = esz;
>  }

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260516210357.2247-1-ravis.opensrc@gmail.com?part=2

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 2/5] mm/damon/core: cap effective quota size to total monitored memory
  2026-05-16 22:55   ` sashiko-bot
@ 2026-05-17 18:48     ` SeongJae Park
  0 siblings, 0 replies; 25+ messages in thread
From: SeongJae Park @ 2026-05-17 18:48 UTC (permalink / raw)
  To: sashiko-bot; +Cc: SeongJae Park, Ravi Jonnalagadda, damon

On Sat, 16 May 2026 22:55:28 +0000 sashiko-bot@kernel.org wrote:

> Thank you for your contribution! Sashiko AI review found 4 potential issue(s) to consider:
> - [High] The safety cap limits the effective quota output (`esz`), but fails to cap the tuner's internal state (`quota->esz_bp`), leading to severe integral windup.
> - [High] Accumulating the total monitored size into an `unsigned long` overflows on 32-bit architectures, permanently crippling the scheme quota.
> - [High] Applying the safety cap can forcefully reduce the effective quota below `min_region_sz`, resulting in permanent scheme starvation for small targets.
> - [Medium] The safety cap is completely bypassed for statically sized quotas due to an early return, causing inconsistent API behavior.

I have a high level question that I replied to the patch.  I will look into
this kind of deep details after resolving the high level discussion first.


Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 2/5] mm/damon/core: cap effective quota size to total monitored memory
  2026-05-16 21:03 ` [RFC PATCH 2/5] mm/damon/core: cap effective quota size to total monitored memory Ravi Jonnalagadda
  2026-05-16 22:55   ` sashiko-bot
@ 2026-05-17 18:36   ` SeongJae Park
  2026-05-18  5:22     ` Ravi Jonnalagadda
  1 sibling, 1 reply; 25+ messages in thread
From: SeongJae Park @ 2026-05-17 18:36 UTC (permalink / raw)
  To: Ravi Jonnalagadda
  Cc: SeongJae Park, damon, linux-mm, linux-kernel, linux-doc, akpm,
	corbet, bijan311, ajayjoshi, honggyu.kim, yunjeong.mun

Hello Ravi,

On Sat, 16 May 2026 14:03:54 -0700 Ravi Jonnalagadda <ravis.opensrc@gmail.com> wrote:

> The DAMOS quota goal tuner can compute an effective size (esz) larger
> than the total monitored memory because it integrates over cumulative
> deltas without bounding by the actual workload size.  Once esz exceeds
> total monitored memory, the per-tick "remaining quota" arithmetic
> stops being meaningful: any scheme can apply to the entire monitored
> space and "remaining" stays positive indefinitely.

Nice finding!

> 
> Cap esz to the total size of all currently monitored regions as a
> final bound after all other quota calculations.  Add
> damon_ctx_total_monitored_sz() helper that sums region sizes across
> all targets.

You could also make an arbitrary cap by setting the static size quota.  That
is, if there are not only quota goal but also the size quota and/or time quota,
and the different types of quotas disagree about the real quota, DAMOS uses
smallest quota.  You could read damos_set_effective_quota() code and kernel-doc
comment of 'struct damos_quota' for more details.

So you could apply the total monitoring region size cap by setting the size
quota of the total monitoring region size.  Could that work for you?

Adding the total monitoring region size cap makes sense to me, and I think that
will make user experience better.  But, if the size quota based cap works, that
could also be handled on user space in an easier and even a betetr way.  If so,
I'd prefer the direction, to reduce kernel code complexity.  What do you think?

Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 2/5] mm/damon/core: cap effective quota size to total monitored memory
  2026-05-17 18:36   ` SeongJae Park
@ 2026-05-18  5:22     ` Ravi Jonnalagadda
  2026-05-19  0:38       ` SeongJae Park
  0 siblings, 1 reply; 25+ messages in thread
From: Ravi Jonnalagadda @ 2026-05-18  5:22 UTC (permalink / raw)
  To: SeongJae Park
  Cc: damon, linux-mm, linux-kernel, linux-doc, akpm, corbet, bijan311,
	ajayjoshi, honggyu.kim, yunjeong.mun

On Sun, May 17, 2026 at 11:37 AM SeongJae Park <sj@kernel.org> wrote:
>
> Hello Ravi,
>
> On Sat, 16 May 2026 14:03:54 -0700 Ravi Jonnalagadda <ravis.opensrc@gmail.com> wrote:
>
> > The DAMOS quota goal tuner can compute an effective size (esz) larger
> > than the total monitored memory because it integrates over cumulative
> > deltas without bounding by the actual workload size.  Once esz exceeds
> > total monitored memory, the per-tick "remaining quota" arithmetic
> > stops being meaningful: any scheme can apply to the entire monitored
> > space and "remaining" stays positive indefinitely.
>
> Nice finding!
>
> >
> > Cap esz to the total size of all currently monitored regions as a
> > final bound after all other quota calculations.  Add
> > damon_ctx_total_monitored_sz() helper that sums region sizes across
> > all targets.
>
> You could also make an arbitrary cap by setting the static size quota.  That
> is, if there are not only quota goal but also the size quota and/or time quota,
> and the different types of quotas disagree about the real quota, DAMOS uses
> smallest quota.  You could read damos_set_effective_quota() code and kernel-doc
> comment of 'struct damos_quota' for more details.
>
> So you could apply the total monitoring region size cap by setting the size
> quota of the total monitoring region size.  Could that work for you?
>
> Adding the total monitoring region size cap makes sense to me, and I think that
> will make user experience better.  But, if the size quota based cap works, that
> could also be handled on user space in an easier and even a betetr way.  If so,
> I'd prefer the direction, to reduce kernel code complexity.  What do you think?

Hello SJ,

Agreed.  quota->sz combined with the smallest-quota-wins rule in
damos_set_effective_quota does express this cap from userspace
without kernel changes, and keeping the kernel side clean is the
right call.

If the UX argument carries weight later, I'm happy to respin v2
with sashiko fixes addressed.

Thanks,
Ravi

>
>
> Thanks,
> SJ
>
> [...]


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 2/5] mm/damon/core: cap effective quota size to total monitored memory
  2026-05-18  5:22     ` Ravi Jonnalagadda
@ 2026-05-19  0:38       ` SeongJae Park
  0 siblings, 0 replies; 25+ messages in thread
From: SeongJae Park @ 2026-05-19  0:38 UTC (permalink / raw)
  To: Ravi Jonnalagadda
  Cc: SeongJae Park, damon, linux-mm, linux-kernel, linux-doc, akpm,
	corbet, bijan311, ajayjoshi, honggyu.kim, yunjeong.mun

On Sun, 17 May 2026 22:22:34 -0700 Ravi Jonnalagadda <ravis.opensrc@gmail.com> wrote:

> On Sun, May 17, 2026 at 11:37 AM SeongJae Park <sj@kernel.org> wrote:
> >
> > Hello Ravi,
> >
> > On Sat, 16 May 2026 14:03:54 -0700 Ravi Jonnalagadda <ravis.opensrc@gmail.com> wrote:
> >
> > > The DAMOS quota goal tuner can compute an effective size (esz) larger
> > > than the total monitored memory because it integrates over cumulative
> > > deltas without bounding by the actual workload size.  Once esz exceeds
> > > total monitored memory, the per-tick "remaining quota" arithmetic
> > > stops being meaningful: any scheme can apply to the entire monitored
> > > space and "remaining" stays positive indefinitely.
> >
> > Nice finding!
> >
> > >
> > > Cap esz to the total size of all currently monitored regions as a
> > > final bound after all other quota calculations.  Add
> > > damon_ctx_total_monitored_sz() helper that sums region sizes across
> > > all targets.
> >
> > You could also make an arbitrary cap by setting the static size quota.  That
> > is, if there are not only quota goal but also the size quota and/or time quota,
> > and the different types of quotas disagree about the real quota, DAMOS uses
> > smallest quota.  You could read damos_set_effective_quota() code and kernel-doc
> > comment of 'struct damos_quota' for more details.
> >
> > So you could apply the total monitoring region size cap by setting the size
> > quota of the total monitoring region size.  Could that work for you?
> >
> > Adding the total monitoring region size cap makes sense to me, and I think that
> > will make user experience better.  But, if the size quota based cap works, that
> > could also be handled on user space in an easier and even a betetr way.  If so,
> > I'd prefer the direction, to reduce kernel code complexity.  What do you think?
> 
> Hello SJ,
> 
> Agreed.  quota->sz combined with the smallest-quota-wins rule in
> damos_set_effective_quota does express this cap from userspace
> without kernel changes, and keeping the kernel side clean is the
> right call.
> 
> If the UX argument carries weight later, I'm happy to respin v2
> with sashiko fixes addressed.

Makes sense.  I find no change on the weight for now.  If someone else
including myself or you in the future claims again, we could revisit.


Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [RFC PATCH 3/5] mm/damon/core: floor effective quota size at minimum region size
  2026-05-16 21:03 [RFC PATCH 0/5] mm/damon: DAMOS quota controller and paddr migration walk fixes Ravi Jonnalagadda
  2026-05-16 21:03 ` [RFC PATCH 1/5] mm/damon/core: fix nr_accesses_bp underflow in damon_moving_sum Ravi Jonnalagadda
  2026-05-16 21:03 ` [RFC PATCH 2/5] mm/damon/core: cap effective quota size to total monitored memory Ravi Jonnalagadda
@ 2026-05-16 21:03 ` Ravi Jonnalagadda
  2026-05-17 18:47   ` SeongJae Park
  2026-05-16 21:03 ` [RFC PATCH 4/5] mm/damon/paddr: skip free pageblocks in migration walk Ravi Jonnalagadda
  2026-05-16 21:03 ` [RFC PATCH 5/5] mm/damon/paddr: add time budget to migration page walk Ravi Jonnalagadda
  4 siblings, 1 reply; 25+ messages in thread
From: Ravi Jonnalagadda @ 2026-05-16 21:03 UTC (permalink / raw)
  To: sj, damon, linux-mm, linux-kernel, linux-doc
  Cc: akpm, corbet, bijan311, ajayjoshi, honggyu.kim, yunjeong.mun,
	ravis.opensrc

The CONSIST quota goal tuner initializes esz_bp to 0, producing an
effective quota size (esz) of 1 byte on the first tick.
damos_quota_is_full() rejects all regions when esz < min_region_sz
(default PAGE_SIZE = 4096), so no regions can be tried and no
feedback reaches the tuner — a bootstrapping deadlock.

Floor esz at ctx->min_region_sz after the tuner computes it, guarded
by an esz != 0 check.  The guard preserves the temporal tuner's
intentional stop behavior: when score >= 10000 (goal met), temporal
sets esz_bp = 0 to halt migration; the floor must not override that.

Signed-off-by: Ravi Jonnalagadda <ravis.opensrc@gmail.com>
---
 mm/damon/core.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/mm/damon/core.c b/mm/damon/core.c
index fd1db234ca304..d33c4360cbd60 100644
--- a/mm/damon/core.c
+++ b/mm/damon/core.c
@@ -2650,6 +2650,10 @@ static void damos_set_effective_quota(struct damon_ctx *ctx, struct damos *s)
 		esz = quota->esz_bp / 10000;
 	}
 
+	/* avoid cold-start deadlock, but respect tuner stop signal (esz=0) */
+	if (esz)
+		esz = max_t(unsigned long, esz, ctx->min_region_sz);
+
 	if (quota->ms) {
 		if (quota->total_charged_ns)
 			throughput = mult_frac(quota->total_charged_sz,
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 3/5] mm/damon/core: floor effective quota size at minimum region size
  2026-05-16 21:03 ` [RFC PATCH 3/5] mm/damon/core: floor effective quota size at minimum region size Ravi Jonnalagadda
@ 2026-05-17 18:47   ` SeongJae Park
  0 siblings, 0 replies; 25+ messages in thread
From: SeongJae Park @ 2026-05-17 18:47 UTC (permalink / raw)
  To: Ravi Jonnalagadda
  Cc: SeongJae Park, damon, linux-mm, linux-kernel, linux-doc, akpm,
	corbet, bijan311, ajayjoshi, honggyu.kim, yunjeong.mun

On Sat, 16 May 2026 14:03:55 -0700 Ravi Jonnalagadda <ravis.opensrc@gmail.com> wrote:

> The CONSIST quota goal tuner initializes esz_bp to 0, producing an
> effective quota size (esz) of 1 byte on the first tick.
> damos_quota_is_full() rejects all regions when esz < min_region_sz
> (default PAGE_SIZE = 4096), so no regions can be tried and no
> feedback reaches the tuner — a bootstrapping deadlock.

That depend on whether the goal is already [over]-achieved.  If the goal is
achieved, the tuner will think no change is needed, so keep the
effectively-zero quota.  If the goal is over-achived, the tuner will think the
DAMOS scheme should be less aggressive, but it is already effectively-zero
quota, so keep having effectively-zero quota.

If the ogal is under-achived, the logic will iteratively increase the internal
esz (esz_bp), until it exceeds the min_region_sz, and finally start making some
effects.

So, unless the goal is already [over]-achieved, there is no deadlock.  If the
goal is already [over]-achieved, why we would want to make DAMOS do something?

Am I missing something?

I'd like to discuss this high level thing first, before digging deep into the
details.

Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [RFC PATCH 4/5] mm/damon/paddr: skip free pageblocks in migration walk
  2026-05-16 21:03 [RFC PATCH 0/5] mm/damon: DAMOS quota controller and paddr migration walk fixes Ravi Jonnalagadda
                   ` (2 preceding siblings ...)
  2026-05-16 21:03 ` [RFC PATCH 3/5] mm/damon/core: floor effective quota size at minimum region size Ravi Jonnalagadda
@ 2026-05-16 21:03 ` Ravi Jonnalagadda
  2026-05-16 23:36   ` sashiko-bot
  2026-05-17 23:37   ` SeongJae Park
  2026-05-16 21:03 ` [RFC PATCH 5/5] mm/damon/paddr: add time budget to migration page walk Ravi Jonnalagadda
  4 siblings, 2 replies; 25+ messages in thread
From: Ravi Jonnalagadda @ 2026-05-16 21:03 UTC (permalink / raw)
  To: sj, damon, linux-mm, linux-kernel, linux-doc
  Cc: akpm, corbet, bijan311, ajayjoshi, honggyu.kim, yunjeong.mun,
	ravis.opensrc

damon_pa_migrate() walks every PFN in a region linearly, calling
damon_get_folio() for each one.  On sparse physical address spaces
(e.g., CXL-attached memory), a single DAMON region can span hundreds
of gigabytes where most memory is free and sitting in the buddy
allocator.  Most page lookups are fruitless and dominate kdamond
tick time.

Check at pageblock boundaries (2MB on x86_64) whether the block is
entirely free.  If the first page of a pageblock is a buddy page at
pageblock_order or higher, the entire block is free and can be
skipped.  Similarly skip pageblocks where pfn_to_online_page() returns
NULL.

This reduces the iteration from O(region_sz / PAGE_SIZE) to
O(region_sz / pageblock_sz) + O(populated_pages).

buddy_order_unsafe() is used without zone->lock.  A transient false
positive (block becomes non-free between the PageBuddy and order
checks) costs at most one tick of missed candidates on that block;
the next tick re-scans.  No correctness consequence as DAMON walks
are best-effort.

Signed-off-by: Ravi Jonnalagadda <ravis.opensrc@gmail.com>
---
 mm/damon/paddr.c | 25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
index c4738cd5e221e..e844c990987b9 100644
--- a/mm/damon/paddr.c
+++ b/mm/damon/paddr.c
@@ -258,13 +258,32 @@ static unsigned long damon_pa_migrate(struct damon_region *r,
 		unsigned long addr_unit, struct damos *s,
 		unsigned long *sz_filter_passed)
 {
-	phys_addr_t addr, applied;
+	phys_addr_t addr, end, applied;
 	LIST_HEAD(folio_list);
 	struct folio *folio = NULL;
+	unsigned long pfn;
 
 	addr = damon_pa_phys_addr(r->ar.start, addr_unit);
-	while (addr < damon_pa_phys_addr(r->ar.end, addr_unit)) {
-		folio = damon_get_folio(PHYS_PFN(addr));
+	end = damon_pa_phys_addr(r->ar.end, addr_unit);
+	while (addr < end) {
+		pfn = PHYS_PFN(addr);
+
+		/* Skip pageblocks that are entirely free. */
+		if (IS_ALIGNED(pfn, pageblock_nr_pages)) {
+			struct page *page = pfn_to_online_page(pfn);
+
+			if (!page) {
+				addr += pageblock_nr_pages * PAGE_SIZE;
+				continue;
+			}
+			if (PageBuddy(page) &&
+			    buddy_order_unsafe(page) >= pageblock_order) {
+				addr += pageblock_nr_pages * PAGE_SIZE;
+				continue;
+			}
+		}
+
+		folio = damon_get_folio(pfn);
 		if (damon_pa_invalid_damos_folio(folio, s)) {
 			addr += PAGE_SIZE;
 			continue;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 4/5] mm/damon/paddr: skip free pageblocks in migration walk
  2026-05-16 21:03 ` [RFC PATCH 4/5] mm/damon/paddr: skip free pageblocks in migration walk Ravi Jonnalagadda
@ 2026-05-16 23:36   ` sashiko-bot
  2026-05-17 23:41     ` SeongJae Park
  2026-05-17 23:37   ` SeongJae Park
  1 sibling, 1 reply; 25+ messages in thread
From: sashiko-bot @ 2026-05-16 23:36 UTC (permalink / raw)
  To: Ravi Jonnalagadda; +Cc: damon

Thank you for your contribution! Sashiko AI review found 5 potential issue(s) to consider:
- [High] Calling `PageBuddy()` on a tail page triggers a kernel panic.
- [High] Unsafe address advancement skips valid memory when starting from an unaligned offset within a folio.
- [High] Speculative access of folio flags on an unreferenced folio risks `VM_BUG_ON_FOLIO`.
- [Medium] The optimization fails to skip buddy blocks larger than `pageblock_order` efficiently.
- [Medium] Skipping an entire pageblock when `pfn_to_online_page()` returns NULL misses valid memory.
--

commit 66e620b27c62d5fcd72a28187d8b381c33d31172
Author: Ravi Jonnalagadda <ravis.opensrc@gmail.com>
Subject: mm/damon/paddr: skip free pageblocks in migration walk

This patch optimizes DAMON physical memory walks on sparse address spaces
by checking if a pageblock is entirely free or offline, skipping the entire
block to reduce kdamond tick time.

> diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
> index c4738cd5e221e..e844c990987b9 100644
> --- a/mm/damon/paddr.c
> +++ b/mm/damon/paddr.c
> @@ -258,13 +258,32 @@ static unsigned long damon_pa_migrate(struct damon_region *r,
>  		unsigned long addr_unit, struct damos *s,
>  		unsigned long *sz_filter_passed)
>  {
> -	phys_addr_t addr, applied;
> +	phys_addr_t addr, end, applied;
>  	LIST_HEAD(folio_list);
>  	struct folio *folio = NULL;
> +	unsigned long pfn;
>  
>  	addr = damon_pa_phys_addr(r->ar.start, addr_unit);
> -	while (addr < damon_pa_phys_addr(r->ar.end, addr_unit)) {
> -		folio = damon_get_folio(PHYS_PFN(addr));
> +	end = damon_pa_phys_addr(r->ar.end, addr_unit);
> +	while (addr < end) {
> +		pfn = PHYS_PFN(addr);
> +
> +		/* Skip pageblocks that are entirely free. */
> +		if (IS_ALIGNED(pfn, pageblock_nr_pages)) {
> +			struct page *page = pfn_to_online_page(pfn);
> +
> +			if (!page) {
> +				addr += pageblock_nr_pages * PAGE_SIZE;
> +				continue;
> +			}

If a memory hole is smaller than a pageblock, can skipping the entire
pageblock miss valid, online pages in the remainder of the block?

> +			if (PageBuddy(page) &&
> +			    buddy_order_unsafe(page) >= pageblock_order) {
> +				addr += pageblock_nr_pages * PAGE_SIZE;
> +				continue;
> +			}

Does this code risk a kernel panic when calling PageBuddy() on a tail page?

If a pageblock-aligned PFN points to a tail page of a larger compound page
(such as a 1GB HugeTLB page), PageBuddy() would trigger a VM_BUG_ON_PGFLAGS
due to the PF_NO_TAIL policy.

Also, if a free buddy block is larger than pageblock_order, does the
optimization fall back to stepping by PAGE_SIZE?

The first pageblock is skipped, but on the next iteration, pfn points to the
start of the second pageblock, which is a tail page of the large buddy block.
Because the PG_buddy flag is only set on the head page, PageBuddy(page) will
evaluate to false.

> +		}
> +
> +		folio = damon_get_folio(pfn);

This isn't a bug introduced by this patch, but does damon_get_folio()
speculatively access folio flags before acquiring a reference?

mm/damon/ops-common.c:damon_get_folio() {
    ...
    if (!folio_test_lru(folio) || !folio_try_get(folio))
        return NULL;
    ...
}

If the page is concurrently freed and reallocated as a tail page of a new
compound page, calling folio_test_lru() before folio_try_get() risks a
VM_BUG_ON_FOLIO.

Additionally, this wasn't introduced by this patch, but can the physical
address advancement skip valid memory when starting from an unaligned offset?

If damon_pa_migrate() starts at an unaligned offset within a large folio, pfn
points to a tail page, and damon_get_folio() resolves this to the head folio.
The loop then unconditionally advances by the full folio size:

    put_folio:
        addr += folio_size(folio);
        folio_put(folio);

Adding the total folio size to an unaligned starting offset would push addr
past the actual end of the folio, skipping pages in the subsequent memory
region.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260516210357.2247-1-ravis.opensrc@gmail.com?part=4

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 4/5] mm/damon/paddr: skip free pageblocks in migration walk
  2026-05-16 23:36   ` sashiko-bot
@ 2026-05-17 23:41     ` SeongJae Park
  0 siblings, 0 replies; 25+ messages in thread
From: SeongJae Park @ 2026-05-17 23:41 UTC (permalink / raw)
  To: sashiko-bot; +Cc: SeongJae Park, Ravi Jonnalagadda, damon

On Sat, 16 May 2026 23:36:27 +0000 sashiko-bot@kernel.org wrote:

> Thank you for your contribution! Sashiko AI review found 5 potential issue(s) to consider:
> - [High] Calling `PageBuddy()` on a tail page triggers a kernel panic.
> - [High] Unsafe address advancement skips valid memory when starting from an unaligned offset within a folio.
> - [High] Speculative access of folio flags on an unreferenced folio risks `VM_BUG_ON_FOLIO`.
> - [Medium] The optimization fails to skip buddy blocks larger than `pageblock_order` efficiently.
> - [Medium] Skipping an entire pageblock when `pfn_to_online_page()` returns NULL misses valid memory.
> --
> 
> commit 66e620b27c62d5fcd72a28187d8b381c33d31172
> Author: Ravi Jonnalagadda <ravis.opensrc@gmail.com>
> Subject: mm/damon/paddr: skip free pageblocks in migration walk
> 
> This patch optimizes DAMON physical memory walks on sparse address spaces
> by checking if a pageblock is entirely free or offline, skipping the entire
> block to reduce kdamond tick time.
> 
> > diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
> > index c4738cd5e221e..e844c990987b9 100644
> > --- a/mm/damon/paddr.c
> > +++ b/mm/damon/paddr.c
> > @@ -258,13 +258,32 @@ static unsigned long damon_pa_migrate(struct damon_region *r,
> >  		unsigned long addr_unit, struct damos *s,
> >  		unsigned long *sz_filter_passed)
> >  {
> > -	phys_addr_t addr, applied;
> > +	phys_addr_t addr, end, applied;
> >  	LIST_HEAD(folio_list);
> >  	struct folio *folio = NULL;
> > +	unsigned long pfn;
> >  
> >  	addr = damon_pa_phys_addr(r->ar.start, addr_unit);
> > -	while (addr < damon_pa_phys_addr(r->ar.end, addr_unit)) {
> > -		folio = damon_get_folio(PHYS_PFN(addr));
> > +	end = damon_pa_phys_addr(r->ar.end, addr_unit);
> > +	while (addr < end) {
> > +		pfn = PHYS_PFN(addr);
> > +
> > +		/* Skip pageblocks that are entirely free. */
> > +		if (IS_ALIGNED(pfn, pageblock_nr_pages)) {
> > +			struct page *page = pfn_to_online_page(pfn);
> > +
> > +			if (!page) {
> > +				addr += pageblock_nr_pages * PAGE_SIZE;
> > +				continue;
> > +			}
> 
> If a memory hole is smaller than a pageblock, can skipping the entire
> pageblock miss valid, online pages in the remainder of the block?
> 
> > +			if (PageBuddy(page) &&
> > +			    buddy_order_unsafe(page) >= pageblock_order) {
> > +				addr += pageblock_nr_pages * PAGE_SIZE;
> > +				continue;
> > +			}
> 
> Does this code risk a kernel panic when calling PageBuddy() on a tail page?
> 
> If a pageblock-aligned PFN points to a tail page of a larger compound page
> (such as a 1GB HugeTLB page), PageBuddy() would trigger a VM_BUG_ON_PGFLAGS
> due to the PF_NO_TAIL policy.
> 
> Also, if a free buddy block is larger than pageblock_order, does the
> optimization fall back to stepping by PAGE_SIZE?
> 
> The first pageblock is skipped, but on the next iteration, pfn points to the
> start of the second pageblock, which is a tail page of the large buddy block.
> Because the PG_buddy flag is only set on the head page, PageBuddy(page) will
> evaluate to false.

I will revisit above details after my high level questions are answered.

> 
> > +		}
> > +
> > +		folio = damon_get_folio(pfn);
> 
> This isn't a bug introduced by this patch, but does damon_get_folio()
> speculatively access folio flags before acquiring a reference?
> 
> mm/damon/ops-common.c:damon_get_folio() {
>     ...
>     if (!folio_test_lru(folio) || !folio_try_get(folio))
>         return NULL;
>     ...
> }
> 
> If the page is concurrently freed and reallocated as a tail page of a new
> compound page, calling folio_test_lru() before folio_try_get() risks a
> VM_BUG_ON_FOLIO.

Good finding.  I will work on this.

> 
> Additionally, this wasn't introduced by this patch, but can the physical
> address advancement skip valid memory when starting from an unaligned offset?
> 
> If damon_pa_migrate() starts at an unaligned offset within a large folio, pfn
> points to a tail page, and damon_get_folio() resolves this to the head folio.
> The loop then unconditionally advances by the full folio size:
> 
>     put_folio:
>         addr += folio_size(folio);
>         folio_put(folio);
> 
> Adding the total folio size to an unaligned starting offset would push addr
> past the actual end of the folio, skipping pages in the subsequent memory
> region.

Again, good finding.  I will work on this.

> 
> -- 
> Sashiko AI review · https://sashiko.dev/#/patchset/20260516210357.2247-1-ravis.opensrc@gmail.com?part=4


Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 4/5] mm/damon/paddr: skip free pageblocks in migration walk
  2026-05-16 21:03 ` [RFC PATCH 4/5] mm/damon/paddr: skip free pageblocks in migration walk Ravi Jonnalagadda
  2026-05-16 23:36   ` sashiko-bot
@ 2026-05-17 23:37   ` SeongJae Park
  2026-05-18  5:38     ` Ravi Jonnalagadda
  1 sibling, 1 reply; 25+ messages in thread
From: SeongJae Park @ 2026-05-17 23:37 UTC (permalink / raw)
  To: Ravi Jonnalagadda
  Cc: SeongJae Park, damon, linux-mm, linux-kernel, linux-doc, akpm,
	corbet, bijan311, ajayjoshi, honggyu.kim, yunjeong.mun

On Sat, 16 May 2026 14:03:56 -0700 Ravi Jonnalagadda <ravis.opensrc@gmail.com> wrote:

> damon_pa_migrate() walks every PFN in a region linearly, calling
> damon_get_folio() for each one.  On sparse physical address spaces
> (e.g., CXL-attached memory), a single DAMON region can span hundreds
> of gigabytes where most memory is free and sitting in the buddy
> allocator.  Most page lookups are fruitless and dominate kdamond
> tick time.

On sparse address spaces, the problem would be large DAMON regions of offlined
memory.  The large DAMON regions that nearly all freed memory is another
problem that doesn't require the sparse address spaces.  If I'm not wrong, the
above paragraph could better clarified in my opinion.

> 
> Check at pageblock boundaries (2MB on x86_64) whether the block is
> entirely free.  If the first page of a pageblock is a buddy page at
> pageblock_order or higher, the entire block is free and can be
> skipped.
> Similarly skip pageblocks where pfn_to_online_page() returns
> NULL.
> 
> This reduces the iteration from O(region_sz / PAGE_SIZE) to
> O(region_sz / pageblock_sz) + O(populated_pages).
> 
> buddy_order_unsafe() is used without zone->lock.  A transient false
> positive (block becomes non-free between the PageBuddy and order
> checks) costs at most one tick of missed candidates on that block;
> the next tick re-scans.  No correctness consequence as DAMON walks
> are best-effort.

I was initially thinking this is a good and reasonable optimization approach.
But on the second thought I get below questions.

For large offlined memory space problem, couldn't we simply tune DAMON's
monitoring regions boundary to ignore the holes?

For large free memory area, is it reasonable to assume such situations?  In
production, users will try to utilize as much memory of the system as possible.
Then, wouldn't there be such problematically large free memory area?

Could you please enlighten me?

I will hold digging deep until this high level questions are answered.

Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 4/5] mm/damon/paddr: skip free pageblocks in migration walk
  2026-05-17 23:37   ` SeongJae Park
@ 2026-05-18  5:38     ` Ravi Jonnalagadda
  2026-05-19  1:14       ` SeongJae Park
  0 siblings, 1 reply; 25+ messages in thread
From: Ravi Jonnalagadda @ 2026-05-18  5:38 UTC (permalink / raw)
  To: SeongJae Park
  Cc: damon, linux-mm, linux-kernel, linux-doc, akpm, corbet, bijan311,
	ajayjoshi, honggyu.kim, yunjeong.mun

On Sun, May 17, 2026 at 4:38 PM SeongJae Park <sj@kernel.org> wrote:
>
> On Sat, 16 May 2026 14:03:56 -0700 Ravi Jonnalagadda <ravis.opensrc@gmail.com> wrote:
>
> > damon_pa_migrate() walks every PFN in a region linearly, calling
> > damon_get_folio() for each one.  On sparse physical address spaces
> > (e.g., CXL-attached memory), a single DAMON region can span hundreds
> > of gigabytes where most memory is free and sitting in the buddy
> > allocator.  Most page lookups are fruitless and dominate kdamond
> > tick time.
>
> On sparse address spaces, the problem would be large DAMON regions of offlined
> memory.  The large DAMON regions that nearly all freed memory is another
> problem that doesn't require the sparse address spaces.  If I'm not wrong, the
> above paragraph could better clarified in my opinion.
>
> >
> > Check at pageblock boundaries (2MB on x86_64) whether the block is
> > entirely free.  If the first page of a pageblock is a buddy page at
> > pageblock_order or higher, the entire block is free and can be
> > skipped.
> > Similarly skip pageblocks where pfn_to_online_page() returns
> > NULL.
> >
> > This reduces the iteration from O(region_sz / PAGE_SIZE) to
> > O(region_sz / pageblock_sz) + O(populated_pages).
> >
> > buddy_order_unsafe() is used without zone->lock.  A transient false
> > positive (block becomes non-free between the PageBuddy and order
> > checks) costs at most one tick of missed candidates on that block;
> > the next tick re-scans.  No correctness consequence as DAMON walks
> > are best-effort.
>
> I was initially thinking this is a good and reasonable optimization approach.
> But on the second thought I get below questions.
>
> For large offlined memory space problem, couldn't we simply tune DAMON's
> monitoring regions boundary to ignore the holes?
>
> For large free memory area, is it reasonable to assume such situations?  In
> production, users will try to utilize as much memory of the system as possible.
> Then, wouldn't there be such problematically large free memory area?
>
> Could you please enlighten me?
>

Hi SJ,

You're right on the first point.  For static offlined memory
holes (memory hotplug gaps, partial socket population, etc.) the
right answer is configuring the monitoring region boundaries to
exclude them upfront, not making the walk skip them at runtime.
The changelog is clearer if I narrow the patch to the free-but-
online case.

On the free-online case: I agree large free memory areas are
not the steady state on a fully-utilized system.  The cases I
had in mind are more limited:

   - A workload using a small part of a much larger range, with
      the rest left as headroom (e.g. 64 GB used of a 512 GB
      range).

  - Shared tiers where workloads are allocated and freed on their own
    timelines.  Any single piece of free memory doesn't last
    long, but on a busy system there's typically a meaningful
    free fraction in the range at any point -- especially on a
    slower tier, where workloads prefer faster memory first
    when it's available.

The patch as written is a narrow optimization for those cases:
the pageblock-aligned check is one extra read per
pageblock_nr_pages PFNs (about 1 per 512 on x86_64), so it's
effectively a no-op when the region is fully populated.

If you don't see those workloads as warranting the change, I'm
happy to drop the patch.  If the framing is the issue more than
the change itself, I can respin a v2 with:

  - the changelog narrowed to the free-but-online case (no
    offlined-memory framing);
  - any suggestions from you on sashiko's review comments.

Thanks,
Ravi

> I will hold digging deep until this high level questions are answered.
>
>
> Thanks,
> SJ
>
> [...]


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 4/5] mm/damon/paddr: skip free pageblocks in migration walk
  2026-05-18  5:38     ` Ravi Jonnalagadda
@ 2026-05-19  1:14       ` SeongJae Park
  0 siblings, 0 replies; 25+ messages in thread
From: SeongJae Park @ 2026-05-19  1:14 UTC (permalink / raw)
  To: Ravi Jonnalagadda
  Cc: SeongJae Park, damon, linux-mm, linux-kernel, linux-doc, akpm,
	corbet, bijan311, ajayjoshi, honggyu.kim, yunjeong.mun

On Sun, 17 May 2026 22:38:51 -0700 Ravi Jonnalagadda <ravis.opensrc@gmail.com> wrote:

> On Sun, May 17, 2026 at 4:38 PM SeongJae Park <sj@kernel.org> wrote:
> >
> > On Sat, 16 May 2026 14:03:56 -0700 Ravi Jonnalagadda <ravis.opensrc@gmail.com> wrote:
> >
> > > damon_pa_migrate() walks every PFN in a region linearly, calling
> > > damon_get_folio() for each one.  On sparse physical address spaces
> > > (e.g., CXL-attached memory), a single DAMON region can span hundreds
> > > of gigabytes where most memory is free and sitting in the buddy
> > > allocator.  Most page lookups are fruitless and dominate kdamond
> > > tick time.
> >
> > On sparse address spaces, the problem would be large DAMON regions of offlined
> > memory.  The large DAMON regions that nearly all freed memory is another
> > problem that doesn't require the sparse address spaces.  If I'm not wrong, the
> > above paragraph could better clarified in my opinion.
> >
> > >
> > > Check at pageblock boundaries (2MB on x86_64) whether the block is
> > > entirely free.  If the first page of a pageblock is a buddy page at
> > > pageblock_order or higher, the entire block is free and can be
> > > skipped.
> > > Similarly skip pageblocks where pfn_to_online_page() returns
> > > NULL.
> > >
> > > This reduces the iteration from O(region_sz / PAGE_SIZE) to
> > > O(region_sz / pageblock_sz) + O(populated_pages).
> > >
> > > buddy_order_unsafe() is used without zone->lock.  A transient false
> > > positive (block becomes non-free between the PageBuddy and order
> > > checks) costs at most one tick of missed candidates on that block;
> > > the next tick re-scans.  No correctness consequence as DAMON walks
> > > are best-effort.
> >
> > I was initially thinking this is a good and reasonable optimization approach.
> > But on the second thought I get below questions.
> >
> > For large offlined memory space problem, couldn't we simply tune DAMON's
> > monitoring regions boundary to ignore the holes?
> >
> > For large free memory area, is it reasonable to assume such situations?  In
> > production, users will try to utilize as much memory of the system as possible.
> > Then, wouldn't there be such problematically large free memory area?
> >
> > Could you please enlighten me?
> >
> 
> Hi SJ,
> 
> You're right on the first point.  For static offlined memory
> holes (memory hotplug gaps, partial socket population, etc.) the
> right answer is configuring the monitoring region boundaries to
> exclude them upfront, not making the walk skip them at runtime.
> The changelog is clearer if I narrow the patch to the free-but-
> online case.

Thank you for clarifying, Ravi.

> 
> On the free-online case: I agree large free memory areas are
> not the steady state on a fully-utilized system.  The cases I
> had in mind are more limited:
> 
>    - A workload using a small part of a much larger range, with
>       the rest left as headroom (e.g. 64 GB used of a 512 GB
>       range).

Why would the user have that large amount of headroom?

> 
>   - Shared tiers where workloads are allocated and freed on their own
>     timelines.  Any single piece of free memory doesn't last
>     long, but on a busy system there's typically a meaningful
>     free fraction in the range at any point -- especially on a
>     slower tier, where workloads prefer faster memory first
>     when it's available.

I agree there could be reasonable amount of free memory.  But, I'm still not
feeling difficult to know would that be big enough to cause the issue in DAMOS.

> 
> The patch as written is a narrow optimization for those cases:
> the pageblock-aligned check is one extra read per
> pageblock_nr_pages PFNs (about 1 per 512 on x86_64), so it's
> effectively a no-op when the region is fully populated.
> 
> If you don't see those workloads as warranting the change, I'm
> happy to drop the patch.  If the framing is the issue more than
> the change itself, I can respin a v2 with:
> 
>   - the changelog narrowed to the free-but-online case (no
>     offlined-memory framing);
>   - any suggestions from you on sashiko's review comments.

I think your arguments make sense in general.  But I'm still not quite sure
what is the realistic size of the problem, so difficult to judge.  Having a
clearer and detailed use case and backing data would be nice.

I also got a little and trivial concern for this approach.  DAMOS quota system
assumes the cost of applying DAMOS action will be proportional to the size of
memory it is applied for.  After this patch is applied, the cost will depend on
amount of free or offline memory in the memory.  It might make users difficult
to predict the overhead of DAMOS.  I might be too picky and hallucinated, but
to be honest I'm not feeling 100% comfortable with this change.

For long term, we are working on extending DAMON for general data attributes
monitoring.  I pretty sure you also aware of that.  The v1 [1] is just added to
mm-new for more testing.  It is currently supporting anon page and belinging
memory cgroup attributes.  I'm planning to extend that a lot.  In future, DAMOS
might be able to target and filter memory based on the attributes monitoring
results.  Then, we may be able to extend it for monitoring online or freeness
of the memory and ask DAMOS to filter out or de-prioritize memory regions
having high proportion of free or offline memory.

So, long story short, I'd suggest to revisit this after a clear use case and
real problem is found, unless we have it right now.

[1] https://lore.kernel.org/20260518234119.97569-1-sj@kernel.org


Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [RFC PATCH 5/5] mm/damon/paddr: add time budget to migration page walk
  2026-05-16 21:03 [RFC PATCH 0/5] mm/damon: DAMOS quota controller and paddr migration walk fixes Ravi Jonnalagadda
                   ` (3 preceding siblings ...)
  2026-05-16 21:03 ` [RFC PATCH 4/5] mm/damon/paddr: skip free pageblocks in migration walk Ravi Jonnalagadda
@ 2026-05-16 21:03 ` Ravi Jonnalagadda
  2026-05-16 23:55   ` sashiko-bot
  2026-05-17 23:43   ` SeongJae Park
  4 siblings, 2 replies; 25+ messages in thread
From: Ravi Jonnalagadda @ 2026-05-16 21:03 UTC (permalink / raw)
  To: sj, damon, linux-mm, linux-kernel, linux-doc
  Cc: akpm, corbet, bijan311, ajayjoshi, honggyu.kim, yunjeong.mun,
	ravis.opensrc

On populated physical address ranges the pageblock skip optimization
alone is insufficient — most pageblocks contain at least one allocated
page, so the walk still iterates millions of PFNs.

Add a 100ms wall-clock time budget to damon_pa_migrate().  Once the
deadline is reached, the walk breaks out and migrates whatever folios
have been collected so far.

The time check is amortized by only calling ktime_get() every 4096
pages (~16MB of address space), adding negligible overhead to the
fast path.

Signed-off-by: Ravi Jonnalagadda <ravis.opensrc@gmail.com>
---
 mm/damon/paddr.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
index e844c990987b9..a2565287bc10f 100644
--- a/mm/damon/paddr.c
+++ b/mm/damon/paddr.c
@@ -14,6 +14,7 @@
 #include <linux/swap.h>
 #include <linux/memory-tiers.h>
 #include <linux/mm_inline.h>
+#include <linux/ktime.h>
 
 #include "../internal.h"
 #include "ops-common.h"
@@ -254,6 +255,14 @@ static unsigned long damon_pa_deactivate_pages(struct damon_region *r,
 	return damon_pa_de_activate(r, addr_unit, s, false, sz_filter_passed);
 }
 
+/* Maximum wall-clock time to spend in a single migration walk (ns) */
+#define DAMON_PA_MIGRATE_BUDGET_NS	(100 * NSEC_PER_MSEC)
+
+/* Check the time budget every 4096 pages (~16MB) to amortize ktime_get(). */
+#define DAMON_PA_MIGRATE_TIME_CHECK_PAGES	4096
+#define DAMON_PA_MIGRATE_TIME_CHECK_MASK	\
+	(DAMON_PA_MIGRATE_TIME_CHECK_PAGES - 1)
+
 static unsigned long damon_pa_migrate(struct damon_region *r,
 		unsigned long addr_unit, struct damos *s,
 		unsigned long *sz_filter_passed)
@@ -262,6 +271,7 @@ static unsigned long damon_pa_migrate(struct damon_region *r,
 	LIST_HEAD(folio_list);
 	struct folio *folio = NULL;
 	unsigned long pfn;
+	ktime_t deadline = ktime_add_ns(ktime_get(), DAMON_PA_MIGRATE_BUDGET_NS);
 
 	addr = damon_pa_phys_addr(r->ar.start, addr_unit);
 	end = damon_pa_phys_addr(r->ar.end, addr_unit);
@@ -283,6 +293,11 @@ static unsigned long damon_pa_migrate(struct damon_region *r,
 			}
 		}
 
+		/* Time budget: keep kdamond responsive on long migration walks. */
+		if (!(pfn & DAMON_PA_MIGRATE_TIME_CHECK_MASK) &&
+		    ktime_after(ktime_get(), deadline))
+			break;
+
 		folio = damon_get_folio(pfn);
 		if (damon_pa_invalid_damos_folio(folio, s)) {
 			addr += PAGE_SIZE;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 5/5] mm/damon/paddr: add time budget to migration page walk
  2026-05-16 21:03 ` [RFC PATCH 5/5] mm/damon/paddr: add time budget to migration page walk Ravi Jonnalagadda
@ 2026-05-16 23:55   ` sashiko-bot
  2026-05-17 23:46     ` SeongJae Park
  2026-05-17 23:43   ` SeongJae Park
  1 sibling, 1 reply; 25+ messages in thread
From: sashiko-bot @ 2026-05-16 23:55 UTC (permalink / raw)
  To: Ravi Jonnalagadda; +Cc: damon

Thank you for your contribution! Sashiko AI review found 4 potential issue(s) to consider:
- [High] The amortized time budget check can be permanently bypassed due to variable PFN strides and loop continues.
- [High] Early termination causes permanent starvation for the tail end of large DAMON regions.
- [High] The time budget targets the wrong phase, failing to cap the actual latency spike of the migration.
- [High] PFN advancement after folio conversion permanently misaligns the scan.
--

commit 87860cbf6b4be5f161140aa9a1f9b7d0461955ba
Author: Ravi Jonnalagadda <ravis.opensrc@gmail.com>

mm/damon/paddr: add time budget to migration page walk

This patch adds a 100ms wall-clock time budget to damon_pa_migrate() to
prevent long migration walks from causing high scheduling latencies. The
time check is amortized by checking ktime_get() every 4096 pages.

> diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
> index e844c990987b9..a2565287bc10f 100644
> --- a/mm/damon/paddr.c
> +++ b/mm/damon/paddr.c
[ ... ]
> @@ -283,6 +293,11 @@ static unsigned long damon_pa_migrate(struct damon_region *r,
>  			}
>  		}
>  
> +		/* Time budget: keep kdamond responsive on long migration walks. */
> +		if (!(pfn & DAMON_PA_MIGRATE_TIME_CHECK_MASK) &&
> +		    ktime_after(ktime_get(), deadline))
> +			break;

Can this time check be permanently bypassed?

If pfn advances by variable strides, such as when addr += folio_size(folio)
adds a multi-page stride that jumps over the 4096-page boundary, the exact
match !(pfn & DAMON_PA_MIGRATE_TIME_CHECK_MASK) is missed, skipping the
check for the entire next 4096-page cycle.

Additionally, if pfn does land exactly on a multiple of 4096, but the
pageblock is skipped via the continue statement just above this block:

        if (!page) {
            addr += pageblock_nr_pages * PAGE_SIZE;
            continue;
        }

the loop advances and leaps over the time check.

If the loop terminates early here, is there a mechanism to save the PFN
where the scan aborted?

Since the DAMON core is unaware of this internal early return, the next time
the scheme applies to this region, it appears it will restart the walk from
the very beginning of the region.

Could this result in a permanent livelock for large regions, where memory
beyond the 100ms scan horizon is never reached or migrated?

>  		folio = damon_get_folio(pfn);
>  		if (damon_pa_invalid_damos_folio(folio, s)) {
[ ... ]
>  		if (!folio_isolate_lru(folio))
>  			goto put_folio;
>  		list_add(&folio->lru, &folio_list);
>  put_folio:
>  		addr += folio_size(folio);

This wasn't introduced by this patch, but does the address advancement
correctly handle tail pages?

Since the region start addr might not be folio-aligned, if it falls within
a tail page of a large folio, does adding the total folio_size(folio)
overshoot the end of the current folio and skip the head page of the next
folio?

>  		folio_put(folio);
>  	}
>  	applied = damon_migrate_pages(&folio_list, s->target_nid);

Does the time budget cap the actual latency spike of the migration?

The 100ms budget limits the fast folio isolation phase. The loop can isolate
gigabytes of memory within 100ms. Once the deadline is reached, all
collected folios are passed to damon_migrate_pages().

Could migrating this massive batch, which requires memory allocation and
page copying, take seconds to complete in a single blocking call and fail
to keep the migration responsive?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260516210357.2247-1-ravis.opensrc@gmail.com?part=5

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 5/5] mm/damon/paddr: add time budget to migration page walk
  2026-05-16 23:55   ` sashiko-bot
@ 2026-05-17 23:46     ` SeongJae Park
  0 siblings, 0 replies; 25+ messages in thread
From: SeongJae Park @ 2026-05-17 23:46 UTC (permalink / raw)
  To: sashiko-bot; +Cc: SeongJae Park, Ravi Jonnalagadda, damon

On Sat, 16 May 2026 23:55:44 +0000 sashiko-bot@kernel.org wrote:

> Thank you for your contribution! Sashiko AI review found 4 potential issue(s) to consider:
> - [High] The amortized time budget check can be permanently bypassed due to variable PFN strides and loop continues.
> - [High] Early termination causes permanent starvation for the tail end of large DAMON regions.
> - [High] The time budget targets the wrong phase, failing to cap the actual latency spike of the migration.
> - [High] PFN advancement after folio conversion permanently misaligns the scan.
> --
> 
> commit 87860cbf6b4be5f161140aa9a1f9b7d0461955ba
> Author: Ravi Jonnalagadda <ravis.opensrc@gmail.com>
> 
> mm/damon/paddr: add time budget to migration page walk
> 
> This patch adds a 100ms wall-clock time budget to damon_pa_migrate() to
> prevent long migration walks from causing high scheduling latencies. The
> time check is amortized by checking ktime_get() every 4096 pages.
> 
> > diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
> > index e844c990987b9..a2565287bc10f 100644
> > --- a/mm/damon/paddr.c
> > +++ b/mm/damon/paddr.c
> [ ... ]
> > @@ -283,6 +293,11 @@ static unsigned long damon_pa_migrate(struct damon_region *r,
> >  			}
> >  		}
> >  
> > +		/* Time budget: keep kdamond responsive on long migration walks. */
> > +		if (!(pfn & DAMON_PA_MIGRATE_TIME_CHECK_MASK) &&
> > +		    ktime_after(ktime_get(), deadline))
> > +			break;
> 
> Can this time check be permanently bypassed?
> 
> If pfn advances by variable strides, such as when addr += folio_size(folio)
> adds a multi-page stride that jumps over the 4096-page boundary, the exact
> match !(pfn & DAMON_PA_MIGRATE_TIME_CHECK_MASK) is missed, skipping the
> check for the entire next 4096-page cycle.
> 
> Additionally, if pfn does land exactly on a multiple of 4096, but the
> pageblock is skipped via the continue statement just above this block:
> 
>         if (!page) {
>             addr += pageblock_nr_pages * PAGE_SIZE;
>             continue;
>         }
> 
> the loop advances and leaps over the time check.
> 
> If the loop terminates early here, is there a mechanism to save the PFN
> where the scan aborted?
> 
> Since the DAMON core is unaware of this internal early return, the next time
> the scheme applies to this region, it appears it will restart the walk from
> the very beginning of the region.
> 
> Could this result in a permanent livelock for large regions, where memory
> beyond the 100ms scan horizon is never reached or migrated?
> 
> >  		folio = damon_get_folio(pfn);
> >  		if (damon_pa_invalid_damos_folio(folio, s)) {
> [ ... ]
> >  		if (!folio_isolate_lru(folio))
> >  			goto put_folio;
> >  		list_add(&folio->lru, &folio_list);
> >  put_folio:
> >  		addr += folio_size(folio);
> 
> This wasn't introduced by this patch, but does the address advancement
> correctly handle tail pages?
> 
> Since the region start addr might not be folio-aligned, if it falls within
> a tail page of a large folio, does adding the total folio_size(folio)
> overshoot the end of the current folio and skip the head page of the next
> folio?

As I previously replied [1], I will work on this.

> 
> >  		folio_put(folio);
> >  	}
> >  	applied = damon_migrate_pages(&folio_list, s->target_nid);
> 
> Does the time budget cap the actual latency spike of the migration?
> 
> The 100ms budget limits the fast folio isolation phase. The loop can isolate
> gigabytes of memory within 100ms. Once the deadline is reached, all
> collected folios are passed to damon_migrate_pages().
> 
> Could migrating this massive batch, which requires memory allocation and
> page copying, take seconds to complete in a single blocking call and fail
> to keep the migration responsive?

I will revisit other Sashiko's detailed comments once we finish the high level
discussion and decide to pursue this direction.

[1] https://lore.kernel.org/20260517234112.89245-1-sj@kernel.org

> 
> -- 
> Sashiko AI review · https://sashiko.dev/#/patchset/20260516210357.2247-1-ravis.opensrc@gmail.com?part=5


Thanks,
SJ

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 5/5] mm/damon/paddr: add time budget to migration page walk
  2026-05-16 21:03 ` [RFC PATCH 5/5] mm/damon/paddr: add time budget to migration page walk Ravi Jonnalagadda
  2026-05-16 23:55   ` sashiko-bot
@ 2026-05-17 23:43   ` SeongJae Park
  2026-05-18  5:54     ` Ravi Jonnalagadda
  1 sibling, 1 reply; 25+ messages in thread
From: SeongJae Park @ 2026-05-17 23:43 UTC (permalink / raw)
  To: Ravi Jonnalagadda
  Cc: SeongJae Park, damon, linux-mm, linux-kernel, linux-doc, akpm,
	corbet, bijan311, ajayjoshi, honggyu.kim, yunjeong.mun

On Sat, 16 May 2026 14:03:57 -0700 Ravi Jonnalagadda <ravis.opensrc@gmail.com> wrote:

> On populated physical address ranges the pageblock skip optimization
> alone is insufficient — most pageblocks contain at least one allocated
> page, so the walk still iterates millions of PFNs.

So my questions to the fourth patch of this series are also applied here,
especially about the assumption of systems having most memory free.  I will
hold digging deep here until the high level discussion is completed.


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 5/5] mm/damon/paddr: add time budget to migration page walk
  2026-05-17 23:43   ` SeongJae Park
@ 2026-05-18  5:54     ` Ravi Jonnalagadda
  2026-05-19  1:27       ` SeongJae Park
  0 siblings, 1 reply; 25+ messages in thread
From: Ravi Jonnalagadda @ 2026-05-18  5:54 UTC (permalink / raw)
  To: SeongJae Park
  Cc: damon, linux-mm, linux-kernel, linux-doc, akpm, corbet, bijan311,
	ajayjoshi, honggyu.kim, yunjeong.mun

On Sun, May 17, 2026 at 4:43 PM SeongJae Park <sj@kernel.org> wrote:
>
> On Sat, 16 May 2026 14:03:57 -0700 Ravi Jonnalagadda <ravis.opensrc@gmail.com> wrote:
>
> > On populated physical address ranges the pageblock skip optimization
> > alone is insufficient — most pageblocks contain at least one allocated
> > page, so the walk still iterates millions of PFNs.
>
> So my questions to the fourth patch of this series are also applied here,
> especially about the assumption of systems having most memory free.  I will
> hold digging deep here until the high level discussion is completed.
>
Hello SJ,

Stepping back to look at this with fresh eyes, I think this
patch is in the same bucket as patches 1 and 3 (full background
on the patch 3 thread): it came out of the same parallel debug
effort, where I was seeing long walks during the startup
transient on a multi-hundred-GB monitored target -- before
kdamond_split_regions() and damon_apply_min_nr_regions() had
trimmed the initial regions down -- and was unsure whether
those long walks were contributing to the NMI-side
responsiveness issues I was chasing.

Once the actual NMI problem was fixed and the per-region work
in steady state is bounded by DAMON's region splitting (and by
the scheme's quota when one is set), the per-call cost in
damon_pa_migrate() is already small enough that the budget
isn't doing useful work.  cond_resched() after damon_migrate_pages()
covers the preemption case.

If a real workload later shows a per-region walk long
enough to matter, I'll re-evaluate then with concrete numbers.

Thanks,
Ravi

>
> Thanks,
> SJ
>
> [...]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 5/5] mm/damon/paddr: add time budget to migration page walk
  2026-05-18  5:54     ` Ravi Jonnalagadda
@ 2026-05-19  1:27       ` SeongJae Park
  0 siblings, 0 replies; 25+ messages in thread
From: SeongJae Park @ 2026-05-19  1:27 UTC (permalink / raw)
  To: Ravi Jonnalagadda
  Cc: SeongJae Park, damon, linux-mm, linux-kernel, linux-doc, akpm,
	corbet, bijan311, ajayjoshi, honggyu.kim, yunjeong.mun

On Sun, 17 May 2026 22:54:18 -0700 Ravi Jonnalagadda <ravis.opensrc@gmail.com> wrote:

> On Sun, May 17, 2026 at 4:43 PM SeongJae Park <sj@kernel.org> wrote:
> >
> > On Sat, 16 May 2026 14:03:57 -0700 Ravi Jonnalagadda <ravis.opensrc@gmail.com> wrote:
> >
> > > On populated physical address ranges the pageblock skip optimization
> > > alone is insufficient — most pageblocks contain at least one allocated
> > > page, so the walk still iterates millions of PFNs.
> >
> > So my questions to the fourth patch of this series are also applied here,
> > especially about the assumption of systems having most memory free.  I will
> > hold digging deep here until the high level discussion is completed.
> >
> Hello SJ,
> 
> Stepping back to look at this with fresh eyes, I think this
> patch is in the same bucket as patches 1 and 3 (full background
> on the patch 3 thread): it came out of the same parallel debug
> effort, where I was seeing long walks during the startup
> transient on a multi-hundred-GB monitored target -- before
> kdamond_split_regions() and damon_apply_min_nr_regions() had
> trimmed the initial regions down -- and was unsure whether
> those long walks were contributing to the NMI-side
> responsiveness issues I was chasing.
> 
> Once the actual NMI problem was fixed and the per-region work
> in steady state is bounded by DAMON's region splitting (and by
> the scheme's quota when one is set), the per-call cost in
> damon_pa_migrate() is already small enough that the budget
> isn't doing useful work.  cond_resched() after damon_migrate_pages()
> covers the preemption case.
> 
> If a real workload later shows a per-region walk long
> enough to matter, I'll re-evaluate then with concrete numbers.

Sounds good!

FYI, many parts of DAMON are designed assuming it will be used on production
environments that have long-running workload and prefer stability.  It helps
making good results in long run, but also make it difficult to understand it in
short term, especially on lab environments.

I learned that by grateful users including you, and therefore recently
developed the multiple quota tuning logics and failed regions charge ratio.  I
feel like such DAMON limitation has contributed to this case to confuse you.
Sorry if that was the case, and please feel free to share your pain points and
improvement ideas.  Every user's use case including yours does matter!


Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2026-05-19  1:28 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-16 21:03 [RFC PATCH 0/5] mm/damon: DAMOS quota controller and paddr migration walk fixes Ravi Jonnalagadda
2026-05-16 21:03 ` [RFC PATCH 1/5] mm/damon/core: fix nr_accesses_bp underflow in damon_moving_sum Ravi Jonnalagadda
2026-05-16 22:29   ` sashiko-bot
2026-05-17 18:21     ` SeongJae Park
2026-05-17 18:16   ` SeongJae Park
2026-05-16 21:03 ` [RFC PATCH 2/5] mm/damon/core: cap effective quota size to total monitored memory Ravi Jonnalagadda
2026-05-16 22:55   ` sashiko-bot
2026-05-17 18:48     ` SeongJae Park
2026-05-17 18:36   ` SeongJae Park
2026-05-18  5:22     ` Ravi Jonnalagadda
2026-05-19  0:38       ` SeongJae Park
2026-05-16 21:03 ` [RFC PATCH 3/5] mm/damon/core: floor effective quota size at minimum region size Ravi Jonnalagadda
2026-05-17 18:47   ` SeongJae Park
2026-05-16 21:03 ` [RFC PATCH 4/5] mm/damon/paddr: skip free pageblocks in migration walk Ravi Jonnalagadda
2026-05-16 23:36   ` sashiko-bot
2026-05-17 23:41     ` SeongJae Park
2026-05-17 23:37   ` SeongJae Park
2026-05-18  5:38     ` Ravi Jonnalagadda
2026-05-19  1:14       ` SeongJae Park
2026-05-16 21:03 ` [RFC PATCH 5/5] mm/damon/paddr: add time budget to migration page walk Ravi Jonnalagadda
2026-05-16 23:55   ` sashiko-bot
2026-05-17 23:46     ` SeongJae Park
2026-05-17 23:43   ` SeongJae Park
2026-05-18  5:54     ` Ravi Jonnalagadda
2026-05-19  1:27       ` SeongJae Park

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.