[RFC PATCH 0/5] mm/damon: DAMOS quota controller and paddr migration walk fixes

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/5] mm/damon: DAMOS quota controller and paddr migration walk fixes
@ 2026-05-16 21:03 Ravi Jonnalagadda
  2026-05-16 21:03 ` [RFC PATCH 1/5] mm/damon/core: fix nr_accesses_bp underflow in damon_moving_sum Ravi Jonnalagadda
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: Ravi Jonnalagadda @ 2026-05-16 21:03 UTC (permalink / raw)
  To: sj, damon, linux-mm, linux-kernel, linux-doc
  Cc: akpm, corbet, bijan311, ajayjoshi, honggyu.kim, yunjeong.mun,
	ravis.opensrc

MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Hi,

This series carries five fixes for the DAMOS quota controller and the
paddr migration walk.  All five were surfaced during closed-loop tiering
testing on a heterogeneous-memory system (DRAM + CXL on a separate
NUMA node), but each fix is in code paths that benefit any caller --
not scoped to closed-loop tiering or to any specific goal metric.

Test envelope: AMD EPYC dual-socket host with CXL.mem on a separate
NUMA node, two-scheme migrate_hot PULL+PUSH setup driven by
node_eligible_mem_bp (now in linux-next)[1].

What each patch does
====================

Patch 1 - mm/damon/core: fix nr_accesses_bp underflow in damon_moving_sum

  damon_moving_sum() can underflow when a region's access rate drops
  to zero faster than the moving-sum window length.  The internal
  accumulator subtracts an outgoing sample without a lower bound,
  producing a sentinel-large nr_accesses_bp that mis-classifies a
  cold region as hot.  Affects every DAMOS scheme, since
  nr_accesses_bp is used on every access-rate update for every
  region regardless of which scheme or goal metric is active.

Patch 2 - mm/damon/core: cap effective quota size to total monitored memory

  The DAMOS quota tuner can compute an effective size (esz) larger
  than the total monitored memory, because the tuner integrates over
  cumulative deltas without bounding by the actual workload size.
  Once esz exceeds total monitored memory the per-tick "remaining
  quota" arithmetic stops being meaningful: any scheme can apply to
  the entire monitored space and "remaining" stays positive
  indefinitely.  Cap esz at total monitored memory so the controller
  remains within physically realisable bounds.  Tuner-shape and
  goal-metric agnostic.

Patch 3 - mm/damon/core: floor effective quota size at minimum region size

  Symmetric to patch 2: the tuner can also compute esz < min_region_sz,
  causing schemes to attempt zero-byte migrations for many ticks before
  the tuner ramps esz back up.  Observed under the CONSIST tuner with
  a node_eligible_mem_bp goal: ftrace traced esz stuck at 1 byte for
  96 seconds before the first region was tried; the first acted-on
  region appeared at t=113s when esz crossed the min_region_sz
  threshold.  Floor esz at min_region_sz so schemes always have at
  least one region's worth of quota when the tuner asks them to act.

Patch 4 - mm/damon/paddr: skip free pageblocks in migration walk

  damon_pa_migrate() walks every 4KB PFN in a region.  On
  sparsely-populated lower-tier extents (e.g., a 549GB CXL region
  with only ~8GB populated), this is ~144M PFN iterations per scheme
  tick at ~40ns each = ~5.6 seconds of walk per tick.  Use
  pageblock-level free-page detection to skip unpopulated runs of
  pages: only enter the per-page loop for pageblocks that contain at
  least one allocated page.  This brings the walk to
  O(region_size / pageblock_size) skip-check cost plus
  O(populated_pages) per-page work.  On x86 pageblocks are 2MB, so
  the same 549GB/8GB example becomes ~281K pageblock skip-checks
  (microseconds total) plus ~2M per-page visits for the populated
  pages -- ~80ms expected.  Helps any migrate_hot/migrate_cold scheme
  on paddr ops, regardless of what drives them.

Patch 5 - mm/damon/paddr: add time budget to migration page walk

  Densely populated regions (e.g., a busy DRAM range where most
  pageblocks contain at least one allocated page) can still consume
  full ticks even with patch 4 applied.  Add a 100ms wall-clock
  budget with a ktime_get() check every 4096 pages walked
  (~16MB worth).  When the budget expires before reaching the end of
  a region, kdamond returns control; subsequent ticks re-walk the
  region from the start.  Folios already on the target node are
  dropped at migration time, so re-walks only re-do collection work,
  not the migrate itself.  Together with the per-scheme quota cap,
  per-tick work is bounded and the workload converges over multiple
  ticks for dense regions.

  Worst-case migration walk contribution to a tick is bounded at
  100ms per scheme regardless of region size or population density,
  preserving kdamond's ability to service other DAMOS schemes and
  user-space sysfs operations during heavy migration phases.

Testing context
===============

  Hardware:  AMD EPYC dual-socket, CXL.mem on a separate NUMA node.
  Workload: 32GB hot working set across DRAM and CXL nodes.
  DAMON config: paddr ops, two migrate_hot schemes (PULL CXL->DRAM,
                PUSH DRAM->CXL) with complementary address filters,
                node_eligible_mem_bp goal per scheme, temporal
                quota tuner, 1s reset interval.

Each fix in this series was reproduced under the above setup, then
verified via ftrace and per-scheme stats after the fix landed.

References
==========

[1] mm/damon: add node_eligible_mem_bp goal metric
https://lore.kernel.org/damon/20260428030520.701-1-ravis.opensrc@gmail.com/

Ravi Jonnalagadda (5):
  mm/damon/core: fix nr_accesses_bp underflow in damon_moving_sum
  mm/damon/core: cap effective quota size to total monitored memory
  mm/damon/core: floor effective quota size at minimum region size
  mm/damon/paddr: skip free pageblocks in migration walk
  mm/damon/paddr: add time budget to migration page walk

 mm/damon/core.c  | 29 ++++++++++++++++++++++++++++-
 mm/damon/paddr.c | 40 +++++++++++++++++++++++++++++++++++++---
 2 files changed, 65 insertions(+), 4 deletions(-)

base-commit: 0cec77cfd5314c0b3b03530abe1a4b32e991f639
-- 
2.43.0

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH 1/5] mm/damon/core: fix nr_accesses_bp underflow in damon_moving_sum
  2026-05-16 21:03 [RFC PATCH 0/5] mm/damon: DAMOS quota controller and paddr migration walk fixes Ravi Jonnalagadda
@ 2026-05-16 21:03 ` Ravi Jonnalagadda
  2026-05-16 22:29   ` sashiko-bot
  2026-05-16 21:03 ` [RFC PATCH 2/5] mm/damon/core: cap effective quota size to total monitored memory Ravi Jonnalagadda
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 10+ messages in thread
From: Ravi Jonnalagadda @ 2026-05-16 21:03 UTC (permalink / raw)
  To: sj, damon, linux-mm, linux-kernel, linux-doc
  Cc: akpm, corbet, bijan311, ajayjoshi, honggyu.kim, yunjeong.mun,
	ravis.opensrc

Guard against unsigned integer underflow when nomvsum/len_window
exceeds mvsum.  When that subtraction wraps, the moving sum returns a
near-ULONG_MAX value and corrupts nr_accesses_bp.

If subtrahend > mvsum, return new_value: this clamps the moving-sum
estimate to the current observation rather than wrapping.

Signed-off-by: Ravi Jonnalagadda <ravis.opensrc@gmail.com>
---
 mm/damon/core.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/damon/core.c b/mm/damon/core.c
index 3a8725e400c6b..9975f3d9ebfe9 100644
--- a/mm/damon/core.c
+++ b/mm/damon/core.c
@@ -3449,7 +3449,11 @@ int damon_set_region_system_rams_default(struct damon_target *t,
 static unsigned int damon_moving_sum(unsigned int mvsum, unsigned int nomvsum,
 		unsigned int len_window, unsigned int new_value)
 {
-	return mvsum - nomvsum / len_window + new_value;
+	unsigned int subtrahend = nomvsum / len_window;
+
+	if (subtrahend > mvsum)
+		return new_value;
+	return mvsum - subtrahend + new_value;
 }
 
 /**
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 1/5] mm/damon/core: fix nr_accesses_bp underflow in damon_moving_sum
  2026-05-16 21:03 ` [RFC PATCH 1/5] mm/damon/core: fix nr_accesses_bp underflow in damon_moving_sum Ravi Jonnalagadda
@ 2026-05-16 22:29   ` sashiko-bot
  0 siblings, 0 replies; 10+ messages in thread
From: sashiko-bot @ 2026-05-16 22:29 UTC (permalink / raw)
  To: Ravi Jonnalagadda; +Cc: damon

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] Unsigned integer overflow in DAMON's moving sum and access rate tracking for large intervals.
--

commit 1d58ed9728ddbaf4579ec265e453a7b3a233285c
Author: Ravi Jonnalagadda <ravis.opensrc@gmail.com>

mm/damon/core: fix nr_accesses_bp underflow in damon_moving_sum

Guard against unsigned integer underflow when nomvsum/len_window exceeds
mvsum. When that subtraction wraps, the moving sum returns a near maximum
value and corrupts nr_accesses_bp. If subtrahend > mvsum, return new_value.

> diff --git a/mm/damon/core.c b/mm/damon/core.c
> index 3a8725e400c6b..9975f3d9ebfe9 100644
> --- a/mm/damon/core.c
> +++ b/mm/damon/core.c
> @@ -3449,7 +3449,11 @@ int damon_set_region_system_rams_default(struct damon_target *t,
>  static unsigned int damon_moving_sum(unsigned int mvsum, unsigned int nomvsum,
>  		unsigned int len_window, unsigned int new_value)
>  {

This isn't a bug introduced by this patch, but could the nomvsum argument
passed here be silently overflowing?

Looking at damon_update_region_access_rate(), damon_moving_sum() is called
with r->last_nr_accesses * 10000:

mm/damon/core.c:damon_update_region_access_rate() {
	...
	if (attrs->sample_interval)
		len_window = damon_max_nr_accesses(attrs);
	r->nr_accesses_bp = damon_moving_sum(r->nr_accesses_bp,
			r->last_nr_accesses * 10000, len_window,
			accessed ? 10000 : 0);
	...
}

DAMON allows configuring aggr_interval / sample_interval up to the maximum
32-bit unsigned integer (for example, 10ms sampling and 1.5 hours
aggregation).

This means len_window, which bounds last_nr_accesses, can exceed 429,496.

If this happens, won't the multiplication r->last_nr_accesses * 10000
silently overflow the 32-bit unsigned int limit of 4,294,967,295?

If this overflows, the corrupted, wrapped-around value is passed as nomvsum,
which would cause subtrahend to be artificially small.

Additionally, would this cause r->nr_accesses_bp (which accumulates mvsum)
to also continuously wrap around 32-bit limits?

Could these silent integer overflows corrupt memory access tracking and lead
to erroneous memory management decisions, like reclaiming hot memory?

A proper fix might require bounding len_window to 429,496 or using 64-bit
integers.

> -	return mvsum - nomvsum / len_window + new_value;
> +	unsigned int subtrahend = nomvsum / len_window;
> +
> +	if (subtrahend > mvsum)
> +		return new_value;
> +	return mvsum - subtrahend + new_value;
>  }

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260516210357.2247-1-ravis.opensrc@gmail.com?part=1

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH 2/5] mm/damon/core: cap effective quota size to total monitored memory
  2026-05-16 21:03 [RFC PATCH 0/5] mm/damon: DAMOS quota controller and paddr migration walk fixes Ravi Jonnalagadda
  2026-05-16 21:03 ` [RFC PATCH 1/5] mm/damon/core: fix nr_accesses_bp underflow in damon_moving_sum Ravi Jonnalagadda
@ 2026-05-16 21:03 ` Ravi Jonnalagadda
  2026-05-16 22:55   ` sashiko-bot
  2026-05-16 21:03 ` [RFC PATCH 3/5] mm/damon/core: floor effective quota size at minimum region size Ravi Jonnalagadda
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 10+ messages in thread
From: Ravi Jonnalagadda @ 2026-05-16 21:03 UTC (permalink / raw)
  To: sj, damon, linux-mm, linux-kernel, linux-doc
  Cc: akpm, corbet, bijan311, ajayjoshi, honggyu.kim, yunjeong.mun,
	ravis.opensrc

The DAMOS quota goal tuner can compute an effective size (esz) larger
than the total monitored memory because it integrates over cumulative
deltas without bounding by the actual workload size.  Once esz exceeds
total monitored memory, the per-tick "remaining quota" arithmetic
stops being meaningful: any scheme can apply to the entire monitored
space and "remaining" stays positive indefinitely.

Cap esz to the total size of all currently monitored regions as a
final bound after all other quota calculations.  Add
damon_ctx_total_monitored_sz() helper that sums region sizes across
all targets.

The helper runs only inside damos_set_effective_quota(), which is
called at most once per quota reset_interval (default 1s) per scheme,
not per kdamond tick.  Walk cost is O(nr_regions) at that frequency
and is dominated by the enclosing tuner work.

This bound is tuner-shape and goal-metric agnostic: it constrains the
quota controller to physically realisable values regardless of which
tuner or goal metric drives it.

Signed-off-by: Ravi Jonnalagadda <ravis.opensrc@gmail.com>
---
 mm/damon/core.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/mm/damon/core.c b/mm/damon/core.c
index 9975f3d9ebfe9..fd1db234ca304 100644
--- a/mm/damon/core.c
+++ b/mm/damon/core.c
@@ -2614,6 +2614,19 @@ static void damos_goal_tune_esz_bp_temporal(struct damon_ctx *c,
 		quota->esz_bp = ULONG_MAX;
 }

+/* Sum of all monitored region sizes across all targets in @ctx. */
+static unsigned long damon_ctx_total_monitored_sz(struct damon_ctx *ctx)
+{
+	struct damon_target *t;
+	struct damon_region *r;
+	unsigned long total = 0;
+
+	damon_for_each_target(t, ctx)
+		damon_for_each_region(r, t)
+			total += damon_sz_region(r);
+	return total;
+}
+
 /*
  * Called only if quota->ms, or quota->sz are set, or quota->goals is not empty
  */
@@ -2621,6 +2634,7 @@ static void damos_set_effective_quota(struct damon_ctx *ctx, struct damos *s)
 {
 	struct damos_quota *quota = &s->quota;
 	unsigned long throughput;
+	unsigned long total_sz;
 	unsigned long esz = ULONG_MAX;

 	if (!quota->ms && list_empty(&quota->goals)) {
@@ -2649,6 +2663,11 @@ static void damos_set_effective_quota(struct damon_ctx *ctx, struct damos *s)
 	if (quota->sz && quota->sz < esz)
 		esz = quota->sz;

+	/* Safety cap: never migrate more than total monitored memory */
+	total_sz = damon_ctx_total_monitored_sz(ctx);
+	if (total_sz && esz > total_sz)
+		esz = total_sz;
+
 	quota->esz = esz;
 }

-- 
2.43.0

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 2/5] mm/damon/core: cap effective quota size to total monitored memory
  2026-05-16 21:03 ` [RFC PATCH 2/5] mm/damon/core: cap effective quota size to total monitored memory Ravi Jonnalagadda
@ 2026-05-16 22:55   ` sashiko-bot
  0 siblings, 0 replies; 10+ messages in thread
From: sashiko-bot @ 2026-05-16 22:55 UTC (permalink / raw)
  To: Ravi Jonnalagadda; +Cc: damon

Thank you for your contribution! Sashiko AI review found 4 potential issue(s) to consider:
- [High] The safety cap limits the effective quota output (`esz`), but fails to cap the tuner's internal state (`quota->esz_bp`), leading to severe integral windup.
- [High] Accumulating the total monitored size into an `unsigned long` overflows on 32-bit architectures, permanently crippling the scheme quota.
- [High] Applying the safety cap can forcefully reduce the effective quota below `min_region_sz`, resulting in permanent scheme starvation for small targets.
- [Medium] The safety cap is completely bypassed for statically sized quotas due to an early return, causing inconsistent API behavior.
--

commit afc20f6b1154f371b12a0f1f9ad37023b5bba0ef
Author: Ravi Jonnalagadda <ravis.opensrc@gmail.com>
Subject: mm/damon/core: cap effective quota size to total monitored memory

This commit introduces a safety cap on the effective quota size computed by
the DAMOS quota goal tuner to prevent it from exceeding the total monitored
memory.

> diff --git a/mm/damon/core.c b/mm/damon/core.c
> --- a/mm/damon/core.c
> +++ b/mm/damon/core.c

[ ... ]

> @@ -2614,6 +2614,19 @@ static void damos_goal_tune_esz_bp_temporal(struct damon_ctx *c,
>  		quota->esz_bp = ULONG_MAX;
>  }
>  
> +/* Sum of all monitored region sizes across all targets in @ctx. */
> +static unsigned long damon_ctx_total_monitored_sz(struct damon_ctx *ctx)
> +{
> +	struct damon_target *t;
> +	struct damon_region *r;
> +	unsigned long total = 0;
> +
> +	damon_for_each_target(t, ctx)
> +		damon_for_each_region(r, t)
> +			total += damon_sz_region(r);

Could the total variable overflow on 32-bit architectures? Since a single
DAMON context can monitor multiple processes, the combined virtual memory size
might exceed 4 GiB. Would it be safer to use u64 to accumulate the total sizes
and avoid wrapping?

> +	return total;
> +}
> +

[ ... ]

> @@ -2621,6 +2634,7 @@ static void damos_set_effective_quota(struct damon_ctx *ctx, struct damos *s)
>  {
>  	struct damos_quota *quota = &s->quota;
>  	unsigned long throughput;
> +	unsigned long total_sz;
>  	unsigned long esz = ULONG_MAX;
>  
>  	if (!quota->ms && list_empty(&quota->goals)) {
>  		quota->esz = quota->sz;
>  		return;
>  	}

The commit message mentions this is a final bound after all other quota
calculations. If a purely static size quota is configured with no time limit
and no tuning goals, the early return here bypasses the new safety cap. Is
this intended behavior, or should the static quota also be constrained by
total_sz?

> @@ -2649,6 +2663,11 @@ static void damos_set_effective_quota(struct damon_ctx *ctx, struct damos *s)
>  	if (quota->sz && quota->sz < esz)
>  		esz = quota->sz;
>  
> +	/* Safety cap: never migrate more than total monitored memory */
> +	total_sz = damon_ctx_total_monitored_sz(ctx);
> +	if (total_sz && esz > total_sz)
> +		esz = total_sz;
> +

Can this cap force the effective quota below the minimum region size? If
total_sz happens to be smaller than ctx->min_region_sz, applying this cap at
the very end would make esz smaller than min_region_sz. If that occurs, will
operations layer checks like esz - charged_sz < min_region_sz always evaluate
to true and permanently starve the scheme?

Does the goal tuner's internal state also need to be bounded here? While the
effective output esz is capped, quota->esz_bp remains untouched.

If a workload consistently misses its tuning goal, quota->esz_bp might
exponentially increase until it hits ULONG_MAX. Once the workload behavior
changes, could the wound-up state take a prolonged time to decay back down
below the physical limit, leaving the scheme unresponsive during recovery?

>  	quota->esz = esz;
>  }

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260516210357.2247-1-ravis.opensrc@gmail.com?part=2

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH 3/5] mm/damon/core: floor effective quota size at minimum region size
  2026-05-16 21:03 [RFC PATCH 0/5] mm/damon: DAMOS quota controller and paddr migration walk fixes Ravi Jonnalagadda
  2026-05-16 21:03 ` [RFC PATCH 1/5] mm/damon/core: fix nr_accesses_bp underflow in damon_moving_sum Ravi Jonnalagadda
  2026-05-16 21:03 ` [RFC PATCH 2/5] mm/damon/core: cap effective quota size to total monitored memory Ravi Jonnalagadda
@ 2026-05-16 21:03 ` Ravi Jonnalagadda
  2026-05-16 21:03 ` [RFC PATCH 4/5] mm/damon/paddr: skip free pageblocks in migration walk Ravi Jonnalagadda
  2026-05-16 21:03 ` [RFC PATCH 5/5] mm/damon/paddr: add time budget to migration page walk Ravi Jonnalagadda
  4 siblings, 0 replies; 10+ messages in thread
From: Ravi Jonnalagadda @ 2026-05-16 21:03 UTC (permalink / raw)
  To: sj, damon, linux-mm, linux-kernel, linux-doc
  Cc: akpm, corbet, bijan311, ajayjoshi, honggyu.kim, yunjeong.mun,
	ravis.opensrc

The CONSIST quota goal tuner initializes esz_bp to 0, producing an
effective quota size (esz) of 1 byte on the first tick.
damos_quota_is_full() rejects all regions when esz < min_region_sz
(default PAGE_SIZE = 4096), so no regions can be tried and no
feedback reaches the tuner — a bootstrapping deadlock.

Floor esz at ctx->min_region_sz after the tuner computes it, guarded
by an esz != 0 check.  The guard preserves the temporal tuner's
intentional stop behavior: when score >= 10000 (goal met), temporal
sets esz_bp = 0 to halt migration; the floor must not override that.

Signed-off-by: Ravi Jonnalagadda <ravis.opensrc@gmail.com>
---
 mm/damon/core.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/mm/damon/core.c b/mm/damon/core.c
index fd1db234ca304..d33c4360cbd60 100644
--- a/mm/damon/core.c
+++ b/mm/damon/core.c
@@ -2650,6 +2650,10 @@ static void damos_set_effective_quota(struct damon_ctx *ctx, struct damos *s)
 		esz = quota->esz_bp / 10000;
 	}
 
+	/* avoid cold-start deadlock, but respect tuner stop signal (esz=0) */
+	if (esz)
+		esz = max_t(unsigned long, esz, ctx->min_region_sz);
+
 	if (quota->ms) {
 		if (quota->total_charged_ns)
 			throughput = mult_frac(quota->total_charged_sz,
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 4/5] mm/damon/paddr: skip free pageblocks in migration walk
  2026-05-16 21:03 [RFC PATCH 0/5] mm/damon: DAMOS quota controller and paddr migration walk fixes Ravi Jonnalagadda
                   ` (2 preceding siblings ...)
  2026-05-16 21:03 ` [RFC PATCH 3/5] mm/damon/core: floor effective quota size at minimum region size Ravi Jonnalagadda
@ 2026-05-16 21:03 ` Ravi Jonnalagadda
  2026-05-16 23:36   ` sashiko-bot
  2026-05-16 21:03 ` [RFC PATCH 5/5] mm/damon/paddr: add time budget to migration page walk Ravi Jonnalagadda
  4 siblings, 1 reply; 10+ messages in thread
From: Ravi Jonnalagadda @ 2026-05-16 21:03 UTC (permalink / raw)
  To: sj, damon, linux-mm, linux-kernel, linux-doc
  Cc: akpm, corbet, bijan311, ajayjoshi, honggyu.kim, yunjeong.mun,
	ravis.opensrc

damon_pa_migrate() walks every PFN in a region linearly, calling
damon_get_folio() for each one.  On sparse physical address spaces
(e.g., CXL-attached memory), a single DAMON region can span hundreds
of gigabytes where most memory is free and sitting in the buddy
allocator.  Most page lookups are fruitless and dominate kdamond
tick time.

Check at pageblock boundaries (2MB on x86_64) whether the block is
entirely free.  If the first page of a pageblock is a buddy page at
pageblock_order or higher, the entire block is free and can be
skipped.  Similarly skip pageblocks where pfn_to_online_page() returns
NULL.

This reduces the iteration from O(region_sz / PAGE_SIZE) to
O(region_sz / pageblock_sz) + O(populated_pages).

buddy_order_unsafe() is used without zone->lock.  A transient false
positive (block becomes non-free between the PageBuddy and order
checks) costs at most one tick of missed candidates on that block;
the next tick re-scans.  No correctness consequence as DAMON walks
are best-effort.

Signed-off-by: Ravi Jonnalagadda <ravis.opensrc@gmail.com>
---
 mm/damon/paddr.c | 25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
index c4738cd5e221e..e844c990987b9 100644
--- a/mm/damon/paddr.c
+++ b/mm/damon/paddr.c
@@ -258,13 +258,32 @@ static unsigned long damon_pa_migrate(struct damon_region *r,
 		unsigned long addr_unit, struct damos *s,
 		unsigned long *sz_filter_passed)
 {
-	phys_addr_t addr, applied;
+	phys_addr_t addr, end, applied;
 	LIST_HEAD(folio_list);
 	struct folio *folio = NULL;
+	unsigned long pfn;

 	addr = damon_pa_phys_addr(r->ar.start, addr_unit);
-	while (addr < damon_pa_phys_addr(r->ar.end, addr_unit)) {
-		folio = damon_get_folio(PHYS_PFN(addr));
+	end = damon_pa_phys_addr(r->ar.end, addr_unit);
+	while (addr < end) {
+		pfn = PHYS_PFN(addr);
+
+		/* Skip pageblocks that are entirely free. */
+		if (IS_ALIGNED(pfn, pageblock_nr_pages)) {
+			struct page *page = pfn_to_online_page(pfn);
+
+			if (!page) {
+				addr += pageblock_nr_pages * PAGE_SIZE;
+				continue;
+			}
+			if (PageBuddy(page) &&
+			    buddy_order_unsafe(page) >= pageblock_order) {
+				addr += pageblock_nr_pages * PAGE_SIZE;
+				continue;
+			}
+		}
+
+		folio = damon_get_folio(pfn);
 		if (damon_pa_invalid_damos_folio(folio, s)) {
 			addr += PAGE_SIZE;
 			continue;
-- 
2.43.0

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 4/5] mm/damon/paddr: skip free pageblocks in migration walk
  2026-05-16 21:03 ` [RFC PATCH 4/5] mm/damon/paddr: skip free pageblocks in migration walk Ravi Jonnalagadda
@ 2026-05-16 23:36   ` sashiko-bot
  0 siblings, 0 replies; 10+ messages in thread
From: sashiko-bot @ 2026-05-16 23:36 UTC (permalink / raw)
  To: Ravi Jonnalagadda; +Cc: damon

Thank you for your contribution! Sashiko AI review found 5 potential issue(s) to consider:
- [High] Calling `PageBuddy()` on a tail page triggers a kernel panic.
- [High] Unsafe address advancement skips valid memory when starting from an unaligned offset within a folio.
- [High] Speculative access of folio flags on an unreferenced folio risks `VM_BUG_ON_FOLIO`.
- [Medium] The optimization fails to skip buddy blocks larger than `pageblock_order` efficiently.
- [Medium] Skipping an entire pageblock when `pfn_to_online_page()` returns NULL misses valid memory.
--

commit 66e620b27c62d5fcd72a28187d8b381c33d31172
Author: Ravi Jonnalagadda <ravis.opensrc@gmail.com>
Subject: mm/damon/paddr: skip free pageblocks in migration walk

This patch optimizes DAMON physical memory walks on sparse address spaces
by checking if a pageblock is entirely free or offline, skipping the entire
block to reduce kdamond tick time.

> diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
> index c4738cd5e221e..e844c990987b9 100644
> --- a/mm/damon/paddr.c
> +++ b/mm/damon/paddr.c
> @@ -258,13 +258,32 @@ static unsigned long damon_pa_migrate(struct damon_region *r,
>  		unsigned long addr_unit, struct damos *s,
>  		unsigned long *sz_filter_passed)
>  {
> -	phys_addr_t addr, applied;
> +	phys_addr_t addr, end, applied;
>  	LIST_HEAD(folio_list);
>  	struct folio *folio = NULL;
> +	unsigned long pfn;
>  
>  	addr = damon_pa_phys_addr(r->ar.start, addr_unit);
> -	while (addr < damon_pa_phys_addr(r->ar.end, addr_unit)) {
> -		folio = damon_get_folio(PHYS_PFN(addr));
> +	end = damon_pa_phys_addr(r->ar.end, addr_unit);
> +	while (addr < end) {
> +		pfn = PHYS_PFN(addr);
> +
> +		/* Skip pageblocks that are entirely free. */
> +		if (IS_ALIGNED(pfn, pageblock_nr_pages)) {
> +			struct page *page = pfn_to_online_page(pfn);
> +
> +			if (!page) {
> +				addr += pageblock_nr_pages * PAGE_SIZE;
> +				continue;
> +			}

If a memory hole is smaller than a pageblock, can skipping the entire
pageblock miss valid, online pages in the remainder of the block?

> +			if (PageBuddy(page) &&
> +			    buddy_order_unsafe(page) >= pageblock_order) {
> +				addr += pageblock_nr_pages * PAGE_SIZE;
> +				continue;
> +			}

Does this code risk a kernel panic when calling PageBuddy() on a tail page?

If a pageblock-aligned PFN points to a tail page of a larger compound page
(such as a 1GB HugeTLB page), PageBuddy() would trigger a VM_BUG_ON_PGFLAGS
due to the PF_NO_TAIL policy.

Also, if a free buddy block is larger than pageblock_order, does the
optimization fall back to stepping by PAGE_SIZE?

The first pageblock is skipped, but on the next iteration, pfn points to the
start of the second pageblock, which is a tail page of the large buddy block.
Because the PG_buddy flag is only set on the head page, PageBuddy(page) will
evaluate to false.

> +		}
> +
> +		folio = damon_get_folio(pfn);

This isn't a bug introduced by this patch, but does damon_get_folio()
speculatively access folio flags before acquiring a reference?

mm/damon/ops-common.c:damon_get_folio() {
    ...
    if (!folio_test_lru(folio) || !folio_try_get(folio))
        return NULL;
    ...
}

If the page is concurrently freed and reallocated as a tail page of a new
compound page, calling folio_test_lru() before folio_try_get() risks a
VM_BUG_ON_FOLIO.

Additionally, this wasn't introduced by this patch, but can the physical
address advancement skip valid memory when starting from an unaligned offset?

If damon_pa_migrate() starts at an unaligned offset within a large folio, pfn
points to a tail page, and damon_get_folio() resolves this to the head folio.
The loop then unconditionally advances by the full folio size:

    put_folio:
        addr += folio_size(folio);
        folio_put(folio);

Adding the total folio size to an unaligned starting offset would push addr
past the actual end of the folio, skipping pages in the subsequent memory
region.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260516210357.2247-1-ravis.opensrc@gmail.com?part=4

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH 5/5] mm/damon/paddr: add time budget to migration page walk
  2026-05-16 21:03 [RFC PATCH 0/5] mm/damon: DAMOS quota controller and paddr migration walk fixes Ravi Jonnalagadda
                   ` (3 preceding siblings ...)
  2026-05-16 21:03 ` [RFC PATCH 4/5] mm/damon/paddr: skip free pageblocks in migration walk Ravi Jonnalagadda
@ 2026-05-16 21:03 ` Ravi Jonnalagadda
  2026-05-16 23:55   ` sashiko-bot
  4 siblings, 1 reply; 10+ messages in thread
From: Ravi Jonnalagadda @ 2026-05-16 21:03 UTC (permalink / raw)
  To: sj, damon, linux-mm, linux-kernel, linux-doc
  Cc: akpm, corbet, bijan311, ajayjoshi, honggyu.kim, yunjeong.mun,
	ravis.opensrc

On populated physical address ranges the pageblock skip optimization
alone is insufficient — most pageblocks contain at least one allocated
page, so the walk still iterates millions of PFNs.

Add a 100ms wall-clock time budget to damon_pa_migrate().  Once the
deadline is reached, the walk breaks out and migrates whatever folios
have been collected so far.

The time check is amortized by only calling ktime_get() every 4096
pages (~16MB of address space), adding negligible overhead to the
fast path.

Signed-off-by: Ravi Jonnalagadda <ravis.opensrc@gmail.com>
---
 mm/damon/paddr.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
index e844c990987b9..a2565287bc10f 100644
--- a/mm/damon/paddr.c
+++ b/mm/damon/paddr.c
@@ -14,6 +14,7 @@
 #include <linux/swap.h>
 #include <linux/memory-tiers.h>
 #include <linux/mm_inline.h>
+#include <linux/ktime.h>
 
 #include "../internal.h"
 #include "ops-common.h"
@@ -254,6 +255,14 @@ static unsigned long damon_pa_deactivate_pages(struct damon_region *r,
 	return damon_pa_de_activate(r, addr_unit, s, false, sz_filter_passed);
 }
 
+/* Maximum wall-clock time to spend in a single migration walk (ns) */
+#define DAMON_PA_MIGRATE_BUDGET_NS	(100 * NSEC_PER_MSEC)
+
+/* Check the time budget every 4096 pages (~16MB) to amortize ktime_get(). */
+#define DAMON_PA_MIGRATE_TIME_CHECK_PAGES	4096
+#define DAMON_PA_MIGRATE_TIME_CHECK_MASK	\
+	(DAMON_PA_MIGRATE_TIME_CHECK_PAGES - 1)
+
 static unsigned long damon_pa_migrate(struct damon_region *r,
 		unsigned long addr_unit, struct damos *s,
 		unsigned long *sz_filter_passed)
@@ -262,6 +271,7 @@ static unsigned long damon_pa_migrate(struct damon_region *r,
 	LIST_HEAD(folio_list);
 	struct folio *folio = NULL;
 	unsigned long pfn;
+	ktime_t deadline = ktime_add_ns(ktime_get(), DAMON_PA_MIGRATE_BUDGET_NS);
 
 	addr = damon_pa_phys_addr(r->ar.start, addr_unit);
 	end = damon_pa_phys_addr(r->ar.end, addr_unit);
@@ -283,6 +293,11 @@ static unsigned long damon_pa_migrate(struct damon_region *r,
 			}
 		}
 
+		/* Time budget: keep kdamond responsive on long migration walks. */
+		if (!(pfn & DAMON_PA_MIGRATE_TIME_CHECK_MASK) &&
+		    ktime_after(ktime_get(), deadline))
+			break;
+
 		folio = damon_get_folio(pfn);
 		if (damon_pa_invalid_damos_folio(folio, s)) {
 			addr += PAGE_SIZE;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 5/5] mm/damon/paddr: add time budget to migration page walk
  2026-05-16 21:03 ` [RFC PATCH 5/5] mm/damon/paddr: add time budget to migration page walk Ravi Jonnalagadda
@ 2026-05-16 23:55   ` sashiko-bot
  0 siblings, 0 replies; 10+ messages in thread
From: sashiko-bot @ 2026-05-16 23:55 UTC (permalink / raw)
  To: Ravi Jonnalagadda; +Cc: damon

Thank you for your contribution! Sashiko AI review found 4 potential issue(s) to consider:
- [High] The amortized time budget check can be permanently bypassed due to variable PFN strides and loop continues.
- [High] Early termination causes permanent starvation for the tail end of large DAMON regions.
- [High] The time budget targets the wrong phase, failing to cap the actual latency spike of the migration.
- [High] PFN advancement after folio conversion permanently misaligns the scan.
--

commit 87860cbf6b4be5f161140aa9a1f9b7d0461955ba
Author: Ravi Jonnalagadda <ravis.opensrc@gmail.com>

mm/damon/paddr: add time budget to migration page walk

This patch adds a 100ms wall-clock time budget to damon_pa_migrate() to
prevent long migration walks from causing high scheduling latencies. The
time check is amortized by checking ktime_get() every 4096 pages.

> diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
> index e844c990987b9..a2565287bc10f 100644
> --- a/mm/damon/paddr.c
> +++ b/mm/damon/paddr.c
[ ... ]
> @@ -283,6 +293,11 @@ static unsigned long damon_pa_migrate(struct damon_region *r,
>  			}
>  		}
>  
> +		/* Time budget: keep kdamond responsive on long migration walks. */
> +		if (!(pfn & DAMON_PA_MIGRATE_TIME_CHECK_MASK) &&
> +		    ktime_after(ktime_get(), deadline))
> +			break;

Can this time check be permanently bypassed?

If pfn advances by variable strides, such as when addr += folio_size(folio)
adds a multi-page stride that jumps over the 4096-page boundary, the exact
match !(pfn & DAMON_PA_MIGRATE_TIME_CHECK_MASK) is missed, skipping the
check for the entire next 4096-page cycle.

Additionally, if pfn does land exactly on a multiple of 4096, but the
pageblock is skipped via the continue statement just above this block:

        if (!page) {
            addr += pageblock_nr_pages * PAGE_SIZE;
            continue;
        }

the loop advances and leaps over the time check.

If the loop terminates early here, is there a mechanism to save the PFN
where the scan aborted?

Since the DAMON core is unaware of this internal early return, the next time
the scheme applies to this region, it appears it will restart the walk from
the very beginning of the region.

Could this result in a permanent livelock for large regions, where memory
beyond the 100ms scan horizon is never reached or migrated?

>  		folio = damon_get_folio(pfn);
>  		if (damon_pa_invalid_damos_folio(folio, s)) {
[ ... ]
>  		if (!folio_isolate_lru(folio))
>  			goto put_folio;
>  		list_add(&folio->lru, &folio_list);
>  put_folio:
>  		addr += folio_size(folio);

This wasn't introduced by this patch, but does the address advancement
correctly handle tail pages?

Since the region start addr might not be folio-aligned, if it falls within
a tail page of a large folio, does adding the total folio_size(folio)
overshoot the end of the current folio and skip the head page of the next
folio?

>  		folio_put(folio);
>  	}
>  	applied = damon_migrate_pages(&folio_list, s->target_nid);

Does the time budget cap the actual latency spike of the migration?

The 100ms budget limits the fast folio isolation phase. The loop can isolate
gigabytes of memory within 100ms. Once the deadline is reached, all
collected folios are passed to damon_migrate_pages().

Could migrating this massive batch, which requires memory allocation and
page copying, take seconds to complete in a single blocking call and fail
to keep the migration responsive?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260516210357.2247-1-ravis.opensrc@gmail.com?part=5

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-05-16 23:55 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-16 21:03 [RFC PATCH 0/5] mm/damon: DAMOS quota controller and paddr migration walk fixes Ravi Jonnalagadda
2026-05-16 21:03 ` [RFC PATCH 1/5] mm/damon/core: fix nr_accesses_bp underflow in damon_moving_sum Ravi Jonnalagadda
2026-05-16 22:29   ` sashiko-bot
2026-05-16 21:03 ` [RFC PATCH 2/5] mm/damon/core: cap effective quota size to total monitored memory Ravi Jonnalagadda
2026-05-16 22:55   ` sashiko-bot
2026-05-16 21:03 ` [RFC PATCH 3/5] mm/damon/core: floor effective quota size at minimum region size Ravi Jonnalagadda
2026-05-16 21:03 ` [RFC PATCH 4/5] mm/damon/paddr: skip free pageblocks in migration walk Ravi Jonnalagadda
2026-05-16 23:36   ` sashiko-bot
2026-05-16 21:03 ` [RFC PATCH 5/5] mm/damon/paddr: add time budget to migration page walk Ravi Jonnalagadda
2026-05-16 23:55   ` sashiko-bot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.