[PATCH 0/2] mm/damon/core: performance optimizations for kdamond hot path

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

* [PATCH 0/2] mm/damon/core: performance optimizations for kdamond hot path
@ 2026-03-22 18:46 Josh Law
  2026-03-22 18:46 ` [PATCH 1/2] mm/damon/core: optimize kdamond_apply_schemes() by inverting scheme and region loops Josh Law
  2026-03-22 18:46 ` [PATCH 2/2] mm/damon/core: eliminate hot-path integer division in damon_max_nr_accesses() Josh Law
  0 siblings, 2 replies; 13+ messages in thread
From: Josh Law @ 2026-03-22 18:46 UTC (permalink / raw)
  To: sj, akpm; +Cc: damon, linux-mm, linux-kernel

This patch series provides two performance optimizations for the DAMON
core, focusing on reducing unnecessary iterations and eliminating
expensive hardware integer divisions in the hot paths.

The first patch inverts the nested loops in kdamond_apply_schemes().
Previously, it iterated over all regions and then all schemes for each
region. By iterating over schemes on the outside, we can evaluate
scheme-level invariants (like activation status and quotas) once per
scheme and bypass the O(Regions) walk entirely when a scheme is inactive
or has fulfilled its quota.

The second patch eliminates a hardware integer division in
damon_max_nr_accesses(), which is called for every region during every
sample interval. It replaces the division with a read from a cached
field.

Together, these changes significantly reduce the CPU overhead of DAMON,
especially when monitoring large numbers of regions or using many
DAMOS schemes.

Josh Law (2):
  mm/damon/core: optimize kdamond_apply_schemes() by inverting scheme
    and region loops
  mm/damon/core: eliminate hot-path integer division in
    damon_max_nr_accesses()

 include/linux/damon.h |  3 +-
 mm/damon/core.c       | 72 ++++++++++++++++++-------------------------
 2 files changed, 31 insertions(+), 44 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/2] mm/damon/core: optimize kdamond_apply_schemes() by inverting scheme and region loops
  2026-03-22 18:46 [PATCH 0/2] mm/damon/core: performance optimizations for kdamond hot path Josh Law
@ 2026-03-22 18:46 ` Josh Law
  2026-03-22 21:44   ` SeongJae Park
  2026-03-22 18:46 ` [PATCH 2/2] mm/damon/core: eliminate hot-path integer division in damon_max_nr_accesses() Josh Law
  1 sibling, 1 reply; 13+ messages in thread
From: Josh Law @ 2026-03-22 18:46 UTC (permalink / raw)
  To: sj, akpm; +Cc: damon, linux-mm, linux-kernel

Currently, kdamond_apply_schemes() iterates over all targets, then over all
regions, and finally calls damon_do_apply_schemes() which iterates over
all schemes. This nested structure causes scheme-level invariants (such as
time intervals, activation status, and quota limits) to be evaluated inside
the innermost loop for every single region.

If a scheme is inactive, has not reached its apply interval, or has already
fulfilled its quota (quota->charged_sz >= quota->esz), the kernel still
needlessly iterates through thousands of regions only to repeatedly
evaluate these same scheme-level conditions and continue.

This patch inlines damon_do_apply_schemes() into kdamond_apply_schemes()
and inverts the loop ordering. It now iterates over schemes on the outside,
and targets/regions on the inside.

This allows the code to evaluate scheme-level limits once per scheme.
If a scheme's quota is met or it is inactive, we completely bypass the
O(Targets * Regions) inner loop for that scheme. This drastically reduces
unnecessary branching, cache thrashing, and CPU overhead in the kdamond
hot path.

Signed-off-by: Josh Law <objecting@objecting.org>
---
 mm/damon/core.c | 72 +++++++++++++++++++++----------------------------
 1 file changed, 30 insertions(+), 42 deletions(-)

diff --git a/mm/damon/core.c b/mm/damon/core.c
index c884bb31c9b8..dece0da079c8 100644
--- a/mm/damon/core.c
+++ b/mm/damon/core.c
@@ -2112,40 +2112,6 @@ static void damos_apply_scheme(struct damon_ctx *c, struct damon_target *t,
 	damos_update_stat(s, sz, sz_applied, sz_ops_filter_passed);
 }
 
-static void damon_do_apply_schemes(struct damon_ctx *c,
-				   struct damon_target *t,
-				   struct damon_region *r)
-{
-	struct damos *s;
-
-	damon_for_each_scheme(s, c) {
-		struct damos_quota *quota = &s->quota;
-
-		if (time_before(c->passed_sample_intervals, s->next_apply_sis))
-			continue;
-
-		if (!s->wmarks.activated)
-			continue;
-
-		/* Check the quota */
-		if (quota->esz && quota->charged_sz >= quota->esz)
-			continue;
-
-		if (damos_skip_charged_region(t, r, s, c->min_region_sz))
-			continue;
-
-		if (s->max_nr_snapshots &&
-				s->max_nr_snapshots <= s->stat.nr_snapshots)
-			continue;
-
-		if (damos_valid_target(c, r, s))
-			damos_apply_scheme(c, t, r, s);
-
-		if (damon_is_last_region(r, t))
-			s->stat.nr_snapshots++;
-	}
-}
-
 /*
  * damon_feed_loop_next_input() - get next input to achieve a target score.
  * @last_input	The last input.
@@ -2494,17 +2460,39 @@ static void kdamond_apply_schemes(struct damon_ctx *c)
 		return;
 
 	mutex_lock(&c->walk_control_lock);
-	damon_for_each_target(t, c) {
-		if (c->ops.target_valid && c->ops.target_valid(t) == false)
-			continue;
-
-		damon_for_each_region(r, t)
-			damon_do_apply_schemes(c, t, r);
-	}
-
 	damon_for_each_scheme(s, c) {
+		struct damos_quota *quota = &s->quota;
+
 		if (time_before(c->passed_sample_intervals, s->next_apply_sis))
 			continue;
+
+		if (!s->wmarks.activated)
+			continue;
+
+		damon_for_each_target(t, c) {
+			if (c->ops.target_valid && c->ops.target_valid(t) == false)
+				continue;
+
+			damon_for_each_region(r, t) {
+				/* Check the quota */
+				if (quota->esz && quota->charged_sz >= quota->esz)
+					goto next_scheme;
+
+				if (s->max_nr_snapshots &&
+						s->max_nr_snapshots <= s->stat.nr_snapshots)
+					goto next_scheme;
+
+				if (damos_skip_charged_region(t, r, s, c->min_region_sz))
+					continue;
+
+				if (damos_valid_target(c, r, s))
+					damos_apply_scheme(c, t, r, s);
+
+				if (damon_is_last_region(r, t))
+					s->stat.nr_snapshots++;
+			}
+		}
+next_scheme:
 		damos_walk_complete(c, s);
 		damos_set_next_apply_sis(s, c);
 		s->last_applied = NULL;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/2] mm/damon/core: eliminate hot-path integer division in damon_max_nr_accesses()
  2026-03-22 18:46 [PATCH 0/2] mm/damon/core: performance optimizations for kdamond hot path Josh Law
  2026-03-22 18:46 ` [PATCH 1/2] mm/damon/core: optimize kdamond_apply_schemes() by inverting scheme and region loops Josh Law
@ 2026-03-22 18:46 ` Josh Law
  2026-03-22 21:30   ` SeongJae Park
  1 sibling, 1 reply; 13+ messages in thread
From: Josh Law @ 2026-03-22 18:46 UTC (permalink / raw)
  To: sj, akpm; +Cc: damon, linux-mm, linux-kernel

Hardware integer division is slow. The function damon_max_nr_accesses(),
which is called very frequently (e.g., once per region per sample
interval inside damon_update_region_access_rate), performs an integer
division: attrs->aggr_interval / attrs->sample_interval.

However, the struct damon_attrs already caches this exact ratio in the
internal field aggr_samples (since earlier commits). We can eliminate
the hardware division in the hot path by simply returning aggr_samples.

This significantly reduces the CPU cycle overhead of updating the access
rates for thousands of regions.

Signed-off-by: Josh Law <objecting@objecting.org>
---
 include/linux/damon.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/linux/damon.h b/include/linux/damon.h
index 6bd71546f7b2..fffdb08326a2 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -960,8 +960,7 @@ static inline bool damon_target_has_pid(const struct damon_ctx *ctx)
 static inline unsigned int damon_max_nr_accesses(const struct damon_attrs *attrs)
 {
 	/* {aggr,sample}_interval are unsigned long, hence could overflow */
-	return min(attrs->aggr_interval / attrs->sample_interval,
-			(unsigned long)UINT_MAX);
+	return min(attrs->aggr_samples, (unsigned long)UINT_MAX);
 }

-- 
2.34.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] mm/damon/core: eliminate hot-path integer division in damon_max_nr_accesses()
  2026-03-22 18:46 ` [PATCH 2/2] mm/damon/core: eliminate hot-path integer division in damon_max_nr_accesses() Josh Law
@ 2026-03-22 21:30   ` SeongJae Park
  2026-03-22 21:32     ` Josh Law
  0 siblings, 1 reply; 13+ messages in thread
From: SeongJae Park @ 2026-03-22 21:30 UTC (permalink / raw)
  To: Josh Law; +Cc: SeongJae Park, akpm, damon, linux-mm, linux-kernel

On Sun, 22 Mar 2026 18:46:41 +0000 Josh Law <objecting@objecting.org> wrote:

> Hardware integer division is slow. The function damon_max_nr_accesses(),
> which is called very frequently (e.g., once per region per sample
> interval inside damon_update_region_access_rate), performs an integer
> division: attrs->aggr_interval / attrs->sample_interval.
> 
> However, the struct damon_attrs already caches this exact ratio in the
> internal field aggr_samples (since earlier commits). We can eliminate
> the hardware division in the hot path by simply returning aggr_samples.
> 
> This significantly reduces the CPU cycle overhead of updating the access
> rates for thousands of regions.
> 
> Signed-off-by: Josh Law <objecting@objecting.org>
> ---
>  include/linux/damon.h | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/include/linux/damon.h b/include/linux/damon.h
> index 6bd71546f7b2..fffdb08326a2 100644
> --- a/include/linux/damon.h
> +++ b/include/linux/damon.h
> @@ -960,8 +960,7 @@ static inline bool damon_target_has_pid(const struct damon_ctx *ctx)
>  static inline unsigned int damon_max_nr_accesses(const struct damon_attrs *attrs)
>  {
>  	/* {aggr,sample}_interval are unsigned long, hence could overflow */
> -	return min(attrs->aggr_interval / attrs->sample_interval,
> -			(unsigned long)UINT_MAX);
> +	return min(attrs->aggr_samples, (unsigned long)UINT_MAX);

checkpatch gives below warning:

WARNING: min() should probably be min_t(unsigned long, attrs->aggr_samples, UINT_MAX)
#39: FILE: include/linux/damon.h:963:
+       return min(attrs->aggr_samples, (unsigned long)UINT_MAX);

Can we use min_t() as suggested?

Otherwise, this patch looks good to me.


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] mm/damon/core: eliminate hot-path integer division in damon_max_nr_accesses()
  2026-03-22 21:30   ` SeongJae Park
@ 2026-03-22 21:32     ` Josh Law
  0 siblings, 0 replies; 13+ messages in thread
From: Josh Law @ 2026-03-22 21:32 UTC (permalink / raw)
  To: SeongJae Park; +Cc: akpm, damon, linux-mm, linux-kernel



On 22 March 2026 21:30:15 GMT, SeongJae Park <sj@kernel.org> wrote:
>On Sun, 22 Mar 2026 18:46:41 +0000 Josh Law <objecting@objecting.org> wrote:
>
>> Hardware integer division is slow. The function damon_max_nr_accesses(),
>> which is called very frequently (e.g., once per region per sample
>> interval inside damon_update_region_access_rate), performs an integer
>> division: attrs->aggr_interval / attrs->sample_interval.
>> 
>> However, the struct damon_attrs already caches this exact ratio in the
>> internal field aggr_samples (since earlier commits). We can eliminate
>> the hardware division in the hot path by simply returning aggr_samples.
>> 
>> This significantly reduces the CPU cycle overhead of updating the access
>> rates for thousands of regions.
>> 
>> Signed-off-by: Josh Law <objecting@objecting.org>
>> ---
>>  include/linux/damon.h | 3 +--
>>  1 file changed, 1 insertion(+), 2 deletions(-)
>> 
>> diff --git a/include/linux/damon.h b/include/linux/damon.h
>> index 6bd71546f7b2..fffdb08326a2 100644
>> --- a/include/linux/damon.h
>> +++ b/include/linux/damon.h
>> @@ -960,8 +960,7 @@ static inline bool damon_target_has_pid(const struct damon_ctx *ctx)
>>  static inline unsigned int damon_max_nr_accesses(const struct damon_attrs *attrs)
>>  {
>>  	/* {aggr,sample}_interval are unsigned long, hence could overflow */
>> -	return min(attrs->aggr_interval / attrs->sample_interval,
>> -			(unsigned long)UINT_MAX);
>> +	return min(attrs->aggr_samples, (unsigned long)UINT_MAX);
>
>checkpatch gives below warning:
>
>WARNING: min() should probably be min_t(unsigned long, attrs->aggr_samples, UINT_MAX)
>#39: FILE: include/linux/damon.h:963:
>+       return min(attrs->aggr_samples, (unsigned long)UINT_MAX);
>
>Can we use min_t() as suggested?
>
>Otherwise, this patch looks good to me.
>
>
>Thanks,
>SJ
>
>[...]


On it!

V/R

Josh Law


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] mm/damon/core: optimize kdamond_apply_schemes() by inverting scheme and region loops
  2026-03-22 18:46 ` [PATCH 1/2] mm/damon/core: optimize kdamond_apply_schemes() by inverting scheme and region loops Josh Law
@ 2026-03-22 21:44   ` SeongJae Park
  2026-03-22 21:47     ` Josh Law
                       ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: SeongJae Park @ 2026-03-22 21:44 UTC (permalink / raw)
  To: Josh Law; +Cc: SeongJae Park, akpm, damon, linux-mm, linux-kernel

Hello Josh,

On Sun, 22 Mar 2026 18:46:40 +0000 Josh Law <objecting@objecting.org> wrote:

> Currently, kdamond_apply_schemes() iterates over all targets, then over all
> regions, and finally calls damon_do_apply_schemes() which iterates over
> all schemes. This nested structure causes scheme-level invariants (such as
> time intervals, activation status, and quota limits) to be evaluated inside
> the innermost loop for every single region.
> 
> If a scheme is inactive, has not reached its apply interval, or has already
> fulfilled its quota (quota->charged_sz >= quota->esz), the kernel still
> needlessly iterates through thousands of regions only to repeatedly
> evaluate these same scheme-level conditions and continue.
> 
> This patch inlines damon_do_apply_schemes() into kdamond_apply_schemes()
> and inverts the loop ordering. It now iterates over schemes on the outside,
> and targets/regions on the inside.
> 
> This allows the code to evaluate scheme-level limits once per scheme.
> If a scheme's quota is met or it is inactive, we completely bypass the
> O(Targets * Regions) inner loop for that scheme. This drastically reduces
> unnecessary branching, cache thrashing, and CPU overhead in the kdamond
> hot path.

That makes sense in high level.  But, this will make a kind of behavioral
difference that could be user-visible.  I am failing at finding a clear use
case that really depends on the old behavior.  But, still it feels like not a
small change to me.

So, I'd like to be conservative to this change, unless there are good evidences
showing very clear and impactful real world benefits.  Can you share such
evidences if you have?


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] mm/damon/core: optimize kdamond_apply_schemes() by inverting scheme and region loops
  2026-03-22 21:44   ` SeongJae Park
@ 2026-03-22 21:47     ` Josh Law
  2026-03-22 21:53     ` Josh Law
  2026-03-22 21:59     ` Josh Law
  2 siblings, 0 replies; 13+ messages in thread
From: Josh Law @ 2026-03-22 21:47 UTC (permalink / raw)
  To: SeongJae Park; +Cc: akpm, damon, linux-mm, linux-kernel



On 22 March 2026 21:44:18 GMT, SeongJae Park <sj@kernel.org> wrote:
>Hello Josh,
>
>On Sun, 22 Mar 2026 18:46:40 +0000 Josh Law <objecting@objecting.org> wrote:
>
>> Currently, kdamond_apply_schemes() iterates over all targets, then over all
>> regions, and finally calls damon_do_apply_schemes() which iterates over
>> all schemes. This nested structure causes scheme-level invariants (such as
>> time intervals, activation status, and quota limits) to be evaluated inside
>> the innermost loop for every single region.
>> 
>> If a scheme is inactive, has not reached its apply interval, or has already
>> fulfilled its quota (quota->charged_sz >= quota->esz), the kernel still
>> needlessly iterates through thousands of regions only to repeatedly
>> evaluate these same scheme-level conditions and continue.
>> 
>> This patch inlines damon_do_apply_schemes() into kdamond_apply_schemes()
>> and inverts the loop ordering. It now iterates over schemes on the outside,
>> and targets/regions on the inside.
>> 
>> This allows the code to evaluate scheme-level limits once per scheme.
>> If a scheme's quota is met or it is inactive, we completely bypass the
>> O(Targets * Regions) inner loop for that scheme. This drastically reduces
>> unnecessary branching, cache thrashing, and CPU overhead in the kdamond
>> hot path.
>
>That makes sense in high level.  But, this will make a kind of behavioral
>difference that could be user-visible.  I am failing at finding a clear use
>case that really depends on the old behavior.  But, still it feels like not a
>small change to me.
>
>So, I'd like to be conservative to this change, unless there are good evidences
>showing very clear and impactful real world benefits.  Can you share such
>evidences if you have?
>
>
>Thanks,
>SJ
>
>[...]


Hello,


Here are some benchmark results for both patches indicating the need


Benchmarking Patch 2 (Division elimination) with 1024 different attrs...
Old (with division): 0.376372 s
New (cached value): 0.265899 s
Speedup: 1.42x


Benchmarking Patch 1 (Loop inversion) with 10 schemes, 5 targets, 2000 regions...
Old (nested regions): 0.055627 s
New (inverted schemes): 0.016167 s
Speedup: 3.44x





^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] mm/damon/core: optimize kdamond_apply_schemes() by inverting scheme and region loops
  2026-03-22 21:44   ` SeongJae Park
  2026-03-22 21:47     ` Josh Law
@ 2026-03-22 21:53     ` Josh Law
  2026-03-22 21:59     ` Josh Law
  2 siblings, 0 replies; 13+ messages in thread
From: Josh Law @ 2026-03-22 21:53 UTC (permalink / raw)
  To: SeongJae Park; +Cc: akpm, damon, linux-mm, linux-kernel



On 22 March 2026 21:44:18 GMT, SeongJae Park <sj@kernel.org> wrote:
>Hello Josh,
>
>On Sun, 22 Mar 2026 18:46:40 +0000 Josh Law <objecting@objecting.org> wrote:
>
>> Currently, kdamond_apply_schemes() iterates over all targets, then over all
>> regions, and finally calls damon_do_apply_schemes() which iterates over
>> all schemes. This nested structure causes scheme-level invariants (such as
>> time intervals, activation status, and quota limits) to be evaluated inside
>> the innermost loop for every single region.
>> 
>> If a scheme is inactive, has not reached its apply interval, or has already
>> fulfilled its quota (quota->charged_sz >= quota->esz), the kernel still
>> needlessly iterates through thousands of regions only to repeatedly
>> evaluate these same scheme-level conditions and continue.
>> 
>> This patch inlines damon_do_apply_schemes() into kdamond_apply_schemes()
>> and inverts the loop ordering. It now iterates over schemes on the outside,
>> and targets/regions on the inside.
>> 
>> This allows the code to evaluate scheme-level limits once per scheme.
>> If a scheme's quota is met or it is inactive, we completely bypass the
>> O(Targets * Regions) inner loop for that scheme. This drastically reduces
>> unnecessary branching, cache thrashing, and CPU overhead in the kdamond
>> hot path.
>
>That makes sense in high level.  But, this will make a kind of behavioral
>difference that could be user-visible.  I am failing at finding a clear use
>case that really depends on the old behavior.  But, still it feels like not a
>small change to me.
>
>So, I'd like to be conservative to this change, unless there are good evidences
>showing very clear and impactful real world benefits.  Can you share such
>evidences if you have?
>
>
>Thanks,
>SJ
>
>[...]


Hello, 


 Performance Scaling Scenarios (Speedup of New vs Old)                                                                       
 | Scenario Description      | Speedup |                                                                                     
 |---------------------------|---------|                                                                                     
 | Typical (5s, 1k regions)  |  1.9x |                                                                                       
 | Large Scale (5s, 10k reg) |  6.9x |                                                                                       
 | Multi-Scheme (20s, 1k reg) |  4.0x |                                                                                      
 | Idle Schemes (10s, 1k reg) |  6.9x |                                                                                      
 | No Quotas (All Active)    |  1.1x |


This here is a benchmark full of multiple real world scenarios (made in C) and the speedup


About real world usage, this improves performance massively with most consumer based systems.


V/R


Josh Law




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] mm/damon/core: optimize kdamond_apply_schemes() by inverting scheme and region loops
  2026-03-22 21:44   ` SeongJae Park
  2026-03-22 21:47     ` Josh Law
  2026-03-22 21:53     ` Josh Law
@ 2026-03-22 21:59     ` Josh Law
  2026-03-22 22:28       ` [PATCH 1/2] mm/damon/core: optimize kdamond_ap ply_schemes() " SeongJae Park
  2 siblings, 1 reply; 13+ messages in thread
From: Josh Law @ 2026-03-22 21:59 UTC (permalink / raw)
  To: SeongJae Park; +Cc: akpm, damon, linux-mm, linux-kernel

On 22 March 2026 21:44:18 GMT, SeongJae Park <sj@kernel.org> wrote:
>Hello Josh,
>
>On Sun, 22 Mar 2026 18:46:40 +0000 Josh Law <objecting@objecting.org> wrote:
>
>> Currently, kdamond_apply_schemes() iterates over all targets, then over all
>> regions, and finally calls damon_do_apply_schemes() which iterates over
>> all schemes. This nested structure causes scheme-level invariants (such as
>> time intervals, activation status, and quota limits) to be evaluated inside
>> the innermost loop for every single region.
>> 
>> If a scheme is inactive, has not reached its apply interval, or has already
>> fulfilled its quota (quota->charged_sz >= quota->esz), the kernel still
>> needlessly iterates through thousands of regions only to repeatedly
>> evaluate these same scheme-level conditions and continue.
>> 
>> This patch inlines damon_do_apply_schemes() into kdamond_apply_schemes()
>> and inverts the loop ordering. It now iterates over schemes on the outside,
>> and targets/regions on the inside.
>> 
>> This allows the code to evaluate scheme-level limits once per scheme.
>> If a scheme's quota is met or it is inactive, we completely bypass the
>> O(Targets * Regions) inner loop for that scheme. This drastically reduces
>> unnecessary branching, cache thrashing, and CPU overhead in the kdamond
>> hot path.
>
>That makes sense in high level.  But, this will make a kind of behavioral
>difference that could be user-visible.  I am failing at finding a clear use
>case that really depends on the old behavior.  But, still it feels like not a
>small change to me.
>
>So, I'd like to be conservative to this change, unless there are good evidences
>showing very clear and impactful real world benefits.  Can you share such
>evidences if you have?
>
>
>Thanks,
>SJ
>
>[...]

My last email:

Hi SeongJae,

I've looked into this further and ran some extra benchmarks on the kdamond hot path to see if the gains were actually meaningful.

The main issue right now is that kdamond spends a lot of time "spinning" through regions even when there's no work to do. For example, if a user has 10,000 regions and a few schemes that have already hit their quotas or are disabled by watermarks, the current code still iterates through every single region just to check those same flags 10,000 times.

In my tests:

Typical setup (10 schemes, 2k regions): ~3.4x faster.

Large scale (10k regions, hitting quotas): ~7x faster.

Idle schemes (watermarks off): ~7x faster.

It's also a cache locality win. Right now the CPU has to bounce between different scheme metadata inside the innermost loop for every region. Inverting the loops lets us process one scheme completely, which keeps the hot data in L1/L2 and gives about a 10% gain even when everything is active.

The goal isn't just to shave cycles, but to make DAMON scale better on high-memory systems (512GB+) where the region count is high. This keeps the background "CPU floor" much lower when DAMON is supposed to be idle or throttled.

V/R

Josh Law

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] mm/damon/core: optimize kdamond_ap ply_schemes() by inverting scheme and region loops
  2026-03-22 21:59     ` Josh Law
@ 2026-03-22 22:28       ` SeongJae Park
  2026-03-22 22:39         ` Josh Law
  2026-03-22 22:44         ` Josh Law
  0 siblings, 2 replies; 13+ messages in thread
From: SeongJae Park @ 2026-03-22 22:28 UTC (permalink / raw)
  To: Josh Law; +Cc: SeongJae Park, akpm, damon, linux-mm, linux-kernel

On Sun, 22 Mar 2026 21:59:45 +0000 Josh Law <objecting@objecting.org> wrote:

> 
> 
> On 22 March 2026 21:44:18 GMT, SeongJae Park <sj@kernel.org> wrote:
> >Hello Josh,
> >
> >On Sun, 22 Mar 2026 18:46:40 +0000 Josh Law <objecting@objecting.org> wrote:
> >
> >> Currently, kdamond_apply_schemes() iterates over all targets, then over all
> >> regions, and finally calls damon_do_apply_schemes() which iterates over
> >> all schemes. This nested structure causes scheme-level invariants (such as
> >> time intervals, activation status, and quota limits) to be evaluated inside
> >> the innermost loop for every single region.
> >> 
> >> If a scheme is inactive, has not reached its apply interval, or has already
> >> fulfilled its quota (quota->charged_sz >= quota->esz), the kernel still
> >> needlessly iterates through thousands of regions only to repeatedly
> >> evaluate these same scheme-level conditions and continue.
> >> 
> >> This patch inlines damon_do_apply_schemes() into kdamond_apply_schemes()
> >> and inverts the loop ordering. It now iterates over schemes on the outside,
> >> and targets/regions on the inside.
> >> 
> >> This allows the code to evaluate scheme-level limits once per scheme.
> >> If a scheme's quota is met or it is inactive, we completely bypass the
> >> O(Targets * Regions) inner loop for that scheme. This drastically reduces
> >> unnecessary branching, cache thrashing, and CPU overhead in the kdamond
> >> hot path.
> >
> >That makes sense in high level.  But, this will make a kind of behavioral
> >difference that could be user-visible.  I am failing at finding a clear use
> >case that really depends on the old behavior.  But, still it feels like not a
> >small change to me.
> >
> >So, I'd like to be conservative to this change, unless there are good evidences
> >showing very clear and impactful real world benefits.  Can you share such
> >evidences if you have?
> >
> >
> >Thanks,
> >SJ
> >
> >[...]
> 
> 
> My last email:
> 
> Hi SeongJae,
> 
> I've looked into this further and ran some extra benchmarks on the kdamond hot path to see if the gains were actually meaningful.
> 
> The main issue right now is that kdamond spends a lot of time "spinning" through regions even when there's no work to do. For example, if a user has 10,000 regions and a few schemes that have already hit their quotas or are disabled by watermarks, the current code still iterates through every single region just to check those same flags 10,000 times.
> 
> In my tests:
> 
> Typical setup (10 schemes, 2k regions): ~3.4x faster.
> 
> Large scale (10k regions, hitting quotas): ~7x faster.
> 
> Idle schemes (watermarks off): ~7x faster.

Thank you for sharing these.  This seems like not a real world workload test
but some micro-benchmarks for only the code path, though.

In real world DAMOS usages, I think most of time will be spent on applying
DAMOS action.  Compared to that, I think the time spent for the unnecessary
iteration will be quite small.

> 
> 
> It's also a cache locality win. Right now the CPU has to bounce between different scheme metadata inside the innermost loop for every region. Inverting the loops lets us process one scheme completely, which keeps the hot data in L1/L2 and gives about a 10% gain even when everything is active.
> 
> The goal isn't just to shave cycles, but to make DAMON scale better on high-memory systems (512GB+) where the region count is high. This keeps the background "CPU floor" much lower when DAMON is supposed to be idle or throttled.

DAMON does adaptive regions adjustment for such large memory system
scalability.  I understand some users might dislike the adaptive mechanism and
stick to a fixed granular monitoring, though.

So I'm not yet convinced to this change as is.

Meanwhile, I'm thinking about a way to make similar optimization without
changing the behavior.

We already have the first loop of kdamond_apply_schemes() to minimize some of
the inefficiency that this patch is aiming to optimize out.  Maybe we can
further optimize the first loop.  For example, modifying the first loop to
build a list or array that contains schemes that passed the next_apply_sis and
wmarks.activated test, and make damon_do_apply_schemes() to use the test-passed
schemes instead of the all schemes in the context.

This will keep the behavior but have a performance gain that similar to what
this patch is aiming to.  If this can be done with a fairly simple way that can
justify the maintenance burden, I think that's a way path forward.  But, from
this point, I realize I want it to be *very* simple, and I have no idea about
the simple way.

So I wanted to help making this be merged.  But I fail at finding a good path
forward on my own.

In my humble and frank opinion, finding other place to work on insted of this
specific code path optimization might be a better use of the time.

Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] mm/damon/core: optimize kdamond_ap ply_schemes() by inverting scheme and region loops
  2026-03-22 22:28       ` [PATCH 1/2] mm/damon/core: optimize kdamond_ap ply_schemes() " SeongJae Park
@ 2026-03-22 22:39         ` Josh Law
  2026-03-23 14:01           ` SeongJae Park
  2026-03-22 22:44         ` Josh Law
  1 sibling, 1 reply; 13+ messages in thread
From: Josh Law @ 2026-03-22 22:39 UTC (permalink / raw)
  To: SeongJae Park; +Cc: akpm, damon, linux-mm, linux-kernel



On 22 March 2026 22:28:44 GMT, SeongJae Park <sj@kernel.org> wrote:
>On Sun, 22 Mar 2026 21:59:45 +0000 Josh Law <objecting@objecting.org> wrote:
>
>> 
>> 
>> On 22 March 2026 21:44:18 GMT, SeongJae Park <sj@kernel.org> wrote:
>> >Hello Josh,
>> >
>> >On Sun, 22 Mar 2026 18:46:40 +0000 Josh Law <objecting@objecting.org> wrote:
>> >
>> >> Currently, kdamond_apply_schemes() iterates over all targets, then over all
>> >> regions, and finally calls damon_do_apply_schemes() which iterates over
>> >> all schemes. This nested structure causes scheme-level invariants (such as
>> >> time intervals, activation status, and quota limits) to be evaluated inside
>> >> the innermost loop for every single region.
>> >> 
>> >> If a scheme is inactive, has not reached its apply interval, or has already
>> >> fulfilled its quota (quota->charged_sz >= quota->esz), the kernel still
>> >> needlessly iterates through thousands of regions only to repeatedly
>> >> evaluate these same scheme-level conditions and continue.
>> >> 
>> >> This patch inlines damon_do_apply_schemes() into kdamond_apply_schemes()
>> >> and inverts the loop ordering. It now iterates over schemes on the outside,
>> >> and targets/regions on the inside.
>> >> 
>> >> This allows the code to evaluate scheme-level limits once per scheme.
>> >> If a scheme's quota is met or it is inactive, we completely bypass the
>> >> O(Targets * Regions) inner loop for that scheme. This drastically reduces
>> >> unnecessary branching, cache thrashing, and CPU overhead in the kdamond
>> >> hot path.
>> >
>> >That makes sense in high level.  But, this will make a kind of behavioral
>> >difference that could be user-visible.  I am failing at finding a clear use
>> >case that really depends on the old behavior.  But, still it feels like not a
>> >small change to me.
>> >
>> >So, I'd like to be conservative to this change, unless there are good evidences
>> >showing very clear and impactful real world benefits.  Can you share such
>> >evidences if you have?
>> >
>> >
>> >Thanks,
>> >SJ
>> >
>> >[...]
>> 
>> 
>> My last email:
>> 
>> Hi SeongJae,
>> 
>> I've looked into this further and ran some extra benchmarks on the kdamond hot path to see if the gains were actually meaningful.
>> 
>> The main issue right now is that kdamond spends a lot of time "spinning" through regions even when there's no work to do. For example, if a user has 10,000 regions and a few schemes that have already hit their quotas or are disabled by watermarks, the current code still iterates through every single region just to check those same flags 10,000 times.
>> 
>> In my tests:
>> 
>> Typical setup (10 schemes, 2k regions): ~3.4x faster.
>> 
>> Large scale (10k regions, hitting quotas): ~7x faster.
>> 
>> Idle schemes (watermarks off): ~7x faster.
>
>Thank you for sharing these.  This seems like not a real world workload test
>but some micro-benchmarks for only the code path, though.
>
>In real world DAMOS usages, I think most of time will be spent on applying
>DAMOS action.  Compared to that, I think the time spent for the unnecessary
>iteration will be quite small.
>
>> 
>> 
>> It's also a cache locality win. Right now the CPU has to bounce between different scheme metadata inside the innermost loop for every region. Inverting the loops lets us process one scheme completely, which keeps the hot data in L1/L2 and gives about a 10% gain even when everything is active.
>> 
>> The goal isn't just to shave cycles, but to make DAMON scale better on high-memory systems (512GB+) where the region count is high. This keeps the background "CPU floor" much lower when DAMON is supposed to be idle or throttled.
>
>DAMON does adaptive regions adjustment for such large memory system
>scalability.  I understand some users might dislike the adaptive mechanism and
>stick to a fixed granular monitoring, though.
>
>So I'm not yet convinced to this change as is.
>
>Meanwhile, I'm thinking about a way to make similar optimization without
>changing the behavior.
>
>We already have the first loop of kdamond_apply_schemes() to minimize some of
>the inefficiency that this patch is aiming to optimize out.  Maybe we can
>further optimize the first loop.  For example, modifying the first loop to
>build a list or array that contains schemes that passed the next_apply_sis and
>wmarks.activated test, and make damon_do_apply_schemes() to use the test-passed
>schemes instead of the all schemes in the context.
>
>This will keep the behavior but have a performance gain that similar to what
>this patch is aiming to.  If this can be done with a fairly simple way that can
>justify the maintenance burden, I think that's a way path forward.  But, from
>this point, I realize I want it to be *very* simple, and I have no idea about
>the simple way.
>
>So I wanted to help making this be merged.  But I fail at finding a good path
>forward on my own.
>
>In my humble and frank opinion, finding other place to work on insted of this
>specific code path optimization might be a better use of the time.
>
>
>Thanks,
>SJ
>
>[...]


Hi SeongJae,
I understand your concerns regarding the behavioral changes and the adaptive regions mechanism. If the loop inversion is too intrusive for the current DAMOS semantics, I agree it's better not to force it.

Your suggestion to optimize the first loop by pre-filtering active schemes into a temporary list/array is interesting. It would achieve the O(Targets * Regions) skip for inactive schemes without changing the application order for active ones.

I'll take a look at whether that can be implemented simply enough to justify the overhead of managing the temporary list. If it looks too complex, I'll move on to other areas as you suggested.
Thanks for the detailed feedback and the guidance on DAMON's scaling philosophy.


V/R


Josh


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] mm/damon/core: optimize kdamond_ap ply_schemes() by inverting scheme and region loops
  2026-03-22 22:28       ` [PATCH 1/2] mm/damon/core: optimize kdamond_ap ply_schemes() " SeongJae Park
  2026-03-22 22:39         ` Josh Law
@ 2026-03-22 22:44         ` Josh Law
  1 sibling, 0 replies; 13+ messages in thread
From: Josh Law @ 2026-03-22 22:44 UTC (permalink / raw)
  To: SeongJae Park; +Cc: akpm, damon, linux-mm, linux-kernel



On 22 March 2026 22:28:44 GMT, SeongJae Park <sj@kernel.org> wrote:
>On Sun, 22 Mar 2026 21:59:45 +0000 Josh Law <objecting@objecting.org> wrote:
>
>> 
>> 
>> On 22 March 2026 21:44:18 GMT, SeongJae Park <sj@kernel.org> wrote:
>> >Hello Josh,
>> >
>> >On Sun, 22 Mar 2026 18:46:40 +0000 Josh Law <objecting@objecting.org> wrote:
>> >
>> >> Currently, kdamond_apply_schemes() iterates over all targets, then over all
>> >> regions, and finally calls damon_do_apply_schemes() which iterates over
>> >> all schemes. This nested structure causes scheme-level invariants (such as
>> >> time intervals, activation status, and quota limits) to be evaluated inside
>> >> the innermost loop for every single region.
>> >> 
>> >> If a scheme is inactive, has not reached its apply interval, or has already
>> >> fulfilled its quota (quota->charged_sz >= quota->esz), the kernel still
>> >> needlessly iterates through thousands of regions only to repeatedly
>> >> evaluate these same scheme-level conditions and continue.
>> >> 
>> >> This patch inlines damon_do_apply_schemes() into kdamond_apply_schemes()
>> >> and inverts the loop ordering. It now iterates over schemes on the outside,
>> >> and targets/regions on the inside.
>> >> 
>> >> This allows the code to evaluate scheme-level limits once per scheme.
>> >> If a scheme's quota is met or it is inactive, we completely bypass the
>> >> O(Targets * Regions) inner loop for that scheme. This drastically reduces
>> >> unnecessary branching, cache thrashing, and CPU overhead in the kdamond
>> >> hot path.
>> >
>> >That makes sense in high level.  But, this will make a kind of behavioral
>> >difference that could be user-visible.  I am failing at finding a clear use
>> >case that really depends on the old behavior.  But, still it feels like not a
>> >small change to me.
>> >
>> >So, I'd like to be conservative to this change, unless there are good evidences
>> >showing very clear and impactful real world benefits.  Can you share such
>> >evidences if you have?
>> >
>> >
>> >Thanks,
>> >SJ
>> >
>> >[...]
>> 
>> 
>> My last email:
>> 
>> Hi SeongJae,
>> 
>> I've looked into this further and ran some extra benchmarks on the kdamond hot path to see if the gains were actually meaningful.
>> 
>> The main issue right now is that kdamond spends a lot of time "spinning" through regions even when there's no work to do. For example, if a user has 10,000 regions and a few schemes that have already hit their quotas or are disabled by watermarks, the current code still iterates through every single region just to check those same flags 10,000 times.
>> 
>> In my tests:
>> 
>> Typical setup (10 schemes, 2k regions): ~3.4x faster.
>> 
>> Large scale (10k regions, hitting quotas): ~7x faster.
>> 
>> Idle schemes (watermarks off): ~7x faster.
>
>Thank you for sharing these.  This seems like not a real world workload test
>but some micro-benchmarks for only the code path, though.
>
>In real world DAMOS usages, I think most of time will be spent on applying
>DAMOS action.  Compared to that, I think the time spent for the unnecessary
>iteration will be quite small.
>
>> 
>> 
>> It's also a cache locality win. Right now the CPU has to bounce between different scheme metadata inside the innermost loop for every region. Inverting the loops lets us process one scheme completely, which keeps the hot data in L1/L2 and gives about a 10% gain even when everything is active.
>> 
>> The goal isn't just to shave cycles, but to make DAMON scale better on high-memory systems (512GB+) where the region count is high. This keeps the background "CPU floor" much lower when DAMON is supposed to be idle or throttled.
>
>DAMON does adaptive regions adjustment for such large memory system
>scalability.  I understand some users might dislike the adaptive mechanism and
>stick to a fixed granular monitoring, though.
>
>So I'm not yet convinced to this change as is.
>
>Meanwhile, I'm thinking about a way to make similar optimization without
>changing the behavior.
>
>We already have the first loop of kdamond_apply_schemes() to minimize some of
>the inefficiency that this patch is aiming to optimize out.  Maybe we can
>further optimize the first loop.  For example, modifying the first loop to
>build a list or array that contains schemes that passed the next_apply_sis and
>wmarks.activated test, and make damon_do_apply_schemes() to use the test-passed
>schemes instead of the all schemes in the context.
>
>This will keep the behavior but have a performance gain that similar to what
>this patch is aiming to.  If this can be done with a fairly simple way that can
>justify the maintenance burden, I think that's a way path forward.  But, from
>this point, I realize I want it to be *very* simple, and I have no idea about
>the simple way.
>
>So I wanted to help making this be merged.  But I fail at finding a good path
>forward on my own.
>
>In my humble and frank opinion, finding other place to work on insted of this
>specific code path optimization might be a better use of the time.
>
>
>Thanks,
>SJ
>
>[...]


Also, V2 is out for the other patch you liked


V/R


Josh Law


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] mm/damon/core: optimize kdamond_ap ply_schemes() by inverting scheme and region loops
  2026-03-22 22:39         ` Josh Law
@ 2026-03-23 14:01           ` SeongJae Park
  0 siblings, 0 replies; 13+ messages in thread
From: SeongJae Park @ 2026-03-23 14:01 UTC (permalink / raw)
  To: Josh Law; +Cc: SeongJae Park, akpm, damon, linux-mm, linux-kernel

On Sun, 22 Mar 2026 22:39:11 +0000 Josh Law <objecting@objecting.org> wrote:
[...]
> Hi SeongJae,
> I understand your concerns regarding the behavioral changes and the adaptive regions mechanism. If the loop inversion is too intrusive for the current DAMOS semantics, I agree it's better not to force it.
> 
> Your suggestion to optimize the first loop by pre-filtering active schemes into a temporary list/array is interesting. It would achieve the O(Targets * Regions) skip for inactive schemes without changing the application order for active ones.
> 
> I'll take a look at whether that can be implemented simply enough to justify the overhead of managing the temporary list. If it looks too complex, I'll move on to other areas as you suggested.

Sounds good!

> Thanks for the detailed feedback and the guidance on DAMON's scaling philosophy.

No worries, I'm glad to help.


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2026-03-23 14:01 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-22 18:46 [PATCH 0/2] mm/damon/core: performance optimizations for kdamond hot path Josh Law
2026-03-22 18:46 ` [PATCH 1/2] mm/damon/core: optimize kdamond_apply_schemes() by inverting scheme and region loops Josh Law
2026-03-22 21:44   ` SeongJae Park
2026-03-22 21:47     ` Josh Law
2026-03-22 21:53     ` Josh Law
2026-03-22 21:59     ` Josh Law
2026-03-22 22:28       ` [PATCH 1/2] mm/damon/core: optimize kdamond_ap ply_schemes() " SeongJae Park
2026-03-22 22:39         ` Josh Law
2026-03-23 14:01           ` SeongJae Park
2026-03-22 22:44         ` Josh Law
2026-03-22 18:46 ` [PATCH 2/2] mm/damon/core: eliminate hot-path integer division in damon_max_nr_accesses() Josh Law
2026-03-22 21:30   ` SeongJae Park
2026-03-22 21:32     ` Josh Law

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox