* [PATCH 1/2] mm/damon: introduce damon_rand_fast() for per-ctx PRNG @ 2026-04-23 12:23 Jiayuan Chen 2026-04-23 12:23 ` [PATCH 2/2] mm/damon/paddr: prefetch struct page of the next region Jiayuan Chen 2026-04-24 1:36 ` [PATCH 1/2] mm/damon: introduce damon_rand_fast() for per-ctx PRNG SeongJae Park 0 siblings, 2 replies; 9+ messages in thread From: Jiayuan Chen @ 2026-04-23 12:23 UTC (permalink / raw) To: damon Cc: Jiayuan Chen, Jiayuan Chen, SeongJae Park, Andrew Morton, linux-mm, linux-kernel From: Jiayuan Chen <jiayuan.chen@shopee.com> damon_rand() on the sampling_addr hot path calls get_random_u32_below(), which takes a local_lock_irqsave() around a per-CPU batched entropy pool and periodically refills it with ChaCha20. On workloads with large nr_regions (20k+), this shows up as a large fraction of kdamond CPU time: the lock_acquire / local_lock pair plus __get_random_u32_below() dominate perf profiles. Introduce damon_rand_fast(), which uses a lockless lfsr113 generator (struct rnd_state) held per damon_ctx and seeded from get_random_u64() in damon_new_ctx(). kdamond is the sole consumer of a given ctx, so no synchronization is required. Range mapping uses Lemire's (u64)rnd * span >> 32 to avoid a 64-bit division; residual bias is bounded by span / 2^32, negligible for statistical sampling. The new helper is intended for the sampling-address hot path only. damon_rand() is kept for call sites that run outside the kdamond loop and/or have no ctx available (damon_split_regions_of(), kunit tests). Convert the two hot callers: - __damon_pa_prepare_access_check() - __damon_va_prepare_access_check() lfsr113 is a linear PRNG and MUST NOT be used for anything security-sensitive. DAMON's sampling_addr is not exposed to userspace and is only consumed as a probe point for PTE accessed-bit sampling, so a non-cryptographic PRNG is appropriate here. Tested with paddr monitoring and max_nr_regions=20000: kdamond CPU usage reduced from ~72% to ~50% of one core. Cc: Jiayuan Chen <jiayuan.chen@linux.dev> Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> --- include/linux/damon.h | 28 ++++++++++++++++++++++++++++ mm/damon/core.c | 2 ++ mm/damon/paddr.c | 10 +++++----- mm/damon/vaddr.c | 9 +++++---- 4 files changed, 40 insertions(+), 9 deletions(-) diff --git a/include/linux/damon.h b/include/linux/damon.h index f2cdb7c3f5e6..0afdc08119c8 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -10,6 +10,7 @@ #include <linux/memcontrol.h> #include <linux/mutex.h> +#include <linux/prandom.h> #include <linux/time64.h> #include <linux/types.h> #include <linux/random.h> @@ -843,8 +844,35 @@ struct damon_ctx { struct list_head adaptive_targets; struct list_head schemes; + + /* + * Per-ctx lockless PRNG state for damon_rand_fast(). Seeded from + * get_random_u64() in damon_new_ctx(). Owned exclusively by the + * kdamond thread of this ctx, so no locking is required. + */ + struct rnd_state rnd_state; }; +/* + * damon_rand_fast - per-ctx PRNG variant of damon_rand() for hot paths. + * + * Uses the lockless lfsr113 state kept in @ctx->rnd_state. Safe because + * kdamond is the single consumer of a given ctx, so no synchronization + * is required. Quality is sufficient for statistical sampling; do NOT + * use for any security-sensitive randomness. + * + * Range mapping uses Lemire's (u64)rnd * span >> 32 to avoid a division; + * bias is bounded by span / 2^32, negligible for DAMON. + */ +static inline unsigned long damon_rand_fast(struct damon_ctx *ctx, + unsigned long l, unsigned long r) +{ + u32 rnd = prandom_u32_state(&ctx->rnd_state); + u32 span = (u32)(r - l); + + return l + (unsigned long)(((u64)rnd * span) >> 32); +} + static inline struct damon_region *damon_next_region(struct damon_region *r) { return container_of(r->list.next, struct damon_region, list); diff --git a/mm/damon/core.c b/mm/damon/core.c index 3dbbbfdeff71..c3779c674601 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -607,6 +607,8 @@ struct damon_ctx *damon_new_ctx(void) INIT_LIST_HEAD(&ctx->adaptive_targets); INIT_LIST_HEAD(&ctx->schemes); + prandom_seed_state(&ctx->rnd_state, get_random_u64()); + return ctx; } diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c index 5cdcc5037cbc..b5e1197f2ba2 100644 --- a/mm/damon/paddr.c +++ b/mm/damon/paddr.c @@ -48,12 +48,12 @@ static void damon_pa_mkold(phys_addr_t paddr) folio_put(folio); } -static void __damon_pa_prepare_access_check(struct damon_region *r, - unsigned long addr_unit) +static void __damon_pa_prepare_access_check(struct damon_ctx *ctx, + struct damon_region *r) { - r->sampling_addr = damon_rand(r->ar.start, r->ar.end); + r->sampling_addr = damon_rand_fast(ctx, r->ar.start, r->ar.end); - damon_pa_mkold(damon_pa_phys_addr(r->sampling_addr, addr_unit)); + damon_pa_mkold(damon_pa_phys_addr(r->sampling_addr, ctx->addr_unit)); } static void damon_pa_prepare_access_checks(struct damon_ctx *ctx) @@ -63,7 +63,7 @@ static void damon_pa_prepare_access_checks(struct damon_ctx *ctx) damon_for_each_target(t, ctx) { damon_for_each_region(r, t) - __damon_pa_prepare_access_check(r, ctx->addr_unit); + __damon_pa_prepare_access_check(ctx, r); } } diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c index b069dbc7e3d2..6cf06ffdf880 100644 --- a/mm/damon/vaddr.c +++ b/mm/damon/vaddr.c @@ -332,10 +332,11 @@ static void damon_va_mkold(struct mm_struct *mm, unsigned long addr) * Functions for the access checking of the regions */ -static void __damon_va_prepare_access_check(struct mm_struct *mm, - struct damon_region *r) +static void __damon_va_prepare_access_check(struct damon_ctx *ctx, + struct mm_struct *mm, + struct damon_region *r) { - r->sampling_addr = damon_rand(r->ar.start, r->ar.end); + r->sampling_addr = damon_rand_fast(ctx, r->ar.start, r->ar.end); damon_va_mkold(mm, r->sampling_addr); } @@ -351,7 +352,7 @@ static void damon_va_prepare_access_checks(struct damon_ctx *ctx) if (!mm) continue; damon_for_each_region(r, t) - __damon_va_prepare_access_check(mm, r); + __damon_va_prepare_access_check(ctx, mm, r); mmput(mm); } } -- 2.43.0 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 2/2] mm/damon/paddr: prefetch struct page of the next region 2026-04-23 12:23 [PATCH 1/2] mm/damon: introduce damon_rand_fast() for per-ctx PRNG Jiayuan Chen @ 2026-04-23 12:23 ` Jiayuan Chen 2026-04-24 1:42 ` SeongJae Park 2026-04-24 1:36 ` [PATCH 1/2] mm/damon: introduce damon_rand_fast() for per-ctx PRNG SeongJae Park 1 sibling, 1 reply; 9+ messages in thread From: Jiayuan Chen @ 2026-04-23 12:23 UTC (permalink / raw) To: damon Cc: Jiayuan Chen, Jiayuan Chen, SeongJae Park, Andrew Morton, linux-mm, linux-kernel From: Jiayuan Chen <jiayuan.chen@shopee.com> In paddr mode with large nr_regions (20k+), damon_get_folio() dominates kdamond CPU time. perf annotate shows ~98% of its samples on a single load: reading page->compound_head for page_folio(). DAMON samples a random PFN per region, so the access pattern scatters across the vmemmap with no locality that hardware prefetchers can exploit; every compound_head load misses to DRAM (~250 cycles). Issue a software prefetch for the next region's struct page one iteration before it is needed. In __damon_pa_prepare_access_check(), the next region's sampling_addr is picked eagerly (one iteration ahead of where its mkold will run) so the prefetched line is usable, and the first region of each epoch is handled with a one-off branch since it has no predecessor to have pre-picked its addr. damon_pa_check_accesses() just prefetches based on sampling_addr already populated by the preceding prepare pass. Two details worth calling out: - prefetchw() rather than prefetch(): compound_head (+0x8) and _refcount (+0x34) share a 64B cacheline. The subsequent folio_try_get() / folio_put() atomics write _refcount, so bringing the line in exclusive state avoids a later S->M coherence upgrade. - pfn_to_page() rather than pfn_to_online_page(): on CONFIG_SPARSEMEM_VMEMMAP the former is pure arithmetic (vmemmap + pfn), while the latter walks the mem_section table. The mem_section lookup itself incurs a DRAM miss for random PFNs, which would serialize what is supposed to be a non-blocking hint - an earlier attempt that used pfn_to_online_page() saw the prefetch path's internal stall dominate perf (~91% skid on the converge point after the call). Prefetching an unmapped vmemmap entry is safe: the hint is dropped without faulting. Concurrency: list_next_entry() and &list == &head comparisons are safe without locking because kdamond is the sole mutator of the region list of its ctx; external threads must go through damon_call() which defers execution to the same kdamond thread between sampling iterations (see documentation on struct damon_ctx and damon_call()). Tested with paddr monitoring, max_nr_regions=20000, and stress-ng-vm consuming ~90% of memory across 8 workers: kdamond CPU further reduced from ~50% to ~40% of one core on top of the earlier damon_rand_fast() change. Cc: Jiayuan Chen <jiayuan.chen@linux.dev> Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> --- mm/damon/paddr.c | 31 ++++++++++++++++++++++++++++--- 1 file changed, 28 insertions(+), 3 deletions(-) diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c index b5e1197f2ba2..99bdf2b88cf1 100644 --- a/mm/damon/paddr.c +++ b/mm/damon/paddr.c @@ -10,6 +10,7 @@ #include <linux/mmu_notifier.h> #include <linux/page_idle.h> #include <linux/pagemap.h> +#include <linux/prefetch.h> #include <linux/rmap.h> #include <linux/swap.h> #include <linux/memory-tiers.h> @@ -48,10 +49,29 @@ static void damon_pa_mkold(phys_addr_t paddr) folio_put(folio); } +static void damon_pa_prefetch_page(unsigned long sampling_addr, + unsigned long addr_unit) +{ + phys_addr_t paddr = damon_pa_phys_addr(sampling_addr, addr_unit); + + prefetchw(pfn_to_page(PHYS_PFN(paddr))); +} + static void __damon_pa_prepare_access_check(struct damon_ctx *ctx, + struct damon_target *t, struct damon_region *r) { - r->sampling_addr = damon_rand_fast(ctx, r->ar.start, r->ar.end); + struct damon_region *next = list_next_entry(r, list); + + /* First region has no predecessor to have pre-picked its addr. */ + if (r->list.prev == &t->regions_list) + r->sampling_addr = damon_rand_fast(ctx, r->ar.start, r->ar.end); + + if (&next->list != &t->regions_list) { + next->sampling_addr = damon_rand_fast(ctx, next->ar.start, + next->ar.end); + damon_pa_prefetch_page(next->sampling_addr, ctx->addr_unit); + } damon_pa_mkold(damon_pa_phys_addr(r->sampling_addr, ctx->addr_unit)); } @@ -63,7 +83,7 @@ static void damon_pa_prepare_access_checks(struct damon_ctx *ctx) damon_for_each_target(t, ctx) { damon_for_each_region(r, t) - __damon_pa_prepare_access_check(ctx, r); + __damon_pa_prepare_access_check(ctx, t, r); } } @@ -106,11 +126,16 @@ static void __damon_pa_check_access(struct damon_region *r, static unsigned int damon_pa_check_accesses(struct damon_ctx *ctx) { struct damon_target *t; - struct damon_region *r; + struct damon_region *r, *next; unsigned int max_nr_accesses = 0; damon_for_each_target(t, ctx) { damon_for_each_region(r, t) { + next = list_next_entry(r, list); + if (&next->list != &t->regions_list) + damon_pa_prefetch_page(next->sampling_addr, + ctx->addr_unit); + __damon_pa_check_access( r, &ctx->attrs, ctx->addr_unit); max_nr_accesses = max(r->nr_accesses, max_nr_accesses); -- 2.43.0 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 2/2] mm/damon/paddr: prefetch struct page of the next region 2026-04-23 12:23 ` [PATCH 2/2] mm/damon/paddr: prefetch struct page of the next region Jiayuan Chen @ 2026-04-24 1:42 ` SeongJae Park 0 siblings, 0 replies; 9+ messages in thread From: SeongJae Park @ 2026-04-24 1:42 UTC (permalink / raw) To: Jiayuan Chen Cc: SeongJae Park, damon, Jiayuan Chen, Andrew Morton, linux-mm, linux-kernel On Thu, 23 Apr 2026 20:23:37 +0800 Jiayuan Chen <jiayuan.chen@linux.dev> wrote: > From: Jiayuan Chen <jiayuan.chen@shopee.com> > > In paddr mode with large nr_regions (20k+), damon_get_folio() dominates > kdamond CPU time. perf annotate shows ~98% of its samples on a single > load: reading page->compound_head for page_folio(). DAMON samples a > random PFN per region, so the access pattern scatters across the > vmemmap with no locality that hardware prefetchers can exploit; every > compound_head load misses to DRAM (~250 cycles). > > Issue a software prefetch for the next region's struct page one > iteration before it is needed. In __damon_pa_prepare_access_check(), > the next region's sampling_addr is picked eagerly (one iteration ahead > of where its mkold will run) so the prefetched line is usable, and the > first region of each epoch is handled with a one-off branch since it > has no predecessor to have pre-picked its addr. damon_pa_check_accesses() > just prefetches based on sampling_addr already populated by the > preceding prepare pass. > > Two details worth calling out: > > - prefetchw() rather than prefetch(): compound_head (+0x8) and > _refcount (+0x34) share a 64B cacheline. The subsequent > folio_try_get() / folio_put() atomics write _refcount, so bringing > the line in exclusive state avoids a later S->M coherence upgrade. > > - pfn_to_page() rather than pfn_to_online_page(): on > CONFIG_SPARSEMEM_VMEMMAP the former is pure arithmetic > (vmemmap + pfn), while the latter walks the mem_section table. The > mem_section lookup itself incurs a DRAM miss for random PFNs, which > would serialize what is supposed to be a non-blocking hint - an > earlier attempt that used pfn_to_online_page() saw the prefetch > path's internal stall dominate perf (~91% skid on the converge > point after the call). Prefetching an unmapped vmemmap entry is > safe: the hint is dropped without faulting. Sashiko raises [1] a concern about this. Could you please reply? > > Concurrency: list_next_entry() and &list == &head comparisons are safe > without locking because kdamond is the sole mutator of the region list > of its ctx; external threads must go through damon_call() which defers > execution to the same kdamond thread between sampling iterations > (see documentation on struct damon_ctx and damon_call()). > > Tested with paddr monitoring, max_nr_regions=20000, and stress-ng-vm > consuming ~90% of memory across 8 workers: kdamond CPU further reduced > from ~50% to ~40% of one core on top of the earlier damon_rand_fast() > change. This patch feels bit complicated. I'm not sure if the performance gain is really big enough to compensate the added complexity. As I also asked to the previous patch, could you please share more details about your use case? I will review the code in detail after the above high level question and concern are addressed. [1] https://lore.kernel.org/20260423192534.300CEC2BCB2@smtp.kernel.org Thanks, SJ [...] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] mm/damon: introduce damon_rand_fast() for per-ctx PRNG 2026-04-23 12:23 [PATCH 1/2] mm/damon: introduce damon_rand_fast() for per-ctx PRNG Jiayuan Chen 2026-04-23 12:23 ` [PATCH 2/2] mm/damon/paddr: prefetch struct page of the next region Jiayuan Chen @ 2026-04-24 1:36 ` SeongJae Park 2026-04-24 2:29 ` Jiayuan Chen 1 sibling, 1 reply; 9+ messages in thread From: SeongJae Park @ 2026-04-24 1:36 UTC (permalink / raw) To: Jiayuan Chen Cc: SeongJae Park, damon, Jiayuan Chen, Andrew Morton, linux-mm, linux-kernel Hello Jiayuan, Thank you for sharing this patch with us! On Thu, 23 Apr 2026 20:23:36 +0800 Jiayuan Chen <jiayuan.chen@linux.dev> wrote: > From: Jiayuan Chen <jiayuan.chen@shopee.com> > > damon_rand() on the sampling_addr hot path calls get_random_u32_below(), > which takes a local_lock_irqsave() around a per-CPU batched entropy pool > and periodically refills it with ChaCha20. On workloads with large > nr_regions (20k+), this shows up as a large fraction of kdamond CPU > time: the lock_acquire / local_lock pair plus __get_random_u32_below() > dominate perf profiles. Could you please share more details about the use case? I'm particularly curious how you ended up setting 'nr_regiions' that high, while the upper limit of nr_regions is set to 1,0000 by default. I know some people worry if the limit is too low and it could result in poor monitoring accuracy. Therefore we developed [1] monitoring intervals auto-tuning. From multiple tests on real environment it showed somewhat convincing results, and therefore I nowadays suggest DAMON users to try that if they didn't try. I'm bit concerned if this is over-engineering. It would be helpful to know if it is, if you could share the more detailed use case. > > Introduce damon_rand_fast(), which uses a lockless lfsr113 generator > (struct rnd_state) held per damon_ctx and seeded from get_random_u64() > in damon_new_ctx(). kdamond is the sole consumer of a given ctx, so no > synchronization is required. Range mapping uses Lemire's > (u64)rnd * span >> 32 to avoid a 64-bit division; residual bias is > bounded by span / 2^32, negligible for statistical sampling. > > The new helper is intended for the sampling-address hot path only. > damon_rand() is kept for call sites that run outside the kdamond loop > and/or have no ctx available (damon_split_regions_of(), kunit tests). > > Convert the two hot callers: > > - __damon_pa_prepare_access_check() > - __damon_va_prepare_access_check() > > lfsr113 is a linear PRNG and MUST NOT be used for anything > security-sensitive. DAMON's sampling_addr is not exposed to userspace > and is only consumed as a probe point for PTE accessed-bit sampling, so > a non-cryptographic PRNG is appropriate here. > > Tested with paddr monitoring and max_nr_regions=20000: kdamond CPU > usage reduced from ~72% to ~50% of one core. > > Cc: Jiayuan Chen <jiayuan.chen@linux.dev> > Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> > --- > include/linux/damon.h | 28 ++++++++++++++++++++++++++++ > mm/damon/core.c | 2 ++ > mm/damon/paddr.c | 10 +++++----- > mm/damon/vaddr.c | 9 +++++---- > 4 files changed, 40 insertions(+), 9 deletions(-) > > diff --git a/include/linux/damon.h b/include/linux/damon.h > index f2cdb7c3f5e6..0afdc08119c8 100644 > --- a/include/linux/damon.h > +++ b/include/linux/damon.h > @@ -10,6 +10,7 @@ > > #include <linux/memcontrol.h> > #include <linux/mutex.h> > +#include <linux/prandom.h> > #include <linux/time64.h> > #include <linux/types.h> > #include <linux/random.h> > @@ -843,8 +844,35 @@ struct damon_ctx { > > struct list_head adaptive_targets; > struct list_head schemes; > + > + /* > + * Per-ctx lockless PRNG state for damon_rand_fast(). Seeded from > + * get_random_u64() in damon_new_ctx(). Owned exclusively by the > + * kdamond thread of this ctx, so no locking is required. > + */ > + struct rnd_state rnd_state; > }; > > +/* > + * damon_rand_fast - per-ctx PRNG variant of damon_rand() for hot paths. > + * > + * Uses the lockless lfsr113 state kept in @ctx->rnd_state. Safe because > + * kdamond is the single consumer of a given ctx, so no synchronization > + * is required. Quality is sufficient for statistical sampling; do NOT > + * use for any security-sensitive randomness. > + * > + * Range mapping uses Lemire's (u64)rnd * span >> 32 to avoid a division; > + * bias is bounded by span / 2^32, negligible for DAMON. > + */ > +static inline unsigned long damon_rand_fast(struct damon_ctx *ctx, > + unsigned long l, unsigned long r) > +{ > + u32 rnd = prandom_u32_state(&ctx->rnd_state); > + u32 span = (u32)(r - l); > + > + return l + (unsigned long)(((u64)rnd * span) >> 32); > +} As Sashiko pointed out [2], we may better to return 'unsigned long' from this function. Can this algorithm be extended for that? > + > static inline struct damon_region *damon_next_region(struct damon_region *r) > { > return container_of(r->list.next, struct damon_region, list); > diff --git a/mm/damon/core.c b/mm/damon/core.c > index 3dbbbfdeff71..c3779c674601 100644 > --- a/mm/damon/core.c > +++ b/mm/damon/core.c > @@ -607,6 +607,8 @@ struct damon_ctx *damon_new_ctx(void) > INIT_LIST_HEAD(&ctx->adaptive_targets); > INIT_LIST_HEAD(&ctx->schemes); > > + prandom_seed_state(&ctx->rnd_state, get_random_u64()); > + > return ctx; > } > > diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c > index 5cdcc5037cbc..b5e1197f2ba2 100644 > --- a/mm/damon/paddr.c > +++ b/mm/damon/paddr.c > @@ -48,12 +48,12 @@ static void damon_pa_mkold(phys_addr_t paddr) > folio_put(folio); > } > > -static void __damon_pa_prepare_access_check(struct damon_region *r, > - unsigned long addr_unit) > +static void __damon_pa_prepare_access_check(struct damon_ctx *ctx, > + struct damon_region *r) Let's keep 'r' on the first line, and update the second line without indent change. > { > - r->sampling_addr = damon_rand(r->ar.start, r->ar.end); > + r->sampling_addr = damon_rand_fast(ctx, r->ar.start, r->ar.end); > > - damon_pa_mkold(damon_pa_phys_addr(r->sampling_addr, addr_unit)); > + damon_pa_mkold(damon_pa_phys_addr(r->sampling_addr, ctx->addr_unit)); > } > > static void damon_pa_prepare_access_checks(struct damon_ctx *ctx) > @@ -63,7 +63,7 @@ static void damon_pa_prepare_access_checks(struct damon_ctx *ctx) > > damon_for_each_target(t, ctx) { > damon_for_each_region(r, t) > - __damon_pa_prepare_access_check(r, ctx->addr_unit); > + __damon_pa_prepare_access_check(ctx, r); > } > } > > diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c > index b069dbc7e3d2..6cf06ffdf880 100644 > --- a/mm/damon/vaddr.c > +++ b/mm/damon/vaddr.c > @@ -332,10 +332,11 @@ static void damon_va_mkold(struct mm_struct *mm, unsigned long addr) > * Functions for the access checking of the regions > */ > > -static void __damon_va_prepare_access_check(struct mm_struct *mm, > - struct damon_region *r) > +static void __damon_va_prepare_access_check(struct damon_ctx *ctx, > + struct mm_struct *mm, > + struct damon_region *r) Let's keep the first line and the the indentation as were, and add 'ctx' argument to the end. > { > - r->sampling_addr = damon_rand(r->ar.start, r->ar.end); > + r->sampling_addr = damon_rand_fast(ctx, r->ar.start, r->ar.end); > > damon_va_mkold(mm, r->sampling_addr); > } > @@ -351,7 +352,7 @@ static void damon_va_prepare_access_checks(struct damon_ctx *ctx) > if (!mm) > continue; > damon_for_each_region(r, t) > - __damon_va_prepare_access_check(mm, r); > + __damon_va_prepare_access_check(ctx, mm, r); > mmput(mm); > } > } > -- > 2.43.0 > > [1] https://lkml.kernel.org/r/20250303221726.484227-1-sj@kernel.org [2] https://lore.kernel.org/20260423190841.821E4C2BCAF@smtp.kernel.org Thanks, SJ ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] mm/damon: introduce damon_rand_fast() for per-ctx PRNG 2026-04-24 1:36 ` [PATCH 1/2] mm/damon: introduce damon_rand_fast() for per-ctx PRNG SeongJae Park @ 2026-04-24 2:29 ` Jiayuan Chen 2026-04-24 15:11 ` SeongJae Park 0 siblings, 1 reply; 9+ messages in thread From: Jiayuan Chen @ 2026-04-24 2:29 UTC (permalink / raw) To: SeongJae Park; +Cc: damon, Jiayuan Chen, Andrew Morton, linux-mm, linux-kernel Hello SJ, Thank you for the review. On 4/24/26 9:36 AM, SeongJae Park wrote: > Hello Jiayuan, > > > Thank you for sharing this patch with us! > > On Thu, 23 Apr 2026 20:23:36 +0800 Jiayuan Chen <jiayuan.chen@linux.dev> wrote: > >> From: Jiayuan Chen <jiayuan.chen@shopee.com> >> >> damon_rand() on the sampling_addr hot path calls get_random_u32_below(), >> which takes a local_lock_irqsave() around a per-CPU batched entropy pool >> and periodically refills it with ChaCha20. On workloads with large >> nr_regions (20k+), this shows up as a large fraction of kdamond CPU >> time: the lock_acquire / local_lock pair plus __get_random_u32_below() >> dominate perf profiles. > Could you please share more details about the use case? I'm particularly > curious how you ended up setting 'nr_regiions' that high, while the upper limit > of nr_regions is set to 1,0000 by default. We use DAMON paddr on a 2 TiB host for per-cgroup hot/cold page classification. Target cgroups can be as small as 1-2% of total memory (tens of GiB). With the default max_nr_regions=1000, each region covers ~2 GiB on average, which is often larger than the entire target cgroup. Adaptive split cannot carve out cgroup-homogeneous regions in that regime: each region's nr_accesses averages the (small) cgroup fraction with the surrounding non-cgroup pages, and the cgroup's access signal gets washed out. Raising max_nr_regions to 10k-20k gives the adaptive split enough spatial resolution that cgroup-majority regions can form at cgroup boundaries (allocations have enough physical locality in practice for this to work -- THP, per-node allocation, etc.). At that region count, damon_rand() starts showing up at the top of kdamond profiles, which is what motivated this patch. > I know some people worry if the limit is too low and it could result in poor > monitoring accuracy. Therefore we developed [1] monitoring intervals > auto-tuning. From multiple tests on real environment it showed somewhat > convincing results, and therefore I nowadays suggest DAMON users to try that if > they didn't try. > > I'm bit concerned if this is over-engineering. It would be helpful to know if > it is, if you could share the more detailed use case. Thanks for pointing to intervals auto-tuning - we do use it. But it trades sampling frequency against monitoring overhead; it cannot change the spatial resolution. With N=1000 regions on a 2 TiB host, a 20 GiB cgroup cannot be resolved no matter how we tune sample_us / aggr_us, because the region boundary itself averages cgroup and non-cgroup pages together. So raising max_nr_regions and making the per-region overhead cheap are complementary to interval auto-tuning, not redundant with it. >> Introduce damon_rand_fast(), which uses a lockless lfsr113 generator >> (struct rnd_state) held per damon_ctx and seeded from get_random_u64() >> in damon_new_ctx(). kdamond is the sole consumer of a given ctx, so no >> synchronization is required. Range mapping uses Lemire's >> (u64)rnd * span >> 32 to avoid a 64-bit division; residual bias is >> bounded by span / 2^32, negligible for statistical sampling. >> >> The new helper is intended for the sampling-address hot path only. >> damon_rand() is kept for call sites that run outside the kdamond loop >> and/or have no ctx available (damon_split_regions_of(), kunit tests). >> >> Convert the two hot callers: >> >> - __damon_pa_prepare_access_check() >> - __damon_va_prepare_access_check() >> >> lfsr113 is a linear PRNG and MUST NOT be used for anything >> security-sensitive. DAMON's sampling_addr is not exposed to userspace >> and is only consumed as a probe point for PTE accessed-bit sampling, so >> a non-cryptographic PRNG is appropriate here. >> >> Tested with paddr monitoring and max_nr_regions=20000: kdamond CPU >> usage reduced from ~72% to ~50% of one core. >> >> Cc: Jiayuan Chen <jiayuan.chen@linux.dev> >> Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> >> --- >> include/linux/damon.h | 28 ++++++++++++++++++++++++++++ >> mm/damon/core.c | 2 ++ >> mm/damon/paddr.c | 10 +++++----- >> mm/damon/vaddr.c | 9 +++++---- >> 4 files changed, 40 insertions(+), 9 deletions(-) >> >> diff --git a/include/linux/damon.h b/include/linux/damon.h >> index f2cdb7c3f5e6..0afdc08119c8 100644 >> --- a/include/linux/damon.h >> +++ b/include/linux/damon.h >> @@ -10,6 +10,7 @@ >> >> #include <linux/memcontrol.h> >> #include <linux/mutex.h> >> +#include <linux/prandom.h> >> #include <linux/time64.h> >> #include <linux/types.h> >> #include <linux/random.h> >> @@ -843,8 +844,35 @@ struct damon_ctx { >> >> struct list_head adaptive_targets; >> struct list_head schemes; >> + >> + /* >> + * Per-ctx lockless PRNG state for damon_rand_fast(). Seeded from >> + * get_random_u64() in damon_new_ctx(). Owned exclusively by the >> + * kdamond thread of this ctx, so no locking is required. >> + */ >> + struct rnd_state rnd_state; >> }; >> >> +/* >> + * damon_rand_fast - per-ctx PRNG variant of damon_rand() for hot paths. >> + * >> + * Uses the lockless lfsr113 state kept in @ctx->rnd_state. Safe because >> + * kdamond is the single consumer of a given ctx, so no synchronization >> + * is required. Quality is sufficient for statistical sampling; do NOT >> + * use for any security-sensitive randomness. >> + * >> + * Range mapping uses Lemire's (u64)rnd * span >> 32 to avoid a division; >> + * bias is bounded by span / 2^32, negligible for DAMON. >> + */ >> +static inline unsigned long damon_rand_fast(struct damon_ctx *ctx, >> + unsigned long l, unsigned long r) >> +{ >> + u32 rnd = prandom_u32_state(&ctx->rnd_state); >> + u32 span = (u32)(r - l); >> + >> + return l + (unsigned long)(((u64)rnd * span) >> 32); >> +} > As Sashiko pointed out [2], we may better to return 'unsigned long' from this > function. Can this algorithm be extended for that? Sashiko is correct that (u32)(r - l) truncates spans greater than 4 GiB. This is a pre-existing limitation of damon_rand() itself, which passes r - l to get_random_u32_below() (a u32 parameter) and truncates the same way. My patch makes that truncation explicit but does not introduce a new bug. We can restrict the region size when config or just allow spans greater than 4G. static inline unsigned long damon_rand_fast(struct damon_ctx *ctx, unsigned long l, unsigned long r) { unsigned long span = r - l; u64 rnd; if (likely(span <= U32_MAX)) { rnd = prandom_u32_state(&ctx->rnd_state); return l + (unsigned long)((rnd * span) >> 32); } rnd = ((u64)prandom_u32_state(&ctx->rnd_state) << 32) | prandom_u32_state(&ctx->rnd_state); // same as 'l + (unsigned long)((rnd * span) >> 32);' return l + mul_u64_u64_shr(rnd, span, 64); } >> + >> static inline struct damon_region *damon_next_region(struct damon_region *r) >> { >> return container_of(r->list.next, struct damon_region, list); >> diff --git a/mm/damon/core.c b/mm/damon/core.c >> index 3dbbbfdeff71..c3779c674601 100644 >> --- a/mm/damon/core.c >> +++ b/mm/damon/core.c >> @@ -607,6 +607,8 @@ struct damon_ctx *damon_new_ctx(void) >> INIT_LIST_HEAD(&ctx->adaptive_targets); >> INIT_LIST_HEAD(&ctx->schemes); >> >> + prandom_seed_state(&ctx->rnd_state, get_random_u64()); >> + >> return ctx; >> } >> >> diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c >> index 5cdcc5037cbc..b5e1197f2ba2 100644 >> --- a/mm/damon/paddr.c >> +++ b/mm/damon/paddr.c >> @@ -48,12 +48,12 @@ static void damon_pa_mkold(phys_addr_t paddr) >> folio_put(folio); >> } >> >> -static void __damon_pa_prepare_access_check(struct damon_region *r, >> - unsigned long addr_unit) >> +static void __damon_pa_prepare_access_check(struct damon_ctx *ctx, >> + struct damon_region *r) > Let's keep 'r' on the first line, and update the second line without indent > change. > Thanks >> { >> - r->sampling_addr = damon_rand(r->ar.start, r->ar.end); >> + r->sampling_addr = damon_rand_fast(ctx, r->ar.start, r->ar.end); >> >> - damon_pa_mkold(damon_pa_phys_addr(r->sampling_addr, addr_unit)); >> + damon_pa_mkold(damon_pa_phys_addr(r->sampling_addr, ctx->addr_unit)); >> } >> >> static void damon_pa_prepare_access_checks(struct damon_ctx *ctx) >> @@ -63,7 +63,7 @@ static void damon_pa_prepare_access_checks(struct damon_ctx *ctx) >> >> damon_for_each_target(t, ctx) { >> damon_for_each_region(r, t) >> - __damon_pa_prepare_access_check(r, ctx->addr_unit); >> + __damon_pa_prepare_access_check(ctx, r); >> } >> } >> >> diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c >> index b069dbc7e3d2..6cf06ffdf880 100644 >> --- a/mm/damon/vaddr.c >> +++ b/mm/damon/vaddr.c >> @@ -332,10 +332,11 @@ static void damon_va_mkold(struct mm_struct *mm, unsigned long addr) >> * Functions for the access checking of the regions >> */ >> >> -static void __damon_va_prepare_access_check(struct mm_struct *mm, >> - struct damon_region *r) >> +static void __damon_va_prepare_access_check(struct damon_ctx *ctx, >> + struct mm_struct *mm, >> + struct damon_region *r) > Let's keep the first line and the the indentation as were, and add 'ctx' > argument to the end. > >> { >> - r->sampling_addr = damon_rand(r->ar.start, r->ar.end); >> + r->sampling_addr = damon_rand_fast(ctx, r->ar.start, r->ar.end); >> >> damon_va_mkold(mm, r->sampling_addr); >> } >> @@ -351,7 +352,7 @@ static void damon_va_prepare_access_checks(struct damon_ctx *ctx) >> if (!mm) >> continue; >> damon_for_each_region(r, t) >> - __damon_va_prepare_access_check(mm, r); >> + __damon_va_prepare_access_check(ctx, mm, r); >> mmput(mm); >> } >> } >> -- >> 2.43.0 >> >> > [1] https://lkml.kernel.org/r/20250303221726.484227-1-sj@kernel.org > [2] https://lore.kernel.org/20260423190841.821E4C2BCAF@smtp.kernel.org > > > Thanks, > SJ ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] mm/damon: introduce damon_rand_fast() for per-ctx PRNG 2026-04-24 2:29 ` Jiayuan Chen @ 2026-04-24 15:11 ` SeongJae Park 2026-04-25 3:36 ` Jiayuan Chen 0 siblings, 1 reply; 9+ messages in thread From: SeongJae Park @ 2026-04-24 15:11 UTC (permalink / raw) To: Jiayuan Chen Cc: SeongJae Park, damon, Jiayuan Chen, Andrew Morton, linux-mm, linux-kernel On Fri, 24 Apr 2026 10:29:58 +0800 Jiayuan Chen <jiayuan.chen@linux.dev> wrote: > Hello SJ, > > Thank you for the review. > > On 4/24/26 9:36 AM, SeongJae Park wrote: > > Hello Jiayuan, > > > > > > Thank you for sharing this patch with us! > > > > On Thu, 23 Apr 2026 20:23:36 +0800 Jiayuan Chen <jiayuan.chen@linux.dev> wrote: > > > >> From: Jiayuan Chen <jiayuan.chen@shopee.com> > >> > >> damon_rand() on the sampling_addr hot path calls get_random_u32_below(), > >> which takes a local_lock_irqsave() around a per-CPU batched entropy pool > >> and periodically refills it with ChaCha20. On workloads with large > >> nr_regions (20k+), this shows up as a large fraction of kdamond CPU > >> time: the lock_acquire / local_lock pair plus __get_random_u32_below() > >> dominate perf profiles. > > Could you please share more details about the use case? I'm particularly > > curious how you ended up setting 'nr_regiions' that high, while the upper limit > > of nr_regions is set to 1,0000 by default. > > We use DAMON paddr on a 2 TiB host for per-cgroup hot/cold page > classification. Target cgroups can be as small as 1-2% of total > memory (tens of GiB). Thank you for sharing this detail! Could you further share me what do you do with the hot/cold information? We found some people want to use DAMON for proactive cold pages reclamation but they watned to use it for only file-backed pages, and therefore developed page level DAMOS filter. In a similar way, if we know your detailed use case, we might be able to serve you better by finding not well advertised features, or developing a new features. > > With the default max_nr_regions=1000, each region covers ~2 GiB on > average, which is often larger than the entire target cgroup. > Adaptive split cannot carve out cgroup-homogeneous regions in that > regime: each region's nr_accesses averages the (small) cgroup > fraction with the surrounding non-cgroup pages, and the cgroup's > access signal gets washed out. But, pages of cgroups would be scattered around on the physical address space. So even if DAMON finds a hot or cold region on physical address space that having a size smaller than a cgroup's memory, it is hard to say if it is for a specific cgroup. How do you know if the region is for which cgroup? Do you have a way to make sure pages of same cgroup are gathered on the physical address space? If not, I'm not sure how ensuring size of region to be equal or smaller than each cgroup size helps you. We are supporting page level monitoring [1] for such cgroup-aware use case. Are you using it? But again, if you use it, I'm not sure why you need to ensure DAMON region size smaller than the cgroup size. Fyi, I'm also working [2] on making the cgroup-aware monitoring much more lightweight. > > Raising max_nr_regions to 10k-20k gives the adaptive split enough > spatial resolution that cgroup-majority regions can form at cgroup > boundaries (allocations have enough physical locality in practice for > this to work -- THP, per-node allocation, etc.). At that region > count, damon_rand() starts showing up at the top of kdamond profiles, > which is what motivated this patch. Thank you for sharing how this patch has developed. I'm still curious if there is yet more ways to make DAMON better for your use case. But this patch makes sense in general to me. > > > > I know some people worry if the limit is too low and it could result in poor > > monitoring accuracy. Therefore we developed [1] monitoring intervals > > auto-tuning. From multiple tests on real environment it showed somewhat > > convincing results, and therefore I nowadays suggest DAMON users to try that if > > they didn't try. > > > > I'm bit concerned if this is over-engineering. It would be helpful to know if > > it is, if you could share the more detailed use case. > > > Thanks for pointing to intervals auto-tuning - we do use it. But it > > trades sampling frequency against monitoring overhead; it cannot > change the spatial resolution. With N=1000 regions on a 2 TiB host, > a 20 GiB cgroup cannot be resolved no matter how we tune sample_us / > aggr_us, because the region boundary itself averages cgroup and > > non-cgroup pages together. > > So raising max_nr_regions and making the per-region overhead cheap > are complementary to interval auto-tuning, not redundant with it. Makes sense. I'm still not sure how it helps for cgroup. But, if you do want it and if it helps you, you are of course ok to use it in your way, and I will help you. :) > > >> Introduce damon_rand_fast(), which uses a lockless lfsr113 generator > >> (struct rnd_state) held per damon_ctx and seeded from get_random_u64() > >> in damon_new_ctx(). kdamond is the sole consumer of a given ctx, so no > >> synchronization is required. Range mapping uses Lemire's > >> (u64)rnd * span >> 32 to avoid a 64-bit division; residual bias is > >> bounded by span / 2^32, negligible for statistical sampling. > >> > >> The new helper is intended for the sampling-address hot path only. > >> damon_rand() is kept for call sites that run outside the kdamond loop > >> and/or have no ctx available (damon_split_regions_of(), kunit tests). > >> > >> Convert the two hot callers: > >> > >> - __damon_pa_prepare_access_check() > >> - __damon_va_prepare_access_check() > >> > >> lfsr113 is a linear PRNG and MUST NOT be used for anything > >> security-sensitive. DAMON's sampling_addr is not exposed to userspace > >> and is only consumed as a probe point for PTE accessed-bit sampling, so > >> a non-cryptographic PRNG is appropriate here. > >> > >> Tested with paddr monitoring and max_nr_regions=20000: kdamond CPU > >> usage reduced from ~72% to ~50% of one core. > >> > >> Cc: Jiayuan Chen <jiayuan.chen@linux.dev> > >> Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> > >> --- > >> include/linux/damon.h | 28 ++++++++++++++++++++++++++++ > >> mm/damon/core.c | 2 ++ > >> mm/damon/paddr.c | 10 +++++----- > >> mm/damon/vaddr.c | 9 +++++---- > >> 4 files changed, 40 insertions(+), 9 deletions(-) > >> > >> diff --git a/include/linux/damon.h b/include/linux/damon.h > >> index f2cdb7c3f5e6..0afdc08119c8 100644 > >> --- a/include/linux/damon.h > >> +++ b/include/linux/damon.h > >> @@ -10,6 +10,7 @@ > >> > >> #include <linux/memcontrol.h> > >> #include <linux/mutex.h> > >> +#include <linux/prandom.h> > >> #include <linux/time64.h> > >> #include <linux/types.h> > >> #include <linux/random.h> > >> @@ -843,8 +844,35 @@ struct damon_ctx { > >> > >> struct list_head adaptive_targets; > >> struct list_head schemes; > >> + > >> + /* > >> + * Per-ctx lockless PRNG state for damon_rand_fast(). Seeded from > >> + * get_random_u64() in damon_new_ctx(). Owned exclusively by the > >> + * kdamond thread of this ctx, so no locking is required. > >> + */ > >> + struct rnd_state rnd_state; > >> }; > >> > >> +/* > >> + * damon_rand_fast - per-ctx PRNG variant of damon_rand() for hot paths. > >> + * > >> + * Uses the lockless lfsr113 state kept in @ctx->rnd_state. Safe because > >> + * kdamond is the single consumer of a given ctx, so no synchronization > >> + * is required. Quality is sufficient for statistical sampling; do NOT > >> + * use for any security-sensitive randomness. > >> + * > >> + * Range mapping uses Lemire's (u64)rnd * span >> 32 to avoid a division; > >> + * bias is bounded by span / 2^32, negligible for DAMON. > >> + */ > >> +static inline unsigned long damon_rand_fast(struct damon_ctx *ctx, > >> + unsigned long l, unsigned long r) > >> +{ > >> + u32 rnd = prandom_u32_state(&ctx->rnd_state); > >> + u32 span = (u32)(r - l); > >> + > >> + return l + (unsigned long)(((u64)rnd * span) >> 32); > >> +} > > As Sashiko pointed out [2], we may better to return 'unsigned long' from this > > function. Can this algorithm be extended for that? > > Sashiko is correct that (u32)(r - l) truncates spans greater than 4 GiB. > > This is a pre-existing limitation of damon_rand() itself, which > passes r - l to get_random_u32_below() (a u32 parameter) and > truncates the same way. My patch makes that truncation > explicit but does not introduce a new bug. You are 100% correct and I agree. Thank you for making this clear. I should also mentioned and underlined this in my previous reply, but I forgot it. So I think this patch is ok to be merged as is (after addressing my nit trivial comments about coding styles), but we may still want to fix it in future. So I was wondering if such future fix for this new, shiny and efficient function is easy. > > We can restrict the region size when config or just allow spans greater > than 4G. > > static inline unsigned long damon_rand_fast(struct damon_ctx *ctx, > unsigned long l, unsigned long r) > { > unsigned long span = r - l; > u64 rnd; > > if (likely(span <= U32_MAX)) { I think we could drop 'likely' and hope compiler or hardware branch prediction to be smart enough. > rnd = prandom_u32_state(&ctx->rnd_state); > return l + (unsigned long)((rnd * span) >> 32); > } > rnd = ((u64)prandom_u32_state(&ctx->rnd_state) << 32) | > prandom_u32_state(&ctx->rnd_state); > > // same as 'l + (unsigned long)((rnd * span) >> 32);' > return l + mul_u64_u64_shr(rnd, span, 64); Makes sense to me. > } > > > >> + > >> static inline struct damon_region *damon_next_region(struct damon_region *r) > >> { > >> return container_of(r->list.next, struct damon_region, list); > >> diff --git a/mm/damon/core.c b/mm/damon/core.c > >> index 3dbbbfdeff71..c3779c674601 100644 > >> --- a/mm/damon/core.c > >> +++ b/mm/damon/core.c > >> @@ -607,6 +607,8 @@ struct damon_ctx *damon_new_ctx(void) > >> INIT_LIST_HEAD(&ctx->adaptive_targets); > >> INIT_LIST_HEAD(&ctx->schemes); > >> > >> + prandom_seed_state(&ctx->rnd_state, get_random_u64()); > >> + > >> return ctx; > >> } > >> > >> diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c > >> index 5cdcc5037cbc..b5e1197f2ba2 100644 > >> --- a/mm/damon/paddr.c > >> +++ b/mm/damon/paddr.c > >> @@ -48,12 +48,12 @@ static void damon_pa_mkold(phys_addr_t paddr) > >> folio_put(folio); > >> } > >> > >> -static void __damon_pa_prepare_access_check(struct damon_region *r, > >> - unsigned long addr_unit) > >> +static void __damon_pa_prepare_access_check(struct damon_ctx *ctx, > >> + struct damon_region *r) > > Let's keep 'r' on the first line, and update the second line without indent > > change. > > > Thanks You're welcome :) So I'm still unsure if this is the only single and best way for your use case. I'd like to further discuss your use case and find more way to help it if you are also interested. So, I'd like to continue the conversation. Regardless of it, though, I think this patch is good enough to be merged for general use case, though it require above trivial style changes. I wouldn't strongly request you to solve the >4GiB region issue together, but it would be nice to be added from the beginning. It's up to your choice. I'm also waiting for your answer to the second patch of this series. I'm still bit skeptical about it, though. To my simple brain, it feels introducing too much complexity. I feel like making this damon_rand() optimization patch merged first and keep discussing your use case and the second patch later might be a path forward. If you agree, please feel free to post a new version of this patch. [1] https://github.com/damonitor/damo/blob/next/USAGE.md#damo-report-access-page-level-properties-based-access-monitoring [2] https://lore.kernel.org/linux-mm/20260307210250.204245-1-sj@kernel.org/ Thanks, SJ [...] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] mm/damon: introduce damon_rand_fast() for per-ctx PRNG 2026-04-24 15:11 ` SeongJae Park @ 2026-04-25 3:36 ` Jiayuan Chen 2026-04-25 15:59 ` SeongJae Park 0 siblings, 1 reply; 9+ messages in thread From: Jiayuan Chen @ 2026-04-25 3:36 UTC (permalink / raw) To: SeongJae Park; +Cc: damon, Jiayuan Chen, Andrew Morton, linux-mm, linux-kernel On 4/24/26 11:11 PM, SeongJae Park wrote: > On Fri, 24 Apr 2026 10:29:58 +0800 Jiayuan Chen <jiayuan.chen@linux.dev> wrote: > >> Hello SJ, >> >> Thank you for the review. >> >> On 4/24/26 9:36 AM, SeongJae Park wrote: >>> Hello Jiayuan, >>> >>> >>> Thank you for sharing this patch with us! >>> >>> On Thu, 23 Apr 2026 20:23:36 +0800 Jiayuan Chen <jiayuan.chen@linux.dev> wrote: >>> >>>> From: Jiayuan Chen <jiayuan.chen@shopee.com> >>>> >>>> damon_rand() on the sampling_addr hot path calls get_random_u32_below(), >>>> which takes a local_lock_irqsave() around a per-CPU batched entropy pool >>>> and periodically refills it with ChaCha20. On workloads with large >>>> nr_regions (20k+), this shows up as a large fraction of kdamond CPU >>>> time: the lock_acquire / local_lock pair plus __get_random_u32_below() >>>> dominate perf profiles. >>> Could you please share more details about the use case? I'm particularly >>> curious how you ended up setting 'nr_regiions' that high, while the upper limit >>> of nr_regions is set to 1,0000 by default. >> We use DAMON paddr on a 2 TiB host for per-cgroup hot/cold page >> classification. Target cgroups can be as small as 1-2% of total >> memory (tens of GiB). > Thank you for sharing this detail! > > Could you further share me what do you do with the hot/cold information? We > found some people want to use DAMON for proactive cold pages reclamation but > they watned to use it for only file-backed pages, and therefore developed page > level DAMOS filter. In a similar way, if we know your detailed use case, we > might be able to serve you better by finding not well advertised features, or > developing a new features. > Hello SJ, Thanks. Use case: mixed-workload Kubernetes host (latency-sensitive + batch pods on the same machine). We need continuous per-pod cold/hot signal for two things -- k8s scheduling decisions (which pod can give up memory under host pressure) and sizing the eventual reclamation through memory.reclaim. Both anon and file pages matter (swap is configured); reclamation itself goes through mainline. What we add is the per-pod cold attribution feeding the scheduler. DAMON_RECLAIM alone doesn't fit -- it is reclaim-only with aggregate stats, while we need continuous per-pod visibility even when reclamation isn't firing. Let me re-frame the region-size point, since I don't think I explained it well last time. We are NOT trying to make region size match cgroup size; you're right that pages are scattered and that mapping doesn't work. The high nr_regions is so that DAMON's adaptive split has enough resolution to identify cold sub-areas of physical memory at all -- regardless of cgroup. With ~2 GiB default regions on a 2 TiB host, a small pod's pages are averaged with thousands of non-pod pages in the same region, and the region never reaches nr_accesses=0 even when the pod is genuinely idle. The cold signal is gone before any cgroup attribution happens. Cgroup attribution itself is done at sample granularity (folio_memcg per sampled page), not at region granularity -- the regions just need to be fine enough that there *is* a cold signal to attribute. >> With the default max_nr_regions=1000, each region covers ~2 GiB on >> average, which is often larger than the entire target cgroup. >> Adaptive split cannot carve out cgroup-homogeneous regions in that >> regime: each region's nr_accesses averages the (small) cgroup >> fraction with the surrounding non-cgroup pages, and the cgroup's >> access signal gets washed out. > But, pages of cgroups would be scattered around on the physical address space. > So even if DAMON finds a hot or cold region on physical address space that > having a size smaller than a cgroup's memory, it is hard to say if it is for a > specific cgroup. How do you know if the region is for which cgroup? Do you > have a way to make sure pages of same cgroup are gathered on the physical > address space? If not, I'm not sure how ensuring size of region to be equal or > smaller than each cgroup size helps you. > > We are supporting page level monitoring [1] for such cgroup-aware use case. > Are you using it? But again, if you use it, I'm not sure why you need to > ensure DAMON region size smaller than the cgroup size. This is also why DAMOS memcg filters do not bypass the nr_regions constraint -- the scheme's access-pattern matcher operates on the already-averaged region, so by the time any filter sees pages the signal is already lost. Whatever attribution mechanism we use afterwards (custom callback, memcg filter, page-level reporting), the region count needs to be high enough first. > Fyi, I'm also working [2] on making the cgroup-aware monitoring much more > lightweight. Looking forward to it -- hopefully it'll cover our case. 😁 >> Raising max_nr_regions to 10k-20k gives the adaptive split enough >> spatial resolution that cgroup-majority regions can form at cgroup >> boundaries (allocations have enough physical locality in practice for >> this to work -- THP, per-node allocation, etc.). At that region >> count, damon_rand() starts showing up at the top of kdamond profiles, >> which is what motivated this patch. > Thank you for sharing how this patch has developed. I'm still curious if there > is yet more ways to make DAMON better for your use case. But this patch makes > sense in general to me. > >> >>> I know some people worry if the limit is too low and it could result in poor >>> monitoring accuracy. Therefore we developed [1] monitoring intervals >>> auto-tuning. From multiple tests on real environment it showed somewhat >>> convincing results, and therefore I nowadays suggest DAMON users to try that if >>> they didn't try. >>> >>> I'm bit concerned if this is over-engineering. It would be helpful to know if >>> it is, if you could share the more detailed use case. >> >> Thanks for pointing to intervals auto-tuning - we do use it. But it >> >> trades sampling frequency against monitoring overhead; it cannot >> change the spatial resolution. With N=1000 regions on a 2 TiB host, >> a 20 GiB cgroup cannot be resolved no matter how we tune sample_us / >> aggr_us, because the region boundary itself averages cgroup and >> >> non-cgroup pages together. >> >> So raising max_nr_regions and making the per-region overhead cheap >> are complementary to interval auto-tuning, not redundant with it. > Makes sense. I'm still not sure how it helps for cgroup. But, if you do want > it and if it helps you, you are of course ok to use it in your way, and I will > help you. :) > [...] >> Sashiko is correct that (u32)(r - l) truncates spans greater than 4 GiB. >> >> This is a pre-existing limitation of damon_rand() itself, which >> passes r - l to get_random_u32_below() (a u32 parameter) and >> truncates the same way. My patch makes that truncation >> explicit but does not introduce a new bug. > You are 100% correct and I agree. Thank you for making this clear. I should > also mentioned and underlined this in my previous reply, but I forgot it. > > So I think this patch is ok to be merged as is (after addressing my nit trivial > comments about coding styles), but we may still want to fix it in future. So I damon_rand() is now only called by damon_split_regions_of() with the constant range (1, 10). may by we can rename it to damon_rand_u32() to make the u32 constraint explicit in the API name; that closes out the truncation concern at the legacy helper without needing a separate series. > was wondering if such future fix for this new, shiny and efficient function is > easy. Will do. v2 is a single patch with the style fixes (drop 'likely', 'r' on the first line for paddr.c, ctx at the end for vaddr.c) and 64-bit handling: static inline unsigned long damon_rand_fast(struct damon_ctx *ctx, unsigned long l, unsigned long r) { unsigned long span = r - l; u64 rnd; if (span <= U32_MAX) { rnd = prandom_u32_state(&ctx->rnd_state); return l + (unsigned long)((rnd * span) >> 32); } rnd = ((u64)prandom_u32_state(&ctx->rnd_state) << 32) | prandom_u32_state(&ctx->rnd_state); return l + mul_u64_u64_shr(rnd, span, 64); } I will also drop Prefetch patch. >> We can restrict the region size when config or just allow spans greater >> than 4G. >> >> static inline unsigned long damon_rand_fast(struct damon_ctx *ctx, >> unsigned long l, unsigned long r) >> { >> unsigned long span = r - l; >> u64 rnd; >> >> if (likely(span <= U32_MAX)) { > I think we could drop 'likely' and hope compiler or hardware branch prediction > to be smart enough. > >> rnd = prandom_u32_state(&ctx->rnd_state); >> return l + (unsigned long)((rnd * span) >> 32); >> } >> rnd = ((u64)prandom_u32_state(&ctx->rnd_state) << 32) | >> prandom_u32_state(&ctx->rnd_state); >> >> // same as 'l + (unsigned long)((rnd * span) >> 32);' >> return l + mul_u64_u64_shr(rnd, span, 64); > Makes sense to me. > >> } >> >> >>>> + >>>> static inline struct damon_region *damon_next_region(struct damon_region *r) >>>> { >>>> return container_of(r->list.next, struct damon_region, list); >>>> diff --git a/mm/damon/core.c b/mm/damon/core.c >>>> index 3dbbbfdeff71..c3779c674601 100644 >>>> --- a/mm/damon/core.c >>>> +++ b/mm/damon/core.c >>>> @@ -607,6 +607,8 @@ struct damon_ctx *damon_new_ctx(void) >>>> INIT_LIST_HEAD(&ctx->adaptive_targets); >>>> INIT_LIST_HEAD(&ctx->schemes); >>>> >>>> + prandom_seed_state(&ctx->rnd_state, get_random_u64()); >>>> + >>>> return ctx; >>>> } >>>> >>>> diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c >>>> index 5cdcc5037cbc..b5e1197f2ba2 100644 >>>> --- a/mm/damon/paddr.c >>>> +++ b/mm/damon/paddr.c >>>> @@ -48,12 +48,12 @@ static void damon_pa_mkold(phys_addr_t paddr) >>>> folio_put(folio); >>>> } >>>> >>>> -static void __damon_pa_prepare_access_check(struct damon_region *r, >>>> - unsigned long addr_unit) >>>> +static void __damon_pa_prepare_access_check(struct damon_ctx *ctx, >>>> + struct damon_region *r) >>> Let's keep 'r' on the first line, and update the second line without indent >>> change. >>> >> Thanks > You're welcome :) > > So I'm still unsure if this is the only single and best way for your use case. > I'd like to further discuss your use case and find more way to help it if you > are also interested. So, I'd like to continue the conversation. > > Regardless of it, though, I think this patch is good enough to be merged for > general use case, though it require above trivial style changes. I wouldn't > strongly request you to solve the >4GiB region issue together, but it would be > nice to be added from the beginning. It's up to your choice. > > I'm also waiting for your answer to the second patch of this series. I'm still > bit skeptical about it, though. To my simple brain, it feels introducing too > much complexity. I feel like making this damon_rand() optimization patch > merged first and keep discussing your use case and the second patch later might > be a path forward. If you agree, please feel free to post a new version of > this patch. > > > [1] https://github.com/damonitor/damo/blob/next/USAGE.md#damo-report-access-page-level-properties-based-access-monitoring > [2] https://lore.kernel.org/linux-mm/20260307210250.204245-1-sj@kernel.org/ > > > Thanks, > SJ > > [...] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] mm/damon: introduce damon_rand_fast() for per-ctx PRNG 2026-04-25 3:36 ` Jiayuan Chen @ 2026-04-25 15:59 ` SeongJae Park 2026-04-26 5:50 ` Jiayuan Chen 0 siblings, 1 reply; 9+ messages in thread From: SeongJae Park @ 2026-04-25 15:59 UTC (permalink / raw) To: Jiayuan Chen Cc: SeongJae Park, damon, Jiayuan Chen, Andrew Morton, linux-mm, linux-kernel On Sat, 25 Apr 2026 11:36:02 +0800 Jiayuan Chen <jiayuan.chen@linux.dev> wrote: > > On 4/24/26 11:11 PM, SeongJae Park wrote: > > On Fri, 24 Apr 2026 10:29:58 +0800 Jiayuan Chen <jiayuan.chen@linux.dev> wrote: > > > >> Hello SJ, > >> > >> Thank you for the review. > >> > >> On 4/24/26 9:36 AM, SeongJae Park wrote: > >>> Hello Jiayuan, > >>> > >>> > >>> Thank you for sharing this patch with us! > >>> > >>> On Thu, 23 Apr 2026 20:23:36 +0800 Jiayuan Chen <jiayuan.chen@linux.dev> wrote: > >>> > >>>> From: Jiayuan Chen <jiayuan.chen@shopee.com> > >>>> > >>>> damon_rand() on the sampling_addr hot path calls get_random_u32_below(), > >>>> which takes a local_lock_irqsave() around a per-CPU batched entropy pool > >>>> and periodically refills it with ChaCha20. On workloads with large > >>>> nr_regions (20k+), this shows up as a large fraction of kdamond CPU > >>>> time: the lock_acquire / local_lock pair plus __get_random_u32_below() > >>>> dominate perf profiles. > >>> Could you please share more details about the use case? I'm particularly > >>> curious how you ended up setting 'nr_regiions' that high, while the upper limit > >>> of nr_regions is set to 1,0000 by default. > >> We use DAMON paddr on a 2 TiB host for per-cgroup hot/cold page > >> classification. Target cgroups can be as small as 1-2% of total > >> memory (tens of GiB). > > Thank you for sharing this detail! > > > > Could you further share me what do you do with the hot/cold information? We > > found some people want to use DAMON for proactive cold pages reclamation but > > they watned to use it for only file-backed pages, and therefore developed page > > level DAMOS filter. In a similar way, if we know your detailed use case, we > > might be able to serve you better by finding not well advertised features, or > > developing a new features. > > > > Hello SJ, > > Thanks. > > Use case: mixed-workload Kubernetes host (latency-sensitive + batch > pods on the same machine). We need continuous per-pod cold/hot > signal for two things -- k8s scheduling decisions (which pod can > give up memory under host pressure) and sizing the eventual > reclamation through memory.reclaim. Both anon and file pages > matter (swap is configured); reclamation itself goes through > mainline. What we add is the per-pod cold attribution feeding the > scheduler. Thank you for sharing this detail! > > DAMON_RECLAIM alone doesn't fit -- it is reclaim-only with > aggregate stats, while we need continuous per-pod visibility even > when reclamation isn't firing. > > Let me re-frame the region-size point, since I don't think I > explained it well last time. We are NOT trying to make region > size match cgroup size; you're right that pages are scattered and > that mapping doesn't work. The high nr_regions is so that DAMON's > adaptive split has enough resolution to identify cold sub-areas of > physical memory at all -- regardless of cgroup. Thank you, this helps me betetr understand your concern. > With ~2 GiB > default regions on a 2 TiB host, a small pod's pages are averaged > with thousands of non-pod pages in the same region, and the > region never reaches nr_accesses=0 even when the pod is genuinely > idle. But, the adaptive regions adjustment mechanism dynamically change size of regions, down to 4 KiB. If the small pod's page is really cold while its surrounding pages are not, DAMON should down-size the region to capture only the page and show you nr_accesses=0. > The cold signal is gone before any cgroup attribution > happens. Cgroup attribution itself is done at sample granularity > (folio_memcg per sampled page), not at region granularity -- the > regions just need to be fine enough that there *is* a cold signal > to attribute. Could you please share more details about what is the cgroup attribution, and how it is done? I guess that is the way to map DAMON's monitoring regions to each cgroup to determine if each cgroup is hot or cold. I'm unsure how it is really be done. > > >> With the default max_nr_regions=1000, each region covers ~2 GiB on > >> average, which is often larger than the entire target cgroup. > >> Adaptive split cannot carve out cgroup-homogeneous regions in that > >> regime: each region's nr_accesses averages the (small) cgroup > >> fraction with the surrounding non-cgroup pages, and the cgroup's > >> access signal gets washed out. > > But, pages of cgroups would be scattered around on the physical address space. > > So even if DAMON finds a hot or cold region on physical address space that > > having a size smaller than a cgroup's memory, it is hard to say if it is for a > > specific cgroup. How do you know if the region is for which cgroup? Do you > > have a way to make sure pages of same cgroup are gathered on the physical > > address space? If not, I'm not sure how ensuring size of region to be equal or > > smaller than each cgroup size helps you. > > > > We are supporting page level monitoring [1] for such cgroup-aware use case. > > Are you using it? But again, if you use it, I'm not sure why you need to > > ensure DAMON region size smaller than the cgroup size. > This is also why DAMOS memcg filters do not bypass the > nr_regions constraint -- the scheme's access-pattern matcher > operates on the already-averaged region, so by the time any > filter sees pages the signal is already lost. Whatever > attribution mechanism we use afterwards (custom callback, memcg > filter, page-level reporting), the region count needs to be high > enough first. But, I'm not understanding why the regions adjustment mechanism cannot work here. Your answer to my above question will be helpful. > > Fyi, I'm also working [2] on making the cgroup-aware monitoring much more > > lightweight. > > > Looking forward to it -- hopefully it'll cover our case. 😁 I hope so, too! :) > > > >> Raising max_nr_regions to 10k-20k gives the adaptive split enough > >> spatial resolution that cgroup-majority regions can form at cgroup > >> boundaries (allocations have enough physical locality in practice for > >> this to work -- THP, per-node allocation, etc.). At that region > >> count, damon_rand() starts showing up at the top of kdamond profiles, > >> which is what motivated this patch. > > Thank you for sharing how this patch has developed. I'm still curious if there > > is yet more ways to make DAMON better for your use case. But this patch makes > > sense in general to me. > > > >> > >>> I know some people worry if the limit is too low and it could result in poor > >>> monitoring accuracy. Therefore we developed [1] monitoring intervals > >>> auto-tuning. From multiple tests on real environment it showed somewhat > >>> convincing results, and therefore I nowadays suggest DAMON users to try that if > >>> they didn't try. > >>> > >>> I'm bit concerned if this is over-engineering. It would be helpful to know if > >>> it is, if you could share the more detailed use case. > >> > >> Thanks for pointing to intervals auto-tuning - we do use it. But it > >> > >> trades sampling frequency against monitoring overhead; it cannot > >> change the spatial resolution. With N=1000 regions on a 2 TiB host, > >> a 20 GiB cgroup cannot be resolved no matter how we tune sample_us / > >> aggr_us, because the region boundary itself averages cgroup and > >> > >> non-cgroup pages together. > >> > >> So raising max_nr_regions and making the per-region overhead cheap > >> are complementary to interval auto-tuning, not redundant with it. > > Makes sense. I'm still not sure how it helps for cgroup. But, if you do want > > it and if it helps you, you are of course ok to use it in your way, and I will > > help you. :) > > > [...] > >> Sashiko is correct that (u32)(r - l) truncates spans greater than 4 GiB. > >> > >> This is a pre-existing limitation of damon_rand() itself, which > >> passes r - l to get_random_u32_below() (a u32 parameter) and > >> truncates the same way. My patch makes that truncation > >> explicit but does not introduce a new bug. > > You are 100% correct and I agree. Thank you for making this clear. I should > > also mentioned and underlined this in my previous reply, but I forgot it. > > > > So I think this patch is ok to be merged as is (after addressing my nit trivial > > comments about coding styles), but we may still want to fix it in future. So I > damon_rand() is now only called by damon_split_regions_of() with > the constant range (1, 10). may by we can rename it to > damon_rand_u32() to make the u32 constraint explicit in the API > name; that closes out the truncation concern at the legacy helper > without needing a separate series. Good point. I'm wondering if we have a reason to keep using damon_rand() at all. I find no such reason. If you also find no real reason, how about simply removing existing damon_rand() and renaming damon_rand_fast() to damon_rand()? > > was wondering if such future fix for this new, shiny and efficient function is > > easy. > > > Will do. > > v2 is a single patch with the style fixes (drop 'likely', 'r' on the > first line for paddr.c, ctx at the end for > > vaddr.c) and 64-bit handling: > > static inline unsigned long damon_rand_fast(struct damon_ctx *ctx, > unsigned long l, unsigned long r) > { > unsigned long span = r - l; > u64 rnd; > > if (span <= U32_MAX) { > rnd = prandom_u32_state(&ctx->rnd_state); > return l + (unsigned long)((rnd * span) >> 32); > } > rnd = ((u64)prandom_u32_state(&ctx->rnd_state) << 32) | > prandom_u32_state(&ctx->rnd_state); > return l + mul_u64_u64_shr(rnd, span, 64); > } > > I will also drop Prefetch patch. Sounds good, looking forward to v2! Thanks, SJ [...] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] mm/damon: introduce damon_rand_fast() for per-ctx PRNG 2026-04-25 15:59 ` SeongJae Park @ 2026-04-26 5:50 ` Jiayuan Chen 0 siblings, 0 replies; 9+ messages in thread From: Jiayuan Chen @ 2026-04-26 5:50 UTC (permalink / raw) To: SeongJae Park; +Cc: damon, Jiayuan Chen, Andrew Morton, linux-mm, linux-kernel Hello SJ Thanks. On 4/25/26 11:59 PM, SeongJae Park wrote: > On Sat, 25 Apr 2026 11:36:02 +0800 Jiayuan Chen <jiayuan.chen@linux.dev> wrote: > >> On 4/24/26 11:11 PM, SeongJae Park wrote: [...] >> With ~2 GiB >> default regions on a 2 TiB host, a small pod's pages are averaged >> with thousands of non-pod pages in the same region, and the >> region never reaches nr_accesses=0 even when the pod is genuinely >> idle. > But, the adaptive regions adjustment mechanism dynamically change size of > regions, down to 4 KiB. If the small pod's page is really cold while its > surrounding pages are not, DAMON should down-size the region to capture only > the page and show you nr_accesses=0. > Looking at kdamond_split_regions() / kdamond_merge_regions(): the per-region floor is 4 KiB, but the *total* count is hard-capped at max_nr_regions, and split itself is blind -- it picks the cut position via damon_rand(1, 10) without looking at access patterns. What keeps a split in place is the next merge cycle finding the two halves have visibly different nr_accesses; if a small cgroup's signal is averaged with surrounding host pages on both sides of the random cut, the halves look identical and merge folds them back. For a 1% cgroup on a 2 TiB host with max_nr_regions=1000, the split-merge loop never converges to cgroup-aligned regions -- random cuts almost always land in "99% host + 1% cgroup" mixtures. Raising the cap gives the random splitter enough attempts that some cuts happen to land on physically-clustered cgroup pages (THP, NUMA-local allocations) and stick. That's why the cap matters in practice, not the 4 KiB floor. >> The cold signal is gone before any cgroup attribution >> happens. Cgroup attribution itself is done at sample granularity >> (folio_memcg per sampled page), not at region granularity -- the >> regions just need to be fine enough that there *is* a cold signal >> to attribute. > Could you please share more details about what is the cgroup attribution, and > how it is done? I guess that is the way to map DAMON's monitoring regions to > each cgroup to determine if each cgroup is hot or cold. I'm unsure how it is > really be done. We sample physical pages within DAMON regions, look up folio_memcg() per sampled page to find the owning memcg, and accumulate cold bytes per memcg. Userspace reads the per-cgroup result and sizes memory.reclaim per pod. Conceptually similar to the page-level monitoring you pointed me at -- we'll evaluate whether [1] / [2] can replace this path. >>>> With the default max_nr_regions=1000, each region covers ~2 GiB on >>>> average, which is often larger than the entire target cgroup. >>>> Adaptive split cannot carve out cgroup-homogeneous regions in that >>>> regime: each region's nr_accesses averages the (small) cgroup >>>> fraction with the surrounding non-cgroup pages, and the cgroup's >>>> access signal gets washed out. >>> But, pages of cgroups would be scattered around on the physical address space. >>> So even if DAMON finds a hot or cold region on physical address space that >>> having a size smaller than a cgroup's memory, it is hard to say if it is for a >>> specific cgroup. How do you know if the region is for which cgroup? Do you >>> have a way to make sure pages of same cgroup are gathered on the physical >>> address space? If not, I'm not sure how ensuring size of region to be equal or >>> smaller than each cgroup size helps you. >>> >>> We are supporting page level monitoring [1] for such cgroup-aware use case. >>> Are you using it? But again, if you use it, I'm not sure why you need to >>> ensure DAMON region size smaller than the cgroup size. >> This is also why DAMOS memcg filters do not bypass the >> nr_regions constraint -- the scheme's access-pattern matcher >> operates on the already-averaged region, so by the time any >> filter sees pages the signal is already lost. Whatever >> attribution mechanism we use afterwards (custom callback, memcg >> filter, page-level reporting), the region count needs to be high >> enough first. > But, I'm not understanding why the regions adjustment mechanism cannot work > here. Your answer to my above question will be helpful. > >>> Fyi, I'm also working [2] on making the cgroup-aware monitoring much more >>> lightweight. >> >> Looking forward to it -- hopefully it'll cover our case. 😁 > I hope so, too! :) > >> >>>> Raising max_nr_regions to 10k-20k gives the adaptive split enough >>>> spatial resolution that cgroup-majority regions can form at cgroup >>>> boundaries (allocations have enough physical locality in practice for >>>> this to work -- THP, per-node allocation, etc.). At that region >>>> count, damon_rand() starts showing up at the top of kdamond profiles, >>>> which is what motivated this patch. >>> Thank you for sharing how this patch has developed. I'm still curious if there >>> is yet more ways to make DAMON better for your use case. But this patch makes >>> sense in general to me. >>> >>>>> I know some people worry if the limit is too low and it could result in poor >>>>> monitoring accuracy. Therefore we developed [1] monitoring intervals >>>>> auto-tuning. From multiple tests on real environment it showed somewhat >>>>> convincing results, and therefore I nowadays suggest DAMON users to try that if >>>>> they didn't try. >>>>> >>>>> I'm bit concerned if this is over-engineering. It would be helpful to know if >>>>> it is, if you could share the more detailed use case. >>>> Thanks for pointing to intervals auto-tuning - we do use it. But it >>>> >>>> trades sampling frequency against monitoring overhead; it cannot >>>> change the spatial resolution. With N=1000 regions on a 2 TiB host, >>>> a 20 GiB cgroup cannot be resolved no matter how we tune sample_us / >>>> aggr_us, because the region boundary itself averages cgroup and >>>> >>>> non-cgroup pages together. >>>> >>>> So raising max_nr_regions and making the per-region overhead cheap >>>> are complementary to interval auto-tuning, not redundant with it. >>> Makes sense. I'm still not sure how it helps for cgroup. But, if you do want >>> it and if it helps you, you are of course ok to use it in your way, and I will >>> help you. :) >>> >> [...] >>>> Sashiko is correct that (u32)(r - l) truncates spans greater than 4 GiB. >>>> >>>> This is a pre-existing limitation of damon_rand() itself, which >>>> passes r - l to get_random_u32_below() (a u32 parameter) and >>>> truncates the same way. My patch makes that truncation >>>> explicit but does not introduce a new bug. >>> You are 100% correct and I agree. Thank you for making this clear. I should >>> also mentioned and underlined this in my previous reply, but I forgot it. >>> >>> So I think this patch is ok to be merged as is (after addressing my nit trivial >>> comments about coding styles), but we may still want to fix it in future. So I >> damon_rand() is now only called by damon_split_regions_of() with >> the constant range (1, 10). may by we can rename it to >> damon_rand_u32() to make the u32 constraint explicit in the API >> name; that closes out the truncation concern at the legacy helper >> without needing a separate series. > Good point. I'm wondering if we have a reason to keep using damon_rand() at > all. I find no such reason. If you also find no real reason, how about simply > removing existing damon_rand() and renaming damon_rand_fast() to damon_rand()? > Good idea. v2 will remove the legacy helper, rename damon_rand_fast() to damon_rand(), and plumb ctx into damon_split_regions_of() for the new signature. >>> was wondering if such future fix for this new, shiny and efficient function is >>> easy. >> [...] Thanks! ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2026-04-26 5:50 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-04-23 12:23 [PATCH 1/2] mm/damon: introduce damon_rand_fast() for per-ctx PRNG Jiayuan Chen 2026-04-23 12:23 ` [PATCH 2/2] mm/damon/paddr: prefetch struct page of the next region Jiayuan Chen 2026-04-24 1:42 ` SeongJae Park 2026-04-24 1:36 ` [PATCH 1/2] mm/damon: introduce damon_rand_fast() for per-ctx PRNG SeongJae Park 2026-04-24 2:29 ` Jiayuan Chen 2026-04-24 15:11 ` SeongJae Park 2026-04-25 3:36 ` Jiayuan Chen 2026-04-25 15:59 ` SeongJae Park 2026-04-26 5:50 ` Jiayuan Chen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox