From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-189.mta1.migadu.com (out-189.mta1.migadu.com [95.215.58.189]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7D44E2F12B3 for ; Wed, 6 May 2026 23:41:38 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.189 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778110900; cv=none; b=ZfLj+d6rp2PdjAF1BFe3JuUn7/sfjdDh42+/R9cH+zPVd9aYXfjW5AM9xewZCVRzWhqbPvVAdsKVIr0pKebrlAwdIts2Yd33Li0Ur31Ocx2Gfp4WZvncq1POZ9PgvLTy7sZPmmJ2l5p8iRsduLoMgmOVJCKv2ubmpZb35pQ3Urc= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778110900; c=relaxed/simple; bh=Z+zBaej/FyigjTpwQiVFttZWIC02vE9TUPUWpcRfUyg=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=LKB6NaDCvqS5xBOiORY00UipVSct07Lj2csBCZoA5A3vbflikIWDi/d6EqsPVAK7LjTxx+6RRrESUMe/Ds0Xx+AVS0hy+PwrhDq1L7f5TF95NZk1imKronQul7faAQIx6+TnP0HwwDDF1d6490E+kFKF78CGn1wxq1BhPyw+J0M= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=lfDG9GYN; arc=none smtp.client-ip=95.215.58.189 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="lfDG9GYN" Message-ID: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1778110886; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=omVYK4AEUjq1NlglCNUDwEXuCT4cUd1OTwfgXBB1ZSY=; b=lfDG9GYNsBLSSa6BBxZLdmHXeWhCEXgQa0PcMxEdMKqaf44VfYzoj51TdMh5CLlgL/pHl9 70zeg195xegmAVMnIUB/5/C8LzpYJy0JTa7/SPMYsOScxHKkap+zBlFJm2yfqHRqjLQunk 2f86OH0VHxznwzNhWWkyszxSrfL+JDw= Date: Thu, 7 May 2026 07:41:11 +0800 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Subject: Re: [PATCH v2] mm/damon: replace damon_rand() with a per-ctx lockless PRNG To: SeongJae Park Cc: damon@lists.linux.dev, Jiayuan Chen , Andrew Morton , Shu Anzai , Quanmin Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20260506163701.6288-1-sj@kernel.org> X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Jiayuan Chen In-Reply-To: <20260506163701.6288-1-sj@kernel.org> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT Hello SJ, On 5/7/26 12:37 AM, SeongJae Park wrote: > Hello Jiayuan, > > > Thank you for posting this second version! > > On Tue, 5 May 2026 22:52:06 +0800 Jiayuan Chen wrote: > >> From: Jiayuan Chen >> >> damon_rand() on the sampling_addr hot path called >> get_random_u32_below(), which takes a local_lock_irqsave() around a >> per-CPU batched entropy pool and periodically refills it with >> ChaCha20. At elevated nr_regions counts (20k+), the lock_acquire / >> local_lock pair plus __get_random_u32_below() dominate kdamond perf >> profiles. >> >> Replace the helper with a lockless lfsr113 generator (struct rnd_state) >> held per damon_ctx and seeded from get_random_u64() in damon_new_ctx(). >> kdamond is the single consumer of a given ctx, so no synchronization >> is required. Range mapping uses Lemire's (u64)rnd * span >> 32 on the >> fast path; > Could we add some links to the algorithm? I found a blog [1] from Google and > guessing that is what you're saying, but I'm not really sure since I failed at > finding time to thoroughtly read it. The link covers a related family of techniques, but I don't think we need an external reference.  The pattern is just the multiply-shift fast path that the kernel already uses in __get_random_u32_below() (drivers/char/random.c) -- (u64)rnd * span >> 32 -- which the existing comment there calls "traditional reciprocal multiplication". Of course I can add simple comment like "traditional reciprocal multiplication, similar as get_random_u32_below()" >> for spans larger than U32_MAX (only reachable on 64-bit) the >> slow path combines two u32 outputs and uses mul_u64_u64_shr() at 64-bit >> width. On 32-bit the slow path is dead code and gets eliminated by >> the compiler. >> >> The new helper takes a ctx parameter; damon_split_regions_of() and >> the kunit tests that call it directly are updated accordingly. >> >> lfsr113 is a linear PRNG and MUST NOT be used for anything >> security-sensitive. DAMON's sampling_addr is not exposed to userspace >> and is only consumed as a probe point for PTE accessed-bit sampling, >> so a non-cryptographic PRNG is appropriate here. >> >> Tested with paddr monitoring and max_nr_regions=20000: kdamond CPU >> usage reduced from ~72% to ~50% of one core. >> >> Link: https://lore.kernel.org/damon/20260426173346.86238-1-sj@kernel.org/T/#m4f1fd74112728f83a41511e394e8c3fef703039c > Why are you adding the above link? It would be good to add description of the > link on the commit message. > > Other than that, commit message looks good to me. Will move it to the patch changelog (after ---) instead of the commit message >> Cc: Jiayuan Chen >> Signed-off-by: Jiayuan Chen >> --- >> include/linux/damon.h | 27 +++++++++++++++++++++------ >> mm/damon/core.c | 12 ++++++++---- >> mm/damon/paddr.c | 8 ++++---- >> mm/damon/tests/core-kunit.h | 28 ++++++++++++++++++++++------ >> mm/damon/vaddr.c | 7 ++++--- >> 5 files changed, 59 insertions(+), 23 deletions(-) >> >> diff --git a/include/linux/damon.h b/include/linux/damon.h >> index f2cdb7c3f5e6..e16012a7f41a 100644 >> --- a/include/linux/damon.h >> +++ b/include/linux/damon.h >> @@ -8,8 +8,10 @@ >> #ifndef _DAMON_H_ >> #define _DAMON_H_ >> >> +#include >> #include >> #include >> +#include >> #include >> #include >> #include > Maybe we can remove random.h? Right ! >> @@ -19,12 +21,6 @@ >> /* Max priority score for DAMON-based operation schemes */ >> #define DAMOS_MAX_SCORE (99) >> >> -/* Get a random number in [l, r) */ >> -static inline unsigned long damon_rand(unsigned long l, unsigned long r) >> -{ >> - return l + get_random_u32_below(r - l); >> -} >> - >> /** >> * struct damon_addr_range - Represents an address region of [@start, @end). >> * @start: Start address of the region (inclusive). >> @@ -843,8 +839,27 @@ struct damon_ctx { >> >> struct list_head adaptive_targets; >> struct list_head schemes; >> + >> + /* Per-ctx PRNG state for damon_rand(); kdamond is the sole consumer. */ >> + struct rnd_state rnd_state; >> }; >> >> +/* Get a random number in [@l, @r) using @ctx's lockless PRNG. */ >> +static inline unsigned long damon_rand(struct damon_ctx *ctx, >> + unsigned long l, unsigned long r) >> +{ >> + unsigned long span = r - l; >> + u64 rnd; >> + >> + if (span <= U32_MAX) { >> + rnd = prandom_u32_state(&ctx->rnd_state); >> + return l + (unsigned long)((rnd * span) >> 32); >> + } >> + rnd = ((u64)prandom_u32_state(&ctx->rnd_state) << 32) | >> + prandom_u32_state(&ctx->rnd_state); >> + return l + mul_u64_u64_shr(rnd, span, 64); >> +} >> + > I was unable to find time to thoroughly read the link I found, or understand > the algorithm with only the code, so tested by myself like below. > > ''' > $ cat ./foo.c > #include > #include > #include > #include > > int main(int argc, char *argv[]) > { > unsigned long span; > int i; > > if (argc != 2) { > printf("Usge: %s \n", argv[0]); > return -1; > } > span = atoi(argv[1]); > > for (i = 0; i < 1000; i++) { > uint64_t rnd = rand(); > > printf("%lu\n", (uint64_t)((rnd * span) >> 32)); > } > return 0; > } > $ gcc ./foo.c > $ ./a.out 10 | sort -n | uniq --count > 195 0 > 194 1 > 188 2 > 223 3 > 200 4 > ''' > > So I expected the program to generate number 0-9 randomly with similar > proportions. But it is generating only numbers 0-4. I confirmed doubling > 'span' value makes it work in my expected way. Should we do that here, too? I wrote a kmod to verify.  The helper is a verbatim copy of damon_rand() from the patch:   static inline unsigned long damon_rand_test(struct rnd_state *state,                                               unsigned long l, unsigned long r)   {           unsigned long span = r - l;           u64 rnd;           if (span <= U32_MAX) {                   rnd = prandom_u32_state(state);                   return l + (unsigned long)((rnd * span) >> 32);           }           rnd = ((u64)prandom_u32_state(state) << 32) |                 prandom_u32_state(state);           return l + mul_u64_u64_shr(rnd, span, 64);   }   static int __init damon_rand_test_init(void)   {           struct rnd_state state;           unsigned long counts[SPAN] = {0};           unsigned long i, v;           int j;           prandom_seed_state(&state, get_random_u64());           for (i = 0; i < ITERATIONS; i++) {                   v = damon_rand_test(&state, 0, SPAN);                   if (v >= SPAN) {                           pr_err("out of range: %lu\n", v);                           return -EINVAL;                   }                   counts[v]++;           }           pr_info("damon_rand_test: span=%lu iterations=%lu\n",                   SPAN, ITERATIONS);           for (j = 0; j < SPAN; j++)                   pr_info("damon_rand_test:   %d -> %lu\n", j, counts[j]);           return 0;   } dmesg output (span=10, 1000 iterations):   damon_rand_test: span=10 iterations=1000   damon_rand_test:   0 -> 99   damon_rand_test:   1 -> 107   damon_rand_test:   2 -> 79   damon_rand_test:   3 -> 94   damon_rand_test:   4 -> 97   damon_rand_test:   5 -> 97   damon_rand_test:   6 -> 94   damon_rand_test:   7 -> 100   damon_rand_test:   8 -> 114   damon_rand_test:   9 -> 119 All 10 buckets get hits, roughly uniform. Looking into the usrspace test program, the issue is that libc rand() returns a value in [0, RAND_MAX].  On glibc RAND_MAX is 2^31 - 1 (not 2^32 - 1), so the upper bit is missing and the multiply-shift only fills the lower half of the output range.  prandom_u32_state() returns a full u32 in [0, 2^32), so the in-kernel formula behaves correctly without doubling. > [...] > > Other than above, code changes look good to me. > > [1] https://lemire.me/blog/2024/08/17/faster-random-integer-generation-with-batching/ > > > Thanks, > SJ