* [PATCH 0/2] mm/damon/core: detect internal variation above max_nr_regions/2
@ 2026-05-21 4:52 Jiayuan Chen
2026-05-21 4:52 ` [PATCH 1/2] mm/damon/core: split age==0 regions when nr_regions exceeds max/2 Jiayuan Chen
` (2 more replies)
0 siblings, 3 replies; 8+ messages in thread
From: Jiayuan Chen @ 2026-05-21 4:52 UTC (permalink / raw)
To: damon
Cc: Jiayuan Chen, SeongJae Park, Andrew Morton, Shu Anzai,
Jiayuan Chen, Quanmin Yan, linux-mm, linux-kernel
kdamond_split_regions() bails out early when nr_regions is already
above max_nr_regions / 2. A large region that picks up new internal
variation after that point never gets split, so we lose visibility
into its hot/cold structure.
We hit this with damon-paddr on hugepage workloads and damon-vaddr
on processes that mmap a large anonymous range.
On our production tree we added a current_nr_regions counter (no
good upstream home for it yet, so it's not in this series). We saw
nr_regions never getting close to max_nr_regions, and the picture of
the access pattern was too coarse.
Example with max_nr_regions == 1500. A target ends up with 799
small hot/cold regions plus one big region (an earlier merge
collapsed a uniformly-accessed range into a single piece):
H:hot
C:cold
r1 r2 r3 r800
HHHHHH|CCCCCC|HHHHHH|...|HHHHHH..........................|
nr_regions = 800 > max_nr_regions / 2 = 750
Now a cold subarea shows up inside r800:
r1 r2 r3 r800
HHHHHH|CCCCCC|HHHHHH|...|HHHHHH........CCCCCC.............|
The small regions can't merge with each other (their access counts
differ), so budget never frees up. r800 can't be split because
nr_regions > max_nr_regions / 2 returns early. The cold subarea
stays invisible.
Patch 1 lets this path still split regions that just changed
(age == 0), up to whatever budget is left under max_nr_regions.
If a split turns out useless, the next merge cycle undoes it.
Patch 2 adds a KUnit test for the case where nr_regions is already
above max_nr_regions / 2.
Jiayuan Chen (2):
mm/damon/core: split age==0 regions when nr_regions exceeds max/2
mm/damon/tests/core-kunit: test split above max_nr_regions/2
mm/damon/core.c | 68 ++++++++++++++++++++++++++++-------
mm/damon/tests/core-kunit.h | 70 +++++++++++++++++++++++++++++++++++++
2 files changed, 126 insertions(+), 12 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 8+ messages in thread* [PATCH 1/2] mm/damon/core: split age==0 regions when nr_regions exceeds max/2 2026-05-21 4:52 [PATCH 0/2] mm/damon/core: detect internal variation above max_nr_regions/2 Jiayuan Chen @ 2026-05-21 4:52 ` Jiayuan Chen 2026-05-21 4:52 ` [PATCH 2/2] mm/damon/tests/core-kunit: test split above max_nr_regions/2 Jiayuan Chen 2026-05-21 14:30 ` [PATCH 0/2] mm/damon/core: detect internal variation " SeongJae Park 2 siblings, 0 replies; 8+ messages in thread From: Jiayuan Chen @ 2026-05-21 4:52 UTC (permalink / raw) To: damon Cc: Jiayuan Chen, Jiayuan Chen, SeongJae Park, Andrew Morton, Shu Anzai, Quanmin Yan, linux-mm, linux-kernel From: Jiayuan Chen <jiayuan.chen@shopee.com> kdamond_split_regions() returns early when nr_regions is above max_nr_regions / 2, leaving internal access variation inside a large region undetected. Such a layout is common with damon-paddr on hugepage workloads or damon-vaddr on processes with a large anonymous mmap. For example, with max_nr_regions == 1500, a target may end up with 799 small alternating-temperature regions plus one large region that absorbed a uniformly-accessed range during an earlier merge: H:hot C:cold r1 r2 r3 r800 HHHHHH|CCCCCC|HHHHHH|...|HHHHHH..........................| nr_regions = 800 > max_nr_regions / 2 = 750 If a cold subarea later emerges inside r800: r1 r2 r3 r800 HHHHHH|CCCCCC|HHHHHH|...|HHHHHH........CCCCCC.............| The small regions cannot merge with each other (different access counts), so the budget stays full. r800 cannot be split because nr_regions > max_nr_regions / 2 causes an early return. The cold subarea is never discovered. Split regions whose access pattern has just changed (age == 0) on this path, up to the remaining budget against max_nr_regions. An unnecessary split is reverted by the next kdamond_merge_regions(). Cc: Jiayuan Chen <jiayuan.chen@linux.dev> Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> --- mm/damon/core.c | 68 ++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 56 insertions(+), 12 deletions(-) diff --git a/mm/damon/core.c b/mm/damon/core.c index 6b8af7f956b7..442a6c323aeb 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -3452,37 +3452,81 @@ static void damon_split_regions_of(struct damon_ctx *ctx, } /* - * Split every target region into randomly-sized small regions + * Split each region whose access pattern has just changed (age == 0) + * into two, until @budget new regions have been produced or no eligible + * region remains. + */ +static void damon_split_zero_age_regions(struct damon_ctx *ctx, + unsigned long budget) +{ + struct damon_target *t; + struct damon_region *r, *next; + + damon_for_each_target(t, ctx) { + damon_for_each_region_safe(r, next, t) { + unsigned long sz_region, sz_sub; + + if (!budget) + return; + if (r->age != 0) + continue; + sz_region = damon_sz_region(r); + if (sz_region < 2 * ctx->min_region_sz) + continue; + + sz_sub = ALIGN_DOWN(damon_rand(ctx, 1, 10) * + sz_region / 10, ctx->min_region_sz); + /* Do not allow blank region */ + if (sz_sub == 0 || sz_sub >= sz_region) + continue; + + damon_split_region_at(t, r, sz_sub); + budget--; + } + } +} + +/* + * Split target regions to refine the monitoring resolution under + * dynamically changing access patterns. * - * This function splits every target region into random-sized small regions if - * current total number of the regions is equal or smaller than half of the - * user-specified maximum number of regions. This is for maximizing the - * monitoring accuracy under the dynamically changeable access patterns. If a - * split was unnecessarily made, later 'kdamond_merge_regions()' will revert - * it. + * When the total region count leaves room for a blanket doubling + * (nr_regions <= max_nr_regions / 2), every region is randomly split. + * Otherwise, only regions whose access pattern has just changed + * (age == 0) are split, up to the remaining budget against + * max_nr_regions. + * + * Unnecessary splits are reverted by a later kdamond_merge_regions(). */ static void kdamond_split_regions(struct damon_ctx *ctx) { struct damon_target *t; - unsigned int nr_regions = 0; - static unsigned int last_nr_regions; + unsigned long nr_regions = 0; + unsigned long max_nr_regions = ctx->attrs.max_nr_regions; + static unsigned long last_nr_regions; int nr_subregions = 2; damon_for_each_target(t, ctx) nr_regions += damon_nr_regions(t); - if (nr_regions > ctx->attrs.max_nr_regions / 2) - return; + if (nr_regions >= max_nr_regions) + goto done; + + if (nr_regions > max_nr_regions / 2) { + damon_split_zero_age_regions(ctx, max_nr_regions - nr_regions); + goto done; + } /* Maybe the middle of the region has different access frequency */ if (last_nr_regions == nr_regions && - nr_regions < ctx->attrs.max_nr_regions / 3) + nr_regions < max_nr_regions / 3) nr_subregions = 3; damon_for_each_target(t, ctx) damon_split_regions_of(ctx, t, nr_subregions, ctx->min_region_sz); +done: last_nr_regions = nr_regions; } -- 2.43.0 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH 2/2] mm/damon/tests/core-kunit: test split above max_nr_regions/2 2026-05-21 4:52 [PATCH 0/2] mm/damon/core: detect internal variation above max_nr_regions/2 Jiayuan Chen 2026-05-21 4:52 ` [PATCH 1/2] mm/damon/core: split age==0 regions when nr_regions exceeds max/2 Jiayuan Chen @ 2026-05-21 4:52 ` Jiayuan Chen 2026-05-21 14:30 ` [PATCH 0/2] mm/damon/core: detect internal variation " SeongJae Park 2 siblings, 0 replies; 8+ messages in thread From: Jiayuan Chen @ 2026-05-21 4:52 UTC (permalink / raw) To: damon Cc: Jiayuan Chen, Jiayuan Chen, SeongJae Park, Andrew Morton, Shu Anzai, Quanmin Yan, linux-mm, linux-kernel From: Jiayuan Chen <jiayuan.chen@shopee.com> Add a test that exercises kdamond_split_regions() when the total region count is already above max_nr_regions / 2, asserting that the function can still produce new regions and does not overshoot the limit. All tests pass: damon: pass:29 fail:0 skip:0 total:29 Totals: pass:29 fail:0 skip:0 total:29 Cc: Jiayuan Chen <jiayuan.chen@linux.dev> Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> --- mm/damon/tests/core-kunit.h | 70 +++++++++++++++++++++++++++++++++++++ 1 file changed, 70 insertions(+) diff --git a/mm/damon/tests/core-kunit.h b/mm/damon/tests/core-kunit.h index 1cfb8c176b87..6b2439670049 100644 --- a/mm/damon/tests/core-kunit.h +++ b/mm/damon/tests/core-kunit.h @@ -339,6 +339,75 @@ static void damon_test_split_regions_of(struct kunit *test) damon_destroy_ctx(c); } +/* + * kdamond_split_regions() must still be able to make progress when the + * total region count is above max_nr_regions / 2, as long as there is + * unused budget and at least one region whose access pattern has just + * changed. + */ +static void damon_test_split_above_half_progresses(struct kunit *test) +{ + struct damon_ctx *c; + struct damon_target *t; + struct damon_region *r, *big; + unsigned long start; + unsigned int nr_before, nr_after, i; + const unsigned int nr_small = 799; + const unsigned long small_sz = 10; + const unsigned long big_sz = 1000000; + + c = damon_new_ctx(); + if (!c) + kunit_skip(test, "ctx alloc fail"); + + c->attrs.min_nr_regions = 10; + c->attrs.max_nr_regions = 1500; + + t = damon_new_target(); + if (!t) { + damon_destroy_ctx(c); + kunit_skip(test, "target alloc fail"); + } + + for (i = 0; i < nr_small; i++) { + start = i * small_sz; + r = damon_new_region(start, start + small_sz); + if (!r) { + damon_free_target(t); + damon_destroy_ctx(c); + kunit_skip(test, "region alloc fail"); + } + r->nr_accesses = (i & 1) ? 0 : 100; + r->age = 5; + damon_add_region(r, t); + } + + start = nr_small * small_sz; + big = damon_new_region(start, start + big_sz); + if (!big) { + damon_free_target(t); + damon_destroy_ctx(c); + kunit_skip(test, "big region alloc fail"); + } + big->nr_accesses = 50; + damon_add_region(big, t); + + damon_add_target(c, t); + + nr_before = damon_nr_regions(t); + KUNIT_EXPECT_GT(test, (unsigned long)nr_before, + c->attrs.max_nr_regions / 2); + + kdamond_split_regions(c); + + nr_after = damon_nr_regions(t); + KUNIT_EXPECT_GT(test, nr_after, nr_before); + KUNIT_EXPECT_LE(test, (unsigned long)nr_after, + c->attrs.max_nr_regions); + + damon_destroy_ctx(c); +} + static void damon_test_ops_registration(struct kunit *test) { struct damon_ctx *c = damon_new_ctx(); @@ -1468,6 +1537,7 @@ static struct kunit_case damon_test_cases[] = { KUNIT_CASE(damon_test_merge_two), KUNIT_CASE(damon_test_merge_regions_of), KUNIT_CASE(damon_test_split_regions_of), + KUNIT_CASE(damon_test_split_above_half_progresses), KUNIT_CASE(damon_test_ops_registration), KUNIT_CASE(damon_test_set_regions), KUNIT_CASE(damon_test_nr_accesses_to_accesses_bp), -- 2.43.0 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH 0/2] mm/damon/core: detect internal variation above max_nr_regions/2 2026-05-21 4:52 [PATCH 0/2] mm/damon/core: detect internal variation above max_nr_regions/2 Jiayuan Chen 2026-05-21 4:52 ` [PATCH 1/2] mm/damon/core: split age==0 regions when nr_regions exceeds max/2 Jiayuan Chen 2026-05-21 4:52 ` [PATCH 2/2] mm/damon/tests/core-kunit: test split above max_nr_regions/2 Jiayuan Chen @ 2026-05-21 14:30 ` SeongJae Park 2026-05-21 15:07 ` Jiayuan Chen 2 siblings, 1 reply; 8+ messages in thread From: SeongJae Park @ 2026-05-21 14:30 UTC (permalink / raw) To: Jiayuan Chen Cc: SeongJae Park, damon, Andrew Morton, Shu Anzai, Jiayuan Chen, Quanmin Yan, linux-mm, linux-kernel Hello Jiayuan, On Thu, 21 May 2026 12:52:22 +0800 Jiayuan Chen <jiayuan.chen@linux.dev> wrote: > kdamond_split_regions() bails out early when nr_regions is already > above max_nr_regions / 2. A large region that picks up new internal > variation after that point never gets split, so we lose visibility > into its hot/cold structure. > > We hit this with damon-paddr on hugepage workloads and damon-vaddr > on processes that mmap a large anonymous range. > > On our production tree we added a current_nr_regions counter (no > good upstream home for it yet, so it's not in this series). We saw > nr_regions never getting close to max_nr_regions, and the picture of > the access pattern was too coarse. Is 'current_nr_regions' somewhat showing the number of DAMON regions? If so, you could also get the information from nr_regions field of damon_aggregated tracepoint. I'm wondering if you considered using that but found a problem that made you have to implement the internal change. I will be happy to help removing such downstream changes. > > Example with max_nr_regions == 1500. A target ends up with 799 > small hot/cold regions plus one big region (an earlier merge > collapsed a uniformly-accessed range into a single piece): > > H:hot > C:cold > > r1 r2 r3 r800 > HHHHHH|CCCCCC|HHHHHH|...|HHHHHH..........................| > > nr_regions = 800 > max_nr_regions / 2 = 750 > > Now a cold subarea shows up inside r800: > > r1 r2 r3 r800 > HHHHHH|CCCCCC|HHHHHH|...|HHHHHH........CCCCCC.............| > > The small regions can't merge with each other (their access counts > differ), so budget never frees up. r800 can't be split because > nr_regions > max_nr_regions / 2 returns early. The cold subarea > stays invisible. I agree this corner case could theoretically happen. But, would the small regions have the current pattern forever? On real world systems having dynamic access pattern, I guess those small regions may not keep the shape forever, and give chance for the large region to be split. Am I missing something? My theory also implies that this kind of situation could happen at least sometimes for temporal periods. In other words, it could happens too frequently and too long to be problematic. But, in the case, maybe the user could mitigate the issue by increasing the max_nr_regions. I'm curious if you considered that direction and found a problem that I don't expect for now. > > Patch 1 lets this path still split regions that just changed > (age == 0), Why 'age == 0' means it is a good candidate to split? Because it means its access frequency is anyway unstable? Or are there other reasons? More clarification would be helpful. > up to whatever budget is left under max_nr_regions. > If a split turns out useless, the next merge cycle undoes it. I'm again curious why the user cannot just increase max_nr_regions. > > Patch 2 adds a KUnit test for the case where nr_regions is already > above max_nr_regions / 2. Adding tests for new features is always nice, thank you! I will review each patch in detail after the above high level questions are answered. Thanks, SJ [...] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 0/2] mm/damon/core: detect internal variation above max_nr_regions/2 2026-05-21 14:30 ` [PATCH 0/2] mm/damon/core: detect internal variation " SeongJae Park @ 2026-05-21 15:07 ` Jiayuan Chen 2026-05-22 2:42 ` SeongJae Park 0 siblings, 1 reply; 8+ messages in thread From: Jiayuan Chen @ 2026-05-21 15:07 UTC (permalink / raw) To: SeongJae Park Cc: damon, Andrew Morton, Shu Anzai, Jiayuan Chen, Quanmin Yan, linux-mm, linux-kernel Hi SJ, Thanks for taking a look. Quick replies inline. On 5/21/26 10:30 PM, SeongJae Park wrote: > Hello Jiayuan, > > On Thu, 21 May 2026 12:52:22 +0800 Jiayuan Chen <jiayuan.chen@linux.dev> wrote: > >> kdamond_split_regions() bails out early when nr_regions is already >> above max_nr_regions / 2. A large region that picks up new internal >> variation after that point never gets split, so we lose visibility >> into its hot/cold structure. >> >> We hit this with damon-paddr on hugepage workloads and damon-vaddr >> on processes that mmap a large anonymous range. >> >> On our production tree we added a current_nr_regions counter (no >> good upstream home for it yet, so it's not in this series). We saw >> nr_regions never getting close to max_nr_regions, and the picture of >> the access pattern was too coarse. > Is 'current_nr_regions' somewhat showing the number of DAMON regions? If so, > you could also get the information from nr_regions field of damon_aggregated > tracepoint. I'm wondering if you considered using that but found a problem > that made you have to implement the internal change. > > I will be happy to help removing such downstream changes. Yes, same data as the nr_regions field in damon_aggregated. The downstream counter was just for convenience -- easier to cat a sysfs file than to wire up tracing. Even the tracepoint covers it, It's cost to much for Grafana to just get a metrics by tracepoint. >> Example with max_nr_regions == 1500. A target ends up with 799 >> small hot/cold regions plus one big region (an earlier merge >> collapsed a uniformly-accessed range into a single piece): >> >> H:hot >> C:cold >> >> r1 r2 r3 r800 >> HHHHHH|CCCCCC|HHHHHH|...|HHHHHH..........................| >> >> nr_regions = 800 > max_nr_regions / 2 = 750 >> >> Now a cold subarea shows up inside r800: >> >> r1 r2 r3 r800 >> HHHHHH|CCCCCC|HHHHHH|...|HHHHHH........CCCCCC.............| >> >> The small regions can't merge with each other (their access counts >> differ), so budget never frees up. r800 can't be split because >> nr_regions > max_nr_regions / 2 returns early. The cold subarea >> stays invisible. > I agree this corner case could theoretically happen. But, would the small > regions have the current pattern forever? On real world systems having dynamic I agree with the point that this is a corner case. But it's not transient for us. On a production setup with max_nr_regions = 20000, nr_regions sits at 11k-12k for extended periods. There are occasional bursts (e.g. from offline pods), then things settle back without ever reclaiming the budget. > access pattern, I guess those small regions may not keep the shape forever, and > give chance for the large region to be split. Am I missing something? > > My theory also implies that this kind of situation could happen at least > sometimes for temporal periods. In other words, it could happens too > frequently and too long to be problematic. But, in the case, maybe the user > could mitigate the issue by increasing the max_nr_regions. I'm curious if you > considered that direction and found a problem that I don't expect for now. > >> Patch 1 lets this path still split regions that just changed >> (age == 0), > Why 'age == 0' means it is a good candidate to split? Because it means its > access frequency is anyway unstable? Or are there other reasons? More > clarification would be helpful. Yes, age == 0 means the region's access count drifted past the merge threshold in the last aggregation -- the strongest signal it just changed internally. Regions with age > 0 are stable; splitting them tends to oscillate (the next merge cycle pulls the halves back together and we waste the budget). > >> up to whatever budget is left under max_nr_regions. >> If a split turns out useless, the next merge cycle undoes it. > I'm again curious why the user cannot just increase max_nr_regions. It works as a workaround, but it isn't free: higher max means more sampling work and more memory, and 20000 is the ceiling we actually want to live with. Bumping to 30000 just so the splitter has room to make progress between max/2 and max is wasteful -- we don't actually want to spend the resources for 30000 regions. The real issue isn't budget waste, it's that once nr_regions crosses max/2 the splitter has no recovery path -- it returns immediately even when there's variation worth refining, and merges don't help because the small regions have different access counts. nr_regions just sits between max/2 and max, and new variation inside a large region goes undetected. The patch gives that path a way to keep refining within whatever budget remains, instead of asking users to over-provision max. >> Patch 2 adds a KUnit test for the case where nr_regions is already >> above max_nr_regions / 2. > Adding tests for new features is always nice, thank you! > > I will review each patch in detail after the above high level questions are > answered. > > > Thanks, > SJ > > [...] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 0/2] mm/damon/core: detect internal variation above max_nr_regions/2 2026-05-21 15:07 ` Jiayuan Chen @ 2026-05-22 2:42 ` SeongJae Park 2026-05-22 15:11 ` Jiayuan Chen 0 siblings, 1 reply; 8+ messages in thread From: SeongJae Park @ 2026-05-22 2:42 UTC (permalink / raw) To: Jiayuan Chen Cc: SeongJae Park, damon, Andrew Morton, Shu Anzai, Jiayuan Chen, Quanmin Yan, linux-mm, linux-kernel On Thu, 21 May 2026 23:07:11 +0800 Jiayuan Chen <jiayuan.chen@linux.dev> wrote: > Hi SJ, > > Thanks for taking a look. Quick replies inline. > > > On 5/21/26 10:30 PM, SeongJae Park wrote: > > Hello Jiayuan, > > > > On Thu, 21 May 2026 12:52:22 +0800 Jiayuan Chen <jiayuan.chen@linux.dev> wrote: > > > >> kdamond_split_regions() bails out early when nr_regions is already > >> above max_nr_regions / 2. A large region that picks up new internal > >> variation after that point never gets split, so we lose visibility > >> into its hot/cold structure. > >> > >> We hit this with damon-paddr on hugepage workloads and damon-vaddr > >> on processes that mmap a large anonymous range. > >> > >> On our production tree we added a current_nr_regions counter (no > >> good upstream home for it yet, so it's not in this series). We saw > >> nr_regions never getting close to max_nr_regions, and the picture of > >> the access pattern was too coarse. > > Is 'current_nr_regions' somewhat showing the number of DAMON regions? If so, > > you could also get the information from nr_regions field of damon_aggregated > > tracepoint. I'm wondering if you considered using that but found a problem > > that made you have to implement the internal change. > > > > I will be happy to help removing such downstream changes. > > > Yes, same data as the nr_regions field in damon_aggregated. The downstream > > counter was just for convenience -- easier to cat a sysfs file than to wire > > up tracing. Even the tracepoint covers it, It's cost to much for > Grafana to just get > > a metrics by tracepoint. Makes sense. And I think this deserves to be upstreamed. Some minor modifications might be needed to your current implementation, though. Please feel free to send a patch to start the discussion, if you want. > > > >> Example with max_nr_regions == 1500. A target ends up with 799 > >> small hot/cold regions plus one big region (an earlier merge > >> collapsed a uniformly-accessed range into a single piece): > >> > >> H:hot > >> C:cold > >> > >> r1 r2 r3 r800 > >> HHHHHH|CCCCCC|HHHHHH|...|HHHHHH..........................| > >> > >> nr_regions = 800 > max_nr_regions / 2 = 750 > >> > >> Now a cold subarea shows up inside r800: > >> > >> r1 r2 r3 r800 > >> HHHHHH|CCCCCC|HHHHHH|...|HHHHHH........CCCCCC.............| > >> > >> The small regions can't merge with each other (their access counts > >> differ), so budget never frees up. r800 can't be split because > >> nr_regions > max_nr_regions / 2 returns early. The cold subarea > >> stays invisible. > > I agree this corner case could theoretically happen. But, would the small > > regions have the current pattern forever? On real world systems having dynamic > > > I agree with the point that this is a corner case. But it's not > transient for us. Thank you for sharing this nice information. > > On a production setup with max_nr_regions = 20000, nr_regions sits at > 11k-12k > > for extended periods. There are occasional bursts (e.g. from offline > pods), then things settle > > back without ever reclaiming the budget. Could you please clarify a little bit more? What is the occasional bursts, and how offline pods contribute to that? What "reclaiming the budget" means? Also, do you have some measurements that shows this problem and how much of it is removed by this series? > > > > access pattern, I guess those small regions may not keep the shape forever, and > > give chance for the large region to be split. Am I missing something? > > > > My theory also implies that this kind of situation could happen at least > > sometimes for temporal periods. In other words, it could happens too > > frequently and too long to be problematic. But, in the case, maybe the user > > could mitigate the issue by increasing the max_nr_regions. I'm curious if you > > considered that direction and found a problem that I don't expect for now. > > > >> Patch 1 lets this path still split regions that just changed > >> (age == 0), > > Why 'age == 0' means it is a good candidate to split? Because it means its > > access frequency is anyway unstable? Or are there other reasons? More > > clarification would be helpful. > > > Yes, age == 0 means the region's access count drifted past the merge > threshold in > the last aggregation -- the strongest signal it just changed internally. > Regions with age > 0 are stable; splitting them tends to oscillate (the next > merge cycle pulls the halves back together and we waste the budget). Thank you for confirming this. Yes, that sounds good approach to me. But because this is a core behavior, I'd like to be careful more than usual. I will spend more time at thinking if I'm missing something, and if this is the best approach. If you have measurements that I asked above and can share, that will also be helpful. > > > > >> up to whatever budget is left under max_nr_regions. > >> If a split turns out useless, the next merge cycle undoes it. > > I'm again curious why the user cannot just increase max_nr_regions. > > It works as a workaround, but it isn't free: higher max means more sampling > work and more memory, It would depend on the real number of distinct access patterns. I understand the number is really high on your use case. Again, if you have measurements and could share, that will be very helpful. > and 20000 is the ceiling we actually want to live > with. Bumping to 30000 just so the splitter has room to make progress > between max/2 and max is wasteful -- we don't actually want to spend the > resources for 30000 regions. Makes sense. > > The real issue isn't budget waste, it's that once nr_regions crosses max/2 > the splitter has no recovery path -- it returns immediately even when > there's > variation worth refining, and merges don't help because the small regions > have different access counts. nr_regions just sits between max/2 and max, > and new variation inside a large region goes undetected. The patch gives > that path a way to keep refining within whatever budget remains, instead of > asking users to over-provision max. Yes, I agree. Nonetheless, as I mentioned above a couple of times, if you have and could share measurements that showing how big the problem is and how much of it this change can solve will be very helpful. Thanks, SJ [...] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 0/2] mm/damon/core: detect internal variation above max_nr_regions/2 2026-05-22 2:42 ` SeongJae Park @ 2026-05-22 15:11 ` Jiayuan Chen 2026-05-23 1:43 ` SeongJae Park 0 siblings, 1 reply; 8+ messages in thread From: Jiayuan Chen @ 2026-05-22 15:11 UTC (permalink / raw) To: SeongJae Park Cc: damon, Andrew Morton, Shu Anzai, Jiayuan Chen, Quanmin Yan, linux-mm, linux-kernel Hi, SJ On 5/22/26 10:42 AM, SeongJae Park wrote: > On Thu, 21 May 2026 23:07:11 +0800 Jiayuan Chen <jiayuan.chen@linux.dev> wrote: > >> Hi SJ, >> >> Thanks for taking a look. Quick replies inline. >> >> >> On 5/21/26 10:30 PM, SeongJae Park wrote: >>> Hello Jiayuan, >>> >>> On Thu, 21 May 2026 12:52:22 +0800 Jiayuan Chen <jiayuan.chen@linux.dev> wrote: >>> >>>> kdamond_split_regions() bails out early when nr_regions is already >>>> above max_nr_regions / 2. A large region that picks up new internal >>>> variation after that point never gets split, so we lose visibility >>>> into its hot/cold structure. >>>> >>>> We hit this with damon-paddr on hugepage workloads and damon-vaddr >>>> on processes that mmap a large anonymous range. >>>> >>>> On our production tree we added a current_nr_regions counter (no >>>> good upstream home for it yet, so it's not in this series). We saw >>>> nr_regions never getting close to max_nr_regions, and the picture of >>>> the access pattern was too coarse. >>> Is 'current_nr_regions' somewhat showing the number of DAMON regions? If so, >>> you could also get the information from nr_regions field of damon_aggregated >>> tracepoint. I'm wondering if you considered using that but found a problem >>> that made you have to implement the internal change. >>> >>> I will be happy to help removing such downstream changes. >> >> Yes, same data as the nr_regions field in damon_aggregated. The downstream >> >> counter was just for convenience -- easier to cat a sysfs file than to wire >> >> up tracing. Even the tracepoint covers it, It's cost to much for >> Grafana to just get >> >> a metrics by tracepoint. > Makes sense. And I think this deserves to be upstreamed. Some minor > modifications might be needed to your current implementation, though. Please > feel free to send a patch to start the discussion, if you want. On the sysfs counter -- agreed, same data as the tracepoint. I'll look into a suitable location. >> >>>> Example with max_nr_regions == 1500. A target ends up with 799 >>>> small hot/cold regions plus one big region (an earlier merge >>>> collapsed a uniformly-accessed range into a single piece): >>>> >>>> H:hot >>>> C:cold >>>> >>>> r1 r2 r3 r800 >>>> HHHHHH|CCCCCC|HHHHHH|...|HHHHHH..........................| >>>> >>>> nr_regions = 800 > max_nr_regions / 2 = 750 >>>> >>>> Now a cold subarea shows up inside r800: >>>> >>>> r1 r2 r3 r800 >>>> HHHHHH|CCCCCC|HHHHHH|...|HHHHHH........CCCCCC.............| >>>> >>>> The small regions can't merge with each other (their access counts >>>> differ), so budget never frees up. r800 can't be split because >>>> nr_regions > max_nr_regions / 2 returns early. The cold subarea >>>> stays invisible. >>> I agree this corner case could theoretically happen. But, would the small >>> regions have the current pattern forever? On real world systems having dynamic >> >> I agree with the point that this is a corner case. But it's not >> transient for us. > Thank you for sharing this nice information. > >> On a production setup with max_nr_regions = 20000, nr_regions sits at >> 11k-12k >> >> for extended periods. There are occasional bursts (e.g. from offline >> pods), then things settle >> >> back without ever reclaiming the budget. > Could you please clarify a little bit more? What is the occasional bursts, and > how offline pods contribute to that? What "reclaiming the budget" means? > > Also, do you have some measurements that shows this problem and how much of it > is removed by this series? > >> >>> access pattern, I guess those small regions may not keep the shape forever, and >>> give chance for the large region to be split. Am I missing something? >>> >>> My theory also implies that this kind of situation could happen at least >>> sometimes for temporal periods. In other words, it could happens too >>> frequently and too long to be problematic. But, in the case, maybe the user >>> could mitigate the issue by increasing the max_nr_regions. I'm curious if you >>> considered that direction and found a problem that I don't expect for now. >>> >>>> Patch 1 lets this path still split regions that just changed >>>> (age == 0), >>> Why 'age == 0' means it is a good candidate to split? Because it means its >>> access frequency is anyway unstable? Or are there other reasons? More >>> clarification would be helpful. >> >> Yes, age == 0 means the region's access count drifted past the merge >> threshold in >> the last aggregation -- the strongest signal it just changed internally. >> Regions with age > 0 are stable; splitting them tends to oscillate (the next >> merge cycle pulls the halves back together and we waste the budget). > Thank you for confirming this. Yes, that sounds good approach to me. But > because this is a core behavior, I'd like to be careful more than usual. I > will spend more time at thinking if I'm missing something, and if this is the > best approach. If you have measurements that I asked above and can share, that > will also be helpful. We considered selecting regions randomly past max/2 (which is what our downstream tree does). Random selection converges to higher nr_regions faster. We picked age == 0 for upstream because: - It's DAMON's own signal that the region's nr_accesses just crossed the merge threshold -- i.e. the access pattern is currently unstable. Splitting an unstable region is more likely to reveal new internal structure than splitting a stable region - It's selective by design, so it leans conservative on a core code path. In our tests it still reaches the effective refinement we need (e.g. 160-180 at max_nr_regions = 200), just more gradually than random selection would. We thought a selective, signal-based filte. >>>> up to whatever budget is left under max_nr_regions. >>>> If a split turns out useless, the next merge cycle undoes it. >>> I'm again curious why the user cannot just increase max_nr_regions. >> It works as a workaround, but it isn't free: higher max means more sampling >> work and more memory, > It would depend on the real number of distinct access patterns. I understand > the number is really high on your use case. Again, if you have measurements > and could share, that will be very helpful. > >> and 20000 is the ceiling we actually want to live >> with. Bumping to 30000 just so the splitter has room to make progress >> between max/2 and max is wasteful -- we don't actually want to spend the >> resources for 30000 regions. > Makes sense. > >> The real issue isn't budget waste, it's that once nr_regions crosses max/2 >> the splitter has no recovery path -- it returns immediately even when >> there's >> variation worth refining, and merges don't help because the small regions >> have different access counts. nr_regions just sits between max/2 and max, >> and new variation inside a large region goes undetected. The patch gives >> that path a way to keep refining within whatever budget remains, instead of >> asking users to over-provision max. > Yes, I agree. Nonetheless, as I mentioned above a couple of times, if you have > and could share measurements that showing how big the problem is and how much > of it this change can solve will be very helpful. > Our downstream paddr has per-cgroup tweaks, so I don't think those numbers would be that meaningful for upstream review. Here's a clean upstream-paddr reproducer instead. paddr config: ```shell ADMIN=/sys/kernel/mm/damon/admin echo 1 > $ADMIN/kdamonds/nr_kdamonds echo 1 > $ADMIN/kdamonds/0/contexts/nr_contexts CTX=$ADMIN/kdamonds/0/contexts/0 echo paddr > $CTX/operations # Using stress-ng for hot memory. Walking a 256M chunk takes around # sample=50ms, aggr=1000ms, update=1s echo 50000 > $CTX/monitoring_attrs/intervals/sample_us echo 1000000 > $CTX/monitoring_attrs/intervals/aggr_us echo 1000000 > $CTX/monitoring_attrs/intervals/update_us # Without any cap nr_regions usually settles around 300+ on this # workload, so max=200 makes the corner case easy to hit. echo 10 > $CTX/monitoring_attrs/nr_regions/min echo 200 > $CTX/monitoring_attrs/nr_regions/max echo 1 > $CTX/targets/nr_targets echo 1 > $CTX/targets/0/regions/nr_regions echo 0 > $CTX/targets/0/regions/0/start # 32C 16G machine echo $((16 * 1024 * 1024 * 1024)) > $CTX/targets/0/regions/0/end echo 0 > $CTX/schemes/nr_schemes echo on > $ADMIN/kdamonds/0/state ``` Workload -- cold producer first, then a few hot producers right after, so cold and hot pages get interleaved across physical memory: ```shell # Cold: 4 GiB mmap, touch every page once, then sleep python3 -c ' import mmap, time size = 4 * 1024**3 m = mmap.mmap(-1, size, mmap.MAP_PRIVATE | mmap.MAP_ANONYMOUS) for i in range(0, size, 4096): m[i] = 1 print("cold allocated, sleeping") time.sleep(86400) ' & # Hot: 7 stress-ng instances, different vm-methods so the hot # regions don't all look identical and merge into one for m in walk-0a walk-1a walk-0d walk-1d incdec rand-set zero-one; do stress-ng --vm 4 --vm-bytes 256M --vm-method $m --vm-keep --timeout 0 & done ``` After running for an hour: 1.Without this series: nr_regions stays at ~100 (max/2), doesn't recover 2.With this series: nr_regions stays at 160-180 In real production this is actually pretty common. Workloads keep changing state and creating new access patterns, so nr_regions naturally tends to live above max/2 most of the time -- which is exactly where the corner case kicks in. On our production box with max_nr_regions = 20000, nr_regions sits at 11k-13k for long stretches without ever clearing. Without this series the effective ceiling is just max/2. Set max=200, you cap at ~100. Set max=400, you cap at ~200. The 1-hour reproducer above is admittedly a bit of a toy -- I set max=200 to force the corner case without having to scale up the workload -- but it shows the same pattern: once nr_regions crosses max/2 it just stays there. The offline-pod example I mentioned earlier is just one workload that hits this. The mechanism isn't specific to that workload: any new access pattern that shows up inside an existing region after nr_regions crosses max/2 will stay invisible until something else lowers nr_regions, which may never happen. Thanks, Jiayuan > Thanks, > SJ > > [...] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 0/2] mm/damon/core: detect internal variation above max_nr_regions/2 2026-05-22 15:11 ` Jiayuan Chen @ 2026-05-23 1:43 ` SeongJae Park 0 siblings, 0 replies; 8+ messages in thread From: SeongJae Park @ 2026-05-23 1:43 UTC (permalink / raw) To: Jiayuan Chen Cc: SeongJae Park, damon, Andrew Morton, Shu Anzai, Jiayuan Chen, Quanmin Yan, linux-mm, linux-kernel On Fri, 22 May 2026 23:11:47 +0800 Jiayuan Chen <jiayuan.chen@linux.dev> wrote: > Hi, SJ > > On 5/22/26 10:42 AM, SeongJae Park wrote: > > On Thu, 21 May 2026 23:07:11 +0800 Jiayuan Chen <jiayuan.chen@linux.dev> wrote: > > > >> Hi SJ, > >> > >> Thanks for taking a look. Quick replies inline. > >> > >> > >> On 5/21/26 10:30 PM, SeongJae Park wrote: > >>> Hello Jiayuan, > >>> > >>> On Thu, 21 May 2026 12:52:22 +0800 Jiayuan Chen <jiayuan.chen@linux.dev> wrote: [...] > >> counter was just for convenience -- easier to cat a sysfs file than to wire > >> > >> up tracing. Even the tracepoint covers it, It's cost to much for > >> Grafana to just get > >> > >> a metrics by tracepoint. Out of the scope of this patch series, but I'm interested in how you connect DAMON outputs to Grafana. I believe that could be useful for many people who willing to get some fleet wide access pattern. Maybe worthy to present to wider audiences, like System monitoring microconf [1] at LPC? > > Makes sense. And I think this deserves to be upstreamed. Some minor > > modifications might be needed to your current implementation, though. Please > > feel free to send a patch to start the discussion, if you want. > > > On the sysfs counter -- agreed, same data as the tracepoint. I'll > look into a suitable location. Maybe /sys/.../schemes/<S>/tried_regions/nr_regions ? [...] > >> Yes, age == 0 means the region's access count drifted past the merge > >> threshold in > >> the last aggregation -- the strongest signal it just changed internally. > >> Regions with age > 0 are stable; splitting them tends to oscillate (the next > >> merge cycle pulls the halves back together and we waste the budget). > > Thank you for confirming this. Yes, that sounds good approach to me. But > > because this is a core behavior, I'd like to be careful more than usual. I > > will spend more time at thinking if I'm missing something, and if this is the > > best approach. If you have measurements that I asked above and can share, that > > will also be helpful. > > > We considered selecting regions randomly past max/2 (which is what our > downstream tree does). Interesting. Actually I was thinking something like this as a suggestion. And I understand that you had to develop and carry your downstream patches because DAMON was not supporting your use case. I know carrying downstream patches is painful. Sorry for the inconvenience and thank you for making this voice. I'm here for users, and I will be happy to help you removing the downstream change. > Random selection converges to higher > nr_regions faster. We picked age == 0 for upstream because: > > - It's DAMON's own signal that the region's nr_accesses just > crossed the merge threshold -- i.e. the access pattern is > currently unstable. Splitting an unstable region is more likely > to reveal new internal structure than splitting a stable region > > - It's selective by design, so it leans conservative on a core > code path. In our tests it still reaches the effective > refinement we need (e.g. 160-180 at max_nr_regions = 200), just > more gradually than random selection would. > > We thought a selective, signal-based filte. I understand that you concern about the increased number of regions, which would make the overhead greater? I think the concern and your filtering approach make sense. But the age threshold value feels like a heuristic that may not be good for someone. I also think age != 0 might not always be a good signal for distinguising the regions. I feel temptation to keep using the power of the chaos (randomness) in the regions adjustment. So I was thinking below as a suggestion. The basic idea is, choosing the number of regions to split based on the remaining budget (max_nr_regions - nr_regions). I'd prefer making this simple and lightweight. So suggesting something like below. void kdamond_split_regions() { static unsigned char rndseed; budget = max_nr_regions - current_nr_regions() if (budget > max_nr_regions / 2) split_step = 1 elif (budget > max_nr_regions / 3) split_step = 2 ... idx = rndseed++ % split_step; for (; idx < current_nr_regions(), idx += split_step) split_region(nth_region(idx)); } I think this might be similar to your downstream change, but what do you think, Jiayuan? I'm also bit concerned about the fact that it would increase the number of regions. However, DAMON never promised the usual number of regions will be around max_nr_regions / 2. More technically speaking, the current behavior is that once the number of regions exceeds max_nr_regions / 2, it only slowly decrease. Anyway, it is not a documented behavior. Yes, maybe some users rely on the current behavior and changing that could make them sad. But I haven't heard any voice from such users. Meanwhile Jiayuan and their friends are apparently being suffered by the behavior and making this voice. And we repeatedly told DAMON does its random evolution based on "selfish voices" from users. So I think we should move based on the Jiayuan's "selfish voice" here. If it really makes someone sad and if they make thier different "selfish voice", that's when we can discuss on different direction. The someone could simply reduce max_nr_regions, or work together to make another knob for making the new behavior opt-in or opt-out, depending on their loudness of the voice. If you rely on the current behavior, this is the best time to make your voice. I hope this doesn't make people get us wrong. We care quiet users. Nonetheless in this case, the behavior is somewhat not documented. [...] > Our downstream paddr has per-cgroup tweaks, Interesting! Please consider sharing that on some conferences and/or upstreaming that for the community and yourself! No push, though. > so I don't think those > numbers would be that meaningful for upstream review. Here's a clean > upstream-paddr reproducer instead. [...] > After running for an hour: > 1.Without this series: nr_regions stays at ~100 (max/2), doesn't recover > 2.With this series: nr_regions stays at 160-180 Data from the real workload would be really interesting. But this artificial test results also helpful. Thank you for conducting the test and sharing these. > > In real production this is actually pretty common. Workloads keep > changing state and creating new access patterns, so nr_regions > naturally tends to live above max/2 most of the time -- which is > exactly where the corner case kicks in. On our production box with > max_nr_regions = 20000, nr_regions sits at 11k-13k for long stretches > without ever clearing. Thanks for sharing these, I believe you. > > Without this series the effective ceiling is just max/2. Set max=200, > you cap at ~100. Set max=400, you cap at ~200. > > > The 1-hour reproducer above is admittedly a bit of a toy -- I set > max=200 to force the corner case without having to scale up the > workload -- but it shows the same pattern: once nr_regions crosses > max/2 it just stays there. > > > The offline-pod example I mentioned earlier is just one workload that > hits this. The mechanism isn't specific to that workload: any new > access pattern that shows up inside an existing region after > nr_regions crosses max/2 will stay invisible until something else > lowers nr_regions, which may never happen. Yes, makes sense. [1] https://lpc.events/event/20/contributions/2327/ Thanks, SJ [...] ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2026-05-23 1:43 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-05-21 4:52 [PATCH 0/2] mm/damon/core: detect internal variation above max_nr_regions/2 Jiayuan Chen 2026-05-21 4:52 ` [PATCH 1/2] mm/damon/core: split age==0 regions when nr_regions exceeds max/2 Jiayuan Chen 2026-05-21 4:52 ` [PATCH 2/2] mm/damon/tests/core-kunit: test split above max_nr_regions/2 Jiayuan Chen 2026-05-21 14:30 ` [PATCH 0/2] mm/damon/core: detect internal variation " SeongJae Park 2026-05-21 15:07 ` Jiayuan Chen 2026-05-22 2:42 ` SeongJae Park 2026-05-22 15:11 ` Jiayuan Chen 2026-05-23 1:43 ` SeongJae Park
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox