Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: SeongJae Park <sj@kernel.org>
To: Jiayuan Chen <jiayuan.chen@linux.dev>
Cc: SeongJae Park <sj@kernel.org>,
	damon@lists.linux.dev, Andrew Morton <akpm@linux-foundation.org>,
	Shu Anzai <shu17az@gmail.com>,
	Jiayuan Chen <jiayuan.chen@shopee.com>,
	Quanmin Yan <yanquanmin1@huawei.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 0/2] mm/damon/core: detect internal variation above max_nr_regions/2
Date: Thu, 21 May 2026 19:42:26 -0700	[thread overview]
Message-ID: <20260522024228.87328-1-sj@kernel.org> (raw)
In-Reply-To: <26baec2e-bc11-4042-8f61-0b3f7fe1c3b1@linux.dev>

On Thu, 21 May 2026 23:07:11 +0800 Jiayuan Chen <jiayuan.chen@linux.dev> wrote:

> Hi SJ,
> 
> Thanks for taking a look.  Quick replies inline.
> 
> 
> On 5/21/26 10:30 PM, SeongJae Park wrote:
> > Hello Jiayuan,
> >
> > On Thu, 21 May 2026 12:52:22 +0800 Jiayuan Chen <jiayuan.chen@linux.dev> wrote:
> >
> >> kdamond_split_regions() bails out early when nr_regions is already
> >> above max_nr_regions / 2.  A large region that picks up new internal
> >> variation after that point never gets split, so we lose visibility
> >> into its hot/cold structure.
> >>
> >> We hit this with damon-paddr on hugepage workloads and damon-vaddr
> >> on processes that mmap a large anonymous range.
> >>
> >> On our production tree we added a current_nr_regions counter (no
> >> good upstream home for it yet, so it's not in this series).  We saw
> >> nr_regions never getting close to max_nr_regions, and the picture of
> >> the access pattern was too coarse.
> > Is 'current_nr_regions' somewhat showing the number of DAMON regions?  If so,
> > you could also get the information from nr_regions field of damon_aggregated
> > tracepoint.  I'm wondering if you considered using that but found a problem
> > that made you have to implement the internal change.
> >
> > I will be happy to help removing such downstream changes.
> 
> 
> Yes, same data as the nr_regions field in damon_aggregated.  The downstream
> 
> counter was just for convenience -- easier to cat a sysfs file than to wire
> 
> up tracing.  Even the tracepoint covers it, It's cost to much for 
> Grafana to just get
> 
> a metrics by tracepoint.

Makes sense.  And I think this deserves to be upstreamed.  Some minor
modifications might be needed to your current implementation, though.  Please
feel free to send a patch to start the discussion, if you want.

> 
> 
> >> Example with max_nr_regions == 1500.  A target ends up with 799
> >> small hot/cold regions plus one big region (an earlier merge
> >> collapsed a uniformly-accessed range into a single piece):
> >>
> >> H:hot
> >> C:cold
> >>
> >>        r1     r2     r3                 r800
> >>      HHHHHH|CCCCCC|HHHHHH|...|HHHHHH..........................|
> >>
> >>      nr_regions = 800  >  max_nr_regions / 2 = 750
> >>
> >> Now a cold subarea shows up inside r800:
> >>
> >>        r1     r2     r3                 r800
> >>      HHHHHH|CCCCCC|HHHHHH|...|HHHHHH........CCCCCC.............|
> >>
> >> The small regions can't merge with each other (their access counts
> >> differ), so budget never frees up.  r800 can't be split because
> >> nr_regions > max_nr_regions / 2 returns early.  The cold subarea
> >> stays invisible.
> > I agree this corner case could theoretically happen.  But, would the small
> > regions have the current pattern forever?  On real world systems having dynamic
> 
> 
> I agree with the point that this is a corner case. But it's not 
> transient for us.

Thank you for sharing this nice information.

> 
> On a production setup with max_nr_regions = 20000, nr_regions sits at 
> 11k-12k
> 
> for extended periods. There are occasional bursts (e.g. from offline 
> pods), then things settle
> 
> back without ever reclaiming the budget.

Could you please clarify a little bit more?  What is the occasional bursts, and
how offline pods contribute to that?  What "reclaiming the budget" means?

Also, do you have some measurements that shows this problem and how much of it
is removed by this series?

> 
> 
> > access pattern, I guess those small regions may not keep the shape forever, and
> > give chance for the large region to be split.  Am I missing something?
> >
> > My theory also implies that this kind of situation could happen at least
> > sometimes for temporal periods.  In other words, it could happens too
> > frequently and too long to be problematic.  But, in the case, maybe the user
> > could mitigate the issue by increasing the max_nr_regions.  I'm curious if you
> > considered that direction and found a problem that I don't expect for now.
> >
> >> Patch 1 lets this path still split regions that just changed
> >> (age == 0),
> > Why 'age == 0' means it is a good candidate to split?  Because it means its
> > access frequency is anyway unstable?  Or are there other reasons?  More
> > clarification would be helpful.
> 
> 
> Yes, age == 0 means the region's access count drifted past the merge 
> threshold in
> the last aggregation -- the strongest signal it just changed internally.
> Regions with age > 0 are stable; splitting them tends to oscillate (the next
> merge cycle pulls the halves back together and we waste the budget).

Thank you for confirming this.  Yes, that sounds good approach to me.  But
because this is a core behavior, I'd like to be careful more than usual.  I
will spend more time at thinking if I'm missing something, and if this is the
best approach.  If you have measurements that I asked above and can share, that
will also be helpful.

> 
> >
> >> up to whatever budget is left under max_nr_regions.
> >> If a split turns out useless, the next merge cycle undoes it.
> > I'm again curious why the user cannot just increase max_nr_regions.
> 
> It works as a workaround, but it isn't free: higher max means more sampling
> work and more memory,

It would depend on the real number of distinct access patterns.  I understand
the number is really high on your use case.  Again, if you have measurements
and could share, that will be very helpful.

> and 20000 is the ceiling we actually want to live
> with.  Bumping to 30000 just so the splitter has room to make progress
> between max/2 and max is wasteful -- we don't actually want to spend the
> resources for 30000 regions.

Makes sense.

> 
> The real issue isn't budget waste, it's that once nr_regions crosses max/2
> the splitter has no recovery path -- it returns immediately even when 
> there's
> variation worth refining, and merges don't help because the small regions
> have different access counts.  nr_regions just sits between max/2 and max,
> and new variation inside a large region goes undetected.  The patch gives
> that path a way to keep refining within whatever budget remains, instead of
> asking users to over-provision max.

Yes, I agree.  Nonetheless, as I mentioned above a couple of times, if you have
and could share measurements that showing how big the problem is and how much
of it this change can solve will be very helpful.


Thanks,
SJ

[...]


  reply	other threads:[~2026-05-22  2:42 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-21  4:52 [PATCH 0/2] mm/damon/core: detect internal variation above max_nr_regions/2 Jiayuan Chen
2026-05-21  4:52 ` [PATCH 1/2] mm/damon/core: split age==0 regions when nr_regions exceeds max/2 Jiayuan Chen
2026-05-21  4:52 ` [PATCH 2/2] mm/damon/tests/core-kunit: test split above max_nr_regions/2 Jiayuan Chen
2026-05-21 14:30 ` [PATCH 0/2] mm/damon/core: detect internal variation " SeongJae Park
2026-05-21 15:07   ` Jiayuan Chen
2026-05-22  2:42     ` SeongJae Park [this message]
2026-05-22 15:11       ` Jiayuan Chen
2026-05-23  1:43         ` SeongJae Park

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260522024228.87328-1-sj@kernel.org \
    --to=sj@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=damon@lists.linux.dev \
    --cc=jiayuan.chen@linux.dev \
    --cc=jiayuan.chen@shopee.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=shu17az@gmail.com \
    --cc=yanquanmin1@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox