Re: [LSF/MM/BPF TOPIC v2] Unifying sources of page temperature information - what info is actually wanted?

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
To: SeongJae Park <sj@kernel.org>
Cc: Raghavendra K T <raghavendra.kt@amd.com>,
	Bharata B Rao <bharata@amd.com>,
	<lsf-pc@lists.linux-foundation.org>, <linux-mm@kvack.org>,
	Michal Hocko <mhocko@suse.com>,
	Dan Williams <dan.j.williams@intel.com>, <linuxarm@huawei.com>,
	Matthew Wilcox <willy@infradead.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Gregory Price <gourry@gourry.net>
Subject: Re: [LSF/MM/BPF TOPIC v2] Unifying sources of page temperature information - what info is actually wanted?
Date: Fri, 21 Mar 2025 15:30:44 +0000	[thread overview]
Message-ID: <20250321153044.000017aa@huawei.com> (raw)
In-Reply-To: <20250319235029.54378-1-sj@kernel.org>


> > And here is an attempt to compile how different subsystems
> > use the above data:
> > ==========================================================================================
> > Source			Subsystem	Consumption		Activation/Frequency
> > ==========================================================================================
> > PROT_NONE faults	NUMAB		NUMAB=1 locality based	While task is running,
> > via process pgtable			balancing		rate varies on observed
> > walk					NUMAB=2 hot page	locality and sysctl knobs.
> > 					promotion
> > ==========================================================================================
> > folio_mark_accessed()	FS/filemap/GUP	LRU list activation	On cache access and unmap
> > ==========================================================================================
> > PTE A bit via		Reclaim:LRU	LRU list activation,	During memory pressure
> > rmap walk				deactivation/demotion
> > ==========================================================================================
> > PTE A bit via		Reclaim:MGLRU	LRU list activation,	- During memory pressure
> > rmap walk and process			deactivation/demotion	- Continuous sampling (configurable)
> > pgtable walk							  for workingset reporting
> > ==========================================================================================
> > PTE A bit via		DAMON		LRU activation,
> > rmap walk				hot page promotion,
> > 					demotion etc  
> 
> For virtual address spaces monitoring mode, DAMON uses PTE A bit via pgtable
> walk.
> 
> It's activation and frequency is basically set as user requests.  Activation
> can be set to be reactive to memory pressure like events (using watermarks).
> Frequency can be auto-tuned for pursuing access events per snapshot ratio.

Thanks.  I've added that (in very brief form) to the table in my slides.


> > SJ has proposed perhaps extending Damon as a possible interface layer. I am
> > yet to understand how that works in cases where regions do not provide
> > a compact representation due to lack of contiguity in the hotness.
> > An example usecase is hypervisor wanting to migrate data under unaware,
> > cheap VMs.  After a system has been running for a while (particularly with hot
> > pages being migrated, swap etc) the hotness map looks much like noise.  
> 
> Similar concerns for DAMON's region abstraction were raised for physical
> address space monitoring, because there is no cautious effort for making hot
> pages gathered together (or, locality).
> 
> I'd argue there is no cautious effort to make temperature be spread, though.
> As a result, we can expect a level of uncautious bias, and that matches with my
> experiences from DAMON use cases on products environemnts so far.

Whilst I'm not in a position to share the data, as it's not mine :( I've
seen graphs that show that for at least some use cases, even if we have some
contiguity of hotness in the VA space, it looks like noise in PA.  So
I think this is a case of 'mileage may vary'. Damon works great sometimes but
sometime the spared of access statistics happen to be wrong.

> 
> Also, in practice, DAMON regions are used in combination with other
> information.  For example, DAMON-based reclaim checkes PTE A bit of each page
> in DAMON-suggested cold memory region to make final decision about whether to
> reclaim or not it, like MADV_PAGEOUT does.

Makes sense.  The MADV_PAGEOUT case was one of the motivators for mixing
methods suggestion.  Here it's kind of DAMON + dense A bit checking (on
candidate pages).

> 
> That is, yes, I agree DAMON's region abstraction is maybe not a good way to
> find perfect answer to some questions such as finding N-th hottest single page.
> And it has many rooms to improve.  Nevertheless, even DAMON of today can give
> good enough best-effort answers for questions that practical for some cases,
> such as finding regions that may containing N most hot/cold pages, while
> letting the monitoring overhead fixed as users ask.
> 
> Also, please note that there is no reason to restrict DAMON to always use
> regions abstraction.  For different use-cases and situation, DAMON will be open
> to be extended to use new abstractions.  DAMON aims not to be a subsystem for
> DAMON regions concept but data access monitoring for practical efficiency, and
> continue random evolution for given environments.

Absolutely understood. In my current thinking Damon sits at a particular layer
in the stack and there may be one more abstraction on top of it (e.g. a list
of hot /cold pages). Equally possible that the layers may fuse and it becomes
an aspect of DAMON.

> 
> > 
> > Now for the "there be monsters bit"...
> > ---------------------------------------
> > 
> > - Stability of hotness matters and is hard to establish.
> >   Predict a page will remain hot - various heuristics.
> > 	a) It is hot, probably stays so? (super hot!)
> > 	   Sometimes enough to be detected as hot once,
> > 	   often not.
> > 	b) It has been hot a while, probably stays so.
> > 	   Check this hot list against previous hot list,
> > 	   entries in both needed to promote.
> > 	   This has a problem if hotlist is small compared to
> > 	   total count of hot pages.  Say list is 1%, 20% actually
> > 	   hot, low chance of repeats even in hot pages.
> > 	c) It is hot, let's monitor a while before doing anything.
> > 	   Measurement technique may change. Maybe cheaper
> > 	   to monitor 'candidate' pages than all pages
> > 	   e.g. CXL HMU gives 1000 pages, then we use access bit
> > 	        sampling to check they are at least accessed N times
> > 		in next second.
> > 	d) It was hot, We moved it. Did it stay hot?
> > 	   More useful to identify when we are thrashing and should
> > 	   just stop doing anything.  To late to fix this one!  
> 
> DAMON is providing a sort of b) approach, aka DAMON regions' age, for finding
> both hot and cold regions.
> 
> > - Some data should be considered hot even when not in use (e.g. stack)  
> 
> DAMOS filters is for this kind of exceptions, and DAMON kernel API is flexible
> enough to let callers directly manipulate the regions information based on
> thier special knowledges.  We can further optimize the interface for easier
> uses, of course.

Nice.

> 
> > - Usecases interfere. So it can't just be a broadcast mode
> >   where hotness information is sent to all users.
> > - When to stop, start migration / tracking?
> > 	a) Detecting bad decisions. Enough bad decisions, better to
> > 	   do nothing?
> >  	b) Metadata beyond the counts is useful
> > 	   https://lore.kernel.org/all/87h64u2xkh.fsf@DESKTOP-5N7EMDA/
> > 	   Promotion algorithms can need aggregate statistics for a memory 
> > 	   device to decide how much to move.  
> 
> DAMOS quotas goal feature is a sort of a feature for this question.  It allows
> users to set target metric and value, and tune the aggressiveness.  For
> promotions and demotions, I suggested using upper tier utilization and free
> ratio as such possible goal metric, and gonna post an implementation for that
> soon.

Those are certainly good metrics to consider, but I think we definitely also
need a metric around how beneficial are the moves being made.

That matters more on the promotion path, because that interrupts access to
hot data and so will cause a temporary drop in performance / latency spike.

> 
> > 
> > As noted above, this may well overlap with other sessions.
> > One outcome of the discussion so far is to highlight what I think many
> > already knew.  This is hard!  
> 
> Indeed.  Keeping more people on the same page is important and difficult.
> Thank you for your effort again, and looking forward to discuss in more depth!
>

I'm not sure we'll succeed.  This may well be a wild west situation for a while
yet, but hopefully we can slowly converge or at least build some common
parts.

Jonathan

p.s. Heathrow disruption means I'm crossing my fingers on actually getting to
Montreal.
 
> 
> Thanks,
> SJ
> 
> > 
> > Jonathan

next prev parent reply	other threads:[~2025-03-21 15:30 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-19 12:47 [LSF/MM/BPF TOPIC v2] Unifying sources of page temperature information - what info is actually wanted? Jonathan Cameron
2025-03-19 23:50 ` SeongJae Park
2025-03-21 15:30   ` Jonathan Cameron [this message]
2025-03-21 17:36     ` SeongJae Park
2025-04-04 10:39 ` Jonathan Cameron

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250321153044.000017aa@huawei.com \
    --to=jonathan.cameron@huawei.com \
    --cc=bharata@amd.com \
    --cc=dan.j.williams@intel.com \
    --cc=gourry@gourry.net \
    --cc=hannes@cmpxchg.org \
    --cc=linux-mm@kvack.org \
    --cc=linuxarm@huawei.com \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=mhocko@suse.com \
    --cc=raghavendra.kt@amd.com \
    --cc=sj@kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.