public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed
From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
To: Raghavendra K T <raghavendra.kt@amd.com>,
	Bharata B Rao <bharata@amd.com>, SeongJae Park <sj@kernel.org>,
	<lsf-pc@lists.linux-foundation.org>, <linux-mm@kvack.org>
Cc: Michal Hocko <mhocko@suse.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Matthew Wilcox <willy@infradead.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Gregory Price <gourry@gourry.net>
Subject: Re: [LSF/MM/BPF TOPIC v2] Unifying sources of page temperature information - what info is actually wanted?
Date: Fri, 4 Apr 2025 11:39:12 +0100	[thread overview]
Message-ID: <20250404113912.00002606@huawei.com> (raw)
In-Reply-To: <20250319124552.0000344a@huawei.com>

On Wed, 19 Mar 2025 12:47:53 +0000
Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:


https://drive.google.com/file/d/1o9g-Bggg7jJwrkLa90ZyLEW6xPdp2D2a/view?usp=drivesdk

Slides as presented at LSF-MM.

> Prior to LSFMM, this is an update on where the discussion has gone on list
> since the original proposal back in January (which was buried in the
> thread for Ragha's proposal focused on PTE A bit scanning)
> 
> v1: https://lore.kernel.org/all/20250131130901.00000dd1@huawei.com/
> 
> Note that this is combining comments and discussion from many people and I may
> well have summarized things badly + missed key details. If time allows
> I'll update with a v3 when people have ripped up this straw man.
> 
> Bharata has posted code for one approach and discussion is ongoing:
> https://lore.kernel.org/linux-mm/20250306054532.221138-1-bharata@amd.com/
> This proposal overlaps with part of several other proposals, (Damon, access
> bit tracking etc) but the focus is intended to be more general.
> 
> Abstract:
> 
> We have:
> 1) A range of different technologies tracking what may be loosely defined
> as the hotness of regions of memory.
> 2) A set of use cases that care about this data.
> 
> Question:
> 
> Is it useful or feasible to aggregate the data from the sources (1) to some
> layer before providing answers to (2)?  What should that layer look like?
> What services and abstractions should it provide? Is there commonality in
> what those use cases need?
> 
> By aggregate I'm not necessarily implying multiple techniques in use at
> once, but more that we want one interface driven by whatever solution
> is the right balance on a particular system. That balance can be affected
> by hardware availability or characteristics of the system or workloa
> 
> Note that many of the hotness driven actions are painful (e.g. migration
> of hot pages) and for those we need to be very sure it is a good idea
> to do anything at all!
> 
> My assumption is that in at least some cases the problem will be too hard
> to solve in kernel but lets consider what we can do.
> 
> On to the details:
> ------------------
> 
> Note: I'm ignoring the low level implementation details of each method
> and how they avoid resource exhaustion, tune sampling timing (epoch length)
> and what is sampled (scanning random etc) as in at least some cases that's
> a problem for the lowest technique specific level.
> 
> Enumerating the cases (thanks to Bharata, Johannes, SJ and others for inputs
> on this!)  Much of this is direct quotes from this thread:
> https://lore.kernel.org/all/de31971e-98fc-4baf-8f4f-09d153902e2e@amd.com/
> (particularly Bharata's reply to my original questions)
> 
> Here is a compilation of available temperature sources and how the 
> hot/access data is consumed by different subsystems:
> 
> PA-Physical address available
> VA-Virtual address available
> AA-Access time available
> NA-accessing Node info available
> 
> ==================================================
> Temperature		PA	VA	AA	NA
> source
> ==================================================
> PROT_NONE faults	Y	Y	Y	Y
> --------------------------------------------------
> folio_mark_accessed()	Y		Y	Y
> --------------------------------------------------
> PTE A bit		Y	Y	N*	N
> --------------------------------------------------
> Platform hints		Y	Y	Y	Y
> (AMD IBS)
> --------------------------------------------------
> Device hints		Y	N	N	N
> (CXL HMU)
> ==================================================
> * Some information available from scanning timing.
>   In all cases other methods can be applied to fill in the missing data
>   (rmap etc)
> 
> And here is an attempt to compile how different subsystems
> use the above data:
> ==========================================================================================
> Source			Subsystem	Consumption		Activation/Frequency
> ==========================================================================================
> PROT_NONE faults	NUMAB		NUMAB=1 locality based	While task is running,
> via process pgtable			balancing		rate varies on observed
> walk					NUMAB=2 hot page	locality and sysctl knobs.
> 					promotion
> ==========================================================================================
> folio_mark_accessed()	FS/filemap/GUP	LRU list activation	On cache access and unmap
> ==========================================================================================
> PTE A bit via		Reclaim:LRU	LRU list activation,	During memory pressure
> rmap walk				deactivation/demotion
> ==========================================================================================
> PTE A bit via		Reclaim:MGLRU	LRU list activation,	- During memory pressure
> rmap walk and process			deactivation/demotion	- Continuous sampling (configurable)
> pgtable walk							  for workingset reporting
> ==========================================================================================
> PTE A bit via		DAMON		LRU activation,
> rmap walk				hot page promotion,
> 					demotion etc
> ==========================================================================================
> Platform hints		NUMAB		NUMAB=1 Locality based
> (e.g. AMD IBS)				balancing and
> 					NUMAB=2 hot page
> 					promotion
> ==========================================================================================
> Device hints		NUMAB		NUMAB=2 hot page
> (e.g. CXL HMU)				promotion
> ==========================================================================================
> PG_young / PG_idle ?
> ==========================================================================================
> 
> Technique trade offs:
> 
> Why not just use one method?
> 
> - Cost of capture, cost of use.
>   * Run all the time - aggregate data for stability of hotness.
>   * Run occasionally to minimize cost.
> 
> - Different availability. e.g. IBS might be needed for other things,
>   hardware monitors may not be available.
> 
> Straw man (based part on IBS proposal linked above)
> ---------------------------------------------------
> 
> Multiple sources become similar at different levels.
> 
> Taking just tiering promotion as an example and keeping in mind the golden
> rule of tiered memory: Put data in the right place to start with if you
> can.  So this is about when you can't: application unaware, changing memory
> pressure and workload mix etc.
> 
>    _____________________     __________________
>   | Sampling techniques |   | Hardware units  |
>   | - Access counter,   |   | CXL HMU etc     |
>   | - Trace based       |   |_________________|
>   |_____________________|           |
>              |                  Hot page
>            Events                   |
>              |                      |
>    __________v___________           |
>   |  Events to counts    |          |
>   |  - hashtable, sketch |          |
>   |    etc               |          |
>   |______________________|          |
>              |                      |
>           Hot page                  |
>              |                      |
>   ___________V______________________V_________
>  |  Hot list - responsible for stability?     |
>  |____________________________________________|
>              |
>         Timely hotlist data        
>              |               Additional data (process newness, stack location...?)
>    __________v__________________|___
>   |  Promotion Daemon               |
>   |_________________________________|
> 
> For all paths where data is flowing down we probably need control parameters
> flowing back the other way + if we have multiple users of the datastream
> we need to satisfy each of their constraints.
> 
> SJ has proposed perhaps extending Damon as a possible interface layer. I am
> yet to understand how that works in cases where regions do not provide
> a compact representation due to lack of contiguity in the hotness.
> An example usecase is hypervisor wanting to migrate data under unaware,
> cheap VMs.  After a system has been running for a while (particularly with hot
> pages being migrated, swap etc) the hotness map looks much like noise.
> 
> Now for the "there be monsters bit"...
> ---------------------------------------
> 
> - Stability of hotness matters and is hard to establish.
>   Predict a page will remain hot - various heuristics.
> 	a) It is hot, probably stays so? (super hot!)
> 	   Sometimes enough to be detected as hot once,
> 	   often not.
> 	b) It has been hot a while, probably stays so.
> 	   Check this hot list against previous hot list,
> 	   entries in both needed to promote.
> 	   This has a problem if hotlist is small compared to
> 	   total count of hot pages.  Say list is 1%, 20% actually
> 	   hot, low chance of repeats even in hot pages.
> 	c) It is hot, let's monitor a while before doing anything.
> 	   Measurement technique may change. Maybe cheaper
> 	   to monitor 'candidate' pages than all pages
> 	   e.g. CXL HMU gives 1000 pages, then we use access bit
> 	        sampling to check they are at least accessed N times
> 		in next second.
> 	d) It was hot, We moved it. Did it stay hot?
> 	   More useful to identify when we are thrashing and should
> 	   just stop doing anything.  To late to fix this one!
> - Some data should be considered hot even when not in use (e.g. stack)
> - Usecases interfere. So it can't just be a broadcast mode
>   where hotness information is sent to all users.
> - When to stop, start migration / tracking?
> 	a) Detecting bad decisions. Enough bad decisions, better to
> 	   do nothing?
>  	b) Metadata beyond the counts is useful
> 	   https://lore.kernel.org/all/87h64u2xkh.fsf@DESKTOP-5N7EMDA/
> 	   Promotion algorithms can need aggregate statistics for a memory 
> 	   device to decide how much to move.
> 
> As noted above, this may well overlap with other sessions.
> One outcome of the discussion so far is to highlight what I think many
> already knew.  This is hard!
> 
> Jonathan
> 



      parent reply	other threads:[~2025-04-04 10:39 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-19 12:47 [LSF/MM/BPF TOPIC v2] Unifying sources of page temperature information - what info is actually wanted? Jonathan Cameron
2025-03-19 23:50 ` SeongJae Park
2025-03-21 15:30   ` Jonathan Cameron
2025-03-21 17:36     ` SeongJae Park
2025-04-04 10:39 ` Jonathan Cameron [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250404113912.00002606@huawei.com \
    --to=jonathan.cameron@huawei.com \
    --cc=bharata@amd.com \
    --cc=dan.j.williams@intel.com \
    --cc=gourry@gourry.net \
    --cc=hannes@cmpxchg.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=mhocko@suse.com \
    --cc=raghavendra.kt@amd.com \
    --cc=sj@kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox