From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
To: Raghavendra K T <raghavendra.kt@amd.com>,
Bharata B Rao <bharata@amd.com>, SeongJae Park <sj@kernel.org>,
<lsf-pc@lists.linux-foundation.org>, <linux-mm@kvack.org>
Cc: Michal Hocko <mhocko@suse.com>,
Dan Williams <dan.j.williams@intel.com>,
Matthew Wilcox <willy@infradead.org>,
Johannes Weiner <hannes@cmpxchg.org>,
Gregory Price <gourry@gourry.net>
Subject: Re: [LSF/MM/BPF TOPIC v2] Unifying sources of page temperature information - what info is actually wanted?
Date: Fri, 4 Apr 2025 11:39:12 +0100 [thread overview]
Message-ID: <20250404113912.00002606@huawei.com> (raw)
In-Reply-To: <20250319124552.0000344a@huawei.com>
On Wed, 19 Mar 2025 12:47:53 +0000
Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
https://drive.google.com/file/d/1o9g-Bggg7jJwrkLa90ZyLEW6xPdp2D2a/view?usp=drivesdk
Slides as presented at LSF-MM.
> Prior to LSFMM, this is an update on where the discussion has gone on list
> since the original proposal back in January (which was buried in the
> thread for Ragha's proposal focused on PTE A bit scanning)
>
> v1: https://lore.kernel.org/all/20250131130901.00000dd1@huawei.com/
>
> Note that this is combining comments and discussion from many people and I may
> well have summarized things badly + missed key details. If time allows
> I'll update with a v3 when people have ripped up this straw man.
>
> Bharata has posted code for one approach and discussion is ongoing:
> https://lore.kernel.org/linux-mm/20250306054532.221138-1-bharata@amd.com/
> This proposal overlaps with part of several other proposals, (Damon, access
> bit tracking etc) but the focus is intended to be more general.
>
> Abstract:
>
> We have:
> 1) A range of different technologies tracking what may be loosely defined
> as the hotness of regions of memory.
> 2) A set of use cases that care about this data.
>
> Question:
>
> Is it useful or feasible to aggregate the data from the sources (1) to some
> layer before providing answers to (2)? What should that layer look like?
> What services and abstractions should it provide? Is there commonality in
> what those use cases need?
>
> By aggregate I'm not necessarily implying multiple techniques in use at
> once, but more that we want one interface driven by whatever solution
> is the right balance on a particular system. That balance can be affected
> by hardware availability or characteristics of the system or workloa
>
> Note that many of the hotness driven actions are painful (e.g. migration
> of hot pages) and for those we need to be very sure it is a good idea
> to do anything at all!
>
> My assumption is that in at least some cases the problem will be too hard
> to solve in kernel but lets consider what we can do.
>
> On to the details:
> ------------------
>
> Note: I'm ignoring the low level implementation details of each method
> and how they avoid resource exhaustion, tune sampling timing (epoch length)
> and what is sampled (scanning random etc) as in at least some cases that's
> a problem for the lowest technique specific level.
>
> Enumerating the cases (thanks to Bharata, Johannes, SJ and others for inputs
> on this!) Much of this is direct quotes from this thread:
> https://lore.kernel.org/all/de31971e-98fc-4baf-8f4f-09d153902e2e@amd.com/
> (particularly Bharata's reply to my original questions)
>
> Here is a compilation of available temperature sources and how the
> hot/access data is consumed by different subsystems:
>
> PA-Physical address available
> VA-Virtual address available
> AA-Access time available
> NA-accessing Node info available
>
> ==================================================
> Temperature PA VA AA NA
> source
> ==================================================
> PROT_NONE faults Y Y Y Y
> --------------------------------------------------
> folio_mark_accessed() Y Y Y
> --------------------------------------------------
> PTE A bit Y Y N* N
> --------------------------------------------------
> Platform hints Y Y Y Y
> (AMD IBS)
> --------------------------------------------------
> Device hints Y N N N
> (CXL HMU)
> ==================================================
> * Some information available from scanning timing.
> In all cases other methods can be applied to fill in the missing data
> (rmap etc)
>
> And here is an attempt to compile how different subsystems
> use the above data:
> ==========================================================================================
> Source Subsystem Consumption Activation/Frequency
> ==========================================================================================
> PROT_NONE faults NUMAB NUMAB=1 locality based While task is running,
> via process pgtable balancing rate varies on observed
> walk NUMAB=2 hot page locality and sysctl knobs.
> promotion
> ==========================================================================================
> folio_mark_accessed() FS/filemap/GUP LRU list activation On cache access and unmap
> ==========================================================================================
> PTE A bit via Reclaim:LRU LRU list activation, During memory pressure
> rmap walk deactivation/demotion
> ==========================================================================================
> PTE A bit via Reclaim:MGLRU LRU list activation, - During memory pressure
> rmap walk and process deactivation/demotion - Continuous sampling (configurable)
> pgtable walk for workingset reporting
> ==========================================================================================
> PTE A bit via DAMON LRU activation,
> rmap walk hot page promotion,
> demotion etc
> ==========================================================================================
> Platform hints NUMAB NUMAB=1 Locality based
> (e.g. AMD IBS) balancing and
> NUMAB=2 hot page
> promotion
> ==========================================================================================
> Device hints NUMAB NUMAB=2 hot page
> (e.g. CXL HMU) promotion
> ==========================================================================================
> PG_young / PG_idle ?
> ==========================================================================================
>
> Technique trade offs:
>
> Why not just use one method?
>
> - Cost of capture, cost of use.
> * Run all the time - aggregate data for stability of hotness.
> * Run occasionally to minimize cost.
>
> - Different availability. e.g. IBS might be needed for other things,
> hardware monitors may not be available.
>
> Straw man (based part on IBS proposal linked above)
> ---------------------------------------------------
>
> Multiple sources become similar at different levels.
>
> Taking just tiering promotion as an example and keeping in mind the golden
> rule of tiered memory: Put data in the right place to start with if you
> can. So this is about when you can't: application unaware, changing memory
> pressure and workload mix etc.
>
> _____________________ __________________
> | Sampling techniques | | Hardware units |
> | - Access counter, | | CXL HMU etc |
> | - Trace based | |_________________|
> |_____________________| |
> | Hot page
> Events |
> | |
> __________v___________ |
> | Events to counts | |
> | - hashtable, sketch | |
> | etc | |
> |______________________| |
> | |
> Hot page |
> | |
> ___________V______________________V_________
> | Hot list - responsible for stability? |
> |____________________________________________|
> |
> Timely hotlist data
> | Additional data (process newness, stack location...?)
> __________v__________________|___
> | Promotion Daemon |
> |_________________________________|
>
> For all paths where data is flowing down we probably need control parameters
> flowing back the other way + if we have multiple users of the datastream
> we need to satisfy each of their constraints.
>
> SJ has proposed perhaps extending Damon as a possible interface layer. I am
> yet to understand how that works in cases where regions do not provide
> a compact representation due to lack of contiguity in the hotness.
> An example usecase is hypervisor wanting to migrate data under unaware,
> cheap VMs. After a system has been running for a while (particularly with hot
> pages being migrated, swap etc) the hotness map looks much like noise.
>
> Now for the "there be monsters bit"...
> ---------------------------------------
>
> - Stability of hotness matters and is hard to establish.
> Predict a page will remain hot - various heuristics.
> a) It is hot, probably stays so? (super hot!)
> Sometimes enough to be detected as hot once,
> often not.
> b) It has been hot a while, probably stays so.
> Check this hot list against previous hot list,
> entries in both needed to promote.
> This has a problem if hotlist is small compared to
> total count of hot pages. Say list is 1%, 20% actually
> hot, low chance of repeats even in hot pages.
> c) It is hot, let's monitor a while before doing anything.
> Measurement technique may change. Maybe cheaper
> to monitor 'candidate' pages than all pages
> e.g. CXL HMU gives 1000 pages, then we use access bit
> sampling to check they are at least accessed N times
> in next second.
> d) It was hot, We moved it. Did it stay hot?
> More useful to identify when we are thrashing and should
> just stop doing anything. To late to fix this one!
> - Some data should be considered hot even when not in use (e.g. stack)
> - Usecases interfere. So it can't just be a broadcast mode
> where hotness information is sent to all users.
> - When to stop, start migration / tracking?
> a) Detecting bad decisions. Enough bad decisions, better to
> do nothing?
> b) Metadata beyond the counts is useful
> https://lore.kernel.org/all/87h64u2xkh.fsf@DESKTOP-5N7EMDA/
> Promotion algorithms can need aggregate statistics for a memory
> device to decide how much to move.
>
> As noted above, this may well overlap with other sessions.
> One outcome of the discussion so far is to highlight what I think many
> already knew. This is hard!
>
> Jonathan
>
prev parent reply other threads:[~2025-04-04 10:39 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-03-19 12:47 [LSF/MM/BPF TOPIC v2] Unifying sources of page temperature information - what info is actually wanted? Jonathan Cameron
2025-03-19 23:50 ` SeongJae Park
2025-03-21 15:30 ` Jonathan Cameron
2025-03-21 17:36 ` SeongJae Park
2025-04-04 10:39 ` Jonathan Cameron [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250404113912.00002606@huawei.com \
--to=jonathan.cameron@huawei.com \
--cc=bharata@amd.com \
--cc=dan.j.williams@intel.com \
--cc=gourry@gourry.net \
--cc=hannes@cmpxchg.org \
--cc=linux-mm@kvack.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=mhocko@suse.com \
--cc=raghavendra.kt@amd.com \
--cc=sj@kernel.org \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox