[Linux Memory Hotness and Promotion] Notes from September 11, 2025

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Rientjes <rientjes@google.com>
To: Davidlohr Bueso <dave@stgolabs.net>, Fan Ni <nifan.cxl@gmail.com>,
	 Gregory Price <gourry@gourry.net>,
	 Jonathan Cameron <Jonathan.Cameron@huawei.com>,
	 Joshua Hahn <joshua.hahnjy@gmail.com>,
	Raghavendra K T <rkodsara@amd.com>,
	 "Rao, Bharata Bhasker" <bharata@amd.com>,
	SeongJae Park <sj@kernel.org>,  Wei Xu <weixugc@google.com>,
	Xuezheng Chu <xuezhengchu@huawei.com>,
	 Yiannis Nikolakopoulos <yiannis@zptcorp.com>,
	Zi Yan <ziy@nvidia.com>
Cc: linux-mm@kvack.org
Subject: [Linux Memory Hotness and Promotion] Notes from September 11, 2025
Date: Sun, 14 Sep 2025 18:37:20 -0700 (PDT)	[thread overview]
Message-ID: <d18661f5-ba27-35fb-f2ee-a4cbe865b6c7@google.com> (raw)

Hi everybody,

Here are the notes from the last Linux Memory Hotness and Promotion call
that happened on Thursday, September 11.  Thanks to everybody who was 
involved!

These notes are intended to bring people up to speed who could not attend 
the call as well as keep the conversation going in between meetings.

----->o-----
Bharata provided an update on the status of his patch series including 
NUMAB=2 and ratelimiting and dynamic thresholding.  The latest patch 
series was posted with three sources of hotness information, all 
experimental in nature.  It also includes basic testing.

Bharata noted that he had been doing testing on Zen4 based systems where 
access latency between remote node and a CXL node is high, from 60-90%.  
So if there are two top tier nodes, nodes 0 and 1, and a CXL node 2, the 
access latency from 0->2 regresses by ~90% compared to 0->1 (access 
latency from 1->2 regresses by ~60% because it is closer to the CXL card).  
Compare that to a Zen5 based system where latencies have improved a lot, 
the latency from 0->2 is ~7% regression compared to 0->1 (access latency 
from 1->2 regresses by ~40%).

He asked two questions:

 - do we still need to provision CXL memory as a separate NUMA node or is
   traditional NUMA Balancing sufficient for this?

 - question to Jonathan: is this considered a step forward based on
   previous discussions at LSF/MM/BPF?  And where are we with CHMU?

----->o-----
Wei Xu noted there are additional use cases: memory expansion, bandwidth 
expansion, and memory tiering itself.  This all depends on the CXL 
hardware itself.  There will be use cases where we want to put cheaper 
memory behind CXL to improve overall TCO.  Additionally, there may be 
additional features behind the CXL controller such as inline memory 
compression.  Memory tiering itself is likely not the only case for CXL 
memory.  Yiannis agreed with the point about handling inline memory 
compression since that's his focus as well. 

Wei suggested that the data structure is key to these discussions to 
minimize complexity.  LRU is likely a sufficient signal for demotion but 
not promotion.  A separate data structure for promotion is needed, but the 
complexity should be minimized.

----->o-----
Jonathan Cameron suggested some folks may not yet have the shiny inline 
memory compression devices yet but also brought up much larger topologies 
if the latencies are this good.  People may start doing switch fabric to 
get wider fan-out and plug even more RAM into the system.  He strongly 
agreed there was a case for all of this.

He also noted there was infrastructure that can gather data on application 
behavior to optimize for memory placement.  This was the focus of the CHMU 
for right now until we have actual hardware.  It was also noted that 
there's a lot of flexibility allowed in the CHMU specification that allows 
for building very bad hotness monitors if we choose.  It's very early 
days.

----->o-----
Raghu gave an update on his PTE Accessed bit scanning series.  Instead of 
idle page tracking APIs, he preferred to rely on PTE scanning and MGLRU.  
He discussed an integration mechanism for the two approaches.  He wanted 
to integrate his series with MGLRU.  Bharata noted that kscand is based on 
PTE Accessed bit scanning information and klruscand is also based on the 
same approach; there should be commonality between the two that can 
leverage the heuristics from kscand.  The goal was to get the best of both 
worlds between the two approaches.  Wei strongly agreed and suggested 
klruscand was a proof of concept.

The division of work was suggested as Kinsey Ho providing an API for MGLRU 
that provides scanning for these use cases (access and flush) while Raghu 
focused on the kernel daemon for this including heuristics.

----->o-----
Next meeting will be on Thursday, September 25 at 8:30am PDT (UTC-7),
everybody is welcome: https://meet.google.com/jak-ytdx-hnm

Topics for the next meeting:

 - updates on latest patch series from Bharata and consolidating memory
   hotness information, including ratelimiting and dynamic thresholds
 - update on Raghu's patch series for PTE Accessed bit scanning and its
   integration into the above, as well as with klruscand
 - how to provide data to the community both on access latency for type 3
   memory expansion devices as well as hotness information
 - update on non-temporal stores enlightenment for memory tiering
 - enlightening migrate_pages() for hardware assists and how this work
   will be charged to userspace
 - discuss proactive demotion interface as an extension to memory.reclaim
 - discuss overall testing and benchmarking methodology for various
   approaches as we go along

Please let me know if you'd like to propose additional topics for
discussion, thank you!

                 reply	other threads:[~2025-09-15  1:37 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d18661f5-ba27-35fb-f2ee-a4cbe865b6c7@google.com \
    --to=rientjes@google.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=bharata@amd.com \
    --cc=dave@stgolabs.net \
    --cc=gourry@gourry.net \
    --cc=joshua.hahnjy@gmail.com \
    --cc=linux-mm@kvack.org \
    --cc=nifan.cxl@gmail.com \
    --cc=rkodsara@amd.com \
    --cc=sj@kernel.org \
    --cc=weixugc@google.com \
    --cc=xuezhengchu@huawei.com \
    --cc=yiannis@zptcorp.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).