From: David Rientjes <rientjes@google.com>
To: Davidlohr Bueso <dave@stgolabs.net>, Fan Ni <nifan.cxl@gmail.com>,
Gregory Price <gourry@gourry.net>,
Jonathan Cameron <Jonathan.Cameron@huawei.com>,
Joshua Hahn <joshua.hahnjy@gmail.com>,
Raghavendra K T <rkodsara@amd.com>,
"Rao, Bharata Bhasker" <bharata@amd.com>,
SeongJae Park <sj@kernel.org>, Wei Xu <weixugc@google.com>,
Xuezheng Chu <xuezhengchu@huawei.com>,
Yiannis Nikolakopoulos <yiannis@zptcorp.com>,
Zi Yan <ziy@nvidia.com>
Cc: linux-mm@kvack.org
Subject: [Linux Memory Hotness and Promotion] Notes from September 11, 2025
Date: Sun, 14 Sep 2025 18:37:20 -0700 (PDT) [thread overview]
Message-ID: <d18661f5-ba27-35fb-f2ee-a4cbe865b6c7@google.com> (raw)
Hi everybody,
Here are the notes from the last Linux Memory Hotness and Promotion call
that happened on Thursday, September 11. Thanks to everybody who was
involved!
These notes are intended to bring people up to speed who could not attend
the call as well as keep the conversation going in between meetings.
----->o-----
Bharata provided an update on the status of his patch series including
NUMAB=2 and ratelimiting and dynamic thresholding. The latest patch
series was posted with three sources of hotness information, all
experimental in nature. It also includes basic testing.
Bharata noted that he had been doing testing on Zen4 based systems where
access latency between remote node and a CXL node is high, from 60-90%.
So if there are two top tier nodes, nodes 0 and 1, and a CXL node 2, the
access latency from 0->2 regresses by ~90% compared to 0->1 (access
latency from 1->2 regresses by ~60% because it is closer to the CXL card).
Compare that to a Zen5 based system where latencies have improved a lot,
the latency from 0->2 is ~7% regression compared to 0->1 (access latency
from 1->2 regresses by ~40%).
He asked two questions:
- do we still need to provision CXL memory as a separate NUMA node or is
traditional NUMA Balancing sufficient for this?
- question to Jonathan: is this considered a step forward based on
previous discussions at LSF/MM/BPF? And where are we with CHMU?
----->o-----
Wei Xu noted there are additional use cases: memory expansion, bandwidth
expansion, and memory tiering itself. This all depends on the CXL
hardware itself. There will be use cases where we want to put cheaper
memory behind CXL to improve overall TCO. Additionally, there may be
additional features behind the CXL controller such as inline memory
compression. Memory tiering itself is likely not the only case for CXL
memory. Yiannis agreed with the point about handling inline memory
compression since that's his focus as well.
Wei suggested that the data structure is key to these discussions to
minimize complexity. LRU is likely a sufficient signal for demotion but
not promotion. A separate data structure for promotion is needed, but the
complexity should be minimized.
----->o-----
Jonathan Cameron suggested some folks may not yet have the shiny inline
memory compression devices yet but also brought up much larger topologies
if the latencies are this good. People may start doing switch fabric to
get wider fan-out and plug even more RAM into the system. He strongly
agreed there was a case for all of this.
He also noted there was infrastructure that can gather data on application
behavior to optimize for memory placement. This was the focus of the CHMU
for right now until we have actual hardware. It was also noted that
there's a lot of flexibility allowed in the CHMU specification that allows
for building very bad hotness monitors if we choose. It's very early
days.
----->o-----
Raghu gave an update on his PTE Accessed bit scanning series. Instead of
idle page tracking APIs, he preferred to rely on PTE scanning and MGLRU.
He discussed an integration mechanism for the two approaches. He wanted
to integrate his series with MGLRU. Bharata noted that kscand is based on
PTE Accessed bit scanning information and klruscand is also based on the
same approach; there should be commonality between the two that can
leverage the heuristics from kscand. The goal was to get the best of both
worlds between the two approaches. Wei strongly agreed and suggested
klruscand was a proof of concept.
The division of work was suggested as Kinsey Ho providing an API for MGLRU
that provides scanning for these use cases (access and flush) while Raghu
focused on the kernel daemon for this including heuristics.
----->o-----
Next meeting will be on Thursday, September 25 at 8:30am PDT (UTC-7),
everybody is welcome: https://meet.google.com/jak-ytdx-hnm
Topics for the next meeting:
- updates on latest patch series from Bharata and consolidating memory
hotness information, including ratelimiting and dynamic thresholds
- update on Raghu's patch series for PTE Accessed bit scanning and its
integration into the above, as well as with klruscand
- how to provide data to the community both on access latency for type 3
memory expansion devices as well as hotness information
- update on non-temporal stores enlightenment for memory tiering
- enlightening migrate_pages() for hardware assists and how this work
will be charged to userspace
- discuss proactive demotion interface as an extension to memory.reclaim
- discuss overall testing and benchmarking methodology for various
approaches as we go along
Please let me know if you'd like to propose additional topics for
discussion, thank you!
reply other threads:[~2025-09-15 1:37 UTC|newest]
Thread overview: [no followups] expand[flat|nested] mbox.gz Atom feed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d18661f5-ba27-35fb-f2ee-a4cbe865b6c7@google.com \
--to=rientjes@google.com \
--cc=Jonathan.Cameron@huawei.com \
--cc=bharata@amd.com \
--cc=dave@stgolabs.net \
--cc=gourry@gourry.net \
--cc=joshua.hahnjy@gmail.com \
--cc=linux-mm@kvack.org \
--cc=nifan.cxl@gmail.com \
--cc=rkodsara@amd.com \
--cc=sj@kernel.org \
--cc=weixugc@google.com \
--cc=xuezhengchu@huawei.com \
--cc=yiannis@zptcorp.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).