[Linux Memory Hotness and Promotion] Notes from June 5, 2025

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [Linux Memory Hotness and Promotion] Notes from June 5, 2025
@ 2025-06-18  3:42 David Rientjes
  2025-06-18  3:49 ` Bharata B Rao
  0 siblings, 1 reply; 2+ messages in thread
From: David Rientjes @ 2025-06-18  3:42 UTC (permalink / raw)
  To: Davidlohr Bueso, Fan Ni, Gregory Price, Jonathan Cameron,
	Joshua Hahn, Raghavendra K T, Rao, Bharata Bhasker, SeongJae Park,
	Wei Xu, Xuezheng Chu, Yiannis Nikolakopoulos, Zi Yan
  Cc: linux-mm

Hi everybody,

Here are the notes from the last Linux Memory Hotness and Promotion call
that happened on Thursday, June 5.  Thanks to everybody who was involved!

These notes are intended to bring people up to speed who could not attend 
the call as well as keep the conversation going in between meetings.

----->o-----
I recapped the previous instance and discussion around asynchronous page
promotion driven through a kthread.  I also discussed the previous chat
about trade-offs around isolated folio lists vs pfn based tracking of
memory to migrate.

Bharata said that tracking pfns were much easier since they are
stateless.  We don't need about doing refcounts or isolating too many
pages.  Based on the latest iteration of the patch series, he converted
this to use pfn based tracking because we don't want to keep memory
isolated for too long.  Previous attempts isolated these folios at the
time of page fault and then batch migrating them later.

Now, during page fault, we grab the pfn of the folio that has been
misplaced and push that into a migrator subsystem -- a very simple
subsystem that stores pfns pushed to it (NUMA Balancing as currently
the first/only source) in a per-node list for the target node.  This is
coupled with a per-node kthread that routinely scans the list and
migrates the folio to the target node.  Work was underway to address
some contention issue if there are multiple sources of these pfns to
migrate (moving away from a mutex).

----->o-----
Wei Xu asked about the pfn based tracking and how this would handle
multiple sources of memory hotness with additional information without a
lot of overhead.  Bharata noted for locality based migration there was no
need for additional information.  For hot page tracking, like with page
table scanning or with CHMU, then we'd need more information later.

Wei suggested generalizing this now so that it is easily extensible later
while still acknowledging that the metadata associated with previous
kpromoted patch series was very large.  He suggested a virtual map that
could be sparsely populated indexed by pfn.  Bharata acked that the
information stored here should be concise and precised.  This would be a
future extension, however.

Raghavendra brought up the PTE based method that he is pursuing that
stored per-mm based list of folios and that this could be converted to
using pfn based tracking as well.  Starting with the mm, he is storing a
simple list of folios to be migrated which can also use batch migration.

Wei asked about the per-mm tracking.  He suggested standardizing on the
tracking for folios that need to be migrated, whether this is per node or
per mm.  Raghavendra said that scanning is done in a separate thread than
the migration thread.  He said that the per mm information being stored
includes a timestamp for the whole mm so the next scan can determine if
it was still hot.  Bharata suggested one use case for storing a per-mm
list of folios is to ensure that everything is tracked on a per process
list and if the task exits then you can simply purge this very easily.
Raghavendra noted that this also prevents system-wide lock contention.

----->o-----
I asked, as I normally do (:D), about timelines for the next verison of
patch series to be proposed.  Bharata said this would be sent out the
next week to the mailing list.  He wanted to optimize the locking first.
Raghavendra said this would be on track for his series as well.

----->o-----
I pivoted the discussion to the testing story for this approach,
including on systems that are not memory tiered.  I asked about workloads
of interest such as redis, mysql, specint, etc, as well as metrics of
interest to collect.

Yiannis Nikolakopoulos asked about the baseline demotion strategies that 
we could rely on.  I noted that Google is primarily looking at proactive
demotion using working set extensions on top of MGLRU.  It would be
interesting to discuss with the group about extensions to memory.reclaim
to support proactive demotion triggered by userspace.  Yiannis noted it
would be useful to establish the baseline for the demotion strategy
because that would be key to testing infrastructure.

I asked if there were specific workloads that were interesting for our
evaluation of these approaches.  Bharata noted that his primary
evaluation was being done with microbenchmarks and redis.

----->o-----
Next meeting will be on Thursday, July 17 at 8:30am PDT (UTC-7),
everybody is welcome: https://meet.google.com/jak-ytdx-hnm

Topics for the next meeting:

 - update on latest series that leverages pfn tracked folios with a per-
   node kpromoted thread
   + including optimizations that avoid system-wide mutex contention
   + reconcile how this will overlap and interact with pte based scanning
 - discuss proactive demotion interface as an extension to memory.reclaim
   + possibly leveraging working set extensions on top of MGLRU
 - discuss overall testing and benchmarking methodology for various
   approaches as we go along
   + minimal viable infrastructure, testing workloads, and metrics of
     interest to collect
 - enlightening migrate_pages() for hardware assists and how this work
   will be charged to userspace

Please let me know if you'd like to propose additional topics for
discussion, thank you!

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [Linux Memory Hotness and Promotion] Notes from June 5, 2025
  2025-06-18  3:42 [Linux Memory Hotness and Promotion] Notes from June 5, 2025 David Rientjes
@ 2025-06-18  3:49 ` Bharata B Rao
  0 siblings, 0 replies; 2+ messages in thread
From: Bharata B Rao @ 2025-06-18  3:49 UTC (permalink / raw)
  To: David Rientjes, Davidlohr Bueso, Fan Ni, Gregory Price,
	Jonathan Cameron, Joshua Hahn, Raghavendra K T, SeongJae Park,
	Wei Xu, Xuezheng Chu, Yiannis Nikolakopoulos, Zi Yan
  Cc: linux-mm

Hi David,

Thanks for the detailed summary.

On 18-Jun-25 9:12 AM, David Rientjes wrote:
> Hi everybody,
> 
> Here are the notes from the last Linux Memory Hotness and Promotion call
> that happened on Thursday, June 5.  Thanks to everybody who was involved!
> 
> These notes are intended to bring people up to speed who could not attend
> the call as well as keep the conversation going in between meetings.
> 
> ----->o-----
> I recapped the previous instance and discussion around asynchronous page
> promotion driven through a kthread.  I also discussed the previous chat
> about trade-offs around isolated folio lists vs pfn based tracking of
> memory to migrate.
> 
> Bharata said that tracking pfns were much easier since they are
> stateless.  We don't need about doing refcounts or isolating too many
> pages.  Based on the latest iteration of the patch series, he converted
> this to use pfn based tracking because we don't want to keep memory
> isolated for too long.  Previous attempts isolated these folios at the
> time of page fault and then batch migrating them later.
> 
> Now, during page fault, we grab the pfn of the folio that has been
> misplaced and push that into a migrator subsystem -- a very simple
> subsystem that stores pfns pushed to it (NUMA Balancing as currently
> the first/only source) in a per-node list for the target node.  This is
> coupled with a per-node kthread that routinely scans the list and
> migrates the folio to the target node.  Work was underway to address
> some contention issue if there are multiple sources of these pfns to
> migrate (moving away from a mutex).
> 
> ----->o-----
> Wei Xu asked about the pfn based tracking and how this would handle
> multiple sources of memory hotness with additional information without a
> lot of overhead.  Bharata noted for locality based migration there was no
> need for additional information.  For hot page tracking, like with page
> table scanning or with CHMU, then we'd need more information later.
> 
> Wei suggested generalizing this now so that it is easily extensible later
> while still acknowledging that the metadata associated with previous
> kpromoted patch series was very large.  He suggested a virtual map that
> could be sparsely populated indexed by pfn.  Bharata acked that the
> information stored here should be concise and precised.  This would be a
> future extension, however.

I have posted the next version which maintains per-PFN/page information 
as part of extended page flags:

https://lore.kernel.org/linux-mm/20250616133931.206626-1-bharata@amd.com/

Regards,
Bharata.



^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2025-06-18  3:49 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-18  3:42 [Linux Memory Hotness and Promotion] Notes from June 5, 2025 David Rientjes
2025-06-18  3:49 ` Bharata B Rao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).