Re: [Linux Memory Hotness and Promotion] Notes from October 23, 2025

All of lore.kernel.org
 help / color / mirror / Atom feed

From: SeongJae Park <sj@kernel.org>
To: David Rientjes <rientjes@google.com>
Cc: SeongJae Park <sj@kernel.org>,
	Davidlohr Bueso <dave@stgolabs.net>, Fan Ni <nifan.cxl@gmail.com>,
	Gregory Price <gourry@gourry.net>,
	Jonathan Cameron <Jonathan.Cameron@huawei.com>,
	Joshua Hahn <joshua.hahnjy@gmail.com>,
	Raghavendra K T <rkodsara@amd.com>,
	"Rao, Bharata Bhasker" <bharata@amd.com>,
	Wei Xu <weixugc@google.com>,
	Xuezheng Chu <xuezhengchu@huawei.com>,
	Yiannis Nikolakopoulos <yiannis@zptcorp.com>,
	Zi Yan <ziy@nvidia.com>,
	linux-mm@kvack.org, damon@lists.linux.dev,
	Honggyu Kim <honggyu.kim@sk.com>,
	Yunjeong Mun <yunjeong.mun@sk.com>
Subject: Re: [Linux Memory Hotness and Promotion] Notes from October 23, 2025
Date: Thu, 13 Nov 2025 17:42:54 -0800	[thread overview]
Message-ID: <20251114014255.72884-1-sj@kernel.org> (raw)
In-Reply-To: <d952a84f-332e-8f7a-4816-2c1cbd8f5b00@google.com>

Cc-ing HMSDK developers and DAMON mailing list.

On Sun, 2 Nov 2025 16:41:19 -0800 (PST) David Rientjes <rientjes@google.com> wrote:

> Hi everybody,
> 
> Here are the notes from the last Linux Memory Hotness and Promotion call
> that happened on Thursday, October 9.  Thanks to everybody who was 
> involved!
> 
> These notes are intended to bring people up to speed who could not attend 
> the call as well as keep the conversation going in between meetings.

I was unable to join the call due to a conflict.  This note is very helpful.
Thank you for taking and sharing this note, David!

> 
> ----->o-----
> Ravi Jonnalagadda presented dynamic interleaving slides, co-developed with 
> Bijan Tabatabai, discussing the current approach of promoting all hot 
> pages into DRAM tier and demoting all cold pages.  If the bandwidth 
> utilization is high, it will saturate the top tier even though there is 
> bandwidth available on the lower tier.  The preference was to demote cold 
> pages when under-utilizing memory in the top tier and then interleave hot 
> pages to maximize bandwidth utilization.  For Ravi's experimentation, this 
> has been 3/4 of maximum write bandwidth for the top tier.  If this 
> threshold is not reached, memory is demoted.

I had a grateful chance to discuss about above in more detail with Ravi.
Sharing my detailed thoughts here, too.

I agree to the concern.  I also heard similar concerns for general
latency-aware memory tiering approaches from multiple people in the past.

The memory capacity extension solution of HMSDK [1], which is developed by SK
Hynix, is one good example.  To my understanding (please correct me if I'm
wrong), HMSDK is providing separate solutions for bandwidth and capacity
expansions.  The user should first understand whether their workload is
bandwidth-hungry or capacity-hungry, and select a proper solution.  I suspect
the concern from Ravi was one of the reasons.

I also recently developed a DAMON-based memory tiering approach [2] that
implementing the main idea of TPP [3]: promoting and demoting hot and cold
pages aiming a level of the faster node's space utilization.  I didn't see the
bandwidth issue from my simple tests of it, but I think the very same problem
can be applied to both DAMON-based approach and the original TPP
implementation.

> 
> Ravi suggested adaptive interleaving of memory to optimize both bandwidth 
> and capacity utilization.  He suggested an approach of a migrator in 
> kernel space and a calibrator in userspace.  The calibrator would monitor 
> system bandwidth utilization and, using different weights, determine the 
> optimal weights for interleaving the hot pages for the highest bandwidth.  
> If bandwidth saturation is not hit, only cold pages get demoted.  The 
> migrator reads the target interleave ratio and rearrange the hot pages 
> from the calibrator and demotes cold pages to the target node.  Currently 
> this uses DAMOS policies, Migrate_hot and Migrate_cold.

This implementation makes sense to me, especially if the aimed use case is for
specific virtual address spaces.  Nevertheless, if a physical address space
based version is also an option, I think there could be yet another way to
achive the goal (optimizing both bandwidth and capacity).

My idea is tweaking TPP idea a little bit: migrate pages among NUMA nodees
aiming a level of both space and bandwidth utilization of the faster (e.g.,
DRAM) node.  In more detail, do the hot pages promotion and cold pages
demotions for the target level of faster node space utilization, same to the
original TPP idea.  But, stop the hot page promotions if the memory bandwidth
consumption of the faster node exceeds a level.  In the case, instead, start
demoting _hot_ pages until the memory bandwidth consumption on the faster node
decreases below the limit level.

I think this idea could easily be prototyped by extending the
DAMON-based TPP implementation [2].  Let me briefly explain the prototyping
idea assuming the readers are familiar with the DAMON-based TPP implementation.
If you are not familiar with, please feel free to ask questions to me, or refer
to the cover letter [2] of the patch series.

First, add another DAMOS quota goal for the hot pages promotion scheme.  The
goal will aim to achieve a high level memory bandwidth consumption of the
faster node.  The target level will be reasonably high but not too high to keep
head room remained.  So the hot pages promotion scheme will be activated at the
beginning, promote hot pages, make the faster node's space and bandwidth
utilization increase.  But if the memory bandwidth consumption of the faster
node surpasses the target leevel as a result of the hot pages promotion or the
workload's access pattern change, the hot pages promotion scheme will be less
aggressive and eventually stop.

Second, add another DAMOS scheme to the faster node access monitoring DAMON
context.  The new scheme does hot pages demotion with a quota goal that aim to
make unused (free, or available) memory bandwidth of the faster node a headroom
level.  This scheme will do nothing at the beginning of the system since the
faster node may have available (unused) memory bandwidth more than the headroom
level.  This scheme will start the hot pages demotion once the faster node's
available memory bandwidth becomes less than the desired headroom level, due to
increased loads or the hot pages promotion.  And once the unused memory
bandwidth of the faster node becomes higher than the head room level as a
result of the hot pages demotion or access pattern change, the hot pages
demotion will be deactivated again.

For example, a change like below can be made to the simple DAMON-based TPP
implementation [4].

diff --git a/scripts/mem_tier.sh b/scripts/mem_tier.sh
index 9e685751..83757fa9 100644
--- a/scripts/mem_tier.sh
+++ b/scripts/mem_tier.sh
@@ -30,16 +30,25 @@ fi
 "$damo_bin" module stat write enabled N
 "$damo_bin" start \
        --numa_node 0 --monitoring_intervals_goal 4% 3 5ms 10s \
+               `# demote cold pages for faster node headroom space` \
                --damos_action migrate_cold 1 --damos_access_rate 0% 0% \
                --damos_apply_interval 1s \
                --damos_quota_interval 1s --damos_quota_space 200MB \
                --damos_quota_goal node_mem_free_bp 0.5% 0 \
                --damos_filter reject young \
+               `# demote hot pages for faster node headroom bandwidth` \
+               --damos_action migrate_hot 1 --damos_access_rate 5% max \
+                       --damos_apply_interval 1s \
+                       --damos_quota_interval 1s --damos_quota_space 200MB \
+                       --damos_quota_goal node_membw_free_bp 5% 0 \
+                       --damos_filter allow young \
        --numa_node 1 --monitoring_intervals_goal 4% 3 5ms 10s \
+               `# promote hot pages for faster node space/bandwidth high utilization` \
                --damos_action migrate_hot 0 --damos_access_rate 5% max \
                --damos_apply_interval 1s \
                --damos_quota_interval 1s --damos_quota_space 200MB \
                --damos_quota_goal node_mem_used_bp 99.7% 0 \
+               --damos_quota_goal node_membw_used_bp 95% 0 \
                --damos_filter allow young \
-               --damos_nr_quota_goals 1 1 --damos_nr_filters 1 1 \
-       --nr_targets 1 1 --nr_schemes 1 1 --nr_ctxs 1 1
+               --damos_nr_quota_goals 1 1 2 --damos_nr_filters 1 1 1 \
+       --nr_targets 1 1 --nr_schemes 2 1 --nr_ctxs 1 1

"node_membw_free_bp" and "node_membw_used_bp" are _imaginary_ DAMOS quota goal
metrics representing the available (unused) or consuming level of memory
bandiwdth of a given NUMA node.  Those are imaginery ones that arenot supported
on DAMON of today.  If this idea makes sense, we may develop the support of the
metrics.

But even before the metrics are implemented, we could prototype this for early
proof of concepts by setting the DAMOS quota goals using the user_input goal
metric [5] and run a user-space program that measures the memory bandwidth of
the faster node and feeds it to DAMON using the DAMON sysfs interface.

Implementing both the memory bandwidth/space utilization monitoring and the
quota auto-tuning logic on user-space, and directly adjusting the quotas of
DAMOS schemes instead of using the quota goals could also be an option.

I have no plan to implement the "node_membw_{free,used}_bp" quota goal metrics
or do the user_input based prototyping at the moment.  But, as always, if you
watn the features and/or willing to step up for development of the features, I
will be happy to help.

[...]
> Ravi suggested hotness information need not be used exclusively for 
> promotion and that there is an advantage seen in rearranging hot pages 
> based on weights.  He also suggested a standard subsystem that can provide 
> bandwidth information would be very useful (including sources such as IBS, 
> PEBS, and PMU sources).

If we decide to implement the above per-node memory bandwidth based DAMOS quota
goal metrics, I think this standard subsystem could also be useful for the
implementation.

FYI, users can _estimate_ memory bandwidth of the system or workloads from
DAMON's monitoring results snapshot.  For example, if DAMON is seeing a 1 GiB
memory region that is consistently being accessed about 10 times per second, we
can estimate it is consuming 10 GiB/s memory bandwidth.

DAMON user-space tool provides this estimated bandwidth per monitoring results
snapshot with 'damo report access' command.  DAMON_STAT module, which is
recently developed for providing system wide high level data access pattern in
an easy way, also provides this _estimated_ memory bandwidth usage.

[1] https://events.linuxfoundation.org/open-source-summit-korea/program/schedule/
[2] https://lore.kernel.org/all/20250420194030.75838-1-sj@kernel.org/
[3] https://symbioticlab.org/publications/files/tpp:asplos23/tpp-asplos23.pdf
[4] https://github.com/damonitor/damo/blob/v3.0.4/scripts/mem_tier.sh
[5] https://docs.kernel.org/mm/damon/design.html#aim-oriented-feedback-driven-auto-tuning

Thanks,
SJ

[...]

next prev parent reply	other threads:[~2025-11-14  1:43 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-03  0:41 [Linux Memory Hotness and Promotion] Notes from October 23, 2025 David Rientjes
2025-11-14  1:42 ` SeongJae Park [this message]
2025-11-17 11:36   ` Honggyu Kim
2025-11-21  2:27     ` SeongJae Park

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:9e68575 dfblob:83757fa )
 OR (
bs:"Re: [Linux Memory Hotness and Promotion] Notes from October 23, 2025" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251114014255.72884-1-sj@kernel.org \
    --to=sj@kernel.org \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=bharata@amd.com \
    --cc=damon@lists.linux.dev \
    --cc=dave@stgolabs.net \
    --cc=gourry@gourry.net \
    --cc=honggyu.kim@sk.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=linux-mm@kvack.org \
    --cc=nifan.cxl@gmail.com \
    --cc=rientjes@google.com \
    --cc=rkodsara@amd.com \
    --cc=weixugc@google.com \
    --cc=xuezhengchu@huawei.com \
    --cc=yiannis@zptcorp.com \
    --cc=yunjeong.mun@sk.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.