Re: [RFC PATCH v6 0/5] mm: Hot page tracking and promotion infrastructure

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

From: Bharata B Rao <bharata@amd.com>
To: <linux-kernel@vger.kernel.org>, <linux-mm@kvack.org>
Cc: <Jonathan.Cameron@huawei.com>, <dave.hansen@intel.com>,
	<gourry@gourry.net>, <mgorman@techsingularity.net>,
	<mingo@redhat.com>, <peterz@infradead.org>,
	<raghavendra.kt@amd.com>, <riel@surriel.com>,
	<rientjes@google.com>, <sj@kernel.org>, <weixugc@google.com>,
	<willy@infradead.org>, <ying.huang@linux.alibaba.com>,
	<ziy@nvidia.com>, <dave@stgolabs.net>, <nifan.cxl@gmail.com>,
	<xuezhengchu@huawei.com>, <yiannis@zptcorp.com>,
	<akpm@linux-foundation.org>, <david@redhat.com>,
	<byungchul@sk.com>, <kinseyho@google.com>,
	<joshua.hahnjy@gmail.com>, <yuanchu@google.com>,
	<balbirs@nvidia.com>, <alok.rathore@samsung.com>,
	<shivankg@amd.com>
Subject: Re: [RFC PATCH v6 0/5] mm: Hot page tracking and promotion infrastructure
Date: Mon, 23 Mar 2026 15:26:36 +0530	[thread overview]
Message-ID: <868004d8-bb8e-4800-9fdd-ade48e95fe3b@amd.com> (raw)
In-Reply-To: <20260323095104.238982-1-bharata@amd.com>

Microbenchmark results

Test system details
-------------------
3 node AMD Zen5 system with 2 regular NUMA nodes (0, 1) and a CXL node (2)

$ numactl -H
available: 3 nodes (0-2)
node 0 cpus: 0-95,192-287
node 0 size: 128460 MB
node 1 cpus: 96-191,288-383
node 1 size: 128893 MB
node 2 cpus:
node 2 size: 257993 MB
node distances:
node   0   1   2
  0:  10  32  50
  1:  32  10  60
  2:  255  255  10

Hotness sources
---------------
NUMAB0 - Without NUMA Balancing in base case and with no source enabled
         in the patched case. No migrations occur.
NUMAB2 - Existing hot page promotion for the base case and
         use of hint faults as source in the patched case.

Pghot by default promotes after two accesses but for NUMAB2 source,
promotion is done after one access to match the base behaviour.
(/sys/kernel/debug/pghot/freq_threshold=1)

==============================================================
Scenario 1 - Enough memory in toptier and hence only promotion
==============================================================
Multi-threaded application with 64 threads that access memory at 4K granularity
repetitively and randomly. The number of accesses per thread and the randomness
pattern for each thread are fixed beforehand. The accesses are divided into
stores and loads in the ratio of 50:50.

Benchmark threads run on Node 0, while memory is initially provisioned on
CXL node 2 before the accesses start.

Repetitive accesses results in lowertier pages becoming hot and kmigrated
detecting and migrating them. The benchmark score is the time taken to finish
the accesses in microseconds. The sooner it finishes the better it is. All the
numbers shown below are average of 3 runs.

Default mode - Time taken (microseconds, lower is better)
---------------------------------------------------------
Source          Base            Pghot
---------------------------------------------------------
NUMAB0          119,658,562     118,037,791
NUMAB2          104,205,571     102,705,330
---------------------------------------------------------

Default mode - Pages migrated (pgpromote_success)
---------------------------------------------------------
Source          Base            Pghot
---------------------------------------------------------
NUMAB0          0               0
NUMAB2          2097152         2097152
---------------------------------------------------------

Precision mode - Time taken (microseconds, lower is better)
-----------------------------------------------------------
Source          Base            Pghot
-----------------------------------------------------------
NUMAB0          119,658,562     115,173,151
NUMAB2          104,205,571     102,194,435
-----------------------------------------------------------

Precision mode - Pages migrated (pgpromote_success)
---------------------------------------------------
Source          Base            Pghot
---------------------------------------------------
NUMAB0          0               0
NUMAB2          2097152         2097152
---------------------------------------------------

Rate of migration (pgpromote_success)
-----------------------------------------
Time(s)         Base            Pghot
-----------------------------------------
0               0               0
28              0               0
32              262144          262144
36              524288          469012
40              786432          720896
44              1048576         983040
48              1310720         1245184
52              1572864         1507328
56              1835008         1769472
60              2097152         2031616
64              2097152         2097152
-----------------------------------------

==============================================================
Scenario 2 - Toptier memory overcommited, promotion + demotion
==============================================================
Single threaded application that allocates memory on both DRAM and CXL nodes
using mmap(MAP_POPULATE). Every 1G region of allocated memory on CXL node is
accessed at 4K granularity randomly and repetitively to build up the notion
of hotness in the 1GB region that is under access. This should drive promotion.
For promotion to work successfully, the DRAM memory that has been provisioned
(and not being accessed) should be demoted first. There is enough free memory
in the CXL node to for demotions.

In summary, this benchmark creates a memory pressure on DRAM node and does
CXL memory accesses to drive both demotion and promotion.

The number of accesses are fixed and hence, the quicker the accessed pages
get promoted to DRAM, the sooner the benchmark is expected to finish.
All the numbers shown below are average of 3 runs.

DRAM-node                       = 1
CXL-node                        = 2
Initial DRAM alloc ratio        = 75%
Allocation-size                 = 171798691840
Initial DRAM Alloc-size         = 128849018880
Initial CXL Alloc-size          = 42949672960
Hot-region-size                 = 1073741824
Nr-regions                      = 160
Nr-regions DRAM                 = 120 (provisioned but not accessed)
Nr-hot-regions CXL              = 40
Access pattern                  = random
Access granularity              = 4096
Delay b/n accesses              = 0
Load/store ratio                = 50l50s
THP used                        = no
Nr accesses                     = 42949672960
Nr repetitions                  = 1024

Default mode - Time taken (microseconds, lower is better)
------------------------------------------------------
Source          Base            Pghot
------------------------------------------------------
NUMAB0          61,028,534      59,432,137
NUMAB2          63,070,998      61,375,763
------------------------------------------------------

Default mode - Pages migrated (pgpromote_success)
-------------------------------------------------
Source          Base            Pghot
-------------------------------------------------
NUMAB0          0               0
NUMAB2          26546           1070842 (High R2R variation in Base)
---------------------------------------

Precision mode - Time taken (microseconds, lower is better)
------------------------------------------------------
Source          Base            Pghot
------------------------------------------------------
NUMAB0          61,028,534      60,354,547
NUMAB2          63,070,998      60,199,147
------------------------------------------------------

Precision mode - Pages migrated (pgpromote_success)
---------------------------------------------------
Source          Base            Pghot
---------------------------------------------------
NUMAB0          0               0
NUMAB2          26546           1088621 (High R2R variation in Base)
---------------------------------------------------

- The base case itself doesn't show any improvement in benchmark numbers due
  to hot page promotion. The same pattern is seen in pghot case with all
  the sources except hwhints. The benchmark itself may need tuning so that
  promotion helps.
- There is a high run to run variation in the number of pages promoted in
  base case.
- Most promotion attempts in base case fail because the NUMA hint fault
  latency is found to exceed the threshold value (default threshold
  is 1000ms) in majority of the promotion attempts.
- Unlike base NUMAB2 where the hint fault latency is the difference between the
  PTE update time (during scanning) and the access time (hint fault), pghot uses
  a single latency threshold (3000ms in pghot-default and 5000ms in
  pghot-precise) for two purposes.
        1. If the time difference between successive accesses are within the
           threshold, the page is marked as hot.
        2. Later when kmigrated picks up the page for migration, it will migrate
           only if the difference between the current time and the time when the
          page was marked hot is with the threshold.

next prev parent reply	other threads:[~2026-03-23  9:56 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-23  9:50 [RFC PATCH v6 0/5] mm: Hot page tracking and promotion infrastructure Bharata B Rao
2026-03-23  9:51 ` [RFC PATCH v6 1/5] mm: migrate: Allow misplaced migration without VMA Bharata B Rao
2026-03-23  9:51 ` [RFC PATCH v6 2/5] mm: migrate: Add migrate_misplaced_folios_batch() Bharata B Rao
2026-03-26  5:50   ` Bharata B Rao
2026-03-23  9:51 ` [RFC PATCH v6 3/5] mm: Hot page tracking and promotion - pghot Bharata B Rao
2026-03-23  9:51 ` [RFC PATCH v6 4/5] mm: pghot: Precision mode for pghot Bharata B Rao
2026-03-26 10:41   ` Bharata B Rao
2026-03-23  9:51 ` [RFC PATCH v6 5/5] mm: sched: move NUMA balancing tiering promotion to pghot Bharata B Rao
2026-03-23  9:56 ` Bharata B Rao [this message]
2026-03-23  9:58 ` [RFC PATCH v6 0/5] mm: Hot page tracking and promotion infrastructure Bharata B Rao
2026-03-23  9:59 ` Bharata B Rao
2026-03-23 10:01 ` Bharata B Rao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=868004d8-bb8e-4800-9fdd-ade48e95fe3b@amd.com \
    --to=bharata@amd.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=akpm@linux-foundation.org \
    --cc=alok.rathore@samsung.com \
    --cc=balbirs@nvidia.com \
    --cc=byungchul@sk.com \
    --cc=dave.hansen@intel.com \
    --cc=dave@stgolabs.net \
    --cc=david@redhat.com \
    --cc=gourry@gourry.net \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kinseyho@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=mingo@redhat.com \
    --cc=nifan.cxl@gmail.com \
    --cc=peterz@infradead.org \
    --cc=raghavendra.kt@amd.com \
    --cc=riel@surriel.com \
    --cc=rientjes@google.com \
    --cc=shivankg@amd.com \
    --cc=sj@kernel.org \
    --cc=weixugc@google.com \
    --cc=willy@infradead.org \
    --cc=xuezhengchu@huawei.com \
    --cc=yiannis@zptcorp.com \
    --cc=ying.huang@linux.alibaba.com \
    --cc=yuanchu@google.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox