From: Bharata B Rao <bharata@amd.com>
To: <linux-kernel@vger.kernel.org>, <linux-mm@kvack.org>
Cc: <Jonathan.Cameron@huawei.com>, <dave.hansen@intel.com>,
<gourry@gourry.net>, <mgorman@techsingularity.net>,
<mingo@redhat.com>, <peterz@infradead.org>,
<raghavendra.kt@amd.com>, <riel@surriel.com>,
<rientjes@google.com>, <sj@kernel.org>, <weixugc@google.com>,
<willy@infradead.org>, <ying.huang@linux.alibaba.com>,
<ziy@nvidia.com>, <dave@stgolabs.net>, <nifan.cxl@gmail.com>,
<xuezhengchu@huawei.com>, <yiannis@zptcorp.com>,
<akpm@linux-foundation.org>, <david@kernel.org>,
<byungchul@sk.com>, <kinseyho@google.com>,
<joshua.hahnjy@gmail.com>, <yuanchu@google.com>,
<balbirs@nvidia.com>, <alok.rathore@samsung.com>,
<shivankg@amd.com>, <donettom@linux.ibm.com>, <bharata@amd.com>
Subject: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure
Date: Mon, 4 May 2026 11:39:17 +0530 [thread overview]
Message-ID: <20260504060924.344313-1-bharata@amd.com> (raw)
Hi,
This is v7 of pghot, a hot-page tracking and promotion subsystem. The
main change in this version is to add support for IBS Memory Profiler
as page hotness source(PGHOT_HWHINTS). IBS Memory Profiler is a
facility that will be present in future AMD processors. It provides memory
access information and is independent of the existing IBS instance that
is primarily used by the perf subsystem.
This patchset introduces a new subsystem for hot page tracking and
promotion (pghot) with the following goals:
- Unify hot page detection from multiple sources like hint faults,
page table scans, hardware hints (AMD IBS).
- Decouple detection from migration.
- Centralize promotion logic via per-lower-tier-node kmigrated kernel
thread.
- Move promotion rate‑limiting and related logic used by
numa_balancing=2 (NUMAB2, the current NUMA balancing–based promotion)
from the scheduler to pghot for broader reuse.
Currently, multiple kernel subsystems detect page accesses independently.
This patchset consolidates accesses from these mechanisms by providing:
- A common API for reporting page accesses.
- Shared infrastructure for tracking hotness at PFN granularity.
- Per-lower-tier-node kernel threads for promoting pages.
Here is a brief summary of how this subsystem works:
- Tracks frequency and last access time.
- Additionally, the accessing NUMA node ID (NID) for each recorded
access is also tracked in the precision mode.
- These hotness parameters are maintained in a per-PFN hotness record
within the existing mem_section data structure.
- In default mode, one byte (u8) is used for hotness record. 5 bits are
used to store time and bucketing scheme is used to represent a total
access time up to 4s with HZ=1000. Default toptier NID (0) is used as
the target for promotion which can be changed via debugfs tunable.
- In precision mode, 4 bytes (u32) are used for each hotness record.
14 bits are used to store time which can represent around 16s
with HZ=1000.
- Classifies pages as hot based on configurable thresholds.
- Pages classified as hot are marked as ready for migration using the
ready bit. Both modes use MSB of the hotness record as ready bit.
- Per-lower-tier-node kmigrated threads periodically scan the PFNs of
lower-tier nodes, checking for the migration-ready bit to perform
batched migrations. Interval between successive scans and batching
value are configurable via debugfs tunables.
Memory overhead
---------------
Default mode: 1 byte per lower-tier PFN. For a 1TB lower-tier memory
this amounts to 256MB overhead (assuming 4K pages)
Precision mode: 4 bytes per lower-tier PFN. For a 1TB of lower memory
this amounts to 1G overhead.
Bit layout of hotness record
----------------------------
Default mode
- Bits 0-1: Frequency (2bits, 4 access samples)
- Bits 2-6: Bucketed time (5bits, up to 4s with HZ=1000)
- Bit 7: Migration ready bit
Precision mode
- Bits 0-9: Target NID (10bits)
- Bits 10-12: Frequency (3bits, 8 access samples)
- Bits 13-26: Time (14bits, up to 16s with HZ=1000)
- Bits 27-30: Reserved
- Bit 31: Migration ready bit
Potential hotness sources
-------------------------
1. NUMA Balancing (NUMAB2, Tiering mode) - included in this patchset.
2. AMD IBS Memory Profiler: HW based access profiler - included in this
patchset.
3. klruscand - PTE‑A bit scanning built on MGLRU’s walk helpers - was
showcased in previous versions but not part of this version.
4. folio_mark_accessed() - Page cache access tracking (unmapped
page cache pages) - was showcased in previous versions but not part
of this patchset.
Changes in v7
-------------
- Added AMD IBS Memory Profiler as page hotness source.
- Addressed review comments from v6 (Thanks to Shashiko AI, Gregory and Donet)
- Early exit from batched migration routine if input
list is empty
- Changed the name of batched migration routine to indicate
that it handles "promotion" of batched "memcg" folios.
- Debug code in batched migration routine to check if all
the folios in the input list belong to the same memcg.
- Kconfig dependency cleanups.
- Fix one-off-regression in nid check in pghot-precise.
- More checks to validate nid in pghot-precise.
- Early check to not call kmigrated_run() for lower tier nodes.
- Handling PTE writable and ignore_writable conditions correctly
in hint fault handler.
- Using unsigned int instead of unsigned long for representing
time in ms.
- Misc cleanups.
Results
=======
Posted as replies to this mail thread.
This v7 patchset applies on top of upstream commit c1f49dea2b8f and
can be fetched from:
https://github.com/AMDESE/linux-mm/tree/bharata/pghot-v7
v6: https://lore.kernel.org/linux-mm/20260323095104.238982-1-bharata@amd.com/
v5: https://lore.kernel.org/linux-mm/20260129144043.231636-1-bharata@amd.com/
v4: https://lore.kernel.org/linux-mm/20251206101423.5004-1-bharata@amd.com/
v3: https://lore.kernel.org/linux-mm/20251110052343.208768-1-bharata@amd.com/
v2: https://lore.kernel.org/linux-mm/20250910144653.212066-1-bharata@amd.com/
v1: https://lore.kernel.org/linux-mm/20250814134826.154003-1-bharata@amd.com/
v0: https://lore.kernel.org/linux-mm/20250306054532.221138-1-bharata@amd.com/
Bharata B Rao (6):
mm: migrate: Allow misplaced migration without VMA
mm: Hot page tracking and promotion - pghot
mm: pghot: Precision mode for pghot
mm: sched: move NUMA balancing tiering promotion to pghot
x86/ibs: Move IBS caps definitions into its own header
x86/mm/ibs: In-kernel driver for AMD IBS Memory Profiler
Gregory Price (1):
mm: migrate: Add promote_misplaced_memcg_folios()
Documentation/admin-guide/mm/index.rst | 1 +
Documentation/admin-guide/mm/pghot.rst | 80 ++++
arch/x86/Kconfig | 16 +
arch/x86/include/asm/ibs-caps.h | 93 ++++
arch/x86/include/asm/ibs-mprof.h | 46 ++
arch/x86/include/asm/msr-index.h | 8 +
arch/x86/include/asm/perf_event.h | 81 +---
arch/x86/mm/Makefile | 1 +
arch/x86/mm/ibs-mprof.c | 308 ++++++++++++
include/linux/cpuhotplug.h | 1 +
include/linux/migrate.h | 9 +-
include/linux/mm.h | 35 +-
include/linux/mmzone.h | 24 +-
include/linux/pghot.h | 113 +++++
include/linux/vm_event_item.h | 11 +
init/Kconfig | 13 +
kernel/sched/core.c | 7 +
kernel/sched/debug.c | 1 -
kernel/sched/fair.c | 177 +------
kernel/sched/sched.h | 1 -
mm/Kconfig | 34 ++
mm/Makefile | 6 +
mm/huge_memory.c | 24 +-
mm/memcontrol.c | 6 +-
mm/memory-tiers.c | 15 +-
mm/memory.c | 28 +-
mm/mempolicy.c | 3 -
mm/migrate.c | 98 +++-
mm/mm_init.c | 10 +
mm/pghot-default.c | 79 +++
mm/pghot-precise.c | 81 ++++
mm/pghot-tunables.c | 182 +++++++
mm/pghot.c | 633 +++++++++++++++++++++++++
mm/vmstat.c | 13 +-
34 files changed, 1922 insertions(+), 316 deletions(-)
create mode 100644 Documentation/admin-guide/mm/pghot.rst
create mode 100644 arch/x86/include/asm/ibs-caps.h
create mode 100644 arch/x86/include/asm/ibs-mprof.h
create mode 100644 arch/x86/mm/ibs-mprof.c
create mode 100644 include/linux/pghot.h
create mode 100644 mm/pghot-default.c
create mode 100644 mm/pghot-precise.c
create mode 100644 mm/pghot-tunables.c
create mode 100644 mm/pghot.c
base-commit: c1f49dea2b8f335813d3b348fd39117fb8efb428
IBS Memory Profiler driver part of this patchset depends on the
patchset that increases the number of APIC EILVT registers -
https://lore.kernel.org/lkml/cover.1775019269.git.naveen@kernel.org/
--
2.34.1
next reply other threads:[~2026-05-04 6:10 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-04 6:09 Bharata B Rao [this message]
2026-05-04 6:09 ` [PATCH v7 1/7] mm: migrate: Allow misplaced migration without VMA Bharata B Rao
2026-05-04 6:09 ` [PATCH v7 2/7] mm: migrate: Add promote_misplaced_memcg_folios() Bharata B Rao
2026-05-04 18:14 ` Donet Tom
2026-05-04 6:09 ` [PATCH v7 3/7] mm: Hot page tracking and promotion - pghot Bharata B Rao
2026-05-04 6:09 ` [PATCH v7 4/7] mm: pghot: Precision mode for pghot Bharata B Rao
2026-05-04 18:41 ` Donet Tom
2026-05-04 6:09 ` [PATCH v7 5/7] mm: sched: move NUMA balancing tiering promotion to pghot Bharata B Rao
2026-05-04 6:09 ` [RFC PATCH v7 6/7] x86/ibs: Move IBS caps definitions into its own header Bharata B Rao
2026-05-04 6:09 ` [RFC PATCH v7 7/7] x86/mm/ibs: In-kernel driver for AMD IBS Memory Profiler Bharata B Rao
2026-05-04 6:23 ` [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure Bharata B Rao
2026-05-04 20:36 ` Matthew Wilcox
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260504060924.344313-1-bharata@amd.com \
--to=bharata@amd.com \
--cc=Jonathan.Cameron@huawei.com \
--cc=akpm@linux-foundation.org \
--cc=alok.rathore@samsung.com \
--cc=balbirs@nvidia.com \
--cc=byungchul@sk.com \
--cc=dave.hansen@intel.com \
--cc=dave@stgolabs.net \
--cc=david@kernel.org \
--cc=donettom@linux.ibm.com \
--cc=gourry@gourry.net \
--cc=joshua.hahnjy@gmail.com \
--cc=kinseyho@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@techsingularity.net \
--cc=mingo@redhat.com \
--cc=nifan.cxl@gmail.com \
--cc=peterz@infradead.org \
--cc=raghavendra.kt@amd.com \
--cc=riel@surriel.com \
--cc=rientjes@google.com \
--cc=shivankg@amd.com \
--cc=sj@kernel.org \
--cc=weixugc@google.com \
--cc=willy@infradead.org \
--cc=xuezhengchu@huawei.com \
--cc=yiannis@zptcorp.com \
--cc=ying.huang@linux.alibaba.com \
--cc=yuanchu@google.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox