public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure
@ 2026-05-04  6:09 Bharata B Rao
  2026-05-04  6:09 ` [PATCH v7 1/7] mm: migrate: Allow misplaced migration without VMA Bharata B Rao
                   ` (8 more replies)
  0 siblings, 9 replies; 12+ messages in thread
From: Bharata B Rao @ 2026-05-04  6:09 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz,
	raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david,
	byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
	alok.rathore, shivankg, donettom, bharata

Hi,

This is v7 of pghot, a hot-page tracking and promotion subsystem. The
main change in this version is to add support for IBS Memory Profiler
as page hotness source(PGHOT_HWHINTS). IBS Memory Profiler is a
facility that will be present in future AMD processors. It provides memory
access information and is independent of the existing IBS instance that
is primarily used by the perf subsystem.

This patchset introduces a new subsystem for hot page tracking and
promotion (pghot) with the following goals:

- Unify hot page detection from multiple sources like hint faults,
  page table scans, hardware hints (AMD IBS).
- Decouple detection from migration.
- Centralize promotion logic via per-lower-tier-node kmigrated kernel
  thread.
- Move promotion rate‑limiting and related logic used by
  numa_balancing=2 (NUMAB2, the current NUMA balancing–based promotion)
  from the scheduler to pghot for broader reuse.
  
Currently, multiple kernel subsystems detect page accesses independently.
This patchset consolidates accesses from these mechanisms by providing:

- A common API for reporting page accesses.
- Shared infrastructure for tracking hotness at PFN granularity.
- Per-lower-tier-node kernel threads for promoting pages.

Here is a brief summary of how this subsystem works:

- Tracks frequency and last access time.
- Additionally, the accessing NUMA node ID (NID) for each recorded
  access is also tracked in the precision mode.
- These hotness parameters are maintained in a per-PFN hotness record
  within the existing mem_section data structure.
  - In default mode, one byte (u8) is used for hotness record. 5 bits are
    used to store time and bucketing scheme is used to represent a total
    access time up to 4s with HZ=1000. Default toptier NID (0) is used as
    the target for promotion which can be changed via debugfs tunable.
  - In precision mode, 4 bytes (u32) are used for each hotness record.
    14 bits are used to store time which can represent around 16s
    with HZ=1000.
- Classifies pages as hot based on configurable thresholds.
- Pages classified as hot are marked as ready for migration using the
  ready bit. Both modes use MSB of the hotness record as ready bit.
- Per-lower-tier-node kmigrated threads periodically scan the PFNs of
  lower-tier nodes, checking for the migration-ready bit to perform
  batched migrations. Interval between successive scans and batching
  value are configurable via debugfs tunables.

Memory overhead
---------------
Default mode: 1 byte per lower-tier PFN. For a 1TB lower-tier memory
this amounts to 256MB overhead (assuming 4K pages)

Precision mode: 4 bytes per lower-tier PFN. For a 1TB of lower memory
this amounts to 1G overhead.

Bit layout of hotness record
----------------------------
Default mode
- Bits 0-1: Frequency (2bits, 4 access samples)
- Bits 2-6: Bucketed time (5bits, up to 4s with HZ=1000)
- Bit 7: Migration ready bit

Precision mode
- Bits 0-9: Target NID (10bits)
- Bits 10-12: Frequency (3bits, 8 access samples)
- Bits 13-26: Time (14bits, up to 16s with HZ=1000)
- Bits 27-30: Reserved
- Bit 31: Migration ready bit

Potential hotness sources
-------------------------
1. NUMA Balancing (NUMAB2, Tiering mode) - included in this patchset.
2. AMD IBS Memory Profiler: HW based access profiler - included in this
   patchset.
3. klruscand - PTE‑A bit scanning built on MGLRU’s walk helpers - was
   showcased in previous versions but not part of this version.
4. folio_mark_accessed() - Page cache access tracking (unmapped
   page cache pages) - was showcased in previous versions but not part
   of this patchset.

Changes in v7
-------------
- Added AMD IBS Memory Profiler as page hotness source.
- Addressed review comments from v6 (Thanks to Shashiko AI, Gregory and Donet)
  - Early exit from batched migration routine if input
    list is empty
  - Changed the name of batched migration routine to indicate
    that it handles "promotion" of batched "memcg" folios.
  - Debug code in batched migration routine to check if all
    the folios in the input list belong to the same memcg.
  - Kconfig dependency cleanups.
  - Fix one-off-regression in nid check in pghot-precise.
  - More checks to validate nid in pghot-precise.
  - Early check to not call kmigrated_run() for lower tier nodes.
  - Handling PTE writable and ignore_writable conditions correctly
    in hint fault handler.
  - Using unsigned int instead of unsigned long for representing
    time in ms.
  - Misc cleanups.

Results
=======
Posted as replies to this mail thread.

This v7 patchset applies on top of upstream commit c1f49dea2b8f and
can be fetched from:

https://github.com/AMDESE/linux-mm/tree/bharata/pghot-v7

v6: https://lore.kernel.org/linux-mm/20260323095104.238982-1-bharata@amd.com/
v5: https://lore.kernel.org/linux-mm/20260129144043.231636-1-bharata@amd.com/
v4: https://lore.kernel.org/linux-mm/20251206101423.5004-1-bharata@amd.com/
v3: https://lore.kernel.org/linux-mm/20251110052343.208768-1-bharata@amd.com/
v2: https://lore.kernel.org/linux-mm/20250910144653.212066-1-bharata@amd.com/
v1: https://lore.kernel.org/linux-mm/20250814134826.154003-1-bharata@amd.com/
v0: https://lore.kernel.org/linux-mm/20250306054532.221138-1-bharata@amd.com/

Bharata B Rao (6):
  mm: migrate: Allow misplaced migration without VMA
  mm: Hot page tracking and promotion - pghot
  mm: pghot: Precision mode for pghot
  mm: sched: move NUMA balancing tiering promotion to pghot
  x86/ibs: Move IBS caps definitions into its own header
  x86/mm/ibs: In-kernel driver for AMD IBS Memory Profiler

Gregory Price (1):
  mm: migrate: Add promote_misplaced_memcg_folios()

 Documentation/admin-guide/mm/index.rst |   1 +
 Documentation/admin-guide/mm/pghot.rst |  80 ++++
 arch/x86/Kconfig                       |  16 +
 arch/x86/include/asm/ibs-caps.h        |  93 ++++
 arch/x86/include/asm/ibs-mprof.h       |  46 ++
 arch/x86/include/asm/msr-index.h       |   8 +
 arch/x86/include/asm/perf_event.h      |  81 +---
 arch/x86/mm/Makefile                   |   1 +
 arch/x86/mm/ibs-mprof.c                | 308 ++++++++++++
 include/linux/cpuhotplug.h             |   1 +
 include/linux/migrate.h                |   9 +-
 include/linux/mm.h                     |  35 +-
 include/linux/mmzone.h                 |  24 +-
 include/linux/pghot.h                  | 113 +++++
 include/linux/vm_event_item.h          |  11 +
 init/Kconfig                           |  13 +
 kernel/sched/core.c                    |   7 +
 kernel/sched/debug.c                   |   1 -
 kernel/sched/fair.c                    | 177 +------
 kernel/sched/sched.h                   |   1 -
 mm/Kconfig                             |  34 ++
 mm/Makefile                            |   6 +
 mm/huge_memory.c                       |  24 +-
 mm/memcontrol.c                        |   6 +-
 mm/memory-tiers.c                      |  15 +-
 mm/memory.c                            |  28 +-
 mm/mempolicy.c                         |   3 -
 mm/migrate.c                           |  98 +++-
 mm/mm_init.c                           |  10 +
 mm/pghot-default.c                     |  79 +++
 mm/pghot-precise.c                     |  81 ++++
 mm/pghot-tunables.c                    | 182 +++++++
 mm/pghot.c                             | 633 +++++++++++++++++++++++++
 mm/vmstat.c                            |  13 +-
 34 files changed, 1922 insertions(+), 316 deletions(-)
 create mode 100644 Documentation/admin-guide/mm/pghot.rst
 create mode 100644 arch/x86/include/asm/ibs-caps.h
 create mode 100644 arch/x86/include/asm/ibs-mprof.h
 create mode 100644 arch/x86/mm/ibs-mprof.c
 create mode 100644 include/linux/pghot.h
 create mode 100644 mm/pghot-default.c
 create mode 100644 mm/pghot-precise.c
 create mode 100644 mm/pghot-tunables.c
 create mode 100644 mm/pghot.c

base-commit: c1f49dea2b8f335813d3b348fd39117fb8efb428

IBS Memory Profiler driver part of this patchset depends on the
patchset that increases the number of APIC EILVT registers -
https://lore.kernel.org/lkml/cover.1775019269.git.naveen@kernel.org/
-- 
2.34.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2026-05-04 20:36 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-04  6:09 [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure Bharata B Rao
2026-05-04  6:09 ` [PATCH v7 1/7] mm: migrate: Allow misplaced migration without VMA Bharata B Rao
2026-05-04  6:09 ` [PATCH v7 2/7] mm: migrate: Add promote_misplaced_memcg_folios() Bharata B Rao
2026-05-04 18:14   ` Donet Tom
2026-05-04  6:09 ` [PATCH v7 3/7] mm: Hot page tracking and promotion - pghot Bharata B Rao
2026-05-04  6:09 ` [PATCH v7 4/7] mm: pghot: Precision mode for pghot Bharata B Rao
2026-05-04 18:41   ` Donet Tom
2026-05-04  6:09 ` [PATCH v7 5/7] mm: sched: move NUMA balancing tiering promotion to pghot Bharata B Rao
2026-05-04  6:09 ` [RFC PATCH v7 6/7] x86/ibs: Move IBS caps definitions into its own header Bharata B Rao
2026-05-04  6:09 ` [RFC PATCH v7 7/7] x86/mm/ibs: In-kernel driver for AMD IBS Memory Profiler Bharata B Rao
2026-05-04  6:23 ` [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure Bharata B Rao
2026-05-04 20:36 ` Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox