[PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure
@ 2026-05-04  6:09 Bharata B Rao
  2026-05-04  6:09 ` [PATCH v7 1/7] mm: migrate: Allow misplaced migration without VMA Bharata B Rao
                   ` (8 more replies)
  0 siblings, 9 replies; 12+ messages in thread
From: Bharata B Rao @ 2026-05-04  6:09 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz,
	raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david,
	byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
	alok.rathore, shivankg, donettom, bharata

Hi,

This is v7 of pghot, a hot-page tracking and promotion subsystem. The
main change in this version is to add support for IBS Memory Profiler
as page hotness source(PGHOT_HWHINTS). IBS Memory Profiler is a
facility that will be present in future AMD processors. It provides memory
access information and is independent of the existing IBS instance that
is primarily used by the perf subsystem.

This patchset introduces a new subsystem for hot page tracking and
promotion (pghot) with the following goals:

- Unify hot page detection from multiple sources like hint faults,
  page table scans, hardware hints (AMD IBS).
- Decouple detection from migration.
- Centralize promotion logic via per-lower-tier-node kmigrated kernel
  thread.
- Move promotion rate‑limiting and related logic used by
  numa_balancing=2 (NUMAB2, the current NUMA balancing–based promotion)
  from the scheduler to pghot for broader reuse.
  
Currently, multiple kernel subsystems detect page accesses independently.
This patchset consolidates accesses from these mechanisms by providing:

- A common API for reporting page accesses.
- Shared infrastructure for tracking hotness at PFN granularity.
- Per-lower-tier-node kernel threads for promoting pages.

Here is a brief summary of how this subsystem works:

- Tracks frequency and last access time.
- Additionally, the accessing NUMA node ID (NID) for each recorded
  access is also tracked in the precision mode.
- These hotness parameters are maintained in a per-PFN hotness record
  within the existing mem_section data structure.
  - In default mode, one byte (u8) is used for hotness record. 5 bits are
    used to store time and bucketing scheme is used to represent a total
    access time up to 4s with HZ=1000. Default toptier NID (0) is used as
    the target for promotion which can be changed via debugfs tunable.
  - In precision mode, 4 bytes (u32) are used for each hotness record.
    14 bits are used to store time which can represent around 16s
    with HZ=1000.
- Classifies pages as hot based on configurable thresholds.
- Pages classified as hot are marked as ready for migration using the
  ready bit. Both modes use MSB of the hotness record as ready bit.
- Per-lower-tier-node kmigrated threads periodically scan the PFNs of
  lower-tier nodes, checking for the migration-ready bit to perform
  batched migrations. Interval between successive scans and batching
  value are configurable via debugfs tunables.

Memory overhead
---------------
Default mode: 1 byte per lower-tier PFN. For a 1TB lower-tier memory
this amounts to 256MB overhead (assuming 4K pages)

Precision mode: 4 bytes per lower-tier PFN. For a 1TB of lower memory
this amounts to 1G overhead.

Bit layout of hotness record
----------------------------
Default mode
- Bits 0-1: Frequency (2bits, 4 access samples)
- Bits 2-6: Bucketed time (5bits, up to 4s with HZ=1000)
- Bit 7: Migration ready bit

Precision mode
- Bits 0-9: Target NID (10bits)
- Bits 10-12: Frequency (3bits, 8 access samples)
- Bits 13-26: Time (14bits, up to 16s with HZ=1000)
- Bits 27-30: Reserved
- Bit 31: Migration ready bit

Potential hotness sources
-------------------------
1. NUMA Balancing (NUMAB2, Tiering mode) - included in this patchset.
2. AMD IBS Memory Profiler: HW based access profiler - included in this
   patchset.
3. klruscand - PTE‑A bit scanning built on MGLRU’s walk helpers - was
   showcased in previous versions but not part of this version.
4. folio_mark_accessed() - Page cache access tracking (unmapped
   page cache pages) - was showcased in previous versions but not part
   of this patchset.

Changes in v7
-------------
- Added AMD IBS Memory Profiler as page hotness source.
- Addressed review comments from v6 (Thanks to Shashiko AI, Gregory and Donet)
  - Early exit from batched migration routine if input
    list is empty
  - Changed the name of batched migration routine to indicate
    that it handles "promotion" of batched "memcg" folios.
  - Debug code in batched migration routine to check if all
    the folios in the input list belong to the same memcg.
  - Kconfig dependency cleanups.
  - Fix one-off-regression in nid check in pghot-precise.
  - More checks to validate nid in pghot-precise.
  - Early check to not call kmigrated_run() for lower tier nodes.
  - Handling PTE writable and ignore_writable conditions correctly
    in hint fault handler.
  - Using unsigned int instead of unsigned long for representing
    time in ms.
  - Misc cleanups.

Results
=======
Posted as replies to this mail thread.

This v7 patchset applies on top of upstream commit c1f49dea2b8f and
can be fetched from:

https://github.com/AMDESE/linux-mm/tree/bharata/pghot-v7

v6: https://lore.kernel.org/linux-mm/20260323095104.238982-1-bharata@amd.com/
v5: https://lore.kernel.org/linux-mm/20260129144043.231636-1-bharata@amd.com/
v4: https://lore.kernel.org/linux-mm/20251206101423.5004-1-bharata@amd.com/
v3: https://lore.kernel.org/linux-mm/20251110052343.208768-1-bharata@amd.com/
v2: https://lore.kernel.org/linux-mm/20250910144653.212066-1-bharata@amd.com/
v1: https://lore.kernel.org/linux-mm/20250814134826.154003-1-bharata@amd.com/
v0: https://lore.kernel.org/linux-mm/20250306054532.221138-1-bharata@amd.com/

Bharata B Rao (6):
  mm: migrate: Allow misplaced migration without VMA
  mm: Hot page tracking and promotion - pghot
  mm: pghot: Precision mode for pghot
  mm: sched: move NUMA balancing tiering promotion to pghot
  x86/ibs: Move IBS caps definitions into its own header
  x86/mm/ibs: In-kernel driver for AMD IBS Memory Profiler

Gregory Price (1):
  mm: migrate: Add promote_misplaced_memcg_folios()

 Documentation/admin-guide/mm/index.rst |   1 +
 Documentation/admin-guide/mm/pghot.rst |  80 ++++
 arch/x86/Kconfig                       |  16 +
 arch/x86/include/asm/ibs-caps.h        |  93 ++++
 arch/x86/include/asm/ibs-mprof.h       |  46 ++
 arch/x86/include/asm/msr-index.h       |   8 +
 arch/x86/include/asm/perf_event.h      |  81 +---
 arch/x86/mm/Makefile                   |   1 +
 arch/x86/mm/ibs-mprof.c                | 308 ++++++++++++
 include/linux/cpuhotplug.h             |   1 +
 include/linux/migrate.h                |   9 +-
 include/linux/mm.h                     |  35 +-
 include/linux/mmzone.h                 |  24 +-
 include/linux/pghot.h                  | 113 +++++
 include/linux/vm_event_item.h          |  11 +
 init/Kconfig                           |  13 +
 kernel/sched/core.c                    |   7 +
 kernel/sched/debug.c                   |   1 -
 kernel/sched/fair.c                    | 177 +------
 kernel/sched/sched.h                   |   1 -
 mm/Kconfig                             |  34 ++
 mm/Makefile                            |   6 +
 mm/huge_memory.c                       |  24 +-
 mm/memcontrol.c                        |   6 +-
 mm/memory-tiers.c                      |  15 +-
 mm/memory.c                            |  28 +-
 mm/mempolicy.c                         |   3 -
 mm/migrate.c                           |  98 +++-
 mm/mm_init.c                           |  10 +
 mm/pghot-default.c                     |  79 +++
 mm/pghot-precise.c                     |  81 ++++
 mm/pghot-tunables.c                    | 182 +++++++
 mm/pghot.c                             | 633 +++++++++++++++++++++++++
 mm/vmstat.c                            |  13 +-
 34 files changed, 1922 insertions(+), 316 deletions(-)
 create mode 100644 Documentation/admin-guide/mm/pghot.rst
 create mode 100644 arch/x86/include/asm/ibs-caps.h
 create mode 100644 arch/x86/include/asm/ibs-mprof.h
 create mode 100644 arch/x86/mm/ibs-mprof.c
 create mode 100644 include/linux/pghot.h
 create mode 100644 mm/pghot-default.c
 create mode 100644 mm/pghot-precise.c
 create mode 100644 mm/pghot-tunables.c
 create mode 100644 mm/pghot.c

base-commit: c1f49dea2b8f335813d3b348fd39117fb8efb428

IBS Memory Profiler driver part of this patchset depends on the
patchset that increases the number of APIC EILVT registers -
https://lore.kernel.org/lkml/cover.1775019269.git.naveen@kernel.org/
-- 
2.34.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v7 1/7] mm: migrate: Allow misplaced migration without VMA
  2026-05-04  6:09 [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure Bharata B Rao
@ 2026-05-04  6:09 ` Bharata B Rao
  2026-05-04  6:09 ` [PATCH v7 2/7] mm: migrate: Add promote_misplaced_memcg_folios() Bharata B Rao
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Bharata B Rao @ 2026-05-04  6:09 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz,
	raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david,
	byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
	alok.rathore, shivankg, donettom, bharata

We want isolation of misplaced folios to work in contexts
where VMA isn't available, typically when performing migrations
from a kernel thread context. In order to prepare for that,
allow migrate_misplaced_folio_prepare() to be called with
a NULL VMA.

When migrate_misplaced_folio_prepare() is called with non-NULL
VMA, it will check if the folio is mapped shared and that requires
holding PTL lock. This path isn't taken when the function is
invoked with NULL VMA (migration outside of process context).
Therefore, when VMA == NULL, migrate_misplaced_folio_prepare()
does not require the caller to hold the PTL.

Signed-off-by: Bharata B Rao <bharata@amd.com>
---
 mm/migrate.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 8a64291ab5b4..eb21a02fade0 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2671,7 +2671,12 @@ static struct folio *alloc_misplaced_dst_folio(struct folio *src,
 
 /*
  * Prepare for calling migrate_misplaced_folio() by isolating the folio if
- * permitted. Must be called with the PTL still held.
+ * permitted. Must be called with the PTL still held if called with a non-NULL
+ * vma.
+ *
+ * When called with a NULL vma (e.g., kernel thread initiated migration),
+ * migrate_misplaced_folio_prepare() will allow shared executable folios
+ * to be migrated.
  */
 int migrate_misplaced_folio_prepare(struct folio *folio,
 		struct vm_area_struct *vma, int node)
@@ -2688,7 +2693,7 @@ int migrate_misplaced_folio_prepare(struct folio *folio,
 		 * See folio_maybe_mapped_shared() on possible imprecision
 		 * when we cannot easily detect if a folio is shared.
 		 */
-		if ((vma->vm_flags & VM_EXEC) && folio_maybe_mapped_shared(folio))
+		if (vma && (vma->vm_flags & VM_EXEC) && folio_maybe_mapped_shared(folio))
 			return -EACCES;
 
 		/*
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v7 2/7] mm: migrate: Add promote_misplaced_memcg_folios()
  2026-05-04  6:09 [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure Bharata B Rao
  2026-05-04  6:09 ` [PATCH v7 1/7] mm: migrate: Allow misplaced migration without VMA Bharata B Rao
@ 2026-05-04  6:09 ` Bharata B Rao
  2026-05-04 18:14   ` Donet Tom
  2026-05-04  6:09 ` [PATCH v7 3/7] mm: Hot page tracking and promotion - pghot Bharata B Rao
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 12+ messages in thread
From: Bharata B Rao @ 2026-05-04  6:09 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz,
	raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david,
	byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
	alok.rathore, shivankg, donettom, bharata

From: Gregory Price <gourry@gourry.net>

Tiered memory systems often require migrating multiple folios at once.
Currently, migrate_misplaced_folio() handles only one folio per call,
which is inefficient for batch operations. This patch introduces
promote_misplaced_memcg_folios(), a batch variant that leverages
migrate_pages() internally for improved performance.

The caller must isolate folios beforehand using
migrate_misplaced_folio_prepare(). Additionally all the folios in the
isolated list must belong to the same memcg. On return, the folio list
will be empty regardless of success or failure.

This function will be used by pghot kmigrated thread.

Signed-off-by: Gregory Price <gourry@gourry.net>
[Rewrote commit description, memcg awareness]
Signed-off-by: Bharata B Rao <bharata@amd.com>
---
 include/linux/migrate.h |  5 ++++
 mm/migrate.c            | 57 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 62 insertions(+)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index d5af2b7f577b..d136612eef9d 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -111,6 +111,7 @@ static inline void softleaf_entry_wait_on_locked(softleaf_t entry, spinlock_t *p
 int migrate_misplaced_folio_prepare(struct folio *folio,
 		struct vm_area_struct *vma, int node);
 int migrate_misplaced_folio(struct folio *folio, int node);
+int promote_misplaced_memcg_folios(struct list_head *folio_list, int node);
 #else
 static inline int migrate_misplaced_folio_prepare(struct folio *folio,
 		struct vm_area_struct *vma, int node)
@@ -121,6 +122,10 @@ static inline int migrate_misplaced_folio(struct folio *folio, int node)
 {
 	return -EAGAIN; /* can't migrate now */
 }
+static inline int promote_misplaced_memcg_folios(struct list_head *folio_list, int node)
+{
+	return -EAGAIN; /* can't migrate now */
+}
 #endif /* CONFIG_NUMA_BALANCING */
 
 #ifdef CONFIG_MIGRATION
diff --git a/mm/migrate.c b/mm/migrate.c
index eb21a02fade0..747277aadf19 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2770,4 +2770,61 @@ int migrate_misplaced_folio(struct folio *folio, int node)
 	BUG_ON(!list_empty(&migratepages));
 	return nr_remaining ? -EAGAIN : 0;
 }
+
+/**
+ * promote_misplaced_memcg_folios() - Batch variant of migrate_misplaced_folio
+ * Attempts to promote a folio list to the specified destination.
+ * @folio_list: Isolated list of folios to be batch-promoted.
+ * @node: The NUMA node ID to where the folios should be promoted.
+ *
+ * Caller is expected to have isolated the folios by calling
+ * migrate_misplaced_folio_prepare(), which will result in an
+ * elevated reference count on the folios. All the isolated folios
+ * in the list must belong to the same memcg so that NUMA_PAGE_MIGRATE
+ * stat can be attributed correctly to the memcg.
+ *
+ * This function will un-isolate the folios, drop the elevated reference
+ * and remove them from the list before returning. This should be called
+ * only for batched promotion of hot pages from lower tier nodes.
+ *
+ * Return: 0 on success and -EAGAIN on failure or partial promotion.
+ *         On return, @folio_list will be empty regardless of success/failure.
+ */
+int promote_misplaced_memcg_folios(struct list_head *folio_list, int node)
+{
+	struct mem_cgroup *memcg = NULL;
+	unsigned int nr_succeeded = 0;
+	struct folio *first;
+	int nr_remaining;
+
+	if (list_empty(folio_list))
+		return 0;
+
+	first = list_first_entry(folio_list, struct folio, lru);
+#ifdef CONFIG_DEBUG_VM
+	{
+		struct folio *f;
+		list_for_each_entry(f, folio_list, lru)
+		VM_WARN_ON_ONCE(folio_memcg(f) != folio_memcg(first));
+	}
+#endif
+	memcg = get_mem_cgroup_from_folio(first);
+
+	nr_remaining = migrate_pages(folio_list, alloc_misplaced_dst_folio,
+				     NULL, node, MIGRATE_ASYNC,
+				     MR_NUMA_MISPLACED, &nr_succeeded);
+	if (nr_remaining)
+		putback_movable_pages(folio_list);
+
+	if (nr_succeeded) {
+		count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded);
+		count_memcg_events(memcg, NUMA_PAGE_MIGRATE, nr_succeeded);
+		mod_lruvec_state(mem_cgroup_lruvec(memcg, NODE_DATA(node)),
+				 PGPROMOTE_SUCCESS, nr_succeeded);
+	}
+
+	mem_cgroup_put(memcg);
+	WARN_ON(!list_empty(folio_list));
+	return nr_remaining ? -EAGAIN : 0;
+}
 #endif /* CONFIG_NUMA_BALANCING */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v7 2/7] mm: migrate: Add promote_misplaced_memcg_folios()
  2026-05-04  6:09 ` [PATCH v7 2/7] mm: migrate: Add promote_misplaced_memcg_folios() Bharata B Rao
@ 2026-05-04 18:14   ` Donet Tom
  0 siblings, 0 replies; 12+ messages in thread
From: Donet Tom @ 2026-05-04 18:14 UTC (permalink / raw)
  To: Bharata B Rao, linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz,
	raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david,
	byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
	alok.rathore, shivankg

Hi Bharata

On 5/4/26 11:39 AM, Bharata B Rao wrote:
> +int promote_misplaced_memcg_folios(struct list_head *folio_list, int node)
> +{
> +	struct mem_cgroup *memcg = NULL;
> +	unsigned int nr_succeeded = 0;
> +	struct folio *first;
> +	int nr_remaining;
> +
> +	if (list_empty(folio_list))
> +		return 0;
> +
> +	first = list_first_entry(folio_list, struct folio, lru);
> +#ifdef CONFIG_DEBUG_VM
> +	{
> +		struct folio *f;
> +		list_for_each_entry(f, folio_list, lru)
> +		VM_WARN_ON_ONCE(folio_memcg(f) != folio_memcg(first));


It looks like the indentation might be off here.


> +	}
> +#endif
> +	memcg = get_mem_cgroup_from_folio(first);
> +
> +	nr_remaining = migrate_pages(folio_list, alloc_misplaced_dst_folio,
> +				     NULL, node, MIGRATE_ASYNC,
> +				     MR_NUMA_MISPLACED, &nr_succeeded);
> +	if (nr_remaining)
> +		putback_movable_pages(folio_list);
> +
> +	if (nr_succeeded) {
> +		count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded);
> +		count_memcg_events(memcg, NUMA_PAGE_MIGRATE, nr_succeeded);
> +		mod_lruvec_state(mem_cgroup_lruvec(memcg, NODE_DATA(node)),
> +				 PGPROMOTE_SUCCESS, nr_succeeded);
> +	}
> +
> +	mem_cgroup_put(memcg);
> +	WARN_ON(!list_empty(folio_list));
> +	return nr_remaining ? -EAGAIN : 0;
> +}
>   #endif /* CONFIG_NUMA_BALANCING */

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v7 3/7] mm: Hot page tracking and promotion - pghot
  2026-05-04  6:09 [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure Bharata B Rao
  2026-05-04  6:09 ` [PATCH v7 1/7] mm: migrate: Allow misplaced migration without VMA Bharata B Rao
  2026-05-04  6:09 ` [PATCH v7 2/7] mm: migrate: Add promote_misplaced_memcg_folios() Bharata B Rao
@ 2026-05-04  6:09 ` Bharata B Rao
  2026-05-04  6:09 ` [PATCH v7 4/7] mm: pghot: Precision mode for pghot Bharata B Rao
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Bharata B Rao @ 2026-05-04  6:09 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz,
	raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david,
	byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
	alok.rathore, shivankg, donettom, bharata

pghot is a subsystem that collects memory access information from
multiple sources, classifies hot pages resident in lower-tier memory,
and promotes them to faster tiers. It stores per-PFN hotness metadata
and performs asynchronous, batched promotion via a per-lower-tier-node
kernel thread (kmigrated).

This change introduces the default (compact) mode of pghot:

- Per-PFN hotness record (phi_t = u8) embedded via mem_section:
  - 2 bits: access frequency (4 levels)
  - 5 bits: time bucket (≈4s window with HZ=1000, bucketed jiffies)
  - 1 bit : migration-ready flag (MSB)
  The LSB of mem_section->hot_map pointer is used as a per-section
  "hot" flag to gate scanning.

- Event recording API:
  int pghot_record_access(unsigned long pfn, int nid, int src, unsigned long now)
  @pfn: The PFN of the memory accessed
  @nid: The accessing NUMA node ID
  @src: The temperature source (subsystem) that generated the
        access info
  @time: The access time in jiffies
  - Sources (e.g., NUMA hint faults, HW hints) call this to report
    accesses.
  - In default mode, the nid is not stored/used for targeting;
    promotion goes to a configurable toptier node (pghot_target_nid).

- Promotion engine:
  - One kmigrated thread per lower-tier node.
  - Scans only sections whose "hot" flag was raised, iterates PFNs,
    and batches candidates by destination node.
  - Uses migrate_misplaced_folios_batch() to move batched folios.

- Tunables & stats:
  - debugfs: enabled_sources, target_nid, freq_threshold,
             kmigrated_sleep_ms, kmigrated_batch_nr
  - sysctl : vm.pghot_promote_freq_window_ms
  - vmstat : pghot_recorded_accesses, pghot_recorded_hintfaults,
             pghot_recorded_hwhints

Memory overhead
---------------
Default mode uses 1 byte of hotness metadata per PFN on lower-tier
nodes.

Behavior & policy
-----------------
- Default mode promotion target:
  The nid passed by sources is not stored; hot pages promote to
  pghot_target_nid (toptier). Precision mode (added later in the
  series) changes this.

- Record consumption:
  kmigrated consumes (clears) the "migration-ready" bit before
  attempting isolation. Additionally the hotness record is reset.
  If isolation/migration fails, the folio is not re-queued automatically;
  subsequent accesses will re-arm it. This avoids retry storms and
  keeps batching stable.

- Wakeups:
  kmigrated wakeups are intentionally timeout-driven. We set
  the per-pgdat "activate" flag on access, and kmigrated checks this
  flag on its next sleep interval. This keeps the first cut simple
  and avoids potential wake storms; active wakeups can be considered
  in a follow-up.

Signed-off-by: Bharata B Rao <bharata@amd.com>
---
 Documentation/admin-guide/mm/index.rst |   1 +
 Documentation/admin-guide/mm/pghot.rst |  80 ++++
 include/linux/migrate.h                |   4 +-
 include/linux/mmzone.h                 |  20 +
 include/linux/pghot.h                  |  82 ++++
 include/linux/vm_event_item.h          |   5 +
 mm/Kconfig                             |  14 +
 mm/Makefile                            |   1 +
 mm/migrate.c                           |  16 +-
 mm/mm_init.c                           |  10 +
 mm/pghot-default.c                     |  79 ++++
 mm/pghot-tunables.c                    | 182 +++++++++
 mm/pghot.c                             | 494 +++++++++++++++++++++++++
 mm/vmstat.c                            |   5 +
 14 files changed, 986 insertions(+), 7 deletions(-)
 create mode 100644 Documentation/admin-guide/mm/pghot.rst
 create mode 100644 include/linux/pghot.h
 create mode 100644 mm/pghot-default.c
 create mode 100644 mm/pghot-tunables.c
 create mode 100644 mm/pghot.c

diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
index bbb563cba5d2..4d6810b02365 100644
--- a/Documentation/admin-guide/mm/index.rst
+++ b/Documentation/admin-guide/mm/index.rst
@@ -43,3 +43,4 @@ the Linux memory management.
    userfaultfd
    zswap
    kho
+   pghot
diff --git a/Documentation/admin-guide/mm/pghot.rst b/Documentation/admin-guide/mm/pghot.rst
new file mode 100644
index 000000000000..5f51dd1d4d45
--- /dev/null
+++ b/Documentation/admin-guide/mm/pghot.rst
@@ -0,0 +1,80 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================================
+PGHOT: Hot Page Tracking Tunables
+=================================
+
+Overview
+========
+The PGHOT subsystem tracks frequently accessed pages in lower-tier memory and
+promotes them to faster tiers. It uses per-PFN hotness metadata and asynchronous
+migration via per-node kernel threads (kmigrated).
+
+This document describes tunables available via **debugfs** and **sysctl** for
+PGHOT.
+
+Debugfs Interface
+=================
+Path: /sys/kernel/debug/pghot/
+
+1. **enabled_sources**
+   - Bitmask to enable/disable hotness sources.
+   - Bits:
+     - 0: Hint faults (value 0x1)
+     - 1: Hardware hints (value 0x2)
+   - Default: 0 (disabled)
+   - Example:
+     # echo 0x3 > /sys/kernel/debug/pghot/enabled_sources
+     Enables all sources.
+
+2. **target_nid**
+   - Toptier NUMA node ID to which hot pages should be promoted when source
+     does not provide nid. Used when hotness source can't provide accessing
+     NID or when the tracking mode is default.
+   - Default: 0
+   - Example:
+     # echo 1 > /sys/kernel/debug/pghot/target_nid
+
+3. **freq_threshold**
+   - Minimum access frequency before a page is marked ready for promotion.
+   - Range: 1 to 3
+   - Default: 2
+   - Example:
+     # echo 3 > /sys/kernel/debug/pghot/freq_threshold
+
+4. **kmigrated_sleep_ms**
+   - Sleep interval (ms) for kmigrated thread between scans.
+   - Default: 100
+
+5. **kmigrated_batch_nr**
+   - Maximum number of folios migrated in one batch.
+   - Default: 512
+
+Sysctl Interface
+================
+1. pghot_promote_freq_window_ms
+
+Path: /proc/sys/vm/pghot_promote_freq_window_ms
+
+- Controls the time window (in ms) for counting access frequency. A page is
+  considered hot only when **freq_threshold** number of accesses occur with
+  this time period.
+- Default: 3000 (3 seconds)
+- Example:
+  # sysctl vm.pghot_promote_freq_window_ms=3000
+
+Vmstat Counters
+===============
+Following vmstat counters provide some stats about pghot subsystem.
+
+Path: /proc/vmstat
+
+1. **pghot_recorded_accesses**
+   - Number of total hot page accesses recorded by pghot.
+
+2. **pghot_recorded_hintfaults**
+   - Number of recorded accesses reported by NUMA Balancing based
+     hotness source.
+
+3. **pghot_recorded_hwhints**
+   - Number of recorded accesses reported by hwhints source.
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index d136612eef9d..53bae80d11ae 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -107,7 +107,7 @@ static inline void softleaf_entry_wait_on_locked(softleaf_t entry, spinlock_t *p
 
 #endif /* CONFIG_MIGRATION */
 
-#ifdef CONFIG_NUMA_BALANCING
+#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_PGHOT)
 int migrate_misplaced_folio_prepare(struct folio *folio,
 		struct vm_area_struct *vma, int node);
 int migrate_misplaced_folio(struct folio *folio, int node);
@@ -126,7 +126,7 @@ static inline int promote_misplaced_memcg_folios(struct list_head *folio_list, i
 {
 	return -EAGAIN; /* can't migrate now */
 }
-#endif /* CONFIG_NUMA_BALANCING */
+#endif /* CONFIG_NUMA_BALANCING || CONFIG_PGHOT */
 
 #ifdef CONFIG_MIGRATION
 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9adb2ad21da5..eb08431dc9fb 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1155,6 +1155,7 @@ enum pgdat_flags {
 					 * many pages under writeback
 					 */
 	PGDAT_RECLAIM_LOCKED,		/* prevents concurrent reclaim */
+	PGDAT_KMIGRATED_ACTIVATE,	/* activates kmigrated */
 };
 
 enum zone_flags {
@@ -1609,6 +1610,10 @@ typedef struct pglist_data {
 #ifdef CONFIG_MEMORY_FAILURE
 	struct memory_failure_stats mf_stats;
 #endif
+#ifdef CONFIG_PGHOT
+	struct task_struct *kmigrated;
+	wait_queue_head_t kmigrated_wait;
+#endif
 } pg_data_t;
 
 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
@@ -2019,12 +2024,27 @@ struct mem_section {
 	unsigned long section_mem_map;
 
 	struct mem_section_usage *usage;
+#ifdef CONFIG_PGHOT
+	/*
+	 * Per-PFN hotness data for this section.
+	 * Array of phi_t (u8 in default mode).
+	 * LSB is used as PGHOT_SECTION_HOT_BIT flag.
+	 */
+	void *hot_map;
+#endif
 #ifdef CONFIG_PAGE_EXTENSION
 	/*
 	 * If SPARSEMEM, pgdat doesn't have page_ext pointer. We use
 	 * section. (see page_ext.h about this.)
 	 */
 	struct page_ext *page_ext;
+#endif
+	/*
+	 * Padding to maintain consistent mem_section size when exactly
+	 * one of PGHOT or PAGE_EXTENSION is enabled. This ensures
+	 * optimal alignment regardless of configuration.
+	 */
+#if (defined(CONFIG_PGHOT) ^ defined(CONFIG_PAGE_EXTENSION))
 	unsigned long pad;
 #endif
 	/*
diff --git a/include/linux/pghot.h b/include/linux/pghot.h
new file mode 100644
index 000000000000..525d4dd28fc1
--- /dev/null
+++ b/include/linux/pghot.h
@@ -0,0 +1,82 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_PGHOT_H
+#define _LINUX_PGHOT_H
+
+/* Page hotness temperature sources */
+enum pghot_src {
+	PGHOT_HINTFAULTS = 0,
+	PGHOT_HWHINTS,
+	PGHOT_SRC_MAX
+};
+
+#ifdef CONFIG_PGHOT
+#include <linux/static_key.h>
+
+extern unsigned int pghot_target_nid;
+extern unsigned int pghot_src_enabled;
+extern unsigned int pghot_freq_threshold;
+extern unsigned int kmigrated_sleep_ms;
+extern unsigned int kmigrated_batch_nr;
+extern unsigned int sysctl_pghot_freq_window;
+
+void pghot_debug_init(void);
+
+DECLARE_STATIC_KEY_FALSE(pghot_src_hintfaults);
+DECLARE_STATIC_KEY_FALSE(pghot_src_hwhints);
+
+#define PGHOT_HINTFAULTS_ENABLED	BIT(PGHOT_HINTFAULTS)
+#define PGHOT_HWHINTS_ENABLED		BIT(PGHOT_HWHINTS)
+#define PGHOT_SRC_ENABLED_MASK		GENMASK(PGHOT_SRC_MAX - 1, 0)
+
+#define PGHOT_DEFAULT_FREQ_THRESHOLD	2
+
+#define KMIGRATED_DEFAULT_SLEEP_MS	100
+#define KMIGRATED_DEFAULT_BATCH_NR	512
+
+#define PGHOT_DEFAULT_NODE		0
+
+#define PGHOT_DEFAULT_FREQ_WINDOW	(3 * MSEC_PER_SEC)
+
+/*
+ * Bits 0-6 are used to store frequency and time.
+ * Bit 7 is used to indicate the page is ready for migration.
+ */
+#define PGHOT_MIGRATE_READY		7
+
+#define PGHOT_FREQ_WIDTH		2
+/* Bucketed time is stored in 5 bits which can represent up to 3.9s with HZ=1000 */
+#define PGHOT_TIME_BUCKETS_SHIFT	7
+#define PGHOT_TIME_WIDTH		5
+#define PGHOT_NID_WIDTH			10
+
+#define PGHOT_FREQ_SHIFT		0
+#define PGHOT_TIME_SHIFT		(PGHOT_FREQ_SHIFT + PGHOT_FREQ_WIDTH)
+
+#define PGHOT_FREQ_MASK			GENMASK(PGHOT_FREQ_WIDTH - 1, 0)
+#define PGHOT_TIME_MASK			GENMASK(PGHOT_TIME_WIDTH - 1, 0)
+#define PGHOT_TIME_BUCKETS_MASK		(PGHOT_TIME_MASK << PGHOT_TIME_BUCKETS_SHIFT)
+
+#define PGHOT_NID_MAX			((1 << PGHOT_NID_WIDTH) - 1)
+#define PGHOT_FREQ_MAX			((1 << PGHOT_FREQ_WIDTH) - 1)
+#define PGHOT_TIME_MAX			((1 << PGHOT_TIME_WIDTH) - 1)
+
+typedef u8 phi_t;
+
+#define PGHOT_RECORD_SIZE		sizeof(phi_t)
+
+#define PGHOT_SECTION_HOT_BIT		0
+#define PGHOT_SECTION_HOT_MASK		BIT(PGHOT_SECTION_HOT_BIT)
+
+bool pghot_nid_valid(int nid);
+unsigned long pghot_access_latency(unsigned long old_time, unsigned long time);
+bool pghot_update_record(phi_t *phi, int nid, unsigned long now);
+int pghot_get_record(phi_t *phi, int *nid, int *freq, unsigned long *time);
+
+int pghot_record_access(unsigned long pfn, int nid, int src, unsigned long now);
+#else
+static inline int pghot_record_access(unsigned long pfn, int nid, int src, unsigned long now)
+{
+	return 0;
+}
+#endif /* CONFIG_PGHOT */
+#endif /* _LINUX_PGHOT_H */
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 03fe95f5a020..58d510711bd4 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -175,6 +175,11 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		KSTACK_REST,
 #endif
 #endif /* CONFIG_DEBUG_STACK_USAGE */
+#ifdef CONFIG_PGHOT
+		PGHOT_RECORDED_ACCESSES,
+		PGHOT_RECORDED_HINTFAULTS,
+		PGHOT_RECORDED_HWHINTS,
+#endif /* CONFIG_PGHOT */
 		NR_VM_EVENT_ITEMS
 };
 
diff --git a/mm/Kconfig b/mm/Kconfig
index 0a43bb80df4f..ebfa149d8123 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1469,6 +1469,20 @@ config LAZY_MMU_MODE_KUNIT_TEST
 
 	  If unsure, say N.
 
+config PGHOT
+	bool "Hot page tracking and promotion"
+	default n
+	depends on NUMA_MIGRATION && SPARSEMEM
+	help
+	  A sub-system to track page accesses in lower tier memory and
+	  maintain hot page information. Promotes hot pages from lower
+	  tiers to top tier by using the memory access information provided
+	  by various sources. Asynchronous promotion is done by per-node
+	  kernel threads.
+
+	  This adds 1 byte of metadata overhead per page in lower-tier
+	  memory nodes.
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index 8ad2ab08244e..33014de43acc 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -150,3 +150,4 @@ obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
 obj-$(CONFIG_EXECMEM) += execmem.o
 obj-$(CONFIG_TMPFS_QUOTA) += shmem_quota.o
 obj-$(CONFIG_LAZY_MMU_MODE_KUNIT_TEST) += tests/lazy_mmu_mode_kunit.o
+obj-$(CONFIG_PGHOT) += pghot.o pghot-tunables.o pghot-default.o
diff --git a/mm/migrate.c b/mm/migrate.c
index 747277aadf19..726d27b61a46 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2625,7 +2625,7 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid, unsigned long, nr_pages,
 }
 #endif /* CONFIG_NUMA_MIGRATION */
 
-#ifdef CONFIG_NUMA_BALANCING
+#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_PGHOT)
 /*
  * Returns true if this is a safe migration target node for misplaced NUMA
  * pages. Currently it only checks the watermarks which is crude.
@@ -2745,12 +2745,10 @@ int migrate_misplaced_folio_prepare(struct folio *folio,
  */
 int migrate_misplaced_folio(struct folio *folio, int node)
 {
-	pg_data_t *pgdat = NODE_DATA(node);
 	int nr_remaining;
 	unsigned int nr_succeeded;
 	LIST_HEAD(migratepages);
 	struct mem_cgroup *memcg = get_mem_cgroup_from_folio(folio);
-	struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
 
 	list_add(&folio->lru, &migratepages);
 	nr_remaining = migrate_pages(&migratepages, alloc_misplaced_dst_folio,
@@ -2759,12 +2757,18 @@ int migrate_misplaced_folio(struct folio *folio, int node)
 	if (nr_remaining && !list_empty(&migratepages))
 		putback_movable_pages(&migratepages);
 	if (nr_succeeded) {
+#ifdef CONFIG_NUMA_BALANCING
 		count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded);
 		count_memcg_events(memcg, NUMA_PAGE_MIGRATE, nr_succeeded);
 		if ((sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)
 		    && !node_is_toptier(folio_nid(folio))
-		    && node_is_toptier(node))
+		    && node_is_toptier(node)) {
+			pg_data_t *pgdat = NODE_DATA(node);
+			struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
+
 			mod_lruvec_state(lruvec, PGPROMOTE_SUCCESS, nr_succeeded);
+		}
+#endif
 	}
 	mem_cgroup_put(memcg);
 	BUG_ON(!list_empty(&migratepages));
@@ -2817,14 +2821,16 @@ int promote_misplaced_memcg_folios(struct list_head *folio_list, int node)
 		putback_movable_pages(folio_list);
 
 	if (nr_succeeded) {
+#ifdef CONFIG_NUMA_BALANCING
 		count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded);
 		count_memcg_events(memcg, NUMA_PAGE_MIGRATE, nr_succeeded);
 		mod_lruvec_state(mem_cgroup_lruvec(memcg, NODE_DATA(node)),
 				 PGPROMOTE_SUCCESS, nr_succeeded);
+#endif
 	}
 
 	mem_cgroup_put(memcg);
 	WARN_ON(!list_empty(folio_list));
 	return nr_remaining ? -EAGAIN : 0;
 }
-#endif /* CONFIG_NUMA_BALANCING */
+#endif /* CONFIG_NUMA_BALANCING || CONFIG_PGHOT */
diff --git a/mm/mm_init.c b/mm/mm_init.c
index f9f8e1af921c..2396c42028ae 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1384,6 +1384,15 @@ static void pgdat_init_kcompactd(struct pglist_data *pgdat)
 static void pgdat_init_kcompactd(struct pglist_data *pgdat) {}
 #endif
 
+#ifdef CONFIG_PGHOT
+static void pgdat_init_kmigrated(struct pglist_data *pgdat)
+{
+	init_waitqueue_head(&pgdat->kmigrated_wait);
+}
+#else
+static inline void pgdat_init_kmigrated(struct pglist_data *pgdat) {}
+#endif
+
 static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
 {
 	int i;
@@ -1393,6 +1402,7 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
 
 	pgdat_init_split_queue(pgdat);
 	pgdat_init_kcompactd(pgdat);
+	pgdat_init_kmigrated(pgdat);
 
 	init_waitqueue_head(&pgdat->kswapd_wait);
 	init_waitqueue_head(&pgdat->pfmemalloc_wait);
diff --git a/mm/pghot-default.c b/mm/pghot-default.c
new file mode 100644
index 000000000000..e610062345e4
--- /dev/null
+++ b/mm/pghot-default.c
@@ -0,0 +1,79 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * pghot: Default mode
+ *
+ * 1 byte hotness record per PFN.
+ * Bucketed time and frequency tracked as part of the record.
+ * Promotion to @pghot_target_nid by default.
+ */
+
+#include <linux/pghot.h>
+#include <linux/jiffies.h>
+
+/* pghot-default doesn't store and hence no NID validation is required */
+bool pghot_nid_valid(int nid)
+{
+	return true;
+}
+
+/*
+ * @time is regular time, @old_time is bucketed time.
+ */
+unsigned long pghot_access_latency(unsigned long old_time, unsigned long time)
+{
+	time &= PGHOT_TIME_BUCKETS_MASK;
+	old_time <<= PGHOT_TIME_BUCKETS_SHIFT;
+
+	return jiffies_to_msecs((time - old_time) & PGHOT_TIME_BUCKETS_MASK);
+}
+
+bool pghot_update_record(phi_t *phi, int nid, unsigned long now)
+{
+	phi_t freq, old_freq, hotness, old_hotness, old_time;
+	phi_t time = now >> PGHOT_TIME_BUCKETS_SHIFT;
+
+	old_hotness = READ_ONCE(*phi);
+	do {
+		bool new_window = false;
+
+		hotness = old_hotness;
+		old_freq = (hotness >> PGHOT_FREQ_SHIFT) & PGHOT_FREQ_MASK;
+		old_time = (hotness >> PGHOT_TIME_SHIFT) & PGHOT_TIME_MASK;
+
+		if (pghot_access_latency(old_time, now) > sysctl_pghot_freq_window)
+			new_window = true;
+
+		if (new_window)
+			freq = 1;
+		else if (old_freq < PGHOT_FREQ_MAX)
+			freq = old_freq + 1;
+		else
+			freq = old_freq;
+
+		hotness &= ~(PGHOT_FREQ_MASK << PGHOT_FREQ_SHIFT);
+		hotness &= ~(PGHOT_TIME_MASK << PGHOT_TIME_SHIFT);
+
+		hotness |= (freq & PGHOT_FREQ_MASK) << PGHOT_FREQ_SHIFT;
+		hotness |= (time & PGHOT_TIME_MASK) << PGHOT_TIME_SHIFT;
+
+		if (freq >= pghot_freq_threshold)
+			hotness |= BIT(PGHOT_MIGRATE_READY);
+	} while (unlikely(!try_cmpxchg(phi, &old_hotness, hotness)));
+	return !!(hotness & BIT(PGHOT_MIGRATE_READY));
+}
+
+int pghot_get_record(phi_t *phi, int *nid, int *freq, unsigned long *time)
+{
+	phi_t old_hotness, hotness = 0;
+
+	old_hotness = READ_ONCE(*phi);
+	do {
+		if (!(old_hotness & BIT(PGHOT_MIGRATE_READY)))
+			return -EINVAL;
+	} while (unlikely(!try_cmpxchg(phi, &old_hotness, hotness)));
+
+	*nid = pghot_target_nid;
+	*freq = (old_hotness >> PGHOT_FREQ_SHIFT) & PGHOT_FREQ_MASK;
+	*time = (old_hotness >> PGHOT_TIME_SHIFT) & PGHOT_TIME_MASK;
+	return 0;
+}
diff --git a/mm/pghot-tunables.c b/mm/pghot-tunables.c
new file mode 100644
index 000000000000..f04e2137309e
--- /dev/null
+++ b/mm/pghot-tunables.c
@@ -0,0 +1,182 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * pghot tunables in debugfs
+ */
+#include <linux/pghot.h>
+#include <linux/memory-tiers.h>
+#include <linux/debugfs.h>
+
+static struct dentry *debugfs_pghot;
+static DEFINE_MUTEX(pghot_tunables_lock);
+
+static ssize_t pghot_freq_th_write(struct file *filp, const char __user *ubuf,
+				   size_t cnt, loff_t *ppos)
+{
+	char buf[16];
+	unsigned int freq;
+
+	if (cnt > 15)
+		cnt = 15;
+
+	if (copy_from_user(&buf, ubuf, cnt))
+		return -EFAULT;
+	buf[cnt] = '\0';
+
+	if (kstrtouint(buf, 10, &freq))
+		return -EINVAL;
+
+	if (!freq || freq > PGHOT_FREQ_MAX)
+		return -EINVAL;
+
+	mutex_lock(&pghot_tunables_lock);
+	pghot_freq_threshold = freq;
+	mutex_unlock(&pghot_tunables_lock);
+
+	*ppos += cnt;
+	return cnt;
+}
+
+static int pghot_freq_th_show(struct seq_file *m, void *v)
+{
+	seq_printf(m, "%d\n", pghot_freq_threshold);
+	return 0;
+}
+
+static int pghot_freq_th_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, pghot_freq_th_show, NULL);
+}
+
+static const struct file_operations pghot_freq_th_fops = {
+	.open		= pghot_freq_th_open,
+	.write		= pghot_freq_th_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
+static ssize_t pghot_target_nid_write(struct file *filp, const char __user *ubuf,
+				      size_t cnt, loff_t *ppos)
+{
+	char buf[16];
+	unsigned int nid;
+
+	if (cnt > 15)
+		cnt = 15;
+
+	if (copy_from_user(&buf, ubuf, cnt))
+		return -EFAULT;
+	buf[cnt] = '\0';
+
+	if (kstrtouint(buf, 10, &nid))
+		return -EINVAL;
+
+	if (nid > PGHOT_NID_MAX || !node_online(nid) || !node_is_toptier(nid))
+		return -EINVAL;
+	mutex_lock(&pghot_tunables_lock);
+	pghot_target_nid = nid;
+	mutex_unlock(&pghot_tunables_lock);
+
+	*ppos += cnt;
+	return cnt;
+}
+
+static int pghot_target_nid_show(struct seq_file *m, void *v)
+{
+	seq_printf(m, "%d\n", pghot_target_nid);
+	return 0;
+}
+
+static int pghot_target_nid_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, pghot_target_nid_show, NULL);
+}
+
+static const struct file_operations pghot_target_nid_fops = {
+	.open		= pghot_target_nid_open,
+	.write		= pghot_target_nid_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
+static void pghot_src_enabled_update(unsigned int enabled)
+{
+	unsigned int changed = pghot_src_enabled ^ enabled;
+
+	if (changed & PGHOT_HINTFAULTS_ENABLED) {
+		if (enabled & PGHOT_HINTFAULTS_ENABLED)
+			static_branch_enable(&pghot_src_hintfaults);
+		else
+			static_branch_disable(&pghot_src_hintfaults);
+	}
+
+	if (changed & PGHOT_HWHINTS_ENABLED) {
+		if (enabled & PGHOT_HWHINTS_ENABLED)
+			static_branch_enable(&pghot_src_hwhints);
+		else
+			static_branch_disable(&pghot_src_hwhints);
+	}
+}
+
+static ssize_t pghot_src_enabled_write(struct file *filp, const char __user *ubuf,
+					   size_t cnt, loff_t *ppos)
+{
+	char buf[16];
+	unsigned int enabled;
+
+	if (cnt > 15)
+		cnt = 15;
+
+	if (copy_from_user(&buf, ubuf, cnt))
+		return -EFAULT;
+	buf[cnt] = '\0';
+
+	if (kstrtouint(buf, 0, &enabled))
+		return -EINVAL;
+
+	if (enabled & ~PGHOT_SRC_ENABLED_MASK)
+		return -EINVAL;
+
+	mutex_lock(&pghot_tunables_lock);
+	pghot_src_enabled_update(enabled);
+	pghot_src_enabled = enabled;
+	mutex_unlock(&pghot_tunables_lock);
+
+	*ppos += cnt;
+	return cnt;
+}
+
+static int pghot_src_enabled_show(struct seq_file *m, void *v)
+{
+	seq_printf(m, "%u\n", pghot_src_enabled);
+	return 0;
+}
+
+static int pghot_src_enabled_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, pghot_src_enabled_show, NULL);
+}
+
+static const struct file_operations pghot_src_enabled_fops = {
+	.open		= pghot_src_enabled_open,
+	.write		= pghot_src_enabled_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
+void pghot_debug_init(void)
+{
+	debugfs_pghot = debugfs_create_dir("pghot", NULL);
+	debugfs_create_file("enabled_sources", 0644, debugfs_pghot, NULL,
+			    &pghot_src_enabled_fops);
+	debugfs_create_file("target_nid", 0644, debugfs_pghot, NULL,
+			    &pghot_target_nid_fops);
+	debugfs_create_file("freq_threshold", 0644, debugfs_pghot, NULL,
+			    &pghot_freq_th_fops);
+	debugfs_create_u32("kmigrated_sleep_ms", 0644, debugfs_pghot,
+			    &kmigrated_sleep_ms);
+	debugfs_create_u32("kmigrated_batch_nr", 0644, debugfs_pghot,
+			    &kmigrated_batch_nr);
+}
diff --git a/mm/pghot.c b/mm/pghot.c
new file mode 100644
index 000000000000..02e6959b647a
--- /dev/null
+++ b/mm/pghot.c
@@ -0,0 +1,494 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Maintains information about hot pages from slower tier nodes and
+ * promotes them.
+ *
+ * Per-PFN hotness information is stored for lower tier nodes in
+ * mem_section.
+ *
+ * In the default mode, a single byte (u8) is used to store
+ * the frequency of access and last access time. Promotions are done
+ * to a default toptier NID.
+ *
+ * A kernel thread named kmigrated is provided to migrate or promote
+ * the hot pages. kmigrated runs for each lower tier node. It iterates
+ * over the node's PFNs and  migrates pages marked for migration into
+ * their targeted nodes.
+ */
+#include <linux/mm.h>
+#include <linux/migrate.h>
+#include <linux/memory.h>
+#include <linux/memory-tiers.h>
+#include <linux/pghot.h>
+
+unsigned int pghot_target_nid = PGHOT_DEFAULT_NODE;
+unsigned int pghot_src_enabled;
+unsigned int pghot_freq_threshold = PGHOT_DEFAULT_FREQ_THRESHOLD;
+unsigned int kmigrated_sleep_ms = KMIGRATED_DEFAULT_SLEEP_MS;
+unsigned int kmigrated_batch_nr = KMIGRATED_DEFAULT_BATCH_NR;
+
+unsigned int sysctl_pghot_freq_window = PGHOT_DEFAULT_FREQ_WINDOW;
+
+DEFINE_STATIC_KEY_FALSE(pghot_src_hwhints);
+DEFINE_STATIC_KEY_FALSE(pghot_src_hintfaults);
+
+#ifdef CONFIG_SYSCTL
+static const struct ctl_table pghot_sysctls[] = {
+	{
+		.procname       = "pghot_promote_freq_window_ms",
+		.data           = &sysctl_pghot_freq_window,
+		.maxlen         = sizeof(unsigned int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec_minmax,
+		.extra1         = SYSCTL_ZERO,
+	},
+};
+#endif
+
+static bool kmigrated_started __ro_after_init;
+
+/**
+ * pghot_record_access() - Record page accesses from lower tier memory
+ * for the purpose of tracking page hotness and subsequent promotion.
+ *
+ * @pfn: PFN of the page
+ * @nid: Unused
+ * @src: The identifier of the sub-system that reports the access
+ * @now: Access time in jiffies
+ *
+ * Updates the frequency and time of access and marks the page as
+ * ready for migration if the frequency crosses a threshold. The pages
+ * marked for migration are migrated by kmigrated kernel thread.
+ *
+ * Return: 0 on success and -EINVAL on failure to record the access.
+ */
+int pghot_record_access(unsigned long pfn, int nid, int src, unsigned long now)
+{
+	struct mem_section *ms;
+	struct folio *folio;
+	phi_t *phi, *hot_map;
+	struct page *page;
+	int src_nid;
+
+	if (!kmigrated_started)
+		return 0;
+
+	if (!pghot_nid_valid(nid))
+		return -EINVAL;
+
+	switch (src) {
+	case PGHOT_HINTFAULTS:
+		if (!static_branch_unlikely(&pghot_src_hintfaults))
+			return 0;
+		count_vm_event(PGHOT_RECORDED_HINTFAULTS);
+		break;
+	case PGHOT_HWHINTS:
+		if (!static_branch_unlikely(&pghot_src_hwhints))
+			return 0;
+		count_vm_event(PGHOT_RECORDED_HWHINTS);
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	src_nid = pfn_to_nid(pfn);
+	if (src_nid == nid)
+		return 0;
+
+	/*
+	 * Record only accesses from lower tiers.
+	 */
+	if (node_is_toptier(src_nid))
+		return 0;
+
+	/*
+	 * Reject the non-migratable pages right away.
+	 */
+	page = pfn_to_online_page(pfn);
+	if (!page || is_zone_device_page(page))
+		return 0;
+
+	folio = page_folio(page);
+	if (!folio_try_get(folio))
+		return 0;
+
+	if (unlikely(page_folio(page) != folio))
+		goto out;
+
+	if (!folio_test_lru(folio))
+		goto out;
+
+	/* Get the hotness slot corresponding to the 1st PFN of the folio */
+	pfn = folio_pfn(folio);
+	ms = __pfn_to_section(pfn);
+	if (!ms || !ms->hot_map)
+		goto out;
+
+	hot_map = (phi_t *)(((unsigned long)(ms->hot_map)) & ~PGHOT_SECTION_HOT_MASK);
+	phi = &hot_map[pfn % PAGES_PER_SECTION];
+
+	count_vm_event(PGHOT_RECORDED_ACCESSES);
+
+	/*
+	 * Update the hotness parameters.
+	 */
+	if (pghot_update_record(phi, nid, now)) {
+		set_bit(PGHOT_SECTION_HOT_BIT, (unsigned long *)&ms->hot_map);
+		set_bit(PGDAT_KMIGRATED_ACTIVATE, &page_pgdat(page)->flags);
+	}
+out:
+	folio_put(folio);
+	return 0;
+}
+
+static int pghot_get_hotness(unsigned long pfn, int *nid, int *freq,
+			     unsigned long *time)
+{
+	phi_t *phi, *hot_map;
+	struct mem_section *ms;
+
+	ms = __pfn_to_section(pfn);
+	if (!ms || !ms->hot_map)
+		return -EINVAL;
+
+	hot_map = (phi_t *)(((unsigned long)(ms->hot_map)) & ~PGHOT_SECTION_HOT_MASK);
+	phi = &hot_map[pfn % PAGES_PER_SECTION];
+
+	return pghot_get_record(phi, nid, freq, time);
+}
+
+/*
+ * Walks the PFNs of the zone, isolates and migrates them in batches.
+ */
+static void kmigrated_walk_zone(unsigned long start_pfn, unsigned long end_pfn,
+				int src_nid)
+{
+	struct mem_cgroup *cur_memcg = NULL;
+	int cur_nid = NUMA_NO_NODE;
+	LIST_HEAD(migrate_list);
+	int batch_count = 0;
+	struct folio *folio;
+	struct page *page;
+	unsigned long pfn;
+
+	pfn = start_pfn;
+	do {
+		int nid = NUMA_NO_NODE, nr = 1;
+		struct mem_cgroup *memcg;
+		unsigned long time = 0;
+		int freq = 0;
+
+		if (!pfn_valid(pfn))
+			goto out_next;
+
+		page = pfn_to_online_page(pfn);
+		if (!page)
+			goto out_next;
+
+		folio = page_folio(page);
+		if (!folio_try_get(folio))
+			goto out_next;
+
+		if (unlikely(page_folio(page) != folio)) {
+			folio_put(folio);
+			goto out_next;
+		}
+
+		nr = folio_nr_pages(folio);
+		if (folio_nid(folio) != src_nid) {
+			folio_put(folio);
+			goto out_next;
+		}
+
+		if (!folio_test_lru(folio)) {
+			folio_put(folio);
+			goto out_next;
+		}
+
+		if (pghot_get_hotness(pfn, &nid, &freq, &time)) {
+			folio_put(folio);
+			goto out_next;
+		}
+
+		if (nid == NUMA_NO_NODE)
+			nid = pghot_target_nid;
+
+		if (folio_nid(folio) == nid) {
+			folio_put(folio);
+			goto out_next;
+		}
+
+		if (migrate_misplaced_folio_prepare(folio, NULL, nid)) {
+			folio_put(folio);
+			goto out_next;
+		}
+
+		memcg = folio_memcg(folio);
+		if (cur_nid == NUMA_NO_NODE) {
+			cur_nid = nid;
+			cur_memcg = memcg;
+		}
+
+		/* If NID or memcg changed, flush the previous batch first */
+		if (cur_nid != nid || cur_memcg != memcg) {
+			if (!list_empty(&migrate_list))
+				promote_misplaced_memcg_folios(&migrate_list, cur_nid);
+			cur_nid = nid;
+			cur_memcg = memcg;
+			batch_count = 0;
+			cond_resched();
+		}
+
+		list_add(&folio->lru, &migrate_list);
+		folio_put(folio);
+
+		if (++batch_count > kmigrated_batch_nr) {
+			promote_misplaced_memcg_folios(&migrate_list, cur_nid);
+			batch_count = 0;
+			cond_resched();
+		}
+out_next:
+		pfn += nr;
+	} while (pfn < end_pfn);
+	if (!list_empty(&migrate_list))
+		promote_misplaced_memcg_folios(&migrate_list, cur_nid);
+}
+
+static void kmigrated_do_work(pg_data_t *pgdat)
+{
+	unsigned long section_nr, s_begin, start_pfn;
+	struct mem_section *ms;
+	int nid;
+
+	clear_bit(PGDAT_KMIGRATED_ACTIVATE, &pgdat->flags);
+	s_begin = next_present_section_nr(-1);
+	for_each_present_section_nr(s_begin, section_nr) {
+		start_pfn = section_nr_to_pfn(section_nr);
+		ms = __nr_to_section(section_nr);
+
+		if (!pfn_valid(start_pfn))
+			continue;
+
+		nid = pfn_to_nid(start_pfn);
+		if (node_is_toptier(nid) || nid != pgdat->node_id)
+			continue;
+
+		if (!test_and_clear_bit(PGHOT_SECTION_HOT_BIT, (unsigned long *)&ms->hot_map))
+			continue;
+
+		kmigrated_walk_zone(start_pfn, start_pfn + PAGES_PER_SECTION,
+				    pgdat->node_id);
+	}
+}
+
+static inline bool kmigrated_work_requested(pg_data_t *pgdat)
+{
+	return test_bit(PGDAT_KMIGRATED_ACTIVATE, &pgdat->flags);
+}
+
+/*
+ * Per-node kthread that iterates over its PFNs and migrates the
+ * pages that have been marked for migration.
+ */
+static int kmigrated(void *p)
+{
+	pg_data_t *pgdat = p;
+
+	while (!kthread_should_stop()) {
+		long timeout = msecs_to_jiffies(READ_ONCE(kmigrated_sleep_ms));
+
+		if (wait_event_timeout(pgdat->kmigrated_wait, kmigrated_work_requested(pgdat),
+				       timeout))
+			kmigrated_do_work(pgdat);
+	}
+	return 0;
+}
+
+static int kmigrated_run(int nid)
+{
+	pg_data_t *pgdat = NODE_DATA(nid);
+	int ret;
+
+	if (!pgdat->kmigrated) {
+		pgdat->kmigrated = kthread_create_on_node(kmigrated, pgdat, nid,
+							  "kmigrated%d", nid);
+		if (IS_ERR(pgdat->kmigrated)) {
+			ret = PTR_ERR(pgdat->kmigrated);
+			pgdat->kmigrated = NULL;
+			pr_err("Failed to start kmigrated%d, ret %d\n", nid, ret);
+			return ret;
+		}
+		pr_info("pghot: Started kmigrated thread for node %d\n", nid);
+	}
+	wake_up_process(pgdat->kmigrated);
+	return 0;
+}
+
+static void pghot_free_hot_map(struct mem_section *ms)
+{
+	kfree((void *)((unsigned long)ms->hot_map & ~PGHOT_SECTION_HOT_MASK));
+	ms->hot_map = NULL;
+}
+
+static int pghot_alloc_hot_map(struct mem_section *ms, int nid)
+{
+	ms->hot_map = kcalloc_node(PAGES_PER_SECTION, PGHOT_RECORD_SIZE, GFP_KERNEL,
+				   nid);
+	if (!ms->hot_map)
+		return -ENOMEM;
+	return 0;
+}
+
+static void pghot_offline_sec_hotmap(unsigned long start_pfn,
+				     unsigned long nr_pages)
+{
+	unsigned long start, end, pfn;
+	struct mem_section *ms;
+
+	start = SECTION_ALIGN_DOWN(start_pfn);
+	end = SECTION_ALIGN_UP(start_pfn + nr_pages);
+
+	for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) {
+		ms = __pfn_to_section(pfn);
+		if (!ms || !ms->hot_map)
+			continue;
+
+		pghot_free_hot_map(ms);
+	}
+}
+
+static int pghot_online_sec_hotmap(unsigned long start_pfn,
+				   unsigned long nr_pages)
+{
+	int nid = pfn_to_nid(start_pfn);
+	unsigned long start, end, pfn;
+	struct mem_section *ms;
+	int fail = 0;
+
+	start = SECTION_ALIGN_DOWN(start_pfn);
+	end = SECTION_ALIGN_UP(start_pfn + nr_pages);
+
+	for (pfn = start; !fail && pfn < end; pfn += PAGES_PER_SECTION) {
+		ms = __pfn_to_section(pfn);
+		if (!ms || ms->hot_map)
+			continue;
+
+		fail = pghot_alloc_hot_map(ms, nid);
+	}
+
+	if (!fail)
+		return 0;
+
+	/* rollback */
+	end = pfn - PAGES_PER_SECTION;
+	for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) {
+		ms = __pfn_to_section(pfn);
+		if (ms && ms->hot_map)
+			pghot_free_hot_map(ms);
+	}
+	return -ENOMEM;
+}
+
+static int pghot_memhp_callback(struct notifier_block *self,
+				unsigned long action, void *arg)
+{
+	struct memory_notify *mn = arg;
+	int ret = 0;
+
+	switch (action) {
+	case MEM_GOING_ONLINE:
+		ret = pghot_online_sec_hotmap(mn->start_pfn, mn->nr_pages);
+		break;
+	case MEM_OFFLINE:
+	case MEM_CANCEL_ONLINE:
+		pghot_offline_sec_hotmap(mn->start_pfn, mn->nr_pages);
+		break;
+	}
+
+	return notifier_from_errno(ret);
+}
+
+static struct notifier_block pghot_mem_notifier = {
+	.notifier_call = pghot_memhp_callback,
+	.priority = DEFAULT_CALLBACK_PRI,
+};
+
+static void pghot_destroy_hot_map(void)
+{
+	unsigned long section_nr, s_begin;
+	struct mem_section *ms;
+
+	s_begin = next_present_section_nr(-1);
+	for_each_present_section_nr(s_begin, section_nr) {
+		ms = __nr_to_section(section_nr);
+		pghot_free_hot_map(ms);
+	}
+
+	unregister_memory_notifier(&pghot_mem_notifier);
+}
+
+static int pghot_setup_hot_map(void)
+{
+	unsigned long section_nr, s_begin, start_pfn;
+	struct mem_section *ms;
+	int nid, ret;
+
+	ret = register_memory_notifier(&pghot_mem_notifier);
+	if (ret)
+		return ret;
+
+	s_begin = next_present_section_nr(-1);
+	for_each_present_section_nr(s_begin, section_nr) {
+		ms = __nr_to_section(section_nr);
+		start_pfn = section_nr_to_pfn(section_nr);
+		nid = pfn_to_nid(start_pfn);
+
+		if (node_is_toptier(nid) || !pfn_valid(start_pfn))
+			continue;
+
+		if (pghot_alloc_hot_map(ms, nid))
+			goto out_free_hot_map;
+	}
+	return 0;
+
+out_free_hot_map:
+	pghot_destroy_hot_map();
+	return -ENOMEM;
+}
+
+static int __init pghot_init(void)
+{
+	pg_data_t *pgdat;
+	int nid, ret;
+
+	ret = pghot_setup_hot_map();
+	if (ret)
+		return ret;
+
+	for_each_node_state(nid, N_MEMORY) {
+		if (node_is_toptier(nid))
+			continue;
+
+		ret = kmigrated_run(nid);
+		if (ret)
+			goto out_stop_kthread;
+	}
+	register_sysctl_init("vm", pghot_sysctls);
+	pghot_debug_init();
+
+	kmigrated_started = true;
+	return 0;
+
+out_stop_kthread:
+	for_each_node_state(nid, N_MEMORY) {
+		pgdat = NODE_DATA(nid);
+		if (pgdat->kmigrated) {
+			kthread_stop(pgdat->kmigrated);
+			pgdat->kmigrated = NULL;
+		}
+	}
+	pghot_destroy_hot_map();
+	return ret;
+}
+
+late_initcall_sync(pghot_init)
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f534972f517d..4064ead568cc 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1489,6 +1489,11 @@ const char * const vmstat_text[] = {
 	[I(KSTACK_REST)]			= "kstack_rest",
 #endif
 #endif
+#ifdef CONFIG_PGHOT
+	[I(PGHOT_RECORDED_ACCESSES)]		= "pghot_recorded_accesses",
+	[I(PGHOT_RECORDED_HINTFAULTS)]		= "pghot_recorded_hintfaults",
+	[I(PGHOT_RECORDED_HWHINTS)]		= "pghot_recorded_hwhints",
+#endif /* CONFIG_PGHOT */
 #undef I
 #endif /* CONFIG_VM_EVENT_COUNTERS */
 };
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v7 4/7] mm: pghot: Precision mode for pghot
  2026-05-04  6:09 [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure Bharata B Rao
                   ` (2 preceding siblings ...)
  2026-05-04  6:09 ` [PATCH v7 3/7] mm: Hot page tracking and promotion - pghot Bharata B Rao
@ 2026-05-04  6:09 ` Bharata B Rao
  2026-05-04 18:41   ` Donet Tom
  2026-05-04  6:09 ` [PATCH v7 5/7] mm: sched: move NUMA balancing tiering promotion to pghot Bharata B Rao
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 12+ messages in thread
From: Bharata B Rao @ 2026-05-04  6:09 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz,
	raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david,
	byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
	alok.rathore, shivankg, donettom, bharata

Default pghot stores hotness in a 1‑byte record per PFN, limiting
frequency to 2 bits, time to a 5‑bit bucket, and preventing storage
of per‑PFN toptier NID. This restricts time granularity and forces
all promotions to use the global pghot_target_nid.

This patch adds an optional precision mode (CONFIG_PGHOT_PRECISE)
that expands the hotness record to 4 bytes (u32) and provides:

- 10‑bit NID field for per‑PFN promotion target,
- 3‑bit frequency field (freq_threshold range 1–7),
- 14‑bit time field offering finer recency tracking,
- MSB migrate‑ready bit.

Precision mode improves placement accuracy on systems with multiple
toptier nodes and provides higher‑resolution hotness tracking, at
the cost of increasing metadata to 4 bytes per PFN.

Documentation, tunables, and the record layout are updated accordingly.

Signed-off-by: Bharata B Rao <bharata@amd.com>
---
 Documentation/admin-guide/mm/pghot.rst |  4 +-
 include/linux/mmzone.h                 |  2 +-
 include/linux/pghot.h                  | 31 ++++++++++
 mm/Kconfig                             | 11 ++++
 mm/Makefile                            |  7 ++-
 mm/pghot-precise.c                     | 81 ++++++++++++++++++++++++++
 mm/pghot.c                             | 13 +++--
 7 files changed, 141 insertions(+), 8 deletions(-)
 create mode 100644 mm/pghot-precise.c

diff --git a/Documentation/admin-guide/mm/pghot.rst b/Documentation/admin-guide/mm/pghot.rst
index 5f51dd1d4d45..7b84e911afe7 100644
--- a/Documentation/admin-guide/mm/pghot.rst
+++ b/Documentation/admin-guide/mm/pghot.rst
@@ -37,7 +37,7 @@ Path: /sys/kernel/debug/pghot/
 
 3. **freq_threshold**
    - Minimum access frequency before a page is marked ready for promotion.
-   - Range: 1 to 3
+   - Range: 1 to 3 in default mode, 1 to 7 in precision mode.
    - Default: 2
    - Example:
      # echo 3 > /sys/kernel/debug/pghot/freq_threshold
@@ -59,7 +59,7 @@ Path: /proc/sys/vm/pghot_promote_freq_window_ms
 - Controls the time window (in ms) for counting access frequency. A page is
   considered hot only when **freq_threshold** number of accesses occur with
   this time period.
-- Default: 3000 (3 seconds)
+- Default: 3000 (3 seconds) in default mode and 5000 (5s) in precision mode.
 - Example:
   # sysctl vm.pghot_promote_freq_window_ms=3000
 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index eb08431dc9fb..9577bdc575d9 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -2027,7 +2027,7 @@ struct mem_section {
 #ifdef CONFIG_PGHOT
 	/*
 	 * Per-PFN hotness data for this section.
-	 * Array of phi_t (u8 in default mode).
+	 * Array of phi_t (u8 in default mode, u32 in precision mode).
 	 * LSB is used as PGHOT_SECTION_HOT_BIT flag.
 	 */
 	void *hot_map;
diff --git a/include/linux/pghot.h b/include/linux/pghot.h
index 525d4dd28fc1..2e1742b8caee 100644
--- a/include/linux/pghot.h
+++ b/include/linux/pghot.h
@@ -35,6 +35,36 @@ DECLARE_STATIC_KEY_FALSE(pghot_src_hwhints);
 
 #define PGHOT_DEFAULT_NODE		0
 
+#if defined(CONFIG_PGHOT_PRECISE)
+#define PGHOT_DEFAULT_FREQ_WINDOW	(5 * MSEC_PER_SEC)
+
+/*
+ * Bits 0-26 are used to store nid, frequency and time.
+ * Bits 27-30 are unused now.
+ * Bit 31 is used to indicate the page is ready for migration.
+ */
+#define PGHOT_MIGRATE_READY		31
+
+#define PGHOT_NID_WIDTH			10
+#define PGHOT_FREQ_WIDTH		3
+/* time is stored in 14 bits which can represent up to 16s with HZ=1000 */
+#define PGHOT_TIME_WIDTH		14
+
+#define PGHOT_NID_SHIFT			0
+#define PGHOT_FREQ_SHIFT		(PGHOT_NID_SHIFT + PGHOT_NID_WIDTH)
+#define PGHOT_TIME_SHIFT		(PGHOT_FREQ_SHIFT + PGHOT_FREQ_WIDTH)
+
+#define PGHOT_NID_MASK			GENMASK(PGHOT_NID_WIDTH - 1, 0)
+#define PGHOT_FREQ_MASK			GENMASK(PGHOT_FREQ_WIDTH - 1, 0)
+#define PGHOT_TIME_MASK			GENMASK(PGHOT_TIME_WIDTH - 1, 0)
+
+#define PGHOT_NID_MAX			((1 << PGHOT_NID_WIDTH) - 1)
+#define PGHOT_FREQ_MAX			((1 << PGHOT_FREQ_WIDTH) - 1)
+#define PGHOT_TIME_MAX			((1 << PGHOT_TIME_WIDTH) - 1)
+
+typedef u32 phi_t;
+
+#else	/* !CONFIG_PGHOT_PRECISE */
 #define PGHOT_DEFAULT_FREQ_WINDOW	(3 * MSEC_PER_SEC)
 
 /*
@@ -61,6 +91,7 @@ DECLARE_STATIC_KEY_FALSE(pghot_src_hwhints);
 #define PGHOT_TIME_MAX			((1 << PGHOT_TIME_WIDTH) - 1)
 
 typedef u8 phi_t;
+#endif /* CONFIG_PGHOT_PRECISE */
 
 #define PGHOT_RECORD_SIZE		sizeof(phi_t)
 
diff --git a/mm/Kconfig b/mm/Kconfig
index ebfa149d8123..cc4b5685ecd4 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1483,6 +1483,17 @@ config PGHOT
 	  This adds 1 byte of metadata overhead per page in lower-tier
 	  memory nodes.
 
+config PGHOT_PRECISE
+	bool "Hot page tracking precision mode"
+	default n
+	depends on PGHOT
+	help
+	  Enables precision mode for tracking hot pages with pghot sub-system.
+	  Adds fine-grained access time tracking and explicit toptier target
+	  NID tracking. Precise hot page tracking comes at the cost of using
+	  4 bytes per page against the default one byte per page. Preferable
+	  to enable this on systems with multiple nodes in toptier.
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index 33014de43acc..dc61f4d955f8 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -150,4 +150,9 @@ obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
 obj-$(CONFIG_EXECMEM) += execmem.o
 obj-$(CONFIG_TMPFS_QUOTA) += shmem_quota.o
 obj-$(CONFIG_LAZY_MMU_MODE_KUNIT_TEST) += tests/lazy_mmu_mode_kunit.o
-obj-$(CONFIG_PGHOT) += pghot.o pghot-tunables.o pghot-default.o
+obj-$(CONFIG_PGHOT) += pghot.o pghot-tunables.o
+ifdef CONFIG_PGHOT_PRECISE
+obj-$(CONFIG_PGHOT) += pghot-precise.o
+else
+obj-$(CONFIG_PGHOT) += pghot-default.o
+endif
diff --git a/mm/pghot-precise.c b/mm/pghot-precise.c
new file mode 100644
index 000000000000..8e571988b4ce
--- /dev/null
+++ b/mm/pghot-precise.c
@@ -0,0 +1,81 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * pghot: Precision mode
+ *
+ * 4 byte hotness record per PFN (u32)
+ * NID, time and frequency tracked as part of the record.
+ */
+
+#include <linux/pghot.h>
+#include <linux/jiffies.h>
+#include <linux/memory-tiers.h>
+
+bool pghot_nid_valid(int nid)
+{
+	if (nid != NUMA_NO_NODE &&
+	    (!numa_valid_node(nid) || nid > PGHOT_NID_MAX ||
+	     !node_online(nid) || !node_is_toptier(nid)))
+		return false;
+
+	return true;
+}
+
+unsigned long pghot_access_latency(unsigned long old_time, unsigned long time)
+{
+	return jiffies_to_msecs((time - old_time) & PGHOT_TIME_MASK);
+}
+
+bool pghot_update_record(phi_t *phi, int nid, unsigned long now)
+{
+	phi_t freq, old_freq, hotness, old_hotness, old_time;
+	phi_t time = now & PGHOT_TIME_MASK;
+
+	nid = (nid == NUMA_NO_NODE) ? pghot_target_nid : nid;
+	old_hotness = READ_ONCE(*phi);
+
+	do {
+		bool new_window = false;
+
+		hotness = old_hotness;
+		old_freq = (hotness >> PGHOT_FREQ_SHIFT) & PGHOT_FREQ_MASK;
+		old_time = (hotness >> PGHOT_TIME_SHIFT) & PGHOT_TIME_MASK;
+
+		if (pghot_access_latency(old_time, time) > sysctl_pghot_freq_window)
+			new_window = true;
+
+		if (new_window)
+			freq = 1;
+		else if (old_freq < PGHOT_FREQ_MAX)
+			freq = old_freq + 1;
+		else
+			freq = old_freq;
+
+		hotness &= ~(PGHOT_NID_MASK << PGHOT_NID_SHIFT);
+		hotness &= ~(PGHOT_FREQ_MASK << PGHOT_FREQ_SHIFT);
+		hotness &= ~(PGHOT_TIME_MASK << PGHOT_TIME_SHIFT);
+
+		hotness |= (nid & PGHOT_NID_MASK) << PGHOT_NID_SHIFT;
+		hotness |= (freq & PGHOT_FREQ_MASK) << PGHOT_FREQ_SHIFT;
+		hotness |= (time & PGHOT_TIME_MASK) << PGHOT_TIME_SHIFT;
+
+		if (freq >= pghot_freq_threshold)
+			hotness |= BIT(PGHOT_MIGRATE_READY);
+	} while (unlikely(!try_cmpxchg(phi, &old_hotness, hotness)));
+	return !!(hotness & BIT(PGHOT_MIGRATE_READY));
+}
+
+int pghot_get_record(phi_t *phi, int *nid, int *freq, unsigned long *time)
+{
+	phi_t old_hotness, hotness = 0;
+
+	old_hotness = READ_ONCE(*phi);
+	do {
+		if (!(old_hotness & BIT(PGHOT_MIGRATE_READY)))
+			return -EINVAL;
+	} while (unlikely(!try_cmpxchg(phi, &old_hotness, hotness)));
+
+	*nid = (old_hotness >> PGHOT_NID_SHIFT) & PGHOT_NID_MASK;
+	*freq = (old_hotness >> PGHOT_FREQ_SHIFT) & PGHOT_FREQ_MASK;
+	*time = (old_hotness >> PGHOT_TIME_SHIFT) & PGHOT_TIME_MASK;
+	return 0;
+}
diff --git a/mm/pghot.c b/mm/pghot.c
index 02e6959b647a..0b31d5917833 100644
--- a/mm/pghot.c
+++ b/mm/pghot.c
@@ -10,6 +10,9 @@
  * the frequency of access and last access time. Promotions are done
  * to a default toptier NID.
  *
+ * In the precision mode, 4 bytes are used to store the frequency
+ * of access, last access time and the accessing NID.
+ *
  * A kernel thread named kmigrated is provided to migrate or promote
  * the hot pages. kmigrated runs for each lower tier node. It iterates
  * over the node's PFNs and  migrates pages marked for migration into
@@ -52,13 +55,15 @@ static bool kmigrated_started __ro_after_init;
  * for the purpose of tracking page hotness and subsequent promotion.
  *
  * @pfn: PFN of the page
- * @nid: Unused
+ * @nid: Target NID to where the page needs to be migrated in precision
+ *       mode but unused in default mode
  * @src: The identifier of the sub-system that reports the access
  * @now: Access time in jiffies
  *
- * Updates the frequency and time of access and marks the page as
- * ready for migration if the frequency crosses a threshold. The pages
- * marked for migration are migrated by kmigrated kernel thread.
+ * Updates the NID (in precision mode only), frequency and time of access
+ * and marks the page as ready for migration if the frequency crosses a
+ * threshold. The pages marked for migration are migrated by kmigrated
+ * kernel thread.
  *
  * Return: 0 on success and -EINVAL on failure to record the access.
  */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v7 4/7] mm: pghot: Precision mode for pghot
  2026-05-04  6:09 ` [PATCH v7 4/7] mm: pghot: Precision mode for pghot Bharata B Rao
@ 2026-05-04 18:41   ` Donet Tom
  0 siblings, 0 replies; 12+ messages in thread
From: Donet Tom @ 2026-05-04 18:41 UTC (permalink / raw)
  To: Bharata B Rao, linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz,
	raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david,
	byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
	alok.rathore, shivankg

Hi Bharata

On 5/4/26 11:39 AM, Bharata B Rao wrote:
> +#include <linux/pghot.h>
> +#include <linux/jiffies.h>
> +#include <linux/memory-tiers.h>
> +
> +bool pghot_nid_valid(int nid)

I might be missing something, but since pghot_nid_valid() exists in both 
pghot-default.c and pghot-precise.c, would it make sense to move it to a 
header file as a static inline function?

-Donet


> +{
> +	if (nid != NUMA_NO_NODE &&
> +	    (!numa_valid_node(nid) || nid > PGHOT_NID_MAX ||
> +	     !node_online(nid) || !node_is_toptier(nid)))
> +		return false;
> +
> +	return true;
> +}
> +
> +unsigned long pghot_access_latency(unsigned long old_time, unsigned long time)
> +{
> +	return jiffies_to_msecs((time - old_time) & PGHOT_TIME_MASK);
> +}
> +
> +bool pghot_update_record(phi_t *phi, int nid, unsigned long now)
> +{

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v7 5/7] mm: sched: move NUMA balancing tiering promotion to pghot
  2026-05-04  6:09 [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure Bharata B Rao
                   ` (3 preceding siblings ...)
  2026-05-04  6:09 ` [PATCH v7 4/7] mm: pghot: Precision mode for pghot Bharata B Rao
@ 2026-05-04  6:09 ` Bharata B Rao
  2026-05-04  6:09 ` [RFC PATCH v7 6/7] x86/ibs: Move IBS caps definitions into its own header Bharata B Rao
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Bharata B Rao @ 2026-05-04  6:09 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz,
	raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david,
	byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
	alok.rathore, shivankg, donettom, bharata

Currently hot page promotion (NUMA_BALANCING_MEMORY_TIERING
mode of NUMA Balancing) does hot page detection (via hint faults),
hot page classification and eventual promotion, all by itself and
sits within the scheduler.

With pghot, the new hot page tracking and promotion mechanism
being available, NUMA Balancing can limit itself to detection
of hot pages (via hint faults) and off-load rest of the
functionality to pghot.

To achieve this, pghot_record_access(PGHOT_HINTFAULTS) API
is used to feed the hot page info to pghot. In addition, the
migration rate limiting and dynamic threshold logic are moved to
kmigrated so that the same can be used for hot pages reported by
other sources too. Hence it becomes necessary to introduce a
new config option CONFIG_NUMA_BALANCING_TIERING to control
the hint faults souce for hot page promotion. This option
controls the NUMA_BALANCING_MEMORY_TIERING mode of
kernel.numa_balancing

This movement of hot page promotion to pghot results in the following
changes to the behaviour of hint faults based hot page promotion:

1. Promotion is no longer done in the fault path but instead is
   deferred to kmigrated and happens in batches.
2. NUMA_BALANCING_MEMORY_TIERING mode used to promote on first
   access. Pghot by default, promotes on second access though this
   can be changed by setting /sys/kernel/debug/pghot/freq_threshold.
   hot_threshold_ms debugfs tunable now gets replaced by pghot's
   freq_threshold.
3. In NUMA_BALANCING_MEMORY_TIERING mode, hint fault latency is the
   difference between the PTE update time (during scanning) and the
   access time (hint fault). However with pghot, a single latency
   threshold is used for two purposes:
   a) If the time difference between successive accesses are within
      the threshold, the page is marked as hot.
   b) Later when kmigrated picks up the page for migration, it will
      migrate only if the difference between the current time and
      the time when the page was marked hot is with the threshold.
4. Batch migration of misplaced folios is done from non-process
   context where VMA info is not readily available. Without VMA
   and the exec check on that, it will not be possible to filter
   out exec pages during migration prep stage. Hence shared
   executable pages also will be subjected to misplaced migration.
5. The max scan period which is used in dynamic threshold logic
   was a debugfs tunable. However this has been converted to a
   scalar metric in pghot.
6. In the uncommon case of using NUMA_BALANCING_NORMAL mode
   to balance between lower and higher tier nodes, we end up
   waking the kswapd when there is no headroom in the toptier.

Key code changes due to this movement are detailed below to help
easy understanding of the restructuring.

1. Scanning and access times are no longer tracked in last_cpupid
   field of folio flags. Hence all code related to this (like
   folio_xchg_access_time(), cpupid_valid()) are removed.
2. The misplaced migration routines become conditional to
   CONFIG_PGHOT in addition to CONFIG_NUMA_BALANCING.
3. The promotion related stats (like PGPROMOTE_SUCCESS etc) are
   now moved to under CONFIG_PGHOT as these stats are part of
   promotion engine which will be used for other hotness sources
   as well.
4. Routines that are responsibile for migration rate limiting
   dynamic thresholding, pgdat balancing during promotion etc
   are moved to pghot with appropriate renaming.

Signed-off-by: Bharata B Rao <bharata@amd.com>
---
 include/linux/mm.h     |  35 ++------
 include/linux/mmzone.h |   4 +-
 init/Kconfig           |  13 +++
 kernel/sched/core.c    |   7 ++
 kernel/sched/debug.c   |   1 -
 kernel/sched/fair.c    | 177 ++---------------------------------------
 kernel/sched/sched.h   |   1 -
 mm/huge_memory.c       |  24 +++++-
 mm/memcontrol.c        |   6 +-
 mm/memory-tiers.c      |  15 ++--
 mm/memory.c            |  28 +++++--
 mm/mempolicy.c         |   3 -
 mm/migrate.c           |  16 +++-
 mm/pghot.c             | 134 +++++++++++++++++++++++++++++++
 mm/vmstat.c            |   2 +-
 15 files changed, 239 insertions(+), 227 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0b776907152e..3b237946b322 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2271,17 +2271,6 @@ static inline int folio_nid(const struct folio *folio)
 }
 
 #ifdef CONFIG_NUMA_BALANCING
-/* page access time bits needs to hold at least 4 seconds */
-#define PAGE_ACCESS_TIME_MIN_BITS	12
-#if LAST_CPUPID_SHIFT < PAGE_ACCESS_TIME_MIN_BITS
-#define PAGE_ACCESS_TIME_BUCKETS				\
-	(PAGE_ACCESS_TIME_MIN_BITS - LAST_CPUPID_SHIFT)
-#else
-#define PAGE_ACCESS_TIME_BUCKETS	0
-#endif
-
-#define PAGE_ACCESS_TIME_MASK				\
-	(LAST_CPUPID_MASK << PAGE_ACCESS_TIME_BUCKETS)
 
 static inline int cpu_pid_to_cpupid(int cpu, int pid)
 {
@@ -2347,15 +2336,6 @@ static inline void page_cpupid_reset_last(struct page *page)
 }
 #endif /* LAST_CPUPID_NOT_IN_PAGE_FLAGS */
 
-static inline int folio_xchg_access_time(struct folio *folio, int time)
-{
-	int last_time;
-
-	last_time = folio_xchg_last_cpupid(folio,
-					   time >> PAGE_ACCESS_TIME_BUCKETS);
-	return last_time << PAGE_ACCESS_TIME_BUCKETS;
-}
-
 static inline void vma_set_access_pid_bit(struct vm_area_struct *vma)
 {
 	unsigned int pid_bit;
@@ -2366,18 +2346,12 @@ static inline void vma_set_access_pid_bit(struct vm_area_struct *vma)
 	}
 }
 
-bool folio_use_access_time(struct folio *folio);
 #else /* !CONFIG_NUMA_BALANCING */
 static inline int folio_xchg_last_cpupid(struct folio *folio, int cpupid)
 {
 	return folio_nid(folio); /* XXX */
 }
 
-static inline int folio_xchg_access_time(struct folio *folio, int time)
-{
-	return 0;
-}
-
 static inline int folio_last_cpupid(struct folio *folio)
 {
 	return folio_nid(folio); /* XXX */
@@ -2420,11 +2394,16 @@ static inline bool cpupid_match_pid(struct task_struct *task, int cpupid)
 static inline void vma_set_access_pid_bit(struct vm_area_struct *vma)
 {
 }
-static inline bool folio_use_access_time(struct folio *folio)
+#endif /* CONFIG_NUMA_BALANCING */
+
+#ifdef CONFIG_NUMA_BALANCING_TIERING
+bool folio_is_promo_candidate(struct folio *folio);
+#else
+static inline bool folio_is_promo_candidate(struct folio *folio)
 {
 	return false;
 }
-#endif /* CONFIG_NUMA_BALANCING */
+#endif /* CONFIG_NUMA_BALANCING_TIERING */
 
 #if defined(CONFIG_KASAN_SW_TAGS) || defined(CONFIG_KASAN_HW_TAGS)
 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9577bdc575d9..b29d06168826 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -287,7 +287,7 @@ enum node_stat_item {
 #ifdef CONFIG_SWAP
 	NR_SWAPCACHE,
 #endif
-#ifdef CONFIG_NUMA_BALANCING
+#ifdef CONFIG_PGHOT
 	PGPROMOTE_SUCCESS,	/* promote successfully */
 	/**
 	 * Candidate pages for promotion based on hint fault latency.  This
@@ -1566,7 +1566,7 @@ typedef struct pglist_data {
 	struct deferred_split deferred_split_queue;
 #endif
 
-#ifdef CONFIG_NUMA_BALANCING
+#ifdef CONFIG_PGHOT
 	/* start time in ms of current promote rate limit period */
 	unsigned int nbp_rl_start;
 	/* number of promote candidate pages at start time of current rate limit period */
diff --git a/init/Kconfig b/init/Kconfig
index 2937c4d308ae..7624be1c739a 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1027,6 +1027,19 @@ config NUMA_BALANCING_DEFAULT_ENABLED
 	  If set, automatic NUMA balancing will be enabled if running on a NUMA
 	  machine.
 
+config NUMA_BALANCING_TIERING
+	bool "NUMA balancing memory tiering promotion"
+	depends on NUMA_BALANCING && PGHOT
+	help
+	  Enable NUMA balancing mode 2 (memory tiering). This allows
+	  automatic promotion of hot pages from slower memory tiers to
+	  faster tiers using the pghot subsystem.
+
+	  This requires CONFIG_PGHOT for the hot page tracking engine.
+	  This option is required for kernel.numa_balancing=2.
+
+	  If unsure, say N.
+
 config SLAB_OBJ_EXT
 	bool
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index da20fb6ea25a..46ce75f00b40 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4523,6 +4523,7 @@ void set_numabalancing_state(bool enabled)
 }
 
 #ifdef CONFIG_PROC_SYSCTL
+#ifdef CONFIG_NUMA_BALANCING_TIERING
 static void reset_memory_tiering(void)
 {
 	struct pglist_data *pgdat;
@@ -4533,6 +4534,7 @@ static void reset_memory_tiering(void)
 		pgdat->nbp_th_start = jiffies_to_msecs(jiffies);
 	}
 }
+#endif
 
 static int sysctl_numa_balancing(const struct ctl_table *table, int write,
 			  void *buffer, size_t *lenp, loff_t *ppos)
@@ -4550,9 +4552,14 @@ static int sysctl_numa_balancing(const struct ctl_table *table, int write,
 	if (err < 0)
 		return err;
 	if (write) {
+		if ((state & NUMA_BALANCING_MEMORY_TIERING) &&
+		    !IS_ENABLED(CONFIG_NUMA_BALANCING_TIERING))
+			return -EOPNOTSUPP;
+#ifdef CONFIG_NUMA_BALANCING_TIERING
 		if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) &&
 		    (state & NUMA_BALANCING_MEMORY_TIERING))
 			reset_memory_tiering();
+#endif
 		sysctl_numa_balancing_mode = state;
 		__set_numabalancing_state(state);
 	}
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 74c1617cf652..abf53f3071ea 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -623,7 +623,6 @@ static __init int sched_init_debug(void)
 	debugfs_create_u32("scan_period_min_ms", 0644, numa, &sysctl_numa_balancing_scan_period_min);
 	debugfs_create_u32("scan_period_max_ms", 0644, numa, &sysctl_numa_balancing_scan_period_max);
 	debugfs_create_u32("scan_size_mb", 0644, numa, &sysctl_numa_balancing_scan_size);
-	debugfs_create_u32("hot_threshold_ms", 0644, numa, &sysctl_numa_balancing_hot_threshold);
 #endif /* CONFIG_NUMA_BALANCING */
 
 	debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 69361c63353a..f1da4fa95598 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -125,11 +125,6 @@ int __weak arch_asym_cpu_priority(int cpu)
 static unsigned int sysctl_sched_cfs_bandwidth_slice		= 5000UL;
 #endif
 
-#ifdef CONFIG_NUMA_BALANCING
-/* Restrict the NUMA promotion throughput (MB/s) for each target node. */
-static unsigned int sysctl_numa_balancing_promote_rate_limit = 65536;
-#endif
-
 #ifdef CONFIG_SYSCTL
 static const struct ctl_table sched_fair_sysctls[] = {
 #ifdef CONFIG_CFS_BANDWIDTH
@@ -142,16 +137,6 @@ static const struct ctl_table sched_fair_sysctls[] = {
 		.extra1         = SYSCTL_ONE,
 	},
 #endif
-#ifdef CONFIG_NUMA_BALANCING
-	{
-		.procname	= "numa_balancing_promote_rate_limit_MBps",
-		.data		= &sysctl_numa_balancing_promote_rate_limit,
-		.maxlen		= sizeof(unsigned int),
-		.mode		= 0644,
-		.proc_handler	= proc_dointvec_minmax,
-		.extra1		= SYSCTL_ZERO,
-	},
-#endif /* CONFIG_NUMA_BALANCING */
 };
 
 static int __init sched_fair_sysctl_init(void)
@@ -1612,9 +1597,6 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
 /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
 unsigned int sysctl_numa_balancing_scan_delay = 1000;
 
-/* The page with hint page fault latency < threshold in ms is considered hot */
-unsigned int sysctl_numa_balancing_hot_threshold = MSEC_PER_SEC;
-
 struct numa_group {
 	refcount_t refcount;
 
@@ -1957,120 +1939,6 @@ static inline unsigned long group_weight(struct task_struct *p, int nid,
 	return 1000 * faults / total_faults;
 }
 
-/*
- * If memory tiering mode is enabled, cpupid of slow memory page is
- * used to record scan time instead of CPU and PID.  When tiering mode
- * is disabled at run time, the scan time (in cpupid) will be
- * interpreted as CPU and PID.  So CPU needs to be checked to avoid to
- * access out of array bound.
- */
-static inline bool cpupid_valid(int cpupid)
-{
-	return cpupid_to_cpu(cpupid) < nr_cpu_ids;
-}
-
-/*
- * For memory tiering mode, if there are enough free pages (more than
- * enough watermark defined here) in fast memory node, to take full
- * advantage of fast memory capacity, all recently accessed slow
- * memory pages will be migrated to fast memory node without
- * considering hot threshold.
- */
-static bool pgdat_free_space_enough(struct pglist_data *pgdat)
-{
-	int z;
-	unsigned long enough_wmark;
-
-	enough_wmark = max(1UL * 1024 * 1024 * 1024 >> PAGE_SHIFT,
-			   pgdat->node_present_pages >> 4);
-	for (z = pgdat->nr_zones - 1; z >= 0; z--) {
-		struct zone *zone = pgdat->node_zones + z;
-
-		if (!populated_zone(zone))
-			continue;
-
-		if (zone_watermark_ok(zone, 0,
-				      promo_wmark_pages(zone) + enough_wmark,
-				      ZONE_MOVABLE, 0))
-			return true;
-	}
-	return false;
-}
-
-/*
- * For memory tiering mode, when page tables are scanned, the scan
- * time will be recorded in struct page in addition to make page
- * PROT_NONE for slow memory page.  So when the page is accessed, in
- * hint page fault handler, the hint page fault latency is calculated
- * via,
- *
- *	hint page fault latency = hint page fault time - scan time
- *
- * The smaller the hint page fault latency, the higher the possibility
- * for the page to be hot.
- */
-static int numa_hint_fault_latency(struct folio *folio)
-{
-	int last_time, time;
-
-	time = jiffies_to_msecs(jiffies);
-	last_time = folio_xchg_access_time(folio, time);
-
-	return (time - last_time) & PAGE_ACCESS_TIME_MASK;
-}
-
-/*
- * For memory tiering mode, too high promotion/demotion throughput may
- * hurt application latency.  So we provide a mechanism to rate limit
- * the number of pages that are tried to be promoted.
- */
-static bool numa_promotion_rate_limit(struct pglist_data *pgdat,
-				      unsigned long rate_limit, int nr)
-{
-	unsigned long nr_cand;
-	unsigned int now, start;
-
-	now = jiffies_to_msecs(jiffies);
-	mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr);
-	nr_cand = node_page_state(pgdat, PGPROMOTE_CANDIDATE);
-	start = pgdat->nbp_rl_start;
-	if (now - start > MSEC_PER_SEC &&
-	    cmpxchg(&pgdat->nbp_rl_start, start, now) == start)
-		pgdat->nbp_rl_nr_cand = nr_cand;
-	if (nr_cand - pgdat->nbp_rl_nr_cand >= rate_limit)
-		return true;
-	return false;
-}
-
-#define NUMA_MIGRATION_ADJUST_STEPS	16
-
-static void numa_promotion_adjust_threshold(struct pglist_data *pgdat,
-					    unsigned long rate_limit,
-					    unsigned int ref_th)
-{
-	unsigned int now, start, th_period, unit_th, th;
-	unsigned long nr_cand, ref_cand, diff_cand;
-
-	now = jiffies_to_msecs(jiffies);
-	th_period = sysctl_numa_balancing_scan_period_max;
-	start = pgdat->nbp_th_start;
-	if (now - start > th_period &&
-	    cmpxchg(&pgdat->nbp_th_start, start, now) == start) {
-		ref_cand = rate_limit *
-			sysctl_numa_balancing_scan_period_max / MSEC_PER_SEC;
-		nr_cand = node_page_state(pgdat, PGPROMOTE_CANDIDATE);
-		diff_cand = nr_cand - pgdat->nbp_th_nr_cand;
-		unit_th = ref_th * 2 / NUMA_MIGRATION_ADJUST_STEPS;
-		th = pgdat->nbp_threshold ? : ref_th;
-		if (diff_cand > ref_cand * 11 / 10)
-			th = max(th - unit_th, unit_th);
-		else if (diff_cand < ref_cand * 9 / 10)
-			th = min(th + unit_th, ref_th * 2);
-		pgdat->nbp_th_nr_cand = nr_cand;
-		pgdat->nbp_threshold = th;
-	}
-}
-
 bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
 				int src_nid, int dst_cpu)
 {
@@ -2086,41 +1954,15 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
 
 	/*
 	 * The pages in slow memory node should be migrated according
-	 * to hot/cold instead of private/shared.
-	 */
-	if (folio_use_access_time(folio)) {
-		struct pglist_data *pgdat;
-		unsigned long rate_limit;
-		unsigned int latency, th, def_th;
-		long nr = folio_nr_pages(folio);
-
-		pgdat = NODE_DATA(dst_nid);
-		if (pgdat_free_space_enough(pgdat)) {
-			/* workload changed, reset hot threshold */
-			pgdat->nbp_threshold = 0;
-			mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE_NRL, nr);
-			return true;
-		}
-
-		def_th = sysctl_numa_balancing_hot_threshold;
-		rate_limit = MB_TO_PAGES(sysctl_numa_balancing_promote_rate_limit);
-		numa_promotion_adjust_threshold(pgdat, rate_limit, def_th);
-
-		th = pgdat->nbp_threshold ? : def_th;
-		latency = numa_hint_fault_latency(folio);
-		if (latency >= th)
-			return false;
-
-		return !numa_promotion_rate_limit(pgdat, rate_limit, nr);
-	}
+	 * to hot/cold instead of private/shared. Also the migration
+	 * of such pages are handled by kmigrated.
+	 */
+	if (folio_is_promo_candidate(folio))
+		return true;
 
 	this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid);
 	last_cpupid = folio_xchg_last_cpupid(folio, this_cpupid);
 
-	if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) &&
-	    !node_is_toptier(src_nid) && !cpupid_valid(last_cpupid))
-		return false;
-
 	/*
 	 * Allow first faults or private faults to migrate immediately early in
 	 * the lifetime of a task. The magic number 4 is based on waiting for
@@ -3330,15 +3172,6 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
 	if (!p->mm)
 		return;
 
-	/*
-	 * NUMA faults statistics are unnecessary for the slow memory
-	 * node for memory tiering mode.
-	 */
-	if (!node_is_toptier(mem_node) &&
-	    (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING ||
-	     !cpupid_valid(last_cpupid)))
-		return;
-
 	/* Allocate buffer to track faults on a per-node basis */
 	if (unlikely(!p->numa_faults)) {
 		int size = sizeof(*p->numa_faults) *
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9f63b15d309d..f176643516b5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3066,7 +3066,6 @@ extern unsigned int sysctl_numa_balancing_scan_delay;
 extern unsigned int sysctl_numa_balancing_scan_period_min;
 extern unsigned int sysctl_numa_balancing_scan_period_max;
 extern unsigned int sysctl_numa_balancing_scan_size;
-extern unsigned int sysctl_numa_balancing_hot_threshold;
 
 #ifdef CONFIG_SCHED_HRTICK
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 970e077019b7..1890b1e534a4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -40,6 +40,7 @@
 #include <linux/pgalloc.h>
 #include <linux/pgalloc_tag.h>
 #include <linux/pagewalk.h>
+#include <linux/pghot.h>
 
 #include <asm/tlb.h>
 #include "internal.h"
@@ -2267,7 +2268,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 	int nid = NUMA_NO_NODE;
 	int target_nid, last_cpupid;
 	pmd_t pmd, old_pmd;
-	bool writable = false;
+	bool writable = false, needs_promotion = false;
 	int flags = 0;
 
 	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
@@ -2294,11 +2295,23 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 		goto out_map;
 
 	nid = folio_nid(folio);
+	needs_promotion = folio_is_promo_candidate(folio);
 
 	target_nid = numa_migrate_check(folio, vmf, haddr, &flags, writable,
 					&last_cpupid);
 	if (target_nid == NUMA_NO_NODE)
 		goto out_map;
+
+	if (needs_promotion) {
+		/*
+		 * Hot page promotion, mode=NUMA_BALANCING_MEMORY_TIERING.
+		 * Isolation and migration are handled by pghot.
+		 */
+		nid = target_nid;
+		goto out_map;
+	}
+
+	/* Balancing b/n toptier nodes, mode=NUMA_BALANCING_NORMAL */
 	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
 		flags |= TNF_MIGRATE_FAIL;
 		goto out_map;
@@ -2330,8 +2343,13 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 	update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
 	spin_unlock(vmf->ptl);
 
-	if (nid != NUMA_NO_NODE)
-		task_numa_fault(last_cpupid, nid, HPAGE_PMD_NR, flags);
+	if (nid != NUMA_NO_NODE) {
+		if (needs_promotion)
+			pghot_record_access(folio_pfn(folio), nid,
+					    PGHOT_HINTFAULTS, jiffies);
+		else
+			task_numa_fault(last_cpupid, nid, HPAGE_PMD_NR, flags);
+	}
 	return 0;
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c3d98ab41f1f..033b80ad248e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -400,7 +400,7 @@ static const unsigned int memcg_node_stat_items[] = {
 #ifdef CONFIG_SWAP
 	NR_SWAPCACHE,
 #endif
-#ifdef CONFIG_NUMA_BALANCING
+#ifdef CONFIG_PGHOT
 	PGPROMOTE_SUCCESS,
 #endif
 	PGDEMOTE_KSWAPD,
@@ -1594,7 +1594,7 @@ static const struct memory_stat memory_stats[] = {
 	{ "pgscan_khugepaged",		PGSCAN_KHUGEPAGED	},
 	{ "pgscan_proactive",		PGSCAN_PROACTIVE	},
 	{ "pgrefill",			PGREFILL		},
-#ifdef CONFIG_NUMA_BALANCING
+#ifdef CONFIG_PGHOT
 	{ "pgpromote_success",		PGPROMOTE_SUCCESS	},
 #endif
 };
@@ -1646,7 +1646,7 @@ static int memcg_page_state_output_unit(int item)
 	case PGSCAN_KHUGEPAGED:
 	case PGSCAN_PROACTIVE:
 	case PGREFILL:
-#ifdef CONFIG_NUMA_BALANCING
+#ifdef CONFIG_PGHOT
 	case PGPROMOTE_SUCCESS:
 #endif
 		return 1;
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 54851d8a195b..be134a32f5bf 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -51,18 +51,19 @@ static const struct bus_type memory_tier_subsys = {
 	.dev_name = "memory_tier",
 };
 
-#ifdef CONFIG_NUMA_BALANCING
+#ifdef CONFIG_NUMA_BALANCING_TIERING
 /**
- * folio_use_access_time - check if a folio reuses cpupid for page access time
+ * folio_is_promo_candidate - check if the folio qualifies for promotion
+ *
  * @folio: folio to check
  *
- * folio's _last_cpupid field is repurposed by memory tiering. In memory
- * tiering mode, cpupid of slow memory folio (not toptier memory) is used to
- * record page access time.
+ * Checks if NUMA Balancing tiering mode is set and the folio belongs
+ * to lower tier. If so, it qualifies for promotion to toptier when
+ * it is categorized as hot.
  *
- * Return: the folio _last_cpupid is used to record page access time
+ * Return: True if the above condition is met, else False.
  */
-bool folio_use_access_time(struct folio *folio)
+bool folio_is_promo_candidate(struct folio *folio)
 {
 	return (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) &&
 	       !node_is_toptier(folio_nid(folio));
diff --git a/mm/memory.c b/mm/memory.c
index ea6568571131..17ea31750573 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -75,6 +75,7 @@
 #include <linux/perf_event.h>
 #include <linux/ptrace.h>
 #include <linux/vmalloc.h>
+#include <linux/pghot.h>
 #include <linux/sched/sysctl.h>
 #include <linux/pgalloc.h>
 #include <linux/uaccess.h>
@@ -6062,10 +6063,9 @@ int numa_migrate_check(struct folio *folio, struct vm_fault *vmf,
 	if (folio_maybe_mapped_shared(folio) && (vma->vm_flags & VM_SHARED))
 		*flags |= TNF_SHARED;
 	/*
-	 * For memory tiering mode, cpupid of slow memory page is used
-	 * to record page access time.  So use default value.
+	 * For memory tiering mode, last_cpupid is unused. So use default value.
 	 */
-	if (folio_use_access_time(folio))
+	if (folio_is_promo_candidate(folio))
 		*last_cpupid = (-1 & LAST_CPUPID_MASK);
 	else
 		*last_cpupid = folio_last_cpupid(folio);
@@ -6146,6 +6146,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	int nid = NUMA_NO_NODE;
 	bool writable = false, ignore_writable = false;
 	bool pte_write_upgrade = vma_wants_manual_pte_write_upgrade(vma);
+	bool needs_promotion = false;
 	int last_cpupid;
 	int target_nid;
 	pte_t pte, old_pte;
@@ -6180,12 +6181,24 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 		goto out_map;
 
 	nid = folio_nid(folio);
+	needs_promotion = folio_is_promo_candidate(folio);
 	nr_pages = folio_nr_pages(folio);
 
 	target_nid = numa_migrate_check(folio, vmf, vmf->address, &flags,
 					writable, &last_cpupid);
 	if (target_nid == NUMA_NO_NODE)
 		goto out_map;
+
+	if (needs_promotion) {
+		/*
+		 * Hot page promotion, mode=NUMA_BALANCING_MEMORY_TIERING.
+		 * Isolation and migration are handled by pghot.
+		 */
+		nid = target_nid;
+		goto out_map;
+	}
+
+	/* Balancing b/n toptier nodes, mode=NUMA_BALANCING_NORMAL */
 	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
 		flags |= TNF_MIGRATE_FAIL;
 		goto out_map;
@@ -6225,8 +6238,13 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 					    writable);
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
 
-	if (nid != NUMA_NO_NODE)
-		task_numa_fault(last_cpupid, nid, nr_pages, flags);
+	if (nid != NUMA_NO_NODE) {
+		if (needs_promotion)
+			pghot_record_access(folio_pfn(folio), nid,
+					    PGHOT_HINTFAULTS, jiffies);
+		else
+			task_numa_fault(last_cpupid, nid, nr_pages, flags);
+	}
 	return 0;
 }
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4e4421b22b59..aef9bb8a6cd4 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -872,9 +872,6 @@ bool folio_can_map_prot_numa(struct folio *folio, struct vm_area_struct *vma,
 	    node_is_toptier(nid))
 		return false;
 
-	if (folio_use_access_time(folio))
-		folio_xchg_access_time(folio, jiffies_to_msecs(jiffies));
-
 	return true;
 }
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 726d27b61a46..a468fa4f7963 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2709,8 +2709,18 @@ int migrate_misplaced_folio_prepare(struct folio *folio,
 	if (!migrate_balanced_pgdat(pgdat, nr_pages)) {
 		int z;
 
-		if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING))
+		/*
+		 * Kswapd wakeup for creating headroom in toptier is done only
+		 * for hot page promotion case and not for misplaced migrations
+		 * between toptier nodes.
+		 *
+		 * In the uncommon case of using NUMA_BALANCING_NORMAL mode
+		 * to balance between lower and higher tier nodes, we end up
+		 * waking the kswapd.
+		 */
+		if (node_is_toptier(folio_nid(folio)))
 			return -EAGAIN;
+
 		for (z = pgdat->nr_zones - 1; z >= 0; z--) {
 			if (managed_zone(pgdat->node_zones + z))
 				break;
@@ -2760,6 +2770,8 @@ int migrate_misplaced_folio(struct folio *folio, int node)
 #ifdef CONFIG_NUMA_BALANCING
 		count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded);
 		count_memcg_events(memcg, NUMA_PAGE_MIGRATE, nr_succeeded);
+#endif
+#ifdef CONFIG_PGHOT
 		if ((sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)
 		    && !node_is_toptier(folio_nid(folio))
 		    && node_is_toptier(node)) {
@@ -2824,6 +2836,8 @@ int promote_misplaced_memcg_folios(struct list_head *folio_list, int node)
 #ifdef CONFIG_NUMA_BALANCING
 		count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded);
 		count_memcg_events(memcg, NUMA_PAGE_MIGRATE, nr_succeeded);
+#endif
+#ifdef CONFIG_PGHOT
 		mod_lruvec_state(mem_cgroup_lruvec(memcg, NODE_DATA(node)),
 				 PGPROMOTE_SUCCESS, nr_succeeded);
 #endif
diff --git a/mm/pghot.c b/mm/pghot.c
index 0b31d5917833..1f204a8613eb 100644
--- a/mm/pghot.c
+++ b/mm/pghot.c
@@ -17,6 +17,9 @@
  * the hot pages. kmigrated runs for each lower tier node. It iterates
  * over the node's PFNs and  migrates pages marked for migration into
  * their targeted nodes.
+ *
+ * Migration rate-limiting and dynamic threshold logic implementations
+ * were moved from NUMA Balancing mode 2.
  */
 #include <linux/mm.h>
 #include <linux/migrate.h>
@@ -32,6 +35,12 @@ unsigned int kmigrated_batch_nr = KMIGRATED_DEFAULT_BATCH_NR;
 
 unsigned int sysctl_pghot_freq_window = PGHOT_DEFAULT_FREQ_WINDOW;
 
+/* Restrict the NUMA promotion throughput (MB/s) for each target node. */
+static unsigned int sysctl_pghot_promote_rate_limit = 65536;
+
+#define KMIGRATED_MIGRATION_ADJUST_STEPS	16
+#define KMIGRATED_PROMOTION_THRESHOLD_WINDOW	60000
+
 DEFINE_STATIC_KEY_FALSE(pghot_src_hwhints);
 DEFINE_STATIC_KEY_FALSE(pghot_src_hintfaults);
 
@@ -45,6 +54,22 @@ static const struct ctl_table pghot_sysctls[] = {
 		.proc_handler   = proc_dointvec_minmax,
 		.extra1         = SYSCTL_ZERO,
 	},
+	{
+		.procname	= "pghot_promote_rate_limit_MBps",
+		.data		= &sysctl_pghot_promote_rate_limit,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+	},
+	{
+		.procname	= "numa_balancing_promote_rate_limit_MBps",
+		.data		= &sysctl_pghot_promote_rate_limit,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+	},
 };
 #endif
 
@@ -146,6 +171,110 @@ int pghot_record_access(unsigned long pfn, int nid, int src, unsigned long now)
 	return 0;
 }
 
+/*
+ * For memory tiering mode, if there are enough free pages (more than
+ * enough watermark defined here) in fast memory node, to take full
+ * advantage of fast memory capacity, all recently accessed slow
+ * memory pages will be migrated to fast memory node without
+ * considering hot threshold.
+ */
+static bool pgdat_free_space_enough(struct pglist_data *pgdat)
+{
+	int z;
+	unsigned long enough_wmark;
+
+	enough_wmark = max(1UL * 1024 * 1024 * 1024 >> PAGE_SHIFT,
+			   pgdat->node_present_pages >> 4);
+	for (z = pgdat->nr_zones - 1; z >= 0; z--) {
+		struct zone *zone = pgdat->node_zones + z;
+
+		if (!populated_zone(zone))
+			continue;
+
+		if (zone_watermark_ok(zone, 0,
+				      promo_wmark_pages(zone) + enough_wmark,
+				      ZONE_MOVABLE, 0))
+			return true;
+	}
+	return false;
+}
+
+/*
+ * For memory tiering mode, too high promotion/demotion throughput may
+ * hurt application latency.  So we provide a mechanism to rate limit
+ * the number of pages that are tried to be promoted.
+ */
+static bool kmigrated_promotion_rate_limit(struct pglist_data *pgdat, unsigned long rate_limit,
+					   int nr, unsigned int now_ms)
+{
+	unsigned long nr_cand;
+	unsigned int start;
+
+	mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr);
+	nr_cand = node_page_state(pgdat, PGPROMOTE_CANDIDATE);
+	start = pgdat->nbp_rl_start;
+	if (now_ms - start > MSEC_PER_SEC &&
+	    cmpxchg(&pgdat->nbp_rl_start, start, now_ms) == start)
+		pgdat->nbp_rl_nr_cand = nr_cand;
+	if (nr_cand - pgdat->nbp_rl_nr_cand >= rate_limit)
+		return true;
+	return false;
+}
+
+static void kmigrated_promotion_adjust_threshold(struct pglist_data *pgdat,
+						 unsigned long rate_limit, unsigned int ref_th,
+						 unsigned int now_ms)
+{
+	unsigned int start, th_period, unit_th, th;
+	unsigned long nr_cand, ref_cand, diff_cand;
+
+	th_period = KMIGRATED_PROMOTION_THRESHOLD_WINDOW;
+	start = pgdat->nbp_th_start;
+	if (now_ms - start > th_period &&
+	    cmpxchg(&pgdat->nbp_th_start, start, now_ms) == start) {
+		ref_cand = rate_limit *
+			KMIGRATED_PROMOTION_THRESHOLD_WINDOW / MSEC_PER_SEC;
+		nr_cand = node_page_state(pgdat, PGPROMOTE_CANDIDATE);
+		diff_cand = nr_cand - pgdat->nbp_th_nr_cand;
+		unit_th = ref_th * 2 / KMIGRATED_MIGRATION_ADJUST_STEPS;
+		th = pgdat->nbp_threshold ? : ref_th;
+		if (diff_cand > ref_cand * 11 / 10)
+			th = max(th - unit_th, unit_th);
+		else if (diff_cand < ref_cand * 9 / 10)
+			th = min(th + unit_th, ref_th * 2);
+		pgdat->nbp_th_nr_cand = nr_cand;
+		pgdat->nbp_threshold = th;
+	}
+}
+
+static bool kmigrated_should_migrate_memory(unsigned long nr_pages, int nid,
+					    unsigned long time)
+{
+	struct pglist_data *pgdat;
+	unsigned long rate_limit;
+	unsigned int th, def_th;
+	unsigned int now_ms = jiffies_to_msecs(jiffies); /* Based on full-width jiffies */
+	unsigned long now = jiffies;
+
+	pgdat = NODE_DATA(nid);
+	if (pgdat_free_space_enough(pgdat)) {
+		/* workload changed, reset hot threshold */
+		pgdat->nbp_threshold = 0;
+		mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE_NRL, nr_pages);
+		return true;
+	}
+
+	def_th = sysctl_pghot_freq_window;
+	rate_limit = MB_TO_PAGES(sysctl_pghot_promote_rate_limit);
+	kmigrated_promotion_adjust_threshold(pgdat, rate_limit, def_th, now_ms);
+
+	th = pgdat->nbp_threshold ? : def_th;
+	if (pghot_access_latency(time, now) >= th)
+		return false;
+
+	return !kmigrated_promotion_rate_limit(pgdat, rate_limit, nr_pages, now_ms);
+}
+
 static int pghot_get_hotness(unsigned long pfn, int *nid, int *freq,
 			     unsigned long *time)
 {
@@ -223,6 +352,11 @@ static void kmigrated_walk_zone(unsigned long start_pfn, unsigned long end_pfn,
 			goto out_next;
 		}
 
+		if (!kmigrated_should_migrate_memory(nr, nid, time)) {
+			folio_put(folio);
+			goto out_next;
+		}
+
 		if (migrate_misplaced_folio_prepare(folio, NULL, nid)) {
 			folio_put(folio);
 			goto out_next;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4064ead568cc..da668ff05032 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1268,7 +1268,7 @@ const char * const vmstat_text[] = {
 #ifdef CONFIG_SWAP
 	[I(NR_SWAPCACHE)]			= "nr_swapcached",
 #endif
-#ifdef CONFIG_NUMA_BALANCING
+#ifdef CONFIG_PGHOT
 	[I(PGPROMOTE_SUCCESS)]			= "pgpromote_success",
 	[I(PGPROMOTE_CANDIDATE)]		= "pgpromote_candidate",
 	[I(PGPROMOTE_CANDIDATE_NRL)]		= "pgpromote_candidate_nrl",
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH v7 6/7] x86/ibs: Move IBS caps definitions into its own header
  2026-05-04  6:09 [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure Bharata B Rao
                   ` (4 preceding siblings ...)
  2026-05-04  6:09 ` [PATCH v7 5/7] mm: sched: move NUMA balancing tiering promotion to pghot Bharata B Rao
@ 2026-05-04  6:09 ` Bharata B Rao
  2026-05-04  6:09 ` [RFC PATCH v7 7/7] x86/mm/ibs: In-kernel driver for AMD IBS Memory Profiler Bharata B Rao
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Bharata B Rao @ 2026-05-04  6:09 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz,
	raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david,
	byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
	alok.rathore, shivankg, donettom, bharata

Subsequent patch adds IBS Memory Profiler driver that is
independent of the perf subsystem but needs the CPUID
0x8000001B capability bits. Hence move those bit definitions
out of asm/perf_event.h into a dedicated header so the new
driver can consume them without pulling in perf.

Signed-off-by: Bharata B Rao <bharata@amd.com>
---
 arch/x86/include/asm/ibs-caps.h   | 85 +++++++++++++++++++++++++++++++
 arch/x86/include/asm/perf_event.h | 81 +----------------------------
 2 files changed, 86 insertions(+), 80 deletions(-)
 create mode 100644 arch/x86/include/asm/ibs-caps.h

diff --git a/arch/x86/include/asm/ibs-caps.h b/arch/x86/include/asm/ibs-caps.h
new file mode 100644
index 000000000000..ddf6c512c8f9
--- /dev/null
+++ b/arch/x86/include/asm/ibs-caps.h
@@ -0,0 +1,85 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_IBS_CAPS_H
+#define _ASM_X86_IBS_CAPS_H
+
+/*
+ * IBS cpuid feature detection
+ */
+
+#define IBS_CPUID_FEATURES		0x8000001b
+
+/*
+ * Same bit mask as for IBS cpuid feature flags (Fn8000_001B_EAX), but
+ * bit 0 is used to indicate the existence of IBS.
+ */
+#define IBS_CAPS_AVAIL			(1U<<0)
+#define IBS_CAPS_FETCHSAM		(1U<<1)
+#define IBS_CAPS_OPSAM			(1U<<2)
+#define IBS_CAPS_RDWROPCNT		(1U<<3)
+#define IBS_CAPS_OPCNT			(1U<<4)
+#define IBS_CAPS_BRNTRGT		(1U<<5)
+#define IBS_CAPS_OPCNTEXT		(1U<<6)
+#define IBS_CAPS_RIPINVALIDCHK		(1U<<7)
+#define IBS_CAPS_OPBRNFUSE		(1U<<8)
+#define IBS_CAPS_FETCHCTLEXTD		(1U<<9)
+#define IBS_CAPS_OPDATA4		(1U<<10)
+#define IBS_CAPS_ZEN4			(1U<<11)
+#define IBS_CAPS_OPLDLAT		(1U<<12)
+#define IBS_CAPS_DIS			(1U<<13)
+#define IBS_CAPS_FETCHLAT		(1U<<14)
+#define IBS_CAPS_BIT63_FILTER		(1U<<15)
+#define IBS_CAPS_STRMST_RMTSOCKET	(1U<<16)
+#define IBS_CAPS_OPDTLBPGSIZE		(1U<<19)
+
+#define IBS_CAPS_DEFAULT		(IBS_CAPS_AVAIL		\
+					 | IBS_CAPS_FETCHSAM	\
+					 | IBS_CAPS_OPSAM)
+
+/*
+ * IBS APIC setup
+ */
+#define IBSCTL				0x1cc
+#define IBSCTL_LVT_OFFSET_VALID		(1ULL<<8)
+#define IBSCTL_LVT_OFFSET_MASK		0x0F
+
+/* IBS fetch bits/masks */
+#define IBS_FETCH_L3MISSONLY		      (1ULL << 59)
+#define IBS_FETCH_RAND_EN		      (1ULL << 57)
+#define IBS_FETCH_VAL			      (1ULL << 49)
+#define IBS_FETCH_ENABLE		      (1ULL << 48)
+#define IBS_FETCH_CNT			     0xFFFF0000ULL
+#define IBS_FETCH_MAX_CNT		     0x0000FFFFULL
+
+#define IBS_FETCH_2_DIS			      (1ULL <<  0)
+#define IBS_FETCH_2_FETCHLAT_FILTER	    (0xFULL <<  1)
+#define IBS_FETCH_2_FETCHLAT_FILTER_SHIFT	       (1)
+#define IBS_FETCH_2_EXCL_RIP_63_EQ_1	      (1ULL <<  5)
+#define IBS_FETCH_2_EXCL_RIP_63_EQ_0	      (1ULL <<  6)
+
+/*
+ * IBS op bits/masks
+ * The lower 7 bits of the current count are random bits
+ * preloaded by hardware and ignored in software
+ */
+#define IBS_OP_LDLAT_EN			      (1ULL << 63)
+#define IBS_OP_LDLAT_THRSH		    (0xFULL << 59)
+#define IBS_OP_LDLAT_THRSH_SHIFT		      (59)
+#define IBS_OP_CUR_CNT			(0xFFF80ULL << 32)
+#define IBS_OP_CUR_CNT_RAND		(0x0007FULL << 32)
+#define IBS_OP_CUR_CNT_EXT_MASK		   (0x7FULL << 52)
+#define IBS_OP_CNT_CTL			      (1ULL << 19)
+#define IBS_OP_VAL			      (1ULL << 18)
+#define IBS_OP_ENABLE			      (1ULL << 17)
+#define IBS_OP_L3MISSONLY		      (1ULL << 16)
+#define IBS_OP_MAX_CNT			     0x0000FFFFULL
+#define IBS_OP_MAX_CNT_EXT		     0x007FFFFFULL	/* not a register bit mask */
+#define IBS_OP_MAX_CNT_EXT_MASK		   (0x7FULL << 20)	/* separate upper 7 bits */
+#define IBS_RIP_INVALID			      (1ULL << 38)
+
+#define IBS_OP_2_DIS			      (1ULL <<  0)
+#define IBS_OP_2_EXCL_RIP_63_EQ_0	      (1ULL <<  1)
+#define IBS_OP_2_EXCL_RIP_63_EQ_1	      (1ULL <<  2)
+#define IBS_OP_2_STRM_ST_FILTER		      (1ULL <<  3)
+#define IBS_OP_2_STRM_ST_FILTER_SHIFT		       (3)
+
+#endif /*  _ASM_X86_IBS_CAPS_H */
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 752cb319d5ea..655a54c77f4e 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -3,6 +3,7 @@
 #define _ASM_X86_PERF_EVENT_H
 
 #include <linux/static_call.h>
+#include <asm/ibs-caps.h>
 
 /*
  * Performance event hw details:
@@ -620,86 +621,6 @@ struct arch_pebs_cntr_header {
  */
 #define EXT_PERFMON_DEBUG_FEATURES		0x80000022
 
-/*
- * IBS cpuid feature detection
- */
-
-#define IBS_CPUID_FEATURES		0x8000001b
-
-/*
- * Same bit mask as for IBS cpuid feature flags (Fn8000_001B_EAX), but
- * bit 0 is used to indicate the existence of IBS.
- */
-#define IBS_CAPS_AVAIL			(1U<<0)
-#define IBS_CAPS_FETCHSAM		(1U<<1)
-#define IBS_CAPS_OPSAM			(1U<<2)
-#define IBS_CAPS_RDWROPCNT		(1U<<3)
-#define IBS_CAPS_OPCNT			(1U<<4)
-#define IBS_CAPS_BRNTRGT		(1U<<5)
-#define IBS_CAPS_OPCNTEXT		(1U<<6)
-#define IBS_CAPS_RIPINVALIDCHK		(1U<<7)
-#define IBS_CAPS_OPBRNFUSE		(1U<<8)
-#define IBS_CAPS_FETCHCTLEXTD		(1U<<9)
-#define IBS_CAPS_OPDATA4		(1U<<10)
-#define IBS_CAPS_ZEN4			(1U<<11)
-#define IBS_CAPS_OPLDLAT		(1U<<12)
-#define IBS_CAPS_DIS			(1U<<13)
-#define IBS_CAPS_FETCHLAT		(1U<<14)
-#define IBS_CAPS_BIT63_FILTER		(1U<<15)
-#define IBS_CAPS_STRMST_RMTSOCKET	(1U<<16)
-#define IBS_CAPS_OPDTLBPGSIZE		(1U<<19)
-
-#define IBS_CAPS_DEFAULT		(IBS_CAPS_AVAIL		\
-					 | IBS_CAPS_FETCHSAM	\
-					 | IBS_CAPS_OPSAM)
-
-/*
- * IBS APIC setup
- */
-#define IBSCTL				0x1cc
-#define IBSCTL_LVT_OFFSET_VALID		(1ULL<<8)
-#define IBSCTL_LVT_OFFSET_MASK		0x0F
-
-/* IBS fetch bits/masks */
-#define IBS_FETCH_L3MISSONLY		      (1ULL << 59)
-#define IBS_FETCH_RAND_EN		      (1ULL << 57)
-#define IBS_FETCH_VAL			      (1ULL << 49)
-#define IBS_FETCH_ENABLE		      (1ULL << 48)
-#define IBS_FETCH_CNT			     0xFFFF0000ULL
-#define IBS_FETCH_MAX_CNT		     0x0000FFFFULL
-
-#define IBS_FETCH_2_DIS			      (1ULL <<  0)
-#define IBS_FETCH_2_FETCHLAT_FILTER	    (0xFULL <<  1)
-#define IBS_FETCH_2_FETCHLAT_FILTER_SHIFT	       (1)
-#define IBS_FETCH_2_EXCL_RIP_63_EQ_1	      (1ULL <<  5)
-#define IBS_FETCH_2_EXCL_RIP_63_EQ_0	      (1ULL <<  6)
-
-/*
- * IBS op bits/masks
- * The lower 7 bits of the current count are random bits
- * preloaded by hardware and ignored in software
- */
-#define IBS_OP_LDLAT_EN			      (1ULL << 63)
-#define IBS_OP_LDLAT_THRSH		    (0xFULL << 59)
-#define IBS_OP_LDLAT_THRSH_SHIFT		      (59)
-#define IBS_OP_CUR_CNT			(0xFFF80ULL << 32)
-#define IBS_OP_CUR_CNT_RAND		(0x0007FULL << 32)
-#define IBS_OP_CUR_CNT_EXT_MASK		   (0x7FULL << 52)
-#define IBS_OP_CNT_CTL			      (1ULL << 19)
-#define IBS_OP_VAL			      (1ULL << 18)
-#define IBS_OP_ENABLE			      (1ULL << 17)
-#define IBS_OP_L3MISSONLY		      (1ULL << 16)
-#define IBS_OP_MAX_CNT			     0x0000FFFFULL
-#define IBS_OP_MAX_CNT_EXT		     0x007FFFFFULL	/* not a register bit mask */
-#define IBS_OP_MAX_CNT_EXT_MASK		   (0x7FULL << 20)	/* separate upper 7 bits */
-#define IBS_RIP_INVALID			      (1ULL << 38)
-
-#define IBS_OP_2_DIS			      (1ULL <<  0)
-#define IBS_OP_2_EXCL_RIP_63_EQ_0	      (1ULL <<  1)
-#define IBS_OP_2_EXCL_RIP_63_EQ_1	      (1ULL <<  2)
-#define IBS_OP_2_STRM_ST_FILTER		      (1ULL <<  3)
-#define IBS_OP_2_STRM_ST_FILTER_SHIFT		       (3)
-
 #ifdef CONFIG_X86_LOCAL_APIC
 extern u32 get_ibs_caps(void);
 extern int forward_event_to_ibs(struct perf_event *event);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH v7 7/7] x86/mm/ibs: In-kernel driver for AMD IBS Memory Profiler
  2026-05-04  6:09 [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure Bharata B Rao
                   ` (5 preceding siblings ...)
  2026-05-04  6:09 ` [RFC PATCH v7 6/7] x86/ibs: Move IBS caps definitions into its own header Bharata B Rao
@ 2026-05-04  6:09 ` Bharata B Rao
  2026-05-04  6:23 ` [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure Bharata B Rao
  2026-05-04 20:36 ` Matthew Wilcox
  8 siblings, 0 replies; 12+ messages in thread
From: Bharata B Rao @ 2026-05-04  6:09 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz,
	raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david,
	byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
	alok.rathore, shivankg, donettom, bharata

Use IBS (Instruction Based Sampling) Memory Profiler feature
present in AMD Zen6 processors for memory access tracking. The
access information obtained from IBS Memory Profiler is fed to
pghot sub-system for further action using
pghot_record_access(PGHOT_HWHINTS, ...) API.

IBS Memory Profiler as page hotness source is enabled by the
new config option HWMEM_PROFILER and is also gated by the
existing pghot_src_hwhints static key set via debugfs.

More details about IBS Memory Profiler can be obtained from
the AMD document titled "AMD64 Zen6 Instruction Based Sampling (IBS)
Extensions and Features".

Signed-off-by: Bharata B Rao <bharata@amd.com>
---
 arch/x86/Kconfig                 |  16 ++
 arch/x86/include/asm/ibs-caps.h  |   8 +
 arch/x86/include/asm/ibs-mprof.h |  46 +++++
 arch/x86/include/asm/msr-index.h |   8 +
 arch/x86/mm/Makefile             |   1 +
 arch/x86/mm/ibs-mprof.c          | 308 +++++++++++++++++++++++++++++++
 include/linux/cpuhotplug.h       |   1 +
 include/linux/vm_event_item.h    |   6 +
 mm/Kconfig                       |   9 +
 mm/vmstat.c                      |   6 +
 10 files changed, 409 insertions(+)
 create mode 100644 arch/x86/include/asm/ibs-mprof.h
 create mode 100644 arch/x86/mm/ibs-mprof.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 99bb5217649a..f06c0c44ecce 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1514,6 +1514,22 @@ config AMD_MEM_ENCRYPT
 	  This requires an AMD processor that supports Secure Memory
 	  Encryption (SME).
 
+config AMD_IBS_MEMPROF
+	bool "AMD IBS Memory Profiler"
+	depends on X86_64 && CPU_SUP_AMD
+	depends on PGHOT
+	select HWMEM_PROFILER
+	help
+	  Use the AMD Instruction Based Sampling (IBS) Memory Profiler
+	  facility (present on Zen6 and later AMD CPUs) to feed
+	  hardware-observed memory accesses into the pghot subsystem
+	  for hot-page detection and promotion.
+
+	  When disabled, no IBS Memory Profiler MSRs are programmed and
+	  the corresponding NMI handler is not installed.
+
+	  If unsure, say N.
+
 # Common NUMA Features
 config NUMA
 	bool "NUMA Memory Allocation and Scheduler Support"
diff --git a/arch/x86/include/asm/ibs-caps.h b/arch/x86/include/asm/ibs-caps.h
index ddf6c512c8f9..1f6c4058a0e3 100644
--- a/arch/x86/include/asm/ibs-caps.h
+++ b/arch/x86/include/asm/ibs-caps.h
@@ -29,6 +29,7 @@
 #define IBS_CAPS_FETCHLAT		(1U<<14)
 #define IBS_CAPS_BIT63_FILTER		(1U<<15)
 #define IBS_CAPS_STRMST_RMTSOCKET	(1U<<16)
+#define IBS_CAPS_MEM_PROFILER		(1U<<18)
 #define IBS_CAPS_OPDTLBPGSIZE		(1U<<19)
 
 #define IBS_CAPS_DEFAULT		(IBS_CAPS_AVAIL		\
@@ -42,6 +43,13 @@
 #define IBSCTL_LVT_OFFSET_VALID		(1ULL<<8)
 #define IBSCTL_LVT_OFFSET_MASK		0x0F
 
+/*
+ * IBS Memprofiler setup
+ */
+#define IBSCTL_MPROF_LVT_OFFSET_VALID	(1ULL << 24)
+#define IBSCTL_MPROF_LVT_OFFSET_SHIFT	16
+#define IBSCTL_MPROF_LVT_OFFSET_MASK	(0xFULL << IBSCTL_MPROF_LVT_OFFSET_SHIFT)
+
 /* IBS fetch bits/masks */
 #define IBS_FETCH_L3MISSONLY		      (1ULL << 59)
 #define IBS_FETCH_RAND_EN		      (1ULL << 57)
diff --git a/arch/x86/include/asm/ibs-mprof.h b/arch/x86/include/asm/ibs-mprof.h
new file mode 100644
index 000000000000..91b1ce51d667
--- /dev/null
+++ b/arch/x86/include/asm/ibs-mprof.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_IBS_MPROF_H
+#define _ASM_X86_IBS_MPROF_H
+
+/*
+ * All bits are documented here for clarity even if the current
+ * driver doesn't use all of them.
+ */
+
+/* MSR_AMD64_IBS_MPROF_DATA2 bits */
+#define IBS_MPROF_DATA2_DATASRC_MASK		0x7
+#define IBS_MPROF_DATA2_DATASRC_MASK_HIGH	0xC0
+#define IBS_MPROF_DATA2_DATASRC_MASK_HIGH_SHIFT	0x3
+#define IBS_MPROF_DATA2_DATASRC_LCL_CCX		0x1
+#define IBS_MPROF_DATA2_DATASRC_PEER_CCX_NEAR	0x2
+#define IBS_MPROF_DATA2_DATASRC_DRAM		0x3
+#define IBS_MPROF_DATA2_DATASRC_CCX_FAR		0x5
+#define IBS_MPROF_DATA2_DATASRC_EXT_MEM		0x8
+#define IBS_MPROF_DATA2_RMT_NODE		BIT_ULL(4)
+#define IBS_MPROF_DATA2_RMT_SOCKET		BIT_ULL(9)
+
+/* MSR_AMD64_IBS_MPROF_DATA3 bits */
+#define IBS_MPROF_DATA3_LDOP		BIT_ULL(0)
+#define IBS_MPROF_DATA3_STOP		BIT_ULL(1)
+#define IBS_MPROF_DATA3_DCMISS		BIT_ULL(7)
+#define IBS_MPROF_DATA3_LADDR_VALID	BIT_ULL(17)
+#define IBS_MPROF_DATA3_PADDR_VALID	BIT_ULL(18)
+#define IBS_MPROF_DATA3_L2MISS		BIT_ULL(20)
+#define IBS_MPROF_DATA3_SW_PREFETCH	BIT_ULL(21)
+
+/* MSR_AMD64_IBS_MPROF_CTL bits */
+#define IBS_MPROF_CTL_CNT_CTL		BIT_ULL(19)
+#define IBS_MPROF_CTL_VAL		BIT_ULL(18)
+#define IBS_MPROF_CTL_ENABLE		BIT_ULL(17)
+#define IBS_MPROF_CTL_L3MISSONLY	BIT_ULL(16)
+#define IBS_MPROF_CTL_MAXCNT_MASK	0x0000FFFFULL
+#define IBS_MPROF_CTL_MAXCNT_EXT_MASK	(0x7FULL << 20)	/* separate upper 7 bits */
+
+/* MSR_AMD64_IBS_MPROF_CTL2 bits */
+#define IBS_MPROF_CTL2_DISABLE		BIT_ULL(0)
+#define IBS_MPROF_CTL2_EXCLUDE_USER	BIT_ULL(1)
+#define IBS_MPROF_CTL2_EXCLUDE_KERNEL	BIT_ULL(2)
+
+#define IBS_MPROF_SAMPLE_PERIOD	10000
+
+#endif /* _ASM_X86_IBS_MPROF_H */
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index a14a0f43e04a..c44b68940f43 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1315,4 +1315,12 @@
 						* a #GP
 						*/
 
+/* AMD IBS Memory Profiler MSRs */
+#define MSR_AMD64_IBS_MPROF_CTL		0xc0010380
+#define MSR_AMD64_IBS_MPROF_CTL2	0xc0010381
+#define MSR_AMD64_IBS_MPROF_DATA2	0xc0010382
+#define MSR_AMD64_IBS_MPROF_DATA3	0xc0010383
+#define MSR_AMD64_IBS_MPROF_LINADDR	0xc0010384
+#define MSR_AMD64_IBS_MPROF_PHYADDR	0xc0010385
+
 #endif /* _ASM_X86_MSR_INDEX_H */
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 3a5364853eab..050a7379d9f7 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -59,3 +59,4 @@ obj-$(CONFIG_X86_MEM_ENCRYPT)	+= mem_encrypt.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_amd.o
 
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_boot.o
+obj-$(CONFIG_AMD_IBS_MEMPROF)	+= ibs-mprof.o
diff --git a/arch/x86/mm/ibs-mprof.c b/arch/x86/mm/ibs-mprof.c
new file mode 100644
index 000000000000..b3d59b21c8c9
--- /dev/null
+++ b/arch/x86/mm/ibs-mprof.c
@@ -0,0 +1,308 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define pr_fmt(fmt)	"amd_ibs_memprof: " fmt
+
+#include <linux/init.h>
+#include <linux/pghot.h>
+#include <linux/percpu.h>
+#include <linux/workqueue.h>
+#include <linux/irq_work.h>
+#include <linux/mm.h>
+#include <linux/vm_event_item.h>
+#include <linux/vmstat.h>
+#include <linux/cpuhotplug.h>
+
+#include <asm/ibs-mprof.h>
+#include <asm/ibs-caps.h>
+#include <asm/nmi.h>
+#include <asm/apic.h>
+
+#define IBS_NR_SAMPLES		150	/* Percpu sample buffer size */
+
+static DEFINE_PER_CPU(bool, mprof_work_pending);
+
+/*
+ * Basic access info captured for each memory access.
+ */
+struct mprof_sample {
+	unsigned long pfn;
+	unsigned long time;	/* jiffies when accessed */
+	int nid;		/* Accessing node ID, if known */
+};
+
+/*
+ * Percpu buffer of access samples. Samples are accumulated here
+ * before pushing them to pghot sub-system for further action.
+ */
+struct mprof_sample_pcpu {
+	struct mprof_sample samples[IBS_NR_SAMPLES];
+	int head, tail;
+};
+
+static struct mprof_sample_pcpu __percpu *mprof_s;
+
+/*
+ * The workqueue for pushing the percpu access samples to pghot sub-system.
+ */
+static DEFINE_PER_CPU(struct work_struct, mprof_work);
+static DEFINE_PER_CPU(struct irq_work, mprof_irq_work);
+
+/*
+ * Record the IBS-reported access sample in percpu buffer.
+ * Called from IBS NMI handler.
+ */
+static bool mprof_push_sample(unsigned long pfn, int nid, unsigned long time)
+{
+	struct mprof_sample_pcpu *pcpu = raw_cpu_ptr(mprof_s);
+	int head = READ_ONCE(pcpu->head);
+	int tail = READ_ONCE(pcpu->tail);
+	int next = head + 1;
+
+	if (next >= IBS_NR_SAMPLES)
+		next = 0;
+
+	if (next == tail)
+		return false;
+
+	pcpu->samples[head].pfn = pfn;
+	pcpu->samples[head].time = time;
+	pcpu->samples[head].nid = nid;
+
+	smp_store_release(&pcpu->head, next);
+	return true;
+}
+
+static bool mprof_pop_sample(struct mprof_sample *s)
+{
+	struct mprof_sample_pcpu *pcpu = raw_cpu_ptr(mprof_s);
+	int tail = READ_ONCE(pcpu->tail);
+	int head = smp_load_acquire(&pcpu->head);
+	int next = tail + 1;
+
+	if (head == tail)
+		return false;
+
+	if (next >= IBS_NR_SAMPLES)
+		next = 0;
+
+	*s = pcpu->samples[tail];
+
+	WRITE_ONCE(pcpu->tail, next);
+	return true;
+}
+
+/*
+ * Remove access samples from percpu buffer and send them
+ * to pghot sub-system for further action.
+ */
+static void mprof_work_handler(struct work_struct *work)
+{
+	struct mprof_sample s;
+
+	while (mprof_pop_sample(&s))
+		pghot_record_access(s.pfn, s.nid, PGHOT_HWHINTS, s.time);
+
+	this_cpu_write(mprof_work_pending, false);
+}
+
+static void mprof_irq_handler(struct irq_work *i)
+{
+	struct work_struct *w = this_cpu_ptr(&mprof_work);
+
+	/*
+	 * FIXME: pending samples on a CPU that goes offline before the
+	 * work runs may be lost or migrated to the wrong CPU's ring;
+	 * needs a teardown-time drain.
+	 */
+	schedule_work_on(smp_processor_id(), w);
+}
+
+/*
+ * L3MissOnly + Exclude kernel RIP
+ */
+static void mprof_enable_profiling(void)
+{
+	u64 mprof_config = IBS_MPROF_CTL_CNT_CTL | IBS_MPROF_CTL_ENABLE |
+			   IBS_MPROF_CTL_L3MISSONLY;
+	unsigned int period = IBS_MPROF_SAMPLE_PERIOD;
+	u64 ctl, ctl2;
+
+	/*
+	 * Assemble bits 26:20 and 19:4 of periodic op counter in ctl.
+	 * The lower 4 bits are always 0000b.
+	 */
+	ctl = (period >> 4) & IBS_MPROF_CTL_MAXCNT_MASK;
+	ctl |= (period & IBS_MPROF_CTL_MAXCNT_EXT_MASK);
+	ctl |= mprof_config;
+	wrmsrq(MSR_AMD64_IBS_MPROF_CTL, ctl);
+
+	/*
+	 * Exclude samples that have bit 63 of their RIP set.
+	 */
+	ctl2 = IBS_MPROF_CTL2_EXCLUDE_KERNEL;
+	wrmsrq(MSR_AMD64_IBS_MPROF_CTL2, ctl2);
+}
+
+static void mprof_disable_profiling(u64 mem_ctl)
+{
+	mem_ctl &= ~IBS_MPROF_CTL_ENABLE;
+	mem_ctl &= ~IBS_MPROF_CTL_VAL;
+	wrmsrq(MSR_AMD64_IBS_MPROF_CTL, mem_ctl);
+
+	wrmsrq(MSR_AMD64_IBS_MPROF_CTL2, IBS_MPROF_CTL2_DISABLE);
+}
+
+/*
+ * IBS NMI handler: Process the memory access info reported by IBS.
+ *
+ * Reads the MSRs to collect all the information about the reported
+ * memory access, validates the access, stores the valid sample and
+ * schedules the work on this CPU to further process the sample.
+ */
+static int mprof_overflow_handler(unsigned int cmd, struct pt_regs *regs)
+{
+	u64 mem_ctl, mem_data3, mem_data2, paddr, data_src;
+	unsigned long pfn;
+	struct page *page;
+
+	rdmsrq(MSR_AMD64_IBS_MPROF_CTL, mem_ctl);
+	if (!(mem_ctl & IBS_MPROF_CTL_VAL))
+		return NMI_DONE;
+
+	mprof_disable_profiling(mem_ctl);
+	count_vm_event(HWHINT_TOTAL_EVENTS);
+
+	rdmsrq(MSR_AMD64_IBS_MPROF_DATA3, mem_data3);
+	rdmsrq(MSR_AMD64_IBS_MPROF_DATA2, mem_data2);
+
+	data_src = mem_data2 & IBS_MPROF_DATA2_DATASRC_MASK;
+	data_src |= ((mem_data2 & IBS_MPROF_DATA2_DATASRC_MASK_HIGH) >>
+			IBS_MPROF_DATA2_DATASRC_MASK_HIGH_SHIFT);
+
+	switch (data_src) {
+	case IBS_MPROF_DATA2_DATASRC_DRAM:
+		count_vm_event(HWHINT_DRAM_ACCESSES);
+		break;
+	case IBS_MPROF_DATA2_DATASRC_EXT_MEM:
+		count_vm_event(HWHINT_EXTMEM_ACCESSES);
+		break;
+	}
+
+	/* Is linear addr valid? */
+	if (!(mem_data3 & IBS_MPROF_DATA3_LADDR_VALID))
+		goto handled;
+
+	/* Is phys addr valid? */
+	if (!(mem_data3 & IBS_MPROF_DATA3_PADDR_VALID))
+		goto handled;
+	rdmsrq(MSR_AMD64_IBS_MPROF_PHYADDR, paddr);
+
+	pfn = PHYS_PFN(paddr);
+	page = pfn_to_online_page(pfn);
+	if (!page)
+		goto handled;
+
+	/*
+	 * Use the accessing CPU's node as the migration target. On
+	 * topologies where all CPUs reside on toptier nodes (the common
+	 * case), this is the desired behaviour. Topologies that place
+	 * CPUs on lower-tier nodes are rejected later by
+	 * pghot_record_access() via the src_nid == nid early return.
+	 */
+	if (!mprof_push_sample(pfn, numa_node_id(), jiffies))
+		goto handled;
+
+	if (!this_cpu_read(mprof_work_pending)) {
+		this_cpu_write(mprof_work_pending, true);
+		irq_work_queue(this_cpu_ptr(&mprof_irq_work));
+	}
+	count_vm_event(HWHINT_USEFUL_EVENTS);
+
+handled:
+	mprof_enable_profiling();
+	return NMI_HANDLED;
+}
+
+static int get_mprof_lvt_offset(void)
+{
+	u64 val;
+
+	rdmsrq(MSR_AMD64_IBSCTL, val);
+	if (!(val & IBSCTL_MPROF_LVT_OFFSET_VALID))
+		return -EINVAL;
+
+	return (val & IBSCTL_MPROF_LVT_OFFSET_MASK) >>
+		IBSCTL_MPROF_LVT_OFFSET_SHIFT;
+}
+
+static int x86_amd_ibs_mprof_startup(unsigned int cpu)
+{
+	int offset = get_mprof_lvt_offset();
+
+	if (offset < 0) {
+		pr_warn("offset not valid on cpu #%d\n", cpu);
+		return 0;
+	}
+
+	if (setup_APIC_eilvt(offset, 0, APIC_DELIVERY_MODE_NMI, 0)) {
+		pr_warn("APIC setup failed on cpu #%d\n", cpu);
+		return 0;
+	}
+
+	mprof_enable_profiling();
+	return 0;
+}
+
+static int x86_amd_ibs_mprof_teardown(unsigned int cpu)
+{
+	int offset = get_mprof_lvt_offset();
+	u64 mem_ctl;
+
+	if (offset >= 0)
+		setup_APIC_eilvt(offset, 0, APIC_DELIVERY_MODE_FIXED, 1);
+
+	rdmsrq(MSR_AMD64_IBS_MPROF_CTL, mem_ctl);
+	mprof_disable_profiling(mem_ctl);
+
+	return 0;
+}
+
+static int __init mprof_access_profiling_init(void)
+{
+	u32 mprof_caps = cpuid_eax(IBS_CPUID_FEATURES);
+	int cpu, ret;
+
+	if (!(mprof_caps & IBS_CAPS_MEM_PROFILER)) {
+		pr_info("capability is unavailable for access profiling\n");
+		return 0;
+	}
+
+	mprof_s = alloc_percpu_gfp(struct mprof_sample_pcpu, GFP_KERNEL | __GFP_ZERO);
+	if (!mprof_s) {
+		pr_err("alloc_percpu_gfp failed\n");
+		return 0;
+	}
+
+	for_each_possible_cpu(cpu) {
+		INIT_WORK(per_cpu_ptr(&mprof_work, cpu), mprof_work_handler);
+		init_irq_work(per_cpu_ptr(&mprof_irq_work, cpu), mprof_irq_handler);
+	}
+
+	register_nmi_handler(NMI_LOCAL, mprof_overflow_handler, 0, "ibs-memprof");
+
+	ret = cpuhp_setup_state(CPUHP_AP_MM_AMD_IBS_MEMPROF_STARTING,
+				"x86/amd/ibs_mprof:starting",
+				x86_amd_ibs_mprof_startup,
+				x86_amd_ibs_mprof_teardown);
+
+	if (ret) {
+		unregister_nmi_handler(NMI_LOCAL, "ibs-memprof");
+		free_percpu(mprof_s);
+		pr_err("cpuhp_setup_state failed: %d\n", ret);
+	} else {
+		pr_info("IBS Memory Profiler setup for memory access profiling\n");
+	}
+	return 0;
+}
+
+device_initcall(mprof_access_profiling_init);
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 22ba327ec227..feaa3f571726 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -150,6 +150,7 @@ enum cpuhp_state {
 	CPUHP_AP_PERF_X86_AMD_UNCORE_STARTING,
 	CPUHP_AP_PERF_X86_STARTING,
 	CPUHP_AP_PERF_X86_AMD_IBS_STARTING,
+	CPUHP_AP_MM_AMD_IBS_MEMPROF_STARTING,
 	CPUHP_AP_PERF_XTENSA_STARTING,
 	CPUHP_AP_ARM_VFP_STARTING,
 	CPUHP_AP_ARM64_DEBUG_MONITORS_STARTING,
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 58d510711bd4..a9c04a9735c6 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -179,6 +179,12 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		PGHOT_RECORDED_ACCESSES,
 		PGHOT_RECORDED_HINTFAULTS,
 		PGHOT_RECORDED_HWHINTS,
+#ifdef CONFIG_HWMEM_PROFILER
+		HWHINT_TOTAL_EVENTS,
+		HWHINT_DRAM_ACCESSES,
+		HWHINT_EXTMEM_ACCESSES,
+		HWHINT_USEFUL_EVENTS,
+#endif /* CONFIG_HWMEM_PROFILER */
 #endif /* CONFIG_PGHOT */
 		NR_VM_EVENT_ITEMS
 };
diff --git a/mm/Kconfig b/mm/Kconfig
index cc4b5685ecd4..674cfcea7bb0 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1494,6 +1494,15 @@ config PGHOT_PRECISE
 	  4 bytes per page against the default one byte per page. Preferable
 	  to enable this on systems with multiple nodes in toptier.
 
+config HWMEM_PROFILER
+	bool
+	depends on PGHOT
+	help
+	  Umbrella symbol enabled by any in-kernel driver that forwards
+	  hardware-observed memory accesses to the pghot subsystem (for
+	  example AMD_IBS_MEMPROF on x86_64). Drivers select this; users
+	  do not enable it directly.
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/vmstat.c b/mm/vmstat.c
index da668ff05032..06e7ae06519e 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1493,6 +1493,12 @@ const char * const vmstat_text[] = {
 	[I(PGHOT_RECORDED_ACCESSES)]		= "pghot_recorded_accesses",
 	[I(PGHOT_RECORDED_HINTFAULTS)]		= "pghot_recorded_hintfaults",
 	[I(PGHOT_RECORDED_HWHINTS)]		= "pghot_recorded_hwhints",
+#ifdef CONFIG_HWMEM_PROFILER
+	[I(HWHINT_TOTAL_EVENTS)]		= "hwhint_total_events",
+	[I(HWHINT_DRAM_ACCESSES)]		= "hwhint_dram_accesses",
+	[I(HWHINT_EXTMEM_ACCESSES)]		= "hwhint_extmem_accesses",
+	[I(HWHINT_USEFUL_EVENTS)]		= "hwhint_useful_events",
+#endif /* CONFIG_HWMEM_PROFILER */
 #endif /* CONFIG_PGHOT */
 #undef I
 #endif /* CONFIG_VM_EVENT_COUNTERS */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure
  2026-05-04  6:09 [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure Bharata B Rao
                   ` (6 preceding siblings ...)
  2026-05-04  6:09 ` [RFC PATCH v7 7/7] x86/mm/ibs: In-kernel driver for AMD IBS Memory Profiler Bharata B Rao
@ 2026-05-04  6:23 ` Bharata B Rao
  2026-05-04 20:36 ` Matthew Wilcox
  8 siblings, 0 replies; 12+ messages in thread
From: Bharata B Rao @ 2026-05-04  6:23 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz,
	raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david,
	byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
	alok.rathore, shivankg, donettom

On 04-May-26 11:39 AM, Bharata B Rao wrote:
> 
> Results
> =======
> Posted as replies to this mail thread.

Micro-benchmark numbers for IBS Memory Profiler pghot source

Test system details
-------------------
2 node AMD system with 1 regular NUMA node (0) and a CXL node (1)

$ numactl -H
available: 3 nodes (0-1)
node 0 cpus: 0-255
node 0 size: 515563 MB
node 1 cpus:
node 1 size: 258034 MB
node distances:
node distances:
node   0   1
  0:  10  50
  1:  255  10

Hotness sources
---------------
NUMAB0 - Without NUMA Balancing in base case and with no source enabled
         in the patched case. No migrations occur.
NUMAB2 - Existing hot page promotion for the base case
HWHINTS - IBS Memory Profiler as source for pghot

Pghot by default promotes after two accesses but for NUMAB2 source,
promotion is done after one access to match the base behaviour.
(/sys/kernel/debug/pghot/freq_threshold=1)

==============================================================
Scenario 1 - Enough memory in toptier and hence only promotion
==============================================================
Multi-threaded application with 64 threads that access memory(8G) at
4K granularity repetitively and randomly. The number of accesses per
thread and the randomness pattern for each thread are fixed beforehand.
The accesses are divided into stores and loads in the ratio of 50:50.

Benchmark threads run on Node 0, while memory is initially provisioned on
CXL node 1 before the accesses start.

Repetitive accesses results in lowertier pages becoming hot and kmigrated
detecting and migrating them. The benchmark score is the time taken to
finish the accesses in microseconds. The sooner it finishes the better it is.
All the numbers shown below are average of 3 runs.

Time taken (microseconds, lower is better)
---------------------------------------------------------
Source          Base            Pghot-default
---------------------------------------------------------
NUMAB0          181,393,365     184,331,381
NUMAB2          42,287,528
HWHINTS         NA              50,422,862
---------------------------------------------------------

Stats comparision b/n base-NUMAB2 and pghot-default-hwhints
---------------------------------------------------------------------
                                Base-NUMAB2     Pghot-default-hwhints
---------------------------------------------------------------------
pgpromote_success               2097152         1961087
numa_hint_faults                2358069         0
pghot_recorded_accesses         NA              1962696
pghot_recorded_hintfaults       NA              0
pghot_recorded_hwhints          NA              5532979
hwhint_total_events             NA              5532979
---------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure
  2026-05-04  6:09 [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure Bharata B Rao
                   ` (7 preceding siblings ...)
  2026-05-04  6:23 ` [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure Bharata B Rao
@ 2026-05-04 20:36 ` Matthew Wilcox
  8 siblings, 0 replies; 12+ messages in thread
From: Matthew Wilcox @ 2026-05-04 20:36 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-kernel, linux-mm, Jonathan.Cameron, dave.hansen, gourry,
	mgorman, mingo, peterz, raghavendra.kt, riel, rientjes, sj,
	weixugc, ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis,
	akpm, david, byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
	alok.rathore, shivankg, donettom

On Mon, May 04, 2026 at 11:39:17AM +0530, Bharata B Rao wrote:
> This is v7 of pghot, a hot-page tracking and promotion subsystem. The

I continue to think we should not do this.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2026-05-04 20:36 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-04  6:09 [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure Bharata B Rao
2026-05-04  6:09 ` [PATCH v7 1/7] mm: migrate: Allow misplaced migration without VMA Bharata B Rao
2026-05-04  6:09 ` [PATCH v7 2/7] mm: migrate: Add promote_misplaced_memcg_folios() Bharata B Rao
2026-05-04 18:14   ` Donet Tom
2026-05-04  6:09 ` [PATCH v7 3/7] mm: Hot page tracking and promotion - pghot Bharata B Rao
2026-05-04  6:09 ` [PATCH v7 4/7] mm: pghot: Precision mode for pghot Bharata B Rao
2026-05-04 18:41   ` Donet Tom
2026-05-04  6:09 ` [PATCH v7 5/7] mm: sched: move NUMA balancing tiering promotion to pghot Bharata B Rao
2026-05-04  6:09 ` [RFC PATCH v7 6/7] x86/ibs: Move IBS caps definitions into its own header Bharata B Rao
2026-05-04  6:09 ` [RFC PATCH v7 7/7] x86/mm/ibs: In-kernel driver for AMD IBS Memory Profiler Bharata B Rao
2026-05-04  6:23 ` [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure Bharata B Rao
2026-05-04 20:36 ` Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox