* [PATCH v7 1/7] mm: migrate: Allow misplaced migration without VMA
2026-05-04 6:09 [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure Bharata B Rao
@ 2026-05-04 6:09 ` Bharata B Rao
2026-05-04 6:09 ` [PATCH v7 2/7] mm: migrate: Add promote_misplaced_memcg_folios() Bharata B Rao
` (7 subsequent siblings)
8 siblings, 0 replies; 12+ messages in thread
From: Bharata B Rao @ 2026-05-04 6:09 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz,
raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang,
ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david,
byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
alok.rathore, shivankg, donettom, bharata
We want isolation of misplaced folios to work in contexts
where VMA isn't available, typically when performing migrations
from a kernel thread context. In order to prepare for that,
allow migrate_misplaced_folio_prepare() to be called with
a NULL VMA.
When migrate_misplaced_folio_prepare() is called with non-NULL
VMA, it will check if the folio is mapped shared and that requires
holding PTL lock. This path isn't taken when the function is
invoked with NULL VMA (migration outside of process context).
Therefore, when VMA == NULL, migrate_misplaced_folio_prepare()
does not require the caller to hold the PTL.
Signed-off-by: Bharata B Rao <bharata@amd.com>
---
mm/migrate.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/mm/migrate.c b/mm/migrate.c
index 8a64291ab5b4..eb21a02fade0 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2671,7 +2671,12 @@ static struct folio *alloc_misplaced_dst_folio(struct folio *src,
/*
* Prepare for calling migrate_misplaced_folio() by isolating the folio if
- * permitted. Must be called with the PTL still held.
+ * permitted. Must be called with the PTL still held if called with a non-NULL
+ * vma.
+ *
+ * When called with a NULL vma (e.g., kernel thread initiated migration),
+ * migrate_misplaced_folio_prepare() will allow shared executable folios
+ * to be migrated.
*/
int migrate_misplaced_folio_prepare(struct folio *folio,
struct vm_area_struct *vma, int node)
@@ -2688,7 +2693,7 @@ int migrate_misplaced_folio_prepare(struct folio *folio,
* See folio_maybe_mapped_shared() on possible imprecision
* when we cannot easily detect if a folio is shared.
*/
- if ((vma->vm_flags & VM_EXEC) && folio_maybe_mapped_shared(folio))
+ if (vma && (vma->vm_flags & VM_EXEC) && folio_maybe_mapped_shared(folio))
return -EACCES;
/*
--
2.34.1
^ permalink raw reply related [flat|nested] 12+ messages in thread* [PATCH v7 2/7] mm: migrate: Add promote_misplaced_memcg_folios()
2026-05-04 6:09 [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure Bharata B Rao
2026-05-04 6:09 ` [PATCH v7 1/7] mm: migrate: Allow misplaced migration without VMA Bharata B Rao
@ 2026-05-04 6:09 ` Bharata B Rao
2026-05-04 18:14 ` Donet Tom
2026-05-04 6:09 ` [PATCH v7 3/7] mm: Hot page tracking and promotion - pghot Bharata B Rao
` (6 subsequent siblings)
8 siblings, 1 reply; 12+ messages in thread
From: Bharata B Rao @ 2026-05-04 6:09 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz,
raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang,
ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david,
byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
alok.rathore, shivankg, donettom, bharata
From: Gregory Price <gourry@gourry.net>
Tiered memory systems often require migrating multiple folios at once.
Currently, migrate_misplaced_folio() handles only one folio per call,
which is inefficient for batch operations. This patch introduces
promote_misplaced_memcg_folios(), a batch variant that leverages
migrate_pages() internally for improved performance.
The caller must isolate folios beforehand using
migrate_misplaced_folio_prepare(). Additionally all the folios in the
isolated list must belong to the same memcg. On return, the folio list
will be empty regardless of success or failure.
This function will be used by pghot kmigrated thread.
Signed-off-by: Gregory Price <gourry@gourry.net>
[Rewrote commit description, memcg awareness]
Signed-off-by: Bharata B Rao <bharata@amd.com>
---
include/linux/migrate.h | 5 ++++
mm/migrate.c | 57 +++++++++++++++++++++++++++++++++++++++++
2 files changed, 62 insertions(+)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index d5af2b7f577b..d136612eef9d 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -111,6 +111,7 @@ static inline void softleaf_entry_wait_on_locked(softleaf_t entry, spinlock_t *p
int migrate_misplaced_folio_prepare(struct folio *folio,
struct vm_area_struct *vma, int node);
int migrate_misplaced_folio(struct folio *folio, int node);
+int promote_misplaced_memcg_folios(struct list_head *folio_list, int node);
#else
static inline int migrate_misplaced_folio_prepare(struct folio *folio,
struct vm_area_struct *vma, int node)
@@ -121,6 +122,10 @@ static inline int migrate_misplaced_folio(struct folio *folio, int node)
{
return -EAGAIN; /* can't migrate now */
}
+static inline int promote_misplaced_memcg_folios(struct list_head *folio_list, int node)
+{
+ return -EAGAIN; /* can't migrate now */
+}
#endif /* CONFIG_NUMA_BALANCING */
#ifdef CONFIG_MIGRATION
diff --git a/mm/migrate.c b/mm/migrate.c
index eb21a02fade0..747277aadf19 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2770,4 +2770,61 @@ int migrate_misplaced_folio(struct folio *folio, int node)
BUG_ON(!list_empty(&migratepages));
return nr_remaining ? -EAGAIN : 0;
}
+
+/**
+ * promote_misplaced_memcg_folios() - Batch variant of migrate_misplaced_folio
+ * Attempts to promote a folio list to the specified destination.
+ * @folio_list: Isolated list of folios to be batch-promoted.
+ * @node: The NUMA node ID to where the folios should be promoted.
+ *
+ * Caller is expected to have isolated the folios by calling
+ * migrate_misplaced_folio_prepare(), which will result in an
+ * elevated reference count on the folios. All the isolated folios
+ * in the list must belong to the same memcg so that NUMA_PAGE_MIGRATE
+ * stat can be attributed correctly to the memcg.
+ *
+ * This function will un-isolate the folios, drop the elevated reference
+ * and remove them from the list before returning. This should be called
+ * only for batched promotion of hot pages from lower tier nodes.
+ *
+ * Return: 0 on success and -EAGAIN on failure or partial promotion.
+ * On return, @folio_list will be empty regardless of success/failure.
+ */
+int promote_misplaced_memcg_folios(struct list_head *folio_list, int node)
+{
+ struct mem_cgroup *memcg = NULL;
+ unsigned int nr_succeeded = 0;
+ struct folio *first;
+ int nr_remaining;
+
+ if (list_empty(folio_list))
+ return 0;
+
+ first = list_first_entry(folio_list, struct folio, lru);
+#ifdef CONFIG_DEBUG_VM
+ {
+ struct folio *f;
+ list_for_each_entry(f, folio_list, lru)
+ VM_WARN_ON_ONCE(folio_memcg(f) != folio_memcg(first));
+ }
+#endif
+ memcg = get_mem_cgroup_from_folio(first);
+
+ nr_remaining = migrate_pages(folio_list, alloc_misplaced_dst_folio,
+ NULL, node, MIGRATE_ASYNC,
+ MR_NUMA_MISPLACED, &nr_succeeded);
+ if (nr_remaining)
+ putback_movable_pages(folio_list);
+
+ if (nr_succeeded) {
+ count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded);
+ count_memcg_events(memcg, NUMA_PAGE_MIGRATE, nr_succeeded);
+ mod_lruvec_state(mem_cgroup_lruvec(memcg, NODE_DATA(node)),
+ PGPROMOTE_SUCCESS, nr_succeeded);
+ }
+
+ mem_cgroup_put(memcg);
+ WARN_ON(!list_empty(folio_list));
+ return nr_remaining ? -EAGAIN : 0;
+}
#endif /* CONFIG_NUMA_BALANCING */
--
2.34.1
^ permalink raw reply related [flat|nested] 12+ messages in thread* Re: [PATCH v7 2/7] mm: migrate: Add promote_misplaced_memcg_folios()
2026-05-04 6:09 ` [PATCH v7 2/7] mm: migrate: Add promote_misplaced_memcg_folios() Bharata B Rao
@ 2026-05-04 18:14 ` Donet Tom
0 siblings, 0 replies; 12+ messages in thread
From: Donet Tom @ 2026-05-04 18:14 UTC (permalink / raw)
To: Bharata B Rao, linux-kernel, linux-mm
Cc: Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz,
raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang,
ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david,
byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
alok.rathore, shivankg
Hi Bharata
On 5/4/26 11:39 AM, Bharata B Rao wrote:
> +int promote_misplaced_memcg_folios(struct list_head *folio_list, int node)
> +{
> + struct mem_cgroup *memcg = NULL;
> + unsigned int nr_succeeded = 0;
> + struct folio *first;
> + int nr_remaining;
> +
> + if (list_empty(folio_list))
> + return 0;
> +
> + first = list_first_entry(folio_list, struct folio, lru);
> +#ifdef CONFIG_DEBUG_VM
> + {
> + struct folio *f;
> + list_for_each_entry(f, folio_list, lru)
> + VM_WARN_ON_ONCE(folio_memcg(f) != folio_memcg(first));
It looks like the indentation might be off here.
> + }
> +#endif
> + memcg = get_mem_cgroup_from_folio(first);
> +
> + nr_remaining = migrate_pages(folio_list, alloc_misplaced_dst_folio,
> + NULL, node, MIGRATE_ASYNC,
> + MR_NUMA_MISPLACED, &nr_succeeded);
> + if (nr_remaining)
> + putback_movable_pages(folio_list);
> +
> + if (nr_succeeded) {
> + count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded);
> + count_memcg_events(memcg, NUMA_PAGE_MIGRATE, nr_succeeded);
> + mod_lruvec_state(mem_cgroup_lruvec(memcg, NODE_DATA(node)),
> + PGPROMOTE_SUCCESS, nr_succeeded);
> + }
> +
> + mem_cgroup_put(memcg);
> + WARN_ON(!list_empty(folio_list));
> + return nr_remaining ? -EAGAIN : 0;
> +}
> #endif /* CONFIG_NUMA_BALANCING */
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v7 3/7] mm: Hot page tracking and promotion - pghot
2026-05-04 6:09 [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure Bharata B Rao
2026-05-04 6:09 ` [PATCH v7 1/7] mm: migrate: Allow misplaced migration without VMA Bharata B Rao
2026-05-04 6:09 ` [PATCH v7 2/7] mm: migrate: Add promote_misplaced_memcg_folios() Bharata B Rao
@ 2026-05-04 6:09 ` Bharata B Rao
2026-05-04 6:09 ` [PATCH v7 4/7] mm: pghot: Precision mode for pghot Bharata B Rao
` (5 subsequent siblings)
8 siblings, 0 replies; 12+ messages in thread
From: Bharata B Rao @ 2026-05-04 6:09 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz,
raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang,
ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david,
byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
alok.rathore, shivankg, donettom, bharata
pghot is a subsystem that collects memory access information from
multiple sources, classifies hot pages resident in lower-tier memory,
and promotes them to faster tiers. It stores per-PFN hotness metadata
and performs asynchronous, batched promotion via a per-lower-tier-node
kernel thread (kmigrated).
This change introduces the default (compact) mode of pghot:
- Per-PFN hotness record (phi_t = u8) embedded via mem_section:
- 2 bits: access frequency (4 levels)
- 5 bits: time bucket (≈4s window with HZ=1000, bucketed jiffies)
- 1 bit : migration-ready flag (MSB)
The LSB of mem_section->hot_map pointer is used as a per-section
"hot" flag to gate scanning.
- Event recording API:
int pghot_record_access(unsigned long pfn, int nid, int src, unsigned long now)
@pfn: The PFN of the memory accessed
@nid: The accessing NUMA node ID
@src: The temperature source (subsystem) that generated the
access info
@time: The access time in jiffies
- Sources (e.g., NUMA hint faults, HW hints) call this to report
accesses.
- In default mode, the nid is not stored/used for targeting;
promotion goes to a configurable toptier node (pghot_target_nid).
- Promotion engine:
- One kmigrated thread per lower-tier node.
- Scans only sections whose "hot" flag was raised, iterates PFNs,
and batches candidates by destination node.
- Uses migrate_misplaced_folios_batch() to move batched folios.
- Tunables & stats:
- debugfs: enabled_sources, target_nid, freq_threshold,
kmigrated_sleep_ms, kmigrated_batch_nr
- sysctl : vm.pghot_promote_freq_window_ms
- vmstat : pghot_recorded_accesses, pghot_recorded_hintfaults,
pghot_recorded_hwhints
Memory overhead
---------------
Default mode uses 1 byte of hotness metadata per PFN on lower-tier
nodes.
Behavior & policy
-----------------
- Default mode promotion target:
The nid passed by sources is not stored; hot pages promote to
pghot_target_nid (toptier). Precision mode (added later in the
series) changes this.
- Record consumption:
kmigrated consumes (clears) the "migration-ready" bit before
attempting isolation. Additionally the hotness record is reset.
If isolation/migration fails, the folio is not re-queued automatically;
subsequent accesses will re-arm it. This avoids retry storms and
keeps batching stable.
- Wakeups:
kmigrated wakeups are intentionally timeout-driven. We set
the per-pgdat "activate" flag on access, and kmigrated checks this
flag on its next sleep interval. This keeps the first cut simple
and avoids potential wake storms; active wakeups can be considered
in a follow-up.
Signed-off-by: Bharata B Rao <bharata@amd.com>
---
Documentation/admin-guide/mm/index.rst | 1 +
Documentation/admin-guide/mm/pghot.rst | 80 ++++
include/linux/migrate.h | 4 +-
include/linux/mmzone.h | 20 +
include/linux/pghot.h | 82 ++++
include/linux/vm_event_item.h | 5 +
mm/Kconfig | 14 +
mm/Makefile | 1 +
mm/migrate.c | 16 +-
mm/mm_init.c | 10 +
mm/pghot-default.c | 79 ++++
mm/pghot-tunables.c | 182 +++++++++
mm/pghot.c | 494 +++++++++++++++++++++++++
mm/vmstat.c | 5 +
14 files changed, 986 insertions(+), 7 deletions(-)
create mode 100644 Documentation/admin-guide/mm/pghot.rst
create mode 100644 include/linux/pghot.h
create mode 100644 mm/pghot-default.c
create mode 100644 mm/pghot-tunables.c
create mode 100644 mm/pghot.c
diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
index bbb563cba5d2..4d6810b02365 100644
--- a/Documentation/admin-guide/mm/index.rst
+++ b/Documentation/admin-guide/mm/index.rst
@@ -43,3 +43,4 @@ the Linux memory management.
userfaultfd
zswap
kho
+ pghot
diff --git a/Documentation/admin-guide/mm/pghot.rst b/Documentation/admin-guide/mm/pghot.rst
new file mode 100644
index 000000000000..5f51dd1d4d45
--- /dev/null
+++ b/Documentation/admin-guide/mm/pghot.rst
@@ -0,0 +1,80 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================================
+PGHOT: Hot Page Tracking Tunables
+=================================
+
+Overview
+========
+The PGHOT subsystem tracks frequently accessed pages in lower-tier memory and
+promotes them to faster tiers. It uses per-PFN hotness metadata and asynchronous
+migration via per-node kernel threads (kmigrated).
+
+This document describes tunables available via **debugfs** and **sysctl** for
+PGHOT.
+
+Debugfs Interface
+=================
+Path: /sys/kernel/debug/pghot/
+
+1. **enabled_sources**
+ - Bitmask to enable/disable hotness sources.
+ - Bits:
+ - 0: Hint faults (value 0x1)
+ - 1: Hardware hints (value 0x2)
+ - Default: 0 (disabled)
+ - Example:
+ # echo 0x3 > /sys/kernel/debug/pghot/enabled_sources
+ Enables all sources.
+
+2. **target_nid**
+ - Toptier NUMA node ID to which hot pages should be promoted when source
+ does not provide nid. Used when hotness source can't provide accessing
+ NID or when the tracking mode is default.
+ - Default: 0
+ - Example:
+ # echo 1 > /sys/kernel/debug/pghot/target_nid
+
+3. **freq_threshold**
+ - Minimum access frequency before a page is marked ready for promotion.
+ - Range: 1 to 3
+ - Default: 2
+ - Example:
+ # echo 3 > /sys/kernel/debug/pghot/freq_threshold
+
+4. **kmigrated_sleep_ms**
+ - Sleep interval (ms) for kmigrated thread between scans.
+ - Default: 100
+
+5. **kmigrated_batch_nr**
+ - Maximum number of folios migrated in one batch.
+ - Default: 512
+
+Sysctl Interface
+================
+1. pghot_promote_freq_window_ms
+
+Path: /proc/sys/vm/pghot_promote_freq_window_ms
+
+- Controls the time window (in ms) for counting access frequency. A page is
+ considered hot only when **freq_threshold** number of accesses occur with
+ this time period.
+- Default: 3000 (3 seconds)
+- Example:
+ # sysctl vm.pghot_promote_freq_window_ms=3000
+
+Vmstat Counters
+===============
+Following vmstat counters provide some stats about pghot subsystem.
+
+Path: /proc/vmstat
+
+1. **pghot_recorded_accesses**
+ - Number of total hot page accesses recorded by pghot.
+
+2. **pghot_recorded_hintfaults**
+ - Number of recorded accesses reported by NUMA Balancing based
+ hotness source.
+
+3. **pghot_recorded_hwhints**
+ - Number of recorded accesses reported by hwhints source.
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index d136612eef9d..53bae80d11ae 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -107,7 +107,7 @@ static inline void softleaf_entry_wait_on_locked(softleaf_t entry, spinlock_t *p
#endif /* CONFIG_MIGRATION */
-#ifdef CONFIG_NUMA_BALANCING
+#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_PGHOT)
int migrate_misplaced_folio_prepare(struct folio *folio,
struct vm_area_struct *vma, int node);
int migrate_misplaced_folio(struct folio *folio, int node);
@@ -126,7 +126,7 @@ static inline int promote_misplaced_memcg_folios(struct list_head *folio_list, i
{
return -EAGAIN; /* can't migrate now */
}
-#endif /* CONFIG_NUMA_BALANCING */
+#endif /* CONFIG_NUMA_BALANCING || CONFIG_PGHOT */
#ifdef CONFIG_MIGRATION
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9adb2ad21da5..eb08431dc9fb 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1155,6 +1155,7 @@ enum pgdat_flags {
* many pages under writeback
*/
PGDAT_RECLAIM_LOCKED, /* prevents concurrent reclaim */
+ PGDAT_KMIGRATED_ACTIVATE, /* activates kmigrated */
};
enum zone_flags {
@@ -1609,6 +1610,10 @@ typedef struct pglist_data {
#ifdef CONFIG_MEMORY_FAILURE
struct memory_failure_stats mf_stats;
#endif
+#ifdef CONFIG_PGHOT
+ struct task_struct *kmigrated;
+ wait_queue_head_t kmigrated_wait;
+#endif
} pg_data_t;
#define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages)
@@ -2019,12 +2024,27 @@ struct mem_section {
unsigned long section_mem_map;
struct mem_section_usage *usage;
+#ifdef CONFIG_PGHOT
+ /*
+ * Per-PFN hotness data for this section.
+ * Array of phi_t (u8 in default mode).
+ * LSB is used as PGHOT_SECTION_HOT_BIT flag.
+ */
+ void *hot_map;
+#endif
#ifdef CONFIG_PAGE_EXTENSION
/*
* If SPARSEMEM, pgdat doesn't have page_ext pointer. We use
* section. (see page_ext.h about this.)
*/
struct page_ext *page_ext;
+#endif
+ /*
+ * Padding to maintain consistent mem_section size when exactly
+ * one of PGHOT or PAGE_EXTENSION is enabled. This ensures
+ * optimal alignment regardless of configuration.
+ */
+#if (defined(CONFIG_PGHOT) ^ defined(CONFIG_PAGE_EXTENSION))
unsigned long pad;
#endif
/*
diff --git a/include/linux/pghot.h b/include/linux/pghot.h
new file mode 100644
index 000000000000..525d4dd28fc1
--- /dev/null
+++ b/include/linux/pghot.h
@@ -0,0 +1,82 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_PGHOT_H
+#define _LINUX_PGHOT_H
+
+/* Page hotness temperature sources */
+enum pghot_src {
+ PGHOT_HINTFAULTS = 0,
+ PGHOT_HWHINTS,
+ PGHOT_SRC_MAX
+};
+
+#ifdef CONFIG_PGHOT
+#include <linux/static_key.h>
+
+extern unsigned int pghot_target_nid;
+extern unsigned int pghot_src_enabled;
+extern unsigned int pghot_freq_threshold;
+extern unsigned int kmigrated_sleep_ms;
+extern unsigned int kmigrated_batch_nr;
+extern unsigned int sysctl_pghot_freq_window;
+
+void pghot_debug_init(void);
+
+DECLARE_STATIC_KEY_FALSE(pghot_src_hintfaults);
+DECLARE_STATIC_KEY_FALSE(pghot_src_hwhints);
+
+#define PGHOT_HINTFAULTS_ENABLED BIT(PGHOT_HINTFAULTS)
+#define PGHOT_HWHINTS_ENABLED BIT(PGHOT_HWHINTS)
+#define PGHOT_SRC_ENABLED_MASK GENMASK(PGHOT_SRC_MAX - 1, 0)
+
+#define PGHOT_DEFAULT_FREQ_THRESHOLD 2
+
+#define KMIGRATED_DEFAULT_SLEEP_MS 100
+#define KMIGRATED_DEFAULT_BATCH_NR 512
+
+#define PGHOT_DEFAULT_NODE 0
+
+#define PGHOT_DEFAULT_FREQ_WINDOW (3 * MSEC_PER_SEC)
+
+/*
+ * Bits 0-6 are used to store frequency and time.
+ * Bit 7 is used to indicate the page is ready for migration.
+ */
+#define PGHOT_MIGRATE_READY 7
+
+#define PGHOT_FREQ_WIDTH 2
+/* Bucketed time is stored in 5 bits which can represent up to 3.9s with HZ=1000 */
+#define PGHOT_TIME_BUCKETS_SHIFT 7
+#define PGHOT_TIME_WIDTH 5
+#define PGHOT_NID_WIDTH 10
+
+#define PGHOT_FREQ_SHIFT 0
+#define PGHOT_TIME_SHIFT (PGHOT_FREQ_SHIFT + PGHOT_FREQ_WIDTH)
+
+#define PGHOT_FREQ_MASK GENMASK(PGHOT_FREQ_WIDTH - 1, 0)
+#define PGHOT_TIME_MASK GENMASK(PGHOT_TIME_WIDTH - 1, 0)
+#define PGHOT_TIME_BUCKETS_MASK (PGHOT_TIME_MASK << PGHOT_TIME_BUCKETS_SHIFT)
+
+#define PGHOT_NID_MAX ((1 << PGHOT_NID_WIDTH) - 1)
+#define PGHOT_FREQ_MAX ((1 << PGHOT_FREQ_WIDTH) - 1)
+#define PGHOT_TIME_MAX ((1 << PGHOT_TIME_WIDTH) - 1)
+
+typedef u8 phi_t;
+
+#define PGHOT_RECORD_SIZE sizeof(phi_t)
+
+#define PGHOT_SECTION_HOT_BIT 0
+#define PGHOT_SECTION_HOT_MASK BIT(PGHOT_SECTION_HOT_BIT)
+
+bool pghot_nid_valid(int nid);
+unsigned long pghot_access_latency(unsigned long old_time, unsigned long time);
+bool pghot_update_record(phi_t *phi, int nid, unsigned long now);
+int pghot_get_record(phi_t *phi, int *nid, int *freq, unsigned long *time);
+
+int pghot_record_access(unsigned long pfn, int nid, int src, unsigned long now);
+#else
+static inline int pghot_record_access(unsigned long pfn, int nid, int src, unsigned long now)
+{
+ return 0;
+}
+#endif /* CONFIG_PGHOT */
+#endif /* _LINUX_PGHOT_H */
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 03fe95f5a020..58d510711bd4 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -175,6 +175,11 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
KSTACK_REST,
#endif
#endif /* CONFIG_DEBUG_STACK_USAGE */
+#ifdef CONFIG_PGHOT
+ PGHOT_RECORDED_ACCESSES,
+ PGHOT_RECORDED_HINTFAULTS,
+ PGHOT_RECORDED_HWHINTS,
+#endif /* CONFIG_PGHOT */
NR_VM_EVENT_ITEMS
};
diff --git a/mm/Kconfig b/mm/Kconfig
index 0a43bb80df4f..ebfa149d8123 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1469,6 +1469,20 @@ config LAZY_MMU_MODE_KUNIT_TEST
If unsure, say N.
+config PGHOT
+ bool "Hot page tracking and promotion"
+ default n
+ depends on NUMA_MIGRATION && SPARSEMEM
+ help
+ A sub-system to track page accesses in lower tier memory and
+ maintain hot page information. Promotes hot pages from lower
+ tiers to top tier by using the memory access information provided
+ by various sources. Asynchronous promotion is done by per-node
+ kernel threads.
+
+ This adds 1 byte of metadata overhead per page in lower-tier
+ memory nodes.
+
source "mm/damon/Kconfig"
endmenu
diff --git a/mm/Makefile b/mm/Makefile
index 8ad2ab08244e..33014de43acc 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -150,3 +150,4 @@ obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
obj-$(CONFIG_EXECMEM) += execmem.o
obj-$(CONFIG_TMPFS_QUOTA) += shmem_quota.o
obj-$(CONFIG_LAZY_MMU_MODE_KUNIT_TEST) += tests/lazy_mmu_mode_kunit.o
+obj-$(CONFIG_PGHOT) += pghot.o pghot-tunables.o pghot-default.o
diff --git a/mm/migrate.c b/mm/migrate.c
index 747277aadf19..726d27b61a46 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2625,7 +2625,7 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid, unsigned long, nr_pages,
}
#endif /* CONFIG_NUMA_MIGRATION */
-#ifdef CONFIG_NUMA_BALANCING
+#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_PGHOT)
/*
* Returns true if this is a safe migration target node for misplaced NUMA
* pages. Currently it only checks the watermarks which is crude.
@@ -2745,12 +2745,10 @@ int migrate_misplaced_folio_prepare(struct folio *folio,
*/
int migrate_misplaced_folio(struct folio *folio, int node)
{
- pg_data_t *pgdat = NODE_DATA(node);
int nr_remaining;
unsigned int nr_succeeded;
LIST_HEAD(migratepages);
struct mem_cgroup *memcg = get_mem_cgroup_from_folio(folio);
- struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
list_add(&folio->lru, &migratepages);
nr_remaining = migrate_pages(&migratepages, alloc_misplaced_dst_folio,
@@ -2759,12 +2757,18 @@ int migrate_misplaced_folio(struct folio *folio, int node)
if (nr_remaining && !list_empty(&migratepages))
putback_movable_pages(&migratepages);
if (nr_succeeded) {
+#ifdef CONFIG_NUMA_BALANCING
count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded);
count_memcg_events(memcg, NUMA_PAGE_MIGRATE, nr_succeeded);
if ((sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)
&& !node_is_toptier(folio_nid(folio))
- && node_is_toptier(node))
+ && node_is_toptier(node)) {
+ pg_data_t *pgdat = NODE_DATA(node);
+ struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
+
mod_lruvec_state(lruvec, PGPROMOTE_SUCCESS, nr_succeeded);
+ }
+#endif
}
mem_cgroup_put(memcg);
BUG_ON(!list_empty(&migratepages));
@@ -2817,14 +2821,16 @@ int promote_misplaced_memcg_folios(struct list_head *folio_list, int node)
putback_movable_pages(folio_list);
if (nr_succeeded) {
+#ifdef CONFIG_NUMA_BALANCING
count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded);
count_memcg_events(memcg, NUMA_PAGE_MIGRATE, nr_succeeded);
mod_lruvec_state(mem_cgroup_lruvec(memcg, NODE_DATA(node)),
PGPROMOTE_SUCCESS, nr_succeeded);
+#endif
}
mem_cgroup_put(memcg);
WARN_ON(!list_empty(folio_list));
return nr_remaining ? -EAGAIN : 0;
}
-#endif /* CONFIG_NUMA_BALANCING */
+#endif /* CONFIG_NUMA_BALANCING || CONFIG_PGHOT */
diff --git a/mm/mm_init.c b/mm/mm_init.c
index f9f8e1af921c..2396c42028ae 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1384,6 +1384,15 @@ static void pgdat_init_kcompactd(struct pglist_data *pgdat)
static void pgdat_init_kcompactd(struct pglist_data *pgdat) {}
#endif
+#ifdef CONFIG_PGHOT
+static void pgdat_init_kmigrated(struct pglist_data *pgdat)
+{
+ init_waitqueue_head(&pgdat->kmigrated_wait);
+}
+#else
+static inline void pgdat_init_kmigrated(struct pglist_data *pgdat) {}
+#endif
+
static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
{
int i;
@@ -1393,6 +1402,7 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
pgdat_init_split_queue(pgdat);
pgdat_init_kcompactd(pgdat);
+ pgdat_init_kmigrated(pgdat);
init_waitqueue_head(&pgdat->kswapd_wait);
init_waitqueue_head(&pgdat->pfmemalloc_wait);
diff --git a/mm/pghot-default.c b/mm/pghot-default.c
new file mode 100644
index 000000000000..e610062345e4
--- /dev/null
+++ b/mm/pghot-default.c
@@ -0,0 +1,79 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * pghot: Default mode
+ *
+ * 1 byte hotness record per PFN.
+ * Bucketed time and frequency tracked as part of the record.
+ * Promotion to @pghot_target_nid by default.
+ */
+
+#include <linux/pghot.h>
+#include <linux/jiffies.h>
+
+/* pghot-default doesn't store and hence no NID validation is required */
+bool pghot_nid_valid(int nid)
+{
+ return true;
+}
+
+/*
+ * @time is regular time, @old_time is bucketed time.
+ */
+unsigned long pghot_access_latency(unsigned long old_time, unsigned long time)
+{
+ time &= PGHOT_TIME_BUCKETS_MASK;
+ old_time <<= PGHOT_TIME_BUCKETS_SHIFT;
+
+ return jiffies_to_msecs((time - old_time) & PGHOT_TIME_BUCKETS_MASK);
+}
+
+bool pghot_update_record(phi_t *phi, int nid, unsigned long now)
+{
+ phi_t freq, old_freq, hotness, old_hotness, old_time;
+ phi_t time = now >> PGHOT_TIME_BUCKETS_SHIFT;
+
+ old_hotness = READ_ONCE(*phi);
+ do {
+ bool new_window = false;
+
+ hotness = old_hotness;
+ old_freq = (hotness >> PGHOT_FREQ_SHIFT) & PGHOT_FREQ_MASK;
+ old_time = (hotness >> PGHOT_TIME_SHIFT) & PGHOT_TIME_MASK;
+
+ if (pghot_access_latency(old_time, now) > sysctl_pghot_freq_window)
+ new_window = true;
+
+ if (new_window)
+ freq = 1;
+ else if (old_freq < PGHOT_FREQ_MAX)
+ freq = old_freq + 1;
+ else
+ freq = old_freq;
+
+ hotness &= ~(PGHOT_FREQ_MASK << PGHOT_FREQ_SHIFT);
+ hotness &= ~(PGHOT_TIME_MASK << PGHOT_TIME_SHIFT);
+
+ hotness |= (freq & PGHOT_FREQ_MASK) << PGHOT_FREQ_SHIFT;
+ hotness |= (time & PGHOT_TIME_MASK) << PGHOT_TIME_SHIFT;
+
+ if (freq >= pghot_freq_threshold)
+ hotness |= BIT(PGHOT_MIGRATE_READY);
+ } while (unlikely(!try_cmpxchg(phi, &old_hotness, hotness)));
+ return !!(hotness & BIT(PGHOT_MIGRATE_READY));
+}
+
+int pghot_get_record(phi_t *phi, int *nid, int *freq, unsigned long *time)
+{
+ phi_t old_hotness, hotness = 0;
+
+ old_hotness = READ_ONCE(*phi);
+ do {
+ if (!(old_hotness & BIT(PGHOT_MIGRATE_READY)))
+ return -EINVAL;
+ } while (unlikely(!try_cmpxchg(phi, &old_hotness, hotness)));
+
+ *nid = pghot_target_nid;
+ *freq = (old_hotness >> PGHOT_FREQ_SHIFT) & PGHOT_FREQ_MASK;
+ *time = (old_hotness >> PGHOT_TIME_SHIFT) & PGHOT_TIME_MASK;
+ return 0;
+}
diff --git a/mm/pghot-tunables.c b/mm/pghot-tunables.c
new file mode 100644
index 000000000000..f04e2137309e
--- /dev/null
+++ b/mm/pghot-tunables.c
@@ -0,0 +1,182 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * pghot tunables in debugfs
+ */
+#include <linux/pghot.h>
+#include <linux/memory-tiers.h>
+#include <linux/debugfs.h>
+
+static struct dentry *debugfs_pghot;
+static DEFINE_MUTEX(pghot_tunables_lock);
+
+static ssize_t pghot_freq_th_write(struct file *filp, const char __user *ubuf,
+ size_t cnt, loff_t *ppos)
+{
+ char buf[16];
+ unsigned int freq;
+
+ if (cnt > 15)
+ cnt = 15;
+
+ if (copy_from_user(&buf, ubuf, cnt))
+ return -EFAULT;
+ buf[cnt] = '\0';
+
+ if (kstrtouint(buf, 10, &freq))
+ return -EINVAL;
+
+ if (!freq || freq > PGHOT_FREQ_MAX)
+ return -EINVAL;
+
+ mutex_lock(&pghot_tunables_lock);
+ pghot_freq_threshold = freq;
+ mutex_unlock(&pghot_tunables_lock);
+
+ *ppos += cnt;
+ return cnt;
+}
+
+static int pghot_freq_th_show(struct seq_file *m, void *v)
+{
+ seq_printf(m, "%d\n", pghot_freq_threshold);
+ return 0;
+}
+
+static int pghot_freq_th_open(struct inode *inode, struct file *filp)
+{
+ return single_open(filp, pghot_freq_th_show, NULL);
+}
+
+static const struct file_operations pghot_freq_th_fops = {
+ .open = pghot_freq_th_open,
+ .write = pghot_freq_th_write,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
+static ssize_t pghot_target_nid_write(struct file *filp, const char __user *ubuf,
+ size_t cnt, loff_t *ppos)
+{
+ char buf[16];
+ unsigned int nid;
+
+ if (cnt > 15)
+ cnt = 15;
+
+ if (copy_from_user(&buf, ubuf, cnt))
+ return -EFAULT;
+ buf[cnt] = '\0';
+
+ if (kstrtouint(buf, 10, &nid))
+ return -EINVAL;
+
+ if (nid > PGHOT_NID_MAX || !node_online(nid) || !node_is_toptier(nid))
+ return -EINVAL;
+ mutex_lock(&pghot_tunables_lock);
+ pghot_target_nid = nid;
+ mutex_unlock(&pghot_tunables_lock);
+
+ *ppos += cnt;
+ return cnt;
+}
+
+static int pghot_target_nid_show(struct seq_file *m, void *v)
+{
+ seq_printf(m, "%d\n", pghot_target_nid);
+ return 0;
+}
+
+static int pghot_target_nid_open(struct inode *inode, struct file *filp)
+{
+ return single_open(filp, pghot_target_nid_show, NULL);
+}
+
+static const struct file_operations pghot_target_nid_fops = {
+ .open = pghot_target_nid_open,
+ .write = pghot_target_nid_write,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
+static void pghot_src_enabled_update(unsigned int enabled)
+{
+ unsigned int changed = pghot_src_enabled ^ enabled;
+
+ if (changed & PGHOT_HINTFAULTS_ENABLED) {
+ if (enabled & PGHOT_HINTFAULTS_ENABLED)
+ static_branch_enable(&pghot_src_hintfaults);
+ else
+ static_branch_disable(&pghot_src_hintfaults);
+ }
+
+ if (changed & PGHOT_HWHINTS_ENABLED) {
+ if (enabled & PGHOT_HWHINTS_ENABLED)
+ static_branch_enable(&pghot_src_hwhints);
+ else
+ static_branch_disable(&pghot_src_hwhints);
+ }
+}
+
+static ssize_t pghot_src_enabled_write(struct file *filp, const char __user *ubuf,
+ size_t cnt, loff_t *ppos)
+{
+ char buf[16];
+ unsigned int enabled;
+
+ if (cnt > 15)
+ cnt = 15;
+
+ if (copy_from_user(&buf, ubuf, cnt))
+ return -EFAULT;
+ buf[cnt] = '\0';
+
+ if (kstrtouint(buf, 0, &enabled))
+ return -EINVAL;
+
+ if (enabled & ~PGHOT_SRC_ENABLED_MASK)
+ return -EINVAL;
+
+ mutex_lock(&pghot_tunables_lock);
+ pghot_src_enabled_update(enabled);
+ pghot_src_enabled = enabled;
+ mutex_unlock(&pghot_tunables_lock);
+
+ *ppos += cnt;
+ return cnt;
+}
+
+static int pghot_src_enabled_show(struct seq_file *m, void *v)
+{
+ seq_printf(m, "%u\n", pghot_src_enabled);
+ return 0;
+}
+
+static int pghot_src_enabled_open(struct inode *inode, struct file *filp)
+{
+ return single_open(filp, pghot_src_enabled_show, NULL);
+}
+
+static const struct file_operations pghot_src_enabled_fops = {
+ .open = pghot_src_enabled_open,
+ .write = pghot_src_enabled_write,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
+void pghot_debug_init(void)
+{
+ debugfs_pghot = debugfs_create_dir("pghot", NULL);
+ debugfs_create_file("enabled_sources", 0644, debugfs_pghot, NULL,
+ &pghot_src_enabled_fops);
+ debugfs_create_file("target_nid", 0644, debugfs_pghot, NULL,
+ &pghot_target_nid_fops);
+ debugfs_create_file("freq_threshold", 0644, debugfs_pghot, NULL,
+ &pghot_freq_th_fops);
+ debugfs_create_u32("kmigrated_sleep_ms", 0644, debugfs_pghot,
+ &kmigrated_sleep_ms);
+ debugfs_create_u32("kmigrated_batch_nr", 0644, debugfs_pghot,
+ &kmigrated_batch_nr);
+}
diff --git a/mm/pghot.c b/mm/pghot.c
new file mode 100644
index 000000000000..02e6959b647a
--- /dev/null
+++ b/mm/pghot.c
@@ -0,0 +1,494 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Maintains information about hot pages from slower tier nodes and
+ * promotes them.
+ *
+ * Per-PFN hotness information is stored for lower tier nodes in
+ * mem_section.
+ *
+ * In the default mode, a single byte (u8) is used to store
+ * the frequency of access and last access time. Promotions are done
+ * to a default toptier NID.
+ *
+ * A kernel thread named kmigrated is provided to migrate or promote
+ * the hot pages. kmigrated runs for each lower tier node. It iterates
+ * over the node's PFNs and migrates pages marked for migration into
+ * their targeted nodes.
+ */
+#include <linux/mm.h>
+#include <linux/migrate.h>
+#include <linux/memory.h>
+#include <linux/memory-tiers.h>
+#include <linux/pghot.h>
+
+unsigned int pghot_target_nid = PGHOT_DEFAULT_NODE;
+unsigned int pghot_src_enabled;
+unsigned int pghot_freq_threshold = PGHOT_DEFAULT_FREQ_THRESHOLD;
+unsigned int kmigrated_sleep_ms = KMIGRATED_DEFAULT_SLEEP_MS;
+unsigned int kmigrated_batch_nr = KMIGRATED_DEFAULT_BATCH_NR;
+
+unsigned int sysctl_pghot_freq_window = PGHOT_DEFAULT_FREQ_WINDOW;
+
+DEFINE_STATIC_KEY_FALSE(pghot_src_hwhints);
+DEFINE_STATIC_KEY_FALSE(pghot_src_hintfaults);
+
+#ifdef CONFIG_SYSCTL
+static const struct ctl_table pghot_sysctls[] = {
+ {
+ .procname = "pghot_promote_freq_window_ms",
+ .data = &sysctl_pghot_freq_window,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = SYSCTL_ZERO,
+ },
+};
+#endif
+
+static bool kmigrated_started __ro_after_init;
+
+/**
+ * pghot_record_access() - Record page accesses from lower tier memory
+ * for the purpose of tracking page hotness and subsequent promotion.
+ *
+ * @pfn: PFN of the page
+ * @nid: Unused
+ * @src: The identifier of the sub-system that reports the access
+ * @now: Access time in jiffies
+ *
+ * Updates the frequency and time of access and marks the page as
+ * ready for migration if the frequency crosses a threshold. The pages
+ * marked for migration are migrated by kmigrated kernel thread.
+ *
+ * Return: 0 on success and -EINVAL on failure to record the access.
+ */
+int pghot_record_access(unsigned long pfn, int nid, int src, unsigned long now)
+{
+ struct mem_section *ms;
+ struct folio *folio;
+ phi_t *phi, *hot_map;
+ struct page *page;
+ int src_nid;
+
+ if (!kmigrated_started)
+ return 0;
+
+ if (!pghot_nid_valid(nid))
+ return -EINVAL;
+
+ switch (src) {
+ case PGHOT_HINTFAULTS:
+ if (!static_branch_unlikely(&pghot_src_hintfaults))
+ return 0;
+ count_vm_event(PGHOT_RECORDED_HINTFAULTS);
+ break;
+ case PGHOT_HWHINTS:
+ if (!static_branch_unlikely(&pghot_src_hwhints))
+ return 0;
+ count_vm_event(PGHOT_RECORDED_HWHINTS);
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ src_nid = pfn_to_nid(pfn);
+ if (src_nid == nid)
+ return 0;
+
+ /*
+ * Record only accesses from lower tiers.
+ */
+ if (node_is_toptier(src_nid))
+ return 0;
+
+ /*
+ * Reject the non-migratable pages right away.
+ */
+ page = pfn_to_online_page(pfn);
+ if (!page || is_zone_device_page(page))
+ return 0;
+
+ folio = page_folio(page);
+ if (!folio_try_get(folio))
+ return 0;
+
+ if (unlikely(page_folio(page) != folio))
+ goto out;
+
+ if (!folio_test_lru(folio))
+ goto out;
+
+ /* Get the hotness slot corresponding to the 1st PFN of the folio */
+ pfn = folio_pfn(folio);
+ ms = __pfn_to_section(pfn);
+ if (!ms || !ms->hot_map)
+ goto out;
+
+ hot_map = (phi_t *)(((unsigned long)(ms->hot_map)) & ~PGHOT_SECTION_HOT_MASK);
+ phi = &hot_map[pfn % PAGES_PER_SECTION];
+
+ count_vm_event(PGHOT_RECORDED_ACCESSES);
+
+ /*
+ * Update the hotness parameters.
+ */
+ if (pghot_update_record(phi, nid, now)) {
+ set_bit(PGHOT_SECTION_HOT_BIT, (unsigned long *)&ms->hot_map);
+ set_bit(PGDAT_KMIGRATED_ACTIVATE, &page_pgdat(page)->flags);
+ }
+out:
+ folio_put(folio);
+ return 0;
+}
+
+static int pghot_get_hotness(unsigned long pfn, int *nid, int *freq,
+ unsigned long *time)
+{
+ phi_t *phi, *hot_map;
+ struct mem_section *ms;
+
+ ms = __pfn_to_section(pfn);
+ if (!ms || !ms->hot_map)
+ return -EINVAL;
+
+ hot_map = (phi_t *)(((unsigned long)(ms->hot_map)) & ~PGHOT_SECTION_HOT_MASK);
+ phi = &hot_map[pfn % PAGES_PER_SECTION];
+
+ return pghot_get_record(phi, nid, freq, time);
+}
+
+/*
+ * Walks the PFNs of the zone, isolates and migrates them in batches.
+ */
+static void kmigrated_walk_zone(unsigned long start_pfn, unsigned long end_pfn,
+ int src_nid)
+{
+ struct mem_cgroup *cur_memcg = NULL;
+ int cur_nid = NUMA_NO_NODE;
+ LIST_HEAD(migrate_list);
+ int batch_count = 0;
+ struct folio *folio;
+ struct page *page;
+ unsigned long pfn;
+
+ pfn = start_pfn;
+ do {
+ int nid = NUMA_NO_NODE, nr = 1;
+ struct mem_cgroup *memcg;
+ unsigned long time = 0;
+ int freq = 0;
+
+ if (!pfn_valid(pfn))
+ goto out_next;
+
+ page = pfn_to_online_page(pfn);
+ if (!page)
+ goto out_next;
+
+ folio = page_folio(page);
+ if (!folio_try_get(folio))
+ goto out_next;
+
+ if (unlikely(page_folio(page) != folio)) {
+ folio_put(folio);
+ goto out_next;
+ }
+
+ nr = folio_nr_pages(folio);
+ if (folio_nid(folio) != src_nid) {
+ folio_put(folio);
+ goto out_next;
+ }
+
+ if (!folio_test_lru(folio)) {
+ folio_put(folio);
+ goto out_next;
+ }
+
+ if (pghot_get_hotness(pfn, &nid, &freq, &time)) {
+ folio_put(folio);
+ goto out_next;
+ }
+
+ if (nid == NUMA_NO_NODE)
+ nid = pghot_target_nid;
+
+ if (folio_nid(folio) == nid) {
+ folio_put(folio);
+ goto out_next;
+ }
+
+ if (migrate_misplaced_folio_prepare(folio, NULL, nid)) {
+ folio_put(folio);
+ goto out_next;
+ }
+
+ memcg = folio_memcg(folio);
+ if (cur_nid == NUMA_NO_NODE) {
+ cur_nid = nid;
+ cur_memcg = memcg;
+ }
+
+ /* If NID or memcg changed, flush the previous batch first */
+ if (cur_nid != nid || cur_memcg != memcg) {
+ if (!list_empty(&migrate_list))
+ promote_misplaced_memcg_folios(&migrate_list, cur_nid);
+ cur_nid = nid;
+ cur_memcg = memcg;
+ batch_count = 0;
+ cond_resched();
+ }
+
+ list_add(&folio->lru, &migrate_list);
+ folio_put(folio);
+
+ if (++batch_count > kmigrated_batch_nr) {
+ promote_misplaced_memcg_folios(&migrate_list, cur_nid);
+ batch_count = 0;
+ cond_resched();
+ }
+out_next:
+ pfn += nr;
+ } while (pfn < end_pfn);
+ if (!list_empty(&migrate_list))
+ promote_misplaced_memcg_folios(&migrate_list, cur_nid);
+}
+
+static void kmigrated_do_work(pg_data_t *pgdat)
+{
+ unsigned long section_nr, s_begin, start_pfn;
+ struct mem_section *ms;
+ int nid;
+
+ clear_bit(PGDAT_KMIGRATED_ACTIVATE, &pgdat->flags);
+ s_begin = next_present_section_nr(-1);
+ for_each_present_section_nr(s_begin, section_nr) {
+ start_pfn = section_nr_to_pfn(section_nr);
+ ms = __nr_to_section(section_nr);
+
+ if (!pfn_valid(start_pfn))
+ continue;
+
+ nid = pfn_to_nid(start_pfn);
+ if (node_is_toptier(nid) || nid != pgdat->node_id)
+ continue;
+
+ if (!test_and_clear_bit(PGHOT_SECTION_HOT_BIT, (unsigned long *)&ms->hot_map))
+ continue;
+
+ kmigrated_walk_zone(start_pfn, start_pfn + PAGES_PER_SECTION,
+ pgdat->node_id);
+ }
+}
+
+static inline bool kmigrated_work_requested(pg_data_t *pgdat)
+{
+ return test_bit(PGDAT_KMIGRATED_ACTIVATE, &pgdat->flags);
+}
+
+/*
+ * Per-node kthread that iterates over its PFNs and migrates the
+ * pages that have been marked for migration.
+ */
+static int kmigrated(void *p)
+{
+ pg_data_t *pgdat = p;
+
+ while (!kthread_should_stop()) {
+ long timeout = msecs_to_jiffies(READ_ONCE(kmigrated_sleep_ms));
+
+ if (wait_event_timeout(pgdat->kmigrated_wait, kmigrated_work_requested(pgdat),
+ timeout))
+ kmigrated_do_work(pgdat);
+ }
+ return 0;
+}
+
+static int kmigrated_run(int nid)
+{
+ pg_data_t *pgdat = NODE_DATA(nid);
+ int ret;
+
+ if (!pgdat->kmigrated) {
+ pgdat->kmigrated = kthread_create_on_node(kmigrated, pgdat, nid,
+ "kmigrated%d", nid);
+ if (IS_ERR(pgdat->kmigrated)) {
+ ret = PTR_ERR(pgdat->kmigrated);
+ pgdat->kmigrated = NULL;
+ pr_err("Failed to start kmigrated%d, ret %d\n", nid, ret);
+ return ret;
+ }
+ pr_info("pghot: Started kmigrated thread for node %d\n", nid);
+ }
+ wake_up_process(pgdat->kmigrated);
+ return 0;
+}
+
+static void pghot_free_hot_map(struct mem_section *ms)
+{
+ kfree((void *)((unsigned long)ms->hot_map & ~PGHOT_SECTION_HOT_MASK));
+ ms->hot_map = NULL;
+}
+
+static int pghot_alloc_hot_map(struct mem_section *ms, int nid)
+{
+ ms->hot_map = kcalloc_node(PAGES_PER_SECTION, PGHOT_RECORD_SIZE, GFP_KERNEL,
+ nid);
+ if (!ms->hot_map)
+ return -ENOMEM;
+ return 0;
+}
+
+static void pghot_offline_sec_hotmap(unsigned long start_pfn,
+ unsigned long nr_pages)
+{
+ unsigned long start, end, pfn;
+ struct mem_section *ms;
+
+ start = SECTION_ALIGN_DOWN(start_pfn);
+ end = SECTION_ALIGN_UP(start_pfn + nr_pages);
+
+ for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) {
+ ms = __pfn_to_section(pfn);
+ if (!ms || !ms->hot_map)
+ continue;
+
+ pghot_free_hot_map(ms);
+ }
+}
+
+static int pghot_online_sec_hotmap(unsigned long start_pfn,
+ unsigned long nr_pages)
+{
+ int nid = pfn_to_nid(start_pfn);
+ unsigned long start, end, pfn;
+ struct mem_section *ms;
+ int fail = 0;
+
+ start = SECTION_ALIGN_DOWN(start_pfn);
+ end = SECTION_ALIGN_UP(start_pfn + nr_pages);
+
+ for (pfn = start; !fail && pfn < end; pfn += PAGES_PER_SECTION) {
+ ms = __pfn_to_section(pfn);
+ if (!ms || ms->hot_map)
+ continue;
+
+ fail = pghot_alloc_hot_map(ms, nid);
+ }
+
+ if (!fail)
+ return 0;
+
+ /* rollback */
+ end = pfn - PAGES_PER_SECTION;
+ for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) {
+ ms = __pfn_to_section(pfn);
+ if (ms && ms->hot_map)
+ pghot_free_hot_map(ms);
+ }
+ return -ENOMEM;
+}
+
+static int pghot_memhp_callback(struct notifier_block *self,
+ unsigned long action, void *arg)
+{
+ struct memory_notify *mn = arg;
+ int ret = 0;
+
+ switch (action) {
+ case MEM_GOING_ONLINE:
+ ret = pghot_online_sec_hotmap(mn->start_pfn, mn->nr_pages);
+ break;
+ case MEM_OFFLINE:
+ case MEM_CANCEL_ONLINE:
+ pghot_offline_sec_hotmap(mn->start_pfn, mn->nr_pages);
+ break;
+ }
+
+ return notifier_from_errno(ret);
+}
+
+static struct notifier_block pghot_mem_notifier = {
+ .notifier_call = pghot_memhp_callback,
+ .priority = DEFAULT_CALLBACK_PRI,
+};
+
+static void pghot_destroy_hot_map(void)
+{
+ unsigned long section_nr, s_begin;
+ struct mem_section *ms;
+
+ s_begin = next_present_section_nr(-1);
+ for_each_present_section_nr(s_begin, section_nr) {
+ ms = __nr_to_section(section_nr);
+ pghot_free_hot_map(ms);
+ }
+
+ unregister_memory_notifier(&pghot_mem_notifier);
+}
+
+static int pghot_setup_hot_map(void)
+{
+ unsigned long section_nr, s_begin, start_pfn;
+ struct mem_section *ms;
+ int nid, ret;
+
+ ret = register_memory_notifier(&pghot_mem_notifier);
+ if (ret)
+ return ret;
+
+ s_begin = next_present_section_nr(-1);
+ for_each_present_section_nr(s_begin, section_nr) {
+ ms = __nr_to_section(section_nr);
+ start_pfn = section_nr_to_pfn(section_nr);
+ nid = pfn_to_nid(start_pfn);
+
+ if (node_is_toptier(nid) || !pfn_valid(start_pfn))
+ continue;
+
+ if (pghot_alloc_hot_map(ms, nid))
+ goto out_free_hot_map;
+ }
+ return 0;
+
+out_free_hot_map:
+ pghot_destroy_hot_map();
+ return -ENOMEM;
+}
+
+static int __init pghot_init(void)
+{
+ pg_data_t *pgdat;
+ int nid, ret;
+
+ ret = pghot_setup_hot_map();
+ if (ret)
+ return ret;
+
+ for_each_node_state(nid, N_MEMORY) {
+ if (node_is_toptier(nid))
+ continue;
+
+ ret = kmigrated_run(nid);
+ if (ret)
+ goto out_stop_kthread;
+ }
+ register_sysctl_init("vm", pghot_sysctls);
+ pghot_debug_init();
+
+ kmigrated_started = true;
+ return 0;
+
+out_stop_kthread:
+ for_each_node_state(nid, N_MEMORY) {
+ pgdat = NODE_DATA(nid);
+ if (pgdat->kmigrated) {
+ kthread_stop(pgdat->kmigrated);
+ pgdat->kmigrated = NULL;
+ }
+ }
+ pghot_destroy_hot_map();
+ return ret;
+}
+
+late_initcall_sync(pghot_init)
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f534972f517d..4064ead568cc 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1489,6 +1489,11 @@ const char * const vmstat_text[] = {
[I(KSTACK_REST)] = "kstack_rest",
#endif
#endif
+#ifdef CONFIG_PGHOT
+ [I(PGHOT_RECORDED_ACCESSES)] = "pghot_recorded_accesses",
+ [I(PGHOT_RECORDED_HINTFAULTS)] = "pghot_recorded_hintfaults",
+ [I(PGHOT_RECORDED_HWHINTS)] = "pghot_recorded_hwhints",
+#endif /* CONFIG_PGHOT */
#undef I
#endif /* CONFIG_VM_EVENT_COUNTERS */
};
--
2.34.1
^ permalink raw reply related [flat|nested] 12+ messages in thread* [PATCH v7 4/7] mm: pghot: Precision mode for pghot
2026-05-04 6:09 [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure Bharata B Rao
` (2 preceding siblings ...)
2026-05-04 6:09 ` [PATCH v7 3/7] mm: Hot page tracking and promotion - pghot Bharata B Rao
@ 2026-05-04 6:09 ` Bharata B Rao
2026-05-04 18:41 ` Donet Tom
2026-05-04 6:09 ` [PATCH v7 5/7] mm: sched: move NUMA balancing tiering promotion to pghot Bharata B Rao
` (4 subsequent siblings)
8 siblings, 1 reply; 12+ messages in thread
From: Bharata B Rao @ 2026-05-04 6:09 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz,
raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang,
ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david,
byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
alok.rathore, shivankg, donettom, bharata
Default pghot stores hotness in a 1‑byte record per PFN, limiting
frequency to 2 bits, time to a 5‑bit bucket, and preventing storage
of per‑PFN toptier NID. This restricts time granularity and forces
all promotions to use the global pghot_target_nid.
This patch adds an optional precision mode (CONFIG_PGHOT_PRECISE)
that expands the hotness record to 4 bytes (u32) and provides:
- 10‑bit NID field for per‑PFN promotion target,
- 3‑bit frequency field (freq_threshold range 1–7),
- 14‑bit time field offering finer recency tracking,
- MSB migrate‑ready bit.
Precision mode improves placement accuracy on systems with multiple
toptier nodes and provides higher‑resolution hotness tracking, at
the cost of increasing metadata to 4 bytes per PFN.
Documentation, tunables, and the record layout are updated accordingly.
Signed-off-by: Bharata B Rao <bharata@amd.com>
---
Documentation/admin-guide/mm/pghot.rst | 4 +-
include/linux/mmzone.h | 2 +-
include/linux/pghot.h | 31 ++++++++++
mm/Kconfig | 11 ++++
mm/Makefile | 7 ++-
mm/pghot-precise.c | 81 ++++++++++++++++++++++++++
mm/pghot.c | 13 +++--
7 files changed, 141 insertions(+), 8 deletions(-)
create mode 100644 mm/pghot-precise.c
diff --git a/Documentation/admin-guide/mm/pghot.rst b/Documentation/admin-guide/mm/pghot.rst
index 5f51dd1d4d45..7b84e911afe7 100644
--- a/Documentation/admin-guide/mm/pghot.rst
+++ b/Documentation/admin-guide/mm/pghot.rst
@@ -37,7 +37,7 @@ Path: /sys/kernel/debug/pghot/
3. **freq_threshold**
- Minimum access frequency before a page is marked ready for promotion.
- - Range: 1 to 3
+ - Range: 1 to 3 in default mode, 1 to 7 in precision mode.
- Default: 2
- Example:
# echo 3 > /sys/kernel/debug/pghot/freq_threshold
@@ -59,7 +59,7 @@ Path: /proc/sys/vm/pghot_promote_freq_window_ms
- Controls the time window (in ms) for counting access frequency. A page is
considered hot only when **freq_threshold** number of accesses occur with
this time period.
-- Default: 3000 (3 seconds)
+- Default: 3000 (3 seconds) in default mode and 5000 (5s) in precision mode.
- Example:
# sysctl vm.pghot_promote_freq_window_ms=3000
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index eb08431dc9fb..9577bdc575d9 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -2027,7 +2027,7 @@ struct mem_section {
#ifdef CONFIG_PGHOT
/*
* Per-PFN hotness data for this section.
- * Array of phi_t (u8 in default mode).
+ * Array of phi_t (u8 in default mode, u32 in precision mode).
* LSB is used as PGHOT_SECTION_HOT_BIT flag.
*/
void *hot_map;
diff --git a/include/linux/pghot.h b/include/linux/pghot.h
index 525d4dd28fc1..2e1742b8caee 100644
--- a/include/linux/pghot.h
+++ b/include/linux/pghot.h
@@ -35,6 +35,36 @@ DECLARE_STATIC_KEY_FALSE(pghot_src_hwhints);
#define PGHOT_DEFAULT_NODE 0
+#if defined(CONFIG_PGHOT_PRECISE)
+#define PGHOT_DEFAULT_FREQ_WINDOW (5 * MSEC_PER_SEC)
+
+/*
+ * Bits 0-26 are used to store nid, frequency and time.
+ * Bits 27-30 are unused now.
+ * Bit 31 is used to indicate the page is ready for migration.
+ */
+#define PGHOT_MIGRATE_READY 31
+
+#define PGHOT_NID_WIDTH 10
+#define PGHOT_FREQ_WIDTH 3
+/* time is stored in 14 bits which can represent up to 16s with HZ=1000 */
+#define PGHOT_TIME_WIDTH 14
+
+#define PGHOT_NID_SHIFT 0
+#define PGHOT_FREQ_SHIFT (PGHOT_NID_SHIFT + PGHOT_NID_WIDTH)
+#define PGHOT_TIME_SHIFT (PGHOT_FREQ_SHIFT + PGHOT_FREQ_WIDTH)
+
+#define PGHOT_NID_MASK GENMASK(PGHOT_NID_WIDTH - 1, 0)
+#define PGHOT_FREQ_MASK GENMASK(PGHOT_FREQ_WIDTH - 1, 0)
+#define PGHOT_TIME_MASK GENMASK(PGHOT_TIME_WIDTH - 1, 0)
+
+#define PGHOT_NID_MAX ((1 << PGHOT_NID_WIDTH) - 1)
+#define PGHOT_FREQ_MAX ((1 << PGHOT_FREQ_WIDTH) - 1)
+#define PGHOT_TIME_MAX ((1 << PGHOT_TIME_WIDTH) - 1)
+
+typedef u32 phi_t;
+
+#else /* !CONFIG_PGHOT_PRECISE */
#define PGHOT_DEFAULT_FREQ_WINDOW (3 * MSEC_PER_SEC)
/*
@@ -61,6 +91,7 @@ DECLARE_STATIC_KEY_FALSE(pghot_src_hwhints);
#define PGHOT_TIME_MAX ((1 << PGHOT_TIME_WIDTH) - 1)
typedef u8 phi_t;
+#endif /* CONFIG_PGHOT_PRECISE */
#define PGHOT_RECORD_SIZE sizeof(phi_t)
diff --git a/mm/Kconfig b/mm/Kconfig
index ebfa149d8123..cc4b5685ecd4 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1483,6 +1483,17 @@ config PGHOT
This adds 1 byte of metadata overhead per page in lower-tier
memory nodes.
+config PGHOT_PRECISE
+ bool "Hot page tracking precision mode"
+ default n
+ depends on PGHOT
+ help
+ Enables precision mode for tracking hot pages with pghot sub-system.
+ Adds fine-grained access time tracking and explicit toptier target
+ NID tracking. Precise hot page tracking comes at the cost of using
+ 4 bytes per page against the default one byte per page. Preferable
+ to enable this on systems with multiple nodes in toptier.
+
source "mm/damon/Kconfig"
endmenu
diff --git a/mm/Makefile b/mm/Makefile
index 33014de43acc..dc61f4d955f8 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -150,4 +150,9 @@ obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
obj-$(CONFIG_EXECMEM) += execmem.o
obj-$(CONFIG_TMPFS_QUOTA) += shmem_quota.o
obj-$(CONFIG_LAZY_MMU_MODE_KUNIT_TEST) += tests/lazy_mmu_mode_kunit.o
-obj-$(CONFIG_PGHOT) += pghot.o pghot-tunables.o pghot-default.o
+obj-$(CONFIG_PGHOT) += pghot.o pghot-tunables.o
+ifdef CONFIG_PGHOT_PRECISE
+obj-$(CONFIG_PGHOT) += pghot-precise.o
+else
+obj-$(CONFIG_PGHOT) += pghot-default.o
+endif
diff --git a/mm/pghot-precise.c b/mm/pghot-precise.c
new file mode 100644
index 000000000000..8e571988b4ce
--- /dev/null
+++ b/mm/pghot-precise.c
@@ -0,0 +1,81 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * pghot: Precision mode
+ *
+ * 4 byte hotness record per PFN (u32)
+ * NID, time and frequency tracked as part of the record.
+ */
+
+#include <linux/pghot.h>
+#include <linux/jiffies.h>
+#include <linux/memory-tiers.h>
+
+bool pghot_nid_valid(int nid)
+{
+ if (nid != NUMA_NO_NODE &&
+ (!numa_valid_node(nid) || nid > PGHOT_NID_MAX ||
+ !node_online(nid) || !node_is_toptier(nid)))
+ return false;
+
+ return true;
+}
+
+unsigned long pghot_access_latency(unsigned long old_time, unsigned long time)
+{
+ return jiffies_to_msecs((time - old_time) & PGHOT_TIME_MASK);
+}
+
+bool pghot_update_record(phi_t *phi, int nid, unsigned long now)
+{
+ phi_t freq, old_freq, hotness, old_hotness, old_time;
+ phi_t time = now & PGHOT_TIME_MASK;
+
+ nid = (nid == NUMA_NO_NODE) ? pghot_target_nid : nid;
+ old_hotness = READ_ONCE(*phi);
+
+ do {
+ bool new_window = false;
+
+ hotness = old_hotness;
+ old_freq = (hotness >> PGHOT_FREQ_SHIFT) & PGHOT_FREQ_MASK;
+ old_time = (hotness >> PGHOT_TIME_SHIFT) & PGHOT_TIME_MASK;
+
+ if (pghot_access_latency(old_time, time) > sysctl_pghot_freq_window)
+ new_window = true;
+
+ if (new_window)
+ freq = 1;
+ else if (old_freq < PGHOT_FREQ_MAX)
+ freq = old_freq + 1;
+ else
+ freq = old_freq;
+
+ hotness &= ~(PGHOT_NID_MASK << PGHOT_NID_SHIFT);
+ hotness &= ~(PGHOT_FREQ_MASK << PGHOT_FREQ_SHIFT);
+ hotness &= ~(PGHOT_TIME_MASK << PGHOT_TIME_SHIFT);
+
+ hotness |= (nid & PGHOT_NID_MASK) << PGHOT_NID_SHIFT;
+ hotness |= (freq & PGHOT_FREQ_MASK) << PGHOT_FREQ_SHIFT;
+ hotness |= (time & PGHOT_TIME_MASK) << PGHOT_TIME_SHIFT;
+
+ if (freq >= pghot_freq_threshold)
+ hotness |= BIT(PGHOT_MIGRATE_READY);
+ } while (unlikely(!try_cmpxchg(phi, &old_hotness, hotness)));
+ return !!(hotness & BIT(PGHOT_MIGRATE_READY));
+}
+
+int pghot_get_record(phi_t *phi, int *nid, int *freq, unsigned long *time)
+{
+ phi_t old_hotness, hotness = 0;
+
+ old_hotness = READ_ONCE(*phi);
+ do {
+ if (!(old_hotness & BIT(PGHOT_MIGRATE_READY)))
+ return -EINVAL;
+ } while (unlikely(!try_cmpxchg(phi, &old_hotness, hotness)));
+
+ *nid = (old_hotness >> PGHOT_NID_SHIFT) & PGHOT_NID_MASK;
+ *freq = (old_hotness >> PGHOT_FREQ_SHIFT) & PGHOT_FREQ_MASK;
+ *time = (old_hotness >> PGHOT_TIME_SHIFT) & PGHOT_TIME_MASK;
+ return 0;
+}
diff --git a/mm/pghot.c b/mm/pghot.c
index 02e6959b647a..0b31d5917833 100644
--- a/mm/pghot.c
+++ b/mm/pghot.c
@@ -10,6 +10,9 @@
* the frequency of access and last access time. Promotions are done
* to a default toptier NID.
*
+ * In the precision mode, 4 bytes are used to store the frequency
+ * of access, last access time and the accessing NID.
+ *
* A kernel thread named kmigrated is provided to migrate or promote
* the hot pages. kmigrated runs for each lower tier node. It iterates
* over the node's PFNs and migrates pages marked for migration into
@@ -52,13 +55,15 @@ static bool kmigrated_started __ro_after_init;
* for the purpose of tracking page hotness and subsequent promotion.
*
* @pfn: PFN of the page
- * @nid: Unused
+ * @nid: Target NID to where the page needs to be migrated in precision
+ * mode but unused in default mode
* @src: The identifier of the sub-system that reports the access
* @now: Access time in jiffies
*
- * Updates the frequency and time of access and marks the page as
- * ready for migration if the frequency crosses a threshold. The pages
- * marked for migration are migrated by kmigrated kernel thread.
+ * Updates the NID (in precision mode only), frequency and time of access
+ * and marks the page as ready for migration if the frequency crosses a
+ * threshold. The pages marked for migration are migrated by kmigrated
+ * kernel thread.
*
* Return: 0 on success and -EINVAL on failure to record the access.
*/
--
2.34.1
^ permalink raw reply related [flat|nested] 12+ messages in thread* Re: [PATCH v7 4/7] mm: pghot: Precision mode for pghot
2026-05-04 6:09 ` [PATCH v7 4/7] mm: pghot: Precision mode for pghot Bharata B Rao
@ 2026-05-04 18:41 ` Donet Tom
0 siblings, 0 replies; 12+ messages in thread
From: Donet Tom @ 2026-05-04 18:41 UTC (permalink / raw)
To: Bharata B Rao, linux-kernel, linux-mm
Cc: Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz,
raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang,
ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david,
byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
alok.rathore, shivankg
Hi Bharata
On 5/4/26 11:39 AM, Bharata B Rao wrote:
> +#include <linux/pghot.h>
> +#include <linux/jiffies.h>
> +#include <linux/memory-tiers.h>
> +
> +bool pghot_nid_valid(int nid)
I might be missing something, but since pghot_nid_valid() exists in both
pghot-default.c and pghot-precise.c, would it make sense to move it to a
header file as a static inline function?
-Donet
> +{
> + if (nid != NUMA_NO_NODE &&
> + (!numa_valid_node(nid) || nid > PGHOT_NID_MAX ||
> + !node_online(nid) || !node_is_toptier(nid)))
> + return false;
> +
> + return true;
> +}
> +
> +unsigned long pghot_access_latency(unsigned long old_time, unsigned long time)
> +{
> + return jiffies_to_msecs((time - old_time) & PGHOT_TIME_MASK);
> +}
> +
> +bool pghot_update_record(phi_t *phi, int nid, unsigned long now)
> +{
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v7 5/7] mm: sched: move NUMA balancing tiering promotion to pghot
2026-05-04 6:09 [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure Bharata B Rao
` (3 preceding siblings ...)
2026-05-04 6:09 ` [PATCH v7 4/7] mm: pghot: Precision mode for pghot Bharata B Rao
@ 2026-05-04 6:09 ` Bharata B Rao
2026-05-04 6:09 ` [RFC PATCH v7 6/7] x86/ibs: Move IBS caps definitions into its own header Bharata B Rao
` (3 subsequent siblings)
8 siblings, 0 replies; 12+ messages in thread
From: Bharata B Rao @ 2026-05-04 6:09 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz,
raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang,
ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david,
byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
alok.rathore, shivankg, donettom, bharata
Currently hot page promotion (NUMA_BALANCING_MEMORY_TIERING
mode of NUMA Balancing) does hot page detection (via hint faults),
hot page classification and eventual promotion, all by itself and
sits within the scheduler.
With pghot, the new hot page tracking and promotion mechanism
being available, NUMA Balancing can limit itself to detection
of hot pages (via hint faults) and off-load rest of the
functionality to pghot.
To achieve this, pghot_record_access(PGHOT_HINTFAULTS) API
is used to feed the hot page info to pghot. In addition, the
migration rate limiting and dynamic threshold logic are moved to
kmigrated so that the same can be used for hot pages reported by
other sources too. Hence it becomes necessary to introduce a
new config option CONFIG_NUMA_BALANCING_TIERING to control
the hint faults souce for hot page promotion. This option
controls the NUMA_BALANCING_MEMORY_TIERING mode of
kernel.numa_balancing
This movement of hot page promotion to pghot results in the following
changes to the behaviour of hint faults based hot page promotion:
1. Promotion is no longer done in the fault path but instead is
deferred to kmigrated and happens in batches.
2. NUMA_BALANCING_MEMORY_TIERING mode used to promote on first
access. Pghot by default, promotes on second access though this
can be changed by setting /sys/kernel/debug/pghot/freq_threshold.
hot_threshold_ms debugfs tunable now gets replaced by pghot's
freq_threshold.
3. In NUMA_BALANCING_MEMORY_TIERING mode, hint fault latency is the
difference between the PTE update time (during scanning) and the
access time (hint fault). However with pghot, a single latency
threshold is used for two purposes:
a) If the time difference between successive accesses are within
the threshold, the page is marked as hot.
b) Later when kmigrated picks up the page for migration, it will
migrate only if the difference between the current time and
the time when the page was marked hot is with the threshold.
4. Batch migration of misplaced folios is done from non-process
context where VMA info is not readily available. Without VMA
and the exec check on that, it will not be possible to filter
out exec pages during migration prep stage. Hence shared
executable pages also will be subjected to misplaced migration.
5. The max scan period which is used in dynamic threshold logic
was a debugfs tunable. However this has been converted to a
scalar metric in pghot.
6. In the uncommon case of using NUMA_BALANCING_NORMAL mode
to balance between lower and higher tier nodes, we end up
waking the kswapd when there is no headroom in the toptier.
Key code changes due to this movement are detailed below to help
easy understanding of the restructuring.
1. Scanning and access times are no longer tracked in last_cpupid
field of folio flags. Hence all code related to this (like
folio_xchg_access_time(), cpupid_valid()) are removed.
2. The misplaced migration routines become conditional to
CONFIG_PGHOT in addition to CONFIG_NUMA_BALANCING.
3. The promotion related stats (like PGPROMOTE_SUCCESS etc) are
now moved to under CONFIG_PGHOT as these stats are part of
promotion engine which will be used for other hotness sources
as well.
4. Routines that are responsibile for migration rate limiting
dynamic thresholding, pgdat balancing during promotion etc
are moved to pghot with appropriate renaming.
Signed-off-by: Bharata B Rao <bharata@amd.com>
---
include/linux/mm.h | 35 ++------
include/linux/mmzone.h | 4 +-
init/Kconfig | 13 +++
kernel/sched/core.c | 7 ++
kernel/sched/debug.c | 1 -
kernel/sched/fair.c | 177 ++---------------------------------------
kernel/sched/sched.h | 1 -
mm/huge_memory.c | 24 +++++-
mm/memcontrol.c | 6 +-
mm/memory-tiers.c | 15 ++--
mm/memory.c | 28 +++++--
mm/mempolicy.c | 3 -
mm/migrate.c | 16 +++-
mm/pghot.c | 134 +++++++++++++++++++++++++++++++
mm/vmstat.c | 2 +-
15 files changed, 239 insertions(+), 227 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0b776907152e..3b237946b322 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2271,17 +2271,6 @@ static inline int folio_nid(const struct folio *folio)
}
#ifdef CONFIG_NUMA_BALANCING
-/* page access time bits needs to hold at least 4 seconds */
-#define PAGE_ACCESS_TIME_MIN_BITS 12
-#if LAST_CPUPID_SHIFT < PAGE_ACCESS_TIME_MIN_BITS
-#define PAGE_ACCESS_TIME_BUCKETS \
- (PAGE_ACCESS_TIME_MIN_BITS - LAST_CPUPID_SHIFT)
-#else
-#define PAGE_ACCESS_TIME_BUCKETS 0
-#endif
-
-#define PAGE_ACCESS_TIME_MASK \
- (LAST_CPUPID_MASK << PAGE_ACCESS_TIME_BUCKETS)
static inline int cpu_pid_to_cpupid(int cpu, int pid)
{
@@ -2347,15 +2336,6 @@ static inline void page_cpupid_reset_last(struct page *page)
}
#endif /* LAST_CPUPID_NOT_IN_PAGE_FLAGS */
-static inline int folio_xchg_access_time(struct folio *folio, int time)
-{
- int last_time;
-
- last_time = folio_xchg_last_cpupid(folio,
- time >> PAGE_ACCESS_TIME_BUCKETS);
- return last_time << PAGE_ACCESS_TIME_BUCKETS;
-}
-
static inline void vma_set_access_pid_bit(struct vm_area_struct *vma)
{
unsigned int pid_bit;
@@ -2366,18 +2346,12 @@ static inline void vma_set_access_pid_bit(struct vm_area_struct *vma)
}
}
-bool folio_use_access_time(struct folio *folio);
#else /* !CONFIG_NUMA_BALANCING */
static inline int folio_xchg_last_cpupid(struct folio *folio, int cpupid)
{
return folio_nid(folio); /* XXX */
}
-static inline int folio_xchg_access_time(struct folio *folio, int time)
-{
- return 0;
-}
-
static inline int folio_last_cpupid(struct folio *folio)
{
return folio_nid(folio); /* XXX */
@@ -2420,11 +2394,16 @@ static inline bool cpupid_match_pid(struct task_struct *task, int cpupid)
static inline void vma_set_access_pid_bit(struct vm_area_struct *vma)
{
}
-static inline bool folio_use_access_time(struct folio *folio)
+#endif /* CONFIG_NUMA_BALANCING */
+
+#ifdef CONFIG_NUMA_BALANCING_TIERING
+bool folio_is_promo_candidate(struct folio *folio);
+#else
+static inline bool folio_is_promo_candidate(struct folio *folio)
{
return false;
}
-#endif /* CONFIG_NUMA_BALANCING */
+#endif /* CONFIG_NUMA_BALANCING_TIERING */
#if defined(CONFIG_KASAN_SW_TAGS) || defined(CONFIG_KASAN_HW_TAGS)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9577bdc575d9..b29d06168826 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -287,7 +287,7 @@ enum node_stat_item {
#ifdef CONFIG_SWAP
NR_SWAPCACHE,
#endif
-#ifdef CONFIG_NUMA_BALANCING
+#ifdef CONFIG_PGHOT
PGPROMOTE_SUCCESS, /* promote successfully */
/**
* Candidate pages for promotion based on hint fault latency. This
@@ -1566,7 +1566,7 @@ typedef struct pglist_data {
struct deferred_split deferred_split_queue;
#endif
-#ifdef CONFIG_NUMA_BALANCING
+#ifdef CONFIG_PGHOT
/* start time in ms of current promote rate limit period */
unsigned int nbp_rl_start;
/* number of promote candidate pages at start time of current rate limit period */
diff --git a/init/Kconfig b/init/Kconfig
index 2937c4d308ae..7624be1c739a 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1027,6 +1027,19 @@ config NUMA_BALANCING_DEFAULT_ENABLED
If set, automatic NUMA balancing will be enabled if running on a NUMA
machine.
+config NUMA_BALANCING_TIERING
+ bool "NUMA balancing memory tiering promotion"
+ depends on NUMA_BALANCING && PGHOT
+ help
+ Enable NUMA balancing mode 2 (memory tiering). This allows
+ automatic promotion of hot pages from slower memory tiers to
+ faster tiers using the pghot subsystem.
+
+ This requires CONFIG_PGHOT for the hot page tracking engine.
+ This option is required for kernel.numa_balancing=2.
+
+ If unsure, say N.
+
config SLAB_OBJ_EXT
bool
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index da20fb6ea25a..46ce75f00b40 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4523,6 +4523,7 @@ void set_numabalancing_state(bool enabled)
}
#ifdef CONFIG_PROC_SYSCTL
+#ifdef CONFIG_NUMA_BALANCING_TIERING
static void reset_memory_tiering(void)
{
struct pglist_data *pgdat;
@@ -4533,6 +4534,7 @@ static void reset_memory_tiering(void)
pgdat->nbp_th_start = jiffies_to_msecs(jiffies);
}
}
+#endif
static int sysctl_numa_balancing(const struct ctl_table *table, int write,
void *buffer, size_t *lenp, loff_t *ppos)
@@ -4550,9 +4552,14 @@ static int sysctl_numa_balancing(const struct ctl_table *table, int write,
if (err < 0)
return err;
if (write) {
+ if ((state & NUMA_BALANCING_MEMORY_TIERING) &&
+ !IS_ENABLED(CONFIG_NUMA_BALANCING_TIERING))
+ return -EOPNOTSUPP;
+#ifdef CONFIG_NUMA_BALANCING_TIERING
if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) &&
(state & NUMA_BALANCING_MEMORY_TIERING))
reset_memory_tiering();
+#endif
sysctl_numa_balancing_mode = state;
__set_numabalancing_state(state);
}
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 74c1617cf652..abf53f3071ea 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -623,7 +623,6 @@ static __init int sched_init_debug(void)
debugfs_create_u32("scan_period_min_ms", 0644, numa, &sysctl_numa_balancing_scan_period_min);
debugfs_create_u32("scan_period_max_ms", 0644, numa, &sysctl_numa_balancing_scan_period_max);
debugfs_create_u32("scan_size_mb", 0644, numa, &sysctl_numa_balancing_scan_size);
- debugfs_create_u32("hot_threshold_ms", 0644, numa, &sysctl_numa_balancing_hot_threshold);
#endif /* CONFIG_NUMA_BALANCING */
debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 69361c63353a..f1da4fa95598 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -125,11 +125,6 @@ int __weak arch_asym_cpu_priority(int cpu)
static unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;
#endif
-#ifdef CONFIG_NUMA_BALANCING
-/* Restrict the NUMA promotion throughput (MB/s) for each target node. */
-static unsigned int sysctl_numa_balancing_promote_rate_limit = 65536;
-#endif
-
#ifdef CONFIG_SYSCTL
static const struct ctl_table sched_fair_sysctls[] = {
#ifdef CONFIG_CFS_BANDWIDTH
@@ -142,16 +137,6 @@ static const struct ctl_table sched_fair_sysctls[] = {
.extra1 = SYSCTL_ONE,
},
#endif
-#ifdef CONFIG_NUMA_BALANCING
- {
- .procname = "numa_balancing_promote_rate_limit_MBps",
- .data = &sysctl_numa_balancing_promote_rate_limit,
- .maxlen = sizeof(unsigned int),
- .mode = 0644,
- .proc_handler = proc_dointvec_minmax,
- .extra1 = SYSCTL_ZERO,
- },
-#endif /* CONFIG_NUMA_BALANCING */
};
static int __init sched_fair_sysctl_init(void)
@@ -1612,9 +1597,6 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
/* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
unsigned int sysctl_numa_balancing_scan_delay = 1000;
-/* The page with hint page fault latency < threshold in ms is considered hot */
-unsigned int sysctl_numa_balancing_hot_threshold = MSEC_PER_SEC;
-
struct numa_group {
refcount_t refcount;
@@ -1957,120 +1939,6 @@ static inline unsigned long group_weight(struct task_struct *p, int nid,
return 1000 * faults / total_faults;
}
-/*
- * If memory tiering mode is enabled, cpupid of slow memory page is
- * used to record scan time instead of CPU and PID. When tiering mode
- * is disabled at run time, the scan time (in cpupid) will be
- * interpreted as CPU and PID. So CPU needs to be checked to avoid to
- * access out of array bound.
- */
-static inline bool cpupid_valid(int cpupid)
-{
- return cpupid_to_cpu(cpupid) < nr_cpu_ids;
-}
-
-/*
- * For memory tiering mode, if there are enough free pages (more than
- * enough watermark defined here) in fast memory node, to take full
- * advantage of fast memory capacity, all recently accessed slow
- * memory pages will be migrated to fast memory node without
- * considering hot threshold.
- */
-static bool pgdat_free_space_enough(struct pglist_data *pgdat)
-{
- int z;
- unsigned long enough_wmark;
-
- enough_wmark = max(1UL * 1024 * 1024 * 1024 >> PAGE_SHIFT,
- pgdat->node_present_pages >> 4);
- for (z = pgdat->nr_zones - 1; z >= 0; z--) {
- struct zone *zone = pgdat->node_zones + z;
-
- if (!populated_zone(zone))
- continue;
-
- if (zone_watermark_ok(zone, 0,
- promo_wmark_pages(zone) + enough_wmark,
- ZONE_MOVABLE, 0))
- return true;
- }
- return false;
-}
-
-/*
- * For memory tiering mode, when page tables are scanned, the scan
- * time will be recorded in struct page in addition to make page
- * PROT_NONE for slow memory page. So when the page is accessed, in
- * hint page fault handler, the hint page fault latency is calculated
- * via,
- *
- * hint page fault latency = hint page fault time - scan time
- *
- * The smaller the hint page fault latency, the higher the possibility
- * for the page to be hot.
- */
-static int numa_hint_fault_latency(struct folio *folio)
-{
- int last_time, time;
-
- time = jiffies_to_msecs(jiffies);
- last_time = folio_xchg_access_time(folio, time);
-
- return (time - last_time) & PAGE_ACCESS_TIME_MASK;
-}
-
-/*
- * For memory tiering mode, too high promotion/demotion throughput may
- * hurt application latency. So we provide a mechanism to rate limit
- * the number of pages that are tried to be promoted.
- */
-static bool numa_promotion_rate_limit(struct pglist_data *pgdat,
- unsigned long rate_limit, int nr)
-{
- unsigned long nr_cand;
- unsigned int now, start;
-
- now = jiffies_to_msecs(jiffies);
- mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr);
- nr_cand = node_page_state(pgdat, PGPROMOTE_CANDIDATE);
- start = pgdat->nbp_rl_start;
- if (now - start > MSEC_PER_SEC &&
- cmpxchg(&pgdat->nbp_rl_start, start, now) == start)
- pgdat->nbp_rl_nr_cand = nr_cand;
- if (nr_cand - pgdat->nbp_rl_nr_cand >= rate_limit)
- return true;
- return false;
-}
-
-#define NUMA_MIGRATION_ADJUST_STEPS 16
-
-static void numa_promotion_adjust_threshold(struct pglist_data *pgdat,
- unsigned long rate_limit,
- unsigned int ref_th)
-{
- unsigned int now, start, th_period, unit_th, th;
- unsigned long nr_cand, ref_cand, diff_cand;
-
- now = jiffies_to_msecs(jiffies);
- th_period = sysctl_numa_balancing_scan_period_max;
- start = pgdat->nbp_th_start;
- if (now - start > th_period &&
- cmpxchg(&pgdat->nbp_th_start, start, now) == start) {
- ref_cand = rate_limit *
- sysctl_numa_balancing_scan_period_max / MSEC_PER_SEC;
- nr_cand = node_page_state(pgdat, PGPROMOTE_CANDIDATE);
- diff_cand = nr_cand - pgdat->nbp_th_nr_cand;
- unit_th = ref_th * 2 / NUMA_MIGRATION_ADJUST_STEPS;
- th = pgdat->nbp_threshold ? : ref_th;
- if (diff_cand > ref_cand * 11 / 10)
- th = max(th - unit_th, unit_th);
- else if (diff_cand < ref_cand * 9 / 10)
- th = min(th + unit_th, ref_th * 2);
- pgdat->nbp_th_nr_cand = nr_cand;
- pgdat->nbp_threshold = th;
- }
-}
-
bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
int src_nid, int dst_cpu)
{
@@ -2086,41 +1954,15 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
/*
* The pages in slow memory node should be migrated according
- * to hot/cold instead of private/shared.
- */
- if (folio_use_access_time(folio)) {
- struct pglist_data *pgdat;
- unsigned long rate_limit;
- unsigned int latency, th, def_th;
- long nr = folio_nr_pages(folio);
-
- pgdat = NODE_DATA(dst_nid);
- if (pgdat_free_space_enough(pgdat)) {
- /* workload changed, reset hot threshold */
- pgdat->nbp_threshold = 0;
- mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE_NRL, nr);
- return true;
- }
-
- def_th = sysctl_numa_balancing_hot_threshold;
- rate_limit = MB_TO_PAGES(sysctl_numa_balancing_promote_rate_limit);
- numa_promotion_adjust_threshold(pgdat, rate_limit, def_th);
-
- th = pgdat->nbp_threshold ? : def_th;
- latency = numa_hint_fault_latency(folio);
- if (latency >= th)
- return false;
-
- return !numa_promotion_rate_limit(pgdat, rate_limit, nr);
- }
+ * to hot/cold instead of private/shared. Also the migration
+ * of such pages are handled by kmigrated.
+ */
+ if (folio_is_promo_candidate(folio))
+ return true;
this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid);
last_cpupid = folio_xchg_last_cpupid(folio, this_cpupid);
- if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) &&
- !node_is_toptier(src_nid) && !cpupid_valid(last_cpupid))
- return false;
-
/*
* Allow first faults or private faults to migrate immediately early in
* the lifetime of a task. The magic number 4 is based on waiting for
@@ -3330,15 +3172,6 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
if (!p->mm)
return;
- /*
- * NUMA faults statistics are unnecessary for the slow memory
- * node for memory tiering mode.
- */
- if (!node_is_toptier(mem_node) &&
- (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING ||
- !cpupid_valid(last_cpupid)))
- return;
-
/* Allocate buffer to track faults on a per-node basis */
if (unlikely(!p->numa_faults)) {
int size = sizeof(*p->numa_faults) *
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9f63b15d309d..f176643516b5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3066,7 +3066,6 @@ extern unsigned int sysctl_numa_balancing_scan_delay;
extern unsigned int sysctl_numa_balancing_scan_period_min;
extern unsigned int sysctl_numa_balancing_scan_period_max;
extern unsigned int sysctl_numa_balancing_scan_size;
-extern unsigned int sysctl_numa_balancing_hot_threshold;
#ifdef CONFIG_SCHED_HRTICK
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 970e077019b7..1890b1e534a4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -40,6 +40,7 @@
#include <linux/pgalloc.h>
#include <linux/pgalloc_tag.h>
#include <linux/pagewalk.h>
+#include <linux/pghot.h>
#include <asm/tlb.h>
#include "internal.h"
@@ -2267,7 +2268,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
int nid = NUMA_NO_NODE;
int target_nid, last_cpupid;
pmd_t pmd, old_pmd;
- bool writable = false;
+ bool writable = false, needs_promotion = false;
int flags = 0;
vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
@@ -2294,11 +2295,23 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
goto out_map;
nid = folio_nid(folio);
+ needs_promotion = folio_is_promo_candidate(folio);
target_nid = numa_migrate_check(folio, vmf, haddr, &flags, writable,
&last_cpupid);
if (target_nid == NUMA_NO_NODE)
goto out_map;
+
+ if (needs_promotion) {
+ /*
+ * Hot page promotion, mode=NUMA_BALANCING_MEMORY_TIERING.
+ * Isolation and migration are handled by pghot.
+ */
+ nid = target_nid;
+ goto out_map;
+ }
+
+ /* Balancing b/n toptier nodes, mode=NUMA_BALANCING_NORMAL */
if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
flags |= TNF_MIGRATE_FAIL;
goto out_map;
@@ -2330,8 +2343,13 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
spin_unlock(vmf->ptl);
- if (nid != NUMA_NO_NODE)
- task_numa_fault(last_cpupid, nid, HPAGE_PMD_NR, flags);
+ if (nid != NUMA_NO_NODE) {
+ if (needs_promotion)
+ pghot_record_access(folio_pfn(folio), nid,
+ PGHOT_HINTFAULTS, jiffies);
+ else
+ task_numa_fault(last_cpupid, nid, HPAGE_PMD_NR, flags);
+ }
return 0;
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c3d98ab41f1f..033b80ad248e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -400,7 +400,7 @@ static const unsigned int memcg_node_stat_items[] = {
#ifdef CONFIG_SWAP
NR_SWAPCACHE,
#endif
-#ifdef CONFIG_NUMA_BALANCING
+#ifdef CONFIG_PGHOT
PGPROMOTE_SUCCESS,
#endif
PGDEMOTE_KSWAPD,
@@ -1594,7 +1594,7 @@ static const struct memory_stat memory_stats[] = {
{ "pgscan_khugepaged", PGSCAN_KHUGEPAGED },
{ "pgscan_proactive", PGSCAN_PROACTIVE },
{ "pgrefill", PGREFILL },
-#ifdef CONFIG_NUMA_BALANCING
+#ifdef CONFIG_PGHOT
{ "pgpromote_success", PGPROMOTE_SUCCESS },
#endif
};
@@ -1646,7 +1646,7 @@ static int memcg_page_state_output_unit(int item)
case PGSCAN_KHUGEPAGED:
case PGSCAN_PROACTIVE:
case PGREFILL:
-#ifdef CONFIG_NUMA_BALANCING
+#ifdef CONFIG_PGHOT
case PGPROMOTE_SUCCESS:
#endif
return 1;
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 54851d8a195b..be134a32f5bf 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -51,18 +51,19 @@ static const struct bus_type memory_tier_subsys = {
.dev_name = "memory_tier",
};
-#ifdef CONFIG_NUMA_BALANCING
+#ifdef CONFIG_NUMA_BALANCING_TIERING
/**
- * folio_use_access_time - check if a folio reuses cpupid for page access time
+ * folio_is_promo_candidate - check if the folio qualifies for promotion
+ *
* @folio: folio to check
*
- * folio's _last_cpupid field is repurposed by memory tiering. In memory
- * tiering mode, cpupid of slow memory folio (not toptier memory) is used to
- * record page access time.
+ * Checks if NUMA Balancing tiering mode is set and the folio belongs
+ * to lower tier. If so, it qualifies for promotion to toptier when
+ * it is categorized as hot.
*
- * Return: the folio _last_cpupid is used to record page access time
+ * Return: True if the above condition is met, else False.
*/
-bool folio_use_access_time(struct folio *folio)
+bool folio_is_promo_candidate(struct folio *folio)
{
return (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) &&
!node_is_toptier(folio_nid(folio));
diff --git a/mm/memory.c b/mm/memory.c
index ea6568571131..17ea31750573 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -75,6 +75,7 @@
#include <linux/perf_event.h>
#include <linux/ptrace.h>
#include <linux/vmalloc.h>
+#include <linux/pghot.h>
#include <linux/sched/sysctl.h>
#include <linux/pgalloc.h>
#include <linux/uaccess.h>
@@ -6062,10 +6063,9 @@ int numa_migrate_check(struct folio *folio, struct vm_fault *vmf,
if (folio_maybe_mapped_shared(folio) && (vma->vm_flags & VM_SHARED))
*flags |= TNF_SHARED;
/*
- * For memory tiering mode, cpupid of slow memory page is used
- * to record page access time. So use default value.
+ * For memory tiering mode, last_cpupid is unused. So use default value.
*/
- if (folio_use_access_time(folio))
+ if (folio_is_promo_candidate(folio))
*last_cpupid = (-1 & LAST_CPUPID_MASK);
else
*last_cpupid = folio_last_cpupid(folio);
@@ -6146,6 +6146,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
int nid = NUMA_NO_NODE;
bool writable = false, ignore_writable = false;
bool pte_write_upgrade = vma_wants_manual_pte_write_upgrade(vma);
+ bool needs_promotion = false;
int last_cpupid;
int target_nid;
pte_t pte, old_pte;
@@ -6180,12 +6181,24 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
goto out_map;
nid = folio_nid(folio);
+ needs_promotion = folio_is_promo_candidate(folio);
nr_pages = folio_nr_pages(folio);
target_nid = numa_migrate_check(folio, vmf, vmf->address, &flags,
writable, &last_cpupid);
if (target_nid == NUMA_NO_NODE)
goto out_map;
+
+ if (needs_promotion) {
+ /*
+ * Hot page promotion, mode=NUMA_BALANCING_MEMORY_TIERING.
+ * Isolation and migration are handled by pghot.
+ */
+ nid = target_nid;
+ goto out_map;
+ }
+
+ /* Balancing b/n toptier nodes, mode=NUMA_BALANCING_NORMAL */
if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
flags |= TNF_MIGRATE_FAIL;
goto out_map;
@@ -6225,8 +6238,13 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
writable);
pte_unmap_unlock(vmf->pte, vmf->ptl);
- if (nid != NUMA_NO_NODE)
- task_numa_fault(last_cpupid, nid, nr_pages, flags);
+ if (nid != NUMA_NO_NODE) {
+ if (needs_promotion)
+ pghot_record_access(folio_pfn(folio), nid,
+ PGHOT_HINTFAULTS, jiffies);
+ else
+ task_numa_fault(last_cpupid, nid, nr_pages, flags);
+ }
return 0;
}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4e4421b22b59..aef9bb8a6cd4 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -872,9 +872,6 @@ bool folio_can_map_prot_numa(struct folio *folio, struct vm_area_struct *vma,
node_is_toptier(nid))
return false;
- if (folio_use_access_time(folio))
- folio_xchg_access_time(folio, jiffies_to_msecs(jiffies));
-
return true;
}
diff --git a/mm/migrate.c b/mm/migrate.c
index 726d27b61a46..a468fa4f7963 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2709,8 +2709,18 @@ int migrate_misplaced_folio_prepare(struct folio *folio,
if (!migrate_balanced_pgdat(pgdat, nr_pages)) {
int z;
- if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING))
+ /*
+ * Kswapd wakeup for creating headroom in toptier is done only
+ * for hot page promotion case and not for misplaced migrations
+ * between toptier nodes.
+ *
+ * In the uncommon case of using NUMA_BALANCING_NORMAL mode
+ * to balance between lower and higher tier nodes, we end up
+ * waking the kswapd.
+ */
+ if (node_is_toptier(folio_nid(folio)))
return -EAGAIN;
+
for (z = pgdat->nr_zones - 1; z >= 0; z--) {
if (managed_zone(pgdat->node_zones + z))
break;
@@ -2760,6 +2770,8 @@ int migrate_misplaced_folio(struct folio *folio, int node)
#ifdef CONFIG_NUMA_BALANCING
count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded);
count_memcg_events(memcg, NUMA_PAGE_MIGRATE, nr_succeeded);
+#endif
+#ifdef CONFIG_PGHOT
if ((sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)
&& !node_is_toptier(folio_nid(folio))
&& node_is_toptier(node)) {
@@ -2824,6 +2836,8 @@ int promote_misplaced_memcg_folios(struct list_head *folio_list, int node)
#ifdef CONFIG_NUMA_BALANCING
count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded);
count_memcg_events(memcg, NUMA_PAGE_MIGRATE, nr_succeeded);
+#endif
+#ifdef CONFIG_PGHOT
mod_lruvec_state(mem_cgroup_lruvec(memcg, NODE_DATA(node)),
PGPROMOTE_SUCCESS, nr_succeeded);
#endif
diff --git a/mm/pghot.c b/mm/pghot.c
index 0b31d5917833..1f204a8613eb 100644
--- a/mm/pghot.c
+++ b/mm/pghot.c
@@ -17,6 +17,9 @@
* the hot pages. kmigrated runs for each lower tier node. It iterates
* over the node's PFNs and migrates pages marked for migration into
* their targeted nodes.
+ *
+ * Migration rate-limiting and dynamic threshold logic implementations
+ * were moved from NUMA Balancing mode 2.
*/
#include <linux/mm.h>
#include <linux/migrate.h>
@@ -32,6 +35,12 @@ unsigned int kmigrated_batch_nr = KMIGRATED_DEFAULT_BATCH_NR;
unsigned int sysctl_pghot_freq_window = PGHOT_DEFAULT_FREQ_WINDOW;
+/* Restrict the NUMA promotion throughput (MB/s) for each target node. */
+static unsigned int sysctl_pghot_promote_rate_limit = 65536;
+
+#define KMIGRATED_MIGRATION_ADJUST_STEPS 16
+#define KMIGRATED_PROMOTION_THRESHOLD_WINDOW 60000
+
DEFINE_STATIC_KEY_FALSE(pghot_src_hwhints);
DEFINE_STATIC_KEY_FALSE(pghot_src_hintfaults);
@@ -45,6 +54,22 @@ static const struct ctl_table pghot_sysctls[] = {
.proc_handler = proc_dointvec_minmax,
.extra1 = SYSCTL_ZERO,
},
+ {
+ .procname = "pghot_promote_rate_limit_MBps",
+ .data = &sysctl_pghot_promote_rate_limit,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = SYSCTL_ZERO,
+ },
+ {
+ .procname = "numa_balancing_promote_rate_limit_MBps",
+ .data = &sysctl_pghot_promote_rate_limit,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = SYSCTL_ZERO,
+ },
};
#endif
@@ -146,6 +171,110 @@ int pghot_record_access(unsigned long pfn, int nid, int src, unsigned long now)
return 0;
}
+/*
+ * For memory tiering mode, if there are enough free pages (more than
+ * enough watermark defined here) in fast memory node, to take full
+ * advantage of fast memory capacity, all recently accessed slow
+ * memory pages will be migrated to fast memory node without
+ * considering hot threshold.
+ */
+static bool pgdat_free_space_enough(struct pglist_data *pgdat)
+{
+ int z;
+ unsigned long enough_wmark;
+
+ enough_wmark = max(1UL * 1024 * 1024 * 1024 >> PAGE_SHIFT,
+ pgdat->node_present_pages >> 4);
+ for (z = pgdat->nr_zones - 1; z >= 0; z--) {
+ struct zone *zone = pgdat->node_zones + z;
+
+ if (!populated_zone(zone))
+ continue;
+
+ if (zone_watermark_ok(zone, 0,
+ promo_wmark_pages(zone) + enough_wmark,
+ ZONE_MOVABLE, 0))
+ return true;
+ }
+ return false;
+}
+
+/*
+ * For memory tiering mode, too high promotion/demotion throughput may
+ * hurt application latency. So we provide a mechanism to rate limit
+ * the number of pages that are tried to be promoted.
+ */
+static bool kmigrated_promotion_rate_limit(struct pglist_data *pgdat, unsigned long rate_limit,
+ int nr, unsigned int now_ms)
+{
+ unsigned long nr_cand;
+ unsigned int start;
+
+ mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr);
+ nr_cand = node_page_state(pgdat, PGPROMOTE_CANDIDATE);
+ start = pgdat->nbp_rl_start;
+ if (now_ms - start > MSEC_PER_SEC &&
+ cmpxchg(&pgdat->nbp_rl_start, start, now_ms) == start)
+ pgdat->nbp_rl_nr_cand = nr_cand;
+ if (nr_cand - pgdat->nbp_rl_nr_cand >= rate_limit)
+ return true;
+ return false;
+}
+
+static void kmigrated_promotion_adjust_threshold(struct pglist_data *pgdat,
+ unsigned long rate_limit, unsigned int ref_th,
+ unsigned int now_ms)
+{
+ unsigned int start, th_period, unit_th, th;
+ unsigned long nr_cand, ref_cand, diff_cand;
+
+ th_period = KMIGRATED_PROMOTION_THRESHOLD_WINDOW;
+ start = pgdat->nbp_th_start;
+ if (now_ms - start > th_period &&
+ cmpxchg(&pgdat->nbp_th_start, start, now_ms) == start) {
+ ref_cand = rate_limit *
+ KMIGRATED_PROMOTION_THRESHOLD_WINDOW / MSEC_PER_SEC;
+ nr_cand = node_page_state(pgdat, PGPROMOTE_CANDIDATE);
+ diff_cand = nr_cand - pgdat->nbp_th_nr_cand;
+ unit_th = ref_th * 2 / KMIGRATED_MIGRATION_ADJUST_STEPS;
+ th = pgdat->nbp_threshold ? : ref_th;
+ if (diff_cand > ref_cand * 11 / 10)
+ th = max(th - unit_th, unit_th);
+ else if (diff_cand < ref_cand * 9 / 10)
+ th = min(th + unit_th, ref_th * 2);
+ pgdat->nbp_th_nr_cand = nr_cand;
+ pgdat->nbp_threshold = th;
+ }
+}
+
+static bool kmigrated_should_migrate_memory(unsigned long nr_pages, int nid,
+ unsigned long time)
+{
+ struct pglist_data *pgdat;
+ unsigned long rate_limit;
+ unsigned int th, def_th;
+ unsigned int now_ms = jiffies_to_msecs(jiffies); /* Based on full-width jiffies */
+ unsigned long now = jiffies;
+
+ pgdat = NODE_DATA(nid);
+ if (pgdat_free_space_enough(pgdat)) {
+ /* workload changed, reset hot threshold */
+ pgdat->nbp_threshold = 0;
+ mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE_NRL, nr_pages);
+ return true;
+ }
+
+ def_th = sysctl_pghot_freq_window;
+ rate_limit = MB_TO_PAGES(sysctl_pghot_promote_rate_limit);
+ kmigrated_promotion_adjust_threshold(pgdat, rate_limit, def_th, now_ms);
+
+ th = pgdat->nbp_threshold ? : def_th;
+ if (pghot_access_latency(time, now) >= th)
+ return false;
+
+ return !kmigrated_promotion_rate_limit(pgdat, rate_limit, nr_pages, now_ms);
+}
+
static int pghot_get_hotness(unsigned long pfn, int *nid, int *freq,
unsigned long *time)
{
@@ -223,6 +352,11 @@ static void kmigrated_walk_zone(unsigned long start_pfn, unsigned long end_pfn,
goto out_next;
}
+ if (!kmigrated_should_migrate_memory(nr, nid, time)) {
+ folio_put(folio);
+ goto out_next;
+ }
+
if (migrate_misplaced_folio_prepare(folio, NULL, nid)) {
folio_put(folio);
goto out_next;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4064ead568cc..da668ff05032 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1268,7 +1268,7 @@ const char * const vmstat_text[] = {
#ifdef CONFIG_SWAP
[I(NR_SWAPCACHE)] = "nr_swapcached",
#endif
-#ifdef CONFIG_NUMA_BALANCING
+#ifdef CONFIG_PGHOT
[I(PGPROMOTE_SUCCESS)] = "pgpromote_success",
[I(PGPROMOTE_CANDIDATE)] = "pgpromote_candidate",
[I(PGPROMOTE_CANDIDATE_NRL)] = "pgpromote_candidate_nrl",
--
2.34.1
^ permalink raw reply related [flat|nested] 12+ messages in thread* [RFC PATCH v7 6/7] x86/ibs: Move IBS caps definitions into its own header
2026-05-04 6:09 [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure Bharata B Rao
` (4 preceding siblings ...)
2026-05-04 6:09 ` [PATCH v7 5/7] mm: sched: move NUMA balancing tiering promotion to pghot Bharata B Rao
@ 2026-05-04 6:09 ` Bharata B Rao
2026-05-04 6:09 ` [RFC PATCH v7 7/7] x86/mm/ibs: In-kernel driver for AMD IBS Memory Profiler Bharata B Rao
` (2 subsequent siblings)
8 siblings, 0 replies; 12+ messages in thread
From: Bharata B Rao @ 2026-05-04 6:09 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz,
raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang,
ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david,
byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
alok.rathore, shivankg, donettom, bharata
Subsequent patch adds IBS Memory Profiler driver that is
independent of the perf subsystem but needs the CPUID
0x8000001B capability bits. Hence move those bit definitions
out of asm/perf_event.h into a dedicated header so the new
driver can consume them without pulling in perf.
Signed-off-by: Bharata B Rao <bharata@amd.com>
---
arch/x86/include/asm/ibs-caps.h | 85 +++++++++++++++++++++++++++++++
arch/x86/include/asm/perf_event.h | 81 +----------------------------
2 files changed, 86 insertions(+), 80 deletions(-)
create mode 100644 arch/x86/include/asm/ibs-caps.h
diff --git a/arch/x86/include/asm/ibs-caps.h b/arch/x86/include/asm/ibs-caps.h
new file mode 100644
index 000000000000..ddf6c512c8f9
--- /dev/null
+++ b/arch/x86/include/asm/ibs-caps.h
@@ -0,0 +1,85 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_IBS_CAPS_H
+#define _ASM_X86_IBS_CAPS_H
+
+/*
+ * IBS cpuid feature detection
+ */
+
+#define IBS_CPUID_FEATURES 0x8000001b
+
+/*
+ * Same bit mask as for IBS cpuid feature flags (Fn8000_001B_EAX), but
+ * bit 0 is used to indicate the existence of IBS.
+ */
+#define IBS_CAPS_AVAIL (1U<<0)
+#define IBS_CAPS_FETCHSAM (1U<<1)
+#define IBS_CAPS_OPSAM (1U<<2)
+#define IBS_CAPS_RDWROPCNT (1U<<3)
+#define IBS_CAPS_OPCNT (1U<<4)
+#define IBS_CAPS_BRNTRGT (1U<<5)
+#define IBS_CAPS_OPCNTEXT (1U<<6)
+#define IBS_CAPS_RIPINVALIDCHK (1U<<7)
+#define IBS_CAPS_OPBRNFUSE (1U<<8)
+#define IBS_CAPS_FETCHCTLEXTD (1U<<9)
+#define IBS_CAPS_OPDATA4 (1U<<10)
+#define IBS_CAPS_ZEN4 (1U<<11)
+#define IBS_CAPS_OPLDLAT (1U<<12)
+#define IBS_CAPS_DIS (1U<<13)
+#define IBS_CAPS_FETCHLAT (1U<<14)
+#define IBS_CAPS_BIT63_FILTER (1U<<15)
+#define IBS_CAPS_STRMST_RMTSOCKET (1U<<16)
+#define IBS_CAPS_OPDTLBPGSIZE (1U<<19)
+
+#define IBS_CAPS_DEFAULT (IBS_CAPS_AVAIL \
+ | IBS_CAPS_FETCHSAM \
+ | IBS_CAPS_OPSAM)
+
+/*
+ * IBS APIC setup
+ */
+#define IBSCTL 0x1cc
+#define IBSCTL_LVT_OFFSET_VALID (1ULL<<8)
+#define IBSCTL_LVT_OFFSET_MASK 0x0F
+
+/* IBS fetch bits/masks */
+#define IBS_FETCH_L3MISSONLY (1ULL << 59)
+#define IBS_FETCH_RAND_EN (1ULL << 57)
+#define IBS_FETCH_VAL (1ULL << 49)
+#define IBS_FETCH_ENABLE (1ULL << 48)
+#define IBS_FETCH_CNT 0xFFFF0000ULL
+#define IBS_FETCH_MAX_CNT 0x0000FFFFULL
+
+#define IBS_FETCH_2_DIS (1ULL << 0)
+#define IBS_FETCH_2_FETCHLAT_FILTER (0xFULL << 1)
+#define IBS_FETCH_2_FETCHLAT_FILTER_SHIFT (1)
+#define IBS_FETCH_2_EXCL_RIP_63_EQ_1 (1ULL << 5)
+#define IBS_FETCH_2_EXCL_RIP_63_EQ_0 (1ULL << 6)
+
+/*
+ * IBS op bits/masks
+ * The lower 7 bits of the current count are random bits
+ * preloaded by hardware and ignored in software
+ */
+#define IBS_OP_LDLAT_EN (1ULL << 63)
+#define IBS_OP_LDLAT_THRSH (0xFULL << 59)
+#define IBS_OP_LDLAT_THRSH_SHIFT (59)
+#define IBS_OP_CUR_CNT (0xFFF80ULL << 32)
+#define IBS_OP_CUR_CNT_RAND (0x0007FULL << 32)
+#define IBS_OP_CUR_CNT_EXT_MASK (0x7FULL << 52)
+#define IBS_OP_CNT_CTL (1ULL << 19)
+#define IBS_OP_VAL (1ULL << 18)
+#define IBS_OP_ENABLE (1ULL << 17)
+#define IBS_OP_L3MISSONLY (1ULL << 16)
+#define IBS_OP_MAX_CNT 0x0000FFFFULL
+#define IBS_OP_MAX_CNT_EXT 0x007FFFFFULL /* not a register bit mask */
+#define IBS_OP_MAX_CNT_EXT_MASK (0x7FULL << 20) /* separate upper 7 bits */
+#define IBS_RIP_INVALID (1ULL << 38)
+
+#define IBS_OP_2_DIS (1ULL << 0)
+#define IBS_OP_2_EXCL_RIP_63_EQ_0 (1ULL << 1)
+#define IBS_OP_2_EXCL_RIP_63_EQ_1 (1ULL << 2)
+#define IBS_OP_2_STRM_ST_FILTER (1ULL << 3)
+#define IBS_OP_2_STRM_ST_FILTER_SHIFT (3)
+
+#endif /* _ASM_X86_IBS_CAPS_H */
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 752cb319d5ea..655a54c77f4e 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -3,6 +3,7 @@
#define _ASM_X86_PERF_EVENT_H
#include <linux/static_call.h>
+#include <asm/ibs-caps.h>
/*
* Performance event hw details:
@@ -620,86 +621,6 @@ struct arch_pebs_cntr_header {
*/
#define EXT_PERFMON_DEBUG_FEATURES 0x80000022
-/*
- * IBS cpuid feature detection
- */
-
-#define IBS_CPUID_FEATURES 0x8000001b
-
-/*
- * Same bit mask as for IBS cpuid feature flags (Fn8000_001B_EAX), but
- * bit 0 is used to indicate the existence of IBS.
- */
-#define IBS_CAPS_AVAIL (1U<<0)
-#define IBS_CAPS_FETCHSAM (1U<<1)
-#define IBS_CAPS_OPSAM (1U<<2)
-#define IBS_CAPS_RDWROPCNT (1U<<3)
-#define IBS_CAPS_OPCNT (1U<<4)
-#define IBS_CAPS_BRNTRGT (1U<<5)
-#define IBS_CAPS_OPCNTEXT (1U<<6)
-#define IBS_CAPS_RIPINVALIDCHK (1U<<7)
-#define IBS_CAPS_OPBRNFUSE (1U<<8)
-#define IBS_CAPS_FETCHCTLEXTD (1U<<9)
-#define IBS_CAPS_OPDATA4 (1U<<10)
-#define IBS_CAPS_ZEN4 (1U<<11)
-#define IBS_CAPS_OPLDLAT (1U<<12)
-#define IBS_CAPS_DIS (1U<<13)
-#define IBS_CAPS_FETCHLAT (1U<<14)
-#define IBS_CAPS_BIT63_FILTER (1U<<15)
-#define IBS_CAPS_STRMST_RMTSOCKET (1U<<16)
-#define IBS_CAPS_OPDTLBPGSIZE (1U<<19)
-
-#define IBS_CAPS_DEFAULT (IBS_CAPS_AVAIL \
- | IBS_CAPS_FETCHSAM \
- | IBS_CAPS_OPSAM)
-
-/*
- * IBS APIC setup
- */
-#define IBSCTL 0x1cc
-#define IBSCTL_LVT_OFFSET_VALID (1ULL<<8)
-#define IBSCTL_LVT_OFFSET_MASK 0x0F
-
-/* IBS fetch bits/masks */
-#define IBS_FETCH_L3MISSONLY (1ULL << 59)
-#define IBS_FETCH_RAND_EN (1ULL << 57)
-#define IBS_FETCH_VAL (1ULL << 49)
-#define IBS_FETCH_ENABLE (1ULL << 48)
-#define IBS_FETCH_CNT 0xFFFF0000ULL
-#define IBS_FETCH_MAX_CNT 0x0000FFFFULL
-
-#define IBS_FETCH_2_DIS (1ULL << 0)
-#define IBS_FETCH_2_FETCHLAT_FILTER (0xFULL << 1)
-#define IBS_FETCH_2_FETCHLAT_FILTER_SHIFT (1)
-#define IBS_FETCH_2_EXCL_RIP_63_EQ_1 (1ULL << 5)
-#define IBS_FETCH_2_EXCL_RIP_63_EQ_0 (1ULL << 6)
-
-/*
- * IBS op bits/masks
- * The lower 7 bits of the current count are random bits
- * preloaded by hardware and ignored in software
- */
-#define IBS_OP_LDLAT_EN (1ULL << 63)
-#define IBS_OP_LDLAT_THRSH (0xFULL << 59)
-#define IBS_OP_LDLAT_THRSH_SHIFT (59)
-#define IBS_OP_CUR_CNT (0xFFF80ULL << 32)
-#define IBS_OP_CUR_CNT_RAND (0x0007FULL << 32)
-#define IBS_OP_CUR_CNT_EXT_MASK (0x7FULL << 52)
-#define IBS_OP_CNT_CTL (1ULL << 19)
-#define IBS_OP_VAL (1ULL << 18)
-#define IBS_OP_ENABLE (1ULL << 17)
-#define IBS_OP_L3MISSONLY (1ULL << 16)
-#define IBS_OP_MAX_CNT 0x0000FFFFULL
-#define IBS_OP_MAX_CNT_EXT 0x007FFFFFULL /* not a register bit mask */
-#define IBS_OP_MAX_CNT_EXT_MASK (0x7FULL << 20) /* separate upper 7 bits */
-#define IBS_RIP_INVALID (1ULL << 38)
-
-#define IBS_OP_2_DIS (1ULL << 0)
-#define IBS_OP_2_EXCL_RIP_63_EQ_0 (1ULL << 1)
-#define IBS_OP_2_EXCL_RIP_63_EQ_1 (1ULL << 2)
-#define IBS_OP_2_STRM_ST_FILTER (1ULL << 3)
-#define IBS_OP_2_STRM_ST_FILTER_SHIFT (3)
-
#ifdef CONFIG_X86_LOCAL_APIC
extern u32 get_ibs_caps(void);
extern int forward_event_to_ibs(struct perf_event *event);
--
2.34.1
^ permalink raw reply related [flat|nested] 12+ messages in thread* [RFC PATCH v7 7/7] x86/mm/ibs: In-kernel driver for AMD IBS Memory Profiler
2026-05-04 6:09 [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure Bharata B Rao
` (5 preceding siblings ...)
2026-05-04 6:09 ` [RFC PATCH v7 6/7] x86/ibs: Move IBS caps definitions into its own header Bharata B Rao
@ 2026-05-04 6:09 ` Bharata B Rao
2026-05-04 6:23 ` [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure Bharata B Rao
2026-05-04 20:36 ` Matthew Wilcox
8 siblings, 0 replies; 12+ messages in thread
From: Bharata B Rao @ 2026-05-04 6:09 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz,
raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang,
ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david,
byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
alok.rathore, shivankg, donettom, bharata
Use IBS (Instruction Based Sampling) Memory Profiler feature
present in AMD Zen6 processors for memory access tracking. The
access information obtained from IBS Memory Profiler is fed to
pghot sub-system for further action using
pghot_record_access(PGHOT_HWHINTS, ...) API.
IBS Memory Profiler as page hotness source is enabled by the
new config option HWMEM_PROFILER and is also gated by the
existing pghot_src_hwhints static key set via debugfs.
More details about IBS Memory Profiler can be obtained from
the AMD document titled "AMD64 Zen6 Instruction Based Sampling (IBS)
Extensions and Features".
Signed-off-by: Bharata B Rao <bharata@amd.com>
---
arch/x86/Kconfig | 16 ++
arch/x86/include/asm/ibs-caps.h | 8 +
arch/x86/include/asm/ibs-mprof.h | 46 +++++
arch/x86/include/asm/msr-index.h | 8 +
arch/x86/mm/Makefile | 1 +
arch/x86/mm/ibs-mprof.c | 308 +++++++++++++++++++++++++++++++
include/linux/cpuhotplug.h | 1 +
include/linux/vm_event_item.h | 6 +
mm/Kconfig | 9 +
mm/vmstat.c | 6 +
10 files changed, 409 insertions(+)
create mode 100644 arch/x86/include/asm/ibs-mprof.h
create mode 100644 arch/x86/mm/ibs-mprof.c
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 99bb5217649a..f06c0c44ecce 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1514,6 +1514,22 @@ config AMD_MEM_ENCRYPT
This requires an AMD processor that supports Secure Memory
Encryption (SME).
+config AMD_IBS_MEMPROF
+ bool "AMD IBS Memory Profiler"
+ depends on X86_64 && CPU_SUP_AMD
+ depends on PGHOT
+ select HWMEM_PROFILER
+ help
+ Use the AMD Instruction Based Sampling (IBS) Memory Profiler
+ facility (present on Zen6 and later AMD CPUs) to feed
+ hardware-observed memory accesses into the pghot subsystem
+ for hot-page detection and promotion.
+
+ When disabled, no IBS Memory Profiler MSRs are programmed and
+ the corresponding NMI handler is not installed.
+
+ If unsure, say N.
+
# Common NUMA Features
config NUMA
bool "NUMA Memory Allocation and Scheduler Support"
diff --git a/arch/x86/include/asm/ibs-caps.h b/arch/x86/include/asm/ibs-caps.h
index ddf6c512c8f9..1f6c4058a0e3 100644
--- a/arch/x86/include/asm/ibs-caps.h
+++ b/arch/x86/include/asm/ibs-caps.h
@@ -29,6 +29,7 @@
#define IBS_CAPS_FETCHLAT (1U<<14)
#define IBS_CAPS_BIT63_FILTER (1U<<15)
#define IBS_CAPS_STRMST_RMTSOCKET (1U<<16)
+#define IBS_CAPS_MEM_PROFILER (1U<<18)
#define IBS_CAPS_OPDTLBPGSIZE (1U<<19)
#define IBS_CAPS_DEFAULT (IBS_CAPS_AVAIL \
@@ -42,6 +43,13 @@
#define IBSCTL_LVT_OFFSET_VALID (1ULL<<8)
#define IBSCTL_LVT_OFFSET_MASK 0x0F
+/*
+ * IBS Memprofiler setup
+ */
+#define IBSCTL_MPROF_LVT_OFFSET_VALID (1ULL << 24)
+#define IBSCTL_MPROF_LVT_OFFSET_SHIFT 16
+#define IBSCTL_MPROF_LVT_OFFSET_MASK (0xFULL << IBSCTL_MPROF_LVT_OFFSET_SHIFT)
+
/* IBS fetch bits/masks */
#define IBS_FETCH_L3MISSONLY (1ULL << 59)
#define IBS_FETCH_RAND_EN (1ULL << 57)
diff --git a/arch/x86/include/asm/ibs-mprof.h b/arch/x86/include/asm/ibs-mprof.h
new file mode 100644
index 000000000000..91b1ce51d667
--- /dev/null
+++ b/arch/x86/include/asm/ibs-mprof.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_IBS_MPROF_H
+#define _ASM_X86_IBS_MPROF_H
+
+/*
+ * All bits are documented here for clarity even if the current
+ * driver doesn't use all of them.
+ */
+
+/* MSR_AMD64_IBS_MPROF_DATA2 bits */
+#define IBS_MPROF_DATA2_DATASRC_MASK 0x7
+#define IBS_MPROF_DATA2_DATASRC_MASK_HIGH 0xC0
+#define IBS_MPROF_DATA2_DATASRC_MASK_HIGH_SHIFT 0x3
+#define IBS_MPROF_DATA2_DATASRC_LCL_CCX 0x1
+#define IBS_MPROF_DATA2_DATASRC_PEER_CCX_NEAR 0x2
+#define IBS_MPROF_DATA2_DATASRC_DRAM 0x3
+#define IBS_MPROF_DATA2_DATASRC_CCX_FAR 0x5
+#define IBS_MPROF_DATA2_DATASRC_EXT_MEM 0x8
+#define IBS_MPROF_DATA2_RMT_NODE BIT_ULL(4)
+#define IBS_MPROF_DATA2_RMT_SOCKET BIT_ULL(9)
+
+/* MSR_AMD64_IBS_MPROF_DATA3 bits */
+#define IBS_MPROF_DATA3_LDOP BIT_ULL(0)
+#define IBS_MPROF_DATA3_STOP BIT_ULL(1)
+#define IBS_MPROF_DATA3_DCMISS BIT_ULL(7)
+#define IBS_MPROF_DATA3_LADDR_VALID BIT_ULL(17)
+#define IBS_MPROF_DATA3_PADDR_VALID BIT_ULL(18)
+#define IBS_MPROF_DATA3_L2MISS BIT_ULL(20)
+#define IBS_MPROF_DATA3_SW_PREFETCH BIT_ULL(21)
+
+/* MSR_AMD64_IBS_MPROF_CTL bits */
+#define IBS_MPROF_CTL_CNT_CTL BIT_ULL(19)
+#define IBS_MPROF_CTL_VAL BIT_ULL(18)
+#define IBS_MPROF_CTL_ENABLE BIT_ULL(17)
+#define IBS_MPROF_CTL_L3MISSONLY BIT_ULL(16)
+#define IBS_MPROF_CTL_MAXCNT_MASK 0x0000FFFFULL
+#define IBS_MPROF_CTL_MAXCNT_EXT_MASK (0x7FULL << 20) /* separate upper 7 bits */
+
+/* MSR_AMD64_IBS_MPROF_CTL2 bits */
+#define IBS_MPROF_CTL2_DISABLE BIT_ULL(0)
+#define IBS_MPROF_CTL2_EXCLUDE_USER BIT_ULL(1)
+#define IBS_MPROF_CTL2_EXCLUDE_KERNEL BIT_ULL(2)
+
+#define IBS_MPROF_SAMPLE_PERIOD 10000
+
+#endif /* _ASM_X86_IBS_MPROF_H */
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index a14a0f43e04a..c44b68940f43 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1315,4 +1315,12 @@
* a #GP
*/
+/* AMD IBS Memory Profiler MSRs */
+#define MSR_AMD64_IBS_MPROF_CTL 0xc0010380
+#define MSR_AMD64_IBS_MPROF_CTL2 0xc0010381
+#define MSR_AMD64_IBS_MPROF_DATA2 0xc0010382
+#define MSR_AMD64_IBS_MPROF_DATA3 0xc0010383
+#define MSR_AMD64_IBS_MPROF_LINADDR 0xc0010384
+#define MSR_AMD64_IBS_MPROF_PHYADDR 0xc0010385
+
#endif /* _ASM_X86_MSR_INDEX_H */
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 3a5364853eab..050a7379d9f7 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -59,3 +59,4 @@ obj-$(CONFIG_X86_MEM_ENCRYPT) += mem_encrypt.o
obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_amd.o
obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_boot.o
+obj-$(CONFIG_AMD_IBS_MEMPROF) += ibs-mprof.o
diff --git a/arch/x86/mm/ibs-mprof.c b/arch/x86/mm/ibs-mprof.c
new file mode 100644
index 000000000000..b3d59b21c8c9
--- /dev/null
+++ b/arch/x86/mm/ibs-mprof.c
@@ -0,0 +1,308 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define pr_fmt(fmt) "amd_ibs_memprof: " fmt
+
+#include <linux/init.h>
+#include <linux/pghot.h>
+#include <linux/percpu.h>
+#include <linux/workqueue.h>
+#include <linux/irq_work.h>
+#include <linux/mm.h>
+#include <linux/vm_event_item.h>
+#include <linux/vmstat.h>
+#include <linux/cpuhotplug.h>
+
+#include <asm/ibs-mprof.h>
+#include <asm/ibs-caps.h>
+#include <asm/nmi.h>
+#include <asm/apic.h>
+
+#define IBS_NR_SAMPLES 150 /* Percpu sample buffer size */
+
+static DEFINE_PER_CPU(bool, mprof_work_pending);
+
+/*
+ * Basic access info captured for each memory access.
+ */
+struct mprof_sample {
+ unsigned long pfn;
+ unsigned long time; /* jiffies when accessed */
+ int nid; /* Accessing node ID, if known */
+};
+
+/*
+ * Percpu buffer of access samples. Samples are accumulated here
+ * before pushing them to pghot sub-system for further action.
+ */
+struct mprof_sample_pcpu {
+ struct mprof_sample samples[IBS_NR_SAMPLES];
+ int head, tail;
+};
+
+static struct mprof_sample_pcpu __percpu *mprof_s;
+
+/*
+ * The workqueue for pushing the percpu access samples to pghot sub-system.
+ */
+static DEFINE_PER_CPU(struct work_struct, mprof_work);
+static DEFINE_PER_CPU(struct irq_work, mprof_irq_work);
+
+/*
+ * Record the IBS-reported access sample in percpu buffer.
+ * Called from IBS NMI handler.
+ */
+static bool mprof_push_sample(unsigned long pfn, int nid, unsigned long time)
+{
+ struct mprof_sample_pcpu *pcpu = raw_cpu_ptr(mprof_s);
+ int head = READ_ONCE(pcpu->head);
+ int tail = READ_ONCE(pcpu->tail);
+ int next = head + 1;
+
+ if (next >= IBS_NR_SAMPLES)
+ next = 0;
+
+ if (next == tail)
+ return false;
+
+ pcpu->samples[head].pfn = pfn;
+ pcpu->samples[head].time = time;
+ pcpu->samples[head].nid = nid;
+
+ smp_store_release(&pcpu->head, next);
+ return true;
+}
+
+static bool mprof_pop_sample(struct mprof_sample *s)
+{
+ struct mprof_sample_pcpu *pcpu = raw_cpu_ptr(mprof_s);
+ int tail = READ_ONCE(pcpu->tail);
+ int head = smp_load_acquire(&pcpu->head);
+ int next = tail + 1;
+
+ if (head == tail)
+ return false;
+
+ if (next >= IBS_NR_SAMPLES)
+ next = 0;
+
+ *s = pcpu->samples[tail];
+
+ WRITE_ONCE(pcpu->tail, next);
+ return true;
+}
+
+/*
+ * Remove access samples from percpu buffer and send them
+ * to pghot sub-system for further action.
+ */
+static void mprof_work_handler(struct work_struct *work)
+{
+ struct mprof_sample s;
+
+ while (mprof_pop_sample(&s))
+ pghot_record_access(s.pfn, s.nid, PGHOT_HWHINTS, s.time);
+
+ this_cpu_write(mprof_work_pending, false);
+}
+
+static void mprof_irq_handler(struct irq_work *i)
+{
+ struct work_struct *w = this_cpu_ptr(&mprof_work);
+
+ /*
+ * FIXME: pending samples on a CPU that goes offline before the
+ * work runs may be lost or migrated to the wrong CPU's ring;
+ * needs a teardown-time drain.
+ */
+ schedule_work_on(smp_processor_id(), w);
+}
+
+/*
+ * L3MissOnly + Exclude kernel RIP
+ */
+static void mprof_enable_profiling(void)
+{
+ u64 mprof_config = IBS_MPROF_CTL_CNT_CTL | IBS_MPROF_CTL_ENABLE |
+ IBS_MPROF_CTL_L3MISSONLY;
+ unsigned int period = IBS_MPROF_SAMPLE_PERIOD;
+ u64 ctl, ctl2;
+
+ /*
+ * Assemble bits 26:20 and 19:4 of periodic op counter in ctl.
+ * The lower 4 bits are always 0000b.
+ */
+ ctl = (period >> 4) & IBS_MPROF_CTL_MAXCNT_MASK;
+ ctl |= (period & IBS_MPROF_CTL_MAXCNT_EXT_MASK);
+ ctl |= mprof_config;
+ wrmsrq(MSR_AMD64_IBS_MPROF_CTL, ctl);
+
+ /*
+ * Exclude samples that have bit 63 of their RIP set.
+ */
+ ctl2 = IBS_MPROF_CTL2_EXCLUDE_KERNEL;
+ wrmsrq(MSR_AMD64_IBS_MPROF_CTL2, ctl2);
+}
+
+static void mprof_disable_profiling(u64 mem_ctl)
+{
+ mem_ctl &= ~IBS_MPROF_CTL_ENABLE;
+ mem_ctl &= ~IBS_MPROF_CTL_VAL;
+ wrmsrq(MSR_AMD64_IBS_MPROF_CTL, mem_ctl);
+
+ wrmsrq(MSR_AMD64_IBS_MPROF_CTL2, IBS_MPROF_CTL2_DISABLE);
+}
+
+/*
+ * IBS NMI handler: Process the memory access info reported by IBS.
+ *
+ * Reads the MSRs to collect all the information about the reported
+ * memory access, validates the access, stores the valid sample and
+ * schedules the work on this CPU to further process the sample.
+ */
+static int mprof_overflow_handler(unsigned int cmd, struct pt_regs *regs)
+{
+ u64 mem_ctl, mem_data3, mem_data2, paddr, data_src;
+ unsigned long pfn;
+ struct page *page;
+
+ rdmsrq(MSR_AMD64_IBS_MPROF_CTL, mem_ctl);
+ if (!(mem_ctl & IBS_MPROF_CTL_VAL))
+ return NMI_DONE;
+
+ mprof_disable_profiling(mem_ctl);
+ count_vm_event(HWHINT_TOTAL_EVENTS);
+
+ rdmsrq(MSR_AMD64_IBS_MPROF_DATA3, mem_data3);
+ rdmsrq(MSR_AMD64_IBS_MPROF_DATA2, mem_data2);
+
+ data_src = mem_data2 & IBS_MPROF_DATA2_DATASRC_MASK;
+ data_src |= ((mem_data2 & IBS_MPROF_DATA2_DATASRC_MASK_HIGH) >>
+ IBS_MPROF_DATA2_DATASRC_MASK_HIGH_SHIFT);
+
+ switch (data_src) {
+ case IBS_MPROF_DATA2_DATASRC_DRAM:
+ count_vm_event(HWHINT_DRAM_ACCESSES);
+ break;
+ case IBS_MPROF_DATA2_DATASRC_EXT_MEM:
+ count_vm_event(HWHINT_EXTMEM_ACCESSES);
+ break;
+ }
+
+ /* Is linear addr valid? */
+ if (!(mem_data3 & IBS_MPROF_DATA3_LADDR_VALID))
+ goto handled;
+
+ /* Is phys addr valid? */
+ if (!(mem_data3 & IBS_MPROF_DATA3_PADDR_VALID))
+ goto handled;
+ rdmsrq(MSR_AMD64_IBS_MPROF_PHYADDR, paddr);
+
+ pfn = PHYS_PFN(paddr);
+ page = pfn_to_online_page(pfn);
+ if (!page)
+ goto handled;
+
+ /*
+ * Use the accessing CPU's node as the migration target. On
+ * topologies where all CPUs reside on toptier nodes (the common
+ * case), this is the desired behaviour. Topologies that place
+ * CPUs on lower-tier nodes are rejected later by
+ * pghot_record_access() via the src_nid == nid early return.
+ */
+ if (!mprof_push_sample(pfn, numa_node_id(), jiffies))
+ goto handled;
+
+ if (!this_cpu_read(mprof_work_pending)) {
+ this_cpu_write(mprof_work_pending, true);
+ irq_work_queue(this_cpu_ptr(&mprof_irq_work));
+ }
+ count_vm_event(HWHINT_USEFUL_EVENTS);
+
+handled:
+ mprof_enable_profiling();
+ return NMI_HANDLED;
+}
+
+static int get_mprof_lvt_offset(void)
+{
+ u64 val;
+
+ rdmsrq(MSR_AMD64_IBSCTL, val);
+ if (!(val & IBSCTL_MPROF_LVT_OFFSET_VALID))
+ return -EINVAL;
+
+ return (val & IBSCTL_MPROF_LVT_OFFSET_MASK) >>
+ IBSCTL_MPROF_LVT_OFFSET_SHIFT;
+}
+
+static int x86_amd_ibs_mprof_startup(unsigned int cpu)
+{
+ int offset = get_mprof_lvt_offset();
+
+ if (offset < 0) {
+ pr_warn("offset not valid on cpu #%d\n", cpu);
+ return 0;
+ }
+
+ if (setup_APIC_eilvt(offset, 0, APIC_DELIVERY_MODE_NMI, 0)) {
+ pr_warn("APIC setup failed on cpu #%d\n", cpu);
+ return 0;
+ }
+
+ mprof_enable_profiling();
+ return 0;
+}
+
+static int x86_amd_ibs_mprof_teardown(unsigned int cpu)
+{
+ int offset = get_mprof_lvt_offset();
+ u64 mem_ctl;
+
+ if (offset >= 0)
+ setup_APIC_eilvt(offset, 0, APIC_DELIVERY_MODE_FIXED, 1);
+
+ rdmsrq(MSR_AMD64_IBS_MPROF_CTL, mem_ctl);
+ mprof_disable_profiling(mem_ctl);
+
+ return 0;
+}
+
+static int __init mprof_access_profiling_init(void)
+{
+ u32 mprof_caps = cpuid_eax(IBS_CPUID_FEATURES);
+ int cpu, ret;
+
+ if (!(mprof_caps & IBS_CAPS_MEM_PROFILER)) {
+ pr_info("capability is unavailable for access profiling\n");
+ return 0;
+ }
+
+ mprof_s = alloc_percpu_gfp(struct mprof_sample_pcpu, GFP_KERNEL | __GFP_ZERO);
+ if (!mprof_s) {
+ pr_err("alloc_percpu_gfp failed\n");
+ return 0;
+ }
+
+ for_each_possible_cpu(cpu) {
+ INIT_WORK(per_cpu_ptr(&mprof_work, cpu), mprof_work_handler);
+ init_irq_work(per_cpu_ptr(&mprof_irq_work, cpu), mprof_irq_handler);
+ }
+
+ register_nmi_handler(NMI_LOCAL, mprof_overflow_handler, 0, "ibs-memprof");
+
+ ret = cpuhp_setup_state(CPUHP_AP_MM_AMD_IBS_MEMPROF_STARTING,
+ "x86/amd/ibs_mprof:starting",
+ x86_amd_ibs_mprof_startup,
+ x86_amd_ibs_mprof_teardown);
+
+ if (ret) {
+ unregister_nmi_handler(NMI_LOCAL, "ibs-memprof");
+ free_percpu(mprof_s);
+ pr_err("cpuhp_setup_state failed: %d\n", ret);
+ } else {
+ pr_info("IBS Memory Profiler setup for memory access profiling\n");
+ }
+ return 0;
+}
+
+device_initcall(mprof_access_profiling_init);
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 22ba327ec227..feaa3f571726 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -150,6 +150,7 @@ enum cpuhp_state {
CPUHP_AP_PERF_X86_AMD_UNCORE_STARTING,
CPUHP_AP_PERF_X86_STARTING,
CPUHP_AP_PERF_X86_AMD_IBS_STARTING,
+ CPUHP_AP_MM_AMD_IBS_MEMPROF_STARTING,
CPUHP_AP_PERF_XTENSA_STARTING,
CPUHP_AP_ARM_VFP_STARTING,
CPUHP_AP_ARM64_DEBUG_MONITORS_STARTING,
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 58d510711bd4..a9c04a9735c6 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -179,6 +179,12 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
PGHOT_RECORDED_ACCESSES,
PGHOT_RECORDED_HINTFAULTS,
PGHOT_RECORDED_HWHINTS,
+#ifdef CONFIG_HWMEM_PROFILER
+ HWHINT_TOTAL_EVENTS,
+ HWHINT_DRAM_ACCESSES,
+ HWHINT_EXTMEM_ACCESSES,
+ HWHINT_USEFUL_EVENTS,
+#endif /* CONFIG_HWMEM_PROFILER */
#endif /* CONFIG_PGHOT */
NR_VM_EVENT_ITEMS
};
diff --git a/mm/Kconfig b/mm/Kconfig
index cc4b5685ecd4..674cfcea7bb0 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1494,6 +1494,15 @@ config PGHOT_PRECISE
4 bytes per page against the default one byte per page. Preferable
to enable this on systems with multiple nodes in toptier.
+config HWMEM_PROFILER
+ bool
+ depends on PGHOT
+ help
+ Umbrella symbol enabled by any in-kernel driver that forwards
+ hardware-observed memory accesses to the pghot subsystem (for
+ example AMD_IBS_MEMPROF on x86_64). Drivers select this; users
+ do not enable it directly.
+
source "mm/damon/Kconfig"
endmenu
diff --git a/mm/vmstat.c b/mm/vmstat.c
index da668ff05032..06e7ae06519e 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1493,6 +1493,12 @@ const char * const vmstat_text[] = {
[I(PGHOT_RECORDED_ACCESSES)] = "pghot_recorded_accesses",
[I(PGHOT_RECORDED_HINTFAULTS)] = "pghot_recorded_hintfaults",
[I(PGHOT_RECORDED_HWHINTS)] = "pghot_recorded_hwhints",
+#ifdef CONFIG_HWMEM_PROFILER
+ [I(HWHINT_TOTAL_EVENTS)] = "hwhint_total_events",
+ [I(HWHINT_DRAM_ACCESSES)] = "hwhint_dram_accesses",
+ [I(HWHINT_EXTMEM_ACCESSES)] = "hwhint_extmem_accesses",
+ [I(HWHINT_USEFUL_EVENTS)] = "hwhint_useful_events",
+#endif /* CONFIG_HWMEM_PROFILER */
#endif /* CONFIG_PGHOT */
#undef I
#endif /* CONFIG_VM_EVENT_COUNTERS */
--
2.34.1
^ permalink raw reply related [flat|nested] 12+ messages in thread* Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure
2026-05-04 6:09 [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure Bharata B Rao
` (6 preceding siblings ...)
2026-05-04 6:09 ` [RFC PATCH v7 7/7] x86/mm/ibs: In-kernel driver for AMD IBS Memory Profiler Bharata B Rao
@ 2026-05-04 6:23 ` Bharata B Rao
2026-05-04 20:36 ` Matthew Wilcox
8 siblings, 0 replies; 12+ messages in thread
From: Bharata B Rao @ 2026-05-04 6:23 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz,
raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang,
ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david,
byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
alok.rathore, shivankg, donettom
On 04-May-26 11:39 AM, Bharata B Rao wrote:
>
> Results
> =======
> Posted as replies to this mail thread.
Micro-benchmark numbers for IBS Memory Profiler pghot source
Test system details
-------------------
2 node AMD system with 1 regular NUMA node (0) and a CXL node (1)
$ numactl -H
available: 3 nodes (0-1)
node 0 cpus: 0-255
node 0 size: 515563 MB
node 1 cpus:
node 1 size: 258034 MB
node distances:
node distances:
node 0 1
0: 10 50
1: 255 10
Hotness sources
---------------
NUMAB0 - Without NUMA Balancing in base case and with no source enabled
in the patched case. No migrations occur.
NUMAB2 - Existing hot page promotion for the base case
HWHINTS - IBS Memory Profiler as source for pghot
Pghot by default promotes after two accesses but for NUMAB2 source,
promotion is done after one access to match the base behaviour.
(/sys/kernel/debug/pghot/freq_threshold=1)
==============================================================
Scenario 1 - Enough memory in toptier and hence only promotion
==============================================================
Multi-threaded application with 64 threads that access memory(8G) at
4K granularity repetitively and randomly. The number of accesses per
thread and the randomness pattern for each thread are fixed beforehand.
The accesses are divided into stores and loads in the ratio of 50:50.
Benchmark threads run on Node 0, while memory is initially provisioned on
CXL node 1 before the accesses start.
Repetitive accesses results in lowertier pages becoming hot and kmigrated
detecting and migrating them. The benchmark score is the time taken to
finish the accesses in microseconds. The sooner it finishes the better it is.
All the numbers shown below are average of 3 runs.
Time taken (microseconds, lower is better)
---------------------------------------------------------
Source Base Pghot-default
---------------------------------------------------------
NUMAB0 181,393,365 184,331,381
NUMAB2 42,287,528
HWHINTS NA 50,422,862
---------------------------------------------------------
Stats comparision b/n base-NUMAB2 and pghot-default-hwhints
---------------------------------------------------------------------
Base-NUMAB2 Pghot-default-hwhints
---------------------------------------------------------------------
pgpromote_success 2097152 1961087
numa_hint_faults 2358069 0
pghot_recorded_accesses NA 1962696
pghot_recorded_hintfaults NA 0
pghot_recorded_hwhints NA 5532979
hwhint_total_events NA 5532979
---------------------------------------------------------------------
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure
2026-05-04 6:09 [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure Bharata B Rao
` (7 preceding siblings ...)
2026-05-04 6:23 ` [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure Bharata B Rao
@ 2026-05-04 20:36 ` Matthew Wilcox
8 siblings, 0 replies; 12+ messages in thread
From: Matthew Wilcox @ 2026-05-04 20:36 UTC (permalink / raw)
To: Bharata B Rao
Cc: linux-kernel, linux-mm, Jonathan.Cameron, dave.hansen, gourry,
mgorman, mingo, peterz, raghavendra.kt, riel, rientjes, sj,
weixugc, ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis,
akpm, david, byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
alok.rathore, shivankg, donettom
On Mon, May 04, 2026 at 11:39:17AM +0530, Bharata B Rao wrote:
> This is v7 of pghot, a hot-page tracking and promotion subsystem. The
I continue to think we should not do this.
^ permalink raw reply [flat|nested] 12+ messages in thread