Re: [RFC PATCH v6 3/5] mm: Hot page tracking and promotion - pghot

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

From: Donet Tom <donettom@linux.ibm.com>
To: Bharata B Rao <bharata@amd.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Cc: Jonathan.Cameron@huawei.com, dave.hansen@intel.com,
	gourry@gourry.net, mgorman@techsingularity.net, mingo@redhat.com,
	peterz@infradead.org, raghavendra.kt@amd.com, riel@surriel.com,
	rientjes@google.com, sj@kernel.org, weixugc@google.com,
	willy@infradead.org, ying.huang@linux.alibaba.com,
	ziy@nvidia.com, dave@stgolabs.net, nifan.cxl@gmail.com,
	xuezhengchu@huawei.com, yiannis@zptcorp.com,
	akpm@linux-foundation.org, david@redhat.com, byungchul@sk.com,
	kinseyho@google.com, joshua.hahnjy@gmail.com, yuanchu@google.com,
	balbirs@nvidia.com, alok.rathore@samsung.com, shivankg@amd.com
Subject: Re: [RFC PATCH v6 3/5] mm: Hot page tracking and promotion - pghot
Date: Fri, 24 Apr 2026 18:27:02 +0530	[thread overview]
Message-ID: <250e68f3-3664-4148-bfbf-52fd4230a3b9@linux.ibm.com> (raw)
In-Reply-To: <20260323095104.238982-4-bharata@amd.com>

Hi Bharata

On 3/23/26 3:21 PM, Bharata B Rao wrote:
> pghot is a subsystem that collects memory access information from
> multiple sources, classifies hot pages resident in lower-tier memory,
> and promotes them to faster tiers. It stores per-PFN hotness metadata
> and performs asynchronous, batched promotion via a per-lower-tier-node
> kernel thread (kmigrated).
>
> This change introduces the default (compact) mode of pghot:
>
> - Per-PFN hotness record (phi_t = u8) embedded via mem_section:
>    - 2 bits: access frequency (4 levels)
>    - 5 bits: time bucket (≈4s window with HZ=1000, bucketed jiffies)
>    - 1 bit : migration-ready flag (MSB)
>    The LSB of mem_section->hot_map pointer is used as a per-section
>    "hot" flag to gate scanning.
>
> - Event recording API:
>    int pghot_record_access(unsigned long pfn, int nid, int src, unsigned long now)
>    @pfn: The PFN of the memory accessed
>    @nid: The accessing NUMA node ID
>    @src: The temperature source (subsystem) that generated the
>          access info
>    @time: The access time in jiffies
>    - Sources (e.g., NUMA hint faults, HW hints) call this to report
>      accesses.
>    - In default mode, the nid is not stored/used for targeting;
>      promotion goes to a configurable toptier node (pghot_target_nid).
>
> - Promotion engine:
>    - One kmigrated thread per lower-tier node.
>    - Scans only sections whose "hot" flag was raised, iterates PFNs,
>      and batches candidates by destination node.
>    - Uses migrate_misplaced_folios_batch() to move batched folios.
>
> - Tunables & stats:
>    - debugfs: enabled_sources, target_nid, freq_threshold,
>               kmigrated_sleep_ms, kmigrated_batch_nr
>    - sysctl : vm.pghot_promote_freq_window_ms
>    - vmstat : pghot_recorded_accesses, pghot_recorded_hintfaults,
>               pghot_recorded_hwhints
>
> Memory overhead
> ---------------
> Default mode uses 1 byte of hotness metadata per PFN on lower-tier
> nodes.
>
> Behavior & policy
> -----------------
> - Default mode promotion target:
>    The nid passed by sources is not stored; hot pages promote to
>    pghot_target_nid (toptier). Precision mode (added later in the
>    series) changes this.
>
> - Record consumption:
>    kmigrated consumes (clears) the "migration-ready" bit before
>    attempting isolation. If isolation/migration fails, the folio is
>    not re-queued automatically; subsequent accesses will re-arm it.
>    This avoids retry storms and keeps batching stable.
>
> - Wakeups:
>    kmigrated wakeups are intentionally timeout-driven in v6. We set
>    the per-pgdat "activate" flag on access, and kmigrated checks this
>    flag on its next sleep interval. This keeps the first cut simple
>    and avoids potential wake storms; active wakeups can be considered
>    in a follow-up.
>
> Signed-off-by: Bharata B Rao <bharata@amd.com>
> ---
>   Documentation/admin-guide/mm/pghot.txt |  80 +++++
>   include/linux/migrate.h                |   4 +-
>   include/linux/mmzone.h                 |  20 ++
>   include/linux/pghot.h                  |  82 +++++
>   include/linux/vm_event_item.h          |   5 +
>   mm/Kconfig                             |  14 +
>   mm/Makefile                            |   1 +
>   mm/migrate.c                           |  19 +-
>   mm/mm_init.c                           |  10 +
>   mm/pghot-default.c                     |  79 ++++
>   mm/pghot-tunables.c                    | 182 ++++++++++
>   mm/pghot.c                             | 479 +++++++++++++++++++++++++
>   mm/vmstat.c                            |   5 +
>   13 files changed, 971 insertions(+), 9 deletions(-)
>   create mode 100644 Documentation/admin-guide/mm/pghot.txt
>   create mode 100644 include/linux/pghot.h
>   create mode 100644 mm/pghot-default.c
>   create mode 100644 mm/pghot-tunables.c
>   create mode 100644 mm/pghot.c
>
> diff --git a/Documentation/admin-guide/mm/pghot.txt b/Documentation/admin-guide/mm/pghot.txt
> new file mode 100644
> index 000000000000..5f51dd1d4d45
> --- /dev/null
> +++ b/Documentation/admin-guide/mm/pghot.txt
> @@ -0,0 +1,80 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=================================
> +PGHOT: Hot Page Tracking Tunables
> +=================================
> +
> +Overview
> +========
> +The PGHOT subsystem tracks frequently accessed pages in lower-tier memory and
> +promotes them to faster tiers. It uses per-PFN hotness metadata and asynchronous
> +migration via per-node kernel threads (kmigrated).
> +
> +This document describes tunables available via **debugfs** and **sysctl** for
> +PGHOT.
> +
> +Debugfs Interface
> +=================
> +Path: /sys/kernel/debug/pghot/
> +
> +1. **enabled_sources**
> +   - Bitmask to enable/disable hotness sources.
> +   - Bits:
> +     - 0: Hint faults (value 0x1)
> +     - 1: Hardware hints (value 0x2)
> +   - Default: 0 (disabled)
> +   - Example:
> +     # echo 0x3 > /sys/kernel/debug/pghot/enabled_sources
> +     Enables all sources.
> +
> +2. **target_nid**
> +   - Toptier NUMA node ID to which hot pages should be promoted when source
> +     does not provide nid. Used when hotness source can't provide accessing
> +     NID or when the tracking mode is default.
> +   - Default: 0
> +   - Example:
> +     # echo 1 > /sys/kernel/debug/pghot/target_nid
> +
> +3. **freq_threshold**
> +   - Minimum access frequency before a page is marked ready for promotion.
> +   - Range: 1 to 3
> +   - Default: 2
> +   - Example:
> +     # echo 3 > /sys/kernel/debug/pghot/freq_threshold
> +
> +4. **kmigrated_sleep_ms**
> +   - Sleep interval (ms) for kmigrated thread between scans.
> +   - Default: 100
> +
> +5. **kmigrated_batch_nr**
> +   - Maximum number of folios migrated in one batch.
> +   - Default: 512
> +
> +Sysctl Interface
> +================
> +1. pghot_promote_freq_window_ms
> +
> +Path: /proc/sys/vm/pghot_promote_freq_window_ms
> +
> +- Controls the time window (in ms) for counting access frequency. A page is
> +  considered hot only when **freq_threshold** number of accesses occur with
> +  this time period.
> +- Default: 3000 (3 seconds)
> +- Example:
> +  # sysctl vm.pghot_promote_freq_window_ms=3000
> +
> +Vmstat Counters
> +===============
> +Following vmstat counters provide some stats about pghot subsystem.
> +
> +Path: /proc/vmstat
> +
> +1. **pghot_recorded_accesses**
> +   - Number of total hot page accesses recorded by pghot.
> +
> +2. **pghot_recorded_hintfaults**
> +   - Number of recorded accesses reported by NUMA Balancing based
> +     hotness source.
> +
> +3. **pghot_recorded_hwhints**
> +   - Number of recorded accesses reported by hwhints source.
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 5c1e2691cec2..7f912b6ebf02 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -107,7 +107,7 @@ static inline void softleaf_entry_wait_on_locked(softleaf_t entry, spinlock_t *p
>   
>   #endif /* CONFIG_MIGRATION */
>   
> -#ifdef CONFIG_NUMA_BALANCING
> +#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_PGHOT)
>   int migrate_misplaced_folio_prepare(struct folio *folio,
>   		struct vm_area_struct *vma, int node);
>   int migrate_misplaced_folio(struct folio *folio, int node);
> @@ -127,7 +127,7 @@ static inline int migrate_misplaced_folios_batch(struct list_head *folio_list,
>   {
>   	return -EAGAIN; /* can't migrate now */
>   }
> -#endif /* CONFIG_NUMA_BALANCING */
> +#endif /* CONFIG_NUMA_BALANCING || CONFIG_PGHOT */
>   
>   #ifdef CONFIG_MIGRATION
>   
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 3e51190a55e4..d7ed60956543 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1064,6 +1064,7 @@ enum pgdat_flags {
>   					 * many pages under writeback
>   					 */
>   	PGDAT_RECLAIM_LOCKED,		/* prevents concurrent reclaim */
> +	PGDAT_KMIGRATED_ACTIVATE,	/* activates kmigrated */
>   };
>   
>   enum zone_flags {
> @@ -1518,6 +1519,10 @@ typedef struct pglist_data {
>   #ifdef CONFIG_MEMORY_FAILURE
>   	struct memory_failure_stats mf_stats;
>   #endif
> +#ifdef CONFIG_PGHOT
> +	struct task_struct *kmigrated;
> +	wait_queue_head_t kmigrated_wait;
> +#endif
>   } pg_data_t;
>   
>   #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
> @@ -1930,12 +1935,27 @@ struct mem_section {
>   	unsigned long section_mem_map;
>   
>   	struct mem_section_usage *usage;
> +#ifdef CONFIG_PGHOT
> +	/*
> +	 * Per-PFN hotness data for this section.
> +	 * Array of phi_t (u8 in default mode).
> +	 * LSB is used as PGHOT_SECTION_HOT_BIT flag.
> +	 */
> +	void *hot_map;
> +#endif
>   #ifdef CONFIG_PAGE_EXTENSION
>   	/*
>   	 * If SPARSEMEM, pgdat doesn't have page_ext pointer. We use
>   	 * section. (see page_ext.h about this.)
>   	 */
>   	struct page_ext *page_ext;
> +#endif
> +	/*
> +	 * Padding to maintain consistent mem_section size when exactly
> +	 * one of PGHOT or PAGE_EXTENSION is enabled. This ensures
> +	 * optimal alignment regardless of configuration.
> +	 */
> +#if (defined(CONFIG_PGHOT) ^ defined(CONFIG_PAGE_EXTENSION))
>   	unsigned long pad;
>   #endif
>   	/*
> diff --git a/include/linux/pghot.h b/include/linux/pghot.h
> new file mode 100644
> index 000000000000..525d4dd28fc1
> --- /dev/null
> +++ b/include/linux/pghot.h
> @@ -0,0 +1,82 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_PGHOT_H
> +#define _LINUX_PGHOT_H
> +
> +/* Page hotness temperature sources */
> +enum pghot_src {
> +	PGHOT_HINTFAULTS = 0,
> +	PGHOT_HWHINTS,
> +	PGHOT_SRC_MAX
> +};
> +
> +#ifdef CONFIG_PGHOT
> +#include <linux/static_key.h>
> +
> +extern unsigned int pghot_target_nid;
> +extern unsigned int pghot_src_enabled;
> +extern unsigned int pghot_freq_threshold;
> +extern unsigned int kmigrated_sleep_ms;
> +extern unsigned int kmigrated_batch_nr;
> +extern unsigned int sysctl_pghot_freq_window;
> +
> +void pghot_debug_init(void);
> +
> +DECLARE_STATIC_KEY_FALSE(pghot_src_hintfaults);
> +DECLARE_STATIC_KEY_FALSE(pghot_src_hwhints);
> +
> +#define PGHOT_HINTFAULTS_ENABLED	BIT(PGHOT_HINTFAULTS)
> +#define PGHOT_HWHINTS_ENABLED		BIT(PGHOT_HWHINTS)
> +#define PGHOT_SRC_ENABLED_MASK		GENMASK(PGHOT_SRC_MAX - 1, 0)
> +
> +#define PGHOT_DEFAULT_FREQ_THRESHOLD	2
> +
> +#define KMIGRATED_DEFAULT_SLEEP_MS	100
> +#define KMIGRATED_DEFAULT_BATCH_NR	512
> +
> +#define PGHOT_DEFAULT_NODE		0
> +
> +#define PGHOT_DEFAULT_FREQ_WINDOW	(3 * MSEC_PER_SEC)
> +
> +/*
> + * Bits 0-6 are used to store frequency and time.
> + * Bit 7 is used to indicate the page is ready for migration.
> + */
> +#define PGHOT_MIGRATE_READY		7
> +
> +#define PGHOT_FREQ_WIDTH		2
> +/* Bucketed time is stored in 5 bits which can represent up to 3.9s with HZ=1000 */
> +#define PGHOT_TIME_BUCKETS_SHIFT	7
> +#define PGHOT_TIME_WIDTH		5
> +#define PGHOT_NID_WIDTH			10
> +
> +#define PGHOT_FREQ_SHIFT		0
> +#define PGHOT_TIME_SHIFT		(PGHOT_FREQ_SHIFT + PGHOT_FREQ_WIDTH)
> +
> +#define PGHOT_FREQ_MASK			GENMASK(PGHOT_FREQ_WIDTH - 1, 0)
> +#define PGHOT_TIME_MASK			GENMASK(PGHOT_TIME_WIDTH - 1, 0)
> +#define PGHOT_TIME_BUCKETS_MASK		(PGHOT_TIME_MASK << PGHOT_TIME_BUCKETS_SHIFT)
> +
> +#define PGHOT_NID_MAX			((1 << PGHOT_NID_WIDTH) - 1)
> +#define PGHOT_FREQ_MAX			((1 << PGHOT_FREQ_WIDTH) - 1)
> +#define PGHOT_TIME_MAX			((1 << PGHOT_TIME_WIDTH) - 1)
> +
> +typedef u8 phi_t;
> +
> +#define PGHOT_RECORD_SIZE		sizeof(phi_t)
> +
> +#define PGHOT_SECTION_HOT_BIT		0
> +#define PGHOT_SECTION_HOT_MASK		BIT(PGHOT_SECTION_HOT_BIT)
> +
> +bool pghot_nid_valid(int nid);
> +unsigned long pghot_access_latency(unsigned long old_time, unsigned long time);
> +bool pghot_update_record(phi_t *phi, int nid, unsigned long now);
> +int pghot_get_record(phi_t *phi, int *nid, int *freq, unsigned long *time);
> +
> +int pghot_record_access(unsigned long pfn, int nid, int src, unsigned long now);
> +#else
> +static inline int pghot_record_access(unsigned long pfn, int nid, int src, unsigned long now)
> +{
> +	return 0;
> +}
> +#endif /* CONFIG_PGHOT */
> +#endif /* _LINUX_PGHOT_H */
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 22a139f82d75..4ce670c1bb02 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -188,6 +188,11 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>   		KSTACK_REST,
>   #endif
>   #endif /* CONFIG_DEBUG_STACK_USAGE */
> +#ifdef CONFIG_PGHOT
> +		PGHOT_RECORDED_ACCESSES,
> +		PGHOT_RECORDED_HINTFAULTS,
> +		PGHOT_RECORDED_HWHINTS,
> +#endif /* CONFIG_PGHOT */
>   		NR_VM_EVENT_ITEMS
>   };
>   
> diff --git a/mm/Kconfig b/mm/Kconfig
> index ebd8ea353687..4aeab6aee535 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1471,6 +1471,20 @@ config LAZY_MMU_MODE_KUNIT_TEST
>   
>   	  If unsure, say N.
>   
> +config PGHOT
> +	bool "Hot page tracking and promotion"
> +	def_bool n
> +	depends on NUMA && MIGRATION && SPARSEMEM && MMU
> +	help
> +	  A sub-system to track page accesses in lower tier memory and
> +	  maintain hot page information. Promotes hot pages from lower
> +	  tiers to top tier by using the memory access information provided
> +	  by various sources. Asynchronous promotion is done by per-node
> +	  kernel threads.
> +
> +	  This adds 1 byte of metadata overhead per page in lower-tier
> +	  memory nodes.
> +
>   source "mm/damon/Kconfig"
>   
>   endmenu
> diff --git a/mm/Makefile b/mm/Makefile
> index 8ad2ab08244e..33014de43acc 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -150,3 +150,4 @@ obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
>   obj-$(CONFIG_EXECMEM) += execmem.o
>   obj-$(CONFIG_TMPFS_QUOTA) += shmem_quota.o
>   obj-$(CONFIG_LAZY_MMU_MODE_KUNIT_TEST) += tests/lazy_mmu_mode_kunit.o
> +obj-$(CONFIG_PGHOT) += pghot.o pghot-tunables.o pghot-default.o
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 94daec0f49ef..a5f48984ed3e 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2606,7 +2606,7 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid, unsigned long, nr_pages,
>   	return kernel_move_pages(pid, nr_pages, pages, nodes, status, flags);
>   }
>   
> -#ifdef CONFIG_NUMA_BALANCING
> +#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_PGHOT)
>   /*
>    * Returns true if this is a safe migration target node for misplaced NUMA
>    * pages. Currently it only checks the watermarks which is crude.
> @@ -2726,12 +2726,10 @@ int migrate_misplaced_folio_prepare(struct folio *folio,
>    */
>   int migrate_misplaced_folio(struct folio *folio, int node)
>   {
> -	pg_data_t *pgdat = NODE_DATA(node);
>   	int nr_remaining;
>   	unsigned int nr_succeeded;
>   	LIST_HEAD(migratepages);
>   	struct mem_cgroup *memcg = get_mem_cgroup_from_folio(folio);
> -	struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
>   
>   	list_add(&folio->lru, &migratepages);
>   	nr_remaining = migrate_pages(&migratepages, alloc_misplaced_dst_folio,
> @@ -2740,12 +2738,18 @@ int migrate_misplaced_folio(struct folio *folio, int node)
>   	if (nr_remaining && !list_empty(&migratepages))
>   		putback_movable_pages(&migratepages);
>   	if (nr_succeeded) {
> +#ifdef CONFIG_NUMA_BALANCING
>   		count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded);
>   		count_memcg_events(memcg, NUMA_PAGE_MIGRATE, nr_succeeded);
>   		if ((sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)
>   		    && !node_is_toptier(folio_nid(folio))
> -		    && node_is_toptier(node))
> +		    && node_is_toptier(node)) {
> +			pg_data_t *pgdat = NODE_DATA(node);
> +			struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
> +
>   			mod_lruvec_state(lruvec, PGPROMOTE_SUCCESS, nr_succeeded);
> +		}
> +#endif
>   	}
>   	mem_cgroup_put(memcg);
>   	BUG_ON(!list_empty(&migratepages));
> @@ -2773,7 +2777,6 @@ int migrate_misplaced_folio(struct folio *folio, int node)
>    */
>   int migrate_misplaced_folios_batch(struct list_head *folio_list, int node)
>   {
> -	pg_data_t *pgdat = NODE_DATA(node);
>   	struct mem_cgroup *memcg = NULL;
>   	unsigned int nr_succeeded = 0;
>   	int nr_remaining;
> @@ -2790,14 +2793,16 @@ int migrate_misplaced_folios_batch(struct list_head *folio_list, int node)
>   		putback_movable_pages(folio_list);
>   
>   	if (nr_succeeded) {
> +#ifdef CONFIG_NUMA_BALANCING
>   		count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded);
> -		mod_node_page_state(pgdat, PGPROMOTE_SUCCESS, nr_succeeded);
>   		count_memcg_events(memcg, NUMA_PAGE_MIGRATE, nr_succeeded);
> +		mod_node_page_state(NODE_DATA(node), PGPROMOTE_SUCCESS, nr_succeeded);
> +#endif
>   	}
>   
>   	mem_cgroup_put(memcg);
>   	WARN_ON(!list_empty(folio_list));
>   	return nr_remaining ? -EAGAIN : 0;
>   }
> -#endif /* CONFIG_NUMA_BALANCING */
> +#endif /* CONFIG_NUMA_BALANCING || CONFIG_PGHOT */
>   #endif /* CONFIG_NUMA */
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index df34797691bd..c777c54cfe69 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -1398,6 +1398,15 @@ static void pgdat_init_kcompactd(struct pglist_data *pgdat)
>   static void pgdat_init_kcompactd(struct pglist_data *pgdat) {}
>   #endif
>   
> +#ifdef CONFIG_PGHOT
> +static void pgdat_init_kmigrated(struct pglist_data *pgdat)
> +{
> +	init_waitqueue_head(&pgdat->kmigrated_wait);
> +}
> +#else
> +static inline void pgdat_init_kmigrated(struct pglist_data *pgdat) {}
> +#endif
> +
>   static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
>   {
>   	int i;
> @@ -1407,6 +1416,7 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
>   
>   	pgdat_init_split_queue(pgdat);
>   	pgdat_init_kcompactd(pgdat);
> +	pgdat_init_kmigrated(pgdat);
>   
>   	init_waitqueue_head(&pgdat->kswapd_wait);
>   	init_waitqueue_head(&pgdat->pfmemalloc_wait);
> diff --git a/mm/pghot-default.c b/mm/pghot-default.c
> new file mode 100644
> index 000000000000..e610062345e4
> --- /dev/null
> +++ b/mm/pghot-default.c
> @@ -0,0 +1,79 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * pghot: Default mode
> + *
> + * 1 byte hotness record per PFN.
> + * Bucketed time and frequency tracked as part of the record.
> + * Promotion to @pghot_target_nid by default.
> + */
> +
> +#include <linux/pghot.h>
> +#include <linux/jiffies.h>
> +
> +/* pghot-default doesn't store and hence no NID validation is required */
> +bool pghot_nid_valid(int nid)
> +{
> +	return true;
> +}
> +
> +/*
> + * @time is regular time, @old_time is bucketed time.
> + */
> +unsigned long pghot_access_latency(unsigned long old_time, unsigned long time)
> +{
> +	time &= PGHOT_TIME_BUCKETS_MASK;
> +	old_time <<= PGHOT_TIME_BUCKETS_SHIFT;
> +
> +	return jiffies_to_msecs((time - old_time) & PGHOT_TIME_BUCKETS_MASK);
> +}
> +
> +bool pghot_update_record(phi_t *phi, int nid, unsigned long now)
> +{
> +	phi_t freq, old_freq, hotness, old_hotness, old_time;
> +	phi_t time = now >> PGHOT_TIME_BUCKETS_SHIFT;
> +
> +	old_hotness = READ_ONCE(*phi);
> +	do {
> +		bool new_window = false;
> +
> +		hotness = old_hotness;
> +		old_freq = (hotness >> PGHOT_FREQ_SHIFT) & PGHOT_FREQ_MASK;
> +		old_time = (hotness >> PGHOT_TIME_SHIFT) & PGHOT_TIME_MASK;
> +
> +		if (pghot_access_latency(old_time, now) > sysctl_pghot_freq_window)
> +			new_window = true;
> +
> +		if (new_window)
> +			freq = 1;
> +		else if (old_freq < PGHOT_FREQ_MAX)
> +			freq = old_freq + 1;
> +		else
> +			freq = old_freq;
> +
> +		hotness &= ~(PGHOT_FREQ_MASK << PGHOT_FREQ_SHIFT);
> +		hotness &= ~(PGHOT_TIME_MASK << PGHOT_TIME_SHIFT);
> +
> +		hotness |= (freq & PGHOT_FREQ_MASK) << PGHOT_FREQ_SHIFT;
> +		hotness |= (time & PGHOT_TIME_MASK) << PGHOT_TIME_SHIFT;
> +
> +		if (freq >= pghot_freq_threshold)
> +			hotness |= BIT(PGHOT_MIGRATE_READY);
> +	} while (unlikely(!try_cmpxchg(phi, &old_hotness, hotness)));
> +	return !!(hotness & BIT(PGHOT_MIGRATE_READY));
> +}
> +
> +int pghot_get_record(phi_t *phi, int *nid, int *freq, unsigned long *time)
> +{
> +	phi_t old_hotness, hotness = 0;
> +
> +	old_hotness = READ_ONCE(*phi);
> +	do {
> +		if (!(old_hotness & BIT(PGHOT_MIGRATE_READY)))
> +			return -EINVAL;
> +	} while (unlikely(!try_cmpxchg(phi, &old_hotness, hotness)));
> +
> +	*nid = pghot_target_nid;
> +	*freq = (old_hotness >> PGHOT_FREQ_SHIFT) & PGHOT_FREQ_MASK;
> +	*time = (old_hotness >> PGHOT_TIME_SHIFT) & PGHOT_TIME_MASK;
> +	return 0;
> +}
> diff --git a/mm/pghot-tunables.c b/mm/pghot-tunables.c
> new file mode 100644
> index 000000000000..f04e2137309e
> --- /dev/null
> +++ b/mm/pghot-tunables.c
> @@ -0,0 +1,182 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * pghot tunables in debugfs
> + */
> +#include <linux/pghot.h>
> +#include <linux/memory-tiers.h>
> +#include <linux/debugfs.h>
> +
> +static struct dentry *debugfs_pghot;
> +static DEFINE_MUTEX(pghot_tunables_lock);
> +
> +static ssize_t pghot_freq_th_write(struct file *filp, const char __user *ubuf,
> +				   size_t cnt, loff_t *ppos)
> +{
> +	char buf[16];
> +	unsigned int freq;
> +
> +	if (cnt > 15)
> +		cnt = 15;
> +
> +	if (copy_from_user(&buf, ubuf, cnt))
> +		return -EFAULT;
> +	buf[cnt] = '\0';
> +
> +	if (kstrtouint(buf, 10, &freq))
> +		return -EINVAL;
> +
> +	if (!freq || freq > PGHOT_FREQ_MAX)
> +		return -EINVAL;
> +
> +	mutex_lock(&pghot_tunables_lock);
> +	pghot_freq_threshold = freq;
> +	mutex_unlock(&pghot_tunables_lock);
> +
> +	*ppos += cnt;
> +	return cnt;
> +}
> +
> +static int pghot_freq_th_show(struct seq_file *m, void *v)
> +{
> +	seq_printf(m, "%d\n", pghot_freq_threshold);
> +	return 0;
> +}
> +
> +static int pghot_freq_th_open(struct inode *inode, struct file *filp)
> +{
> +	return single_open(filp, pghot_freq_th_show, NULL);
> +}
> +
> +static const struct file_operations pghot_freq_th_fops = {
> +	.open		= pghot_freq_th_open,
> +	.write		= pghot_freq_th_write,
> +	.read		= seq_read,
> +	.llseek		= seq_lseek,
> +	.release	= seq_release,
> +};
> +
> +static ssize_t pghot_target_nid_write(struct file *filp, const char __user *ubuf,
> +				      size_t cnt, loff_t *ppos)
> +{
> +	char buf[16];
> +	unsigned int nid;
> +
> +	if (cnt > 15)
> +		cnt = 15;
> +
> +	if (copy_from_user(&buf, ubuf, cnt))
> +		return -EFAULT;
> +	buf[cnt] = '\0';
> +
> +	if (kstrtouint(buf, 10, &nid))
> +		return -EINVAL;
> +
> +	if (nid > PGHOT_NID_MAX || !node_online(nid) || !node_is_toptier(nid))
> +		return -EINVAL;
> +	mutex_lock(&pghot_tunables_lock);
> +	pghot_target_nid = nid;
> +	mutex_unlock(&pghot_tunables_lock);
> +
> +	*ppos += cnt;
> +	return cnt;
> +}
> +
> +static int pghot_target_nid_show(struct seq_file *m, void *v)
> +{
> +	seq_printf(m, "%d\n", pghot_target_nid);
> +	return 0;
> +}
> +
> +static int pghot_target_nid_open(struct inode *inode, struct file *filp)
> +{
> +	return single_open(filp, pghot_target_nid_show, NULL);
> +}
> +
> +static const struct file_operations pghot_target_nid_fops = {
> +	.open		= pghot_target_nid_open,
> +	.write		= pghot_target_nid_write,
> +	.read		= seq_read,
> +	.llseek		= seq_lseek,
> +	.release	= seq_release,
> +};
> +
> +static void pghot_src_enabled_update(unsigned int enabled)
> +{
> +	unsigned int changed = pghot_src_enabled ^ enabled;
> +
> +	if (changed & PGHOT_HINTFAULTS_ENABLED) {
> +		if (enabled & PGHOT_HINTFAULTS_ENABLED)
> +			static_branch_enable(&pghot_src_hintfaults);
> +		else
> +			static_branch_disable(&pghot_src_hintfaults);
> +	}
> +
> +	if (changed & PGHOT_HWHINTS_ENABLED) {
> +		if (enabled & PGHOT_HWHINTS_ENABLED)
> +			static_branch_enable(&pghot_src_hwhints);
> +		else
> +			static_branch_disable(&pghot_src_hwhints);
> +	}
> +}
> +
> +static ssize_t pghot_src_enabled_write(struct file *filp, const char __user *ubuf,
> +					   size_t cnt, loff_t *ppos)
> +{
> +	char buf[16];
> +	unsigned int enabled;
> +
> +	if (cnt > 15)
> +		cnt = 15;
> +
> +	if (copy_from_user(&buf, ubuf, cnt))
> +		return -EFAULT;
> +	buf[cnt] = '\0';
> +
> +	if (kstrtouint(buf, 0, &enabled))
> +		return -EINVAL;
> +
> +	if (enabled & ~PGHOT_SRC_ENABLED_MASK)
> +		return -EINVAL;
> +
> +	mutex_lock(&pghot_tunables_lock);
> +	pghot_src_enabled_update(enabled);
> +	pghot_src_enabled = enabled;
> +	mutex_unlock(&pghot_tunables_lock);
> +
> +	*ppos += cnt;
> +	return cnt;
> +}
> +
> +static int pghot_src_enabled_show(struct seq_file *m, void *v)
> +{
> +	seq_printf(m, "%u\n", pghot_src_enabled);
> +	return 0;
> +}
> +
> +static int pghot_src_enabled_open(struct inode *inode, struct file *filp)
> +{
> +	return single_open(filp, pghot_src_enabled_show, NULL);
> +}
> +
> +static const struct file_operations pghot_src_enabled_fops = {
> +	.open		= pghot_src_enabled_open,
> +	.write		= pghot_src_enabled_write,
> +	.read		= seq_read,
> +	.llseek		= seq_lseek,
> +	.release	= seq_release,
> +};
> +
> +void pghot_debug_init(void)
> +{
> +	debugfs_pghot = debugfs_create_dir("pghot", NULL);
> +	debugfs_create_file("enabled_sources", 0644, debugfs_pghot, NULL,
> +			    &pghot_src_enabled_fops);
> +	debugfs_create_file("target_nid", 0644, debugfs_pghot, NULL,
> +			    &pghot_target_nid_fops);
> +	debugfs_create_file("freq_threshold", 0644, debugfs_pghot, NULL,
> +			    &pghot_freq_th_fops);
> +	debugfs_create_u32("kmigrated_sleep_ms", 0644, debugfs_pghot,
> +			    &kmigrated_sleep_ms);
> +	debugfs_create_u32("kmigrated_batch_nr", 0644, debugfs_pghot,
> +			    &kmigrated_batch_nr);
> +}
> diff --git a/mm/pghot.c b/mm/pghot.c
> new file mode 100644
> index 000000000000..dac9e6f3b61e
> --- /dev/null
> +++ b/mm/pghot.c
> @@ -0,0 +1,479 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Maintains information about hot pages from slower tier nodes and
> + * promotes them.
> + *
> + * Per-PFN hotness information is stored for lower tier nodes in
> + * mem_section.
> + *
> + * In the default mode, a single byte (u8) is used to store
> + * the frequency of access and last access time. Promotions are done
> + * to a default toptier NID.
> + *
> + * A kernel thread named kmigrated is provided to migrate or promote
> + * the hot pages. kmigrated runs for each lower tier node. It iterates
> + * over the node's PFNs and  migrates pages marked for migration into
> + * their targeted nodes.
> + */
> +#include <linux/mm.h>
> +#include <linux/migrate.h>
> +#include <linux/memory.h>
> +#include <linux/memory-tiers.h>
> +#include <linux/pghot.h>
> +
> +unsigned int pghot_target_nid = PGHOT_DEFAULT_NODE;
> +unsigned int pghot_src_enabled;
> +unsigned int pghot_freq_threshold = PGHOT_DEFAULT_FREQ_THRESHOLD;
> +unsigned int kmigrated_sleep_ms = KMIGRATED_DEFAULT_SLEEP_MS;
> +unsigned int kmigrated_batch_nr = KMIGRATED_DEFAULT_BATCH_NR;
> +
> +unsigned int sysctl_pghot_freq_window = PGHOT_DEFAULT_FREQ_WINDOW;
> +
> +DEFINE_STATIC_KEY_FALSE(pghot_src_hwhints);
> +DEFINE_STATIC_KEY_FALSE(pghot_src_hintfaults);
> +
> +#ifdef CONFIG_SYSCTL
> +static const struct ctl_table pghot_sysctls[] = {
> +	{
> +		.procname       = "pghot_promote_freq_window_ms",
> +		.data           = &sysctl_pghot_freq_window,
> +		.maxlen         = sizeof(unsigned int),
> +		.mode           = 0644,
> +		.proc_handler   = proc_dointvec_minmax,
> +		.extra1         = SYSCTL_ZERO,
> +	},
> +};
> +#endif
> +
> +static bool kmigrated_started __ro_after_init;
> +
> +/**
> + * pghot_record_access() - Record page accesses from lower tier memory
> + * for the purpose of tracking page hotness and subsequent promotion.
> + *
> + * @pfn: PFN of the page
> + * @nid: Unused
> + * @src: The identifier of the sub-system that reports the access
> + * @now: Access time in jiffies
> + *
> + * Updates the frequency and time of access and marks the page as
> + * ready for migration if the frequency crosses a threshold. The pages
> + * marked for migration are migrated by kmigrated kernel thread.
> + *
> + * Return: 0 on success and -EINVAL on failure to record the access.
> + */
> +int pghot_record_access(unsigned long pfn, int nid, int src, unsigned long now)
> +{
> +	struct mem_section *ms;
> +	struct folio *folio;
> +	phi_t *phi, *hot_map;
> +	struct page *page;
> +
> +	if (!kmigrated_started)
> +		return 0;
> +
> +	if (!pghot_nid_valid(nid))
> +		return -EINVAL;
> +
> +	switch (src) {
> +	case PGHOT_HINTFAULTS:
> +		if (!static_branch_unlikely(&pghot_src_hintfaults))
> +			return 0;
> +		count_vm_event(PGHOT_RECORDED_HINTFAULTS);
> +		break;
> +	case PGHOT_HWHINTS:
> +		if (!static_branch_unlikely(&pghot_src_hwhints))
> +			return 0;
> +		count_vm_event(PGHOT_RECORDED_HWHINTS);
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	/*
> +	 * Record only accesses from lower tiers.
> +	 */
> +	if (node_is_toptier(pfn_to_nid(pfn)))
> +		return 0;


Just a thought—could we check this at the beginning of the function, 
before the switch case?


> +
> +	/*
> +	 * Reject the non-migratable pages right away.
> +	 */
> +	page = pfn_to_online_page(pfn);
> +	if (!page || is_zone_device_page(page))
> +		return 0;
> +
> +	folio = page_folio(page);
> +	if (!folio_try_get(folio))
> +		return 0;
> +
> +	if (unlikely(page_folio(page) != folio))
> +		goto out;
> +
> +	if (!folio_test_lru(folio))
> +		goto out;
> +
> +	/* Get the hotness slot corresponding to the 1st PFN of the folio */
> +	pfn = folio_pfn(folio);
> +	ms = __pfn_to_section(pfn);
> +	if (!ms || !ms->hot_map)
> +		goto out;
> +
> +	hot_map = (phi_t *)(((unsigned long)(ms->hot_map)) & ~PGHOT_SECTION_HOT_MASK);
> +	phi = &hot_map[pfn % PAGES_PER_SECTION];
> +
> +	count_vm_event(PGHOT_RECORDED_ACCESSES);
> +
> +	/*
> +	 * Update the hotness parameters.
> +	 */
> +	if (pghot_update_record(phi, nid, now)) {
> +		set_bit(PGHOT_SECTION_HOT_BIT, (unsigned long *)&ms->hot_map);
> +		set_bit(PGDAT_KMIGRATED_ACTIVATE, &page_pgdat(page)->flags);
> +	}
> +out:
> +	folio_put(folio);
> +	return 0;
> +}
> +
> +static int pghot_get_hotness(unsigned long pfn, int *nid, int *freq,
> +			     unsigned long *time)
> +{
> +	phi_t *phi, *hot_map;
> +	struct mem_section *ms;
> +
> +	ms = __pfn_to_section(pfn);
> +	if (!ms || !ms->hot_map)
> +		return -EINVAL;
> +
> +	hot_map = (phi_t *)(((unsigned long)(ms->hot_map)) & ~PGHOT_SECTION_HOT_MASK);
> +	phi = &hot_map[pfn % PAGES_PER_SECTION];
> +
> +	return pghot_get_record(phi, nid, freq, time);
> +}
> +
> +/*
> + * Walks the PFNs of the zone, isolates and migrates them in batches.
> + */
> +static void kmigrated_walk_zone(unsigned long start_pfn, unsigned long end_pfn,
> +				int src_nid)
> +{
> +	struct mem_cgroup *cur_memcg = NULL;
> +	int cur_nid = NUMA_NO_NODE;
> +	LIST_HEAD(migrate_list);
> +	int batch_count = 0;
> +	struct folio *folio;
> +	struct page *page;
> +	unsigned long pfn;
> +
> +	pfn = start_pfn;
> +	do {
> +		int nid = NUMA_NO_NODE, nr = 1;
> +		struct mem_cgroup *memcg;
> +		unsigned long time = 0;
> +		int freq = 0;
> +
> +		if (!pfn_valid(pfn))
> +			goto out_next;
> +
> +		page = pfn_to_online_page(pfn);
> +		if (!page)
> +			goto out_next;
> +
> +		folio = page_folio(page);
> +		if (!folio_try_get(folio))
> +			goto out_next;
> +
> +		if (unlikely(page_folio(page) != folio)) {
> +			folio_put(folio);
> +			goto out_next;
> +		}
> +
> +		nr = folio_nr_pages(folio);
> +		if (folio_nid(folio) != src_nid) {
> +			folio_put(folio);
> +			goto out_next;
> +		}
> +
> +		if (!folio_test_lru(folio)) {
> +			folio_put(folio);
> +			goto out_next;
> +		}
> +
> +		if (pghot_get_hotness(pfn, &nid, &freq, &time)) {
> +			folio_put(folio);
> +			goto out_next;
> +		}
> +
> +		if (nid == NUMA_NO_NODE)
> +			nid = pghot_target_nid;
> +
> +		if (folio_nid(folio) == nid) {
> +			folio_put(folio);
> +			goto out_next;
> +		}
> +
> +		if (migrate_misplaced_folio_prepare(folio, NULL, nid)) {
> +			folio_put(folio);
> +			goto out_next;
> +		}
> +
> +		memcg = folio_memcg(folio);
> +		if (cur_nid == NUMA_NO_NODE) {
> +			cur_nid = nid;
> +			cur_memcg = memcg;
> +		}
> +
> +		/* If NID or memcg changed, flush the previous batch first */
> +		if (cur_nid != nid || cur_memcg != memcg) {
> +			if (!list_empty(&migrate_list))
> +				migrate_misplaced_folios_batch(&migrate_list, cur_nid);
> +			cur_nid = nid;
> +			cur_memcg = memcg;
> +			batch_count = 0;
> +			cond_resched();
> +		}
> +
> +		list_add(&folio->lru, &migrate_list);
> +		folio_put(folio);
> +
> +		if (++batch_count > kmigrated_batch_nr) {
> +			migrate_misplaced_folios_batch(&migrate_list, cur_nid);
> +			batch_count = 0;
> +			cond_resched();
> +		}
> +out_next:
> +		pfn += nr;
> +	} while (pfn < end_pfn);
> +	if (!list_empty(&migrate_list))
> +		migrate_misplaced_folios_batch(&migrate_list, cur_nid);
> +}
> +
> +static void kmigrated_do_work(pg_data_t *pgdat)
> +{
> +	unsigned long section_nr, s_begin, start_pfn;
> +	struct mem_section *ms;
> +	int nid;
> +
> +	clear_bit(PGDAT_KMIGRATED_ACTIVATE, &pgdat->flags);
> +	s_begin = next_present_section_nr(-1);
> +	for_each_present_section_nr(s_begin, section_nr) {
> +		start_pfn = section_nr_to_pfn(section_nr);


I may be missing something, but in pghot_setup_hot_map() and 
kmigrated_do_work() we seem to iterate over all memory sections. On 
large memory systems, could this become a bottleneck right?

Since hot_map is allocated only for lower-tier memory and the hotness 
information is primarily used there, would it make sense to skip 
scanning higher-tier sections?

for_each_online_node(nid) {
         if (node_is_toptier(nid))
             continue;

         start_pfn = node_start_pfn(nid);
         end_pfn = node_end_pfn(nid);

         s_begin = pfn_to_section_nr(start_pfn);
         for_each_present_section_nr(s_begin, section_nr) {
     }
}

Would this approach be reasonable, or am I overlooking something?



> +		ms = __nr_to_section(section_nr);
> +
> +		if (!pfn_valid(start_pfn))
> +			continue;
> +
> +		nid = pfn_to_nid(start_pfn);
> +		if (node_is_toptier(nid) || nid != pgdat->node_id)
> +			continue;
> +
> +		if (!test_and_clear_bit(PGHOT_SECTION_HOT_BIT, (unsigned long *)&ms->hot_map))
> +			continue;
> +
> +		kmigrated_walk_zone(start_pfn, start_pfn + PAGES_PER_SECTION,
> +				    pgdat->node_id);
> +	}
> +}
> +
> +static inline bool kmigrated_work_requested(pg_data_t *pgdat)
> +{
> +	return test_bit(PGDAT_KMIGRATED_ACTIVATE, &pgdat->flags);
> +}
> +
> +/*
> + * Per-node kthread that iterates over its PFNs and migrates the
> + * pages that have been marked for migration.
> + */
> +static int kmigrated(void *p)
> +{
> +	pg_data_t *pgdat = p;
> +
> +	while (!kthread_should_stop()) {
> +		long timeout = msecs_to_jiffies(READ_ONCE(kmigrated_sleep_ms));
> +
> +		if (wait_event_timeout(pgdat->kmigrated_wait, kmigrated_work_requested(pgdat),
> +				       timeout))
> +			kmigrated_do_work(pgdat);
> +	}
> +	return 0;
> +}
> +
> +static int kmigrated_run(int nid)
> +{
> +	pg_data_t *pgdat = NODE_DATA(nid);
> +	int ret;
> +
> +	if (node_is_toptier(nid))
> +		return 0;

I might be missing something, but since this function is only called 
from pghot_init(), would it make sense to check the condition before 
calling kmigrated_run() to avoid the function call overhead?


> +
> +	if (!pgdat->kmigrated) {
> +		pgdat->kmigrated = kthread_create_on_node(kmigrated, pgdat, nid,
> +							  "kmigrated%d", nid);
> +		if (IS_ERR(pgdat->kmigrated)) {
> +			ret = PTR_ERR(pgdat->kmigrated);
> +			pgdat->kmigrated = NULL;
> +			pr_err("Failed to start kmigrated%d, ret %d\n", nid, ret);
> +			return ret;
> +		}
> +		pr_info("pghot: Started kmigrated thread for node %d\n", nid);
> +	}
> +	wake_up_process(pgdat->kmigrated);
> +	return 0;
> +}
> +
> +static void pghot_free_hot_map(struct mem_section *ms)
> +{
> +	kfree((void *)((unsigned long)ms->hot_map & ~PGHOT_SECTION_HOT_MASK));
> +	ms->hot_map = NULL;
> +}
> +
> +static int pghot_alloc_hot_map(struct mem_section *ms, int nid)
> +{
> +	ms->hot_map = kcalloc_node(PAGES_PER_SECTION, PGHOT_RECORD_SIZE, GFP_KERNEL,
> +				   nid);
> +	if (!ms->hot_map)
> +		return -ENOMEM;
> +	return 0;
> +}
> +
> +static void pghot_offline_sec_hotmap(unsigned long start_pfn,
> +				     unsigned long nr_pages)
> +{
> +	unsigned long start, end, pfn;
> +	struct mem_section *ms;
> +
> +	start = SECTION_ALIGN_DOWN(start_pfn);
> +	end = SECTION_ALIGN_UP(start_pfn + nr_pages);
> +
> +	for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) {
> +		ms = __pfn_to_section(pfn);
> +		if (!ms || !ms->hot_map)
> +			continue;
> +
> +		pghot_free_hot_map(ms);
> +	}
> +}
> +
> +static int pghot_online_sec_hotmap(unsigned long start_pfn,
> +				   unsigned long nr_pages)
> +{
> +	int nid = pfn_to_nid(start_pfn);
> +	unsigned long start, end, pfn;
> +	struct mem_section *ms;
> +	int fail = 0;
> +
> +	start = SECTION_ALIGN_DOWN(start_pfn);
> +	end = SECTION_ALIGN_UP(start_pfn + nr_pages);
> +
> +	for (pfn = start; !fail && pfn < end; pfn += PAGES_PER_SECTION) {
> +		ms = __pfn_to_section(pfn);
> +		if (!ms || ms->hot_map)
> +			continue;
> +
> +		fail = pghot_alloc_hot_map(ms, nid);

I may be missing something, but after pghot_alloc_hot_map fails, we 
continue the loop. Would it make sense to break and go to the cleanup 
logic instead?

-Donet


> +	}
> +
> +	if (!fail)
> +		return 0;
> +
> +	/* rollback */
> +	end = pfn - PAGES_PER_SECTION;
> +	for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) {
> +		ms = __pfn_to_section(pfn);
> +		if (ms && ms->hot_map)
> +			pghot_free_hot_map(ms);
> +	}
> +	return -ENOMEM;
> +}
> +
> +static int pghot_memhp_callback(struct notifier_block *self,
> +				unsigned long action, void *arg)
> +{
> +	struct memory_notify *mn = arg;
> +	int ret = 0;
> +
> +	switch (action) {
> +	case MEM_GOING_ONLINE:
> +		ret = pghot_online_sec_hotmap(mn->start_pfn, mn->nr_pages);
> +		break;
> +	case MEM_OFFLINE:
> +	case MEM_CANCEL_ONLINE:
> +		pghot_offline_sec_hotmap(mn->start_pfn, mn->nr_pages);
> +		break;
> +	}
> +
> +	return notifier_from_errno(ret);
> +}
> +
> +static void pghot_destroy_hot_map(void)
> +{
> +	unsigned long section_nr, s_begin;
> +	struct mem_section *ms;
> +
> +	s_begin = next_present_section_nr(-1);
> +	for_each_present_section_nr(s_begin, section_nr) {
> +		ms = __nr_to_section(section_nr);
> +		pghot_free_hot_map(ms);
> +	}
> +}
> +
> +static int pghot_setup_hot_map(void)
> +{
> +	unsigned long section_nr, s_begin, start_pfn;
> +	struct mem_section *ms;
> +	int nid;
> +
> +	s_begin = next_present_section_nr(-1);
> +	for_each_present_section_nr(s_begin, section_nr) {
> +		ms = __nr_to_section(section_nr);
> +		start_pfn = section_nr_to_pfn(section_nr);
> +		nid = pfn_to_nid(start_pfn);
> +
> +		if (node_is_toptier(nid) || !pfn_valid(start_pfn))
> +			continue;
> +
> +		if (pghot_alloc_hot_map(ms, nid))
> +			goto out_free_hot_map;
> +	}
> +	hotplug_memory_notifier(pghot_memhp_callback, DEFAULT_CALLBACK_PRI);
> +	return 0;
> +
> +out_free_hot_map:
> +	pghot_destroy_hot_map();
> +	return -ENOMEM;
> +}
> +
> +static int __init pghot_init(void)
> +{
> +	pg_data_t *pgdat;
> +	int nid, ret;
> +
> +	ret = pghot_setup_hot_map();
> +	if (ret)
> +		return ret;
> +
> +	for_each_node_state(nid, N_MEMORY) {
> +		ret = kmigrated_run(nid);
> +		if (ret)
> +			goto out_stop_kthread;
> +	}
> +	register_sysctl_init("vm", pghot_sysctls);
> +	pghot_debug_init();
> +
> +	kmigrated_started = true;
> +	return 0;
> +
> +out_stop_kthread:
> +	for_each_node_state(nid, N_MEMORY) {
> +		pgdat = NODE_DATA(nid);
> +		if (pgdat->kmigrated) {
> +			kthread_stop(pgdat->kmigrated);
> +			pgdat->kmigrated = NULL;
> +		}
> +	}
> +	pghot_destroy_hot_map();
> +	return ret;
> +}
> +
> +late_initcall_sync(pghot_init)
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 86b14b0f77b5..d3fbe2a5d0e6 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1486,6 +1486,11 @@ const char * const vmstat_text[] = {
>   	[I(KSTACK_REST)]			= "kstack_rest",
>   #endif
>   #endif
> +#ifdef CONFIG_PGHOT
> +	[I(PGHOT_RECORDED_ACCESSES)]		= "pghot_recorded_accesses",
> +	[I(PGHOT_RECORDED_HINTFAULTS)]		= "pghot_recorded_hintfaults",
> +	[I(PGHOT_RECORDED_HWHINTS)]		= "pghot_recorded_hwhints",
> +#endif /* CONFIG_PGHOT */
>   #undef I
>   #endif /* CONFIG_VM_EVENT_COUNTERS */
>   };

next prev parent reply	other threads:[~2026-04-24 12:58 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-23  9:50 [RFC PATCH v6 0/5] mm: Hot page tracking and promotion infrastructure Bharata B Rao
2026-03-23  9:51 ` [RFC PATCH v6 1/5] mm: migrate: Allow misplaced migration without VMA Bharata B Rao
2026-03-23  9:51 ` [RFC PATCH v6 2/5] mm: migrate: Add migrate_misplaced_folios_batch() Bharata B Rao
2026-03-26  5:50   ` Bharata B Rao
2026-04-21 15:25   ` Donet Tom
2026-04-21 16:05     ` Gregory Price
2026-04-22  3:26       ` Bharata B Rao
2026-04-22  3:37         ` Gregory Price
2026-04-22  4:04           ` Donet Tom
2026-04-22  4:15             ` Bharata B Rao
2026-03-23  9:51 ` [RFC PATCH v6 3/5] mm: Hot page tracking and promotion - pghot Bharata B Rao
2026-04-24 12:57   ` Donet Tom [this message]
2026-04-24 13:21     ` Gregory Price
2026-04-24 15:40       ` Donet Tom
2026-04-27  5:24     ` Bharata B Rao
2026-04-30  7:06       ` Donet Tom
2026-03-23  9:51 ` [RFC PATCH v6 4/5] mm: pghot: Precision mode for pghot Bharata B Rao
2026-03-26 10:41   ` Bharata B Rao
2026-03-23  9:51 ` [RFC PATCH v6 5/5] mm: sched: move NUMA balancing tiering promotion to pghot Bharata B Rao
2026-03-30  4:46   ` Bharata B Rao
2026-03-23  9:56 ` [RFC PATCH v6 0/5] mm: Hot page tracking and promotion infrastructure Bharata B Rao
2026-03-23  9:58 ` Bharata B Rao
2026-03-23  9:59 ` Bharata B Rao
2026-03-23 10:01 ` Bharata B Rao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=250e68f3-3664-4148-bfbf-52fd4230a3b9@linux.ibm.com \
    --to=donettom@linux.ibm.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=akpm@linux-foundation.org \
    --cc=alok.rathore@samsung.com \
    --cc=balbirs@nvidia.com \
    --cc=bharata@amd.com \
    --cc=byungchul@sk.com \
    --cc=dave.hansen@intel.com \
    --cc=dave@stgolabs.net \
    --cc=david@redhat.com \
    --cc=gourry@gourry.net \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kinseyho@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=mingo@redhat.com \
    --cc=nifan.cxl@gmail.com \
    --cc=peterz@infradead.org \
    --cc=raghavendra.kt@amd.com \
    --cc=riel@surriel.com \
    --cc=rientjes@google.com \
    --cc=shivankg@amd.com \
    --cc=sj@kernel.org \
    --cc=weixugc@google.com \
    --cc=willy@infradead.org \
    --cc=xuezhengchu@huawei.com \
    --cc=yiannis@zptcorp.com \
    --cc=ying.huang@linux.alibaba.com \
    --cc=yuanchu@google.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox