From: Jonathan Cameron <jonathan.cameron@huawei.com>
To: Bharata B Rao <bharata@amd.com>
Cc: <linux-kernel@vger.kernel.org>, <linux-mm@kvack.org>,
<dave.hansen@intel.com>, <gourry@gourry.net>,
<hannes@cmpxchg.org>, <mingo@redhat.com>, <peterz@infradead.org>,
<raghavendra.kt@amd.com>, <riel@surriel.com>,
<rientjes@google.com>, <sj@kernel.org>, <weixugc@google.com>,
<willy@infradead.org>, <ying.huang@linux.alibaba.com>,
<ziy@nvidia.com>, <dave@stgolabs.net>, <nifan.cxl@gmail.com>,
<xuezhengchu@huawei.com>, <yiannis@zptcorp.com>,
<akpm@linux-foundation.org>, <david@redhat.com>,
<byungchul@sk.com>, <kinseyho@google.com>,
<joshua.hahnjy@gmail.com>, <yuanchu@google.com>,
<balbirs@nvidia.com>, <alok.rathore@samsung.com>
Subject: Re: [RFC PATCH v2 3/8] mm: Hot page tracking and promotion
Date: Fri, 3 Oct 2025 12:17:24 +0100 [thread overview]
Message-ID: <20251003121724.00002e6b@huawei.com> (raw)
In-Reply-To: <20250910144653.212066-4-bharata@amd.com>
On Wed, 10 Sep 2025 20:16:48 +0530
Bharata B Rao <bharata@amd.com> wrote:
> This introduces a sub-system for collecting memory access
> information from different sources. It maintains the hotness
> information based on the access history and time of access.
>
> Additionally, it provides per-lowertier-node kernel threads
> (named kpromoted) that periodically promote the pages that
> are eligible for promotion.
>
> Sub-systems that generate hot page access info can report that
> using this API:
>
> int pghot_record_access(u64 pfn, int nid, int src,
> unsigned long time)
>
> @pfn: The PFN of the memory accessed
> @nid: The accessing NUMA node ID
> @src: The temperature source (sub-system) that generated the
> access info
> @time: The access time in jiffies
>
> Some temperature sources may not provide the nid from which
> the page was accessed. This is true for sources that use
> page table scanning for PTE Accessed bit. For such sources,
> the default toptier node to which such pages should be promoted
> is hard coded.
>
> Also, the access time provided some sources may at best be
> considered approximate. This is especially true for hot pages
> detected by PTE A bit scanning.
>
> The hot PFN records are stored in hash lists hashed by PFN value.
> The PFN records that are categorized as hot enough to be promoted
> are maintained in a per-lowertier-node max heap from which
> kpromoted extracts and promotes them.
>
> Signed-off-by: Bharata B Rao <bharata@amd.com>
A fairly superficial review only of this. At some point I'll aim to take a closer
look at the heap bit.
> ---
> include/linux/mmzone.h | 11 +
> include/linux/pghot.h | 96 +++++++
> include/linux/vm_event_item.h | 9 +
> mm/Kconfig | 11 +
> mm/Makefile | 1 +
> mm/mm_init.c | 10 +
> mm/pghot.c | 524 ++++++++++++++++++++++++++++++++++
> mm/vmstat.c | 9 +
> 8 files changed, 671 insertions(+)
> create mode 100644 include/linux/pghot.h
> create mode 100644 mm/pghot.c
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 0c5da9141983..f7094babed10 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> diff --git a/include/linux/pghot.h b/include/linux/pghot.h
> new file mode 100644
> index 000000000000..1443643aab13
> --- /dev/null
> +++ b/include/linux/pghot.h
> +
> +struct pghot_info {
> + unsigned long pfn;
> +
> + /*
> + * The following three fundamental parameters
> + * required to track the hotness of page/PFN are
> + * packed within a single u32.
> + */
> + u32 frequency:PGHOT_FREQ_BITS; /* Number of accesses within current window */
> + u32 nid:PGHOT_NID_BITS; /* Most recent access from this node */
> + u32 last_update:PGHOT_TIME_BITS; /* Most recent access time */
Add spaces around : I think to help the eye parse those.
> +
> + struct hlist_node hnode;
> + size_t heap_idx; /* Position in max heap for quick retreival */
> +};
> +
> +struct max_heap {
> + size_t nr;
> + size_t size;
> + struct pghot_info **data;
> + DECLARE_FLEX_ARRAY(struct pghot_info *, preallocated);
That macro is all about use in unions rather than generally being needed.
Do you need that here rather than
struct pg_hot_info *preallocated[];
Can you add a __counted_by() marking?
> +};
> diff --git a/mm/pghot.c b/mm/pghot.c
> new file mode 100644
> index 000000000000..9f7581818b8f
> --- /dev/null
> +++ b/mm/pghot.c
> @@ -0,0 +1,524 @@
> +
> +static struct folio *kpromoted_isolate_folio(struct pghot_info *phi)
> +{
> + struct page *page = pfn_to_page(phi->pfn);
> + struct folio *folio;
> +
> + if (!page)
> + return NULL;
> +
> + folio = page_folio(page);
> + if (migrate_misplaced_folio_prepare(folio, NULL, phi->nid))
> + return NULL;
> + else
else not needed.
> + return folio;
> +}
> +static int phi_heap_extract(pg_data_t *pgdat, int batch_count, int freq_th,
> + struct list_head *migrate_list, int *count)
> +{
> + spinlock_t *phi_heap_lock = &pgdat->heap_lock;
> + struct max_heap *phi_heap = &pgdat->heap;
> + int max_retries = 10;
> + int bkt, i = 0;
> +
> + if (batch_count < 0 || !migrate_list || !count || freq_th < 1 ||
> + freq_th > KPROMOTED_FREQ_THRESHOLD)
> + return -EINVAL;
> +
> + *count = 0;
> + for (i = 0; i < batch_count; i++) {
> + struct pghot_info *top = NULL;
> + bool should_continue = false;
> + struct folio *folio;
> + int retries = 0;
> +
> + while (retries < max_retries) {
> + spin_lock(phi_heap_lock);
> + if (phi_heap->nr > 0 && phi_heap->data[0]->frequency >= freq_th) {
> + should_continue = true;
> + bkt = hash_min(phi_heap->data[0]->pfn, phi_hash_order);
> + top = phi_heap->data[0];
> + }
> + spin_unlock(phi_heap_lock);
> +
> + if (!should_continue)
> + goto done;
> +
> + spin_lock(&phi_hash[bkt].lock);
> + spin_lock(phi_heap_lock);
> + if (phi_heap->nr == 0 || phi_heap->data[0] != top ||
> + phi_heap->data[0]->frequency < freq_th) {
> + spin_unlock(phi_heap_lock);
> + spin_unlock(&phi_hash[bkt].lock);
> + retries++;
> + continue;
> + }
> +
> + top = phi_heap->data[0];
> + hlist_del_init(&top->hnode);
> +
> + phi_heap->nr--;
> + if (phi_heap->nr > 0) {
> + phi_heap->data[0] = phi_heap->data[phi_heap->nr];
> + phi_heap->data[0]->heap_idx = 0;
> + min_heap_sift_down(phi_heap, 0, &phi_heap_cb,
> + phi_heap->data);
> + }
> +
> + spin_unlock(phi_heap_lock);
> + spin_unlock(&phi_hash[bkt].lock);
> +
> + if (!phi_is_pfn_hot(top)) {
> + count_vm_event(KPROMOTED_DROPPED);
> + goto skip;
> + }
> +
> + folio = kpromoted_isolate_folio(top);
> + if (folio) {
> + list_add(&folio->lru, migrate_list);
> + (*count)++;
> + }
> +skip:
> + phi_free(top);
> + break;
> + }
> + if (retries >= max_retries) {
> + pr_warn("%s: Too many retries\n", __func__);
> + break;
> + }
> +
> + }
> +done:
If that is all there is, I'd use an early return as tends to give
simpler code.
> + return 0;
> +}
> +
> +static void phi_heap_add_or_adjust(struct pghot_info *phi)
> +{
> + pg_data_t *pgdat = NODE_DATA(phi->nid);
> + struct max_heap *phi_heap = &pgdat->heap;
> +
> + spin_lock(&pgdat->heap_lock);
guard() perhaps.
> + if (phi->heap_idx >= 0 && phi->heap_idx < phi_heap->nr &&
> + phi_heap->data[phi->heap_idx] == phi) {
> + /* Entry exists in heap */
> + if (phi->frequency < KPROMOTED_FREQ_THRESHOLD) {
> + /* Below threshold, remove from the heap */
> + phi_heap->nr--;
> + if (phi->heap_idx < phi_heap->nr) {
> + phi_heap->data[phi->heap_idx] =
> + phi_heap->data[phi_heap->nr];
> + phi_heap->data[phi->heap_idx]->heap_idx =
> + phi->heap_idx;
> + min_heap_sift_down(phi_heap, phi->heap_idx,
> + &phi_heap_cb, phi_heap->data);
> + }
> + phi->heap_idx = -1;
> +
> + } else {
> + /* Update position in heap */
> + phi_heap_update_entry(phi_heap, phi);
> + }
> + } else if (phi->frequency >= KPROMOTED_FREQ_THRESHOLD) {
> + /*
> + * Add to the heap. If heap is full we will have
> + * to wait for the next access reporting to elevate
> + * it to heap.
> + */
> + if (phi_heap_insert(phi_heap, phi))
> + count_vm_event(PGHOT_RECORDS_HEAP);
> + }
> + spin_unlock(&pgdat->heap_lock);
> +}
> +
> +/*
> + * Called by subsystems that generate page hotness/access information.
> + *
> + * @pfn: The PFN of the memory accessed
> + * @nid: The accessing NUMA node ID
> + * @src: The temperature source (sub-system) that generated the
> + * access info
> + * @time: The access time in jiffies
> + *
> + * Maintains the access records per PFN, classifies them as
> + * hot based on subsequent accesses and finally hands over
> + * them to kpromoted for migration.
> + */
> +int pghot_record_access(u64 pfn, int nid, int src, unsigned long now)
> +{
> + struct pghot_info *phi;
> + struct page *page;
> + struct folio *folio;
> + int bkt;
> + bool new_entry = false, new_window = false;
> + u32 cur_time = now & PGHOT_TIME_MASK;
> +
> + if (!kpromoted_started)
> + return -EINVAL;
> +
> + if (nid >= PGHOT_NID_MAX)
> + return -EINVAL;
> +
> + count_vm_event(PGHOT_RECORDED_ACCESSES);
> +
> + switch (src) {
> + case PGHOT_HW_HINTS:
> + count_vm_event(PGHOT_RECORD_HWHINTS);
> + break;
> + case PGHOT_PGTABLE_SCAN:
> + count_vm_event(PGHOT_RECORD_PGTSCANS);
> + break;
> + case PGHOT_HINT_FAULT:
> + count_vm_event(PGHOT_RECORD_HINTFAULTS);
> + break;
> + default:
> + return -EINVAL;
> + }
> +
> + /*
> + * Record only accesses from lower tiers.
> + */
> + if (node_is_toptier(pfn_to_nid(pfn)))
> + return 0;
> +
> + /*
> + * Reject the non-migratable pages right away.
> + */
> + page = pfn_to_online_page(pfn);
> + if (!page || is_zone_device_page(page))
> + return 0;
> +
> + folio = page_folio(page);
> + if (!folio_test_lru(folio))
> + return 0;
> +
> + bkt = hash_min(pfn, phi_hash_order);
> + spin_lock(&phi_hash[bkt].lock);
If this doesn't get more complex later, use guard() and early returns on error.
> + phi = phi_lookup(pfn, bkt);
> + if (!phi) {
> + phi = phi_alloc(pfn);
> + if (!phi)
> + goto out;
Not an error? Add a comment on why not perhaps.
> + new_entry = true;
> + }
> +
> + /*
> + * If the previous access was beyond the threshold window
> + * start frequency tracking afresh.
> + */
> + if (((cur_time - phi->last_update) > msecs_to_jiffies(sysctl_pghot_freq_window)) ||
> + (nid != NUMA_NO_NODE && phi->nid != nid))
> + new_window = true;
> +
> + if (new_entry || new_window) {
> + /* New window */
> + phi->frequency = 1; /* TODO: Factor in the history */
> + } else if (phi->frequency < PGHOT_FREQ_MAX)
> + phi->frequency++;
> + phi->last_update = cur_time;
> + phi->nid = (nid == NUMA_NO_NODE) ? KPROMOTED_DEFAULT_NODE : nid;
> +
> + if (new_entry) {
> + /* Insert the new entry into hash table */
> + hlist_add_head(&phi->hnode, &phi_hash[bkt].hash);
> + count_vm_event(PGHOT_RECORDS_HASH);
> + } else {
> + /* Add/update the position in heap */
> + phi_heap_add_or_adjust(phi);
> + }
> +out:
> + spin_unlock(&phi_hash[bkt].lock);
> + return 0;
> +}
> +
> +/*
> + * Extract the hot page records and batch-migrate the
> + * hot pages.
Wrap comments to 80 chars.
> + */
> +static void kpromoted_migrate(pg_data_t *pgdat)
> +{
> + int count, ret;
> + LIST_HEAD(migrate_list);
> +
> + /*
> + * Extract the top N elements from the heap that match
> + * the requested hotness threshold.
> + *
> + * PFNs ineligible from migration standpoint are removed
> + * from the heap and hash.
> + *
> + * Folios eligible for migration are isolated and returned
> + * in @migrate_list.
> + */
> + ret = phi_heap_extract(pgdat, KPROMOTED_MIGRATE_BATCH,
> + KPROMOTED_FREQ_THRESHOLD, &migrate_list, &count);
> + if (ret)
> + return;
> +
> + if (!list_empty(&migrate_list))
> + migrate_misplaced_folios_batch(&migrate_list, pgdat->node_id);
> +}
> +
> +static int kpromoted(void *p)
> +{
> + pg_data_t *pgdat = (pg_data_t *)p;
Cast not needed.
pg_data_t *pgdat = p;
> +
> + while (!kthread_should_stop()) {
> + wait_event_timeout(pgdat->kpromoted_wait, false,
> + msecs_to_jiffies(KPROMOTE_DELAY));
> + kpromoted_migrate(pgdat);
> + }
> + return 0;
> +}
> +
> +static int kpromoted_run(int nid)
> +{
> + pg_data_t *pgdat = NODE_DATA(nid);
> + int ret = 0;
> +
> + if (!node_is_toptier(nid))
> + return 0;
> +
> + if (!pgdat->phi_buf) {
> + pgdat->phi_buf = vzalloc_node(phi_heap_entries * sizeof(struct pghot_info *),
> + nid);
I'd use sizeof(*pgdat->phi_buf) here to avoid need to check types match when reading the
code. Sadly there isn't a vcalloc_node().
> + if (!pgdat->phi_buf)
> + return -ENOMEM;
> +
> + min_heap_init(&pgdat->heap, pgdat->phi_buf, phi_heap_entries);
> + spin_lock_init(&pgdat->heap_lock);
> + }
> +
> + if (!pgdat->kpromoted)
> + pgdat->kpromoted = kthread_create_on_node(kpromoted, pgdat, nid,
> + "kpromoted%d", nid);
> + if (IS_ERR(pgdat->kpromoted)) {
> + ret = PTR_ERR(pgdat->kpromoted);
> + pgdat->kpromoted = NULL;
> + pr_info("Failed to start kpromoted%d, ret %d\n", nid, ret);
Unless there is going to be more in later patches that prevents it. Just
return here.
> + } else {
> + wake_up_process(pgdat->kpromoted);
> + }
> + return ret;
return 0; //after change suggested above.
> +}
> +
> +/*
> + * TODO: Handle cleanup during node offline.
> + */
> +static int __init pghot_init(void)
> +{
> + unsigned int hash_size;
> + size_t hash_entries;
> + size_t nr_pages = 0;
> + pg_data_t *pgdat;
> + int i, nid, ret;
> +
> + /*
> + * Arrive at the hash and heap sizes based on the
> + * number of pages present in the lower tier nodes.
Trivial: Wrap closer to 80 chars.
> + */
> + for_each_node_state(nid, N_MEMORY) {
> + if (!node_is_toptier(nid))
> + nr_pages += NODE_DATA(nid)->node_present_pages;
> + }
> +
> + if (!nr_pages)
> + return 0;
> +
> + hash_entries = nr_pages * PGHOT_HASH_PCT / 100;
> + hash_size = hash_entries / PGHOT_HASH_ENTRIES;
> + phi_hash_order = ilog2(hash_size);
> +
> + phi_hash = vmalloc(sizeof(struct pghot_hash) * hash_size);
Prefer sizeof(*phy_hash) so I don't need to check types match :)
vcalloc() probably appropriate here.
> + if (!phi_hash) {
> + ret = -ENOMEM;
> + goto out;
Out label isn't clearly an 'error' which is a little confusing.
> + }
> +
> + for (i = 0; i < hash_size; i++) {
> + INIT_HLIST_HEAD(&phi_hash[i].hash);
> + spin_lock_init(&phi_hash[i].lock);
> + }
> +
> + phi_cache = KMEM_CACHE(pghot_info, 0);
> + if (unlikely(!phi_cache)) {
> + ret = -ENOMEM;
> + goto out;
Whilst not strictly necessary I'd add multiple labels so only things
that have been allocated are free rather than relying on them being
NULL otherwise. Whilst not a correctness thing it makes it slightly
easier to check tear down paths are correct.
> + }
> +
> + phi_heap_entries = hash_entries * PGHOT_HEAP_PCT / 100;
> + for_each_node_state(nid, N_CPU) {
> + ret = kpromoted_run(nid);
> + if (ret)
> + goto out_stop_kthread;
> + }
> +
> + register_sysctl_init("vm", pghot_sysctls);
> + kpromoted_started = true;
> + pr_info("pghot: Started page hotness monitoring and promotion thread\n");
> + pr_info("pghot: nr_pages %ld hash_size %d hash_entries %ld hash_order %d heap_entries %d\n",
> + nr_pages, hash_size, hash_entries, phi_hash_order, phi_heap_entries);
> + return 0;
> +
> +out_stop_kthread:
> + for_each_node_state(nid, N_CPU) {
> + pgdat = NODE_DATA(nid);
> + if (pgdat->kpromoted) {
> + kthread_stop(pgdat->kpromoted);
> + pgdat->kpromoted = NULL;
> + vfree(pgdat->phi_buf);
> + }
> + }
> +out:
> + kmem_cache_destroy(phi_cache);
> + vfree(phi_hash);
> + return ret;
> +}
> +
> +late_initcall(pghot_init)
next prev parent reply other threads:[~2025-10-03 11:17 UTC|newest]
Thread overview: 53+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-10 14:46 [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 1/8] mm: migrate: Allow misplaced migration without VMA too Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 2/8] migrate: implement migrate_misplaced_folios_batch Bharata B Rao
2025-10-03 10:36 ` Jonathan Cameron
2025-10-03 11:02 ` Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 3/8] mm: Hot page tracking and promotion Bharata B Rao
2025-10-03 11:17 ` Jonathan Cameron [this message]
2025-10-06 4:13 ` Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 4/8] x86: ibs: In-kernel IBS driver for memory access profiling Bharata B Rao
2025-10-03 12:19 ` Jonathan Cameron
2025-10-06 4:28 ` Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 5/8] x86: ibs: Enable IBS profiling for memory accesses Bharata B Rao
2025-10-03 12:22 ` Jonathan Cameron
2025-09-10 14:46 ` [RFC PATCH v2 6/8] mm: mglru: generalize page table walk Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 7/8] mm: klruscand: use mglru scanning for page promotion Bharata B Rao
2025-10-03 12:30 ` Jonathan Cameron
2025-09-10 14:46 ` [RFC PATCH v2 8/8] mm: sched: Move hot page promotion from NUMAB=2 to kpromoted Bharata B Rao
2025-10-03 12:38 ` Jonathan Cameron
2025-10-06 5:57 ` Bharata B Rao
2025-10-06 9:53 ` Jonathan Cameron
2025-09-10 15:39 ` [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure Matthew Wilcox
2025-09-10 16:01 ` Gregory Price
2025-09-16 19:45 ` David Rientjes
2025-09-16 22:02 ` Gregory Price
2025-09-17 0:30 ` Wei Xu
2025-09-17 3:20 ` Balbir Singh
2025-09-17 4:15 ` Bharata B Rao
2025-09-17 16:49 ` Jonathan Cameron
2025-09-25 14:03 ` Yiannis Nikolakopoulos
2025-09-25 14:41 ` Gregory Price
2025-10-16 11:48 ` Yiannis Nikolakopoulos
2025-09-25 15:00 ` Jonathan Cameron
2025-09-25 15:08 ` Gregory Price
2025-09-25 15:18 ` Gregory Price
2025-09-25 15:24 ` Jonathan Cameron
2025-09-25 16:06 ` Gregory Price
2025-09-25 17:23 ` Jonathan Cameron
2025-09-25 19:02 ` Gregory Price
2025-10-01 7:22 ` Gregory Price
2025-10-17 9:53 ` Yiannis Nikolakopoulos
2025-10-17 14:15 ` Gregory Price
2025-10-17 14:36 ` Jonathan Cameron
2025-10-17 14:59 ` Gregory Price
2025-10-20 14:05 ` Jonathan Cameron
2025-10-21 18:52 ` Gregory Price
2025-10-21 18:57 ` Gregory Price
2025-10-22 9:09 ` Jonathan Cameron
2025-10-22 15:05 ` Gregory Price
2025-10-23 15:29 ` Jonathan Cameron
2025-10-16 16:16 ` Yiannis Nikolakopoulos
2025-10-20 14:23 ` Jonathan Cameron
2025-10-20 15:05 ` Gregory Price
2025-10-08 17:59 ` Vinicius Petrucci
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251003121724.00002e6b@huawei.com \
--to=jonathan.cameron@huawei.com \
--cc=akpm@linux-foundation.org \
--cc=alok.rathore@samsung.com \
--cc=balbirs@nvidia.com \
--cc=bharata@amd.com \
--cc=byungchul@sk.com \
--cc=dave.hansen@intel.com \
--cc=dave@stgolabs.net \
--cc=david@redhat.com \
--cc=gourry@gourry.net \
--cc=hannes@cmpxchg.org \
--cc=joshua.hahnjy@gmail.com \
--cc=kinseyho@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@techsingularity.net \
--cc=mingo@redhat.com \
--cc=nifan.cxl@gmail.com \
--cc=peterz@infradead.org \
--cc=raghavendra.kt@amd.com \
--cc=riel@surriel.com \
--cc=rientjes@google.com \
--cc=sj@kernel.org \
--cc=weixugc@google.com \
--cc=willy@infradead.org \
--cc=xuezhengchu@huawei.com \
--cc=yiannis@zptcorp.com \
--cc=ying.huang@linux.alibaba.com \
--cc=yuanchu@google.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.