From: Bharata B Rao <bharata@amd.com>
To: <linux-kernel@vger.kernel.org>, <linux-mm@kvack.org>
Cc: <Jonathan.Cameron@huawei.com>, <dave.hansen@intel.com>,
<gourry@gourry.net>, <mgorman@techsingularity.net>,
<mingo@redhat.com>, <peterz@infradead.org>,
<raghavendra.kt@amd.com>, <riel@surriel.com>,
<rientjes@google.com>, <sj@kernel.org>, <weixugc@google.com>,
<willy@infradead.org>, <ying.huang@linux.alibaba.com>,
<ziy@nvidia.com>, <dave@stgolabs.net>, <nifan.cxl@gmail.com>,
<xuezhengchu@huawei.com>, <yiannis@zptcorp.com>,
<akpm@linux-foundation.org>, <david@redhat.com>,
<byungchul@sk.com>, <kinseyho@google.com>,
<joshua.hahnjy@gmail.com>, <yuanchu@google.com>,
<balbirs@nvidia.com>, <alok.rathore@samsung.com>,
<shivankg@amd.com>
Subject: Re: [RFC PATCH v6 4/5] mm: pghot: Precision mode for pghot
Date: Thu, 26 Mar 2026 16:11:57 +0530 [thread overview]
Message-ID: <ce3b51f8-812d-424d-9848-3ba408253c9d@amd.com> (raw)
In-Reply-To: <20260323095104.238982-5-bharata@amd.com>
Replies to review from Shashiko
[https://sashiko.dev/#/patchset/20260323095104.238982-1-bharata%40amd.com]
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 4aeab6aee535..14383bb1d890 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -1485,6 +1485,17 @@ config PGHOT
>> This adds 1 byte of metadata overhead per page in lower-tier
>> memory nodes.
>>
>> +config PGHOT_PRECISE
>> + bool "Hot page tracking precision mode"
>> + def_bool n
> This isn't a bug, but having both bool with a prompt and def_bool is
> redundant and can be confusing. Should this use default n instead?
Sure.
>> + depends on PGHOT
>> + help
>> + Enables precision mode for tracking hot pages with pghot sub-system.
>[ ... ]
>> diff --git a/mm/pghot-precise.c b/mm/pghot-precise.c
>> new file mode 100644
>> index 000000000000..9e8007adfff9
>> --- /dev/null
>> +++ b/mm/pghot-precise.c
>> @@ -0,0 +1,81 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * pghot: Precision mode
>> + *
>> + * 4 byte hotness record per PFN (u32)
>> + * NID, time and frequency tracked as part of the record.
>> + */
>> +
>> +#include <linux/pghot.h>
>> +#include <linux/jiffies.h>
>> +
>> +bool pghot_nid_valid(int nid)
>> +{
>> + /*
>> + * TODO: Add node_online() and node_is_toptier() checks?
>> + */
>> + if (nid != NUMA_NO_NODE && (nid < 0 || nid >= PGHOT_NID_MAX))
>> + return false;
> Could this validation allow node IDs that exceed the system's actual
> MAX_NUMNODES?
Unlikely as the nid gets passed down from other sub-systems like NUMA Balancing
etc.
> If PGHOT_NID_MAX is 1023 but the system supports fewer nodes, could passing
> a large node ID bypass this check and cause out-of-bounds accesses in NUMA
> node arrays like node_data or node_states?
> Also, is there an off-by-one regression here? PGHOT_NID_MAX is the maximum
> value that fits in the bitfield, so should this check be nid > PGHOT_NID_MAX
> to allow the maximum node ID?
Yes, will adjust this in the next iteration.
>> +
>> + return true;
>> +}
>> +
>> +unsigned long pghot_access_latency(unsigned long old_time, unsigned long time)
>> +{
>> + return jiffies_to_msecs((time - old_time) & PGHOT_TIME_MASK);
>> +}
> Does this timestamp wraparound allow cold pages to be improperly promoted?
> If a page is accessed exactly at intervals matching the wraparound of the
> 14-bit time mask (e.g., every 16.384 seconds at HZ=1000), the latency would
> evaluate to 0. Would this bypass the reset of the frequency counter and
> falsely trigger migration?
The hotness timestamp is stored in a limited number of bits (14 bits in precise
mode, 5 bucketed bits in default mode), which means it can only represent a
finite window of time before wrapping around — approximately 16 seconds for
precise mode and 4 seconds for default mode at HZ=1000. Latency calculations
rely on modular subtraction of these truncated timestamps. This produces the
correct elapsed time as long as the true interval between two events is shorter
than the wrap period. When the interval exceeds the wrap period, there could be
approximations.
In practice, this should be acceptable because the system has two independent
gates that a page must pass through before migration actually occurs:
1. Access frequency gate (recording time): pghot_update_record() determines
whether consecutive accesses to a page fall within the same frequency-counting
window. A page must be accessed at least pghot_freq_threshold times within
sysctl_pghot_freq_window (default 5 seconds) to be marked migration-ready.
2. Recency gate (migration time): kmigrated_should_migrate_memory() checks
whether the page's last recorded access is recent enough relative to the current
time before actually initiating migration.
>> +
>> +bool pghot_update_record(phi_t *phi, int nid, unsigned long now)
>> +{
>> + phi_t freq, old_freq, hotness, old_hotness, old_time;
>> + phi_t time = now & PGHOT_TIME_MASK;
>> +
>> + nid = (nid == NUMA_NO_NODE) ? pghot_target_nid : nid;
>> + old_hotness = READ_ONCE(*phi);
>> +
>> + do {
>> + bool new_window = false;
>> +
>> + hotness = old_hotness;
>> + old_freq = (hotness >> PGHOT_FREQ_SHIFT) & PGHOT_FREQ_MASK;
>> + old_time = (hotness >> PGHOT_TIME_SHIFT) & PGHOT_TIME_MASK;
>> +
>> + if (pghot_access_latency(old_time, time) > sysctl_pghot_freq_window)
>> + new_window = true;
> Is there a missing upper bound on the sysctl_pghot_freq_window configuration?
> The time is tracked in a 14-bit field, which represents a maximum latency of
> 16,383 jiffies. If a user configures a window larger than this maximum
> representable latency, could this condition permanently evaluate to false,
> causing sparse accesses to accumulate over arbitrarily long periods?
I can add an upper bound check.
Regards,
Bharata.
next prev parent reply other threads:[~2026-03-26 10:42 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-23 9:50 [RFC PATCH v6 0/5] mm: Hot page tracking and promotion infrastructure Bharata B Rao
2026-03-23 9:51 ` [RFC PATCH v6 1/5] mm: migrate: Allow misplaced migration without VMA Bharata B Rao
2026-03-23 9:51 ` [RFC PATCH v6 2/5] mm: migrate: Add migrate_misplaced_folios_batch() Bharata B Rao
2026-03-26 5:50 ` Bharata B Rao
2026-03-23 9:51 ` [RFC PATCH v6 3/5] mm: Hot page tracking and promotion - pghot Bharata B Rao
2026-03-23 9:51 ` [RFC PATCH v6 4/5] mm: pghot: Precision mode for pghot Bharata B Rao
2026-03-26 10:41 ` Bharata B Rao [this message]
2026-03-23 9:51 ` [RFC PATCH v6 5/5] mm: sched: move NUMA balancing tiering promotion to pghot Bharata B Rao
2026-03-23 9:56 ` [RFC PATCH v6 0/5] mm: Hot page tracking and promotion infrastructure Bharata B Rao
2026-03-23 9:58 ` Bharata B Rao
2026-03-23 9:59 ` Bharata B Rao
2026-03-23 10:01 ` Bharata B Rao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ce3b51f8-812d-424d-9848-3ba408253c9d@amd.com \
--to=bharata@amd.com \
--cc=Jonathan.Cameron@huawei.com \
--cc=akpm@linux-foundation.org \
--cc=alok.rathore@samsung.com \
--cc=balbirs@nvidia.com \
--cc=byungchul@sk.com \
--cc=dave.hansen@intel.com \
--cc=dave@stgolabs.net \
--cc=david@redhat.com \
--cc=gourry@gourry.net \
--cc=joshua.hahnjy@gmail.com \
--cc=kinseyho@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@techsingularity.net \
--cc=mingo@redhat.com \
--cc=nifan.cxl@gmail.com \
--cc=peterz@infradead.org \
--cc=raghavendra.kt@amd.com \
--cc=riel@surriel.com \
--cc=rientjes@google.com \
--cc=shivankg@amd.com \
--cc=sj@kernel.org \
--cc=weixugc@google.com \
--cc=willy@infradead.org \
--cc=xuezhengchu@huawei.com \
--cc=yiannis@zptcorp.com \
--cc=ying.huang@linux.alibaba.com \
--cc=yuanchu@google.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox