* [PATCH] sched/numa, mm: Skip page promotion if cpu pid is valid
@ 2026-03-26 7:12 Donet Tom
2026-03-26 10:29 ` David Hildenbrand (Arm)
0 siblings, 1 reply; 10+ messages in thread
From: Donet Tom @ 2026-03-26 7:12 UTC (permalink / raw)
To: David Hildenbrand, Andrew Morton, Ingo Molnar, Peter Zijlstra
Cc: Ritesh Harjani, linux-mm, linux-kernel, Baolin Wang, Ying Huang,
Juri Lelli, Mel Gorman, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Donet Tom
If memory tiering is disabled, cpupid of slow memory pages may
contain a valid CPU and PID. If tiering is enabled at runtime,
there is a chance that in should_numa_migrate_memory(), this
valid CPU/PID is treated as a last access timestamp, leading
to unnecessary promotion.
Prevent this by skipping promotion when cpupid is valid.
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
---
kernel/sched/fair.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4b43809a3fb1..f5830a5a94d5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2001,6 +2001,13 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
unsigned int latency, th, def_th;
long nr = folio_nr_pages(folio);
+ /* When tiering is enabled at runtime, last_cpupid may
+ * hold a valid cpupid instead of an access timestamp.
+ * If so, skip page promotion.
+ */
+ if (cpupid_valid(folio_last_cpupid(folio)))
+ return false;
+
pgdat = NODE_DATA(dst_nid);
if (pgdat_free_space_enough(pgdat)) {
/* workload changed, reset hot threshold */
--
2.47.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH] sched/numa, mm: Skip page promotion if cpu pid is valid
2026-03-26 7:12 [PATCH] sched/numa, mm: Skip page promotion if cpu pid is valid Donet Tom
@ 2026-03-26 10:29 ` David Hildenbrand (Arm)
2026-03-27 18:54 ` Donet Tom
0 siblings, 1 reply; 10+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-26 10:29 UTC (permalink / raw)
To: Donet Tom, Andrew Morton, Ingo Molnar, Peter Zijlstra
Cc: Ritesh Harjani, linux-mm, linux-kernel, Baolin Wang, Ying Huang,
Juri Lelli, Mel Gorman, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt
On 3/26/26 08:12, Donet Tom wrote:
> If memory tiering is disabled, cpupid of slow memory pages may
> contain a valid CPU and PID. If tiering is enabled at runtime,
> there is a chance that in should_numa_migrate_memory(), this
> valid CPU/PID is treated as a last access timestamp, leading
> to unnecessary promotion.
Is that measurable? Should we at least have a Fixes: ?
>
> Prevent this by skipping promotion when cpupid is valid.
>
> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
> ---
> kernel/sched/fair.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4b43809a3fb1..f5830a5a94d5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2001,6 +2001,13 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
> unsigned int latency, th, def_th;
> long nr = folio_nr_pages(folio);
>
/*
* When ...
> + /* When tiering is enabled at runtime, last_cpupid may
> + * hold a valid cpupid instead of an access timestamp.
> + * If so, skip page promotion.
> + */
> + if (cpupid_valid(folio_last_cpupid(folio)))
> + return false;
> +
IIUC, as timestamp we use jiffies_to_msecs(). So, soon after bootup,
we would no longer get false positives for cpupid_valid().
I suppose overflows are not a problem, correct?
So what we're saying is that folio_use_access_time()==true does not
imply that there is actually a valid time in there.
In numa_migrate_check() we could still use the valid cpuid I guess and
make that code a bit clearer?
diff --git a/mm/memory.c b/mm/memory.c
index 631205a384e1..ba68933a9e4a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6119,10 +6119,9 @@ int numa_migrate_check(struct folio *folio, struct vm_fault *vmf,
* For memory tiering mode, cpupid of slow memory page is used
* to record page access time. So use default value.
*/
- if (folio_use_access_time(folio))
+ *last_cpupid = folio_last_cpupid(folio);
+ if (!cpupid_valid(*last_cpupid))
*last_cpupid = (-1 & LAST_CPUPID_MASK);
- else
- *last_cpupid = folio_last_cpupid(folio);
/* Record the current PID accessing VMA */
vma_set_access_pid_bit(vma);
The change itself here looks reasonable to me.
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
--
Cheers,
David
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH] sched/numa, mm: Skip page promotion if cpu pid is valid
2026-03-26 10:29 ` David Hildenbrand (Arm)
@ 2026-03-27 18:54 ` Donet Tom
2026-03-31 8:33 ` Huang, Ying
0 siblings, 1 reply; 10+ messages in thread
From: Donet Tom @ 2026-03-27 18:54 UTC (permalink / raw)
To: David Hildenbrand (Arm), Andrew Morton, Ingo Molnar,
Peter Zijlstra
Cc: Ritesh Harjani, linux-mm, linux-kernel, Baolin Wang, Ying Huang,
Juri Lelli, Mel Gorman, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt
On 3/26/26 3:59 PM, David Hildenbrand (Arm) wrote:
> On 3/26/26 08:12, Donet Tom wrote:
>> If memory tiering is disabled, cpupid of slow memory pages may
>> contain a valid CPU and PID. If tiering is enabled at runtime,
>> there is a chance that in should_numa_migrate_memory(), this
>> valid CPU/PID is treated as a last access timestamp, leading
>> to unnecessary promotion.
> Is that measurable? Should we at least have a Fixes: ?
>
>> Prevent this by skipping promotion when cpupid is valid.
>>
>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>> ---
>> kernel/sched/fair.c | 7 +++++++
>> 1 file changed, 7 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 4b43809a3fb1..f5830a5a94d5 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -2001,6 +2001,13 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
>> unsigned int latency, th, def_th;
>> long nr = folio_nr_pages(folio);
>>
> /*
> * When ...
>
>> + /* When tiering is enabled at runtime, last_cpupid may
>> + * hold a valid cpupid instead of an access timestamp.
>> + * If so, skip page promotion.
>> + */
>> + if (cpupid_valid(folio_last_cpupid(folio)))
>> + return false;
>> +
> IIUC, as timestamp we use jiffies_to_msecs(). So, soon after bootup,
> we would no longer get false positives for cpupid_valid().
> I suppose overflows are not a problem, correct?
Thank you, David, for guiding me in the right direction.
I initially thought that overflows would not occur, and therefore
cpupid_valid() would not produce false positives. However,
after looking into it further, it appears that overflow can
happen when storing the access time.
The last_cpupid field is used to store the last access time.
From the code, it appears that 21 bits are used for this
(#define LAST_CPUPID_SHIFT (LAST__PID_SHIFT + LAST__CPU_SHIFT)).
With 21 bits, the maximum value that can be stored is
2097151ms (35Hrs) . If the access time exceeds this
range, it can overflow, which may lead to cpupid_valid()
returning false positives.
I think we need a reliable way to determine cpupid_valid() that
does not produce false positives.
-Donet
>
> So what we're saying is that folio_use_access_time()==true does not
> imply that there is actually a valid time in there.
>
> In numa_migrate_check() we could still use the valid cpuid I guess and
> make that code a bit clearer?
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 631205a384e1..ba68933a9e4a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -6119,10 +6119,9 @@ int numa_migrate_check(struct folio *folio, struct vm_fault *vmf,
> * For memory tiering mode, cpupid of slow memory page is used
> * to record page access time. So use default value.
> */
> - if (folio_use_access_time(folio))
> + *last_cpupid = folio_last_cpupid(folio);
> + if (!cpupid_valid(*last_cpupid))
> *last_cpupid = (-1 & LAST_CPUPID_MASK);
> - else
> - *last_cpupid = folio_last_cpupid(folio);
>
> /* Record the current PID accessing VMA */
> vma_set_access_pid_bit(vma);
>
>
> The change itself here looks reasonable to me.
>
> Acked-by: David Hildenbrand (Arm) <david@kernel.org>
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] sched/numa, mm: Skip page promotion if cpu pid is valid
2026-03-27 18:54 ` Donet Tom
@ 2026-03-31 8:33 ` Huang, Ying
2026-03-31 9:03 ` Donet Tom
2026-03-31 10:04 ` David Hildenbrand (Arm)
0 siblings, 2 replies; 10+ messages in thread
From: Huang, Ying @ 2026-03-31 8:33 UTC (permalink / raw)
To: Donet Tom
Cc: David Hildenbrand (Arm), Andrew Morton, Ingo Molnar,
Peter Zijlstra, Ritesh Harjani, linux-mm, linux-kernel,
Baolin Wang, Ying Huang, Juri Lelli, Mel Gorman, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt
Hi, Donet,
Donet Tom <donettom@linux.ibm.com> writes:
> On 3/26/26 3:59 PM, David Hildenbrand (Arm) wrote:
>> On 3/26/26 08:12, Donet Tom wrote:
>>> If memory tiering is disabled, cpupid of slow memory pages may
>>> contain a valid CPU and PID. If tiering is enabled at runtime,
>>> there is a chance that in should_numa_migrate_memory(), this
>>> valid CPU/PID is treated as a last access timestamp, leading
>>> to unnecessary promotion.
>> Is that measurable? Should we at least have a Fixes: ?
>>
>>> Prevent this by skipping promotion when cpupid is valid.
>>>
>>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>>> ---
>>> kernel/sched/fair.c | 7 +++++++
>>> 1 file changed, 7 insertions(+)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 4b43809a3fb1..f5830a5a94d5 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -2001,6 +2001,13 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
>>> unsigned int latency, th, def_th;
>>> long nr = folio_nr_pages(folio);
>>>
>> /*
>> * When ...
>>
>>> + /* When tiering is enabled at runtime, last_cpupid may
>>> + * hold a valid cpupid instead of an access timestamp.
>>> + * If so, skip page promotion.
>>> + */
>>> + if (cpupid_valid(folio_last_cpupid(folio)))
>>> + return false;
>>> +
>> IIUC, as timestamp we use jiffies_to_msecs(). So, soon after bootup,
>> we would no longer get false positives for cpupid_valid().
>> I suppose overflows are not a problem, correct?
>
> Thank you, David, for guiding me in the right direction.
>
> I initially thought that overflows would not occur, and therefore
> cpupid_valid() would not produce false positives. However,
> after looking into it further, it appears that overflow can
> happen when storing the access time.
>
> The last_cpupid field is used to store the last access time.
> From the code, it appears that 21 bits are used for this
> (#define LAST_CPUPID_SHIFT (LAST__PID_SHIFT + LAST__CPU_SHIFT)).
>
> With 21 bits, the maximum value that can be stored is
It can be less than 21 bits, if CONFIG_NR_CPUS is small.
DEFINE(NR_CPUS_BITS, order_base_2(CONFIG_NR_CPUS));
> 2097151ms (35Hrs) . If the access time exceeds this
> range, it can overflow, which may lead to cpupid_valid()
> returning false positives.
>
> I think we need a reliable way to determine cpupid_valid() that
> does not produce false positives.
Yes. IMHO, false positives is unavoidable. So, the patch fixes a
temporal performance issue at the cost of a longstanding performance
issue. Right?
---
Best Regards,
Huang, Ying
>
>>
>> So what we're saying is that folio_use_access_time()==true does not
>> imply that there is actually a valid time in there.
>>
>> In numa_migrate_check() we could still use the valid cpuid I guess and
>> make that code a bit clearer?
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 631205a384e1..ba68933a9e4a 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -6119,10 +6119,9 @@ int numa_migrate_check(struct folio *folio, struct vm_fault *vmf,
>> * For memory tiering mode, cpupid of slow memory page is used
>> * to record page access time. So use default value.
>> */
>> - if (folio_use_access_time(folio))
>> + *last_cpupid = folio_last_cpupid(folio);
>> + if (!cpupid_valid(*last_cpupid))
>> *last_cpupid = (-1 & LAST_CPUPID_MASK);
>> - else
>> - *last_cpupid = folio_last_cpupid(folio);
>> /* Record the current PID accessing VMA */
>> vma_set_access_pid_bit(vma);
>>
>>
>> The change itself here looks reasonable to me.
>>
>> Acked-by: David Hildenbrand (Arm) <david@kernel.org>
>>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] sched/numa, mm: Skip page promotion if cpu pid is valid
2026-03-31 8:33 ` Huang, Ying
@ 2026-03-31 9:03 ` Donet Tom
2026-03-31 9:17 ` Huang, Ying
2026-03-31 10:04 ` David Hildenbrand (Arm)
1 sibling, 1 reply; 10+ messages in thread
From: Donet Tom @ 2026-03-31 9:03 UTC (permalink / raw)
To: Huang, Ying
Cc: David Hildenbrand (Arm), Andrew Morton, Ingo Molnar,
Peter Zijlstra, Ritesh Harjani, linux-mm, linux-kernel,
Baolin Wang, Ying Huang, Juri Lelli, Mel Gorman, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt
Hi
On 3/31/26 2:03 PM, Huang, Ying wrote:
> Hi, Donet,
>
> Donet Tom <donettom@linux.ibm.com> writes:
>
>> On 3/26/26 3:59 PM, David Hildenbrand (Arm) wrote:
>>> On 3/26/26 08:12, Donet Tom wrote:
>>>> If memory tiering is disabled, cpupid of slow memory pages may
>>>> contain a valid CPU and PID. If tiering is enabled at runtime,
>>>> there is a chance that in should_numa_migrate_memory(), this
>>>> valid CPU/PID is treated as a last access timestamp, leading
>>>> to unnecessary promotion.
>>> Is that measurable? Should we at least have a Fixes: ?
>>>
>>>> Prevent this by skipping promotion when cpupid is valid.
>>>>
>>>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>>>> ---
>>>> kernel/sched/fair.c | 7 +++++++
>>>> 1 file changed, 7 insertions(+)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index 4b43809a3fb1..f5830a5a94d5 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -2001,6 +2001,13 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
>>>> unsigned int latency, th, def_th;
>>>> long nr = folio_nr_pages(folio);
>>>>
>>> /*
>>> * When ...
>>>
>>>> + /* When tiering is enabled at runtime, last_cpupid may
>>>> + * hold a valid cpupid instead of an access timestamp.
>>>> + * If so, skip page promotion.
>>>> + */
>>>> + if (cpupid_valid(folio_last_cpupid(folio)))
>>>> + return false;
>>>> +
>>> IIUC, as timestamp we use jiffies_to_msecs(). So, soon after bootup,
>>> we would no longer get false positives for cpupid_valid().
>>> I suppose overflows are not a problem, correct?
>> Thank you, David, for guiding me in the right direction.
>>
>> I initially thought that overflows would not occur, and therefore
>> cpupid_valid() would not produce false positives. However,
>> after looking into it further, it appears that overflow can
>> happen when storing the access time.
>>
>> The last_cpupid field is used to store the last access time.
>> From the code, it appears that 21 bits are used for this
>> (#define LAST_CPUPID_SHIFT (LAST__PID_SHIFT + LAST__CPU_SHIFT)).
>>
>> With 21 bits, the maximum value that can be stored is
> It can be less than 21 bits, if CONFIG_NR_CPUS is small.
>
> DEFINE(NR_CPUS_BITS, order_base_2(CONFIG_NR_CPUS));
>
>> 2097151ms (35Hrs) . If the access time exceeds this
>> range, it can overflow, which may lead to cpupid_valid()
>> returning false positives.
>>
>> I think we need a reliable way to determine cpupid_valid() that
>> does not produce false positives.
> Yes. IMHO, false positives is unavoidable. So, the patch fixes a
> temporal performance issue at the cost of a longstanding performance
> issue. Right?
I was trying to fix a functional issue. When memory tiering is
enabled at runtime, treating last_cpupid as access time is incorrect, right?
-Donet
> ---
> Best Regards,
> Huang, Ying
>
>>> So what we're saying is that folio_use_access_time()==true does not
>>> imply that there is actually a valid time in there.
>>>
>>> In numa_migrate_check() we could still use the valid cpuid I guess and
>>> make that code a bit clearer?
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 631205a384e1..ba68933a9e4a 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -6119,10 +6119,9 @@ int numa_migrate_check(struct folio *folio, struct vm_fault *vmf,
>>> * For memory tiering mode, cpupid of slow memory page is used
>>> * to record page access time. So use default value.
>>> */
>>> - if (folio_use_access_time(folio))
>>> + *last_cpupid = folio_last_cpupid(folio);
>>> + if (!cpupid_valid(*last_cpupid))
>>> *last_cpupid = (-1 & LAST_CPUPID_MASK);
>>> - else
>>> - *last_cpupid = folio_last_cpupid(folio);
>>> /* Record the current PID accessing VMA */
>>> vma_set_access_pid_bit(vma);
>>>
>>>
>>> The change itself here looks reasonable to me.
>>>
>>> Acked-by: David Hildenbrand (Arm) <david@kernel.org>
>>>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] sched/numa, mm: Skip page promotion if cpu pid is valid
2026-03-31 9:03 ` Donet Tom
@ 2026-03-31 9:17 ` Huang, Ying
2026-03-31 10:00 ` Donet Tom
0 siblings, 1 reply; 10+ messages in thread
From: Huang, Ying @ 2026-03-31 9:17 UTC (permalink / raw)
To: Donet Tom
Cc: David Hildenbrand (Arm), Andrew Morton, Ingo Molnar,
Peter Zijlstra, Ritesh Harjani, linux-mm, linux-kernel,
Baolin Wang, Ying Huang, Juri Lelli, Mel Gorman, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt
Donet Tom <donettom@linux.ibm.com> writes:
> Hi
>
> On 3/31/26 2:03 PM, Huang, Ying wrote:
>> Hi, Donet,
>>
>> Donet Tom <donettom@linux.ibm.com> writes:
>>
>>> On 3/26/26 3:59 PM, David Hildenbrand (Arm) wrote:
>>>> On 3/26/26 08:12, Donet Tom wrote:
>>>>> If memory tiering is disabled, cpupid of slow memory pages may
>>>>> contain a valid CPU and PID. If tiering is enabled at runtime,
>>>>> there is a chance that in should_numa_migrate_memory(), this
>>>>> valid CPU/PID is treated as a last access timestamp, leading
>>>>> to unnecessary promotion.
>>>> Is that measurable? Should we at least have a Fixes: ?
>>>>
>>>>> Prevent this by skipping promotion when cpupid is valid.
>>>>>
>>>>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>>>>> ---
>>>>> kernel/sched/fair.c | 7 +++++++
>>>>> 1 file changed, 7 insertions(+)
>>>>>
>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>> index 4b43809a3fb1..f5830a5a94d5 100644
>>>>> --- a/kernel/sched/fair.c
>>>>> +++ b/kernel/sched/fair.c
>>>>> @@ -2001,6 +2001,13 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
>>>>> unsigned int latency, th, def_th;
>>>>> long nr = folio_nr_pages(folio);
>>>>>
>>>> /*
>>>> * When ...
>>>>
>>>>> + /* When tiering is enabled at runtime, last_cpupid may
>>>>> + * hold a valid cpupid instead of an access timestamp.
>>>>> + * If so, skip page promotion.
>>>>> + */
>>>>> + if (cpupid_valid(folio_last_cpupid(folio)))
>>>>> + return false;
>>>>> +
>>>> IIUC, as timestamp we use jiffies_to_msecs(). So, soon after bootup,
>>>> we would no longer get false positives for cpupid_valid().
>>>> I suppose overflows are not a problem, correct?
>>> Thank you, David, for guiding me in the right direction.
>>>
>>> I initially thought that overflows would not occur, and therefore
>>> cpupid_valid() would not produce false positives. However,
>>> after looking into it further, it appears that overflow can
>>> happen when storing the access time.
>>>
>>> The last_cpupid field is used to store the last access time.
>>> From the code, it appears that 21 bits are used for this
>>> (#define LAST_CPUPID_SHIFT (LAST__PID_SHIFT + LAST__CPU_SHIFT)).
>>>
>>> With 21 bits, the maximum value that can be stored is
>> It can be less than 21 bits, if CONFIG_NR_CPUS is small.
>>
>> DEFINE(NR_CPUS_BITS, order_base_2(CONFIG_NR_CPUS));
>>
>>> 2097151ms (35Hrs) . If the access time exceeds this
>>> range, it can overflow, which may lead to cpupid_valid()
>>> returning false positives.
>>>
>>> I think we need a reliable way to determine cpupid_valid() that
>>> does not produce false positives.
>> Yes. IMHO, false positives is unavoidable. So, the patch fixes a
>> temporal performance issue at the cost of a longstanding performance
>> issue. Right?
>
>
> I was trying to fix a functional issue. When memory tiering is
>
> enabled at runtime, treating last_cpupid as access time is incorrect, right?
I don't think that it's a functional issue. It has only performance
impact. Did you find any functionality bug?
---
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] sched/numa, mm: Skip page promotion if cpu pid is valid
2026-03-31 9:17 ` Huang, Ying
@ 2026-03-31 10:00 ` Donet Tom
0 siblings, 0 replies; 10+ messages in thread
From: Donet Tom @ 2026-03-31 10:00 UTC (permalink / raw)
To: Huang, Ying
Cc: David Hildenbrand (Arm), Andrew Morton, Ingo Molnar,
Peter Zijlstra, Ritesh Harjani, linux-mm, linux-kernel,
Baolin Wang, Ying Huang, Juri Lelli, Mel Gorman, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt
On 3/31/26 2:47 PM, Huang, Ying wrote:
> Donet Tom <donettom@linux.ibm.com> writes:
>
>> Hi
>>
>> On 3/31/26 2:03 PM, Huang, Ying wrote:
>>> Hi, Donet,
>>>
>>> Donet Tom <donettom@linux.ibm.com> writes:
>>>
>>>> On 3/26/26 3:59 PM, David Hildenbrand (Arm) wrote:
>>>>> On 3/26/26 08:12, Donet Tom wrote:
>>>>>> If memory tiering is disabled, cpupid of slow memory pages may
>>>>>> contain a valid CPU and PID. If tiering is enabled at runtime,
>>>>>> there is a chance that in should_numa_migrate_memory(), this
>>>>>> valid CPU/PID is treated as a last access timestamp, leading
>>>>>> to unnecessary promotion.
>>>>> Is that measurable? Should we at least have a Fixes: ?
>>>>>
>>>>>> Prevent this by skipping promotion when cpupid is valid.
>>>>>>
>>>>>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>>>>>> ---
>>>>>> kernel/sched/fair.c | 7 +++++++
>>>>>> 1 file changed, 7 insertions(+)
>>>>>>
>>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>>> index 4b43809a3fb1..f5830a5a94d5 100644
>>>>>> --- a/kernel/sched/fair.c
>>>>>> +++ b/kernel/sched/fair.c
>>>>>> @@ -2001,6 +2001,13 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
>>>>>> unsigned int latency, th, def_th;
>>>>>> long nr = folio_nr_pages(folio);
>>>>>>
>>>>> /*
>>>>> * When ...
>>>>>
>>>>>> + /* When tiering is enabled at runtime, last_cpupid may
>>>>>> + * hold a valid cpupid instead of an access timestamp.
>>>>>> + * If so, skip page promotion.
>>>>>> + */
>>>>>> + if (cpupid_valid(folio_last_cpupid(folio)))
>>>>>> + return false;
>>>>>> +
>>>>> IIUC, as timestamp we use jiffies_to_msecs(). So, soon after bootup,
>>>>> we would no longer get false positives for cpupid_valid().
>>>>> I suppose overflows are not a problem, correct?
>>>> Thank you, David, for guiding me in the right direction.
>>>>
>>>> I initially thought that overflows would not occur, and therefore
>>>> cpupid_valid() would not produce false positives. However,
>>>> after looking into it further, it appears that overflow can
>>>> happen when storing the access time.
>>>>
>>>> The last_cpupid field is used to store the last access time.
>>>> From the code, it appears that 21 bits are used for this
>>>> (#define LAST_CPUPID_SHIFT (LAST__PID_SHIFT + LAST__CPU_SHIFT)).
>>>>
>>>> With 21 bits, the maximum value that can be stored is
>>> It can be less than 21 bits, if CONFIG_NR_CPUS is small.
>>>
>>> DEFINE(NR_CPUS_BITS, order_base_2(CONFIG_NR_CPUS));
>>>
>>>> 2097151ms (35Hrs) . If the access time exceeds this
>>>> range, it can overflow, which may lead to cpupid_valid()
>>>> returning false positives.
>>>>
>>>> I think we need a reliable way to determine cpupid_valid() that
>>>> does not produce false positives.
>>> Yes. IMHO, false positives is unavoidable. So, the patch fixes a
>>> temporal performance issue at the cost of a longstanding performance
>>> issue. Right?
>>
>> I was trying to fix a functional issue. When memory tiering is
>>
>> enabled at runtime, treating last_cpupid as access time is incorrect, right?
> I don't think that it's a functional issue. It has only performance
> impact. Did you find any functionality bug?
>
Thank you for the confirmation. I thought this was a functional issue.
In that case, we can drop this patch.
-Donet
>
> ---
> Best Regards,
> Huang, Ying
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] sched/numa, mm: Skip page promotion if cpu pid is valid
2026-03-31 8:33 ` Huang, Ying
2026-03-31 9:03 ` Donet Tom
@ 2026-03-31 10:04 ` David Hildenbrand (Arm)
2026-03-31 15:02 ` Donet Tom
2026-04-01 9:48 ` Huang, Ying
1 sibling, 2 replies; 10+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-31 10:04 UTC (permalink / raw)
To: Huang, Ying, Donet Tom
Cc: Andrew Morton, Ingo Molnar, Peter Zijlstra, Ritesh Harjani,
linux-mm, linux-kernel, Baolin Wang, Ying Huang, Juri Lelli,
Mel Gorman, Vincent Guittot, Dietmar Eggemann, Steven Rostedt
On 3/31/26 10:33, Huang, Ying wrote:
> Hi, Donet,
>
> Donet Tom <donettom@linux.ibm.com> writes:
>
>> On 3/26/26 3:59 PM, David Hildenbrand (Arm) wrote:
>>> Is that measurable? Should we at least have a Fixes: ?
>>>
>>> /*
>>> * When ...
>>>
>>> IIUC, as timestamp we use jiffies_to_msecs(). So, soon after bootup,
>>> we would no longer get false positives for cpupid_valid().
>>> I suppose overflows are not a problem, correct?
>>
>> Thank you, David, for guiding me in the right direction.
>>
>> I initially thought that overflows would not occur, and therefore
>> cpupid_valid() would not produce false positives. However,
>> after looking into it further, it appears that overflow can
>> happen when storing the access time.
>>
>> The last_cpupid field is used to store the last access time.
>> From the code, it appears that 21 bits are used for this
>> (#define LAST_CPUPID_SHIFT (LAST__PID_SHIFT + LAST__CPU_SHIFT)).
>>
>> With 21 bits, the maximum value that can be stored is
>
> It can be less than 21 bits, if CONFIG_NR_CPUS is small.
>
> DEFINE(NR_CPUS_BITS, order_base_2(CONFIG_NR_CPUS));
>
>> 2097151ms (35Hrs) . If the access time exceeds this
>> range, it can overflow, which may lead to cpupid_valid()
>> returning false positives.
>>
>> I think we need a reliable way to determine cpupid_valid() that
>> does not produce false positives.
>
> Yes. IMHO, false positives is unavoidable. So, the patch fixes a
> temporal performance issue at the cost of a longstanding performance
> issue. Right?
Could we set aside a bit to indicate "cpuid vs. time" ? We'd lose one
bit for time, to we care?
Would make it all easier to get ...
--
Cheers,
David
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] sched/numa, mm: Skip page promotion if cpu pid is valid
2026-03-31 10:04 ` David Hildenbrand (Arm)
@ 2026-03-31 15:02 ` Donet Tom
2026-04-01 9:48 ` Huang, Ying
1 sibling, 0 replies; 10+ messages in thread
From: Donet Tom @ 2026-03-31 15:02 UTC (permalink / raw)
To: David Hildenbrand (Arm), Huang, Ying
Cc: Andrew Morton, Ingo Molnar, Peter Zijlstra, Ritesh Harjani,
linux-mm, linux-kernel, Baolin Wang, Ying Huang, Juri Lelli,
Mel Gorman, Vincent Guittot, Dietmar Eggemann, Steven Rostedt
On 3/31/26 3:34 PM, David Hildenbrand (Arm) wrote:
> On 3/31/26 10:33, Huang, Ying wrote:
>> Hi, Donet,
>>
>> Donet Tom <donettom@linux.ibm.com> writes:
>>
>>> On 3/26/26 3:59 PM, David Hildenbrand (Arm) wrote:
>>>> Is that measurable? Should we at least have a Fixes: ?
>>>>
>>>> /*
>>>> * When ...
>>>>
>>>> IIUC, as timestamp we use jiffies_to_msecs(). So, soon after bootup,
>>>> we would no longer get false positives for cpupid_valid().
>>>> I suppose overflows are not a problem, correct?
>>> Thank you, David, for guiding me in the right direction.
>>>
>>> I initially thought that overflows would not occur, and therefore
>>> cpupid_valid() would not produce false positives. However,
>>> after looking into it further, it appears that overflow can
>>> happen when storing the access time.
>>>
>>> The last_cpupid field is used to store the last access time.
>>> From the code, it appears that 21 bits are used for this
>>> (#define LAST_CPUPID_SHIFT (LAST__PID_SHIFT + LAST__CPU_SHIFT)).
>>>
>>> With 21 bits, the maximum value that can be stored is
>> It can be less than 21 bits, if CONFIG_NR_CPUS is small.
>>
>> DEFINE(NR_CPUS_BITS, order_base_2(CONFIG_NR_CPUS));
>>
>>> 2097151ms (35Hrs) . If the access time exceeds this
>>> range, it can overflow, which may lead to cpupid_valid()
>>> returning false positives.
>>>
>>> I think we need a reliable way to determine cpupid_valid() that
>>> does not produce false positives.
>> Yes. IMHO, false positives is unavoidable. So, the patch fixes a
>> temporal performance issue at the cost of a longstanding performance
>> issue. Right?
>
> Could we set aside a bit to indicate "cpuid vs. time" ? We'd lose one
> bit for time, to we care?
Thank you, David, for the input. This sounds like a good idea—I'll give
it a try.
-Donet
>
> Would make it all easier to get ...
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] sched/numa, mm: Skip page promotion if cpu pid is valid
2026-03-31 10:04 ` David Hildenbrand (Arm)
2026-03-31 15:02 ` Donet Tom
@ 2026-04-01 9:48 ` Huang, Ying
1 sibling, 0 replies; 10+ messages in thread
From: Huang, Ying @ 2026-04-01 9:48 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Donet Tom, Andrew Morton, Ingo Molnar, Peter Zijlstra,
Ritesh Harjani, linux-mm, linux-kernel, Baolin Wang, Ying Huang,
Juri Lelli, Mel Gorman, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt
"David Hildenbrand (Arm)" <david@kernel.org> writes:
> On 3/31/26 10:33, Huang, Ying wrote:
>> Hi, Donet,
>>
>> Donet Tom <donettom@linux.ibm.com> writes:
>>
>>> On 3/26/26 3:59 PM, David Hildenbrand (Arm) wrote:
>>>> Is that measurable? Should we at least have a Fixes: ?
>>>>
>>>> /*
>>>> * When ...
>>>>
>>>> IIUC, as timestamp we use jiffies_to_msecs(). So, soon after bootup,
>>>> we would no longer get false positives for cpupid_valid().
>>>> I suppose overflows are not a problem, correct?
>>>
>>> Thank you, David, for guiding me in the right direction.
>>>
>>> I initially thought that overflows would not occur, and therefore
>>> cpupid_valid() would not produce false positives. However,
>>> after looking into it further, it appears that overflow can
>>> happen when storing the access time.
>>>
>>> The last_cpupid field is used to store the last access time.
>>> From the code, it appears that 21 bits are used for this
>>> (#define LAST_CPUPID_SHIFT (LAST__PID_SHIFT + LAST__CPU_SHIFT)).
>>>
>>> With 21 bits, the maximum value that can be stored is
>>
>> It can be less than 21 bits, if CONFIG_NR_CPUS is small.
>>
>> DEFINE(NR_CPUS_BITS, order_base_2(CONFIG_NR_CPUS));
>>
>>> 2097151ms (35Hrs) . If the access time exceeds this
>>> range, it can overflow, which may lead to cpupid_valid()
>>> returning false positives.
>>>
>>> I think we need a reliable way to determine cpupid_valid() that
>>> does not produce false positives.
>>
>> Yes. IMHO, false positives is unavoidable. So, the patch fixes a
>> temporal performance issue at the cost of a longstanding performance
>> issue. Right?
>
>
> Could we set aside a bit to indicate "cpuid vs. time" ? We'd lose one
> bit for time, to we care?
Do we need one more bit for time and cpupid? However, page flags are
precious resources.
> Would make it all easier to get ...
---
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2026-04-01 9:48 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-26 7:12 [PATCH] sched/numa, mm: Skip page promotion if cpu pid is valid Donet Tom
2026-03-26 10:29 ` David Hildenbrand (Arm)
2026-03-27 18:54 ` Donet Tom
2026-03-31 8:33 ` Huang, Ying
2026-03-31 9:03 ` Donet Tom
2026-03-31 9:17 ` Huang, Ying
2026-03-31 10:00 ` Donet Tom
2026-03-31 10:04 ` David Hildenbrand (Arm)
2026-03-31 15:02 ` Donet Tom
2026-04-01 9:48 ` Huang, Ying
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox