* [PATCH] sched/numa, mm: Skip page promotion if cpu pid is valid @ 2026-03-26 7:12 Donet Tom 2026-03-26 10:29 ` David Hildenbrand (Arm) 0 siblings, 1 reply; 10+ messages in thread From: Donet Tom @ 2026-03-26 7:12 UTC (permalink / raw) To: David Hildenbrand, Andrew Morton, Ingo Molnar, Peter Zijlstra Cc: Ritesh Harjani, linux-mm, linux-kernel, Baolin Wang, Ying Huang, Juri Lelli, Mel Gorman, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Donet Tom If memory tiering is disabled, cpupid of slow memory pages may contain a valid CPU and PID. If tiering is enabled at runtime, there is a chance that in should_numa_migrate_memory(), this valid CPU/PID is treated as a last access timestamp, leading to unnecessary promotion. Prevent this by skipping promotion when cpupid is valid. Signed-off-by: Donet Tom <donettom@linux.ibm.com> --- kernel/sched/fair.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4b43809a3fb1..f5830a5a94d5 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2001,6 +2001,13 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, unsigned int latency, th, def_th; long nr = folio_nr_pages(folio); + /* When tiering is enabled at runtime, last_cpupid may + * hold a valid cpupid instead of an access timestamp. + * If so, skip page promotion. + */ + if (cpupid_valid(folio_last_cpupid(folio))) + return false; + pgdat = NODE_DATA(dst_nid); if (pgdat_free_space_enough(pgdat)) { /* workload changed, reset hot threshold */ -- 2.47.1 ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH] sched/numa, mm: Skip page promotion if cpu pid is valid 2026-03-26 7:12 [PATCH] sched/numa, mm: Skip page promotion if cpu pid is valid Donet Tom @ 2026-03-26 10:29 ` David Hildenbrand (Arm) 2026-03-27 18:54 ` Donet Tom 0 siblings, 1 reply; 10+ messages in thread From: David Hildenbrand (Arm) @ 2026-03-26 10:29 UTC (permalink / raw) To: Donet Tom, Andrew Morton, Ingo Molnar, Peter Zijlstra Cc: Ritesh Harjani, linux-mm, linux-kernel, Baolin Wang, Ying Huang, Juri Lelli, Mel Gorman, Vincent Guittot, Dietmar Eggemann, Steven Rostedt On 3/26/26 08:12, Donet Tom wrote: > If memory tiering is disabled, cpupid of slow memory pages may > contain a valid CPU and PID. If tiering is enabled at runtime, > there is a chance that in should_numa_migrate_memory(), this > valid CPU/PID is treated as a last access timestamp, leading > to unnecessary promotion. Is that measurable? Should we at least have a Fixes: ? > > Prevent this by skipping promotion when cpupid is valid. > > Signed-off-by: Donet Tom <donettom@linux.ibm.com> > --- > kernel/sched/fair.c | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 4b43809a3fb1..f5830a5a94d5 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -2001,6 +2001,13 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, > unsigned int latency, th, def_th; > long nr = folio_nr_pages(folio); > /* * When ... > + /* When tiering is enabled at runtime, last_cpupid may > + * hold a valid cpupid instead of an access timestamp. > + * If so, skip page promotion. > + */ > + if (cpupid_valid(folio_last_cpupid(folio))) > + return false; > + IIUC, as timestamp we use jiffies_to_msecs(). So, soon after bootup, we would no longer get false positives for cpupid_valid(). I suppose overflows are not a problem, correct? So what we're saying is that folio_use_access_time()==true does not imply that there is actually a valid time in there. In numa_migrate_check() we could still use the valid cpuid I guess and make that code a bit clearer? diff --git a/mm/memory.c b/mm/memory.c index 631205a384e1..ba68933a9e4a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -6119,10 +6119,9 @@ int numa_migrate_check(struct folio *folio, struct vm_fault *vmf, * For memory tiering mode, cpupid of slow memory page is used * to record page access time. So use default value. */ - if (folio_use_access_time(folio)) + *last_cpupid = folio_last_cpupid(folio); + if (!cpupid_valid(*last_cpupid)) *last_cpupid = (-1 & LAST_CPUPID_MASK); - else - *last_cpupid = folio_last_cpupid(folio); /* Record the current PID accessing VMA */ vma_set_access_pid_bit(vma); The change itself here looks reasonable to me. Acked-by: David Hildenbrand (Arm) <david@kernel.org> -- Cheers, David ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH] sched/numa, mm: Skip page promotion if cpu pid is valid 2026-03-26 10:29 ` David Hildenbrand (Arm) @ 2026-03-27 18:54 ` Donet Tom 2026-03-31 8:33 ` Huang, Ying 0 siblings, 1 reply; 10+ messages in thread From: Donet Tom @ 2026-03-27 18:54 UTC (permalink / raw) To: David Hildenbrand (Arm), Andrew Morton, Ingo Molnar, Peter Zijlstra Cc: Ritesh Harjani, linux-mm, linux-kernel, Baolin Wang, Ying Huang, Juri Lelli, Mel Gorman, Vincent Guittot, Dietmar Eggemann, Steven Rostedt On 3/26/26 3:59 PM, David Hildenbrand (Arm) wrote: > On 3/26/26 08:12, Donet Tom wrote: >> If memory tiering is disabled, cpupid of slow memory pages may >> contain a valid CPU and PID. If tiering is enabled at runtime, >> there is a chance that in should_numa_migrate_memory(), this >> valid CPU/PID is treated as a last access timestamp, leading >> to unnecessary promotion. > Is that measurable? Should we at least have a Fixes: ? > >> Prevent this by skipping promotion when cpupid is valid. >> >> Signed-off-by: Donet Tom <donettom@linux.ibm.com> >> --- >> kernel/sched/fair.c | 7 +++++++ >> 1 file changed, 7 insertions(+) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 4b43809a3fb1..f5830a5a94d5 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -2001,6 +2001,13 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, >> unsigned int latency, th, def_th; >> long nr = folio_nr_pages(folio); >> > /* > * When ... > >> + /* When tiering is enabled at runtime, last_cpupid may >> + * hold a valid cpupid instead of an access timestamp. >> + * If so, skip page promotion. >> + */ >> + if (cpupid_valid(folio_last_cpupid(folio))) >> + return false; >> + > IIUC, as timestamp we use jiffies_to_msecs(). So, soon after bootup, > we would no longer get false positives for cpupid_valid(). > I suppose overflows are not a problem, correct? Thank you, David, for guiding me in the right direction. I initially thought that overflows would not occur, and therefore cpupid_valid() would not produce false positives. However, after looking into it further, it appears that overflow can happen when storing the access time. The last_cpupid field is used to store the last access time. From the code, it appears that 21 bits are used for this (#define LAST_CPUPID_SHIFT (LAST__PID_SHIFT + LAST__CPU_SHIFT)). With 21 bits, the maximum value that can be stored is 2097151ms (35Hrs) . If the access time exceeds this range, it can overflow, which may lead to cpupid_valid() returning false positives. I think we need a reliable way to determine cpupid_valid() that does not produce false positives. -Donet > > So what we're saying is that folio_use_access_time()==true does not > imply that there is actually a valid time in there. > > In numa_migrate_check() we could still use the valid cpuid I guess and > make that code a bit clearer? > > diff --git a/mm/memory.c b/mm/memory.c > index 631205a384e1..ba68933a9e4a 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -6119,10 +6119,9 @@ int numa_migrate_check(struct folio *folio, struct vm_fault *vmf, > * For memory tiering mode, cpupid of slow memory page is used > * to record page access time. So use default value. > */ > - if (folio_use_access_time(folio)) > + *last_cpupid = folio_last_cpupid(folio); > + if (!cpupid_valid(*last_cpupid)) > *last_cpupid = (-1 & LAST_CPUPID_MASK); > - else > - *last_cpupid = folio_last_cpupid(folio); > > /* Record the current PID accessing VMA */ > vma_set_access_pid_bit(vma); > > > The change itself here looks reasonable to me. > > Acked-by: David Hildenbrand (Arm) <david@kernel.org> > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] sched/numa, mm: Skip page promotion if cpu pid is valid 2026-03-27 18:54 ` Donet Tom @ 2026-03-31 8:33 ` Huang, Ying 2026-03-31 9:03 ` Donet Tom 2026-03-31 10:04 ` David Hildenbrand (Arm) 0 siblings, 2 replies; 10+ messages in thread From: Huang, Ying @ 2026-03-31 8:33 UTC (permalink / raw) To: Donet Tom Cc: David Hildenbrand (Arm), Andrew Morton, Ingo Molnar, Peter Zijlstra, Ritesh Harjani, linux-mm, linux-kernel, Baolin Wang, Ying Huang, Juri Lelli, Mel Gorman, Vincent Guittot, Dietmar Eggemann, Steven Rostedt Hi, Donet, Donet Tom <donettom@linux.ibm.com> writes: > On 3/26/26 3:59 PM, David Hildenbrand (Arm) wrote: >> On 3/26/26 08:12, Donet Tom wrote: >>> If memory tiering is disabled, cpupid of slow memory pages may >>> contain a valid CPU and PID. If tiering is enabled at runtime, >>> there is a chance that in should_numa_migrate_memory(), this >>> valid CPU/PID is treated as a last access timestamp, leading >>> to unnecessary promotion. >> Is that measurable? Should we at least have a Fixes: ? >> >>> Prevent this by skipping promotion when cpupid is valid. >>> >>> Signed-off-by: Donet Tom <donettom@linux.ibm.com> >>> --- >>> kernel/sched/fair.c | 7 +++++++ >>> 1 file changed, 7 insertions(+) >>> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>> index 4b43809a3fb1..f5830a5a94d5 100644 >>> --- a/kernel/sched/fair.c >>> +++ b/kernel/sched/fair.c >>> @@ -2001,6 +2001,13 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, >>> unsigned int latency, th, def_th; >>> long nr = folio_nr_pages(folio); >>> >> /* >> * When ... >> >>> + /* When tiering is enabled at runtime, last_cpupid may >>> + * hold a valid cpupid instead of an access timestamp. >>> + * If so, skip page promotion. >>> + */ >>> + if (cpupid_valid(folio_last_cpupid(folio))) >>> + return false; >>> + >> IIUC, as timestamp we use jiffies_to_msecs(). So, soon after bootup, >> we would no longer get false positives for cpupid_valid(). >> I suppose overflows are not a problem, correct? > > Thank you, David, for guiding me in the right direction. > > I initially thought that overflows would not occur, and therefore > cpupid_valid() would not produce false positives. However, > after looking into it further, it appears that overflow can > happen when storing the access time. > > The last_cpupid field is used to store the last access time. > From the code, it appears that 21 bits are used for this > (#define LAST_CPUPID_SHIFT (LAST__PID_SHIFT + LAST__CPU_SHIFT)). > > With 21 bits, the maximum value that can be stored is It can be less than 21 bits, if CONFIG_NR_CPUS is small. DEFINE(NR_CPUS_BITS, order_base_2(CONFIG_NR_CPUS)); > 2097151ms (35Hrs) . If the access time exceeds this > range, it can overflow, which may lead to cpupid_valid() > returning false positives. > > I think we need a reliable way to determine cpupid_valid() that > does not produce false positives. Yes. IMHO, false positives is unavoidable. So, the patch fixes a temporal performance issue at the cost of a longstanding performance issue. Right? --- Best Regards, Huang, Ying > >> >> So what we're saying is that folio_use_access_time()==true does not >> imply that there is actually a valid time in there. >> >> In numa_migrate_check() we could still use the valid cpuid I guess and >> make that code a bit clearer? >> >> diff --git a/mm/memory.c b/mm/memory.c >> index 631205a384e1..ba68933a9e4a 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -6119,10 +6119,9 @@ int numa_migrate_check(struct folio *folio, struct vm_fault *vmf, >> * For memory tiering mode, cpupid of slow memory page is used >> * to record page access time. So use default value. >> */ >> - if (folio_use_access_time(folio)) >> + *last_cpupid = folio_last_cpupid(folio); >> + if (!cpupid_valid(*last_cpupid)) >> *last_cpupid = (-1 & LAST_CPUPID_MASK); >> - else >> - *last_cpupid = folio_last_cpupid(folio); >> /* Record the current PID accessing VMA */ >> vma_set_access_pid_bit(vma); >> >> >> The change itself here looks reasonable to me. >> >> Acked-by: David Hildenbrand (Arm) <david@kernel.org> >> ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] sched/numa, mm: Skip page promotion if cpu pid is valid 2026-03-31 8:33 ` Huang, Ying @ 2026-03-31 9:03 ` Donet Tom 2026-03-31 9:17 ` Huang, Ying 2026-03-31 10:04 ` David Hildenbrand (Arm) 1 sibling, 1 reply; 10+ messages in thread From: Donet Tom @ 2026-03-31 9:03 UTC (permalink / raw) To: Huang, Ying Cc: David Hildenbrand (Arm), Andrew Morton, Ingo Molnar, Peter Zijlstra, Ritesh Harjani, linux-mm, linux-kernel, Baolin Wang, Ying Huang, Juri Lelli, Mel Gorman, Vincent Guittot, Dietmar Eggemann, Steven Rostedt Hi On 3/31/26 2:03 PM, Huang, Ying wrote: > Hi, Donet, > > Donet Tom <donettom@linux.ibm.com> writes: > >> On 3/26/26 3:59 PM, David Hildenbrand (Arm) wrote: >>> On 3/26/26 08:12, Donet Tom wrote: >>>> If memory tiering is disabled, cpupid of slow memory pages may >>>> contain a valid CPU and PID. If tiering is enabled at runtime, >>>> there is a chance that in should_numa_migrate_memory(), this >>>> valid CPU/PID is treated as a last access timestamp, leading >>>> to unnecessary promotion. >>> Is that measurable? Should we at least have a Fixes: ? >>> >>>> Prevent this by skipping promotion when cpupid is valid. >>>> >>>> Signed-off-by: Donet Tom <donettom@linux.ibm.com> >>>> --- >>>> kernel/sched/fair.c | 7 +++++++ >>>> 1 file changed, 7 insertions(+) >>>> >>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>>> index 4b43809a3fb1..f5830a5a94d5 100644 >>>> --- a/kernel/sched/fair.c >>>> +++ b/kernel/sched/fair.c >>>> @@ -2001,6 +2001,13 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, >>>> unsigned int latency, th, def_th; >>>> long nr = folio_nr_pages(folio); >>>> >>> /* >>> * When ... >>> >>>> + /* When tiering is enabled at runtime, last_cpupid may >>>> + * hold a valid cpupid instead of an access timestamp. >>>> + * If so, skip page promotion. >>>> + */ >>>> + if (cpupid_valid(folio_last_cpupid(folio))) >>>> + return false; >>>> + >>> IIUC, as timestamp we use jiffies_to_msecs(). So, soon after bootup, >>> we would no longer get false positives for cpupid_valid(). >>> I suppose overflows are not a problem, correct? >> Thank you, David, for guiding me in the right direction. >> >> I initially thought that overflows would not occur, and therefore >> cpupid_valid() would not produce false positives. However, >> after looking into it further, it appears that overflow can >> happen when storing the access time. >> >> The last_cpupid field is used to store the last access time. >> From the code, it appears that 21 bits are used for this >> (#define LAST_CPUPID_SHIFT (LAST__PID_SHIFT + LAST__CPU_SHIFT)). >> >> With 21 bits, the maximum value that can be stored is > It can be less than 21 bits, if CONFIG_NR_CPUS is small. > > DEFINE(NR_CPUS_BITS, order_base_2(CONFIG_NR_CPUS)); > >> 2097151ms (35Hrs) . If the access time exceeds this >> range, it can overflow, which may lead to cpupid_valid() >> returning false positives. >> >> I think we need a reliable way to determine cpupid_valid() that >> does not produce false positives. > Yes. IMHO, false positives is unavoidable. So, the patch fixes a > temporal performance issue at the cost of a longstanding performance > issue. Right? I was trying to fix a functional issue. When memory tiering is enabled at runtime, treating last_cpupid as access time is incorrect, right? -Donet > --- > Best Regards, > Huang, Ying > >>> So what we're saying is that folio_use_access_time()==true does not >>> imply that there is actually a valid time in there. >>> >>> In numa_migrate_check() we could still use the valid cpuid I guess and >>> make that code a bit clearer? >>> >>> diff --git a/mm/memory.c b/mm/memory.c >>> index 631205a384e1..ba68933a9e4a 100644 >>> --- a/mm/memory.c >>> +++ b/mm/memory.c >>> @@ -6119,10 +6119,9 @@ int numa_migrate_check(struct folio *folio, struct vm_fault *vmf, >>> * For memory tiering mode, cpupid of slow memory page is used >>> * to record page access time. So use default value. >>> */ >>> - if (folio_use_access_time(folio)) >>> + *last_cpupid = folio_last_cpupid(folio); >>> + if (!cpupid_valid(*last_cpupid)) >>> *last_cpupid = (-1 & LAST_CPUPID_MASK); >>> - else >>> - *last_cpupid = folio_last_cpupid(folio); >>> /* Record the current PID accessing VMA */ >>> vma_set_access_pid_bit(vma); >>> >>> >>> The change itself here looks reasonable to me. >>> >>> Acked-by: David Hildenbrand (Arm) <david@kernel.org> >>> ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] sched/numa, mm: Skip page promotion if cpu pid is valid 2026-03-31 9:03 ` Donet Tom @ 2026-03-31 9:17 ` Huang, Ying 2026-03-31 10:00 ` Donet Tom 0 siblings, 1 reply; 10+ messages in thread From: Huang, Ying @ 2026-03-31 9:17 UTC (permalink / raw) To: Donet Tom Cc: David Hildenbrand (Arm), Andrew Morton, Ingo Molnar, Peter Zijlstra, Ritesh Harjani, linux-mm, linux-kernel, Baolin Wang, Ying Huang, Juri Lelli, Mel Gorman, Vincent Guittot, Dietmar Eggemann, Steven Rostedt Donet Tom <donettom@linux.ibm.com> writes: > Hi > > On 3/31/26 2:03 PM, Huang, Ying wrote: >> Hi, Donet, >> >> Donet Tom <donettom@linux.ibm.com> writes: >> >>> On 3/26/26 3:59 PM, David Hildenbrand (Arm) wrote: >>>> On 3/26/26 08:12, Donet Tom wrote: >>>>> If memory tiering is disabled, cpupid of slow memory pages may >>>>> contain a valid CPU and PID. If tiering is enabled at runtime, >>>>> there is a chance that in should_numa_migrate_memory(), this >>>>> valid CPU/PID is treated as a last access timestamp, leading >>>>> to unnecessary promotion. >>>> Is that measurable? Should we at least have a Fixes: ? >>>> >>>>> Prevent this by skipping promotion when cpupid is valid. >>>>> >>>>> Signed-off-by: Donet Tom <donettom@linux.ibm.com> >>>>> --- >>>>> kernel/sched/fair.c | 7 +++++++ >>>>> 1 file changed, 7 insertions(+) >>>>> >>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>>>> index 4b43809a3fb1..f5830a5a94d5 100644 >>>>> --- a/kernel/sched/fair.c >>>>> +++ b/kernel/sched/fair.c >>>>> @@ -2001,6 +2001,13 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, >>>>> unsigned int latency, th, def_th; >>>>> long nr = folio_nr_pages(folio); >>>>> >>>> /* >>>> * When ... >>>> >>>>> + /* When tiering is enabled at runtime, last_cpupid may >>>>> + * hold a valid cpupid instead of an access timestamp. >>>>> + * If so, skip page promotion. >>>>> + */ >>>>> + if (cpupid_valid(folio_last_cpupid(folio))) >>>>> + return false; >>>>> + >>>> IIUC, as timestamp we use jiffies_to_msecs(). So, soon after bootup, >>>> we would no longer get false positives for cpupid_valid(). >>>> I suppose overflows are not a problem, correct? >>> Thank you, David, for guiding me in the right direction. >>> >>> I initially thought that overflows would not occur, and therefore >>> cpupid_valid() would not produce false positives. However, >>> after looking into it further, it appears that overflow can >>> happen when storing the access time. >>> >>> The last_cpupid field is used to store the last access time. >>> From the code, it appears that 21 bits are used for this >>> (#define LAST_CPUPID_SHIFT (LAST__PID_SHIFT + LAST__CPU_SHIFT)). >>> >>> With 21 bits, the maximum value that can be stored is >> It can be less than 21 bits, if CONFIG_NR_CPUS is small. >> >> DEFINE(NR_CPUS_BITS, order_base_2(CONFIG_NR_CPUS)); >> >>> 2097151ms (35Hrs) . If the access time exceeds this >>> range, it can overflow, which may lead to cpupid_valid() >>> returning false positives. >>> >>> I think we need a reliable way to determine cpupid_valid() that >>> does not produce false positives. >> Yes. IMHO, false positives is unavoidable. So, the patch fixes a >> temporal performance issue at the cost of a longstanding performance >> issue. Right? > > > I was trying to fix a functional issue. When memory tiering is > > enabled at runtime, treating last_cpupid as access time is incorrect, right? I don't think that it's a functional issue. It has only performance impact. Did you find any functionality bug? --- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] sched/numa, mm: Skip page promotion if cpu pid is valid 2026-03-31 9:17 ` Huang, Ying @ 2026-03-31 10:00 ` Donet Tom 0 siblings, 0 replies; 10+ messages in thread From: Donet Tom @ 2026-03-31 10:00 UTC (permalink / raw) To: Huang, Ying Cc: David Hildenbrand (Arm), Andrew Morton, Ingo Molnar, Peter Zijlstra, Ritesh Harjani, linux-mm, linux-kernel, Baolin Wang, Ying Huang, Juri Lelli, Mel Gorman, Vincent Guittot, Dietmar Eggemann, Steven Rostedt On 3/31/26 2:47 PM, Huang, Ying wrote: > Donet Tom <donettom@linux.ibm.com> writes: > >> Hi >> >> On 3/31/26 2:03 PM, Huang, Ying wrote: >>> Hi, Donet, >>> >>> Donet Tom <donettom@linux.ibm.com> writes: >>> >>>> On 3/26/26 3:59 PM, David Hildenbrand (Arm) wrote: >>>>> On 3/26/26 08:12, Donet Tom wrote: >>>>>> If memory tiering is disabled, cpupid of slow memory pages may >>>>>> contain a valid CPU and PID. If tiering is enabled at runtime, >>>>>> there is a chance that in should_numa_migrate_memory(), this >>>>>> valid CPU/PID is treated as a last access timestamp, leading >>>>>> to unnecessary promotion. >>>>> Is that measurable? Should we at least have a Fixes: ? >>>>> >>>>>> Prevent this by skipping promotion when cpupid is valid. >>>>>> >>>>>> Signed-off-by: Donet Tom <donettom@linux.ibm.com> >>>>>> --- >>>>>> kernel/sched/fair.c | 7 +++++++ >>>>>> 1 file changed, 7 insertions(+) >>>>>> >>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>>>>> index 4b43809a3fb1..f5830a5a94d5 100644 >>>>>> --- a/kernel/sched/fair.c >>>>>> +++ b/kernel/sched/fair.c >>>>>> @@ -2001,6 +2001,13 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, >>>>>> unsigned int latency, th, def_th; >>>>>> long nr = folio_nr_pages(folio); >>>>>> >>>>> /* >>>>> * When ... >>>>> >>>>>> + /* When tiering is enabled at runtime, last_cpupid may >>>>>> + * hold a valid cpupid instead of an access timestamp. >>>>>> + * If so, skip page promotion. >>>>>> + */ >>>>>> + if (cpupid_valid(folio_last_cpupid(folio))) >>>>>> + return false; >>>>>> + >>>>> IIUC, as timestamp we use jiffies_to_msecs(). So, soon after bootup, >>>>> we would no longer get false positives for cpupid_valid(). >>>>> I suppose overflows are not a problem, correct? >>>> Thank you, David, for guiding me in the right direction. >>>> >>>> I initially thought that overflows would not occur, and therefore >>>> cpupid_valid() would not produce false positives. However, >>>> after looking into it further, it appears that overflow can >>>> happen when storing the access time. >>>> >>>> The last_cpupid field is used to store the last access time. >>>> From the code, it appears that 21 bits are used for this >>>> (#define LAST_CPUPID_SHIFT (LAST__PID_SHIFT + LAST__CPU_SHIFT)). >>>> >>>> With 21 bits, the maximum value that can be stored is >>> It can be less than 21 bits, if CONFIG_NR_CPUS is small. >>> >>> DEFINE(NR_CPUS_BITS, order_base_2(CONFIG_NR_CPUS)); >>> >>>> 2097151ms (35Hrs) . If the access time exceeds this >>>> range, it can overflow, which may lead to cpupid_valid() >>>> returning false positives. >>>> >>>> I think we need a reliable way to determine cpupid_valid() that >>>> does not produce false positives. >>> Yes. IMHO, false positives is unavoidable. So, the patch fixes a >>> temporal performance issue at the cost of a longstanding performance >>> issue. Right? >> >> I was trying to fix a functional issue. When memory tiering is >> >> enabled at runtime, treating last_cpupid as access time is incorrect, right? > I don't think that it's a functional issue. It has only performance > impact. Did you find any functionality bug? > Thank you for the confirmation. I thought this was a functional issue. In that case, we can drop this patch. -Donet > > --- > Best Regards, > Huang, Ying ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] sched/numa, mm: Skip page promotion if cpu pid is valid 2026-03-31 8:33 ` Huang, Ying 2026-03-31 9:03 ` Donet Tom @ 2026-03-31 10:04 ` David Hildenbrand (Arm) 2026-03-31 15:02 ` Donet Tom 2026-04-01 9:48 ` Huang, Ying 1 sibling, 2 replies; 10+ messages in thread From: David Hildenbrand (Arm) @ 2026-03-31 10:04 UTC (permalink / raw) To: Huang, Ying, Donet Tom Cc: Andrew Morton, Ingo Molnar, Peter Zijlstra, Ritesh Harjani, linux-mm, linux-kernel, Baolin Wang, Ying Huang, Juri Lelli, Mel Gorman, Vincent Guittot, Dietmar Eggemann, Steven Rostedt On 3/31/26 10:33, Huang, Ying wrote: > Hi, Donet, > > Donet Tom <donettom@linux.ibm.com> writes: > >> On 3/26/26 3:59 PM, David Hildenbrand (Arm) wrote: >>> Is that measurable? Should we at least have a Fixes: ? >>> >>> /* >>> * When ... >>> >>> IIUC, as timestamp we use jiffies_to_msecs(). So, soon after bootup, >>> we would no longer get false positives for cpupid_valid(). >>> I suppose overflows are not a problem, correct? >> >> Thank you, David, for guiding me in the right direction. >> >> I initially thought that overflows would not occur, and therefore >> cpupid_valid() would not produce false positives. However, >> after looking into it further, it appears that overflow can >> happen when storing the access time. >> >> The last_cpupid field is used to store the last access time. >> From the code, it appears that 21 bits are used for this >> (#define LAST_CPUPID_SHIFT (LAST__PID_SHIFT + LAST__CPU_SHIFT)). >> >> With 21 bits, the maximum value that can be stored is > > It can be less than 21 bits, if CONFIG_NR_CPUS is small. > > DEFINE(NR_CPUS_BITS, order_base_2(CONFIG_NR_CPUS)); > >> 2097151ms (35Hrs) . If the access time exceeds this >> range, it can overflow, which may lead to cpupid_valid() >> returning false positives. >> >> I think we need a reliable way to determine cpupid_valid() that >> does not produce false positives. > > Yes. IMHO, false positives is unavoidable. So, the patch fixes a > temporal performance issue at the cost of a longstanding performance > issue. Right? Could we set aside a bit to indicate "cpuid vs. time" ? We'd lose one bit for time, to we care? Would make it all easier to get ... -- Cheers, David ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] sched/numa, mm: Skip page promotion if cpu pid is valid 2026-03-31 10:04 ` David Hildenbrand (Arm) @ 2026-03-31 15:02 ` Donet Tom 2026-04-01 9:48 ` Huang, Ying 1 sibling, 0 replies; 10+ messages in thread From: Donet Tom @ 2026-03-31 15:02 UTC (permalink / raw) To: David Hildenbrand (Arm), Huang, Ying Cc: Andrew Morton, Ingo Molnar, Peter Zijlstra, Ritesh Harjani, linux-mm, linux-kernel, Baolin Wang, Ying Huang, Juri Lelli, Mel Gorman, Vincent Guittot, Dietmar Eggemann, Steven Rostedt On 3/31/26 3:34 PM, David Hildenbrand (Arm) wrote: > On 3/31/26 10:33, Huang, Ying wrote: >> Hi, Donet, >> >> Donet Tom <donettom@linux.ibm.com> writes: >> >>> On 3/26/26 3:59 PM, David Hildenbrand (Arm) wrote: >>>> Is that measurable? Should we at least have a Fixes: ? >>>> >>>> /* >>>> * When ... >>>> >>>> IIUC, as timestamp we use jiffies_to_msecs(). So, soon after bootup, >>>> we would no longer get false positives for cpupid_valid(). >>>> I suppose overflows are not a problem, correct? >>> Thank you, David, for guiding me in the right direction. >>> >>> I initially thought that overflows would not occur, and therefore >>> cpupid_valid() would not produce false positives. However, >>> after looking into it further, it appears that overflow can >>> happen when storing the access time. >>> >>> The last_cpupid field is used to store the last access time. >>> From the code, it appears that 21 bits are used for this >>> (#define LAST_CPUPID_SHIFT (LAST__PID_SHIFT + LAST__CPU_SHIFT)). >>> >>> With 21 bits, the maximum value that can be stored is >> It can be less than 21 bits, if CONFIG_NR_CPUS is small. >> >> DEFINE(NR_CPUS_BITS, order_base_2(CONFIG_NR_CPUS)); >> >>> 2097151ms (35Hrs) . If the access time exceeds this >>> range, it can overflow, which may lead to cpupid_valid() >>> returning false positives. >>> >>> I think we need a reliable way to determine cpupid_valid() that >>> does not produce false positives. >> Yes. IMHO, false positives is unavoidable. So, the patch fixes a >> temporal performance issue at the cost of a longstanding performance >> issue. Right? > > Could we set aside a bit to indicate "cpuid vs. time" ? We'd lose one > bit for time, to we care? Thank you, David, for the input. This sounds like a good idea—I'll give it a try. -Donet > > Would make it all easier to get ... > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] sched/numa, mm: Skip page promotion if cpu pid is valid 2026-03-31 10:04 ` David Hildenbrand (Arm) 2026-03-31 15:02 ` Donet Tom @ 2026-04-01 9:48 ` Huang, Ying 1 sibling, 0 replies; 10+ messages in thread From: Huang, Ying @ 2026-04-01 9:48 UTC (permalink / raw) To: David Hildenbrand (Arm) Cc: Donet Tom, Andrew Morton, Ingo Molnar, Peter Zijlstra, Ritesh Harjani, linux-mm, linux-kernel, Baolin Wang, Ying Huang, Juri Lelli, Mel Gorman, Vincent Guittot, Dietmar Eggemann, Steven Rostedt "David Hildenbrand (Arm)" <david@kernel.org> writes: > On 3/31/26 10:33, Huang, Ying wrote: >> Hi, Donet, >> >> Donet Tom <donettom@linux.ibm.com> writes: >> >>> On 3/26/26 3:59 PM, David Hildenbrand (Arm) wrote: >>>> Is that measurable? Should we at least have a Fixes: ? >>>> >>>> /* >>>> * When ... >>>> >>>> IIUC, as timestamp we use jiffies_to_msecs(). So, soon after bootup, >>>> we would no longer get false positives for cpupid_valid(). >>>> I suppose overflows are not a problem, correct? >>> >>> Thank you, David, for guiding me in the right direction. >>> >>> I initially thought that overflows would not occur, and therefore >>> cpupid_valid() would not produce false positives. However, >>> after looking into it further, it appears that overflow can >>> happen when storing the access time. >>> >>> The last_cpupid field is used to store the last access time. >>> From the code, it appears that 21 bits are used for this >>> (#define LAST_CPUPID_SHIFT (LAST__PID_SHIFT + LAST__CPU_SHIFT)). >>> >>> With 21 bits, the maximum value that can be stored is >> >> It can be less than 21 bits, if CONFIG_NR_CPUS is small. >> >> DEFINE(NR_CPUS_BITS, order_base_2(CONFIG_NR_CPUS)); >> >>> 2097151ms (35Hrs) . If the access time exceeds this >>> range, it can overflow, which may lead to cpupid_valid() >>> returning false positives. >>> >>> I think we need a reliable way to determine cpupid_valid() that >>> does not produce false positives. >> >> Yes. IMHO, false positives is unavoidable. So, the patch fixes a >> temporal performance issue at the cost of a longstanding performance >> issue. Right? > > > Could we set aside a bit to indicate "cpuid vs. time" ? We'd lose one > bit for time, to we care? Do we need one more bit for time and cpupid? However, page flags are precious resources. > Would make it all easier to get ... --- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2026-04-01 9:48 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-03-26 7:12 [PATCH] sched/numa, mm: Skip page promotion if cpu pid is valid Donet Tom 2026-03-26 10:29 ` David Hildenbrand (Arm) 2026-03-27 18:54 ` Donet Tom 2026-03-31 8:33 ` Huang, Ying 2026-03-31 9:03 ` Donet Tom 2026-03-31 9:17 ` Huang, Ying 2026-03-31 10:00 ` Donet Tom 2026-03-31 10:04 ` David Hildenbrand (Arm) 2026-03-31 15:02 ` Donet Tom 2026-04-01 9:48 ` Huang, Ying
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox