* [PATCH RFC v3] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE counting
@ 2025-07-22 14:16 Ruan Shiyang
2025-07-23 3:09 ` Huang, Ying
2025-07-24 3:35 ` Zhijian Li (Fujitsu)
0 siblings, 2 replies; 8+ messages in thread
From: Ruan Shiyang @ 2025-07-22 14:16 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, lkp, ying.huang, akpm, y-goto, mingo, peterz,
juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, mgorman,
vschneid, Li Zhijian, Ben Segall
From: Li Zhijian <lizhijian@fujitsu.com>
===
Changes since v2:
1. According to Huang's suggestion, add a new stat to not count these
pages into PGPROMOTE_CANDIDATE, to avoid changing the rate limit
mechanism.
===
Goto-san reported confusing pgpromote statistics where the
pgpromote_success count significantly exceeded pgpromote_candidate.
On a system with three nodes (nodes 0-1: DRAM 4GB, node 2: NVDIMM 4GB):
# Enable demotion only
echo 1 > /sys/kernel/mm/numa/demotion_enabled
numactl -m 0-1 memhog -r200 3500M >/dev/null &
pid=$!
sleep 2
numactl memhog -r100 2500M >/dev/null &
sleep 10
kill -9 $pid # terminate the 1st memhog
# Enable promotion
echo 2 > /proc/sys/kernel/numa_balancing
After a few seconds, we observeed `pgpromote_candidate < pgpromote_success`
$ grep -e pgpromote /proc/vmstat
pgpromote_success 2579
pgpromote_candidate 0
In this scenario, after terminating the first memhog, the conditions for
pgdat_free_space_enough() are quickly met, and triggers promotion.
However, these migrated pages are only counted for in PGPROMOTE_SUCCESS,
not in PGPROMOTE_CANDIDATE.
To solve this confusing statistics, introduce this
PGPROMOTE_CANDIDATE_NOLIMIT to count the missed promotion pages. And
also, not counting these pages into PGPROMOTE_CANDIDATE is to avoid
changing the existing algorithm or performance of the promotion rate
limit.
Perhaps PGPROMOTE_CANDIDATE_NOLIMIT is not well named, please comment if
you have a better idea.
Cc: Huang Ying <ying.huang@linux.alibaba.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Valentin Schneider <vschneid@redhat.com>
Reported-by: Yasunori Gotou (Fujitsu) <y-goto@fujitsu.com>
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Signed-off-by: Ruan Shiyang <ruansy.fnst@fujitsu.com>
---
include/linux/mmzone.h | 2 ++
kernel/sched/fair.c | 6 ++++--
mm/vmstat.c | 1 +
3 files changed, 7 insertions(+), 2 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 283913d42d7b..6216e2eecf3b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -231,6 +231,8 @@ enum node_stat_item {
#ifdef CONFIG_NUMA_BALANCING
PGPROMOTE_SUCCESS, /* promote successfully */
PGPROMOTE_CANDIDATE, /* candidate pages to promote */
+ PGPROMOTE_CANDIDATE_NOLIMIT, /* candidate pages without considering
+ * hot threshold */
#endif
/* PGDEMOTE_*: pages demoted */
PGDEMOTE_KSWAPD,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a14da5396fb..12dac3519c49 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1940,11 +1940,14 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
struct pglist_data *pgdat;
unsigned long rate_limit;
unsigned int latency, th, def_th;
+ long nr = folio_nr_pages(folio);
pgdat = NODE_DATA(dst_nid);
if (pgdat_free_space_enough(pgdat)) {
/* workload changed, reset hot threshold */
pgdat->nbp_threshold = 0;
+ mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE_NOLIMIT,
+ nr);
return true;
}
@@ -1958,8 +1961,7 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
if (latency >= th)
return false;
- return !numa_promotion_rate_limit(pgdat, rate_limit,
- folio_nr_pages(folio));
+ return !numa_promotion_rate_limit(pgdat, rate_limit, nr);
}
this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index a78d70ddeacd..ca44a2dd5497 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1272,6 +1272,7 @@ const char * const vmstat_text[] = {
#ifdef CONFIG_NUMA_BALANCING
"pgpromote_success",
"pgpromote_candidate",
+ "pgpromote_candidate_nolimit",
#endif
"pgdemote_kswapd",
"pgdemote_direct",
--
2.43.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH RFC v3] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE counting
2025-07-22 14:16 [PATCH RFC v3] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE counting Ruan Shiyang
@ 2025-07-23 3:09 ` Huang, Ying
2025-07-24 2:39 ` Shiyang Ruan
2025-07-24 3:35 ` Zhijian Li (Fujitsu)
1 sibling, 1 reply; 8+ messages in thread
From: Huang, Ying @ 2025-07-23 3:09 UTC (permalink / raw)
To: Ruan Shiyang
Cc: linux-mm, linux-kernel, lkp, akpm, y-goto, mingo, peterz,
juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, mgorman,
vschneid, Li Zhijian, Ben Segall
Ruan Shiyang <ruansy.fnst@fujitsu.com> writes:
> From: Li Zhijian <lizhijian@fujitsu.com>
>
> ===
> Changes since v2:
> 1. According to Huang's suggestion, add a new stat to not count these
> pages into PGPROMOTE_CANDIDATE, to avoid changing the rate limit
> mechanism.
> ===
This isn't the popular place for changelog, please refer to other patch
email.
> Goto-san reported confusing pgpromote statistics where the
> pgpromote_success count significantly exceeded pgpromote_candidate.
>
> On a system with three nodes (nodes 0-1: DRAM 4GB, node 2: NVDIMM 4GB):
> # Enable demotion only
> echo 1 > /sys/kernel/mm/numa/demotion_enabled
> numactl -m 0-1 memhog -r200 3500M >/dev/null &
> pid=$!
> sleep 2
> numactl memhog -r100 2500M >/dev/null &
> sleep 10
> kill -9 $pid # terminate the 1st memhog
> # Enable promotion
> echo 2 > /proc/sys/kernel/numa_balancing
>
> After a few seconds, we observeed `pgpromote_candidate < pgpromote_success`
> $ grep -e pgpromote /proc/vmstat
> pgpromote_success 2579
> pgpromote_candidate 0
>
> In this scenario, after terminating the first memhog, the conditions for
> pgdat_free_space_enough() are quickly met, and triggers promotion.
> However, these migrated pages are only counted for in PGPROMOTE_SUCCESS,
> not in PGPROMOTE_CANDIDATE.
>
> To solve this confusing statistics, introduce this
> PGPROMOTE_CANDIDATE_NOLIMIT to count the missed promotion pages. And
> also, not counting these pages into PGPROMOTE_CANDIDATE is to avoid
> changing the existing algorithm or performance of the promotion rate
> limit.
>
> Perhaps PGPROMOTE_CANDIDATE_NOLIMIT is not well named, please comment if
> you have a better idea.
Yes. Naming is hard. I guess that the name comes from the promotion
that isn't rate limited. I have asked Deepseek that what is the good
abbreviation for "not rate limited". Its answer is "NRL". I don't know
whether it's good. However, "NOT_RATE_LIMITED" appears too long.
>
>
The empty line is unnecessary.
> Cc: Huang Ying <ying.huang@linux.alibaba.com>
Suggested-by: Huang Ying <ying.huang@linux.alibaba.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Valentin Schneider <vschneid@redhat.com>
> Reported-by: Yasunori Gotou (Fujitsu) <y-goto@fujitsu.com>
> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
> Signed-off-by: Ruan Shiyang <ruansy.fnst@fujitsu.com>
> ---
> include/linux/mmzone.h | 2 ++
> kernel/sched/fair.c | 6 ++++--
> mm/vmstat.c | 1 +
> 3 files changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 283913d42d7b..6216e2eecf3b 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -231,6 +231,8 @@ enum node_stat_item {
> #ifdef CONFIG_NUMA_BALANCING
> PGPROMOTE_SUCCESS, /* promote successfully */
> PGPROMOTE_CANDIDATE, /* candidate pages to promote */
> + PGPROMOTE_CANDIDATE_NOLIMIT, /* candidate pages without considering
> + * hot threshold */
> #endif
> /* PGDEMOTE_*: pages demoted */
> PGDEMOTE_KSWAPD,
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7a14da5396fb..12dac3519c49 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1940,11 +1940,14 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
> struct pglist_data *pgdat;
> unsigned long rate_limit;
> unsigned int latency, th, def_th;
> + long nr = folio_nr_pages(folio);
>
> pgdat = NODE_DATA(dst_nid);
> if (pgdat_free_space_enough(pgdat)) {
> /* workload changed, reset hot threshold */
> pgdat->nbp_threshold = 0;
> + mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE_NOLIMIT,
> + nr);
> return true;
> }
>
> @@ -1958,8 +1961,7 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
> if (latency >= th)
> return false;
>
> - return !numa_promotion_rate_limit(pgdat, rate_limit,
> - folio_nr_pages(folio));
> + return !numa_promotion_rate_limit(pgdat, rate_limit, nr);
> }
>
> this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid);
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index a78d70ddeacd..ca44a2dd5497 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1272,6 +1272,7 @@ const char * const vmstat_text[] = {
> #ifdef CONFIG_NUMA_BALANCING
> "pgpromote_success",
> "pgpromote_candidate",
> + "pgpromote_candidate_nolimit",
> #endif
> "pgdemote_kswapd",
> "pgdemote_direct",
---
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH RFC v3] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE counting
2025-07-23 3:09 ` Huang, Ying
@ 2025-07-24 2:39 ` Shiyang Ruan
2025-07-24 7:36 ` Huang, Ying
0 siblings, 1 reply; 8+ messages in thread
From: Shiyang Ruan @ 2025-07-24 2:39 UTC (permalink / raw)
To: Huang, Ying
Cc: linux-mm, linux-kernel, lkp, akpm, y-goto, mingo, peterz,
juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, mgorman,
vschneid, Li Zhijian, Ben Segall
在 2025/7/23 11:09, Huang, Ying 写道:
> Ruan Shiyang <ruansy.fnst@fujitsu.com> writes:
>
>> From: Li Zhijian <lizhijian@fujitsu.com>
>>
>> ===
>> Changes since v2:
>> 1. According to Huang's suggestion, add a new stat to not count these
>> pages into PGPROMOTE_CANDIDATE, to avoid changing the rate limit
>> mechanism.
>> ===
>
> This isn't the popular place for changelog, please refer to other patch
> email.
OK. I'll move this part down below.>
>> Goto-san reported confusing pgpromote statistics where the
>> pgpromote_success count significantly exceeded pgpromote_candidate.
>>
>> On a system with three nodes (nodes 0-1: DRAM 4GB, node 2: NVDIMM 4GB):
>> # Enable demotion only
>> echo 1 > /sys/kernel/mm/numa/demotion_enabled
>> numactl -m 0-1 memhog -r200 3500M >/dev/null &
>> pid=$!
>> sleep 2
>> numactl memhog -r100 2500M >/dev/null &
>> sleep 10
>> kill -9 $pid # terminate the 1st memhog
>> # Enable promotion
>> echo 2 > /proc/sys/kernel/numa_balancing
>>
>> After a few seconds, we observeed `pgpromote_candidate < pgpromote_success`
>> $ grep -e pgpromote /proc/vmstat
>> pgpromote_success 2579
>> pgpromote_candidate 0
>>
>> In this scenario, after terminating the first memhog, the conditions for
>> pgdat_free_space_enough() are quickly met, and triggers promotion.
>> However, these migrated pages are only counted for in PGPROMOTE_SUCCESS,
>> not in PGPROMOTE_CANDIDATE.
>>
>> To solve this confusing statistics, introduce this
>> PGPROMOTE_CANDIDATE_NOLIMIT to count the missed promotion pages. And
>> also, not counting these pages into PGPROMOTE_CANDIDATE is to avoid
>> changing the existing algorithm or performance of the promotion rate
>> limit.
>>
>> Perhaps PGPROMOTE_CANDIDATE_NOLIMIT is not well named, please comment if
>> you have a better idea.
>
> Yes. Naming is hard. I guess that the name comes from the promotion
> that isn't rate limited. I have asked Deepseek that what is the good
> abbreviation for "not rate limited". Its answer is "NRL". I don't know
> whether it's good. However, "NOT_RATE_LIMITED" appears too long.
"NRL" Sounds good to me.
I'm thinking another one: since it's not rate limited, it could be
migrated quickly/fast. How about PGPROMOTE_CANDIDATE_FAST?
>
>>
>>
>
> The empty line is unnecessary.
OK.>
>> Cc: Huang Ying <ying.huang@linux.alibaba.com>
>
> Suggested-by: Huang Ying <ying.huang@linux.alibaba.com>
OK.
--
Thanks,
Ruan.
>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Juri Lelli <juri.lelli@redhat.com>
>> Cc: Vincent Guittot <vincent.guittot@linaro.org>
>> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
>> Cc: Steven Rostedt <rostedt@goodmis.org>
>> Cc: Ben Segall <bsegall@google.com>
>> Cc: Mel Gorman <mgorman@suse.de>
>> Cc: Valentin Schneider <vschneid@redhat.com>
>> Reported-by: Yasunori Gotou (Fujitsu) <y-goto@fujitsu.com>
>> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
>> Signed-off-by: Ruan Shiyang <ruansy.fnst@fujitsu.com>
>> ---
>> include/linux/mmzone.h | 2 ++
>> kernel/sched/fair.c | 6 ++++--
>> mm/vmstat.c | 1 +
>> 3 files changed, 7 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 283913d42d7b..6216e2eecf3b 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -231,6 +231,8 @@ enum node_stat_item {
>> #ifdef CONFIG_NUMA_BALANCING
>> PGPROMOTE_SUCCESS, /* promote successfully */
>> PGPROMOTE_CANDIDATE, /* candidate pages to promote */
>> + PGPROMOTE_CANDIDATE_NOLIMIT, /* candidate pages without considering
>> + * hot threshold */
>> #endif
>> /* PGDEMOTE_*: pages demoted */
>> PGDEMOTE_KSWAPD,
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 7a14da5396fb..12dac3519c49 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1940,11 +1940,14 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
>> struct pglist_data *pgdat;
>> unsigned long rate_limit;
>> unsigned int latency, th, def_th;
>> + long nr = folio_nr_pages(folio);
>>
>> pgdat = NODE_DATA(dst_nid);
>> if (pgdat_free_space_enough(pgdat)) {
>> /* workload changed, reset hot threshold */
>> pgdat->nbp_threshold = 0;
>> + mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE_NOLIMIT,
>> + nr);
>> return true;
>> }
>>
>> @@ -1958,8 +1961,7 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
>> if (latency >= th)
>> return false;
>>
>> - return !numa_promotion_rate_limit(pgdat, rate_limit,
>> - folio_nr_pages(folio));
>> + return !numa_promotion_rate_limit(pgdat, rate_limit, nr);
>> }
>>
>> this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid);
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index a78d70ddeacd..ca44a2dd5497 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -1272,6 +1272,7 @@ const char * const vmstat_text[] = {
>> #ifdef CONFIG_NUMA_BALANCING
>> "pgpromote_success",
>> "pgpromote_candidate",
>> + "pgpromote_candidate_nolimit",
>> #endif
>> "pgdemote_kswapd",
>> "pgdemote_direct",
>
> ---
> Best Regards,
> Huang, Ying
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH RFC v3] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE counting
2025-07-22 14:16 [PATCH RFC v3] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE counting Ruan Shiyang
2025-07-23 3:09 ` Huang, Ying
@ 2025-07-24 3:35 ` Zhijian Li (Fujitsu)
2025-07-24 7:35 ` Huang, Ying
1 sibling, 1 reply; 8+ messages in thread
From: Zhijian Li (Fujitsu) @ 2025-07-24 3:35 UTC (permalink / raw)
To: Shiyang Ruan (Fujitsu), linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org, lkp@intel.com,
ying.huang@linux.alibaba.com, akpm@linux-foundation.org,
Yasunori Gotou (Fujitsu), mingo@redhat.com, peterz@infradead.org,
juri.lelli@redhat.com, vincent.guittot@linaro.org,
dietmar.eggemann@arm.com, rostedt@goodmis.org, mgorman@suse.de,
vschneid@redhat.com, Ben Segall
On 22/07/2025 22:16, Ruan Shiyang wrote:
> From: Li Zhijian<lizhijian@fujitsu.com>
>
I believe you are the actual author of this patch, so please change to yourself :)
> Cc: Juri Lelli<juri.lelli@redhat.com>
> Cc: Vincent Guittot<vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann<dietmar.eggemann@arm.com>
> Cc: Steven Rostedt<rostedt@goodmis.org>
> Cc: Ben Segall<bsegall@google.com>
> Cc: Mel Gorman<mgorman@suse.de>
> Cc: Valentin Schneider<vschneid@redhat.com>
> Reported-by: Yasunori Gotou (Fujitsu)<y-goto@fujitsu.com>
> Signed-off-by: Li Zhijian<lizhijian@fujitsu.com>
> Signed-off-by: Ruan Shiyang<ruansy.fnst@fujitsu.com>
> ---
> include/linux/mmzone.h | 2 ++
> kernel/sched/fair.c | 6 ++++--
> mm/vmstat.c | 1 +
> 3 files changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 283913d42d7b..6216e2eecf3b 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -231,6 +231,8 @@ enum node_stat_item {
> #ifdef CONFIG_NUMA_BALANCING
> PGPROMOTE_SUCCESS, /* promote successfully */
> PGPROMOTE_CANDIDATE, /* candidate pages to promote */
Additionally, I think the current comment for PGPROMOTE_CANDIDATE is inaccurate. If possible, I'd like to refine it along with this patch.
For example:
/*
* Candidate pages for promotion based on hint fault latency. This counter
* is used to control the promotion rate and adjust the hotness threshold.
*/
What are your thoughts, @Ying?
> + PGPROMOTE_CANDIDATE_NOLIMIT, /* candidate pages without considering
> + * hot threshold */
Similarly, the comment for PGPROMOTE_CANDIDATE_NOLIMIT can also be made more precise.
Thanks
Zhijian
> #endif
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH RFC v3] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE counting
2025-07-24 3:35 ` Zhijian Li (Fujitsu)
@ 2025-07-24 7:35 ` Huang, Ying
0 siblings, 0 replies; 8+ messages in thread
From: Huang, Ying @ 2025-07-24 7:35 UTC (permalink / raw)
To: Zhijian Li (Fujitsu)
Cc: Shiyang Ruan (Fujitsu), linux-mm@kvack.org,
linux-kernel@vger.kernel.org, lkp@intel.com,
akpm@linux-foundation.org, Yasunori Gotou (Fujitsu),
mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, mgorman@suse.de, vschneid@redhat.com,
Ben Segall
"Zhijian Li (Fujitsu)" <lizhijian@fujitsu.com> writes:
> On 22/07/2025 22:16, Ruan Shiyang wrote:
>> From: Li Zhijian<lizhijian@fujitsu.com>
>>
>
> I believe you are the actual author of this patch, so please change to yourself :)
>
>
>> Cc: Juri Lelli<juri.lelli@redhat.com>
>> Cc: Vincent Guittot<vincent.guittot@linaro.org>
>> Cc: Dietmar Eggemann<dietmar.eggemann@arm.com>
>> Cc: Steven Rostedt<rostedt@goodmis.org>
>> Cc: Ben Segall<bsegall@google.com>
>> Cc: Mel Gorman<mgorman@suse.de>
>> Cc: Valentin Schneider<vschneid@redhat.com>
>> Reported-by: Yasunori Gotou (Fujitsu)<y-goto@fujitsu.com>
>> Signed-off-by: Li Zhijian<lizhijian@fujitsu.com>
>> Signed-off-by: Ruan Shiyang<ruansy.fnst@fujitsu.com>
>> ---
>> include/linux/mmzone.h | 2 ++
>> kernel/sched/fair.c | 6 ++++--
>> mm/vmstat.c | 1 +
>> 3 files changed, 7 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 283913d42d7b..6216e2eecf3b 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -231,6 +231,8 @@ enum node_stat_item {
>> #ifdef CONFIG_NUMA_BALANCING
>> PGPROMOTE_SUCCESS, /* promote successfully */
>> PGPROMOTE_CANDIDATE, /* candidate pages to promote */
>
> Additionally, I think the current comment for PGPROMOTE_CANDIDATE is inaccurate. If possible, I'd like to refine it along with this patch.
> For example:
> /*
> * Candidate pages for promotion based on hint fault latency. This counter
> * is used to control the promotion rate and adjust the hotness threshold.
> */
> What are your thoughts, @Ying?
This looks good to me, Thanks!
---
Best Regards,
Huang, Ying
>
>
>> + PGPROMOTE_CANDIDATE_NOLIMIT, /* candidate pages without considering
>> + * hot threshold */
>
> Similarly, the comment for PGPROMOTE_CANDIDATE_NOLIMIT can also be made more precise.
>
>
> Thanks
> Zhijian
>
>> #endif
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH RFC v3] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE counting
2025-07-24 2:39 ` Shiyang Ruan
@ 2025-07-24 7:36 ` Huang, Ying
2025-07-25 2:20 ` Shiyang Ruan
0 siblings, 1 reply; 8+ messages in thread
From: Huang, Ying @ 2025-07-24 7:36 UTC (permalink / raw)
To: Shiyang Ruan
Cc: linux-mm, linux-kernel, lkp, akpm, y-goto, mingo, peterz,
juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, mgorman,
vschneid, Li Zhijian, Ben Segall
Shiyang Ruan <ruansy.fnst@fujitsu.com> writes:
> 在 2025/7/23 11:09, Huang, Ying 写道:
>> Ruan Shiyang <ruansy.fnst@fujitsu.com> writes:
>>
>>> From: Li Zhijian <lizhijian@fujitsu.com>
>>>
>>> ===
>>> Changes since v2:
>>> 1. According to Huang's suggestion, add a new stat to not count these
>>> pages into PGPROMOTE_CANDIDATE, to avoid changing the rate limit
>>> mechanism.
>>> ===
>> This isn't the popular place for changelog, please refer to other
>> patch
>> email.
>
> OK. I'll move this part down below.>
>>> Goto-san reported confusing pgpromote statistics where the
>>> pgpromote_success count significantly exceeded pgpromote_candidate.
>>>
>>> On a system with three nodes (nodes 0-1: DRAM 4GB, node 2: NVDIMM 4GB):
>>> # Enable demotion only
>>> echo 1 > /sys/kernel/mm/numa/demotion_enabled
>>> numactl -m 0-1 memhog -r200 3500M >/dev/null &
>>> pid=$!
>>> sleep 2
>>> numactl memhog -r100 2500M >/dev/null &
>>> sleep 10
>>> kill -9 $pid # terminate the 1st memhog
>>> # Enable promotion
>>> echo 2 > /proc/sys/kernel/numa_balancing
>>>
>>> After a few seconds, we observeed `pgpromote_candidate < pgpromote_success`
>>> $ grep -e pgpromote /proc/vmstat
>>> pgpromote_success 2579
>>> pgpromote_candidate 0
>>>
>>> In this scenario, after terminating the first memhog, the conditions for
>>> pgdat_free_space_enough() are quickly met, and triggers promotion.
>>> However, these migrated pages are only counted for in PGPROMOTE_SUCCESS,
>>> not in PGPROMOTE_CANDIDATE.
>>>
>>> To solve this confusing statistics, introduce this
>>> PGPROMOTE_CANDIDATE_NOLIMIT to count the missed promotion pages. And
>>> also, not counting these pages into PGPROMOTE_CANDIDATE is to avoid
>>> changing the existing algorithm or performance of the promotion rate
>>> limit.
>>>
>>> Perhaps PGPROMOTE_CANDIDATE_NOLIMIT is not well named, please comment if
>>> you have a better idea.
>> Yes. Naming is hard. I guess that the name comes from the
>> promotion
>> that isn't rate limited. I have asked Deepseek that what is the good
>> abbreviation for "not rate limited". Its answer is "NRL". I don't know
>> whether it's good. However, "NOT_RATE_LIMITED" appears too long.
>
> "NRL" Sounds good to me.
>
> I'm thinking another one: since it's not rate limited, it could be
> migrated quickly/fast. How about PGPROMOTE_CANDIDATE_FAST?
This sounds good to me, Thanks!
---
Best Regards,
Huang, Ying
>
>>
>>>
>>>
>> The empty line is unnecessary.
>
> OK.>
>>> Cc: Huang Ying <ying.huang@linux.alibaba.com>
>> Suggested-by: Huang Ying <ying.huang@linux.alibaba.com>
>
> OK.
>
>
> --
> Thanks,
> Ruan.
>
>>
>>> Cc: Ingo Molnar <mingo@redhat.com>
>>> Cc: Peter Zijlstra <peterz@infradead.org>
>>> Cc: Juri Lelli <juri.lelli@redhat.com>
>>> Cc: Vincent Guittot <vincent.guittot@linaro.org>
>>> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
>>> Cc: Steven Rostedt <rostedt@goodmis.org>
>>> Cc: Ben Segall <bsegall@google.com>
>>> Cc: Mel Gorman <mgorman@suse.de>
>>> Cc: Valentin Schneider <vschneid@redhat.com>
>>> Reported-by: Yasunori Gotou (Fujitsu) <y-goto@fujitsu.com>
>>> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
>>> Signed-off-by: Ruan Shiyang <ruansy.fnst@fujitsu.com>
>>> ---
>>> include/linux/mmzone.h | 2 ++
>>> kernel/sched/fair.c | 6 ++++--
>>> mm/vmstat.c | 1 +
>>> 3 files changed, 7 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>>> index 283913d42d7b..6216e2eecf3b 100644
>>> --- a/include/linux/mmzone.h
>>> +++ b/include/linux/mmzone.h
>>> @@ -231,6 +231,8 @@ enum node_stat_item {
>>> #ifdef CONFIG_NUMA_BALANCING
>>> PGPROMOTE_SUCCESS, /* promote successfully */
>>> PGPROMOTE_CANDIDATE, /* candidate pages to promote */
>>> + PGPROMOTE_CANDIDATE_NOLIMIT, /* candidate pages without considering
>>> + * hot threshold */
>>> #endif
>>> /* PGDEMOTE_*: pages demoted */
>>> PGDEMOTE_KSWAPD,
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 7a14da5396fb..12dac3519c49 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -1940,11 +1940,14 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
>>> struct pglist_data *pgdat;
>>> unsigned long rate_limit;
>>> unsigned int latency, th, def_th;
>>> + long nr = folio_nr_pages(folio);
>>> pgdat = NODE_DATA(dst_nid);
>>> if (pgdat_free_space_enough(pgdat)) {
>>> /* workload changed, reset hot threshold */
>>> pgdat->nbp_threshold = 0;
>>> + mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE_NOLIMIT,
>>> + nr);
>>> return true;
>>> }
>>> @@ -1958,8 +1961,7 @@ bool should_numa_migrate_memory(struct
>>> task_struct *p, struct folio *folio,
>>> if (latency >= th)
>>> return false;
>>> - return !numa_promotion_rate_limit(pgdat, rate_limit,
>>> - folio_nr_pages(folio));
>>> + return !numa_promotion_rate_limit(pgdat, rate_limit, nr);
>>> }
>>> this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid);
>>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>>> index a78d70ddeacd..ca44a2dd5497 100644
>>> --- a/mm/vmstat.c
>>> +++ b/mm/vmstat.c
>>> @@ -1272,6 +1272,7 @@ const char * const vmstat_text[] = {
>>> #ifdef CONFIG_NUMA_BALANCING
>>> "pgpromote_success",
>>> "pgpromote_candidate",
>>> + "pgpromote_candidate_nolimit",
>>> #endif
>>> "pgdemote_kswapd",
>>> "pgdemote_direct",
>> ---
>> Best Regards,
>> Huang, Ying
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH RFC v3] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE counting
2025-07-24 7:36 ` Huang, Ying
@ 2025-07-25 2:20 ` Shiyang Ruan
2025-07-25 6:39 ` Huang, Ying
0 siblings, 1 reply; 8+ messages in thread
From: Shiyang Ruan @ 2025-07-25 2:20 UTC (permalink / raw)
To: Huang, Ying
Cc: linux-mm, linux-kernel, lkp, akpm, y-goto, mingo, peterz,
juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, mgorman,
vschneid, Li Zhijian, Ben Segall
在 2025/7/24 15:36, Huang, Ying 写道:
> Shiyang Ruan <ruansy.fnst@fujitsu.com> writes:
>
>> 在 2025/7/23 11:09, Huang, Ying 写道:
>>> Ruan Shiyang <ruansy.fnst@fujitsu.com> writes:
>>>
>>>> From: Li Zhijian <lizhijian@fujitsu.com>
>>>>
>>>> ===
>>>> Changes since v2:
>>>> 1. According to Huang's suggestion, add a new stat to not count these
>>>> pages into PGPROMOTE_CANDIDATE, to avoid changing the rate limit
>>>> mechanism.
>>>> ===
>>> This isn't the popular place for changelog, please refer to other
>>> patch
>>> email.
>>
>> OK. I'll move this part down below.>
>>>> Goto-san reported confusing pgpromote statistics where the
>>>> pgpromote_success count significantly exceeded pgpromote_candidate.
>>>>
>>>> On a system with three nodes (nodes 0-1: DRAM 4GB, node 2: NVDIMM 4GB):
>>>> # Enable demotion only
>>>> echo 1 > /sys/kernel/mm/numa/demotion_enabled
>>>> numactl -m 0-1 memhog -r200 3500M >/dev/null &
>>>> pid=$!
>>>> sleep 2
>>>> numactl memhog -r100 2500M >/dev/null &
>>>> sleep 10
>>>> kill -9 $pid # terminate the 1st memhog
>>>> # Enable promotion
>>>> echo 2 > /proc/sys/kernel/numa_balancing
>>>>
>>>> After a few seconds, we observeed `pgpromote_candidate < pgpromote_success`
>>>> $ grep -e pgpromote /proc/vmstat
>>>> pgpromote_success 2579
>>>> pgpromote_candidate 0
>>>>
>>>> In this scenario, after terminating the first memhog, the conditions for
>>>> pgdat_free_space_enough() are quickly met, and triggers promotion.
>>>> However, these migrated pages are only counted for in PGPROMOTE_SUCCESS,
>>>> not in PGPROMOTE_CANDIDATE.
>>>>
>>>> To solve this confusing statistics, introduce this
>>>> PGPROMOTE_CANDIDATE_NOLIMIT to count the missed promotion pages. And
>>>> also, not counting these pages into PGPROMOTE_CANDIDATE is to avoid
>>>> changing the existing algorithm or performance of the promotion rate
>>>> limit.
>>>>
>>>> Perhaps PGPROMOTE_CANDIDATE_NOLIMIT is not well named, please comment if
>>>> you have a better idea.
>>> Yes. Naming is hard. I guess that the name comes from the
>>> promotion
>>> that isn't rate limited. I have asked Deepseek that what is the good
>>> abbreviation for "not rate limited". Its answer is "NRL". I don't know
>>> whether it's good. However, "NOT_RATE_LIMITED" appears too long.
>>
>> "NRL" Sounds good to me.
>>
>> I'm thinking another one: since it's not rate limited, it could be
>> migrated quickly/fast. How about PGPROMOTE_CANDIDATE_FAST?
>
> This sounds good to me, Thanks!
Gemini 2.5 gave me a more radical name for it:
/*
* Candidate pages for promotion based on hint fault latency. This counter
* is used by the feedback mechanism to control the promotion rate and
* adjust the hot threshold.
*/
PGPROMOTE_CANDIDATE,
/*
* Pages promoted aggressively to a fast-tier node when it has sufficient
* free space. These promotions bypass the regular hotness checks and do
* NOT influence the promotion rate-limiter or threshold-adjustment logic.
* This is for statistics/monitoring purposes.
*/
PGPROMOTED_AGGRESSIVE,
I think this one is concise and easy to understand with the comments. What do
you think? If this one is not appropriate, then I will go with "_NRL" as you
suggested.
--
Thanks,
Ruan.
>
> ---
> Best Regards,
> Huang, Ying
>
>>
>>>
>>>>
>>>>
>>> The empty line is unnecessary.
>>
>> OK.>
>>>> Cc: Huang Ying <ying.huang@linux.alibaba.com>
>>> Suggested-by: Huang Ying <ying.huang@linux.alibaba.com>
>>
>> OK.
>>
>>
>> --
>> Thanks,
>> Ruan.
>>
>>>
>>>> Cc: Ingo Molnar <mingo@redhat.com>
>>>> Cc: Peter Zijlstra <peterz@infradead.org>
>>>> Cc: Juri Lelli <juri.lelli@redhat.com>
>>>> Cc: Vincent Guittot <vincent.guittot@linaro.org>
>>>> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
>>>> Cc: Steven Rostedt <rostedt@goodmis.org>
>>>> Cc: Ben Segall <bsegall@google.com>
>>>> Cc: Mel Gorman <mgorman@suse.de>
>>>> Cc: Valentin Schneider <vschneid@redhat.com>
>>>> Reported-by: Yasunori Gotou (Fujitsu) <y-goto@fujitsu.com>
>>>> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
>>>> Signed-off-by: Ruan Shiyang <ruansy.fnst@fujitsu.com>
>>>> ---
>>>> include/linux/mmzone.h | 2 ++
>>>> kernel/sched/fair.c | 6 ++++--
>>>> mm/vmstat.c | 1 +
>>>> 3 files changed, 7 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>>>> index 283913d42d7b..6216e2eecf3b 100644
>>>> --- a/include/linux/mmzone.h
>>>> +++ b/include/linux/mmzone.h
>>>> @@ -231,6 +231,8 @@ enum node_stat_item {
>>>> #ifdef CONFIG_NUMA_BALANCING
>>>> PGPROMOTE_SUCCESS, /* promote successfully */
>>>> PGPROMOTE_CANDIDATE, /* candidate pages to promote */
>>>> + PGPROMOTE_CANDIDATE_NOLIMIT, /* candidate pages without considering
>>>> + * hot threshold */
>>>> #endif
>>>> /* PGDEMOTE_*: pages demoted */
>>>> PGDEMOTE_KSWAPD,
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index 7a14da5396fb..12dac3519c49 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -1940,11 +1940,14 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
>>>> struct pglist_data *pgdat;
>>>> unsigned long rate_limit;
>>>> unsigned int latency, th, def_th;
>>>> + long nr = folio_nr_pages(folio);
>>>> pgdat = NODE_DATA(dst_nid);
>>>> if (pgdat_free_space_enough(pgdat)) {
>>>> /* workload changed, reset hot threshold */
>>>> pgdat->nbp_threshold = 0;
>>>> + mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE_NOLIMIT,
>>>> + nr);
>>>> return true;
>>>> }
>>>> @@ -1958,8 +1961,7 @@ bool should_numa_migrate_memory(struct
>>>> task_struct *p, struct folio *folio,
>>>> if (latency >= th)
>>>> return false;
>>>> - return !numa_promotion_rate_limit(pgdat, rate_limit,
>>>> - folio_nr_pages(folio));
>>>> + return !numa_promotion_rate_limit(pgdat, rate_limit, nr);
>>>> }
>>>> this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid);
>>>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>>>> index a78d70ddeacd..ca44a2dd5497 100644
>>>> --- a/mm/vmstat.c
>>>> +++ b/mm/vmstat.c
>>>> @@ -1272,6 +1272,7 @@ const char * const vmstat_text[] = {
>>>> #ifdef CONFIG_NUMA_BALANCING
>>>> "pgpromote_success",
>>>> "pgpromote_candidate",
>>>> + "pgpromote_candidate_nolimit",
>>>> #endif
>>>> "pgdemote_kswapd",
>>>> "pgdemote_direct",
>>> ---
>>> Best Regards,
>>> Huang, Ying
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH RFC v3] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE counting
2025-07-25 2:20 ` Shiyang Ruan
@ 2025-07-25 6:39 ` Huang, Ying
0 siblings, 0 replies; 8+ messages in thread
From: Huang, Ying @ 2025-07-25 6:39 UTC (permalink / raw)
To: Shiyang Ruan
Cc: linux-mm, linux-kernel, lkp, akpm, y-goto, mingo, peterz,
juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, mgorman,
vschneid, Li Zhijian, Ben Segall
Shiyang Ruan <ruansy.fnst@fujitsu.com> writes:
> 在 2025/7/24 15:36, Huang, Ying 写道:
>> Shiyang Ruan <ruansy.fnst@fujitsu.com> writes:
>>
>>> 在 2025/7/23 11:09, Huang, Ying 写道:
>>>> Ruan Shiyang <ruansy.fnst@fujitsu.com> writes:
>>>>
>>>>> From: Li Zhijian <lizhijian@fujitsu.com>
>>>>>
>>>>> ===
>>>>> Changes since v2:
>>>>> 1. According to Huang's suggestion, add a new stat to not count these
>>>>> pages into PGPROMOTE_CANDIDATE, to avoid changing the rate limit
>>>>> mechanism.
>>>>> ===
>>>> This isn't the popular place for changelog, please refer to other
>>>> patch
>>>> email.
>>>
>>> OK. I'll move this part down below.>
>>>>> Goto-san reported confusing pgpromote statistics where the
>>>>> pgpromote_success count significantly exceeded pgpromote_candidate.
>>>>>
>>>>> On a system with three nodes (nodes 0-1: DRAM 4GB, node 2: NVDIMM 4GB):
>>>>> # Enable demotion only
>>>>> echo 1 > /sys/kernel/mm/numa/demotion_enabled
>>>>> numactl -m 0-1 memhog -r200 3500M >/dev/null &
>>>>> pid=$!
>>>>> sleep 2
>>>>> numactl memhog -r100 2500M >/dev/null &
>>>>> sleep 10
>>>>> kill -9 $pid # terminate the 1st memhog
>>>>> # Enable promotion
>>>>> echo 2 > /proc/sys/kernel/numa_balancing
>>>>>
>>>>> After a few seconds, we observeed `pgpromote_candidate < pgpromote_success`
>>>>> $ grep -e pgpromote /proc/vmstat
>>>>> pgpromote_success 2579
>>>>> pgpromote_candidate 0
>>>>>
>>>>> In this scenario, after terminating the first memhog, the conditions for
>>>>> pgdat_free_space_enough() are quickly met, and triggers promotion.
>>>>> However, these migrated pages are only counted for in PGPROMOTE_SUCCESS,
>>>>> not in PGPROMOTE_CANDIDATE.
>>>>>
>>>>> To solve this confusing statistics, introduce this
>>>>> PGPROMOTE_CANDIDATE_NOLIMIT to count the missed promotion pages. And
>>>>> also, not counting these pages into PGPROMOTE_CANDIDATE is to avoid
>>>>> changing the existing algorithm or performance of the promotion rate
>>>>> limit.
>>>>>
>>>>> Perhaps PGPROMOTE_CANDIDATE_NOLIMIT is not well named, please comment if
>>>>> you have a better idea.
>>>> Yes. Naming is hard. I guess that the name comes from the
>>>> promotion
>>>> that isn't rate limited. I have asked Deepseek that what is the good
>>>> abbreviation for "not rate limited". Its answer is "NRL". I don't know
>>>> whether it's good. However, "NOT_RATE_LIMITED" appears too long.
>>>
>>> "NRL" Sounds good to me.
>>>
>>> I'm thinking another one: since it's not rate limited, it could be
>>> migrated quickly/fast. How about PGPROMOTE_CANDIDATE_FAST?
>> This sounds good to me, Thanks!
>
> Gemini 2.5 gave me a more radical name for it:
>
> /*
> * Candidate pages for promotion based on hint fault latency. This counter
> * is used by the feedback mechanism to control the promotion rate and
> * adjust the hot threshold.
> */
> PGPROMOTE_CANDIDATE,
> /*
> * Pages promoted aggressively to a fast-tier node when it has sufficient
> * free space. These promotions bypass the regular hotness checks and do
> * NOT influence the promotion rate-limiter or threshold-adjustment logic.
> * This is for statistics/monitoring purposes.
> */
> PGPROMOTED_AGGRESSIVE,
>
> I think this one is concise and easy to understand with the
> comments. What do you think? If this one is not appropriate, then I
> will go with "_NRL" as you suggested.
In fact, we still count candidate pages here. Although there's enough
free space in the target node, the promotion may still fail for say
increased refcount.
---
Best Regards,
Huang, Ying
[snip]
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2025-07-25 6:39 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-22 14:16 [PATCH RFC v3] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE counting Ruan Shiyang
2025-07-23 3:09 ` Huang, Ying
2025-07-24 2:39 ` Shiyang Ruan
2025-07-24 7:36 ` Huang, Ying
2025-07-25 2:20 ` Shiyang Ruan
2025-07-25 6:39 ` Huang, Ying
2025-07-24 3:35 ` Zhijian Li (Fujitsu)
2025-07-24 7:35 ` Huang, Ying
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).