* [PATCH RFC] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting
@ 2025-06-19 7:52 Li Zhijian
2025-06-19 22:06 ` kernel test robot
2025-06-20 6:28 ` Huang, Ying
0 siblings, 2 replies; 7+ messages in thread
From: Li Zhijian @ 2025-06-19 7:52 UTC (permalink / raw)
To: linux-mm
Cc: akpm, linux-kernel, y-goto, Li Zhijian, Huang Ying, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider
Goto-san reported confusing pgpromote statistics where
the pgpromote_success count significantly exceeded pgpromote_candidate.
The issue manifests under specific memory pressure conditions:
when top-tier memory (DRAM) is exhausted by memhog and allocation begins
in lower-tier memory (CXL). After terminating memhog, the stats show:
$ grep -e pgpromote /proc/vmstat
pgpromote_success 2579
pgpromote_candidate 1
This update increments PGPROMOTE_CANDIDATE within the free space branch
when a promotion decision is made, which may alter the mechanism of the
rate limit. Consequently, it becomes easier to reach the rate limit than
it was previously.
For example:
Rate Limit = 100 pages/sec
Scenario:
T0: 90 free-space migrations
T0+100ms: 20-page migration request
Before:
Rate limit is *not* reached: 0 + 20 = 20 < 100
PGPROMOTE_CANDIDATE: 20
After:
Rate limit is reached: 90 + 20 = 110 > 100
PGPROMOTE_CANDIDATE: 110
Reported-by: Yasunori Gotou (Fujitsu) <y-goto@fujitsu.com>
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
---
This is markes as RFC because I am uncertain whether we originally
intended for this or if it was overlooked.
However, the current situation where pgpromote_candidate < pgpromote_success
is indeed confusing when interpreted literally.
Cc: Huang Ying <ying.huang@linux.alibaba.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Valentin Schneider <vschneid@redhat.com>
---
kernel/sched/fair.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a14da5396fb..4715cd4fa248 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1940,11 +1940,13 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
struct pglist_data *pgdat;
unsigned long rate_limit;
unsigned int latency, th, def_th;
+ long nr = folio_nr_pages(folio)
pgdat = NODE_DATA(dst_nid);
if (pgdat_free_space_enough(pgdat)) {
/* workload changed, reset hot threshold */
pgdat->nbp_threshold = 0;
+ mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr);
return true;
}
@@ -1958,8 +1960,7 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
if (latency >= th)
return false;
- return !numa_promotion_rate_limit(pgdat, rate_limit,
- folio_nr_pages(folio));
+ return !numa_promotion_rate_limit(pgdat, rate_limit, nr);
}
this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid);
--
2.43.5
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH RFC] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting
2025-06-19 7:52 [PATCH RFC] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting Li Zhijian
@ 2025-06-19 22:06 ` kernel test robot
2025-06-20 2:04 ` Zhijian Li (Fujitsu)
2025-06-20 6:28 ` Huang, Ying
1 sibling, 1 reply; 7+ messages in thread
From: kernel test robot @ 2025-06-19 22:06 UTC (permalink / raw)
To: Li Zhijian; +Cc: oe-kbuild-all
Hi Li,
[This is a private test report for your RFC patch.]
kernel test robot noticed the following build errors:
[auto build test ERROR on akpm-mm/mm-everything]
url: https://github.com/intel-lab-lkp/linux/commits/Li-Zhijian/mm-memory-tiering-Fix-PGPROMOTE_CANDIDATE-accounting/20250619-155351
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20250619075245.3272384-1-lizhijian%40fujitsu.com
patch subject: [PATCH RFC] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting
config: x86_64-buildonly-randconfig-005-20250620 (https://download.01.org/0day-ci/archive/20250620/202506200524.r3rTtqLQ-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250620/202506200524.r3rTtqLQ-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202506200524.r3rTtqLQ-lkp@intel.com/
All errors (new ones prefixed by >>):
kernel/sched/fair.c: In function 'should_numa_migrate_memory':
>> kernel/sched/fair.c:1945:17: error: expected ',' or ';' before 'pgdat'
1945 | pgdat = NODE_DATA(dst_nid);
| ^~~~~
vim +1945 kernel/sched/fair.c
c959924b0dc53b Ying Huang 2022-07-13 1921
8c9ae56dc73b5a Kefeng Wang 2023-09-21 1922 bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
10f39042711ba2 Rik van Riel 2014-01-27 1923 int src_nid, int dst_cpu)
10f39042711ba2 Rik van Riel 2014-01-27 1924 {
cb361d8cdef699 Jann Horn 2019-07-16 1925 struct numa_group *ng = deref_curr_numa_group(p);
10f39042711ba2 Rik van Riel 2014-01-27 1926 int dst_nid = cpu_to_node(dst_cpu);
10f39042711ba2 Rik van Riel 2014-01-27 1927 int last_cpupid, this_cpupid;
10f39042711ba2 Rik van Riel 2014-01-27 1928
3fb43636876d98 Byungchul Park 2024-02-19 1929 /*
3fb43636876d98 Byungchul Park 2024-02-19 1930 * Cannot migrate to memoryless nodes.
3fb43636876d98 Byungchul Park 2024-02-19 1931 */
3fb43636876d98 Byungchul Park 2024-02-19 1932 if (!node_state(dst_nid, N_MEMORY))
3fb43636876d98 Byungchul Park 2024-02-19 1933 return false;
3fb43636876d98 Byungchul Park 2024-02-19 1934
33024536bafd91 Ying Huang 2022-07-13 1935 /*
33024536bafd91 Ying Huang 2022-07-13 1936 * The pages in slow memory node should be migrated according
33024536bafd91 Ying Huang 2022-07-13 1937 * to hot/cold instead of private/shared.
33024536bafd91 Ying Huang 2022-07-13 1938 */
2a28713a67fd28 Zi Yan 2024-07-24 1939 if (folio_use_access_time(folio)) {
33024536bafd91 Ying Huang 2022-07-13 1940 struct pglist_data *pgdat;
c959924b0dc53b Ying Huang 2022-07-13 1941 unsigned long rate_limit;
c959924b0dc53b Ying Huang 2022-07-13 1942 unsigned int latency, th, def_th;
675e22ff5b390e Li Zhijian 2025-06-19 1943 long nr = folio_nr_pages(folio)
33024536bafd91 Ying Huang 2022-07-13 1944
33024536bafd91 Ying Huang 2022-07-13 @1945 pgdat = NODE_DATA(dst_nid);
c959924b0dc53b Ying Huang 2022-07-13 1946 if (pgdat_free_space_enough(pgdat)) {
c959924b0dc53b Ying Huang 2022-07-13 1947 /* workload changed, reset hot threshold */
c959924b0dc53b Ying Huang 2022-07-13 1948 pgdat->nbp_threshold = 0;
675e22ff5b390e Li Zhijian 2025-06-19 1949 mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr);
33024536bafd91 Ying Huang 2022-07-13 1950 return true;
c959924b0dc53b Ying Huang 2022-07-13 1951 }
33024536bafd91 Ying Huang 2022-07-13 1952
c959924b0dc53b Ying Huang 2022-07-13 1953 def_th = sysctl_numa_balancing_hot_threshold;
c959924b0dc53b Ying Huang 2022-07-13 1954 rate_limit = sysctl_numa_balancing_promote_rate_limit << \
c959924b0dc53b Ying Huang 2022-07-13 1955 (20 - PAGE_SHIFT);
c959924b0dc53b Ying Huang 2022-07-13 1956 numa_promotion_adjust_threshold(pgdat, rate_limit, def_th);
c959924b0dc53b Ying Huang 2022-07-13 1957
c959924b0dc53b Ying Huang 2022-07-13 1958 th = pgdat->nbp_threshold ? : def_th;
8c9ae56dc73b5a Kefeng Wang 2023-09-21 1959 latency = numa_hint_fault_latency(folio);
33024536bafd91 Ying Huang 2022-07-13 1960 if (latency >= th)
33024536bafd91 Ying Huang 2022-07-13 1961 return false;
33024536bafd91 Ying Huang 2022-07-13 1962
675e22ff5b390e Li Zhijian 2025-06-19 1963 return !numa_promotion_rate_limit(pgdat, rate_limit, nr);
33024536bafd91 Ying Huang 2022-07-13 1964 }
33024536bafd91 Ying Huang 2022-07-13 1965
10f39042711ba2 Rik van Riel 2014-01-27 1966 this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid);
1b143cc77f2074 Kefeng Wang 2023-10-18 1967 last_cpupid = folio_xchg_last_cpupid(folio, this_cpupid);
37355bdc5a1298 Mel Gorman 2018-10-01 1968
33024536bafd91 Ying Huang 2022-07-13 1969 if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) &&
33024536bafd91 Ying Huang 2022-07-13 1970 !node_is_toptier(src_nid) && !cpupid_valid(last_cpupid))
33024536bafd91 Ying Huang 2022-07-13 1971 return false;
33024536bafd91 Ying Huang 2022-07-13 1972
37355bdc5a1298 Mel Gorman 2018-10-01 1973 /*
37355bdc5a1298 Mel Gorman 2018-10-01 1974 * Allow first faults or private faults to migrate immediately early in
37355bdc5a1298 Mel Gorman 2018-10-01 1975 * the lifetime of a task. The magic number 4 is based on waiting for
37355bdc5a1298 Mel Gorman 2018-10-01 1976 * two full passes of the "multi-stage node selection" test that is
37355bdc5a1298 Mel Gorman 2018-10-01 1977 * executed below.
37355bdc5a1298 Mel Gorman 2018-10-01 1978 */
98fa15f34cb379 Anshuman Khandual 2019-03-05 1979 if ((p->numa_preferred_nid == NUMA_NO_NODE || p->numa_scan_seq <= 4) &&
37355bdc5a1298 Mel Gorman 2018-10-01 1980 (cpupid_pid_unset(last_cpupid) || cpupid_match_pid(p, last_cpupid)))
37355bdc5a1298 Mel Gorman 2018-10-01 1981 return true;
10f39042711ba2 Rik van Riel 2014-01-27 1982
10f39042711ba2 Rik van Riel 2014-01-27 1983 /*
10f39042711ba2 Rik van Riel 2014-01-27 1984 * Multi-stage node selection is used in conjunction with a periodic
10f39042711ba2 Rik van Riel 2014-01-27 1985 * migration fault to build a temporal task<->page relation. By using
10f39042711ba2 Rik van Riel 2014-01-27 1986 * a two-stage filter we remove short/unlikely relations.
10f39042711ba2 Rik van Riel 2014-01-27 1987 *
10f39042711ba2 Rik van Riel 2014-01-27 1988 * Using P(p) ~ n_p / n_t as per frequentist probability, we can equate
10f39042711ba2 Rik van Riel 2014-01-27 1989 * a task's usage of a particular page (n_p) per total usage of this
10f39042711ba2 Rik van Riel 2014-01-27 1990 * page (n_t) (in a given time-span) to a probability.
10f39042711ba2 Rik van Riel 2014-01-27 1991 *
10f39042711ba2 Rik van Riel 2014-01-27 1992 * Our periodic faults will sample this probability and getting the
10f39042711ba2 Rik van Riel 2014-01-27 1993 * same result twice in a row, given these samples are fully
10f39042711ba2 Rik van Riel 2014-01-27 1994 * independent, is then given by P(n)^2, provided our sample period
10f39042711ba2 Rik van Riel 2014-01-27 1995 * is sufficiently short compared to the usage pattern.
10f39042711ba2 Rik van Riel 2014-01-27 1996 *
10f39042711ba2 Rik van Riel 2014-01-27 1997 * This quadric squishes small probabilities, making it less likely we
10f39042711ba2 Rik van Riel 2014-01-27 1998 * act on an unlikely task<->page relation.
10f39042711ba2 Rik van Riel 2014-01-27 1999 */
10f39042711ba2 Rik van Riel 2014-01-27 2000 if (!cpupid_pid_unset(last_cpupid) &&
10f39042711ba2 Rik van Riel 2014-01-27 2001 cpupid_to_nid(last_cpupid) != dst_nid)
10f39042711ba2 Rik van Riel 2014-01-27 2002 return false;
10f39042711ba2 Rik van Riel 2014-01-27 2003
10f39042711ba2 Rik van Riel 2014-01-27 2004 /* Always allow migrate on private faults */
10f39042711ba2 Rik van Riel 2014-01-27 2005 if (cpupid_match_pid(p, last_cpupid))
10f39042711ba2 Rik van Riel 2014-01-27 2006 return true;
10f39042711ba2 Rik van Riel 2014-01-27 2007
10f39042711ba2 Rik van Riel 2014-01-27 2008 /* A shared fault, but p->numa_group has not been set up yet. */
10f39042711ba2 Rik van Riel 2014-01-27 2009 if (!ng)
10f39042711ba2 Rik van Riel 2014-01-27 2010 return true;
10f39042711ba2 Rik van Riel 2014-01-27 2011
10f39042711ba2 Rik van Riel 2014-01-27 2012 /*
4142c3ebb685bb Rik van Riel 2016-01-25 2013 * Destination node is much more heavily used than the source
4142c3ebb685bb Rik van Riel 2016-01-25 2014 * node? Allow migration.
10f39042711ba2 Rik van Riel 2014-01-27 2015 */
4142c3ebb685bb Rik van Riel 2016-01-25 2016 if (group_faults_cpu(ng, dst_nid) > group_faults_cpu(ng, src_nid) *
4142c3ebb685bb Rik van Riel 2016-01-25 2017 ACTIVE_NODE_FRACTION)
10f39042711ba2 Rik van Riel 2014-01-27 2018 return true;
10f39042711ba2 Rik van Riel 2014-01-27 2019
10f39042711ba2 Rik van Riel 2014-01-27 2020 /*
4142c3ebb685bb Rik van Riel 2016-01-25 2021 * Distribute memory according to CPU & memory use on each node,
4142c3ebb685bb Rik van Riel 2016-01-25 2022 * with 3/4 hysteresis to avoid unnecessary memory migrations:
4142c3ebb685bb Rik van Riel 2016-01-25 2023 *
4142c3ebb685bb Rik van Riel 2016-01-25 2024 * faults_cpu(dst) 3 faults_cpu(src)
4142c3ebb685bb Rik van Riel 2016-01-25 2025 * --------------- * - > ---------------
4142c3ebb685bb Rik van Riel 2016-01-25 2026 * faults_mem(dst) 4 faults_mem(src)
10f39042711ba2 Rik van Riel 2014-01-27 2027 */
4142c3ebb685bb Rik van Riel 2016-01-25 2028 return group_faults_cpu(ng, dst_nid) * group_faults(p, src_nid) * 3 >
4142c3ebb685bb Rik van Riel 2016-01-25 2029 group_faults_cpu(ng, src_nid) * group_faults(p, dst_nid) * 4;
10f39042711ba2 Rik van Riel 2014-01-27 2030 }
10f39042711ba2 Rik van Riel 2014-01-27 2031
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH RFC] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting
2025-06-19 22:06 ` kernel test robot
@ 2025-06-20 2:04 ` Zhijian Li (Fujitsu)
2025-06-20 2:22 ` Philip Li
0 siblings, 1 reply; 7+ messages in thread
From: Zhijian Li (Fujitsu) @ 2025-06-20 2:04 UTC (permalink / raw)
To: kernel test robot; +Cc: oe-kbuild-all@lists.linux.dev
Thanks for the report. I will update it.
On 20/06/2025 06:06, kernel test robot wrote:
> Hi Li,
>
> [This is a private test report for your RFC patch.]
> kernel test robot noticed the following build errors:
>
> [auto build test ERROR on akpm-mm/mm-everything]
>
> url: https://github.com/intel-lab-lkp/linux/commits/Li-Zhijian/mm-memory-tiering-Fix-PGPROMOTE_CANDIDATE-accounting/20250619-155351
> base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> patch link: https://lore.kernel.org/r/20250619075245.3272384-1-lizhijian%40fujitsu.com
> patch subject: [PATCH RFC] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting
> config: x86_64-buildonly-randconfig-005-20250620 (https://download.01.org/0day-ci/archive/20250620/202506200524.r3rTtqLQ-lkp@intel.com/config)
> compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250620/202506200524.r3rTtqLQ-lkp@intel.com/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202506200524.r3rTtqLQ-lkp@intel.com/
>
> All errors (new ones prefixed by >>):
>
> kernel/sched/fair.c: In function 'should_numa_migrate_memory':
>>> kernel/sched/fair.c:1945:17: error: expected ',' or ';' before 'pgdat'
> 1945 | pgdat = NODE_DATA(dst_nid);
> | ^~~~~
>
>
> vim +1945 kernel/sched/fair.c
>
> c959924b0dc53b Ying Huang 2022-07-13 1921
> 8c9ae56dc73b5a Kefeng Wang 2023-09-21 1922 bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
> 10f39042711ba2 Rik van Riel 2014-01-27 1923 int src_nid, int dst_cpu)
> 10f39042711ba2 Rik van Riel 2014-01-27 1924 {
> cb361d8cdef699 Jann Horn 2019-07-16 1925 struct numa_group *ng = deref_curr_numa_group(p);
> 10f39042711ba2 Rik van Riel 2014-01-27 1926 int dst_nid = cpu_to_node(dst_cpu);
> 10f39042711ba2 Rik van Riel 2014-01-27 1927 int last_cpupid, this_cpupid;
> 10f39042711ba2 Rik van Riel 2014-01-27 1928
> 3fb43636876d98 Byungchul Park 2024-02-19 1929 /*
> 3fb43636876d98 Byungchul Park 2024-02-19 1930 * Cannot migrate to memoryless nodes.
> 3fb43636876d98 Byungchul Park 2024-02-19 1931 */
> 3fb43636876d98 Byungchul Park 2024-02-19 1932 if (!node_state(dst_nid, N_MEMORY))
> 3fb43636876d98 Byungchul Park 2024-02-19 1933 return false;
> 3fb43636876d98 Byungchul Park 2024-02-19 1934
> 33024536bafd91 Ying Huang 2022-07-13 1935 /*
> 33024536bafd91 Ying Huang 2022-07-13 1936 * The pages in slow memory node should be migrated according
> 33024536bafd91 Ying Huang 2022-07-13 1937 * to hot/cold instead of private/shared.
> 33024536bafd91 Ying Huang 2022-07-13 1938 */
> 2a28713a67fd28 Zi Yan 2024-07-24 1939 if (folio_use_access_time(folio)) {
> 33024536bafd91 Ying Huang 2022-07-13 1940 struct pglist_data *pgdat;
> c959924b0dc53b Ying Huang 2022-07-13 1941 unsigned long rate_limit;
> c959924b0dc53b Ying Huang 2022-07-13 1942 unsigned int latency, th, def_th;
> 675e22ff5b390e Li Zhijian 2025-06-19 1943 long nr = folio_nr_pages(folio)
> 33024536bafd91 Ying Huang 2022-07-13 1944
> 33024536bafd91 Ying Huang 2022-07-13 @1945 pgdat = NODE_DATA(dst_nid);
> c959924b0dc53b Ying Huang 2022-07-13 1946 if (pgdat_free_space_enough(pgdat)) {
> c959924b0dc53b Ying Huang 2022-07-13 1947 /* workload changed, reset hot threshold */
> c959924b0dc53b Ying Huang 2022-07-13 1948 pgdat->nbp_threshold = 0;
> 675e22ff5b390e Li Zhijian 2025-06-19 1949 mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr);
> 33024536bafd91 Ying Huang 2022-07-13 1950 return true;
> c959924b0dc53b Ying Huang 2022-07-13 1951 }
> 33024536bafd91 Ying Huang 2022-07-13 1952
> c959924b0dc53b Ying Huang 2022-07-13 1953 def_th = sysctl_numa_balancing_hot_threshold;
> c959924b0dc53b Ying Huang 2022-07-13 1954 rate_limit = sysctl_numa_balancing_promote_rate_limit << \
> c959924b0dc53b Ying Huang 2022-07-13 1955 (20 - PAGE_SHIFT);
> c959924b0dc53b Ying Huang 2022-07-13 1956 numa_promotion_adjust_threshold(pgdat, rate_limit, def_th);
> c959924b0dc53b Ying Huang 2022-07-13 1957
> c959924b0dc53b Ying Huang 2022-07-13 1958 th = pgdat->nbp_threshold ? : def_th;
> 8c9ae56dc73b5a Kefeng Wang 2023-09-21 1959 latency = numa_hint_fault_latency(folio);
> 33024536bafd91 Ying Huang 2022-07-13 1960 if (latency >= th)
> 33024536bafd91 Ying Huang 2022-07-13 1961 return false;
> 33024536bafd91 Ying Huang 2022-07-13 1962
> 675e22ff5b390e Li Zhijian 2025-06-19 1963 return !numa_promotion_rate_limit(pgdat, rate_limit, nr);
> 33024536bafd91 Ying Huang 2022-07-13 1964 }
> 33024536bafd91 Ying Huang 2022-07-13 1965
> 10f39042711ba2 Rik van Riel 2014-01-27 1966 this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid);
> 1b143cc77f2074 Kefeng Wang 2023-10-18 1967 last_cpupid = folio_xchg_last_cpupid(folio, this_cpupid);
> 37355bdc5a1298 Mel Gorman 2018-10-01 1968
> 33024536bafd91 Ying Huang 2022-07-13 1969 if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) &&
> 33024536bafd91 Ying Huang 2022-07-13 1970 !node_is_toptier(src_nid) && !cpupid_valid(last_cpupid))
> 33024536bafd91 Ying Huang 2022-07-13 1971 return false;
> 33024536bafd91 Ying Huang 2022-07-13 1972
> 37355bdc5a1298 Mel Gorman 2018-10-01 1973 /*
> 37355bdc5a1298 Mel Gorman 2018-10-01 1974 * Allow first faults or private faults to migrate immediately early in
> 37355bdc5a1298 Mel Gorman 2018-10-01 1975 * the lifetime of a task. The magic number 4 is based on waiting for
> 37355bdc5a1298 Mel Gorman 2018-10-01 1976 * two full passes of the "multi-stage node selection" test that is
> 37355bdc5a1298 Mel Gorman 2018-10-01 1977 * executed below.
> 37355bdc5a1298 Mel Gorman 2018-10-01 1978 */
> 98fa15f34cb379 Anshuman Khandual 2019-03-05 1979 if ((p->numa_preferred_nid == NUMA_NO_NODE || p->numa_scan_seq <= 4) &&
> 37355bdc5a1298 Mel Gorman 2018-10-01 1980 (cpupid_pid_unset(last_cpupid) || cpupid_match_pid(p, last_cpupid)))
> 37355bdc5a1298 Mel Gorman 2018-10-01 1981 return true;
> 10f39042711ba2 Rik van Riel 2014-01-27 1982
> 10f39042711ba2 Rik van Riel 2014-01-27 1983 /*
> 10f39042711ba2 Rik van Riel 2014-01-27 1984 * Multi-stage node selection is used in conjunction with a periodic
> 10f39042711ba2 Rik van Riel 2014-01-27 1985 * migration fault to build a temporal task<->page relation. By using
> 10f39042711ba2 Rik van Riel 2014-01-27 1986 * a two-stage filter we remove short/unlikely relations.
> 10f39042711ba2 Rik van Riel 2014-01-27 1987 *
> 10f39042711ba2 Rik van Riel 2014-01-27 1988 * Using P(p) ~ n_p / n_t as per frequentist probability, we can equate
> 10f39042711ba2 Rik van Riel 2014-01-27 1989 * a task's usage of a particular page (n_p) per total usage of this
> 10f39042711ba2 Rik van Riel 2014-01-27 1990 * page (n_t) (in a given time-span) to a probability.
> 10f39042711ba2 Rik van Riel 2014-01-27 1991 *
> 10f39042711ba2 Rik van Riel 2014-01-27 1992 * Our periodic faults will sample this probability and getting the
> 10f39042711ba2 Rik van Riel 2014-01-27 1993 * same result twice in a row, given these samples are fully
> 10f39042711ba2 Rik van Riel 2014-01-27 1994 * independent, is then given by P(n)^2, provided our sample period
> 10f39042711ba2 Rik van Riel 2014-01-27 1995 * is sufficiently short compared to the usage pattern.
> 10f39042711ba2 Rik van Riel 2014-01-27 1996 *
> 10f39042711ba2 Rik van Riel 2014-01-27 1997 * This quadric squishes small probabilities, making it less likely we
> 10f39042711ba2 Rik van Riel 2014-01-27 1998 * act on an unlikely task<->page relation.
> 10f39042711ba2 Rik van Riel 2014-01-27 1999 */
> 10f39042711ba2 Rik van Riel 2014-01-27 2000 if (!cpupid_pid_unset(last_cpupid) &&
> 10f39042711ba2 Rik van Riel 2014-01-27 2001 cpupid_to_nid(last_cpupid) != dst_nid)
> 10f39042711ba2 Rik van Riel 2014-01-27 2002 return false;
> 10f39042711ba2 Rik van Riel 2014-01-27 2003
> 10f39042711ba2 Rik van Riel 2014-01-27 2004 /* Always allow migrate on private faults */
> 10f39042711ba2 Rik van Riel 2014-01-27 2005 if (cpupid_match_pid(p, last_cpupid))
> 10f39042711ba2 Rik van Riel 2014-01-27 2006 return true;
> 10f39042711ba2 Rik van Riel 2014-01-27 2007
> 10f39042711ba2 Rik van Riel 2014-01-27 2008 /* A shared fault, but p->numa_group has not been set up yet. */
> 10f39042711ba2 Rik van Riel 2014-01-27 2009 if (!ng)
> 10f39042711ba2 Rik van Riel 2014-01-27 2010 return true;
> 10f39042711ba2 Rik van Riel 2014-01-27 2011
> 10f39042711ba2 Rik van Riel 2014-01-27 2012 /*
> 4142c3ebb685bb Rik van Riel 2016-01-25 2013 * Destination node is much more heavily used than the source
> 4142c3ebb685bb Rik van Riel 2016-01-25 2014 * node? Allow migration.
> 10f39042711ba2 Rik van Riel 2014-01-27 2015 */
> 4142c3ebb685bb Rik van Riel 2016-01-25 2016 if (group_faults_cpu(ng, dst_nid) > group_faults_cpu(ng, src_nid) *
> 4142c3ebb685bb Rik van Riel 2016-01-25 2017 ACTIVE_NODE_FRACTION)
> 10f39042711ba2 Rik van Riel 2014-01-27 2018 return true;
> 10f39042711ba2 Rik van Riel 2014-01-27 2019
> 10f39042711ba2 Rik van Riel 2014-01-27 2020 /*
> 4142c3ebb685bb Rik van Riel 2016-01-25 2021 * Distribute memory according to CPU & memory use on each node,
> 4142c3ebb685bb Rik van Riel 2016-01-25 2022 * with 3/4 hysteresis to avoid unnecessary memory migrations:
> 4142c3ebb685bb Rik van Riel 2016-01-25 2023 *
> 4142c3ebb685bb Rik van Riel 2016-01-25 2024 * faults_cpu(dst) 3 faults_cpu(src)
> 4142c3ebb685bb Rik van Riel 2016-01-25 2025 * --------------- * - > ---------------
> 4142c3ebb685bb Rik van Riel 2016-01-25 2026 * faults_mem(dst) 4 faults_mem(src)
> 10f39042711ba2 Rik van Riel 2014-01-27 2027 */
> 4142c3ebb685bb Rik van Riel 2016-01-25 2028 return group_faults_cpu(ng, dst_nid) * group_faults(p, src_nid) * 3 >
> 4142c3ebb685bb Rik van Riel 2016-01-25 2029 group_faults_cpu(ng, src_nid) * group_faults(p, dst_nid) * 4;
> 10f39042711ba2 Rik van Riel 2014-01-27 2030 }
> 10f39042711ba2 Rik van Riel 2014-01-27 2031
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH RFC] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting
2025-06-20 2:04 ` Zhijian Li (Fujitsu)
@ 2025-06-20 2:22 ` Philip Li
0 siblings, 0 replies; 7+ messages in thread
From: Philip Li @ 2025-06-20 2:22 UTC (permalink / raw)
To: Zhijian Li (Fujitsu); +Cc: kernel test robot, oe-kbuild-all@lists.linux.dev
On Fri, Jun 20, 2025 at 02:04:46AM +0000, Zhijian Li (Fujitsu) wrote:
> Thanks for the report. I will update it.
hi Zhijian, you are welcome :)
>
>
>
> On 20/06/2025 06:06, kernel test robot wrote:
> > Hi Li,
> >
> > [This is a private test report for your RFC patch.]
> > kernel test robot noticed the following build errors:
> >
> > [auto build test ERROR on akpm-mm/mm-everything]
> >
> > url: https://github.com/intel-lab-lkp/linux/commits/Li-Zhijian/mm-memory-tiering-Fix-PGPROMOTE_CANDIDATE-accounting/20250619-155351
> > base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> > patch link: https://lore.kernel.org/r/20250619075245.3272384-1-lizhijian%40fujitsu.com
> > patch subject: [PATCH RFC] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting
> > config: x86_64-buildonly-randconfig-005-20250620 (https://download.01.org/0day-ci/archive/20250620/202506200524.r3rTtqLQ-lkp@intel.com/config)
> > compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
> > reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250620/202506200524.r3rTtqLQ-lkp@intel.com/reproduce)
> >
> > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > the same patch/commit), kindly add following tags
> > | Reported-by: kernel test robot <lkp@intel.com>
> > | Closes: https://lore.kernel.org/oe-kbuild-all/202506200524.r3rTtqLQ-lkp@intel.com/
> >
> > All errors (new ones prefixed by >>):
> >
> > kernel/sched/fair.c: In function 'should_numa_migrate_memory':
> >>> kernel/sched/fair.c:1945:17: error: expected ',' or ';' before 'pgdat'
> > 1945 | pgdat = NODE_DATA(dst_nid);
> > | ^~~~~
> >
> >
> > vim +1945 kernel/sched/fair.c
> >
> > c959924b0dc53b Ying Huang 2022-07-13 1921
> > 8c9ae56dc73b5a Kefeng Wang 2023-09-21 1922 bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
> > 10f39042711ba2 Rik van Riel 2014-01-27 1923 int src_nid, int dst_cpu)
> > 10f39042711ba2 Rik van Riel 2014-01-27 1924 {
> > cb361d8cdef699 Jann Horn 2019-07-16 1925 struct numa_group *ng = deref_curr_numa_group(p);
> > 10f39042711ba2 Rik van Riel 2014-01-27 1926 int dst_nid = cpu_to_node(dst_cpu);
> > 10f39042711ba2 Rik van Riel 2014-01-27 1927 int last_cpupid, this_cpupid;
> > 10f39042711ba2 Rik van Riel 2014-01-27 1928
> > 3fb43636876d98 Byungchul Park 2024-02-19 1929 /*
> > 3fb43636876d98 Byungchul Park 2024-02-19 1930 * Cannot migrate to memoryless nodes.
> > 3fb43636876d98 Byungchul Park 2024-02-19 1931 */
> > 3fb43636876d98 Byungchul Park 2024-02-19 1932 if (!node_state(dst_nid, N_MEMORY))
> > 3fb43636876d98 Byungchul Park 2024-02-19 1933 return false;
> > 3fb43636876d98 Byungchul Park 2024-02-19 1934
> > 33024536bafd91 Ying Huang 2022-07-13 1935 /*
> > 33024536bafd91 Ying Huang 2022-07-13 1936 * The pages in slow memory node should be migrated according
> > 33024536bafd91 Ying Huang 2022-07-13 1937 * to hot/cold instead of private/shared.
> > 33024536bafd91 Ying Huang 2022-07-13 1938 */
> > 2a28713a67fd28 Zi Yan 2024-07-24 1939 if (folio_use_access_time(folio)) {
> > 33024536bafd91 Ying Huang 2022-07-13 1940 struct pglist_data *pgdat;
> > c959924b0dc53b Ying Huang 2022-07-13 1941 unsigned long rate_limit;
> > c959924b0dc53b Ying Huang 2022-07-13 1942 unsigned int latency, th, def_th;
> > 675e22ff5b390e Li Zhijian 2025-06-19 1943 long nr = folio_nr_pages(folio)
> > 33024536bafd91 Ying Huang 2022-07-13 1944
> > 33024536bafd91 Ying Huang 2022-07-13 @1945 pgdat = NODE_DATA(dst_nid);
> > c959924b0dc53b Ying Huang 2022-07-13 1946 if (pgdat_free_space_enough(pgdat)) {
> > c959924b0dc53b Ying Huang 2022-07-13 1947 /* workload changed, reset hot threshold */
> > c959924b0dc53b Ying Huang 2022-07-13 1948 pgdat->nbp_threshold = 0;
> > 675e22ff5b390e Li Zhijian 2025-06-19 1949 mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr);
> > 33024536bafd91 Ying Huang 2022-07-13 1950 return true;
> > c959924b0dc53b Ying Huang 2022-07-13 1951 }
> > 33024536bafd91 Ying Huang 2022-07-13 1952
> > c959924b0dc53b Ying Huang 2022-07-13 1953 def_th = sysctl_numa_balancing_hot_threshold;
> > c959924b0dc53b Ying Huang 2022-07-13 1954 rate_limit = sysctl_numa_balancing_promote_rate_limit << \
> > c959924b0dc53b Ying Huang 2022-07-13 1955 (20 - PAGE_SHIFT);
> > c959924b0dc53b Ying Huang 2022-07-13 1956 numa_promotion_adjust_threshold(pgdat, rate_limit, def_th);
> > c959924b0dc53b Ying Huang 2022-07-13 1957
> > c959924b0dc53b Ying Huang 2022-07-13 1958 th = pgdat->nbp_threshold ? : def_th;
> > 8c9ae56dc73b5a Kefeng Wang 2023-09-21 1959 latency = numa_hint_fault_latency(folio);
> > 33024536bafd91 Ying Huang 2022-07-13 1960 if (latency >= th)
> > 33024536bafd91 Ying Huang 2022-07-13 1961 return false;
> > 33024536bafd91 Ying Huang 2022-07-13 1962
> > 675e22ff5b390e Li Zhijian 2025-06-19 1963 return !numa_promotion_rate_limit(pgdat, rate_limit, nr);
> > 33024536bafd91 Ying Huang 2022-07-13 1964 }
> > 33024536bafd91 Ying Huang 2022-07-13 1965
> > 10f39042711ba2 Rik van Riel 2014-01-27 1966 this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid);
> > 1b143cc77f2074 Kefeng Wang 2023-10-18 1967 last_cpupid = folio_xchg_last_cpupid(folio, this_cpupid);
> > 37355bdc5a1298 Mel Gorman 2018-10-01 1968
> > 33024536bafd91 Ying Huang 2022-07-13 1969 if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) &&
> > 33024536bafd91 Ying Huang 2022-07-13 1970 !node_is_toptier(src_nid) && !cpupid_valid(last_cpupid))
> > 33024536bafd91 Ying Huang 2022-07-13 1971 return false;
> > 33024536bafd91 Ying Huang 2022-07-13 1972
> > 37355bdc5a1298 Mel Gorman 2018-10-01 1973 /*
> > 37355bdc5a1298 Mel Gorman 2018-10-01 1974 * Allow first faults or private faults to migrate immediately early in
> > 37355bdc5a1298 Mel Gorman 2018-10-01 1975 * the lifetime of a task. The magic number 4 is based on waiting for
> > 37355bdc5a1298 Mel Gorman 2018-10-01 1976 * two full passes of the "multi-stage node selection" test that is
> > 37355bdc5a1298 Mel Gorman 2018-10-01 1977 * executed below.
> > 37355bdc5a1298 Mel Gorman 2018-10-01 1978 */
> > 98fa15f34cb379 Anshuman Khandual 2019-03-05 1979 if ((p->numa_preferred_nid == NUMA_NO_NODE || p->numa_scan_seq <= 4) &&
> > 37355bdc5a1298 Mel Gorman 2018-10-01 1980 (cpupid_pid_unset(last_cpupid) || cpupid_match_pid(p, last_cpupid)))
> > 37355bdc5a1298 Mel Gorman 2018-10-01 1981 return true;
> > 10f39042711ba2 Rik van Riel 2014-01-27 1982
> > 10f39042711ba2 Rik van Riel 2014-01-27 1983 /*
> > 10f39042711ba2 Rik van Riel 2014-01-27 1984 * Multi-stage node selection is used in conjunction with a periodic
> > 10f39042711ba2 Rik van Riel 2014-01-27 1985 * migration fault to build a temporal task<->page relation. By using
> > 10f39042711ba2 Rik van Riel 2014-01-27 1986 * a two-stage filter we remove short/unlikely relations.
> > 10f39042711ba2 Rik van Riel 2014-01-27 1987 *
> > 10f39042711ba2 Rik van Riel 2014-01-27 1988 * Using P(p) ~ n_p / n_t as per frequentist probability, we can equate
> > 10f39042711ba2 Rik van Riel 2014-01-27 1989 * a task's usage of a particular page (n_p) per total usage of this
> > 10f39042711ba2 Rik van Riel 2014-01-27 1990 * page (n_t) (in a given time-span) to a probability.
> > 10f39042711ba2 Rik van Riel 2014-01-27 1991 *
> > 10f39042711ba2 Rik van Riel 2014-01-27 1992 * Our periodic faults will sample this probability and getting the
> > 10f39042711ba2 Rik van Riel 2014-01-27 1993 * same result twice in a row, given these samples are fully
> > 10f39042711ba2 Rik van Riel 2014-01-27 1994 * independent, is then given by P(n)^2, provided our sample period
> > 10f39042711ba2 Rik van Riel 2014-01-27 1995 * is sufficiently short compared to the usage pattern.
> > 10f39042711ba2 Rik van Riel 2014-01-27 1996 *
> > 10f39042711ba2 Rik van Riel 2014-01-27 1997 * This quadric squishes small probabilities, making it less likely we
> > 10f39042711ba2 Rik van Riel 2014-01-27 1998 * act on an unlikely task<->page relation.
> > 10f39042711ba2 Rik van Riel 2014-01-27 1999 */
> > 10f39042711ba2 Rik van Riel 2014-01-27 2000 if (!cpupid_pid_unset(last_cpupid) &&
> > 10f39042711ba2 Rik van Riel 2014-01-27 2001 cpupid_to_nid(last_cpupid) != dst_nid)
> > 10f39042711ba2 Rik van Riel 2014-01-27 2002 return false;
> > 10f39042711ba2 Rik van Riel 2014-01-27 2003
> > 10f39042711ba2 Rik van Riel 2014-01-27 2004 /* Always allow migrate on private faults */
> > 10f39042711ba2 Rik van Riel 2014-01-27 2005 if (cpupid_match_pid(p, last_cpupid))
> > 10f39042711ba2 Rik van Riel 2014-01-27 2006 return true;
> > 10f39042711ba2 Rik van Riel 2014-01-27 2007
> > 10f39042711ba2 Rik van Riel 2014-01-27 2008 /* A shared fault, but p->numa_group has not been set up yet. */
> > 10f39042711ba2 Rik van Riel 2014-01-27 2009 if (!ng)
> > 10f39042711ba2 Rik van Riel 2014-01-27 2010 return true;
> > 10f39042711ba2 Rik van Riel 2014-01-27 2011
> > 10f39042711ba2 Rik van Riel 2014-01-27 2012 /*
> > 4142c3ebb685bb Rik van Riel 2016-01-25 2013 * Destination node is much more heavily used than the source
> > 4142c3ebb685bb Rik van Riel 2016-01-25 2014 * node? Allow migration.
> > 10f39042711ba2 Rik van Riel 2014-01-27 2015 */
> > 4142c3ebb685bb Rik van Riel 2016-01-25 2016 if (group_faults_cpu(ng, dst_nid) > group_faults_cpu(ng, src_nid) *
> > 4142c3ebb685bb Rik van Riel 2016-01-25 2017 ACTIVE_NODE_FRACTION)
> > 10f39042711ba2 Rik van Riel 2014-01-27 2018 return true;
> > 10f39042711ba2 Rik van Riel 2014-01-27 2019
> > 10f39042711ba2 Rik van Riel 2014-01-27 2020 /*
> > 4142c3ebb685bb Rik van Riel 2016-01-25 2021 * Distribute memory according to CPU & memory use on each node,
> > 4142c3ebb685bb Rik van Riel 2016-01-25 2022 * with 3/4 hysteresis to avoid unnecessary memory migrations:
> > 4142c3ebb685bb Rik van Riel 2016-01-25 2023 *
> > 4142c3ebb685bb Rik van Riel 2016-01-25 2024 * faults_cpu(dst) 3 faults_cpu(src)
> > 4142c3ebb685bb Rik van Riel 2016-01-25 2025 * --------------- * - > ---------------
> > 4142c3ebb685bb Rik van Riel 2016-01-25 2026 * faults_mem(dst) 4 faults_mem(src)
> > 10f39042711ba2 Rik van Riel 2014-01-27 2027 */
> > 4142c3ebb685bb Rik van Riel 2016-01-25 2028 return group_faults_cpu(ng, dst_nid) * group_faults(p, src_nid) * 3 >
> > 4142c3ebb685bb Rik van Riel 2016-01-25 2029 group_faults_cpu(ng, src_nid) * group_faults(p, dst_nid) * 4;
> > 10f39042711ba2 Rik van Riel 2014-01-27 2030 }
> > 10f39042711ba2 Rik van Riel 2014-01-27 2031
> >
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH RFC] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting
2025-06-19 7:52 [PATCH RFC] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting Li Zhijian
2025-06-19 22:06 ` kernel test robot
@ 2025-06-20 6:28 ` Huang, Ying
2025-06-23 8:54 ` Zhijian Li (Fujitsu)
1 sibling, 1 reply; 7+ messages in thread
From: Huang, Ying @ 2025-06-20 6:28 UTC (permalink / raw)
To: Li Zhijian
Cc: linux-mm, akpm, linux-kernel, y-goto, Ingo Molnar, Peter Zijlstra,
Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider
Li Zhijian <lizhijian@fujitsu.com> writes:
> Goto-san reported confusing pgpromote statistics where
> the pgpromote_success count significantly exceeded pgpromote_candidate.
> The issue manifests under specific memory pressure conditions:
> when top-tier memory (DRAM) is exhausted by memhog and allocation begins
> in lower-tier memory (CXL). After terminating memhog, the stats show:
The above description is confusing. The page promotion occurs when the
size of the top-tier free space is large enough (after killing the
memhog above). The accessed lower-tier memory will be promoted upon
accessing to take full advantage of the more expensive top-tier memory.
> $ grep -e pgpromote /proc/vmstat
> pgpromote_success 2579
> pgpromote_candidate 1
>
> This update increments PGPROMOTE_CANDIDATE within the free space branch
> when a promotion decision is made, which may alter the mechanism of the
> rate limit. Consequently, it becomes easier to reach the rate limit than
> it was previously.
>
> For example:
> Rate Limit = 100 pages/sec
> Scenario:
> T0: 90 free-space migrations
> T0+100ms: 20-page migration request
>
> Before:
> Rate limit is *not* reached: 0 + 20 = 20 < 100
> PGPROMOTE_CANDIDATE: 20
> After:
> Rate limit is reached: 90 + 20 = 110 > 100
> PGPROMOTE_CANDIDATE: 110
Yes. The rate limit will be influenced by the change. So, more tests
may be needed to verify it will not incurs regressions.
>
> Reported-by: Yasunori Gotou (Fujitsu) <y-goto@fujitsu.com>
> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
> ---
>
> This is markes as RFC because I am uncertain whether we originally
> intended for this or if it was overlooked.
>
> However, the current situation where pgpromote_candidate < pgpromote_success
> is indeed confusing when interpreted literally.
>
> Cc: Huang Ying <ying.huang@linux.alibaba.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Valentin Schneider <vschneid@redhat.com>
> ---
> kernel/sched/fair.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7a14da5396fb..4715cd4fa248 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1940,11 +1940,13 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
> struct pglist_data *pgdat;
> unsigned long rate_limit;
> unsigned int latency, th, def_th;
> + long nr = folio_nr_pages(folio)
>
> pgdat = NODE_DATA(dst_nid);
> if (pgdat_free_space_enough(pgdat)) {
> /* workload changed, reset hot threshold */
> pgdat->nbp_threshold = 0;
> + mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr);
> return true;
> }
>
> @@ -1958,8 +1960,7 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
> if (latency >= th)
> return false;
>
> - return !numa_promotion_rate_limit(pgdat, rate_limit,
> - folio_nr_pages(folio));
> + return !numa_promotion_rate_limit(pgdat, rate_limit, nr);
> }
>
> this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid);
---
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH RFC] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting
2025-06-20 6:28 ` Huang, Ying
@ 2025-06-23 8:54 ` Zhijian Li (Fujitsu)
2025-06-24 2:46 ` Huang, Ying
0 siblings, 1 reply; 7+ messages in thread
From: Zhijian Li (Fujitsu) @ 2025-06-23 8:54 UTC (permalink / raw)
To: Huang, Ying
Cc: linux-mm@kvack.org, akpm@linux-foundation.org,
linux-kernel@vger.kernel.org, Yasunori Gotou (Fujitsu),
Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, kernel test robot
On 20/06/2025 14:28, Huang, Ying wrote:
> Li Zhijian <lizhijian@fujitsu.com> writes:
>
>> Goto-san reported confusing pgpromote statistics where
>> the pgpromote_success count significantly exceeded pgpromote_candidate.
>> The issue manifests under specific memory pressure conditions:
>> when top-tier memory (DRAM) is exhausted by memhog and allocation begins
>> in lower-tier memory (CXL). After terminating memhog, the stats show:
>
> The above description is confusing. The page promotion occurs when the
> size of the top-tier free space is large enough (after killing the
> memhog above). The accessed lower-tier memory will be promoted upon
> accessing to take full advantage of the more expensive top-tier memory.
Yeah, that's what the promotion does.
Let's clarify the reproducer steps specifically(thanks Goto-san for the reproducer):
On a system with three nodes (nodes 0-1: DRAM 4GB, node 2: NVDIMM 4GB):
# Enable demotion only
echo 1 > /sys/kernel/mm/numa/demotion_enabled
numactl -m 0-1 memhog -r200 3500M >/dev/null &
pid=$!
sleep 2
numactl memhog -r100 2500M >/dev/null &
sleep 10
kill -9 $pid
# Enable promotion
echo 2 > /proc/sys/kernel/numa_balancing
# After a few seconds, we observe `pgpromote_candidate < pgpromote_success`
In this scenario, after terminating the first memhog, the conditions for pgdat_free_space_enough() are quickly met, triggering promotion.
However, these migrated pages are only accounted for in PGPROMOTE_SUCCESS, not in PGPROMOTE_CANDIDATE.
>
>> $ grep -e pgpromote /proc/vmstat
>> pgpromote_success 2579
>> pgpromote_candidate 1
>>
>> This update increments PGPROMOTE_CANDIDATE within the free space branch
>> when a promotion decision is made, which may alter the mechanism of the
>> rate limit. Consequently, it becomes easier to reach the rate limit than
>> it was previously.
>>
>> For example:
>> Rate Limit = 100 pages/sec
>> Scenario:
>> T0: 90 free-space migrations
>> T0+100ms: 20-page migration request
>>
>> Before:
>> Rate limit is *not* reached: 0 + 20 = 20 < 100
>> PGPROMOTE_CANDIDATE: 20
>> After:
>> Rate limit is reached: 90 + 20 = 110 > 100
>> PGPROMOTE_CANDIDATE: 110
>
> Yes. The rate limit will be influenced by the change. So, more tests
> may be needed to verify it will not incurs regressions.
Testing this might be challenging due to workload dependencies. Do you have any recommended workloads for evaluation?
Alternatively, could we could rely on the LKP project for impact assessment(Current patch has not really tested
by LKP due to a compiling error, I will post a V2 soon).
However, regarding the rate limit change itself, I consider this patch logically correct. As stated in the numa_promotion_rate_limit() comment:
> "For memory tiering mode, too high promotion/demotion throughput may hurt application latency."
It seems there is no justification for excluding pgdat_free_space_enough() triggered promotions from the rate limiting mechanism.
>
>>
>> Reported-by: Yasunori Gotou (Fujitsu) <y-goto@fujitsu.com>
>> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
>> ---
>>
>> This is markes as RFC because I am uncertain whether we originally
>> intended for this or if it was overlooked.
>>
>> However, the current situation where pgpromote_candidate < pgpromote_success
>> is indeed confusing when interpreted literally.
>>
>> Cc: Huang Ying <ying.huang@linux.alibaba.com>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Juri Lelli <juri.lelli@redhat.com>
>> Cc: Vincent Guittot <vincent.guittot@linaro.org>
>> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
>> Cc: Steven Rostedt <rostedt@goodmis.org>
>> Cc: Ben Segall <bsegall@google.com>
>> Cc: Mel Gorman <mgorman@suse.de>
>> Cc: Valentin Schneider <vschneid@redhat.com>
>> ---
>> kernel/sched/fair.c | 5 +++--
>> 1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 7a14da5396fb..4715cd4fa248 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1940,11 +1940,13 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
>> struct pglist_data *pgdat;
>> unsigned long rate_limit;
>> unsigned int latency, th, def_th;
>> + long nr = folio_nr_pages(folio)
Cc LKP
There is a compilation error which I overlooked at the time due to several ongoing refactors in
my local code. I appreciate LKP for detecting this issue.
Thanks
Zhijian
>>
>> pgdat = NODE_DATA(dst_nid);
>> if (pgdat_free_space_enough(pgdat)) {
>> /* workload changed, reset hot threshold */
>> pgdat->nbp_threshold = 0;
>> + mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr);
>> return true;
>> }
>>
>> @@ -1958,8 +1960,7 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
>> if (latency >= th)
>> return false;
>>
>> - return !numa_promotion_rate_limit(pgdat, rate_limit,
>> - folio_nr_pages(folio));
>> + return !numa_promotion_rate_limit(pgdat, rate_limit, nr);
>> }
>>
>> this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid);
>
> ---
> Best Regards,
> Huang, Ying
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH RFC] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting
2025-06-23 8:54 ` Zhijian Li (Fujitsu)
@ 2025-06-24 2:46 ` Huang, Ying
0 siblings, 0 replies; 7+ messages in thread
From: Huang, Ying @ 2025-06-24 2:46 UTC (permalink / raw)
To: Zhijian Li (Fujitsu)
Cc: linux-mm@kvack.org, akpm@linux-foundation.org,
linux-kernel@vger.kernel.org, Yasunori Gotou (Fujitsu),
Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, kernel test robot
"Zhijian Li (Fujitsu)" <lizhijian@fujitsu.com> writes:
> On 20/06/2025 14:28, Huang, Ying wrote:
>> Li Zhijian <lizhijian@fujitsu.com> writes:
>>
>>> Goto-san reported confusing pgpromote statistics where
>>> the pgpromote_success count significantly exceeded pgpromote_candidate.
>>> The issue manifests under specific memory pressure conditions:
>>> when top-tier memory (DRAM) is exhausted by memhog and allocation begins
>>> in lower-tier memory (CXL). After terminating memhog, the stats show:
>>
>> The above description is confusing. The page promotion occurs when the
>> size of the top-tier free space is large enough (after killing the
>> memhog above). The accessed lower-tier memory will be promoted upon
>> accessing to take full advantage of the more expensive top-tier memory.
>
> Yeah, that's what the promotion does.
>
> Let's clarify the reproducer steps specifically(thanks Goto-san for the reproducer):
> On a system with three nodes (nodes 0-1: DRAM 4GB, node 2: NVDIMM 4GB):
>
> # Enable demotion only
> echo 1 > /sys/kernel/mm/numa/demotion_enabled
> numactl -m 0-1 memhog -r200 3500M >/dev/null &
> pid=$!
> sleep 2
> numactl memhog -r100 2500M >/dev/null &
> sleep 10
> kill -9 $pid
> # Enable promotion
> echo 2 > /proc/sys/kernel/numa_balancing
>
> # After a few seconds, we observe `pgpromote_candidate < pgpromote_success`
>
> In this scenario, after terminating the first memhog, the conditions
> for pgdat_free_space_enough() are quickly met, triggering promotion.
> However, these migrated pages are only accounted for in PGPROMOTE_SUCCESS, not in PGPROMOTE_CANDIDATE.
Yes. This is the expected behavior of current implementation.
>
>>
>>> $ grep -e pgpromote /proc/vmstat
>>> pgpromote_success 2579
>>> pgpromote_candidate 1
>>>
>>> This update increments PGPROMOTE_CANDIDATE within the free space branch
>>> when a promotion decision is made, which may alter the mechanism of the
>>> rate limit. Consequently, it becomes easier to reach the rate limit than
>>> it was previously.
>>>
>>> For example:
>>> Rate Limit = 100 pages/sec
>>> Scenario:
>>> T0: 90 free-space migrations
>>> T0+100ms: 20-page migration request
>>>
>>> Before:
>>> Rate limit is *not* reached: 0 + 20 = 20 < 100
>>> PGPROMOTE_CANDIDATE: 20
>>> After:
>>> Rate limit is reached: 90 + 20 = 110 > 100
>>> PGPROMOTE_CANDIDATE: 110
>>
>> Yes. The rate limit will be influenced by the change. So, more tests
>> may be needed to verify it will not incurs regressions.
>
>
> Testing this might be challenging due to workload dependencies. Do you
> have any recommended workloads for evaluation?
Some in-memory database should be good workloads, for example, redis, etc.
> Alternatively, could we could rely on the LKP project for impact assessment(Current patch has not really tested
> by LKP due to a compiling error, I will post a V2 soon).
LKP has some basic workload to test this, for example, pmbench with
Gauss-ih access pattern.
> However, regarding the rate limit change itself, I consider this patch
> logically correct. As stated in the numa_promotion_rate_limit()
> comment:
>> "For memory tiering mode, too high promotion/demotion throughput may hurt application latency."
> It seems there is no justification for excluding
> pgdat_free_space_enough() triggered promotions from the rate limiting
> mechanism.
In fact, we don't rate limit promotion if there are enough free space on
fast memory to fill the fast memory quickly. I think that it's
necessary to prevent the fast memory from under-utilized ASAP.
>
>
>>
>>>
>>> Reported-by: Yasunori Gotou (Fujitsu) <y-goto@fujitsu.com>
>>> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
[snip]
---
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2025-06-24 2:47 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-19 7:52 [PATCH RFC] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting Li Zhijian
2025-06-19 22:06 ` kernel test robot
2025-06-20 2:04 ` Zhijian Li (Fujitsu)
2025-06-20 2:22 ` Philip Li
2025-06-20 6:28 ` Huang, Ying
2025-06-23 8:54 ` Zhijian Li (Fujitsu)
2025-06-24 2:46 ` Huang, Ying
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.