* [PATCH RFC] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting @ 2025-06-19 7:52 Li Zhijian 2025-06-19 22:06 ` kernel test robot 2025-06-20 6:28 ` Huang, Ying 0 siblings, 2 replies; 7+ messages in thread From: Li Zhijian @ 2025-06-19 7:52 UTC (permalink / raw) To: linux-mm Cc: akpm, linux-kernel, y-goto, Li Zhijian, Huang Ying, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider Goto-san reported confusing pgpromote statistics where the pgpromote_success count significantly exceeded pgpromote_candidate. The issue manifests under specific memory pressure conditions: when top-tier memory (DRAM) is exhausted by memhog and allocation begins in lower-tier memory (CXL). After terminating memhog, the stats show: $ grep -e pgpromote /proc/vmstat pgpromote_success 2579 pgpromote_candidate 1 This update increments PGPROMOTE_CANDIDATE within the free space branch when a promotion decision is made, which may alter the mechanism of the rate limit. Consequently, it becomes easier to reach the rate limit than it was previously. For example: Rate Limit = 100 pages/sec Scenario: T0: 90 free-space migrations T0+100ms: 20-page migration request Before: Rate limit is *not* reached: 0 + 20 = 20 < 100 PGPROMOTE_CANDIDATE: 20 After: Rate limit is reached: 90 + 20 = 110 > 100 PGPROMOTE_CANDIDATE: 110 Reported-by: Yasunori Gotou (Fujitsu) <y-goto@fujitsu.com> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> --- This is markes as RFC because I am uncertain whether we originally intended for this or if it was overlooked. However, the current situation where pgpromote_candidate < pgpromote_success is indeed confusing when interpreted literally. Cc: Huang Ying <ying.huang@linux.alibaba.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Valentin Schneider <vschneid@redhat.com> --- kernel/sched/fair.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7a14da5396fb..4715cd4fa248 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1940,11 +1940,13 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, struct pglist_data *pgdat; unsigned long rate_limit; unsigned int latency, th, def_th; + long nr = folio_nr_pages(folio) pgdat = NODE_DATA(dst_nid); if (pgdat_free_space_enough(pgdat)) { /* workload changed, reset hot threshold */ pgdat->nbp_threshold = 0; + mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr); return true; } @@ -1958,8 +1960,7 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, if (latency >= th) return false; - return !numa_promotion_rate_limit(pgdat, rate_limit, - folio_nr_pages(folio)); + return !numa_promotion_rate_limit(pgdat, rate_limit, nr); } this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid); -- 2.43.5 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH RFC] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting 2025-06-19 7:52 [PATCH RFC] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting Li Zhijian @ 2025-06-19 22:06 ` kernel test robot 2025-06-20 2:04 ` Zhijian Li (Fujitsu) 2025-06-20 6:28 ` Huang, Ying 1 sibling, 1 reply; 7+ messages in thread From: kernel test robot @ 2025-06-19 22:06 UTC (permalink / raw) To: Li Zhijian; +Cc: oe-kbuild-all Hi Li, [This is a private test report for your RFC patch.] kernel test robot noticed the following build errors: [auto build test ERROR on akpm-mm/mm-everything] url: https://github.com/intel-lab-lkp/linux/commits/Li-Zhijian/mm-memory-tiering-Fix-PGPROMOTE_CANDIDATE-accounting/20250619-155351 base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything patch link: https://lore.kernel.org/r/20250619075245.3272384-1-lizhijian%40fujitsu.com patch subject: [PATCH RFC] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting config: x86_64-buildonly-randconfig-005-20250620 (https://download.01.org/0day-ci/archive/20250620/202506200524.r3rTtqLQ-lkp@intel.com/config) compiler: gcc-12 (Debian 12.2.0-14) 12.2.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250620/202506200524.r3rTtqLQ-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202506200524.r3rTtqLQ-lkp@intel.com/ All errors (new ones prefixed by >>): kernel/sched/fair.c: In function 'should_numa_migrate_memory': >> kernel/sched/fair.c:1945:17: error: expected ',' or ';' before 'pgdat' 1945 | pgdat = NODE_DATA(dst_nid); | ^~~~~ vim +1945 kernel/sched/fair.c c959924b0dc53b Ying Huang 2022-07-13 1921 8c9ae56dc73b5a Kefeng Wang 2023-09-21 1922 bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, 10f39042711ba2 Rik van Riel 2014-01-27 1923 int src_nid, int dst_cpu) 10f39042711ba2 Rik van Riel 2014-01-27 1924 { cb361d8cdef699 Jann Horn 2019-07-16 1925 struct numa_group *ng = deref_curr_numa_group(p); 10f39042711ba2 Rik van Riel 2014-01-27 1926 int dst_nid = cpu_to_node(dst_cpu); 10f39042711ba2 Rik van Riel 2014-01-27 1927 int last_cpupid, this_cpupid; 10f39042711ba2 Rik van Riel 2014-01-27 1928 3fb43636876d98 Byungchul Park 2024-02-19 1929 /* 3fb43636876d98 Byungchul Park 2024-02-19 1930 * Cannot migrate to memoryless nodes. 3fb43636876d98 Byungchul Park 2024-02-19 1931 */ 3fb43636876d98 Byungchul Park 2024-02-19 1932 if (!node_state(dst_nid, N_MEMORY)) 3fb43636876d98 Byungchul Park 2024-02-19 1933 return false; 3fb43636876d98 Byungchul Park 2024-02-19 1934 33024536bafd91 Ying Huang 2022-07-13 1935 /* 33024536bafd91 Ying Huang 2022-07-13 1936 * The pages in slow memory node should be migrated according 33024536bafd91 Ying Huang 2022-07-13 1937 * to hot/cold instead of private/shared. 33024536bafd91 Ying Huang 2022-07-13 1938 */ 2a28713a67fd28 Zi Yan 2024-07-24 1939 if (folio_use_access_time(folio)) { 33024536bafd91 Ying Huang 2022-07-13 1940 struct pglist_data *pgdat; c959924b0dc53b Ying Huang 2022-07-13 1941 unsigned long rate_limit; c959924b0dc53b Ying Huang 2022-07-13 1942 unsigned int latency, th, def_th; 675e22ff5b390e Li Zhijian 2025-06-19 1943 long nr = folio_nr_pages(folio) 33024536bafd91 Ying Huang 2022-07-13 1944 33024536bafd91 Ying Huang 2022-07-13 @1945 pgdat = NODE_DATA(dst_nid); c959924b0dc53b Ying Huang 2022-07-13 1946 if (pgdat_free_space_enough(pgdat)) { c959924b0dc53b Ying Huang 2022-07-13 1947 /* workload changed, reset hot threshold */ c959924b0dc53b Ying Huang 2022-07-13 1948 pgdat->nbp_threshold = 0; 675e22ff5b390e Li Zhijian 2025-06-19 1949 mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr); 33024536bafd91 Ying Huang 2022-07-13 1950 return true; c959924b0dc53b Ying Huang 2022-07-13 1951 } 33024536bafd91 Ying Huang 2022-07-13 1952 c959924b0dc53b Ying Huang 2022-07-13 1953 def_th = sysctl_numa_balancing_hot_threshold; c959924b0dc53b Ying Huang 2022-07-13 1954 rate_limit = sysctl_numa_balancing_promote_rate_limit << \ c959924b0dc53b Ying Huang 2022-07-13 1955 (20 - PAGE_SHIFT); c959924b0dc53b Ying Huang 2022-07-13 1956 numa_promotion_adjust_threshold(pgdat, rate_limit, def_th); c959924b0dc53b Ying Huang 2022-07-13 1957 c959924b0dc53b Ying Huang 2022-07-13 1958 th = pgdat->nbp_threshold ? : def_th; 8c9ae56dc73b5a Kefeng Wang 2023-09-21 1959 latency = numa_hint_fault_latency(folio); 33024536bafd91 Ying Huang 2022-07-13 1960 if (latency >= th) 33024536bafd91 Ying Huang 2022-07-13 1961 return false; 33024536bafd91 Ying Huang 2022-07-13 1962 675e22ff5b390e Li Zhijian 2025-06-19 1963 return !numa_promotion_rate_limit(pgdat, rate_limit, nr); 33024536bafd91 Ying Huang 2022-07-13 1964 } 33024536bafd91 Ying Huang 2022-07-13 1965 10f39042711ba2 Rik van Riel 2014-01-27 1966 this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid); 1b143cc77f2074 Kefeng Wang 2023-10-18 1967 last_cpupid = folio_xchg_last_cpupid(folio, this_cpupid); 37355bdc5a1298 Mel Gorman 2018-10-01 1968 33024536bafd91 Ying Huang 2022-07-13 1969 if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) && 33024536bafd91 Ying Huang 2022-07-13 1970 !node_is_toptier(src_nid) && !cpupid_valid(last_cpupid)) 33024536bafd91 Ying Huang 2022-07-13 1971 return false; 33024536bafd91 Ying Huang 2022-07-13 1972 37355bdc5a1298 Mel Gorman 2018-10-01 1973 /* 37355bdc5a1298 Mel Gorman 2018-10-01 1974 * Allow first faults or private faults to migrate immediately early in 37355bdc5a1298 Mel Gorman 2018-10-01 1975 * the lifetime of a task. The magic number 4 is based on waiting for 37355bdc5a1298 Mel Gorman 2018-10-01 1976 * two full passes of the "multi-stage node selection" test that is 37355bdc5a1298 Mel Gorman 2018-10-01 1977 * executed below. 37355bdc5a1298 Mel Gorman 2018-10-01 1978 */ 98fa15f34cb379 Anshuman Khandual 2019-03-05 1979 if ((p->numa_preferred_nid == NUMA_NO_NODE || p->numa_scan_seq <= 4) && 37355bdc5a1298 Mel Gorman 2018-10-01 1980 (cpupid_pid_unset(last_cpupid) || cpupid_match_pid(p, last_cpupid))) 37355bdc5a1298 Mel Gorman 2018-10-01 1981 return true; 10f39042711ba2 Rik van Riel 2014-01-27 1982 10f39042711ba2 Rik van Riel 2014-01-27 1983 /* 10f39042711ba2 Rik van Riel 2014-01-27 1984 * Multi-stage node selection is used in conjunction with a periodic 10f39042711ba2 Rik van Riel 2014-01-27 1985 * migration fault to build a temporal task<->page relation. By using 10f39042711ba2 Rik van Riel 2014-01-27 1986 * a two-stage filter we remove short/unlikely relations. 10f39042711ba2 Rik van Riel 2014-01-27 1987 * 10f39042711ba2 Rik van Riel 2014-01-27 1988 * Using P(p) ~ n_p / n_t as per frequentist probability, we can equate 10f39042711ba2 Rik van Riel 2014-01-27 1989 * a task's usage of a particular page (n_p) per total usage of this 10f39042711ba2 Rik van Riel 2014-01-27 1990 * page (n_t) (in a given time-span) to a probability. 10f39042711ba2 Rik van Riel 2014-01-27 1991 * 10f39042711ba2 Rik van Riel 2014-01-27 1992 * Our periodic faults will sample this probability and getting the 10f39042711ba2 Rik van Riel 2014-01-27 1993 * same result twice in a row, given these samples are fully 10f39042711ba2 Rik van Riel 2014-01-27 1994 * independent, is then given by P(n)^2, provided our sample period 10f39042711ba2 Rik van Riel 2014-01-27 1995 * is sufficiently short compared to the usage pattern. 10f39042711ba2 Rik van Riel 2014-01-27 1996 * 10f39042711ba2 Rik van Riel 2014-01-27 1997 * This quadric squishes small probabilities, making it less likely we 10f39042711ba2 Rik van Riel 2014-01-27 1998 * act on an unlikely task<->page relation. 10f39042711ba2 Rik van Riel 2014-01-27 1999 */ 10f39042711ba2 Rik van Riel 2014-01-27 2000 if (!cpupid_pid_unset(last_cpupid) && 10f39042711ba2 Rik van Riel 2014-01-27 2001 cpupid_to_nid(last_cpupid) != dst_nid) 10f39042711ba2 Rik van Riel 2014-01-27 2002 return false; 10f39042711ba2 Rik van Riel 2014-01-27 2003 10f39042711ba2 Rik van Riel 2014-01-27 2004 /* Always allow migrate on private faults */ 10f39042711ba2 Rik van Riel 2014-01-27 2005 if (cpupid_match_pid(p, last_cpupid)) 10f39042711ba2 Rik van Riel 2014-01-27 2006 return true; 10f39042711ba2 Rik van Riel 2014-01-27 2007 10f39042711ba2 Rik van Riel 2014-01-27 2008 /* A shared fault, but p->numa_group has not been set up yet. */ 10f39042711ba2 Rik van Riel 2014-01-27 2009 if (!ng) 10f39042711ba2 Rik van Riel 2014-01-27 2010 return true; 10f39042711ba2 Rik van Riel 2014-01-27 2011 10f39042711ba2 Rik van Riel 2014-01-27 2012 /* 4142c3ebb685bb Rik van Riel 2016-01-25 2013 * Destination node is much more heavily used than the source 4142c3ebb685bb Rik van Riel 2016-01-25 2014 * node? Allow migration. 10f39042711ba2 Rik van Riel 2014-01-27 2015 */ 4142c3ebb685bb Rik van Riel 2016-01-25 2016 if (group_faults_cpu(ng, dst_nid) > group_faults_cpu(ng, src_nid) * 4142c3ebb685bb Rik van Riel 2016-01-25 2017 ACTIVE_NODE_FRACTION) 10f39042711ba2 Rik van Riel 2014-01-27 2018 return true; 10f39042711ba2 Rik van Riel 2014-01-27 2019 10f39042711ba2 Rik van Riel 2014-01-27 2020 /* 4142c3ebb685bb Rik van Riel 2016-01-25 2021 * Distribute memory according to CPU & memory use on each node, 4142c3ebb685bb Rik van Riel 2016-01-25 2022 * with 3/4 hysteresis to avoid unnecessary memory migrations: 4142c3ebb685bb Rik van Riel 2016-01-25 2023 * 4142c3ebb685bb Rik van Riel 2016-01-25 2024 * faults_cpu(dst) 3 faults_cpu(src) 4142c3ebb685bb Rik van Riel 2016-01-25 2025 * --------------- * - > --------------- 4142c3ebb685bb Rik van Riel 2016-01-25 2026 * faults_mem(dst) 4 faults_mem(src) 10f39042711ba2 Rik van Riel 2014-01-27 2027 */ 4142c3ebb685bb Rik van Riel 2016-01-25 2028 return group_faults_cpu(ng, dst_nid) * group_faults(p, src_nid) * 3 > 4142c3ebb685bb Rik van Riel 2016-01-25 2029 group_faults_cpu(ng, src_nid) * group_faults(p, dst_nid) * 4; 10f39042711ba2 Rik van Riel 2014-01-27 2030 } 10f39042711ba2 Rik van Riel 2014-01-27 2031 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH RFC] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting 2025-06-19 22:06 ` kernel test robot @ 2025-06-20 2:04 ` Zhijian Li (Fujitsu) 2025-06-20 2:22 ` Philip Li 0 siblings, 1 reply; 7+ messages in thread From: Zhijian Li (Fujitsu) @ 2025-06-20 2:04 UTC (permalink / raw) To: kernel test robot; +Cc: oe-kbuild-all@lists.linux.dev Thanks for the report. I will update it. On 20/06/2025 06:06, kernel test robot wrote: > Hi Li, > > [This is a private test report for your RFC patch.] > kernel test robot noticed the following build errors: > > [auto build test ERROR on akpm-mm/mm-everything] > > url: https://github.com/intel-lab-lkp/linux/commits/Li-Zhijian/mm-memory-tiering-Fix-PGPROMOTE_CANDIDATE-accounting/20250619-155351 > base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything > patch link: https://lore.kernel.org/r/20250619075245.3272384-1-lizhijian%40fujitsu.com > patch subject: [PATCH RFC] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting > config: x86_64-buildonly-randconfig-005-20250620 (https://download.01.org/0day-ci/archive/20250620/202506200524.r3rTtqLQ-lkp@intel.com/config) > compiler: gcc-12 (Debian 12.2.0-14) 12.2.0 > reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250620/202506200524.r3rTtqLQ-lkp@intel.com/reproduce) > > If you fix the issue in a separate patch/commit (i.e. not just a new version of > the same patch/commit), kindly add following tags > | Reported-by: kernel test robot <lkp@intel.com> > | Closes: https://lore.kernel.org/oe-kbuild-all/202506200524.r3rTtqLQ-lkp@intel.com/ > > All errors (new ones prefixed by >>): > > kernel/sched/fair.c: In function 'should_numa_migrate_memory': >>> kernel/sched/fair.c:1945:17: error: expected ',' or ';' before 'pgdat' > 1945 | pgdat = NODE_DATA(dst_nid); > | ^~~~~ > > > vim +1945 kernel/sched/fair.c > > c959924b0dc53b Ying Huang 2022-07-13 1921 > 8c9ae56dc73b5a Kefeng Wang 2023-09-21 1922 bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, > 10f39042711ba2 Rik van Riel 2014-01-27 1923 int src_nid, int dst_cpu) > 10f39042711ba2 Rik van Riel 2014-01-27 1924 { > cb361d8cdef699 Jann Horn 2019-07-16 1925 struct numa_group *ng = deref_curr_numa_group(p); > 10f39042711ba2 Rik van Riel 2014-01-27 1926 int dst_nid = cpu_to_node(dst_cpu); > 10f39042711ba2 Rik van Riel 2014-01-27 1927 int last_cpupid, this_cpupid; > 10f39042711ba2 Rik van Riel 2014-01-27 1928 > 3fb43636876d98 Byungchul Park 2024-02-19 1929 /* > 3fb43636876d98 Byungchul Park 2024-02-19 1930 * Cannot migrate to memoryless nodes. > 3fb43636876d98 Byungchul Park 2024-02-19 1931 */ > 3fb43636876d98 Byungchul Park 2024-02-19 1932 if (!node_state(dst_nid, N_MEMORY)) > 3fb43636876d98 Byungchul Park 2024-02-19 1933 return false; > 3fb43636876d98 Byungchul Park 2024-02-19 1934 > 33024536bafd91 Ying Huang 2022-07-13 1935 /* > 33024536bafd91 Ying Huang 2022-07-13 1936 * The pages in slow memory node should be migrated according > 33024536bafd91 Ying Huang 2022-07-13 1937 * to hot/cold instead of private/shared. > 33024536bafd91 Ying Huang 2022-07-13 1938 */ > 2a28713a67fd28 Zi Yan 2024-07-24 1939 if (folio_use_access_time(folio)) { > 33024536bafd91 Ying Huang 2022-07-13 1940 struct pglist_data *pgdat; > c959924b0dc53b Ying Huang 2022-07-13 1941 unsigned long rate_limit; > c959924b0dc53b Ying Huang 2022-07-13 1942 unsigned int latency, th, def_th; > 675e22ff5b390e Li Zhijian 2025-06-19 1943 long nr = folio_nr_pages(folio) > 33024536bafd91 Ying Huang 2022-07-13 1944 > 33024536bafd91 Ying Huang 2022-07-13 @1945 pgdat = NODE_DATA(dst_nid); > c959924b0dc53b Ying Huang 2022-07-13 1946 if (pgdat_free_space_enough(pgdat)) { > c959924b0dc53b Ying Huang 2022-07-13 1947 /* workload changed, reset hot threshold */ > c959924b0dc53b Ying Huang 2022-07-13 1948 pgdat->nbp_threshold = 0; > 675e22ff5b390e Li Zhijian 2025-06-19 1949 mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr); > 33024536bafd91 Ying Huang 2022-07-13 1950 return true; > c959924b0dc53b Ying Huang 2022-07-13 1951 } > 33024536bafd91 Ying Huang 2022-07-13 1952 > c959924b0dc53b Ying Huang 2022-07-13 1953 def_th = sysctl_numa_balancing_hot_threshold; > c959924b0dc53b Ying Huang 2022-07-13 1954 rate_limit = sysctl_numa_balancing_promote_rate_limit << \ > c959924b0dc53b Ying Huang 2022-07-13 1955 (20 - PAGE_SHIFT); > c959924b0dc53b Ying Huang 2022-07-13 1956 numa_promotion_adjust_threshold(pgdat, rate_limit, def_th); > c959924b0dc53b Ying Huang 2022-07-13 1957 > c959924b0dc53b Ying Huang 2022-07-13 1958 th = pgdat->nbp_threshold ? : def_th; > 8c9ae56dc73b5a Kefeng Wang 2023-09-21 1959 latency = numa_hint_fault_latency(folio); > 33024536bafd91 Ying Huang 2022-07-13 1960 if (latency >= th) > 33024536bafd91 Ying Huang 2022-07-13 1961 return false; > 33024536bafd91 Ying Huang 2022-07-13 1962 > 675e22ff5b390e Li Zhijian 2025-06-19 1963 return !numa_promotion_rate_limit(pgdat, rate_limit, nr); > 33024536bafd91 Ying Huang 2022-07-13 1964 } > 33024536bafd91 Ying Huang 2022-07-13 1965 > 10f39042711ba2 Rik van Riel 2014-01-27 1966 this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid); > 1b143cc77f2074 Kefeng Wang 2023-10-18 1967 last_cpupid = folio_xchg_last_cpupid(folio, this_cpupid); > 37355bdc5a1298 Mel Gorman 2018-10-01 1968 > 33024536bafd91 Ying Huang 2022-07-13 1969 if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) && > 33024536bafd91 Ying Huang 2022-07-13 1970 !node_is_toptier(src_nid) && !cpupid_valid(last_cpupid)) > 33024536bafd91 Ying Huang 2022-07-13 1971 return false; > 33024536bafd91 Ying Huang 2022-07-13 1972 > 37355bdc5a1298 Mel Gorman 2018-10-01 1973 /* > 37355bdc5a1298 Mel Gorman 2018-10-01 1974 * Allow first faults or private faults to migrate immediately early in > 37355bdc5a1298 Mel Gorman 2018-10-01 1975 * the lifetime of a task. The magic number 4 is based on waiting for > 37355bdc5a1298 Mel Gorman 2018-10-01 1976 * two full passes of the "multi-stage node selection" test that is > 37355bdc5a1298 Mel Gorman 2018-10-01 1977 * executed below. > 37355bdc5a1298 Mel Gorman 2018-10-01 1978 */ > 98fa15f34cb379 Anshuman Khandual 2019-03-05 1979 if ((p->numa_preferred_nid == NUMA_NO_NODE || p->numa_scan_seq <= 4) && > 37355bdc5a1298 Mel Gorman 2018-10-01 1980 (cpupid_pid_unset(last_cpupid) || cpupid_match_pid(p, last_cpupid))) > 37355bdc5a1298 Mel Gorman 2018-10-01 1981 return true; > 10f39042711ba2 Rik van Riel 2014-01-27 1982 > 10f39042711ba2 Rik van Riel 2014-01-27 1983 /* > 10f39042711ba2 Rik van Riel 2014-01-27 1984 * Multi-stage node selection is used in conjunction with a periodic > 10f39042711ba2 Rik van Riel 2014-01-27 1985 * migration fault to build a temporal task<->page relation. By using > 10f39042711ba2 Rik van Riel 2014-01-27 1986 * a two-stage filter we remove short/unlikely relations. > 10f39042711ba2 Rik van Riel 2014-01-27 1987 * > 10f39042711ba2 Rik van Riel 2014-01-27 1988 * Using P(p) ~ n_p / n_t as per frequentist probability, we can equate > 10f39042711ba2 Rik van Riel 2014-01-27 1989 * a task's usage of a particular page (n_p) per total usage of this > 10f39042711ba2 Rik van Riel 2014-01-27 1990 * page (n_t) (in a given time-span) to a probability. > 10f39042711ba2 Rik van Riel 2014-01-27 1991 * > 10f39042711ba2 Rik van Riel 2014-01-27 1992 * Our periodic faults will sample this probability and getting the > 10f39042711ba2 Rik van Riel 2014-01-27 1993 * same result twice in a row, given these samples are fully > 10f39042711ba2 Rik van Riel 2014-01-27 1994 * independent, is then given by P(n)^2, provided our sample period > 10f39042711ba2 Rik van Riel 2014-01-27 1995 * is sufficiently short compared to the usage pattern. > 10f39042711ba2 Rik van Riel 2014-01-27 1996 * > 10f39042711ba2 Rik van Riel 2014-01-27 1997 * This quadric squishes small probabilities, making it less likely we > 10f39042711ba2 Rik van Riel 2014-01-27 1998 * act on an unlikely task<->page relation. > 10f39042711ba2 Rik van Riel 2014-01-27 1999 */ > 10f39042711ba2 Rik van Riel 2014-01-27 2000 if (!cpupid_pid_unset(last_cpupid) && > 10f39042711ba2 Rik van Riel 2014-01-27 2001 cpupid_to_nid(last_cpupid) != dst_nid) > 10f39042711ba2 Rik van Riel 2014-01-27 2002 return false; > 10f39042711ba2 Rik van Riel 2014-01-27 2003 > 10f39042711ba2 Rik van Riel 2014-01-27 2004 /* Always allow migrate on private faults */ > 10f39042711ba2 Rik van Riel 2014-01-27 2005 if (cpupid_match_pid(p, last_cpupid)) > 10f39042711ba2 Rik van Riel 2014-01-27 2006 return true; > 10f39042711ba2 Rik van Riel 2014-01-27 2007 > 10f39042711ba2 Rik van Riel 2014-01-27 2008 /* A shared fault, but p->numa_group has not been set up yet. */ > 10f39042711ba2 Rik van Riel 2014-01-27 2009 if (!ng) > 10f39042711ba2 Rik van Riel 2014-01-27 2010 return true; > 10f39042711ba2 Rik van Riel 2014-01-27 2011 > 10f39042711ba2 Rik van Riel 2014-01-27 2012 /* > 4142c3ebb685bb Rik van Riel 2016-01-25 2013 * Destination node is much more heavily used than the source > 4142c3ebb685bb Rik van Riel 2016-01-25 2014 * node? Allow migration. > 10f39042711ba2 Rik van Riel 2014-01-27 2015 */ > 4142c3ebb685bb Rik van Riel 2016-01-25 2016 if (group_faults_cpu(ng, dst_nid) > group_faults_cpu(ng, src_nid) * > 4142c3ebb685bb Rik van Riel 2016-01-25 2017 ACTIVE_NODE_FRACTION) > 10f39042711ba2 Rik van Riel 2014-01-27 2018 return true; > 10f39042711ba2 Rik van Riel 2014-01-27 2019 > 10f39042711ba2 Rik van Riel 2014-01-27 2020 /* > 4142c3ebb685bb Rik van Riel 2016-01-25 2021 * Distribute memory according to CPU & memory use on each node, > 4142c3ebb685bb Rik van Riel 2016-01-25 2022 * with 3/4 hysteresis to avoid unnecessary memory migrations: > 4142c3ebb685bb Rik van Riel 2016-01-25 2023 * > 4142c3ebb685bb Rik van Riel 2016-01-25 2024 * faults_cpu(dst) 3 faults_cpu(src) > 4142c3ebb685bb Rik van Riel 2016-01-25 2025 * --------------- * - > --------------- > 4142c3ebb685bb Rik van Riel 2016-01-25 2026 * faults_mem(dst) 4 faults_mem(src) > 10f39042711ba2 Rik van Riel 2014-01-27 2027 */ > 4142c3ebb685bb Rik van Riel 2016-01-25 2028 return group_faults_cpu(ng, dst_nid) * group_faults(p, src_nid) * 3 > > 4142c3ebb685bb Rik van Riel 2016-01-25 2029 group_faults_cpu(ng, src_nid) * group_faults(p, dst_nid) * 4; > 10f39042711ba2 Rik van Riel 2014-01-27 2030 } > 10f39042711ba2 Rik van Riel 2014-01-27 2031 > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH RFC] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting 2025-06-20 2:04 ` Zhijian Li (Fujitsu) @ 2025-06-20 2:22 ` Philip Li 0 siblings, 0 replies; 7+ messages in thread From: Philip Li @ 2025-06-20 2:22 UTC (permalink / raw) To: Zhijian Li (Fujitsu); +Cc: kernel test robot, oe-kbuild-all@lists.linux.dev On Fri, Jun 20, 2025 at 02:04:46AM +0000, Zhijian Li (Fujitsu) wrote: > Thanks for the report. I will update it. hi Zhijian, you are welcome :) > > > > On 20/06/2025 06:06, kernel test robot wrote: > > Hi Li, > > > > [This is a private test report for your RFC patch.] > > kernel test robot noticed the following build errors: > > > > [auto build test ERROR on akpm-mm/mm-everything] > > > > url: https://github.com/intel-lab-lkp/linux/commits/Li-Zhijian/mm-memory-tiering-Fix-PGPROMOTE_CANDIDATE-accounting/20250619-155351 > > base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything > > patch link: https://lore.kernel.org/r/20250619075245.3272384-1-lizhijian%40fujitsu.com > > patch subject: [PATCH RFC] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting > > config: x86_64-buildonly-randconfig-005-20250620 (https://download.01.org/0day-ci/archive/20250620/202506200524.r3rTtqLQ-lkp@intel.com/config) > > compiler: gcc-12 (Debian 12.2.0-14) 12.2.0 > > reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250620/202506200524.r3rTtqLQ-lkp@intel.com/reproduce) > > > > If you fix the issue in a separate patch/commit (i.e. not just a new version of > > the same patch/commit), kindly add following tags > > | Reported-by: kernel test robot <lkp@intel.com> > > | Closes: https://lore.kernel.org/oe-kbuild-all/202506200524.r3rTtqLQ-lkp@intel.com/ > > > > All errors (new ones prefixed by >>): > > > > kernel/sched/fair.c: In function 'should_numa_migrate_memory': > >>> kernel/sched/fair.c:1945:17: error: expected ',' or ';' before 'pgdat' > > 1945 | pgdat = NODE_DATA(dst_nid); > > | ^~~~~ > > > > > > vim +1945 kernel/sched/fair.c > > > > c959924b0dc53b Ying Huang 2022-07-13 1921 > > 8c9ae56dc73b5a Kefeng Wang 2023-09-21 1922 bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, > > 10f39042711ba2 Rik van Riel 2014-01-27 1923 int src_nid, int dst_cpu) > > 10f39042711ba2 Rik van Riel 2014-01-27 1924 { > > cb361d8cdef699 Jann Horn 2019-07-16 1925 struct numa_group *ng = deref_curr_numa_group(p); > > 10f39042711ba2 Rik van Riel 2014-01-27 1926 int dst_nid = cpu_to_node(dst_cpu); > > 10f39042711ba2 Rik van Riel 2014-01-27 1927 int last_cpupid, this_cpupid; > > 10f39042711ba2 Rik van Riel 2014-01-27 1928 > > 3fb43636876d98 Byungchul Park 2024-02-19 1929 /* > > 3fb43636876d98 Byungchul Park 2024-02-19 1930 * Cannot migrate to memoryless nodes. > > 3fb43636876d98 Byungchul Park 2024-02-19 1931 */ > > 3fb43636876d98 Byungchul Park 2024-02-19 1932 if (!node_state(dst_nid, N_MEMORY)) > > 3fb43636876d98 Byungchul Park 2024-02-19 1933 return false; > > 3fb43636876d98 Byungchul Park 2024-02-19 1934 > > 33024536bafd91 Ying Huang 2022-07-13 1935 /* > > 33024536bafd91 Ying Huang 2022-07-13 1936 * The pages in slow memory node should be migrated according > > 33024536bafd91 Ying Huang 2022-07-13 1937 * to hot/cold instead of private/shared. > > 33024536bafd91 Ying Huang 2022-07-13 1938 */ > > 2a28713a67fd28 Zi Yan 2024-07-24 1939 if (folio_use_access_time(folio)) { > > 33024536bafd91 Ying Huang 2022-07-13 1940 struct pglist_data *pgdat; > > c959924b0dc53b Ying Huang 2022-07-13 1941 unsigned long rate_limit; > > c959924b0dc53b Ying Huang 2022-07-13 1942 unsigned int latency, th, def_th; > > 675e22ff5b390e Li Zhijian 2025-06-19 1943 long nr = folio_nr_pages(folio) > > 33024536bafd91 Ying Huang 2022-07-13 1944 > > 33024536bafd91 Ying Huang 2022-07-13 @1945 pgdat = NODE_DATA(dst_nid); > > c959924b0dc53b Ying Huang 2022-07-13 1946 if (pgdat_free_space_enough(pgdat)) { > > c959924b0dc53b Ying Huang 2022-07-13 1947 /* workload changed, reset hot threshold */ > > c959924b0dc53b Ying Huang 2022-07-13 1948 pgdat->nbp_threshold = 0; > > 675e22ff5b390e Li Zhijian 2025-06-19 1949 mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr); > > 33024536bafd91 Ying Huang 2022-07-13 1950 return true; > > c959924b0dc53b Ying Huang 2022-07-13 1951 } > > 33024536bafd91 Ying Huang 2022-07-13 1952 > > c959924b0dc53b Ying Huang 2022-07-13 1953 def_th = sysctl_numa_balancing_hot_threshold; > > c959924b0dc53b Ying Huang 2022-07-13 1954 rate_limit = sysctl_numa_balancing_promote_rate_limit << \ > > c959924b0dc53b Ying Huang 2022-07-13 1955 (20 - PAGE_SHIFT); > > c959924b0dc53b Ying Huang 2022-07-13 1956 numa_promotion_adjust_threshold(pgdat, rate_limit, def_th); > > c959924b0dc53b Ying Huang 2022-07-13 1957 > > c959924b0dc53b Ying Huang 2022-07-13 1958 th = pgdat->nbp_threshold ? : def_th; > > 8c9ae56dc73b5a Kefeng Wang 2023-09-21 1959 latency = numa_hint_fault_latency(folio); > > 33024536bafd91 Ying Huang 2022-07-13 1960 if (latency >= th) > > 33024536bafd91 Ying Huang 2022-07-13 1961 return false; > > 33024536bafd91 Ying Huang 2022-07-13 1962 > > 675e22ff5b390e Li Zhijian 2025-06-19 1963 return !numa_promotion_rate_limit(pgdat, rate_limit, nr); > > 33024536bafd91 Ying Huang 2022-07-13 1964 } > > 33024536bafd91 Ying Huang 2022-07-13 1965 > > 10f39042711ba2 Rik van Riel 2014-01-27 1966 this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid); > > 1b143cc77f2074 Kefeng Wang 2023-10-18 1967 last_cpupid = folio_xchg_last_cpupid(folio, this_cpupid); > > 37355bdc5a1298 Mel Gorman 2018-10-01 1968 > > 33024536bafd91 Ying Huang 2022-07-13 1969 if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) && > > 33024536bafd91 Ying Huang 2022-07-13 1970 !node_is_toptier(src_nid) && !cpupid_valid(last_cpupid)) > > 33024536bafd91 Ying Huang 2022-07-13 1971 return false; > > 33024536bafd91 Ying Huang 2022-07-13 1972 > > 37355bdc5a1298 Mel Gorman 2018-10-01 1973 /* > > 37355bdc5a1298 Mel Gorman 2018-10-01 1974 * Allow first faults or private faults to migrate immediately early in > > 37355bdc5a1298 Mel Gorman 2018-10-01 1975 * the lifetime of a task. The magic number 4 is based on waiting for > > 37355bdc5a1298 Mel Gorman 2018-10-01 1976 * two full passes of the "multi-stage node selection" test that is > > 37355bdc5a1298 Mel Gorman 2018-10-01 1977 * executed below. > > 37355bdc5a1298 Mel Gorman 2018-10-01 1978 */ > > 98fa15f34cb379 Anshuman Khandual 2019-03-05 1979 if ((p->numa_preferred_nid == NUMA_NO_NODE || p->numa_scan_seq <= 4) && > > 37355bdc5a1298 Mel Gorman 2018-10-01 1980 (cpupid_pid_unset(last_cpupid) || cpupid_match_pid(p, last_cpupid))) > > 37355bdc5a1298 Mel Gorman 2018-10-01 1981 return true; > > 10f39042711ba2 Rik van Riel 2014-01-27 1982 > > 10f39042711ba2 Rik van Riel 2014-01-27 1983 /* > > 10f39042711ba2 Rik van Riel 2014-01-27 1984 * Multi-stage node selection is used in conjunction with a periodic > > 10f39042711ba2 Rik van Riel 2014-01-27 1985 * migration fault to build a temporal task<->page relation. By using > > 10f39042711ba2 Rik van Riel 2014-01-27 1986 * a two-stage filter we remove short/unlikely relations. > > 10f39042711ba2 Rik van Riel 2014-01-27 1987 * > > 10f39042711ba2 Rik van Riel 2014-01-27 1988 * Using P(p) ~ n_p / n_t as per frequentist probability, we can equate > > 10f39042711ba2 Rik van Riel 2014-01-27 1989 * a task's usage of a particular page (n_p) per total usage of this > > 10f39042711ba2 Rik van Riel 2014-01-27 1990 * page (n_t) (in a given time-span) to a probability. > > 10f39042711ba2 Rik van Riel 2014-01-27 1991 * > > 10f39042711ba2 Rik van Riel 2014-01-27 1992 * Our periodic faults will sample this probability and getting the > > 10f39042711ba2 Rik van Riel 2014-01-27 1993 * same result twice in a row, given these samples are fully > > 10f39042711ba2 Rik van Riel 2014-01-27 1994 * independent, is then given by P(n)^2, provided our sample period > > 10f39042711ba2 Rik van Riel 2014-01-27 1995 * is sufficiently short compared to the usage pattern. > > 10f39042711ba2 Rik van Riel 2014-01-27 1996 * > > 10f39042711ba2 Rik van Riel 2014-01-27 1997 * This quadric squishes small probabilities, making it less likely we > > 10f39042711ba2 Rik van Riel 2014-01-27 1998 * act on an unlikely task<->page relation. > > 10f39042711ba2 Rik van Riel 2014-01-27 1999 */ > > 10f39042711ba2 Rik van Riel 2014-01-27 2000 if (!cpupid_pid_unset(last_cpupid) && > > 10f39042711ba2 Rik van Riel 2014-01-27 2001 cpupid_to_nid(last_cpupid) != dst_nid) > > 10f39042711ba2 Rik van Riel 2014-01-27 2002 return false; > > 10f39042711ba2 Rik van Riel 2014-01-27 2003 > > 10f39042711ba2 Rik van Riel 2014-01-27 2004 /* Always allow migrate on private faults */ > > 10f39042711ba2 Rik van Riel 2014-01-27 2005 if (cpupid_match_pid(p, last_cpupid)) > > 10f39042711ba2 Rik van Riel 2014-01-27 2006 return true; > > 10f39042711ba2 Rik van Riel 2014-01-27 2007 > > 10f39042711ba2 Rik van Riel 2014-01-27 2008 /* A shared fault, but p->numa_group has not been set up yet. */ > > 10f39042711ba2 Rik van Riel 2014-01-27 2009 if (!ng) > > 10f39042711ba2 Rik van Riel 2014-01-27 2010 return true; > > 10f39042711ba2 Rik van Riel 2014-01-27 2011 > > 10f39042711ba2 Rik van Riel 2014-01-27 2012 /* > > 4142c3ebb685bb Rik van Riel 2016-01-25 2013 * Destination node is much more heavily used than the source > > 4142c3ebb685bb Rik van Riel 2016-01-25 2014 * node? Allow migration. > > 10f39042711ba2 Rik van Riel 2014-01-27 2015 */ > > 4142c3ebb685bb Rik van Riel 2016-01-25 2016 if (group_faults_cpu(ng, dst_nid) > group_faults_cpu(ng, src_nid) * > > 4142c3ebb685bb Rik van Riel 2016-01-25 2017 ACTIVE_NODE_FRACTION) > > 10f39042711ba2 Rik van Riel 2014-01-27 2018 return true; > > 10f39042711ba2 Rik van Riel 2014-01-27 2019 > > 10f39042711ba2 Rik van Riel 2014-01-27 2020 /* > > 4142c3ebb685bb Rik van Riel 2016-01-25 2021 * Distribute memory according to CPU & memory use on each node, > > 4142c3ebb685bb Rik van Riel 2016-01-25 2022 * with 3/4 hysteresis to avoid unnecessary memory migrations: > > 4142c3ebb685bb Rik van Riel 2016-01-25 2023 * > > 4142c3ebb685bb Rik van Riel 2016-01-25 2024 * faults_cpu(dst) 3 faults_cpu(src) > > 4142c3ebb685bb Rik van Riel 2016-01-25 2025 * --------------- * - > --------------- > > 4142c3ebb685bb Rik van Riel 2016-01-25 2026 * faults_mem(dst) 4 faults_mem(src) > > 10f39042711ba2 Rik van Riel 2014-01-27 2027 */ > > 4142c3ebb685bb Rik van Riel 2016-01-25 2028 return group_faults_cpu(ng, dst_nid) * group_faults(p, src_nid) * 3 > > > 4142c3ebb685bb Rik van Riel 2016-01-25 2029 group_faults_cpu(ng, src_nid) * group_faults(p, dst_nid) * 4; > > 10f39042711ba2 Rik van Riel 2014-01-27 2030 } > > 10f39042711ba2 Rik van Riel 2014-01-27 2031 > > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH RFC] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting 2025-06-19 7:52 [PATCH RFC] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting Li Zhijian 2025-06-19 22:06 ` kernel test robot @ 2025-06-20 6:28 ` Huang, Ying 2025-06-23 8:54 ` Zhijian Li (Fujitsu) 1 sibling, 1 reply; 7+ messages in thread From: Huang, Ying @ 2025-06-20 6:28 UTC (permalink / raw) To: Li Zhijian Cc: linux-mm, akpm, linux-kernel, y-goto, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider Li Zhijian <lizhijian@fujitsu.com> writes: > Goto-san reported confusing pgpromote statistics where > the pgpromote_success count significantly exceeded pgpromote_candidate. > The issue manifests under specific memory pressure conditions: > when top-tier memory (DRAM) is exhausted by memhog and allocation begins > in lower-tier memory (CXL). After terminating memhog, the stats show: The above description is confusing. The page promotion occurs when the size of the top-tier free space is large enough (after killing the memhog above). The accessed lower-tier memory will be promoted upon accessing to take full advantage of the more expensive top-tier memory. > $ grep -e pgpromote /proc/vmstat > pgpromote_success 2579 > pgpromote_candidate 1 > > This update increments PGPROMOTE_CANDIDATE within the free space branch > when a promotion decision is made, which may alter the mechanism of the > rate limit. Consequently, it becomes easier to reach the rate limit than > it was previously. > > For example: > Rate Limit = 100 pages/sec > Scenario: > T0: 90 free-space migrations > T0+100ms: 20-page migration request > > Before: > Rate limit is *not* reached: 0 + 20 = 20 < 100 > PGPROMOTE_CANDIDATE: 20 > After: > Rate limit is reached: 90 + 20 = 110 > 100 > PGPROMOTE_CANDIDATE: 110 Yes. The rate limit will be influenced by the change. So, more tests may be needed to verify it will not incurs regressions. > > Reported-by: Yasunori Gotou (Fujitsu) <y-goto@fujitsu.com> > Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> > --- > > This is markes as RFC because I am uncertain whether we originally > intended for this or if it was overlooked. > > However, the current situation where pgpromote_candidate < pgpromote_success > is indeed confusing when interpreted literally. > > Cc: Huang Ying <ying.huang@linux.alibaba.com> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Juri Lelli <juri.lelli@redhat.com> > Cc: Vincent Guittot <vincent.guittot@linaro.org> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> > Cc: Steven Rostedt <rostedt@goodmis.org> > Cc: Ben Segall <bsegall@google.com> > Cc: Mel Gorman <mgorman@suse.de> > Cc: Valentin Schneider <vschneid@redhat.com> > --- > kernel/sched/fair.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 7a14da5396fb..4715cd4fa248 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -1940,11 +1940,13 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, > struct pglist_data *pgdat; > unsigned long rate_limit; > unsigned int latency, th, def_th; > + long nr = folio_nr_pages(folio) > > pgdat = NODE_DATA(dst_nid); > if (pgdat_free_space_enough(pgdat)) { > /* workload changed, reset hot threshold */ > pgdat->nbp_threshold = 0; > + mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr); > return true; > } > > @@ -1958,8 +1960,7 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, > if (latency >= th) > return false; > > - return !numa_promotion_rate_limit(pgdat, rate_limit, > - folio_nr_pages(folio)); > + return !numa_promotion_rate_limit(pgdat, rate_limit, nr); > } > > this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid); --- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH RFC] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting 2025-06-20 6:28 ` Huang, Ying @ 2025-06-23 8:54 ` Zhijian Li (Fujitsu) 2025-06-24 2:46 ` Huang, Ying 0 siblings, 1 reply; 7+ messages in thread From: Zhijian Li (Fujitsu) @ 2025-06-23 8:54 UTC (permalink / raw) To: Huang, Ying Cc: linux-mm@kvack.org, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, Yasunori Gotou (Fujitsu), Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, kernel test robot On 20/06/2025 14:28, Huang, Ying wrote: > Li Zhijian <lizhijian@fujitsu.com> writes: > >> Goto-san reported confusing pgpromote statistics where >> the pgpromote_success count significantly exceeded pgpromote_candidate. >> The issue manifests under specific memory pressure conditions: >> when top-tier memory (DRAM) is exhausted by memhog and allocation begins >> in lower-tier memory (CXL). After terminating memhog, the stats show: > > The above description is confusing. The page promotion occurs when the > size of the top-tier free space is large enough (after killing the > memhog above). The accessed lower-tier memory will be promoted upon > accessing to take full advantage of the more expensive top-tier memory. Yeah, that's what the promotion does. Let's clarify the reproducer steps specifically(thanks Goto-san for the reproducer): On a system with three nodes (nodes 0-1: DRAM 4GB, node 2: NVDIMM 4GB): # Enable demotion only echo 1 > /sys/kernel/mm/numa/demotion_enabled numactl -m 0-1 memhog -r200 3500M >/dev/null & pid=$! sleep 2 numactl memhog -r100 2500M >/dev/null & sleep 10 kill -9 $pid # Enable promotion echo 2 > /proc/sys/kernel/numa_balancing # After a few seconds, we observe `pgpromote_candidate < pgpromote_success` In this scenario, after terminating the first memhog, the conditions for pgdat_free_space_enough() are quickly met, triggering promotion. However, these migrated pages are only accounted for in PGPROMOTE_SUCCESS, not in PGPROMOTE_CANDIDATE. > >> $ grep -e pgpromote /proc/vmstat >> pgpromote_success 2579 >> pgpromote_candidate 1 >> >> This update increments PGPROMOTE_CANDIDATE within the free space branch >> when a promotion decision is made, which may alter the mechanism of the >> rate limit. Consequently, it becomes easier to reach the rate limit than >> it was previously. >> >> For example: >> Rate Limit = 100 pages/sec >> Scenario: >> T0: 90 free-space migrations >> T0+100ms: 20-page migration request >> >> Before: >> Rate limit is *not* reached: 0 + 20 = 20 < 100 >> PGPROMOTE_CANDIDATE: 20 >> After: >> Rate limit is reached: 90 + 20 = 110 > 100 >> PGPROMOTE_CANDIDATE: 110 > > Yes. The rate limit will be influenced by the change. So, more tests > may be needed to verify it will not incurs regressions. Testing this might be challenging due to workload dependencies. Do you have any recommended workloads for evaluation? Alternatively, could we could rely on the LKP project for impact assessment(Current patch has not really tested by LKP due to a compiling error, I will post a V2 soon). However, regarding the rate limit change itself, I consider this patch logically correct. As stated in the numa_promotion_rate_limit() comment: > "For memory tiering mode, too high promotion/demotion throughput may hurt application latency." It seems there is no justification for excluding pgdat_free_space_enough() triggered promotions from the rate limiting mechanism. > >> >> Reported-by: Yasunori Gotou (Fujitsu) <y-goto@fujitsu.com> >> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> >> --- >> >> This is markes as RFC because I am uncertain whether we originally >> intended for this or if it was overlooked. >> >> However, the current situation where pgpromote_candidate < pgpromote_success >> is indeed confusing when interpreted literally. >> >> Cc: Huang Ying <ying.huang@linux.alibaba.com> >> Cc: Ingo Molnar <mingo@redhat.com> >> Cc: Peter Zijlstra <peterz@infradead.org> >> Cc: Juri Lelli <juri.lelli@redhat.com> >> Cc: Vincent Guittot <vincent.guittot@linaro.org> >> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> >> Cc: Steven Rostedt <rostedt@goodmis.org> >> Cc: Ben Segall <bsegall@google.com> >> Cc: Mel Gorman <mgorman@suse.de> >> Cc: Valentin Schneider <vschneid@redhat.com> >> --- >> kernel/sched/fair.c | 5 +++-- >> 1 file changed, 3 insertions(+), 2 deletions(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 7a14da5396fb..4715cd4fa248 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -1940,11 +1940,13 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, >> struct pglist_data *pgdat; >> unsigned long rate_limit; >> unsigned int latency, th, def_th; >> + long nr = folio_nr_pages(folio) Cc LKP There is a compilation error which I overlooked at the time due to several ongoing refactors in my local code. I appreciate LKP for detecting this issue. Thanks Zhijian >> >> pgdat = NODE_DATA(dst_nid); >> if (pgdat_free_space_enough(pgdat)) { >> /* workload changed, reset hot threshold */ >> pgdat->nbp_threshold = 0; >> + mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr); >> return true; >> } >> >> @@ -1958,8 +1960,7 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, >> if (latency >= th) >> return false; >> >> - return !numa_promotion_rate_limit(pgdat, rate_limit, >> - folio_nr_pages(folio)); >> + return !numa_promotion_rate_limit(pgdat, rate_limit, nr); >> } >> >> this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid); > > --- > Best Regards, > Huang, Ying ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH RFC] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting 2025-06-23 8:54 ` Zhijian Li (Fujitsu) @ 2025-06-24 2:46 ` Huang, Ying 0 siblings, 0 replies; 7+ messages in thread From: Huang, Ying @ 2025-06-24 2:46 UTC (permalink / raw) To: Zhijian Li (Fujitsu) Cc: linux-mm@kvack.org, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, Yasunori Gotou (Fujitsu), Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, kernel test robot "Zhijian Li (Fujitsu)" <lizhijian@fujitsu.com> writes: > On 20/06/2025 14:28, Huang, Ying wrote: >> Li Zhijian <lizhijian@fujitsu.com> writes: >> >>> Goto-san reported confusing pgpromote statistics where >>> the pgpromote_success count significantly exceeded pgpromote_candidate. >>> The issue manifests under specific memory pressure conditions: >>> when top-tier memory (DRAM) is exhausted by memhog and allocation begins >>> in lower-tier memory (CXL). After terminating memhog, the stats show: >> >> The above description is confusing. The page promotion occurs when the >> size of the top-tier free space is large enough (after killing the >> memhog above). The accessed lower-tier memory will be promoted upon >> accessing to take full advantage of the more expensive top-tier memory. > > Yeah, that's what the promotion does. > > Let's clarify the reproducer steps specifically(thanks Goto-san for the reproducer): > On a system with three nodes (nodes 0-1: DRAM 4GB, node 2: NVDIMM 4GB): > > # Enable demotion only > echo 1 > /sys/kernel/mm/numa/demotion_enabled > numactl -m 0-1 memhog -r200 3500M >/dev/null & > pid=$! > sleep 2 > numactl memhog -r100 2500M >/dev/null & > sleep 10 > kill -9 $pid > # Enable promotion > echo 2 > /proc/sys/kernel/numa_balancing > > # After a few seconds, we observe `pgpromote_candidate < pgpromote_success` > > In this scenario, after terminating the first memhog, the conditions > for pgdat_free_space_enough() are quickly met, triggering promotion. > However, these migrated pages are only accounted for in PGPROMOTE_SUCCESS, not in PGPROMOTE_CANDIDATE. Yes. This is the expected behavior of current implementation. > >> >>> $ grep -e pgpromote /proc/vmstat >>> pgpromote_success 2579 >>> pgpromote_candidate 1 >>> >>> This update increments PGPROMOTE_CANDIDATE within the free space branch >>> when a promotion decision is made, which may alter the mechanism of the >>> rate limit. Consequently, it becomes easier to reach the rate limit than >>> it was previously. >>> >>> For example: >>> Rate Limit = 100 pages/sec >>> Scenario: >>> T0: 90 free-space migrations >>> T0+100ms: 20-page migration request >>> >>> Before: >>> Rate limit is *not* reached: 0 + 20 = 20 < 100 >>> PGPROMOTE_CANDIDATE: 20 >>> After: >>> Rate limit is reached: 90 + 20 = 110 > 100 >>> PGPROMOTE_CANDIDATE: 110 >> >> Yes. The rate limit will be influenced by the change. So, more tests >> may be needed to verify it will not incurs regressions. > > > Testing this might be challenging due to workload dependencies. Do you > have any recommended workloads for evaluation? Some in-memory database should be good workloads, for example, redis, etc. > Alternatively, could we could rely on the LKP project for impact assessment(Current patch has not really tested > by LKP due to a compiling error, I will post a V2 soon). LKP has some basic workload to test this, for example, pmbench with Gauss-ih access pattern. > However, regarding the rate limit change itself, I consider this patch > logically correct. As stated in the numa_promotion_rate_limit() > comment: >> "For memory tiering mode, too high promotion/demotion throughput may hurt application latency." > It seems there is no justification for excluding > pgdat_free_space_enough() triggered promotions from the rate limiting > mechanism. In fact, we don't rate limit promotion if there are enough free space on fast memory to fill the fast memory quickly. I think that it's necessary to prevent the fast memory from under-utilized ASAP. > > >> >>> >>> Reported-by: Yasunori Gotou (Fujitsu) <y-goto@fujitsu.com> >>> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> [snip] --- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2025-06-24 2:47 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-06-19 7:52 [PATCH RFC] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting Li Zhijian 2025-06-19 22:06 ` kernel test robot 2025-06-20 2:04 ` Zhijian Li (Fujitsu) 2025-06-20 2:22 ` Philip Li 2025-06-20 6:28 ` Huang, Ying 2025-06-23 8:54 ` Zhijian Li (Fujitsu) 2025-06-24 2:46 ` Huang, Ying
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.