All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jonathan Cameron <jonathan.cameron@huawei.com>
To: Bharata B Rao <bharata@amd.com>
Cc: <linux-kernel@vger.kernel.org>, <linux-mm@kvack.org>,
	<dave.hansen@intel.com>, <gourry@gourry.net>,
	<hannes@cmpxchg.org>, <mgorman@techsingularity.net>,
	<mingo@redhat.com>, <peterz@infradead.org>,
	<raghavendra.kt@amd.com>, <riel@surriel.com>,
	<rientjes@google.com>, <sj@kernel.org>, <weixugc@google.com>,
	<willy@infradead.org>, <ying.huang@linux.alibaba.com>,
	<ziy@nvidia.com>, <dave@stgolabs.net>, <nifan.cxl@gmail.com>,
	<xuezhengchu@huawei.com>, <yiannis@zptcorp.com>,
	<akpm@linux-foundation.org>, <david@redhat.com>,
	<byungchul@sk.com>, <kinseyho@google.com>,
	<joshua.hahnjy@gmail.com>, <yuanchu@google.com>,
	<balbirs@nvidia.com>, <alok.rathore@samsung.com>
Subject: Re: [RFC PATCH v2 8/8] mm: sched: Move hot page promotion from NUMAB=2 to kpromoted
Date: Fri, 3 Oct 2025 13:38:18 +0100	[thread overview]
Message-ID: <20251003133818.000017af@huawei.com> (raw)
In-Reply-To: <20250910144653.212066-9-bharata@amd.com>

On Wed, 10 Sep 2025 20:16:53 +0530
Bharata B Rao <bharata@amd.com> wrote:

> Currently hot page promotion (NUMA_BALANCING_MEMORY_TIERING
> mode of NUMA Balancing) does hot page detection (via hint faults),
> hot page classification and eventual promotion, all by itself and
> sits within the scheduler.
> 
> With the new hot page tracking and promotion mechanism being
> available, NUMA Balancing can limit itself to detection of
> hot pages (via hint faults) and off-load rest of the
> functionality to the common hot page tracking system.
> 
> pghot_record_access(PGHOT_HINT_FAULT) API is used to feed the
> hot page info. In addition, the migration rate limiting and
> dynamic threshold logic are moved to kpromoted so that the same
> can be used for hot pages reported by other sources too.
> 
> Signed-off-by: Bharata B Rao <bharata@amd.com>

Making a direct replacement without any fallback to previous method
is going to need a lot of data to show there are no important regressions.

So bold move if that's the intent! 

J
> ---
>  include/linux/pghot.h |   2 +
>  kernel/sched/fair.c   | 149 ++----------------------------------------
>  mm/memory.c           |  32 ++-------
>  mm/pghot.c            | 132 +++++++++++++++++++++++++++++++++++--
>  4 files changed, 142 insertions(+), 173 deletions(-)
> 

> diff --git a/mm/pghot.c b/mm/pghot.c
> index 9f7581818b8f..9f5746892bce 100644
> --- a/mm/pghot.c
> +++ b/mm/pghot.c
> @@ -9,6 +9,9 @@
>   *
>   * kpromoted is a kernel thread that runs on each toptier node and
>   * promotes pages from max_heap.
> + *
> + * Migration rate-limiting and dynamic threshold logic implementations
> + * were moved from NUMA Balancing mode 2.
>   */
>  #include <linux/pghot.h>
>  #include <linux/kthread.h>
> @@ -34,6 +37,9 @@ static bool kpromoted_started __ro_after_init;
>  
>  static unsigned int sysctl_pghot_freq_window = KPROMOTED_FREQ_WINDOW;
>  
> +/* Restrict the NUMA promotion throughput (MB/s) for each target node. */
> +static unsigned int sysctl_pghot_promote_rate_limit = 65536;

If the comment correlates with the value, this is 64 GiB/s?  That seems
unlikely if I guess possible.

> +
>  #ifdef CONFIG_SYSCTL
>  static const struct ctl_table pghot_sysctls[] = {
>  	{
> @@ -44,8 +50,17 @@ static const struct ctl_table pghot_sysctls[] = {
>  		.proc_handler	= proc_dointvec_minmax,
>  		.extra1		= SYSCTL_ZERO,
>  	},
> +	{
> +		.procname	= "pghot_promote_rate_limit_MBps",
> +		.data		= &sysctl_pghot_promote_rate_limit,
> +		.maxlen		= sizeof(unsigned int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec_minmax,
> +		.extra1		= SYSCTL_ZERO,
> +	},
>  };
>  #endif
> +
Put that in earlier patch to reduce noise here.

>  static bool phi_heap_less(const void *lhs, const void *rhs, void *args)
>  {
>  	return (*(struct pghot_info **)lhs)->frequency >
> @@ -94,11 +109,99 @@ static bool phi_heap_insert(struct max_heap *phi_heap, struct pghot_info *phi)
>  	return true;
>  }
>  
> +/*
> + * For memory tiering mode, if there are enough free pages (more than
> + * enough watermark defined here) in fast memory node, to take full

I'd use enough_wmark   Just because "more than enough" is a common
English phrase and I at least tripped over that sentence as a result!

> + * advantage of fast memory capacity, all recently accessed slow
> + * memory pages will be migrated to fast memory node without
> + * considering hot threshold.
> + */
> +static bool pgdat_free_space_enough(struct pglist_data *pgdat)
> +{
> +	int z;
> +	unsigned long enough_wmark;
> +
> +	enough_wmark = max(1UL * 1024 * 1024 * 1024 >> PAGE_SHIFT,
> +			   pgdat->node_present_pages >> 4);
> +	for (z = pgdat->nr_zones - 1; z >= 0; z--) {
> +		struct zone *zone = pgdat->node_zones + z;
> +
> +		if (!populated_zone(zone))
> +			continue;
> +
> +		if (zone_watermark_ok(zone, 0,
> +				      promo_wmark_pages(zone) + enough_wmark,
> +				      ZONE_MOVABLE, 0))
> +			return true;
> +	}
> +	return false;
> +}

> +
> +static void kpromoted_promotion_adjust_threshold(struct pglist_data *pgdat,

Needs documentation of the algorithm and the reasons for various choices.

I see it is a code move though so maybe that's a job for another day.

> +						 unsigned long rate_limit,
> +						 unsigned int ref_th,
> +						 unsigned long now)
> +{
> +	unsigned int start, th_period, unit_th, th;
> +	unsigned long nr_cand, ref_cand, diff_cand;
> +
> +	now = jiffies_to_msecs(now);
> +	th_period = KPROMOTED_PROMOTION_THRESHOLD_WINDOW;
> +	start = pgdat->nbp_th_start;
> +	if (now - start > th_period &&
> +	    cmpxchg(&pgdat->nbp_th_start, start, now) == start) {
> +		ref_cand = rate_limit *
> +			KPROMOTED_PROMOTION_THRESHOLD_WINDOW / MSEC_PER_SEC;
> +		nr_cand = node_page_state(pgdat, PGPROMOTE_CANDIDATE);
> +		diff_cand = nr_cand - pgdat->nbp_th_nr_cand;
> +		unit_th = ref_th * 2 / KPROMOTED_MIGRATION_ADJUST_STEPS;
> +		th = pgdat->nbp_threshold ? : ref_th;
> +		if (diff_cand > ref_cand * 11 / 10)
> +			th = max(th - unit_th, unit_th);
> +		else if (diff_cand < ref_cand * 9 / 10)
> +			th = min(th + unit_th, ref_th * 2);
> +		pgdat->nbp_th_nr_cand = nr_cand;
> +		pgdat->nbp_threshold = th;
> +	}
> +}
 +
>  static bool phi_is_pfn_hot(struct pghot_info *phi)
>  {
>  	struct page *page = pfn_to_online_page(phi->pfn);
> -	unsigned long now = jiffies;
>  	struct folio *folio;
> +	struct pglist_data *pgdat;
> +	unsigned long rate_limit;
> +	unsigned int latency, th, def_th;
> +	unsigned long now = jiffies;
>  
Avoid the reorder.  Just put it here in first place if you prefer this.





  reply	other threads:[~2025-10-03 12:38 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-10 14:46 [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 1/8] mm: migrate: Allow misplaced migration without VMA too Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 2/8] migrate: implement migrate_misplaced_folios_batch Bharata B Rao
2025-10-03 10:36   ` Jonathan Cameron
2025-10-03 11:02     ` Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 3/8] mm: Hot page tracking and promotion Bharata B Rao
2025-10-03 11:17   ` Jonathan Cameron
2025-10-06  4:13     ` Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 4/8] x86: ibs: In-kernel IBS driver for memory access profiling Bharata B Rao
2025-10-03 12:19   ` Jonathan Cameron
2025-10-06  4:28     ` Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 5/8] x86: ibs: Enable IBS profiling for memory accesses Bharata B Rao
2025-10-03 12:22   ` Jonathan Cameron
2025-09-10 14:46 ` [RFC PATCH v2 6/8] mm: mglru: generalize page table walk Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 7/8] mm: klruscand: use mglru scanning for page promotion Bharata B Rao
2025-10-03 12:30   ` Jonathan Cameron
2025-09-10 14:46 ` [RFC PATCH v2 8/8] mm: sched: Move hot page promotion from NUMAB=2 to kpromoted Bharata B Rao
2025-10-03 12:38   ` Jonathan Cameron [this message]
2025-10-06  5:57     ` Bharata B Rao
2025-10-06  9:53       ` Jonathan Cameron
2025-09-10 15:39 ` [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure Matthew Wilcox
2025-09-10 16:01   ` Gregory Price
2025-09-16 19:45     ` David Rientjes
2025-09-16 22:02       ` Gregory Price
2025-09-17  0:30       ` Wei Xu
2025-09-17  3:20         ` Balbir Singh
2025-09-17  4:15           ` Bharata B Rao
2025-09-17 16:49         ` Jonathan Cameron
2025-09-25 14:03           ` Yiannis Nikolakopoulos
2025-09-25 14:41             ` Gregory Price
2025-10-16 11:48               ` Yiannis Nikolakopoulos
2025-09-25 15:00             ` Jonathan Cameron
2025-09-25 15:08               ` Gregory Price
2025-09-25 15:18                 ` Gregory Price
2025-09-25 15:24                 ` Jonathan Cameron
2025-09-25 16:06                   ` Gregory Price
2025-09-25 17:23                     ` Jonathan Cameron
2025-09-25 19:02                       ` Gregory Price
2025-10-01  7:22                         ` Gregory Price
2025-10-17  9:53                           ` Yiannis Nikolakopoulos
2025-10-17 14:15                             ` Gregory Price
2025-10-17 14:36                               ` Jonathan Cameron
2025-10-17 14:59                                 ` Gregory Price
2025-10-20 14:05                                   ` Jonathan Cameron
2025-10-21 18:52                                     ` Gregory Price
2025-10-21 18:57                                       ` Gregory Price
2025-10-22  9:09                                         ` Jonathan Cameron
2025-10-22 15:05                                           ` Gregory Price
2025-10-23 15:29                                             ` Jonathan Cameron
2025-10-16 16:16               ` Yiannis Nikolakopoulos
2025-10-20 14:23                 ` Jonathan Cameron
2025-10-20 15:05                   ` Gregory Price
2025-10-08 17:59       ` Vinicius Petrucci

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251003133818.000017af@huawei.com \
    --to=jonathan.cameron@huawei.com \
    --cc=akpm@linux-foundation.org \
    --cc=alok.rathore@samsung.com \
    --cc=balbirs@nvidia.com \
    --cc=bharata@amd.com \
    --cc=byungchul@sk.com \
    --cc=dave.hansen@intel.com \
    --cc=dave@stgolabs.net \
    --cc=david@redhat.com \
    --cc=gourry@gourry.net \
    --cc=hannes@cmpxchg.org \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kinseyho@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=mingo@redhat.com \
    --cc=nifan.cxl@gmail.com \
    --cc=peterz@infradead.org \
    --cc=raghavendra.kt@amd.com \
    --cc=riel@surriel.com \
    --cc=rientjes@google.com \
    --cc=sj@kernel.org \
    --cc=weixugc@google.com \
    --cc=willy@infradead.org \
    --cc=xuezhengchu@huawei.com \
    --cc=yiannis@zptcorp.com \
    --cc=ying.huang@linux.alibaba.com \
    --cc=yuanchu@google.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.