Re: [RFC PATCH V3 10/17] mm: Add a heuristic to calculate target node

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Jonathan Cameron <jonathan.cameron@huawei.com>
To: Raghavendra K T <raghavendra.kt@amd.com>
Cc: <AneeshKumar.KizhakeVeetil@arm.com>, <Michael.Day@amd.com>,
	<akpm@linux-foundation.org>, <bharata@amd.com>,
	<dave.hansen@intel.com>, <david@redhat.com>,
	<dongjoo.linux.dev@gmail.com>, <feng.tang@intel.com>,
	<gourry@gourry.net>, <hannes@cmpxchg.org>, <honggyu.kim@sk.com>,
	<hughd@google.com>, <jhubbard@nvidia.com>, <jon.grimm@amd.com>,
	<k.shutemov@gmail.com>, <kbusch@meta.com>,
	<kmanaouil.dev@gmail.com>, <leesuyeon0506@gmail.com>,
	<leillc@google.com>, <liam.howlett@oracle.com>,
	<linux-kernel@vger.kernel.org>, <linux-mm@kvack.org>,
	<mgorman@techsingularity.net>, <mingo@redhat.com>,
	<nadav.amit@gmail.com>, <nphamcs@gmail.com>,
	<peterz@infradead.org>, <riel@surriel.com>, <rientjes@google.com>,
	<rppt@kernel.org>, <santosh.shukla@amd.com>, <shivankg@amd.com>,
	<shy828301@gmail.com>, <sj@kernel.org>, <vbabka@suse.cz>,
	<weixugc@google.com>, <willy@infradead.org>,
	<ying.huang@linux.alibaba.com>, <ziy@nvidia.com>,
	<dave@stgolabs.net>, <yuanchu@google.com>, <kinseyho@google.com>,
	<hdanton@sina.com>, <harry.yoo@oracle.com>
Subject: Re: [RFC PATCH V3 10/17] mm: Add a heuristic to calculate target node
Date: Fri, 3 Oct 2025 11:04:53 +0100	[thread overview]
Message-ID: <20251003110453.00007ca6@huawei.com> (raw)
In-Reply-To: <20250814153307.1553061-11-raghavendra.kt@amd.com>

On Thu, 14 Aug 2025 15:33:00 +0000
Raghavendra K T <raghavendra.kt@amd.com> wrote:

> One of the key challenges in PTE A bit based scanning is to find right
> target node to promote to.
> 
> Here is a simple heuristic based approach:
>  1. While scanning pages of any mm, also scan toptier pages that belong
> to that mm.
>  2. Accumulate the insight on the distribution of active pages on
> toptier nodes.
>  3. Walk all the top-tier nodes and pick the one with highest accesses.
> 
>  This method tries to consolidate application to a single node.
Nothing new in the following comment as we've discussed it before, but just
to keep everything together:

So for the pathological case of task that has moved after initial allocations
are done, this is effectively relying on conventional numa balancing ensuring
we don't keep promoting to the wrong node?

That makes me a little nervous.   I guess the proof of this will be
in mass testing though.  Maybe it works well enough - I have no idea yet!

A few comments inline

Jonathan


> 
> TBD: Create a list of preferred nodes for fallback when highest access
>  node is nearly full.
> 
> Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>

> +/* Per memory node information used to caclulate target_node for migration */

calculate

> +struct kscand_nodeinfo {
> +	unsigned long nr_scanned;
> +	unsigned long nr_accessed;
> +	int node;
> +	bool is_toptier;
> +};
> +
>  /*
>   * Data structure passed to control scanning and also collect
I'd drop "also". The and implies that already.
>   * per memory node information

Wrap closer to 80 chars.  Also missing full stop.

>   */
>  struct kscand_scanctrl {
>  	struct list_head scan_list;
> +	struct kscand_nodeinfo *nodeinfo[MAX_NUMNODES];
>  	unsigned long address;
> +	unsigned long nr_to_scan;
>  };
>  
>  struct kscand_scanctrl kscand_scanctrl;
> @@ -218,15 +229,129 @@ static void kmigrated_wait_work(void)
>  			migrate_sleep_jiffies);
>  }
>  
> -/*
> - * Do not know what info to pass in the future to make
> - * decision on taget node. Keep it void * now.

Wrong patch review for this comment but "target"

> - */
> +static unsigned long get_slowtier_accesed(struct kscand_scanctrl *scanctrl)

accessed

> +{
> +	int node;
> +	unsigned long accessed = 0;
> +
> +	for_each_node_state(node, N_MEMORY) {
> +		if (!node_is_toptier(node) && scanctrl->nodeinfo[node])
> +			accessed += scanctrl->nodeinfo[node]->nr_accessed;
> +	}
> +	return accessed;
> +}
> +
> +static inline unsigned long get_nodeinfo_nr_accessed(struct kscand_nodeinfo *ni)
> +{
> +	return ni->nr_accessed;
> +}
> +
> +static inline void set_nodeinfo_nr_accessed(struct kscand_nodeinfo *ni, unsigned long val)
> +{
> +	ni->nr_accessed = val;
> +}
> +
> +static inline unsigned long get_nodeinfo_nr_scanned(struct kscand_nodeinfo *ni)
> +{
> +	return ni->nr_scanned;
> +}
> +
> +static inline void set_nodeinfo_nr_scanned(struct kscand_nodeinfo *ni, unsigned long val)
> +{
> +	ni->nr_scanned = val;
> +}

These helpers seems unnecessary given they are static, so we have fully visibility of the
structure where they are called anyway.

Perhaps they get more complex in later patches though (in which case ignore this comment!)

> +
> +static inline void reset_nodeinfo_nr_scanned(struct kscand_nodeinfo *ni)
> +{
> +	set_nodeinfo_nr_scanned(ni, 0);
> +}
> +
> +static inline void reset_nodeinfo(struct kscand_nodeinfo *ni)
> +{
> +	set_nodeinfo_nr_scanned(ni, 0);
> +	set_nodeinfo_nr_accessed(ni, 0);
> +}
> +
> +static void init_one_nodeinfo(struct kscand_nodeinfo *ni, int node)
> +{
> +	ni->nr_scanned = 0;
> +	ni->nr_accessed = 0;
> +	ni->node = node;
> +	ni->is_toptier = node_is_toptier(node) ? true : false;
	ni->is_toptier = node_is_toptier(node);

> +}
> +
> +static struct kscand_nodeinfo *alloc_one_nodeinfo(int node)
> +{
> +	struct kscand_nodeinfo *ni;
> +
> +	ni = kzalloc(sizeof(*ni), GFP_KERNEL);
> +
> +	if (!ni)
> +		return NULL;
> +
> +	init_one_nodeinfo(ni, node);
As only done in one place, I'd just do an inline
	*ni = (struct kscand_node_info) {
		.node = node,
		.is_toptier = node_is_toptier(node),

Can set the zeros if you think that acts as useful documentation.


	};
> +
> +	return ni;
> +}
> +
> +/* TBD: Handle errors */
> +static void init_scanctrl(struct kscand_scanctrl *scanctrl)
> +{
> +	struct kscand_nodeinfo *ni;
Trivial: I'd move this into the for_each_node scope.


> +	int node;
> +
> +	for_each_node(node) {
i.e.
		struct kscand_nodeinfo *ni = alloc_one_nodeinfo(node);

> +		ni = alloc_one_nodeinfo(node);

If this isn't going to get a lot more complex, I'd squash the alloc_one_nodeinfo()
code in here and drop the helper. Up to you though as this is a trade off in
levels of modularity vs compact code.

> +		if (!ni)
> +			WARN_ON_ONCE(ni);
> +		scanctrl->nodeinfo[node] = ni;
> +	}
> +}
> +
> +static void reset_scanctrl(struct kscand_scanctrl *scanctrl)
> +{
> +	int node;
> +
> +	for_each_node_state(node, N_MEMORY)
> +		reset_nodeinfo(scanctrl->nodeinfo[node]);
> +
> +	/* XXX: Not rellay required? */
> +	scanctrl->nr_to_scan = kscand_scan_size;
> +}
> +
> +static void free_scanctrl(struct kscand_scanctrl *scanctrl)
> +{
> +	int node;
> +
> +	for_each_node(node)
> +		kfree(scanctrl->nodeinfo[node]);
> +}
> +
>  static int kscand_get_target_node(void *data)
>  {
>  	return kscand_target_node;
>  }
>  
> +static int get_target_node(struct kscand_scanctrl *scanctrl)
> +{
> +	int node, target_node = NUMA_NO_NODE;
> +	unsigned long prev = 0;
> +
> +	for_each_node(node) {
> +		if (node_is_toptier(node) && scanctrl->nodeinfo[node]) {

Probably flip sense of one or more of the if statements just to reduce indent.

		if (!node_is_toptier(node) || !scanctrl->nodeinfo[node])
			continue;

etc.


> +			/* This creates a fallback migration node list */
> +			if (get_nodeinfo_nr_accessed(scanctrl->nodeinfo[node]) > prev) {
> +				prev = get_nodeinfo_nr_accessed(scanctrl->nodeinfo[node]);

Maybe a local variable given use in check and here.

> +				target_node = node;
> +			}
> +		}
> +	}
> +	if (target_node == NUMA_NO_NODE)
> +		target_node = kscand_get_target_node(NULL);
> +
> +	return target_node;
> +}
> +
>  extern bool migrate_balanced_pgdat(struct pglist_data *pgdat,
>  					unsigned long nr_migrate_pages);
>  
> @@ -495,6 +620,14 @@ static int hot_vma_idle_pte_entry(pte_t *pte,
>  	page_idle_clear_pte_refs(page, pte, walk);
>  	srcnid = folio_nid(folio);
>  
> +	scanctrl->nodeinfo[srcnid]->nr_scanned++;
> +	if (scanctrl->nr_to_scan)
> +		scanctrl->nr_to_scan--;
> +
> +	if (!scanctrl->nr_to_scan) {
> +		folio_put(folio);
> +		return 1;
> +	}
>  
>  	if (!folio_test_lru(folio)) {
>  		folio_put(folio);
> @@ -502,13 +635,17 @@ static int hot_vma_idle_pte_entry(pte_t *pte,
>  	}
>  
>  	if (!kscand_eligible_srcnid(srcnid)) {
> +		if (folio_test_young(folio) || folio_test_referenced(folio)
> +				|| pte_young(pteval)) {
Unusual wrap position.  I'd move the || to line above and align pte_young() 
after the ( on the line above.
> +			scanctrl->nodeinfo[srcnid]->nr_accessed++;
> +		}
>  		folio_put(folio);

> +	/* Either Scan 25% of scan_size or cover vma size of scan_size */
> +	kscand_scanctrl.nr_to_scan =	mm_slot_scan_size >> PAGE_SHIFT;

Trivial but I'm not sure what you are forcing alignment for here.  I'd stick
to one space after =

> +	/* Reduce actual amount of pages scanned */
> +	kscand_scanctrl.nr_to_scan =	mm_slot_scan_size >> 1;

If my eyes aren't tricking me this sets the value then immediately replaces it with
something else. Is that intent?

> +
> +	/* XXX: skip scanning to avoid duplicates until all migrations done? */
>  	kmigrated_mm_slot = kmigrated_get_mm_slot(mm, false);
>  
>  	for_each_vma(vmi, vma) {
>  		kscand_walk_page_vma(vma, &kscand_scanctrl);
>  		vma_scanned_size += vma->vm_end - vma->vm_start;
>  
> -		if (vma_scanned_size >= kscand_scan_size) {
> +		if (vma_scanned_size >= mm_slot_scan_size ||
> +					!kscand_scanctrl.nr_to_scan) {
>  			next_mm = true;
>  
>  			if (!list_empty(&kscand_scanctrl.scan_list)) {
>  				if (!kmigrated_mm_slot)
>  					kmigrated_mm_slot = kmigrated_get_mm_slot(mm, true);
> +				/* Add scanned folios to migration list */
>  				spin_lock(&kmigrated_mm_slot->migrate_lock);
> +
>  				list_splice_tail_init(&kscand_scanctrl.scan_list,
>  						&kmigrated_mm_slot->migrate_head);
>  				spin_unlock(&kmigrated_mm_slot->migrate_lock);
> +				break;
>  			}
> -			break;
> +		}
> +		if (!list_empty(&kscand_scanctrl.scan_list)) {
> +			if (!kmigrated_mm_slot)
> +				kmigrated_mm_slot = kmigrated_get_mm_slot(mm, true);
> +			spin_lock(&kmigrated_mm_slot->migrate_lock);

Use of guard() in these might be a useful readability improvement.

> +			list_splice_tail_init(&kscand_scanctrl.scan_list,
> +					&kmigrated_mm_slot->migrate_head);
> +			spin_unlock(&kmigrated_mm_slot->migrate_lock);

This code block is identical to the one just above and that breaks out to run this.
Do we need them both?  Or is there some subtle difference my eyes are jumping over?


>  		}
>  	}
>

next prev parent reply	other threads:[~2025-10-03 10:05 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-14 15:32 [RFC PATCH V3 00/17] mm: slowtier page promotion based on PTE A bit Raghavendra K T
2025-08-14 15:32 ` [RFC PATCH V3 01/17] mm: Add kscand kthread for PTE A bit scan Raghavendra K T
2025-10-02 13:12   ` Jonathan Cameron
2025-08-14 15:32 ` [RFC PATCH V3 02/17] mm: Maintain mm_struct list in the system Raghavendra K T
2025-10-02 13:23   ` Jonathan Cameron
2025-08-14 15:32 ` [RFC PATCH V3 03/17] mm: Scan the mm and create a migration list Raghavendra K T
2025-10-02 13:53   ` Jonathan Cameron
2025-08-14 15:32 ` [RFC PATCH V3 04/17] mm/kscand: Add only hot pages to " Raghavendra K T
2025-10-02 16:00   ` Jonathan Cameron
2025-08-14 15:32 ` [RFC PATCH V3 05/17] mm: Create a separate kthread for migration Raghavendra K T
2025-10-02 16:03   ` Jonathan Cameron
2025-08-14 15:32 ` [RFC PATCH V3 06/17] mm/migration: migrate accessed folios to toptier node Raghavendra K T
2025-10-02 16:17   ` Jonathan Cameron
2025-08-14 15:32 ` [RFC PATCH V3 07/17] mm: Add throttling of mm scanning using scan_period Raghavendra K T
2025-10-02 16:24   ` Jonathan Cameron
2025-08-14 15:32 ` [RFC PATCH V3 08/17] mm: Add throttling of mm scanning using scan_size Raghavendra K T
2025-10-03  9:35   ` Jonathan Cameron
2025-08-14 15:32 ` [RFC PATCH V3 09/17] mm: Add initial scan delay Raghavendra K T
2025-10-03  9:41   ` Jonathan Cameron
2025-08-14 15:33 ` [RFC PATCH V3 10/17] mm: Add a heuristic to calculate target node Raghavendra K T
2025-10-03 10:04   ` Jonathan Cameron [this message]
2025-08-14 15:33 ` [RFC PATCH V3 11/17] mm/kscand: Implement migration failure feedback Raghavendra K T
2025-10-03 10:10   ` Jonathan Cameron
2025-08-14 15:33 ` [RFC PATCH V3 12/17] sysfs: Add sysfs support to tune scanning Raghavendra K T
2025-10-03 10:25   ` Jonathan Cameron
2025-08-14 15:33 ` [RFC PATCH V3 13/17] mm/vmstat: Add vmstat counters Raghavendra K T
2025-08-14 15:33 ` [RFC PATCH V3 14/17] trace/kscand: Add tracing of scanning and migration Raghavendra K T
2025-10-03 10:28   ` Jonathan Cameron
2025-08-14 15:33 ` [RFC PATCH V3 15/17] prctl: Introduce new prctl to control scanning Raghavendra K T
2025-08-14 15:33 ` [RFC PATCH V3 16/17] prctl: Fine tune scan_period with prctl scale param Raghavendra K T
2025-08-14 15:33 ` [RFC PATCH V3 17/17] mm: Create a list of fallback target nodes Raghavendra K T
2025-08-21 15:24 ` [RFC PATCH V3 00/17] mm: slowtier page promotion based on PTE A bit Raghavendra K T

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251003110453.00007ca6@huawei.com \
    --to=jonathan.cameron@huawei.com \
    --cc=AneeshKumar.KizhakeVeetil@arm.com \
    --cc=Michael.Day@amd.com \
    --cc=akpm@linux-foundation.org \
    --cc=bharata@amd.com \
    --cc=dave.hansen@intel.com \
    --cc=dave@stgolabs.net \
    --cc=david@redhat.com \
    --cc=dongjoo.linux.dev@gmail.com \
    --cc=feng.tang@intel.com \
    --cc=gourry@gourry.net \
    --cc=hannes@cmpxchg.org \
    --cc=harry.yoo@oracle.com \
    --cc=hdanton@sina.com \
    --cc=honggyu.kim@sk.com \
    --cc=hughd@google.com \
    --cc=jhubbard@nvidia.com \
    --cc=jon.grimm@amd.com \
    --cc=k.shutemov@gmail.com \
    --cc=kbusch@meta.com \
    --cc=kinseyho@google.com \
    --cc=kmanaouil.dev@gmail.com \
    --cc=leesuyeon0506@gmail.com \
    --cc=leillc@google.com \
    --cc=liam.howlett@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=mingo@redhat.com \
    --cc=nadav.amit@gmail.com \
    --cc=nphamcs@gmail.com \
    --cc=peterz@infradead.org \
    --cc=raghavendra.kt@amd.com \
    --cc=riel@surriel.com \
    --cc=rientjes@google.com \
    --cc=rppt@kernel.org \
    --cc=santosh.shukla@amd.com \
    --cc=shivankg@amd.com \
    --cc=shy828301@gmail.com \
    --cc=sj@kernel.org \
    --cc=vbabka@suse.cz \
    --cc=weixugc@google.com \
    --cc=willy@infradead.org \
    --cc=ying.huang@linux.alibaba.com \
    --cc=yuanchu@google.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).