[RFC PATCH v0 0/2] Batch migration for NUMA balancing

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v0 0/2] Batch migration for NUMA balancing
@ 2025-05-21  8:02 Bharata B Rao
  2025-05-21  8:02 ` [RFC PATCH v0 1/2] migrate: implement migrate_misplaced_folio_batch Bharata B Rao
                   ` (3 more replies)
  0 siblings, 4 replies; 41+ messages in thread
From: Bharata B Rao @ 2025-05-21  8:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, hannes, mgorman, mingo,
	peterz, raghavendra.kt, riel, rientjes, sj, weixugc, willy,
	ying.huang, ziy, dave, nifan.cxl, joshua.hahnjy, xuezhengchu,
	yiannis, akpm, david, Bharata B Rao

Hi,

This is an attempt to convert the NUMA balancing to do batched
migration instead of migrating one folio at a time. The basic
idea is to collect (from hint fault handler) the folios to be
migrated in a list and batch-migrate them from task_work context.
More details about the specifics are present in patch 2/2.

During LSFMM[1] and subsequent discussions in MM alignment calls[2],
it was suggested that separate migration threads to handle migration
or promotion request may be desirable. Existing NUMA balancing, hot
page promotion and other future promotion techniques could off-load
migration part to these threads. Or if we manage to have a single
source of hotness truth like kpromoted[3], then that too can hand
over migration requests to the migration threads. I am envisaging
that different hotness sources like kmmscand[4], MGLRU[5], IBS[6]
and CXL HMU would push hot page info to kpromoted, which would
then isolate and push the folios to be promoted to the migrator
thread.

As a first step, this is an attempt to batch and perform NUMAB
migrations in async manner. Separate migration threads aren't
yet implemented but I am using Gregory's patch[7] that provides
migrate_misplaced_folio_batch() API to do batch migration of
misplaced folios.

Some points for discussion
--------------------------
1. To isolate the misplaced folios or not?

To do batch migration, the misplaced folios need to be stored in
some manner. I thought isolating them and using the folio->lru
field to link them up would be the most straight-forward way. But
then there were concerns expressed about folios remaining isolated
for long until they get migrated.

Or should we just maintain the PFNs instead of folios and
isolate them only just prior to migrating them?

2. Managing target_nid for misplaced pages

NUMAB provides the accurate target_nid for each folio that is
detected as misplaced. However when we don't migrate the folio
right away, but instead want to batch and do asyn migration later,
then where do we keep track of target_nid for each folio?

In this implementation, I am using last_cpupid field as it appeared
that this field could be reused (with some challenges mentioned
in 2/2) for isolated folios. This approach may be specific to NUMAB
but then each sub-system that hands over pages to the migrator thread
should also provide a target_nid and hence each sub-system should be
free to maintain and track the target_nid of folios that it has
isolated/batched for migration in its own specific manner.

3. How many folios to batch?

Currently I have a fixed threshold for number of folios to batch.
It could be a sysctl to allow a setting between a min and max. It
could also be auto-tuned if required.

The state of the patchset
-------------------------
* Still raw and very lightly tested
* Just posted to serve as base for subsequent discussions
  here and in MM alignment calls.

References
----------
[1] LSFMM LWN summary - https://lwn.net/Articles/1016519/
[2] MM alignment call summary - https://lore.kernel.org/linux-mm/263d7140-c343-e82e-b836-ec85c52b54eb@google.com/
[3] kpromoted patchset - https://lore.kernel.org/linux-mm/20250306054532.221138-1-bharata@amd.com/
[4] Kmmscand: PTE A bit scanning - https://lore.kernel.org/linux-mm/20250319193028.29514-1-raghavendra.kt@amd.com/
[5] MGLRU scanning for page promotion - https://lore.kernel.org/lkml/20250324220301.1273038-1-kinseyho@google.com/
[6] IBS base hot page promotion - https://lore.kernel.org/linux-mm/20250306054532.221138-4-bharata@amd.com/
[7] Unmapped page cache folio promotion patchset - https://lore.kernel.org/linux-mm/20250411221111.493193-1-gourry@gourry.net/

Bharata B Rao (1):
  mm: sched: Batch-migrate misplaced pages

Gregory Price (1):
  migrate: implement migrate_misplaced_folio_batch

 include/linux/migrate.h |  6 ++++
 include/linux/sched.h   |  4 +++
 init/init_task.c        |  2 ++
 kernel/sched/fair.c     | 64 +++++++++++++++++++++++++++++++++++++++++
 mm/memory.c             | 44 ++++++++++++++--------------
 mm/migrate.c            | 31 ++++++++++++++++++++
 6 files changed, 130 insertions(+), 21 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [RFC PATCH v0 1/2] migrate: implement migrate_misplaced_folio_batch
  2025-05-21  8:02 [RFC PATCH v0 0/2] Batch migration for NUMA balancing Bharata B Rao
@ 2025-05-21  8:02 ` Bharata B Rao
  2025-05-22 15:59   ` David Hildenbrand
  2025-05-26  8:16   ` Huang, Ying
  2025-05-21  8:02 ` [RFC PATCH v0 2/2] mm: sched: Batch-migrate misplaced pages Bharata B Rao
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 41+ messages in thread
From: Bharata B Rao @ 2025-05-21  8:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, hannes, mgorman, mingo,
	peterz, raghavendra.kt, riel, rientjes, sj, weixugc, willy,
	ying.huang, ziy, dave, nifan.cxl, joshua.hahnjy, xuezhengchu,
	yiannis, akpm, david, Bharata B Rao

From: Gregory Price <gourry@gourry.net>

A common operation in tiering is to migrate multiple pages at once.
The migrate_misplaced_folio function requires one call for each
individual folio.  Expose a batch-variant of the same call for use
when doing batch migrations.

Signed-off-by: Gregory Price <gourry@gourry.net>
Signed-off-by: Bharata B Rao <bharata@amd.com>
---
 include/linux/migrate.h |  6 ++++++
 mm/migrate.c            | 31 +++++++++++++++++++++++++++++++
 2 files changed, 37 insertions(+)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index aaa2114498d6..c9496adcf192 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -145,6 +145,7 @@ const struct movable_operations *page_movable_ops(struct page *page)
 int migrate_misplaced_folio_prepare(struct folio *folio,
 		struct vm_area_struct *vma, int node);
 int migrate_misplaced_folio(struct folio *folio, int node);
+int migrate_misplaced_folio_batch(struct list_head *foliolist, int node);
 #else
 static inline int migrate_misplaced_folio_prepare(struct folio *folio,
 		struct vm_area_struct *vma, int node)
@@ -155,6 +156,11 @@ static inline int migrate_misplaced_folio(struct folio *folio, int node)
 {
 	return -EAGAIN; /* can't migrate now */
 }
+static inline int migrate_misplaced_folio_batch(struct list_head *foliolist,
+						int node)
+{
+	return -EAGAIN; /* can't migrate now */
+}
 #endif /* CONFIG_NUMA_BALANCING */
 
 #ifdef CONFIG_MIGRATION
diff --git a/mm/migrate.c b/mm/migrate.c
index 676d9cfc7059..32cc2eafb037 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2733,5 +2733,36 @@ int migrate_misplaced_folio(struct folio *folio, int node)
 	BUG_ON(!list_empty(&migratepages));
 	return nr_remaining ? -EAGAIN : 0;
 }
+
+/*
+ * Batch variant of migrate_misplaced_folio. Attempts to migrate
+ * a folio list to the specified destination.
+ *
+ * Caller is expected to have isolated the folios by calling
+ * migrate_misplaced_folio_prepare(), which will result in an
+ * elevated reference count on the folio.
+ *
+ * This function will un-isolate the folios, dereference them, and
+ * remove them from the list before returning.
+ */
+int migrate_misplaced_folio_batch(struct list_head *folio_list, int node)
+{
+	pg_data_t *pgdat = NODE_DATA(node);
+	unsigned int nr_succeeded;
+	int nr_remaining;
+
+	nr_remaining = migrate_pages(folio_list, alloc_misplaced_dst_folio,
+				     NULL, node, MIGRATE_ASYNC,
+				     MR_NUMA_MISPLACED, &nr_succeeded);
+	if (nr_remaining)
+		putback_movable_pages(folio_list);
+
+	if (nr_succeeded) {
+		count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded);
+		mod_node_page_state(pgdat, PGPROMOTE_SUCCESS, nr_succeeded);
+	}
+	BUG_ON(!list_empty(folio_list));
+	return nr_remaining ? -EAGAIN : 0;
+}
 #endif /* CONFIG_NUMA_BALANCING */
 #endif /* CONFIG_NUMA */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [RFC PATCH v0 2/2] mm: sched: Batch-migrate misplaced pages
  2025-05-21  8:02 [RFC PATCH v0 0/2] Batch migration for NUMA balancing Bharata B Rao
  2025-05-21  8:02 ` [RFC PATCH v0 1/2] migrate: implement migrate_misplaced_folio_batch Bharata B Rao
@ 2025-05-21  8:02 ` Bharata B Rao
  2025-05-21 18:25   ` Donet Tom
                     ` (2 more replies)
  2025-05-21 18:45 ` [RFC PATCH v0 0/2] Batch migration for NUMA balancing SeongJae Park
  2025-05-26  8:46 ` Huang, Ying
  3 siblings, 3 replies; 41+ messages in thread
From: Bharata B Rao @ 2025-05-21  8:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, hannes, mgorman, mingo,
	peterz, raghavendra.kt, riel, rientjes, sj, weixugc, willy,
	ying.huang, ziy, dave, nifan.cxl, joshua.hahnjy, xuezhengchu,
	yiannis, akpm, david, Bharata B Rao

Currently the folios identified as misplaced by the NUMA
balancing sub-system are migrated one by one from the NUMA
hint fault handler as and when they are identified as
misplaced.

Instead of such singe folio migrations, batch them and
migrate them at once.

Identified misplaced folios are isolated and stored in
a per-task list. A new task_work is queued from task tick
handler to migrate them in batches. Migration is done
periodically or if pending number of isolated foios exceeds
a threshold.

The PTEs for the isolated folios are restored to PRESENT
state right after isolation.

The last_cpupid field of isolated folios is used to store
the target_nid to which the folios need to be migrated to.
This needs changes to (at least) a couple of places where
last_cpupid field is updated/reset which now should happen
conditionally. The updation in folio_migrate_flags() isn't
handled yet but the reset in write page fault case is
handled.

The failed migration count isn't fed back to the scan period
update heuristics currently.

Signed-off-by: Bharata B Rao <bharata@amd.com>
---
 include/linux/sched.h |  4 +++
 init/init_task.c      |  2 ++
 kernel/sched/fair.c   | 64 +++++++++++++++++++++++++++++++++++++++++++
 mm/memory.c           | 44 +++++++++++++++--------------
 4 files changed, 93 insertions(+), 21 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f96ac1982893..4177ecf53633 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1360,6 +1360,8 @@ struct task_struct {
 	u64				last_task_numa_placement;
 	u64				last_sum_exec_runtime;
 	struct callback_head		numa_work;
+	struct callback_head		numa_mig_work;
+	unsigned long			numa_mig_interval;
 
 	/*
 	 * This pointer is only modified for current in syscall and
@@ -1397,6 +1399,8 @@ struct task_struct {
 	unsigned long			numa_faults_locality[3];
 
 	unsigned long			numa_pages_migrated;
+	struct list_head		migrate_list;
+	unsigned long			migrate_count;
 #endif /* CONFIG_NUMA_BALANCING */
 
 #ifdef CONFIG_RSEQ
diff --git a/init/init_task.c b/init/init_task.c
index e557f622bd90..997af6ab67a7 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -187,6 +187,8 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
 	.numa_preferred_nid = NUMA_NO_NODE,
 	.numa_group	= NULL,
 	.numa_faults	= NULL,
+	.migrate_count	= 0,
+	.migrate_list	= LIST_HEAD_INIT(init_task.migrate_list),
 #endif
 #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
 	.kasan_depth	= 1,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0fb9bf995a47..d6cbf8be76e1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -49,6 +49,7 @@
 #include <linux/ratelimit.h>
 #include <linux/task_work.h>
 #include <linux/rbtree_augmented.h>
+#include <linux/migrate.h>
 
 #include <asm/switch_to.h>
 
@@ -1463,6 +1464,8 @@ unsigned int sysctl_numa_balancing_scan_delay = 1000;
 /* The page with hint page fault latency < threshold in ms is considered hot */
 unsigned int sysctl_numa_balancing_hot_threshold = MSEC_PER_SEC;
 
+#define NUMAB_BATCH_MIGRATION_THRESHOLD	512
+
 struct numa_group {
 	refcount_t refcount;
 
@@ -3297,6 +3300,46 @@ static bool vma_is_accessed(struct mm_struct *mm, struct vm_area_struct *vma)
 
 #define VMA_PID_RESET_PERIOD (4 * sysctl_numa_balancing_scan_delay)
 
+/*
+ * TODO: Feed failed migration count back to scan period update
+ * mechanism.
+ */
+static void migrate_queued_pages(struct list_head *migrate_list)
+{
+	int cur_nid, nid;
+	struct folio *folio, *tmp;
+	LIST_HEAD(nid_list);
+
+	folio = list_entry(migrate_list, struct folio, lru);
+	cur_nid = folio_last_cpupid(folio);
+
+	list_for_each_entry_safe(folio, tmp, migrate_list, lru) {
+		nid = folio_xchg_last_cpupid(folio, -1);
+
+		if (cur_nid != nid) {
+			migrate_misplaced_folio_batch(&nid_list, cur_nid);
+			cur_nid = nid;
+		}
+		list_move(&folio->lru, &nid_list);
+	}
+	migrate_misplaced_folio_batch(&nid_list, cur_nid);
+}
+
+static void task_migration_work(struct callback_head *work)
+{
+	struct task_struct *p = current;
+
+	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_mig_work));
+
+	work->next = work;
+
+	if (list_empty(&p->migrate_list))
+		return;
+
+	migrate_queued_pages(&p->migrate_list);
+	p->migrate_count = 0;
+}
+
 /*
  * The expensive part of numa migration is done from task_work context.
  * Triggered from task_tick_numa().
@@ -3567,14 +3610,19 @@ void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
 	p->numa_migrate_retry		= 0;
 	/* Protect against double add, see task_tick_numa and task_numa_work */
 	p->numa_work.next		= &p->numa_work;
+	p->numa_mig_work.next		= &p->numa_mig_work;
+	p->numa_mig_interval			= 0;
 	p->numa_faults			= NULL;
 	p->numa_pages_migrated		= 0;
 	p->total_numa_faults		= 0;
 	RCU_INIT_POINTER(p->numa_group, NULL);
 	p->last_task_numa_placement	= 0;
 	p->last_sum_exec_runtime	= 0;
+	p->migrate_count		= 0;
+	INIT_LIST_HEAD(&p->migrate_list);
 
 	init_task_work(&p->numa_work, task_numa_work);
+	init_task_work(&p->numa_mig_work, task_migration_work);
 
 	/* New address space, reset the preferred nid */
 	if (!(clone_flags & CLONE_VM)) {
@@ -3596,6 +3644,20 @@ void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
 	}
 }
 
+static void task_check_pending_migrations(struct task_struct *curr)
+{
+	struct callback_head *work = &curr->numa_mig_work;
+
+	if (work->next != work)
+		return;
+
+	if (time_after(jiffies, curr->numa_mig_interval) ||
+	    (curr->migrate_count > NUMAB_BATCH_MIGRATION_THRESHOLD)) {
+		curr->numa_mig_interval = jiffies + HZ;
+		task_work_add(curr, work, TWA_RESUME);
+	}
+}
+
 /*
  * Drive the periodic memory faults..
  */
@@ -3610,6 +3672,8 @@ static void task_tick_numa(struct rq *rq, struct task_struct *curr)
 	if (!curr->mm || (curr->flags & (PF_EXITING | PF_KTHREAD)) || work->next != work)
 		return;
 
+	task_check_pending_migrations(curr);
+
 	/*
 	 * Using runtime rather than walltime has the dual advantage that
 	 * we (mostly) drive the selection from busy threads and that the
diff --git a/mm/memory.c b/mm/memory.c
index 49199410805c..11d07004cb04 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3375,8 +3375,13 @@ static inline void wp_page_reuse(struct vm_fault *vmf, struct folio *folio)
 		 * Clear the folio's cpupid information as the existing
 		 * information potentially belongs to a now completely
 		 * unrelated process.
+		 *
+		 * If the page is found to be isolated pending migration,
+		 * then don't reset as last_cpupid will be holding the
+		 * target_nid information.
 		 */
-		folio_xchg_last_cpupid(folio, (1 << LAST_CPUPID_SHIFT) - 1);
+		if (folio_test_lru(folio))
+			folio_xchg_last_cpupid(folio, (1 << LAST_CPUPID_SHIFT) - 1);
 	}
 
 	flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
@@ -5766,12 +5771,13 @@ static void numa_rebuild_large_mapping(struct vm_fault *vmf, struct vm_area_stru
 
 static vm_fault_t do_numa_page(struct vm_fault *vmf)
 {
+	struct task_struct *task = current;
 	struct vm_area_struct *vma = vmf->vma;
 	struct folio *folio = NULL;
 	int nid = NUMA_NO_NODE;
 	bool writable = false, ignore_writable = false;
 	bool pte_write_upgrade = vma_wants_manual_pte_write_upgrade(vma);
-	int last_cpupid;
+	int last_cpupid = (-1 & LAST_CPUPID_MASK);
 	int target_nid;
 	pte_t pte, old_pte;
 	int flags = 0, nr_pages;
@@ -5807,6 +5813,13 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	nid = folio_nid(folio);
 	nr_pages = folio_nr_pages(folio);
 
+	/*
+	 * If it is a non-LRU folio, it has been already
+	 * isolated and is in migration list.
+	 */
+	if (!folio_test_lru(folio))
+		goto out_map;
+
 	target_nid = numa_migrate_check(folio, vmf, vmf->address, &flags,
 					writable, &last_cpupid);
 	if (target_nid == NUMA_NO_NODE)
@@ -5815,28 +5828,17 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 		flags |= TNF_MIGRATE_FAIL;
 		goto out_map;
 	}
-	/* The folio is isolated and isolation code holds a folio reference. */
-	pte_unmap_unlock(vmf->pte, vmf->ptl);
 	writable = false;
 	ignore_writable = true;
+	nid = target_nid;
 
-	/* Migrate to the requested node */
-	if (!migrate_misplaced_folio(folio, target_nid)) {
-		nid = target_nid;
-		flags |= TNF_MIGRATED;
-		task_numa_fault(last_cpupid, nid, nr_pages, flags);
-		return 0;
-	}
-
-	flags |= TNF_MIGRATE_FAIL;
-	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
-				       vmf->address, &vmf->ptl);
-	if (unlikely(!vmf->pte))
-		return 0;
-	if (unlikely(!pte_same(ptep_get(vmf->pte), vmf->orig_pte))) {
-		pte_unmap_unlock(vmf->pte, vmf->ptl);
-		return 0;
-	}
+	/*
+	 * Store target_nid in last_cpupid field for the isolated
+	 * folios.
+	 */
+	folio_xchg_last_cpupid(folio, target_nid);
+	list_add_tail(&folio->lru, &task->migrate_list);
+	task->migrate_count += nr_pages;
 out_map:
 	/*
 	 * Make it present again, depending on how arch implements
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH v0 2/2] mm: sched: Batch-migrate misplaced pages
  2025-05-21  8:02 ` [RFC PATCH v0 2/2] mm: sched: Batch-migrate misplaced pages Bharata B Rao
@ 2025-05-21 18:25   ` Donet Tom
  2025-05-21 18:40     ` Zi Yan
  2025-05-22  4:39     ` Bharata B Rao
  2025-05-22  3:55   ` Gregory Price
  2025-05-22 16:11   ` David Hildenbrand
  2 siblings, 2 replies; 41+ messages in thread
From: Donet Tom @ 2025-05-21 18:25 UTC (permalink / raw)
  To: Bharata B Rao, linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, hannes, mgorman, mingo,
	peterz, raghavendra.kt, riel, rientjes, sj, weixugc, willy,
	ying.huang, ziy, dave, nifan.cxl, joshua.hahnjy, xuezhengchu,
	yiannis, akpm, david


On 5/21/25 1:32 PM, Bharata B Rao wrote:
> Currently the folios identified as misplaced by the NUMA
> balancing sub-system are migrated one by one from the NUMA
> hint fault handler as and when they are identified as
> misplaced.
>
> Instead of such singe folio migrations, batch them and
> migrate them at once.
>
> Identified misplaced folios are isolated and stored in
> a per-task list. A new task_work is queued from task tick
> handler to migrate them in batches. Migration is done
> periodically or if pending number of isolated foios exceeds
> a threshold.
>
> The PTEs for the isolated folios are restored to PRESENT
> state right after isolation.
>
> The last_cpupid field of isolated folios is used to store
> the target_nid to which the folios need to be migrated to.
> This needs changes to (at least) a couple of places where
> last_cpupid field is updated/reset which now should happen
> conditionally. The updation in folio_migrate_flags() isn't
> handled yet but the reset in write page fault case is
> handled.
>
> The failed migration count isn't fed back to the scan period
> update heuristics currently.
>
> Signed-off-by: Bharata B Rao <bharata@amd.com>
> ---
>   include/linux/sched.h |  4 +++
>   init/init_task.c      |  2 ++
>   kernel/sched/fair.c   | 64 +++++++++++++++++++++++++++++++++++++++++++
>   mm/memory.c           | 44 +++++++++++++++--------------
>   4 files changed, 93 insertions(+), 21 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index f96ac1982893..4177ecf53633 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1360,6 +1360,8 @@ struct task_struct {
>   	u64				last_task_numa_placement;
>   	u64				last_sum_exec_runtime;
>   	struct callback_head		numa_work;
> +	struct callback_head		numa_mig_work;
> +	unsigned long			numa_mig_interval;
>   
>   	/*
>   	 * This pointer is only modified for current in syscall and
> @@ -1397,6 +1399,8 @@ struct task_struct {
>   	unsigned long			numa_faults_locality[3];
>   
>   	unsigned long			numa_pages_migrated;
> +	struct list_head		migrate_list;
> +	unsigned long			migrate_count;
>   #endif /* CONFIG_NUMA_BALANCING */
>   
>   #ifdef CONFIG_RSEQ
> diff --git a/init/init_task.c b/init/init_task.c
> index e557f622bd90..997af6ab67a7 100644
> --- a/init/init_task.c
> +++ b/init/init_task.c
> @@ -187,6 +187,8 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
>   	.numa_preferred_nid = NUMA_NO_NODE,
>   	.numa_group	= NULL,
>   	.numa_faults	= NULL,
> +	.migrate_count	= 0,
> +	.migrate_list	= LIST_HEAD_INIT(init_task.migrate_list),
>   #endif
>   #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
>   	.kasan_depth	= 1,
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0fb9bf995a47..d6cbf8be76e1 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -49,6 +49,7 @@
>   #include <linux/ratelimit.h>
>   #include <linux/task_work.h>
>   #include <linux/rbtree_augmented.h>
> +#include <linux/migrate.h>
>   
>   #include <asm/switch_to.h>
>   
> @@ -1463,6 +1464,8 @@ unsigned int sysctl_numa_balancing_scan_delay = 1000;
>   /* The page with hint page fault latency < threshold in ms is considered hot */
>   unsigned int sysctl_numa_balancing_hot_threshold = MSEC_PER_SEC;
>   
> +#define NUMAB_BATCH_MIGRATION_THRESHOLD	512
> +
>   struct numa_group {
>   	refcount_t refcount;
>   
> @@ -3297,6 +3300,46 @@ static bool vma_is_accessed(struct mm_struct *mm, struct vm_area_struct *vma)
>   
>   #define VMA_PID_RESET_PERIOD (4 * sysctl_numa_balancing_scan_delay)
>   
> +/*
> + * TODO: Feed failed migration count back to scan period update
> + * mechanism.
> + */
> +static void migrate_queued_pages(struct list_head *migrate_list)
> +{
> +	int cur_nid, nid;
> +	struct folio *folio, *tmp;
> +	LIST_HEAD(nid_list);
> +
> +	folio = list_entry(migrate_list, struct folio, lru);
> +	cur_nid = folio_last_cpupid(folio);

Hi Bharatha,

This is target node ID right?


> +
> +	list_for_each_entry_safe(folio, tmp, migrate_list, lru) {
> +		nid = folio_xchg_last_cpupid(folio, -1);

Just one doubt: to get the last CPU ID (target node ID) here, folio_xchg_last_cpupid()

is used, whereas earlier folio_last_cpupid() was used. Is there a specific reason for

using different functions?


Thanks
Donet

> +
> +		if (cur_nid != nid) {
> +			migrate_misplaced_folio_batch(&nid_list, cur_nid);
> +			cur_nid = nid;
> +		}
> +		list_move(&folio->lru, &nid_list);
> +	}
> +	migrate_misplaced_folio_batch(&nid_list, cur_nid);
> +}
> +
> +static void task_migration_work(struct callback_head *work)
> +{
> +	struct task_struct *p = current;
> +
> +	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_mig_work));
> +
> +	work->next = work;
> +
> +	if (list_empty(&p->migrate_list))
> +		return;
> +
> +	migrate_queued_pages(&p->migrate_list);
> +	p->migrate_count = 0;
> +}
> +
>   /*
>    * The expensive part of numa migration is done from task_work context.
>    * Triggered from task_tick_numa().
> @@ -3567,14 +3610,19 @@ void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
>   	p->numa_migrate_retry		= 0;
>   	/* Protect against double add, see task_tick_numa and task_numa_work */
>   	p->numa_work.next		= &p->numa_work;
> +	p->numa_mig_work.next		= &p->numa_mig_work;
> +	p->numa_mig_interval			= 0;
>   	p->numa_faults			= NULL;
>   	p->numa_pages_migrated		= 0;
>   	p->total_numa_faults		= 0;
>   	RCU_INIT_POINTER(p->numa_group, NULL);
>   	p->last_task_numa_placement	= 0;
>   	p->last_sum_exec_runtime	= 0;
> +	p->migrate_count		= 0;
> +	INIT_LIST_HEAD(&p->migrate_list);
>   
>   	init_task_work(&p->numa_work, task_numa_work);
> +	init_task_work(&p->numa_mig_work, task_migration_work);
>   
>   	/* New address space, reset the preferred nid */
>   	if (!(clone_flags & CLONE_VM)) {
> @@ -3596,6 +3644,20 @@ void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
>   	}
>   }
>   
> +static void task_check_pending_migrations(struct task_struct *curr)
> +{
> +	struct callback_head *work = &curr->numa_mig_work;
> +
> +	if (work->next != work)
> +		return;
> +
> +	if (time_after(jiffies, curr->numa_mig_interval) ||
> +	    (curr->migrate_count > NUMAB_BATCH_MIGRATION_THRESHOLD)) {
> +		curr->numa_mig_interval = jiffies + HZ;
> +		task_work_add(curr, work, TWA_RESUME);
> +	}
> +}
> +
>   /*
>    * Drive the periodic memory faults..
>    */
> @@ -3610,6 +3672,8 @@ static void task_tick_numa(struct rq *rq, struct task_struct *curr)
>   	if (!curr->mm || (curr->flags & (PF_EXITING | PF_KTHREAD)) || work->next != work)
>   		return;
>   
> +	task_check_pending_migrations(curr);
> +
>   	/*
>   	 * Using runtime rather than walltime has the dual advantage that
>   	 * we (mostly) drive the selection from busy threads and that the
> diff --git a/mm/memory.c b/mm/memory.c
> index 49199410805c..11d07004cb04 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3375,8 +3375,13 @@ static inline void wp_page_reuse(struct vm_fault *vmf, struct folio *folio)
>   		 * Clear the folio's cpupid information as the existing
>   		 * information potentially belongs to a now completely
>   		 * unrelated process.
> +		 *
> +		 * If the page is found to be isolated pending migration,
> +		 * then don't reset as last_cpupid will be holding the
> +		 * target_nid information.
>   		 */
> -		folio_xchg_last_cpupid(folio, (1 << LAST_CPUPID_SHIFT) - 1);
> +		if (folio_test_lru(folio))
> +			folio_xchg_last_cpupid(folio, (1 << LAST_CPUPID_SHIFT) - 1);
>   	}
>   
>   	flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
> @@ -5766,12 +5771,13 @@ static void numa_rebuild_large_mapping(struct vm_fault *vmf, struct vm_area_stru
>   
>   static vm_fault_t do_numa_page(struct vm_fault *vmf)
>   {
> +	struct task_struct *task = current;
>   	struct vm_area_struct *vma = vmf->vma;
>   	struct folio *folio = NULL;
>   	int nid = NUMA_NO_NODE;
>   	bool writable = false, ignore_writable = false;
>   	bool pte_write_upgrade = vma_wants_manual_pte_write_upgrade(vma);
> -	int last_cpupid;
> +	int last_cpupid = (-1 & LAST_CPUPID_MASK);
>   	int target_nid;
>   	pte_t pte, old_pte;
>   	int flags = 0, nr_pages;
> @@ -5807,6 +5813,13 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>   	nid = folio_nid(folio);
>   	nr_pages = folio_nr_pages(folio);
>   
> +	/*
> +	 * If it is a non-LRU folio, it has been already
> +	 * isolated and is in migration list.
> +	 */
> +	if (!folio_test_lru(folio))
> +		goto out_map;
> +
>   	target_nid = numa_migrate_check(folio, vmf, vmf->address, &flags,
>   					writable, &last_cpupid);
>   	if (target_nid == NUMA_NO_NODE)
> @@ -5815,28 +5828,17 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>   		flags |= TNF_MIGRATE_FAIL;
>   		goto out_map;
>   	}
> -	/* The folio is isolated and isolation code holds a folio reference. */
> -	pte_unmap_unlock(vmf->pte, vmf->ptl);
>   	writable = false;
>   	ignore_writable = true;
> +	nid = target_nid;
>   
> -	/* Migrate to the requested node */
> -	if (!migrate_misplaced_folio(folio, target_nid)) {
> -		nid = target_nid;
> -		flags |= TNF_MIGRATED;
> -		task_numa_fault(last_cpupid, nid, nr_pages, flags);
> -		return 0;
> -	}
> -
> -	flags |= TNF_MIGRATE_FAIL;
> -	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> -				       vmf->address, &vmf->ptl);
> -	if (unlikely(!vmf->pte))
> -		return 0;
> -	if (unlikely(!pte_same(ptep_get(vmf->pte), vmf->orig_pte))) {
> -		pte_unmap_unlock(vmf->pte, vmf->ptl);
> -		return 0;
> -	}
> +	/*
> +	 * Store target_nid in last_cpupid field for the isolated
> +	 * folios.
> +	 */
> +	folio_xchg_last_cpupid(folio, target_nid);
> +	list_add_tail(&folio->lru, &task->migrate_list);
> +	task->migrate_count += nr_pages;
>   out_map:
>   	/*
>   	 * Make it present again, depending on how arch implements

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH v0 2/2] mm: sched: Batch-migrate misplaced pages
  2025-05-21 18:25   ` Donet Tom
@ 2025-05-21 18:40     ` Zi Yan
  2025-05-22  3:24       ` Gregory Price
  2025-05-22  4:42       ` Bharata B Rao
  2025-05-22  4:39     ` Bharata B Rao
  1 sibling, 2 replies; 41+ messages in thread
From: Zi Yan @ 2025-05-21 18:40 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: Donet Tom, linux-kernel, linux-mm, Jonathan.Cameron, dave.hansen,
	gourry, hannes, mgorman, mingo, peterz, raghavendra.kt, riel,
	rientjes, sj, weixugc, willy, ying.huang, dave, nifan.cxl,
	joshua.hahnjy, xuezhengchu, yiannis, akpm, david

On 21 May 2025, at 14:25, Donet Tom wrote:

> On 5/21/25 1:32 PM, Bharata B Rao wrote:
>> Currently the folios identified as misplaced by the NUMA
>> balancing sub-system are migrated one by one from the NUMA
>> hint fault handler as and when they are identified as
>> misplaced.
>>
>> Instead of such singe folio migrations, batch them and
>> migrate them at once.
>>
>> Identified misplaced folios are isolated and stored in
>> a per-task list. A new task_work is queued from task tick
>> handler to migrate them in batches. Migration is done
>> periodically or if pending number of isolated foios exceeds
>> a threshold.
>>
>> The PTEs for the isolated folios are restored to PRESENT
>> state right after isolation.
>>
>> The last_cpupid field of isolated folios is used to store
>> the target_nid to which the folios need to be migrated to.
>> This needs changes to (at least) a couple of places where
>> last_cpupid field is updated/reset which now should happen
>> conditionally. The updation in folio_migrate_flags() isn't
>> handled yet but the reset in write page fault case is
>> handled.
>>
>> The failed migration count isn't fed back to the scan period
>> update heuristics currently.
>>
>> Signed-off-by: Bharata B Rao <bharata@amd.com>
>> ---
>>   include/linux/sched.h |  4 +++
>>   init/init_task.c      |  2 ++
>>   kernel/sched/fair.c   | 64 +++++++++++++++++++++++++++++++++++++++++++
>>   mm/memory.c           | 44 +++++++++++++++--------------
>>   4 files changed, 93 insertions(+), 21 deletions(-)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index f96ac1982893..4177ecf53633 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1360,6 +1360,8 @@ struct task_struct {
>>   	u64				last_task_numa_placement;
>>   	u64				last_sum_exec_runtime;
>>   	struct callback_head		numa_work;
>> +	struct callback_head		numa_mig_work;
>> +	unsigned long			numa_mig_interval;
>>    	/*
>>   	 * This pointer is only modified for current in syscall and
>> @@ -1397,6 +1399,8 @@ struct task_struct {
>>   	unsigned long			numa_faults_locality[3];
>>    	unsigned long			numa_pages_migrated;
>> +	struct list_head		migrate_list;
>> +	unsigned long			migrate_count;
>>   #endif /* CONFIG_NUMA_BALANCING */
>>    #ifdef CONFIG_RSEQ
>> diff --git a/init/init_task.c b/init/init_task.c
>> index e557f622bd90..997af6ab67a7 100644
>> --- a/init/init_task.c
>> +++ b/init/init_task.c
>> @@ -187,6 +187,8 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
>>   	.numa_preferred_nid = NUMA_NO_NODE,
>>   	.numa_group	= NULL,
>>   	.numa_faults	= NULL,
>> +	.migrate_count	= 0,
>> +	.migrate_list	= LIST_HEAD_INIT(init_task.migrate_list),
>>   #endif
>>   #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
>>   	.kasan_depth	= 1,
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 0fb9bf995a47..d6cbf8be76e1 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -49,6 +49,7 @@
>>   #include <linux/ratelimit.h>
>>   #include <linux/task_work.h>
>>   #include <linux/rbtree_augmented.h>
>> +#include <linux/migrate.h>
>>    #include <asm/switch_to.h>
>>  @@ -1463,6 +1464,8 @@ unsigned int sysctl_numa_balancing_scan_delay = 1000;
>>   /* The page with hint page fault latency < threshold in ms is considered hot */
>>   unsigned int sysctl_numa_balancing_hot_threshold = MSEC_PER_SEC;
>>  +#define NUMAB_BATCH_MIGRATION_THRESHOLD	512
>> +
>>   struct numa_group {
>>   	refcount_t refcount;
>>  @@ -3297,6 +3300,46 @@ static bool vma_is_accessed(struct mm_struct *mm, struct vm_area_struct *vma)
>>    #define VMA_PID_RESET_PERIOD (4 * sysctl_numa_balancing_scan_delay)
>>  +/*
>> + * TODO: Feed failed migration count back to scan period update
>> + * mechanism.
>> + */
>> +static void migrate_queued_pages(struct list_head *migrate_list)
>> +{
>> +	int cur_nid, nid;
>> +	struct folio *folio, *tmp;
>> +	LIST_HEAD(nid_list);
>> +
>> +	folio = list_entry(migrate_list, struct folio, lru);
>> +	cur_nid = folio_last_cpupid(folio);
>
> Hi Bharatha,
>
> This is target node ID right?

In memory tiering mode, folio_last_cpupid() gives page access time
for slow memory folios. In !folio_use_access_time() case,
folio_last_cpupid() gives last cpupid. Now it is reused for node
id. It is too confusing. At least, a new function like folio_get_target_nid()
should be added to return a nid only if folio is isolated.

>
>
>> +
>> +	list_for_each_entry_safe(folio, tmp, migrate_list, lru) {
>> +		nid = folio_xchg_last_cpupid(folio, -1);
>
> Just one doubt: to get the last CPU ID (target node ID) here, folio_xchg_last_cpupid()
>
> is used, whereas earlier folio_last_cpupid() was used. Is there a specific reason for
>
> using different functions?
>
>
> Thanks
> Donet
>
>> +
>> +		if (cur_nid != nid) {
>> +			migrate_misplaced_folio_batch(&nid_list, cur_nid);
>> +			cur_nid = nid;
>> +		}
>> +		list_move(&folio->lru, &nid_list);
>> +	}
>> +	migrate_misplaced_folio_batch(&nid_list, cur_nid);
>> +}
>> +
>> +static void task_migration_work(struct callback_head *work)
>> +{
>> +	struct task_struct *p = current;
>> +
>> +	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_mig_work));
>> +
>> +	work->next = work;
>> +
>> +	if (list_empty(&p->migrate_list))
>> +		return;
>> +
>> +	migrate_queued_pages(&p->migrate_list);
>> +	p->migrate_count = 0;
>> +}
>> +
>>   /*
>>    * The expensive part of numa migration is done from task_work context.
>>    * Triggered from task_tick_numa().
>> @@ -3567,14 +3610,19 @@ void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
>>   	p->numa_migrate_retry		= 0;
>>   	/* Protect against double add, see task_tick_numa and task_numa_work */
>>   	p->numa_work.next		= &p->numa_work;
>> +	p->numa_mig_work.next		= &p->numa_mig_work;
>> +	p->numa_mig_interval			= 0;
>>   	p->numa_faults			= NULL;
>>   	p->numa_pages_migrated		= 0;
>>   	p->total_numa_faults		= 0;
>>   	RCU_INIT_POINTER(p->numa_group, NULL);
>>   	p->last_task_numa_placement	= 0;
>>   	p->last_sum_exec_runtime	= 0;
>> +	p->migrate_count		= 0;
>> +	INIT_LIST_HEAD(&p->migrate_list);
>>    	init_task_work(&p->numa_work, task_numa_work);
>> +	init_task_work(&p->numa_mig_work, task_migration_work);
>>    	/* New address space, reset the preferred nid */
>>   	if (!(clone_flags & CLONE_VM)) {
>> @@ -3596,6 +3644,20 @@ void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
>>   	}
>>   }
>>  +static void task_check_pending_migrations(struct task_struct *curr)
>> +{
>> +	struct callback_head *work = &curr->numa_mig_work;
>> +
>> +	if (work->next != work)
>> +		return;
>> +
>> +	if (time_after(jiffies, curr->numa_mig_interval) ||
>> +	    (curr->migrate_count > NUMAB_BATCH_MIGRATION_THRESHOLD)) {
>> +		curr->numa_mig_interval = jiffies + HZ;
>> +		task_work_add(curr, work, TWA_RESUME);
>> +	}
>> +}
>> +
>>   /*
>>    * Drive the periodic memory faults..
>>    */
>> @@ -3610,6 +3672,8 @@ static void task_tick_numa(struct rq *rq, struct task_struct *curr)
>>   	if (!curr->mm || (curr->flags & (PF_EXITING | PF_KTHREAD)) || work->next != work)
>>   		return;
>>  +	task_check_pending_migrations(curr);
>> +
>>   	/*
>>   	 * Using runtime rather than walltime has the dual advantage that
>>   	 * we (mostly) drive the selection from busy threads and that the
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 49199410805c..11d07004cb04 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -3375,8 +3375,13 @@ static inline void wp_page_reuse(struct vm_fault *vmf, struct folio *folio)
>>   		 * Clear the folio's cpupid information as the existing
>>   		 * information potentially belongs to a now completely
>>   		 * unrelated process.
>> +		 *
>> +		 * If the page is found to be isolated pending migration,
>> +		 * then don't reset as last_cpupid will be holding the
>> +		 * target_nid information.
>>   		 */
>> -		folio_xchg_last_cpupid(folio, (1 << LAST_CPUPID_SHIFT) - 1);
>> +		if (folio_test_lru(folio))
>> +			folio_xchg_last_cpupid(folio, (1 << LAST_CPUPID_SHIFT) - 1);
>>   	}
>>    	flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
>> @@ -5766,12 +5771,13 @@ static void numa_rebuild_large_mapping(struct vm_fault *vmf, struct vm_area_stru
>>    static vm_fault_t do_numa_page(struct vm_fault *vmf)
>>   {
>> +	struct task_struct *task = current;
>>   	struct vm_area_struct *vma = vmf->vma;
>>   	struct folio *folio = NULL;
>>   	int nid = NUMA_NO_NODE;
>>   	bool writable = false, ignore_writable = false;
>>   	bool pte_write_upgrade = vma_wants_manual_pte_write_upgrade(vma);
>> -	int last_cpupid;
>> +	int last_cpupid = (-1 & LAST_CPUPID_MASK);
>>   	int target_nid;
>>   	pte_t pte, old_pte;
>>   	int flags = 0, nr_pages;
>> @@ -5807,6 +5813,13 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>>   	nid = folio_nid(folio);
>>   	nr_pages = folio_nr_pages(folio);
>>  +	/*
>> +	 * If it is a non-LRU folio, it has been already
>> +	 * isolated and is in migration list.
>> +	 */
>> +	if (!folio_test_lru(folio))
>> +		goto out_map;
>> +
>>   	target_nid = numa_migrate_check(folio, vmf, vmf->address, &flags,
>>   					writable, &last_cpupid);
>>   	if (target_nid == NUMA_NO_NODE)
>> @@ -5815,28 +5828,17 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>>   		flags |= TNF_MIGRATE_FAIL;
>>   		goto out_map;
>>   	}
>> -	/* The folio is isolated and isolation code holds a folio reference. */
>> -	pte_unmap_unlock(vmf->pte, vmf->ptl);
>>   	writable = false;
>>   	ignore_writable = true;
>> +	nid = target_nid;
>>  -	/* Migrate to the requested node */
>> -	if (!migrate_misplaced_folio(folio, target_nid)) {
>> -		nid = target_nid;
>> -		flags |= TNF_MIGRATED;
>> -		task_numa_fault(last_cpupid, nid, nr_pages, flags);
>> -		return 0;
>> -	}
>> -
>> -	flags |= TNF_MIGRATE_FAIL;
>> -	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
>> -				       vmf->address, &vmf->ptl);
>> -	if (unlikely(!vmf->pte))
>> -		return 0;
>> -	if (unlikely(!pte_same(ptep_get(vmf->pte), vmf->orig_pte))) {
>> -		pte_unmap_unlock(vmf->pte, vmf->ptl);
>> -		return 0;
>> -	}
>> +	/*
>> +	 * Store target_nid in last_cpupid field for the isolated
>> +	 * folios.
>> +	 */
>> +	folio_xchg_last_cpupid(folio, target_nid);
>> +	list_add_tail(&folio->lru, &task->migrate_list);
>> +	task->migrate_count += nr_pages;
>>   out_map:
>>   	/*
>>   	 * Make it present again, depending on how arch implements


--
Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH v0 0/2] Batch migration for NUMA balancing
  2025-05-21  8:02 [RFC PATCH v0 0/2] Batch migration for NUMA balancing Bharata B Rao
  2025-05-21  8:02 ` [RFC PATCH v0 1/2] migrate: implement migrate_misplaced_folio_batch Bharata B Rao
  2025-05-21  8:02 ` [RFC PATCH v0 2/2] mm: sched: Batch-migrate misplaced pages Bharata B Rao
@ 2025-05-21 18:45 ` SeongJae Park
  2025-05-22  3:08   ` Gregory Price
                     ` (2 more replies)
  2025-05-26  8:46 ` Huang, Ying
  3 siblings, 3 replies; 41+ messages in thread
From: SeongJae Park @ 2025-05-21 18:45 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: SeongJae Park, linux-kernel, linux-mm, Jonathan.Cameron,
	dave.hansen, gourry, hannes, mgorman, mingo, peterz,
	raghavendra.kt, riel, rientjes, weixugc, willy, ying.huang, ziy,
	dave, nifan.cxl, joshua.hahnjy, xuezhengchu, yiannis, akpm, david

Hi Bharata,

On Wed, 21 May 2025 13:32:36 +0530 Bharata B Rao <bharata@amd.com> wrote:

> Hi,
> 
> This is an attempt to convert the NUMA balancing to do batched
> migration instead of migrating one folio at a time. The basic
> idea is to collect (from hint fault handler) the folios to be
> migrated in a list and batch-migrate them from task_work context.
> More details about the specifics are present in patch 2/2.
> 
> During LSFMM[1] and subsequent discussions in MM alignment calls[2],
> it was suggested that separate migration threads to handle migration
> or promotion request may be desirable. Existing NUMA balancing, hot
> page promotion and other future promotion techniques could off-load
> migration part to these threads. Or if we manage to have a single
> source of hotness truth like kpromoted[3], then that too can hand
> over migration requests to the migration threads. I am envisaging
> that different hotness sources like kmmscand[4], MGLRU[5], IBS[6]
> and CXL HMU would push hot page info to kpromoted, which would
> then isolate and push the folios to be promoted to the migrator
> thread.

I think (or, hope) it would also be not very worthless or rude to mention other
existing and ongoing works that have potentials to serve for similar purpose or
collaborate in future, here.

DAMON is designed for a sort of multi-source access information handling.  In
LSFMM, I proposed[1] damon_report_access() interface for making it easier to be
extended for more types of access information.  Currenlty damon_report_access()
is under early development.  I think this has a potential to serve something
similar to your single source goal.

Also in LSFMM, I proposed damos_add_folio() for a case that callers want to
utilize DAMON worker thread (kdamond) as an asynchronous memory
management operations execution thread while using its other features such as
[auto-tuned] quotas.  I think this has a potential to serve something similar
to your migration threads.  I haven't started damos_add_folio() development
yet, though.

I remember we discussed about DAMON on mailing list and in LSFMM a bit, on your
session.  IIRC, you were also looking for a time to see if there is a chance to
use DAMON in some way.  Due to the technical issue, we were unable to discuss
on the two new proposals on my LSFMM session, and it has been a bit while since
our last discussion.  So if you don't mind, I'd like to ask if you have some
opinions or comments about these.

[1] https://lwn.net/Articles/1016525/

Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH v0 0/2] Batch migration for NUMA balancing
  2025-05-21 18:45 ` [RFC PATCH v0 0/2] Batch migration for NUMA balancing SeongJae Park
@ 2025-05-22  3:08   ` Gregory Price
  2025-05-22 16:30     ` SeongJae Park
  2025-05-22 18:43   ` Apologies and clarifications on DAMON-disruptions (was Re: [RFC PATCH v0 0/2] Batch migration for NUMA balancing) SeongJae Park
  2025-05-26  5:20   ` [RFC PATCH v0 0/2] Batch migration for NUMA balancing Bharata B Rao
  2 siblings, 1 reply; 41+ messages in thread
From: Gregory Price @ 2025-05-22  3:08 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Bharata B Rao, linux-kernel, linux-mm, Jonathan.Cameron,
	dave.hansen, hannes, mgorman, mingo, peterz, raghavendra.kt, riel,
	rientjes, weixugc, willy, ying.huang, ziy, dave, nifan.cxl,
	joshua.hahnjy, xuezhengchu, yiannis, akpm, david

On Wed, May 21, 2025 at 11:45:52AM -0700, SeongJae Park wrote:
> Hi Bharata,
> 
> On Wed, 21 May 2025 13:32:36 +0530 Bharata B Rao <bharata@amd.com> wrote:
> 
> > Hi,
> > 
> > This is an attempt to convert the NUMA balancing to do batched
> > migration instead of migrating one folio at a time. The basic
> > idea is to collect (from hint fault handler) the folios to be
> > migrated in a list and batch-migrate them from task_work context.
> > More details about the specifics are present in patch 2/2.
> > 
> > During LSFMM[1] and subsequent discussions in MM alignment calls[2],
> > it was suggested that separate migration threads to handle migration
> > or promotion request may be desirable. Existing NUMA balancing, hot
> > page promotion and other future promotion techniques could off-load
> > migration part to these threads. Or if we manage to have a single
> > source of hotness truth like kpromoted[3], then that too can hand
> > over migration requests to the migration threads. I am envisaging
> > that different hotness sources like kmmscand[4], MGLRU[5], IBS[6]
> > and CXL HMU would push hot page info to kpromoted, which would
> > then isolate and push the folios to be promoted to the migrator
> > thread.
> 
> I think (or, hope) it would also be not very worthless or rude to mention other
> existing and ongoing works that have potentials to serve for similar purpose or
> collaborate in future, here.
> 
> DAMON is designed for a sort of multi-source access information handling.  In
> LSFMM, I proposed[1] damon_report_access() interface for making it easier to be
> extended for more types of access information.  Currenlty damon_report_access()
> is under early development.  I think this has a potential to serve something
> similar to your single source goal.
> 

It seems to me that DAMON might make use of the batch migration
interface, so if you need any changes or extensions, it might be good
for you (SJ) to take a look at that for us.

~Gregory

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH v0 2/2] mm: sched: Batch-migrate misplaced pages
  2025-05-21 18:40     ` Zi Yan
@ 2025-05-22  3:24       ` Gregory Price
  2025-05-22  5:23         ` Bharata B Rao
  2025-05-22  4:42       ` Bharata B Rao
  1 sibling, 1 reply; 41+ messages in thread
From: Gregory Price @ 2025-05-22  3:24 UTC (permalink / raw)
  To: Zi Yan
  Cc: Bharata B Rao, Donet Tom, linux-kernel, linux-mm,
	Jonathan.Cameron, dave.hansen, hannes, mgorman, mingo, peterz,
	raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang,
	dave, nifan.cxl, joshua.hahnjy, xuezhengchu, yiannis, akpm, david

On Wed, May 21, 2025 at 02:40:53PM -0400, Zi Yan wrote:
> On 21 May 2025, at 14:25, Donet Tom wrote:
> > Hi Bharatha,
> >
> > This is target node ID right?
> 
> In memory tiering mode, folio_last_cpupid() gives page access time
> for slow memory folios. In !folio_use_access_time() case,
> folio_last_cpupid() gives last cpupid. Now it is reused for node
> id. It is too confusing. At least, a new function like folio_get_target_nid()
> should be added to return a nid only if folio is isolated.
>

The really annoying part of all of this is

#ifdef CONFIG_NUMA_BALANCING
#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
static inline int folio_last_cpupid(struct folio *folio)
{
        return folio->_last_cpupid;
}
#else
static inline int folio_last_cpupid(struct folio *folio)
{
        return (folio->flags >> LAST_CPUPID_PGSHIFT) & LAST_CPUPID_MASK;
}
#endif
#else /* !CONFIG_NUMA_BALANCING */
static inline int folio_last_cpupid(struct folio *folio)
{
        return folio_nid(folio); /* XXX */
}
...
#endif

Obviously we don't have to care about the !NUMAB case, but what a silly
muxing we have going on here (I get it, space is tight - the interfaces
are just confusing is all).

My question is whether there's some kind of race condition here if the
mode changes between isolate and fetch.  Can we "fetch a node id" and
end up with a cpupid because someone toggled the between tiering and
balancing?

If we can answer that, then implementing folio_last_cpupid and
folio_last_access_nid can return -1 if called in the wrong mode
(assuming this check isn't too expensive).  That would be a nice cleanup
for readability sake.

~Gregory

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH v0 2/2] mm: sched: Batch-migrate misplaced pages
  2025-05-21  8:02 ` [RFC PATCH v0 2/2] mm: sched: Batch-migrate misplaced pages Bharata B Rao
  2025-05-21 18:25   ` Donet Tom
@ 2025-05-22  3:55   ` Gregory Price
  2025-05-22  7:33     ` Bharata B Rao
  2025-05-22 16:11   ` David Hildenbrand
  2 siblings, 1 reply; 41+ messages in thread
From: Gregory Price @ 2025-05-22  3:55 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-kernel, linux-mm, Jonathan.Cameron, dave.hansen, hannes,
	mgorman, mingo, peterz, raghavendra.kt, riel, rientjes, sj,
	weixugc, willy, ying.huang, ziy, dave, nifan.cxl, joshua.hahnjy,
	xuezhengchu, yiannis, akpm, david

On Wed, May 21, 2025 at 01:32:38PM +0530, Bharata B Rao wrote:
>  
> +static void task_check_pending_migrations(struct task_struct *curr)
> +{
> +	struct callback_head *work = &curr->numa_mig_work;
> +
> +	if (work->next != work)
> +		return;
> +
> +	if (time_after(jiffies, curr->numa_mig_interval) ||
> +	    (curr->migrate_count > NUMAB_BATCH_MIGRATION_THRESHOLD)) {
> +		curr->numa_mig_interval = jiffies + HZ;
> +		task_work_add(curr, work, TWA_RESUME);
> +	}
> +}
> +
>  /*
>   * Drive the periodic memory faults..
>   */
> @@ -3610,6 +3672,8 @@ static void task_tick_numa(struct rq *rq, struct task_struct *curr)
>  	if (!curr->mm || (curr->flags & (PF_EXITING | PF_KTHREAD)) || work->next != work)
>  		return;
>  
> +	task_check_pending_migrations(curr);
> +

So I know this was discussed in the cover leter a bit and alluded to in
the patch, but I want to add my 2cents from work on the unmapped page
cache set.

In that set, I chose to always schedule the task work on the next return
to user-space, rather than defer to a tick like the current numa-balance
code.  This was for two concerns:

1) I didn't want to leave a potentially large number of isolated folios
   on a list that may not be reaped for an unknown period of time.

   I don't know the real limitations on the number of isolated folios,
   but given what we have here I think we can represent a mathematical
   worst case on the nubmer of stranded folios.

   If (N=1,000,000, and M=511) then we could have ~1.8TB of pages
   stranded on these lists - never to be migrated because it never hits
   the threshhold.  In practice this won't happen to that extreme, but
   in practice it absolutely will happen for some chunk of tasks.

   So I chose to never leave kernel space with isolated folios on the
   task numa_mig_list.

   This discussion changes if the numa_mig_list is not on the
   task_struct and instead some per-cpu list routinely reaped by a
   kthread (kpromoted or whatever).

2) I was not confident I could measure the performance implications of
   the migrations directly when it was deferred.  When would I even know
   it happened?  The actual goal is to *not* know it happened, right?

   But now it might happen during a page fault, or any random syscall.

   This concerned me - so i just didn't defer.  That was largely out of
   lack of confidence in my own understanding of the task_work system.

So i think this, as presented, is a half-measure - and I don't think
it's a good half-measure.  I think we might need to go all the way to a
set of per-cpu migration lists that a kernel work can pluck the head of
on some interval.  That would bound the number of isolated folios to the
number of CPUs rather than the number of tasks.

~Gregory

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH v0 2/2] mm: sched: Batch-migrate misplaced pages
  2025-05-21 18:25   ` Donet Tom
  2025-05-21 18:40     ` Zi Yan
@ 2025-05-22  4:39     ` Bharata B Rao
  2025-05-23  9:05       ` Donet Tom
  1 sibling, 1 reply; 41+ messages in thread
From: Bharata B Rao @ 2025-05-22  4:39 UTC (permalink / raw)
  To: Donet Tom, linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, hannes, mgorman, mingo,
	peterz, raghavendra.kt, riel, rientjes, sj, weixugc, willy,
	ying.huang, ziy, dave, nifan.cxl, joshua.hahnjy, xuezhengchu,
	yiannis, akpm, david

Hi Donet,

On 21-May-25 11:55 PM, Donet Tom wrote:
> 
>> +static void migrate_queued_pages(struct list_head *migrate_list)
>> +{
>> +    int cur_nid, nid;
>> +    struct folio *folio, *tmp;
>> +    LIST_HEAD(nid_list);
>> +
>> +    folio = list_entry(migrate_list, struct folio, lru);
>> +    cur_nid = folio_last_cpupid(folio);
> 
> Hi Bharatha,
> 
> This is target node ID right?

Correct.

> 
> 
>> +
>> +    list_for_each_entry_safe(folio, tmp, migrate_list, lru) {
>> +        nid = folio_xchg_last_cpupid(folio, -1);
> 
> Just one doubt: to get the last CPU ID (target node ID) here, 
> folio_xchg_last_cpupid()
> 
> is used, whereas earlier folio_last_cpupid() was used. Is there a 
> specific reason for
> 
> using different functions?

This function iterates over the isolated folios looking for the same 
target_nid so that all of them can be migrated at once to the given 
target_nid. Hence the first call just reads the target_nid from 
last_cpupid field to note which nid is of interest in the current 
iteration and the next call actually reads target_nid and resets the 
last_cpupid field.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH v0 2/2] mm: sched: Batch-migrate misplaced pages
  2025-05-21 18:40     ` Zi Yan
  2025-05-22  3:24       ` Gregory Price
@ 2025-05-22  4:42       ` Bharata B Rao
  1 sibling, 0 replies; 41+ messages in thread
From: Bharata B Rao @ 2025-05-22  4:42 UTC (permalink / raw)
  To: Zi Yan
  Cc: Donet Tom, linux-kernel, linux-mm, Jonathan.Cameron, dave.hansen,
	gourry, hannes, mgorman, mingo, peterz, raghavendra.kt, riel,
	rientjes, sj, weixugc, willy, ying.huang, dave, nifan.cxl,
	joshua.hahnjy, xuezhengchu, yiannis, akpm, david

On 22-May-25 12:10 AM, Zi Yan wrote:
> On 21 May 2025, at 14:25, Donet Tom wrote:
> 
> In memory tiering mode, folio_last_cpupid() gives page access time
> for slow memory folios. In !folio_use_access_time() case,
> folio_last_cpupid() gives last cpupid. Now it is reused for node
> id. It is too confusing. At least, a new function like folio_get_target_nid()
> should be added to return a nid only if folio is isolated.

Yes, it can be confusing. If this approach of using last_cpupid field to 
store the target_nid is found to be feasible, I will cleanup the names 
as you suggest.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH v0 2/2] mm: sched: Batch-migrate misplaced pages
  2025-05-22  3:24       ` Gregory Price
@ 2025-05-22  5:23         ` Bharata B Rao
  0 siblings, 0 replies; 41+ messages in thread
From: Bharata B Rao @ 2025-05-22  5:23 UTC (permalink / raw)
  To: Gregory Price, Zi Yan
  Cc: Donet Tom, linux-kernel, linux-mm, Jonathan.Cameron, dave.hansen,
	hannes, mgorman, mingo, peterz, raghavendra.kt, riel, rientjes,
	sj, weixugc, willy, ying.huang, dave, nifan.cxl, joshua.hahnjy,
	xuezhengchu, yiannis, akpm, david

On 22-May-25 8:54 AM, Gregory Price wrote:
> 
> The really annoying part of all of this is
> 
> #ifdef CONFIG_NUMA_BALANCING
> #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
> static inline int folio_last_cpupid(struct folio *folio)
> {
>          return folio->_last_cpupid;
> }
> #else
> static inline int folio_last_cpupid(struct folio *folio)
> {
>          return (folio->flags >> LAST_CPUPID_PGSHIFT) & LAST_CPUPID_MASK;
> }
> #endif
> #else /* !CONFIG_NUMA_BALANCING */
> static inline int folio_last_cpupid(struct folio *folio)
> {
>          return folio_nid(folio); /* XXX */
> }
> ...
> #endif
> 
> Obviously we don't have to care about the !NUMAB case, but what a silly
> muxing we have going on here (I get it, space is tight - the interfaces
> are just confusing is all).

I really didn't realize the usage in !NUMAB case, thanks.

> 
> My question is whether there's some kind of race condition here if the
> mode changes between isolate and fetch.  Can we "fetch a node id" and
> end up with a cpupid because someone toggled the between tiering and
> balancing?

Good question. I need to check all such cases where inadvertent reset or 
reuse/repurposing of last_cpupid field of an isolated folio becomes 
possible.

> 
> If we can answer that, then implementing folio_last_cpupid and
> folio_last_access_nid can return -1 if called in the wrong mode
> (assuming this check isn't too expensive).  That would be a nice cleanup
> for readability sake.

Right.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH v0 2/2] mm: sched: Batch-migrate misplaced pages
  2025-05-22  3:55   ` Gregory Price
@ 2025-05-22  7:33     ` Bharata B Rao
  2025-05-22 15:38       ` Gregory Price
  0 siblings, 1 reply; 41+ messages in thread
From: Bharata B Rao @ 2025-05-22  7:33 UTC (permalink / raw)
  To: Gregory Price
  Cc: linux-kernel, linux-mm, Jonathan.Cameron, dave.hansen, hannes,
	mgorman, mingo, peterz, raghavendra.kt, riel, rientjes, sj,
	weixugc, willy, ying.huang, ziy, dave, nifan.cxl, joshua.hahnjy,
	xuezhengchu, yiannis, akpm, david

On 22-May-25 9:25 AM, Gregory Price wrote:
> On Wed, May 21, 2025 at 01:32:38PM +0530, Bharata B Rao wrote:
>>   
>> +static void task_check_pending_migrations(struct task_struct *curr)
>> +{
>> +	struct callback_head *work = &curr->numa_mig_work;
>> +
>> +	if (work->next != work)
>> +		return;
>> +
>> +	if (time_after(jiffies, curr->numa_mig_interval) ||
>> +	    (curr->migrate_count > NUMAB_BATCH_MIGRATION_THRESHOLD)) {
>> +		curr->numa_mig_interval = jiffies + HZ;
>> +		task_work_add(curr, work, TWA_RESUME);
>> +	}
>> +}
>> +
>>   /*
>>    * Drive the periodic memory faults..
>>    */
>> @@ -3610,6 +3672,8 @@ static void task_tick_numa(struct rq *rq, struct task_struct *curr)
>>   	if (!curr->mm || (curr->flags & (PF_EXITING | PF_KTHREAD)) || work->next != work)
>>   		return;
>>   
>> +	task_check_pending_migrations(curr);
>> +
> 
> So I know this was discussed in the cover leter a bit and alluded to in
> the patch, but I want to add my 2cents from work on the unmapped page
> cache set.
> 
> In that set, I chose to always schedule the task work on the next return
> to user-space, rather than defer to a tick like the current numa-balance
> code.  This was for two concerns:
> 
> 1) I didn't want to leave a potentially large number of isolated folios
>     on a list that may not be reaped for an unknown period of time.
> 
>     I don't know the real limitations on the number of isolated folios,
>     but given what we have here I think we can represent a mathematical
>     worst case on the nubmer of stranded folios.
> 
>     If (N=1,000,000, and M=511) then we could have ~1.8TB of pages
>     stranded on these lists - never to be migrated because it never hits
>     the threshhold.  In practice this won't happen to that extreme, but
>     in practice it absolutely will happen for some chunk of tasks.

In addition to the threshold, I have a time limit too and hence at the 
end of that period, the isolated folios do get migrated even if the 
threshold isn't hit.

The other thing I haven't taken care yet is to put back the isolated 
folios if the task exits with pending isolated folios.

> 
>     So I chose to never leave kernel space with isolated folios on the
>     task numa_mig_list.
> 
>     This discussion changes if the numa_mig_list is not on the
>     task_struct and instead some per-cpu list routinely reaped by a
>     kthread (kpromoted or whatever).
>   
> 
> 2) I was not confident I could measure the performance implications of
>     the migrations directly when it was deferred.  When would I even know
>     it happened?  The actual goal is to *not* know it happened, right?
> 
>     But now it might happen during a page fault, or any random syscall.
> 
>     This concerned me - so i just didn't defer.  That was largely out of
>     lack of confidence in my own understanding of the task_work system.
> 
> 
> So i think this, as presented, is a half-measure - and I don't think
> it's a good half-measure.  I think we might need to go all the way to a
> set of per-cpu migration lists that a kernel work can pluck the head of
> on some interval.  That would bound the number of isolated folios to the
> number of CPUs rather than the number of tasks.

Why per-cpu and not per-node? All folios that are targeted for a node 
can be in that node's list.

I think if we are leaving the migration to be done by the migrator 
thread later, then isolating them beforehand may not be ideal. In such 
cases tracking the hot pages via PFNs like I did in kpromoted may be better.

Even when we have per-node migrator threads that would handle migration 
requests from multiple hot page sources (or single unified layer), I 
still think that there should be a "migrate now" kind of interface 
(which is essentially what your migrate_misplaced_folio_batch() is). 
That will be more suitable for handling migration request originating 
from locality-based NUMA balancing (NUMAB=1 case).

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH v0 2/2] mm: sched: Batch-migrate misplaced pages
  2025-05-22  7:33     ` Bharata B Rao
@ 2025-05-22 15:38       ` Gregory Price
  0 siblings, 0 replies; 41+ messages in thread
From: Gregory Price @ 2025-05-22 15:38 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-kernel, linux-mm, Jonathan.Cameron, dave.hansen, hannes,
	mgorman, mingo, peterz, raghavendra.kt, riel, rientjes, sj,
	weixugc, willy, ying.huang, ziy, dave, nifan.cxl, joshua.hahnjy,
	xuezhengchu, yiannis, akpm, david

On Thu, May 22, 2025 at 01:03:35PM +0530, Bharata B Rao wrote:
> On 22-May-25 9:25 AM, Gregory Price wrote:
> > 
> > So i think this, as presented, is a half-measure - and I don't think
> > it's a good half-measure.  I think we might need to go all the way to a
> > set of per-cpu migration lists that a kernel work can pluck the head of
> > on some interval.  That would bound the number of isolated folios to the
> > number of CPUs rather than the number of tasks.
> 
> Why per-cpu and not per-node? All folios that are targeted for a node can be
> in that node's list.
> 

On systems with significant number of threads (512-1024), these lists
may be highly contended.  I suppose we can start with per-node, but I
would not be surprised if this went straight to per-cpu.

> I think if we are leaving the migration to be done by the migrator thread
> later, then isolating them beforehand may not be ideal. In such cases
> tracking the hot pages via PFNs like I did in kpromoted may be better.
>

This seems like not a bad idea, you could do hot-swapped buffers to
prevent infinite growth / contention.  One of the problems with PFNs is
that the state of that page can change between candidacy and promotion.
I suppose the devil is the details there.

~Gregory

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH v0 1/2] migrate: implement migrate_misplaced_folio_batch
  2025-05-21  8:02 ` [RFC PATCH v0 1/2] migrate: implement migrate_misplaced_folio_batch Bharata B Rao
@ 2025-05-22 15:59   ` David Hildenbrand
  2025-05-22 16:03     ` Gregory Price
  2025-05-26  8:16   ` Huang, Ying
  1 sibling, 1 reply; 41+ messages in thread
From: David Hildenbrand @ 2025-05-22 15:59 UTC (permalink / raw)
  To: Bharata B Rao, linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, hannes, mgorman, mingo,
	peterz, raghavendra.kt, riel, rientjes, sj, weixugc, willy,
	ying.huang, ziy, dave, nifan.cxl, joshua.hahnjy, xuezhengchu,
	yiannis, akpm

On 21.05.25 10:02, Bharata B Rao wrote:
> From: Gregory Price <gourry@gourry.net>
> 
> A common operation in tiering is to migrate multiple pages at once.
> The migrate_misplaced_folio function requires one call for each
> individual folio.  Expose a batch-variant of the same call for use
> when doing batch migrations.
> 
> Signed-off-by: Gregory Price <gourry@gourry.net>
> Signed-off-by: Bharata B Rao <bharata@amd.com>
> ---
>   include/linux/migrate.h |  6 ++++++
>   mm/migrate.c            | 31 +++++++++++++++++++++++++++++++
>   2 files changed, 37 insertions(+)
> 
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index aaa2114498d6..c9496adcf192 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -145,6 +145,7 @@ const struct movable_operations *page_movable_ops(struct page *page)
>   int migrate_misplaced_folio_prepare(struct folio *folio,
>   		struct vm_area_struct *vma, int node);
>   int migrate_misplaced_folio(struct folio *folio, int node);
> +int migrate_misplaced_folio_batch(struct list_head *foliolist, int node);
>   #else
>   static inline int migrate_misplaced_folio_prepare(struct folio *folio,
>   		struct vm_area_struct *vma, int node)
> @@ -155,6 +156,11 @@ static inline int migrate_misplaced_folio(struct folio *folio, int node)
>   {
>   	return -EAGAIN; /* can't migrate now */
>   }
> +static inline int migrate_misplaced_folio_batch(struct list_head *foliolist,
> +						int node)
> +{
> +	return -EAGAIN; /* can't migrate now */
> +}
>   #endif /* CONFIG_NUMA_BALANCING */
>   
>   #ifdef CONFIG_MIGRATION
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 676d9cfc7059..32cc2eafb037 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2733,5 +2733,36 @@ int migrate_misplaced_folio(struct folio *folio, int node)
>   	BUG_ON(!list_empty(&migratepages));
>   	return nr_remaining ? -EAGAIN : 0;
>   }
> +
> +/*
> + * Batch variant of migrate_misplaced_folio. Attempts to migrate
> + * a folio list to the specified destination.
> + *
> + * Caller is expected to have isolated the folios by calling
> + * migrate_misplaced_folio_prepare(), which will result in an
> + * elevated reference count on the folio.
> + *
> + * This function will un-isolate the folios, dereference them, and
> + * remove them from the list before returning.
> + */
> +int migrate_misplaced_folio_batch(struct list_head *folio_list, int node)

"migrate_misplaced_folios" ?

:)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH v0 1/2] migrate: implement migrate_misplaced_folio_batch
  2025-05-22 15:59   ` David Hildenbrand
@ 2025-05-22 16:03     ` Gregory Price
  2025-05-22 16:08       ` David Hildenbrand
  0 siblings, 1 reply; 41+ messages in thread
From: Gregory Price @ 2025-05-22 16:03 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Bharata B Rao, linux-kernel, linux-mm, Jonathan.Cameron,
	dave.hansen, hannes, mgorman, mingo, peterz, raghavendra.kt, riel,
	rientjes, sj, weixugc, willy, ying.huang, ziy, dave, nifan.cxl,
	joshua.hahnjy, xuezhengchu, yiannis, akpm

On Thu, May 22, 2025 at 05:59:01PM +0200, David Hildenbrand wrote:
> > +int migrate_misplaced_folio_batch(struct list_head *folio_list, int node)
> 
> "migrate_misplaced_folios" ?
> 
> :)

something something brevity is the soul of wit 

I think i went with _batch to match surrounding code (been a while since
I wrote this), but I don't have strong feelings either way.

~Gregory

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH v0 1/2] migrate: implement migrate_misplaced_folio_batch
  2025-05-22 16:03     ` Gregory Price
@ 2025-05-22 16:08       ` David Hildenbrand
  0 siblings, 0 replies; 41+ messages in thread
From: David Hildenbrand @ 2025-05-22 16:08 UTC (permalink / raw)
  To: Gregory Price
  Cc: Bharata B Rao, linux-kernel, linux-mm, Jonathan.Cameron,
	dave.hansen, hannes, mgorman, mingo, peterz, raghavendra.kt, riel,
	rientjes, sj, weixugc, willy, ying.huang, ziy, dave, nifan.cxl,
	joshua.hahnjy, xuezhengchu, yiannis, akpm

On 22.05.25 18:03, Gregory Price wrote:
> On Thu, May 22, 2025 at 05:59:01PM +0200, David Hildenbrand wrote:
>>> +int migrate_misplaced_folio_batch(struct list_head *folio_list, int node)
>>
>> "migrate_misplaced_folios" ?
>>
>> :)
> 
> something something brevity is the soul of wit
> 
> I think i went with _batch to match surrounding code (been a while since
> I wrote this), but I don't have strong feelings either way.

I think we have migrate_pages_batch() and migrate_pages_sync() because 
... they are called from migrate_pages() :)

For something that "simply" calls migrate_pages() right now, probably we 
should just call migrate_folios().

But maybe you were referring to yet another set of "_batch" functions.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH v0 2/2] mm: sched: Batch-migrate misplaced pages
  2025-05-21  8:02 ` [RFC PATCH v0 2/2] mm: sched: Batch-migrate misplaced pages Bharata B Rao
  2025-05-21 18:25   ` Donet Tom
  2025-05-22  3:55   ` Gregory Price
@ 2025-05-22 16:11   ` David Hildenbrand
  2025-05-22 16:24     ` Zi Yan
  2025-05-26  5:14     ` Bharata B Rao
  2 siblings, 2 replies; 41+ messages in thread
From: David Hildenbrand @ 2025-05-22 16:11 UTC (permalink / raw)
  To: Bharata B Rao, linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, hannes, mgorman, mingo,
	peterz, raghavendra.kt, riel, rientjes, sj, weixugc, willy,
	ying.huang, ziy, dave, nifan.cxl, joshua.hahnjy, xuezhengchu,
	yiannis, akpm

On 21.05.25 10:02, Bharata B Rao wrote:
> Currently the folios identified as misplaced by the NUMA
> balancing sub-system are migrated one by one from the NUMA
> hint fault handler as and when they are identified as
> misplaced.
> 
> Instead of such singe folio migrations, batch them and
> migrate them at once.
> 
> Identified misplaced folios are isolated and stored in
> a per-task list. A new task_work is queued from task tick
> handler to migrate them in batches. Migration is done
> periodically or if pending number of isolated foios exceeds
> a threshold.

That means that these pages are effectively unmovable for other purposes 
(CMA, compaction, long-term pinning, whatever) until that list was drained.

Bad.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH v0 2/2] mm: sched: Batch-migrate misplaced pages
  2025-05-22 16:11   ` David Hildenbrand
@ 2025-05-22 16:24     ` Zi Yan
  2025-05-22 16:26       ` David Hildenbrand
  2025-05-26  5:14     ` Bharata B Rao
  1 sibling, 1 reply; 41+ messages in thread
From: Zi Yan @ 2025-05-22 16:24 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Bharata B Rao, linux-kernel, linux-mm, Jonathan.Cameron,
	dave.hansen, gourry, hannes, mgorman, mingo, peterz,
	raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang,
	dave, nifan.cxl, joshua.hahnjy, xuezhengchu, yiannis, akpm

On 22 May 2025, at 12:11, David Hildenbrand wrote:

> On 21.05.25 10:02, Bharata B Rao wrote:
>> Currently the folios identified as misplaced by the NUMA
>> balancing sub-system are migrated one by one from the NUMA
>> hint fault handler as and when they are identified as
>> misplaced.
>>
>> Instead of such singe folio migrations, batch them and
>> migrate them at once.
>>
>> Identified misplaced folios are isolated and stored in
>> a per-task list. A new task_work is queued from task tick
>> handler to migrate them in batches. Migration is done
>> periodically or if pending number of isolated foios exceeds
>> a threshold.
>
> That means that these pages are effectively unmovable for other purposes (CMA, compaction, long-term pinning, whatever) until that list was drained.
>
> Bad.

Probably we can mark these pages and when others want to migrate the page,
get_new_page() just looks at the page's target node and get a new page from
the target node.

--
Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH v0 2/2] mm: sched: Batch-migrate misplaced pages
  2025-05-22 16:24     ` Zi Yan
@ 2025-05-22 16:26       ` David Hildenbrand
  2025-05-22 16:38         ` Zi Yan
  0 siblings, 1 reply; 41+ messages in thread
From: David Hildenbrand @ 2025-05-22 16:26 UTC (permalink / raw)
  To: Zi Yan
  Cc: Bharata B Rao, linux-kernel, linux-mm, Jonathan.Cameron,
	dave.hansen, gourry, hannes, mgorman, mingo, peterz,
	raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang,
	dave, nifan.cxl, joshua.hahnjy, xuezhengchu, yiannis, akpm

On 22.05.25 18:24, Zi Yan wrote:
> On 22 May 2025, at 12:11, David Hildenbrand wrote:
> 
>> On 21.05.25 10:02, Bharata B Rao wrote:
>>> Currently the folios identified as misplaced by the NUMA
>>> balancing sub-system are migrated one by one from the NUMA
>>> hint fault handler as and when they are identified as
>>> misplaced.
>>>
>>> Instead of such singe folio migrations, batch them and
>>> migrate them at once.
>>>
>>> Identified misplaced folios are isolated and stored in
>>> a per-task list. A new task_work is queued from task tick
>>> handler to migrate them in batches. Migration is done
>>> periodically or if pending number of isolated foios exceeds
>>> a threshold.
>>
>> That means that these pages are effectively unmovable for other purposes (CMA, compaction, long-term pinning, whatever) until that list was drained.
>>
>> Bad.
> 
> Probably we can mark these pages and when others want to migrate the page,
> get_new_page() just looks at the page's target node and get a new page from
> the target node.

How do you envision that working when CMA needs to migrate this exact 
page to a different location?

It cannot isolate it for migration because ... it's already isolated ... 
so it will give up.

Marking might not be easy I assume ...

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH v0 0/2] Batch migration for NUMA balancing
  2025-05-22  3:08   ` Gregory Price
@ 2025-05-22 16:30     ` SeongJae Park
  2025-05-22 17:40       ` Gregory Price
  0 siblings, 1 reply; 41+ messages in thread
From: SeongJae Park @ 2025-05-22 16:30 UTC (permalink / raw)
  To: Gregory Price
  Cc: SeongJae Park, Bharata B Rao, linux-kernel, linux-mm,
	Jonathan.Cameron, dave.hansen, hannes, mgorman, mingo, peterz,
	raghavendra.kt, riel, rientjes, weixugc, willy, ying.huang, ziy,
	dave, nifan.cxl, joshua.hahnjy, xuezhengchu, yiannis, akpm, david

On Wed, 21 May 2025 23:08:16 -0400 Gregory Price <gourry@gourry.net> wrote:

> On Wed, May 21, 2025 at 11:45:52AM -0700, SeongJae Park wrote:
> > Hi Bharata,
> > 
> > On Wed, 21 May 2025 13:32:36 +0530 Bharata B Rao <bharata@amd.com> wrote:
> > 
> > > Hi,
> > > 
> > > This is an attempt to convert the NUMA balancing to do batched
> > > migration instead of migrating one folio at a time. The basic
> > > idea is to collect (from hint fault handler) the folios to be
> > > migrated in a list and batch-migrate them from task_work context.
> > > More details about the specifics are present in patch 2/2.
> > > 
> > > During LSFMM[1] and subsequent discussions in MM alignment calls[2],
> > > it was suggested that separate migration threads to handle migration
> > > or promotion request may be desirable. Existing NUMA balancing, hot
> > > page promotion and other future promotion techniques could off-load
> > > migration part to these threads. Or if we manage to have a single
> > > source of hotness truth like kpromoted[3], then that too can hand
> > > over migration requests to the migration threads. I am envisaging
> > > that different hotness sources like kmmscand[4], MGLRU[5], IBS[6]
> > > and CXL HMU would push hot page info to kpromoted, which would
> > > then isolate and push the folios to be promoted to the migrator
> > > thread.
> > 
> > I think (or, hope) it would also be not very worthless or rude to mention other
> > existing and ongoing works that have potentials to serve for similar purpose or
> > collaborate in future, here.
> > 
> > DAMON is designed for a sort of multi-source access information handling.  In
> > LSFMM, I proposed[1] damon_report_access() interface for making it easier to be
> > extended for more types of access information.  Currenlty damon_report_access()
> > is under early development.  I think this has a potential to serve something
> > similar to your single source goal.
> > 
> 
> It seems to me that DAMON might make use of the batch migration
> interface, so if you need any changes or extensions, it might be good
> for you (SJ) to take a look at that for us.

I started this subthread not for batch migration but the long term goal.  I
took only a glance on the migration batching part, and I'm still trying to find
time to take more deep look on batch migration.

Nonetheless, yes, basically I believe DAMON and Bharata's works have great
opportunities to collaborate and use each other in a very productive ways.  I'm
especially very intersted in kpromoted's AMD IBS code, and trying to make DAMON
easier to be used for Bharata's works.

For batch migration interface, though, to be honest I don't find very clear
DAMON's usage of it, since DAMON does region-based sort of batched migration.
Again, I took only a glance on migration batching part and gonna take more time
to the details.

Meanwhile, if you saw some opportunities, and if you don't mind, it would be
very helpful if you can share your detailed view of the opportunity (how DAMON
could be better by using the migration batching in what way?).


Thanks,
SJ


> 
> ~Gregory

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH v0 2/2] mm: sched: Batch-migrate misplaced pages
  2025-05-22 16:26       ` David Hildenbrand
@ 2025-05-22 16:38         ` Zi Yan
  2025-05-22 17:21           ` David Hildenbrand
  0 siblings, 1 reply; 41+ messages in thread
From: Zi Yan @ 2025-05-22 16:38 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Bharata B Rao, linux-kernel, linux-mm, Jonathan.Cameron,
	dave.hansen, gourry, hannes, mgorman, mingo, peterz,
	raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang,
	dave, nifan.cxl, joshua.hahnjy, xuezhengchu, yiannis, akpm

On 22 May 2025, at 12:26, David Hildenbrand wrote:

> On 22.05.25 18:24, Zi Yan wrote:
>> On 22 May 2025, at 12:11, David Hildenbrand wrote:
>>
>>> On 21.05.25 10:02, Bharata B Rao wrote:
>>>> Currently the folios identified as misplaced by the NUMA
>>>> balancing sub-system are migrated one by one from the NUMA
>>>> hint fault handler as and when they are identified as
>>>> misplaced.
>>>>
>>>> Instead of such singe folio migrations, batch them and
>>>> migrate them at once.
>>>>
>>>> Identified misplaced folios are isolated and stored in
>>>> a per-task list. A new task_work is queued from task tick
>>>> handler to migrate them in batches. Migration is done
>>>> periodically or if pending number of isolated foios exceeds
>>>> a threshold.
>>>
>>> That means that these pages are effectively unmovable for other purposes (CMA, compaction, long-term pinning, whatever) until that list was drained.
>>>
>>> Bad.
>>
>> Probably we can mark these pages and when others want to migrate the page,
>> get_new_page() just looks at the page's target node and get a new page from
>> the target node.
>
> How do you envision that working when CMA needs to migrate this exact page to a different location?
>
> It cannot isolate it for migration because ... it's already isolated ... so it will give up.
>
> Marking might not be easy I assume ...

I guess you mean we do not have any extra bit to indicate this page is isolated,
but it can be migrated. My point is that if this page is going to be migrated
due to other reasons, like CMA, compaction, why not migrate it to the target
node instead of moving it around within the same node.


--
Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH v0 2/2] mm: sched: Batch-migrate misplaced pages
  2025-05-22 16:38         ` Zi Yan
@ 2025-05-22 17:21           ` David Hildenbrand
  2025-05-22 17:30             ` Zi Yan
  0 siblings, 1 reply; 41+ messages in thread
From: David Hildenbrand @ 2025-05-22 17:21 UTC (permalink / raw)
  To: Zi Yan
  Cc: Bharata B Rao, linux-kernel, linux-mm, Jonathan.Cameron,
	dave.hansen, gourry, hannes, mgorman, mingo, peterz,
	raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang,
	dave, nifan.cxl, joshua.hahnjy, xuezhengchu, yiannis, akpm

On 22.05.25 18:38, Zi Yan wrote:
> On 22 May 2025, at 12:26, David Hildenbrand wrote:
> 
>> On 22.05.25 18:24, Zi Yan wrote:
>>> On 22 May 2025, at 12:11, David Hildenbrand wrote:
>>>
>>>> On 21.05.25 10:02, Bharata B Rao wrote:
>>>>> Currently the folios identified as misplaced by the NUMA
>>>>> balancing sub-system are migrated one by one from the NUMA
>>>>> hint fault handler as and when they are identified as
>>>>> misplaced.
>>>>>
>>>>> Instead of such singe folio migrations, batch them and
>>>>> migrate them at once.
>>>>>
>>>>> Identified misplaced folios are isolated and stored in
>>>>> a per-task list. A new task_work is queued from task tick
>>>>> handler to migrate them in batches. Migration is done
>>>>> periodically or if pending number of isolated foios exceeds
>>>>> a threshold.
>>>>
>>>> That means that these pages are effectively unmovable for other purposes (CMA, compaction, long-term pinning, whatever) until that list was drained.
>>>>
>>>> Bad.
>>>
>>> Probably we can mark these pages and when others want to migrate the page,
>>> get_new_page() just looks at the page's target node and get a new page from
>>> the target node.
>>
>> How do you envision that working when CMA needs to migrate this exact page to a different location?
>>
>> It cannot isolate it for migration because ... it's already isolated ... so it will give up.
>>
>> Marking might not be easy I assume ...
> 
> I guess you mean we do not have any extra bit to indicate this page is isolated,
> but it can be migrated. My point is that if this page is going to be migrated
> due to other reasons, like CMA, compaction, why not migrate it to the target
> node instead of moving it around within the same node.

I think we'd have to identify that

a) This page is isolate for migration (could be isolated for other
    reasons)

b) The one responsible for the isolation is numa code (could be someone
    else)

c) We're allowed to grab that page from that list (IOW sync against
    others, and especially also against), to essentially "steal" the
    isolated page.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH v0 2/2] mm: sched: Batch-migrate misplaced pages
  2025-05-22 17:21           ` David Hildenbrand
@ 2025-05-22 17:30             ` Zi Yan
  2025-05-26  8:33               ` Huang, Ying
  2025-05-26  9:29               ` David Hildenbrand
  0 siblings, 2 replies; 41+ messages in thread
From: Zi Yan @ 2025-05-22 17:30 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Bharata B Rao, linux-kernel, linux-mm, Jonathan.Cameron,
	dave.hansen, gourry, hannes, mgorman, mingo, peterz,
	raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang,
	dave, nifan.cxl, joshua.hahnjy, xuezhengchu, yiannis, akpm

On 22 May 2025, at 13:21, David Hildenbrand wrote:

> On 22.05.25 18:38, Zi Yan wrote:
>> On 22 May 2025, at 12:26, David Hildenbrand wrote:
>>
>>> On 22.05.25 18:24, Zi Yan wrote:
>>>> On 22 May 2025, at 12:11, David Hildenbrand wrote:
>>>>
>>>>> On 21.05.25 10:02, Bharata B Rao wrote:
>>>>>> Currently the folios identified as misplaced by the NUMA
>>>>>> balancing sub-system are migrated one by one from the NUMA
>>>>>> hint fault handler as and when they are identified as
>>>>>> misplaced.
>>>>>>
>>>>>> Instead of such singe folio migrations, batch them and
>>>>>> migrate them at once.
>>>>>>
>>>>>> Identified misplaced folios are isolated and stored in
>>>>>> a per-task list. A new task_work is queued from task tick
>>>>>> handler to migrate them in batches. Migration is done
>>>>>> periodically or if pending number of isolated foios exceeds
>>>>>> a threshold.
>>>>>
>>>>> That means that these pages are effectively unmovable for other purposes (CMA, compaction, long-term pinning, whatever) until that list was drained.
>>>>>
>>>>> Bad.
>>>>
>>>> Probably we can mark these pages and when others want to migrate the page,
>>>> get_new_page() just looks at the page's target node and get a new page from
>>>> the target node.
>>>
>>> How do you envision that working when CMA needs to migrate this exact page to a different location?
>>>
>>> It cannot isolate it for migration because ... it's already isolated ... so it will give up.
>>>
>>> Marking might not be easy I assume ...
>>
>> I guess you mean we do not have any extra bit to indicate this page is isolated,
>> but it can be migrated. My point is that if this page is going to be migrated
>> due to other reasons, like CMA, compaction, why not migrate it to the target
>> node instead of moving it around within the same node.
>
> I think we'd have to identify that
>
> a) This page is isolate for migration (could be isolated for other
>    reasons)
>
> b) The one responsible for the isolation is numa code (could be someone
>    else)
>
> c) We're allowed to grab that page from that list (IOW sync against
>    others, and especially also against), to essentially "steal" the
>    isolated page.

Right. c) sounds like adding more contention to the candidate list.
I wonder if we can just mark the page as migration candidate (using
a page flag or something else), then migrate it whenever CMA,
compaction, long-term pinning and more look at the page. In addition,
periodically, the migration task would do a PFN scanning and migrate
any migration candidate. I remember Willy did some experiments showing
that PFN scanning is very fast.

--
Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH v0 0/2] Batch migration for NUMA balancing
  2025-05-22 16:30     ` SeongJae Park
@ 2025-05-22 17:40       ` Gregory Price
  2025-05-22 18:52         ` SeongJae Park
  0 siblings, 1 reply; 41+ messages in thread
From: Gregory Price @ 2025-05-22 17:40 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Bharata B Rao, linux-kernel, linux-mm, Jonathan.Cameron,
	dave.hansen, hannes, mgorman, mingo, peterz, raghavendra.kt, riel,
	rientjes, weixugc, willy, ying.huang, ziy, dave, nifan.cxl,
	joshua.hahnjy, xuezhengchu, yiannis, akpm, david

On Thu, May 22, 2025 at 09:30:23AM -0700, SeongJae Park wrote:
> On Wed, 21 May 2025 23:08:16 -0400 Gregory Price <gourry@gourry.net> wrote:
> 
> > 
> > It seems to me that DAMON might make use of the batch migration
> > interface, so if you need any changes or extensions, it might be good
> > for you (SJ) to take a look at that for us.
> 
> For batch migration interface, though, to be honest I don't find very clear
> DAMON's usage of it, since DAMON does region-based sort of batched migration.
> Again, I took only a glance on migration batching part and gonna take more time
> to the details.
> 

DAMON would identify a set of PFNs or Folios to migrate, at some point,
wouldn't it be beneficial to DAMON to simply re-use:

int migrate_misplaced_folio_batch(struct list_head *folio_list, int node)

If not, then why?

That's why I mean by what would DAMON want out of such an interface.
Not the async part, but the underlying migration functions.

~Gregory

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Apologies and clarifications on DAMON-disruptions (was Re: [RFC PATCH v0 0/2] Batch migration for NUMA balancing)
  2025-05-21 18:45 ` [RFC PATCH v0 0/2] Batch migration for NUMA balancing SeongJae Park
  2025-05-22  3:08   ` Gregory Price
@ 2025-05-22 18:43   ` SeongJae Park
  2025-05-26  5:20   ` [RFC PATCH v0 0/2] Batch migration for NUMA balancing Bharata B Rao
  2 siblings, 0 replies; 41+ messages in thread
From: SeongJae Park @ 2025-05-22 18:43 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Bharata B Rao, linux-kernel, linux-mm, Jonathan.Cameron,
	dave.hansen, gourry, hannes, mgorman, mingo, peterz,
	raghavendra.kt, riel, rientjes, weixugc, willy, ying.huang, ziy,
	dave, nifan.cxl, joshua.hahnjy, xuezhengchu, yiannis, akpm, david

On Wed, 21 May 2025 11:45:52 -0700 SeongJae Park <sj@kernel.org> wrote:

[...]
> I think (or, hope) it would also be not very worthless or rude to mention other
> existing and ongoing works that have potentials to serve for similar purpose or
> collaborate in future, here.
> 
> DAMON is designed for a sort of multi-source access information handling.  In
> LSFMM, I proposed[1] damon_report_access() interface for making it easier to be
> extended for more types of access information.  Currenlty damon_report_access()
> is under early development.  I think this has a potential to serve something
> similar to your single source goal.
[...]

I heard some people are feeling uncomfortable about patterns on my mails like
this.  I understand the pattern is that I suddenly replying to a thread saying
"hey, by the way DAMON is ...", and I understand it bothers people when they
wanted to discuss about something more than DAMON.

I never intended to make others feel uncomfortable but was doing that with only
good faith to make discussions be made with full information.  But if you felt
so, you felt so.  I sincerely feel sorry if you felt so, and will try my best
to not bother you next time.

But, I'm a human and I cannot do something more than my best effort.  I hence
expect I will unintentionally continue making people upset.  I think I migt be
able to reduce such cases by explaining why and what I'm doing, and how you can
avoid it.

Yes, I might bothering you with exactly the pattern now, but let me do this
hopefully as the last time, to reduce next recurrences.

TL; DR: please briefly mention DAMON and optionally clarify DAMON discussion
and/or clarifications are unwelcome for now, if you don't want to be bothered
by DAMON.

Why and What I'm Doing
======================

I sometimes find threads of works that seem related with DAMON.  From such
mails, I find opportunities to help the work using DAMON, shiny features of
the work that DAMON could adopt, or whatever makes me believe so.  Such mails
sometimes contain explanations of such relations, and sometimes not.

In some cases such relations description has some details that look important
on the context but missed, outdated, or wrong.  I believe I have a
responsibility to add or fix those as a maintainer of DAMON for making Linux
kernel healthier with discussions under full information, and so jump in.

If there is no such relations description at all, I have no ability to know if
they just don't know about DAMON, mistakenly didn't add the part, or want to
ignore DAMON at the moment.  Hence again I believe it is a responsibility of
DAMON maintainer to tell about it, to help the original author and other
reviewers' understanding.

I think sometimes you didn't want to discuss or know such details of DAMON but
I was jumping in and waste your time since I misread your intention.

How You Can Prevent I Bothering You with DAMON
==============================================

Please help me better understand your intentions.  If you knwo DAMON and you
feel I may find your work is related with DAMON, please take a moment to add a
brief paragraph explaining it.  If you don't want me adding more details that I
believe worthy to be added there, or don't want to discuss in that detail of
DAMON, please mention so.  E.g., "this may have potential relations with DAMON,
but that's out of the scope at the moment."

I will also do my best to read implicit intention, but again, the best effort
is the best effort and I cannot promise something I cannot do.

Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH v0 0/2] Batch migration for NUMA balancing
  2025-05-22 17:40       ` Gregory Price
@ 2025-05-22 18:52         ` SeongJae Park
  0 siblings, 0 replies; 41+ messages in thread
From: SeongJae Park @ 2025-05-22 18:52 UTC (permalink / raw)
  To: Gregory Price
  Cc: SeongJae Park, Bharata B Rao, linux-kernel, linux-mm,
	Jonathan.Cameron, dave.hansen, hannes, mgorman, mingo, peterz,
	raghavendra.kt, riel, rientjes, weixugc, willy, ying.huang, ziy,
	dave, nifan.cxl, joshua.hahnjy, xuezhengchu, yiannis, akpm, david

On Thu, 22 May 2025 13:40:59 -0400 Gregory Price <gourry@gourry.net> wrote:

> On Thu, May 22, 2025 at 09:30:23AM -0700, SeongJae Park wrote:
> > On Wed, 21 May 2025 23:08:16 -0400 Gregory Price <gourry@gourry.net> wrote:
> > 
> > > 
> > > It seems to me that DAMON might make use of the batch migration
> > > interface, so if you need any changes or extensions, it might be good
> > > for you (SJ) to take a look at that for us.
> > 
> > For batch migration interface, though, to be honest I don't find very clear
> > DAMON's usage of it, since DAMON does region-based sort of batched migration.
> > Again, I took only a glance on migration batching part and gonna take more time
> > to the details.
> > 
> 
> DAMON would identify a set of PFNs or Folios to migrate, at some point,
> wouldn't it be beneficial to DAMON to simply re-use:
> 
> int migrate_misplaced_folio_batch(struct list_head *folio_list, int node)

Good idea.  Actually we implemnted DAMOS_MIGRATE_HOT and DAMOS_MIGRATE_COLD
instead of DAMOS_PROMOTE and DAMOS_DEMOTE since we didn't sure the promotion
path logic is somewhat everyone agreed upon.  FYI, I'm planning to revisit the
promotion path logic later (don't wait for me though ;) ).

> 
> If not, then why?

I'll need to look into the detail, but from a high level glance I think it is a
good idea.

> 
> That's why I mean by what would DAMON want out of such an interface.
> Not the async part, but the underlying migration functions.

Makes sense, thank you!


Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH v0 2/2] mm: sched: Batch-migrate misplaced pages
  2025-05-22  4:39     ` Bharata B Rao
@ 2025-05-23  9:05       ` Donet Tom
  0 siblings, 0 replies; 41+ messages in thread
From: Donet Tom @ 2025-05-23  9:05 UTC (permalink / raw)
  To: Bharata B Rao, linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, hannes, mgorman, mingo,
	peterz, raghavendra.kt, riel, rientjes, sj, weixugc, willy,
	ying.huang, ziy, dave, nifan.cxl, joshua.hahnjy, xuezhengchu,
	yiannis, akpm, david


On 5/22/25 10:09 AM, Bharata B Rao wrote:
> Hi Donet,
>
> On 21-May-25 11:55 PM, Donet Tom wrote:
>>
>>> +static void migrate_queued_pages(struct list_head *migrate_list)
>>> +{
>>> +    int cur_nid, nid;
>>> +    struct folio *folio, *tmp;
>>> +    LIST_HEAD(nid_list);
>>> +
>>> +    folio = list_entry(migrate_list, struct folio, lru);
>>> +    cur_nid = folio_last_cpupid(folio);
>>
>> Hi Bharatha,
>>
>> This is target node ID right?
>
> Correct.
>
>>
>>
>>> +
>>> +    list_for_each_entry_safe(folio, tmp, migrate_list, lru) {
>>> +        nid = folio_xchg_last_cpupid(folio, -1);
>>
>> Just one doubt: to get the last CPU ID (target node ID) here, 
>> folio_xchg_last_cpupid()
>>
>> is used, whereas earlier folio_last_cpupid() was used. Is there a 
>> specific reason for
>>
>> using different functions?
>
> This function iterates over the isolated folios looking for the same 
> target_nid so that all of them can be migrated at once to the given 
> target_nid. Hence the first call just reads the target_nid from 
> last_cpupid field to note which nid is of interest in the current 
> iteration and the next call actually reads target_nid and resets the 
> last_cpupid field.


Thank you, Bharata, for the clarification.

>
> Regards,
> Bharata.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH v0 2/2] mm: sched: Batch-migrate misplaced pages
  2025-05-22 16:11   ` David Hildenbrand
  2025-05-22 16:24     ` Zi Yan
@ 2025-05-26  5:14     ` Bharata B Rao
  1 sibling, 0 replies; 41+ messages in thread
From: Bharata B Rao @ 2025-05-26  5:14 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, hannes, mgorman, mingo,
	peterz, raghavendra.kt, riel, rientjes, sj, weixugc, willy,
	ying.huang, ziy, dave, nifan.cxl, joshua.hahnjy, xuezhengchu,
	yiannis, akpm

On 22-May-25 9:41 PM, David Hildenbrand wrote:
> On 21.05.25 10:02, Bharata B Rao wrote:
>> Currently the folios identified as misplaced by the NUMA
>> balancing sub-system are migrated one by one from the NUMA
>> hint fault handler as and when they are identified as
>> misplaced.
>>
>> Instead of such singe folio migrations, batch them and
>> migrate them at once.
>>
>> Identified misplaced folios are isolated and stored in
>> a per-task list. A new task_work is queued from task tick
>> handler to migrate them in batches. Migration is done
>> periodically or if pending number of isolated foios exceeds
>> a threshold.
> 
> That means that these pages are effectively unmovable for other purposes 
> (CMA, compaction, long-term pinning, whatever) until that list was drained.
> 
> Bad.

During the last week's MM alignment call on this subject, it was decided 
not to isolate and batch from fault context and migrate from task_work 
context like this.

However since the amount of time the folios stay in isolated state was
bounded both by time(1s in my patchset) and the number of folios 
isolated, I thought it should be okay, but I may be wrong.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH v0 0/2] Batch migration for NUMA balancing
  2025-05-21 18:45 ` [RFC PATCH v0 0/2] Batch migration for NUMA balancing SeongJae Park
  2025-05-22  3:08   ` Gregory Price
  2025-05-22 18:43   ` Apologies and clarifications on DAMON-disruptions (was Re: [RFC PATCH v0 0/2] Batch migration for NUMA balancing) SeongJae Park
@ 2025-05-26  5:20   ` Bharata B Rao
  2025-05-27 18:50     ` SeongJae Park
  2 siblings, 1 reply; 41+ messages in thread
From: Bharata B Rao @ 2025-05-26  5:20 UTC (permalink / raw)
  To: SeongJae Park
  Cc: linux-kernel, linux-mm, Jonathan.Cameron, dave.hansen, gourry,
	hannes, mgorman, mingo, peterz, raghavendra.kt, riel, rientjes,
	weixugc, willy, ying.huang, ziy, dave, nifan.cxl, joshua.hahnjy,
	xuezhengchu, yiannis, akpm, david

Hi SJ,

On 22-May-25 12:15 AM, SeongJae Park wrote:
> Hi Bharata,
> 
> On Wed, 21 May 2025 13:32:36 +0530 Bharata B Rao <bharata@amd.com> wrote:
> 
>> Hi,
>>
>> This is an attempt to convert the NUMA balancing to do batched
>> migration instead of migrating one folio at a time. The basic
>> idea is to collect (from hint fault handler) the folios to be
>> migrated in a list and batch-migrate them from task_work context.
>> More details about the specifics are present in patch 2/2.
>>
>> During LSFMM[1] and subsequent discussions in MM alignment calls[2],
>> it was suggested that separate migration threads to handle migration
>> or promotion request may be desirable. Existing NUMA balancing, hot
>> page promotion and other future promotion techniques could off-load
>> migration part to these threads. Or if we manage to have a single
>> source of hotness truth like kpromoted[3], then that too can hand
>> over migration requests to the migration threads. I am envisaging
>> that different hotness sources like kmmscand[4], MGLRU[5], IBS[6]
>> and CXL HMU would push hot page info to kpromoted, which would
>> then isolate and push the folios to be promoted to the migrator
>> thread.
> 
> I think (or, hope) it would also be not very worthless or rude to mention other
> existing and ongoing works that have potentials to serve for similar purpose or
> collaborate in future, here.
> 
> DAMON is designed for a sort of multi-source access information handling.  In
> LSFMM, I proposed[1] damon_report_access() interface for making it easier to be
> extended for more types of access information.  Currenlty damon_report_access()
> is under early development.  I think this has a potential to serve something
> similar to your single source goal.
> 
> Also in LSFMM, I proposed damos_add_folio() for a case that callers want to
> utilize DAMON worker thread (kdamond) as an asynchronous memory
> management operations execution thread while using its other features such as
> [auto-tuned] quotas.  I think this has a potential to serve something similar
> to your migration threads.  I haven't started damos_add_folio() development
> yet, though.
> 
> I remember we discussed about DAMON on mailing list and in LSFMM a bit, on your
> session.  IIRC, you were also looking for a time to see if there is a chance to
> use DAMON in some way.  Due to the technical issue, we were unable to discuss
> on the two new proposals on my LSFMM session, and it has been a bit while since
> our last discussion.  So if you don't mind, I'd like to ask if you have some
> opinions or comments about these.
> 
> [1] https://lwn.net/Articles/1016525/

Since this patchset was just about making the migration batched and 
async for NUMAB, I didn't mention DAMON as an alternative here.

One of the concerns I always had about DAMON when it is considered as 
replacement for existing hot page migration is its current inability to 
gather and maintain hot page info at per-folio granularity. How much 
that eventually matters to the workloads has to be really seen.

Regards,
Bharata.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH v0 1/2] migrate: implement migrate_misplaced_folio_batch
  2025-05-21  8:02 ` [RFC PATCH v0 1/2] migrate: implement migrate_misplaced_folio_batch Bharata B Rao
  2025-05-22 15:59   ` David Hildenbrand
@ 2025-05-26  8:16   ` Huang, Ying
  1 sibling, 0 replies; 41+ messages in thread
From: Huang, Ying @ 2025-05-26  8:16 UTC (permalink / raw)
  To: Bharata B Rao, gourry
  Cc: linux-kernel, linux-mm, Jonathan.Cameron, dave.hansen, hannes,
	mgorman, mingo, peterz, raghavendra.kt, riel, rientjes, sj,
	weixugc, willy, ziy, dave, nifan.cxl, joshua.hahnjy, xuezhengchu,
	yiannis, akpm, david

Bharata B Rao <bharata@amd.com> writes:

> From: Gregory Price <gourry@gourry.net>
>
> A common operation in tiering is to migrate multiple pages at once.
> The migrate_misplaced_folio function requires one call for each
> individual folio.  Expose a batch-variant of the same call for use
> when doing batch migrations.
>
> Signed-off-by: Gregory Price <gourry@gourry.net>
> Signed-off-by: Bharata B Rao <bharata@amd.com>
> ---
>  include/linux/migrate.h |  6 ++++++
>  mm/migrate.c            | 31 +++++++++++++++++++++++++++++++
>  2 files changed, 37 insertions(+)
>
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index aaa2114498d6..c9496adcf192 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -145,6 +145,7 @@ const struct movable_operations *page_movable_ops(struct page *page)
>  int migrate_misplaced_folio_prepare(struct folio *folio,
>  		struct vm_area_struct *vma, int node);
>  int migrate_misplaced_folio(struct folio *folio, int node);
> +int migrate_misplaced_folio_batch(struct list_head *foliolist, int node);
>  #else
>  static inline int migrate_misplaced_folio_prepare(struct folio *folio,
>  		struct vm_area_struct *vma, int node)
> @@ -155,6 +156,11 @@ static inline int migrate_misplaced_folio(struct folio *folio, int node)
>  {
>  	return -EAGAIN; /* can't migrate now */
>  }
> +static inline int migrate_misplaced_folio_batch(struct list_head *foliolist,
> +						int node)
> +{
> +	return -EAGAIN; /* can't migrate now */
> +}
>  #endif /* CONFIG_NUMA_BALANCING */
>  
>  #ifdef CONFIG_MIGRATION
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 676d9cfc7059..32cc2eafb037 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2733,5 +2733,36 @@ int migrate_misplaced_folio(struct folio *folio, int node)
>  	BUG_ON(!list_empty(&migratepages));
>  	return nr_remaining ? -EAGAIN : 0;
>  }
> +
> +/*
> + * Batch variant of migrate_misplaced_folio. Attempts to migrate
> + * a folio list to the specified destination.
> + *
> + * Caller is expected to have isolated the folios by calling
> + * migrate_misplaced_folio_prepare(), which will result in an
> + * elevated reference count on the folio.
> + *
> + * This function will un-isolate the folios, dereference them, and
> + * remove them from the list before returning.
> + */
> +int migrate_misplaced_folio_batch(struct list_head *folio_list, int node)
> +{
> +	pg_data_t *pgdat = NODE_DATA(node);
> +	unsigned int nr_succeeded;
> +	int nr_remaining;
> +
> +	nr_remaining = migrate_pages(folio_list, alloc_misplaced_dst_folio,
> +				     NULL, node, MIGRATE_ASYNC,
> +				     MR_NUMA_MISPLACED, &nr_succeeded);
> +	if (nr_remaining)
> +		putback_movable_pages(folio_list);
> +
> +	if (nr_succeeded) {
> +		count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded);
> +		mod_node_page_state(pgdat, PGPROMOTE_SUCCESS, nr_succeeded);
> +	}
> +	BUG_ON(!list_empty(folio_list));
> +	return nr_remaining ? -EAGAIN : 0;
> +}
>  #endif /* CONFIG_NUMA_BALANCING */
>  #endif /* CONFIG_NUMA */

migrate_misplaced_folio_batch() looks quite similar as
migrate_misplaced_folio(), can we merge them?

---
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH v0 2/2] mm: sched: Batch-migrate misplaced pages
  2025-05-22 17:30             ` Zi Yan
@ 2025-05-26  8:33               ` Huang, Ying
  2025-05-26  9:29               ` David Hildenbrand
  1 sibling, 0 replies; 41+ messages in thread
From: Huang, Ying @ 2025-05-26  8:33 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, Bharata B Rao, linux-kernel, linux-mm,
	Jonathan.Cameron, dave.hansen, gourry, hannes, mgorman, mingo,
	peterz, raghavendra.kt, riel, rientjes, sj, weixugc, willy, dave,
	nifan.cxl, joshua.hahnjy, xuezhengchu, yiannis, akpm

Zi Yan <ziy@nvidia.com> writes:

> On 22 May 2025, at 13:21, David Hildenbrand wrote:
>
>> On 22.05.25 18:38, Zi Yan wrote:
>>> On 22 May 2025, at 12:26, David Hildenbrand wrote:
>>>
>>>> On 22.05.25 18:24, Zi Yan wrote:
>>>>> On 22 May 2025, at 12:11, David Hildenbrand wrote:
>>>>>
>>>>>> On 21.05.25 10:02, Bharata B Rao wrote:
>>>>>>> Currently the folios identified as misplaced by the NUMA
>>>>>>> balancing sub-system are migrated one by one from the NUMA
>>>>>>> hint fault handler as and when they are identified as
>>>>>>> misplaced.
>>>>>>>
>>>>>>> Instead of such singe folio migrations, batch them and
>>>>>>> migrate them at once.
>>>>>>>
>>>>>>> Identified misplaced folios are isolated and stored in
>>>>>>> a per-task list. A new task_work is queued from task tick
>>>>>>> handler to migrate them in batches. Migration is done
>>>>>>> periodically or if pending number of isolated foios exceeds
>>>>>>> a threshold.
>>>>>>
>>>>>> That means that these pages are effectively unmovable for other
>>>>>> purposes (CMA, compaction, long-term pinning, whatever) until
>>>>>> that list was drained.
>>>>>>
>>>>>> Bad.
>>>>>
>>>>> Probably we can mark these pages and when others want to migrate the page,
>>>>> get_new_page() just looks at the page's target node and get a new page from
>>>>> the target node.
>>>>
>>>> How do you envision that working when CMA needs to migrate this exact page to a different location?
>>>>
>>>> It cannot isolate it for migration because ... it's already isolated ... so it will give up.
>>>>
>>>> Marking might not be easy I assume ...
>>>
>>> I guess you mean we do not have any extra bit to indicate this page is isolated,
>>> but it can be migrated. My point is that if this page is going to be migrated
>>> due to other reasons, like CMA, compaction, why not migrate it to the target
>>> node instead of moving it around within the same node.
>>
>> I think we'd have to identify that
>>
>> a) This page is isolate for migration (could be isolated for other
>>    reasons)
>>
>> b) The one responsible for the isolation is numa code (could be someone
>>    else)
>>
>> c) We're allowed to grab that page from that list (IOW sync against
>>    others, and especially also against), to essentially "steal" the
>>    isolated page.
>
> Right. c) sounds like adding more contention to the candidate list.
> I wonder if we can just mark the page as migration candidate (using
> a page flag or something else), then migrate it whenever CMA,
> compaction, long-term pinning and more look at the page. In addition,
> periodically, the migration task would do a PFN scanning and migrate
> any migration candidate. I remember Willy did some experiments showing
> that PFN scanning is very fast.

I think that this could be a second step optimization after the simple
implementation has been done.

---
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH v0 0/2] Batch migration for NUMA balancing
  2025-05-21  8:02 [RFC PATCH v0 0/2] Batch migration for NUMA balancing Bharata B Rao
                   ` (2 preceding siblings ...)
  2025-05-21 18:45 ` [RFC PATCH v0 0/2] Batch migration for NUMA balancing SeongJae Park
@ 2025-05-26  8:46 ` Huang, Ying
  2025-05-27  8:53   ` Bharata B Rao
  3 siblings, 1 reply; 41+ messages in thread
From: Huang, Ying @ 2025-05-26  8:46 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-kernel, linux-mm, Jonathan.Cameron, dave.hansen, gourry,
	hannes, mgorman, mingo, peterz, raghavendra.kt, riel, rientjes,
	sj, weixugc, willy, ziy, dave, nifan.cxl, joshua.hahnjy,
	xuezhengchu, yiannis, akpm, david

Hi, Bharata,

Bharata B Rao <bharata@amd.com> writes:

> Hi,
>
> This is an attempt to convert the NUMA balancing to do batched
> migration instead of migrating one folio at a time. The basic
> idea is to collect (from hint fault handler) the folios to be
> migrated in a list and batch-migrate them from task_work context.
> More details about the specifics are present in patch 2/2.
>
> During LSFMM[1] and subsequent discussions in MM alignment calls[2],
> it was suggested that separate migration threads to handle migration
> or promotion request may be desirable. Existing NUMA balancing, hot
> page promotion and other future promotion techniques could off-load
> migration part to these threads.

What is the expected benefit of the change?

For code reuse, we can use migrate_misplaced_folio() or
migrate_misplaced_folio_batch() in various promotion path.

For workload latency influence, per my understanding, PTE scanning is
much more serious than migration.  Why not start from that?

> Or if we manage to have a single
> source of hotness truth like kpromoted[3], then that too can hand
> over migration requests to the migration threads. I am envisaging
> that different hotness sources like kmmscand[4], MGLRU[5], IBS[6]
> and CXL HMU would push hot page info to kpromoted, which would
> then isolate and push the folios to be promoted to the migrator
> thread.
>
> As a first step, this is an attempt to batch and perform NUMAB
> migrations in async manner. Separate migration threads aren't
> yet implemented but I am using Gregory's patch[7] that provides
> migrate_misplaced_folio_batch() API to do batch migration of
> misplaced folios.
>
> Some points for discussion
> --------------------------
> 1. To isolate the misplaced folios or not?
>
> To do batch migration, the misplaced folios need to be stored in
> some manner. I thought isolating them and using the folio->lru
> field to link them up would be the most straight-forward way. But
> then there were concerns expressed about folios remaining isolated
> for long until they get migrated.
>
> Or should we just maintain the PFNs instead of folios and
> isolate them only just prior to migrating them?
>
> 2. Managing target_nid for misplaced pages
>
> NUMAB provides the accurate target_nid for each folio that is
> detected as misplaced. However when we don't migrate the folio
> right away, but instead want to batch and do asyn migration later,
> then where do we keep track of target_nid for each folio?
>
> In this implementation, I am using last_cpupid field as it appeared
> that this field could be reused (with some challenges mentioned
> in 2/2) for isolated folios. This approach may be specific to NUMAB
> but then each sub-system that hands over pages to the migrator thread
> should also provide a target_nid and hence each sub-system should be
> free to maintain and track the target_nid of folios that it has
> isolated/batched for migration in its own specific manner.
>
> 3. How many folios to batch?
>
> Currently I have a fixed threshold for number of folios to batch.
> It could be a sysctl to allow a setting between a min and max. It
> could also be auto-tuned if required.
>
> The state of the patchset
> -------------------------
> * Still raw and very lightly tested
> * Just posted to serve as base for subsequent discussions
>   here and in MM alignment calls.
>
> References
> ----------
> [1] LSFMM LWN summary - https://lwn.net/Articles/1016519/
> [2] MM alignment call summary - https://lore.kernel.org/linux-mm/263d7140-c343-e82e-b836-ec85c52b54eb@google.com/
> [3] kpromoted patchset - https://lore.kernel.org/linux-mm/20250306054532.221138-1-bharata@amd.com/
> [4] Kmmscand: PTE A bit scanning - https://lore.kernel.org/linux-mm/20250319193028.29514-1-raghavendra.kt@amd.com/
> [5] MGLRU scanning for page promotion - https://lore.kernel.org/lkml/20250324220301.1273038-1-kinseyho@google.com/
> [6] IBS base hot page promotion - https://lore.kernel.org/linux-mm/20250306054532.221138-4-bharata@amd.com/
> [7] Unmapped page cache folio promotion patchset - https://lore.kernel.org/linux-mm/20250411221111.493193-1-gourry@gourry.net/
>
> Bharata B Rao (1):
>   mm: sched: Batch-migrate misplaced pages
>
> Gregory Price (1):
>   migrate: implement migrate_misplaced_folio_batch
>
>  include/linux/migrate.h |  6 ++++
>  include/linux/sched.h   |  4 +++
>  init/init_task.c        |  2 ++
>  kernel/sched/fair.c     | 64 +++++++++++++++++++++++++++++++++++++++++
>  mm/memory.c             | 44 ++++++++++++++--------------
>  mm/migrate.c            | 31 ++++++++++++++++++++
>  6 files changed, 130 insertions(+), 21 deletions(-)

---
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH v0 2/2] mm: sched: Batch-migrate misplaced pages
  2025-05-22 17:30             ` Zi Yan
  2025-05-26  8:33               ` Huang, Ying
@ 2025-05-26  9:29               ` David Hildenbrand
  2025-05-26 14:20                 ` Zi Yan
  1 sibling, 1 reply; 41+ messages in thread
From: David Hildenbrand @ 2025-05-26  9:29 UTC (permalink / raw)
  To: Zi Yan
  Cc: Bharata B Rao, linux-kernel, linux-mm, Jonathan.Cameron,
	dave.hansen, gourry, hannes, mgorman, mingo, peterz,
	raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang,
	dave, nifan.cxl, joshua.hahnjy, xuezhengchu, yiannis, akpm

On 22.05.25 19:30, Zi Yan wrote:
> On 22 May 2025, at 13:21, David Hildenbrand wrote:
> 
>> On 22.05.25 18:38, Zi Yan wrote:
>>> On 22 May 2025, at 12:26, David Hildenbrand wrote:
>>>
>>>> On 22.05.25 18:24, Zi Yan wrote:
>>>>> On 22 May 2025, at 12:11, David Hildenbrand wrote:
>>>>>
>>>>>> On 21.05.25 10:02, Bharata B Rao wrote:
>>>>>>> Currently the folios identified as misplaced by the NUMA
>>>>>>> balancing sub-system are migrated one by one from the NUMA
>>>>>>> hint fault handler as and when they are identified as
>>>>>>> misplaced.
>>>>>>>
>>>>>>> Instead of such singe folio migrations, batch them and
>>>>>>> migrate them at once.
>>>>>>>
>>>>>>> Identified misplaced folios are isolated and stored in
>>>>>>> a per-task list. A new task_work is queued from task tick
>>>>>>> handler to migrate them in batches. Migration is done
>>>>>>> periodically or if pending number of isolated foios exceeds
>>>>>>> a threshold.
>>>>>>
>>>>>> That means that these pages are effectively unmovable for other purposes (CMA, compaction, long-term pinning, whatever) until that list was drained.
>>>>>>
>>>>>> Bad.
>>>>>
>>>>> Probably we can mark these pages and when others want to migrate the page,
>>>>> get_new_page() just looks at the page's target node and get a new page from
>>>>> the target node.
>>>>
>>>> How do you envision that working when CMA needs to migrate this exact page to a different location?
>>>>
>>>> It cannot isolate it for migration because ... it's already isolated ... so it will give up.
>>>>
>>>> Marking might not be easy I assume ...
>>>
>>> I guess you mean we do not have any extra bit to indicate this page is isolated,
>>> but it can be migrated. My point is that if this page is going to be migrated
>>> due to other reasons, like CMA, compaction, why not migrate it to the target
>>> node instead of moving it around within the same node.
>>
>> I think we'd have to identify that
>>
>> a) This page is isolate for migration (could be isolated for other
>>     reasons)
>>
>> b) The one responsible for the isolation is numa code (could be someone
>>     else)
>>
>> c) We're allowed to grab that page from that list (IOW sync against
>>     others, and especially also against), to essentially "steal" the
>>     isolated page.
> 
> Right. c) sounds like adding more contention to the candidate list.
> I wonder if we can just mark the page as migration candidate (using
> a page flag or something else), then migrate it whenever CMA,
> compaction, long-term pinning and more look at the page.

I mean, all these will migrate the page either way, no need to add 
another flag for that.

I guess what you mean, indicating that the migration destination should 
be on a different node than the current one.

Well, and for the NUMA scanner (below) to find which pages to migrate.

... to be this raises some questions: like, if we don't migrate 
immediately, could that information ("migrate this page") actually now 
be wrong? I guess a way to obtain the destination node would suffice: if 
the destination node matches, no need to migrate from that NUMA scanner.

In addition,
> periodically, the migration task would do a PFN scanning and migrate
> any migration candidate. I remember Willy did some experiments showing
> that PFN scanning is very fast.

PFN scanning can be faster than walking lists, but I suspect it depends 
on how many pages there really are to be migrated ... and some other 
factors :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH v0 2/2] mm: sched: Batch-migrate misplaced pages
  2025-05-26  9:29               ` David Hildenbrand
@ 2025-05-26 14:20                 ` Zi Yan
  2025-05-27  1:18                   ` Huang, Ying
  2025-05-28 12:25                   ` Karim Manaouil
  0 siblings, 2 replies; 41+ messages in thread
From: Zi Yan @ 2025-05-26 14:20 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Bharata B Rao, linux-kernel, linux-mm, Jonathan.Cameron,
	dave.hansen, gourry, hannes, mgorman, mingo, peterz,
	raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang,
	dave, nifan.cxl, joshua.hahnjy, xuezhengchu, yiannis, akpm

On 26 May 2025, at 5:29, David Hildenbrand wrote:

> On 22.05.25 19:30, Zi Yan wrote:
>> On 22 May 2025, at 13:21, David Hildenbrand wrote:
>>
>>> On 22.05.25 18:38, Zi Yan wrote:
>>>> On 22 May 2025, at 12:26, David Hildenbrand wrote:
>>>>
>>>>> On 22.05.25 18:24, Zi Yan wrote:
>>>>>> On 22 May 2025, at 12:11, David Hildenbrand wrote:
>>>>>>
>>>>>>> On 21.05.25 10:02, Bharata B Rao wrote:
>>>>>>>> Currently the folios identified as misplaced by the NUMA
>>>>>>>> balancing sub-system are migrated one by one from the NUMA
>>>>>>>> hint fault handler as and when they are identified as
>>>>>>>> misplaced.
>>>>>>>>
>>>>>>>> Instead of such singe folio migrations, batch them and
>>>>>>>> migrate them at once.
>>>>>>>>
>>>>>>>> Identified misplaced folios are isolated and stored in
>>>>>>>> a per-task list. A new task_work is queued from task tick
>>>>>>>> handler to migrate them in batches. Migration is done
>>>>>>>> periodically or if pending number of isolated foios exceeds
>>>>>>>> a threshold.
>>>>>>>
>>>>>>> That means that these pages are effectively unmovable for other purposes (CMA, compaction, long-term pinning, whatever) until that list was drained.
>>>>>>>
>>>>>>> Bad.
>>>>>>
>>>>>> Probably we can mark these pages and when others want to migrate the page,
>>>>>> get_new_page() just looks at the page's target node and get a new page from
>>>>>> the target node.
>>>>>
>>>>> How do you envision that working when CMA needs to migrate this exact page to a different location?
>>>>>
>>>>> It cannot isolate it for migration because ... it's already isolated ... so it will give up.
>>>>>
>>>>> Marking might not be easy I assume ...
>>>>
>>>> I guess you mean we do not have any extra bit to indicate this page is isolated,
>>>> but it can be migrated. My point is that if this page is going to be migrated
>>>> due to other reasons, like CMA, compaction, why not migrate it to the target
>>>> node instead of moving it around within the same node.
>>>
>>> I think we'd have to identify that
>>>
>>> a) This page is isolate for migration (could be isolated for other
>>>     reasons)
>>>
>>> b) The one responsible for the isolation is numa code (could be someone
>>>     else)
>>>
>>> c) We're allowed to grab that page from that list (IOW sync against
>>>     others, and especially also against), to essentially "steal" the
>>>     isolated page.
>>
>> Right. c) sounds like adding more contention to the candidate list.
>> I wonder if we can just mark the page as migration candidate (using
>> a page flag or something else), then migrate it whenever CMA,
>> compaction, long-term pinning and more look at the page.
>
> I mean, all these will migrate the page either way, no need to add another flag for that.
>
> I guess what you mean, indicating that the migration destination should be on a different node than the current one.

Yes.

>
> Well, and for the NUMA scanner (below) to find which pages to migrate.
>
> ... to be this raises some questions: like, if we don't migrate immediately, could that information ("migrate this page") actually now be wrong? I guess a way to

Could be. So it is better to evaluate the page before the actual migration, in
case the page is no longer needed in a remote node.

> obtain the destination node would suffice: if the destination node matches, no need to migrate from that NUMA scanner.

Right. The destination node could be calculated by certain metric like most recent
accesses or last remote node access time. If most recent accesses are still coming
from a remote node and/or last remote node access time is within a short time frame,
the page should be migrated. Since it is possible that the page is frequently accessed
by a remote node but when it comes to migration, it is no longer needed by a remote
node and the access pattern would look like 1) a lot of remote node accesses, but
2) the last remote node access is long time ago.

>
> In addition,
>> periodically, the migration task would do a PFN scanning and migrate
>> any migration candidate. I remember Willy did some experiments showing
>> that PFN scanning is very fast.
>
> PFN scanning can be faster than walking lists, but I suspect it depends on how many pages there really are to be migrated ... and some other factors :)

Yes. LRU list is good since it restricts the scanning range, but PFN scanning
itself does not have it. PFN scanning with some filter mechanism might work
and that filter mechanism is a way of marking to-be-migrated pages. Of course,
a quick re-evaluation of the to-be-migrated pages right before a migration
would avoid unnecessary work like we discussed above.

--
Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH v0 2/2] mm: sched: Batch-migrate misplaced pages
  2025-05-26 14:20                 ` Zi Yan
@ 2025-05-27  1:18                   ` Huang, Ying
  2025-05-27  1:27                     ` Zi Yan
  2025-05-28 12:25                   ` Karim Manaouil
  1 sibling, 1 reply; 41+ messages in thread
From: Huang, Ying @ 2025-05-27  1:18 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, Bharata B Rao, linux-kernel, linux-mm,
	Jonathan.Cameron, dave.hansen, gourry, hannes, mgorman, mingo,
	peterz, raghavendra.kt, riel, rientjes, sj, weixugc, willy, dave,
	nifan.cxl, joshua.hahnjy, xuezhengchu, yiannis, akpm

Zi Yan <ziy@nvidia.com> writes:

> On 26 May 2025, at 5:29, David Hildenbrand wrote:
>
>> On 22.05.25 19:30, Zi Yan wrote:
>>> On 22 May 2025, at 13:21, David Hildenbrand wrote:
>>>
>>>> On 22.05.25 18:38, Zi Yan wrote:
>>>>> On 22 May 2025, at 12:26, David Hildenbrand wrote:
>>>>>
>>>>>> On 22.05.25 18:24, Zi Yan wrote:
>>>>>>> On 22 May 2025, at 12:11, David Hildenbrand wrote:
>>>>>>>
>>>>>>>> On 21.05.25 10:02, Bharata B Rao wrote:
>>>>>>>>> Currently the folios identified as misplaced by the NUMA
>>>>>>>>> balancing sub-system are migrated one by one from the NUMA
>>>>>>>>> hint fault handler as and when they are identified as
>>>>>>>>> misplaced.
>>>>>>>>>
>>>>>>>>> Instead of such singe folio migrations, batch them and
>>>>>>>>> migrate them at once.
>>>>>>>>>
>>>>>>>>> Identified misplaced folios are isolated and stored in
>>>>>>>>> a per-task list. A new task_work is queued from task tick
>>>>>>>>> handler to migrate them in batches. Migration is done
>>>>>>>>> periodically or if pending number of isolated foios exceeds
>>>>>>>>> a threshold.
>>>>>>>>
>>>>>>>> That means that these pages are effectively unmovable for
>>>>>>>> other purposes (CMA, compaction, long-term pinning, whatever)
>>>>>>>> until that list was drained.
>>>>>>>>
>>>>>>>> Bad.
>>>>>>>
>>>>>>> Probably we can mark these pages and when others want to migrate the page,
>>>>>>> get_new_page() just looks at the page's target node and get a new page from
>>>>>>> the target node.
>>>>>>
>>>>>> How do you envision that working when CMA needs to migrate this exact page to a different location?
>>>>>>
>>>>>> It cannot isolate it for migration because ... it's already isolated ... so it will give up.
>>>>>>
>>>>>> Marking might not be easy I assume ...
>>>>>
>>>>> I guess you mean we do not have any extra bit to indicate this page is isolated,
>>>>> but it can be migrated. My point is that if this page is going to be migrated
>>>>> due to other reasons, like CMA, compaction, why not migrate it to the target
>>>>> node instead of moving it around within the same node.
>>>>
>>>> I think we'd have to identify that
>>>>
>>>> a) This page is isolate for migration (could be isolated for other
>>>>     reasons)
>>>>
>>>> b) The one responsible for the isolation is numa code (could be someone
>>>>     else)
>>>>
>>>> c) We're allowed to grab that page from that list (IOW sync against
>>>>     others, and especially also against), to essentially "steal" the
>>>>     isolated page.
>>>
>>> Right. c) sounds like adding more contention to the candidate list.
>>> I wonder if we can just mark the page as migration candidate (using
>>> a page flag or something else), then migrate it whenever CMA,
>>> compaction, long-term pinning and more look at the page.
>>
>> I mean, all these will migrate the page either way, no need to add another flag for that.
>>
>> I guess what you mean, indicating that the migration destination
>> should be on a different node than the current one.
>
> Yes.
>
>>
>> Well, and for the NUMA scanner (below) to find which pages to migrate.
>>
>> ... to be this raises some questions: like, if we don't migrate
>> immediately, could that information ("migrate this page") actually
>> now be wrong? I guess a way to
>
> Could be. So it is better to evaluate the page before the actual migration, in
> case the page is no longer needed in a remote node.
>
>> obtain the destination node would suffice: if the destination node
>> matches, no need to migrate from that NUMA scanner.
>
> Right. The destination node could be calculated by certain metric like most recent
> accesses or last remote node access time.

Do we have the necessary information available?  last_cpupid have either
last accessing CPU or last scanning timestamp, not both.  Any other
information source?

---
Best Regards,
Huang, Ying

> If most recent accesses are still coming
> from a remote node and/or last remote node access time is within a short time frame,
> the page should be migrated. Since it is possible that the page is frequently accessed
> by a remote node but when it comes to migration, it is no longer needed by a remote
> node and the access pattern would look like 1) a lot of remote node accesses, but
> 2) the last remote node access is long time ago.
>
>>
>> In addition,
>>> periodically, the migration task would do a PFN scanning and migrate
>>> any migration candidate. I remember Willy did some experiments showing
>>> that PFN scanning is very fast.
>>
>> PFN scanning can be faster than walking lists, but I suspect it
>> depends on how many pages there really are to be migrated ... and
>> some other factors :)
>
> Yes. LRU list is good since it restricts the scanning range, but PFN scanning
> itself does not have it. PFN scanning with some filter mechanism might work
> and that filter mechanism is a way of marking to-be-migrated pages. Of course,
> a quick re-evaluation of the to-be-migrated pages right before a migration
> would avoid unnecessary work like we discussed above.
>
> --
> Best Regards,
> Yan, Zi

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH v0 2/2] mm: sched: Batch-migrate misplaced pages
  2025-05-27  1:18                   ` Huang, Ying
@ 2025-05-27  1:27                     ` Zi Yan
  0 siblings, 0 replies; 41+ messages in thread
From: Zi Yan @ 2025-05-27  1:27 UTC (permalink / raw)
  To: Huang, Ying
  Cc: David Hildenbrand, Bharata B Rao, linux-kernel, linux-mm,
	Jonathan.Cameron, dave.hansen, gourry, hannes, mgorman, mingo,
	peterz, raghavendra.kt, riel, rientjes, sj, weixugc, willy, dave,
	nifan.cxl, joshua.hahnjy, xuezhengchu, yiannis, akpm

On 26 May 2025, at 21:18, Huang, Ying wrote:

> Zi Yan <ziy@nvidia.com> writes:
>
>> On 26 May 2025, at 5:29, David Hildenbrand wrote:
>>
>>> On 22.05.25 19:30, Zi Yan wrote:
>>>> On 22 May 2025, at 13:21, David Hildenbrand wrote:
>>>>
>>>>> On 22.05.25 18:38, Zi Yan wrote:
>>>>>> On 22 May 2025, at 12:26, David Hildenbrand wrote:
>>>>>>
>>>>>>> On 22.05.25 18:24, Zi Yan wrote:
>>>>>>>> On 22 May 2025, at 12:11, David Hildenbrand wrote:
>>>>>>>>
>>>>>>>>> On 21.05.25 10:02, Bharata B Rao wrote:
>>>>>>>>>> Currently the folios identified as misplaced by the NUMA
>>>>>>>>>> balancing sub-system are migrated one by one from the NUMA
>>>>>>>>>> hint fault handler as and when they are identified as
>>>>>>>>>> misplaced.
>>>>>>>>>>
>>>>>>>>>> Instead of such singe folio migrations, batch them and
>>>>>>>>>> migrate them at once.
>>>>>>>>>>
>>>>>>>>>> Identified misplaced folios are isolated and stored in
>>>>>>>>>> a per-task list. A new task_work is queued from task tick
>>>>>>>>>> handler to migrate them in batches. Migration is done
>>>>>>>>>> periodically or if pending number of isolated foios exceeds
>>>>>>>>>> a threshold.
>>>>>>>>>
>>>>>>>>> That means that these pages are effectively unmovable for
>>>>>>>>> other purposes (CMA, compaction, long-term pinning, whatever)
>>>>>>>>> until that list was drained.
>>>>>>>>>
>>>>>>>>> Bad.
>>>>>>>>
>>>>>>>> Probably we can mark these pages and when others want to migrate the page,
>>>>>>>> get_new_page() just looks at the page's target node and get a new page from
>>>>>>>> the target node.
>>>>>>>
>>>>>>> How do you envision that working when CMA needs to migrate this exact page to a different location?
>>>>>>>
>>>>>>> It cannot isolate it for migration because ... it's already isolated ... so it will give up.
>>>>>>>
>>>>>>> Marking might not be easy I assume ...
>>>>>>
>>>>>> I guess you mean we do not have any extra bit to indicate this page is isolated,
>>>>>> but it can be migrated. My point is that if this page is going to be migrated
>>>>>> due to other reasons, like CMA, compaction, why not migrate it to the target
>>>>>> node instead of moving it around within the same node.
>>>>>
>>>>> I think we'd have to identify that
>>>>>
>>>>> a) This page is isolate for migration (could be isolated for other
>>>>>     reasons)
>>>>>
>>>>> b) The one responsible for the isolation is numa code (could be someone
>>>>>     else)
>>>>>
>>>>> c) We're allowed to grab that page from that list (IOW sync against
>>>>>     others, and especially also against), to essentially "steal" the
>>>>>     isolated page.
>>>>
>>>> Right. c) sounds like adding more contention to the candidate list.
>>>> I wonder if we can just mark the page as migration candidate (using
>>>> a page flag or something else), then migrate it whenever CMA,
>>>> compaction, long-term pinning and more look at the page.
>>>
>>> I mean, all these will migrate the page either way, no need to add another flag for that.
>>>
>>> I guess what you mean, indicating that the migration destination
>>> should be on a different node than the current one.
>>
>> Yes.
>>
>>>
>>> Well, and for the NUMA scanner (below) to find which pages to migrate.
>>>
>>> ... to be this raises some questions: like, if we don't migrate
>>> immediately, could that information ("migrate this page") actually
>>> now be wrong? I guess a way to
>>
>> Could be. So it is better to evaluate the page before the actual migration, in
>> case the page is no longer needed in a remote node.
>>
>>> obtain the destination node would suffice: if the destination node
>>> matches, no need to migrate from that NUMA scanner.
>>
>> Right. The destination node could be calculated by certain metric like most recent
>> accesses or last remote node access time.
>
> Do we have the necessary information available?  last_cpupid have either
> last accessing CPU or last scanning timestamp, not both.  Any other
> information source?

Not at the moment. A unified page access information framework
is probably needed. The recent LSFMM has a related discussion[1]. We
also have a biweekly discussion on it[2].

[1] https://lwn.net/Articles/1016722/
[2] https://lore.kernel.org/linux-mm/ae6e7b19-f221-9a5d-a3eb-799ed271de11@google.com/

Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH v0 0/2] Batch migration for NUMA balancing
  2025-05-26  8:46 ` Huang, Ying
@ 2025-05-27  8:53   ` Bharata B Rao
  2025-05-27  9:05     ` Huang, Ying
  0 siblings, 1 reply; 41+ messages in thread
From: Bharata B Rao @ 2025-05-27  8:53 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-kernel, linux-mm, Jonathan.Cameron, dave.hansen, gourry,
	hannes, mgorman, mingo, peterz, raghavendra.kt, riel, rientjes,
	sj, weixugc, willy, ziy, dave, nifan.cxl, joshua.hahnjy,
	xuezhengchu, yiannis, akpm, david

On 26-May-25 2:16 PM, Huang, Ying wrote:
> Hi, Bharata,
> 
> Bharata B Rao <bharata@amd.com> writes:
> 
>> Hi,
>>
>> This is an attempt to convert the NUMA balancing to do batched
>> migration instead of migrating one folio at a time. The basic
>> idea is to collect (from hint fault handler) the folios to be
>> migrated in a list and batch-migrate them from task_work context.
>> More details about the specifics are present in patch 2/2.
>>
>> During LSFMM[1] and subsequent discussions in MM alignment calls[2],
>> it was suggested that separate migration threads to handle migration
>> or promotion request may be desirable. Existing NUMA balancing, hot
>> page promotion and other future promotion techniques could off-load
>> migration part to these threads.
> 
> What is the expected benefit of the change?

Initially it is about cleanliness and separation of migration into its 
own thread/sub-system.

> 
> For code reuse, we can use migrate_misplaced_folio() or
> migrate_misplaced_folio_batch() in various promotion path.

That's what I have done in this patchset at least. We thought we could 
go full length and off-load migration to its own thread.

> 
> For workload latency influence, per my understanding, PTE scanning is
> much more serious than migration.  Why not start from that?

Raghu's PTE A bit scanning is one effort towards that (Removing PTE 
scanning from task context.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH v0 0/2] Batch migration for NUMA balancing
  2025-05-27  8:53   ` Bharata B Rao
@ 2025-05-27  9:05     ` Huang, Ying
  0 siblings, 0 replies; 41+ messages in thread
From: Huang, Ying @ 2025-05-27  9:05 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-kernel, linux-mm, Jonathan.Cameron, dave.hansen, gourry,
	hannes, mgorman, mingo, peterz, raghavendra.kt, riel, rientjes,
	sj, weixugc, willy, ziy, dave, nifan.cxl, joshua.hahnjy,
	xuezhengchu, yiannis, akpm, david

Bharata B Rao <bharata@amd.com> writes:

> On 26-May-25 2:16 PM, Huang, Ying wrote:
>> Hi, Bharata,
>> Bharata B Rao <bharata@amd.com> writes:
>> 
>>> Hi,
>>>
>>> This is an attempt to convert the NUMA balancing to do batched
>>> migration instead of migrating one folio at a time. The basic
>>> idea is to collect (from hint fault handler) the folios to be
>>> migrated in a list and batch-migrate them from task_work context.
>>> More details about the specifics are present in patch 2/2.
>>>
>>> During LSFMM[1] and subsequent discussions in MM alignment calls[2],
>>> it was suggested that separate migration threads to handle migration
>>> or promotion request may be desirable. Existing NUMA balancing, hot
>>> page promotion and other future promotion techniques could off-load
>>> migration part to these threads.
>> What is the expected benefit of the change?
>
> Initially it is about cleanliness and separation of migration into its
> own thread/sub-system.
>
>> For code reuse, we can use migrate_misplaced_folio() or
>> migrate_misplaced_folio_batch() in various promotion path.
>
> That's what I have done in this patchset at least. We thought we could
> go full length and off-load migration to its own thread.

Even if we migrate pages in another thread, the migrated pages will be
unmapped, copied, remapped during migrating.  That is, the workload
threads may be stalled to wait for migrating.  So, we need to measure
the real benefit firstly.

>> For workload latency influence, per my understanding, PTE scanning
>> is
>> much more serious than migration.  Why not start from that?
>
> Raghu's PTE A bit scanning is one effort towards that (Removing PTE
> scanning from task context.

---
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH v0 0/2] Batch migration for NUMA balancing
  2025-05-26  5:20   ` [RFC PATCH v0 0/2] Batch migration for NUMA balancing Bharata B Rao
@ 2025-05-27 18:50     ` SeongJae Park
  0 siblings, 0 replies; 41+ messages in thread
From: SeongJae Park @ 2025-05-27 18:50 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: SeongJae Park, linux-kernel, linux-mm, Jonathan.Cameron,
	dave.hansen, gourry, hannes, mgorman, mingo, peterz,
	raghavendra.kt, riel, rientjes, weixugc, willy, ying.huang, ziy,
	dave, nifan.cxl, joshua.hahnjy, xuezhengchu, yiannis, akpm, david

On Mon, 26 May 2025 10:50:02 +0530 Bharata B Rao <bharata@amd.com> wrote:

> Hi SJ,
> 
> On 22-May-25 12:15 AM, SeongJae Park wrote:
> > Hi Bharata,
> > 
> > On Wed, 21 May 2025 13:32:36 +0530 Bharata B Rao <bharata@amd.com> wrote:
> > 
> >> Hi,
> >>
> >> This is an attempt to convert the NUMA balancing to do batched
> >> migration instead of migrating one folio at a time. The basic
> >> idea is to collect (from hint fault handler) the folios to be
> >> migrated in a list and batch-migrate them from task_work context.
> >> More details about the specifics are present in patch 2/2.
> >>
> >> During LSFMM[1] and subsequent discussions in MM alignment calls[2],
> >> it was suggested that separate migration threads to handle migration
> >> or promotion request may be desirable. Existing NUMA balancing, hot
> >> page promotion and other future promotion techniques could off-load
> >> migration part to these threads. Or if we manage to have a single
> >> source of hotness truth like kpromoted[3], then that too can hand
> >> over migration requests to the migration threads. I am envisaging
> >> that different hotness sources like kmmscand[4], MGLRU[5], IBS[6]
> >> and CXL HMU would push hot page info to kpromoted, which would
> >> then isolate and push the folios to be promoted to the migrator
> >> thread.
> > 
> > I think (or, hope) it would also be not very worthless or rude to mention other
> > existing and ongoing works that have potentials to serve for similar purpose or
> > collaborate in future, here.
> > 
> > DAMON is designed for a sort of multi-source access information handling.  In
> > LSFMM, I proposed[1] damon_report_access() interface for making it easier to be
> > extended for more types of access information.  Currenlty damon_report_access()
> > is under early development.  I think this has a potential to serve something
> > similar to your single source goal.
> > 
> > Also in LSFMM, I proposed damos_add_folio() for a case that callers want to
> > utilize DAMON worker thread (kdamond) as an asynchronous memory
> > management operations execution thread while using its other features such as
> > [auto-tuned] quotas.  I think this has a potential to serve something similar
> > to your migration threads.  I haven't started damos_add_folio() development
> > yet, though.
> > 
> > I remember we discussed about DAMON on mailing list and in LSFMM a bit, on your
> > session.  IIRC, you were also looking for a time to see if there is a chance to
> > use DAMON in some way.  Due to the technical issue, we were unable to discuss
> > on the two new proposals on my LSFMM session, and it has been a bit while since
> > our last discussion.  So if you don't mind, I'd like to ask if you have some
> > opinions or comments about these.
> > 
> > [1] https://lwn.net/Articles/1016525/
> 
> Since this patchset was just about making the migration batched and 
> async for NUMAB, I didn't mention DAMON as an alternative here.

I was thinking a clarification like this could be useful for readers though,
since you were mentioning the future work together.  Thank you for clarifying.

> 
> One of the concerns I always had about DAMON when it is considered as 
> replacement for existing hot page migration is its current inability to 
> gather and maintain hot page info at per-folio granularity.

I think this is a very valid concern.  But I don't think DAMON should be a
_replacement_.  Rather, I'm looking for a chance to make existing approaches
help each other.  For example, I recommend running DAMON-based memory
tiering[1] together with the LRU-based demotion.  I think there is no reason to
discourage using it together with NUMAB-2 based promotion, if the folio
granularity is a real issue.  That is, still NUMAB-2 will do synchronous
promotion, but DAMON will do it asynchronously, so the amount of synchronous
promotions and its overhead will reduce.

I didn't encourage using NUMB-2 based promotion together with DAMON-based
memory tiering[1] not because I show a problem at such co-usage, but just
because I found no clear benefit of that from my test setup.  In theory, I
think running those together makes sense.

That said, we're also making efforts for overcoming the folio-granularity issue
on DAMON side, too.  We implemented page-level filters that motivated by SK
hynix' test results, and developed monitoring intervals auto-tuning for overall
monitoring results accuracy.  We proposed damon_report_access() and
damos_add_folios() as yet another opportunitis to better deal with the issue.
I was curious about your opinion to damon_report_access() and
damos_add_folios() for the reason.  I understand that could be out of the scope
of this patch series, though.

> How much 
> that eventually matters to the workloads has to be really seen.

Cannot agree more.  Nonetheless, as mentioned abovely, my test setup[1] didn't
show the problem.  That said, I'm not really convinced with my test setup, and
I don't think the test setup is good for verifying the problem.  Hence I'm
trying to make a better test setup for this.  I'll share more of the new setup
if I make some progress.  I will also be more than happy to learn about other's
test setup if they have a good one or suggestions.

[1] https://lore.kernel.org/20250420194030.75838-1-sj@kernel.org

Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH v0 2/2] mm: sched: Batch-migrate misplaced pages
  2025-05-26 14:20                 ` Zi Yan
  2025-05-27  1:18                   ` Huang, Ying
@ 2025-05-28 12:25                   ` Karim Manaouil
  1 sibling, 0 replies; 41+ messages in thread
From: Karim Manaouil @ 2025-05-28 12:25 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, Bharata B Rao, linux-kernel, linux-mm,
	Jonathan.Cameron, dave.hansen, gourry, hannes, mgorman, mingo,
	peterz, raghavendra.kt, riel, rientjes, sj, weixugc, willy,
	ying.huang, dave, nifan.cxl, joshua.hahnjy, xuezhengchu, yiannis,
	akpm

On Mon, May 26, 2025 at 10:20:39AM -0400, Zi Yan wrote:
> On 26 May 2025, at 5:29, David Hildenbrand wrote:
> > PFN scanning can be faster than walking lists, but I suspect it depends on how many pages there really are to be migrated ... and some other factors :)
> 
> Yes. LRU list is good since it restricts the scanning range, but PFN scanning
> itself does not have it. PFN scanning with some filter mechanism might work
> and that filter mechanism is a way of marking to-be-migrated pages. Of course,
> a quick re-evaluation of the to-be-migrated pages right before a migration
> would avoid unnecessary work like we discussed above.

PFN scanning could be faster because of prefetching, but it pollutes the
caches, which may not be nice to the application running on that cpu
core before we transitioned to kernel space.

-- 
~karim

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2025-05-28 12:25 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-21  8:02 [RFC PATCH v0 0/2] Batch migration for NUMA balancing Bharata B Rao
2025-05-21  8:02 ` [RFC PATCH v0 1/2] migrate: implement migrate_misplaced_folio_batch Bharata B Rao
2025-05-22 15:59   ` David Hildenbrand
2025-05-22 16:03     ` Gregory Price
2025-05-22 16:08       ` David Hildenbrand
2025-05-26  8:16   ` Huang, Ying
2025-05-21  8:02 ` [RFC PATCH v0 2/2] mm: sched: Batch-migrate misplaced pages Bharata B Rao
2025-05-21 18:25   ` Donet Tom
2025-05-21 18:40     ` Zi Yan
2025-05-22  3:24       ` Gregory Price
2025-05-22  5:23         ` Bharata B Rao
2025-05-22  4:42       ` Bharata B Rao
2025-05-22  4:39     ` Bharata B Rao
2025-05-23  9:05       ` Donet Tom
2025-05-22  3:55   ` Gregory Price
2025-05-22  7:33     ` Bharata B Rao
2025-05-22 15:38       ` Gregory Price
2025-05-22 16:11   ` David Hildenbrand
2025-05-22 16:24     ` Zi Yan
2025-05-22 16:26       ` David Hildenbrand
2025-05-22 16:38         ` Zi Yan
2025-05-22 17:21           ` David Hildenbrand
2025-05-22 17:30             ` Zi Yan
2025-05-26  8:33               ` Huang, Ying
2025-05-26  9:29               ` David Hildenbrand
2025-05-26 14:20                 ` Zi Yan
2025-05-27  1:18                   ` Huang, Ying
2025-05-27  1:27                     ` Zi Yan
2025-05-28 12:25                   ` Karim Manaouil
2025-05-26  5:14     ` Bharata B Rao
2025-05-21 18:45 ` [RFC PATCH v0 0/2] Batch migration for NUMA balancing SeongJae Park
2025-05-22  3:08   ` Gregory Price
2025-05-22 16:30     ` SeongJae Park
2025-05-22 17:40       ` Gregory Price
2025-05-22 18:52         ` SeongJae Park
2025-05-22 18:43   ` Apologies and clarifications on DAMON-disruptions (was Re: [RFC PATCH v0 0/2] Batch migration for NUMA balancing) SeongJae Park
2025-05-26  5:20   ` [RFC PATCH v0 0/2] Batch migration for NUMA balancing Bharata B Rao
2025-05-27 18:50     ` SeongJae Park
2025-05-26  8:46 ` Huang, Ying
2025-05-27  8:53   ` Bharata B Rao
2025-05-27  9:05     ` Huang, Ying

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).