[RFC PATCH v1 0/4] Kernel thread based async batch migration

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v1 0/4] Kernel thread based async batch migration
@ 2025-06-16 13:39 Bharata B Rao
  2025-06-16 13:39 ` [RFC PATCH v1 1/4] mm: migrate: Allow misplaced migration without VMA too Bharata B Rao
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: Bharata B Rao @ 2025-06-16 13:39 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, hannes, mgorman, mingo,
	peterz, raghavendra.kt, riel, rientjes, sj, weixugc, willy,
	ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm,
	david, bharata

Hi,

This is a continuation of the earlier post[1] that attempted to
convert migrations from NUMA Balancing to be async and batched.
In this version, per-node kernel threads are created to handle
migrations in an async manner.

This adds a few fields to the extended page flags that can be
used both by the sub-systems that request migrations and kmigrated
which migrates the pages. Some of the fields are potentially defined
to be used by kpromoted-like subsystem to manage hot page metrics,
but are unused right now.

Currently only NUMA Balancing is changed to make use of the async
batched migration. It does so by recording the target NID and the
readiness of the page to be migrated in the extended page flags
fields.

Each kmigrated routinely scans its PFNs, identifies the pages
marked for migration and batch-migrates them. Unlike the previous
approach, the responsibility of isolating the pages is now with
kmigrated.

The major difference between this approach and the way kpromoted[2]
tracked hot pages is the elimination of heavy synchronization points
between producers(sub-systems that request migrations or report
a hot page) and the consumer (kmigrated or kpromoted).
Instead of tracking only the list of hot pages in an orthogonal
manner, this approach ties the hot page or migration infomation
to the struct page.

TODOs:

- Very lightly tested(only with NUMAB=1) and posted to get some
  feedback on the overall approach.
- Currently uses the flags field from page extension sub-system.
  However need to check if it is preferrable to use/allocate a
  separate 32bit field exclusively for this purpose within page
  extension sub-system or outside of it.
- Benefit of async batch migration still needs to be measured.
- Need to really tune a few things like the number of pages to
  batch, the aggressiveness of kthread, the kthread sleep interval etc.
- The logic to skip scanning of zones that don't have any pages
  marked for migration needs to be added.
- No separate kernel config is defined currently and dependency
  on PAGE_EXTENSION isn't cleanly laid out. Some added definitions
  currently sit in page_ext.h which may not be an ideal location
  for them.

[1] v0 - https://lore.kernel.org/linux-mm/20250521080238.209678-3-bharata@amd.com/
[2] kpromoted patchset - https://lore.kernel.org/linux-mm/20250306054532.221138-1-bharata@amd.com/

Bharata B Rao (3):
  mm: migrate: Allow misplaced migration without VMA too
  mm: kmigrated - Async kernel migration thread
  mm: sched: Batch-migrate misplaced pages

Gregory Price (1):
  migrate: implement migrate_misplaced_folios_batch

 include/linux/migrate.h  |   6 ++
 include/linux/mmzone.h   |   5 +
 include/linux/page_ext.h |  17 +++
 mm/Makefile              |   3 +-
 mm/kmigrated.c           | 223 +++++++++++++++++++++++++++++++++++++++
 mm/memory.c              |  30 +-----
 mm/migrate.c             |  36 ++++++-
 mm/mm_init.c             |   6 ++
 mm/page_ext.c            |  11 ++
 9 files changed, 309 insertions(+), 28 deletions(-)
 create mode 100644 mm/kmigrated.c

-- 
2.34.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH v1 1/4] mm: migrate: Allow misplaced migration without VMA too
  2025-06-16 13:39 [RFC PATCH v1 0/4] Kernel thread based async batch migration Bharata B Rao
@ 2025-06-16 13:39 ` Bharata B Rao
  2025-06-16 13:39 ` [RFC PATCH v1 2/4] migrate: implement migrate_misplaced_folios_batch Bharata B Rao
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 13+ messages in thread
From: Bharata B Rao @ 2025-06-16 13:39 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, hannes, mgorman, mingo,
	peterz, raghavendra.kt, riel, rientjes, sj, weixugc, willy,
	ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm,
	david, bharata

We want isolation of misplaced folios to work in contexts
where VMA isn't available. In order to prepare for that
allow migrate_misplaced_folio_prepare() to be called with
a NULL VMA.

Signed-off-by: Bharata B Rao <bharata@amd.com>
---
 mm/migrate.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 8cf0f9c9599d..9fdc2cc3dd1c 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2580,7 +2580,8 @@ static struct folio *alloc_misplaced_dst_folio(struct folio *src,
 
 /*
  * Prepare for calling migrate_misplaced_folio() by isolating the folio if
- * permitted. Must be called with the PTL still held.
+ * permitted. Must be called with the PTL still held if called with a non-NULL
+ * vma.
  */
 int migrate_misplaced_folio_prepare(struct folio *folio,
 		struct vm_area_struct *vma, int node)
@@ -2597,7 +2598,7 @@ int migrate_misplaced_folio_prepare(struct folio *folio,
 		 * See folio_maybe_mapped_shared() on possible imprecision
 		 * when we cannot easily detect if a folio is shared.
 		 */
-		if ((vma->vm_flags & VM_EXEC) && folio_maybe_mapped_shared(folio))
+		if (vma && (vma->vm_flags & VM_EXEC) && folio_maybe_mapped_shared(folio))
 			return -EACCES;
 
 		/*
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH v1 2/4] migrate: implement migrate_misplaced_folios_batch
  2025-06-16 13:39 [RFC PATCH v1 0/4] Kernel thread based async batch migration Bharata B Rao
  2025-06-16 13:39 ` [RFC PATCH v1 1/4] mm: migrate: Allow misplaced migration without VMA too Bharata B Rao
@ 2025-06-16 13:39 ` Bharata B Rao
  2025-06-16 13:39 ` [RFC PATCH v1 3/4] mm: kmigrated - Async kernel migration thread Bharata B Rao
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 13+ messages in thread
From: Bharata B Rao @ 2025-06-16 13:39 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, hannes, mgorman, mingo,
	peterz, raghavendra.kt, riel, rientjes, sj, weixugc, willy,
	ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm,
	david, bharata

From: Gregory Price <gourry@gourry.net>

A common operation in tiering is to migrate multiple pages at once.
The migrate_misplaced_folio function requires one call for each
individual folio.  Expose a batch-variant of the same call for use
when doing batch migrations.

Signed-off-by: Gregory Price <gourry@gourry.net>
Signed-off-by: Bharata B Rao <bharata@amd.com>
---
 include/linux/migrate.h |  6 ++++++
 mm/migrate.c            | 31 +++++++++++++++++++++++++++++++
 2 files changed, 37 insertions(+)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index aaa2114498d6..90baf5ef4660 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -145,6 +145,7 @@ const struct movable_operations *page_movable_ops(struct page *page)
 int migrate_misplaced_folio_prepare(struct folio *folio,
 		struct vm_area_struct *vma, int node);
 int migrate_misplaced_folio(struct folio *folio, int node);
+int migrate_misplaced_folios_batch(struct list_head *foliolist, int node);
 #else
 static inline int migrate_misplaced_folio_prepare(struct folio *folio,
 		struct vm_area_struct *vma, int node)
@@ -155,6 +156,11 @@ static inline int migrate_misplaced_folio(struct folio *folio, int node)
 {
 	return -EAGAIN; /* can't migrate now */
 }
+static inline int migrate_misplaced_folios_batch(struct list_head *foliolist,
+						 int node)
+{
+	return -EAGAIN; /* can't migrate now */
+}
 #endif /* CONFIG_NUMA_BALANCING */
 
 #ifdef CONFIG_MIGRATION
diff --git a/mm/migrate.c b/mm/migrate.c
index 9fdc2cc3dd1c..748b5cd48a1f 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2675,5 +2675,36 @@ int migrate_misplaced_folio(struct folio *folio, int node)
 	BUG_ON(!list_empty(&migratepages));
 	return nr_remaining ? -EAGAIN : 0;
 }
+
+/*
+ * Batch variant of migrate_misplaced_folio. Attempts to migrate
+ * a folio list to the specified destination.
+ *
+ * Caller is expected to have isolated the folios by calling
+ * migrate_misplaced_folio_prepare(), which will result in an
+ * elevated reference count on the folio.
+ *
+ * This function will un-isolate the folios, dereference them, and
+ * remove them from the list before returning.
+ */
+int migrate_misplaced_folios_batch(struct list_head *folio_list, int node)
+{
+	pg_data_t *pgdat = NODE_DATA(node);
+	unsigned int nr_succeeded;
+	int nr_remaining;
+
+	nr_remaining = migrate_pages(folio_list, alloc_misplaced_dst_folio,
+				     NULL, node, MIGRATE_ASYNC,
+				     MR_NUMA_MISPLACED, &nr_succeeded);
+	if (nr_remaining)
+		putback_movable_pages(folio_list);
+
+	if (nr_succeeded) {
+		count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded);
+		mod_node_page_state(pgdat, PGPROMOTE_SUCCESS, nr_succeeded);
+	}
+	BUG_ON(!list_empty(folio_list));
+	return nr_remaining ? -EAGAIN : 0;
+}
 #endif /* CONFIG_NUMA_BALANCING */
 #endif /* CONFIG_NUMA */
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH v1 3/4] mm: kmigrated - Async kernel migration thread
  2025-06-16 13:39 [RFC PATCH v1 0/4] Kernel thread based async batch migration Bharata B Rao
  2025-06-16 13:39 ` [RFC PATCH v1 1/4] mm: migrate: Allow misplaced migration without VMA too Bharata B Rao
  2025-06-16 13:39 ` [RFC PATCH v1 2/4] migrate: implement migrate_misplaced_folios_batch Bharata B Rao
@ 2025-06-16 13:39 ` Bharata B Rao
  2025-06-16 14:05   ` page_ext and memdescs Matthew Wilcox
  2025-07-07  9:36   ` [RFC PATCH v1 3/4] mm: kmigrated - Async kernel migration thread Byungchul Park
  2025-06-16 13:39 ` [RFC PATCH v1 4/4] mm: sched: Batch-migrate misplaced pages Bharata B Rao
  2025-06-20  6:39 ` [RFC PATCH v1 0/4] Kernel thread based async batch migration Huang, Ying
  4 siblings, 2 replies; 13+ messages in thread
From: Bharata B Rao @ 2025-06-16 13:39 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, hannes, mgorman, mingo,
	peterz, raghavendra.kt, riel, rientjes, sj, weixugc, willy,
	ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm,
	david, bharata

kmigrated is a per-node kernel thread that migrates the
folios marked for migration in batches. Each kmigrated
thread walks the PFN range spanning its node and checks
for potential migration candidates.

It depends on the fields added to extended page flags
to determine the pages that need to be migrated and
the target NID.

Signed-off-by: Bharata B Rao <bharata@amd.com>
---
 include/linux/mmzone.h   |   5 +
 include/linux/page_ext.h |  17 +++
 mm/Makefile              |   3 +-
 mm/kmigrated.c           | 223 +++++++++++++++++++++++++++++++++++++++
 mm/mm_init.c             |   6 ++
 mm/page_ext.c            |  11 ++
 6 files changed, 264 insertions(+), 1 deletion(-)
 create mode 100644 mm/kmigrated.c

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 283913d42d7b..5d7f0b8d3c91 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -853,6 +853,8 @@ enum zone_type {
 
 };
 
+int kmigrated_add_pfn(unsigned long pfn, int nid);
+
 #ifndef __GENERATING_BOUNDS_H
 
 #define ASYNC_AND_SYNC 2
@@ -1049,6 +1051,7 @@ enum pgdat_flags {
 					 * many pages under writeback
 					 */
 	PGDAT_RECLAIM_LOCKED,		/* prevents concurrent reclaim */
+	PGDAT_KMIGRATED_ACTIVATE,	/* activates kmigrated */
 };
 
 enum zone_flags {
@@ -1493,6 +1496,8 @@ typedef struct pglist_data {
 #ifdef CONFIG_MEMORY_FAILURE
 	struct memory_failure_stats mf_stats;
 #endif
+	struct task_struct *kmigrated;
+	wait_queue_head_t kmigrated_wait;
 } pg_data_t;
 
 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
index 76c817162d2f..4300c9dbafec 100644
--- a/include/linux/page_ext.h
+++ b/include/linux/page_ext.h
@@ -40,8 +40,25 @@ enum page_ext_flags {
 	PAGE_EXT_YOUNG,
 	PAGE_EXT_IDLE,
 #endif
+	/*
+	 * 32 bits following this are used by the migrator.
+	 * The next available bit position is 33.
+	 */
+	PAGE_EXT_MIGRATE_READY,
 };
 
+#define PAGE_EXT_MIG_NID_WIDTH	10
+#define PAGE_EXT_MIG_FREQ_WIDTH	3
+#define PAGE_EXT_MIG_TIME_WIDTH	18
+
+#define PAGE_EXT_MIG_NID_SHIFT	(PAGE_EXT_MIGRATE_READY + 1)
+#define PAGE_EXT_MIG_FREQ_SHIFT	(PAGE_EXT_MIG_NID_SHIFT + PAGE_EXT_MIG_NID_WIDTH)
+#define PAGE_EXT_MIG_TIME_SHIFT	(PAGE_EXT_MIG_FREQ_SHIFT + PAGE_EXT_MIG_FREQ_WIDTH)
+
+#define PAGE_EXT_MIG_NID_MASK	((1UL << PAGE_EXT_MIG_NID_SHIFT) - 1)
+#define PAGE_EXT_MIG_FREQ_MASK	((1UL << PAGE_EXT_MIG_FREQ_SHIFT) - 1)
+#define PAGE_EXT_MIG_TIME_MASK	((1UL << PAGE_EXT_MIG_TIME_SHIFT) - 1)
+
 /*
  * Page Extension can be considered as an extended mem_map.
  * A page_ext page is associated with every page descriptor. The
diff --git a/mm/Makefile b/mm/Makefile
index 1a7a11d4933d..5a382f19105f 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -37,7 +37,8 @@ mmu-y			:= nommu.o
 mmu-$(CONFIG_MMU)	:= highmem.o memory.o mincore.o \
 			   mlock.o mmap.o mmu_gather.o mprotect.o mremap.o \
 			   msync.o page_vma_mapped.o pagewalk.o \
-			   pgtable-generic.o rmap.o vmalloc.o vma.o vma_exec.o
+			   pgtable-generic.o rmap.o vmalloc.o vma.o vma_exec.o \
+			   kmigrated.o
 
 
 ifdef CONFIG_CROSS_MEMORY_ATTACH
diff --git a/mm/kmigrated.c b/mm/kmigrated.c
new file mode 100644
index 000000000000..3caefe4be0e7
--- /dev/null
+++ b/mm/kmigrated.c
@@ -0,0 +1,223 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * kmigrated is a kernel thread that runs for each node that has
+ * memory. It iterates over the node's PFNs and  migrates pages
+ * marked for migration into their targeted nodes.
+ *
+ * kmigrated depends on PAGE_EXTENSION to find out the pages that
+ * need to be migrated. In addition to a few fields that could be
+ * used by hot page promotion logic to store and evaluate the page
+ * hotness information, the extended page flags is field is extended
+ * to store the target NID for migration.
+ */
+#include <linux/mm.h>
+#include <linux/migrate.h>
+#include <linux/cpuhotplug.h>
+#include <linux/page_ext.h>
+
+#define KMIGRATE_DELAY	MSEC_PER_SEC
+#define KMIGRATE_BATCH	512
+
+static int page_ext_xchg_nid(struct page_ext *page_ext, int nid)
+{
+	unsigned long old_flags, flags;
+	int old_nid;
+
+	old_flags = READ_ONCE(page_ext->flags);
+	do {
+		flags = old_flags;
+		old_nid = (flags >> PAGE_EXT_MIG_NID_SHIFT) & PAGE_EXT_MIG_NID_MASK;
+
+		flags &= ~(PAGE_EXT_MIG_NID_MASK << PAGE_EXT_MIG_NID_SHIFT);
+		flags |= (nid & PAGE_EXT_MIG_NID_MASK) << PAGE_EXT_MIG_NID_SHIFT;
+	} while (unlikely(!try_cmpxchg(&page_ext->flags, &old_flags, flags)));
+
+	return old_nid;
+}
+
+/*
+ * Marks the page as ready for migration.
+ *
+ * @pfn: PFN of the page
+ * @nid: Target NID to were the page needs to be migrated
+ *
+ * The request for migration is noted by setting PAGE_EXT_MIGRATE_READY
+ * in the extended page flags which the kmigrated thread would check.
+ */
+int kmigrated_add_pfn(unsigned long pfn, int nid)
+{
+	struct page *page;
+	struct page_ext *page_ext;
+
+	page = pfn_to_page(pfn);
+	if (!page)
+		return -EINVAL;
+
+	page_ext = page_ext_get(page);
+	if (unlikely(!page_ext))
+		return -EINVAL;
+
+	page_ext_xchg_nid(page_ext, nid);
+	test_and_set_bit(PAGE_EXT_MIGRATE_READY, &page_ext->flags);
+	page_ext_put(page_ext);
+
+	set_bit(PGDAT_KMIGRATED_ACTIVATE, &page_pgdat(page)->flags);
+	return 0;
+}
+
+/*
+ * If the page has been marked ready for migration, return
+ * the NID to which it needs to be migrated to.
+ *
+ * If not return NUMA_NO_NODE.
+ */
+static int kmigrated_get_nid(struct page *page)
+{
+	struct page_ext *page_ext;
+	int nid = NUMA_NO_NODE;
+
+	page_ext = page_ext_get(page);
+	if (unlikely(!page_ext))
+		return nid;
+
+	if (!test_and_clear_bit(PAGE_EXT_MIGRATE_READY, &page_ext->flags))
+		goto out;
+
+	nid = page_ext_xchg_nid(page_ext, nid);
+out:
+	page_ext_put(page_ext);
+	return nid;
+}
+
+/*
+ * Walks the PFNs of the zone, isolates and migrates them in batches.
+ */
+static void kmigrated_walk_zone(unsigned long start_pfn, unsigned long end_pfn,
+				int src_nid)
+{
+	int nid, cur_nid = NUMA_NO_NODE;
+	LIST_HEAD(migrate_list);
+	int batch_count = 0;
+	struct folio *folio;
+	struct page *page;
+	unsigned long pfn;
+
+	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
+		if (!pfn_valid(pfn))
+			continue;
+
+		page = pfn_to_online_page(pfn);
+		if (!page)
+			continue;
+
+		if (page_to_nid(page) != src_nid)
+			continue;
+
+		/*
+		 * TODO: Take care of folio_nr_pages() increment
+		 * to pfn count.
+		 */
+		folio = page_folio(page);
+		if (!folio_test_lru(folio))
+			continue;
+
+		nid = kmigrated_get_nid(page);
+		if (nid == NUMA_NO_NODE)
+			continue;
+
+		if (page_to_nid(page) == nid)
+			continue;
+
+		if (migrate_misplaced_folio_prepare(folio, NULL, nid))
+			continue;
+
+		if (cur_nid != NUMA_NO_NODE)
+			cur_nid = nid;
+
+		if (++batch_count >= KMIGRATE_BATCH || cur_nid != nid) {
+			migrate_misplaced_folios_batch(&migrate_list, cur_nid);
+			cur_nid = nid;
+			batch_count = 0;
+			cond_resched();
+		}
+		list_add(&folio->lru, &migrate_list);
+	}
+	if (!list_empty(&migrate_list))
+		migrate_misplaced_folios_batch(&migrate_list, cur_nid);
+}
+
+static void kmigrated_do_work(pg_data_t *pgdat)
+{
+	struct zone *zone;
+	int zone_idx;
+
+	clear_bit(PGDAT_KMIGRATED_ACTIVATE, &pgdat->flags);
+	for (zone_idx = 0; zone_idx < MAX_NR_ZONES; zone_idx++) {
+		zone = &pgdat->node_zones[zone_idx];
+
+		if (!populated_zone(zone))
+			continue;
+
+		if (zone_is_zone_device(zone))
+			continue;
+
+		kmigrated_walk_zone(zone->zone_start_pfn, zone_end_pfn(zone),
+				    pgdat->node_id);
+	}
+}
+
+static inline bool kmigrated_work_requested(pg_data_t *pgdat)
+{
+	return test_bit(PGDAT_KMIGRATED_ACTIVATE, &pgdat->flags);
+}
+
+static void kmigrated_wait_work(pg_data_t *pgdat)
+{
+	long timeout = msecs_to_jiffies(KMIGRATE_DELAY);
+
+	wait_event_timeout(pgdat->kmigrated_wait,
+			   kmigrated_work_requested(pgdat), timeout);
+}
+
+/*
+ * Per-node kthread that iterates over its PFNs and migrates the
+ * pages that have been marked for migration.
+ */
+static int kmigrated(void *p)
+{
+	pg_data_t *pgdat = (pg_data_t *)p;
+
+	while (!kthread_should_stop()) {
+		kmigrated_wait_work(pgdat);
+		kmigrated_do_work(pgdat);
+	}
+	return 0;
+}
+
+static void kmigrated_run(int nid)
+{
+	pg_data_t *pgdat = NODE_DATA(nid);
+
+	if (pgdat->kmigrated)
+		return;
+
+	pgdat->kmigrated = kthread_create(kmigrated, pgdat, "kmigrated%d", nid);
+	if (IS_ERR(pgdat->kmigrated)) {
+		pr_err("Failed to start kmigrated for node %d\n", nid);
+		pgdat->kmigrated = NULL;
+	} else {
+		wake_up_process(pgdat->kmigrated);
+	}
+}
+
+static int __init kmigrated_init(void)
+{
+	int nid;
+
+	for_each_node_state(nid, N_MEMORY)
+		kmigrated_run(nid);
+
+	return 0;
+}
+
+subsys_initcall(kmigrated_init)
diff --git a/mm/mm_init.c b/mm/mm_init.c
index f2944748f526..3a9cfd175366 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1398,6 +1398,11 @@ static void pgdat_init_kcompactd(struct pglist_data *pgdat)
 static void pgdat_init_kcompactd(struct pglist_data *pgdat) {}
 #endif
 
+static void pgdat_init_kmigrated(struct pglist_data *pgdat)
+{
+	init_waitqueue_head(&pgdat->kmigrated_wait);
+}
+
 static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
 {
 	int i;
@@ -1407,6 +1412,7 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
 
 	pgdat_init_split_queue(pgdat);
 	pgdat_init_kcompactd(pgdat);
+	pgdat_init_kmigrated(pgdat);
 
 	init_waitqueue_head(&pgdat->kswapd_wait);
 	init_waitqueue_head(&pgdat->pfmemalloc_wait);
diff --git a/mm/page_ext.c b/mm/page_ext.c
index c351fdfe9e9a..546725fffddb 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -76,6 +76,16 @@ static struct page_ext_operations page_idle_ops __initdata = {
 };
 #endif
 
+static bool need_page_mig(void)
+{
+	return true;
+}
+
+static struct page_ext_operations page_mig_ops __initdata = {
+	.need = need_page_mig,
+	.need_shared_flags = true,
+};
+
 static struct page_ext_operations *page_ext_ops[] __initdata = {
 #ifdef CONFIG_PAGE_OWNER
 	&page_owner_ops,
@@ -89,6 +99,7 @@ static struct page_ext_operations *page_ext_ops[] __initdata = {
 #ifdef CONFIG_PAGE_TABLE_CHECK
 	&page_table_check_ops,
 #endif
+	&page_mig_ops,
 };
 
 unsigned long page_ext_size;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH v1 4/4] mm: sched: Batch-migrate misplaced pages
  2025-06-16 13:39 [RFC PATCH v1 0/4] Kernel thread based async batch migration Bharata B Rao
                   ` (2 preceding siblings ...)
  2025-06-16 13:39 ` [RFC PATCH v1 3/4] mm: kmigrated - Async kernel migration thread Bharata B Rao
@ 2025-06-16 13:39 ` Bharata B Rao
  2025-06-20  6:39 ` [RFC PATCH v1 0/4] Kernel thread based async batch migration Huang, Ying
  4 siblings, 0 replies; 13+ messages in thread
From: Bharata B Rao @ 2025-06-16 13:39 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, hannes, mgorman, mingo,
	peterz, raghavendra.kt, riel, rientjes, sj, weixugc, willy,
	ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm,
	david, bharata

Currently the folios identified as misplaced by the NUMA
balancing sub-system are migrated one by one from the NUMA
hint fault handler as and when they are identified as
misplaced.

Instead of such singe folio migrations, batch them and
migrate them at once. This is achieved by passing on the
information about misplaced folio to kmigrated which will
batch and migrate the folios.

The failed migration count isn't fed back to the scan period
update heuristics currently.

Signed-off-by: Bharata B Rao <bharata@amd.com>
---
 mm/memory.c | 30 +++++-------------------------
 1 file changed, 5 insertions(+), 25 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 8eba595056fe..b27054f6b4d5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5903,32 +5903,10 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 					writable, &last_cpupid);
 	if (target_nid == NUMA_NO_NODE)
 		goto out_map;
-	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
-		flags |= TNF_MIGRATE_FAIL;
-		goto out_map;
-	}
-	/* The folio is isolated and isolation code holds a folio reference. */
-	pte_unmap_unlock(vmf->pte, vmf->ptl);
+
 	writable = false;
 	ignore_writable = true;
-
-	/* Migrate to the requested node */
-	if (!migrate_misplaced_folio(folio, target_nid)) {
-		nid = target_nid;
-		flags |= TNF_MIGRATED;
-		task_numa_fault(last_cpupid, nid, nr_pages, flags);
-		return 0;
-	}
-
-	flags |= TNF_MIGRATE_FAIL;
-	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
-				       vmf->address, &vmf->ptl);
-	if (unlikely(!vmf->pte))
-		return 0;
-	if (unlikely(!pte_same(ptep_get(vmf->pte), vmf->orig_pte))) {
-		pte_unmap_unlock(vmf->pte, vmf->ptl);
-		return 0;
-	}
+	nid = target_nid;
 out_map:
 	/*
 	 * Make it present again, depending on how arch implements
@@ -5942,8 +5920,10 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 					    writable);
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
 
-	if (nid != NUMA_NO_NODE)
+	if (nid != NUMA_NO_NODE) {
+		kmigrated_add_pfn(folio_pfn(folio), nid);
 		task_numa_fault(last_cpupid, nid, nr_pages, flags);
+	}
 	return 0;
 }
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* page_ext and memdescs
  2025-06-16 13:39 ` [RFC PATCH v1 3/4] mm: kmigrated - Async kernel migration thread Bharata B Rao
@ 2025-06-16 14:05   ` Matthew Wilcox
  2025-06-17  8:28     ` Bharata B Rao
  2025-06-24  9:47     ` David Hildenbrand
  2025-07-07  9:36   ` [RFC PATCH v1 3/4] mm: kmigrated - Async kernel migration thread Byungchul Park
  1 sibling, 2 replies; 13+ messages in thread
From: Matthew Wilcox @ 2025-06-16 14:05 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-kernel, linux-mm, Jonathan.Cameron, dave.hansen, gourry,
	hannes, mgorman, mingo, peterz, raghavendra.kt, riel, rientjes,
	sj, weixugc, ying.huang, ziy, dave, nifan.cxl, xuezhengchu,
	yiannis, akpm, david

On Mon, Jun 16, 2025 at 07:09:30PM +0530, Bharata B Rao wrote:
> diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
> index 76c817162d2f..4300c9dbafec 100644
> --- a/include/linux/page_ext.h
> +++ b/include/linux/page_ext.h
> @@ -40,8 +40,25 @@ enum page_ext_flags {
>  	PAGE_EXT_YOUNG,
>  	PAGE_EXT_IDLE,
>  #endif
> +	/*
> +	 * 32 bits following this are used by the migrator.
> +	 * The next available bit position is 33.
> +	 */
> +	PAGE_EXT_MIGRATE_READY,
>  };
>  
> +#define PAGE_EXT_MIG_NID_WIDTH	10
> +#define PAGE_EXT_MIG_FREQ_WIDTH	3
> +#define PAGE_EXT_MIG_TIME_WIDTH	18
> +
> +#define PAGE_EXT_MIG_NID_SHIFT	(PAGE_EXT_MIGRATE_READY + 1)
> +#define PAGE_EXT_MIG_FREQ_SHIFT	(PAGE_EXT_MIG_NID_SHIFT + PAGE_EXT_MIG_NID_WIDTH)
> +#define PAGE_EXT_MIG_TIME_SHIFT	(PAGE_EXT_MIG_FREQ_SHIFT + PAGE_EXT_MIG_FREQ_WIDTH)
> +
> +#define PAGE_EXT_MIG_NID_MASK	((1UL << PAGE_EXT_MIG_NID_SHIFT) - 1)
> +#define PAGE_EXT_MIG_FREQ_MASK	((1UL << PAGE_EXT_MIG_FREQ_SHIFT) - 1)
> +#define PAGE_EXT_MIG_TIME_MASK	((1UL << PAGE_EXT_MIG_TIME_SHIFT) - 1)

OK, so we need to have a conversation about page_ext.  Sorry this is
happening to you.  I've kind of skipped over page_ext when talking
about folios and memdescs up to now, so it's not that you've missed
anything.

As the comment says,

 * Page Extension can be considered as an extended mem_map.

and we need to do this because we don't want to grow struct page beyond
64 bytes.  But memdescs are dynamically allocated, so we don't need
page_ext any more, and all that code can go away.

lib/alloc_tag.c:struct page_ext_operations page_alloc_tagging_ops = {
mm/page_ext.c:static struct page_ext_operations page_idle_ops __initdata = {
mm/page_ext.c:static struct page_ext_operations *page_ext_ops[] __initdata = {
mm/page_owner.c:struct page_ext_operations page_owner_ops = {
mm/page_table_check.c:struct page_ext_operations page_table_check_ops = {

I think all of these are actually per-memdesc thangs and not per-page
things, so we can get rid of them all.  That means I don't want to see
new per-page data being added to page_ext.

So, what's this really used for?  It seems like it's really
per-allocation, not per-page.  Does it need to be preserved across
alloc/free or can it be reset at free time?


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: page_ext and memdescs
  2025-06-16 14:05   ` page_ext and memdescs Matthew Wilcox
@ 2025-06-17  8:28     ` Bharata B Rao
  2025-06-24  9:47     ` David Hildenbrand
  1 sibling, 0 replies; 13+ messages in thread
From: Bharata B Rao @ 2025-06-17  8:28 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-kernel, linux-mm, Jonathan.Cameron, dave.hansen, gourry,
	hannes, mgorman, mingo, peterz, raghavendra.kt, riel, rientjes,
	sj, weixugc, ying.huang, ziy, dave, nifan.cxl, xuezhengchu,
	yiannis, akpm, david

On 16-Jun-25 7:35 PM, Matthew Wilcox wrote:
> On Mon, Jun 16, 2025 at 07:09:30PM +0530, Bharata B Rao wrote:
<snip>
>> +#define PAGE_EXT_MIG_NID_MASK	((1UL << PAGE_EXT_MIG_NID_SHIFT) - 1)
>> +#define PAGE_EXT_MIG_FREQ_MASK	((1UL << PAGE_EXT_MIG_FREQ_SHIFT) - 1)
>> +#define PAGE_EXT_MIG_TIME_MASK	((1UL << PAGE_EXT_MIG_TIME_SHIFT) - 1)
> 
> OK, so we need to have a conversation about page_ext.  Sorry this is
> happening to you.  I've kind of skipped over page_ext when talking
> about folios and memdescs up to now, so it's not that you've missed
> anything.
> 
> As the comment says,
> 
>   * Page Extension can be considered as an extended mem_map.
> 
> and we need to do this because we don't want to grow struct page beyond
> 64 bytes.  But memdescs are dynamically allocated, so we don't need
> page_ext any more, and all that code can go away.
> 
> lib/alloc_tag.c:struct page_ext_operations page_alloc_tagging_ops = {
> mm/page_ext.c:static struct page_ext_operations page_idle_ops __initdata = {
> mm/page_ext.c:static struct page_ext_operations *page_ext_ops[] __initdata = {
> mm/page_owner.c:struct page_ext_operations page_owner_ops = {
> mm/page_table_check.c:struct page_ext_operations page_table_check_ops = {
> 
> I think all of these are actually per-memdesc thangs and not per-page
> things, so we can get rid of them all.  That means I don't want to see
> new per-page data being added to page_ext.

Fair point.

> 
> So, what's this really used for?  It seems like it's really
> per-allocation, not per-page.  Does it need to be preserved across
> alloc/free or can it be reset at free time?

The context here is to track the pages that need to be migrated. Whether 
it is for NUMA Balancing or for any other sub-system that would need to 
migrate (or promote) pages across nodes, I am trying to come up with a 
kernel thread based migrator that would migrate the identified pages in 
an async and batched manner. For this, the basic information that is 
required for each such ready-to-be-migrated page is the target NID.
Since I have chosen to walk the zones and PFNs of the zone to iterate 
over each page, an additional info that I want per ready-to-be-migrated 
page is an indication that the page is indeed ready now to be migrated 
by the migrator thread.

In addition to these two things, if we want to carve out a single system 
(like kpromoted approach) that handles inputs from multiple page hotness 
sources, maintains heuristics to decide when exactly to migrate/promote 
a page, then it would be good to store a few other information for such 
pages (like access frequency, access timestamp etc).

With that background, I am looking for an optimal place to store this 
information. In my earlier approaches, I was maintaining a global list 
of such hot pages and realized that such an approach will not scale and 
hence in the current approach I am tying that information with the page 
itself. With that, there is no overhead of maintaining such a list, 
synchronizing between producers and migrator thread, no allocation for 
each maintained page. Hence it appeared to me that a pre-allocated 
per-page info would be preferable. At this point, page extension 
appeared a good place to have this information.

Sorry for the long reply, but coming to your specific question now.
So I really need to maintain such data only for pages that can be 
migrated. Pages like most anonymous pages, file backed pages, pages that 
are mapped to user page tables, THP pages etc are candidates. I wonder 
which memdesc type/types would cover all such pages. Would "folio" as 
memdesc (https://kernelnewbies.org/MatthewWilcox/FolioAlloc) be broad 
enough type for this?

As you note, it appears to me that it could be per-allocation rather 
than per-page and the information needn't be preserved across alloc/free.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v1 0/4] Kernel thread based async batch migration
  2025-06-16 13:39 [RFC PATCH v1 0/4] Kernel thread based async batch migration Bharata B Rao
                   ` (3 preceding siblings ...)
  2025-06-16 13:39 ` [RFC PATCH v1 4/4] mm: sched: Batch-migrate misplaced pages Bharata B Rao
@ 2025-06-20  6:39 ` Huang, Ying
  2025-06-20  8:58   ` Bharata B Rao
  4 siblings, 1 reply; 13+ messages in thread
From: Huang, Ying @ 2025-06-20  6:39 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-kernel, linux-mm, Jonathan.Cameron, dave.hansen, gourry,
	hannes, mgorman, mingo, peterz, raghavendra.kt, riel, rientjes,
	sj, weixugc, willy, ziy, dave, nifan.cxl, xuezhengchu, yiannis,
	akpm, david

Bharata B Rao <bharata@amd.com> writes:

> Hi,
>
> This is a continuation of the earlier post[1] that attempted to
> convert migrations from NUMA Balancing to be async and batched.
> In this version, per-node kernel threads are created to handle
> migrations in an async manner.
>
> This adds a few fields to the extended page flags that can be
> used both by the sub-systems that request migrations and kmigrated
> which migrates the pages. Some of the fields are potentially defined
> to be used by kpromoted-like subsystem to manage hot page metrics,
> but are unused right now.
>
> Currently only NUMA Balancing is changed to make use of the async
> batched migration. It does so by recording the target NID and the
> readiness of the page to be migrated in the extended page flags
> fields.
>
> Each kmigrated routinely scans its PFNs, identifies the pages
> marked for migration and batch-migrates them. Unlike the previous
> approach, the responsibility of isolating the pages is now with
> kmigrated.
>
> The major difference between this approach and the way kpromoted[2]
> tracked hot pages is the elimination of heavy synchronization points
> between producers(sub-systems that request migrations or report
> a hot page) and the consumer (kmigrated or kpromoted).
> Instead of tracking only the list of hot pages in an orthogonal
> manner, this approach ties the hot page or migration infomation
> to the struct page.

I don't think page flag + scanning is a good idea.  If the
synchronization is really a problem for you (based on test results),
some per-CPU data structure can be used to record candidate pages.

[snip]

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v1 0/4] Kernel thread based async batch migration
  2025-06-20  6:39 ` [RFC PATCH v1 0/4] Kernel thread based async batch migration Huang, Ying
@ 2025-06-20  8:58   ` Bharata B Rao
  2025-06-20  9:59     ` Huang, Ying
  0 siblings, 1 reply; 13+ messages in thread
From: Bharata B Rao @ 2025-06-20  8:58 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-kernel, linux-mm, Jonathan.Cameron, dave.hansen, gourry,
	hannes, mgorman, mingo, peterz, raghavendra.kt, riel, rientjes,
	sj, weixugc, willy, ziy, dave, nifan.cxl, xuezhengchu, yiannis,
	akpm, david

On 20-Jun-25 12:09 PM, Huang, Ying wrote:
> Bharata B Rao <bharata@amd.com> writes:
> <snip>
> 
> I don't think page flag + scanning is a good idea.If the

If extended page flags is not the ideal location (I chose it in this
version only to get something going quickly), we can look at maintaining
per-pfn allocation for the required hot page metadata separately.

Or is your concern specifically with scanning? What problems do you see?
It is the cost or the possibility of not identifying the migrate-ready
pages in time? Or something else??

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v1 0/4] Kernel thread based async batch migration
  2025-06-20  8:58   ` Bharata B Rao
@ 2025-06-20  9:59     ` Huang, Ying
  0 siblings, 0 replies; 13+ messages in thread
From: Huang, Ying @ 2025-06-20  9:59 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-kernel, linux-mm, Jonathan.Cameron, dave.hansen, gourry,
	hannes, mgorman, mingo, peterz, raghavendra.kt, riel, rientjes,
	sj, weixugc, willy, ziy, dave, nifan.cxl, xuezhengchu, yiannis,
	akpm, david

Bharata B Rao <bharata@amd.com> writes:

> On 20-Jun-25 12:09 PM, Huang, Ying wrote:
>> Bharata B Rao <bharata@amd.com> writes:
>> <snip>
>> 
>> I don't think page flag + scanning is a good idea.If the
>
> If extended page flags is not the ideal location (I chose it in this
> version only to get something going quickly), we can look at maintaining
> per-pfn allocation for the required hot page metadata separately.
>
> Or is your concern specifically with scanning? What problems do you
> see?
>
> It is the cost or the possibility of not identifying the migrate-ready
> pages in time? Or something else??

We may need to scan a large number of pages to identify a page to
promote.  This will waste CPU cycles and pollute cache.

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: page_ext and memdescs
  2025-06-16 14:05   ` page_ext and memdescs Matthew Wilcox
  2025-06-17  8:28     ` Bharata B Rao
@ 2025-06-24  9:47     ` David Hildenbrand
  1 sibling, 0 replies; 13+ messages in thread
From: David Hildenbrand @ 2025-06-24  9:47 UTC (permalink / raw)
  To: Matthew Wilcox, Bharata B Rao
  Cc: linux-kernel, linux-mm, Jonathan.Cameron, dave.hansen, gourry,
	hannes, mgorman, mingo, peterz, raghavendra.kt, riel, rientjes,
	sj, weixugc, ying.huang, ziy, dave, nifan.cxl, xuezhengchu,
	yiannis, akpm

On 16.06.25 16:05, Matthew Wilcox wrote:
> On Mon, Jun 16, 2025 at 07:09:30PM +0530, Bharata B Rao wrote:
>> diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
>> index 76c817162d2f..4300c9dbafec 100644
>> --- a/include/linux/page_ext.h
>> +++ b/include/linux/page_ext.h
>> @@ -40,8 +40,25 @@ enum page_ext_flags {
>>   	PAGE_EXT_YOUNG,
>>   	PAGE_EXT_IDLE,
>>   #endif
>> +	/*
>> +	 * 32 bits following this are used by the migrator.
>> +	 * The next available bit position is 33.
>> +	 */
>> +	PAGE_EXT_MIGRATE_READY,
>>   };
>>   
>> +#define PAGE_EXT_MIG_NID_WIDTH	10
>> +#define PAGE_EXT_MIG_FREQ_WIDTH	3
>> +#define PAGE_EXT_MIG_TIME_WIDTH	18
>> +
>> +#define PAGE_EXT_MIG_NID_SHIFT	(PAGE_EXT_MIGRATE_READY + 1)
>> +#define PAGE_EXT_MIG_FREQ_SHIFT	(PAGE_EXT_MIG_NID_SHIFT + PAGE_EXT_MIG_NID_WIDTH)
>> +#define PAGE_EXT_MIG_TIME_SHIFT	(PAGE_EXT_MIG_FREQ_SHIFT + PAGE_EXT_MIG_FREQ_WIDTH)
>> +
>> +#define PAGE_EXT_MIG_NID_MASK	((1UL << PAGE_EXT_MIG_NID_SHIFT) - 1)
>> +#define PAGE_EXT_MIG_FREQ_MASK	((1UL << PAGE_EXT_MIG_FREQ_SHIFT) - 1)
>> +#define PAGE_EXT_MIG_TIME_MASK	((1UL << PAGE_EXT_MIG_TIME_SHIFT) - 1)
> 
> OK, so we need to have a conversation about page_ext.  Sorry this is
> happening to you.  I've kind of skipped over page_ext when talking
> about folios and memdescs up to now, so it's not that you've missed
> anything.
> 
> As the comment says,
> 
>   * Page Extension can be considered as an extended mem_map.
> 
> and we need to do this because we don't want to grow struct page beyond
> 64 bytes.  But memdescs are dynamically allocated, so we don't need
> page_ext any more, and all that code can go away.
> 
> lib/alloc_tag.c:struct page_ext_operations page_alloc_tagging_ops = {

In this case, we might not necessarily have an allocated memdesc, for 
all allocations, though. Think of memory ballooning allocating "offline" 
pages in the future.

Of course, the easy solution is to not track these non-memdesc allocations.

> mm/page_ext.c:static struct page_ext_operations page_idle_ops __initdata = {

That should be per-folio.

> mm/page_ext.c:static struct page_ext_operations *page_ext_ops[] __initdata = {

That's just the lookup table for the others.

> mm/page_owner.c:struct page_ext_operations page_owner_ops = {

Hm, probably like tagging above.

> mm/page_table_check.c:struct page_ext_operations page_table_check_ops = {

That should be per-folio as well IIUC.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v1 3/4] mm: kmigrated - Async kernel migration thread
  2025-06-16 13:39 ` [RFC PATCH v1 3/4] mm: kmigrated - Async kernel migration thread Bharata B Rao
  2025-06-16 14:05   ` page_ext and memdescs Matthew Wilcox
@ 2025-07-07  9:36   ` Byungchul Park
  2025-07-08  3:43     ` Bharata B Rao
  1 sibling, 1 reply; 13+ messages in thread
From: Byungchul Park @ 2025-07-07  9:36 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-kernel, linux-mm, Jonathan.Cameron, dave.hansen, gourry,
	hannes, mgorman, mingo, peterz, raghavendra.kt, riel, rientjes,
	sj, weixugc, willy, ying.huang, ziy, dave, nifan.cxl, xuezhengchu,
	yiannis, akpm, david, kernel_team

On Mon, Jun 16, 2025 at 07:09:30PM +0530, Bharata B Rao wrote:
> 
> kmigrated is a per-node kernel thread that migrates the
> folios marked for migration in batches. Each kmigrated
> thread walks the PFN range spanning its node and checks
> for potential migration candidates.
> 
> It depends on the fields added to extended page flags
> to determine the pages that need to be migrated and
> the target NID.
> 
> Signed-off-by: Bharata B Rao <bharata@amd.com>
> ---
>  include/linux/mmzone.h   |   5 +
>  include/linux/page_ext.h |  17 +++
>  mm/Makefile              |   3 +-
>  mm/kmigrated.c           | 223 +++++++++++++++++++++++++++++++++++++++
>  mm/mm_init.c             |   6 ++
>  mm/page_ext.c            |  11 ++
>  6 files changed, 264 insertions(+), 1 deletion(-)
>  create mode 100644 mm/kmigrated.c
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 283913d42d7b..5d7f0b8d3c91 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -853,6 +853,8 @@ enum zone_type {
> 
>  };
> 
> +int kmigrated_add_pfn(unsigned long pfn, int nid);
> +
>  #ifndef __GENERATING_BOUNDS_H
> 
>  #define ASYNC_AND_SYNC 2
> @@ -1049,6 +1051,7 @@ enum pgdat_flags {
>                                          * many pages under writeback
>                                          */
>         PGDAT_RECLAIM_LOCKED,           /* prevents concurrent reclaim */
> +       PGDAT_KMIGRATED_ACTIVATE,       /* activates kmigrated */
>  };
> 
>  enum zone_flags {
> @@ -1493,6 +1496,8 @@ typedef struct pglist_data {
>  #ifdef CONFIG_MEMORY_FAILURE
>         struct memory_failure_stats mf_stats;
>  #endif
> +       struct task_struct *kmigrated;
> +       wait_queue_head_t kmigrated_wait;
>  } pg_data_t;
> 
>  #define node_present_pages(nid)        (NODE_DATA(nid)->node_present_pages)
> diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
> index 76c817162d2f..4300c9dbafec 100644
> --- a/include/linux/page_ext.h
> +++ b/include/linux/page_ext.h
> @@ -40,8 +40,25 @@ enum page_ext_flags {
>         PAGE_EXT_YOUNG,
>         PAGE_EXT_IDLE,
>  #endif
> +       /*
> +        * 32 bits following this are used by the migrator.
> +        * The next available bit position is 33.
> +        */
> +       PAGE_EXT_MIGRATE_READY,
>  };
> 
> +#define PAGE_EXT_MIG_NID_WIDTH 10
> +#define PAGE_EXT_MIG_FREQ_WIDTH        3
> +#define PAGE_EXT_MIG_TIME_WIDTH        18
> +
> +#define PAGE_EXT_MIG_NID_SHIFT (PAGE_EXT_MIGRATE_READY + 1)
> +#define PAGE_EXT_MIG_FREQ_SHIFT        (PAGE_EXT_MIG_NID_SHIFT + PAGE_EXT_MIG_NID_WIDTH)
> +#define PAGE_EXT_MIG_TIME_SHIFT        (PAGE_EXT_MIG_FREQ_SHIFT + PAGE_EXT_MIG_FREQ_WIDTH)
> +
> +#define PAGE_EXT_MIG_NID_MASK  ((1UL << PAGE_EXT_MIG_NID_SHIFT) - 1)
> +#define PAGE_EXT_MIG_FREQ_MASK ((1UL << PAGE_EXT_MIG_FREQ_SHIFT) - 1)
> +#define PAGE_EXT_MIG_TIME_MASK ((1UL << PAGE_EXT_MIG_TIME_SHIFT) - 1)
> +
>  /*
>   * Page Extension can be considered as an extended mem_map.
>   * A page_ext page is associated with every page descriptor. The
> diff --git a/mm/Makefile b/mm/Makefile
> index 1a7a11d4933d..5a382f19105f 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -37,7 +37,8 @@ mmu-y                 := nommu.o
>  mmu-$(CONFIG_MMU)      := highmem.o memory.o mincore.o \
>                            mlock.o mmap.o mmu_gather.o mprotect.o mremap.o \
>                            msync.o page_vma_mapped.o pagewalk.o \
> -                          pgtable-generic.o rmap.o vmalloc.o vma.o vma_exec.o
> +                          pgtable-generic.o rmap.o vmalloc.o vma.o vma_exec.o \
> +                          kmigrated.o
> 
> 
>  ifdef CONFIG_CROSS_MEMORY_ATTACH
> diff --git a/mm/kmigrated.c b/mm/kmigrated.c
> new file mode 100644
> index 000000000000..3caefe4be0e7
> --- /dev/null
> +++ b/mm/kmigrated.c
> @@ -0,0 +1,223 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * kmigrated is a kernel thread that runs for each node that has
> + * memory. It iterates over the node's PFNs and  migrates pages
> + * marked for migration into their targeted nodes.
> + *
> + * kmigrated depends on PAGE_EXTENSION to find out the pages that
> + * need to be migrated. In addition to a few fields that could be
> + * used by hot page promotion logic to store and evaluate the page
> + * hotness information, the extended page flags is field is extended
> + * to store the target NID for migration.
> + */
> +#include <linux/mm.h>
> +#include <linux/migrate.h>
> +#include <linux/cpuhotplug.h>
> +#include <linux/page_ext.h>
> +
> +#define KMIGRATE_DELAY MSEC_PER_SEC
> +#define KMIGRATE_BATCH 512
> +
> +static int page_ext_xchg_nid(struct page_ext *page_ext, int nid)
> +{
> +       unsigned long old_flags, flags;
> +       int old_nid;
> +
> +       old_flags = READ_ONCE(page_ext->flags);
> +       do {
> +               flags = old_flags;
> +               old_nid = (flags >> PAGE_EXT_MIG_NID_SHIFT) & PAGE_EXT_MIG_NID_MASK;
> +
> +               flags &= ~(PAGE_EXT_MIG_NID_MASK << PAGE_EXT_MIG_NID_SHIFT);
> +               flags |= (nid & PAGE_EXT_MIG_NID_MASK) << PAGE_EXT_MIG_NID_SHIFT;
> +       } while (unlikely(!try_cmpxchg(&page_ext->flags, &old_flags, flags)));
> +
> +       return old_nid;
> +}
> +
> +/*
> + * Marks the page as ready for migration.
> + *
> + * @pfn: PFN of the page
> + * @nid: Target NID to were the page needs to be migrated
> + *
> + * The request for migration is noted by setting PAGE_EXT_MIGRATE_READY
> + * in the extended page flags which the kmigrated thread would check.
> + */
> +int kmigrated_add_pfn(unsigned long pfn, int nid)
> +{
> +       struct page *page;
> +       struct page_ext *page_ext;
> +
> +       page = pfn_to_page(pfn);
> +       if (!page)
> +               return -EINVAL;
> +
> +       page_ext = page_ext_get(page);
> +       if (unlikely(!page_ext))
> +               return -EINVAL;
> +
> +       page_ext_xchg_nid(page_ext, nid);
> +       test_and_set_bit(PAGE_EXT_MIGRATE_READY, &page_ext->flags);
> +       page_ext_put(page_ext);
> +
> +       set_bit(PGDAT_KMIGRATED_ACTIVATE, &page_pgdat(page)->flags);
> +       return 0;
> +}
> +
> +/*
> + * If the page has been marked ready for migration, return
> + * the NID to which it needs to be migrated to.
> + *
> + * If not return NUMA_NO_NODE.
> + */
> +static int kmigrated_get_nid(struct page *page)
> +{
> +       struct page_ext *page_ext;
> +       int nid = NUMA_NO_NODE;
> +
> +       page_ext = page_ext_get(page);
> +       if (unlikely(!page_ext))
> +               return nid;
> +
> +       if (!test_and_clear_bit(PAGE_EXT_MIGRATE_READY, &page_ext->flags))
> +               goto out;
> +
> +       nid = page_ext_xchg_nid(page_ext, nid);
> +out:
> +       page_ext_put(page_ext);
> +       return nid;
> +}
> +
> +/*
> + * Walks the PFNs of the zone, isolates and migrates them in batches.
> + */
> +static void kmigrated_walk_zone(unsigned long start_pfn, unsigned long end_pfn,
> +                               int src_nid)
> +{
> +       int nid, cur_nid = NUMA_NO_NODE;
> +       LIST_HEAD(migrate_list);
> +       int batch_count = 0;
> +       struct folio *folio;
> +       struct page *page;
> +       unsigned long pfn;
> +
> +       for (pfn = start_pfn; pfn < end_pfn; pfn++) {

Hi,

Is it feasible to scan all the pages in each zone?  I think we should
figure out a better way so as to reduce CPU time for this purpose.

Besides the opinion above, I was thinking to design and implement a
kthread for memory placement between different tiers - I already named
it e.g. kmplaced, rather than relying on kswapd and hinting fault, lol ;)

Now that you've started, I'd like to think about it together and improve
it so that it works better.  Please cc me from the next spin.

	Byungchul

> +               if (!pfn_valid(pfn))
> +                       continue;
> +
> +               page = pfn_to_online_page(pfn);
> +               if (!page)
> +                       continue;
> +
> +               if (page_to_nid(page) != src_nid)
> +                       continue;
> +
> +               /*
> +                * TODO: Take care of folio_nr_pages() increment
> +                * to pfn count.
> +                */
> +               folio = page_folio(page);
> +               if (!folio_test_lru(folio))
> +                       continue;
> +
> +               nid = kmigrated_get_nid(page);
> +               if (nid == NUMA_NO_NODE)
> +                       continue;
> +
> +               if (page_to_nid(page) == nid)
> +                       continue;
> +
> +               if (migrate_misplaced_folio_prepare(folio, NULL, nid))
> +                       continue;
> +
> +               if (cur_nid != NUMA_NO_NODE)
> +                       cur_nid = nid;
> +
> +               if (++batch_count >= KMIGRATE_BATCH || cur_nid != nid) {
> +                       migrate_misplaced_folios_batch(&migrate_list, cur_nid);
> +                       cur_nid = nid;
> +                       batch_count = 0;
> +                       cond_resched();
> +               }
> +               list_add(&folio->lru, &migrate_list);
> +       }
> +       if (!list_empty(&migrate_list))
> +               migrate_misplaced_folios_batch(&migrate_list, cur_nid);
> +}
> +
> +static void kmigrated_do_work(pg_data_t *pgdat)
> +{
> +       struct zone *zone;
> +       int zone_idx;
> +
> +       clear_bit(PGDAT_KMIGRATED_ACTIVATE, &pgdat->flags);
> +       for (zone_idx = 0; zone_idx < MAX_NR_ZONES; zone_idx++) {
> +               zone = &pgdat->node_zones[zone_idx];
> +
> +               if (!populated_zone(zone))
> +                       continue;
> +
> +               if (zone_is_zone_device(zone))
> +                       continue;
> +
> +               kmigrated_walk_zone(zone->zone_start_pfn, zone_end_pfn(zone),
> +                                   pgdat->node_id);
> +       }
> +}
> +
> +static inline bool kmigrated_work_requested(pg_data_t *pgdat)
> +{
> +       return test_bit(PGDAT_KMIGRATED_ACTIVATE, &pgdat->flags);
> +}
> +
> +static void kmigrated_wait_work(pg_data_t *pgdat)
> +{
> +       long timeout = msecs_to_jiffies(KMIGRATE_DELAY);
> +
> +       wait_event_timeout(pgdat->kmigrated_wait,
> +                          kmigrated_work_requested(pgdat), timeout);
> +}
> +
> +/*
> + * Per-node kthread that iterates over its PFNs and migrates the
> + * pages that have been marked for migration.
> + */
> +static int kmigrated(void *p)
> +{
> +       pg_data_t *pgdat = (pg_data_t *)p;
> +
> +       while (!kthread_should_stop()) {
> +               kmigrated_wait_work(pgdat);
> +               kmigrated_do_work(pgdat);
> +       }
> +       return 0;
> +}
> +
> +static void kmigrated_run(int nid)
> +{
> +       pg_data_t *pgdat = NODE_DATA(nid);
> +
> +       if (pgdat->kmigrated)
> +               return;
> +
> +       pgdat->kmigrated = kthread_create(kmigrated, pgdat, "kmigrated%d", nid);
> +       if (IS_ERR(pgdat->kmigrated)) {
> +               pr_err("Failed to start kmigrated for node %d\n", nid);
> +               pgdat->kmigrated = NULL;
> +       } else {
> +               wake_up_process(pgdat->kmigrated);
> +       }
> +}
> +
> +static int __init kmigrated_init(void)
> +{
> +       int nid;
> +
> +       for_each_node_state(nid, N_MEMORY)
> +               kmigrated_run(nid);
> +
> +       return 0;
> +}
> +
> +subsys_initcall(kmigrated_init)
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index f2944748f526..3a9cfd175366 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -1398,6 +1398,11 @@ static void pgdat_init_kcompactd(struct pglist_data *pgdat)
>  static void pgdat_init_kcompactd(struct pglist_data *pgdat) {}
>  #endif
> 
> +static void pgdat_init_kmigrated(struct pglist_data *pgdat)
> +{
> +       init_waitqueue_head(&pgdat->kmigrated_wait);
> +}
> +
>  static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
>  {
>         int i;
> @@ -1407,6 +1412,7 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
> 
>         pgdat_init_split_queue(pgdat);
>         pgdat_init_kcompactd(pgdat);
> +       pgdat_init_kmigrated(pgdat);
> 
>         init_waitqueue_head(&pgdat->kswapd_wait);
>         init_waitqueue_head(&pgdat->pfmemalloc_wait);
> diff --git a/mm/page_ext.c b/mm/page_ext.c
> index c351fdfe9e9a..546725fffddb 100644
> --- a/mm/page_ext.c
> +++ b/mm/page_ext.c
> @@ -76,6 +76,16 @@ static struct page_ext_operations page_idle_ops __initdata = {
>  };
>  #endif
> 
> +static bool need_page_mig(void)
> +{
> +       return true;
> +}
> +
> +static struct page_ext_operations page_mig_ops __initdata = {
> +       .need = need_page_mig,
> +       .need_shared_flags = true,
> +};
> +
>  static struct page_ext_operations *page_ext_ops[] __initdata = {
>  #ifdef CONFIG_PAGE_OWNER
>         &page_owner_ops,
> @@ -89,6 +99,7 @@ static struct page_ext_operations *page_ext_ops[] __initdata = {
>  #ifdef CONFIG_PAGE_TABLE_CHECK
>         &page_table_check_ops,
>  #endif
> +       &page_mig_ops,
>  };
> 
>  unsigned long page_ext_size;
> --
> 2.34.1
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v1 3/4] mm: kmigrated - Async kernel migration thread
  2025-07-07  9:36   ` [RFC PATCH v1 3/4] mm: kmigrated - Async kernel migration thread Byungchul Park
@ 2025-07-08  3:43     ` Bharata B Rao
  0 siblings, 0 replies; 13+ messages in thread
From: Bharata B Rao @ 2025-07-08  3:43 UTC (permalink / raw)
  To: Byungchul Park
  Cc: linux-kernel, linux-mm, Jonathan.Cameron, dave.hansen, gourry,
	hannes, mgorman, mingo, peterz, raghavendra.kt, riel, rientjes,
	sj, weixugc, willy, ying.huang, ziy, dave, nifan.cxl, xuezhengchu,
	yiannis, akpm, david, kernel_team

On 07-Jul-25 3:06 PM, Byungchul Park wrote:
> On Mon, Jun 16, 2025 at 07:09:30PM +0530, Bharata B Rao wrote:
>> +
>> +/*
>> + * Walks the PFNs of the zone, isolates and migrates them in batches.
>> + */
>> +static void kmigrated_walk_zone(unsigned long start_pfn, unsigned long end_pfn,
>> +                               int src_nid)
>> +{
>> +       int nid, cur_nid = NUMA_NO_NODE;
>> +       LIST_HEAD(migrate_list);
>> +       int batch_count = 0;
>> +       struct folio *folio;
>> +       struct page *page;
>> +       unsigned long pfn;
>> +
>> +       for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> 
> Hi,
> 
> Is it feasible to scan all the pages in each zone?  I think we should
> figure out a better way so as to reduce CPU time for this purpose.

I incorporated a per-zone indicator to inform kmigrated if it needs to
skip the whole zone when scanning and look at only those zones which
have migrate-ready pages.

CPU time spent is one aspect, but the other aspect I have observed is
the delay in identifying migrate-ready pages depending on where they
exist in the zone. I have been seeing both best case and worst case
behaviors due to which the number of pages migrated for a given workload
can vary based on the given run.

Hence scanning all pages without additional smarts to quickly arrive
the pages of interest may not be ideal. I am working on approaches
to improve this situation.

> 
> Besides the opinion above, I was thinking to design and implement a
> kthread for memory placement between different tiers - I already named
> it e.g. kmplaced, rather than relying on kswapd and hinting fault, lol ;)
> 
> Now that you've started, I'd like to think about it together and improve
> it so that it works better.  Please cc me from the next spin.

Sure, will do from next post.

Regards,
Bharata.


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-07-08  3:43 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-16 13:39 [RFC PATCH v1 0/4] Kernel thread based async batch migration Bharata B Rao
2025-06-16 13:39 ` [RFC PATCH v1 1/4] mm: migrate: Allow misplaced migration without VMA too Bharata B Rao
2025-06-16 13:39 ` [RFC PATCH v1 2/4] migrate: implement migrate_misplaced_folios_batch Bharata B Rao
2025-06-16 13:39 ` [RFC PATCH v1 3/4] mm: kmigrated - Async kernel migration thread Bharata B Rao
2025-06-16 14:05   ` page_ext and memdescs Matthew Wilcox
2025-06-17  8:28     ` Bharata B Rao
2025-06-24  9:47     ` David Hildenbrand
2025-07-07  9:36   ` [RFC PATCH v1 3/4] mm: kmigrated - Async kernel migration thread Byungchul Park
2025-07-08  3:43     ` Bharata B Rao
2025-06-16 13:39 ` [RFC PATCH v1 4/4] mm: sched: Batch-migrate misplaced pages Bharata B Rao
2025-06-20  6:39 ` [RFC PATCH v1 0/4] Kernel thread based async batch migration Huang, Ying
2025-06-20  8:58   ` Bharata B Rao
2025-06-20  9:59     ` Huang, Ying

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).