linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH V3 00/17] mm: slowtier page promotion based on PTE A bit
@ 2025-08-14 15:32 Raghavendra K T
  2025-08-14 15:32 ` [RFC PATCH V3 01/17] mm: Add kscand kthread for PTE A bit scan Raghavendra K T
                   ` (17 more replies)
  0 siblings, 18 replies; 19+ messages in thread
From: Raghavendra K T @ 2025-08-14 15:32 UTC (permalink / raw)
  To: raghavendra.kt
  Cc: AneeshKumar.KizhakeVeetil, Michael.Day, akpm, bharata,
	dave.hansen, david, dongjoo.linux.dev, feng.tang, gourry, hannes,
	honggyu.kim, hughd, jhubbard, jon.grimm, k.shutemov, kbusch,
	kmanaouil.dev, leesuyeon0506, leillc, liam.howlett, linux-kernel,
	linux-mm, mgorman, mingo, nadav.amit, nphamcs, peterz, riel,
	rientjes, rppt, santosh.shukla, shivankg, shy828301, sj, vbabka,
	weixugc, willy, ying.huang, ziy, Jonathan.Cameron, dave, yuanchu,
	kinseyho, hdanton, harry.yoo

The current series has additional enhancements and comments' incorporation on top of
RFC V2.

This is an additional source of hot page generator to NUMAB, IBS [4], KMGLRUD [5].

Introduction:
=============
In the current hot page promotion, all the activities including the
process address space scanning, NUMA hint fault handling and page
migration is performed in the process context. i.e., scanning overhead is
borne by applications.

This RFC V2 patch series does slow-tier page promotion by using PTE Accessed
bit scanning. Scanning is done by a global kernel thread which routinely
scans all the processes' address spaces and checks for accesses by reading
the PTE A bit.

A separate migration thread migrates/promotes the pages to the top-tier
node based on a simple heuristic that uses top-tier scan/access information
of the mm.

Additionally based on the feedback, a prctl knob with a scalar value is
provided to control per task scanning.

Changes Since RFC V2:
===================
 - Enhanced logic to migrate on second access.

 - Using prctl scalar value to further tune the scanning efficiency.

 - Using of PFN instead of folio to record hot pages for easy integration
with kpromoted/kmigrated [4].

 - Rebasing on top of fork/exec changes in v6.16.

 - Revisiting mm_walk logic and folio validation based on Harry's comments.

 - Feedback from migration system to slowdown scanning when more migration failures
 happen.

 - Masami's comment on trace patch.

 - Bug fix to overnight idle system  crash due to incorrect kmemcache usage.

 - Enhanced target node finding logic to further obtain fallback nodes to migrate.
(TBD: This needs followup patch that actually does migration to fallback target nodes)

Changes since RFC V1:
=====================
- Addressing the review comments by Jonathan (Thank you for your closer
 reviews).

- Per mm migration list with separate lock to resolve race conditions/softlockups
reported by Davidlohr.

- Add one more filter before migration for LRU_GEN case to check whether
 folio is still hot.

- Rename kmmscand ==> kscand kmmmigrated ==> kmigrated (hopefully this
 gets merged into Bharat's upcoming migration thread)

Changes since RFC V0:
======================
- A separate migration thread is used for migration, thus alleviating need for
  multi-threaded scanning (at least as per tracing).

- A simple heuristic for target node calculation is added.

- prctl (David R) interface with scalar value is added to control per task scanning.

- Steve's comment on tracing incorporated.

- Davidlohr's reported bugfix.

- Initial scan delay similar to NUMAB1 mode added.

- Got rid of migration lock during mm_walk.

A note on per mm migration list using mm_slot:
=============================================
Using per mm migration list (mm_slot) has helped to reduce contention
 and thus easing mm teardown during process exit.

It also helps to tie PFN/folio with mm to make heuristics work better
 and further it would help to throttle migration per mm (OR process) (TBD).

A note on PTE A bit scanning:
============================
Major positive: Current patchset is able to cover all the process address
 space scanning effectively with simple algorithms to tune scan_size and
 scan_period.

Thanks to Jonathan, Davidlohr, David, Harry, Masami Steve for review feedback on RFCs.

Future plans:
================
Evaluate how integration with hotness monitoring subsystem works, OR
as a standalone integration with kmigrated API* of [4]

Results:
=======
Benchmark Cbench (by Bharata) to evaluate performance promotion in
slowtier system.

Benchmark allocates memory on both regular NUMA node and  slowtier node,
then does continuous access.
Goal: Finishing fixed numaber of access in less time

SUT: Genoa+ EPYC system

base 6.16 NUMAB2 (because this has the best perf)
patched 6.16 + current series

Time taken in sec (lower is better)
               base           patched
8GB            228            206
32GB           547            534
128GB          1100           920

Links:
[1] RFC V0: https://lore.kernel.org/all/20241201153818.2633616-1-raghavendra.kt@amd.com/
[2] RFC V1: https://lore.kernel.org/linux-mm/20250319193028.29514-1-raghavendra.kt@amd.com/
[3] RFC V2: https://lore.kernel.org/linux-mm/20250624055617.1291159-1-raghavendra.kt@amd.com/
[4] Hotpage detection and promotion: https://lore.kernel.org/linux-mm/20250814134826.154003-1-bharata@amd.com/T/#t
[5] MGLRU: https://lkml.org/lkml/2025/3/24/1458

Patch organization:
patch 1-5 initial skeleton for scanning and migration
patch 6: migration
patch 7-9: scanning optimizations
patch 10: target_node heuristic
patch 11: Migration failure feedback
patch 12-14: sysfs, vmstat and tracing
patch 15-16: prctl implementation and enhancements to scanning.
patch17: Fallback target node finding

Raghavendra K T (17):
  mm: Add kscand kthread for PTE A bit scan
  mm: Maintain mm_struct list in the system
  mm: Scan the mm and create a migration list
  mm/kscand: Add only hot pages to migration list
  mm: Create a separate kthread for migration
  mm/migration: migrate accessed folios to toptier node
  mm: Add throttling of mm scanning using scan_period
  mm: Add throttling of mm scanning using scan_size
  mm: Add initial scan delay
  mm: Add a heuristic to calculate target node
  mm/kscand: Implement migration failure feedback
  sysfs: Add sysfs support to tune scanning
  mm/vmstat: Add vmstat counters
  trace/kscand: Add tracing of scanning and migration
  prctl: Introduce new prctl to control scanning
  prctl: Fine tune scan_period with prctl scale param
  mm: Create a list of fallback target nodes

 Documentation/filesystems/proc.rst |    2 +
 fs/proc/task_mmu.c                 |    4 +
 include/linux/kscand.h             |   30 +
 include/linux/migrate.h            |    2 +
 include/linux/mm.h                 |   13 +
 include/linux/mm_types.h           |    7 +
 include/linux/vm_event_item.h      |   12 +
 include/trace/events/kmem.h        |   99 ++
 include/uapi/linux/prctl.h         |    7 +
 kernel/fork.c                      |    6 +
 kernel/sys.c                       |   25 +
 mm/Kconfig                         |    8 +
 mm/Makefile                        |    1 +
 mm/internal.h                      |    1 +
 mm/kscand.c                        | 1754 ++++++++++++++++++++++++++++
 mm/migrate.c                       |    2 +-
 mm/mmap.c                          |    2 +
 mm/vma_exec.c                      |    3 +
 mm/vmstat.c                        |   12 +
 19 files changed, 1989 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/kscand.h
 create mode 100644 mm/kscand.c


base-commit: 038d61fd642278bab63ee8ef722c50d10ab01e8f
-- 
2.34.1



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH V3 01/17] mm: Add kscand kthread for PTE A bit scan
  2025-08-14 15:32 [RFC PATCH V3 00/17] mm: slowtier page promotion based on PTE A bit Raghavendra K T
@ 2025-08-14 15:32 ` Raghavendra K T
  2025-08-14 15:32 ` [RFC PATCH V3 02/17] mm: Maintain mm_struct list in the system Raghavendra K T
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Raghavendra K T @ 2025-08-14 15:32 UTC (permalink / raw)
  To: raghavendra.kt
  Cc: AneeshKumar.KizhakeVeetil, Michael.Day, akpm, bharata,
	dave.hansen, david, dongjoo.linux.dev, feng.tang, gourry, hannes,
	honggyu.kim, hughd, jhubbard, jon.grimm, k.shutemov, kbusch,
	kmanaouil.dev, leesuyeon0506, leillc, liam.howlett, linux-kernel,
	linux-mm, mgorman, mingo, nadav.amit, nphamcs, peterz, riel,
	rientjes, rppt, santosh.shukla, shivankg, shy828301, sj, vbabka,
	weixugc, willy, ying.huang, ziy, Jonathan.Cameron, dave, yuanchu,
	kinseyho, hdanton, harry.yoo

Also add a config option for the same.

High level design:

While (1):
  Scan the slowtier pages belonging to VMAs of a task.
  Add to migation list.

A separate thread:
  Migrate scanned pages to a toptier node based on heuristics.

The overall code is influenced by khugepaged design.

Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 mm/Kconfig  |   8 +++
 mm/Makefile |   1 +
 mm/kscand.c | 163 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 172 insertions(+)
 create mode 100644 mm/kscand.c

diff --git a/mm/Kconfig b/mm/Kconfig
index 781be3240e21..d1e5be76a96e 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -750,6 +750,14 @@ config KSM
 	  until a program has madvised that an area is MADV_MERGEABLE, and
 	  root has set /sys/kernel/mm/ksm/run to 1 (if CONFIG_SYSFS is set).
 
+config KSCAND
+	bool "Enable PTE A bit scanning and Migration"
+	depends on NUMA_BALANCING
+	help
+	  Enable PTE A bit scanning of page. The option creates a separate
+	  kthread for scanning and migration. Accessed slow-tier pages are
+	  migrated to a regular NUMA node to reduce hot page access latency.
+
 config DEFAULT_MMAP_MIN_ADDR
 	int "Low address space to protect from user allocation"
 	depends on MMU
diff --git a/mm/Makefile b/mm/Makefile
index 1a7a11d4933d..a16ef2ff3da1 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -97,6 +97,7 @@ obj-$(CONFIG_FAIL_PAGE_ALLOC) += fail_page_alloc.o
 obj-$(CONFIG_MEMTEST)		+= memtest.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_NUMA) += memory-tiers.o
+obj-$(CONFIG_KSCAND) += kscand.o
 obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
 obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
diff --git a/mm/kscand.c b/mm/kscand.c
new file mode 100644
index 000000000000..f7bbbc70c86a
--- /dev/null
+++ b/mm/kscand.c
@@ -0,0 +1,163 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/mm.h>
+#include <linux/mm_types.h>
+#include <linux/sched.h>
+#include <linux/sched/mm.h>
+#include <linux/mmu_notifier.h>
+#include <linux/swap.h>
+#include <linux/mm_inline.h>
+#include <linux/kthread.h>
+#include <linux/string.h>
+#include <linux/delay.h>
+#include <linux/cleanup.h>
+
+#include <asm/pgalloc.h>
+#include "internal.h"
+
+static struct task_struct *kscand_thread __read_mostly;
+static DEFINE_MUTEX(kscand_mutex);
+
+/* How long to pause between two scan cycles */
+static unsigned int kscand_scan_sleep_ms __read_mostly = 20;
+
+/* Max number of mms to scan in one scan cycle */
+#define KSCAND_MMS_TO_SCAN	(4 * 1024UL)
+static unsigned long kscand_mms_to_scan __read_mostly = KSCAND_MMS_TO_SCAN;
+
+bool kscand_scan_enabled = true;
+static bool need_wakeup;
+
+static unsigned long kscand_sleep_expire;
+
+static DECLARE_WAIT_QUEUE_HEAD(kscand_wait);
+
+/* Data structure to keep track of current mm under scan */
+struct kscand_scan {
+	struct list_head mm_head;
+};
+
+struct kscand_scan kscand_scan = {
+	.mm_head = LIST_HEAD_INIT(kscand_scan.mm_head),
+};
+
+static inline int kscand_has_work(void)
+{
+	return !list_empty(&kscand_scan.mm_head);
+}
+
+static inline bool kscand_should_wakeup(void)
+{
+	bool wakeup = kthread_should_stop() || need_wakeup ||
+	       time_after_eq(jiffies, kscand_sleep_expire);
+
+	need_wakeup = false;
+
+	return wakeup;
+}
+
+static void kscand_wait_work(void)
+{
+	const unsigned long scan_sleep_jiffies =
+		msecs_to_jiffies(kscand_scan_sleep_ms);
+
+	if (!scan_sleep_jiffies)
+		return;
+
+	kscand_sleep_expire = jiffies + scan_sleep_jiffies;
+
+	/* Allows kthread to pause scanning */
+	wait_event_timeout(kscand_wait, kscand_should_wakeup(),
+			scan_sleep_jiffies);
+}
+static void kscand_do_scan(void)
+{
+	unsigned long iter = 0, mms_to_scan;
+
+	mms_to_scan = READ_ONCE(kscand_mms_to_scan);
+
+	while (true) {
+		if (unlikely(kthread_should_stop()) ||
+			!READ_ONCE(kscand_scan_enabled))
+			break;
+
+		if (kscand_has_work())
+			msleep(100);
+
+		iter++;
+
+		if (iter >= mms_to_scan)
+			break;
+		cond_resched();
+	}
+}
+
+static int kscand(void *none)
+{
+	while (true) {
+		if (unlikely(kthread_should_stop()))
+			break;
+
+		while (!READ_ONCE(kscand_scan_enabled)) {
+			cpu_relax();
+			kscand_wait_work();
+		}
+
+		kscand_do_scan();
+
+		kscand_wait_work();
+	}
+	return 0;
+}
+
+static int start_kscand(void)
+{
+	struct task_struct *kthread;
+
+	guard(mutex)(&kscand_mutex);
+
+	if (kscand_thread)
+		return 0;
+
+	kthread = kthread_run(kscand, NULL, "kscand");
+	if (IS_ERR(kscand_thread)) {
+		pr_err("kscand: kthread_run(kscand) failed\n");
+		return PTR_ERR(kthread);
+	}
+
+	kscand_thread = kthread;
+	pr_info("kscand: Successfully started kscand");
+
+	if (!list_empty(&kscand_scan.mm_head))
+		wake_up_interruptible(&kscand_wait);
+
+	return 0;
+}
+
+static int stop_kscand(void)
+{
+	guard(mutex)(&kscand_mutex);
+
+	if (kscand_thread) {
+		kthread_stop(kscand_thread);
+		kscand_thread = NULL;
+	}
+
+	return 0;
+}
+
+static int __init kscand_init(void)
+{
+	int err;
+
+	err = start_kscand();
+	if (err)
+		goto err_kscand;
+
+	return 0;
+
+err_kscand:
+	stop_kscand();
+
+	return err;
+}
+subsys_initcall(kscand_init);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH V3 02/17] mm: Maintain mm_struct list in the system
  2025-08-14 15:32 [RFC PATCH V3 00/17] mm: slowtier page promotion based on PTE A bit Raghavendra K T
  2025-08-14 15:32 ` [RFC PATCH V3 01/17] mm: Add kscand kthread for PTE A bit scan Raghavendra K T
@ 2025-08-14 15:32 ` Raghavendra K T
  2025-08-14 15:32 ` [RFC PATCH V3 03/17] mm: Scan the mm and create a migration list Raghavendra K T
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Raghavendra K T @ 2025-08-14 15:32 UTC (permalink / raw)
  To: raghavendra.kt
  Cc: AneeshKumar.KizhakeVeetil, Michael.Day, akpm, bharata,
	dave.hansen, david, dongjoo.linux.dev, feng.tang, gourry, hannes,
	honggyu.kim, hughd, jhubbard, jon.grimm, k.shutemov, kbusch,
	kmanaouil.dev, leesuyeon0506, leillc, liam.howlett, linux-kernel,
	linux-mm, mgorman, mingo, nadav.amit, nphamcs, peterz, riel,
	rientjes, rppt, santosh.shukla, shivankg, shy828301, sj, vbabka,
	weixugc, willy, ying.huang, ziy, Jonathan.Cameron, dave, yuanchu,
	kinseyho, hdanton, harry.yoo

The list is used to iterate over all the mm and do PTE A bit scanning.
mm_slot infrastructure is reused to aid insert and lookup of mm_struct.

CC: linux-fsdevel@vger.kernel.org

Suggested-by: Bharata B Rao <bharata@amd.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 include/linux/kscand.h | 30 +++++++++++++++
 kernel/fork.c          |  2 +
 mm/internal.h          |  1 +
 mm/kscand.c            | 86 ++++++++++++++++++++++++++++++++++++++++++
 mm/mmap.c              |  2 +
 mm/vma_exec.c          |  3 ++
 6 files changed, 124 insertions(+)
 create mode 100644 include/linux/kscand.h

diff --git a/include/linux/kscand.h b/include/linux/kscand.h
new file mode 100644
index 000000000000..ef9947a33ee5
--- /dev/null
+++ b/include/linux/kscand.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_KSCAND_H_
+#define _LINUX_KSCAND_H_
+
+#ifdef CONFIG_KSCAND
+extern void __kscand_enter(struct mm_struct *mm);
+extern void __kscand_exit(struct mm_struct *mm);
+
+static inline void kscand_execve(struct mm_struct *mm)
+{
+	__kscand_enter(mm);
+}
+
+static inline void kscand_fork(struct mm_struct *mm, struct mm_struct *oldmm)
+{
+	__kscand_enter(mm);
+}
+
+static inline void kscand_exit(struct mm_struct *mm)
+{
+	__kscand_exit(mm);
+}
+#else /* !CONFIG_KSCAND */
+static inline void __kscand_enter(struct mm_struct *mm) {}
+static inline void __kscand_exit(struct mm_struct *mm) {}
+static inline void kscand_execve(struct mm_struct *mm) {}
+static inline void kscand_fork(struct mm_struct *mm, struct mm_struct *oldmm) {}
+static inline void kscand_exit(struct mm_struct *mm) {}
+#endif
+#endif /* _LINUX_KSCAND_H_ */
diff --git a/kernel/fork.c b/kernel/fork.c
index 1ee8eb11f38b..a13043de91b0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -85,6 +85,7 @@
 #include <linux/user-return-notifier.h>
 #include <linux/oom.h>
 #include <linux/khugepaged.h>
+#include <linux/kscand.h>
 #include <linux/signalfd.h>
 #include <linux/uprobes.h>
 #include <linux/aio.h>
@@ -1116,6 +1117,7 @@ static inline void __mmput(struct mm_struct *mm)
 
 	uprobe_clear_state(mm);
 	exit_aio(mm);
+	kscand_exit(mm);
 	ksm_exit(mm);
 	khugepaged_exit(mm); /* must run before exit_mmap */
 	exit_mmap(mm);
diff --git a/mm/internal.h b/mm/internal.h
index 6b8ed2017743..dd86efc54885 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -8,6 +8,7 @@
 #define __MM_INTERNAL_H
 
 #include <linux/fs.h>
+#include <linux/kscand.h>
 #include <linux/khugepaged.h>
 #include <linux/mm.h>
 #include <linux/mm_inline.h>
diff --git a/mm/kscand.c b/mm/kscand.c
index f7bbbc70c86a..d5b0d3041b0f 100644
--- a/mm/kscand.c
+++ b/mm/kscand.c
@@ -7,12 +7,14 @@
 #include <linux/swap.h>
 #include <linux/mm_inline.h>
 #include <linux/kthread.h>
+#include <linux/kscand.h>
 #include <linux/string.h>
 #include <linux/delay.h>
 #include <linux/cleanup.h>
 
 #include <asm/pgalloc.h>
 #include "internal.h"
+#include "mm_slot.h"
 
 static struct task_struct *kscand_thread __read_mostly;
 static DEFINE_MUTEX(kscand_mutex);
@@ -29,11 +31,23 @@ static bool need_wakeup;
 
 static unsigned long kscand_sleep_expire;
 
+static DEFINE_SPINLOCK(kscand_mm_lock);
 static DECLARE_WAIT_QUEUE_HEAD(kscand_wait);
 
+#define KSCAND_SLOT_HASH_BITS 10
+static DEFINE_READ_MOSTLY_HASHTABLE(kscand_slots_hash, KSCAND_SLOT_HASH_BITS);
+
+static struct kmem_cache *kscand_slot_cache __read_mostly;
+
+/* Per mm information collected to control VMA scanning */
+struct kscand_mm_slot {
+	struct mm_slot slot;
+};
+
 /* Data structure to keep track of current mm under scan */
 struct kscand_scan {
 	struct list_head mm_head;
+	struct kscand_mm_slot *mm_slot;
 };
 
 struct kscand_scan kscand_scan = {
@@ -69,6 +83,12 @@ static void kscand_wait_work(void)
 	wait_event_timeout(kscand_wait, kscand_should_wakeup(),
 			scan_sleep_jiffies);
 }
+
+static inline int kscand_test_exit(struct mm_struct *mm)
+{
+	return atomic_read(&mm->mm_users) == 0;
+}
+
 static void kscand_do_scan(void)
 {
 	unsigned long iter = 0, mms_to_scan;
@@ -109,6 +129,65 @@ static int kscand(void *none)
 	return 0;
 }
 
+static inline void kscand_destroy(void)
+{
+	kmem_cache_destroy(kscand_slot_cache);
+}
+
+void __kscand_enter(struct mm_struct *mm)
+{
+	struct kscand_mm_slot *kscand_slot;
+	struct mm_slot *slot;
+	int wakeup;
+
+	/* __kscand_exit() must not run from under us */
+	VM_BUG_ON_MM(kscand_test_exit(mm), mm);
+
+	kscand_slot = mm_slot_alloc(kscand_slot_cache);
+
+	if (!kscand_slot)
+		return;
+
+	slot = &kscand_slot->slot;
+
+	spin_lock(&kscand_mm_lock);
+	mm_slot_insert(kscand_slots_hash, mm, slot);
+
+	wakeup = list_empty(&kscand_scan.mm_head);
+	list_add_tail(&slot->mm_node, &kscand_scan.mm_head);
+	spin_unlock(&kscand_mm_lock);
+
+	mmgrab(mm);
+	if (wakeup)
+		wake_up_interruptible(&kscand_wait);
+}
+
+void __kscand_exit(struct mm_struct *mm)
+{
+	struct kscand_mm_slot *mm_slot;
+	struct mm_slot *slot;
+	int free = 0;
+
+	spin_lock(&kscand_mm_lock);
+	slot = mm_slot_lookup(kscand_slots_hash, mm);
+	mm_slot = mm_slot_entry(slot, struct kscand_mm_slot, slot);
+	if (mm_slot && kscand_scan.mm_slot != mm_slot) {
+		hash_del(&slot->hash);
+		list_del(&slot->mm_node);
+		free = 1;
+	}
+
+	spin_unlock(&kscand_mm_lock);
+
+	if (free) {
+		mm_slot_free(kscand_slot_cache, mm_slot);
+		mmdrop(mm);
+	} else if (mm_slot) {
+		mmap_write_lock(mm);
+		mmap_write_unlock(mm);
+	}
+}
+
 static int start_kscand(void)
 {
 	struct task_struct *kthread;
@@ -149,6 +228,12 @@ static int __init kscand_init(void)
 {
 	int err;
 
+	kscand_slot_cache = KMEM_CACHE(kscand_mm_slot, 0);
+
+	if (!kscand_slot_cache) {
+		pr_err("kscand: kmem_cache error");
+		return -ENOMEM;
+	}
 	err = start_kscand();
 	if (err)
 		goto err_kscand;
@@ -157,6 +242,7 @@ static int __init kscand_init(void)
 
 err_kscand:
 	stop_kscand();
+	kscand_destroy();
 
 	return err;
 }
diff --git a/mm/mmap.c b/mm/mmap.c
index 09c563c95112..c9ffe65866de 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -37,6 +37,7 @@
 #include <linux/perf_event.h>
 #include <linux/audit.h>
 #include <linux/khugepaged.h>
+#include <linux/kscand.h>
 #include <linux/uprobes.h>
 #include <linux/notifier.h>
 #include <linux/memory.h>
@@ -1849,6 +1850,7 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 	if (!retval) {
 		mt_set_in_rcu(vmi.mas.tree);
 		ksm_fork(mm, oldmm);
+		kscand_fork(mm, oldmm);
 		khugepaged_fork(mm, oldmm);
 	} else {
 
diff --git a/mm/vma_exec.c b/mm/vma_exec.c
index 2dffb02ed6a2..8576b377f7ad 100644
--- a/mm/vma_exec.c
+++ b/mm/vma_exec.c
@@ -128,6 +128,8 @@ int create_init_stack_vma(struct mm_struct *mm, struct vm_area_struct **vmap,
 	if (err)
 		goto err_ksm;
 
+	kscand_execve(mm);
+
 	/*
 	 * Place the stack at the largest stack address the architecture
 	 * supports. Later, we'll move this to an appropriate place. We don't
@@ -151,6 +153,7 @@ int create_init_stack_vma(struct mm_struct *mm, struct vm_area_struct **vmap,
 	return 0;
 
 err:
+	kscand_exit(mm);
 	ksm_exit(mm);
 err_ksm:
 	mmap_write_unlock(mm);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH V3 03/17] mm: Scan the mm and create a migration list
  2025-08-14 15:32 [RFC PATCH V3 00/17] mm: slowtier page promotion based on PTE A bit Raghavendra K T
  2025-08-14 15:32 ` [RFC PATCH V3 01/17] mm: Add kscand kthread for PTE A bit scan Raghavendra K T
  2025-08-14 15:32 ` [RFC PATCH V3 02/17] mm: Maintain mm_struct list in the system Raghavendra K T
@ 2025-08-14 15:32 ` Raghavendra K T
  2025-08-14 15:32 ` [RFC PATCH V3 04/17] mm/kscand: Add only hot pages to " Raghavendra K T
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Raghavendra K T @ 2025-08-14 15:32 UTC (permalink / raw)
  To: raghavendra.kt
  Cc: AneeshKumar.KizhakeVeetil, Michael.Day, akpm, bharata,
	dave.hansen, david, dongjoo.linux.dev, feng.tang, gourry, hannes,
	honggyu.kim, hughd, jhubbard, jon.grimm, k.shutemov, kbusch,
	kmanaouil.dev, leesuyeon0506, leillc, liam.howlett, linux-kernel,
	linux-mm, mgorman, mingo, nadav.amit, nphamcs, peterz, riel,
	rientjes, rppt, santosh.shukla, shivankg, shy828301, sj, vbabka,
	weixugc, willy, ying.huang, ziy, Jonathan.Cameron, dave, yuanchu,
	kinseyho, hdanton, harry.yoo

Since we already have the list of mm_struct in the system, add a module to
scan each mm that walks VMAs of each mm_struct and scan all the pages
associated with that.

 In the scan path: Check for the recently acccessed pages (PFNs) belonging
to slowtier nodes. Add all those to a list.

Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 mm/kscand.c | 321 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 320 insertions(+), 1 deletion(-)

diff --git a/mm/kscand.c b/mm/kscand.c
index d5b0d3041b0f..1d883d411664 100644
--- a/mm/kscand.c
+++ b/mm/kscand.c
@@ -4,10 +4,18 @@
 #include <linux/sched.h>
 #include <linux/sched/mm.h>
 #include <linux/mmu_notifier.h>
+#include <linux/rmap.h>
+#include <linux/pagewalk.h>
+#include <linux/page_ext.h>
+#include <linux/page_idle.h>
+#include <linux/page_table_check.h>
+#include <linux/pagemap.h>
 #include <linux/swap.h>
 #include <linux/mm_inline.h>
 #include <linux/kthread.h>
 #include <linux/kscand.h>
+#include <linux/memory-tiers.h>
+#include <linux/mempolicy.h>
 #include <linux/string.h>
 #include <linux/delay.h>
 #include <linux/cleanup.h>
@@ -18,6 +26,11 @@
 
 static struct task_struct *kscand_thread __read_mostly;
 static DEFINE_MUTEX(kscand_mutex);
+/*
+ * Total VMA size to cover during scan.
+ */
+#define KSCAND_SCAN_SIZE	(1 * 1024 * 1024 * 1024UL)
+static unsigned long kscand_scan_size __read_mostly = KSCAND_SCAN_SIZE;
 
 /* How long to pause between two scan cycles */
 static unsigned int kscand_scan_sleep_ms __read_mostly = 20;
@@ -42,6 +55,8 @@ static struct kmem_cache *kscand_slot_cache __read_mostly;
 /* Per mm information collected to control VMA scanning */
 struct kscand_mm_slot {
 	struct mm_slot slot;
+	long address;
+	bool is_scanned;
 };
 
 /* Data structure to keep track of current mm under scan */
@@ -54,6 +69,29 @@ struct kscand_scan kscand_scan = {
 	.mm_head = LIST_HEAD_INIT(kscand_scan.mm_head),
 };
 
+/*
+ * Data structure passed to control scanning and also collect
+ * per memory node information
+ */
+struct kscand_scanctrl {
+	struct list_head scan_list;
+	unsigned long address;
+};
+
+struct kscand_scanctrl kscand_scanctrl;
+/* Per folio information used for migration */
+struct kscand_migrate_info {
+	struct list_head migrate_node;
+	unsigned long pfn;
+	unsigned long address;
+};
+
+static bool kscand_eligible_srcnid(int nid)
+{
+	/* Only promotion case is considered */
+	return  !node_is_toptier(nid);
+}
+
 static inline int kscand_has_work(void)
 {
 	return !list_empty(&kscand_scan.mm_head);
@@ -84,11 +122,277 @@ static void kscand_wait_work(void)
 			scan_sleep_jiffies);
 }
 
+static inline bool is_valid_folio(struct folio *folio)
+{
+	if (!folio || !folio_mapped(folio) || !folio_raw_mapping(folio))
+		return false;
+
+	if (folio_test_unevictable(folio) || folio_is_zone_device(folio) ||
+		folio_maybe_mapped_shared(folio))
+		return false;
+
+	return true;
+}
+
+
+static bool folio_idle_clear_pte_refs_one(struct folio *folio,
+					 struct vm_area_struct *vma,
+					 unsigned long addr,
+					 pte_t *ptep)
+{
+	bool referenced = false;
+	struct mm_struct *mm = vma->vm_mm;
+	pmd_t *pmd = pmd_off(mm, addr);
+
+	if (ptep) {
+		if (ptep_clear_young_notify(vma, addr, ptep))
+			referenced = true;
+	} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
+		if (!pmd_present(*pmd))
+			WARN_ON_ONCE(1);
+		if (pmdp_clear_young_notify(vma, addr, pmd))
+			referenced = true;
+	} else {
+		WARN_ON_ONCE(1);
+	}
+
+	if (referenced) {
+		folio_clear_idle(folio);
+		folio_set_young(folio);
+	}
+
+	return true;
+}
+
+static void page_idle_clear_pte_refs(struct page *page, pte_t *pte, struct mm_walk *walk)
+{
+	bool need_lock;
+	struct folio *folio =  page_folio(page);
+	unsigned long address;
+
+	if (!folio_mapped(folio) || !folio_raw_mapping(folio))
+		return;
+
+	need_lock = !folio_test_anon(folio) || folio_test_ksm(folio);
+	if (need_lock && !folio_trylock(folio))
+		return;
+	address = vma_address(walk->vma, page_pgoff(folio, page), compound_nr(page));
+	VM_BUG_ON_VMA(address == -EFAULT, walk->vma);
+	folio_idle_clear_pte_refs_one(folio, walk->vma, address, pte);
+
+	if (need_lock)
+		folio_unlock(folio);
+}
+
+static int hot_vma_idle_pte_entry(pte_t *pte,
+				 unsigned long addr,
+				 unsigned long next,
+				 struct mm_walk *walk)
+{
+	struct page *page;
+	struct folio *folio;
+	struct mm_struct *mm;
+	struct vm_area_struct *vma;
+	struct kscand_migrate_info *info;
+	struct kscand_scanctrl *scanctrl = walk->private;
+	int srcnid;
+
+	scanctrl->address = addr;
+	pte_t pteval = ptep_get(pte);
+
+	if (!pte_present(pteval))
+		return 0;
+
+	if (pte_none(pteval))
+		return 0;
+
+	vma = walk->vma;
+	mm = vma->vm_mm;
+
+	page = pte_page(*pte);
+
+
+	folio = page_folio(page);
+	folio_get(folio);
+
+	if (!is_valid_folio(folio)) {
+		folio_put(folio);
+		return 0;
+	}
+	folio_set_idle(folio);
+	page_idle_clear_pte_refs(page, pte, walk);
+	srcnid = folio_nid(folio);
+
+
+	if (!folio_test_lru(folio)) {
+		folio_put(folio);
+		return 0;
+	}
+
+	if (!kscand_eligible_srcnid(srcnid)) {
+		folio_put(folio);
+		return 0;
+	}
+	if (!folio_test_idle(folio) &&
+		(folio_test_young(folio) || folio_test_referenced(folio))) {
+
+		/* XXX: Leaking memory. TBD: consume info */
+
+		info = kzalloc(sizeof(struct kscand_migrate_info), GFP_NOWAIT);
+		if (info && scanctrl) {
+			info->pfn = folio_pfn(folio);
+			info->address = addr;
+			list_add_tail(&info->migrate_node, &scanctrl->scan_list);
+		}
+	}
+
+	folio_put(folio);
+	return 0;
+}
+
+static const struct mm_walk_ops hot_vma_set_idle_ops = {
+	.pte_entry = hot_vma_idle_pte_entry,
+	.walk_lock = PGWALK_RDLOCK,
+};
+
+static void kscand_walk_page_vma(struct vm_area_struct *vma, struct kscand_scanctrl *scanctrl)
+{
+	if (!vma_migratable(vma) || !vma_policy_mof(vma) ||
+	    is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_MIXEDMAP)) {
+		return;
+	}
+	if (!vma->vm_mm ||
+	    (vma->vm_file && (vma->vm_flags & (VM_READ|VM_WRITE)) == (VM_READ)))
+		return;
+
+	if (!vma_is_accessible(vma))
+		return;
+
+	walk_page_vma(vma, &hot_vma_set_idle_ops, scanctrl);
+}
+
 static inline int kscand_test_exit(struct mm_struct *mm)
 {
 	return atomic_read(&mm->mm_users) == 0;
 }
 
+static void kscand_collect_mm_slot(struct kscand_mm_slot *mm_slot)
+{
+	struct mm_slot *slot = &mm_slot->slot;
+	struct mm_struct *mm = slot->mm;
+
+	lockdep_assert_held(&kscand_mm_lock);
+
+	if (kscand_test_exit(mm)) {
+		hash_del(&slot->hash);
+		list_del(&slot->mm_node);
+
+		mm_slot_free(kscand_slot_cache, mm_slot);
+		mmdrop(mm);
+	}
+}
+
+static unsigned long kscand_scan_mm_slot(void)
+{
+	bool next_mm = false;
+	bool update_mmslot_info = false;
+
+	unsigned long vma_scanned_size = 0;
+	unsigned long address;
+
+	struct mm_slot *slot;
+	struct mm_struct *mm;
+	struct vm_area_struct *vma = NULL;
+	struct kscand_mm_slot *mm_slot;
+
+
+	spin_lock(&kscand_mm_lock);
+
+	if (kscand_scan.mm_slot) {
+		mm_slot = kscand_scan.mm_slot;
+		slot = &mm_slot->slot;
+		address = mm_slot->address;
+	} else {
+		slot = list_entry(kscand_scan.mm_head.next,
+				     struct mm_slot, mm_node);
+		mm_slot = mm_slot_entry(slot, struct kscand_mm_slot, slot);
+		address = mm_slot->address;
+		kscand_scan.mm_slot = mm_slot;
+	}
+
+	mm = slot->mm;
+	mm_slot->is_scanned = true;
+	spin_unlock(&kscand_mm_lock);
+
+	if (unlikely(!mmap_read_trylock(mm)))
+		goto outerloop_mmap_lock;
+
+	if (unlikely(kscand_test_exit(mm))) {
+		next_mm = true;
+		goto outerloop;
+	}
+
+	VMA_ITERATOR(vmi, mm, address);
+
+	for_each_vma(vmi, vma) {
+		kscand_walk_page_vma(vma, &kscand_scanctrl);
+		vma_scanned_size += vma->vm_end - vma->vm_start;
+
+		if (vma_scanned_size >= kscand_scan_size) {
+			next_mm = true;
+			/* TBD: Add scanned folios to migration list */
+			break;
+		}
+	}
+
+	if (!vma)
+		address = 0;
+	else
+		address = kscand_scanctrl.address + PAGE_SIZE;
+
+	update_mmslot_info = true;
+
+	if (update_mmslot_info)
+		mm_slot->address = address;
+
+outerloop:
+	/* exit_mmap will destroy ptes after this */
+	mmap_read_unlock(mm);
+
+outerloop_mmap_lock:
+	spin_lock(&kscand_mm_lock);
+	WARN_ON(kscand_scan.mm_slot != mm_slot);
+
+	/*
+	 * Release the current mm_slot if this mm is about to die, or
+	 * if we scanned all vmas of this mm.
+	 */
+	if (unlikely(kscand_test_exit(mm)) || !vma || next_mm) {
+		/*
+		 * Make sure that if mm_users is reaching zero while
+		 * kscand runs here, kscand_exit will find
+		 * mm_slot not pointing to the exiting mm.
+		 */
+		if (slot->mm_node.next != &kscand_scan.mm_head) {
+			slot = list_entry(slot->mm_node.next,
+					struct mm_slot, mm_node);
+			kscand_scan.mm_slot =
+				mm_slot_entry(slot, struct kscand_mm_slot, slot);
+
+		} else
+			kscand_scan.mm_slot = NULL;
+
+		if (kscand_test_exit(mm)) {
+			kscand_collect_mm_slot(mm_slot);
+			goto end;
+		}
+	}
+	mm_slot->is_scanned = false;
+end:
+	spin_unlock(&kscand_mm_lock);
+	return 0;
+}
+
 static void kscand_do_scan(void)
 {
 	unsigned long iter = 0, mms_to_scan;
@@ -101,7 +405,7 @@ static void kscand_do_scan(void)
 			break;
 
 		if (kscand_has_work())
-			msleep(100);
+			kscand_scan_mm_slot();
 
 		iter++;
 
@@ -148,6 +452,7 @@ void __kscand_enter(struct mm_struct *mm)
 	if (!kscand_slot)
 		return;
 
+	kscand_slot->address = 0;
 	slot = &kscand_slot->slot;
 
 	spin_lock(&kscand_mm_lock);
@@ -175,6 +480,12 @@ void __kscand_exit(struct mm_struct *mm)
 		hash_del(&slot->hash);
 		list_del(&slot->mm_node);
 		free = 1;
+	} else if (mm_slot && kscand_scan.mm_slot == mm_slot && !mm_slot->is_scanned) {
+		hash_del(&slot->hash);
+		list_del(&slot->mm_node);
+		free = 1;
+		/* TBD: Set the actual next slot */
+		kscand_scan.mm_slot = NULL;
 	}
 
 	spin_unlock(&kscand_mm_lock);
@@ -224,6 +535,12 @@ static int stop_kscand(void)
 	return 0;
 }
 
+static inline void init_list(void)
+{
+	INIT_LIST_HEAD(&kscand_scanctrl.scan_list);
+	init_waitqueue_head(&kscand_wait);
+}
+
 static int __init kscand_init(void)
 {
 	int err;
@@ -234,6 +551,8 @@ static int __init kscand_init(void)
 		pr_err("kscand: kmem_cache error");
 		return -ENOMEM;
 	}
+
+	init_list();
 	err = start_kscand();
 	if (err)
 		goto err_kscand;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH V3 04/17] mm/kscand: Add only hot pages to migration list
  2025-08-14 15:32 [RFC PATCH V3 00/17] mm: slowtier page promotion based on PTE A bit Raghavendra K T
                   ` (2 preceding siblings ...)
  2025-08-14 15:32 ` [RFC PATCH V3 03/17] mm: Scan the mm and create a migration list Raghavendra K T
@ 2025-08-14 15:32 ` Raghavendra K T
  2025-08-14 15:32 ` [RFC PATCH V3 05/17] mm: Create a separate kthread for migration Raghavendra K T
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Raghavendra K T @ 2025-08-14 15:32 UTC (permalink / raw)
  To: raghavendra.kt
  Cc: AneeshKumar.KizhakeVeetil, Michael.Day, akpm, bharata,
	dave.hansen, david, dongjoo.linux.dev, feng.tang, gourry, hannes,
	honggyu.kim, hughd, jhubbard, jon.grimm, k.shutemov, kbusch,
	kmanaouil.dev, leesuyeon0506, leillc, liam.howlett, linux-kernel,
	linux-mm, mgorman, mingo, nadav.amit, nphamcs, peterz, riel,
	rientjes, rppt, santosh.shukla, shivankg, shy828301, sj, vbabka,
	weixugc, willy, ying.huang, ziy, Jonathan.Cameron, dave, yuanchu,
	kinseyho, hdanton, harry.yoo

 Previously all pages, accessed once are added.
Improve it by adding those that are accessed second time.

This logic is closer to current NUMAB implementation
of spotting hot pages.

Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 mm/kscand.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/kscand.c b/mm/kscand.c
index 1d883d411664..7552ce32beea 100644
--- a/mm/kscand.c
+++ b/mm/kscand.c
@@ -196,6 +196,7 @@ static int hot_vma_idle_pte_entry(pte_t *pte,
 	struct kscand_migrate_info *info;
 	struct kscand_scanctrl *scanctrl = walk->private;
 	int srcnid;
+	bool prev_idle;
 
 	scanctrl->address = addr;
 	pte_t pteval = ptep_get(pte);
@@ -219,6 +220,7 @@ static int hot_vma_idle_pte_entry(pte_t *pte,
 		folio_put(folio);
 		return 0;
 	}
+	prev_idle = folio_test_idle(folio);
 	folio_set_idle(folio);
 	page_idle_clear_pte_refs(page, pte, walk);
 	srcnid = folio_nid(folio);
@@ -233,7 +235,7 @@ static int hot_vma_idle_pte_entry(pte_t *pte,
 		folio_put(folio);
 		return 0;
 	}
-	if (!folio_test_idle(folio) &&
+	if (!folio_test_idle(folio) && !prev_idle &&
 		(folio_test_young(folio) || folio_test_referenced(folio))) {
 
 		/* XXX: Leaking memory. TBD: consume info */
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH V3 05/17] mm: Create a separate kthread for migration
  2025-08-14 15:32 [RFC PATCH V3 00/17] mm: slowtier page promotion based on PTE A bit Raghavendra K T
                   ` (3 preceding siblings ...)
  2025-08-14 15:32 ` [RFC PATCH V3 04/17] mm/kscand: Add only hot pages to " Raghavendra K T
@ 2025-08-14 15:32 ` Raghavendra K T
  2025-08-14 15:32 ` [RFC PATCH V3 06/17] mm/migration: migrate accessed folios to toptier node Raghavendra K T
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Raghavendra K T @ 2025-08-14 15:32 UTC (permalink / raw)
  To: raghavendra.kt
  Cc: AneeshKumar.KizhakeVeetil, Michael.Day, akpm, bharata,
	dave.hansen, david, dongjoo.linux.dev, feng.tang, gourry, hannes,
	honggyu.kim, hughd, jhubbard, jon.grimm, k.shutemov, kbusch,
	kmanaouil.dev, leesuyeon0506, leillc, liam.howlett, linux-kernel,
	linux-mm, mgorman, mingo, nadav.amit, nphamcs, peterz, riel,
	rientjes, rppt, santosh.shukla, shivankg, shy828301, sj, vbabka,
	weixugc, willy, ying.huang, ziy, Jonathan.Cameron, dave, yuanchu,
	kinseyho, hdanton, harry.yoo

Having independent thread helps in:
 - Alleviating the need for multiple scanning threads
 - Aids to control batch migration (TBD)
 - Migration throttling (TBD)

Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 mm/kscand.c | 74 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 74 insertions(+)

diff --git a/mm/kscand.c b/mm/kscand.c
index 7552ce32beea..55efd0a6e5ba 100644
--- a/mm/kscand.c
+++ b/mm/kscand.c
@@ -4,6 +4,7 @@
 #include <linux/sched.h>
 #include <linux/sched/mm.h>
 #include <linux/mmu_notifier.h>
+#include <linux/migrate.h>
 #include <linux/rmap.h>
 #include <linux/pagewalk.h>
 #include <linux/page_ext.h>
@@ -41,6 +42,15 @@ static unsigned long kscand_mms_to_scan __read_mostly = KSCAND_MMS_TO_SCAN;
 
 bool kscand_scan_enabled = true;
 static bool need_wakeup;
+static bool migrated_need_wakeup;
+
+/* How long to pause between two migration cycles */
+static unsigned int kmigrate_sleep_ms __read_mostly = 20;
+
+static struct task_struct *kmigrated_thread __read_mostly;
+static DEFINE_MUTEX(kmigrated_mutex);
+static DECLARE_WAIT_QUEUE_HEAD(kmigrated_wait);
+static unsigned long kmigrated_sleep_expire;
 
 static unsigned long kscand_sleep_expire;
 
@@ -79,6 +89,7 @@ struct kscand_scanctrl {
 };
 
 struct kscand_scanctrl kscand_scanctrl;
+
 /* Per folio information used for migration */
 struct kscand_migrate_info {
 	struct list_head migrate_node;
@@ -134,6 +145,19 @@ static inline bool is_valid_folio(struct folio *folio)
 	return true;
 }
 
+static inline void kmigrated_wait_work(void)
+{
+	const unsigned long migrate_sleep_jiffies =
+		msecs_to_jiffies(kmigrate_sleep_ms);
+
+	if (!migrate_sleep_jiffies)
+		return;
+
+	kmigrated_sleep_expire = jiffies + migrate_sleep_jiffies;
+	wait_event_timeout(kmigrated_wait,
+			true,
+			migrate_sleep_jiffies);
+}
 
 static bool folio_idle_clear_pte_refs_one(struct folio *folio,
 					 struct vm_area_struct *vma,
@@ -537,6 +561,49 @@ static int stop_kscand(void)
 	return 0;
 }
 
+static int kmigrated(void *arg)
+{
+	while (true) {
+		WRITE_ONCE(migrated_need_wakeup, false);
+		if (unlikely(kthread_should_stop()))
+			break;
+		msleep(20);
+		kmigrated_wait_work();
+	}
+	return 0;
+}
+
+static int start_kmigrated(void)
+{
+	struct task_struct *kthread;
+
+	guard(mutex)(&kmigrated_mutex);
+
+	/* Someone already succeeded in starting daemon */
+	if (kmigrated_thread)
+		return 0;
+
+	kthread = kthread_run(kmigrated, NULL, "kmigrated");
+	if (IS_ERR(kmigrated_thread)) {
+		pr_err("kmigrated: kthread_run(kmigrated)  failed\n");
+		return PTR_ERR(kthread);
+	}
+
+	kmigrated_thread = kthread;
+	pr_info("kmigrated: Successfully started kmigrated");
+
+	wake_up_interruptible(&kmigrated_wait);
+
+	return 0;
+}
+
+static int stop_kmigrated(void)
+{
+	guard(mutex)(&kmigrated_mutex);
+	kthread_stop(kmigrated_thread);
+	return 0;
+}
+
 static inline void init_list(void)
 {
 	INIT_LIST_HEAD(&kscand_scanctrl.scan_list);
@@ -559,8 +626,15 @@ static int __init kscand_init(void)
 	if (err)
 		goto err_kscand;
 
+	err = start_kmigrated();
+	if (err)
+		goto err_kmigrated;
+
 	return 0;
 
+err_kmigrated:
+	stop_kmigrated();
+
 err_kscand:
 	stop_kscand();
 	kscand_destroy();
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH V3 06/17] mm/migration: migrate accessed folios to toptier node
  2025-08-14 15:32 [RFC PATCH V3 00/17] mm: slowtier page promotion based on PTE A bit Raghavendra K T
                   ` (4 preceding siblings ...)
  2025-08-14 15:32 ` [RFC PATCH V3 05/17] mm: Create a separate kthread for migration Raghavendra K T
@ 2025-08-14 15:32 ` Raghavendra K T
  2025-08-14 15:32 ` [RFC PATCH V3 07/17] mm: Add throttling of mm scanning using scan_period Raghavendra K T
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Raghavendra K T @ 2025-08-14 15:32 UTC (permalink / raw)
  To: raghavendra.kt
  Cc: AneeshKumar.KizhakeVeetil, Michael.Day, akpm, bharata,
	dave.hansen, david, dongjoo.linux.dev, feng.tang, gourry, hannes,
	honggyu.kim, hughd, jhubbard, jon.grimm, k.shutemov, kbusch,
	kmanaouil.dev, leesuyeon0506, leillc, liam.howlett, linux-kernel,
	linux-mm, mgorman, mingo, nadav.amit, nphamcs, peterz, riel,
	rientjes, rppt, santosh.shukla, shivankg, shy828301, sj, vbabka,
	weixugc, willy, ying.huang, ziy, Jonathan.Cameron, dave, yuanchu,
	kinseyho, hdanton, harry.yoo

A per mm migration list is added and a kernel thread iterates over
each of them.

For each recently accessed slowtier folio in the migration list:
 - Isolate LRU pages
 - Migrate to a regular node.

The rationale behind whole migration is to speedup the access to
recently accessed pages.

Currently, PTE A bit scanning approach lacks the information about
exact destination node to migrate to.

Reason:
 PROT_NONE hint fault based scanning is done in a process context. Here
when the fault occurs, source CPU of the fault associated the task is
known. Time of page access is also accurate.
With the lack of above information, migration is done to node 0 by default.

Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 include/linux/migrate.h |   2 +
 mm/kscand.c             | 448 +++++++++++++++++++++++++++++++++++++++-
 mm/migrate.c            |   2 +-
 3 files changed, 440 insertions(+), 12 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index aaa2114498d6..59547f72d150 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -142,6 +142,8 @@ const struct movable_operations *page_movable_ops(struct page *page)
 }
 
 #ifdef CONFIG_NUMA_BALANCING
+bool migrate_balanced_pgdat(struct pglist_data *pgdat,
+				   unsigned long nr_migrate_pages);
 int migrate_misplaced_folio_prepare(struct folio *folio,
 		struct vm_area_struct *vma, int node);
 int migrate_misplaced_folio(struct folio *folio, int node);
diff --git a/mm/kscand.c b/mm/kscand.c
index 55efd0a6e5ba..5cd2764114df 100644
--- a/mm/kscand.c
+++ b/mm/kscand.c
@@ -52,9 +52,18 @@ static DEFINE_MUTEX(kmigrated_mutex);
 static DECLARE_WAIT_QUEUE_HEAD(kmigrated_wait);
 static unsigned long kmigrated_sleep_expire;
 
+/* mm of the migrating folio entry */
+static struct mm_struct *kmigrated_cur_mm;
+
+/* Migration list is manipulated underneath because of mm_exit */
+static bool  kmigrated_clean_list;
+
 static unsigned long kscand_sleep_expire;
+#define KSCAND_DEFAULT_TARGET_NODE	(0)
+static int kscand_target_node = KSCAND_DEFAULT_TARGET_NODE;
 
 static DEFINE_SPINLOCK(kscand_mm_lock);
+static DEFINE_SPINLOCK(kscand_migrate_lock);
 static DECLARE_WAIT_QUEUE_HEAD(kscand_wait);
 
 #define KSCAND_SLOT_HASH_BITS 10
@@ -62,6 +71,10 @@ static DEFINE_READ_MOSTLY_HASHTABLE(kscand_slots_hash, KSCAND_SLOT_HASH_BITS);
 
 static struct kmem_cache *kscand_slot_cache __read_mostly;
 
+#define KMIGRATED_SLOT_HASH_BITS 10
+static DEFINE_READ_MOSTLY_HASHTABLE(kmigrated_slots_hash, KMIGRATED_SLOT_HASH_BITS);
+static struct kmem_cache *kmigrated_slot_cache __read_mostly;
+
 /* Per mm information collected to control VMA scanning */
 struct kscand_mm_slot {
 	struct mm_slot slot;
@@ -90,6 +103,26 @@ struct kscand_scanctrl {
 
 struct kscand_scanctrl kscand_scanctrl;
 
+/* Per mm migration list */
+struct kmigrated_mm_slot {
+	/* Tracks mm that has non empty migration list */
+	struct mm_slot mm_slot;
+	/* Per mm lock used to synchronize migration list */
+	spinlock_t migrate_lock;
+	/* Head of per mm migration list */
+	struct list_head migrate_head;
+};
+
+/* System wide list of mms that maintain migration list */
+struct kmigrated_daemon {
+	struct list_head mm_head;
+	struct kmigrated_mm_slot *mm_slot;
+};
+
+struct kmigrated_daemon kmigrated_daemon = {
+	.mm_head = LIST_HEAD_INIT(kmigrated_daemon.mm_head),
+};
+
 /* Per folio information used for migration */
 struct kscand_migrate_info {
 	struct list_head migrate_node;
@@ -108,6 +141,11 @@ static inline int kscand_has_work(void)
 	return !list_empty(&kscand_scan.mm_head);
 }
 
+static inline int kmigrated_has_work(void)
+{
+	return !list_empty(&kmigrated_daemon.mm_head);
+}
+
 static inline bool kscand_should_wakeup(void)
 {
 	bool wakeup = kthread_should_stop() || need_wakeup ||
@@ -118,6 +156,16 @@ static inline bool kscand_should_wakeup(void)
 	return wakeup;
 }
 
+static inline bool kmigrated_should_wakeup(void)
+{
+	bool wakeup = kthread_should_stop() || migrated_need_wakeup ||
+	       time_after_eq(jiffies, kmigrated_sleep_expire);
+
+	migrated_need_wakeup = false;
+
+	return wakeup;
+}
+
 static void kscand_wait_work(void)
 {
 	const unsigned long scan_sleep_jiffies =
@@ -133,6 +181,85 @@ static void kscand_wait_work(void)
 			scan_sleep_jiffies);
 }
 
+static void kmigrated_wait_work(void)
+{
+	const unsigned long migrate_sleep_jiffies =
+		msecs_to_jiffies(kmigrate_sleep_ms);
+
+	if (!migrate_sleep_jiffies)
+		return;
+
+	kmigrated_sleep_expire = jiffies + migrate_sleep_jiffies;
+	wait_event_timeout(kmigrated_wait, kmigrated_should_wakeup(),
+			migrate_sleep_jiffies);
+}
+
+/*
+ * Do not know what info to pass in the future to make
+ * decision on taget node. Keep it void * now.
+ */
+static int kscand_get_target_node(void *data)
+{
+	return kscand_target_node;
+}
+
+extern bool migrate_balanced_pgdat(struct pglist_data *pgdat,
+					unsigned long nr_migrate_pages);
+
+/*XXX: Taken from migrate.c to avoid NUMAB mode=2 and NULL vma checks*/
+static int kscand_migrate_misplaced_folio_prepare(struct folio *folio,
+		struct vm_area_struct *vma, int node)
+{
+	int nr_pages = folio_nr_pages(folio);
+	pg_data_t *pgdat = NODE_DATA(node);
+
+	if (folio_is_file_lru(folio)) {
+		/*
+		 * Do not migrate file folios that are mapped in multiple
+		 * processes with execute permissions as they are probably
+		 * shared libraries.
+		 *
+		 * See folio_maybe_mapped_shared() on possible imprecision
+		 * when we cannot easily detect if a folio is shared.
+		 */
+		if (vma && (vma->vm_flags & VM_EXEC) &&
+		    folio_maybe_mapped_shared(folio))
+			return -EACCES;
+		/*
+		 * Do not migrate dirty folios as not all filesystems can move
+		 * dirty folios in MIGRATE_ASYNC mode which is a waste of
+		 * cycles.
+		 */
+		if (folio_test_dirty(folio))
+			return -EAGAIN;
+	}
+
+	/* Avoid migrating to a node that is nearly full */
+	if (!migrate_balanced_pgdat(pgdat, nr_pages)) {
+		int z;
+
+		for (z = pgdat->nr_zones - 1; z >= 0; z--) {
+			if (managed_zone(pgdat->node_zones + z))
+				break;
+		}
+
+		if (z < 0)
+			return -EAGAIN;
+
+		wakeup_kswapd(pgdat->node_zones + z, 0,
+			      folio_order(folio), ZONE_MOVABLE);
+		return -EAGAIN;
+	}
+
+	if (!folio_isolate_lru(folio))
+		return -EAGAIN;
+
+	node_stat_mod_folio(folio, NR_ISOLATED_ANON + folio_is_file_lru(folio),
+			    nr_pages);
+
+	return 0;
+}
+
 static inline bool is_valid_folio(struct folio *folio)
 {
 	if (!folio || !folio_mapped(folio) || !folio_raw_mapping(folio))
@@ -145,18 +272,113 @@ static inline bool is_valid_folio(struct folio *folio)
 	return true;
 }
 
-static inline void kmigrated_wait_work(void)
+enum kscand_migration_err {
+	KSCAND_NULL_MM = 1,
+	KSCAND_EXITING_MM,
+	KSCAND_INVALID_FOLIO,
+	KSCAND_NONLRU_FOLIO,
+	KSCAND_INELIGIBLE_SRC_NODE,
+	KSCAND_SAME_SRC_DEST_NODE,
+	KSCAND_PTE_NOT_PRESENT,
+	KSCAND_PMD_NOT_PRESENT,
+	KSCAND_NO_PTE_OFFSET_MAP_LOCK,
+	KSCAND_NOT_HOT_PAGE,
+	KSCAND_LRU_ISOLATION_ERR,
+};
+
+
+static bool is_hot_page(struct folio *folio)
 {
-	const unsigned long migrate_sleep_jiffies =
-		msecs_to_jiffies(kmigrate_sleep_ms);
+	bool ret = false;
 
-	if (!migrate_sleep_jiffies)
-		return;
+	if (!folio_test_idle(folio))
+		ret = folio_test_referenced(folio) || folio_test_young(folio);
 
-	kmigrated_sleep_expire = jiffies + migrate_sleep_jiffies;
-	wait_event_timeout(kmigrated_wait,
-			true,
-			migrate_sleep_jiffies);
+	return ret;
+}
+
+static int kmigrated_promote_folio(struct kscand_migrate_info *info,
+					struct mm_struct *mm,
+					int destnid)
+{
+	unsigned long pfn;
+	unsigned long address;
+	struct page *page;
+	struct folio *folio = NULL;
+	int ret;
+	pmd_t *pmd;
+	pte_t *pte;
+	spinlock_t *ptl;
+	pmd_t pmde;
+	int srcnid;
+
+	if (mm == NULL)
+		return KSCAND_NULL_MM;
+
+	if (mm == READ_ONCE(kmigrated_cur_mm) &&
+		READ_ONCE(kmigrated_clean_list)) {
+		WARN_ON_ONCE(mm);
+		return KSCAND_EXITING_MM;
+	}
+
+	pfn = info->pfn;
+	address = info->address;
+	page = pfn_to_online_page(pfn);
+
+	if (page)
+		folio = page_folio(page);
+
+	if (!page || PageTail(page) || !is_valid_folio(folio))
+		return KSCAND_INVALID_FOLIO;
+
+	if (!folio_test_lru(folio))
+		return KSCAND_NONLRU_FOLIO;
+
+	if (!is_hot_page(folio))
+		return KSCAND_NOT_HOT_PAGE;
+
+	folio_get(folio);
+
+	srcnid = folio_nid(folio);
+
+	/* Do not try to promote pages from regular nodes */
+	if (!kscand_eligible_srcnid(srcnid)) {
+		folio_put(folio);
+		return KSCAND_INELIGIBLE_SRC_NODE;
+	}
+
+	/* Also happen when it is already migrated */
+	if (srcnid == destnid) {
+		folio_put(folio);
+		return KSCAND_SAME_SRC_DEST_NODE;
+	}
+
+	address = info->address;
+	pmd = pmd_off(mm, address);
+	pmde = pmdp_get(pmd);
+
+	if (!pmd_present(pmde)) {
+		folio_put(folio);
+		return KSCAND_PMD_NOT_PRESENT;
+	}
+
+	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
+	if (!pte) {
+		folio_put(folio);
+		WARN_ON_ONCE(!pte);
+		return KSCAND_NO_PTE_OFFSET_MAP_LOCK;
+	}
+
+	ret = kscand_migrate_misplaced_folio_prepare(folio, NULL, destnid);
+
+	folio_put(folio);
+	pte_unmap_unlock(pte, ptl);
+
+	if (ret)
+		return KSCAND_LRU_ISOLATION_ERR;
+
+
+	return  migrate_misplaced_folio(folio, destnid);
 }
 
 static bool folio_idle_clear_pte_refs_one(struct folio *folio,
@@ -302,6 +524,115 @@ static inline int kscand_test_exit(struct mm_struct *mm)
 	return atomic_read(&mm->mm_users) == 0;
 }
 
+struct destroy_list_work {
+	struct list_head migrate_head;
+	struct work_struct dwork;
+};
+
+static void kmigrated_destroy_list_fn(struct work_struct *work)
+{
+	struct destroy_list_work *dlw;
+	struct kscand_migrate_info *info, *tmp;
+
+	dlw = container_of(work, struct destroy_list_work, dwork);
+
+	if (!list_empty(&dlw->migrate_head)) {
+		list_for_each_entry_safe(info, tmp, &dlw->migrate_head,	migrate_node) {
+			list_del(&info->migrate_node);
+			kfree(info);
+		}
+	}
+
+	kfree(dlw);
+}
+
+static void kmigrated_destroy_list(struct list_head *list_head)
+{
+	struct destroy_list_work *destroy_list_work;
+
+
+	destroy_list_work = kmalloc(sizeof(*destroy_list_work), GFP_KERNEL);
+	if (!destroy_list_work)
+		return;
+
+	INIT_LIST_HEAD(&destroy_list_work->migrate_head);
+	list_splice_tail_init(list_head, &destroy_list_work->migrate_head);
+	INIT_WORK(&destroy_list_work->dwork, kmigrated_destroy_list_fn);
+	schedule_work(&destroy_list_work->dwork);
+}
+
+static struct kmigrated_mm_slot *kmigrated_get_mm_slot(struct mm_struct *mm, bool alloc)
+{
+	struct kmigrated_mm_slot *mm_slot = NULL;
+	struct mm_slot *slot;
+
+	guard(spinlock)(&kscand_migrate_lock);
+
+	slot = mm_slot_lookup(kmigrated_slots_hash, mm);
+	mm_slot = mm_slot_entry(slot, struct kmigrated_mm_slot, mm_slot);
+
+	if (!mm_slot && alloc) {
+		mm_slot = mm_slot_alloc(kmigrated_slot_cache);
+		if (!mm_slot) {
+			spin_unlock(&kscand_migrate_lock);
+			return NULL;
+		}
+
+		slot = &mm_slot->mm_slot;
+		INIT_LIST_HEAD(&mm_slot->migrate_head);
+		spin_lock_init(&mm_slot->migrate_lock);
+		mm_slot_insert(kmigrated_slots_hash, mm, slot);
+		list_add_tail(&slot->mm_node, &kmigrated_daemon.mm_head);
+	}
+
+	return mm_slot;
+}
+
+static void kscand_cleanup_migration_list(struct mm_struct *mm)
+{
+	struct kmigrated_mm_slot *mm_slot;
+	struct mm_slot *slot;
+
+	mm_slot = kmigrated_get_mm_slot(mm, false);
+
+	slot = &mm_slot->mm_slot;
+
+	if (mm_slot && slot && slot->mm == mm) {
+		spin_lock(&mm_slot->migrate_lock);
+
+		if (!list_empty(&mm_slot->migrate_head)) {
+			if (mm == READ_ONCE(kmigrated_cur_mm)) {
+				/* A folio in this mm is being migrated. wait */
+				WRITE_ONCE(kmigrated_clean_list, true);
+			}
+
+			kmigrated_destroy_list(&mm_slot->migrate_head);
+			spin_unlock(&mm_slot->migrate_lock);
+retry:
+			if (!spin_trylock(&mm_slot->migrate_lock)) {
+				cpu_relax();
+				goto retry;
+			}
+
+			if (mm == READ_ONCE(kmigrated_cur_mm)) {
+				spin_unlock(&mm_slot->migrate_lock);
+				goto retry;
+			}
+		}
+		/* Reset migrated mm_slot if it was pointing to us */
+		if (kmigrated_daemon.mm_slot == mm_slot)
+			kmigrated_daemon.mm_slot = NULL;
+
+		hash_del(&slot->hash);
+		list_del(&slot->mm_node);
+		mm_slot_free(kmigrated_slot_cache, mm_slot);
+
+		WRITE_ONCE(kmigrated_clean_list, false);
+
+		spin_unlock(&mm_slot->migrate_lock);
+		}
+}
+
 static void kscand_collect_mm_slot(struct kscand_mm_slot *mm_slot)
 {
 	struct mm_slot *slot = &mm_slot->slot;
@@ -313,11 +644,77 @@ static void kscand_collect_mm_slot(struct kscand_mm_slot *mm_slot)
 		hash_del(&slot->hash);
 		list_del(&slot->mm_node);
 
+		kscand_cleanup_migration_list(mm);
+
 		mm_slot_free(kscand_slot_cache, mm_slot);
 		mmdrop(mm);
 	}
 }
 
+static void kmigrated_migrate_mm(struct kmigrated_mm_slot *mm_slot)
+{
+	int ret = 0, dest = -1;
+	struct mm_slot *slot;
+	struct mm_struct *mm;
+	struct kscand_migrate_info *info, *tmp;
+
+	spin_lock(&mm_slot->migrate_lock);
+
+	slot = &mm_slot->mm_slot;
+	mm = slot->mm;
+
+	if (!list_empty(&mm_slot->migrate_head)) {
+		list_for_each_entry_safe(info, tmp, &mm_slot->migrate_head,
+				migrate_node) {
+			if (READ_ONCE(kmigrated_clean_list))
+				goto clean_list_handled;
+
+			list_del(&info->migrate_node);
+
+			spin_unlock(&mm_slot->migrate_lock);
+
+			dest = kscand_get_target_node(NULL);
+			ret = kmigrated_promote_folio(info, mm, dest);
+
+			kfree(info);
+
+			cond_resched();
+			spin_lock(&mm_slot->migrate_lock);
+		}
+	}
+clean_list_handled:
+	/* Reset  mm  of folio entry we are migrating */
+	WRITE_ONCE(kmigrated_cur_mm, NULL);
+	spin_unlock(&mm_slot->migrate_lock);
+}
+
+static void kmigrated_migrate_folio(void)
+{
+	/* for each mm do migrate */
+	struct kmigrated_mm_slot *kmigrated_mm_slot = NULL;
+	struct mm_slot *slot;
+
+	if (!list_empty(&kmigrated_daemon.mm_head)) {
+
+		scoped_guard (spinlock, &kscand_migrate_lock) {
+			if (kmigrated_daemon.mm_slot) {
+				kmigrated_mm_slot = kmigrated_daemon.mm_slot;
+			} else {
+				slot = list_entry(kmigrated_daemon.mm_head.next,
+						struct mm_slot, mm_node);
+
+				kmigrated_mm_slot = mm_slot_entry(slot,
+						struct kmigrated_mm_slot, mm_slot);
+				kmigrated_daemon.mm_slot = kmigrated_mm_slot;
+			}
+			WRITE_ONCE(kmigrated_cur_mm, kmigrated_mm_slot->mm_slot.mm);
+		}
+
+		if (kmigrated_mm_slot)
+			kmigrated_migrate_mm(kmigrated_mm_slot);
+	}
+}
+
 static unsigned long kscand_scan_mm_slot(void)
 {
 	bool next_mm = false;
@@ -331,6 +728,7 @@ static unsigned long kscand_scan_mm_slot(void)
 	struct vm_area_struct *vma = NULL;
 	struct kscand_mm_slot *mm_slot;
 
+	struct kmigrated_mm_slot *kmigrated_mm_slot = NULL;
 
 	spin_lock(&kscand_mm_lock);
 
@@ -360,13 +758,23 @@ static unsigned long kscand_scan_mm_slot(void)
 
 	VMA_ITERATOR(vmi, mm, address);
 
+	kmigrated_mm_slot = kmigrated_get_mm_slot(mm, false);
+
 	for_each_vma(vmi, vma) {
 		kscand_walk_page_vma(vma, &kscand_scanctrl);
 		vma_scanned_size += vma->vm_end - vma->vm_start;
 
 		if (vma_scanned_size >= kscand_scan_size) {
 			next_mm = true;
-			/* TBD: Add scanned folios to migration list */
+
+			if (!list_empty(&kscand_scanctrl.scan_list)) {
+				if (!kmigrated_mm_slot)
+					kmigrated_mm_slot = kmigrated_get_mm_slot(mm, true);
+				spin_lock(&kmigrated_mm_slot->migrate_lock);
+				list_splice_tail_init(&kscand_scanctrl.scan_list,
+						&kmigrated_mm_slot->migrate_head);
+				spin_unlock(&kmigrated_mm_slot->migrate_lock);
+			}
 			break;
 		}
 	}
@@ -462,6 +870,8 @@ static int kscand(void *none)
 static inline void kscand_destroy(void)
 {
 	kmem_cache_destroy(kscand_slot_cache);
+	/* XXX: move below to kmigrated thread */
+	kmem_cache_destroy(kmigrated_slot_cache);
 }
 
 void __kscand_enter(struct mm_struct *mm)
@@ -497,7 +907,7 @@ void __kscand_exit(struct mm_struct *mm)
 {
 	struct kscand_mm_slot *mm_slot;
 	struct mm_slot *slot;
-	int free = 0;
+	int free = 0, serialize = 1;
 
 	spin_lock(&kscand_mm_lock);
 	slot = mm_slot_lookup(kscand_slots_hash, mm);
@@ -512,10 +922,15 @@ void __kscand_exit(struct mm_struct *mm)
 		free = 1;
 		/* TBD: Set the actual next slot */
 		kscand_scan.mm_slot = NULL;
+	} else if (mm_slot && kscand_scan.mm_slot == mm_slot && mm_slot->is_scanned) {
+		serialize = 0;
 	}
 
 	spin_unlock(&kscand_mm_lock);
 
+	if (serialize)
+		kscand_cleanup_migration_list(mm);
+
 	if (free) {
 		mm_slot_free(kscand_slot_cache, mm_slot);
 		mmdrop(mm);
@@ -567,6 +982,8 @@ static int kmigrated(void *arg)
 		WRITE_ONCE(migrated_need_wakeup, false);
 		if (unlikely(kthread_should_stop()))
 			break;
+		if (kmigrated_has_work())
+			kmigrated_migrate_folio();
 		msleep(20);
 		kmigrated_wait_work();
 	}
@@ -607,7 +1024,9 @@ static int stop_kmigrated(void)
 static inline void init_list(void)
 {
 	INIT_LIST_HEAD(&kscand_scanctrl.scan_list);
+	spin_lock_init(&kscand_migrate_lock);
 	init_waitqueue_head(&kscand_wait);
+	init_waitqueue_head(&kmigrated_wait);
 }
 
 static int __init kscand_init(void)
@@ -621,6 +1040,13 @@ static int __init kscand_init(void)
 		return -ENOMEM;
 	}
 
+	kmigrated_slot_cache = KMEM_CACHE(kmigrated_mm_slot, 0);
+
+	if (!kmigrated_slot_cache) {
+		pr_err("kmigrated: kmem_cache error");
+		return -ENOMEM;
+	}
+
 	init_list();
 	err = start_kscand();
 	if (err)
diff --git a/mm/migrate.c b/mm/migrate.c
index 2c88f3b33833..1f74dd5e6776 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2541,7 +2541,7 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid, unsigned long, nr_pages,
  * Returns true if this is a safe migration target node for misplaced NUMA
  * pages. Currently it only checks the watermarks which is crude.
  */
-static bool migrate_balanced_pgdat(struct pglist_data *pgdat,
+bool migrate_balanced_pgdat(struct pglist_data *pgdat,
 				   unsigned long nr_migrate_pages)
 {
 	int z;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH V3 07/17] mm: Add throttling of mm scanning using scan_period
  2025-08-14 15:32 [RFC PATCH V3 00/17] mm: slowtier page promotion based on PTE A bit Raghavendra K T
                   ` (5 preceding siblings ...)
  2025-08-14 15:32 ` [RFC PATCH V3 06/17] mm/migration: migrate accessed folios to toptier node Raghavendra K T
@ 2025-08-14 15:32 ` Raghavendra K T
  2025-08-14 15:32 ` [RFC PATCH V3 08/17] mm: Add throttling of mm scanning using scan_size Raghavendra K T
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Raghavendra K T @ 2025-08-14 15:32 UTC (permalink / raw)
  To: raghavendra.kt
  Cc: AneeshKumar.KizhakeVeetil, Michael.Day, akpm, bharata,
	dave.hansen, david, dongjoo.linux.dev, feng.tang, gourry, hannes,
	honggyu.kim, hughd, jhubbard, jon.grimm, k.shutemov, kbusch,
	kmanaouil.dev, leesuyeon0506, leillc, liam.howlett, linux-kernel,
	linux-mm, mgorman, mingo, nadav.amit, nphamcs, peterz, riel,
	rientjes, rppt, santosh.shukla, shivankg, shy828301, sj, vbabka,
	weixugc, willy, ying.huang, ziy, Jonathan.Cameron, dave, yuanchu,
	kinseyho, hdanton, harry.yoo

Before this patch, scanning of tasks' mm is done continuously and also
at the same rate.

Improve that by adding a throttling logic:
1) If there were useful pages found during last scan and current scan,
decrease the scan_period (to increase scan rate) by TUNE_PERCENT (15%).

2) If there were no useful pages found in last scan, and there are
candidate migration pages in the current scan decrease the scan_period
aggressively by 2 power SCAN_CHANGE_SCALE (2^3 = 8 now).

Vice versa is done for the reverse case.
Scan period is clamped between MIN (600ms) and MAX (5sec).

Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 mm/kscand.c | 110 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 109 insertions(+), 1 deletion(-)

diff --git a/mm/kscand.c b/mm/kscand.c
index 5cd2764114df..843069048c61 100644
--- a/mm/kscand.c
+++ b/mm/kscand.c
@@ -20,6 +20,7 @@
 #include <linux/string.h>
 #include <linux/delay.h>
 #include <linux/cleanup.h>
+#include <linux/minmax.h>
 
 #include <asm/pgalloc.h>
 #include "internal.h"
@@ -33,6 +34,16 @@ static DEFINE_MUTEX(kscand_mutex);
 #define KSCAND_SCAN_SIZE	(1 * 1024 * 1024 * 1024UL)
 static unsigned long kscand_scan_size __read_mostly = KSCAND_SCAN_SIZE;
 
+/*
+ * Scan period for each mm.
+ * Min: 600ms default: 2sec Max: 5sec
+ */
+#define KSCAND_SCAN_PERIOD_MAX	5000U
+#define KSCAND_SCAN_PERIOD_MIN	600U
+#define KSCAND_SCAN_PERIOD		2000U
+
+static unsigned int kscand_mm_scan_period_ms __read_mostly = KSCAND_SCAN_PERIOD;
+
 /* How long to pause between two scan cycles */
 static unsigned int kscand_scan_sleep_ms __read_mostly = 20;
 
@@ -78,6 +89,11 @@ static struct kmem_cache *kmigrated_slot_cache __read_mostly;
 /* Per mm information collected to control VMA scanning */
 struct kscand_mm_slot {
 	struct mm_slot slot;
+	/* Unit: ms. Determines how aften mm scan should happen. */
+	unsigned int scan_period;
+	unsigned long next_scan;
+	/* Tracks how many useful pages obtained for migration in the last scan */
+	unsigned long scan_delta;
 	long address;
 	bool is_scanned;
 };
@@ -715,13 +731,92 @@ static void kmigrated_migrate_folio(void)
 	}
 }
 
+/*
+ * This is the normal change percentage when old and new delta remain same.
+ * i.e., either both positive or both zero.
+ */
+#define SCAN_PERIOD_TUNE_PERCENT	15
+
+/* This is to change the scan_period aggressively when deltas are different */
+#define SCAN_PERIOD_CHANGE_SCALE	3
+/*
+ * XXX: Hack to prevent unmigrated pages coming again and again while scanning.
+ * Actual fix needs to identify the type of unmigrated pages OR consider migration
+ * failures in next scan.
+ */
+#define KSCAND_IGNORE_SCAN_THR	256
+
+/* Maintains stability of scan_period by decaying last time accessed pages */
+#define SCAN_DECAY_SHIFT	4
+/*
+ * X : Number of useful pages in the last scan.
+ * Y : Number of useful pages found in current scan.
+ * Tuning scan_period:
+ *	Initial scan_period is 2s.
+ *	case 1: (X = 0, Y = 0)
+ *		Increase scan_period by SCAN_PERIOD_TUNE_PERCENT.
+ *	case 2: (X = 0, Y > 0)
+ *		Decrease scan_period by (2 << SCAN_PERIOD_CHANGE_SCALE).
+ *	case 3: (X > 0, Y = 0 )
+ *		Increase scan_period by (2 << SCAN_PERIOD_CHANGE_SCALE).
+ *	case 4: (X > 0, Y > 0)
+ *		Decrease scan_period by SCAN_PERIOD_TUNE_PERCENT.
+ */
+static inline void kscand_update_mmslot_info(struct kscand_mm_slot *mm_slot,
+				unsigned long total)
+{
+	unsigned int scan_period;
+	unsigned long now;
+	unsigned long old_scan_delta;
+
+	scan_period = mm_slot->scan_period;
+	old_scan_delta = mm_slot->scan_delta;
+
+	/* decay old value */
+	total = (old_scan_delta >> SCAN_DECAY_SHIFT) + total;
+
+	/* XXX: Hack to get rid of continuously failing/unmigrateable pages */
+	if (total < KSCAND_IGNORE_SCAN_THR)
+		total = 0;
+
+	/*
+	 * case 1: old_scan_delta and new delta are similar, (slow) TUNE_PERCENT used.
+	 * case 2: old_scan_delta and new delta are different. (fast) CHANGE_SCALE used.
+	 * TBD:
+	 * 1. Further tune scan_period based on delta between last and current scan delta.
+	 * 2. Optimize calculation
+	 */
+	if (!old_scan_delta && !total) {
+		scan_period = (100 + SCAN_PERIOD_TUNE_PERCENT) * scan_period;
+		scan_period /= 100;
+	} else if (old_scan_delta && total) {
+		scan_period = (100 - SCAN_PERIOD_TUNE_PERCENT) * scan_period;
+		scan_period /= 100;
+	} else if (old_scan_delta && !total) {
+		scan_period = scan_period << SCAN_PERIOD_CHANGE_SCALE;
+	} else {
+		scan_period = scan_period >> SCAN_PERIOD_CHANGE_SCALE;
+	}
+
+	scan_period = clamp(scan_period, KSCAND_SCAN_PERIOD_MIN, KSCAND_SCAN_PERIOD_MAX);
+
+	now = jiffies;
+	mm_slot->next_scan = now + msecs_to_jiffies(scan_period);
+	mm_slot->scan_period = scan_period;
+	mm_slot->scan_delta = total;
+}
+
 static unsigned long kscand_scan_mm_slot(void)
 {
 	bool next_mm = false;
 	bool update_mmslot_info = false;
 
+	unsigned int mm_slot_scan_period;
+	unsigned long now;
+	unsigned long mm_slot_next_scan;
 	unsigned long vma_scanned_size = 0;
 	unsigned long address;
+	unsigned long total = 0;
 
 	struct mm_slot *slot;
 	struct mm_struct *mm;
@@ -746,6 +841,8 @@ static unsigned long kscand_scan_mm_slot(void)
 
 	mm = slot->mm;
 	mm_slot->is_scanned = true;
+	mm_slot_next_scan = mm_slot->next_scan;
+	mm_slot_scan_period = mm_slot->scan_period;
 	spin_unlock(&kscand_mm_lock);
 
 	if (unlikely(!mmap_read_trylock(mm)))
@@ -756,6 +853,11 @@ static unsigned long kscand_scan_mm_slot(void)
 		goto outerloop;
 	}
 
+	now = jiffies;
+
+	if (mm_slot_next_scan && time_before(now, mm_slot_next_scan))
+		goto outerloop;
+
 	VMA_ITERATOR(vmi, mm, address);
 
 	kmigrated_mm_slot = kmigrated_get_mm_slot(mm, false);
@@ -786,8 +888,10 @@ static unsigned long kscand_scan_mm_slot(void)
 
 	update_mmslot_info = true;
 
-	if (update_mmslot_info)
+	if (update_mmslot_info) {
 		mm_slot->address = address;
+		kscand_update_mmslot_info(mm_slot, total);
+	}
 
 outerloop:
 	/* exit_mmap will destroy ptes after this */
@@ -889,6 +993,10 @@ void __kscand_enter(struct mm_struct *mm)
 		return;
 
 	kscand_slot->address = 0;
+	kscand_slot->scan_period = kscand_mm_scan_period_ms;
+	kscand_slot->next_scan = 0;
+	kscand_slot->scan_delta = 0;
+
 	slot = &kscand_slot->slot;
 
 	spin_lock(&kscand_mm_lock);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH V3 08/17] mm: Add throttling of mm scanning using scan_size
  2025-08-14 15:32 [RFC PATCH V3 00/17] mm: slowtier page promotion based on PTE A bit Raghavendra K T
                   ` (6 preceding siblings ...)
  2025-08-14 15:32 ` [RFC PATCH V3 07/17] mm: Add throttling of mm scanning using scan_period Raghavendra K T
@ 2025-08-14 15:32 ` Raghavendra K T
  2025-08-14 15:32 ` [RFC PATCH V3 09/17] mm: Add initial scan delay Raghavendra K T
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Raghavendra K T @ 2025-08-14 15:32 UTC (permalink / raw)
  To: raghavendra.kt
  Cc: AneeshKumar.KizhakeVeetil, Michael.Day, akpm, bharata,
	dave.hansen, david, dongjoo.linux.dev, feng.tang, gourry, hannes,
	honggyu.kim, hughd, jhubbard, jon.grimm, k.shutemov, kbusch,
	kmanaouil.dev, leesuyeon0506, leillc, liam.howlett, linux-kernel,
	linux-mm, mgorman, mingo, nadav.amit, nphamcs, peterz, riel,
	rientjes, rppt, santosh.shukla, shivankg, shy828301, sj, vbabka,
	weixugc, willy, ying.huang, ziy, Jonathan.Cameron, dave, yuanchu,
	kinseyho, hdanton, harry.yoo

Before this patch, scanning is done on entire virtual address space
of all the tasks. Now the scan size is shrunk or expanded based on the
useful pages found in the last scan.

This helps to quickly get out of unnecessary scanning thus burning
lesser CPU.

Drawback: If a useful chunk is at the other end of the VMA space, it
will delay scanning and migration.

Shrink/expand algorithm for scan_size:
X : Number of useful pages in the last scan.
Y : Number of useful pages found in current scan.
Initial scan_size is 1GB
 case 1: (X = 0, Y = 0)
  Decrease scan_size by 2
 case 2: (X = 0, Y > 0)
  Aggressively change to MAX (4GB)
 case 3: (X > 0, Y = 0 )
   No change
 case 4: (X > 0, Y > 0)
   Increase scan_size by 2

Scan size is clamped between MIN (256MB) and MAX (4GB)).
TBD:  Tuning based on real workloads

Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 mm/kscand.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/mm/kscand.c b/mm/kscand.c
index 843069048c61..39a7fcef7de8 100644
--- a/mm/kscand.c
+++ b/mm/kscand.c
@@ -28,10 +28,15 @@
 
 static struct task_struct *kscand_thread __read_mostly;
 static DEFINE_MUTEX(kscand_mutex);
+
 /*
  * Total VMA size to cover during scan.
+ * Min: 256MB default: 1GB max: 4GB
  */
+#define KSCAND_SCAN_SIZE_MIN	(256 * 1024 * 1024UL)
+#define KSCAND_SCAN_SIZE_MAX	(4 * 1024 * 1024 * 1024UL)
 #define KSCAND_SCAN_SIZE	(1 * 1024 * 1024 * 1024UL)
+
 static unsigned long kscand_scan_size __read_mostly = KSCAND_SCAN_SIZE;
 
 /*
@@ -94,6 +99,8 @@ struct kscand_mm_slot {
 	unsigned long next_scan;
 	/* Tracks how many useful pages obtained for migration in the last scan */
 	unsigned long scan_delta;
+	/* Determines how much VMA address space to be covered in the scanning */
+	unsigned long scan_size;
 	long address;
 	bool is_scanned;
 };
@@ -746,6 +753,8 @@ static void kmigrated_migrate_folio(void)
  */
 #define KSCAND_IGNORE_SCAN_THR	256
 
+#define SCAN_SIZE_CHANGE_SHIFT	1
+
 /* Maintains stability of scan_period by decaying last time accessed pages */
 #define SCAN_DECAY_SHIFT	4
 /*
@@ -761,14 +770,26 @@ static void kmigrated_migrate_folio(void)
  *		Increase scan_period by (2 << SCAN_PERIOD_CHANGE_SCALE).
  *	case 4: (X > 0, Y > 0)
  *		Decrease scan_period by SCAN_PERIOD_TUNE_PERCENT.
+ * Tuning scan_size:
+ * Initial scan_size is 4GB
+ *	case 1: (X = 0, Y = 0)
+ *		Decrease scan_size by (1 << SCAN_SIZE_CHANGE_SHIFT).
+ *	case 2: (X = 0, Y > 0)
+ *		scan_size = KSCAND_SCAN_SIZE_MAX
+ *  case 3: (X > 0, Y = 0 )
+ *		No change
+ *  case 4: (X > 0, Y > 0)
+ *		Increase scan_size by (1 << SCAN_SIZE_CHANGE_SHIFT).
  */
 static inline void kscand_update_mmslot_info(struct kscand_mm_slot *mm_slot,
 				unsigned long total)
 {
 	unsigned int scan_period;
 	unsigned long now;
+	unsigned long scan_size;
 	unsigned long old_scan_delta;
 
+	scan_size = mm_slot->scan_size;
 	scan_period = mm_slot->scan_period;
 	old_scan_delta = mm_slot->scan_delta;
 
@@ -789,20 +810,25 @@ static inline void kscand_update_mmslot_info(struct kscand_mm_slot *mm_slot,
 	if (!old_scan_delta && !total) {
 		scan_period = (100 + SCAN_PERIOD_TUNE_PERCENT) * scan_period;
 		scan_period /= 100;
+		scan_size = scan_size >> SCAN_SIZE_CHANGE_SHIFT;
 	} else if (old_scan_delta && total) {
 		scan_period = (100 - SCAN_PERIOD_TUNE_PERCENT) * scan_period;
 		scan_period /= 100;
+		scan_size = scan_size << SCAN_SIZE_CHANGE_SHIFT;
 	} else if (old_scan_delta && !total) {
 		scan_period = scan_period << SCAN_PERIOD_CHANGE_SCALE;
 	} else {
 		scan_period = scan_period >> SCAN_PERIOD_CHANGE_SCALE;
+		scan_size = KSCAND_SCAN_SIZE_MAX;
 	}
 
 	scan_period = clamp(scan_period, KSCAND_SCAN_PERIOD_MIN, KSCAND_SCAN_PERIOD_MAX);
+	scan_size = clamp(scan_size, KSCAND_SCAN_SIZE_MIN, KSCAND_SCAN_SIZE_MAX);
 
 	now = jiffies;
 	mm_slot->next_scan = now + msecs_to_jiffies(scan_period);
 	mm_slot->scan_period = scan_period;
+	mm_slot->scan_size = scan_size;
 	mm_slot->scan_delta = total;
 }
 
@@ -814,6 +840,7 @@ static unsigned long kscand_scan_mm_slot(void)
 	unsigned int mm_slot_scan_period;
 	unsigned long now;
 	unsigned long mm_slot_next_scan;
+	unsigned long mm_slot_scan_size;
 	unsigned long vma_scanned_size = 0;
 	unsigned long address;
 	unsigned long total = 0;
@@ -843,6 +870,7 @@ static unsigned long kscand_scan_mm_slot(void)
 	mm_slot->is_scanned = true;
 	mm_slot_next_scan = mm_slot->next_scan;
 	mm_slot_scan_period = mm_slot->scan_period;
+	mm_slot_scan_size = mm_slot->scan_size;
 	spin_unlock(&kscand_mm_lock);
 
 	if (unlikely(!mmap_read_trylock(mm)))
@@ -994,6 +1022,7 @@ void __kscand_enter(struct mm_struct *mm)
 
 	kscand_slot->address = 0;
 	kscand_slot->scan_period = kscand_mm_scan_period_ms;
+	kscand_slot->scan_size = kscand_scan_size;
 	kscand_slot->next_scan = 0;
 	kscand_slot->scan_delta = 0;
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH V3 09/17] mm: Add initial scan delay
  2025-08-14 15:32 [RFC PATCH V3 00/17] mm: slowtier page promotion based on PTE A bit Raghavendra K T
                   ` (7 preceding siblings ...)
  2025-08-14 15:32 ` [RFC PATCH V3 08/17] mm: Add throttling of mm scanning using scan_size Raghavendra K T
@ 2025-08-14 15:32 ` Raghavendra K T
  2025-08-14 15:33 ` [RFC PATCH V3 10/17] mm: Add a heuristic to calculate target node Raghavendra K T
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Raghavendra K T @ 2025-08-14 15:32 UTC (permalink / raw)
  To: raghavendra.kt
  Cc: AneeshKumar.KizhakeVeetil, Michael.Day, akpm, bharata,
	dave.hansen, david, dongjoo.linux.dev, feng.tang, gourry, hannes,
	honggyu.kim, hughd, jhubbard, jon.grimm, k.shutemov, kbusch,
	kmanaouil.dev, leesuyeon0506, leillc, liam.howlett, linux-kernel,
	linux-mm, mgorman, mingo, nadav.amit, nphamcs, peterz, riel,
	rientjes, rppt, santosh.shukla, shivankg, shy828301, sj, vbabka,
	weixugc, willy, ying.huang, ziy, Jonathan.Cameron, dave, yuanchu,
	kinseyho, hdanton, harry.yoo

 This is  to prevent unnecessary scanning of short lived tasks
to reduce CPU burning.

Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 mm/kscand.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/kscand.c b/mm/kscand.c
index 39a7fcef7de8..880c3693866d 100644
--- a/mm/kscand.c
+++ b/mm/kscand.c
@@ -28,6 +28,7 @@
 
 static struct task_struct *kscand_thread __read_mostly;
 static DEFINE_MUTEX(kscand_mutex);
+extern unsigned int sysctl_numa_balancing_scan_delay;
 
 /*
  * Total VMA size to cover during scan.
@@ -1010,6 +1011,7 @@ void __kscand_enter(struct mm_struct *mm)
 {
 	struct kscand_mm_slot *kscand_slot;
 	struct mm_slot *slot;
+	unsigned long now;
 	int wakeup;
 
 	/* __kscand_exit() must not run from under us */
@@ -1020,10 +1022,12 @@ void __kscand_enter(struct mm_struct *mm)
 	if (!kscand_slot)
 		return;
 
+	now = jiffies;
 	kscand_slot->address = 0;
 	kscand_slot->scan_period = kscand_mm_scan_period_ms;
 	kscand_slot->scan_size = kscand_scan_size;
-	kscand_slot->next_scan = 0;
+	kscand_slot->next_scan = now +
+			2 * msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
 	kscand_slot->scan_delta = 0;
 
 	slot = &kscand_slot->slot;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH V3 10/17] mm: Add a heuristic to calculate target node
  2025-08-14 15:32 [RFC PATCH V3 00/17] mm: slowtier page promotion based on PTE A bit Raghavendra K T
                   ` (8 preceding siblings ...)
  2025-08-14 15:32 ` [RFC PATCH V3 09/17] mm: Add initial scan delay Raghavendra K T
@ 2025-08-14 15:33 ` Raghavendra K T
  2025-08-14 15:33 ` [RFC PATCH V3 11/17] mm/kscand: Implement migration failure feedback Raghavendra K T
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Raghavendra K T @ 2025-08-14 15:33 UTC (permalink / raw)
  To: raghavendra.kt
  Cc: AneeshKumar.KizhakeVeetil, Michael.Day, akpm, bharata,
	dave.hansen, david, dongjoo.linux.dev, feng.tang, gourry, hannes,
	honggyu.kim, hughd, jhubbard, jon.grimm, k.shutemov, kbusch,
	kmanaouil.dev, leesuyeon0506, leillc, liam.howlett, linux-kernel,
	linux-mm, mgorman, mingo, nadav.amit, nphamcs, peterz, riel,
	rientjes, rppt, santosh.shukla, shivankg, shy828301, sj, vbabka,
	weixugc, willy, ying.huang, ziy, Jonathan.Cameron, dave, yuanchu,
	kinseyho, hdanton, harry.yoo

One of the key challenges in PTE A bit based scanning is to find right
target node to promote to.

Here is a simple heuristic based approach:
 1. While scanning pages of any mm, also scan toptier pages that belong
to that mm.
 2. Accumulate the insight on the distribution of active pages on
toptier nodes.
 3. Walk all the top-tier nodes and pick the one with highest accesses.

 This method tries to consolidate application to a single node.

TBD: Create a list of preferred nodes for fallback when highest access
 node is nearly full.

Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 include/linux/mm_types.h |   4 +
 mm/kscand.c              | 198 +++++++++++++++++++++++++++++++++++++--
 2 files changed, 192 insertions(+), 10 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d6b91e8a66d6..e3d8f11a5a04 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1148,6 +1148,10 @@ struct mm_struct {
 		/* numa_scan_seq prevents two threads remapping PTEs. */
 		int numa_scan_seq;
 #endif
+#ifdef CONFIG_KSCAND
+		/* Tracks promotion node. XXX: use nodemask */
+		int target_node;
+ #endif
 		/*
 		 * An operation with batched TLB flushing is going on. Anything
 		 * that can move process memory needs to flush the TLB when
diff --git a/mm/kscand.c b/mm/kscand.c
index 880c3693866d..bf975e82357d 100644
--- a/mm/kscand.c
+++ b/mm/kscand.c
@@ -104,6 +104,7 @@ struct kscand_mm_slot {
 	unsigned long scan_size;
 	long address;
 	bool is_scanned;
+	int target_node;
 };
 
 /* Data structure to keep track of current mm under scan */
@@ -116,13 +117,23 @@ struct kscand_scan kscand_scan = {
 	.mm_head = LIST_HEAD_INIT(kscand_scan.mm_head),
 };
 
+/* Per memory node information used to caclulate target_node for migration */
+struct kscand_nodeinfo {
+	unsigned long nr_scanned;
+	unsigned long nr_accessed;
+	int node;
+	bool is_toptier;
+};
+
 /*
  * Data structure passed to control scanning and also collect
  * per memory node information
  */
 struct kscand_scanctrl {
 	struct list_head scan_list;
+	struct kscand_nodeinfo *nodeinfo[MAX_NUMNODES];
 	unsigned long address;
+	unsigned long nr_to_scan;
 };
 
 struct kscand_scanctrl kscand_scanctrl;
@@ -218,15 +229,129 @@ static void kmigrated_wait_work(void)
 			migrate_sleep_jiffies);
 }
 
-/*
- * Do not know what info to pass in the future to make
- * decision on taget node. Keep it void * now.
- */
+static unsigned long get_slowtier_accesed(struct kscand_scanctrl *scanctrl)
+{
+	int node;
+	unsigned long accessed = 0;
+
+	for_each_node_state(node, N_MEMORY) {
+		if (!node_is_toptier(node) && scanctrl->nodeinfo[node])
+			accessed += scanctrl->nodeinfo[node]->nr_accessed;
+	}
+	return accessed;
+}
+
+static inline unsigned long get_nodeinfo_nr_accessed(struct kscand_nodeinfo *ni)
+{
+	return ni->nr_accessed;
+}
+
+static inline void set_nodeinfo_nr_accessed(struct kscand_nodeinfo *ni, unsigned long val)
+{
+	ni->nr_accessed = val;
+}
+
+static inline unsigned long get_nodeinfo_nr_scanned(struct kscand_nodeinfo *ni)
+{
+	return ni->nr_scanned;
+}
+
+static inline void set_nodeinfo_nr_scanned(struct kscand_nodeinfo *ni, unsigned long val)
+{
+	ni->nr_scanned = val;
+}
+
+static inline void reset_nodeinfo_nr_scanned(struct kscand_nodeinfo *ni)
+{
+	set_nodeinfo_nr_scanned(ni, 0);
+}
+
+static inline void reset_nodeinfo(struct kscand_nodeinfo *ni)
+{
+	set_nodeinfo_nr_scanned(ni, 0);
+	set_nodeinfo_nr_accessed(ni, 0);
+}
+
+static void init_one_nodeinfo(struct kscand_nodeinfo *ni, int node)
+{
+	ni->nr_scanned = 0;
+	ni->nr_accessed = 0;
+	ni->node = node;
+	ni->is_toptier = node_is_toptier(node) ? true : false;
+}
+
+static struct kscand_nodeinfo *alloc_one_nodeinfo(int node)
+{
+	struct kscand_nodeinfo *ni;
+
+	ni = kzalloc(sizeof(*ni), GFP_KERNEL);
+
+	if (!ni)
+		return NULL;
+
+	init_one_nodeinfo(ni, node);
+
+	return ni;
+}
+
+/* TBD: Handle errors */
+static void init_scanctrl(struct kscand_scanctrl *scanctrl)
+{
+	struct kscand_nodeinfo *ni;
+	int node;
+
+	for_each_node(node) {
+		ni = alloc_one_nodeinfo(node);
+		if (!ni)
+			WARN_ON_ONCE(ni);
+		scanctrl->nodeinfo[node] = ni;
+	}
+}
+
+static void reset_scanctrl(struct kscand_scanctrl *scanctrl)
+{
+	int node;
+
+	for_each_node_state(node, N_MEMORY)
+		reset_nodeinfo(scanctrl->nodeinfo[node]);
+
+	/* XXX: Not rellay required? */
+	scanctrl->nr_to_scan = kscand_scan_size;
+}
+
+static void free_scanctrl(struct kscand_scanctrl *scanctrl)
+{
+	int node;
+
+	for_each_node(node)
+		kfree(scanctrl->nodeinfo[node]);
+}
+
 static int kscand_get_target_node(void *data)
 {
 	return kscand_target_node;
 }
 
+static int get_target_node(struct kscand_scanctrl *scanctrl)
+{
+	int node, target_node = NUMA_NO_NODE;
+	unsigned long prev = 0;
+
+	for_each_node(node) {
+		if (node_is_toptier(node) && scanctrl->nodeinfo[node]) {
+			/* This creates a fallback migration node list */
+			if (get_nodeinfo_nr_accessed(scanctrl->nodeinfo[node]) > prev) {
+				prev = get_nodeinfo_nr_accessed(scanctrl->nodeinfo[node]);
+				target_node = node;
+			}
+		}
+	}
+	if (target_node == NUMA_NO_NODE)
+		target_node = kscand_get_target_node(NULL);
+
+	return target_node;
+}
+
 extern bool migrate_balanced_pgdat(struct pglist_data *pgdat,
 					unsigned long nr_migrate_pages);
 
@@ -495,6 +620,14 @@ static int hot_vma_idle_pte_entry(pte_t *pte,
 	page_idle_clear_pte_refs(page, pte, walk);
 	srcnid = folio_nid(folio);
 
+	scanctrl->nodeinfo[srcnid]->nr_scanned++;
+	if (scanctrl->nr_to_scan)
+		scanctrl->nr_to_scan--;
+
+	if (!scanctrl->nr_to_scan) {
+		folio_put(folio);
+		return 1;
+	}
 
 	if (!folio_test_lru(folio)) {
 		folio_put(folio);
@@ -502,13 +635,17 @@ static int hot_vma_idle_pte_entry(pte_t *pte,
 	}
 
 	if (!kscand_eligible_srcnid(srcnid)) {
+		if (folio_test_young(folio) || folio_test_referenced(folio)
+				|| pte_young(pteval)) {
+			scanctrl->nodeinfo[srcnid]->nr_accessed++;
+		}
 		folio_put(folio);
 		return 0;
 	}
 	if (!folio_test_idle(folio) && !prev_idle &&
 		(folio_test_young(folio) || folio_test_referenced(folio))) {
 
-		/* XXX: Leaking memory. TBD: consume info */
+		scanctrl->nodeinfo[srcnid]->nr_accessed++;
 
 		info = kzalloc(sizeof(struct kscand_migrate_info), GFP_NOWAIT);
 		if (info && scanctrl) {
@@ -697,7 +834,13 @@ static void kmigrated_migrate_mm(struct kmigrated_mm_slot *mm_slot)
 
 			spin_unlock(&mm_slot->migrate_lock);
 
-			dest = kscand_get_target_node(NULL);
+			if (!mmap_read_trylock(mm)) {
+				dest = kscand_get_target_node(NULL);
+			} else {
+				dest = READ_ONCE(mm->target_node);
+				mmap_read_unlock(mm);
+			}
+
 			ret = kmigrated_promote_folio(info, mm, dest);
 
 			kfree(info);
@@ -783,7 +926,7 @@ static void kmigrated_migrate_folio(void)
  *		Increase scan_size by (1 << SCAN_SIZE_CHANGE_SHIFT).
  */
 static inline void kscand_update_mmslot_info(struct kscand_mm_slot *mm_slot,
-				unsigned long total)
+				unsigned long total, int target_node)
 {
 	unsigned int scan_period;
 	unsigned long now;
@@ -831,6 +974,7 @@ static inline void kscand_update_mmslot_info(struct kscand_mm_slot *mm_slot,
 	mm_slot->scan_period = scan_period;
 	mm_slot->scan_size = scan_size;
 	mm_slot->scan_delta = total;
+	mm_slot->target_node = target_node;
 }
 
 static unsigned long kscand_scan_mm_slot(void)
@@ -839,6 +983,7 @@ static unsigned long kscand_scan_mm_slot(void)
 	bool update_mmslot_info = false;
 
 	unsigned int mm_slot_scan_period;
+	int target_node, mm_slot_target_node, mm_target_node;
 	unsigned long now;
 	unsigned long mm_slot_next_scan;
 	unsigned long mm_slot_scan_size;
@@ -872,6 +1017,7 @@ static unsigned long kscand_scan_mm_slot(void)
 	mm_slot_next_scan = mm_slot->next_scan;
 	mm_slot_scan_period = mm_slot->scan_period;
 	mm_slot_scan_size = mm_slot->scan_size;
+	mm_slot_target_node = mm_slot->target_node;
 	spin_unlock(&kscand_mm_lock);
 
 	if (unlikely(!mmap_read_trylock(mm)))
@@ -882,6 +1028,9 @@ static unsigned long kscand_scan_mm_slot(void)
 		goto outerloop;
 	}
 
+	mm_target_node = READ_ONCE(mm->target_node);
+	if (mm_target_node != mm_slot_target_node)
+		WRITE_ONCE(mm->target_node, mm_slot_target_node);
 	now = jiffies;
 
 	if (mm_slot_next_scan && time_before(now, mm_slot_next_scan))
@@ -889,24 +1038,41 @@ static unsigned long kscand_scan_mm_slot(void)
 
 	VMA_ITERATOR(vmi, mm, address);
 
+	/* Either Scan 25% of scan_size or cover vma size of scan_size */
+	kscand_scanctrl.nr_to_scan =	mm_slot_scan_size >> PAGE_SHIFT;
+	/* Reduce actual amount of pages scanned */
+	kscand_scanctrl.nr_to_scan =	mm_slot_scan_size >> 1;
+
+	/* XXX: skip scanning to avoid duplicates until all migrations done? */
 	kmigrated_mm_slot = kmigrated_get_mm_slot(mm, false);
 
 	for_each_vma(vmi, vma) {
 		kscand_walk_page_vma(vma, &kscand_scanctrl);
 		vma_scanned_size += vma->vm_end - vma->vm_start;
 
-		if (vma_scanned_size >= kscand_scan_size) {
+		if (vma_scanned_size >= mm_slot_scan_size ||
+					!kscand_scanctrl.nr_to_scan) {
 			next_mm = true;
 
 			if (!list_empty(&kscand_scanctrl.scan_list)) {
 				if (!kmigrated_mm_slot)
 					kmigrated_mm_slot = kmigrated_get_mm_slot(mm, true);
+				/* Add scanned folios to migration list */
 				spin_lock(&kmigrated_mm_slot->migrate_lock);
+
 				list_splice_tail_init(&kscand_scanctrl.scan_list,
 						&kmigrated_mm_slot->migrate_head);
 				spin_unlock(&kmigrated_mm_slot->migrate_lock);
+				break;
 			}
-			break;
+		}
+		if (!list_empty(&kscand_scanctrl.scan_list)) {
+			if (!kmigrated_mm_slot)
+				kmigrated_mm_slot = kmigrated_get_mm_slot(mm, true);
+			spin_lock(&kmigrated_mm_slot->migrate_lock);
+			list_splice_tail_init(&kscand_scanctrl.scan_list,
+					&kmigrated_mm_slot->migrate_head);
+			spin_unlock(&kmigrated_mm_slot->migrate_lock);
 		}
 	}
 
@@ -917,9 +1083,19 @@ static unsigned long kscand_scan_mm_slot(void)
 
 	update_mmslot_info = true;
 
+	total = get_slowtier_accesed(&kscand_scanctrl);
+	target_node = get_target_node(&kscand_scanctrl);
+
+	mm_target_node = READ_ONCE(mm->target_node);
+
+	/* XXX: Do we need write lock? */
+	if (mm_target_node != target_node)
+		WRITE_ONCE(mm->target_node, target_node);
+	reset_scanctrl(&kscand_scanctrl);
+
 	if (update_mmslot_info) {
 		mm_slot->address = address;
-		kscand_update_mmslot_info(mm_slot, total);
+		kscand_update_mmslot_info(mm_slot, total, target_node);
 	}
 
 outerloop:
@@ -1113,6 +1289,7 @@ static int stop_kscand(void)
 		kthread_stop(kscand_thread);
 		kscand_thread = NULL;
 	}
+	free_scanctrl(&kscand_scanctrl);
 
 	return 0;
 }
@@ -1168,6 +1345,7 @@ static inline void init_list(void)
 	spin_lock_init(&kscand_migrate_lock);
 	init_waitqueue_head(&kscand_wait);
 	init_waitqueue_head(&kmigrated_wait);
+	init_scanctrl(&kscand_scanctrl);
 }
 
 static int __init kscand_init(void)
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH V3 11/17] mm/kscand: Implement migration failure feedback
  2025-08-14 15:32 [RFC PATCH V3 00/17] mm: slowtier page promotion based on PTE A bit Raghavendra K T
                   ` (9 preceding siblings ...)
  2025-08-14 15:33 ` [RFC PATCH V3 10/17] mm: Add a heuristic to calculate target node Raghavendra K T
@ 2025-08-14 15:33 ` Raghavendra K T
  2025-08-14 15:33 ` [RFC PATCH V3 12/17] sysfs: Add sysfs support to tune scanning Raghavendra K T
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Raghavendra K T @ 2025-08-14 15:33 UTC (permalink / raw)
  To: raghavendra.kt
  Cc: AneeshKumar.KizhakeVeetil, Michael.Day, akpm, bharata,
	dave.hansen, david, dongjoo.linux.dev, feng.tang, gourry, hannes,
	honggyu.kim, hughd, jhubbard, jon.grimm, k.shutemov, kbusch,
	kmanaouil.dev, leesuyeon0506, leillc, liam.howlett, linux-kernel,
	linux-mm, mgorman, mingo, nadav.amit, nphamcs, peterz, riel,
	rientjes, rppt, santosh.shukla, shivankg, shy828301, sj, vbabka,
	weixugc, willy, ying.huang, ziy, Jonathan.Cameron, dave, yuanchu,
	kinseyho, hdanton, harry.yoo

 Before this, scanning kthread continues to scan even after
migration fails. To control migration, scanning is slowed down
based on the failure/success ratio obtained from migration
thread.

 Decaying failure ratio is maintained for 1024 migration window.
The ratio further contributes to approximately 10% scaling of
scan_period.

Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 mm/kscand.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 55 insertions(+)

diff --git a/mm/kscand.c b/mm/kscand.c
index bf975e82357d..41321d373be7 100644
--- a/mm/kscand.c
+++ b/mm/kscand.c
@@ -146,6 +146,8 @@ struct kmigrated_mm_slot {
 	spinlock_t migrate_lock;
 	/* Head of per mm migration list */
 	struct list_head migrate_head;
+	/* Indicates weighted success, failure */
+	int msuccess, mfailed, fratio;
 };
 
 /* System wide list of mms that maintain migration list */
@@ -812,13 +814,45 @@ static void kscand_collect_mm_slot(struct kscand_mm_slot *mm_slot)
 	}
 }
 
+static int kmigrated_get_mstat_fratio(struct mm_struct *mm)
+{
+	int fratio = 0;
+	struct kmigrated_mm_slot *mm_slot = NULL;
+	struct mm_slot *slot;
+
+	guard(spinlock)(&kscand_migrate_lock);
+
+	slot = mm_slot_lookup(kmigrated_slots_hash, mm);
+	mm_slot = mm_slot_entry(slot, struct kmigrated_mm_slot, mm_slot);
+
+	if (mm_slot)
+		fratio =  mm_slot->fratio;
+
+	return fratio;
+}
+
+static void update_mstat_ratio(struct kmigrated_mm_slot *mm_slot,
+				int msuccess, int mfailed)
+{
+	mm_slot->msuccess = (mm_slot->msuccess >> 2) + msuccess;
+	mm_slot->mfailed = (mm_slot->mfailed >> 2) + mfailed;
+	mm_slot->fratio = mm_slot->mfailed * 100;
+	mm_slot->fratio /=  (mm_slot->msuccess + mm_slot->mfailed);
+}
+
+#define MSTAT_UPDATE_FREQ	1024
+
 static void kmigrated_migrate_mm(struct kmigrated_mm_slot *mm_slot)
 {
+	int mfailed = 0;
+	int msuccess = 0;
+	int mstat_counter;
 	int ret = 0, dest = -1;
 	struct mm_slot *slot;
 	struct mm_struct *mm;
 	struct kscand_migrate_info *info, *tmp;
 
+	mstat_counter = MSTAT_UPDATE_FREQ;
 	spin_lock(&mm_slot->migrate_lock);
 
 	slot = &mm_slot->mm_slot;
@@ -842,11 +876,23 @@ static void kmigrated_migrate_mm(struct kmigrated_mm_slot *mm_slot)
 			}
 
 			ret = kmigrated_promote_folio(info, mm, dest);
+			mstat_counter--;
+
+			/* TBD: encode migrated count here, currently assume folio_nr_pages */
+			if (!ret)
+				msuccess++;
+			else
+				mfailed++;
 
 			kfree(info);
 
 			cond_resched();
 			spin_lock(&mm_slot->migrate_lock);
+			if (!mstat_counter) {
+				update_mstat_ratio(mm_slot, msuccess, mfailed);
+				msuccess  = mfailed = 0;
+				mstat_counter = MSTAT_UPDATE_FREQ;
+			}
 		}
 	}
 clean_list_handled:
@@ -882,6 +928,12 @@ static void kmigrated_migrate_folio(void)
 	}
 }
 
+/* Get scan_period based on migration failure statistics */
+static int kscand_mstat_scan_period(unsigned int scan_period, int fratio)
+{
+	return scan_period * (1 + fratio / 10);
+}
+
 /*
  * This is the normal change percentage when old and new delta remain same.
  * i.e., either both positive or both zero.
@@ -928,6 +980,7 @@ static void kmigrated_migrate_folio(void)
 static inline void kscand_update_mmslot_info(struct kscand_mm_slot *mm_slot,
 				unsigned long total, int target_node)
 {
+	int fratio;
 	unsigned int scan_period;
 	unsigned long now;
 	unsigned long scan_size;
@@ -967,6 +1020,8 @@ static inline void kscand_update_mmslot_info(struct kscand_mm_slot *mm_slot,
 	}
 
 	scan_period = clamp(scan_period, KSCAND_SCAN_PERIOD_MIN, KSCAND_SCAN_PERIOD_MAX);
+	fratio = kmigrated_get_mstat_fratio((&mm_slot->slot)->mm);
+	scan_period = kscand_mstat_scan_period(scan_period, fratio);
 	scan_size = clamp(scan_size, KSCAND_SCAN_SIZE_MIN, KSCAND_SCAN_SIZE_MAX);
 
 	now = jiffies;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH V3 12/17] sysfs: Add sysfs support to tune scanning
  2025-08-14 15:32 [RFC PATCH V3 00/17] mm: slowtier page promotion based on PTE A bit Raghavendra K T
                   ` (10 preceding siblings ...)
  2025-08-14 15:33 ` [RFC PATCH V3 11/17] mm/kscand: Implement migration failure feedback Raghavendra K T
@ 2025-08-14 15:33 ` Raghavendra K T
  2025-08-14 15:33 ` [RFC PATCH V3 13/17] mm/vmstat: Add vmstat counters Raghavendra K T
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Raghavendra K T @ 2025-08-14 15:33 UTC (permalink / raw)
  To: raghavendra.kt
  Cc: AneeshKumar.KizhakeVeetil, Michael.Day, akpm, bharata,
	dave.hansen, david, dongjoo.linux.dev, feng.tang, gourry, hannes,
	honggyu.kim, hughd, jhubbard, jon.grimm, k.shutemov, kbusch,
	kmanaouil.dev, leesuyeon0506, leillc, liam.howlett, linux-kernel,
	linux-mm, mgorman, mingo, nadav.amit, nphamcs, peterz, riel,
	rientjes, rppt, santosh.shukla, shivankg, shy828301, sj, vbabka,
	weixugc, willy, ying.huang, ziy, Jonathan.Cameron, dave, yuanchu,
	kinseyho, hdanton, harry.yoo

Support below tunables:
scan_enable: turn on or turn off mm_struct scanning
scan_period: initial scan_period (default: 2sec)
scan_sleep_ms: sleep time between two successive round of scanning and
migration.
mms_to_scan: total mm_struct to scan before taking a pause.
target_node: default regular node to which migration of accessed pages
is done (this is only fall back mechnism, otherwise target_node
heuristic is used).

Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 mm/kscand.c | 205 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 205 insertions(+)

diff --git a/mm/kscand.c b/mm/kscand.c
index 41321d373be7..a73606f7ca3c 100644
--- a/mm/kscand.c
+++ b/mm/kscand.c
@@ -21,6 +21,7 @@
 #include <linux/delay.h>
 #include <linux/cleanup.h>
 #include <linux/minmax.h>
+#include <trace/events/kmem.h>
 
 #include <asm/pgalloc.h>
 #include "internal.h"
@@ -173,6 +174,171 @@ static bool kscand_eligible_srcnid(int nid)
 	return  !node_is_toptier(nid);
 }
 
+#ifdef CONFIG_SYSFS
+static ssize_t scan_sleep_ms_show(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 char *buf)
+{
+	return sysfs_emit(buf, "%u\n", kscand_scan_sleep_ms);
+}
+
+static ssize_t scan_sleep_ms_store(struct kobject *kobj,
+					  struct kobj_attribute *attr,
+					  const char *buf, size_t count)
+{
+	unsigned int msecs;
+	int err;
+
+	err = kstrtouint(buf, 10, &msecs);
+	if (err)
+		return -EINVAL;
+
+	kscand_scan_sleep_ms = msecs;
+	kscand_sleep_expire = 0;
+	wake_up_interruptible(&kscand_wait);
+
+	return count;
+}
+
+static struct kobj_attribute scan_sleep_ms_attr =
+	__ATTR_RW(scan_sleep_ms);
+
+static ssize_t mm_scan_period_ms_show(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 char *buf)
+{
+	return sysfs_emit(buf, "%u\n", kscand_mm_scan_period_ms);
+}
+
+/* If a value less than MIN or greater than MAX asked for store value is clamped */
+static ssize_t mm_scan_period_ms_store(struct kobject *kobj,
+					  struct kobj_attribute *attr,
+					  const char *buf, size_t count)
+{
+	unsigned int msecs, stored_msecs;
+	int err;
+
+	err = kstrtouint(buf, 10, &msecs);
+	if (err)
+		return -EINVAL;
+
+	stored_msecs = clamp(msecs, KSCAND_SCAN_PERIOD_MIN, KSCAND_SCAN_PERIOD_MAX);
+
+	kscand_mm_scan_period_ms = stored_msecs;
+	kscand_sleep_expire = 0;
+	wake_up_interruptible(&kscand_wait);
+
+	return count;
+}
+
+static struct kobj_attribute mm_scan_period_ms_attr =
+	__ATTR_RW(mm_scan_period_ms);
+
+static ssize_t mms_to_scan_show(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 char *buf)
+{
+	return sysfs_emit(buf, "%lu\n", kscand_mms_to_scan);
+}
+
+static ssize_t mms_to_scan_store(struct kobject *kobj,
+					  struct kobj_attribute *attr,
+					  const char *buf, size_t count)
+{
+	unsigned long val;
+	int err;
+
+	err = kstrtoul(buf, 10, &val);
+	if (err)
+		return -EINVAL;
+
+	kscand_mms_to_scan = val;
+	kscand_sleep_expire = 0;
+	wake_up_interruptible(&kscand_wait);
+
+	return count;
+}
+
+static struct kobj_attribute mms_to_scan_attr =
+	__ATTR_RW(mms_to_scan);
+
+static ssize_t scan_enabled_show(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 char *buf)
+{
+	return sysfs_emit(buf, "%u\n", kscand_scan_enabled ? 1 : 0);
+}
+
+static ssize_t scan_enabled_store(struct kobject *kobj,
+					  struct kobj_attribute *attr,
+					  const char *buf, size_t count)
+{
+	unsigned int val;
+	int err;
+
+	err = kstrtouint(buf, 10, &val);
+	if (err || val > 1)
+		return -EINVAL;
+
+	if (val) {
+		kscand_scan_enabled = true;
+		need_wakeup = true;
+	} else
+		kscand_scan_enabled = false;
+
+	kscand_sleep_expire = 0;
+	wake_up_interruptible(&kscand_wait);
+
+	return count;
+}
+
+static struct kobj_attribute scan_enabled_attr =
+	__ATTR_RW(scan_enabled);
+
+static ssize_t target_node_show(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 char *buf)
+{
+	return sysfs_emit(buf, "%u\n", kscand_target_node);
+}
+
+static ssize_t target_node_store(struct kobject *kobj,
+					  struct kobj_attribute *attr,
+					  const char *buf, size_t count)
+{
+	int err, node;
+
+	err = kstrtoint(buf, 10, &node);
+	if (err)
+		return -EINVAL;
+
+	kscand_sleep_expire = 0;
+	if (!node_is_toptier(node))
+		return -EINVAL;
+
+	kscand_target_node = node;
+	wake_up_interruptible(&kscand_wait);
+
+	return count;
+}
+static struct kobj_attribute target_node_attr =
+	__ATTR_RW(target_node);
+
+static struct attribute *kscand_attr[] = {
+	&scan_sleep_ms_attr.attr,
+	&mm_scan_period_ms_attr.attr,
+	&mms_to_scan_attr.attr,
+	&scan_enabled_attr.attr,
+	&target_node_attr.attr,
+	NULL,
+};
+
+struct attribute_group kscand_attr_group = {
+	.attrs = kscand_attr,
+	.name = "kscand",
+};
+#endif
+
 static inline int kscand_has_work(void)
 {
 	return !list_empty(&kscand_scan.mm_head);
@@ -1231,11 +1397,45 @@ static int kscand(void *none)
 	return 0;
 }
 
+#ifdef CONFIG_SYSFS
+extern struct kobject *mm_kobj;
+static int __init kscand_init_sysfs(struct kobject **kobj)
+{
+	int err;
+
+	err = sysfs_create_group(*kobj, &kscand_attr_group);
+	if (err) {
+		pr_err("failed to register kscand group\n");
+		goto err_kscand_attr;
+	}
+
+	return 0;
+
+err_kscand_attr:
+	sysfs_remove_group(*kobj, &kscand_attr_group);
+	return err;
+}
+
+static void __init kscand_exit_sysfs(struct kobject *kobj)
+{
+		sysfs_remove_group(kobj, &kscand_attr_group);
+}
+#else
+static inline int __init kscand_init_sysfs(struct kobject **kobj)
+{
+	return 0;
+}
+static inline void __init kscand_exit_sysfs(struct kobject *kobj)
+{
+}
+#endif
+
 static inline void kscand_destroy(void)
 {
 	kmem_cache_destroy(kscand_slot_cache);
 	/* XXX: move below to kmigrated thread */
 	kmem_cache_destroy(kmigrated_slot_cache);
+	kscand_exit_sysfs(mm_kobj);
 }
 
 void __kscand_enter(struct mm_struct *mm)
@@ -1421,6 +1621,10 @@ static int __init kscand_init(void)
 		return -ENOMEM;
 	}
 
+	err = kscand_init_sysfs(&mm_kobj);
+	if (err)
+		goto err_init_sysfs;
+
 	init_list();
 	err = start_kscand();
 	if (err)
@@ -1437,6 +1641,7 @@ static int __init kscand_init(void)
 
 err_kscand:
 	stop_kscand();
+err_init_sysfs:
 	kscand_destroy();
 
 	return err;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH V3 13/17] mm/vmstat: Add vmstat counters
  2025-08-14 15:32 [RFC PATCH V3 00/17] mm: slowtier page promotion based on PTE A bit Raghavendra K T
                   ` (11 preceding siblings ...)
  2025-08-14 15:33 ` [RFC PATCH V3 12/17] sysfs: Add sysfs support to tune scanning Raghavendra K T
@ 2025-08-14 15:33 ` Raghavendra K T
  2025-08-14 15:33 ` [RFC PATCH V3 14/17] trace/kscand: Add tracing of scanning and migration Raghavendra K T
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Raghavendra K T @ 2025-08-14 15:33 UTC (permalink / raw)
  To: raghavendra.kt
  Cc: AneeshKumar.KizhakeVeetil, Michael.Day, akpm, bharata,
	dave.hansen, david, dongjoo.linux.dev, feng.tang, gourry, hannes,
	honggyu.kim, hughd, jhubbard, jon.grimm, k.shutemov, kbusch,
	kmanaouil.dev, leesuyeon0506, leillc, liam.howlett, linux-kernel,
	linux-mm, mgorman, mingo, nadav.amit, nphamcs, peterz, riel,
	rientjes, rppt, santosh.shukla, shivankg, shy828301, sj, vbabka,
	weixugc, willy, ying.huang, ziy, Jonathan.Cameron, dave, yuanchu,
	kinseyho, hdanton, harry.yoo

Add vmstat counter to track scanning, migration and
type of pages.

Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 include/linux/mm.h            | 13 ++++++++
 include/linux/vm_event_item.h | 12 +++++++
 mm/kscand.c                   | 63 +++++++++++++++++++++++++++++++++--
 mm/vmstat.c                   | 12 +++++++
 4 files changed, 98 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index fa538feaa8d9..0d579d0294bf 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -655,6 +655,19 @@ struct vm_operations_struct {
 					  unsigned long addr);
 };
 
+#ifdef CONFIG_KSCAND
+void count_kscand_mm_scans(void);
+void count_kscand_vma_scans(void);
+void count_kscand_migadded(void);
+void count_kscand_migrated(void);
+void count_kscand_migrate_failed(void);
+void count_kscand_slowtier(void);
+void count_kscand_toptier(void);
+void count_kscand_idlepage(void);
+void count_kscand_hotpage(void);
+void count_kscand_coldpage(void);
+#endif
+
 #ifdef CONFIG_NUMA_BALANCING
 static inline void vma_numab_state_init(struct vm_area_struct *vma)
 {
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 9e15a088ba38..b5643be5dd94 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -67,6 +67,18 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		NUMA_HINT_FAULTS_LOCAL,
 		NUMA_PAGE_MIGRATE,
 #endif
+#ifdef CONFIG_KSCAND
+		KSCAND_MM_SCANS,
+		KSCAND_VMA_SCANS,
+		KSCAND_MIGADDED,
+		KSCAND_MIGRATED,
+		KSCAND_MIGRATE_FAILED,
+		KSCAND_SLOWTIER,
+		KSCAND_TOPTIER,
+		KSCAND_IDLEPAGE,
+		KSCAND_HOTPAGE,
+		KSCAND_COLDPAGE,
+#endif
 #ifdef CONFIG_MIGRATION
 		PGMIGRATE_SUCCESS, PGMIGRATE_FAIL,
 		THP_MIGRATION_SUCCESS,
diff --git a/mm/kscand.c b/mm/kscand.c
index a73606f7ca3c..e14645565ba7 100644
--- a/mm/kscand.c
+++ b/mm/kscand.c
@@ -339,6 +339,47 @@ struct attribute_group kscand_attr_group = {
 };
 #endif
 
+void count_kscand_mm_scans(void)
+{
+	count_vm_numa_event(KSCAND_MM_SCANS);
+}
+void count_kscand_vma_scans(void)
+{
+	count_vm_numa_event(KSCAND_VMA_SCANS);
+}
+void count_kscand_migadded(void)
+{
+	count_vm_numa_event(KSCAND_MIGADDED);
+}
+void count_kscand_migrated(void)
+{
+	count_vm_numa_event(KSCAND_MIGRATED);
+}
+void count_kscand_migrate_failed(void)
+{
+	count_vm_numa_event(KSCAND_MIGRATE_FAILED);
+}
+void count_kscand_slowtier(void)
+{
+	count_vm_numa_event(KSCAND_SLOWTIER);
+}
+void count_kscand_toptier(void)
+{
+	count_vm_numa_event(KSCAND_TOPTIER);
+}
+void count_kscand_idlepage(void)
+{
+	count_vm_numa_event(KSCAND_IDLEPAGE);
+}
+void count_kscand_hotpage(void)
+{
+	count_vm_numa_event(KSCAND_HOTPAGE);
+}
+void count_kscand_coldpage(void)
+{
+	count_vm_numa_event(KSCAND_COLDPAGE);
+}
+
 static inline int kscand_has_work(void)
 {
 	return !list_empty(&kscand_scan.mm_head);
@@ -653,6 +694,8 @@ static int kmigrated_promote_folio(struct kscand_migrate_info *info,
 
 	if (!is_hot_page(folio))
 		return KSCAND_NOT_HOT_PAGE;
+	else
+		count_kscand_hotpage();
 
 	folio_get(folio);
 
@@ -803,12 +846,15 @@ static int hot_vma_idle_pte_entry(pte_t *pte,
 	}
 
 	if (!kscand_eligible_srcnid(srcnid)) {
+		count_kscand_toptier();
 		if (folio_test_young(folio) || folio_test_referenced(folio)
 				|| pte_young(pteval)) {
 			scanctrl->nodeinfo[srcnid]->nr_accessed++;
 		}
 		folio_put(folio);
 		return 0;
+	} else {
+		count_kscand_slowtier();
 	}
 	if (!folio_test_idle(folio) && !prev_idle &&
 		(folio_test_young(folio) || folio_test_referenced(folio))) {
@@ -820,7 +866,14 @@ static int hot_vma_idle_pte_entry(pte_t *pte,
 			info->pfn = folio_pfn(folio);
 			info->address = addr;
 			list_add_tail(&info->migrate_node, &scanctrl->scan_list);
+			count_kscand_migadded();
 		}
+		folio_put(folio);
+		return 0;
+	} else {
+		if (prev_idle)
+			count_kscand_coldpage();
+		count_kscand_idlepage();
 	}
 
 	folio_put(folio);
@@ -1045,10 +1098,13 @@ static void kmigrated_migrate_mm(struct kmigrated_mm_slot *mm_slot)
 			mstat_counter--;
 
 			/* TBD: encode migrated count here, currently assume folio_nr_pages */
-			if (!ret)
+			if (!ret) {
+				count_kscand_migrated();
 				msuccess++;
-			else
+			} else {
+				count_kscand_migrate_failed();
 				mfailed++;
+			}
 
 			kfree(info);
 
@@ -1269,6 +1325,7 @@ static unsigned long kscand_scan_mm_slot(void)
 
 	for_each_vma(vmi, vma) {
 		kscand_walk_page_vma(vma, &kscand_scanctrl);
+		count_kscand_vma_scans();
 		vma_scanned_size += vma->vm_end - vma->vm_start;
 
 		if (vma_scanned_size >= mm_slot_scan_size ||
@@ -1304,6 +1361,8 @@ static unsigned long kscand_scan_mm_slot(void)
 
 	update_mmslot_info = true;
 
+	count_kscand_mm_scans();
+
 	total = get_slowtier_accesed(&kscand_scanctrl);
 	target_node = get_target_node(&kscand_scanctrl);
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index a78d70ddeacd..5ba82d2ffe71 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1347,6 +1347,18 @@ const char * const vmstat_text[] = {
 	"numa_hint_faults_local",
 	"numa_pages_migrated",
 #endif
+#ifdef CONFIG_KSCAND
+	"nr_kscand_mm_scans",
+	"nr_kscand_vma_scans",
+	"nr_kscand_migadded",
+	"nr_kscand_migrated",
+	"nr_kscand_migrate_failed",
+	"nr_kscand_slowtier",
+	"nr_kscand_toptier",
+	"nr_kscand_idlepage",
+	"nr_kscand_hotpage",
+	"nr_kscand_coldpage",
+#endif
 #ifdef CONFIG_MIGRATION
 	"pgmigrate_success",
 	"pgmigrate_fail",
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH V3 14/17] trace/kscand: Add tracing of scanning and migration
  2025-08-14 15:32 [RFC PATCH V3 00/17] mm: slowtier page promotion based on PTE A bit Raghavendra K T
                   ` (12 preceding siblings ...)
  2025-08-14 15:33 ` [RFC PATCH V3 13/17] mm/vmstat: Add vmstat counters Raghavendra K T
@ 2025-08-14 15:33 ` Raghavendra K T
  2025-08-14 15:33 ` [RFC PATCH V3 15/17] prctl: Introduce new prctl to control scanning Raghavendra K T
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Raghavendra K T @ 2025-08-14 15:33 UTC (permalink / raw)
  To: raghavendra.kt
  Cc: AneeshKumar.KizhakeVeetil, Michael.Day, akpm, bharata,
	dave.hansen, david, dongjoo.linux.dev, feng.tang, gourry, hannes,
	honggyu.kim, hughd, jhubbard, jon.grimm, k.shutemov, kbusch,
	kmanaouil.dev, leesuyeon0506, leillc, liam.howlett, linux-kernel,
	linux-mm, mgorman, mingo, nadav.amit, nphamcs, peterz, riel,
	rientjes, rppt, santosh.shukla, shivankg, shy828301, sj, vbabka,
	weixugc, willy, ying.huang, ziy, Jonathan.Cameron, dave, yuanchu,
	kinseyho, hdanton, harry.yoo

Add tracing support to track
 - start and end of scanning.
 - migration.

CC: Steven Rostedt <rostedt@goodmis.org>
CC: Masami Hiramatsu <mhiramat@kernel.org>
CC: linux-trace-kernel@vger.kernel.org

Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 include/trace/events/kmem.h | 99 +++++++++++++++++++++++++++++++++++++
 mm/kscand.c                 |  9 ++++
 2 files changed, 108 insertions(+)

diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index f74925a6cf69..d6e544b067b9 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -9,6 +9,105 @@
 #include <linux/tracepoint.h>
 #include <trace/events/mmflags.h>
 
+#ifdef CONFIG_KSCAND
+DECLARE_EVENT_CLASS(kmem_mm_class,
+
+	TP_PROTO(struct mm_struct *mm),
+
+	TP_ARGS(mm),
+
+	TP_STRUCT__entry(
+		__field(	struct mm_struct *, mm		)
+	),
+
+	TP_fast_assign(
+		__entry->mm = mm;
+	),
+
+	TP_printk("mm = %p", __entry->mm)
+);
+
+DEFINE_EVENT(kmem_mm_class, kmem_mm_enter,
+	TP_PROTO(struct mm_struct *mm),
+	TP_ARGS(mm)
+);
+
+DEFINE_EVENT(kmem_mm_class, kmem_mm_exit,
+	TP_PROTO(struct mm_struct *mm),
+	TP_ARGS(mm)
+);
+
+DEFINE_EVENT(kmem_mm_class, kmem_scan_mm_start,
+	TP_PROTO(struct mm_struct *mm),
+	TP_ARGS(mm)
+);
+
+TRACE_EVENT(kmem_scan_mm_end,
+
+	TP_PROTO( struct mm_struct *mm,
+		 unsigned long start,
+		 unsigned long total,
+		 unsigned long scan_period,
+		 unsigned long scan_size,
+		 int target_node),
+
+	TP_ARGS(mm, start, total, scan_period, scan_size, target_node),
+
+	TP_STRUCT__entry(
+		__field(	struct mm_struct *, mm		)
+		__field(	unsigned long,   start		)
+		__field(	unsigned long,   total		)
+		__field(	unsigned long,   scan_period	)
+		__field(	unsigned long,   scan_size	)
+		__field(	int,		 target_node	)
+	),
+
+	TP_fast_assign(
+		__entry->mm = mm;
+		__entry->start = start;
+		__entry->total = total;
+		__entry->scan_period  = scan_period;
+		__entry->scan_size    = scan_size;
+		__entry->target_node  = target_node;
+	),
+
+	TP_printk("mm=%p, start = %ld, total = %ld, scan_period = %ld, scan_size = %ld node = %d",
+		__entry->mm, __entry->start, __entry->total, __entry->scan_period,
+		__entry->scan_size, __entry->target_node)
+);
+
+TRACE_EVENT(kmem_scan_mm_migrate,
+
+	TP_PROTO(struct mm_struct *mm,
+		 int rc,
+		 int target_node,
+		 int msuccess,
+		 int mfailed),
+
+	TP_ARGS(mm, rc, target_node, msuccess, mfailed),
+
+	TP_STRUCT__entry(
+		__field(	struct mm_struct *, mm	)
+		__field(	int,   rc		)
+		__field(	int,   target_node	)
+		__field(	int,   msuccess		)
+		__field(	int,   mfailed		)
+	),
+
+	TP_fast_assign(
+		__entry->mm = mm;
+		__entry->rc = rc;
+		__entry->target_node = target_node;
+		__entry->msuccess = msuccess;
+		__entry->mfailed = mfailed;
+	),
+
+	TP_printk("mm = %p rc = %d node = %d msuccess = %d mfailed = %d ",
+		__entry->mm, __entry->rc, __entry->target_node,
+		__entry->msuccess, __entry->mfailed)
+);
+#endif
+
 TRACE_EVENT(kmem_cache_alloc,
 
 	TP_PROTO(unsigned long call_site,
diff --git a/mm/kscand.c b/mm/kscand.c
index e14645565ba7..273306f47553 100644
--- a/mm/kscand.c
+++ b/mm/kscand.c
@@ -1105,6 +1105,7 @@ static void kmigrated_migrate_mm(struct kmigrated_mm_slot *mm_slot)
 				count_kscand_migrate_failed();
 				mfailed++;
 			}
+			trace_kmem_scan_mm_migrate(mm, ret, dest, msuccess, mfailed);
 
 			kfree(info);
 
@@ -1308,6 +1309,9 @@ static unsigned long kscand_scan_mm_slot(void)
 	mm_target_node = READ_ONCE(mm->target_node);
 	if (mm_target_node != mm_slot_target_node)
 		WRITE_ONCE(mm->target_node, mm_slot_target_node);
+
+	trace_kmem_scan_mm_start(mm);
+
 	now = jiffies;
 
 	if (mm_slot_next_scan && time_before(now, mm_slot_next_scan))
@@ -1378,6 +1382,9 @@ static unsigned long kscand_scan_mm_slot(void)
 		kscand_update_mmslot_info(mm_slot, total, target_node);
 	}
 
+	trace_kmem_scan_mm_end(mm, address, total, mm_slot_scan_period,
+			mm_slot_scan_size, target_node);
+
 outerloop:
 	/* exit_mmap will destroy ptes after this */
 	mmap_read_unlock(mm);
@@ -1530,6 +1537,7 @@ void __kscand_enter(struct mm_struct *mm)
 	spin_unlock(&kscand_mm_lock);
 
 	mmgrab(mm);
+	trace_kmem_mm_enter(mm);
 	if (wakeup)
 		wake_up_interruptible(&kscand_wait);
 }
@@ -1540,6 +1548,7 @@ void __kscand_exit(struct mm_struct *mm)
 	struct mm_slot *slot;
 	int free = 0, serialize = 1;
 
+	trace_kmem_mm_exit(mm);
 	spin_lock(&kscand_mm_lock);
 	slot = mm_slot_lookup(kscand_slots_hash, mm);
 	mm_slot = mm_slot_entry(slot, struct kscand_mm_slot, slot);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH V3 15/17] prctl: Introduce new prctl to control scanning
  2025-08-14 15:32 [RFC PATCH V3 00/17] mm: slowtier page promotion based on PTE A bit Raghavendra K T
                   ` (13 preceding siblings ...)
  2025-08-14 15:33 ` [RFC PATCH V3 14/17] trace/kscand: Add tracing of scanning and migration Raghavendra K T
@ 2025-08-14 15:33 ` Raghavendra K T
  2025-08-14 15:33 ` [RFC PATCH V3 16/17] prctl: Fine tune scan_period with prctl scale param Raghavendra K T
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Raghavendra K T @ 2025-08-14 15:33 UTC (permalink / raw)
  To: raghavendra.kt
  Cc: AneeshKumar.KizhakeVeetil, Michael.Day, akpm, bharata,
	dave.hansen, david, dongjoo.linux.dev, feng.tang, gourry, hannes,
	honggyu.kim, hughd, jhubbard, jon.grimm, k.shutemov, kbusch,
	kmanaouil.dev, leesuyeon0506, leillc, liam.howlett, linux-kernel,
	linux-mm, mgorman, mingo, nadav.amit, nphamcs, peterz, riel,
	rientjes, rppt, santosh.shukla, shivankg, shy828301, sj, vbabka,
	weixugc, willy, ying.huang, ziy, Jonathan.Cameron, dave, yuanchu,
	kinseyho, hdanton, harry.yoo

 A new scalar value (PTEAScanScale) to control per task PTE A bit scanning
is introduced.

0    : scanning disabled
1-10 : scanning enabled.

In future PTEAScanScale could be used to control aggressiveness of
scanning.

CC: linux-doc@vger.kernel.org
CC: Jonathan Corbet <corbet@lwn.net>
CC: linux-fsdevel@vger.kernel.org

Suggested-by: David Rientjes <rientjes@google.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 Documentation/filesystems/proc.rst |  2 ++
 fs/proc/task_mmu.c                 |  4 ++++
 include/linux/mm_types.h           |  3 +++
 include/uapi/linux/prctl.h         |  7 +++++++
 kernel/fork.c                      |  4 ++++
 kernel/sys.c                       | 25 +++++++++++++++++++++++++
 mm/kscand.c                        |  5 +++++
 7 files changed, 50 insertions(+)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 5236cb52e357..0e99d1ca229a 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -205,6 +205,7 @@ read the file /proc/PID/status::
   VmLib:      1412 kB
   VmPTE:        20 kb
   VmSwap:        0 kB
+  PTEAScanScale: 0
   HugetlbPages:          0 kB
   CoreDumping:    0
   THP_enabled:	  1
@@ -288,6 +289,7 @@ It's slow but very precise.
  VmPTE                       size of page table entries
  VmSwap                      amount of swap used by anonymous private data
                              (shmem swap usage is not included)
+ PTEAScanScale               Integer representing async PTE A bit scan agrression
  HugetlbPages                size of hugetlb memory portions
  CoreDumping                 process's memory is currently being dumped
                              (killing the process may lead to a corrupted core)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 751479eb128f..05be24e4bc4f 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -79,6 +79,10 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
 		    " kB\nVmPTE:\t", mm_pgtables_bytes(mm) >> 10, 8);
 	SEQ_PUT_DEC(" kB\nVmSwap:\t", swap);
 	seq_puts(m, " kB\n");
+#ifdef CONFIG_KSCAND
+	seq_put_decimal_ull_width(m, "PTEAScanScale:\t", mm->pte_scan_scale, 8);
+	seq_puts(m, "\n");
+#endif
 	hugetlb_report_usage(m, mm);
 }
 #undef SEQ_PUT_DEC
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index e3d8f11a5a04..798e6053eebe 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1151,6 +1151,9 @@ struct mm_struct {
 #ifdef CONFIG_KSCAND
 		/* Tracks promotion node. XXX: use nodemask */
 		int target_node;
+
+		/* Integer representing PTE A bit scan aggression (0-10) */
+		unsigned int pte_scan_scale;
  #endif
 		/*
 		 * An operation with batched TLB flushing is going on. Anything
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 43dec6eed559..6b5877865e08 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -371,4 +371,11 @@ struct prctl_mm_map {
 # define PR_FUTEX_HASH_GET_SLOTS	2
 # define PR_FUTEX_HASH_GET_IMMUTABLE	3
 
+/* Set/get PTE A bit scan scale */
+#define PR_SET_PTE_A_SCAN_SCALE		79
+#define PR_GET_PTE_A_SCAN_SCALE		80
+# define PR_PTE_A_SCAN_SCALE_MIN	0
+# define PR_PTE_A_SCAN_SCALE_MAX	10
+# define PR_PTE_A_SCAN_SCALE_DEFAULT	8
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index a13043de91b0..bb780215024c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -106,6 +106,7 @@
 #include <uapi/linux/pidfd.h>
 #include <linux/pidfs.h>
 #include <linux/tick.h>
+#include <linux/prctl.h>
 
 #include <asm/pgalloc.h>
 #include <linux/uaccess.h>
@@ -1050,6 +1051,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	futex_mm_init(mm);
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !defined(CONFIG_SPLIT_PMD_PTLOCKS)
 	mm->pmd_huge_pte = NULL;
+#endif
+#ifdef CONFIG_KSCAND
+	mm->pte_scan_scale = PR_PTE_A_SCAN_SCALE_DEFAULT;
 #endif
 	mm_init_uprobes_state(mm);
 	hugetlb_count_init(mm);
diff --git a/kernel/sys.c b/kernel/sys.c
index adc0de0aa364..f6c893b22bc6 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2147,6 +2147,19 @@ static int prctl_set_auxv(struct mm_struct *mm, unsigned long addr,
 
 	return 0;
 }
+#ifdef CONFIG_KSCAND
+static int prctl_pte_scan_scale_write(unsigned int scale)
+{
+	scale = clamp(scale, PR_PTE_A_SCAN_SCALE_MIN, PR_PTE_A_SCAN_SCALE_MAX);
+	current->mm->pte_scan_scale = scale;
+	return 0;
+}
+
+static unsigned int prctl_pte_scan_scale_read(void)
+{
+	return current->mm->pte_scan_scale;
+}
+#endif
 
 static int prctl_set_mm(int opt, unsigned long addr,
 			unsigned long arg4, unsigned long arg5)
@@ -2824,6 +2837,18 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_FUTEX_HASH:
 		error = futex_hash_prctl(arg2, arg3, arg4);
 		break;
+#ifdef CONFIG_KSCAND
+	case PR_SET_PTE_A_SCAN_SCALE:
+		if (arg3 || arg4 || arg5)
+			return -EINVAL;
+		error = prctl_pte_scan_scale_write((unsigned int) arg2);
+		break;
+	case PR_GET_PTE_A_SCAN_SCALE:
+		if (arg2 || arg3 || arg4 || arg5)
+			return -EINVAL;
+		error = prctl_pte_scan_scale_read();
+		break;
+#endif
 	default:
 		trace_task_prctl_unknown(option, arg2, arg3, arg4, arg5);
 		error = -EINVAL;
diff --git a/mm/kscand.c b/mm/kscand.c
index 273306f47553..8aef6021c6ba 100644
--- a/mm/kscand.c
+++ b/mm/kscand.c
@@ -1306,6 +1306,11 @@ static unsigned long kscand_scan_mm_slot(void)
 		goto outerloop;
 	}
 
+	if (!mm->pte_scan_scale) {
+		next_mm = true;
+		goto outerloop;
+	}
+
 	mm_target_node = READ_ONCE(mm->target_node);
 	if (mm_target_node != mm_slot_target_node)
 		WRITE_ONCE(mm->target_node, mm_slot_target_node);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH V3 16/17] prctl: Fine tune scan_period with prctl scale param
  2025-08-14 15:32 [RFC PATCH V3 00/17] mm: slowtier page promotion based on PTE A bit Raghavendra K T
                   ` (14 preceding siblings ...)
  2025-08-14 15:33 ` [RFC PATCH V3 15/17] prctl: Introduce new prctl to control scanning Raghavendra K T
@ 2025-08-14 15:33 ` Raghavendra K T
  2025-08-14 15:33 ` [RFC PATCH V3 17/17] mm: Create a list of fallback target nodes Raghavendra K T
  2025-08-21 15:24 ` [RFC PATCH V3 00/17] mm: slowtier page promotion based on PTE A bit Raghavendra K T
  17 siblings, 0 replies; 19+ messages in thread
From: Raghavendra K T @ 2025-08-14 15:33 UTC (permalink / raw)
  To: raghavendra.kt
  Cc: AneeshKumar.KizhakeVeetil, Michael.Day, akpm, bharata,
	dave.hansen, david, dongjoo.linux.dev, feng.tang, gourry, hannes,
	honggyu.kim, hughd, jhubbard, jon.grimm, k.shutemov, kbusch,
	kmanaouil.dev, leesuyeon0506, leillc, liam.howlett, linux-kernel,
	linux-mm, mgorman, mingo, nadav.amit, nphamcs, peterz, riel,
	rientjes, rppt, santosh.shukla, shivankg, shy828301, sj, vbabka,
	weixugc, willy, ying.huang, ziy, Jonathan.Cameron, dave, yuanchu,
	kinseyho, hdanton, harry.yoo

 Absolute value of pte_scan_scale param further tunes
scan_period by 20%.

Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 mm/kscand.c | 27 +++++++++++++++++++++++++--
 1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/mm/kscand.c b/mm/kscand.c
index 8aef6021c6ba..641150755517 100644
--- a/mm/kscand.c
+++ b/mm/kscand.c
@@ -21,6 +21,7 @@
 #include <linux/delay.h>
 #include <linux/cleanup.h>
 #include <linux/minmax.h>
+#include <linux/prctl.h>
 #include <trace/events/kmem.h>
 
 #include <asm/pgalloc.h>
@@ -1157,6 +1158,26 @@ static int kscand_mstat_scan_period(unsigned int scan_period, int fratio)
 	return scan_period * (1 + fratio / 10);
 }
 
+/*
+ * Scanning aggression is further controlled by prctl. pte_scan_scale value
+ * further tunes the scan period by 20%.
+ * 0 => scanning disabled.
+ * 1 => current min max is retained.
+ * 2..10 => scale the scan period by 20% * scale factor.
+ */
+static unsigned long kscand_get_scaled_scan_period(unsigned int scan_period,
+								unsigned int scale)
+{
+	int delta = 0;
+
+	if (scale) {
+		delta = scan_period * (scale - 1);
+		delta /= (PR_PTE_A_SCAN_SCALE_MAX - 1);
+	}
+
+	return scan_period + delta;
+}
+
 /*
  * This is the normal change percentage when old and new delta remain same.
  * i.e., either both positive or both zero.
@@ -1201,7 +1222,7 @@ static int kscand_mstat_scan_period(unsigned int scan_period, int fratio)
  *		Increase scan_size by (1 << SCAN_SIZE_CHANGE_SHIFT).
  */
 static inline void kscand_update_mmslot_info(struct kscand_mm_slot *mm_slot,
-				unsigned long total, int target_node)
+				unsigned long total, int target_node, unsigned int scale)
 {
 	int fratio;
 	unsigned int scan_period;
@@ -1243,6 +1264,7 @@ static inline void kscand_update_mmslot_info(struct kscand_mm_slot *mm_slot,
 	}
 
 	scan_period = clamp(scan_period, KSCAND_SCAN_PERIOD_MIN, KSCAND_SCAN_PERIOD_MAX);
+	scan_period = kscand_get_scaled_scan_period(scan_period, scale);
 	fratio = kmigrated_get_mstat_fratio((&mm_slot->slot)->mm);
 	scan_period = kscand_mstat_scan_period(scan_period, fratio);
 	scan_size = clamp(scan_size, KSCAND_SCAN_SIZE_MIN, KSCAND_SCAN_SIZE_MAX);
@@ -1384,7 +1406,8 @@ static unsigned long kscand_scan_mm_slot(void)
 
 	if (update_mmslot_info) {
 		mm_slot->address = address;
-		kscand_update_mmslot_info(mm_slot, total, target_node);
+		kscand_update_mmslot_info(mm_slot, total,
+					target_node, mm->pte_scan_scale);
 	}
 
 	trace_kmem_scan_mm_end(mm, address, total, mm_slot_scan_period,
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH V3 17/17] mm: Create a list of fallback target nodes
  2025-08-14 15:32 [RFC PATCH V3 00/17] mm: slowtier page promotion based on PTE A bit Raghavendra K T
                   ` (15 preceding siblings ...)
  2025-08-14 15:33 ` [RFC PATCH V3 16/17] prctl: Fine tune scan_period with prctl scale param Raghavendra K T
@ 2025-08-14 15:33 ` Raghavendra K T
  2025-08-21 15:24 ` [RFC PATCH V3 00/17] mm: slowtier page promotion based on PTE A bit Raghavendra K T
  17 siblings, 0 replies; 19+ messages in thread
From: Raghavendra K T @ 2025-08-14 15:33 UTC (permalink / raw)
  To: raghavendra.kt
  Cc: AneeshKumar.KizhakeVeetil, Michael.Day, akpm, bharata,
	dave.hansen, david, dongjoo.linux.dev, feng.tang, gourry, hannes,
	honggyu.kim, hughd, jhubbard, jon.grimm, k.shutemov, kbusch,
	kmanaouil.dev, leesuyeon0506, leillc, liam.howlett, linux-kernel,
	linux-mm, mgorman, mingo, nadav.amit, nphamcs, peterz, riel,
	rientjes, rppt, santosh.shukla, shivankg, shy828301, sj, vbabka,
	weixugc, willy, ying.huang, ziy, Jonathan.Cameron, dave, yuanchu,
	kinseyho, hdanton, harry.yoo

These fallback target nodes are used as hints for migration
when current target node is near full.

TBD: Implementing migration to fallback nodes

Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 mm/kscand.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/mm/kscand.c b/mm/kscand.c
index 641150755517..a88df9ac2eaa 100644
--- a/mm/kscand.c
+++ b/mm/kscand.c
@@ -136,6 +136,7 @@ struct kscand_scanctrl {
 	struct kscand_nodeinfo *nodeinfo[MAX_NUMNODES];
 	unsigned long address;
 	unsigned long nr_to_scan;
+	nodemask_t nmask;
 };
 
 struct kscand_scanctrl kscand_scanctrl;
@@ -148,6 +149,8 @@ struct kmigrated_mm_slot {
 	spinlock_t migrate_lock;
 	/* Head of per mm migration list */
 	struct list_head migrate_head;
+	/* Indicates set of fallback nodes to migrate. */
+	nodemask_t migration_nmask;
 	/* Indicates weighted success, failure */
 	int msuccess, mfailed, fratio;
 };
@@ -522,6 +525,7 @@ static void reset_scanctrl(struct kscand_scanctrl *scanctrl)
 {
 	int node;
 
+	nodes_clear(scanctrl->nmask);
 	for_each_node_state(node, N_MEMORY)
 		reset_nodeinfo(scanctrl->nodeinfo[node]);
 
@@ -547,9 +551,11 @@ static int get_target_node(struct kscand_scanctrl *scanctrl)
 	int node, target_node = NUMA_NO_NODE;
 	unsigned long prev = 0;
 
+	nodes_clear(scanctrl->nmask);
 	for_each_node(node) {
 		if (node_is_toptier(node) && scanctrl->nodeinfo[node]) {
 			/* This creates a fallback migration node list */
+			node_set(node, scanctrl->nmask);
 			if (get_nodeinfo_nr_accessed(scanctrl->nodeinfo[node]) > prev) {
 				prev = get_nodeinfo_nr_accessed(scanctrl->nodeinfo[node]);
 				target_node = node;
@@ -1396,6 +1402,9 @@ static unsigned long kscand_scan_mm_slot(void)
 
 	total = get_slowtier_accesed(&kscand_scanctrl);
 	target_node = get_target_node(&kscand_scanctrl);
+	if (kmigrated_mm_slot)
+		nodes_copy(kmigrated_mm_slot->migration_nmask,
+						kscand_scanctrl.nmask);
 
 	mm_target_node = READ_ONCE(mm->target_node);
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH V3 00/17] mm: slowtier page promotion based on PTE A bit
  2025-08-14 15:32 [RFC PATCH V3 00/17] mm: slowtier page promotion based on PTE A bit Raghavendra K T
                   ` (16 preceding siblings ...)
  2025-08-14 15:33 ` [RFC PATCH V3 17/17] mm: Create a list of fallback target nodes Raghavendra K T
@ 2025-08-21 15:24 ` Raghavendra K T
  17 siblings, 0 replies; 19+ messages in thread
From: Raghavendra K T @ 2025-08-21 15:24 UTC (permalink / raw)
  To: raghavendra.kt
  Cc: AneeshKumar.KizhakeVeetil, Jonathan.Cameron, Michael.Day, akpm,
	bharata, dave.hansen, dave, david, dongjoo.linux.dev, feng.tang,
	gourry, hannes, harry.yoo, hdanton, honggyu.kim, hughd, jhubbard,
	jon.grimm, k.shutemov, kbusch, kinseyho, kmanaouil.dev,
	leesuyeon0506, leillc, liam.howlett, linux-kernel, linux-mm,
	mgorman, mingo, nadav.amit, nphamcs, peterz, riel, rientjes, rppt,
	santosh.shukla, shivankg, shy828301, sj, vbabka, weixugc, willy,
	ying.huang, yuanchu, ziy

On 8/14/2025 9:02 PM, Raghavendra K T wrote:
> The current series has additional enhancements and comments' incorporation on top of
> RFC V2.
> 
> This is an additional source of hot page generator to NUMAB, IBS [4], KMGLRUD [5].
> 
> Introduction:
> =============
> In the current hot page promotion, all the activities including the
> process address space scanning, NUMA hint fault handling and page
> migration is performed in the process context. i.e., scanning overhead is
> borne by applications.
> 
> This RFC V2 patch series does slow-tier page promotion by using PTE Accessed
> bit scanning. Scanning is done by a global kernel thread which routinely
> scans all the processes' address spaces and checks for accesses by reading
> the PTE A bit.
> 
> A separate migration thread migrates/promotes the pages to the top-tier
> node based on a simple heuristic that uses top-tier scan/access information
> of the mm.
> 
> Additionally based on the feedback, a prctl knob with a scalar value is
> provided to control per task scanning.

Patches are also available in the branch:

https://github.com/AMDESE/linux-mm/tree/rkt/kscand_rfc_v3

[...]

> 
> 
> base-commit: 038d61fd642278bab63ee8ef722c50d10ab01e8f


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2025-08-21 15:25 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-14 15:32 [RFC PATCH V3 00/17] mm: slowtier page promotion based on PTE A bit Raghavendra K T
2025-08-14 15:32 ` [RFC PATCH V3 01/17] mm: Add kscand kthread for PTE A bit scan Raghavendra K T
2025-08-14 15:32 ` [RFC PATCH V3 02/17] mm: Maintain mm_struct list in the system Raghavendra K T
2025-08-14 15:32 ` [RFC PATCH V3 03/17] mm: Scan the mm and create a migration list Raghavendra K T
2025-08-14 15:32 ` [RFC PATCH V3 04/17] mm/kscand: Add only hot pages to " Raghavendra K T
2025-08-14 15:32 ` [RFC PATCH V3 05/17] mm: Create a separate kthread for migration Raghavendra K T
2025-08-14 15:32 ` [RFC PATCH V3 06/17] mm/migration: migrate accessed folios to toptier node Raghavendra K T
2025-08-14 15:32 ` [RFC PATCH V3 07/17] mm: Add throttling of mm scanning using scan_period Raghavendra K T
2025-08-14 15:32 ` [RFC PATCH V3 08/17] mm: Add throttling of mm scanning using scan_size Raghavendra K T
2025-08-14 15:32 ` [RFC PATCH V3 09/17] mm: Add initial scan delay Raghavendra K T
2025-08-14 15:33 ` [RFC PATCH V3 10/17] mm: Add a heuristic to calculate target node Raghavendra K T
2025-08-14 15:33 ` [RFC PATCH V3 11/17] mm/kscand: Implement migration failure feedback Raghavendra K T
2025-08-14 15:33 ` [RFC PATCH V3 12/17] sysfs: Add sysfs support to tune scanning Raghavendra K T
2025-08-14 15:33 ` [RFC PATCH V3 13/17] mm/vmstat: Add vmstat counters Raghavendra K T
2025-08-14 15:33 ` [RFC PATCH V3 14/17] trace/kscand: Add tracing of scanning and migration Raghavendra K T
2025-08-14 15:33 ` [RFC PATCH V3 15/17] prctl: Introduce new prctl to control scanning Raghavendra K T
2025-08-14 15:33 ` [RFC PATCH V3 16/17] prctl: Fine tune scan_period with prctl scale param Raghavendra K T
2025-08-14 15:33 ` [RFC PATCH V3 17/17] mm: Create a list of fallback target nodes Raghavendra K T
2025-08-21 15:24 ` [RFC PATCH V3 00/17] mm: slowtier page promotion based on PTE A bit Raghavendra K T

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).