* [RFC PATCH V2 00/13] mm: slowtier page promotion based on PTE A bit
@ 2025-06-24 5:56 Raghavendra K T
2025-06-24 5:56 ` [RFC PATCH V2 01/13] mm: Add kscand kthread for PTE A bit scan Raghavendra K T
` (12 more replies)
0 siblings, 13 replies; 20+ messages in thread
From: Raghavendra K T @ 2025-06-24 5:56 UTC (permalink / raw)
To: raghavendra.kt
Cc: AneeshKumar.KizhakeVeetil, Hasan.Maruf, Michael.Day, akpm,
bharata, dave.hansen, david, dongjoo.linux.dev, feng.tang, gourry,
hannes, honggyu.kim, hughd, jhubbard, jon.grimm, k.shutemov,
kbusch, kmanaouil.dev, leesuyeon0506, leillc, liam.howlett,
linux-kernel, linux-mm, mgorman, mingo, nadav.amit, nphamcs,
peterz, riel, rientjes, rppt, santosh.shukla, shivankg, shy828301,
sj, vbabka, weixugc, willy, ying.huang, ziy, Jonathan.Cameron,
dave, yuanchu, kinseyho, hdanton
The current series is:
a) RFC V1 [2] with review comments incorporated.
b) Design change to stabilize the solution.
c) Rebased to v6.15.
Introduction:
=============
In the current hot page promotion, all the activities including the
process address space scanning, NUMA hint fault handling and page
migration is performed in the process context. i.e., scanning overhead is
borne by applications.
This RFC V2 patch series does slow-tier page promotion by using PTE Accessed
bit scanning. Scanning is done by a global kernel thread which routinely
scans all the processes' address spaces and checks for accesses by reading
the PTE A bit.
A separate migration thread migrates/promotes the pages to the top-tier
node based on a simple heuristic that uses top-tier scan/access information
of the mm.
Additionally based on the feedback, a prctl knob with a scalar value is
provided to control per task scanning.
Changes since RFC V1:
=====================
- Addressing the review comments by Jonathan (Thank you for your closer
reviews).
- Per mm migration list with separate lock to resolve race conditions/softlockups
reported by Davidlohr.
- Add one more filter before migration for LRU_GEN case to check whether
folio is still hot.
- Rename kmmscand ==> kscand kmmmigrated ==> kmigrated (hopefully this
gets merged into Bharat's upcoming migration thread)
Changes since RFC V0:
======================
- A separate migration thread is used for migration, thus alleviating need for
multi-threaded scanning (at least as per tracing).
- A simple heuristic for target node calculation is added.
- prctl (David R) interface with scalar value is added to control per task scanning.
- Steve's comment on tracing incorporated.
- Davidlohr's reported bugfix.
- Initial scan delay similar to NUMAB1 mode added.
- Got rid of migration lock during mm_walk.
What is not addressed yet:
===========================
- Patchset can use PFN based list instead of folios for migration.
This will be done in the next iteration or so when integrating with
kpromoted [3] /enhanced kmigrated APIs.
- Still using old target node heuristic that chooses only a single
target. Hillf had suggested promotion to next tier nodes. we are not
there yet.
- Jonathan, Davidlohr had raised comments on microbenchmark and migration
based on first access. Now, filtering of folios is done by checking
whether it is still hot using lru_gen_is_active() (for LRU_GEN case). But it is
still first access for !LRU_GEN config.
- Davidlohr suggested using NUMAB2 along with scanning. Need
more thoughts/implementation based on that (without relying on NUMAB2
timestamps??).
A note on per mm migration list using mm_slot:
=============================================
Using per mm migration list (mm_slot) has helped to reduce contention
and thus easing mm teardown during process exit.
It also helps to tie PFN/folio with mm to make heuristics work better
and further it would help to throttle migration per mm (OR process) (TBD).
A note on PTE A bit scanning:
============================
Major positive: Current patchset is able to cover all the process address
space scanning effectively with simple algorithms to tune scan_size and
scan_period.
Thanks to Jonathan, Davidlohr for review feedback on RFC V1.
Results:
=======
Microbenchmark gave similar improvements (8%+) as in RFC V1. But more benchmarking
TBD with redis memtier etc (perhaps tuning based on that).
The patchset is also available here
link: https://github.com/RaghavendraKT80/linux-mm/tree/kmmscand_rfc_v2
Links:
[1] RFC V0: https://lore.kernel.org/all/20241201153818.2633616-1-raghavendra.kt@amd.com/
[2] RFC V1: https://lore.kernel.org/linux-mm/20250319193028.29514-1-raghavendra.kt@amd.com/
[3] Kpromoted: https://lore.kernel.org/linux-mm/20250306054532.221138-1-bharata@amd.com/
Patch organization:
patch 1-4 initial skeleton for scanning and migration
patch 5: migration
patch 6-8: scanning optimizations
patch 9: target_node heuristic
patch 10-12: sysfs, vmstat and tracing
patch 13: A basic prctl implementation.
Raghavendra K T (13):
mm: Add kscand kthread for PTE A bit scan
mm: Maintain mm_struct list in the system
mm: Scan the mm and create a migration list
mm: Create a separate kthread for migration
mm/migration: Migrate accessed folios to toptier node
mm: Add throttling of mm scanning using scan_period
mm: Add throttling of mm scanning using scan_size
mm: Add initial scan delay
mm: Add a heuristic to calculate target node
sysfs: Add sysfs support to tune scanning
vmstat: Add vmstat counters
trace/kscand: Add tracing of scanning and migration
prctl: Introduce new prctl to control scanning
Documentation/filesystems/proc.rst | 2 +
fs/exec.c | 4 +
fs/proc/task_mmu.c | 4 +
include/linux/kscand.h | 30 +
include/linux/migrate.h | 2 +
include/linux/mm.h | 11 +
include/linux/mm_types.h | 7 +
include/linux/vm_event_item.h | 10 +
include/trace/events/kmem.h | 90 ++
include/uapi/linux/prctl.h | 7 +
kernel/fork.c | 8 +
kernel/sys.c | 25 +
mm/Kconfig | 8 +
mm/Makefile | 1 +
mm/kscand.c | 1644 ++++++++++++++++++++++++++++
mm/migrate.c | 2 +-
mm/vmstat.c | 10 +
17 files changed, 1864 insertions(+), 1 deletion(-)
create mode 100644 include/linux/kscand.h
create mode 100644 mm/kscand.c
base-commit: 0ff41df1cb268fc69e703a08a57ee14ae967d0ca
--
2.34.1
^ permalink raw reply [flat|nested] 20+ messages in thread
* [RFC PATCH V2 01/13] mm: Add kscand kthread for PTE A bit scan
2025-06-24 5:56 [RFC PATCH V2 00/13] mm: slowtier page promotion based on PTE A bit Raghavendra K T
@ 2025-06-24 5:56 ` Raghavendra K T
2025-06-24 5:56 ` [RFC PATCH V2 02/13] mm: Maintain mm_struct list in the system Raghavendra K T
` (11 subsequent siblings)
12 siblings, 0 replies; 20+ messages in thread
From: Raghavendra K T @ 2025-06-24 5:56 UTC (permalink / raw)
To: raghavendra.kt
Cc: AneeshKumar.KizhakeVeetil, Hasan.Maruf, Michael.Day, akpm,
bharata, dave.hansen, david, dongjoo.linux.dev, feng.tang, gourry,
hannes, honggyu.kim, hughd, jhubbard, jon.grimm, k.shutemov,
kbusch, kmanaouil.dev, leesuyeon0506, leillc, liam.howlett,
linux-kernel, linux-mm, mgorman, mingo, nadav.amit, nphamcs,
peterz, riel, rientjes, rppt, santosh.shukla, shivankg, shy828301,
sj, vbabka, weixugc, willy, ying.huang, ziy, Jonathan.Cameron,
dave, yuanchu, kinseyho, hdanton
Also add a config option for the same.
High level design:
While (1):
Scan the slowtier pages belonging to VMAs of a task.
Add to migation list.
A separate thread:
Migrate scanned pages to a toptier node based on heuristics.
The overall code is influenced by khugepaged design.
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
mm/Kconfig | 8 +++
mm/Makefile | 1 +
mm/kscand.c | 163 ++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 172 insertions(+)
create mode 100644 mm/kscand.c
diff --git a/mm/Kconfig b/mm/Kconfig
index e113f713b493..062fa6bdf3fb 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -746,6 +746,14 @@ config KSM
until a program has madvised that an area is MADV_MERGEABLE, and
root has set /sys/kernel/mm/ksm/run to 1 (if CONFIG_SYSFS is set).
+config KSCAND
+ bool "Enable PTE A bit scanning and Migration"
+ depends on NUMA_BALANCING
+ help
+ Enable PTE A bit scanning of page. The option creates a separate
+ kthread for scanning and migration. Accessed slow-tier pages are
+ migrated to a regular NUMA node to reduce hot page access latency.
+
config DEFAULT_MMAP_MIN_ADDR
int "Low address space to protect from user allocation"
depends on MMU
diff --git a/mm/Makefile b/mm/Makefile
index e7f6bbf8ae5f..f5e02d6fb1bc 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -97,6 +97,7 @@ obj-$(CONFIG_FAIL_PAGE_ALLOC) += fail_page_alloc.o
obj-$(CONFIG_MEMTEST) += memtest.o
obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_NUMA) += memory-tiers.o
+obj-$(CONFIG_KSCAND) += kscand.o
obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
diff --git a/mm/kscand.c b/mm/kscand.c
new file mode 100644
index 000000000000..f7bbbc70c86a
--- /dev/null
+++ b/mm/kscand.c
@@ -0,0 +1,163 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/mm.h>
+#include <linux/mm_types.h>
+#include <linux/sched.h>
+#include <linux/sched/mm.h>
+#include <linux/mmu_notifier.h>
+#include <linux/swap.h>
+#include <linux/mm_inline.h>
+#include <linux/kthread.h>
+#include <linux/string.h>
+#include <linux/delay.h>
+#include <linux/cleanup.h>
+
+#include <asm/pgalloc.h>
+#include "internal.h"
+
+static struct task_struct *kscand_thread __read_mostly;
+static DEFINE_MUTEX(kscand_mutex);
+
+/* How long to pause between two scan cycles */
+static unsigned int kscand_scan_sleep_ms __read_mostly = 20;
+
+/* Max number of mms to scan in one scan cycle */
+#define KSCAND_MMS_TO_SCAN (4 * 1024UL)
+static unsigned long kscand_mms_to_scan __read_mostly = KSCAND_MMS_TO_SCAN;
+
+bool kscand_scan_enabled = true;
+static bool need_wakeup;
+
+static unsigned long kscand_sleep_expire;
+
+static DECLARE_WAIT_QUEUE_HEAD(kscand_wait);
+
+/* Data structure to keep track of current mm under scan */
+struct kscand_scan {
+ struct list_head mm_head;
+};
+
+struct kscand_scan kscand_scan = {
+ .mm_head = LIST_HEAD_INIT(kscand_scan.mm_head),
+};
+
+static inline int kscand_has_work(void)
+{
+ return !list_empty(&kscand_scan.mm_head);
+}
+
+static inline bool kscand_should_wakeup(void)
+{
+ bool wakeup = kthread_should_stop() || need_wakeup ||
+ time_after_eq(jiffies, kscand_sleep_expire);
+
+ need_wakeup = false;
+
+ return wakeup;
+}
+
+static void kscand_wait_work(void)
+{
+ const unsigned long scan_sleep_jiffies =
+ msecs_to_jiffies(kscand_scan_sleep_ms);
+
+ if (!scan_sleep_jiffies)
+ return;
+
+ kscand_sleep_expire = jiffies + scan_sleep_jiffies;
+
+ /* Allows kthread to pause scanning */
+ wait_event_timeout(kscand_wait, kscand_should_wakeup(),
+ scan_sleep_jiffies);
+}
+static void kscand_do_scan(void)
+{
+ unsigned long iter = 0, mms_to_scan;
+
+ mms_to_scan = READ_ONCE(kscand_mms_to_scan);
+
+ while (true) {
+ if (unlikely(kthread_should_stop()) ||
+ !READ_ONCE(kscand_scan_enabled))
+ break;
+
+ if (kscand_has_work())
+ msleep(100);
+
+ iter++;
+
+ if (iter >= mms_to_scan)
+ break;
+ cond_resched();
+ }
+}
+
+static int kscand(void *none)
+{
+ while (true) {
+ if (unlikely(kthread_should_stop()))
+ break;
+
+ while (!READ_ONCE(kscand_scan_enabled)) {
+ cpu_relax();
+ kscand_wait_work();
+ }
+
+ kscand_do_scan();
+
+ kscand_wait_work();
+ }
+ return 0;
+}
+
+static int start_kscand(void)
+{
+ struct task_struct *kthread;
+
+ guard(mutex)(&kscand_mutex);
+
+ if (kscand_thread)
+ return 0;
+
+ kthread = kthread_run(kscand, NULL, "kscand");
+ if (IS_ERR(kscand_thread)) {
+ pr_err("kscand: kthread_run(kscand) failed\n");
+ return PTR_ERR(kthread);
+ }
+
+ kscand_thread = kthread;
+ pr_info("kscand: Successfully started kscand");
+
+ if (!list_empty(&kscand_scan.mm_head))
+ wake_up_interruptible(&kscand_wait);
+
+ return 0;
+}
+
+static int stop_kscand(void)
+{
+ guard(mutex)(&kscand_mutex);
+
+ if (kscand_thread) {
+ kthread_stop(kscand_thread);
+ kscand_thread = NULL;
+ }
+
+ return 0;
+}
+
+static int __init kscand_init(void)
+{
+ int err;
+
+ err = start_kscand();
+ if (err)
+ goto err_kscand;
+
+ return 0;
+
+err_kscand:
+ stop_kscand();
+
+ return err;
+}
+subsys_initcall(kscand_init);
--
2.34.1
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [RFC PATCH V2 02/13] mm: Maintain mm_struct list in the system
2025-06-24 5:56 [RFC PATCH V2 00/13] mm: slowtier page promotion based on PTE A bit Raghavendra K T
2025-06-24 5:56 ` [RFC PATCH V2 01/13] mm: Add kscand kthread for PTE A bit scan Raghavendra K T
@ 2025-06-24 5:56 ` Raghavendra K T
2025-06-24 5:56 ` [RFC PATCH V2 03/13] mm: Scan the mm and create a migration list Raghavendra K T
` (10 subsequent siblings)
12 siblings, 0 replies; 20+ messages in thread
From: Raghavendra K T @ 2025-06-24 5:56 UTC (permalink / raw)
To: raghavendra.kt
Cc: AneeshKumar.KizhakeVeetil, Hasan.Maruf, Michael.Day, akpm,
bharata, dave.hansen, david, dongjoo.linux.dev, feng.tang, gourry,
hannes, honggyu.kim, hughd, jhubbard, jon.grimm, k.shutemov,
kbusch, kmanaouil.dev, leesuyeon0506, leillc, liam.howlett,
linux-kernel, linux-mm, mgorman, mingo, nadav.amit, nphamcs,
peterz, riel, rientjes, rppt, santosh.shukla, shivankg, shy828301,
sj, vbabka, weixugc, willy, ying.huang, ziy, Jonathan.Cameron,
dave, yuanchu, kinseyho, hdanton
The list is used to iterate over all the mm and do PTE A bit scanning.
mm_slot infrastructure is reused to aid insert and lookup of mm_struct.
CC: linux-fsdevel@vger.kernel.org
Suggested-by: Bharata B Rao <bharata@amd.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
fs/exec.c | 4 ++
include/linux/kscand.h | 30 +++++++++++++++
kernel/fork.c | 4 ++
mm/kscand.c | 86 ++++++++++++++++++++++++++++++++++++++++++
4 files changed, 124 insertions(+)
create mode 100644 include/linux/kscand.h
diff --git a/fs/exec.c b/fs/exec.c
index 8e4ea5f1e64c..e21c590bfdfc 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -68,6 +68,7 @@
#include <linux/user_events.h>
#include <linux/rseq.h>
#include <linux/ksm.h>
+#include <linux/kscand.h>
#include <linux/uaccess.h>
#include <asm/mmu_context.h>
@@ -266,6 +267,8 @@ static int __bprm_mm_init(struct linux_binprm *bprm)
if (err)
goto err_ksm;
+ kscand_execve(mm);
+
/*
* Place the stack at the largest stack address the architecture
* supports. Later, we'll move this to an appropriate place. We don't
@@ -288,6 +291,7 @@ static int __bprm_mm_init(struct linux_binprm *bprm)
return 0;
err:
ksm_exit(mm);
+ kscand_exit(mm);
err_ksm:
mmap_write_unlock(mm);
err_free:
diff --git a/include/linux/kscand.h b/include/linux/kscand.h
new file mode 100644
index 000000000000..ef9947a33ee5
--- /dev/null
+++ b/include/linux/kscand.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_KSCAND_H_
+#define _LINUX_KSCAND_H_
+
+#ifdef CONFIG_KSCAND
+extern void __kscand_enter(struct mm_struct *mm);
+extern void __kscand_exit(struct mm_struct *mm);
+
+static inline void kscand_execve(struct mm_struct *mm)
+{
+ __kscand_enter(mm);
+}
+
+static inline void kscand_fork(struct mm_struct *mm, struct mm_struct *oldmm)
+{
+ __kscand_enter(mm);
+}
+
+static inline void kscand_exit(struct mm_struct *mm)
+{
+ __kscand_exit(mm);
+}
+#else /* !CONFIG_KSCAND */
+static inline void __kscand_enter(struct mm_struct *mm) {}
+static inline void __kscand_exit(struct mm_struct *mm) {}
+static inline void kscand_execve(struct mm_struct *mm) {}
+static inline void kscand_fork(struct mm_struct *mm, struct mm_struct *oldmm) {}
+static inline void kscand_exit(struct mm_struct *mm) {}
+#endif
+#endif /* _LINUX_KSCAND_H_ */
diff --git a/kernel/fork.c b/kernel/fork.c
index 168681fc4b25..af6dd315b106 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -85,6 +85,7 @@
#include <linux/user-return-notifier.h>
#include <linux/oom.h>
#include <linux/khugepaged.h>
+#include <linux/kscand.h>
#include <linux/signalfd.h>
#include <linux/uprobes.h>
#include <linux/aio.h>
@@ -630,6 +631,8 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
mm->exec_vm = oldmm->exec_vm;
mm->stack_vm = oldmm->stack_vm;
+ kscand_fork(mm, oldmm);
+
/* Use __mt_dup() to efficiently build an identical maple tree. */
retval = __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_KERNEL);
if (unlikely(retval))
@@ -1377,6 +1380,7 @@ static inline void __mmput(struct mm_struct *mm)
exit_aio(mm);
ksm_exit(mm);
khugepaged_exit(mm); /* must run before exit_mmap */
+ kscand_exit(mm);
exit_mmap(mm);
mm_put_huge_zero_folio(mm);
set_mm_exe_file(mm, NULL);
diff --git a/mm/kscand.c b/mm/kscand.c
index f7bbbc70c86a..d5b0d3041b0f 100644
--- a/mm/kscand.c
+++ b/mm/kscand.c
@@ -7,12 +7,14 @@
#include <linux/swap.h>
#include <linux/mm_inline.h>
#include <linux/kthread.h>
+#include <linux/kscand.h>
#include <linux/string.h>
#include <linux/delay.h>
#include <linux/cleanup.h>
#include <asm/pgalloc.h>
#include "internal.h"
+#include "mm_slot.h"
static struct task_struct *kscand_thread __read_mostly;
static DEFINE_MUTEX(kscand_mutex);
@@ -29,11 +31,23 @@ static bool need_wakeup;
static unsigned long kscand_sleep_expire;
+static DEFINE_SPINLOCK(kscand_mm_lock);
static DECLARE_WAIT_QUEUE_HEAD(kscand_wait);
+#define KSCAND_SLOT_HASH_BITS 10
+static DEFINE_READ_MOSTLY_HASHTABLE(kscand_slots_hash, KSCAND_SLOT_HASH_BITS);
+
+static struct kmem_cache *kscand_slot_cache __read_mostly;
+
+/* Per mm information collected to control VMA scanning */
+struct kscand_mm_slot {
+ struct mm_slot slot;
+};
+
/* Data structure to keep track of current mm under scan */
struct kscand_scan {
struct list_head mm_head;
+ struct kscand_mm_slot *mm_slot;
};
struct kscand_scan kscand_scan = {
@@ -69,6 +83,12 @@ static void kscand_wait_work(void)
wait_event_timeout(kscand_wait, kscand_should_wakeup(),
scan_sleep_jiffies);
}
+
+static inline int kscand_test_exit(struct mm_struct *mm)
+{
+ return atomic_read(&mm->mm_users) == 0;
+}
+
static void kscand_do_scan(void)
{
unsigned long iter = 0, mms_to_scan;
@@ -109,6 +129,65 @@ static int kscand(void *none)
return 0;
}
+static inline void kscand_destroy(void)
+{
+ kmem_cache_destroy(kscand_slot_cache);
+}
+
+void __kscand_enter(struct mm_struct *mm)
+{
+ struct kscand_mm_slot *kscand_slot;
+ struct mm_slot *slot;
+ int wakeup;
+
+ /* __kscand_exit() must not run from under us */
+ VM_BUG_ON_MM(kscand_test_exit(mm), mm);
+
+ kscand_slot = mm_slot_alloc(kscand_slot_cache);
+
+ if (!kscand_slot)
+ return;
+
+ slot = &kscand_slot->slot;
+
+ spin_lock(&kscand_mm_lock);
+ mm_slot_insert(kscand_slots_hash, mm, slot);
+
+ wakeup = list_empty(&kscand_scan.mm_head);
+ list_add_tail(&slot->mm_node, &kscand_scan.mm_head);
+ spin_unlock(&kscand_mm_lock);
+
+ mmgrab(mm);
+ if (wakeup)
+ wake_up_interruptible(&kscand_wait);
+}
+
+void __kscand_exit(struct mm_struct *mm)
+{
+ struct kscand_mm_slot *mm_slot;
+ struct mm_slot *slot;
+ int free = 0;
+
+ spin_lock(&kscand_mm_lock);
+ slot = mm_slot_lookup(kscand_slots_hash, mm);
+ mm_slot = mm_slot_entry(slot, struct kscand_mm_slot, slot);
+ if (mm_slot && kscand_scan.mm_slot != mm_slot) {
+ hash_del(&slot->hash);
+ list_del(&slot->mm_node);
+ free = 1;
+ }
+
+ spin_unlock(&kscand_mm_lock);
+
+ if (free) {
+ mm_slot_free(kscand_slot_cache, mm_slot);
+ mmdrop(mm);
+ } else if (mm_slot) {
+ mmap_write_lock(mm);
+ mmap_write_unlock(mm);
+ }
+}
+
static int start_kscand(void)
{
struct task_struct *kthread;
@@ -149,6 +228,12 @@ static int __init kscand_init(void)
{
int err;
+ kscand_slot_cache = KMEM_CACHE(kscand_mm_slot, 0);
+
+ if (!kscand_slot_cache) {
+ pr_err("kscand: kmem_cache error");
+ return -ENOMEM;
+ }
err = start_kscand();
if (err)
goto err_kscand;
@@ -157,6 +242,7 @@ static int __init kscand_init(void)
err_kscand:
stop_kscand();
+ kscand_destroy();
return err;
}
--
2.34.1
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [RFC PATCH V2 03/13] mm: Scan the mm and create a migration list
2025-06-24 5:56 [RFC PATCH V2 00/13] mm: slowtier page promotion based on PTE A bit Raghavendra K T
2025-06-24 5:56 ` [RFC PATCH V2 01/13] mm: Add kscand kthread for PTE A bit scan Raghavendra K T
2025-06-24 5:56 ` [RFC PATCH V2 02/13] mm: Maintain mm_struct list in the system Raghavendra K T
@ 2025-06-24 5:56 ` Raghavendra K T
2025-06-25 22:07 ` Harry Yoo
2025-06-24 5:56 ` [RFC PATCH V2 04/13] mm: Create a separate kthread for migration Raghavendra K T
` (9 subsequent siblings)
12 siblings, 1 reply; 20+ messages in thread
From: Raghavendra K T @ 2025-06-24 5:56 UTC (permalink / raw)
To: raghavendra.kt
Cc: AneeshKumar.KizhakeVeetil, Hasan.Maruf, Michael.Day, akpm,
bharata, dave.hansen, david, dongjoo.linux.dev, feng.tang, gourry,
hannes, honggyu.kim, hughd, jhubbard, jon.grimm, k.shutemov,
kbusch, kmanaouil.dev, leesuyeon0506, leillc, liam.howlett,
linux-kernel, linux-mm, mgorman, mingo, nadav.amit, nphamcs,
peterz, riel, rientjes, rppt, santosh.shukla, shivankg, shy828301,
sj, vbabka, weixugc, willy, ying.huang, ziy, Jonathan.Cameron,
dave, yuanchu, kinseyho, hdanton
Since we already have the list of mm_struct in the system, add a module to
scan each mm that walks VMAs of each mm_struct and scan all the pages
associated with that.
In the scan path: Check for the recently acccessed pages (folios) belonging
to slowtier nodes. Add all those folios to a list.
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
mm/kscand.c | 319 +++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 318 insertions(+), 1 deletion(-)
diff --git a/mm/kscand.c b/mm/kscand.c
index d5b0d3041b0f..0edec1b7730d 100644
--- a/mm/kscand.c
+++ b/mm/kscand.c
@@ -4,10 +4,18 @@
#include <linux/sched.h>
#include <linux/sched/mm.h>
#include <linux/mmu_notifier.h>
+#include <linux/rmap.h>
+#include <linux/pagewalk.h>
+#include <linux/page_ext.h>
+#include <linux/page_idle.h>
+#include <linux/page_table_check.h>
+#include <linux/pagemap.h>
#include <linux/swap.h>
#include <linux/mm_inline.h>
#include <linux/kthread.h>
#include <linux/kscand.h>
+#include <linux/memory-tiers.h>
+#include <linux/mempolicy.h>
#include <linux/string.h>
#include <linux/delay.h>
#include <linux/cleanup.h>
@@ -18,6 +26,11 @@
static struct task_struct *kscand_thread __read_mostly;
static DEFINE_MUTEX(kscand_mutex);
+/*
+ * Total VMA size to cover during scan.
+ */
+#define KSCAND_SCAN_SIZE (1 * 1024 * 1024 * 1024UL)
+static unsigned long kscand_scan_size __read_mostly = KSCAND_SCAN_SIZE;
/* How long to pause between two scan cycles */
static unsigned int kscand_scan_sleep_ms __read_mostly = 20;
@@ -42,6 +55,8 @@ static struct kmem_cache *kscand_slot_cache __read_mostly;
/* Per mm information collected to control VMA scanning */
struct kscand_mm_slot {
struct mm_slot slot;
+ long address;
+ bool is_scanned;
};
/* Data structure to keep track of current mm under scan */
@@ -54,6 +69,29 @@ struct kscand_scan kscand_scan = {
.mm_head = LIST_HEAD_INIT(kscand_scan.mm_head),
};
+/*
+ * Data structure passed to control scanning and also collect
+ * per memory node information
+ */
+struct kscand_scanctrl {
+ struct list_head scan_list;
+ unsigned long address;
+};
+
+struct kscand_scanctrl kscand_scanctrl;
+/* Per folio information used for migration */
+struct kscand_migrate_info {
+ struct list_head migrate_node;
+ struct folio *folio;
+ unsigned long address;
+};
+
+static bool kscand_eligible_srcnid(int nid)
+{
+ /* Only promotion case is considered */
+ return !node_is_toptier(nid);
+}
+
static inline int kscand_has_work(void)
{
return !list_empty(&kscand_scan.mm_head);
@@ -84,11 +122,275 @@ static void kscand_wait_work(void)
scan_sleep_jiffies);
}
+static inline bool is_valid_folio(struct folio *folio)
+{
+ if (!folio || folio_test_unevictable(folio) || !folio_mapped(folio) ||
+ folio_is_zone_device(folio) || folio_maybe_mapped_shared(folio))
+ return false;
+
+ return true;
+}
+
+
+static bool folio_idle_clear_pte_refs_one(struct folio *folio,
+ struct vm_area_struct *vma,
+ unsigned long addr,
+ pte_t *ptep)
+{
+ bool referenced = false;
+ struct mm_struct *mm = vma->vm_mm;
+ pmd_t *pmd = pmd_off(mm, addr);
+
+ if (ptep) {
+ if (ptep_clear_young_notify(vma, addr, ptep))
+ referenced = true;
+ } else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
+ if (!pmd_present(*pmd))
+ WARN_ON_ONCE(1);
+ if (pmdp_clear_young_notify(vma, addr, pmd))
+ referenced = true;
+ } else {
+ WARN_ON_ONCE(1);
+ }
+
+ if (referenced) {
+ folio_clear_idle(folio);
+ folio_set_young(folio);
+ }
+
+ return true;
+}
+
+static void page_idle_clear_pte_refs(struct page *page, pte_t *pte, struct mm_walk *walk)
+{
+ bool need_lock;
+ struct folio *folio = page_folio(page);
+ unsigned long address;
+
+ if (!folio_mapped(folio) || !folio_raw_mapping(folio))
+ return;
+
+ need_lock = !folio_test_anon(folio) || folio_test_ksm(folio);
+ if (need_lock && !folio_trylock(folio))
+ return;
+ address = vma_address(walk->vma, page_pgoff(folio, page), compound_nr(page));
+ VM_BUG_ON_VMA(address == -EFAULT, walk->vma);
+ folio_idle_clear_pte_refs_one(folio, walk->vma, address, pte);
+
+ if (need_lock)
+ folio_unlock(folio);
+}
+
+static int hot_vma_idle_pte_entry(pte_t *pte,
+ unsigned long addr,
+ unsigned long next,
+ struct mm_walk *walk)
+{
+ struct page *page;
+ struct folio *folio;
+ struct mm_struct *mm;
+ struct vm_area_struct *vma;
+ struct kscand_migrate_info *info;
+ struct kscand_scanctrl *scanctrl = walk->private;
+ int srcnid;
+
+ scanctrl->address = addr;
+ pte_t pteval = ptep_get(pte);
+
+ if (!pte_present(pteval))
+ return 0;
+
+ if (pte_none(pteval))
+ return 0;
+
+ vma = walk->vma;
+ mm = vma->vm_mm;
+
+ page = pte_page(*pte);
+
+ page_idle_clear_pte_refs(page, pte, walk);
+
+ folio = page_folio(page);
+ folio_get(folio);
+
+ if (!is_valid_folio(folio)) {
+ folio_put(folio);
+ return 0;
+ }
+ srcnid = folio_nid(folio);
+
+
+ if (!folio_test_lru(folio)) {
+ folio_put(folio);
+ return 0;
+ }
+
+ if (!folio_test_idle(folio) || folio_test_young(folio) ||
+ mmu_notifier_test_young(mm, addr) ||
+ folio_test_referenced(folio) || pte_young(pteval)) {
+
+ if (!kscand_eligible_srcnid(srcnid)) {
+ folio_put(folio);
+ return 0;
+ }
+ /* XXX: Leaking memory. TBD: consume info */
+
+ info = kzalloc(sizeof(struct kscand_migrate_info), GFP_NOWAIT);
+ if (info && scanctrl) {
+ info->address = addr;
+ info->folio = folio;
+ list_add_tail(&info->migrate_node, &scanctrl->scan_list);
+ }
+ }
+
+ folio_set_idle(folio);
+ folio_put(folio);
+ return 0;
+}
+
+static const struct mm_walk_ops hot_vma_set_idle_ops = {
+ .pte_entry = hot_vma_idle_pte_entry,
+ .walk_lock = PGWALK_RDLOCK,
+};
+
+static void kscand_walk_page_vma(struct vm_area_struct *vma, struct kscand_scanctrl *scanctrl)
+{
+ if (!vma_migratable(vma) || !vma_policy_mof(vma) ||
+ is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_MIXEDMAP)) {
+ return;
+ }
+ if (!vma->vm_mm ||
+ (vma->vm_file && (vma->vm_flags & (VM_READ|VM_WRITE)) == (VM_READ)))
+ return;
+
+ if (!vma_is_accessible(vma))
+ return;
+
+ walk_page_vma(vma, &hot_vma_set_idle_ops, scanctrl);
+}
+
static inline int kscand_test_exit(struct mm_struct *mm)
{
return atomic_read(&mm->mm_users) == 0;
}
+static void kscand_collect_mm_slot(struct kscand_mm_slot *mm_slot)
+{
+ struct mm_slot *slot = &mm_slot->slot;
+ struct mm_struct *mm = slot->mm;
+
+ lockdep_assert_held(&kscand_mm_lock);
+
+ if (kscand_test_exit(mm)) {
+ hash_del(&slot->hash);
+ list_del(&slot->mm_node);
+
+ mm_slot_free(kscand_slot_cache, mm_slot);
+ mmdrop(mm);
+ }
+}
+
+static unsigned long kscand_scan_mm_slot(void)
+{
+ bool next_mm = false;
+ bool update_mmslot_info = false;
+
+ unsigned long vma_scanned_size = 0;
+ unsigned long address;
+
+ struct mm_slot *slot;
+ struct mm_struct *mm;
+ struct vm_area_struct *vma = NULL;
+ struct kscand_mm_slot *mm_slot;
+
+
+ spin_lock(&kscand_mm_lock);
+
+ if (kscand_scan.mm_slot) {
+ mm_slot = kscand_scan.mm_slot;
+ slot = &mm_slot->slot;
+ address = mm_slot->address;
+ } else {
+ slot = list_entry(kscand_scan.mm_head.next,
+ struct mm_slot, mm_node);
+ mm_slot = mm_slot_entry(slot, struct kscand_mm_slot, slot);
+ address = mm_slot->address;
+ kscand_scan.mm_slot = mm_slot;
+ }
+
+ mm = slot->mm;
+ mm_slot->is_scanned = true;
+ spin_unlock(&kscand_mm_lock);
+
+ if (unlikely(!mmap_read_trylock(mm)))
+ goto outerloop_mmap_lock;
+
+ if (unlikely(kscand_test_exit(mm))) {
+ next_mm = true;
+ goto outerloop;
+ }
+
+ VMA_ITERATOR(vmi, mm, address);
+
+ for_each_vma(vmi, vma) {
+ kscand_walk_page_vma(vma, &kscand_scanctrl);
+ vma_scanned_size += vma->vm_end - vma->vm_start;
+
+ if (vma_scanned_size >= kscand_scan_size) {
+ next_mm = true;
+ /* TBD: Add scanned folios to migration list */
+ break;
+ }
+ }
+
+ if (!vma)
+ address = 0;
+ else
+ address = kscand_scanctrl.address + PAGE_SIZE;
+
+ update_mmslot_info = true;
+
+ if (update_mmslot_info)
+ mm_slot->address = address;
+
+outerloop:
+ /* exit_mmap will destroy ptes after this */
+ mmap_read_unlock(mm);
+
+outerloop_mmap_lock:
+ spin_lock(&kscand_mm_lock);
+ WARN_ON(kscand_scan.mm_slot != mm_slot);
+
+ /*
+ * Release the current mm_slot if this mm is about to die, or
+ * if we scanned all vmas of this mm.
+ */
+ if (unlikely(kscand_test_exit(mm)) || !vma || next_mm) {
+ /*
+ * Make sure that if mm_users is reaching zero while
+ * kscand runs here, kscand_exit will find
+ * mm_slot not pointing to the exiting mm.
+ */
+ if (slot->mm_node.next != &kscand_scan.mm_head) {
+ slot = list_entry(slot->mm_node.next,
+ struct mm_slot, mm_node);
+ kscand_scan.mm_slot =
+ mm_slot_entry(slot, struct kscand_mm_slot, slot);
+
+ } else
+ kscand_scan.mm_slot = NULL;
+
+ if (kscand_test_exit(mm)) {
+ kscand_collect_mm_slot(mm_slot);
+ goto end;
+ }
+ }
+ mm_slot->is_scanned = false;
+end:
+ spin_unlock(&kscand_mm_lock);
+ return 0;
+}
+
static void kscand_do_scan(void)
{
unsigned long iter = 0, mms_to_scan;
@@ -101,7 +403,7 @@ static void kscand_do_scan(void)
break;
if (kscand_has_work())
- msleep(100);
+ kscand_scan_mm_slot();
iter++;
@@ -148,6 +450,7 @@ void __kscand_enter(struct mm_struct *mm)
if (!kscand_slot)
return;
+ kscand_slot->address = 0;
slot = &kscand_slot->slot;
spin_lock(&kscand_mm_lock);
@@ -175,6 +478,12 @@ void __kscand_exit(struct mm_struct *mm)
hash_del(&slot->hash);
list_del(&slot->mm_node);
free = 1;
+ } else if (mm_slot && kscand_scan.mm_slot == mm_slot && !mm_slot->is_scanned) {
+ hash_del(&slot->hash);
+ list_del(&slot->mm_node);
+ free = 1;
+ /* TBD: Set the actual next slot */
+ kscand_scan.mm_slot = NULL;
}
spin_unlock(&kscand_mm_lock);
@@ -224,6 +533,12 @@ static int stop_kscand(void)
return 0;
}
+static inline void init_list(void)
+{
+ INIT_LIST_HEAD(&kscand_scanctrl.scan_list);
+ init_waitqueue_head(&kscand_wait);
+}
+
static int __init kscand_init(void)
{
int err;
@@ -234,6 +549,8 @@ static int __init kscand_init(void)
pr_err("kscand: kmem_cache error");
return -ENOMEM;
}
+
+ init_list();
err = start_kscand();
if (err)
goto err_kscand;
--
2.34.1
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [RFC PATCH V2 04/13] mm: Create a separate kthread for migration
2025-06-24 5:56 [RFC PATCH V2 00/13] mm: slowtier page promotion based on PTE A bit Raghavendra K T
` (2 preceding siblings ...)
2025-06-24 5:56 ` [RFC PATCH V2 03/13] mm: Scan the mm and create a migration list Raghavendra K T
@ 2025-06-24 5:56 ` Raghavendra K T
2025-06-24 5:56 ` [RFC PATCH V2 05/13] mm/migration: Migrate accessed folios to toptier node Raghavendra K T
` (8 subsequent siblings)
12 siblings, 0 replies; 20+ messages in thread
From: Raghavendra K T @ 2025-06-24 5:56 UTC (permalink / raw)
To: raghavendra.kt
Cc: AneeshKumar.KizhakeVeetil, Hasan.Maruf, Michael.Day, akpm,
bharata, dave.hansen, david, dongjoo.linux.dev, feng.tang, gourry,
hannes, honggyu.kim, hughd, jhubbard, jon.grimm, k.shutemov,
kbusch, kmanaouil.dev, leesuyeon0506, leillc, liam.howlett,
linux-kernel, linux-mm, mgorman, mingo, nadav.amit, nphamcs,
peterz, riel, rientjes, rppt, santosh.shukla, shivankg, shy828301,
sj, vbabka, weixugc, willy, ying.huang, ziy, Jonathan.Cameron,
dave, yuanchu, kinseyho, hdanton
Having independent thread helps in:
- Alleviating the need for multiple scanning threads
- Aids to control batch migration (TBD)
- Migration throttling (TBD)
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
mm/kscand.c | 74 +++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 74 insertions(+)
diff --git a/mm/kscand.c b/mm/kscand.c
index 0edec1b7730d..e4d8452b0c50 100644
--- a/mm/kscand.c
+++ b/mm/kscand.c
@@ -4,6 +4,7 @@
#include <linux/sched.h>
#include <linux/sched/mm.h>
#include <linux/mmu_notifier.h>
+#include <linux/migrate.h>
#include <linux/rmap.h>
#include <linux/pagewalk.h>
#include <linux/page_ext.h>
@@ -41,6 +42,15 @@ static unsigned long kscand_mms_to_scan __read_mostly = KSCAND_MMS_TO_SCAN;
bool kscand_scan_enabled = true;
static bool need_wakeup;
+static bool migrated_need_wakeup;
+
+/* How long to pause between two migration cycles */
+static unsigned int kmigrate_sleep_ms __read_mostly = 20;
+
+static struct task_struct *kmigrated_thread __read_mostly;
+static DEFINE_MUTEX(kmigrated_mutex);
+static DECLARE_WAIT_QUEUE_HEAD(kmigrated_wait);
+static unsigned long kmigrated_sleep_expire;
static unsigned long kscand_sleep_expire;
@@ -79,6 +89,7 @@ struct kscand_scanctrl {
};
struct kscand_scanctrl kscand_scanctrl;
+
/* Per folio information used for migration */
struct kscand_migrate_info {
struct list_head migrate_node;
@@ -131,6 +142,19 @@ static inline bool is_valid_folio(struct folio *folio)
return true;
}
+static inline void kmigrated_wait_work(void)
+{
+ const unsigned long migrate_sleep_jiffies =
+ msecs_to_jiffies(kmigrate_sleep_ms);
+
+ if (!migrate_sleep_jiffies)
+ return;
+
+ kmigrated_sleep_expire = jiffies + migrate_sleep_jiffies;
+ wait_event_timeout(kmigrated_wait,
+ true,
+ migrate_sleep_jiffies);
+}
static bool folio_idle_clear_pte_refs_one(struct folio *folio,
struct vm_area_struct *vma,
@@ -533,6 +557,49 @@ static int stop_kscand(void)
return 0;
}
+static int kmigrated(void *arg)
+{
+ while (true) {
+ WRITE_ONCE(migrated_need_wakeup, false);
+ if (unlikely(kthread_should_stop()))
+ break;
+ msleep(20);
+ kmigrated_wait_work();
+ }
+ return 0;
+}
+
+static int start_kmigrated(void)
+{
+ struct task_struct *kthread;
+
+ guard(mutex)(&kmigrated_mutex);
+
+ /* Someone already succeeded in starting daemon */
+ if (kmigrated_thread)
+ return 0;
+
+ kthread = kthread_run(kmigrated, NULL, "kmigrated");
+ if (IS_ERR(kmigrated_thread)) {
+ pr_err("kmigrated: kthread_run(kmigrated) failed\n");
+ return PTR_ERR(kthread);
+ }
+
+ kmigrated_thread = kthread;
+ pr_info("kmigrated: Successfully started kmigrated");
+
+ wake_up_interruptible(&kmigrated_wait);
+
+ return 0;
+}
+
+static int stop_kmigrated(void)
+{
+ guard(mutex)(&kmigrated_mutex);
+ kthread_stop(kmigrated_thread);
+ return 0;
+}
+
static inline void init_list(void)
{
INIT_LIST_HEAD(&kscand_scanctrl.scan_list);
@@ -555,8 +622,15 @@ static int __init kscand_init(void)
if (err)
goto err_kscand;
+ err = start_kmigrated();
+ if (err)
+ goto err_kmigrated;
+
return 0;
+err_kmigrated:
+ stop_kmigrated();
+
err_kscand:
stop_kscand();
kscand_destroy();
--
2.34.1
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [RFC PATCH V2 05/13] mm/migration: Migrate accessed folios to toptier node
2025-06-24 5:56 [RFC PATCH V2 00/13] mm: slowtier page promotion based on PTE A bit Raghavendra K T
` (3 preceding siblings ...)
2025-06-24 5:56 ` [RFC PATCH V2 04/13] mm: Create a separate kthread for migration Raghavendra K T
@ 2025-06-24 5:56 ` Raghavendra K T
2025-06-24 5:56 ` [RFC PATCH V2 06/13] mm: Add throttling of mm scanning using scan_period Raghavendra K T
` (7 subsequent siblings)
12 siblings, 0 replies; 20+ messages in thread
From: Raghavendra K T @ 2025-06-24 5:56 UTC (permalink / raw)
To: raghavendra.kt
Cc: AneeshKumar.KizhakeVeetil, Hasan.Maruf, Michael.Day, akpm,
bharata, dave.hansen, david, dongjoo.linux.dev, feng.tang, gourry,
hannes, honggyu.kim, hughd, jhubbard, jon.grimm, k.shutemov,
kbusch, kmanaouil.dev, leesuyeon0506, leillc, liam.howlett,
linux-kernel, linux-mm, mgorman, mingo, nadav.amit, nphamcs,
peterz, riel, rientjes, rppt, santosh.shukla, shivankg, shy828301,
sj, vbabka, weixugc, willy, ying.huang, ziy, Jonathan.Cameron,
dave, yuanchu, kinseyho, hdanton
A per mm migration list is added and a kernel thread iterates over
each of them.
For each recently accessed slowtier folio in the migration list:
- Isolate LRU pages
- Migrate to a regular node.
The rationale behind whole migration is to speedup the access to
recently accessed pages.
Currently, PTE A bit scanning approach lacks information about exact
destination node to migrate to.
Reason:
PROT_NONE hint fault based scanning is done in a process context. Here
when the fault occurs, source CPU of the fault associated task is known.
Time of page access is also accurate.
With the lack of above information, migration is done to node 0 by default.
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
include/linux/migrate.h | 2 +
mm/kscand.c | 452 ++++++++++++++++++++++++++++++++++++++--
mm/migrate.c | 2 +-
3 files changed, 443 insertions(+), 13 deletions(-)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index aaa2114498d6..59547f72d150 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -142,6 +142,8 @@ const struct movable_operations *page_movable_ops(struct page *page)
}
#ifdef CONFIG_NUMA_BALANCING
+bool migrate_balanced_pgdat(struct pglist_data *pgdat,
+ unsigned long nr_migrate_pages);
int migrate_misplaced_folio_prepare(struct folio *folio,
struct vm_area_struct *vma, int node);
int migrate_misplaced_folio(struct folio *folio, int node);
diff --git a/mm/kscand.c b/mm/kscand.c
index e4d8452b0c50..f79e05590cde 100644
--- a/mm/kscand.c
+++ b/mm/kscand.c
@@ -52,9 +52,18 @@ static DEFINE_MUTEX(kmigrated_mutex);
static DECLARE_WAIT_QUEUE_HEAD(kmigrated_wait);
static unsigned long kmigrated_sleep_expire;
+/* mm of the migrating folio entry */
+static struct mm_struct *kmigrated_cur_mm;
+
+/* Migration list is manipulated underneath because of mm_exit */
+static bool kmigrated_clean_list;
+
static unsigned long kscand_sleep_expire;
+#define KSCAND_DEFAULT_TARGET_NODE (0)
+static int kscand_target_node = KSCAND_DEFAULT_TARGET_NODE;
static DEFINE_SPINLOCK(kscand_mm_lock);
+static DEFINE_SPINLOCK(kscand_migrate_lock);
static DECLARE_WAIT_QUEUE_HEAD(kscand_wait);
#define KSCAND_SLOT_HASH_BITS 10
@@ -62,6 +71,10 @@ static DEFINE_READ_MOSTLY_HASHTABLE(kscand_slots_hash, KSCAND_SLOT_HASH_BITS);
static struct kmem_cache *kscand_slot_cache __read_mostly;
+#define KMIGRATED_SLOT_HASH_BITS 10
+static DEFINE_READ_MOSTLY_HASHTABLE(kmigrated_slots_hash, KMIGRATED_SLOT_HASH_BITS);
+static struct kmem_cache *kmigrated_slot_cache __read_mostly;
+
/* Per mm information collected to control VMA scanning */
struct kscand_mm_slot {
struct mm_slot slot;
@@ -90,6 +103,26 @@ struct kscand_scanctrl {
struct kscand_scanctrl kscand_scanctrl;
+/* Per mm migration list */
+struct kmigrated_mm_slot {
+ /* Tracks mm that has non empty migration list */
+ struct mm_slot mm_slot;
+ /* Per mm lock used to synchronize migration list */
+ spinlock_t migrate_lock;
+ /* Head of per mm migration list */
+ struct list_head migrate_head;
+};
+
+/* System wide list of mms that maintain migration list */
+struct kmigrated_daemon {
+ struct list_head mm_head;
+ struct kmigrated_mm_slot *mm_slot;
+};
+
+struct kmigrated_daemon kmigrated_daemon = {
+ .mm_head = LIST_HEAD_INIT(kmigrated_daemon.mm_head),
+};
+
/* Per folio information used for migration */
struct kscand_migrate_info {
struct list_head migrate_node;
@@ -108,6 +141,11 @@ static inline int kscand_has_work(void)
return !list_empty(&kscand_scan.mm_head);
}
+static inline int kmigrated_has_work(void)
+{
+ return !list_empty(&kmigrated_daemon.mm_head);
+}
+
static inline bool kscand_should_wakeup(void)
{
bool wakeup = kthread_should_stop() || need_wakeup ||
@@ -118,6 +156,16 @@ static inline bool kscand_should_wakeup(void)
return wakeup;
}
+static inline bool kmigrated_should_wakeup(void)
+{
+ bool wakeup = kthread_should_stop() || migrated_need_wakeup ||
+ time_after_eq(jiffies, kmigrated_sleep_expire);
+
+ migrated_need_wakeup = false;
+
+ return wakeup;
+}
+
static void kscand_wait_work(void)
{
const unsigned long scan_sleep_jiffies =
@@ -133,6 +181,85 @@ static void kscand_wait_work(void)
scan_sleep_jiffies);
}
+static void kmigrated_wait_work(void)
+{
+ const unsigned long migrate_sleep_jiffies =
+ msecs_to_jiffies(kmigrate_sleep_ms);
+
+ if (!migrate_sleep_jiffies)
+ return;
+
+ kmigrated_sleep_expire = jiffies + migrate_sleep_jiffies;
+ wait_event_timeout(kmigrated_wait, kmigrated_should_wakeup(),
+ migrate_sleep_jiffies);
+}
+
+/*
+ * Do not know what info to pass in the future to make
+ * decision on taget node. Keep it void * now.
+ */
+static int kscand_get_target_node(void *data)
+{
+ return kscand_target_node;
+}
+
+extern bool migrate_balanced_pgdat(struct pglist_data *pgdat,
+ unsigned long nr_migrate_pages);
+
+/*XXX: Taken from migrate.c to avoid NUMAB mode=2 and NULL vma checks*/
+static int kscand_migrate_misplaced_folio_prepare(struct folio *folio,
+ struct vm_area_struct *vma, int node)
+{
+ int nr_pages = folio_nr_pages(folio);
+ pg_data_t *pgdat = NODE_DATA(node);
+
+ if (folio_is_file_lru(folio)) {
+ /*
+ * Do not migrate file folios that are mapped in multiple
+ * processes with execute permissions as they are probably
+ * shared libraries.
+ *
+ * See folio_maybe_mapped_shared() on possible imprecision
+ * when we cannot easily detect if a folio is shared.
+ */
+ if (vma && (vma->vm_flags & VM_EXEC) &&
+ folio_maybe_mapped_shared(folio))
+ return -EACCES;
+ /*
+ * Do not migrate dirty folios as not all filesystems can move
+ * dirty folios in MIGRATE_ASYNC mode which is a waste of
+ * cycles.
+ */
+ if (folio_test_dirty(folio))
+ return -EAGAIN;
+ }
+
+ /* Avoid migrating to a node that is nearly full */
+ if (!migrate_balanced_pgdat(pgdat, nr_pages)) {
+ int z;
+
+ for (z = pgdat->nr_zones - 1; z >= 0; z--) {
+ if (managed_zone(pgdat->node_zones + z))
+ break;
+ }
+
+ if (z < 0)
+ return -EAGAIN;
+
+ wakeup_kswapd(pgdat->node_zones + z, 0,
+ folio_order(folio), ZONE_MOVABLE);
+ return -EAGAIN;
+ }
+
+ if (!folio_isolate_lru(folio))
+ return -EAGAIN;
+
+ node_stat_mod_folio(folio, NR_ISOLATED_ANON + folio_is_file_lru(folio),
+ nr_pages);
+
+ return 0;
+}
+
static inline bool is_valid_folio(struct folio *folio)
{
if (!folio || folio_test_unevictable(folio) || !folio_mapped(folio) ||
@@ -142,18 +269,116 @@ static inline bool is_valid_folio(struct folio *folio)
return true;
}
-static inline void kmigrated_wait_work(void)
+enum kscand_migration_err {
+ KSCAND_NULL_MM = 1,
+ KSCAND_EXITING_MM,
+ KSCAND_INVALID_FOLIO,
+ KSCAND_NONLRU_FOLIO,
+ KSCAND_INELIGIBLE_SRC_NODE,
+ KSCAND_SAME_SRC_DEST_NODE,
+ KSCAND_PTE_NOT_PRESENT,
+ KSCAND_PMD_NOT_PRESENT,
+ KSCAND_NO_PTE_OFFSET_MAP_LOCK,
+ KSCAND_NOT_HOT_PAGE,
+ KSCAND_LRU_ISOLATION_ERR,
+};
+
+static bool is_hot_page(struct folio *folio)
{
- const unsigned long migrate_sleep_jiffies =
- msecs_to_jiffies(kmigrate_sleep_ms);
+#ifdef CONFIG_LRU_GEN
+ struct lruvec *lruvec;
+ int gen = folio_lru_gen(folio);
+
+ lruvec = folio_lruvec(folio);
+ return lru_gen_is_active(lruvec, gen);
+#else
+ return folio_test_active(folio);
+#endif
+}
- if (!migrate_sleep_jiffies)
- return;
+static int kmigrated_promote_folio(struct kscand_migrate_info *info,
+ struct mm_struct *mm,
+ int destnid)
+{
+ unsigned long pfn;
+ unsigned long address;
+ struct page *page;
+ struct folio *folio;
+ int ret;
+ pmd_t *pmd;
+ pte_t *pte;
+ spinlock_t *ptl;
+ pmd_t pmde;
+ int srcnid;
- kmigrated_sleep_expire = jiffies + migrate_sleep_jiffies;
- wait_event_timeout(kmigrated_wait,
- true,
- migrate_sleep_jiffies);
+ if (mm == NULL)
+ return KSCAND_NULL_MM;
+
+ if (mm == READ_ONCE(kmigrated_cur_mm) &&
+ READ_ONCE(kmigrated_clean_list)) {
+ WARN_ON_ONCE(mm);
+ return KSCAND_EXITING_MM;
+ }
+
+ folio = info->folio;
+
+ /* Check again if the folio is really valid now */
+ if (folio) {
+ pfn = folio_pfn(folio);
+ page = pfn_to_online_page(pfn);
+ }
+
+ if (!page || PageTail(page) || !is_valid_folio(folio))
+ return KSCAND_INVALID_FOLIO;
+
+ if (!folio_test_lru(folio))
+ return KSCAND_NONLRU_FOLIO;
+
+ if (!is_hot_page(folio))
+ return KSCAND_NOT_HOT_PAGE;
+
+ folio_get(folio);
+
+ srcnid = folio_nid(folio);
+
+ /* Do not try to promote pages from regular nodes */
+ if (!kscand_eligible_srcnid(srcnid)) {
+ folio_put(folio);
+ return KSCAND_INELIGIBLE_SRC_NODE;
+ }
+
+ /* Also happen when it is already migrated */
+ if (srcnid == destnid) {
+ folio_put(folio);
+ return KSCAND_SAME_SRC_DEST_NODE;
+ }
+ address = info->address;
+ pmd = pmd_off(mm, address);
+ pmde = pmdp_get(pmd);
+
+ if (!pmd_present(pmde)) {
+ folio_put(folio);
+ return KSCAND_PMD_NOT_PRESENT;
+ }
+
+ pte = pte_offset_map_lock(mm, pmd, address, &ptl);
+ if (!pte) {
+ folio_put(folio);
+ WARN_ON_ONCE(!pte);
+ return KSCAND_NO_PTE_OFFSET_MAP_LOCK;
+ }
+
+ ret = kscand_migrate_misplaced_folio_prepare(folio, NULL, destnid);
+ if (ret) {
+ folio_put(folio);
+ pte_unmap_unlock(pte, ptl);
+ return KSCAND_LRU_ISOLATION_ERR;
+ }
+
+ folio_put(folio);
+ pte_unmap_unlock(pte, ptl);
+
+ return migrate_misplaced_folio(folio, destnid);
}
static bool folio_idle_clear_pte_refs_one(struct folio *folio,
@@ -257,7 +482,6 @@ static int hot_vma_idle_pte_entry(pte_t *pte,
folio_put(folio);
return 0;
}
- /* XXX: Leaking memory. TBD: consume info */
info = kzalloc(sizeof(struct kscand_migrate_info), GFP_NOWAIT);
if (info && scanctrl) {
@@ -298,6 +522,115 @@ static inline int kscand_test_exit(struct mm_struct *mm)
return atomic_read(&mm->mm_users) == 0;
}
+struct destroy_list_work {
+ struct list_head migrate_head;
+ struct work_struct dwork;
+};
+
+static void kmigrated_destroy_list_fn(struct work_struct *work)
+{
+ struct destroy_list_work *dlw;
+ struct kscand_migrate_info *info, *tmp;
+
+ dlw = container_of(work, struct destroy_list_work, dwork);
+
+ if (!list_empty(&dlw->migrate_head)) {
+ list_for_each_entry_safe(info, tmp, &dlw->migrate_head, migrate_node) {
+ list_del(&info->migrate_node);
+ kfree(info);
+ }
+ }
+
+ kfree(dlw);
+}
+
+static void kmigrated_destroy_list(struct list_head *list_head)
+{
+ struct destroy_list_work *destroy_list_work;
+
+
+ destroy_list_work = kmalloc(sizeof(*destroy_list_work), GFP_KERNEL);
+ if (!destroy_list_work)
+ return;
+
+ INIT_LIST_HEAD(&destroy_list_work->migrate_head);
+ list_splice_tail_init(list_head, &destroy_list_work->migrate_head);
+ INIT_WORK(&destroy_list_work->dwork, kmigrated_destroy_list_fn);
+ schedule_work(&destroy_list_work->dwork);
+}
+
+static struct kmigrated_mm_slot *kmigrated_get_mm_slot(struct mm_struct *mm, bool alloc)
+{
+ struct kmigrated_mm_slot *mm_slot = NULL;
+ struct mm_slot *slot;
+
+ guard(spinlock)(&kscand_migrate_lock);
+
+ slot = mm_slot_lookup(kmigrated_slots_hash, mm);
+ mm_slot = mm_slot_entry(slot, struct kmigrated_mm_slot, mm_slot);
+
+ if (!mm_slot && alloc) {
+ mm_slot = mm_slot_alloc(kmigrated_slot_cache);
+ if (!mm_slot) {
+ spin_unlock(&kscand_migrate_lock);
+ return NULL;
+ }
+
+ slot = &mm_slot->mm_slot;
+ INIT_LIST_HEAD(&mm_slot->migrate_head);
+ spin_lock_init(&mm_slot->migrate_lock);
+ mm_slot_insert(kmigrated_slots_hash, mm, slot);
+ list_add_tail(&slot->mm_node, &kmigrated_daemon.mm_head);
+ }
+
+ return mm_slot;
+}
+
+static void kscand_cleanup_migration_list(struct mm_struct *mm)
+{
+ struct kmigrated_mm_slot *mm_slot;
+ struct mm_slot *slot;
+
+ mm_slot = kmigrated_get_mm_slot(mm, false);
+
+ slot = &mm_slot->mm_slot;
+
+ if (mm_slot && slot && slot->mm == mm) {
+ spin_lock(&mm_slot->migrate_lock);
+
+ if (!list_empty(&mm_slot->migrate_head)) {
+ if (mm == READ_ONCE(kmigrated_cur_mm)) {
+ /* A folio in this mm is being migrated. wait */
+ WRITE_ONCE(kmigrated_clean_list, true);
+ }
+
+ kmigrated_destroy_list(&mm_slot->migrate_head);
+ spin_unlock(&mm_slot->migrate_lock);
+retry:
+ if (!spin_trylock(&mm_slot->migrate_lock)) {
+ cpu_relax();
+ goto retry;
+ }
+
+ if (mm == READ_ONCE(kmigrated_cur_mm)) {
+ spin_unlock(&mm_slot->migrate_lock);
+ goto retry;
+ }
+ }
+ /* Reset migrated mm_slot if it was pointing to us */
+ if (kmigrated_daemon.mm_slot == mm_slot)
+ kmigrated_daemon.mm_slot = NULL;
+
+ hash_del(&slot->hash);
+ list_del(&slot->mm_node);
+ mm_slot_free(kmigrated_slot_cache, mm_slot);
+
+ WRITE_ONCE(kmigrated_clean_list, false);
+
+ spin_unlock(&mm_slot->migrate_lock);
+ }
+}
+
static void kscand_collect_mm_slot(struct kscand_mm_slot *mm_slot)
{
struct mm_slot *slot = &mm_slot->slot;
@@ -309,11 +642,77 @@ static void kscand_collect_mm_slot(struct kscand_mm_slot *mm_slot)
hash_del(&slot->hash);
list_del(&slot->mm_node);
+ kscand_cleanup_migration_list(mm);
+
mm_slot_free(kscand_slot_cache, mm_slot);
mmdrop(mm);
}
}
+static void kmigrated_migrate_mm(struct kmigrated_mm_slot *mm_slot)
+{
+ int ret = 0, dest = -1;
+ struct mm_slot *slot;
+ struct mm_struct *mm;
+ struct kscand_migrate_info *info, *tmp;
+
+ spin_lock(&mm_slot->migrate_lock);
+
+ slot = &mm_slot->mm_slot;
+ mm = slot->mm;
+
+ if (!list_empty(&mm_slot->migrate_head)) {
+ list_for_each_entry_safe(info, tmp, &mm_slot->migrate_head,
+ migrate_node) {
+ if (READ_ONCE(kmigrated_clean_list))
+ goto clean_list_handled;
+
+ list_del(&info->migrate_node);
+
+ spin_unlock(&mm_slot->migrate_lock);
+
+ dest = kscand_get_target_node(NULL);
+ ret = kmigrated_promote_folio(info, mm, dest);
+
+ kfree(info);
+
+ cond_resched();
+ spin_lock(&mm_slot->migrate_lock);
+ }
+ }
+clean_list_handled:
+ /* Reset mm of folio entry we are migrating */
+ WRITE_ONCE(kmigrated_cur_mm, NULL);
+ spin_unlock(&mm_slot->migrate_lock);
+}
+
+static void kmigrated_migrate_folio(void)
+{
+ /* for each mm do migrate */
+ struct kmigrated_mm_slot *kmigrated_mm_slot = NULL;
+ struct mm_slot *slot;
+
+ if (!list_empty(&kmigrated_daemon.mm_head)) {
+
+ scoped_guard (spinlock, &kscand_migrate_lock) {
+ if (kmigrated_daemon.mm_slot) {
+ kmigrated_mm_slot = kmigrated_daemon.mm_slot;
+ } else {
+ slot = list_entry(kmigrated_daemon.mm_head.next,
+ struct mm_slot, mm_node);
+
+ kmigrated_mm_slot = mm_slot_entry(slot,
+ struct kmigrated_mm_slot, mm_slot);
+ kmigrated_daemon.mm_slot = kmigrated_mm_slot;
+ }
+ WRITE_ONCE(kmigrated_cur_mm, kmigrated_mm_slot->mm_slot.mm);
+ }
+
+ if (kmigrated_mm_slot)
+ kmigrated_migrate_mm(kmigrated_mm_slot);
+ }
+}
+
static unsigned long kscand_scan_mm_slot(void)
{
bool next_mm = false;
@@ -327,6 +726,7 @@ static unsigned long kscand_scan_mm_slot(void)
struct vm_area_struct *vma = NULL;
struct kscand_mm_slot *mm_slot;
+ struct kmigrated_mm_slot *kmigrated_mm_slot = NULL;
spin_lock(&kscand_mm_lock);
@@ -356,13 +756,23 @@ static unsigned long kscand_scan_mm_slot(void)
VMA_ITERATOR(vmi, mm, address);
+ kmigrated_mm_slot = kmigrated_get_mm_slot(mm, false);
+
for_each_vma(vmi, vma) {
kscand_walk_page_vma(vma, &kscand_scanctrl);
vma_scanned_size += vma->vm_end - vma->vm_start;
if (vma_scanned_size >= kscand_scan_size) {
next_mm = true;
- /* TBD: Add scanned folios to migration list */
+
+ if (!list_empty(&kscand_scanctrl.scan_list)) {
+ if (!kmigrated_mm_slot)
+ kmigrated_mm_slot = kmigrated_get_mm_slot(mm, true);
+ spin_lock(&kmigrated_mm_slot->migrate_lock);
+ list_splice_tail_init(&kscand_scanctrl.scan_list,
+ &kmigrated_mm_slot->migrate_head);
+ spin_unlock(&kmigrated_mm_slot->migrate_lock);
+ }
break;
}
}
@@ -458,6 +868,8 @@ static int kscand(void *none)
static inline void kscand_destroy(void)
{
kmem_cache_destroy(kscand_slot_cache);
+ /* XXX: move below to kmigrated thread */
+ kmem_cache_destroy(kmigrated_slot_cache);
}
void __kscand_enter(struct mm_struct *mm)
@@ -493,7 +905,7 @@ void __kscand_exit(struct mm_struct *mm)
{
struct kscand_mm_slot *mm_slot;
struct mm_slot *slot;
- int free = 0;
+ int free = 0, serialize = 1;
spin_lock(&kscand_mm_lock);
slot = mm_slot_lookup(kscand_slots_hash, mm);
@@ -508,10 +920,15 @@ void __kscand_exit(struct mm_struct *mm)
free = 1;
/* TBD: Set the actual next slot */
kscand_scan.mm_slot = NULL;
+ } else if (mm_slot && kscand_scan.mm_slot == mm_slot && mm_slot->is_scanned) {
+ serialize = 0;
}
spin_unlock(&kscand_mm_lock);
+ if (serialize)
+ kscand_cleanup_migration_list(mm);
+
if (free) {
mm_slot_free(kscand_slot_cache, mm_slot);
mmdrop(mm);
@@ -563,6 +980,8 @@ static int kmigrated(void *arg)
WRITE_ONCE(migrated_need_wakeup, false);
if (unlikely(kthread_should_stop()))
break;
+ if (kmigrated_has_work())
+ kmigrated_migrate_folio();
msleep(20);
kmigrated_wait_work();
}
@@ -603,7 +1022,9 @@ static int stop_kmigrated(void)
static inline void init_list(void)
{
INIT_LIST_HEAD(&kscand_scanctrl.scan_list);
+ spin_lock_init(&kscand_migrate_lock);
init_waitqueue_head(&kscand_wait);
+ init_waitqueue_head(&kmigrated_wait);
}
static int __init kscand_init(void)
@@ -617,6 +1038,13 @@ static int __init kscand_init(void)
return -ENOMEM;
}
+ kmigrated_slot_cache = KMEM_CACHE(kscand_mm_slot, 0);
+
+ if (!kmigrated_slot_cache) {
+ pr_err("kmigrated: kmem_cache error");
+ return -ENOMEM;
+ }
+
init_list();
err = start_kscand();
if (err)
diff --git a/mm/migrate.c b/mm/migrate.c
index 676d9cfc7059..94448062d009 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2598,7 +2598,7 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid, unsigned long, nr_pages,
* Returns true if this is a safe migration target node for misplaced NUMA
* pages. Currently it only checks the watermarks which is crude.
*/
-static bool migrate_balanced_pgdat(struct pglist_data *pgdat,
+bool migrate_balanced_pgdat(struct pglist_data *pgdat,
unsigned long nr_migrate_pages)
{
int z;
--
2.34.1
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [RFC PATCH V2 06/13] mm: Add throttling of mm scanning using scan_period
2025-06-24 5:56 [RFC PATCH V2 00/13] mm: slowtier page promotion based on PTE A bit Raghavendra K T
` (4 preceding siblings ...)
2025-06-24 5:56 ` [RFC PATCH V2 05/13] mm/migration: Migrate accessed folios to toptier node Raghavendra K T
@ 2025-06-24 5:56 ` Raghavendra K T
2025-06-24 5:56 ` [RFC PATCH V2 07/13] mm: Add throttling of mm scanning using scan_size Raghavendra K T
` (6 subsequent siblings)
12 siblings, 0 replies; 20+ messages in thread
From: Raghavendra K T @ 2025-06-24 5:56 UTC (permalink / raw)
To: raghavendra.kt
Cc: AneeshKumar.KizhakeVeetil, Hasan.Maruf, Michael.Day, akpm,
bharata, dave.hansen, david, dongjoo.linux.dev, feng.tang, gourry,
hannes, honggyu.kim, hughd, jhubbard, jon.grimm, k.shutemov,
kbusch, kmanaouil.dev, leesuyeon0506, leillc, liam.howlett,
linux-kernel, linux-mm, mgorman, mingo, nadav.amit, nphamcs,
peterz, riel, rientjes, rppt, santosh.shukla, shivankg, shy828301,
sj, vbabka, weixugc, willy, ying.huang, ziy, Jonathan.Cameron,
dave, yuanchu, kinseyho, hdanton
Before this patch, scanning of tasks' mm is done continuously and also
at the same rate.
Improve that by adding a throttling logic:
1) If there were useful pages found during last scan and current scan,
decrease the scan_period (to increase scan rate) by TUNE_PERCENT (15%).
2) If there were no useful pages found in last scan, and there are
candidate migration pages in the current scan decrease the scan_period
aggressively by 2 power SCAN_CHANGE_SCALE (2^3 = 8 now).
Vice versa is done for the reverse case.
Scan period is clamped between MIN (600ms) and MAX (5sec).
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
mm/kscand.c | 110 +++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 109 insertions(+), 1 deletion(-)
diff --git a/mm/kscand.c b/mm/kscand.c
index f79e05590cde..fca4b7b4a81f 100644
--- a/mm/kscand.c
+++ b/mm/kscand.c
@@ -20,6 +20,7 @@
#include <linux/string.h>
#include <linux/delay.h>
#include <linux/cleanup.h>
+#include <linux/minmax.h>
#include <asm/pgalloc.h>
#include "internal.h"
@@ -33,6 +34,16 @@ static DEFINE_MUTEX(kscand_mutex);
#define KSCAND_SCAN_SIZE (1 * 1024 * 1024 * 1024UL)
static unsigned long kscand_scan_size __read_mostly = KSCAND_SCAN_SIZE;
+/*
+ * Scan period for each mm.
+ * Min: 600ms default: 2sec Max: 5sec
+ */
+#define KSCAND_SCAN_PERIOD_MAX 5000U
+#define KSCAND_SCAN_PERIOD_MIN 600U
+#define KSCAND_SCAN_PERIOD 2000U
+
+static unsigned int kscand_mm_scan_period_ms __read_mostly = KSCAND_SCAN_PERIOD;
+
/* How long to pause between two scan cycles */
static unsigned int kscand_scan_sleep_ms __read_mostly = 20;
@@ -78,6 +89,11 @@ static struct kmem_cache *kmigrated_slot_cache __read_mostly;
/* Per mm information collected to control VMA scanning */
struct kscand_mm_slot {
struct mm_slot slot;
+ /* Unit: ms. Determines how aften mm scan should happen. */
+ unsigned int scan_period;
+ unsigned long next_scan;
+ /* Tracks how many useful pages obtained for migration in the last scan */
+ unsigned long scan_delta;
long address;
bool is_scanned;
};
@@ -713,13 +729,92 @@ static void kmigrated_migrate_folio(void)
}
}
+/*
+ * This is the normal change percentage when old and new delta remain same.
+ * i.e., either both positive or both zero.
+ */
+#define SCAN_PERIOD_TUNE_PERCENT 15
+
+/* This is to change the scan_period aggressively when deltas are different */
+#define SCAN_PERIOD_CHANGE_SCALE 3
+/*
+ * XXX: Hack to prevent unmigrated pages coming again and again while scanning.
+ * Actual fix needs to identify the type of unmigrated pages OR consider migration
+ * failures in next scan.
+ */
+#define KSCAND_IGNORE_SCAN_THR 256
+
+/* Maintains stability of scan_period by decaying last time accessed pages */
+#define SCAN_DECAY_SHIFT 4
+/*
+ * X : Number of useful pages in the last scan.
+ * Y : Number of useful pages found in current scan.
+ * Tuning scan_period:
+ * Initial scan_period is 2s.
+ * case 1: (X = 0, Y = 0)
+ * Increase scan_period by SCAN_PERIOD_TUNE_PERCENT.
+ * case 2: (X = 0, Y > 0)
+ * Decrease scan_period by (2 << SCAN_PERIOD_CHANGE_SCALE).
+ * case 3: (X > 0, Y = 0 )
+ * Increase scan_period by (2 << SCAN_PERIOD_CHANGE_SCALE).
+ * case 4: (X > 0, Y > 0)
+ * Decrease scan_period by SCAN_PERIOD_TUNE_PERCENT.
+ */
+static inline void kscand_update_mmslot_info(struct kscand_mm_slot *mm_slot,
+ unsigned long total)
+{
+ unsigned int scan_period;
+ unsigned long now;
+ unsigned long old_scan_delta;
+
+ scan_period = mm_slot->scan_period;
+ old_scan_delta = mm_slot->scan_delta;
+
+ /* decay old value */
+ total = (old_scan_delta >> SCAN_DECAY_SHIFT) + total;
+
+ /* XXX: Hack to get rid of continuously failing/unmigrateable pages */
+ if (total < KSCAND_IGNORE_SCAN_THR)
+ total = 0;
+
+ /*
+ * case 1: old_scan_delta and new delta are similar, (slow) TUNE_PERCENT used.
+ * case 2: old_scan_delta and new delta are different. (fast) CHANGE_SCALE used.
+ * TBD:
+ * 1. Further tune scan_period based on delta between last and current scan delta.
+ * 2. Optimize calculation
+ */
+ if (!old_scan_delta && !total) {
+ scan_period = (100 + SCAN_PERIOD_TUNE_PERCENT) * scan_period;
+ scan_period /= 100;
+ } else if (old_scan_delta && total) {
+ scan_period = (100 - SCAN_PERIOD_TUNE_PERCENT) * scan_period;
+ scan_period /= 100;
+ } else if (old_scan_delta && !total) {
+ scan_period = scan_period << SCAN_PERIOD_CHANGE_SCALE;
+ } else {
+ scan_period = scan_period >> SCAN_PERIOD_CHANGE_SCALE;
+ }
+
+ scan_period = clamp(scan_period, KSCAND_SCAN_PERIOD_MIN, KSCAND_SCAN_PERIOD_MAX);
+
+ now = jiffies;
+ mm_slot->next_scan = now + msecs_to_jiffies(scan_period);
+ mm_slot->scan_period = scan_period;
+ mm_slot->scan_delta = total;
+}
+
static unsigned long kscand_scan_mm_slot(void)
{
bool next_mm = false;
bool update_mmslot_info = false;
+ unsigned int mm_slot_scan_period;
+ unsigned long now;
+ unsigned long mm_slot_next_scan;
unsigned long vma_scanned_size = 0;
unsigned long address;
+ unsigned long total = 0;
struct mm_slot *slot;
struct mm_struct *mm;
@@ -744,6 +839,8 @@ static unsigned long kscand_scan_mm_slot(void)
mm = slot->mm;
mm_slot->is_scanned = true;
+ mm_slot_next_scan = mm_slot->next_scan;
+ mm_slot_scan_period = mm_slot->scan_period;
spin_unlock(&kscand_mm_lock);
if (unlikely(!mmap_read_trylock(mm)))
@@ -754,6 +851,11 @@ static unsigned long kscand_scan_mm_slot(void)
goto outerloop;
}
+ now = jiffies;
+
+ if (mm_slot_next_scan && time_before(now, mm_slot_next_scan))
+ goto outerloop;
+
VMA_ITERATOR(vmi, mm, address);
kmigrated_mm_slot = kmigrated_get_mm_slot(mm, false);
@@ -784,8 +886,10 @@ static unsigned long kscand_scan_mm_slot(void)
update_mmslot_info = true;
- if (update_mmslot_info)
+ if (update_mmslot_info) {
mm_slot->address = address;
+ kscand_update_mmslot_info(mm_slot, total);
+ }
outerloop:
/* exit_mmap will destroy ptes after this */
@@ -887,6 +991,10 @@ void __kscand_enter(struct mm_struct *mm)
return;
kscand_slot->address = 0;
+ kscand_slot->scan_period = kscand_mm_scan_period_ms;
+ kscand_slot->next_scan = 0;
+ kscand_slot->scan_delta = 0;
+
slot = &kscand_slot->slot;
spin_lock(&kscand_mm_lock);
--
2.34.1
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [RFC PATCH V2 07/13] mm: Add throttling of mm scanning using scan_size
2025-06-24 5:56 [RFC PATCH V2 00/13] mm: slowtier page promotion based on PTE A bit Raghavendra K T
` (5 preceding siblings ...)
2025-06-24 5:56 ` [RFC PATCH V2 06/13] mm: Add throttling of mm scanning using scan_period Raghavendra K T
@ 2025-06-24 5:56 ` Raghavendra K T
2025-06-24 5:56 ` [RFC PATCH V2 08/13] mm: Add initial scan delay Raghavendra K T
` (5 subsequent siblings)
12 siblings, 0 replies; 20+ messages in thread
From: Raghavendra K T @ 2025-06-24 5:56 UTC (permalink / raw)
To: raghavendra.kt
Cc: AneeshKumar.KizhakeVeetil, Hasan.Maruf, Michael.Day, akpm,
bharata, dave.hansen, david, dongjoo.linux.dev, feng.tang, gourry,
hannes, honggyu.kim, hughd, jhubbard, jon.grimm, k.shutemov,
kbusch, kmanaouil.dev, leesuyeon0506, leillc, liam.howlett,
linux-kernel, linux-mm, mgorman, mingo, nadav.amit, nphamcs,
peterz, riel, rientjes, rppt, santosh.shukla, shivankg, shy828301,
sj, vbabka, weixugc, willy, ying.huang, ziy, Jonathan.Cameron,
dave, yuanchu, kinseyho, hdanton
Before this patch, scanning is done on entire virtual address space
of all the tasks. Now the scan size is shrunk or expanded based on the
useful pages found in the last scan.
This helps to quickly get out of unnecessary scanning thus burning
lesser CPU.
Drawback: If a useful chunk is at the other end of the VMA space, it
will delay scanning and migration.
Shrink/expand algorithm for scan_size:
X : Number of useful pages in the last scan.
Y : Number of useful pages found in current scan.
Initial scan_size is 1GB
case 1: (X = 0, Y = 0)
Decrease scan_size by 2
case 2: (X = 0, Y > 0)
Aggressively change to MAX (4GB)
case 3: (X > 0, Y = 0 )
No change
case 4: (X > 0, Y > 0)
Increase scan_size by 2
Scan size is clamped between MIN (256MB) and MAX (4GB)).
TBD: Tuning based on real workloads
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
mm/kscand.c | 29 +++++++++++++++++++++++++++++
1 file changed, 29 insertions(+)
diff --git a/mm/kscand.c b/mm/kscand.c
index fca4b7b4a81f..26b40865d3e5 100644
--- a/mm/kscand.c
+++ b/mm/kscand.c
@@ -28,10 +28,15 @@
static struct task_struct *kscand_thread __read_mostly;
static DEFINE_MUTEX(kscand_mutex);
+
/*
* Total VMA size to cover during scan.
+ * Min: 256MB default: 1GB max: 4GB
*/
+#define KSCAND_SCAN_SIZE_MIN (256 * 1024 * 1024UL)
+#define KSCAND_SCAN_SIZE_MAX (4 * 1024 * 1024 * 1024UL)
#define KSCAND_SCAN_SIZE (1 * 1024 * 1024 * 1024UL)
+
static unsigned long kscand_scan_size __read_mostly = KSCAND_SCAN_SIZE;
/*
@@ -94,6 +99,8 @@ struct kscand_mm_slot {
unsigned long next_scan;
/* Tracks how many useful pages obtained for migration in the last scan */
unsigned long scan_delta;
+ /* Determines how much VMA address space to be covered in the scanning */
+ unsigned long scan_size;
long address;
bool is_scanned;
};
@@ -744,6 +751,8 @@ static void kmigrated_migrate_folio(void)
*/
#define KSCAND_IGNORE_SCAN_THR 256
+#define SCAN_SIZE_CHANGE_SHIFT 1
+
/* Maintains stability of scan_period by decaying last time accessed pages */
#define SCAN_DECAY_SHIFT 4
/*
@@ -759,14 +768,26 @@ static void kmigrated_migrate_folio(void)
* Increase scan_period by (2 << SCAN_PERIOD_CHANGE_SCALE).
* case 4: (X > 0, Y > 0)
* Decrease scan_period by SCAN_PERIOD_TUNE_PERCENT.
+ * Tuning scan_size:
+ * Initial scan_size is 4GB
+ * case 1: (X = 0, Y = 0)
+ * Decrease scan_size by (1 << SCAN_SIZE_CHANGE_SHIFT).
+ * case 2: (X = 0, Y > 0)
+ * scan_size = KSCAND_SCAN_SIZE_MAX
+ * case 3: (X > 0, Y = 0 )
+ * No change
+ * case 4: (X > 0, Y > 0)
+ * Increase scan_size by (1 << SCAN_SIZE_CHANGE_SHIFT).
*/
static inline void kscand_update_mmslot_info(struct kscand_mm_slot *mm_slot,
unsigned long total)
{
unsigned int scan_period;
unsigned long now;
+ unsigned long scan_size;
unsigned long old_scan_delta;
+ scan_size = mm_slot->scan_size;
scan_period = mm_slot->scan_period;
old_scan_delta = mm_slot->scan_delta;
@@ -787,20 +808,25 @@ static inline void kscand_update_mmslot_info(struct kscand_mm_slot *mm_slot,
if (!old_scan_delta && !total) {
scan_period = (100 + SCAN_PERIOD_TUNE_PERCENT) * scan_period;
scan_period /= 100;
+ scan_size = scan_size >> SCAN_SIZE_CHANGE_SHIFT;
} else if (old_scan_delta && total) {
scan_period = (100 - SCAN_PERIOD_TUNE_PERCENT) * scan_period;
scan_period /= 100;
+ scan_size = scan_size << SCAN_SIZE_CHANGE_SHIFT;
} else if (old_scan_delta && !total) {
scan_period = scan_period << SCAN_PERIOD_CHANGE_SCALE;
} else {
scan_period = scan_period >> SCAN_PERIOD_CHANGE_SCALE;
+ scan_size = KSCAND_SCAN_SIZE_MAX;
}
scan_period = clamp(scan_period, KSCAND_SCAN_PERIOD_MIN, KSCAND_SCAN_PERIOD_MAX);
+ scan_size = clamp(scan_size, KSCAND_SCAN_SIZE_MIN, KSCAND_SCAN_SIZE_MAX);
now = jiffies;
mm_slot->next_scan = now + msecs_to_jiffies(scan_period);
mm_slot->scan_period = scan_period;
+ mm_slot->scan_size = scan_size;
mm_slot->scan_delta = total;
}
@@ -812,6 +838,7 @@ static unsigned long kscand_scan_mm_slot(void)
unsigned int mm_slot_scan_period;
unsigned long now;
unsigned long mm_slot_next_scan;
+ unsigned long mm_slot_scan_size;
unsigned long vma_scanned_size = 0;
unsigned long address;
unsigned long total = 0;
@@ -841,6 +868,7 @@ static unsigned long kscand_scan_mm_slot(void)
mm_slot->is_scanned = true;
mm_slot_next_scan = mm_slot->next_scan;
mm_slot_scan_period = mm_slot->scan_period;
+ mm_slot_scan_size = mm_slot->scan_size;
spin_unlock(&kscand_mm_lock);
if (unlikely(!mmap_read_trylock(mm)))
@@ -992,6 +1020,7 @@ void __kscand_enter(struct mm_struct *mm)
kscand_slot->address = 0;
kscand_slot->scan_period = kscand_mm_scan_period_ms;
+ kscand_slot->scan_size = kscand_scan_size;
kscand_slot->next_scan = 0;
kscand_slot->scan_delta = 0;
--
2.34.1
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [RFC PATCH V2 08/13] mm: Add initial scan delay
2025-06-24 5:56 [RFC PATCH V2 00/13] mm: slowtier page promotion based on PTE A bit Raghavendra K T
` (6 preceding siblings ...)
2025-06-24 5:56 ` [RFC PATCH V2 07/13] mm: Add throttling of mm scanning using scan_size Raghavendra K T
@ 2025-06-24 5:56 ` Raghavendra K T
2025-06-24 5:56 ` [RFC PATCH V2 09/13] mm: Add a heuristic to calculate target node Raghavendra K T
` (4 subsequent siblings)
12 siblings, 0 replies; 20+ messages in thread
From: Raghavendra K T @ 2025-06-24 5:56 UTC (permalink / raw)
To: raghavendra.kt
Cc: AneeshKumar.KizhakeVeetil, Hasan.Maruf, Michael.Day, akpm,
bharata, dave.hansen, david, dongjoo.linux.dev, feng.tang, gourry,
hannes, honggyu.kim, hughd, jhubbard, jon.grimm, k.shutemov,
kbusch, kmanaouil.dev, leesuyeon0506, leillc, liam.howlett,
linux-kernel, linux-mm, mgorman, mingo, nadav.amit, nphamcs,
peterz, riel, rientjes, rppt, santosh.shukla, shivankg, shy828301,
sj, vbabka, weixugc, willy, ying.huang, ziy, Jonathan.Cameron,
dave, yuanchu, kinseyho, hdanton
This is to prevent unnecessary scanning of short lived tasks
to reduce CPU burning.
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
mm/kscand.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/mm/kscand.c b/mm/kscand.c
index 26b40865d3e5..8fbe70faea4e 100644
--- a/mm/kscand.c
+++ b/mm/kscand.c
@@ -28,6 +28,7 @@
static struct task_struct *kscand_thread __read_mostly;
static DEFINE_MUTEX(kscand_mutex);
+extern unsigned int sysctl_numa_balancing_scan_delay;
/*
* Total VMA size to cover during scan.
@@ -1008,6 +1009,7 @@ void __kscand_enter(struct mm_struct *mm)
{
struct kscand_mm_slot *kscand_slot;
struct mm_slot *slot;
+ unsigned long now;
int wakeup;
/* __kscand_exit() must not run from under us */
@@ -1018,10 +1020,12 @@ void __kscand_enter(struct mm_struct *mm)
if (!kscand_slot)
return;
+ now = jiffies;
kscand_slot->address = 0;
kscand_slot->scan_period = kscand_mm_scan_period_ms;
kscand_slot->scan_size = kscand_scan_size;
- kscand_slot->next_scan = 0;
+ kscand_slot->next_scan = now +
+ msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
kscand_slot->scan_delta = 0;
slot = &kscand_slot->slot;
--
2.34.1
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [RFC PATCH V2 09/13] mm: Add a heuristic to calculate target node
2025-06-24 5:56 [RFC PATCH V2 00/13] mm: slowtier page promotion based on PTE A bit Raghavendra K T
` (7 preceding siblings ...)
2025-06-24 5:56 ` [RFC PATCH V2 08/13] mm: Add initial scan delay Raghavendra K T
@ 2025-06-24 5:56 ` Raghavendra K T
2025-06-24 5:56 ` [RFC PATCH V2 10/13] sysfs: Add sysfs support to tune scanning Raghavendra K T
` (3 subsequent siblings)
12 siblings, 0 replies; 20+ messages in thread
From: Raghavendra K T @ 2025-06-24 5:56 UTC (permalink / raw)
To: raghavendra.kt
Cc: AneeshKumar.KizhakeVeetil, Hasan.Maruf, Michael.Day, akpm,
bharata, dave.hansen, david, dongjoo.linux.dev, feng.tang, gourry,
hannes, honggyu.kim, hughd, jhubbard, jon.grimm, k.shutemov,
kbusch, kmanaouil.dev, leesuyeon0506, leillc, liam.howlett,
linux-kernel, linux-mm, mgorman, mingo, nadav.amit, nphamcs,
peterz, riel, rientjes, rppt, santosh.shukla, shivankg, shy828301,
sj, vbabka, weixugc, willy, ying.huang, ziy, Jonathan.Cameron,
dave, yuanchu, kinseyho, hdanton
One of the key challenges in PTE A bit based scanning is to find right
target node to promote to.
Here is a simple heuristic based approach:
1. While scanning pages of any mm, also scan toptier pages that belong
to that mm.
2. Accumulate the insight on the distribution of active pages on
toptier nodes.
3. Walk all the top-tier nodes and pick the one with highest accesses.
This method tries to consolidate application to a single node.
TBD: Create a list of preferred nodes for fallback when highest access
node is nearly full.
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
include/linux/mm_types.h | 4 +
mm/kscand.c | 186 +++++++++++++++++++++++++++++++++++++--
2 files changed, 181 insertions(+), 9 deletions(-)
TBD: Also maintain nodemask instead of single target node to handle failed
migrations?.
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 56d07edd01f9..571be1ad12ab 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1109,6 +1109,10 @@ struct mm_struct {
/* numa_scan_seq prevents two threads remapping PTEs. */
int numa_scan_seq;
#endif
+#ifdef CONFIG_KSCAND
+ /* Tracks promotion node. XXX: use nodemask */
+ int target_node;
+ #endif
/*
* An operation with batched TLB flushing is going on. Anything
* that can move process memory needs to flush the TLB when
diff --git a/mm/kscand.c b/mm/kscand.c
index 8fbe70faea4e..2996aaad65d6 100644
--- a/mm/kscand.c
+++ b/mm/kscand.c
@@ -104,6 +104,7 @@ struct kscand_mm_slot {
unsigned long scan_size;
long address;
bool is_scanned;
+ int target_node;
};
/* Data structure to keep track of current mm under scan */
@@ -116,13 +117,23 @@ struct kscand_scan kscand_scan = {
.mm_head = LIST_HEAD_INIT(kscand_scan.mm_head),
};
+/* Per memory node information used to caclulate target_node for migration */
+struct kscand_nodeinfo {
+ unsigned long nr_scanned;
+ unsigned long nr_accessed;
+ int node;
+ bool is_toptier;
+};
+
/*
* Data structure passed to control scanning and also collect
* per memory node information
*/
struct kscand_scanctrl {
struct list_head scan_list;
+ struct kscand_nodeinfo *nodeinfo[MAX_NUMNODES];
unsigned long address;
+ unsigned long nr_to_scan;
};
struct kscand_scanctrl kscand_scanctrl;
@@ -218,15 +229,121 @@ static void kmigrated_wait_work(void)
migrate_sleep_jiffies);
}
-/*
- * Do not know what info to pass in the future to make
- * decision on taget node. Keep it void * now.
- */
+static unsigned long get_slowtier_accesed(struct kscand_scanctrl *scanctrl)
+{
+ int node;
+ unsigned long accessed = 0;
+
+ for_each_node_state(node, N_MEMORY) {
+ if (!node_is_toptier(node) && scanctrl->nodeinfo[node])
+ accessed += scanctrl->nodeinfo[node]->nr_accessed;
+ }
+ return accessed;
+}
+
+static inline void set_nodeinfo_nr_accessed(struct kscand_nodeinfo *ni, unsigned long val)
+{
+ ni->nr_accessed = val;
+}
+static inline unsigned long get_nodeinfo_nr_scanned(struct kscand_nodeinfo *ni)
+{
+ return ni->nr_scanned;
+}
+
+static inline void set_nodeinfo_nr_scanned(struct kscand_nodeinfo *ni, unsigned long val)
+{
+ ni->nr_scanned = val;
+}
+
+static inline void reset_nodeinfo_nr_scanned(struct kscand_nodeinfo *ni)
+{
+ set_nodeinfo_nr_scanned(ni, 0);
+}
+
+static inline void reset_nodeinfo(struct kscand_nodeinfo *ni)
+{
+ set_nodeinfo_nr_scanned(ni, 0);
+ set_nodeinfo_nr_accessed(ni, 0);
+}
+
+static void init_one_nodeinfo(struct kscand_nodeinfo *ni, int node)
+{
+ ni->nr_scanned = 0;
+ ni->nr_accessed = 0;
+ ni->node = node;
+ ni->is_toptier = node_is_toptier(node) ? true : false;
+}
+
+static struct kscand_nodeinfo *alloc_one_nodeinfo(int node)
+{
+ struct kscand_nodeinfo *ni;
+
+ ni = kzalloc(sizeof(*ni), GFP_KERNEL);
+
+ if (!ni)
+ return NULL;
+
+ init_one_nodeinfo(ni, node);
+
+ return ni;
+}
+
+/* TBD: Handle errors */
+static void init_scanctrl(struct kscand_scanctrl *scanctrl)
+{
+ struct kscand_nodeinfo *ni;
+ int node;
+
+ for_each_node(node) {
+ ni = alloc_one_nodeinfo(node);
+ if (!ni)
+ WARN_ON_ONCE(ni);
+ scanctrl->nodeinfo[node] = ni;
+ }
+}
+
+static void reset_scanctrl(struct kscand_scanctrl *scanctrl)
+{
+ int node;
+
+ for_each_node_state(node, N_MEMORY)
+ reset_nodeinfo(scanctrl->nodeinfo[node]);
+
+ /* XXX: Not rellay required? */
+ scanctrl->nr_to_scan = kscand_scan_size;
+}
+
+static void free_scanctrl(struct kscand_scanctrl *scanctrl)
+{
+ int node;
+
+ for_each_node(node)
+ kfree(scanctrl->nodeinfo[node]);
+}
+
static int kscand_get_target_node(void *data)
{
return kscand_target_node;
}
+static int get_target_node(struct kscand_scanctrl *scanctrl)
+{
+ int node, target_node = NUMA_NO_NODE;
+ unsigned long prev = 0;
+
+ for_each_node(node) {
+ if (node_is_toptier(node) && scanctrl->nodeinfo[node] &&
+ get_nodeinfo_nr_scanned(scanctrl->nodeinfo[node]) > prev) {
+ prev = get_nodeinfo_nr_scanned(scanctrl->nodeinfo[node]);
+ target_node = node;
+ }
+ }
+ if (target_node == NUMA_NO_NODE)
+ target_node = kscand_get_target_node(NULL);
+
+ return target_node;
+}
+
extern bool migrate_balanced_pgdat(struct pglist_data *pgdat,
unsigned long nr_migrate_pages);
@@ -492,6 +609,14 @@ static int hot_vma_idle_pte_entry(pte_t *pte,
}
srcnid = folio_nid(folio);
+ scanctrl->nodeinfo[srcnid]->nr_scanned++;
+ if (scanctrl->nr_to_scan)
+ scanctrl->nr_to_scan--;
+
+ if (!scanctrl->nr_to_scan) {
+ folio_put(folio);
+ return 1;
+ }
if (!folio_test_lru(folio)) {
folio_put(folio);
@@ -502,6 +627,8 @@ static int hot_vma_idle_pte_entry(pte_t *pte,
mmu_notifier_test_young(mm, addr) ||
folio_test_referenced(folio) || pte_young(pteval)) {
+ scanctrl->nodeinfo[srcnid]->nr_accessed++;
+
if (!kscand_eligible_srcnid(srcnid)) {
folio_put(folio);
return 0;
@@ -695,7 +822,13 @@ static void kmigrated_migrate_mm(struct kmigrated_mm_slot *mm_slot)
spin_unlock(&mm_slot->migrate_lock);
- dest = kscand_get_target_node(NULL);
+ if (!mmap_read_trylock(mm)) {
+ dest = kscand_get_target_node(NULL);
+ } else {
+ dest = READ_ONCE(mm->target_node);
+ mmap_read_unlock(mm);
+ }
+
ret = kmigrated_promote_folio(info, mm, dest);
kfree(info);
@@ -781,7 +914,7 @@ static void kmigrated_migrate_folio(void)
* Increase scan_size by (1 << SCAN_SIZE_CHANGE_SHIFT).
*/
static inline void kscand_update_mmslot_info(struct kscand_mm_slot *mm_slot,
- unsigned long total)
+ unsigned long total, int target_node)
{
unsigned int scan_period;
unsigned long now;
@@ -829,6 +962,7 @@ static inline void kscand_update_mmslot_info(struct kscand_mm_slot *mm_slot,
mm_slot->scan_period = scan_period;
mm_slot->scan_size = scan_size;
mm_slot->scan_delta = total;
+ mm_slot->target_node = target_node;
}
static unsigned long kscand_scan_mm_slot(void)
@@ -837,6 +971,7 @@ static unsigned long kscand_scan_mm_slot(void)
bool update_mmslot_info = false;
unsigned int mm_slot_scan_period;
+ int target_node, mm_slot_target_node, mm_target_node;
unsigned long now;
unsigned long mm_slot_next_scan;
unsigned long mm_slot_scan_size;
@@ -870,6 +1005,7 @@ static unsigned long kscand_scan_mm_slot(void)
mm_slot_next_scan = mm_slot->next_scan;
mm_slot_scan_period = mm_slot->scan_period;
mm_slot_scan_size = mm_slot->scan_size;
+ mm_slot_target_node = mm_slot->target_node;
spin_unlock(&kscand_mm_lock);
if (unlikely(!mmap_read_trylock(mm)))
@@ -880,6 +1016,9 @@ static unsigned long kscand_scan_mm_slot(void)
goto outerloop;
}
+ mm_target_node = READ_ONCE(mm->target_node);
+ if (mm_target_node != mm_slot_target_node)
+ WRITE_ONCE(mm->target_node, mm_slot_target_node);
now = jiffies;
if (mm_slot_next_scan && time_before(now, mm_slot_next_scan))
@@ -887,24 +1026,41 @@ static unsigned long kscand_scan_mm_slot(void)
VMA_ITERATOR(vmi, mm, address);
+ /* Either Scan 25% of scan_size or cover vma size of scan_size */
+ kscand_scanctrl.nr_to_scan = mm_slot_scan_size >> PAGE_SHIFT;
+ /* Reduce actual amount of pages scanned */
+ kscand_scanctrl.nr_to_scan = mm_slot_scan_size >> 1;
+
+ /* XXX: skip scanning to avoid duplicates until all migrations done? */
kmigrated_mm_slot = kmigrated_get_mm_slot(mm, false);
for_each_vma(vmi, vma) {
kscand_walk_page_vma(vma, &kscand_scanctrl);
vma_scanned_size += vma->vm_end - vma->vm_start;
- if (vma_scanned_size >= kscand_scan_size) {
+ if (vma_scanned_size >= mm_slot_scan_size ||
+ !kscand_scanctrl.nr_to_scan) {
next_mm = true;
if (!list_empty(&kscand_scanctrl.scan_list)) {
if (!kmigrated_mm_slot)
kmigrated_mm_slot = kmigrated_get_mm_slot(mm, true);
+ /* Add scanned folios to migration list */
spin_lock(&kmigrated_mm_slot->migrate_lock);
+
list_splice_tail_init(&kscand_scanctrl.scan_list,
&kmigrated_mm_slot->migrate_head);
spin_unlock(&kmigrated_mm_slot->migrate_lock);
+ break;
}
- break;
+ }
+ if (!list_empty(&kscand_scanctrl.scan_list)) {
+ if (!kmigrated_mm_slot)
+ kmigrated_mm_slot = kmigrated_get_mm_slot(mm, true);
+ spin_lock(&kmigrated_mm_slot->migrate_lock);
+ list_splice_tail_init(&kscand_scanctrl.scan_list,
+ &kmigrated_mm_slot->migrate_head);
+ spin_unlock(&kmigrated_mm_slot->migrate_lock);
}
}
@@ -915,9 +1071,19 @@ static unsigned long kscand_scan_mm_slot(void)
update_mmslot_info = true;
+ total = get_slowtier_accesed(&kscand_scanctrl);
+ target_node = get_target_node(&kscand_scanctrl);
+
+ mm_target_node = READ_ONCE(mm->target_node);
+
+ /* XXX: Do we need write lock? */
+ if (mm_target_node != target_node)
+ WRITE_ONCE(mm->target_node, target_node);
+ reset_scanctrl(&kscand_scanctrl);
+
if (update_mmslot_info) {
mm_slot->address = address;
- kscand_update_mmslot_info(mm_slot, total);
+ kscand_update_mmslot_info(mm_slot, total, target_node);
}
outerloop:
@@ -1111,6 +1277,7 @@ static int stop_kscand(void)
kthread_stop(kscand_thread);
kscand_thread = NULL;
}
+ free_scanctrl(&kscand_scanctrl);
return 0;
}
@@ -1166,6 +1333,7 @@ static inline void init_list(void)
spin_lock_init(&kscand_migrate_lock);
init_waitqueue_head(&kscand_wait);
init_waitqueue_head(&kmigrated_wait);
+ init_scanctrl(&kscand_scanctrl);
}
static int __init kscand_init(void)
--
2.34.1
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [RFC PATCH V2 10/13] sysfs: Add sysfs support to tune scanning
2025-06-24 5:56 [RFC PATCH V2 00/13] mm: slowtier page promotion based on PTE A bit Raghavendra K T
` (8 preceding siblings ...)
2025-06-24 5:56 ` [RFC PATCH V2 09/13] mm: Add a heuristic to calculate target node Raghavendra K T
@ 2025-06-24 5:56 ` Raghavendra K T
2025-06-24 5:56 ` [RFC PATCH V2 11/13] vmstat: Add vmstat counters Raghavendra K T
` (2 subsequent siblings)
12 siblings, 0 replies; 20+ messages in thread
From: Raghavendra K T @ 2025-06-24 5:56 UTC (permalink / raw)
To: raghavendra.kt
Cc: AneeshKumar.KizhakeVeetil, Hasan.Maruf, Michael.Day, akpm,
bharata, dave.hansen, david, dongjoo.linux.dev, feng.tang, gourry,
hannes, honggyu.kim, hughd, jhubbard, jon.grimm, k.shutemov,
kbusch, kmanaouil.dev, leesuyeon0506, leillc, liam.howlett,
linux-kernel, linux-mm, mgorman, mingo, nadav.amit, nphamcs,
peterz, riel, rientjes, rppt, santosh.shukla, shivankg, shy828301,
sj, vbabka, weixugc, willy, ying.huang, ziy, Jonathan.Cameron,
dave, yuanchu, kinseyho, hdanton
Support below tunables:
scan_enable: turn on or turn off mm_struct scanning
scan_period: initial scan_period (default: 2sec)
scan_sleep_ms: sleep time between two successive round of scanning and
migration.
mms_to_scan: total mm_struct to scan before taking a pause.
target_node: default regular node to which migration of accessed pages
is done (this is only fall back mechnism, otherwise target_node
heuristic is used).
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
mm/kscand.c | 205 ++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 205 insertions(+)
diff --git a/mm/kscand.c b/mm/kscand.c
index 2996aaad65d6..abffcb868447 100644
--- a/mm/kscand.c
+++ b/mm/kscand.c
@@ -21,6 +21,7 @@
#include <linux/delay.h>
#include <linux/cleanup.h>
#include <linux/minmax.h>
+#include <trace/events/kmem.h>
#include <asm/pgalloc.h>
#include "internal.h"
@@ -171,6 +172,171 @@ static bool kscand_eligible_srcnid(int nid)
return !node_is_toptier(nid);
}
+#ifdef CONFIG_SYSFS
+static ssize_t scan_sleep_ms_show(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ char *buf)
+{
+ return sysfs_emit(buf, "%u\n", kscand_scan_sleep_ms);
+}
+
+static ssize_t scan_sleep_ms_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ unsigned int msecs;
+ int err;
+
+ err = kstrtouint(buf, 10, &msecs);
+ if (err)
+ return -EINVAL;
+
+ kscand_scan_sleep_ms = msecs;
+ kscand_sleep_expire = 0;
+ wake_up_interruptible(&kscand_wait);
+
+ return count;
+}
+
+static struct kobj_attribute scan_sleep_ms_attr =
+ __ATTR_RW(scan_sleep_ms);
+
+static ssize_t mm_scan_period_ms_show(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ char *buf)
+{
+ return sysfs_emit(buf, "%u\n", kscand_mm_scan_period_ms);
+}
+
+/* If a value less than MIN or greater than MAX asked for store value is clamped */
+static ssize_t mm_scan_period_ms_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ unsigned int msecs, stored_msecs;
+ int err;
+
+ err = kstrtouint(buf, 10, &msecs);
+ if (err)
+ return -EINVAL;
+
+ stored_msecs = clamp(msecs, KSCAND_SCAN_PERIOD_MIN, KSCAND_SCAN_PERIOD_MAX);
+
+ kscand_mm_scan_period_ms = stored_msecs;
+ kscand_sleep_expire = 0;
+ wake_up_interruptible(&kscand_wait);
+
+ return count;
+}
+
+static struct kobj_attribute mm_scan_period_ms_attr =
+ __ATTR_RW(mm_scan_period_ms);
+
+static ssize_t mms_to_scan_show(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ char *buf)
+{
+ return sysfs_emit(buf, "%lu\n", kscand_mms_to_scan);
+}
+
+static ssize_t mms_to_scan_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ unsigned long val;
+ int err;
+
+ err = kstrtoul(buf, 10, &val);
+ if (err)
+ return -EINVAL;
+
+ kscand_mms_to_scan = val;
+ kscand_sleep_expire = 0;
+ wake_up_interruptible(&kscand_wait);
+
+ return count;
+}
+
+static struct kobj_attribute mms_to_scan_attr =
+ __ATTR_RW(mms_to_scan);
+
+static ssize_t scan_enabled_show(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ char *buf)
+{
+ return sysfs_emit(buf, "%u\n", kscand_scan_enabled ? 1 : 0);
+}
+
+static ssize_t scan_enabled_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ unsigned int val;
+ int err;
+
+ err = kstrtouint(buf, 10, &val);
+ if (err || val > 1)
+ return -EINVAL;
+
+ if (val) {
+ kscand_scan_enabled = true;
+ need_wakeup = true;
+ } else
+ kscand_scan_enabled = false;
+
+ kscand_sleep_expire = 0;
+ wake_up_interruptible(&kscand_wait);
+
+ return count;
+}
+
+static struct kobj_attribute scan_enabled_attr =
+ __ATTR_RW(scan_enabled);
+
+static ssize_t target_node_show(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ char *buf)
+{
+ return sysfs_emit(buf, "%u\n", kscand_target_node);
+}
+
+static ssize_t target_node_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ int err, node;
+
+ err = kstrtoint(buf, 10, &node);
+ if (err)
+ return -EINVAL;
+
+ kscand_sleep_expire = 0;
+ if (!node_is_toptier(node))
+ return -EINVAL;
+
+ kscand_target_node = node;
+ wake_up_interruptible(&kscand_wait);
+
+ return count;
+}
+static struct kobj_attribute target_node_attr =
+ __ATTR_RW(target_node);
+
+static struct attribute *kscand_attr[] = {
+ &scan_sleep_ms_attr.attr,
+ &mm_scan_period_ms_attr.attr,
+ &mms_to_scan_attr.attr,
+ &scan_enabled_attr.attr,
+ &target_node_attr.attr,
+ NULL,
+};
+
+struct attribute_group kscand_attr_group = {
+ .attrs = kscand_attr,
+ .name = "kscand",
+};
+#endif
+
static inline int kscand_has_work(void)
{
return !list_empty(&kscand_scan.mm_head);
@@ -1164,11 +1330,45 @@ static int kscand(void *none)
return 0;
}
+#ifdef CONFIG_SYSFS
+extern struct kobject *mm_kobj;
+static int __init kscand_init_sysfs(struct kobject **kobj)
+{
+ int err;
+
+ err = sysfs_create_group(*kobj, &kscand_attr_group);
+ if (err) {
+ pr_err("failed to register kscand group\n");
+ goto err_kscand_attr;
+ }
+
+ return 0;
+
+err_kscand_attr:
+ sysfs_remove_group(*kobj, &kscand_attr_group);
+ return err;
+}
+
+static void __init kscand_exit_sysfs(struct kobject *kobj)
+{
+ sysfs_remove_group(kobj, &kscand_attr_group);
+}
+#else
+static inline int __init kscand_init_sysfs(struct kobject **kobj)
+{
+ return 0;
+}
+static inline void __init kscand_exit_sysfs(struct kobject *kobj)
+{
+}
+#endif
+
static inline void kscand_destroy(void)
{
kmem_cache_destroy(kscand_slot_cache);
/* XXX: move below to kmigrated thread */
kmem_cache_destroy(kmigrated_slot_cache);
+ kscand_exit_sysfs(mm_kobj);
}
void __kscand_enter(struct mm_struct *mm)
@@ -1354,6 +1554,10 @@ static int __init kscand_init(void)
return -ENOMEM;
}
+ err = kscand_init_sysfs(&mm_kobj);
+ if (err)
+ goto err_init_sysfs;
+
init_list();
err = start_kscand();
if (err)
@@ -1370,6 +1574,7 @@ static int __init kscand_init(void)
err_kscand:
stop_kscand();
+err_init_sysfs:
kscand_destroy();
return err;
--
2.34.1
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [RFC PATCH V2 11/13] vmstat: Add vmstat counters
2025-06-24 5:56 [RFC PATCH V2 00/13] mm: slowtier page promotion based on PTE A bit Raghavendra K T
` (9 preceding siblings ...)
2025-06-24 5:56 ` [RFC PATCH V2 10/13] sysfs: Add sysfs support to tune scanning Raghavendra K T
@ 2025-06-24 5:56 ` Raghavendra K T
2025-06-24 5:56 ` [RFC PATCH V2 12/13] trace/kscand: Add tracing of scanning and migration Raghavendra K T
2025-06-24 5:56 ` [RFC PATCH V2 13/13] prctl: Introduce new prctl to control scanning Raghavendra K T
12 siblings, 0 replies; 20+ messages in thread
From: Raghavendra K T @ 2025-06-24 5:56 UTC (permalink / raw)
To: raghavendra.kt
Cc: AneeshKumar.KizhakeVeetil, Hasan.Maruf, Michael.Day, akpm,
bharata, dave.hansen, david, dongjoo.linux.dev, feng.tang, gourry,
hannes, honggyu.kim, hughd, jhubbard, jon.grimm, k.shutemov,
kbusch, kmanaouil.dev, leesuyeon0506, leillc, liam.howlett,
linux-kernel, linux-mm, mgorman, mingo, nadav.amit, nphamcs,
peterz, riel, rientjes, rppt, santosh.shukla, shivankg, shy828301,
sj, vbabka, weixugc, willy, ying.huang, ziy, Jonathan.Cameron,
dave, yuanchu, kinseyho, hdanton
Add vmstat counter to track scanning, migration and
type of pages.
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
include/linux/mm.h | 11 ++++++++
include/linux/vm_event_item.h | 10 +++++++
mm/kscand.c | 51 ++++++++++++++++++++++++++++++++++-
mm/vmstat.c | 10 +++++++
4 files changed, 81 insertions(+), 1 deletion(-)
This implementation will change with upcoming changes in vmstat.
diff --git a/include/linux/mm.h b/include/linux/mm.h
index fdda6b16263b..b67d06cbc2ed 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -656,6 +656,17 @@ struct vm_operations_struct {
unsigned long addr);
};
+#ifdef CONFIG_KSCAND
+void count_kscand_mm_scans(void);
+void count_kscand_vma_scans(void);
+void count_kscand_migadded(void);
+void count_kscand_migrated(void);
+void count_kscand_migrate_failed(void);
+void count_kscand_slowtier(void);
+void count_kscand_toptier(void);
+void count_kscand_idlepage(void);
+#endif
+
#ifdef CONFIG_NUMA_BALANCING
static inline void vma_numab_state_init(struct vm_area_struct *vma)
{
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 9e15a088ba38..8f324ad73821 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -67,6 +67,16 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
NUMA_HINT_FAULTS_LOCAL,
NUMA_PAGE_MIGRATE,
#endif
+#ifdef CONFIG_KSCAND
+ KSCAND_MM_SCANS,
+ KSCAND_VMA_SCANS,
+ KSCAND_MIGADDED,
+ KSCAND_MIGRATED,
+ KSCAND_MIGRATE_FAILED,
+ KSCAND_SLOWTIER,
+ KSCAND_TOPTIER,
+ KSCAND_IDLEPAGE,
+#endif
#ifdef CONFIG_MIGRATION
PGMIGRATE_SUCCESS, PGMIGRATE_FAIL,
THP_MIGRATION_SUCCESS,
diff --git a/mm/kscand.c b/mm/kscand.c
index abffcb868447..db7b2f940f36 100644
--- a/mm/kscand.c
+++ b/mm/kscand.c
@@ -337,6 +337,39 @@ struct attribute_group kscand_attr_group = {
};
#endif
+void count_kscand_mm_scans(void)
+{
+ count_vm_numa_event(KSCAND_MM_SCANS);
+}
+void count_kscand_vma_scans(void)
+{
+ count_vm_numa_event(KSCAND_VMA_SCANS);
+}
+void count_kscand_migadded(void)
+{
+ count_vm_numa_event(KSCAND_MIGADDED);
+}
+void count_kscand_migrated(void)
+{
+ count_vm_numa_event(KSCAND_MIGRATED);
+}
+void count_kscand_migrate_failed(void)
+{
+ count_vm_numa_event(KSCAND_MIGRATE_FAILED);
+}
+void count_kscand_slowtier(void)
+{
+ count_vm_numa_event(KSCAND_SLOWTIER);
+}
+void count_kscand_toptier(void)
+{
+ count_vm_numa_event(KSCAND_TOPTIER);
+}
+void count_kscand_idlepage(void)
+{
+ count_vm_numa_event(KSCAND_IDLEPAGE);
+}
+
static inline int kscand_has_work(void)
{
return !list_empty(&kscand_scan.mm_head);
@@ -789,6 +822,9 @@ static int hot_vma_idle_pte_entry(pte_t *pte,
return 0;
}
+ if (node_is_toptier(srcnid))
+ count_kscand_toptier();
+
if (!folio_test_idle(folio) || folio_test_young(folio) ||
mmu_notifier_test_young(mm, addr) ||
folio_test_referenced(folio) || pte_young(pteval)) {
@@ -802,11 +838,14 @@ static int hot_vma_idle_pte_entry(pte_t *pte,
info = kzalloc(sizeof(struct kscand_migrate_info), GFP_NOWAIT);
if (info && scanctrl) {
+ count_kscand_slowtier();
info->address = addr;
info->folio = folio;
list_add_tail(&info->migrate_node, &scanctrl->scan_list);
+ count_kscand_migadded();
}
- }
+ } else
+ count_kscand_idlepage();
folio_set_idle(folio);
folio_put(folio);
@@ -997,6 +1036,12 @@ static void kmigrated_migrate_mm(struct kmigrated_mm_slot *mm_slot)
ret = kmigrated_promote_folio(info, mm, dest);
+ /* TBD: encode migrated count here, currently assume folio_nr_pages */
+ if (!ret)
+ count_kscand_migrated();
+ else
+ count_kscand_migrate_failed();
+
kfree(info);
cond_resched();
@@ -1202,6 +1247,7 @@ static unsigned long kscand_scan_mm_slot(void)
for_each_vma(vmi, vma) {
kscand_walk_page_vma(vma, &kscand_scanctrl);
+ count_kscand_vma_scans();
vma_scanned_size += vma->vm_end - vma->vm_start;
if (vma_scanned_size >= mm_slot_scan_size ||
@@ -1237,6 +1283,8 @@ static unsigned long kscand_scan_mm_slot(void)
update_mmslot_info = true;
+ count_kscand_mm_scans();
+
total = get_slowtier_accesed(&kscand_scanctrl);
target_node = get_target_node(&kscand_scanctrl);
@@ -1252,6 +1300,7 @@ static unsigned long kscand_scan_mm_slot(void)
kscand_update_mmslot_info(mm_slot, total, target_node);
}
+
outerloop:
/* exit_mmap will destroy ptes after this */
mmap_read_unlock(mm);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4c268ce39ff2..d32e88e4153d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1348,6 +1348,16 @@ const char * const vmstat_text[] = {
"numa_hint_faults_local",
"numa_pages_migrated",
#endif
+#ifdef CONFIG_KSCAND
+ "nr_kscand_mm_scans",
+ "nr_kscand_vma_scans",
+ "nr_kscand_migadded",
+ "nr_kscand_migrated",
+ "nr_kscand_migrate_failed",
+ "nr_kscand_slowtier",
+ "nr_kscand_toptier",
+ "nr_kscand_idlepage",
+#endif
#ifdef CONFIG_MIGRATION
"pgmigrate_success",
"pgmigrate_fail",
--
2.34.1
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [RFC PATCH V2 12/13] trace/kscand: Add tracing of scanning and migration
2025-06-24 5:56 [RFC PATCH V2 00/13] mm: slowtier page promotion based on PTE A bit Raghavendra K T
` (10 preceding siblings ...)
2025-06-24 5:56 ` [RFC PATCH V2 11/13] vmstat: Add vmstat counters Raghavendra K T
@ 2025-06-24 5:56 ` Raghavendra K T
2025-06-24 7:09 ` Masami Hiramatsu
2025-06-24 5:56 ` [RFC PATCH V2 13/13] prctl: Introduce new prctl to control scanning Raghavendra K T
12 siblings, 1 reply; 20+ messages in thread
From: Raghavendra K T @ 2025-06-24 5:56 UTC (permalink / raw)
To: raghavendra.kt
Cc: AneeshKumar.KizhakeVeetil, Hasan.Maruf, Michael.Day, akpm,
bharata, dave.hansen, david, dongjoo.linux.dev, feng.tang, gourry,
hannes, honggyu.kim, hughd, jhubbard, jon.grimm, k.shutemov,
kbusch, kmanaouil.dev, leesuyeon0506, leillc, liam.howlett,
linux-kernel, linux-mm, mgorman, mingo, nadav.amit, nphamcs,
peterz, riel, rientjes, rppt, santosh.shukla, shivankg, shy828301,
sj, vbabka, weixugc, willy, ying.huang, ziy, Jonathan.Cameron,
dave, yuanchu, kinseyho, hdanton
Add tracing support to track
- start and end of scanning.
- migration.
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Masami Hiramatsu <mhiramat@kernel.org>
CC: linux-trace-kernel@vger.kernel.org
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
include/trace/events/kmem.h | 90 +++++++++++++++++++++++++++++++++++++
mm/kscand.c | 8 ++++
2 files changed, 98 insertions(+)
diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index f74925a6cf69..682c4015414f 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -9,6 +9,96 @@
#include <linux/tracepoint.h>
#include <trace/events/mmflags.h>
+DECLARE_EVENT_CLASS(kmem_mm_class,
+
+ TP_PROTO(struct mm_struct *mm),
+
+ TP_ARGS(mm),
+
+ TP_STRUCT__entry(
+ __field( struct mm_struct *, mm )
+ ),
+
+ TP_fast_assign(
+ __entry->mm = mm;
+ ),
+
+ TP_printk("mm = %p", __entry->mm)
+);
+
+DEFINE_EVENT(kmem_mm_class, kmem_mm_enter,
+ TP_PROTO(struct mm_struct *mm),
+ TP_ARGS(mm)
+);
+
+DEFINE_EVENT(kmem_mm_class, kmem_mm_exit,
+ TP_PROTO(struct mm_struct *mm),
+ TP_ARGS(mm)
+);
+
+DEFINE_EVENT(kmem_mm_class, kmem_scan_mm_start,
+ TP_PROTO(struct mm_struct *mm),
+ TP_ARGS(mm)
+);
+
+TRACE_EVENT(kmem_scan_mm_end,
+
+ TP_PROTO( struct mm_struct *mm,
+ unsigned long start,
+ unsigned long total,
+ unsigned long scan_period,
+ unsigned long scan_size,
+ int target_node),
+
+ TP_ARGS(mm, start, total, scan_period, scan_size, target_node),
+
+ TP_STRUCT__entry(
+ __field( struct mm_struct *, mm )
+ __field( unsigned long, start )
+ __field( unsigned long, total )
+ __field( unsigned long, scan_period )
+ __field( unsigned long, scan_size )
+ __field( int, target_node )
+ ),
+
+ TP_fast_assign(
+ __entry->mm = mm;
+ __entry->start = start;
+ __entry->total = total;
+ __entry->scan_period = scan_period;
+ __entry->scan_size = scan_size;
+ __entry->target_node = target_node;
+ ),
+
+ TP_printk("mm=%p, start = %ld, total = %ld, scan_period = %ld, scan_size = %ld node = %d",
+ __entry->mm, __entry->start, __entry->total, __entry->scan_period,
+ __entry->scan_size, __entry->target_node)
+);
+
+TRACE_EVENT(kmem_scan_mm_migrate,
+
+ TP_PROTO(struct mm_struct *mm,
+ int rc,
+ int target_node),
+
+ TP_ARGS(mm, rc, target_node),
+
+ TP_STRUCT__entry(
+ __field( struct mm_struct *, mm )
+ __field( int, rc )
+ __field( int, target_node )
+ ),
+
+ TP_fast_assign(
+ __entry->mm = mm;
+ __entry->rc = rc;
+ __entry->target_node = target_node;
+ ),
+
+ TP_printk("mm = %p rc = %d node = %d",
+ __entry->mm, __entry->rc, __entry->target_node)
+);
+
TRACE_EVENT(kmem_cache_alloc,
TP_PROTO(unsigned long call_site,
diff --git a/mm/kscand.c b/mm/kscand.c
index db7b2f940f36..029d6d2bedc3 100644
--- a/mm/kscand.c
+++ b/mm/kscand.c
@@ -1035,6 +1035,7 @@ static void kmigrated_migrate_mm(struct kmigrated_mm_slot *mm_slot)
}
ret = kmigrated_promote_folio(info, mm, dest);
+ trace_kmem_scan_mm_migrate(mm, ret, dest);
/* TBD: encode migrated count here, currently assume folio_nr_pages */
if (!ret)
@@ -1230,6 +1231,9 @@ static unsigned long kscand_scan_mm_slot(void)
mm_target_node = READ_ONCE(mm->target_node);
if (mm_target_node != mm_slot_target_node)
WRITE_ONCE(mm->target_node, mm_slot_target_node);
+
+ trace_kmem_scan_mm_start(mm);
+
now = jiffies;
if (mm_slot_next_scan && time_before(now, mm_slot_next_scan))
@@ -1300,6 +1304,8 @@ static unsigned long kscand_scan_mm_slot(void)
kscand_update_mmslot_info(mm_slot, total, target_node);
}
+ trace_kmem_scan_mm_end(mm, address, total, mm_slot_scan_period,
+ mm_slot_scan_size, target_node);
outerloop:
/* exit_mmap will destroy ptes after this */
@@ -1453,6 +1459,7 @@ void __kscand_enter(struct mm_struct *mm)
spin_unlock(&kscand_mm_lock);
mmgrab(mm);
+ trace_kmem_mm_enter(mm);
if (wakeup)
wake_up_interruptible(&kscand_wait);
}
@@ -1463,6 +1470,7 @@ void __kscand_exit(struct mm_struct *mm)
struct mm_slot *slot;
int free = 0, serialize = 1;
+ trace_kmem_mm_exit(mm);
spin_lock(&kscand_mm_lock);
slot = mm_slot_lookup(kscand_slots_hash, mm);
mm_slot = mm_slot_entry(slot, struct kscand_mm_slot, slot);
--
2.34.1
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [RFC PATCH V2 13/13] prctl: Introduce new prctl to control scanning
2025-06-24 5:56 [RFC PATCH V2 00/13] mm: slowtier page promotion based on PTE A bit Raghavendra K T
` (11 preceding siblings ...)
2025-06-24 5:56 ` [RFC PATCH V2 12/13] trace/kscand: Add tracing of scanning and migration Raghavendra K T
@ 2025-06-24 5:56 ` Raghavendra K T
12 siblings, 0 replies; 20+ messages in thread
From: Raghavendra K T @ 2025-06-24 5:56 UTC (permalink / raw)
To: raghavendra.kt
Cc: AneeshKumar.KizhakeVeetil, Hasan.Maruf, Michael.Day, akpm,
bharata, dave.hansen, david, dongjoo.linux.dev, feng.tang, gourry,
hannes, honggyu.kim, hughd, jhubbard, jon.grimm, k.shutemov,
kbusch, kmanaouil.dev, leesuyeon0506, leillc, liam.howlett,
linux-kernel, linux-mm, mgorman, mingo, nadav.amit, nphamcs,
peterz, riel, rientjes, rppt, santosh.shukla, shivankg, shy828301,
sj, vbabka, weixugc, willy, ying.huang, ziy, Jonathan.Cameron,
dave, yuanchu, kinseyho, hdanton
A new scalar value (PTEAScanScale) to control per task PTE A bit scanning
is introduced.
0 : scanning disabled
1-10 : scanning enabled.
In future PTEAScanScale could be used to control aggressiveness of
scanning.
CC: linux-doc@vger.kernel.org
CC: Jonathan Corbet <corbet@lwn.net>
CC: linux-fsdevel@vger.kernel.org
Suggested-by: David Rientjes <rientjes@google.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
Documentation/filesystems/proc.rst | 2 ++
fs/proc/task_mmu.c | 4 ++++
include/linux/mm_types.h | 3 +++
include/uapi/linux/prctl.h | 7 +++++++
kernel/fork.c | 4 ++++
kernel/sys.c | 25 +++++++++++++++++++++++++
mm/kscand.c | 5 +++++
7 files changed, 50 insertions(+)
Rebasing to upstream tree will change prctl number.
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 2a17865dfe39..429409c341ac 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -205,6 +205,7 @@ read the file /proc/PID/status::
VmLib: 1412 kB
VmPTE: 20 kb
VmSwap: 0 kB
+ PTEAScanScale: 0
HugetlbPages: 0 kB
CoreDumping: 0
THP_enabled: 1
@@ -288,6 +289,7 @@ It's slow but very precise.
VmPTE size of page table entries
VmSwap amount of swap used by anonymous private data
(shmem swap usage is not included)
+ PTEAScanScale Integer representing async PTE A bit scan agrression
HugetlbPages size of hugetlb memory portions
CoreDumping process's memory is currently being dumped
(killing the process may lead to a corrupted core)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 994cde10e3f4..6a1a660d9824 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -79,6 +79,10 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
" kB\nVmPTE:\t", mm_pgtables_bytes(mm) >> 10, 8);
SEQ_PUT_DEC(" kB\nVmSwap:\t", swap);
seq_puts(m, " kB\n");
+#ifdef CONFIG_KSCAND
+ seq_put_decimal_ull_width(m, "PTEAScanScale:\t", mm->pte_scan_scale, 8);
+ seq_puts(m, "\n");
+#endif
hugetlb_report_usage(m, mm);
}
#undef SEQ_PUT_DEC
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 571be1ad12ab..ffdb9207cc4f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1112,6 +1112,9 @@ struct mm_struct {
#ifdef CONFIG_KSCAND
/* Tracks promotion node. XXX: use nodemask */
int target_node;
+
+ /* Integer representing PTE A bit scan aggression (0-10) */
+ unsigned int pte_scan_scale;
#endif
/*
* An operation with batched TLB flushing is going on. Anything
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 15c18ef4eb11..2f64a80e5cdf 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -364,4 +364,11 @@ struct prctl_mm_map {
# define PR_TIMER_CREATE_RESTORE_IDS_ON 1
# define PR_TIMER_CREATE_RESTORE_IDS_GET 2
+/* Set/get PTE A bit scan scale */
+#define PR_SET_PTE_A_SCAN_SCALE 78
+#define PR_GET_PTE_A_SCAN_SCALE 79
+# define PR_PTE_A_SCAN_SCALE_MIN 0
+# define PR_PTE_A_SCAN_SCALE_MAX 10
+# define PR_PTE_A_SCAN_SCALE_DEFAULT 1
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index af6dd315b106..120ee2ba7d30 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -106,6 +106,7 @@
#include <uapi/linux/pidfd.h>
#include <linux/pidfs.h>
#include <linux/tick.h>
+#include <linux/prctl.h>
#include <asm/pgalloc.h>
#include <linux/uaccess.h>
@@ -1311,6 +1312,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
init_tlb_flush_pending(mm);
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !defined(CONFIG_SPLIT_PMD_PTLOCKS)
mm->pmd_huge_pte = NULL;
+#endif
+#ifdef CONFIG_KSCAND
+ mm->pte_scan_scale = PR_PTE_A_SCAN_SCALE_DEFAULT;
#endif
mm_init_uprobes_state(mm);
hugetlb_count_init(mm);
diff --git a/kernel/sys.c b/kernel/sys.c
index c434968e9f5d..aff92ff2c7dd 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2146,6 +2146,19 @@ static int prctl_set_auxv(struct mm_struct *mm, unsigned long addr,
return 0;
}
+#ifdef CONFIG_KSCAND
+static int prctl_pte_scan_scale_write(unsigned int scale)
+{
+ scale = clamp(scale, PR_PTE_A_SCAN_SCALE_MIN, PR_PTE_A_SCAN_SCALE_MAX);
+ current->mm->pte_scan_scale = scale;
+ return 0;
+}
+
+static unsigned int prctl_pte_scan_scale_read(void)
+{
+ return current->mm->pte_scan_scale;
+}
+#endif
static int prctl_set_mm(int opt, unsigned long addr,
unsigned long arg4, unsigned long arg5)
@@ -2820,6 +2833,18 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
return -EINVAL;
error = posixtimer_create_prctl(arg2);
break;
+#ifdef CONFIG_KSCAND
+ case PR_SET_PTE_A_SCAN_SCALE:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+ error = prctl_pte_scan_scale_write((unsigned int) arg2);
+ break;
+ case PR_GET_PTE_A_SCAN_SCALE:
+ if (arg2 || arg3 || arg4 || arg5)
+ return -EINVAL;
+ error = prctl_pte_scan_scale_read();
+ break;
+#endif
default:
trace_task_prctl_unknown(option, arg2, arg3, arg4, arg5);
error = -EINVAL;
diff --git a/mm/kscand.c b/mm/kscand.c
index 029d6d2bedc3..2be7e71c2c8f 100644
--- a/mm/kscand.c
+++ b/mm/kscand.c
@@ -1228,6 +1228,11 @@ static unsigned long kscand_scan_mm_slot(void)
goto outerloop;
}
+ if (!mm->pte_scan_scale) {
+ next_mm = true;
+ goto outerloop;
+ }
+
mm_target_node = READ_ONCE(mm->target_node);
if (mm_target_node != mm_slot_target_node)
WRITE_ONCE(mm->target_node, mm_slot_target_node);
--
2.34.1
^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [RFC PATCH V2 12/13] trace/kscand: Add tracing of scanning and migration
2025-06-24 5:56 ` [RFC PATCH V2 12/13] trace/kscand: Add tracing of scanning and migration Raghavendra K T
@ 2025-06-24 7:09 ` Masami Hiramatsu
2025-06-24 7:50 ` Raghavendra K T
0 siblings, 1 reply; 20+ messages in thread
From: Masami Hiramatsu @ 2025-06-24 7:09 UTC (permalink / raw)
To: Raghavendra K T
Cc: AneeshKumar.KizhakeVeetil, Hasan.Maruf, Michael.Day, akpm,
bharata, dave.hansen, david, dongjoo.linux.dev, feng.tang, gourry,
hannes, honggyu.kim, hughd, jhubbard, jon.grimm, k.shutemov,
kbusch, kmanaouil.dev, leesuyeon0506, leillc, liam.howlett,
linux-kernel, linux-mm, mgorman, mingo, nadav.amit, nphamcs,
peterz, riel, rientjes, rppt, santosh.shukla, shivankg, shy828301,
sj, vbabka, weixugc, willy, ying.huang, ziy, Jonathan.Cameron,
dave, yuanchu, kinseyho, hdanton
On Tue, 24 Jun 2025 05:56:16 +0000
Raghavendra K T <raghavendra.kt@amd.com> wrote:
> Add tracing support to track
> - start and end of scanning.
> - migration.
>
> CC: Steven Rostedt <rostedt@goodmis.org>
> CC: Masami Hiramatsu <mhiramat@kernel.org>
> CC: linux-trace-kernel@vger.kernel.org
>
> Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
> ---
> include/trace/events/kmem.h | 90 +++++++++++++++++++++++++++++++++++++
> mm/kscand.c | 8 ++++
> 2 files changed, 98 insertions(+)
>
> diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
> index f74925a6cf69..682c4015414f 100644
> --- a/include/trace/events/kmem.h
> +++ b/include/trace/events/kmem.h
> @@ -9,6 +9,96 @@
> #include <linux/tracepoint.h>
> #include <trace/events/mmflags.h>
>
Please make sure the event is not exposed when it is not used.
#ifdef CONFIG_KSCAND
Thank you,
> +DECLARE_EVENT_CLASS(kmem_mm_class,
> +
> + TP_PROTO(struct mm_struct *mm),
> +
> + TP_ARGS(mm),
> +
> + TP_STRUCT__entry(
> + __field( struct mm_struct *, mm )
> + ),
> +
> + TP_fast_assign(
> + __entry->mm = mm;
> + ),
> +
> + TP_printk("mm = %p", __entry->mm)
> +);
> +
> +DEFINE_EVENT(kmem_mm_class, kmem_mm_enter,
> + TP_PROTO(struct mm_struct *mm),
> + TP_ARGS(mm)
> +);
> +
> +DEFINE_EVENT(kmem_mm_class, kmem_mm_exit,
> + TP_PROTO(struct mm_struct *mm),
> + TP_ARGS(mm)
> +);
> +
> +DEFINE_EVENT(kmem_mm_class, kmem_scan_mm_start,
> + TP_PROTO(struct mm_struct *mm),
> + TP_ARGS(mm)
> +);
> +
> +TRACE_EVENT(kmem_scan_mm_end,
> +
> + TP_PROTO( struct mm_struct *mm,
> + unsigned long start,
> + unsigned long total,
> + unsigned long scan_period,
> + unsigned long scan_size,
> + int target_node),
> +
> + TP_ARGS(mm, start, total, scan_period, scan_size, target_node),
> +
> + TP_STRUCT__entry(
> + __field( struct mm_struct *, mm )
> + __field( unsigned long, start )
> + __field( unsigned long, total )
> + __field( unsigned long, scan_period )
> + __field( unsigned long, scan_size )
> + __field( int, target_node )
> + ),
> +
> + TP_fast_assign(
> + __entry->mm = mm;
> + __entry->start = start;
> + __entry->total = total;
> + __entry->scan_period = scan_period;
> + __entry->scan_size = scan_size;
> + __entry->target_node = target_node;
> + ),
> +
> + TP_printk("mm=%p, start = %ld, total = %ld, scan_period = %ld, scan_size = %ld node = %d",
> + __entry->mm, __entry->start, __entry->total, __entry->scan_period,
> + __entry->scan_size, __entry->target_node)
> +);
> +
> +TRACE_EVENT(kmem_scan_mm_migrate,
> +
> + TP_PROTO(struct mm_struct *mm,
> + int rc,
> + int target_node),
> +
> + TP_ARGS(mm, rc, target_node),
> +
> + TP_STRUCT__entry(
> + __field( struct mm_struct *, mm )
> + __field( int, rc )
> + __field( int, target_node )
> + ),
> +
> + TP_fast_assign(
> + __entry->mm = mm;
> + __entry->rc = rc;
> + __entry->target_node = target_node;
> + ),
> +
> + TP_printk("mm = %p rc = %d node = %d",
> + __entry->mm, __entry->rc, __entry->target_node)
> +);
> +
> TRACE_EVENT(kmem_cache_alloc,
>
> TP_PROTO(unsigned long call_site,
> diff --git a/mm/kscand.c b/mm/kscand.c
> index db7b2f940f36..029d6d2bedc3 100644
> --- a/mm/kscand.c
> +++ b/mm/kscand.c
> @@ -1035,6 +1035,7 @@ static void kmigrated_migrate_mm(struct kmigrated_mm_slot *mm_slot)
> }
>
> ret = kmigrated_promote_folio(info, mm, dest);
> + trace_kmem_scan_mm_migrate(mm, ret, dest);
>
> /* TBD: encode migrated count here, currently assume folio_nr_pages */
> if (!ret)
> @@ -1230,6 +1231,9 @@ static unsigned long kscand_scan_mm_slot(void)
> mm_target_node = READ_ONCE(mm->target_node);
> if (mm_target_node != mm_slot_target_node)
> WRITE_ONCE(mm->target_node, mm_slot_target_node);
> +
> + trace_kmem_scan_mm_start(mm);
> +
> now = jiffies;
>
> if (mm_slot_next_scan && time_before(now, mm_slot_next_scan))
> @@ -1300,6 +1304,8 @@ static unsigned long kscand_scan_mm_slot(void)
> kscand_update_mmslot_info(mm_slot, total, target_node);
> }
>
> + trace_kmem_scan_mm_end(mm, address, total, mm_slot_scan_period,
> + mm_slot_scan_size, target_node);
>
> outerloop:
> /* exit_mmap will destroy ptes after this */
> @@ -1453,6 +1459,7 @@ void __kscand_enter(struct mm_struct *mm)
> spin_unlock(&kscand_mm_lock);
>
> mmgrab(mm);
> + trace_kmem_mm_enter(mm);
> if (wakeup)
> wake_up_interruptible(&kscand_wait);
> }
> @@ -1463,6 +1470,7 @@ void __kscand_exit(struct mm_struct *mm)
> struct mm_slot *slot;
> int free = 0, serialize = 1;
>
> + trace_kmem_mm_exit(mm);
> spin_lock(&kscand_mm_lock);
> slot = mm_slot_lookup(kscand_slots_hash, mm);
> mm_slot = mm_slot_entry(slot, struct kscand_mm_slot, slot);
> --
> 2.34.1
>
>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC PATCH V2 12/13] trace/kscand: Add tracing of scanning and migration
2025-06-24 7:09 ` Masami Hiramatsu
@ 2025-06-24 7:50 ` Raghavendra K T
0 siblings, 0 replies; 20+ messages in thread
From: Raghavendra K T @ 2025-06-24 7:50 UTC (permalink / raw)
To: Masami Hiramatsu (Google)
Cc: AneeshKumar.KizhakeVeetil, Hasan.Maruf, Michael.Day, akpm,
bharata, dave.hansen, david, dongjoo.linux.dev, feng.tang, gourry,
hannes, honggyu.kim, hughd, jhubbard, jon.grimm, k.shutemov,
kbusch, kmanaouil.dev, leesuyeon0506, leillc, liam.howlett,
linux-kernel, linux-mm, mgorman, mingo, nadav.amit, nphamcs,
peterz, riel, rientjes, rppt, santosh.shukla, shivankg, shy828301,
sj, vbabka, weixugc, willy, ying.huang, ziy, Jonathan.Cameron,
dave, yuanchu, kinseyho, hdanton
On 6/24/2025 12:39 PM, Masami Hiramatsu (Google) wrote:
> On Tue, 24 Jun 2025 05:56:16 +0000
> Raghavendra K T <raghavendra.kt@amd.com> wrote:
>
>> Add tracing support to track
>> - start and end of scanning.
>> - migration.
>>
>> CC: Steven Rostedt <rostedt@goodmis.org>
>> CC: Masami Hiramatsu <mhiramat@kernel.org>
>> CC: linux-trace-kernel@vger.kernel.org
>>
>> Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
>> ---
>> include/trace/events/kmem.h | 90 +++++++++++++++++++++++++++++++++++++
>> mm/kscand.c | 8 ++++
>> 2 files changed, 98 insertions(+)
>>
>> diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
>> index f74925a6cf69..682c4015414f 100644
>> --- a/include/trace/events/kmem.h
>> +++ b/include/trace/events/kmem.h
>> @@ -9,6 +9,96 @@
>> #include <linux/tracepoint.h>
>> #include <trace/events/mmflags.h>
>>
>
> Please make sure the event is not exposed when it is not used.
>
> #ifdef CONFIG_KSCAND
>
> Thank you,
>
[...]
Sure. Noted. Thank you :)
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC PATCH V2 03/13] mm: Scan the mm and create a migration list
2025-06-24 5:56 ` [RFC PATCH V2 03/13] mm: Scan the mm and create a migration list Raghavendra K T
@ 2025-06-25 22:07 ` Harry Yoo
2025-06-25 23:05 ` Harry Yoo
2025-06-26 6:27 ` Raghavendra K T
0 siblings, 2 replies; 20+ messages in thread
From: Harry Yoo @ 2025-06-25 22:07 UTC (permalink / raw)
To: Raghavendra K T
Cc: AneeshKumar.KizhakeVeetil, Hasan.Maruf, Michael.Day, akpm,
bharata, dave.hansen, david, dongjoo.linux.dev, feng.tang, gourry,
hannes, honggyu.kim, hughd, jhubbard, jon.grimm, k.shutemov,
kbusch, kmanaouil.dev, leesuyeon0506, leillc, liam.howlett,
linux-kernel, linux-mm, mgorman, mingo, nadav.amit, nphamcs,
peterz, riel, rientjes, rppt, santosh.shukla, shivankg, shy828301,
sj, vbabka, weixugc, willy, ying.huang, ziy, Jonathan.Cameron,
dave, yuanchu, kinseyho, hdanton
On Tue, Jun 24, 2025 at 05:56:07AM +0000, Raghavendra K T wrote:
> Since we already have the list of mm_struct in the system, add a module to
> scan each mm that walks VMAs of each mm_struct and scan all the pages
> associated with that.
>
> In the scan path: Check for the recently acccessed pages (folios) belonging
> to slowtier nodes. Add all those folios to a list.
>
> Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
> ---
Hi, just taking a quick look...
> mm/kscand.c | 319 +++++++++++++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 318 insertions(+), 1 deletion(-)
>
> diff --git a/mm/kscand.c b/mm/kscand.c
> index d5b0d3041b0f..0edec1b7730d 100644
> --- a/mm/kscand.c
> +++ b/mm/kscand.c
> @@ -42,6 +55,8 @@ static struct kmem_cache *kscand_slot_cache __read_mostly;
> @@ -84,11 +122,275 @@ static void kscand_wait_work(void)
> scan_sleep_jiffies);
> }
>
> +static inline bool is_valid_folio(struct folio *folio)
> +{
> + if (!folio || folio_test_unevictable(folio) || !folio_mapped(folio) ||
> + folio_is_zone_device(folio) || folio_maybe_mapped_shared(folio))
> + return false;
> +
> + return true;
> +}
What makes it undesirable to migrate shared folios?
> +static bool folio_idle_clear_pte_refs_one(struct folio *folio,
> + struct vm_area_struct *vma,
> + unsigned long addr,
> + pte_t *ptep)
> +{
> + bool referenced = false;
> + struct mm_struct *mm = vma->vm_mm;
> + pmd_t *pmd = pmd_off(mm, addr);
> +
> + if (ptep) {
> + if (ptep_clear_young_notify(vma, addr, ptep))
> + referenced = true;
> + } else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
> + if (!pmd_present(*pmd))
> + WARN_ON_ONCE(1);
> + if (pmdp_clear_young_notify(vma, addr, pmd))
> + referenced = true;
> + } else {
> + WARN_ON_ONCE(1);
> + }
This does not look good.
I think pmd entry handling should be handled in
mm_walk_ops.pmd_entry callback?
> +
> + if (referenced) {
> + folio_clear_idle(folio);
> + folio_set_young(folio);
> + }
> +
> + return true;
> +}
> +
> +static void page_idle_clear_pte_refs(struct page *page, pte_t *pte, struct mm_walk *walk)
> +{
> + bool need_lock;
> + struct folio *folio = page_folio(page);
> + unsigned long address;
> +
> + if (!folio_mapped(folio) || !folio_raw_mapping(folio))
> + return;
> +
> + need_lock = !folio_test_anon(folio) || folio_test_ksm(folio);
> + if (need_lock && !folio_trylock(folio))
> + return;
Why acquire folio lock here?
And I'm not even sure if it's safe to acquire it?
The locking order is folio_lock -> pte_lock
page walk should have already acquired pte_lock before calling
->pte_entry() callback.
> + address = vma_address(walk->vma, page_pgoff(folio, page), compound_nr(page));
> + VM_BUG_ON_VMA(address == -EFAULT, walk->vma);
> + folio_idle_clear_pte_refs_one(folio, walk->vma, address, pte);
> +
> + if (need_lock)
> + folio_unlock(folio);
> +}
> +
> +static const struct mm_walk_ops hot_vma_set_idle_ops = {
> + .pte_entry = hot_vma_idle_pte_entry,
> + .walk_lock = PGWALK_RDLOCK,
> +};
> +
> +static void kscand_walk_page_vma(struct vm_area_struct *vma, struct kscand_scanctrl *scanctrl)
> +{
> + if (!vma_migratable(vma) || !vma_policy_mof(vma) ||
> + is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_MIXEDMAP)) {
> + return;
> + }
> + if (!vma->vm_mm ||
> + (vma->vm_file && (vma->vm_flags & (VM_READ|VM_WRITE)) == (VM_READ)))
> + return;
Why not walk writable file VMAs?
> + if (!vma_is_accessible(vma))
> + return;
> +
> + walk_page_vma(vma, &hot_vma_set_idle_ops, scanctrl);
> +}
--
Cheers,
Harry / Hyeonggon
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC PATCH V2 03/13] mm: Scan the mm and create a migration list
2025-06-25 22:07 ` Harry Yoo
@ 2025-06-25 23:05 ` Harry Yoo
2025-06-26 6:27 ` Raghavendra K T
1 sibling, 0 replies; 20+ messages in thread
From: Harry Yoo @ 2025-06-25 23:05 UTC (permalink / raw)
To: Raghavendra K T
Cc: AneeshKumar.KizhakeVeetil, Hasan.Maruf, Michael.Day, akpm,
bharata, dave.hansen, david, dongjoo.linux.dev, feng.tang, gourry,
hannes, honggyu.kim, hughd, jhubbard, jon.grimm, k.shutemov,
kbusch, kmanaouil.dev, leesuyeon0506, leillc, liam.howlett,
linux-kernel, linux-mm, mgorman, mingo, nadav.amit, nphamcs,
peterz, riel, rientjes, rppt, santosh.shukla, shivankg, shy828301,
sj, vbabka, weixugc, willy, ying.huang, ziy, Jonathan.Cameron,
dave, yuanchu, kinseyho, hdanton
On Thu, Jun 26, 2025 at 07:07:12AM +0900, Harry Yoo wrote:
> On Tue, Jun 24, 2025 at 05:56:07AM +0000, Raghavendra K T wrote:
> > Since we already have the list of mm_struct in the system, add a module to
> > scan each mm that walks VMAs of each mm_struct and scan all the pages
> > associated with that.
> >
> > In the scan path: Check for the recently acccessed pages (folios) belonging
> > to slowtier nodes. Add all those folios to a list.
> >
> > Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
> > ---
>
> Hi, just taking a quick look...
>
> > mm/kscand.c | 319 +++++++++++++++++++++++++++++++++++++++++++++++++++-
> > 1 file changed, 318 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/kscand.c b/mm/kscand.c
> > index d5b0d3041b0f..0edec1b7730d 100644
> > --- a/mm/kscand.c
> > +++ b/mm/kscand.c
> > @@ -42,6 +55,8 @@ static struct kmem_cache *kscand_slot_cache __read_mostly;
> > @@ -84,11 +122,275 @@ static void kscand_wait_work(void)
> > scan_sleep_jiffies);
> > }
> >
> > +static inline bool is_valid_folio(struct folio *folio)
> > +{
> > + if (!folio || folio_test_unevictable(folio) || !folio_mapped(folio) ||
> > + folio_is_zone_device(folio) || folio_maybe_mapped_shared(folio))
> > + return false;
> > +
> > + return true;
> > +}
>
> What makes it undesirable to migrate shared folios?
>
> > +static bool folio_idle_clear_pte_refs_one(struct folio *folio,
> > + struct vm_area_struct *vma,
> > + unsigned long addr,
> > + pte_t *ptep)
> > +{
> > + bool referenced = false;
> > + struct mm_struct *mm = vma->vm_mm;
> > + pmd_t *pmd = pmd_off(mm, addr);
> > +
> > + if (ptep) {
> > + if (ptep_clear_young_notify(vma, addr, ptep))
> > + referenced = true;
> > + } else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
> > + if (!pmd_present(*pmd))
> > + WARN_ON_ONCE(1);
> > + if (pmdp_clear_young_notify(vma, addr, pmd))
> > + referenced = true;
> > + } else {
> > + WARN_ON_ONCE(1);
> > + }
>
> This does not look good.
>
> I think pmd entry handling should be handled in
> mm_walk_ops.pmd_entry callback?
>
> > +
> > + if (referenced) {
> > + folio_clear_idle(folio);
> > + folio_set_young(folio);
> > + }
> > +
> > + return true;
> > +}
> > +
> > +static void page_idle_clear_pte_refs(struct page *page, pte_t *pte, struct mm_walk *walk)
> > +{
> > + bool need_lock;
> > + struct folio *folio = page_folio(page);
> > + unsigned long address;
> > +
> > + if (!folio_mapped(folio) || !folio_raw_mapping(folio))
> > + return;
> > +
> > + need_lock = !folio_test_anon(folio) || folio_test_ksm(folio);
> > + if (need_lock && !folio_trylock(folio))
> > + return;
>
> Why acquire folio lock here?
>
> And I'm not even sure if it's safe to acquire it?
> The locking order is folio_lock -> pte_lock
>
> page walk should have already acquired pte_lock before calling
> ->pte_entry() callback.
Oops, it's trylock. Nevermind.
Needed more coffee in the morning :)
> > + address = vma_address(walk->vma, page_pgoff(folio, page), compound_nr(page));
> > + VM_BUG_ON_VMA(address == -EFAULT, walk->vma);
> > + folio_idle_clear_pte_refs_one(folio, walk->vma, address, pte);
> > +
> > + if (need_lock)
> > + folio_unlock(folio);
> > +}
> > +
> > +static const struct mm_walk_ops hot_vma_set_idle_ops = {
> > + .pte_entry = hot_vma_idle_pte_entry,
> > + .walk_lock = PGWALK_RDLOCK,
> > +};
> > +
> > +static void kscand_walk_page_vma(struct vm_area_struct *vma, struct kscand_scanctrl *scanctrl)
> > +{
> > + if (!vma_migratable(vma) || !vma_policy_mof(vma) ||
> > + is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_MIXEDMAP)) {
> > + return;
> > + }
> > + if (!vma->vm_mm ||
> > + (vma->vm_file && (vma->vm_flags & (VM_READ|VM_WRITE)) == (VM_READ)))
> > + return;
>
> Why not walk writable file VMAs?
>
> > + if (!vma_is_accessible(vma))
> > + return;
> > +
> > + walk_page_vma(vma, &hot_vma_set_idle_ops, scanctrl);
> > +}
>
> --
> Cheers,
> Harry / Hyeonggon
--
Cheers,
Harry / Hyeonggon
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC PATCH V2 03/13] mm: Scan the mm and create a migration list
2025-06-25 22:07 ` Harry Yoo
2025-06-25 23:05 ` Harry Yoo
@ 2025-06-26 6:27 ` Raghavendra K T
2025-07-08 11:17 ` Raghavendra K T
1 sibling, 1 reply; 20+ messages in thread
From: Raghavendra K T @ 2025-06-26 6:27 UTC (permalink / raw)
To: Harry Yoo
Cc: AneeshKumar.KizhakeVeetil, Hasan.Maruf, Michael.Day, akpm,
bharata, dave.hansen, david, dongjoo.linux.dev, feng.tang, gourry,
hannes, honggyu.kim, hughd, jhubbard, jon.grimm, k.shutemov,
kbusch, kmanaouil.dev, leesuyeon0506, leillc, liam.howlett,
linux-kernel, linux-mm, mgorman, mingo, nadav.amit, nphamcs,
peterz, riel, rientjes, rppt, santosh.shukla, shivankg, shy828301,
sj, vbabka, weixugc, willy, ying.huang, ziy, Jonathan.Cameron,
dave, yuanchu, kinseyho, hdanton
On 6/26/2025 3:37 AM, Harry Yoo wrote:
> On Tue, Jun 24, 2025 at 05:56:07AM +0000, Raghavendra K T wrote:
>> Since we already have the list of mm_struct in the system, add a module to
>> scan each mm that walks VMAs of each mm_struct and scan all the pages
>> associated with that.
>>
>> In the scan path: Check for the recently acccessed pages (folios) belonging
>> to slowtier nodes. Add all those folios to a list.
>>
>> Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
>> ---
>
> Hi, just taking a quick look...
Hello Harry,
Thanks for taking a look at the patches.
>
>> mm/kscand.c | 319 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>> 1 file changed, 318 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/kscand.c b/mm/kscand.c
>> index d5b0d3041b0f..0edec1b7730d 100644
>> --- a/mm/kscand.c
>> +++ b/mm/kscand.c
>> @@ -42,6 +55,8 @@ static struct kmem_cache *kscand_slot_cache __read_mostly;
>> @@ -84,11 +122,275 @@ static void kscand_wait_work(void)
>> scan_sleep_jiffies);
>> }
>>
>> +static inline bool is_valid_folio(struct folio *folio)
>> +{
>> + if (!folio || folio_test_unevictable(folio) || !folio_mapped(folio) ||
>> + folio_is_zone_device(folio) || folio_maybe_mapped_shared(folio))
>> + return false;
>> +
>> + return true;
>> +}
>
> What makes it undesirable to migrate shared folios?
This was mostly to avoid shared libraries, but yes this also
should have accompanied with EXEC flag to refine further.
This also avoids moving around shared data. I will experiment more
and add additional filters OR remove the check.
>> +static bool folio_idle_clear_pte_refs_one(struct folio *folio,
>> + struct vm_area_struct *vma,
>> + unsigned long addr,
>> + pte_t *ptep)
>> +{
>> + bool referenced = false;
>> + struct mm_struct *mm = vma->vm_mm;
>> + pmd_t *pmd = pmd_off(mm, addr);
>> +
>> + if (ptep) {
>> + if (ptep_clear_young_notify(vma, addr, ptep))
>> + referenced = true;
>> + } else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
>> + if (!pmd_present(*pmd))
>> + WARN_ON_ONCE(1);
>> + if (pmdp_clear_young_notify(vma, addr, pmd))
>> + referenced = true;
>> + } else {
>> + WARN_ON_ONCE(1);
>> + }
>
> This does not look good.
>
> I think pmd entry handling should be handled in
> mm_walk_ops.pmd_entry callback?
>
Thanks, Let me check on this. Some part came when I was referring
to idle page tracking.
>> +
>> + if (referenced) {
>> + folio_clear_idle(folio);
>> + folio_set_young(folio);
>> + }
>> +
>> + return true;
>> +}
>> +
>> +static void page_idle_clear_pte_refs(struct page *page, pte_t *pte, struct mm_walk *walk)
>> +{
>> + bool need_lock;
>> + struct folio *folio = page_folio(page);
>> + unsigned long address;
>> +
>> + if (!folio_mapped(folio) || !folio_raw_mapping(folio))
>> + return;
>> +
>> + need_lock = !folio_test_anon(folio) || folio_test_ksm(folio);
>> + if (need_lock && !folio_trylock(folio))
>> + return;
>
> Why acquire folio lock here?
>
> And I'm not even sure if it's safe to acquire it?
> The locking order is folio_lock -> pte_lock
>
> page walk should have already acquired pte_lock before calling
> ->pte_entry() callback.
>
I saw you clarified later.
>> + address = vma_address(walk->vma, page_pgoff(folio, page), compound_nr(page));
>> + VM_BUG_ON_VMA(address == -EFAULT, walk->vma);
>> + folio_idle_clear_pte_refs_one(folio, walk->vma, address, pte);
>> +
>> + if (need_lock)
>> + folio_unlock(folio);
>> +}
>> +
>> +static const struct mm_walk_ops hot_vma_set_idle_ops = {
>> + .pte_entry = hot_vma_idle_pte_entry,
>> + .walk_lock = PGWALK_RDLOCK,
>> +};
>> +
>> +static void kscand_walk_page_vma(struct vm_area_struct *vma, struct kscand_scanctrl *scanctrl)
>> +{
>> + if (!vma_migratable(vma) || !vma_policy_mof(vma) ||
>> + is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_MIXEDMAP)) {
>> + return;
>> + }
>> + if (!vma->vm_mm ||
>> + (vma->vm_file && (vma->vm_flags & (VM_READ|VM_WRITE)) == (VM_READ)))
>> + return;
>
> Why not walk writable file VMAs?
>
This is mainly followed from numa balancing logic
" Avoid hinting faults in read-only file-backed mappings or the vDSO
as migrating the pages will be of marginal benefit. "
But I have not measured benefits either way. Let me try once.
>> + if (!vma_is_accessible(vma))
>> + return;
>> +
>> + walk_page_vma(vma, &hot_vma_set_idle_ops, scanctrl);
>> +}
>
Regards
- Raghu
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC PATCH V2 03/13] mm: Scan the mm and create a migration list
2025-06-26 6:27 ` Raghavendra K T
@ 2025-07-08 11:17 ` Raghavendra K T
0 siblings, 0 replies; 20+ messages in thread
From: Raghavendra K T @ 2025-07-08 11:17 UTC (permalink / raw)
To: Harry Yoo
Cc: AneeshKumar.KizhakeVeetil, Hasan.Maruf, Michael.Day, akpm,
bharata, dave.hansen, david, dongjoo.linux.dev, feng.tang, gourry,
hannes, honggyu.kim, hughd, jhubbard, jon.grimm, k.shutemov,
kbusch, kmanaouil.dev, leesuyeon0506, leillc, liam.howlett,
linux-kernel, linux-mm, mgorman, mingo, nadav.amit, nphamcs,
peterz, riel, rientjes, rppt, santosh.shukla, shivankg, shy828301,
sj, vbabka, weixugc, willy, ying.huang, ziy, Jonathan.Cameron,
dave, yuanchu, kinseyho, hdanton
On 6/26/2025 11:57 AM, Raghavendra K T wrote:
>
> On 6/26/2025 3:37 AM, Harry Yoo wrote:
>> On Tue, Jun 24, 2025 at 05:56:07AM +0000, Raghavendra K T wrote:
>>> Since we already have the list of mm_struct in the system, add a
>>> module to
>>> scan each mm that walks VMAs of each mm_struct and scan all the pages
>>> associated with that.
>>>
>>> In the scan path: Check for the recently acccessed pages (folios)
>>> belonging
>>> to slowtier nodes. Add all those folios to a list.
>>>
>>> Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
>>> ---
>>
>> Hi, just taking a quick look...
>
> Hello Harry,
> Thanks for taking a look at the patches.
>>
>>> mm/kscand.c | 319 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>>> 1 file changed, 318 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/kscand.c b/mm/kscand.c
>>> index d5b0d3041b0f..0edec1b7730d 100644
>>> --- a/mm/kscand.c
>>> +++ b/mm/kscand.c
>>> @@ -42,6 +55,8 @@ static struct kmem_cache *kscand_slot_cache
>>> __read_mostly;
>>> @@ -84,11 +122,275 @@ static void kscand_wait_work(void)
>>> scan_sleep_jiffies);
>>> }
>>> +static inline bool is_valid_folio(struct folio *folio)
>>> +{
>>> + if (!folio || folio_test_unevictable(folio) || !
>>> folio_mapped(folio) ||
>>> + folio_is_zone_device(folio) ||
>>> folio_maybe_mapped_shared(folio))
>>> + return false;
>>> +
>>> + return true;
>>> +}
>>
>> What makes it undesirable to migrate shared folios?
>
> This was mostly to avoid shared libraries, but yes this also
> should have accompanied with EXEC flag to refine further.
> This also avoids moving around shared data. I will experiment more
> and add additional filters OR remove the check.
>
Atleast experiment with microbenchmark gives a positive benefit if I
remove check for folio_maybe_mapped_shared() folios. Will respin
accordingly.
(also after checking some real workloads).
- Raghu
^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2025-07-08 11:17 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-24 5:56 [RFC PATCH V2 00/13] mm: slowtier page promotion based on PTE A bit Raghavendra K T
2025-06-24 5:56 ` [RFC PATCH V2 01/13] mm: Add kscand kthread for PTE A bit scan Raghavendra K T
2025-06-24 5:56 ` [RFC PATCH V2 02/13] mm: Maintain mm_struct list in the system Raghavendra K T
2025-06-24 5:56 ` [RFC PATCH V2 03/13] mm: Scan the mm and create a migration list Raghavendra K T
2025-06-25 22:07 ` Harry Yoo
2025-06-25 23:05 ` Harry Yoo
2025-06-26 6:27 ` Raghavendra K T
2025-07-08 11:17 ` Raghavendra K T
2025-06-24 5:56 ` [RFC PATCH V2 04/13] mm: Create a separate kthread for migration Raghavendra K T
2025-06-24 5:56 ` [RFC PATCH V2 05/13] mm/migration: Migrate accessed folios to toptier node Raghavendra K T
2025-06-24 5:56 ` [RFC PATCH V2 06/13] mm: Add throttling of mm scanning using scan_period Raghavendra K T
2025-06-24 5:56 ` [RFC PATCH V2 07/13] mm: Add throttling of mm scanning using scan_size Raghavendra K T
2025-06-24 5:56 ` [RFC PATCH V2 08/13] mm: Add initial scan delay Raghavendra K T
2025-06-24 5:56 ` [RFC PATCH V2 09/13] mm: Add a heuristic to calculate target node Raghavendra K T
2025-06-24 5:56 ` [RFC PATCH V2 10/13] sysfs: Add sysfs support to tune scanning Raghavendra K T
2025-06-24 5:56 ` [RFC PATCH V2 11/13] vmstat: Add vmstat counters Raghavendra K T
2025-06-24 5:56 ` [RFC PATCH V2 12/13] trace/kscand: Add tracing of scanning and migration Raghavendra K T
2025-06-24 7:09 ` Masami Hiramatsu
2025-06-24 7:50 ` Raghavendra K T
2025-06-24 5:56 ` [RFC PATCH V2 13/13] prctl: Introduce new prctl to control scanning Raghavendra K T
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).