* [PATCH 1/5] mm/khugepaged: add framework for khugepaged collapse hint
2026-05-31 4:27 [PATCH 0/5] mm/khugepaged: add collapse hint machanism for khugepaged and use in mglru Luka Bai
@ 2026-05-31 4:27 ` Luka Bai
2026-05-31 4:27 ` [PATCH 2/5] mm/khugepaged: use slab cache instead of normal kmalloc Luka Bai
` (5 subsequent siblings)
6 siblings, 0 replies; 11+ messages in thread
From: Luka Bai @ 2026-05-31 4:27 UTC (permalink / raw)
To: linux-mm
Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Zi Yan,
Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Kairui Song, Qi Zheng,
Shakeel Butt, Axel Rasmussen, Yuanchu Xie, Wei Xu, Rik van Riel,
Harry Yoo, Jann Horn, Johannes Weiner, linux-kernel, Luka Bai
From: Luka Bai <lukabai@tencent.com>
Currently we just have a simple Round-Robin scanning for all the
feasible mm_structs in khugepaged to do collapsing. It is not very
efficient when memory space is huge, and it may waste precious
large folio resources on some cold memory areas that are seldomly
accessed. While at the same time, khugepaged is a very useful tool
for asynchronous large folio merging.
So we introduced khugepaged collapse hint framework in this patch
to try to give khugepaged some priorities for the hot memory areas
when doing collapsing. The hot area indications are regarded as
"collapse hint". Each "collapse hint" has an address and a vma
associated with it to represent a specific hot area that is
preferred to be collapsed. All these hints are aggregated by both
priority and their belonging mm_struct. When khugepaged tries to
collapse, it will first scan the global priority queues that store
these hints, and find the first khugepaged_mm_slot (We added struct
khugepaged_mm_slot and wrapped the old mm_slot for each mm_struct
inside it) that has hints inside it, then try to do collapse on
the address given by the hint. One example is like below (the
mm_slot represents khugepaged_mm_slot I mentioned above):
prio 0 ------()----------------------------------()---------------
mm_slot0(process A) mm_slot1(process B)
| |
hint0---hint1---hint2---hint3 hint4---hint5---hint6
prio 1 ------()----------------------------------()---------------
mm_slot0(process A) mm_slot1(process B)
| |
------- hint7---hint8
The khugepaged will firstly try to scan queue of prio 0 (lower prio
number means higher priority), then go through the list, and check
the first khugepaged_mm_slot, which is mm_slot0, then go through
all the hints in it (hint0 ~ hint3 in the above graph). After handling
this hint (no mater success or fail for collapsing), the hint will be
deleted. If one khugepaged_mm_slot doesn't have any hints in it,
khugepaged will scan the next mm_slot; if there is no hint in prio 0
anymore, khugepaged will scan prio 1; if there is no hints in any
prio queues, then it will fallback to do Round-Robin scanning like
before.
We added a number of NR_KHUGEPAGED_PRIORITY_LEVEL(which is 2 currently)
struct khugepaged_collapse_requests into each struct khugepaged_mm_slot.
Each struct khugepaged_collapse_requests is used for this mm_struct
to be put into the global priority queue. We give each mm_struct a node
in each priority queue for hint dispersion and balancing that may be
introduced in the future and for a better lock pattern. Currently the
khugepaged_collapse_requests[] are linked into the global queues in
__khugepaged_enter() and will live there a lifetime of the mm_struct.
Caller can call khugepaged_add_collapse_hint() to add a new hint for a
specific mm_struct. There is still no callers introduced in this patch.
We will add callers in the following patches.
Signed-off-by: Luka Bai <lukabai@tencent.com>
---
include/linux/khugepaged.h | 13 ++
mm/khugepaged.c | 348 ++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 355 insertions(+), 6 deletions(-)
diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index d7a9053ff4fe..815ae87f0f8e 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -17,6 +17,10 @@ extern void khugepaged_enter_vma(struct vm_area_struct *vma,
vm_flags_t vm_flags);
extern void khugepaged_min_free_kbytes_update(void);
extern bool current_is_khugepaged(void);
+extern void khugepaged_add_collapse_hint(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ unsigned long address,
+ int priority, int max_order);
void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
bool install_pmd);
@@ -31,6 +35,9 @@ static inline void khugepaged_exit(struct mm_struct *mm)
if (mm_flags_test(MMF_VM_HUGEPAGE, mm))
__khugepaged_exit(mm);
}
+
+#define NR_KHUGEPAGED_PRIORITY_LEVEL 2
+
#else /* CONFIG_TRANSPARENT_HUGEPAGE */
static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
{
@@ -55,6 +62,12 @@ static inline bool current_is_khugepaged(void)
{
return false;
}
+static inline void khugepaged_add_collapse_hint(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ unsigned long address,
+ int priority, int max_order)
+{
+}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
#endif /* _LINUX_KHUGEPAGED_H */
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 35a5f8c44c18..5090ffae73f3 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -99,6 +99,8 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
static struct kmem_cache *mm_slot_cache __ro_after_init;
+#define KHUGEPAGED_PRIORITY_QUEUE_MAX_FAIL 10
+
#define KHUGEPAGED_MIN_MTHP_ORDER 2
/*
* mthp_collapse() does an iterative DFS over a binary tree, from
@@ -160,6 +162,53 @@ static struct khugepaged_scan khugepaged_scan = {
.mm_head = LIST_HEAD_INIT(khugepaged_scan.mm_head),
};
+/**
+ * struct khugepaged_collapse_hint - one collapse hint for a specific address
+ * @node: list node on khugepaged_collapse_requests.hints
+ * @vma: hint pointer to the target VMA
+ * @address: PMD-aligned virtual address inside @vma to attempt collapsing on
+ */
+struct khugepaged_collapse_hint {
+ struct list_head node;
+ struct vm_area_struct *vma;
+ unsigned long address;
+};
+
+/**
+ * struct khugepaged_collapse_requests - per-mm, per-priority collapse hints list
+ * @node: list node on the matching khugepaged_priority_queue[] list
+ * @hints: list of pending struct khugepaged_collapse_hint for this mm at
+ * this priority level
+ *
+ * Each khugepaged_mm_slot embeds one request struct per priority level. At
+ * __khugepaged_enter() time, every request is added to the corresponding
+ * khugepaged_priority_queue[] list and stays on that list until the mm
+ * exits khugepaged. While queued, hints for the mm at a given priority are
+ * appended to that priority's @hints;
+ */
+struct khugepaged_collapse_requests {
+ struct list_head node;
+ struct list_head hints;
+};
+
+/**
+ * struct khugepaged_mm_slot - khugepaged information per mm that is being scanned
+ * @slot: hash lookup from mm to mm_slot
+ * @request: per-mm collapse requests, one per priority level, each linked
+ * into the corresponding khugepaged_priority_queue[] list
+ */
+struct khugepaged_mm_slot {
+ struct mm_slot slot;
+ struct khugepaged_collapse_requests request[NR_KHUGEPAGED_PRIORITY_LEVEL];
+};
+
+/*
+ * One queue per priority level. Lower index means higher priority. The
+ * scanner drains queues in ascending index order, so all hints at higher
+ * priority are processed before any hint at a lower priority.
+ */
+static struct list_head khugepaged_priority_queue[NR_KHUGEPAGED_PRIORITY_LEVEL];
+
#ifdef CONFIG_SYSFS
static ssize_t scan_sleep_millisecs_show(struct kobject *kobj,
struct kobj_attribute *attr,
@@ -500,10 +549,15 @@ int hugepage_madvise(struct vm_area_struct *vma,
int __init khugepaged_init(void)
{
- mm_slot_cache = KMEM_CACHE(mm_slot, 0);
+ int i;
+
+ mm_slot_cache = KMEM_CACHE(khugepaged_mm_slot, 0);
if (!mm_slot_cache)
return -ENOMEM;
+ for (i = 0; i < NR_KHUGEPAGED_PRIORITY_LEVEL; i++)
+ INIT_LIST_HEAD(&khugepaged_priority_queue[i]);
+
khugepaged_pages_to_scan = HPAGE_PMD_NR * 8;
khugepaged_max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT;
khugepaged_max_ptes_swap = HPAGE_PMD_NR / 8;
@@ -560,21 +614,27 @@ static bool hugepage_enabled(void)
void __khugepaged_enter(struct mm_struct *mm)
{
+ struct khugepaged_mm_slot *khp_mm_slot;
struct mm_slot *slot;
int wakeup;
+ int i;
/* __khugepaged_exit() must not run from under us */
VM_BUG_ON_MM(collapse_test_exit(mm), mm);
- slot = mm_slot_alloc(mm_slot_cache);
- if (!slot)
+ khp_mm_slot = mm_slot_alloc(mm_slot_cache);
+ if (!khp_mm_slot)
return;
if (unlikely(mm_flags_test_and_set(MMF_VM_HUGEPAGE, mm))) {
- mm_slot_free(mm_slot_cache, slot);
+ mm_slot_free(mm_slot_cache, khp_mm_slot);
return;
}
+ slot = &khp_mm_slot->slot;
+ for (i = 0; i < NR_KHUGEPAGED_PRIORITY_LEVEL; i++)
+ INIT_LIST_HEAD(&khp_mm_slot->request[i].hints);
+
spin_lock(&khugepaged_mm_lock);
mm_slot_insert(mm_slots_hash, mm, slot);
/*
@@ -583,6 +643,12 @@ void __khugepaged_enter(struct mm_struct *mm)
*/
wakeup = list_empty(&khugepaged_scan.mm_head);
list_add_tail(&slot->mm_node, &khugepaged_scan.mm_head);
+ /*
+ * Link this mm into every priority queue.
+ */
+ for (i = 0; i < NR_KHUGEPAGED_PRIORITY_LEVEL; i++)
+ list_add_tail(&khp_mm_slot->request[i].node,
+ &khugepaged_priority_queue[i]);
spin_unlock(&khugepaged_mm_lock);
mmgrab(mm);
@@ -613,23 +679,59 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
__khugepaged_enter(vma->vm_mm);
}
+static void khugepaged_release_collapse_hints(
+ struct khugepaged_collapse_requests *req)
+{
+ struct khugepaged_collapse_hint *hint, *tmp;
+
+ list_for_each_entry_safe(hint, tmp, &req->hints, node) {
+ list_del(&hint->node);
+ kfree(hint);
+ }
+}
+
+/*
+ * Caller must hold khugepaged_mm_lock when removing the request nodes from
+ * the priority queues;
+ */
+static void khugepaged_remove_priority_requests(struct khugepaged_mm_slot *khp_mm_slot)
+{
+ int i;
+
+ lockdep_assert_held(&khugepaged_mm_lock);
+ for (i = 0; i < NR_KHUGEPAGED_PRIORITY_LEVEL; i++)
+ list_del(&khp_mm_slot->request[i].node);
+}
+
+static void khugepaged_release_all_hints(struct khugepaged_mm_slot *khp_mm_slot)
+{
+ int i;
+
+ for (i = 0; i < NR_KHUGEPAGED_PRIORITY_LEVEL; i++)
+ khugepaged_release_collapse_hints(&khp_mm_slot->request[i]);
+}
+
void __khugepaged_exit(struct mm_struct *mm)
{
+ struct khugepaged_mm_slot *khp_mm_slot = NULL;
struct mm_slot *slot;
int free = 0;
spin_lock(&khugepaged_mm_lock);
slot = mm_slot_lookup(mm_slots_hash, mm);
if (slot && khugepaged_scan.mm_slot != slot) {
+ khp_mm_slot = mm_slot_entry(slot, struct khugepaged_mm_slot, slot);
hash_del(&slot->hash);
list_del(&slot->mm_node);
+ khugepaged_remove_priority_requests(khp_mm_slot);
free = 1;
}
spin_unlock(&khugepaged_mm_lock);
if (free) {
mm_flags_clear(MMF_VM_HUGEPAGE, mm);
- mm_slot_free(mm_slot_cache, slot);
+ khugepaged_release_all_hints(khp_mm_slot);
+ mm_slot_free(mm_slot_cache, khp_mm_slot);
mmdrop(mm);
} else if (slot) {
/*
@@ -1804,6 +1906,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
static void collect_mm_slot(struct mm_slot *slot)
{
+ struct khugepaged_mm_slot *khp_mm_slot =
+ mm_slot_entry(slot, struct khugepaged_mm_slot, slot);
struct mm_struct *mm = slot->mm;
lockdep_assert_held(&khugepaged_mm_lock);
@@ -1812,6 +1916,7 @@ static void collect_mm_slot(struct mm_slot *slot)
/* free mm_slot */
hash_del(&slot->hash);
list_del(&slot->mm_node);
+ khugepaged_remove_priority_requests(khp_mm_slot);
/*
* Not strictly needed because the mm exited already.
@@ -1820,7 +1925,8 @@ static void collect_mm_slot(struct mm_slot *slot)
*/
/* khugepaged_mm_lock actually not necessary for the below */
- mm_slot_free(mm_slot_cache, slot);
+ khugepaged_release_all_hints(khp_mm_slot);
+ mm_slot_free(mm_slot_cache, khp_mm_slot);
mmdrop(mm);
}
}
@@ -2848,6 +2954,211 @@ static enum scan_result collapse_single_pmd(unsigned long addr,
return result;
}
+/*
+ * khugepaged_add_collapse_hint - enqueue a collapse hint
+ * @mm: target mm
+ * @vma: hint pointer to the VMA covering @address (treated as a hint)
+ * @address: virtual address; rounded down to HPAGE_PMD_SIZE
+ * @priority: priority bucket the hint should land in. Lower number == higher
+ * priority; must be in [0, NR_KHUGEPAGED_PRIORITY_LEVEL).
+ * @max_order: max order of continuous pt entries inside this target pmd, used
+ * to decide whether we need to collapse it.
+ *
+ * Tell khugepaged to prioritize collapsing the PMD covering @address in @mm.
+ * The next time collapse_scan_mm_slot() runs it will drain these entries
+ * before the regular round-robin scan, walking priority queues from
+ * highest priority (lowest index) to lowest.
+ *
+ * Hints are aggregated per-mm and per-priority: __khugepaged_enter()
+ * pre-installs one collapse_request per priority level on the matching
+ * khugepaged_priority_queue[] list, and this function appends a
+ * (vma, address) hint to the request that matches @priority.
+ *
+ * Caller must keep @vma alive across this call (mmap_lock, per-VMA lock,
+ * or a corresponding rmap-side lock such as anon_vma_lock_read /
+ * i_mmap_lock_read are all sufficient).
+ *
+ * @vma->vm_flags is read with collapse_allowable_orders(). When the
+ * caller does not hold mmap_lock or a per-VMA lock, the result is
+ * advisory; the real validation happens later in
+ * collapse_scan_one_priority_entry() under mmap_read_lock.
+ *
+ * Caller must also guarantee @mm is alive across this call so the underlying
+ * mm_slot cannot be freed while we append.
+ */
+void khugepaged_add_collapse_hint(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ unsigned long address,
+ int priority, int max_order)
+{
+ struct khugepaged_mm_slot *khp_mm_slot;
+ struct khugepaged_collapse_hint *hint;
+ struct mm_slot *slot;
+ int orders;
+
+ if (!mm || !vma)
+ return;
+ if (priority < 0 || priority >= NR_KHUGEPAGED_PRIORITY_LEVEL)
+ return;
+
+ orders = collapse_allowable_orders(vma, vma->vm_flags, TVA_KHUGEPAGED);
+ if (highest_order(orders) <= max_order)
+ return;
+
+ /*
+ * Make sure the mm is enrolled in khugepaged so that its embedded
+ * collapse_request[] entries are on khugepaged_priority_queue[].
+ */
+ khugepaged_enter_vma(vma, vma->vm_flags);
+ if (!mm_flags_test(MMF_VM_HUGEPAGE, mm))
+ return;
+
+ hint = kmalloc_obj(struct khugepaged_collapse_hint);
+ if (!hint)
+ return;
+
+ hint->vma = vma;
+ hint->address = address & HPAGE_PMD_MASK;
+
+ /*
+ * Just use try lock to avoid lock contention because collapse hints are
+ * just "best-effort" optimization.
+ */
+ if (!spin_trylock(&khugepaged_mm_lock)) {
+ kfree(hint);
+ return;
+ }
+
+ slot = mm_slot_lookup(mm_slots_hash, mm);
+ if (!slot) {
+ spin_unlock(&khugepaged_mm_lock);
+ kfree(hint);
+ return;
+ }
+ khp_mm_slot = mm_slot_entry(slot, struct khugepaged_mm_slot, slot);
+ list_add_tail(&hint->node, &khp_mm_slot->request[priority].hints);
+ spin_unlock(&khugepaged_mm_lock);
+
+ wake_up_interruptible(&khugepaged_wait);
+}
+
+/*
+ * Each enrolled mm owns one request struct per priority level, all of which
+ * live on the matching khugepaged_priority_queue[] list for the lifetime of
+ * the mm_slot. The caller iterates priorities from highest to lowest, and
+ * call collapse_scan_one_priority_entry() to process all mms at this priority,
+ * and handle pending collapse hints for each mm. Repeat until either
+ * @progress_max is reached, the per-mm-slot failure exceeds certain threshold,
+ * or no hints remain for this mm at this priority.
+ *
+ * Caller must hold khugepaged_mm_lock.
+ *
+ * Returns 1 if an mm was processed at this priority, 0 if no mm on
+ * khugepaged_priority_queue[@priority] had any pending hints.
+ */
+static int collapse_scan_one_priority_entry(unsigned int progress_max,
+ enum scan_result *result,
+ struct collapse_control *cc,
+ int priority,
+ int *fail_count)
+ __releases(&khugepaged_mm_lock)
+ __acquires(&khugepaged_mm_lock)
+{
+ struct khugepaged_collapse_requests *iter_req;
+ struct khugepaged_mm_slot *khp_mm_slot = NULL, *iter_slot;
+ struct mm_struct *mm = NULL;
+ bool lock_dropped = true;
+
+ /*
+ * We have to call mmget_not_zero() under khugepaged_mm_lock so that
+ * __khugepaged_exit() cannot free the embedding khugepaged_mm_slot from
+ * under us once we drop the spinlock.
+ */
+ list_for_each_entry(iter_req, &khugepaged_priority_queue[priority], node) {
+ if (list_empty(&iter_req->hints))
+ continue;
+ iter_slot = container_of(iter_req, struct khugepaged_mm_slot,
+ request[priority]);
+ if (mmget_not_zero(iter_slot->slot.mm)) {
+ khp_mm_slot = iter_slot;
+ mm = iter_slot->slot.mm;
+ break;
+ }
+ }
+ if (!khp_mm_slot)
+ return 0;
+
+ spin_unlock(&khugepaged_mm_lock);
+
+ /*
+ * Drain hints for this mm while we hold mmap_read_lock.
+ * collapse_single_pmd() may drop the mmap_lock; if so, try once to
+ * retake it for the next hint.
+ */
+ while (cc->progress < progress_max &&
+ *fail_count < KHUGEPAGED_PRIORITY_QUEUE_MAX_FAIL) {
+ struct khugepaged_collapse_hint *hint = NULL;
+ struct vm_area_struct *vma;
+ unsigned long addr;
+
+ if (lock_dropped) {
+ if (!mmap_read_trylock(mm)) {
+ (*fail_count)++;
+ continue;
+ }
+ lock_dropped = false;
+ }
+
+ spin_lock(&khugepaged_mm_lock);
+ if (!list_empty(&khp_mm_slot->request[priority].hints)) {
+ hint = list_first_entry(&khp_mm_slot->request[priority].hints,
+ struct khugepaged_collapse_hint,
+ node);
+ list_del(&hint->node);
+ }
+ spin_unlock(&khugepaged_mm_lock);
+
+ if (!hint)
+ break;
+
+ cc->progress++;
+ addr = hint->address;
+
+ if (unlikely(collapse_test_exit_or_disable(mm))) {
+ kfree(hint);
+ break;
+ }
+
+ /*
+ * Re-validate the cached VMA hint under mmap_read_lock. If the
+ * address is now covered by a different VMA, or no VMA at all,
+ * drop the entry. Note that the vma may be a different object
+ * than the one passed in at enqueue time, but that's a false
+ * positive that we can safely ignore.
+ */
+ vma = vma_lookup(mm, addr);
+ if (!vma || vma != hint->vma)
+ goto skip_hint;
+ if (!collapse_allowable_orders(vma, vma->vm_flags, TVA_KHUGEPAGED))
+ goto skip_hint;
+ if (addr < ALIGN(vma->vm_start, HPAGE_PMD_SIZE) ||
+ addr + HPAGE_PMD_SIZE > ALIGN_DOWN(vma->vm_end, HPAGE_PMD_SIZE))
+ goto skip_hint;
+
+ *result = collapse_single_pmd(addr, vma, &lock_dropped, cc);
+ if (*result != SCAN_SUCCEED)
+ (*fail_count)++;
+skip_hint:
+ kfree(hint);
+ }
+
+ if (!lock_dropped)
+ mmap_read_unlock(mm);
+ mmput(mm);
+ spin_lock(&khugepaged_mm_lock);
+ return 1;
+}
+
static void collapse_scan_mm_slot(unsigned int progress_max,
enum scan_result *result, struct collapse_control *cc)
__releases(&khugepaged_mm_lock)
@@ -2858,10 +3169,35 @@ static void collapse_scan_mm_slot(unsigned int progress_max,
struct mm_struct *mm;
struct vm_area_struct *vma;
unsigned int progress_prev = cc->progress;
+ int priority_queue_fail_times = 0;
+ int prio;
lockdep_assert_held(&khugepaged_mm_lock);
*result = SCAN_FAIL;
+ /*
+ * Drain explicit hints in priority order before the mm_slot scan.
+ * Iterate priorities from highest (lowest index) to lowest. For each
+ * priority, handle every mm with hints queued at that priority
+ * before we move on to the next, lower priority.
+ */
+ for (prio = 0; prio < NR_KHUGEPAGED_PRIORITY_LEVEL; prio++) {
+ while (priority_queue_fail_times < KHUGEPAGED_PRIORITY_QUEUE_MAX_FAIL &&
+ cc->progress < progress_max) {
+ if (collapse_scan_one_priority_entry(progress_max, result, cc,
+ prio, &priority_queue_fail_times) == 0)
+ break;
+ }
+
+ if (cc->progress >= progress_max ||
+ priority_queue_fail_times >= KHUGEPAGED_PRIORITY_QUEUE_MAX_FAIL)
+ break;
+ }
+
+ if (list_empty(&khugepaged_scan.mm_head) ||
+ cc->progress >= progress_max)
+ return;
+
if (khugepaged_scan.mm_slot) {
slot = khugepaged_scan.mm_slot;
} else {
--
2.52.0
^ permalink raw reply related [flat|nested] 11+ messages in thread* [PATCH 2/5] mm/khugepaged: use slab cache instead of normal kmalloc
2026-05-31 4:27 [PATCH 0/5] mm/khugepaged: add collapse hint machanism for khugepaged and use in mglru Luka Bai
2026-05-31 4:27 ` [PATCH 1/5] mm/khugepaged: add framework for khugepaged collapse hint Luka Bai
@ 2026-05-31 4:27 ` Luka Bai
2026-05-31 4:27 ` [PATCH 3/5] mm/khugepaged: add deduplication when adding new collapse hint Luka Bai
` (4 subsequent siblings)
6 siblings, 0 replies; 11+ messages in thread
From: Luka Bai @ 2026-05-31 4:27 UTC (permalink / raw)
To: linux-mm
Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Zi Yan,
Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Kairui Song, Qi Zheng,
Shakeel Butt, Axel Rasmussen, Yuanchu Xie, Wei Xu, Rik van Riel,
Harry Yoo, Jann Horn, Johannes Weiner, linux-kernel, Luka Bai
From: Luka Bai <lukabai@tencent.com>
We added a kmem slab cached called collapse_hint_cache for
khugepaged collapse hint, to improve the performance in allocation
and freeing for the hint structs.
Signed-off-by: Luka Bai <lukabai@tencent.com>
---
mm/khugepaged.c | 21 +++++++++++++++------
1 file changed, 15 insertions(+), 6 deletions(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 5090ffae73f3..04cf85ea5557 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -98,6 +98,7 @@ static unsigned int khugepaged_max_ptes_shared __read_mostly;
static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
static struct kmem_cache *mm_slot_cache __ro_after_init;
+static struct kmem_cache *collapse_hint_cache __ro_after_init;
#define KHUGEPAGED_PRIORITY_QUEUE_MAX_FAIL 10
@@ -555,6 +556,13 @@ int __init khugepaged_init(void)
if (!mm_slot_cache)
return -ENOMEM;
+ collapse_hint_cache = KMEM_CACHE(khugepaged_collapse_hint, 0);
+ if (!collapse_hint_cache) {
+ kmem_cache_destroy(mm_slot_cache);
+ mm_slot_cache = NULL;
+ return -ENOMEM;
+ }
+
for (i = 0; i < NR_KHUGEPAGED_PRIORITY_LEVEL; i++)
INIT_LIST_HEAD(&khugepaged_priority_queue[i]);
@@ -569,6 +577,7 @@ int __init khugepaged_init(void)
void __init khugepaged_destroy(void)
{
kmem_cache_destroy(mm_slot_cache);
+ kmem_cache_destroy(collapse_hint_cache);
}
static inline int collapse_test_exit(struct mm_struct *mm)
@@ -686,7 +695,7 @@ static void khugepaged_release_collapse_hints(
list_for_each_entry_safe(hint, tmp, &req->hints, node) {
list_del(&hint->node);
- kfree(hint);
+ kmem_cache_free(collapse_hint_cache, hint);
}
}
@@ -3013,7 +3022,7 @@ void khugepaged_add_collapse_hint(struct mm_struct *mm,
if (!mm_flags_test(MMF_VM_HUGEPAGE, mm))
return;
- hint = kmalloc_obj(struct khugepaged_collapse_hint);
+ hint = kmem_cache_alloc(collapse_hint_cache, GFP_KERNEL);
if (!hint)
return;
@@ -3025,14 +3034,14 @@ void khugepaged_add_collapse_hint(struct mm_struct *mm,
* just "best-effort" optimization.
*/
if (!spin_trylock(&khugepaged_mm_lock)) {
- kfree(hint);
+ kmem_cache_free(collapse_hint_cache, hint);
return;
}
slot = mm_slot_lookup(mm_slots_hash, mm);
if (!slot) {
spin_unlock(&khugepaged_mm_lock);
- kfree(hint);
+ kmem_cache_free(collapse_hint_cache, hint);
return;
}
khp_mm_slot = mm_slot_entry(slot, struct khugepaged_mm_slot, slot);
@@ -3125,7 +3134,7 @@ static int collapse_scan_one_priority_entry(unsigned int progress_max,
addr = hint->address;
if (unlikely(collapse_test_exit_or_disable(mm))) {
- kfree(hint);
+ kmem_cache_free(collapse_hint_cache, hint);
break;
}
@@ -3149,7 +3158,7 @@ static int collapse_scan_one_priority_entry(unsigned int progress_max,
if (*result != SCAN_SUCCEED)
(*fail_count)++;
skip_hint:
- kfree(hint);
+ kmem_cache_free(collapse_hint_cache, hint);
}
if (!lock_dropped)
--
2.52.0
^ permalink raw reply related [flat|nested] 11+ messages in thread* [PATCH 3/5] mm/khugepaged: add deduplication when adding new collapse hint
2026-05-31 4:27 [PATCH 0/5] mm/khugepaged: add collapse hint machanism for khugepaged and use in mglru Luka Bai
2026-05-31 4:27 ` [PATCH 1/5] mm/khugepaged: add framework for khugepaged collapse hint Luka Bai
2026-05-31 4:27 ` [PATCH 2/5] mm/khugepaged: use slab cache instead of normal kmalloc Luka Bai
@ 2026-05-31 4:27 ` Luka Bai
2026-05-31 4:27 ` [PATCH 4/5] mm/khugepaged: add accounting for successful hint or non-hint collapse Luka Bai
` (3 subsequent siblings)
6 siblings, 0 replies; 11+ messages in thread
From: Luka Bai @ 2026-05-31 4:27 UTC (permalink / raw)
To: linux-mm
Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Zi Yan,
Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Kairui Song, Qi Zheng,
Shakeel Butt, Axel Rasmussen, Yuanchu Xie, Wei Xu, Rik van Riel,
Harry Yoo, Jann Horn, Johannes Weiner, linux-kernel, Luka Bai
From: Luka Bai <lukabai@tencent.com>
We need to check for duplication before we add a new collapse hint,
and we want the searching and adding to be faster. So there are
several options for doing that:
Option 1. Add a Blooming filter for the hint addresses, but that
will make the hint hard to be deleted after handling.
Option 2. Add a hashtable for each khugepaged_mm_slot. But for a
efficient setup, the hashtable should have maybe 16 ~ 32 slots,
which will cost 128 bytes to 256 bytes for each mm_struct. Seems a
little wasteful.
Option 3. Add an xarray for each khugepaged_mm_slot, which only
takes 16 bytes for each mm_struct. However, each time when we try
to add a new entry into the xarray, it may cause memory allocation.
Collapse hint is supposed to be a best-effort machanism, introducing
xarray seems to be a little too heavy for the calling function.
Option 4. Add a global hashtable for all the memory hints, setup
key by their address and mm_struct ptr. The global hashtable mixes
mm_struct ptr and address as key, but the deduplication only looks
at address for saving memory. As a result, there may be collision
on different mms with a same address. But as we claimed above,
collapse hint is only a best-effort thing, and the collision is
also rare to happen because the address is always 0 for the lower
PMD_SHIFT bits, which normally gives mm struct about 2M size to
scatter (the key is calculated by (ptr of mm ^ pmd aligned address).
By choosing option 4, since the hashtable is global, we decided to
directly use a global lock (we directly use khugepaged_mm_lock here).
To avoid uncessary lock spinning, we used trylock when we try to add
a new hint, and exit when the contension happened. Still, this is
harmless for the correctness of the machanism.
Signed-off-by: Luka Bai <lukabai@tencent.com>
---
mm/khugepaged.c | 83 +++++++++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 78 insertions(+), 5 deletions(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 04cf85ea5557..3f5eb8be06d1 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -100,6 +100,24 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
static struct kmem_cache *mm_slot_cache __ro_after_init;
static struct kmem_cache *collapse_hint_cache __ro_after_init;
+/*
+ * Global lookup table used by khugepaged_add_collapse_hint() to deduplicate
+ * pending hints against an existing address. The key mixes mm and address
+ * but the dedup comparison only looks at @address. As a result, two
+ * different mms hinting the same address may collapse. This is rare
+ * since the aligned_addr is always 0 for the lower PMD_SHIFT bits, which
+ * normally gives mm struct about 2M size for scattering (for 4K paging).
+ * And it's also harmless if the collision happens.
+ */
+#define KHUGEPAGED_HINTS_HASH_BITS 9
+static DEFINE_HASHTABLE(khugepaged_hint_lookup, KHUGEPAGED_HINTS_HASH_BITS);
+
+static inline unsigned long khugepaged_hint_key(struct mm_struct *mm,
+ unsigned long aligned_addr)
+{
+ return (unsigned long)mm ^ aligned_addr;
+}
+
#define KHUGEPAGED_PRIORITY_QUEUE_MAX_FAIL 10
#define KHUGEPAGED_MIN_MTHP_ORDER 2
@@ -165,12 +183,15 @@ static struct khugepaged_scan khugepaged_scan = {
/**
* struct khugepaged_collapse_hint - one collapse hint for a specific address
- * @node: list node on khugepaged_collapse_requests.hints
- * @vma: hint pointer to the target VMA
- * @address: PMD-aligned virtual address inside @vma to attempt collapsing on
+ * @node: list node on khugepaged_collapse_requests.hints
+ * @hash_node: hlist node on the global khugepaged_hint_lookup table, used
+ * for deduplication.
+ * @vma: hint pointer to the target VMA
+ * @address: PMD-aligned virtual address inside @vma to attempt collapsing on
*/
struct khugepaged_collapse_hint {
struct list_head node;
+ struct hlist_node hash_node;
struct vm_area_struct *vma;
unsigned long address;
};
@@ -688,6 +709,29 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
__khugepaged_enter(vma->vm_mm);
}
+/*
+ * Unhash any hints still queued under @req. Caller must hold
+ * khugepaged_mm_lock so we can safely unhash each hint from the global
+ * khugepaged_hint_lookup table.
+ */
+static void khugepaged_unhash_collapse_hints(
+ struct khugepaged_collapse_requests *req)
+{
+ struct khugepaged_collapse_hint *hint, *tmp;
+
+ lockdep_assert_held(&khugepaged_mm_lock);
+
+ list_for_each_entry_safe(hint, tmp, &req->hints, node) {
+ hash_del(&hint->hash_node);
+ }
+}
+
+/*
+ * Free any hints still queued under @req. No lock need to be held. Caller
+ * must make sure the hints are already unhashed from the global
+ * khugepaged_hint_lookup table and the mm_slot is removed from the
+ * khugepaged_priority_queue[].
+ */
static void khugepaged_release_collapse_hints(
struct khugepaged_collapse_requests *req)
{
@@ -712,6 +756,14 @@ static void khugepaged_remove_priority_requests(struct khugepaged_mm_slot *khp_m
list_del(&khp_mm_slot->request[i].node);
}
+static void khugepaged_unhash_all_hints(struct khugepaged_mm_slot *khp_mm_slot)
+{
+ int i;
+
+ for (i = 0; i < NR_KHUGEPAGED_PRIORITY_LEVEL; i++)
+ khugepaged_unhash_collapse_hints(&khp_mm_slot->request[i]);
+}
+
static void khugepaged_release_all_hints(struct khugepaged_mm_slot *khp_mm_slot)
{
int i;
@@ -733,6 +785,7 @@ void __khugepaged_exit(struct mm_struct *mm)
hash_del(&slot->hash);
list_del(&slot->mm_node);
khugepaged_remove_priority_requests(khp_mm_slot);
+ khugepaged_unhash_all_hints(khp_mm_slot);
free = 1;
}
spin_unlock(&khugepaged_mm_lock);
@@ -1933,6 +1986,7 @@ static void collect_mm_slot(struct mm_slot *slot)
* mm_flags_clear(MMF_VM_HUGEPAGE, mm);
*/
+ khugepaged_unhash_all_hints(khp_mm_slot);
/* khugepaged_mm_lock actually not necessary for the below */
khugepaged_release_all_hints(khp_mm_slot);
mm_slot_free(mm_slot_cache, khp_mm_slot);
@@ -3001,8 +3055,9 @@ void khugepaged_add_collapse_hint(struct mm_struct *mm,
int priority, int max_order)
{
struct khugepaged_mm_slot *khp_mm_slot;
- struct khugepaged_collapse_hint *hint;
+ struct khugepaged_collapse_hint *hint, *existing;
struct mm_slot *slot;
+ unsigned long aligned_addr, key;
int orders;
if (!mm || !vma)
@@ -3022,12 +3077,15 @@ void khugepaged_add_collapse_hint(struct mm_struct *mm,
if (!mm_flags_test(MMF_VM_HUGEPAGE, mm))
return;
+ aligned_addr = address & HPAGE_PMD_MASK;
+ key = khugepaged_hint_key(mm, aligned_addr);
+
hint = kmem_cache_alloc(collapse_hint_cache, GFP_KERNEL);
if (!hint)
return;
hint->vma = vma;
- hint->address = address & HPAGE_PMD_MASK;
+ hint->address = aligned_addr;
/*
* Just use try lock to avoid lock contention because collapse hints are
@@ -3045,7 +3103,21 @@ void khugepaged_add_collapse_hint(struct mm_struct *mm,
return;
}
khp_mm_slot = mm_slot_entry(slot, struct khugepaged_mm_slot, slot);
+
+ /*
+ * For deduplication. The comparison only checks @address here. See comments
+ * above khugepaged_hint_lookup definition for details.
+ */
+ hash_for_each_possible(khugepaged_hint_lookup, existing, hash_node, key) {
+ if (existing->address == aligned_addr) {
+ spin_unlock(&khugepaged_mm_lock);
+ kmem_cache_free(collapse_hint_cache, hint);
+ return;
+ }
+ }
+
list_add_tail(&hint->node, &khp_mm_slot->request[priority].hints);
+ hash_add(khugepaged_hint_lookup, &hint->hash_node, key);
spin_unlock(&khugepaged_mm_lock);
wake_up_interruptible(&khugepaged_wait);
@@ -3124,6 +3196,7 @@ static int collapse_scan_one_priority_entry(unsigned int progress_max,
struct khugepaged_collapse_hint,
node);
list_del(&hint->node);
+ hash_del(&hint->hash_node);
}
spin_unlock(&khugepaged_mm_lock);
--
2.52.0
^ permalink raw reply related [flat|nested] 11+ messages in thread* [PATCH 4/5] mm/khugepaged: add accounting for successful hint or non-hint collapse
2026-05-31 4:27 [PATCH 0/5] mm/khugepaged: add collapse hint machanism for khugepaged and use in mglru Luka Bai
` (2 preceding siblings ...)
2026-05-31 4:27 ` [PATCH 3/5] mm/khugepaged: add deduplication when adding new collapse hint Luka Bai
@ 2026-05-31 4:27 ` Luka Bai
2026-05-31 4:27 ` [PATCH 5/5] mm/khugepaged: add khugepaged collapse hint in mglru reference checking Luka Bai
` (2 subsequent siblings)
6 siblings, 0 replies; 11+ messages in thread
From: Luka Bai @ 2026-05-31 4:27 UTC (permalink / raw)
To: linux-mm
Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Zi Yan,
Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Kairui Song, Qi Zheng,
Shakeel Butt, Axel Rasmussen, Yuanchu Xie, Wei Xu, Rik van Riel,
Harry Yoo, Jann Horn, Johannes Weiner, linux-kernel, Luka Bai
From: Luka Bai <lukabai@tencent.com>
Add two mthp attributes for the accounting of the number of successful
khugepaged collapse, either by hint or not by hint so that we can know
them easily from userspace. Note that these two statistics only care
about the collapse initiated by khugepaged, and they will not consider
the collapse raised by MADV_COLLAPSE.
Signed-off-by: Luka Bai <lukabai@tencent.com>
---
include/linux/huge_mm.h | 2 ++
mm/huge_memory.c | 4 ++++
mm/khugepaged.c | 18 +++++++++++++++++-
3 files changed, 23 insertions(+), 1 deletion(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index edece3e26985..9df0d7f71e95 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -147,6 +147,8 @@ enum mthp_stat_item {
MTHP_STAT_COLLAPSE_EXCEED_SWAP,
MTHP_STAT_COLLAPSE_EXCEED_NONE,
MTHP_STAT_COLLAPSE_EXCEED_SHARED,
+ MTHP_STAT_KHUGEPAGED_COLLAPSE_HINT,
+ MTHP_STAT_KHUGEPAGED_COLLAPSE_NON_HINT,
__MTHP_STAT_COUNT
};
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bf9b480bb3b0..0031fb4b0b09 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -720,6 +720,8 @@ DEFINE_MTHP_STAT_ATTR(nr_anon_partially_mapped, MTHP_STAT_NR_ANON_PARTIALLY_MAPP
DEFINE_MTHP_STAT_ATTR(collapse_exceed_swap_pte, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
DEFINE_MTHP_STAT_ATTR(collapse_exceed_none_pte, MTHP_STAT_COLLAPSE_EXCEED_NONE);
DEFINE_MTHP_STAT_ATTR(collapse_exceed_shared_pte, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
+DEFINE_MTHP_STAT_ATTR(khugepaged_collapse_hint, MTHP_STAT_KHUGEPAGED_COLLAPSE_HINT);
+DEFINE_MTHP_STAT_ATTR(khugepaged_collapse_non_hint, MTHP_STAT_KHUGEPAGED_COLLAPSE_NON_HINT);
static struct attribute *anon_stats_attrs[] = {
@@ -775,6 +777,8 @@ static struct attribute *any_stats_attrs[] = {
&split_failed_attr.attr,
&collapse_alloc_attr.attr,
&collapse_alloc_failed_attr.attr,
+ &khugepaged_collapse_hint_attr.attr,
+ &khugepaged_collapse_non_hint_attr.attr,
NULL,
};
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 3f5eb8be06d1..2f21c0b6ab46 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -147,6 +147,15 @@ struct mthp_range {
struct collapse_control {
bool is_khugepaged;
+ /*
+ * True while khugepaged is draining a collapse hint queued via
+ * khugepaged_add_collapse_hint(). Used by collapse_single_pmd() to
+ * attribute a successful collapse to MTHP_STAT_KHUGEPAGED_COLLAPSE_HINT
+ * or MTHP_STAT_KHUGEPAGED_COLLAPSE_NON_HINT. Only meaningful when the
+ * collapse is initiated by khugepaged (is_khugepaged == true).
+ */
+ bool from_priority_hint;
+
/* Num pages scanned per node */
u32 node_load[MAX_NUMNODES];
@@ -3012,8 +3021,13 @@ static enum scan_result collapse_single_pmd(unsigned long addr,
mmap_read_unlock(mm);
}
end:
- if (cc->is_khugepaged && result == SCAN_SUCCEED)
+ if (cc->is_khugepaged && result == SCAN_SUCCEED) {
++khugepaged_pages_collapsed;
+ count_mthp_stat(HPAGE_PMD_ORDER,
+ cc->from_priority_hint ?
+ MTHP_STAT_KHUGEPAGED_COLLAPSE_HINT :
+ MTHP_STAT_KHUGEPAGED_COLLAPSE_NON_HINT);
+ }
return result;
}
@@ -3227,7 +3241,9 @@ static int collapse_scan_one_priority_entry(unsigned int progress_max,
addr + HPAGE_PMD_SIZE > ALIGN_DOWN(vma->vm_end, HPAGE_PMD_SIZE))
goto skip_hint;
+ cc->from_priority_hint = true;
*result = collapse_single_pmd(addr, vma, &lock_dropped, cc);
+ cc->from_priority_hint = false;
if (*result != SCAN_SUCCEED)
(*fail_count)++;
skip_hint:
--
2.52.0
^ permalink raw reply related [flat|nested] 11+ messages in thread* [PATCH 5/5] mm/khugepaged: add khugepaged collapse hint in mglru reference checking
2026-05-31 4:27 [PATCH 0/5] mm/khugepaged: add collapse hint machanism for khugepaged and use in mglru Luka Bai
` (3 preceding siblings ...)
2026-05-31 4:27 ` [PATCH 4/5] mm/khugepaged: add accounting for successful hint or non-hint collapse Luka Bai
@ 2026-05-31 4:27 ` Luka Bai
2026-06-09 10:17 ` [PATCH 0/5] mm/khugepaged: add collapse hint machanism for khugepaged and use in mglru Nico Pache
2026-06-09 16:06 ` Lorenzo Stoakes
6 siblings, 0 replies; 11+ messages in thread
From: Luka Bai @ 2026-05-31 4:27 UTC (permalink / raw)
To: linux-mm
Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Zi Yan,
Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Kairui Song, Qi Zheng,
Shakeel Butt, Axel Rasmussen, Yuanchu Xie, Wei Xu, Rik van Riel,
Harry Yoo, Jann Horn, Johannes Weiner, linux-kernel, Luka Bai
From: Luka Bai <lukabai@tencent.com>
Function lru_gen_look_around() works for mglru, which is a good way
to reduce the rmap iteration. It is called in folio_referenced_one()
when it tried to reclaim a cold page. By the time it gets the page
table entry lock, it will also check the nearby ptes and try to
update their generation if they are also accessed because of locality
in most of workloads, and put the pmd that it thinks full of hot
pages into a Bloom filter, for the walk through in next aging.
Function walk_mm() is used in mglru during aging. It will go through
all the pmds of a mm_struct if certain pmd is set in the Bloom
filter, which is setup in lru_gen_look_around() above, and indicates
that pmd is frequently accessed in many pages.
Now that lru_gen_look_around() and walk_mm() found hot pmd area, we
can also use their findings as good sources of khugepaged collapse
hint, so we make up collapse hints from there.
Note that lru_gen_look_around() is called with ptl lock locked, so
we don't want to directly call khugepaged_add_collapse_hint() inside
it because it may try to allocate memory. So we introduced a new struct
area_access_info, and use it to get the access info from inside, and
do collapse after the ptl released.
Signed-off-by: Luka Bai <lukabai@tencent.com>
---
include/linux/khugepaged.h | 7 +++++++
include/linux/mmzone.h | 17 +++++++++++++++--
mm/khugepaged.c | 12 ++++++++++++
mm/rmap.c | 27 ++++++++++++++++++++++++++-
mm/vmscan.c | 33 +++++++++++++++++++++++++++++----
5 files changed, 89 insertions(+), 7 deletions(-)
diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index 815ae87f0f8e..e0793569a9f0 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -17,6 +17,7 @@ extern void khugepaged_enter_vma(struct vm_area_struct *vma,
vm_flags_t vm_flags);
extern void khugepaged_min_free_kbytes_update(void);
extern bool current_is_khugepaged(void);
+extern int get_khp_collapse_priority(int total, int young);
extern void khugepaged_add_collapse_hint(struct mm_struct *mm,
struct vm_area_struct *vma,
unsigned long address,
@@ -62,6 +63,12 @@ static inline bool current_is_khugepaged(void)
{
return false;
}
+
+static inline int get_khp_collapse_priority(int total, int young)
+{
+ return 0;
+}
+
static inline void khugepaged_add_collapse_hint(struct mm_struct *mm,
struct vm_area_struct *vma,
unsigned long address,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 1331a7b93f33..643dd500c121 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -441,6 +441,18 @@ enum lruvec_flags {
#endif /* !__GENERATING_BOUNDS_H */
+/*
+ * Used to get the young and total counts for a memory area,
+ * and also the maximum order of all the page table entries
+ * during scanning.
+ */
+struct area_access_info {
+ unsigned long address;
+ int total;
+ int young;
+ int max_order;
+};
+
/*
* Evictable folios are divided into multiple generations. The youngest and the
* oldest generation numbers, max_seq and min_seq, are monotonically increasing.
@@ -689,7 +701,8 @@ struct lru_gen_memcg {
void lru_gen_init_pgdat(struct pglist_data *pgdat);
void lru_gen_init_lruvec(struct lruvec *lruvec);
-bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw, unsigned int nr);
+bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw, unsigned int nr,
+ struct area_access_info **acc_info_ptr);
void lru_gen_init_memcg(struct mem_cgroup *memcg);
void lru_gen_exit_memcg(struct mem_cgroup *memcg);
@@ -712,7 +725,7 @@ static inline void lru_gen_init_lruvec(struct lruvec *lruvec)
}
static inline bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw,
- unsigned int nr)
+ unsigned int nr, struct area_access_info **acc_info_ptr)
{
return false;
}
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 2f21c0b6ab46..50c363846720 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -3031,6 +3031,18 @@ static enum scan_result collapse_single_pmd(unsigned long addr,
return result;
}
+/*
+ * The caller needs to make sure the pmd is at least qualified for the
+ * lowest priority of collapsing since this function will always return
+ * a legal priority value.
+ */
+int get_khp_collapse_priority(int total, int young)
+{
+ if (young * 2 >= total)
+ return 0;
+ return NR_KHUGEPAGED_PRIORITY_LEVEL - 1;
+}
+
/*
* khugepaged_add_collapse_hint - enqueue a collapse hint
* @mm: target mm
diff --git a/mm/rmap.c b/mm/rmap.c
index 1c77d5dc06e9..1cd111e7b299 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -75,6 +75,7 @@
#include <linux/userfaultfd_k.h>
#include <linux/mm_inline.h>
#include <linux/oom.h>
+#include <linux/khugepaged.h>
#include <asm/tlb.h>
@@ -911,6 +912,12 @@ struct folio_referenced_arg {
struct mem_cgroup *memcg;
};
+/*
+ * acc_info is currently only used to track access patterns for khugepaged
+ * collapse hints. 3 entries are enough for most cases, and it's totally
+ * safe if we missed some hints.
+ */
+#define NR_ACC_INFO_EACH_ITER 3
/*
* arg: folio_referenced_arg will be passed
*/
@@ -921,6 +928,8 @@ static bool folio_referenced_one(struct folio *folio,
DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
int ptes = 0, referenced = 0;
unsigned int nr;
+ struct area_access_info acc_info[NR_ACC_INFO_EACH_ITER] = {0};
+ int acc_info_count = 0;
while (page_vma_mapped_walk(&pvmw)) {
address = pvmw.address;
@@ -979,8 +988,16 @@ static bool folio_referenced_one(struct folio *folio,
* simplest approach is to disable this look-around optimization.
*/
if (lru_gen_enabled() && !lru_gen_switching() && pvmw.pte) {
- if (lru_gen_look_around(&pvmw, nr))
+ struct area_access_info *acc_info_ptr = NULL;
+
+ /* If the acc_info is full, skip the remaining ones */
+ if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
+ acc_info_count < NR_ACC_INFO_EACH_ITER)
+ acc_info_ptr = &acc_info[acc_info_count];
+ if (lru_gen_look_around(&pvmw, nr, &acc_info_ptr))
referenced++;
+ if (acc_info_ptr && acc_info_ptr != &acc_info[acc_info_count])
+ acc_info_count++;
} else if (pvmw.pte) {
if (clear_flush_young_ptes_notify(vma, address, pvmw.pte, nr))
referenced++;
@@ -1019,6 +1036,14 @@ static bool folio_referenced_one(struct folio *folio,
pra->vm_flags |= vma->vm_flags & ~VM_LOCKED;
}
+ for (--acc_info_count; acc_info_count >= 0; acc_info_count--) {
+ khugepaged_add_collapse_hint(vma->vm_mm, vma,
+ acc_info[acc_info_count].address,
+ get_khp_collapse_priority(acc_info[acc_info_count].total,
+ acc_info[acc_info_count].young),
+ acc_info[acc_info_count].max_order);
+ }
+
if (!pra->mapcount)
return false; /* To break the loop */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e8a90911bf88..a0caf5cac951 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3463,7 +3463,7 @@ static void walk_update_folio(struct lru_gen_mm_walk *walk, struct folio *folio,
}
static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
- struct mm_walk *args)
+ struct mm_walk *args, struct area_access_info *acc_info)
{
int i;
bool dirty;
@@ -3472,6 +3472,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
unsigned long addr;
int total = 0;
int young = 0;
+ int max_order = 0;
struct folio *last = NULL;
struct lru_gen_mm_walk *walk = args->private;
struct mem_cgroup *memcg = lruvec_memcg(walk->lruvec);
@@ -3522,6 +3523,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
max_nr, FPB_MERGE_YOUNG_DIRTY);
total += nr - 1;
walk->mm_stats[MM_LEAF_TOTAL] += nr - 1;
+ max_order = max(max_order, folio_order(folio));
}
if (!test_and_clear_young_ptes_notify(args->vma, addr, cur_pte, nr))
@@ -3550,6 +3552,9 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
lazy_mmu_mode_disable();
pte_unmap_unlock(pte, ptl);
+ acc_info->young = young;
+ acc_info->max_order = max_order;
+ acc_info->total = total;
return suitable_to_scan(total, young);
}
@@ -3667,6 +3672,7 @@ static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
vma = args->vma;
for (i = pmd_index(start), addr = start; addr != end; i++, addr = next) {
pmd_t val = pmdp_get_lockless(pmd + i);
+ struct area_access_info acc_info = {0};
next = pmd_addr_end(addr, end);
@@ -3699,11 +3705,16 @@ static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
walk->mm_stats[MM_NONLEAF_FOUND]++;
- if (!walk_pte_range(&val, addr, next, args))
+ if (!walk_pte_range(&val, addr, next, args, &acc_info))
continue;
walk->mm_stats[MM_NONLEAF_ADDED]++;
+ /* When acc_info has valid value */
+ if (acc_info.total > 0)
+ khugepaged_add_collapse_hint(vma->vm_mm, vma, addr,
+ get_khp_collapse_priority(acc_info.total, acc_info.young),
+ acc_info.max_order);
/* carry over to the next generation */
update_bloom_filter(mm_state, walk->seq + 1, pmd + i);
}
@@ -4183,7 +4194,8 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
* the PTE table to the Bloom filter. This forms a feedback loop between the
* eviction and the aging.
*/
-bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw, unsigned int nr)
+bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw, unsigned int nr,
+ struct area_access_info **acc_info_ptr)
{
int i;
bool dirty;
@@ -4202,6 +4214,7 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw, unsigned int nr)
struct lru_gen_mm_state *mm_state;
unsigned long max_seq;
int gen;
+ unsigned int max_order = 0;
lockdep_assert_held(pvmw->ptl);
VM_WARN_ON_ONCE_FOLIO(folio_test_lru(folio), folio);
@@ -4265,6 +4278,7 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw, unsigned int nr)
nr = folio_pte_batch_flags(folio, NULL, pte, &ptent,
max_nr, FPB_MERGE_YOUNG_DIRTY);
+ max_order = max(folio_order(folio), max_order);
}
if (!test_and_clear_young_ptes_notify(vma, addr, pte, nr))
@@ -4288,8 +4302,19 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw, unsigned int nr)
lazy_mmu_mode_disable();
/* feedback from rmap walkers to page table walkers */
- if (mm_state && suitable_to_scan(i, young))
+ if (mm_state && suitable_to_scan(i, young)) {
+ if (*acc_info_ptr) {
+ struct area_access_info acc_info = {
+ .address = start,
+ .total = i,
+ .young = young,
+ .max_order = max_order
+ };
+ *(*acc_info_ptr) = acc_info;
+ (*acc_info_ptr)++;
+ }
update_bloom_filter(mm_state, max_seq, pvmw->pmd);
+ }
mem_cgroup_put(memcg);
--
2.52.0
^ permalink raw reply related [flat|nested] 11+ messages in thread* Re: [PATCH 0/5] mm/khugepaged: add collapse hint machanism for khugepaged and use in mglru
2026-05-31 4:27 [PATCH 0/5] mm/khugepaged: add collapse hint machanism for khugepaged and use in mglru Luka Bai
` (4 preceding siblings ...)
2026-05-31 4:27 ` [PATCH 5/5] mm/khugepaged: add khugepaged collapse hint in mglru reference checking Luka Bai
@ 2026-06-09 10:17 ` Nico Pache
2026-06-09 14:44 ` Lorenzo Stoakes
2026-06-09 16:06 ` Lorenzo Stoakes
6 siblings, 1 reply; 11+ messages in thread
From: Nico Pache @ 2026-06-09 10:17 UTC (permalink / raw)
To: Luka Bai, David Hildenbrand, David Rientjes
Cc: linux-mm, Andrew Morton, Lorenzo Stoakes, Zi Yan, Baolin Wang,
Liam R. Howlett, Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Kairui Song, Qi Zheng, Shakeel Butt, Axel Rasmussen, Yuanchu Xie,
Wei Xu, Rik van Riel, Harry Yoo, Jann Horn, Johannes Weiner,
linux-kernel, Luka Bai
On Sat, May 30, 2026 at 10:33 PM Luka Bai <lukafocus@icloud.com> wrote:
>
> Khugepaged is a background daemon for collapsing feasible pages together
> into a transparent hugepage in all sorts of orders up to PMD_ORDER. However,
> it doesn't have any preference in its collapsing and just iterate through
> all the qualified mm_struct, and scan their page tables from the beginning
> to the end. It is quite inefficient especially for large address spaces
> considering how slow the khugepaged can be, and may waste many hugepage
> resources collapsing memory areas that are seldomly accessed.
>
> We would like to give khugepaged some preference hints when we found
> certain areas are good condidates for collapsing. For example, if some memory
> areas are frequently accessed, then we know that it's valuable to merge
> them into a bigger folio since it will reduce many tlb misses.
>
> For example, MGLRU has walk_mm() and lru_gen_look_around() that are used to
> scan frequently accessed areas to save some works on rmap walking and
> generation elevation. By the same time, they are able to find those
> hot memory areas, it should be valuable to merge these areas into folios.
> MADV_COLLAPSE can be used, but that will cost too much time and will
> harm the performance of reclaimation and slow down the process that may
> enter the slow path of memory allocation. So the better choice shoule be to
> tell khugepaged to asynchronously do it.
>
> We add a khugepaged collapse hint framework in this patchset. The caller can
> call khugepaged_add_collapse_hint() to add hints for khugepaged to make it
> prioritize collapsing these specific address we found before doing Round-Robin
> scanning. Each mm_slot which belongs to a mm_struct in the previous
> mm_slots_hash is now a khugepaged_mm_slot, it comprises the old mm_slot
> struct and a number of NR_KHUGEPAGED_PRIORITY_LEVEL struct
> khugepaged_collapse_requests. The request struct for each mm_struct will
> be put in the global struct khugepaged_priority_queue with respect to its
> priority when __khugepaged_enter() is called on this mm (we give each mm request
> structs for hint dispersion and balancing across all the mm_structs that will
> be added in the future patches), and all the hints will be put in these request
> structs. Each hint will have the target address and the target vma struct. An
> example of the framework is like below:
>
> global collapse hints queues:
> prio 0 ------()----------------------------------()---------------
> mm_slot0(process A) mm_slot1(process B)
> | |
> hint0---hint1---hint2---hint3 hint4---hint5---hint6
>
> prio 1 ------()----------------------------------()---------------
> mm_slot0(process A) mm_slot1(process B)
> | |
> ------- hint7---hint8
>
> The khugepaged will try to scan queues from highest priority (which is prio 0 in
> the graph above) to the lowest priority (which is prio 1 in the graph), then go
> through the list, and check out all the struct khugepaged_mm_slot (which are the
> mm_slot0 and mm_slot1 in the graph above), so it will start from mm_slot0 in queue
> of priority 0. Then khugepaged will scan all the hints listed in the slot (hint0 ~
> hint3 in the above graph). After handling one hint (no mater success or fail on
> collapsing), the hint will be deleted. If one khugepaged_mm_slot doesn't have any
> hints in it, khugepaged will skip it and scan the next mm_slot in the same priority;
> if there is no hint in the queue of prio 0 anymore, khugepaged will scan the ones
> of prio 1; if there is no hint in any prio queues, it will fallback to do Round-Robin
> scanning like before.
>
> khugepaged_add_collapse_hint() is for adding hints, and it only gets called
> by walk_mm() and lru_gen_look_around() right now. In the future we may
> call it in more scenorios when we found hot memory areas. For example: in damon.
>
> We tested the performance by using valkey-server (based on redis) together with
> memtier_benchmark to simulate a gauss distribution on the get/set operations on
> a 160G, 64core x86 VM. The dataset is about 3G. After preloading db, the testing
> parameter was like below:
> memtier_benchmark -s 127.0.0.1 -p 6379 \
> --ratio=1:1 \
> --key-pattern=G:G \
> --key-minimum=1 --key-maximum=3000000 \
> --key-median=2000000 \
> --key-stddev=150000 \
> -d 1024 \
> -t 1 -c 10 \
> -n 2500000 \
> --pipeline=32 \
> --hide-histogram
>
> Since we wanted to see the influence of khugepaged collapse hints on the reduction of
> tlb misses, we made khugepaged do scanning every 1 second, and used the userspace
> interface to do walk_mm() for the cgroup which valkey-server was set into every 2 seconds.
> We made sure the server was all 4k pages before we run test, and only khugepaged could
> collapse them into large folios. We enable the anonymous THP of order 9, which is pmd
> size in most setup. We used perf stat to monitor the tlb misses statistics.
>
> After repeated tests, we could see dTLB-load-misses with a 13.50% reduction, and saw
> dTLB-store-misses with a 5% reduction compared to the setup without any collapse
> hint. The final throughput for the memtier_benchmark was about 2% to 5% improvement
> on average, which was not that obvious compared to the tlb miss reduction. We believed
> that was because there were too many factors to influence the final result of a random
> redis test, so the influence of tlb misses to the final throughput were compromised by
> other factors.
>
> Patch Details:
> ========
> * Patch 1 is to add the basic khugepaged hint framework like we introduced
> above. Details can be seen in the commit itself and the comments in the
> codes.
> * Patch 2 is to add a slab_cache for khugepaged_collapse_hint which can
> improve the performance of allocating and freeing the hints.
> * Patch 3 is to add a deduplication machanism for the hints so that we will
> not add a hint that points to a repeated address.
> * Patch 4 is to add the accounting for successful collapses initiated by
> hint or non-hint.
> * Patch 5 is to add the collapse hint in lru_gen_look_around() and walk_mm()
> of mglru.
>
> Thanks for reading. Comments and suggestions are very welcome!
Hi Luka,
I haven't reviewed the code yet, but the overall concept is
interesting (it should probably be a RFC first though, but that's
fine).
I had future plans for something similar as part of the thp=auto work;
however that requires significant thought and investigation into how
we can properly gather hints for collapse/split THP candidates. From
my perspective we'd want a more global structure/system outside of
khugepaged, that would directly call khugepaged (and others like
split, etc). It would also tie into the allocator so that at fault
time it could leverage the hints to make better decisions. My fear
with this series is that making a decision now might complicate future
work by adding complexity we may eventually want to remove for a
better solution.
If you have the chance perhaps you can lead a discussion on your
proposal at the biweekly MM alignment session.
+David Rientjes as he leads those discussions. We could use that time
to layout a plan for what needs to be done for this work, and for the
work surrounding thp=auto as I beleive they will be interdependent :)
Cheers,
-- Nico
>
> Signed-off-by: Luka Bai <lukabai@tencent.com>
> ---
> Luka Bai (5):
> mm/khugepaged: add framework for khugepaged collapse hint
> mm/khugepaged: use slab cache instead of normal kmalloc
> mm/khugepaged: add deduplication when adding new collapse hint
> mm/khugepaged: add accounting for successful hint or non-hint collapse
> mm/khugepaged: add khugepaged collapse hint in mglru reference checking
>
> include/linux/huge_mm.h | 2 +
> include/linux/khugepaged.h | 20 ++
> include/linux/mmzone.h | 17 +-
> mm/huge_memory.c | 4 +
> mm/khugepaged.c | 460 ++++++++++++++++++++++++++++++++++++++++++++-
> mm/rmap.c | 27 ++-
> mm/vmscan.c | 33 +++-
> 7 files changed, 549 insertions(+), 14 deletions(-)
> ---
> base-commit: e1af79f3291a268adf4e149e1faba3052743e898
> change-id: 20260530-thp_collapse_hint-ec92bd943797
>
> Best regards,
> --
> Luka Bai <lukabai@tencent.com>
>
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [PATCH 0/5] mm/khugepaged: add collapse hint machanism for khugepaged and use in mglru
2026-06-09 10:17 ` [PATCH 0/5] mm/khugepaged: add collapse hint machanism for khugepaged and use in mglru Nico Pache
@ 2026-06-09 14:44 ` Lorenzo Stoakes
0 siblings, 0 replies; 11+ messages in thread
From: Lorenzo Stoakes @ 2026-06-09 14:44 UTC (permalink / raw)
To: Nico Pache
Cc: Luka Bai, David Hildenbrand, David Rientjes, linux-mm,
Andrew Morton, Zi Yan, Baolin Wang, Liam R. Howlett, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Kairui Song, Qi Zheng,
Shakeel Butt, Axel Rasmussen, Yuanchu Xie, Wei Xu, Rik van Riel,
Harry Yoo, Jann Horn, Johannes Weiner, linux-kernel, Luka Bai
On Tue, Jun 09, 2026 at 04:17:45AM -0600, Nico Pache wrote:
> I had future plans for something similar as part of the thp=auto work;
> however that requires significant thought and investigation into how
> we can properly gather hints for collapse/split THP candidates. From
> my perspective we'd want a more global structure/system outside of
> khugepaged, that would directly call khugepaged (and others like
> split, etc). It would also tie into the allocator so that at fault
> time it could leverage the hints to make better decisions. My fear
> with this series is that making a decision now might complicate future
> work by adding complexity we may eventually want to remove for a
> better solution.
I know this is future planning stuff, but I want to point out that we need
to see significant rework of the THP code base before accepting any further
major changes.
>
> If you have the chance perhaps you can lead a discussion on your
> proposal at the biweekly MM alignment session.
>
> +David Rientjes as he leads those discussions. We could use that time
> to layout a plan for what needs to be done for this work, and for the
> work surrounding thp=auto as I beleive they will be interdependent :)
We prefer to discuss THP topics in the THP cabal meeting, which both THP
maintainers regularly attend :) having separate sessions we might not be aware
of isn't really helpful.
And 'THP auto' is a broad topic rather than a new feature. I don't think it
should be seen as a topic 'owned' by anybody, but rather something that we
as a community should discuss.
Series that appear from nowhere trying to implement significant changes
along those lines without community disucssion will not be hugely
appreciated :)
Thanks, Lorenzo
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 0/5] mm/khugepaged: add collapse hint machanism for khugepaged and use in mglru
2026-05-31 4:27 [PATCH 0/5] mm/khugepaged: add collapse hint machanism for khugepaged and use in mglru Luka Bai
` (5 preceding siblings ...)
2026-06-09 10:17 ` [PATCH 0/5] mm/khugepaged: add collapse hint machanism for khugepaged and use in mglru Nico Pache
@ 2026-06-09 16:06 ` Lorenzo Stoakes
6 siblings, 0 replies; 11+ messages in thread
From: Lorenzo Stoakes @ 2026-06-09 16:06 UTC (permalink / raw)
To: Luka Bai
Cc: linux-mm, Andrew Morton, David Hildenbrand, Zi Yan, Baolin Wang,
Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
Lance Yang, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
Michal Hocko, Kairui Song, Qi Zheng, Shakeel Butt, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Rik van Riel, Harry Yoo, Jann Horn,
Johannes Weiner, linux-kernel, Luka Bai
Hi Luka,
This should have been an RFC (equally, your other recent THP submission, [0]).
THP maintainer resource is highly constrained right now and we simply lack the
bandwidth for larger work at the moment.
In addition, I don't think we want to see any further major changes to THP
without without significant work first being done to rework and improve
the existing code base.
We have a lot of technical debt and adding more on top is building on sand,
really.
In general, we expect newcomers to the community to become familiar to the code
base through smaller changes (and also be useful to help with review) first, prior to
submitting larger changes.
So we currently don't really have the resource to review this series at the
moment, so I suggest you focus firstly on finding ways to refactor and improve
the khugepaged code.
Thanks, Lorenzo
[0]:https://lore.kernel.org/all/20260501-thp_cow-v1-0-005377483738@tencent.com/
On Sun, May 31, 2026 at 12:27:16PM +0800, Luka Bai wrote:
> Khugepaged is a background daemon for collapsing feasible pages together
> into a transparent hugepage in all sorts of orders up to PMD_ORDER. However,
> it doesn't have any preference in its collapsing and just iterate through
> all the qualified mm_struct, and scan their page tables from the beginning
> to the end. It is quite inefficient especially for large address spaces
> considering how slow the khugepaged can be, and may waste many hugepage
> resources collapsing memory areas that are seldomly accessed.
>
> We would like to give khugepaged some preference hints when we found
> certain areas are good condidates for collapsing. For example, if some memory
> areas are frequently accessed, then we know that it's valuable to merge
> them into a bigger folio since it will reduce many tlb misses.
>
> For example, MGLRU has walk_mm() and lru_gen_look_around() that are used to
> scan frequently accessed areas to save some works on rmap walking and
> generation elevation. By the same time, they are able to find those
> hot memory areas, it should be valuable to merge these areas into folios.
> MADV_COLLAPSE can be used, but that will cost too much time and will
> harm the performance of reclaimation and slow down the process that may
> enter the slow path of memory allocation. So the better choice shoule be to
> tell khugepaged to asynchronously do it.
>
> We add a khugepaged collapse hint framework in this patchset. The caller can
> call khugepaged_add_collapse_hint() to add hints for khugepaged to make it
> prioritize collapsing these specific address we found before doing Round-Robin
> scanning. Each mm_slot which belongs to a mm_struct in the previous
> mm_slots_hash is now a khugepaged_mm_slot, it comprises the old mm_slot
> struct and a number of NR_KHUGEPAGED_PRIORITY_LEVEL struct
> khugepaged_collapse_requests. The request struct for each mm_struct will
> be put in the global struct khugepaged_priority_queue with respect to its
> priority when __khugepaged_enter() is called on this mm (we give each mm request
> structs for hint dispersion and balancing across all the mm_structs that will
> be added in the future patches), and all the hints will be put in these request
> structs. Each hint will have the target address and the target vma struct. An
> example of the framework is like below:
>
> global collapse hints queues:
> prio 0 ------()----------------------------------()---------------
> mm_slot0(process A) mm_slot1(process B)
> | |
> hint0---hint1---hint2---hint3 hint4---hint5---hint6
>
> prio 1 ------()----------------------------------()---------------
> mm_slot0(process A) mm_slot1(process B)
> | |
> ------- hint7---hint8
>
> The khugepaged will try to scan queues from highest priority (which is prio 0 in
> the graph above) to the lowest priority (which is prio 1 in the graph), then go
> through the list, and check out all the struct khugepaged_mm_slot (which are the
> mm_slot0 and mm_slot1 in the graph above), so it will start from mm_slot0 in queue
> of priority 0. Then khugepaged will scan all the hints listed in the slot (hint0 ~
> hint3 in the above graph). After handling one hint (no mater success or fail on
> collapsing), the hint will be deleted. If one khugepaged_mm_slot doesn't have any
> hints in it, khugepaged will skip it and scan the next mm_slot in the same priority;
> if there is no hint in the queue of prio 0 anymore, khugepaged will scan the ones
> of prio 1; if there is no hint in any prio queues, it will fallback to do Round-Robin
> scanning like before.
>
> khugepaged_add_collapse_hint() is for adding hints, and it only gets called
> by walk_mm() and lru_gen_look_around() right now. In the future we may
> call it in more scenorios when we found hot memory areas. For example: in damon.
>
> We tested the performance by using valkey-server (based on redis) together with
> memtier_benchmark to simulate a gauss distribution on the get/set operations on
> a 160G, 64core x86 VM. The dataset is about 3G. After preloading db, the testing
> parameter was like below:
> memtier_benchmark -s 127.0.0.1 -p 6379 \
> --ratio=1:1 \
> --key-pattern=G:G \
> --key-minimum=1 --key-maximum=3000000 \
> --key-median=2000000 \
> --key-stddev=150000 \
> -d 1024 \
> -t 1 -c 10 \
> -n 2500000 \
> --pipeline=32 \
> --hide-histogram
>
> Since we wanted to see the influence of khugepaged collapse hints on the reduction of
> tlb misses, we made khugepaged do scanning every 1 second, and used the userspace
> interface to do walk_mm() for the cgroup which valkey-server was set into every 2 seconds.
> We made sure the server was all 4k pages before we run test, and only khugepaged could
> collapse them into large folios. We enable the anonymous THP of order 9, which is pmd
> size in most setup. We used perf stat to monitor the tlb misses statistics.
>
> After repeated tests, we could see dTLB-load-misses with a 13.50% reduction, and saw
> dTLB-store-misses with a 5% reduction compared to the setup without any collapse
> hint. The final throughput for the memtier_benchmark was about 2% to 5% improvement
> on average, which was not that obvious compared to the tlb miss reduction. We believed
> that was because there were too many factors to influence the final result of a random
> redis test, so the influence of tlb misses to the final throughput were compromised by
> other factors.
>
> Patch Details:
> ========
> * Patch 1 is to add the basic khugepaged hint framework like we introduced
> above. Details can be seen in the commit itself and the comments in the
> codes.
> * Patch 2 is to add a slab_cache for khugepaged_collapse_hint which can
> improve the performance of allocating and freeing the hints.
> * Patch 3 is to add a deduplication machanism for the hints so that we will
> not add a hint that points to a repeated address.
> * Patch 4 is to add the accounting for successful collapses initiated by
> hint or non-hint.
> * Patch 5 is to add the collapse hint in lru_gen_look_around() and walk_mm()
> of mglru.
>
> Thanks for reading. Comments and suggestions are very welcome!
>
> Signed-off-by: Luka Bai <lukabai@tencent.com>
> ---
> Luka Bai (5):
> mm/khugepaged: add framework for khugepaged collapse hint
> mm/khugepaged: use slab cache instead of normal kmalloc
> mm/khugepaged: add deduplication when adding new collapse hint
> mm/khugepaged: add accounting for successful hint or non-hint collapse
> mm/khugepaged: add khugepaged collapse hint in mglru reference checking
>
> include/linux/huge_mm.h | 2 +
> include/linux/khugepaged.h | 20 ++
> include/linux/mmzone.h | 17 +-
> mm/huge_memory.c | 4 +
> mm/khugepaged.c | 460 ++++++++++++++++++++++++++++++++++++++++++++-
> mm/rmap.c | 27 ++-
> mm/vmscan.c | 33 +++-
> 7 files changed, 549 insertions(+), 14 deletions(-)
> ---
> base-commit: e1af79f3291a268adf4e149e1faba3052743e898
> change-id: 20260530-thp_collapse_hint-ec92bd943797
>
> Best regards,
> --
> Luka Bai <lukabai@tencent.com>
>
^ permalink raw reply [flat|nested] 11+ messages in thread