* Re: [PATCH 1/1] tools/rv: ensure monitor name and desc are NUL-terminated
From: Gabriele Monaco @ 2026-06-04 11:08 UTC (permalink / raw)
To: unknownbbqrx; +Cc: rostedt, linux-trace-kernel, linux-kernel
In-Reply-To: <dc9ea036-de62-4e1f-be63-8e14d675bcca@smtp-relay.sendinblue.com>
On Thu, 2026-04-23 at 17:19 +0300, unknownbbqrx wrote:
>
> ikm_fill_monitor_definition() copies monitor name and description
> with strncpy(), but does not guarantee NUL termination when source
> strings are equal to or longer than the destination buffers.
>
> Clamp copies to sizeof(dst) - 1 and explicitly append '\0' for both
> fields to keep them safe for later string operations.
>
> Signed-off-by: unknownbbqrx <dev@unknownbbqr.xyz>
Contributions need to be attributed to real people using official
names. I'm going to re-send this patch with me as author and a
Suggested-by: unknownbbqrx <dev@unknownbbqr.xyz>, unless you answer
with an appropriate attribution (i.e. your real name) [1].
Thanks,
Gabriele
[1] -
https://docs.kernel.org/process/submitting-patches.html#sign-your-work-the-developer-s-certificate-of-origin
> ---
> tools/verification/rv/src/in_kernel.c | 7 ++++---
> 1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/tools/verification/rv/src/in_kernel.c
> b/tools/verification/rv/src/in_kernel.c
> index 4bb746ea6..d32453824 100644
> --- a/tools/verification/rv/src/in_kernel.c
> +++ b/tools/verification/rv/src/in_kernel.c
> @@ -215,10 +215,11 @@ static int ikm_fill_monitor_definition(char
> *name, struct monitor *ikm, char *co
> return -1;
> }
>
> - strncpy(ikm->name, nested_name, MAX_DA_NAME_LEN);
> + strncpy(ikm->name, nested_name, sizeof(ikm->name) - 1);
> + ikm->name[sizeof(ikm->name) - 1] = '\0';
> ikm->enabled = enabled;
> - strncpy(ikm->desc, desc, MAX_DESCRIPTION);
> -
> + strncpy(ikm->desc, desc, sizeof(ikm->desc) - 1);
> + ikm->desc[sizeof(ikm->desc) - 1] = '\0';
> free(desc);
>
> return 0;
^ permalink raw reply
* Re: [GIT PULL] rv fixes for v7.1
From: Tomas Glozar @ 2026-06-04 11:32 UTC (permalink / raw)
To: Steven Rostedt
Cc: Gabriele Monaco, linux-kernel, linux-trace-kernel, unknownbbqrx,
Wen Yang
In-Reply-To: <20260603191627.5cb4ef6c@fedora>
Hi Steven,
čt 4. 6. 2026 v 1:19 odesílatel Steven Rostedt <rostedt@goodmis.org> napsal:
> Hi Gabriele,
>
> What is this? All commits need to be authored by and signed off by from
> a real person with their official name.
>
> https://docs.kernel.org/process/submitting-patches.html#sign-your-work-the-developer-s-certificate-of-origin
>
> -- Steve
>
Is this really still the case? Note that the document says:
"using a known identity (sorry, no anonymous contributions.)"
It really used to say "real name", but it was changed by Linus in 2023
[1]. Note especially this section by Linus:
" It was 2006, and nobody reacted to the wording, the whole Facebook 'real
name' controversy was a decade in the future, and nobody even thought
about it. And despite the language, we've always accepted nicknames and
that language was never meant to be any kind of exclusionary wording."
The wording sounds quite clear to me. And I'm certain that there were
contributions under pseudonymous identity that have been accepted
since then, most famously by Asahi Lina, a vtuber persona [2]. I don't
really see a difference between that and "unknownbbqrx", other than
the latter doesn't sound like a real name.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d4563201f33a022fc0353033d9dfeb1606a88330
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2e2f6b0ef8551bf3bd8255729d27e3ad9451e562
Tomas
^ permalink raw reply
* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: Lorenzo Stoakes @ 2026-06-04 11:38 UTC (permalink / raw)
To: Nico Pache
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
lance.yang, liam, mathieu.desnoyers, matthew.brost, mhiramat,
mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe, Usama Arif
In-Reply-To: <20260522150009.121603-7-npache@redhat.com>
I will go review the thread about the cache maintenance separately and
respond about that.
On Fri, May 22, 2026 at 09:00:01AM -0600, Nico Pache wrote:
> Pass an order and offset to collapse_huge_page to support collapsing anon
> memory to arbitrary orders within a PMD. order indicates what mTHP size we
> are attempting to collapse to, and offset indicates were in the PMD to
> start the collapse attempt.
>
> For non-PMD collapse we must leave the anon VMA write locked until after
> we collapse the mTHP-- in the PMD case all the pages are isolated, but in
> the mTHP case this is not true, and we must keep the lock to prevent
> access/changes to the page tables. This can happen if the rmap walkers hit
> a pmd_none while the PMD entry is currently unavailable due to being
> temporarily removed during the collapse phase.
>
> Acked-by: Usama Arif <usama.arif@linux.dev>
> Signed-off-by: Nico Pache <npache@redhat.com>
The logic LGTM generally, some questions for understanding below, and of
course as per above I want to review the Lance/David subthread.
Thanks!
> ---
> mm/khugepaged.c | 93 +++++++++++++++++++++++++++++--------------------
> 1 file changed, 55 insertions(+), 38 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index fab35d318641..d64f42f66236 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1214,34 +1214,36 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
> * while allocating a THP, as that could trigger direct reclaim/compaction.
> * Note that the VMA must be rechecked after grabbing the mmap_lock again.
> */
> -static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
> - int referenced, int unmapped, struct collapse_control *cc)
> +static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
> + int referenced, int unmapped, struct collapse_control *cc,
> + unsigned int order)
> {
> + const unsigned long pmd_addr = start_addr & HPAGE_PMD_MASK;
> + const unsigned long end_addr = start_addr + (PAGE_SIZE << order);
> LIST_HEAD(compound_pagelist);
> pmd_t *pmd, _pmd;
> - pte_t *pte;
> + pte_t *pte = NULL;
As mentioned elsewhere for some reason this was dropped in
mm-unstable. Maybe a bad conflict resolution?
> pgtable_t pgtable;
> struct folio *folio;
> spinlock_t *pmd_ptl, *pte_ptl;
> enum scan_result result = SCAN_FAIL;
> struct vm_area_struct *vma;
> struct mmu_notifier_range range;
> + bool anon_vma_locked = false;
>
> - VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> -
> - result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> + result = alloc_charge_folio(&folio, mm, cc, order);
> if (result != SCAN_SUCCEED)
> goto out_nolock;
>
> mmap_read_lock(mm);
> - result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> - HPAGE_PMD_ORDER);
> + result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
> + &vma, cc, order);
> if (result != SCAN_SUCCEED) {
> mmap_read_unlock(mm);
> goto out_nolock;
> }
>
> - result = find_pmd_or_thp_or_none(mm, address, &pmd);
> + result = find_pmd_or_thp_or_none(mm, pmd_addr, &pmd);
> if (result != SCAN_SUCCEED) {
> mmap_read_unlock(mm);
> goto out_nolock;
> @@ -1253,8 +1255,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> * released when it fails. So we jump out_nolock directly in
> * that case. Continuing to collapse causes inconsistency.
> */
> - result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> - referenced, HPAGE_PMD_ORDER);
> + result = __collapse_huge_page_swapin(mm, vma, start_addr, pmd,
> + referenced, order);
> if (result != SCAN_SUCCEED)
> goto out_nolock;
> }
> @@ -1269,20 +1271,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> * mmap_lock.
> */
> mmap_write_lock(mm);
> - result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> - HPAGE_PMD_ORDER);
> + result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
> + &vma, cc, order);
> if (result != SCAN_SUCCEED)
> goto out_up_write;
> /* check if the pmd is still valid */
> vma_start_write(vma);
> - result = check_pmd_still_valid(mm, address, pmd);
> + result = check_pmd_still_valid(mm, pmd_addr, pmd);
> if (result != SCAN_SUCCEED)
> goto out_up_write;
>
> anon_vma_lock_write(vma->anon_vma);
> + anon_vma_locked = true;
I worry that we hold this lock a lot longer now? Maybe the algorithmic
change alters that, but Claude did suggest on the s390 bug that longer lock
hold might be an issue.
I wonder if we'll observe lock contention as a result?
Correct me if I'm wrong and we're not holding longer than previously,
however. Just appears that we do.
>
> - mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> - address + HPAGE_PMD_SIZE);
> + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr,
> + end_addr);
> mmu_notifier_invalidate_range_start(&range);
>
> pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> @@ -1294,26 +1297,23 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> * Parallel GUP-fast is fine since GUP-fast will back off when
> * it detects PMD is changed.
> */
> - _pmd = pmdp_collapse_flush(vma, address, pmd);
> + _pmd = pmdp_collapse_flush(vma, pmd_addr, pmd);
> spin_unlock(pmd_ptl);
> mmu_notifier_invalidate_range_end(&range);
> tlb_remove_table_sync_one();
>
> - pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> + pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);
> if (pte) {
> - result = __collapse_huge_page_isolate(vma, address, pte, cc,
> - HPAGE_PMD_ORDER,
> - &compound_pagelist);
> + result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
> + order, &compound_pagelist);
> spin_unlock(pte_ptl);
> } else {
> result = SCAN_NO_PTE_TABLE;
> }
>
> if (unlikely(result != SCAN_SUCCEED)) {
> - if (pte)
> - pte_unmap(pte);
OK I seem to remember this is because we're holding the anon_vma lock
longer. That does imply that on e.g. x86-64 the RCU lock is being held a
bit longer also as well as the anon_vma loc.
I guess it's also because we need to hold anon_vma and pte lock because
we're fiddling around at PTE level for mTHP not just PMD level as 'classic'
THP did.
(Rememberings going on here :)
> spin_lock(pmd_ptl);
> - BUG_ON(!pmd_none(*pmd));
> + WARN_ON_ONCE(!pmd_none(*pmd));
> /*
> * We can only use set_pmd_at when establishing
> * hugepmds and never for establishing regular pmds that
> @@ -1321,21 +1321,24 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> */
> pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> spin_unlock(pmd_ptl);
> - anon_vma_unlock_write(vma->anon_vma);
> goto out_up_write;
> }
>
> /*
> - * All pages are isolated and locked so anon_vma rmap
> - * can't run anymore.
> + * For PMD collapse all pages are isolated and locked so anon_vma
> + * rmap can't run anymore. For mTHP collapse the PMD entry has been
> + * removed and not all pages are isolated and locked, so we must hold
Right because some PTE entries be unaffected by the change.
> + * the lock to prevent neighboring folios from attempting to access
> + * this PMD until its reinstalled.
OK. This is slightly annoying for my CoW context work as it means there's
another case where we need to explicitly hold an anon_vma lock for
correctness :)
Anyway I will think about that separately, is what it is. And in fact
motivates to want this merged earlier so I can work against it :)
> */
> - anon_vma_unlock_write(vma->anon_vma);
> + if (is_pmd_order(order)) {
> + anon_vma_unlock_write(vma->anon_vma);
> + anon_vma_locked = false;
> + }
>
> result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> - vma, address, pte_ptl,
> - HPAGE_PMD_ORDER,
> - &compound_pagelist);
> - pte_unmap(pte);
> + vma, start_addr, pte_ptl,
> + order, &compound_pagelist);
> if (unlikely(result != SCAN_SUCCEED))
> goto out_up_write;
>
> @@ -1345,18 +1348,32 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> * write.
> */
> __folio_mark_uptodate(folio);
> - pgtable = pmd_pgtable(_pmd);
> -
> spin_lock(pmd_ptl);
> - BUG_ON(!pmd_none(*pmd));
> - pgtable_trans_huge_deposit(mm, pmd, pgtable);
> - map_anon_folio_pmd_nopf(folio, pmd, vma, address);
> + WARN_ON_ONCE(!pmd_none(*pmd));
> + if (is_pmd_order(order)) {
> + pgtable = pmd_pgtable(_pmd);
> + pgtable_trans_huge_deposit(mm, pmd, pgtable);
> + map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
> + } else {
> + /*
> + * set_ptes is called in map_anon_folio_pte_nopf with the
> + * pmd_ptl lock still held; this is safe as the PMD is expected
PMD entry you mean?
> + * to be none. The pmd entry is then repopulated below.
> + */
> + map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
So here we populate entries in the existing PTE _table_ to point at the new
order>0 folio? With arm64 of course doing transparent contpte stuff?
> + smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
> + pmd_populate(mm, pmd, pmd_pgtable(_pmd));
And then we reinstall the pre-existing PMD _entry_ from none -> what it was
before?
> + }
> spin_unlock(pmd_ptl);
>
> folio = NULL;
>
> result = SCAN_SUCCEED;
> out_up_write:
> + if (anon_vma_locked)
> + anon_vma_unlock_write(vma->anon_vma);
> + if (pte)
> + pte_unmap(pte);
> mmap_write_unlock(mm);
> out_nolock:
> if (folio)
> @@ -1536,7 +1553,7 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> /* collapse_huge_page expects the lock to be dropped before calling */
> mmap_read_unlock(mm);
> result = collapse_huge_page(mm, start_addr, referenced,
> - unmapped, cc);
> + unmapped, cc, HPAGE_PMD_ORDER);
> /* collapse_huge_page will return with the mmap_lock released */
> *lock_dropped = true;
> }
> --
> 2.54.0
>
^ permalink raw reply
* [PATCH 0/3] tracing/user_events: More efficient data output in user_seq_show()
From: Markus Elfring @ 2026-06-04 12:05 UTC (permalink / raw)
To: linux-trace-kernel, Masami Hiramatsu, Mathieu Desnoyers,
Steven Rostedt
Cc: LKML, kernel-janitors
From: Markus Elfring <elfring@users.sourceforge.net>
Date: Thu, 4 Jun 2026 14:00:56 +0200
A few update suggestions were taken into account
from static source code analysis.
Markus Elfring (3):
Simplify data output
Use seq_putc()
Replace a seq_printf() call by seq_puts()
kernel/trace/trace_events_user.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
--
2.54.0
^ permalink raw reply
* [PATCH 1/3] tracing/user_events: Simplify data output in user_seq_show()
From: Markus Elfring @ 2026-06-04 12:07 UTC (permalink / raw)
To: linux-trace-kernel, Masami Hiramatsu, Mathieu Desnoyers,
Steven Rostedt
Cc: LKML, kernel-janitors
In-Reply-To: <596c2f16-a12c-4e24-8a8c-1243dce354ec@web.de>
From: Markus Elfring <elfring@users.sourceforge.net>
Date: Thu, 4 Jun 2026 13:33:43 +0200
Move the specification for a line break from a seq_puts() call
to a seq_printf() call.
The source code was transformed by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
---
kernel/trace/trace_events_user.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/kernel/trace/trace_events_user.c b/kernel/trace/trace_events_user.c
index c4ba484f7b38..a089ac30e407 100644
--- a/kernel/trace/trace_events_user.c
+++ b/kernel/trace/trace_events_user.c
@@ -2800,8 +2800,7 @@ static int user_seq_show(struct seq_file *m, void *p)
mutex_unlock(&group->reg_mutex);
- seq_puts(m, "\n");
- seq_printf(m, "Active: %d\n", active);
+ seq_printf(m, "\nActive: %d\n", active);
seq_printf(m, "Busy: %d\n", busy);
return 0;
--
2.54.0
^ permalink raw reply related
* [PATCH 2/3] tracing/user_events: Use seq_putc() in user_seq_show()
From: Markus Elfring @ 2026-06-04 12:10 UTC (permalink / raw)
To: linux-trace-kernel, Masami Hiramatsu, Mathieu Desnoyers,
Steven Rostedt
Cc: LKML, kernel-janitors
In-Reply-To: <596c2f16-a12c-4e24-8a8c-1243dce354ec@web.de>
From: Markus Elfring <elfring@users.sourceforge.net>
Date: Thu, 4 Jun 2026 13:40:38 +0200
A line break should be put into a sequence within a loop.
Thus use the corresponding function “seq_putc” for one selected call.
The source code was transformed by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
---
kernel/trace/trace_events_user.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/trace/trace_events_user.c b/kernel/trace/trace_events_user.c
index a089ac30e407..1a0569110bfd 100644
--- a/kernel/trace/trace_events_user.c
+++ b/kernel/trace/trace_events_user.c
@@ -2794,7 +2794,7 @@ static int user_seq_show(struct seq_file *m, void *p)
busy++;
}
- seq_puts(m, "\n");
+ seq_putc(m, '\n');
active++;
}
--
2.54.0
^ permalink raw reply related
* [PATCH 3/3] tracing/user_events: Replace a seq_printf() call by seq_puts() in user_seq_show()
From: Markus Elfring @ 2026-06-04 12:12 UTC (permalink / raw)
To: linux-trace-kernel, Masami Hiramatsu, Mathieu Desnoyers,
Steven Rostedt
Cc: LKML, kernel-janitors
In-Reply-To: <596c2f16-a12c-4e24-8a8c-1243dce354ec@web.de>
From: Markus Elfring <elfring@users.sourceforge.net>
Date: Thu, 4 Jun 2026 13:45:48 +0200
A single string should be put into a sequence within a loop.
Thus use the corresponding function “seq_puts” for one selected call.
The source code was transformed by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
---
kernel/trace/trace_events_user.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/trace/trace_events_user.c b/kernel/trace/trace_events_user.c
index 1a0569110bfd..57a78de4dc98 100644
--- a/kernel/trace/trace_events_user.c
+++ b/kernel/trace/trace_events_user.c
@@ -2781,7 +2781,7 @@ static int user_seq_show(struct seq_file *m, void *p)
hash_for_each(group->register_table, i, user, node) {
status = user->status;
- seq_printf(m, "%s", EVENT_TP_NAME(user));
+ seq_puts(m, EVENT_TP_NAME(user));
if (status != 0) {
seq_puts(m, " # Used by");
--
2.54.0
^ permalink raw reply related
* [PATCH] tools/rv: Ensure monitor name and desc are NUL-terminated
From: Gabriele Monaco @ 2026-06-04 12:09 UTC (permalink / raw)
To: Steven Rostedt, Gabriele Monaco, linux-trace-kernel, linux-kernel
Cc: unknownbbqrx
ikm_fill_monitor_definition() copies monitor name and description with
strncpy(), but does not guarantee NUL termination when source strings are
equal to or longer than the destination buffers.
Clamp copies to sizeof(dst) - 1 and explicitly append '\0' for both fields
to keep them safe for later string operations.
Suggested-by: unknownbbqrx <dev@unknownbbqr.xyz>
Fixes: 6d60f89691fc9 ("tools/rv: Add in-kernel monitor interface")
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
Patch was initially sent as [1], the original author's email address
doesn't seem to exist any longer and the author didn't provide a valid
name.
Reimplementing the fix and changing attribution.
[1] - https://lore.kernel.org/r/dc9ea036-de62-4e1f-be63-8e14d675bcca@smtp-relay.sendinblue.com
---
tools/verification/rv/src/in_kernel.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/tools/verification/rv/src/in_kernel.c b/tools/verification/rv/src/in_kernel.c
index 4bb746ea6..d32453824 100644
--- a/tools/verification/rv/src/in_kernel.c
+++ b/tools/verification/rv/src/in_kernel.c
@@ -215,10 +215,11 @@ static int ikm_fill_monitor_definition(char *name, struct monitor *ikm, char *co
return -1;
}
- strncpy(ikm->name, nested_name, MAX_DA_NAME_LEN);
+ strncpy(ikm->name, nested_name, sizeof(ikm->name) - 1);
+ ikm->name[sizeof(ikm->name) - 1] = '\0';
ikm->enabled = enabled;
- strncpy(ikm->desc, desc, MAX_DESCRIPTION);
-
+ strncpy(ikm->desc, desc, sizeof(ikm->desc) - 1);
+ ikm->desc[sizeof(ikm->desc) - 1] = '\0';
free(desc);
return 0;
base-commit: e43ffb69e0438cddd72aaa30898b4dc446f664f8
--
2.54.0
^ permalink raw reply related
* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Gregory Price @ 2026-06-04 12:18 UTC (permalink / raw)
To: Balbir Singh
Cc: lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
dave, jonathan.cameron, dave.jiang, alison.schofield,
vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm, david,
lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
terry.bowman
In-Reply-To: <aiFSZfRlFPd7qlIw@parvat>
On Thu, Jun 04, 2026 at 08:35:19PM +1000, Balbir Singh wrote:
>
> My concern is that __GFP_PRIVATE is too wide, I wonder if we'll have a
> need to support N_MEMORY_PRIVATE may not be all homogeneous memory nodes.
> Very similar to how not all ZONE_DEVICE memory is homogenous.
>
Can you more precise about your definition of homogeneous here?
Are you saying not all memory on a private node will be homogeneous?
While possible, I would argue that you should not do this and
should instead prefer to use multiple nodes - 1 per memory class.
Are you saying not all private nodes will be homogenous?
I don't see the issue with this.
> >
> > Agreed, but also one which can be deferred and played with since it's
> > all kernel-internal. None of this should have UAPI implications, and we
> > need need to accept that we're going to get it wrong on the first try.
> >
>
> Agreed that we might get the design wrong, until we fix it up. I feel
> that __GFP_PRIVATE should be an evolution of the design to that point.
>
Possibly. If we can't guarantee isolation without __GFP_PRIVATE, then
we probably can't merge the baseline without it.
> > Because pagecache pages are associated with potentially many VMAs.
> >
> > The fault can be a soft fault or a hard fault. On soft fault - the page
> > was already present, and will simply fault into VMA without being
> > migrated.
> >
>
> Let's split this into two:
>
> 1. unmapped page cache is never impacted by mempolicy and should not
> end up on private memory nodes
> 2. For shared pages, mempolicy would be hard, but it would need to
> be on a set of nodes backed by private memory, depending on mbind()
> policy
>
... snip ...
>
> I'd need to think more about this. For now, my basic requirement would
> be that unmapped page cache should not come from/to private nodes.
>
This does not fully describe the problem.
A file can be opened and cached as unmapped page cache, and then mapped
at a later time - at which point the mapped copy would share the filemap
page cache page.
Worse, because it's file-backed, you can have the memory faulted onto
your remote node - reclaimed - and the faulted back in via the process
accessing the file via unmapped operations (read/write), at which point
you've had a silent migration occur.
Basically consider
Process A:
fd = open("myfile", ..., RO);
read(fd, ...); /* mm/filemap.c fills page cache */
Process B:
fd = open("myfile", ...);
mem = mmap(fd, ...);
mbind(mem, ..., private_node);
for page in mem:
int tmp = mem[page]; /* fault into vma */
The result of Process A running first is Process B thinks it has faulted
the memory onto private_node, but in reality it's taking soft faults and
just getting the filemap folio mapped in.
If you wanted mbind() support from the start, we would have to limit
applicability to anon memory only.
Shared anon memory is different, as there is a radix tree that deals
with a shared mempolicy state.
>
> I am open to this, I was coming from the blueprint approach of:
> - Let's mimic N_MEMORY with N_MEMORY_PRIVATE and then pick and choose
> what features to change or make specific to the implementation
>
N_MEMORY essentially states:
"This is normal memory touch it however you like"
N_MEMORY_PRIVATE (_MANAGED, w/e) says
"This is NOT normal memory, there are special rules here"
So, no, lets not mimic N_MEMORY. This is a "closed by default" design,
while N_MEMORY is an "open by default" design. This design choice is
explicit to make reasoning about these nodes feasible.
> > This is informed by a single use case / device.
> >
> > There are users / devices that don't want any UAPI for their memory,
> > but simply wish to re-utilize some subsection of mm/ (page_alloc,
> > reclaim, etc).
> >
>
> But then, why do they need NUMA nodes? Do we have a list of use cases?
>
So far i have collected:
- Network accelerators carrying their own memory for message buffers
- GPUs with semi-general-purpose working memory across coherent links
- Acceptionally slow distributed memory that you do not want fallback
allocations to (so you want to deliberately tier what lands there)
- Compressed memory (just another form of accelerator really) which
has *special access rules* (i.e. writes need to be controlled)
In most if not all of these cases, the right abstraction to reason about
where memory *should come from* IS a NUMA node.
- the network stack can be taught to check if the target device has a
node with memory and prefer that node over local memory
- accelerators can be given private nodes to manage memory using
core mm/ components, without worrying that general kernel operation
will put unrelated memory on those nodes or do things like migrate
your pages out from under you (unless your driver/service requested
that).
the tiering application should be somewhat obvious / trivial.
> >
> > I am trying to test whether, lacking __GFP_PRIVATE, any normal runtime
> > operations access private nodes removed from fallback lists are reached
> > via something like the possible / online nodemask.
> >
> > I remember, maybe a year ago, there were per-node allocations happening
> > during hotplug and that's why I originally proposed __GFP_PRIVATE, but
> > I'm trying to re-collect that data now.
> >
>
> Thanks, I look forward to the next set of patches. Let me know if I
> can help test what's on the list or if you want me to wait for the next
> round
>
Really I want to get the minimized set out the door so we can start
breaking this up by feature (reclaim, mempolicy, etc), because trying to
reason about it as a whole is infeasible - and I cannot be the single
arbiter of every use case (I simply do not have sufficient context).
I'm reworking it all as we speak.
~Gregory
^ permalink raw reply
* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: Lorenzo Stoakes @ 2026-06-04 12:33 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Lance Yang, npache, linux-doc, linux-kernel, linux-mm,
linux-trace-kernel, aarcange, akpm, anshuman.khandual, apopple,
baohua, baolin.wang, byungchul, catalin.marinas, cl, corbet,
dave.hansen, dev.jain, gourry, hannes, hughd, jack, jackmanb,
jannh, jglisse, joshua.hahnjy, kas, liam, mathieu.desnoyers,
matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
willy, yang, ying.huang, ziy, zokeefe, usama.arif
In-Reply-To: <f5d38f64-ab92-496d-afd3-29ccc17fec2b@kernel.org>
On Mon, Jun 01, 2026 at 08:54:24AM +0200, David Hildenbrand (Arm) wrote:
> On 6/1/26 05:28, Lance Yang wrote:
> >
> > On Sun, May 31, 2026 at 10:00:17PM +0200, David Hildenbrand (Arm) wrote:
> >> On 5/31/26 11:39, Lance Yang wrote:
> >>>
> >>>
> >>> Emm ... is it safe to use map_anon_folio_pte_nopf() here?
> >>>
> >>> At this point pmdp_collapse_flush() has cleared the PMD from the page
> >>> tables. The PTE table we are updating is only reachable through the saved
> >>> old PMD value, _pmd, until pmd_populate() below.
> >>>
> >>> map_anon_folio_pte_nopf() does set_ptes() and then calls
> >>> update_mmu_cache_range(). Documentation/core-api/cachetlb.rst describes
> >>> that hook as:
> >>>
> >>> "
> >>> At the end of every page fault, this routine is invoked to tell
> >>> the architecture specific code that translations now exists
> >>> in the software page tables for address space "vma->vm_mm"
> >>> at virtual address "address" for "nr" consecutive pages.
> >>> "
> >>>
> >>> But that does not seem true here yet, since the PTE table is not
> >>> reachable from vma->vm_mm when update_mmu_cache_range() is called.
> >>>
> >>> Should we avoid calling update_mmu_cache_range() until after the PTE
> >>> table is reinstalled with pmd_populate()?
> >>
> >> I recall that update_mmu_cache* users mostly care about updating folios flags,
> >> for the folio derived from the PTE ... or flushing caches for the user address.
> >>
> >> So intuitively I would say "the architecture code doesn't care that the PMD
> >> table will only be visible to HW shortly after". The important thing should be
> >> that it will definetly happen, and that nothing else is curently there or can be
> >> there?
> >
> > Ah, fair point.
> >
> > I was mostly worried about arch hooks that walk vma->vm_mm again, rather
> > than only using the pte pointer passed in. For example, mips does:
>
> Right, a re-walk would be the real problem.
>
> >
> > update_mmu_cache_range()
> > -> __update_tlb()
> > -> pgd_offset(vma->vm_mm, address)
> > -> pte_offset_map(...)
> >
> > and __update_tlb() has this assumption:
> >
> > /*
> > * update_mmu_cache() is called between pte_offset_map_lock()
> > * and pte_unmap_unlock(), so we can assume that ptep is not
> > * NULL here: and what should be done below if it were NULL?
> > */
> >
> > So if khugepaged happens to run with current->active_mm == vma->vm_mm
> > here, could __update_tlb() hit the none PMD, get NULL from
I really wish people would say Pxx _entry_ :) so confusing.
> > pte_offset_map(), and then dereference it?
>
> Likely yes -- that MIPS code is horrible. And the comment in MIPS code
> even spells that out. :(
>
> Do you know about other code like that, or is MIPS the only one doing a
> re-walk and crossing fingers?
>
> >
> > Just wanted to raise it since some arch code may still have assumptions
> > like this, and the always-enable-mTHP work is getting closer ...
>
> Right. I assume set_pte_at() couldn't trigger something similar (re-walk) in arch code,
> because we simply provide the ptep. update_mmu_cache_range() only consumes the pte.
>
> >
> > Probably very very very hard to hit, though :)
>
> Delaying update_mmu_cache_range() is nasty, as we'd have to make sure that
> nobody can interfere in the meantime ... and the PMD lock will not be sufficient.
>
> Maybe we could reinstall the page table with the cleared (none) entries while
> still holding the PTL?
You mean the cleared PTE entries that are to be updated with the collapsed
larger folio?
>
> Thinking out loud:
After staring at this long enough, this does seems like a viable solution yes.
I hate how subtle this is.
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 5ba298d420b7..e39b750b1e6f 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1413,13 +1413,17 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
> map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
> } else {
> /*
> - * set_ptes is called in map_anon_folio_pte_nopf with the
> - * pmd_ptl lock still held; this is safe as the PMD is expected
> - * to be none. The pmd entry is then repopulated below.
> + * Re-insert the page table with the cleared entries, but
> + * hold the PTL, such that no one can mess with the re-installed
> + * page table until we updated the temporarily-cleared entries
> + * through map_anon_folio_pte_nopf().
> */
You may say nit, but, I think we should be clearly stating the problem here. Yes
we want to hold the PTL to stop anybody else messing with it yet, but we're
really doing this because of:
map_anon_folio_pte_nopf
-> update_mmu_cache_range
-> rewalk
-> try to look up an entry that's not yet actually installed
-> bang
Right?
So maybe something like:
Re-insert the PMD entry pointing to the PTE page table with cleared
entries first, because map_anon_folio_pte_nopf() invokes
update_mmu_cache_range() which may cause a rewalk of the page tables and
blow up if the supplied PTE entry belongs to a PTE table that is not yet
present there.
We hold the PTE PTL to avoid anything else messing with this until we're
ready.
> - map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
> - smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
(I guess better to comment on the smp_wmb() stuff in the other message about
this.)
> + if (pte_ptl != pmd_ptl)
> + spin_lock(pte_ptl);
(Obviously should be spin_lock_nested() as David says later)
It seems a bit weird to me that we acquire the PTE lock:
pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);
Clear out the mTHP entries we're going to remove:
if (pte) {
result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
order, &compound_pagelist);
THen unlock the PTE:
spin_unlock(pte_ptl);
Before again reacquiring here, especially given this is an unreachable PTE
table.
But then again not doing that would require us to add some error handling logic
to unlock again so it's probably not vital.
> pmd_populate(mm, pmd, pmd_pgtable(_pmd));
So we're protecting against concurrent rmap and fault handlers with the PTL such
that installing this is safe right?
Are we good against GUP fast? I guess a race will be fine with that, or will it?
I suppose before it would have skipped the range entirely because of the missing
PMD entry anyway.
(in any case we also hold anon_vma write lock too.)
> + map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
> + if (pte_ptl != pmd_ptl)
> + spin_unlock(pte_ptl);
> }
> spin_unlock(pmd_ptl);
>
>
>
> --
> Cheers,
>
> David
Thanks, Lorenzo
^ permalink raw reply
* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: Lorenzo Stoakes @ 2026-06-04 12:39 UTC (permalink / raw)
To: Nico Pache
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
lance.yang, liam, mathieu.desnoyers, matthew.brost, mhiramat,
mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe, Usama Arif
In-Reply-To: <aiFTSLb0kkTR7I9A@lucifer>
On Thu, Jun 04, 2026 at 12:38:30PM +0100, Lorenzo Stoakes wrote:
> I will go review the thread about the cache maintenance separately and
> respond about that.
>
> On Fri, May 22, 2026 at 09:00:01AM -0600, Nico Pache wrote:
> > Pass an order and offset to collapse_huge_page to support collapsing anon
> > memory to arbitrary orders within a PMD. order indicates what mTHP size we
> > are attempting to collapse to, and offset indicates were in the PMD to
> > start the collapse attempt.
> >
> > For non-PMD collapse we must leave the anon VMA write locked until after
> > we collapse the mTHP-- in the PMD case all the pages are isolated, but in
> > the mTHP case this is not true, and we must keep the lock to prevent
> > access/changes to the page tables. This can happen if the rmap walkers hit
> > a pmd_none while the PMD entry is currently unavailable due to being
> > temporarily removed during the collapse phase.
> >
> > Acked-by: Usama Arif <usama.arif@linux.dev>
> > Signed-off-by: Nico Pache <npache@redhat.com>
>
> The logic LGTM generally, some questions for understanding below, and of
> course as per above I want to review the Lance/David subthread.
>
> Thanks!
>
> > ---
> > mm/khugepaged.c | 93 +++++++++++++++++++++++++++++--------------------
> > 1 file changed, 55 insertions(+), 38 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index fab35d318641..d64f42f66236 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1214,34 +1214,36 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
> > * while allocating a THP, as that could trigger direct reclaim/compaction.
> > * Note that the VMA must be rechecked after grabbing the mmap_lock again.
> > */
> > -static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > - int referenced, int unmapped, struct collapse_control *cc)
> > +static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
> > + int referenced, int unmapped, struct collapse_control *cc,
> > + unsigned int order)
> > {
> > + const unsigned long pmd_addr = start_addr & HPAGE_PMD_MASK;
> > + const unsigned long end_addr = start_addr + (PAGE_SIZE << order);
> > LIST_HEAD(compound_pagelist);
> > pmd_t *pmd, _pmd;
> > - pte_t *pte;
> > + pte_t *pte = NULL;
>
> As mentioned elsewhere for some reason this was dropped in
> mm-unstable. Maybe a bad conflict resolution?
>
> > pgtable_t pgtable;
> > struct folio *folio;
> > spinlock_t *pmd_ptl, *pte_ptl;
> > enum scan_result result = SCAN_FAIL;
> > struct vm_area_struct *vma;
> > struct mmu_notifier_range range;
> > + bool anon_vma_locked = false;
> >
> > - VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> > -
> > - result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> > + result = alloc_charge_folio(&folio, mm, cc, order);
> > if (result != SCAN_SUCCEED)
> > goto out_nolock;
> >
> > mmap_read_lock(mm);
> > - result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> > - HPAGE_PMD_ORDER);
> > + result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
> > + &vma, cc, order);
> > if (result != SCAN_SUCCEED) {
> > mmap_read_unlock(mm);
> > goto out_nolock;
> > }
> >
> > - result = find_pmd_or_thp_or_none(mm, address, &pmd);
> > + result = find_pmd_or_thp_or_none(mm, pmd_addr, &pmd);
> > if (result != SCAN_SUCCEED) {
> > mmap_read_unlock(mm);
> > goto out_nolock;
> > @@ -1253,8 +1255,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > * released when it fails. So we jump out_nolock directly in
> > * that case. Continuing to collapse causes inconsistency.
> > */
> > - result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> > - referenced, HPAGE_PMD_ORDER);
> > + result = __collapse_huge_page_swapin(mm, vma, start_addr, pmd,
> > + referenced, order);
> > if (result != SCAN_SUCCEED)
> > goto out_nolock;
> > }
> > @@ -1269,20 +1271,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > * mmap_lock.
> > */
> > mmap_write_lock(mm);
> > - result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> > - HPAGE_PMD_ORDER);
> > + result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
> > + &vma, cc, order);
> > if (result != SCAN_SUCCEED)
> > goto out_up_write;
> > /* check if the pmd is still valid */
> > vma_start_write(vma);
Hmm actually I think we have another problem here.
For PMD THP this is fine. Only a single VMA can span the range we need, and it
will span the entire PMD.
But for mTHP we have an issue...
See below...
> > - result = check_pmd_still_valid(mm, address, pmd);
> > + result = check_pmd_still_valid(mm, pmd_addr, pmd);
> > if (result != SCAN_SUCCEED)
> > goto out_up_write;
> >
> > anon_vma_lock_write(vma->anon_vma);
> > + anon_vma_locked = true;
>
> I worry that we hold this lock a lot longer now? Maybe the algorithmic
> change alters that, but Claude did suggest on the s390 bug that longer lock
> hold might be an issue.
>
> I wonder if we'll observe lock contention as a result?
>
> Correct me if I'm wrong and we're not holding longer than previously,
> however. Just appears that we do.
>
> >
> > - mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> > - address + HPAGE_PMD_SIZE);
> > + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr,
> > + end_addr);
> > mmu_notifier_invalidate_range_start(&range);
> >
> > pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> > @@ -1294,26 +1297,23 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > * Parallel GUP-fast is fine since GUP-fast will back off when
> > * it detects PMD is changed.
> > */
> > - _pmd = pmdp_collapse_flush(vma, address, pmd);
> > + _pmd = pmdp_collapse_flush(vma, pmd_addr, pmd);
...So we exclude VMA locked faults faulting in a new PMD entry for PMD-sized THP
but for mTHP we might have _another_ VMA that spans another part of the range
mapped by the same PMD entry.
So we clear this, but we do not have a write lock on any other VMA, and so
racing VMA read locks can install a new PMD entry.
> > spin_unlock(pmd_ptl);
Especially since you unlock this :)
And...
> > mmu_notifier_invalidate_range_end(&range);
> > tlb_remove_table_sync_one();
> >
> > - pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> > + pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);
> > if (pte) {
> > - result = __collapse_huge_page_isolate(vma, address, pte, cc,
> > - HPAGE_PMD_ORDER,
> > - &compound_pagelist);
> > + result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
> > + order, &compound_pagelist);
> > spin_unlock(pte_ptl);
> > } else {
> > result = SCAN_NO_PTE_TABLE;
> > }
> >
> > if (unlikely(result != SCAN_SUCCEED)) {
> > - if (pte)
> > - pte_unmap(pte);
>
> OK I seem to remember this is because we're holding the anon_vma lock
> longer. That does imply that on e.g. x86-64 the RCU lock is being held a
> bit longer also as well as the anon_vma loc.
>
> I guess it's also because we need to hold anon_vma and pte lock because
> we're fiddling around at PTE level for mTHP not just PMD level as 'classic'
> THP did.
>
> (Rememberings going on here :)
>
> > spin_lock(pmd_ptl);
> > - BUG_ON(!pmd_none(*pmd));
> > + WARN_ON_ONCE(!pmd_none(*pmd));
...this will get triggered.
I don't know whether we can safely hold the PMD lock across everything here for
mTHP?
Maybe the solution would have to be to scan through VMAs in the range of the PMD
and VMA write lock each of them?
That could cause some 'interesting' lock contention issues though? Then again,
we will be releasing the mmap write lock soon enough which will drop the VMA
write locks.
> > /*
> > * We can only use set_pmd_at when establishing
> > * hugepmds and never for establishing regular pmds that
> > @@ -1321,21 +1321,24 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > */
> > pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> > spin_unlock(pmd_ptl);
> > - anon_vma_unlock_write(vma->anon_vma);
> > goto out_up_write;
> > }
> >
> > /*
> > - * All pages are isolated and locked so anon_vma rmap
> > - * can't run anymore.
> > + * For PMD collapse all pages are isolated and locked so anon_vma
> > + * rmap can't run anymore. For mTHP collapse the PMD entry has been
> > + * removed and not all pages are isolated and locked, so we must hold
>
> Right because some PTE entries be unaffected by the change.
>
> > + * the lock to prevent neighboring folios from attempting to access
> > + * this PMD until its reinstalled.
>
> OK. This is slightly annoying for my CoW context work as it means there's
> another case where we need to explicitly hold an anon_vma lock for
> correctness :)
>
> Anyway I will think about that separately, is what it is. And in fact
> motivates to want this merged earlier so I can work against it :)
>
>
> > */
> > - anon_vma_unlock_write(vma->anon_vma);
> > + if (is_pmd_order(order)) {
> > + anon_vma_unlock_write(vma->anon_vma);
> > + anon_vma_locked = false;
> > + }
> >
> > result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> > - vma, address, pte_ptl,
> > - HPAGE_PMD_ORDER,
> > - &compound_pagelist);
> > - pte_unmap(pte);
> > + vma, start_addr, pte_ptl,
> > + order, &compound_pagelist);
> > if (unlikely(result != SCAN_SUCCEED))
> > goto out_up_write;
> >
> > @@ -1345,18 +1348,32 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > * write.
> > */
> > __folio_mark_uptodate(folio);
> > - pgtable = pmd_pgtable(_pmd);
> > -
> > spin_lock(pmd_ptl);
> > - BUG_ON(!pmd_none(*pmd));
> > - pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > - map_anon_folio_pmd_nopf(folio, pmd, vma, address);
> > + WARN_ON_ONCE(!pmd_none(*pmd));
> > + if (is_pmd_order(order)) {
> > + pgtable = pmd_pgtable(_pmd);
> > + pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > + map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
> > + } else {
> > + /*
> > + * set_ptes is called in map_anon_folio_pte_nopf with the
> > + * pmd_ptl lock still held; this is safe as the PMD is expected
>
> PMD entry you mean?
>
> > + * to be none. The pmd entry is then repopulated below.
> > + */
> > + map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
>
> So here we populate entries in the existing PTE _table_ to point at the new
> order>0 folio? With arm64 of course doing transparent contpte stuff?
>
> > + smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
> > + pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>
> And then we reinstall the pre-existing PMD _entry_ from none -> what it was
> before?
>
> > + }
> > spin_unlock(pmd_ptl);
> >
> > folio = NULL;
> >
> > result = SCAN_SUCCEED;
> > out_up_write:
> > + if (anon_vma_locked)
> > + anon_vma_unlock_write(vma->anon_vma);
> > + if (pte)
> > + pte_unmap(pte);
> > mmap_write_unlock(mm);
> > out_nolock:
> > if (folio)
> > @@ -1536,7 +1553,7 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> > /* collapse_huge_page expects the lock to be dropped before calling */
> > mmap_read_unlock(mm);
> > result = collapse_huge_page(mm, start_addr, referenced,
> > - unmapped, cc);
> > + unmapped, cc, HPAGE_PMD_ORDER);
> > /* collapse_huge_page will return with the mmap_lock released */
> > *lock_dropped = true;
> > }
> > --
> > 2.54.0
> >
Thanks, Lorenzo
^ permalink raw reply
* Re: [GIT PULL] rv fixes for v7.1
From: Gabriele Monaco @ 2026-06-04 12:42 UTC (permalink / raw)
To: Tomas Glozar, Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, unknownbbqrx, Wen Yang
In-Reply-To: <CAP4=nvQEF7BtsMSoDaVO+5MazsRGypUnxB8w0Uq=2sCy=g2HGw@mail.gmail.com>
On Thu, 2026-06-04 at 13:32 +0200, Tomas Glozar wrote:
> Hi Steven,
>
> čt 4. 6. 2026 v 1:19 odesílatel Steven Rostedt <rostedt@goodmis.org>
> napsal:
> > Hi Gabriele,
> >
> > What is this? All commits need to be authored by and signed off by
> > from
> > a real person with their official name.
> >
> > https://docs.kernel.org/process/submitting-patches.html#sign-your-
> > work-the-developer-s-certificate-of-origin
> >
> > -- Steve
> >
>
> Is this really still the case? Note that the document says:
>
> "using a known identity (sorry, no anonymous contributions.)"
Thanks Tomas for chipping in!
Just adding some information, the username unknownbbqr is in fact a
valid username for a Github account.
The user already sent a patch and updated it with a real name [1].
All this to say that, in my opinion unknownbbqrx <dev@unknownbbqr.xyz>
is NOT an anonymous contribution, just a nickname that differs from the
legal name of this person (which we wouldn't validate anyway), so I
would say it complies with the rules.
Thanks,
Gabriele
[1] -
https://lore.kernel.org/lkml/20260426150928.870914-1-srinivas.pandruvada@linux.intel.com/
>
> It really used to say "real name", but it was changed by Linus in
> 2023
> [1]. Note especially this section by Linus:
>
> " It was 2006, and nobody reacted to the wording, the whole Facebook
> 'real
> name' controversy was a decade in the future, and nobody even
> thought
> about it. And despite the language, we've always accepted
> nicknames and
> that language was never meant to be any kind of exclusionary
> wording."
>
> The wording sounds quite clear to me. And I'm certain that there were
> contributions under pseudonymous identity that have been accepted
> since then, most famously by Asahi Lina, a vtuber persona [2]. I
> don't
> really see a difference between that and "unknownbbqrx", other than
> the latter doesn't sound like a real name.
>
> [1]
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d4563201f33a022fc0353033d9dfeb1606a88330
> [2]
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2e2f6b0ef8551bf3bd8255729d27e3ad9451e562
>
> Tomas
^ permalink raw reply
* Re: [GIT PULL] rv fixes for v7.1
From: Steven Rostedt @ 2026-06-04 12:44 UTC (permalink / raw)
To: Tomas Glozar
Cc: Gabriele Monaco, linux-kernel, linux-trace-kernel, unknownbbqrx,
Wen Yang
In-Reply-To: <CAP4=nvQEF7BtsMSoDaVO+5MazsRGypUnxB8w0Uq=2sCy=g2HGw@mail.gmail.com>
On Thu, 4 Jun 2026 13:32:46 +0200
Tomas Glozar <tglozar@redhat.com> wrote:
> Is this really still the case? Note that the document says:
>
> "using a known identity (sorry, no anonymous contributions.)"
>
> It really used to say "real name", but it was changed by Linus in 2023
> [1]. Note especially this section by Linus:
>
> " It was 2006, and nobody reacted to the wording, the whole Facebook 'real
> name' controversy was a decade in the future, and nobody even thought
> about it. And despite the language, we've always accepted nicknames and
> that language was never meant to be any kind of exclusionary wording."
>
> The wording sounds quite clear to me. And I'm certain that there were
> contributions under pseudonymous identity that have been accepted
> since then, most famously by Asahi Lina, a vtuber persona [2]. I don't
> really see a difference between that and "unknownbbqrx", other than
> the latter doesn't sound like a real name.
It specifically says "using a known identity (sorry, no anonymous contributions.)"
As you said, Asahi Lina is well known and a very "known identity".
"unknownbbqrx" is unknown and even states it in the name.
I will not personally accept such a submission, as the Signed-off-by is
a legal statement that states you have the right to submit that code
and take all responsibility for it.
-- Steve
^ permalink raw reply
* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: Nico Pache @ 2026-06-04 12:45 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
lance.yang, liam, mathieu.desnoyers, matthew.brost, mhiramat,
mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe, Usama Arif
In-Reply-To: <aiFw80oLty6F_-8m@lucifer>
On Thu, Jun 4, 2026 at 6:40 AM Lorenzo Stoakes <ljs@kernel.org> wrote:
>
> On Thu, Jun 04, 2026 at 12:38:30PM +0100, Lorenzo Stoakes wrote:
> > I will go review the thread about the cache maintenance separately and
> > respond about that.
> >
> > On Fri, May 22, 2026 at 09:00:01AM -0600, Nico Pache wrote:
> > > Pass an order and offset to collapse_huge_page to support collapsing anon
> > > memory to arbitrary orders within a PMD. order indicates what mTHP size we
> > > are attempting to collapse to, and offset indicates were in the PMD to
> > > start the collapse attempt.
> > >
> > > For non-PMD collapse we must leave the anon VMA write locked until after
> > > we collapse the mTHP-- in the PMD case all the pages are isolated, but in
> > > the mTHP case this is not true, and we must keep the lock to prevent
> > > access/changes to the page tables. This can happen if the rmap walkers hit
> > > a pmd_none while the PMD entry is currently unavailable due to being
> > > temporarily removed during the collapse phase.
> > >
> > > Acked-by: Usama Arif <usama.arif@linux.dev>
> > > Signed-off-by: Nico Pache <npache@redhat.com>
> >
> > The logic LGTM generally, some questions for understanding below, and of
> > course as per above I want to review the Lance/David subthread.
> >
> > Thanks!
> >
> > > ---
> > > mm/khugepaged.c | 93 +++++++++++++++++++++++++++++--------------------
> > > 1 file changed, 55 insertions(+), 38 deletions(-)
> > >
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index fab35d318641..d64f42f66236 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -1214,34 +1214,36 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
> > > * while allocating a THP, as that could trigger direct reclaim/compaction.
> > > * Note that the VMA must be rechecked after grabbing the mmap_lock again.
> > > */
> > > -static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > > - int referenced, int unmapped, struct collapse_control *cc)
> > > +static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
> > > + int referenced, int unmapped, struct collapse_control *cc,
> > > + unsigned int order)
> > > {
> > > + const unsigned long pmd_addr = start_addr & HPAGE_PMD_MASK;
> > > + const unsigned long end_addr = start_addr + (PAGE_SIZE << order);
> > > LIST_HEAD(compound_pagelist);
> > > pmd_t *pmd, _pmd;
> > > - pte_t *pte;
> > > + pte_t *pte = NULL;
> >
> > As mentioned elsewhere for some reason this was dropped in
> > mm-unstable. Maybe a bad conflict resolution?
> >
> > > pgtable_t pgtable;
> > > struct folio *folio;
> > > spinlock_t *pmd_ptl, *pte_ptl;
> > > enum scan_result result = SCAN_FAIL;
> > > struct vm_area_struct *vma;
> > > struct mmu_notifier_range range;
> > > + bool anon_vma_locked = false;
> > >
> > > - VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> > > -
> > > - result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> > > + result = alloc_charge_folio(&folio, mm, cc, order);
> > > if (result != SCAN_SUCCEED)
> > > goto out_nolock;
> > >
> > > mmap_read_lock(mm);
> > > - result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> > > - HPAGE_PMD_ORDER);
> > > + result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
> > > + &vma, cc, order);
> > > if (result != SCAN_SUCCEED) {
> > > mmap_read_unlock(mm);
> > > goto out_nolock;
> > > }
> > >
> > > - result = find_pmd_or_thp_or_none(mm, address, &pmd);
> > > + result = find_pmd_or_thp_or_none(mm, pmd_addr, &pmd);
> > > if (result != SCAN_SUCCEED) {
> > > mmap_read_unlock(mm);
> > > goto out_nolock;
> > > @@ -1253,8 +1255,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > > * released when it fails. So we jump out_nolock directly in
> > > * that case. Continuing to collapse causes inconsistency.
> > > */
> > > - result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> > > - referenced, HPAGE_PMD_ORDER);
> > > + result = __collapse_huge_page_swapin(mm, vma, start_addr, pmd,
> > > + referenced, order);
> > > if (result != SCAN_SUCCEED)
> > > goto out_nolock;
> > > }
> > > @@ -1269,20 +1271,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > > * mmap_lock.
> > > */
> > > mmap_write_lock(mm);
> > > - result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> > > - HPAGE_PMD_ORDER);
> > > + result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
> > > + &vma, cc, order);
> > > if (result != SCAN_SUCCEED)
> > > goto out_up_write;
> > > /* check if the pmd is still valid */
> > > vma_start_write(vma);
>
> Hmm actually I think we have another problem here.
>
> For PMD THP this is fine. Only a single VMA can span the range we need, and it
> will span the entire PMD.
>
> But for mTHP we have an issue...
>
> See below...
>
> > > - result = check_pmd_still_valid(mm, address, pmd);
> > > + result = check_pmd_still_valid(mm, pmd_addr, pmd);
> > > if (result != SCAN_SUCCEED)
> > > goto out_up_write;
> > >
> > > anon_vma_lock_write(vma->anon_vma);
> > > + anon_vma_locked = true;
> >
> > I worry that we hold this lock a lot longer now? Maybe the algorithmic
> > change alters that, but Claude did suggest on the s390 bug that longer lock
> > hold might be an issue.
> >
> > I wonder if we'll observe lock contention as a result?
> >
> > Correct me if I'm wrong and we're not holding longer than previously,
> > however. Just appears that we do.
> >
> > >
> > > - mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> > > - address + HPAGE_PMD_SIZE);
> > > + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr,
> > > + end_addr);
> > > mmu_notifier_invalidate_range_start(&range);
> > >
> > > pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> > > @@ -1294,26 +1297,23 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > > * Parallel GUP-fast is fine since GUP-fast will back off when
> > > * it detects PMD is changed.
> > > */
> > > - _pmd = pmdp_collapse_flush(vma, address, pmd);
> > > + _pmd = pmdp_collapse_flush(vma, pmd_addr, pmd);
>
> ...So we exclude VMA locked faults faulting in a new PMD entry for PMD-sized THP
> but for mTHP we might have _another_ VMA that spans another part of the range
> mapped by the same PMD entry.
>
> So we clear this, but we do not have a write lock on any other VMA, and so
> racing VMA read locks can install a new PMD entry.
>
> > > spin_unlock(pmd_ptl);
>
> Especially since you unlock this :)
>
> And...
>
> > > mmu_notifier_invalidate_range_end(&range);
> > > tlb_remove_table_sync_one();
> > >
> > > - pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> > > + pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);
> > > if (pte) {
> > > - result = __collapse_huge_page_isolate(vma, address, pte, cc,
> > > - HPAGE_PMD_ORDER,
> > > - &compound_pagelist);
> > > + result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
> > > + order, &compound_pagelist);
> > > spin_unlock(pte_ptl);
> > > } else {
> > > result = SCAN_NO_PTE_TABLE;
> > > }
> > >
> > > if (unlikely(result != SCAN_SUCCEED)) {
> > > - if (pte)
> > > - pte_unmap(pte);
> >
> > OK I seem to remember this is because we're holding the anon_vma lock
> > longer. That does imply that on e.g. x86-64 the RCU lock is being held a
> > bit longer also as well as the anon_vma loc.
> >
> > I guess it's also because we need to hold anon_vma and pte lock because
> > we're fiddling around at PTE level for mTHP not just PMD level as 'classic'
> > THP did.
> >
> > (Rememberings going on here :)
> >
> > > spin_lock(pmd_ptl);
> > > - BUG_ON(!pmd_none(*pmd));
> > > + WARN_ON_ONCE(!pmd_none(*pmd));
>
> ...this will get triggered.
>
> I don't know whether we can safely hold the PMD lock across everything here for
> mTHP?
>
> Maybe the solution would have to be to scan through VMAs in the range of the PMD
> and VMA write lock each of them?
I believe we've spoken about this before, but because we always make
sure the VMA spans the full PMD we won't ever hit this issue. If we
wanted to support mTHP collapse on regions smaller than a PMD, the
locking gets tricky (hence the design choice to not do that for now).
This is handled by the HPAGE_ORDER in hugepage_vma_revalidate().
/* Always check the PMD order to ensure its not shared by another VMA */
if (!thp_vma_suitable_order(vma, address, PMD_ORDER))
-- Nico
>
> That could cause some 'interesting' lock contention issues though? Then again,
> we will be releasing the mmap write lock soon enough which will drop the VMA
> write locks.
>
> > > /*
> > > * We can only use set_pmd_at when establishing
> > > * hugepmds and never for establishing regular pmds that
> > > @@ -1321,21 +1321,24 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > > */
> > > pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> > > spin_unlock(pmd_ptl);
> > > - anon_vma_unlock_write(vma->anon_vma);
> > > goto out_up_write;
> > > }
> > >
> > > /*
> > > - * All pages are isolated and locked so anon_vma rmap
> > > - * can't run anymore.
> > > + * For PMD collapse all pages are isolated and locked so anon_vma
> > > + * rmap can't run anymore. For mTHP collapse the PMD entry has been
> > > + * removed and not all pages are isolated and locked, so we must hold
> >
> > Right because some PTE entries be unaffected by the change.
> >
> > > + * the lock to prevent neighboring folios from attempting to access
> > > + * this PMD until its reinstalled.
> >
> > OK. This is slightly annoying for my CoW context work as it means there's
> > another case where we need to explicitly hold an anon_vma lock for
> > correctness :)
> >
> > Anyway I will think about that separately, is what it is. And in fact
> > motivates to want this merged earlier so I can work against it :)
> >
> >
> > > */
> > > - anon_vma_unlock_write(vma->anon_vma);
> > > + if (is_pmd_order(order)) {
> > > + anon_vma_unlock_write(vma->anon_vma);
> > > + anon_vma_locked = false;
> > > + }
> > >
> > > result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> > > - vma, address, pte_ptl,
> > > - HPAGE_PMD_ORDER,
> > > - &compound_pagelist);
> > > - pte_unmap(pte);
> > > + vma, start_addr, pte_ptl,
> > > + order, &compound_pagelist);
> > > if (unlikely(result != SCAN_SUCCEED))
> > > goto out_up_write;
> > >
> > > @@ -1345,18 +1348,32 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > > * write.
> > > */
> > > __folio_mark_uptodate(folio);
> > > - pgtable = pmd_pgtable(_pmd);
> > > -
> > > spin_lock(pmd_ptl);
> > > - BUG_ON(!pmd_none(*pmd));
> > > - pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > > - map_anon_folio_pmd_nopf(folio, pmd, vma, address);
> > > + WARN_ON_ONCE(!pmd_none(*pmd));
> > > + if (is_pmd_order(order)) {
> > > + pgtable = pmd_pgtable(_pmd);
> > > + pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > > + map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
> > > + } else {
> > > + /*
> > > + * set_ptes is called in map_anon_folio_pte_nopf with the
> > > + * pmd_ptl lock still held; this is safe as the PMD is expected
> >
> > PMD entry you mean?
> >
> > > + * to be none. The pmd entry is then repopulated below.
> > > + */
> > > + map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
> >
> > So here we populate entries in the existing PTE _table_ to point at the new
> > order>0 folio? With arm64 of course doing transparent contpte stuff?
> >
> > > + smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
> > > + pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> >
> > And then we reinstall the pre-existing PMD _entry_ from none -> what it was
> > before?
> >
> > > + }
> > > spin_unlock(pmd_ptl);
> > >
> > > folio = NULL;
> > >
> > > result = SCAN_SUCCEED;
> > > out_up_write:
> > > + if (anon_vma_locked)
> > > + anon_vma_unlock_write(vma->anon_vma);
> > > + if (pte)
> > > + pte_unmap(pte);
> > > mmap_write_unlock(mm);
> > > out_nolock:
> > > if (folio)
> > > @@ -1536,7 +1553,7 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> > > /* collapse_huge_page expects the lock to be dropped before calling */
> > > mmap_read_unlock(mm);
> > > result = collapse_huge_page(mm, start_addr, referenced,
> > > - unmapped, cc);
> > > + unmapped, cc, HPAGE_PMD_ORDER);
> > > /* collapse_huge_page will return with the mmap_lock released */
> > > *lock_dropped = true;
> > > }
> > > --
> > > 2.54.0
> > >
>
> Thanks, Lorenzo
>
^ permalink raw reply
* Re: [GIT PULL] rv fixes for v7.1
From: Steven Rostedt @ 2026-06-04 12:54 UTC (permalink / raw)
To: Gabriele Monaco
Cc: Tomas Glozar, linux-kernel, linux-trace-kernel, unknownbbqrx,
Wen Yang
In-Reply-To: <791528a4f69460ba1e9589361860bafeb1517237.camel@redhat.com>
On Thu, 04 Jun 2026 14:42:02 +0200
Gabriele Monaco <gmonaco@redhat.com> wrote:
> All this to say that, in my opinion unknownbbqrx <dev@unknownbbqr.xyz>
> is NOT an anonymous contribution, just a nickname that differs from the
> legal name of this person (which we wouldn't validate anyway), so I
> would say it complies with the rules.
It's a username on github and not a nickname. I did a search for
"unknownbbqr" and it doesn't come up anywhere but Google tries to find
similar matches which brings me to an OnlyFans account :-p
It *DOES NOT* qualify because there's no accountability for this. For
people who have a nickname as their entire internet persona, sure, I'll
take patches from them as there's an entity that exists behind it. But
I'm not going to take some username on github as a persona. To me,
that's still anonymous.
-- Steve
^ permalink raw reply
* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: Lorenzo Stoakes @ 2026-06-04 12:55 UTC (permalink / raw)
To: Nico Pache
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
lance.yang, liam, mathieu.desnoyers, matthew.brost, mhiramat,
mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe, Usama Arif
In-Reply-To: <CAA1CXcDxZEmWtmGFiKDKSPSae8pN0at4vYV24FOs+t_GTGkZ6g@mail.gmail.com>
On Thu, Jun 04, 2026 at 06:45:58AM -0600, Nico Pache wrote:
> On Thu, Jun 4, 2026 at 6:40 AM Lorenzo Stoakes <ljs@kernel.org> wrote:
> >
> > On Thu, Jun 04, 2026 at 12:38:30PM +0100, Lorenzo Stoakes wrote:
> > > I will go review the thread about the cache maintenance separately and
> > > respond about that.
> > >
> > > On Fri, May 22, 2026 at 09:00:01AM -0600, Nico Pache wrote:
> > > > Pass an order and offset to collapse_huge_page to support collapsing anon
> > > > memory to arbitrary orders within a PMD. order indicates what mTHP size we
> > > > are attempting to collapse to, and offset indicates were in the PMD to
> > > > start the collapse attempt.
> > > >
> > > > For non-PMD collapse we must leave the anon VMA write locked until after
> > > > we collapse the mTHP-- in the PMD case all the pages are isolated, but in
> > > > the mTHP case this is not true, and we must keep the lock to prevent
> > > > access/changes to the page tables. This can happen if the rmap walkers hit
> > > > a pmd_none while the PMD entry is currently unavailable due to being
> > > > temporarily removed during the collapse phase.
> > > >
> > > > Acked-by: Usama Arif <usama.arif@linux.dev>
> > > > Signed-off-by: Nico Pache <npache@redhat.com>
> > >
> > > The logic LGTM generally, some questions for understanding below, and of
> > > course as per above I want to review the Lance/David subthread.
> > >
> > > Thanks!
> > >
> > > > ---
> > > > mm/khugepaged.c | 93 +++++++++++++++++++++++++++++--------------------
> > > > 1 file changed, 55 insertions(+), 38 deletions(-)
> > > >
> > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > index fab35d318641..d64f42f66236 100644
> > > > --- a/mm/khugepaged.c
> > > > +++ b/mm/khugepaged.c
> > > > @@ -1214,34 +1214,36 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
> > > > * while allocating a THP, as that could trigger direct reclaim/compaction.
> > > > * Note that the VMA must be rechecked after grabbing the mmap_lock again.
> > > > */
> > > > -static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > > > - int referenced, int unmapped, struct collapse_control *cc)
> > > > +static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
> > > > + int referenced, int unmapped, struct collapse_control *cc,
> > > > + unsigned int order)
> > > > {
> > > > + const unsigned long pmd_addr = start_addr & HPAGE_PMD_MASK;
> > > > + const unsigned long end_addr = start_addr + (PAGE_SIZE << order);
> > > > LIST_HEAD(compound_pagelist);
> > > > pmd_t *pmd, _pmd;
> > > > - pte_t *pte;
> > > > + pte_t *pte = NULL;
> > >
> > > As mentioned elsewhere for some reason this was dropped in
> > > mm-unstable. Maybe a bad conflict resolution?
> > >
> > > > pgtable_t pgtable;
> > > > struct folio *folio;
> > > > spinlock_t *pmd_ptl, *pte_ptl;
> > > > enum scan_result result = SCAN_FAIL;
> > > > struct vm_area_struct *vma;
> > > > struct mmu_notifier_range range;
> > > > + bool anon_vma_locked = false;
> > > >
> > > > - VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> > > > -
> > > > - result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> > > > + result = alloc_charge_folio(&folio, mm, cc, order);
> > > > if (result != SCAN_SUCCEED)
> > > > goto out_nolock;
> > > >
> > > > mmap_read_lock(mm);
> > > > - result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> > > > - HPAGE_PMD_ORDER);
> > > > + result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
> > > > + &vma, cc, order);
> > > > if (result != SCAN_SUCCEED) {
> > > > mmap_read_unlock(mm);
> > > > goto out_nolock;
> > > > }
> > > >
> > > > - result = find_pmd_or_thp_or_none(mm, address, &pmd);
> > > > + result = find_pmd_or_thp_or_none(mm, pmd_addr, &pmd);
> > > > if (result != SCAN_SUCCEED) {
> > > > mmap_read_unlock(mm);
> > > > goto out_nolock;
> > > > @@ -1253,8 +1255,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > > > * released when it fails. So we jump out_nolock directly in
> > > > * that case. Continuing to collapse causes inconsistency.
> > > > */
> > > > - result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> > > > - referenced, HPAGE_PMD_ORDER);
> > > > + result = __collapse_huge_page_swapin(mm, vma, start_addr, pmd,
> > > > + referenced, order);
> > > > if (result != SCAN_SUCCEED)
> > > > goto out_nolock;
> > > > }
> > > > @@ -1269,20 +1271,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > > > * mmap_lock.
> > > > */
> > > > mmap_write_lock(mm);
> > > > - result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> > > > - HPAGE_PMD_ORDER);
> > > > + result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
> > > > + &vma, cc, order);
> > > > if (result != SCAN_SUCCEED)
> > > > goto out_up_write;
> > > > /* check if the pmd is still valid */
> > > > vma_start_write(vma);
> >
> > Hmm actually I think we have another problem here.
> >
> > For PMD THP this is fine. Only a single VMA can span the range we need, and it
> > will span the entire PMD.
> >
> > But for mTHP we have an issue...
> >
> > See below...
> >
> > > > - result = check_pmd_still_valid(mm, address, pmd);
> > > > + result = check_pmd_still_valid(mm, pmd_addr, pmd);
> > > > if (result != SCAN_SUCCEED)
> > > > goto out_up_write;
> > > >
> > > > anon_vma_lock_write(vma->anon_vma);
> > > > + anon_vma_locked = true;
> > >
> > > I worry that we hold this lock a lot longer now? Maybe the algorithmic
> > > change alters that, but Claude did suggest on the s390 bug that longer lock
> > > hold might be an issue.
> > >
> > > I wonder if we'll observe lock contention as a result?
> > >
> > > Correct me if I'm wrong and we're not holding longer than previously,
> > > however. Just appears that we do.
> > >
> > > >
> > > > - mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> > > > - address + HPAGE_PMD_SIZE);
> > > > + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr,
> > > > + end_addr);
> > > > mmu_notifier_invalidate_range_start(&range);
> > > >
> > > > pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> > > > @@ -1294,26 +1297,23 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > > > * Parallel GUP-fast is fine since GUP-fast will back off when
> > > > * it detects PMD is changed.
> > > > */
> > > > - _pmd = pmdp_collapse_flush(vma, address, pmd);
> > > > + _pmd = pmdp_collapse_flush(vma, pmd_addr, pmd);
> >
> > ...So we exclude VMA locked faults faulting in a new PMD entry for PMD-sized THP
> > but for mTHP we might have _another_ VMA that spans another part of the range
> > mapped by the same PMD entry.
> >
> > So we clear this, but we do not have a write lock on any other VMA, and so
> > racing VMA read locks can install a new PMD entry.
> >
> > > > spin_unlock(pmd_ptl);
> >
> > Especially since you unlock this :)
> >
> > And...
> >
> > > > mmu_notifier_invalidate_range_end(&range);
> > > > tlb_remove_table_sync_one();
> > > >
> > > > - pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> > > > + pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);
> > > > if (pte) {
> > > > - result = __collapse_huge_page_isolate(vma, address, pte, cc,
> > > > - HPAGE_PMD_ORDER,
> > > > - &compound_pagelist);
> > > > + result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
> > > > + order, &compound_pagelist);
> > > > spin_unlock(pte_ptl);
> > > > } else {
> > > > result = SCAN_NO_PTE_TABLE;
> > > > }
> > > >
> > > > if (unlikely(result != SCAN_SUCCEED)) {
> > > > - if (pte)
> > > > - pte_unmap(pte);
> > >
> > > OK I seem to remember this is because we're holding the anon_vma lock
> > > longer. That does imply that on e.g. x86-64 the RCU lock is being held a
> > > bit longer also as well as the anon_vma loc.
> > >
> > > I guess it's also because we need to hold anon_vma and pte lock because
> > > we're fiddling around at PTE level for mTHP not just PMD level as 'classic'
> > > THP did.
> > >
> > > (Rememberings going on here :)
> > >
> > > > spin_lock(pmd_ptl);
> > > > - BUG_ON(!pmd_none(*pmd));
> > > > + WARN_ON_ONCE(!pmd_none(*pmd));
> >
> > ...this will get triggered.
> >
> > I don't know whether we can safely hold the PMD lock across everything here for
> > mTHP?
> >
> > Maybe the solution would have to be to scan through VMAs in the range of the PMD
> > and VMA write lock each of them?
>
> I believe we've spoken about this before, but because we always make
Maybe worth a comment then...? Ah how rewarding review is :)
This is something that somebody else might very well wonder about and
forget that it happens to be covered there.
Also:
/* Always check the PMD order to ensure its not shared by another VMA */
Is pretty lightweight there. Something about avoiding racing page faults
would be helpful.
> sure the VMA spans the full PMD we won't ever hit this issue. If we
> wanted to support mTHP collapse on regions smaller than a PMD, the
> locking gets tricky (hence the design choice to not do that for now).
>
> This is handled by the HPAGE_ORDER in hugepage_vma_revalidate().
The existing code is atrocious, and sticking this on top has added to the
pile of assumptions and conventions and having to go check a bunch of
functions to 'just know' you're safe for X, Y, Z.
We really need to see some cleanup series coming after this and I'm going
to get pretty grumpy(ier) if we don't.
>
> /* Always check the PMD order to ensure its not shared by another VMA */
> if (!thp_vma_suitable_order(vma, address, PMD_ORDER))
>
> -- Nico
>
> >
> > That could cause some 'interesting' lock contention issues though? Then again,
> > we will be releasing the mmap write lock soon enough which will drop the VMA
> > write locks.
> >
> > > > /*
> > > > * We can only use set_pmd_at when establishing
> > > > * hugepmds and never for establishing regular pmds that
> > > > @@ -1321,21 +1321,24 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > > > */
> > > > pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> > > > spin_unlock(pmd_ptl);
> > > > - anon_vma_unlock_write(vma->anon_vma);
> > > > goto out_up_write;
> > > > }
> > > >
> > > > /*
> > > > - * All pages are isolated and locked so anon_vma rmap
> > > > - * can't run anymore.
> > > > + * For PMD collapse all pages are isolated and locked so anon_vma
> > > > + * rmap can't run anymore. For mTHP collapse the PMD entry has been
> > > > + * removed and not all pages are isolated and locked, so we must hold
> > >
> > > Right because some PTE entries be unaffected by the change.
> > >
> > > > + * the lock to prevent neighboring folios from attempting to access
> > > > + * this PMD until its reinstalled.
> > >
> > > OK. This is slightly annoying for my CoW context work as it means there's
> > > another case where we need to explicitly hold an anon_vma lock for
> > > correctness :)
> > >
> > > Anyway I will think about that separately, is what it is. And in fact
> > > motivates to want this merged earlier so I can work against it :)
> > >
> > >
> > > > */
> > > > - anon_vma_unlock_write(vma->anon_vma);
> > > > + if (is_pmd_order(order)) {
> > > > + anon_vma_unlock_write(vma->anon_vma);
> > > > + anon_vma_locked = false;
> > > > + }
> > > >
> > > > result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> > > > - vma, address, pte_ptl,
> > > > - HPAGE_PMD_ORDER,
> > > > - &compound_pagelist);
> > > > - pte_unmap(pte);
> > > > + vma, start_addr, pte_ptl,
> > > > + order, &compound_pagelist);
> > > > if (unlikely(result != SCAN_SUCCEED))
> > > > goto out_up_write;
> > > >
> > > > @@ -1345,18 +1348,32 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > > > * write.
> > > > */
> > > > __folio_mark_uptodate(folio);
> > > > - pgtable = pmd_pgtable(_pmd);
> > > > -
> > > > spin_lock(pmd_ptl);
> > > > - BUG_ON(!pmd_none(*pmd));
> > > > - pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > > > - map_anon_folio_pmd_nopf(folio, pmd, vma, address);
> > > > + WARN_ON_ONCE(!pmd_none(*pmd));
> > > > + if (is_pmd_order(order)) {
> > > > + pgtable = pmd_pgtable(_pmd);
> > > > + pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > > > + map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
> > > > + } else {
> > > > + /*
> > > > + * set_ptes is called in map_anon_folio_pte_nopf with the
> > > > + * pmd_ptl lock still held; this is safe as the PMD is expected
> > >
> > > PMD entry you mean?
> > >
> > > > + * to be none. The pmd entry is then repopulated below.
> > > > + */
> > > > + map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
> > >
> > > So here we populate entries in the existing PTE _table_ to point at the new
> > > order>0 folio? With arm64 of course doing transparent contpte stuff?
> > >
> > > > + smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
> > > > + pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> > >
> > > And then we reinstall the pre-existing PMD _entry_ from none -> what it was
> > > before?
> > >
> > > > + }
> > > > spin_unlock(pmd_ptl);
> > > >
> > > > folio = NULL;
> > > >
> > > > result = SCAN_SUCCEED;
> > > > out_up_write:
> > > > + if (anon_vma_locked)
> > > > + anon_vma_unlock_write(vma->anon_vma);
> > > > + if (pte)
> > > > + pte_unmap(pte);
> > > > mmap_write_unlock(mm);
> > > > out_nolock:
> > > > if (folio)
> > > > @@ -1536,7 +1553,7 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> > > > /* collapse_huge_page expects the lock to be dropped before calling */
> > > > mmap_read_unlock(mm);
> > > > result = collapse_huge_page(mm, start_addr, referenced,
> > > > - unmapped, cc);
> > > > + unmapped, cc, HPAGE_PMD_ORDER);
> > > > /* collapse_huge_page will return with the mmap_lock released */
> > > > *lock_dropped = true;
> > > > }
> > > > --
> > > > 2.54.0
> > > >
> >
> > Thanks, Lorenzo
> >
>
^ permalink raw reply
* Re: [GIT PULL] rv fixes for v7.1
From: Gabriele Monaco @ 2026-06-04 13:04 UTC (permalink / raw)
To: Steven Rostedt
Cc: Tomas Glozar, linux-kernel, linux-trace-kernel, unknownbbqrx,
Wen Yang
In-Reply-To: <20260604085405.234d22eb@fedora>
On Thu, 2026-06-04 at 08:54 -0400, Steven Rostedt wrote:
> On Thu, 04 Jun 2026 14:42:02 +0200
> Gabriele Monaco <gmonaco@redhat.com> wrote:
>
> > All this to say that, in my opinion unknownbbqrx
> > <dev@unknownbbqr.xyz>
> > is NOT an anonymous contribution, just a nickname that differs from
> > the legal name of this person (which we wouldn't validate anyway),
> > so I would say it complies with the rules.
>
> It's a username on github and not a nickname. I did a search for
> "unknownbbqr" and it doesn't come up anywhere but Google tries to
> find similar matches which brings me to an OnlyFans account :-p
>
> It *DOES NOT* qualify because there's no accountability for this. For
> people who have a nickname as their entire internet persona, sure,
> I'll take patches from them as there's an entity that exists behind
> it.
> But I'm not going to take some username on github as a persona. To
> me, that's still anonymous.
Alright, fair. In the link I sent, the signoff got changed to Ali Ahmet
MEMIS <dev@unknownbbqr.xyz>, but I believe we cannot use that unless
the user themselves adds it (and they seem unreachable).
I posted the re-authored patch in [1], I'm not sure that's the proper
way though (the patch is so simple that is unmodified). But if you give
me a green light I can send you a pull request with that patch instead.
Thanks,
Gabriele
[1] -
https://lore.kernel.org/lkml/20260604120946.90302-2-gmonaco@redhat.com/
>
> -- Steve
^ permalink raw reply
* [GIT PULL] RTLA fixes for v7.1-rc7
From: Tomas Glozar @ 2026-06-04 13:25 UTC (permalink / raw)
To: Steven Rostedt; +Cc: LKML, linux-trace-kernel, Tomas Glozar
Steven,
The following changes since commit e43ffb69e0438cddd72aaa30898b4dc446f664f8:
Linux 7.1-rc6 (2026-05-31 15:14:24 -0700)
are available in the Git repository at:
https://git.kernel.org/pub/scm/linux/kernel/git/tglozar/linux.git tags/rtla-fixes-v7.1-rc7
for you to fetch changes up to e9e41d3035032ed6053d8bad7b7077e1cb3a6540:
rtla: Fix parsing of multi-character short options (2026-06-04 10:53:25 +0200)
----------------------------------------------------------------
RTLA fixes for v7.1-rc7
- Fix multi-character short option parsing
Fix regression in parsing of multiple-character short options (e.g.
-p100 /= -p 100/, -un /= -u -n/) caused by getopt_long() internal state
corruption after a refactoring.
Build, runtime tests, unit tests pass. Extended runtime tests from next
also pass, except for timerlat hist --dump-tasks (expected).
Signed-off-by: Tomas Glozar <tglozar@redhat.com>
----------------------------------------------------------------
Tomas Glozar (1):
rtla: Fix parsing of multi-character short options
tools/tracing/rtla/src/common.c | 28 +++++-----------------------
tools/tracing/rtla/src/common.h | 12 +++++++++++-
tools/tracing/rtla/src/osnoise_hist.c | 7 ++++---
tools/tracing/rtla/src/osnoise_top.c | 7 ++++---
tools/tracing/rtla/src/timerlat_hist.c | 7 ++++---
tools/tracing/rtla/src/timerlat_top.c | 7 ++++---
6 files changed, 32 insertions(+), 36 deletions(-)
^ permalink raw reply
* Re: mm/memory-failure tracepoint change breaks userspace rasdaemon
From: Xie Yuanbin @ 2026-06-04 13:42 UTC (permalink / raw)
To: david, qiuxu.zhuo, bp, akpm, rostedt, linmiaohe
Cc: linux-edac, linux-kernel, linux-mm, linux-trace-kernel,
mchehab+huawei, tony.luck, torvalds, xieyuanbin1, yi1.lai
In-Reply-To: <4de75e51-025c-4926-871a-4b9da479cefa@kernel.org>
On Thu, 4 Jun 2026 08:42:37 +0200, David Hildenbrand (Arm) wrote:
> Yeah, if only I had known that we would break user space by changing trace
> events ... now we know :)
>
> Do you have capacity to send a fix?
Sure, with pleasure.
^ permalink raw reply
* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Lorenzo Stoakes @ 2026-06-04 13:53 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
joshua.hahnjy, kas, lance.yang, liam, mathieu.desnoyers,
matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <b8380eb3-096a-49f1-9ace-99c1e75888b4@kernel.org>
(Checking the algorithm here)
On Mon, Jun 01, 2026 at 10:11:24AM +0200, David Hildenbrand (Arm) wrote:
> On 5/22/26 17:00, Nico Pache wrote:
>
> Finally time for the core piece :)
>
> > Enable khugepaged to collapse to mTHP orders. This patch implements the
> > main scanning logic using a bitmap to track occupied pages and a stack
> > structure that allows us to find optimal collapse sizes.
> >
> > Previous to this patch, PMD collapse had 3 main phases, a light weight
> > scanning phase (mmap_read_lock) that determines a potential PMD
> > collapse, an alloc phase (mmap unlocked), then finally heavier collapse
> > phase (mmap_write_lock).
> >
> > To enabled mTHP collapse we make the following changes:
> >
> > During PMD scan phase, track occupied pages in a bitmap. When mTHP
> > orders are enabled, we remove the restriction of max_ptes_none during the
> > scan phase to avoid missing potential mTHP collapse candidates. Once we
> > have scanned the full PMD range and updated the bitmap to track occupied
> > pages, we use the bitmap to find the optimal mTHP size.
> >
> > Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
> > and determine the best eligible order for the collapse. A stack structure
> > is used instead of traditional recursion to manage the search. This also
> > prevents a traditional recursive approach when the kernel stack struct is
> > limited. The algorithm recursively splits the bitmap into smaller chunks to
> > find the highest order mTHPs that satisfy the collapse criteria. We start
> > by attempting the PMD order, then moved on the consecutively lower orders
> > (mTHP collapse). The stack maintains a pair of variables (offset, order),
This is inaccurate, it's only consecutively smaller until you hit smallest then
it starts bumping around 2 -> 3 -> 2 -> 3 -> 2 -> .. -> 4 -> 3 -> 2 -> 3 -> 2 -> 4 -> etc.
More like consecutively smaller, then always trying for the smallest possible
fit?
Would be good to describe why we do this, presumably to get a best _fit_?
> > indicating the number of PTEs from the start of the PMD, and the order of
> > the potential collapse candidate.
> >
> > The algorithm for consuming the bitmap works as such:
> > 1) push (0, HPAGE_PMD_ORDER) onto the stack
> > 2) pop the stack
> > 3) check if the number of set bits in that (offset,order) pair
> > statisfy the max_ptes_none threshold for that order
> > 4) if yes, attempt collapse
> > 5) if no (or collapse fails), push two new stack items representing
> > the left and right halves of the current bitmap range, at the
> > next lower order
I notice the ordering is wrong here, you actualy push the mid_offset first then
the offset (e.g. 'right', then 'left'):
collapse_mthp_stack_push(cc, &stack_size, mid_offset,
next_order);
collapse_mthp_stack_push(cc, &stack_size, offset,
next_order);
So that way you are popping the 'left' first then the 'right'.
So seems you'll get:
stack={0, 9}
Pop (0, order=9):
|----------------------------------------|
|########################################|
|----------------------------------------|
stack={256, 8}, {0, 8}
Pop (0, order=8):
|--------------------|-------------------|
|####################| |
|--------------------|-------------------|
stack={256, 8}, {128, 7}, {0, 7}
Pop (0, order=7):
|----------|-----------------------------|
|##########| |
|----------|-----------------------------|
stack={256, 8}, {128, 7}, {64, 6}, {0, 6}
Pop (0, order=6):
|----|-----------------------------------|
|####| |
|----|-----------------------------------|
...
stack={256, 8}, ..., { 8, 3 }, {0, 2}
Pop (0, order=2):
|-|--------------------------------------|
|#| |
|-|--------------------------------------|
Then finally :) we get the offsets :)
stack={256, 8}, ..., {8, 3}, {4, 2}
Pop (4, order=2):
|-|-|------------------------------------|
| |#| |
|-|-|------------------------------------|
stack={256, 8}, ..., { 12, 2 }, {8, 3}
Pop (8, order=3):
|---|--|---------------------------------|
| |##| |
|---|--|---------------------------------|
stack={256, 8}, ..., { 12, 2 }, {12, 2}, {8, 2}
Pop (8, order=2):
|---|-|----------------------------------|
| |#| |
|---|-|----------------------------------|
etc.
It seems to me that you're going to keep iterating down until you match an mTHP
when a larger mTHP could have been had?
So we're going:
order 9 -> 8 -> 7 -> 6 -> ... -> 2 -> 3 -> 2 -> 4 -> 3 -> 2
I guess the point is to avoid only getting the largest possible
I guess if we did try to get the largest then we'd only get 2 of the largest
possible then exhaust the whole PMD, should a PMD-sized entry not be possble.
> > 6) repeat at step (2) until stack is empty.
> >
> > Below is a diagram representing the algorithm and stack items:
> >
> > offset mid_offset
> > | |
> > | |
> > v v
> > ____________________________________
> > | PTE Page Table |
> > --------------------------------------
> > <-------><------->
> > order-1 order-1
>
>
> Reading this, it is unclear why exactly do we need the stack.
>
> Why can't you work with offset + cur_order?
>
> Initially,
>
> offset = 0;
> cur_order = HPAGE_PMD_ORDER;
>
> If collapse succeeded, advance to next range.
> If collapse failed, try next smaller order, keeping offset unchanged.
>
> if (failed && cur_order > KHUGEPAGED_MIN_MTHP_ORDER) {
> /* Try next smaller order. */
> cur_order = cur_order - 1;
OK this matches the stack for the 0 offset entries...
> } else {
> /* Skip to next chunk. */
> offset += 1 << cur_order;
> cur_order = max_order_from_offset(offset);
Then 1 << 2 -> 4 so go to offset=4.
max_order_from_offset(4) = 2. so (4, offset=2) same as above.
Then we'd loop back here and go to offset = 8, and max_order_from_offset(8) = 3
And, yeah this seems equivalent.
> }
>
> Of course, handling disabled orders. max_order_from_offset() is rather trivial
> (natural buddy order, capped at HPAGE_PMD_ORDER).
Something like?
static unsigned long max_order_from_offset(unsigned long offset)
{
if (!offset)
return HPAGE_PMD_ORDER;
return ilog2(offset);
}
>
> What's the benefit of the stack?
Yeah it seems equivalent. Good idea!
Thanks, Lorenzo
^ permalink raw reply
* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Lorenzo Stoakes @ 2026-06-04 13:59 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
joshua.hahnjy, kas, lance.yang, liam, mathieu.desnoyers,
matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <aiF25dvH4qd_C4aj@lucifer>
On Thu, Jun 04, 2026 at 02:53:39PM +0100, Lorenzo Stoakes wrote:
> (Checking the algorithm here)
>
> On Mon, Jun 01, 2026 at 10:11:24AM +0200, David Hildenbrand (Arm) wrote:
> > On 5/22/26 17:00, Nico Pache wrote:
> >
> > Finally time for the core piece :)
> >
> > > Enable khugepaged to collapse to mTHP orders. This patch implements the
> > > main scanning logic using a bitmap to track occupied pages and a stack
> > > structure that allows us to find optimal collapse sizes.
> > >
> > > Previous to this patch, PMD collapse had 3 main phases, a light weight
> > > scanning phase (mmap_read_lock) that determines a potential PMD
> > > collapse, an alloc phase (mmap unlocked), then finally heavier collapse
> > > phase (mmap_write_lock).
> > >
> > > To enabled mTHP collapse we make the following changes:
> > >
> > > During PMD scan phase, track occupied pages in a bitmap. When mTHP
> > > orders are enabled, we remove the restriction of max_ptes_none during the
> > > scan phase to avoid missing potential mTHP collapse candidates. Once we
> > > have scanned the full PMD range and updated the bitmap to track occupied
> > > pages, we use the bitmap to find the optimal mTHP size.
> > >
> > > Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
> > > and determine the best eligible order for the collapse. A stack structure
> > > is used instead of traditional recursion to manage the search. This also
> > > prevents a traditional recursive approach when the kernel stack struct is
> > > limited. The algorithm recursively splits the bitmap into smaller chunks to
> > > find the highest order mTHPs that satisfy the collapse criteria. We start
> > > by attempting the PMD order, then moved on the consecutively lower orders
> > > (mTHP collapse). The stack maintains a pair of variables (offset, order),
>
> This is inaccurate, it's only consecutively smaller until you hit smallest then
> it starts bumping around 2 -> 3 -> 2 -> 3 -> 2 -> .. -> 4 -> 3 -> 2 -> 3 -> 2 -> 4 -> etc.
>
> More like consecutively smaller, then always trying for the smallest possible
> fit?
>
> Would be good to describe why we do this, presumably to get a best _fit_?
>
> > > indicating the number of PTEs from the start of the PMD, and the order of
> > > the potential collapse candidate.
> > >
> > > The algorithm for consuming the bitmap works as such:
> > > 1) push (0, HPAGE_PMD_ORDER) onto the stack
> > > 2) pop the stack
> > > 3) check if the number of set bits in that (offset,order) pair
> > > statisfy the max_ptes_none threshold for that order
> > > 4) if yes, attempt collapse
> > > 5) if no (or collapse fails), push two new stack items representing
> > > the left and right halves of the current bitmap range, at the
> > > next lower order
>
> I notice the ordering is wrong here, you actualy push the mid_offset first then
> the offset (e.g. 'right', then 'left'):
>
> collapse_mthp_stack_push(cc, &stack_size, mid_offset,
> next_order);
> collapse_mthp_stack_push(cc, &stack_size, offset,
> next_order);
>
> So that way you are popping the 'left' first then the 'right'.
>
> So seems you'll get:
>
> stack={0, 9}
>
> Pop (0, order=9):
>
> |----------------------------------------|
> |########################################|
> |----------------------------------------|
>
> stack={256, 8}, {0, 8}
>
> Pop (0, order=8):
>
> |--------------------|-------------------|
> |####################| |
> |--------------------|-------------------|
>
>
> stack={256, 8}, {128, 7}, {0, 7}
>
> Pop (0, order=7):
>
> |----------|-----------------------------|
> |##########| |
> |----------|-----------------------------|
>
> stack={256, 8}, {128, 7}, {64, 6}, {0, 6}
>
> Pop (0, order=6):
>
> |----|-----------------------------------|
> |####| |
> |----|-----------------------------------|
>
> ...
>
> stack={256, 8}, ..., { 8, 3 }, {0, 2}
>
> Pop (0, order=2):
>
> |-|--------------------------------------|
> |#| |
> |-|--------------------------------------|
>
> Then finally :) we get the offsets :)
>
> stack={256, 8}, ..., {8, 3}, {4, 2}
>
> Pop (4, order=2):
>
> |-|-|------------------------------------|
> | |#| |
> |-|-|------------------------------------|
>
> stack={256, 8}, ..., { 12, 2 }, {8, 3}
(Shouldn't be {12, 2} there :)
> Pop (8, order=3):
>
> |---|--|---------------------------------|
> | |##| |
> |---|--|---------------------------------|
>
> stack={256, 8}, ..., { 12, 2 }, {12, 2}, {8, 2}
(Shouldn't duplicate {12, 2} there :)
>
> Pop (8, order=2):
>
> |---|-|----------------------------------|
> | |#| |
> |---|-|----------------------------------|
>
> etc.
>
>
> It seems to me that you're going to keep iterating down until you match an mTHP
> when a larger mTHP could have been had?
>
> So we're going:
>
> order 9 -> 8 -> 7 -> 6 -> ... -> 2 -> 3 -> 2 -> 4 -> 3 -> 2
>
> I guess the point is to avoid only getting the largest possible
>
>
>
>
> I guess if we did try to get the largest then we'd only get 2 of the largest
> possible then exhaust the whole PMD, should a PMD-sized entry not be possble.
>
> > > 6) repeat at step (2) until stack is empty.
> > >
> > > Below is a diagram representing the algorithm and stack items:
> > >
> > > offset mid_offset
> > > | |
> > > | |
> > > v v
> > > ____________________________________
> > > | PTE Page Table |
> > > --------------------------------------
> > > <-------><------->
> > > order-1 order-1
> >
> >
> > Reading this, it is unclear why exactly do we need the stack.
> >
> > Why can't you work with offset + cur_order?
> >
> > Initially,
> >
> > offset = 0;
> > cur_order = HPAGE_PMD_ORDER;
> >
> > If collapse succeeded, advance to next range.
> > If collapse failed, try next smaller order, keeping offset unchanged.
> >
> > if (failed && cur_order > KHUGEPAGED_MIN_MTHP_ORDER) {
> > /* Try next smaller order. */
> > cur_order = cur_order - 1;
>
> OK this matches the stack for the 0 offset entries...
>
> > } else {
> > /* Skip to next chunk. */
> > offset += 1 << cur_order;
> > cur_order = max_order_from_offset(offset);
>
> Then 1 << 2 -> 4 so go to offset=4.
>
> max_order_from_offset(4) = 2. so (4, offset=2) same as above.
>
> Then we'd loop back here and go to offset = 8, and max_order_from_offset(8) = 3
>
> And, yeah this seems equivalent.
>
> > }
>
> >
> > Of course, handling disabled orders. max_order_from_offset() is rather trivial
> > (natural buddy order, capped at HPAGE_PMD_ORDER).
>
> Something like?
>
> static unsigned long max_order_from_offset(unsigned long offset)
> {
> if (!offset)
> return HPAGE_PMD_ORDER;
>
> return ilog2(offset);
Oops, we need the LSB so ffs.
> }
>
> >
> > What's the benefit of the stack?
>
> Yeah it seems equivalent. Good idea!
>
> Thanks, Lorenzo
^ permalink raw reply
* [PATCH] rtla/tests: Fix pgrep filter in get_workload_pids.sh
From: Tomas Glozar @ 2026-06-04 14:05 UTC (permalink / raw)
To: Steven Rostedt, Tomas Glozar
Cc: John Kacur, Luis Goncalves, Crystal Wood, Costa Shulyupin,
Wander Lairson Costa, LKML, linux-trace-kernel
Multiple runtime tests in RTLA rely on the get_workload_pids() shell
helper function to get the PIDs of both kernel and user workloads.
On some systems (e.g. Fedora 43), pgrep matches kernel thread names
including square brackets: "[osnoise/0]"; on other systems (e.g.
RHEL 9.8), brackets are not included: "osnoise/0".
Accept both as valid workload PIDs rather that just the non-bracket form
to make the tests work on all systems.
Fixes: a98dad63cda3 ("rtla/tests: Add runtime test for -k and -u options")
Reported-by: Crystal Wood <crwood@redhat.com>
Signed-off-by: Tomas Glozar <tglozar@redhat.com>
---
Note: the file touched by this commit is included by .gitignore, that is
an error that will be fixed by [1].
[1] https://lore.kernel.org/linux-trace-kernel/20260601091835.3118094-1-tglozar@redhat.com/
tools/tracing/rtla/tests/scripts/lib/get_workload_pids.sh | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/tracing/rtla/tests/scripts/lib/get_workload_pids.sh b/tools/tracing/rtla/tests/scripts/lib/get_workload_pids.sh
index 8aff98cd2c1f..d10a4e3b321d 100644
--- a/tools/tracing/rtla/tests/scripts/lib/get_workload_pids.sh
+++ b/tools/tracing/rtla/tests/scripts/lib/get_workload_pids.sh
@@ -5,7 +5,7 @@ get_workload_pids() {
local rtla_pid=$(ps -o ppid= $shell_pid)
# kernel threads
- pgrep -P $(pgrep ^kthreadd$) -f '^(osnoise|timerlat)/[0-9]+$'
+ pgrep -P $(pgrep ^kthreadd$) -f '^\[?(osnoise|timerlat)/[0-9]+\]?$'
# user threads
pgrep -P $rtla_pid | grep -v "^$shell_pid$"
}
--
2.54.0
^ permalink raw reply related
* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Lorenzo Stoakes @ 2026-06-04 14:14 UTC (permalink / raw)
To: Nico Pache
Cc: David Hildenbrand (Arm), linux-doc, linux-kernel, linux-mm,
linux-trace-kernel, aarcange, akpm, anshuman.khandual, apopple,
baohua, baolin.wang, byungchul, catalin.marinas, cl, corbet,
dave.hansen, dev.jain, gourry, hannes, hughd, jack, jackmanb,
jannh, jglisse, joshua.hahnjy, kas, lance.yang, liam,
mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang, rientjes,
rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
thomas.hellstrom, tiwai, vbabka, vishal.moola, wangkefeng.wang,
will, willy, yang, ying.huang, ziy, zokeefe, Usama Arif,
usamaarif642
In-Reply-To: <19639b08-5bf1-4974-9635-c458d512fa38@redhat.com>
On Tue, Jun 02, 2026 at 11:23:35AM -0600, Nico Pache wrote:
>
>
> On 6/1/26 7:15 AM, David Hildenbrand (Arm) wrote:
> >>>
> >>> Reading this, it is unclear why exactly do we need the stack.
> >>
> >> So I looked into your items below. It seems logical, and I think it
> >> works the same way; however, your method seems slightly harder to
> >> understand due to all the edge cases and more error-prone to future
> >> changes (the stack holds implicit knowledge of the offset/order that
> >> must now be tracked in the edge cases).
> >>
> >> Given the stack is 24 bytes, I'm not sure if the extra complexity is
> >> worth saving that small amount of memory. Although we would also be
> >> getting rid of (3?) functions, so both approaches have pros and cons.
> >
> > I consider a simple forward loop over the offset ... less complexity compared to
> > a stack structure :)
> >
> >>
> >> I will implement a patch comparing your solution against mine and send
> >> it here, then we can decide which approach is better.
> >
> > Right, throw it over the fence and I'll see how to improve it further.
>
> Ok heres what the diff looks like on top of my V19.
>
> you can access the tree here https://gitlab.com/npache/linux/-/commits/mthp-v19?ref_type=heads for easier review.
>
> So far I have no problem with this approach it appeared cleaner than i thought. Did some light testing. Gonna throw it more through the ringer tomorrow.
>
>
> From 9496c5d17eba7f6d04820d78c7c6f1592a58888a Mon Sep 17 00:00:00 2001
> From: Nico Pache <npache@redhat.com>
> Date: Tue, 2 Jun 2026 10:26:18 -0600
> Subject: [PATCH] convert from stack to forward loop
>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
> mm/khugepaged.c | 96 ++++++++-----------------------------------------
> 1 file changed, 15 insertions(+), 81 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 498eba009751..6de935e76ceb 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -100,28 +100,6 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
> static struct kmem_cache *mm_slot_cache __ro_after_init;
>
> #define KHUGEPAGED_MIN_MTHP_ORDER 2
> -/*
> - * mthp_collapse() does an iterative DFS over a binary tree, from
> - * HPAGE_PMD_ORDER down to KHUGEPAGED_MIN_MTHP_ORDER. The max stack
> - * size needed for a DFS on a binary tree is height + 1, where
> - * height = HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER.
> - *
> - * ilog2 is used in place of HPAGE_PMD_ORDER because some architectures
> - * (e.g. ppc64le) do not define HPAGE_PMD_ORDER until after build time.
> - */
> -#define MTHP_STACK_SIZE (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER + 1)
> -
> -/*
> - * Defines a range of PTE entries in a PTE page table which are being
> - * considered for mTHP collapse.
> - *
> - * @offset: the offset of the first PTE entry in a PMD range.
> - * @order: the order of the PTE entries being considered for collapse.
> - */
> -struct mthp_range {
> - u16 offset;
> - u8 order;
> -};
>
> struct collapse_control {
> bool is_khugepaged;
> @@ -137,7 +115,6 @@ struct collapse_control {
>
> /* Each bit represents a single occupied (!none/zero) page. */
> DECLARE_BITMAP(mthp_present_ptes, MAX_PTRS_PER_PTE);
> - struct mthp_range mthp_bitmap_stack[MTHP_STACK_SIZE];
> };
>
> /**
> @@ -1458,50 +1435,14 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
> return result;
> }
>
> -static void collapse_mthp_stack_push(struct collapse_control *cc, int *stack_size,
> - u16 offset, u8 order)
> -{
> - const int size = *stack_size;
> - struct mthp_range *stack = &cc->mthp_bitmap_stack[size];
> -
> - VM_WARN_ON_ONCE(size >= MTHP_STACK_SIZE);
> - stack->order = order;
> - stack->offset = offset;
> - (*stack_size)++;
> -}
> -
> -static struct mthp_range collapse_mthp_stack_pop(struct collapse_control *cc,
> - int *stack_size)
> -{
> - const int size = *stack_size;
> -
> - VM_WARN_ON_ONCE(size <= 0);
> - (*stack_size)--;
> - return cc->mthp_bitmap_stack[size - 1];
> -}
> -
> /*
> * mthp_collapse() consumes the bitmap that is generated during
> * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
> *
> * Each bit in cc->mthp_present_ptes represents a single occupied (!none/zero)
> - * page. A stack structure cc->mthp_bitmap_stack is used to check different
> - * regions of the bitmap for collapse eligibility. The stack maintains a pair
> - * of variables (offset, order), indicating the number of PTEs from the start
> - * of the PMD, and the order of the potential collapse candidate respectively.
> - * We start at the PMD order and check if it is eligible for collapse; if not,
> - * we add two entries to the stack at a lower order to represent the left and
> - * right halves of the PTE page table we are examining.
> - *
> - * offset mid_offset
> - * | |
> - * | |
> - * v v
> - * --------------------------------------
> - * | cc->mthp_present_ptes |
> - * --------------------------------------
> - * <-------><------->
> - * order-1 order-1
> + * page. We start at the PMD order and check if it is eligible for collapse;
> + * if not, we check the left and right halves of the PTE page table we are
> + * examining at a lower order.
Yeah this is not good enough sorry, before there was some kind of explanation of
the algortihm, just because you can explain the _code_ more simply, that's not
very useful.
I had to sit down and spend quite a bit of time to figure out how the actual
output looks so I think that should be explained.
> *
> * For each of these, we determine how many PTE entries are occupied in the
> * range of PTE entries we propose to collapse, then we compare this to a
> @@ -1517,26 +1458,20 @@ static enum scan_result mthp_collapse(struct mm_struct *mm,
> {
> unsigned int nr_occupied_ptes, nr_ptes, max_ptes_none;
> enum scan_result last_result = SCAN_FAIL;
> - int collapsed = 0, stack_size = 0;
> + int collapsed = 0;
> bool alloc_failed = false;
> unsigned long collapse_address;
> - struct mthp_range range;
> - u16 offset;
> - u8 order;
> + unsigned int offset = 0;
> + unsigned int order = HPAGE_PMD_ORDER;
>
> - collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
>
> - while (stack_size) {
> - range = collapse_mthp_stack_pop(cc, &stack_size);
> - order = range.order;
> - offset = range.offset;
> + while (offset < HPAGE_PMD_NR) {
> nr_ptes = 1UL << order;
>
> if (!test_bit(order, &enabled_orders))
> goto next_order;
>
> max_ptes_none = collapse_max_ptes_none(cc, NULL, order);
> -
> nr_occupied_ptes = bitmap_weight_from(cc->mthp_present_ptes, offset,
> offset + nr_ptes);
>
> @@ -1553,7 +1488,7 @@ static enum scan_result mthp_collapse(struct mm_struct *mm,
> collapsed += nr_ptes;
> fallthrough;
> case SCAN_PTE_MAPPED_HUGEPAGE:
> - continue;
> + goto next_offset;
> /* Cases where lower orders might still succeed */
> case SCAN_ALLOC_HUGE_PAGE_FAIL:
> alloc_failed = true;
> @@ -1581,15 +1516,14 @@ static enum scan_result mthp_collapse(struct mm_struct *mm,
> }
>
This obviously needs some comments describing what you're doing here. I think
David said so too.
> next_order:
> - if ((BIT(order) - 1) & enabled_orders) {
> - const u8 next_order = order - 1;
> - const u16 mid_offset = offset + (nr_ptes / 2);
> -
> - collapse_mthp_stack_push(cc, &stack_size, mid_offset,
> - next_order);
> - collapse_mthp_stack_push(cc, &stack_size, offset,
> - next_order);
> + if (order > KHUGEPAGED_MIN_MTHP_ORDER &&
> + (BIT(order) - 1) & enabled_orders) {
Wait, if I disable an order this changes the way we get mTHP doesn't it?
Let's say I disable order-4 but retain order-3 and order-2 for offset 0 we get:
9->8->7->6->5->5->6->5->5->7
And we simply can't get order-3 no?
This seems broken doesn't it? Maybe I'm missing something?
> + order = order - 1;
order--?
> + continue;
> }
> +next_offset:
> + offset += nr_ptes;
> + order = min_t(int, __ffs(offset), HPAGE_PMD_ORDER);
Also wouldn't, in the case where an enabled order check above skips an order--,
we could have offset=0 here and end up just looping around checking from (0,
HPAGE_PMD_ORDER) again? That also seems broken?
Also, what's __ffs(0)? Isn't it undefined? We shouldn't be relying on undefined
behaviour no?
https://elixir.bootlin.com/linux/v7.0.10/source/include/asm-generic/bitops/builtin-__ffs.h#L5
Says as much?
I guess we're assuming we're not going to get to 0 here, but that could do with
a comment or a VM_WARN_ON_ONCE() at least.
Also why aren't we making this a function?
static inline unsigned int max_order_from_offset(unsigned int offset)
{
if (!offset)
return HPAGE_PMD_ORDER;
return __ffs(offset);
}
Though __ffs() works on unsigned long... probably... OK?
> }
> done:
> if (collapsed)
> --
> 2.54.0
>
>
>
> >
> > [...]
> >
> >>>> + bitmap_zero(cc->mthp_bitmap, MAX_PTRS_PER_PTE);
> >>>> memset(cc->node_load, 0, sizeof(cc->node_load));
> >>>> nodes_clear(cc->alloc_nmask);
> >>>> +
> >>>> + enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, tva_flags);
> >>>> +
> >>>> + /*
> >>>> + * If PMD is the only enabled order, enforce max_ptes_none, otherwise
> >>>> + * scan all pages to populate the bitmap for mTHP collapse.
> >>>> + */
> >>>
> >>> You should note here, that we re-verify in mthp_collapse().
> >>>
> >>> But the question is, whether we should relocate the check completely into
> >>> mthp_collapse(), instead of conditionally duplicating it.
> >>>
> >>> What speaks against always populating the bitmap and making the decision in
> >>> mthp_collapse()?
> >>>
> >>> Sure, we might scan a page table a bit longer, but the code gets clearer ... and
> >>> I am not sure if scanning some more page table entries is really that critical here.
> >>
> >> Someone asked me to preserve the legacy behavior (PMD only). Although
> >> rather trivial, if you set max_ptes_none=0 for example, we'd still
> >> have to do 511 iterations for no reason if PMD collapse is the only
> >> enabled order rather than bailing immediately.
> >>
> >> I'm ok with dropping it, but I think its the correct approach (despite
> >> the extra complexity). @Usama Arif brought up this point here
> >> https://lore.kernel.org/all/f8f7bb71-ca31-46ee-a62d-7ddfd83e0ead@gmail.com/
> >
> > We talk about regressions, but I am not sure if we care about scanning speed
> > within a page table that much?
> >
> > After all, we locked it and already read some entries.
> >
> > Having the same check at two places to optimize for PMD order might right now
> > feel like a good optimization, but likely an irrelevant one in a near future?
> >
> > Anyhow, won't push back, as long as we document why we are special casing things
> > here.
> >
>
Thanks, Lorenzo
^ permalink raw reply
* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Lorenzo Stoakes @ 2026-06-04 14:19 UTC (permalink / raw)
To: Nico Pache
Cc: David Hildenbrand (Arm), linux-doc, linux-kernel, linux-mm,
linux-trace-kernel, aarcange, akpm, anshuman.khandual, apopple,
baohua, baolin.wang, byungchul, catalin.marinas, cl, corbet,
dave.hansen, dev.jain, gourry, hannes, hughd, jack, jackmanb,
jannh, jglisse, joshua.hahnjy, kas, lance.yang, liam,
mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang, rientjes,
rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
thomas.hellstrom, tiwai, vbabka, vishal.moola, wangkefeng.wang,
will, willy, yang, ying.huang, ziy, zokeefe, Usama Arif,
usamaarif642
In-Reply-To: <aiGFIrg8_vZZxnPg@lucifer>
On Thu, Jun 04, 2026 at 03:14:59PM +0100, Lorenzo Stoakes wrote:
> On Tue, Jun 02, 2026 at 11:23:35AM -0600, Nico Pache wrote:
> >
> >
> > On 6/1/26 7:15 AM, David Hildenbrand (Arm) wrote:
> > >>>
> > >>> Reading this, it is unclear why exactly do we need the stack.
> > >>
> > >> So I looked into your items below. It seems logical, and I think it
> > >> works the same way; however, your method seems slightly harder to
> > >> understand due to all the edge cases and more error-prone to future
> > >> changes (the stack holds implicit knowledge of the offset/order that
> > >> must now be tracked in the edge cases).
> > >>
> > >> Given the stack is 24 bytes, I'm not sure if the extra complexity is
> > >> worth saving that small amount of memory. Although we would also be
> > >> getting rid of (3?) functions, so both approaches have pros and cons.
> > >
> > > I consider a simple forward loop over the offset ... less complexity compared to
> > > a stack structure :)
> > >
> > >>
> > >> I will implement a patch comparing your solution against mine and send
> > >> it here, then we can decide which approach is better.
> > >
> > > Right, throw it over the fence and I'll see how to improve it further.
> >
> > Ok heres what the diff looks like on top of my V19.
> >
> > you can access the tree here https://gitlab.com/npache/linux/-/commits/mthp-v19?ref_type=heads for easier review.
> >
> > So far I have no problem with this approach it appeared cleaner than i thought. Did some light testing. Gonna throw it more through the ringer tomorrow.
> >
> >
> > From 9496c5d17eba7f6d04820d78c7c6f1592a58888a Mon Sep 17 00:00:00 2001
> > From: Nico Pache <npache@redhat.com>
> > Date: Tue, 2 Jun 2026 10:26:18 -0600
> > Subject: [PATCH] convert from stack to forward loop
> >
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> > mm/khugepaged.c | 96 ++++++++-----------------------------------------
> > 1 file changed, 15 insertions(+), 81 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 498eba009751..6de935e76ceb 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -100,28 +100,6 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
> > static struct kmem_cache *mm_slot_cache __ro_after_init;
> >
> > #define KHUGEPAGED_MIN_MTHP_ORDER 2
> > -/*
> > - * mthp_collapse() does an iterative DFS over a binary tree, from
> > - * HPAGE_PMD_ORDER down to KHUGEPAGED_MIN_MTHP_ORDER. The max stack
> > - * size needed for a DFS on a binary tree is height + 1, where
> > - * height = HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER.
> > - *
> > - * ilog2 is used in place of HPAGE_PMD_ORDER because some architectures
> > - * (e.g. ppc64le) do not define HPAGE_PMD_ORDER until after build time.
> > - */
> > -#define MTHP_STACK_SIZE (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER + 1)
> > -
> > -/*
> > - * Defines a range of PTE entries in a PTE page table which are being
> > - * considered for mTHP collapse.
> > - *
> > - * @offset: the offset of the first PTE entry in a PMD range.
> > - * @order: the order of the PTE entries being considered for collapse.
> > - */
> > -struct mthp_range {
> > - u16 offset;
> > - u8 order;
> > -};
> >
> > struct collapse_control {
> > bool is_khugepaged;
> > @@ -137,7 +115,6 @@ struct collapse_control {
> >
> > /* Each bit represents a single occupied (!none/zero) page. */
> > DECLARE_BITMAP(mthp_present_ptes, MAX_PTRS_PER_PTE);
> > - struct mthp_range mthp_bitmap_stack[MTHP_STACK_SIZE];
> > };
> >
> > /**
> > @@ -1458,50 +1435,14 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
> > return result;
> > }
> >
> > -static void collapse_mthp_stack_push(struct collapse_control *cc, int *stack_size,
> > - u16 offset, u8 order)
> > -{
> > - const int size = *stack_size;
> > - struct mthp_range *stack = &cc->mthp_bitmap_stack[size];
> > -
> > - VM_WARN_ON_ONCE(size >= MTHP_STACK_SIZE);
> > - stack->order = order;
> > - stack->offset = offset;
> > - (*stack_size)++;
> > -}
> > -
> > -static struct mthp_range collapse_mthp_stack_pop(struct collapse_control *cc,
> > - int *stack_size)
> > -{
> > - const int size = *stack_size;
> > -
> > - VM_WARN_ON_ONCE(size <= 0);
> > - (*stack_size)--;
> > - return cc->mthp_bitmap_stack[size - 1];
> > -}
> > -
> > /*
> > * mthp_collapse() consumes the bitmap that is generated during
> > * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
> > *
> > * Each bit in cc->mthp_present_ptes represents a single occupied (!none/zero)
> > - * page. A stack structure cc->mthp_bitmap_stack is used to check different
> > - * regions of the bitmap for collapse eligibility. The stack maintains a pair
> > - * of variables (offset, order), indicating the number of PTEs from the start
> > - * of the PMD, and the order of the potential collapse candidate respectively.
> > - * We start at the PMD order and check if it is eligible for collapse; if not,
> > - * we add two entries to the stack at a lower order to represent the left and
> > - * right halves of the PTE page table we are examining.
> > - *
> > - * offset mid_offset
> > - * | |
> > - * | |
> > - * v v
> > - * --------------------------------------
> > - * | cc->mthp_present_ptes |
> > - * --------------------------------------
> > - * <-------><------->
> > - * order-1 order-1
> > + * page. We start at the PMD order and check if it is eligible for collapse;
> > + * if not, we check the left and right halves of the PTE page table we are
> > + * examining at a lower order.
>
> Yeah this is not good enough sorry, before there was some kind of explanation of
> the algortihm, just because you can explain the _code_ more simply, that's not
> very useful.
>
> I had to sit down and spend quite a bit of time to figure out how the actual
> output looks so I think that should be explained.
>
> > *
> > * For each of these, we determine how many PTE entries are occupied in the
> > * range of PTE entries we propose to collapse, then we compare this to a
> > @@ -1517,26 +1458,20 @@ static enum scan_result mthp_collapse(struct mm_struct *mm,
> > {
> > unsigned int nr_occupied_ptes, nr_ptes, max_ptes_none;
> > enum scan_result last_result = SCAN_FAIL;
> > - int collapsed = 0, stack_size = 0;
> > + int collapsed = 0;
> > bool alloc_failed = false;
> > unsigned long collapse_address;
> > - struct mthp_range range;
> > - u16 offset;
> > - u8 order;
> > + unsigned int offset = 0;
> > + unsigned int order = HPAGE_PMD_ORDER;
> >
> > - collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
> >
> > - while (stack_size) {
> > - range = collapse_mthp_stack_pop(cc, &stack_size);
> > - order = range.order;
> > - offset = range.offset;
> > + while (offset < HPAGE_PMD_NR) {
> > nr_ptes = 1UL << order;
> >
> > if (!test_bit(order, &enabled_orders))
> > goto next_order;
> >
> > max_ptes_none = collapse_max_ptes_none(cc, NULL, order);
> > -
> > nr_occupied_ptes = bitmap_weight_from(cc->mthp_present_ptes, offset,
> > offset + nr_ptes);
> >
> > @@ -1553,7 +1488,7 @@ static enum scan_result mthp_collapse(struct mm_struct *mm,
> > collapsed += nr_ptes;
> > fallthrough;
> > case SCAN_PTE_MAPPED_HUGEPAGE:
> > - continue;
> > + goto next_offset;
> > /* Cases where lower orders might still succeed */
> > case SCAN_ALLOC_HUGE_PAGE_FAIL:
> > alloc_failed = true;
> > @@ -1581,15 +1516,14 @@ static enum scan_result mthp_collapse(struct mm_struct *mm,
> > }
> >
>
> This obviously needs some comments describing what you're doing here. I think
> David said so too.
>
> > next_order:
> > - if ((BIT(order) - 1) & enabled_orders) {
> > - const u8 next_order = order - 1;
> > - const u16 mid_offset = offset + (nr_ptes / 2);
> > -
> > - collapse_mthp_stack_push(cc, &stack_size, mid_offset,
> > - next_order);
> > - collapse_mthp_stack_push(cc, &stack_size, offset,
> > - next_order);
> > + if (order > KHUGEPAGED_MIN_MTHP_ORDER &&
> > + (BIT(order) - 1) & enabled_orders) {
>
> Wait, if I disable an order this changes the way we get mTHP doesn't it?
>
> Let's say I disable order-4 but retain order-3 and order-2 for offset 0 we get:
>
> 9->8->7->6->5->5->6->5->5->7
>
> And we simply can't get order-3 no?
>
> This seems broken doesn't it? Maybe I'm missing something?
OK it's the way this is written, very confusing. I do not know why you are
writing this code in this 'compressed' way.
(1 << order) - 1) & enabled_orders is to see if there's _any others_ to check.
>
>
> > + order = order - 1;
>
> order--?
>
> > + continue;
> > }
> > +next_offset:
> > + offset += nr_ptes;
> > + order = min_t(int, __ffs(offset), HPAGE_PMD_ORDER);
>
> Also wouldn't, in the case where an enabled order check above skips an order--,
> we could have offset=0 here and end up just looping around checking from (0,
> HPAGE_PMD_ORDER) again? That also seems broken?
Yeah sorry the offset += nr_ptes fixes that anyway.
And the fact it's a mask check above makes this OK.
So the logic seems probably fine but it needs to be clearer.
>
> Also, what's __ffs(0)? Isn't it undefined? We shouldn't be relying on undefined
> behaviour no?
>
> https://elixir.bootlin.com/linux/v7.0.10/source/include/asm-generic/bitops/builtin-__ffs.h#L5
> Says as much?
>
> I guess we're assuming we're not going to get to 0 here, but that could do with
> a comment or a VM_WARN_ON_ONCE() at least.
>
> Also why aren't we making this a function?
>
> static inline unsigned int max_order_from_offset(unsigned int offset)
> {
> if (!offset)
> return HPAGE_PMD_ORDER;
>
> return __ffs(offset);
> }
>
> Though __ffs() works on unsigned long... probably... OK?
>
> > }
> > done:
> > if (collapsed)
> > --
> > 2.54.0
> >
> >
> >
> > >
> > > [...]
> > >
> > >>>> + bitmap_zero(cc->mthp_bitmap, MAX_PTRS_PER_PTE);
> > >>>> memset(cc->node_load, 0, sizeof(cc->node_load));
> > >>>> nodes_clear(cc->alloc_nmask);
> > >>>> +
> > >>>> + enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, tva_flags);
> > >>>> +
> > >>>> + /*
> > >>>> + * If PMD is the only enabled order, enforce max_ptes_none, otherwise
> > >>>> + * scan all pages to populate the bitmap for mTHP collapse.
> > >>>> + */
> > >>>
> > >>> You should note here, that we re-verify in mthp_collapse().
> > >>>
> > >>> But the question is, whether we should relocate the check completely into
> > >>> mthp_collapse(), instead of conditionally duplicating it.
> > >>>
> > >>> What speaks against always populating the bitmap and making the decision in
> > >>> mthp_collapse()?
> > >>>
> > >>> Sure, we might scan a page table a bit longer, but the code gets clearer ... and
> > >>> I am not sure if scanning some more page table entries is really that critical here.
> > >>
> > >> Someone asked me to preserve the legacy behavior (PMD only). Although
> > >> rather trivial, if you set max_ptes_none=0 for example, we'd still
> > >> have to do 511 iterations for no reason if PMD collapse is the only
> > >> enabled order rather than bailing immediately.
> > >>
> > >> I'm ok with dropping it, but I think its the correct approach (despite
> > >> the extra complexity). @Usama Arif brought up this point here
> > >> https://lore.kernel.org/all/f8f7bb71-ca31-46ee-a62d-7ddfd83e0ead@gmail.com/
> > >
> > > We talk about regressions, but I am not sure if we care about scanning speed
> > > within a page table that much?
> > >
> > > After all, we locked it and already read some entries.
> > >
> > > Having the same check at two places to optimize for PMD order might right now
> > > feel like a good optimization, but likely an irrelevant one in a near future?
> > >
> > > Anyhow, won't push back, as long as we document why we are special casing things
> > > here.
> > >
> >
>
> Thanks, Lorenzo
^ permalink raw reply
* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Lorenzo Stoakes @ 2026-06-04 14:40 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Lance Yang, Nico Pache, linux-doc, linux-kernel, linux-mm,
linux-trace-kernel, aarcange, akpm, anshuman.khandual, apopple,
baohua, baolin.wang, byungchul, catalin.marinas, cl, corbet,
dave.hansen, dev.jain, gourry, hannes, hughd, jack, jackmanb,
jannh, jglisse, joshua.hahnjy, kas, liam, mathieu.desnoyers,
matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <46bb9d9e-03f0-4e26-9ac9-1cdc5ba9bf4d@kernel.org>
On Wed, Jun 03, 2026 at 10:05:08AM +0200, David Hildenbrand (Arm) wrote:
> On 6/2/26 17:44, Lance Yang wrote:
> >
> >
> > On 2026/6/2 18:58, Nico Pache wrote:
> >> On Sun, May 31, 2026 at 1:19 AM Lance Yang <lance.yang@linux.dev> wrote:
> >>>
> >>>
> >>> [...]
> >>>
> >>> Hmm ... don't we lose the allocation-failure result here?
> >>>
> >>> Previously collapse_scan_pmd() propagated SCAN_ALLOC_HUGE_PAGE_FAIL from
> >>> collapse_huge_page(), so khugepaged would call khugepaged_alloc_sleep()
> >>> in khugepaged_do_scan().
> >>>
> >>> Now if allocation fails and nr_collapsed stays 0, we just return
> >>> SCAN_FAIL. So we won't back off via khugepaged_alloc_sleep() anymore?
> >>
> >> Ok I did the error propagation! I think I handled both of these cases
> >> you brought up pretty easily.
> >
> > Thanks.
> >
> >> However I don't know what to do in the following case: We successfully
> >> collapsed some portion of the PMD, but during that process, we also
> >> hit an allocation failure. Is it best to back off entirely? or can we
> >> treat some forward progress as a sign we can continue trying collapses
> >> without sleeping.
> >>
> >> Basically, do we prioritize SCAN_ALLOC_HUGE_PAGE_FAIL or the
> >> successful collapses as the returned value?
> >
> > Thinking out loud, forward progress should win here, the allocation
> > failure only matter if we made no progress at all?
>
> Agreed, in the first approach, forward progress makes sense.
Sounds sensible to me.
>
> --
> Cheers,
>
> David
Thanks, Lorenzo
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox