* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: David Hildenbrand (Arm) @ 2026-06-05 7:18 UTC (permalink / raw)
To: Nico Pache
Cc: Lance Yang, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
joshua.hahnjy, kas, liam, ljs, mathieu.desnoyers, matthew.brost,
mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe, usama.arif
In-Reply-To: <CAA1CXcBSPVG4CJFCBDvbuodcJ_7eXoDQTpK0ZN0HEhkDPi-DEw@mail.gmail.com>
On 6/4/26 19:04, Nico Pache wrote:
> On Mon, Jun 1, 2026 at 9:00 AM Nico Pache <npache@redhat.com> wrote:
>>
>> On Mon, Jun 1, 2026 at 5:14 AM David Hildenbrand (Arm) <david@kernel.org> wrote:
>>>
>>>
>>> Yeah. BTW, I think we'd need a spin_lock_nested(), so @Nico, treat my code as a
>>> draft.
>>
>> Okay, I read the above and did some investigating.
>>
>> I will try to implement and verify the changes you suggested :)
>
> I've implemented something slightly different actually and I *think* its better!
>
> } else {
> /* this is map_anon_folio_pte_nopf with no mmu update */
> __map_anon_folio_pte_nopf(folio, pte, vma, start_addr,
> /*uffd_wp=*/ false);
> smp_wmb();
> pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> /*
> * Some architectures (e.g. MIPS) walk the live page table in
> * their implementation. update_mmu_cache_range() must be called
> * with a valid page table hierarchy and the PTE lock held.
> * Acquire it nested inside pmd_ptl when they are distinct locks.
> */
> if (pte_ptl != pmd_ptl)
> spin_lock_nested(pte_ptl, SINGLE_DEPTH_NESTING);
> update_mmu_cache_range(NULL, vma, start_addr, pte, nr_pages);
> if (pte_ptl != pmd_ptl)
> spin_unlock(pte_ptl);
> }
> spin_unlock(pmd_ptl);
>
> The logic here is that when the PMD becomes visible, PTEs are already
> populated (no possibility of spurious faults on local CPU)
>
> the SMP_WMB makes sure of the above
>
> And the pmd is installed with the pte and pmd lock both held through
> the mmu_cache update.
>
> This follows the conventions used in pmd_install() and clears the
> potential for local CPU faults hitting cleared PTE entries.
After the pmdp_collapse_flush() we'd be getting CPU faults due to the cleared
PMD already? So the case here is rather different.
--
Cheers,
David
^ permalink raw reply
* [PATCH] tracing: reject invalid preemptirq_delay_test CPU affinity
From: Samuel Moelius @ 2026-06-05 0:40 UTC (permalink / raw)
To: Steven Rostedt
Cc: Samuel Moelius, Masami Hiramatsu, Mathieu Desnoyers,
open list:TRACING, open list:TRACING
preemptirq_delay_test accepts cpu_affinity as a module parameter and,
when it is non-negative, writes that CPU directly into a temporary
cpumask from the worker thread. Values outside nr_cpu_ids can set a
bit outside the allocated cpumask before the test reports a normal
affinity error.
Validate the requested CPU before starting the worker thread, and
return -EINVAL for invalid affinity requests.
Assisted-by: Codex:gpt-5.5-cyber-preview
Signed-off-by: Samuel Moelius <sam.moelius@trailofbits.com>
---
kernel/trace/preemptirq_delay_test.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/kernel/trace/preemptirq_delay_test.c b/kernel/trace/preemptirq_delay_test.c
index acb0c971a408..0f017799754a 100644
--- a/kernel/trace/preemptirq_delay_test.c
+++ b/kernel/trace/preemptirq_delay_test.c
@@ -14,6 +14,7 @@
#include <linux/kthread.h>
#include <linux/module.h>
#include <linux/printk.h>
+#include <linux/cpumask.h>
#include <linux/string.h>
#include <linux/sysfs.h>
#include <linux/completion.h>
@@ -152,6 +153,15 @@ static int preemptirq_run_test(void)
struct task_struct *task;
char task_name[50];
+ if (cpu_affinity > -1) {
+ unsigned int cpu = cpu_affinity;
+
+ if (cpu >= nr_cpu_ids || !cpu_possible(cpu)) {
+ pr_err("cpu_affinity:%d, invalid CPU\n", cpu_affinity);
+ return -EINVAL;
+ }
+ }
+
init_completion(&done);
snprintf(task_name, sizeof(task_name), "%s_test", test_mode);
--
2.43.0
^ permalink raw reply related
* Re: [PATCH v7 00/42] guest_memfd: In-place conversion support
From: Ackerley Tng @ 2026-06-04 21:14 UTC (permalink / raw)
To: Sean Christopherson
Cc: Ackerley Tng via B4 Relay, aik, andrew.jones, binbin.wu, brauner,
chao.p.peng, david, ira.weiny, jmattson, jthoughton, michael.roth,
oupton, pankaj.gupta, qperret, rick.p.edgecombe, rientjes,
shivankg, steven.price, tabba, willy, wyihan, yan.y.zhao,
forkloop, pratyush, suzuki.poulose, aneesh.kumar, liam,
Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <aiHeDZEPkAcWcSkn@google.com>
Sean Christopherson <seanjc@google.com> writes:
> On Wed, Jun 03, 2026, Ackerley Tng wrote:
>> Ackerley Tng via B4 Relay <devnull+ackerleytng.google.com@kernel.org>
>> writes:
>>
>> > This is v7 of guest_memfd in-place conversion support.
>> >
>>
>> Here's the outstanding items after going over everyone's comments
>> including Sashiko's:
>>
>> + KVM: TDX: Make source page optional for KVM_TDX_INIT_MEM_REGION
>> + Need to move page clearing into __kvm_gmem_get_pfn to resolve
>> leak where populate can put initialized kernel memory into TDX
>> guest
>> + See suggested fix at [1]
>
> That fix works for me. The initial guest image will typically be a tiny subset
> of guest memory, so unnecessarily zeroing a few pages isn't a performance concern.
>
In regular usage moving the zeroing in [1] doesn't change anything,
since the same zeroing would have first happened when the host faults
the pages to put the initial image. When populating, there's no more
zeroing since it was zeroed.
[1] covers the case where the host doesn't write anything to the pages
and directly tries to populate the pages to the guest.
>> + KVM: guest_memfd: Only prepare folios for private pages,
>> + s/non-CoCo/CoCo in commit message "INIT_SHARED is about to be
>> supported for non-CoCo VMs in a later patch in this series
>> + Use Suggested-by: Michael Roth <michael.roth@amd.com>
>> + KVM: selftests: Test that shared/private status is consistent across
>> processes
>> + Improve test reliability using pthread_mutex
>> + I have a fixup patch offline.
>>
>> I would like feedback on these:
>>
>> + KVM: selftests: Test conversion with elevated page refcount
>> + Askar pointed out that soon vmsplice may not pin pages. Should I
>> pin pages through CONFIG_GUP_TEST like in [2]? I prefer not to
>> take a dependency on CONFIG_GUP_TEST.
>
> I'm not exactly excited about taking a dependency on CONFIG_GUP_TEST either, but
> it probably is the least awful choice. E.g. KVM also pins pages is certain flows,
> but we're _also_ actively working to remove the need to pin.
>
> Hmm, maybe IORING_REGISTER_PBUF_RING? AFAICT, it's almost literally a "pin user
> memory" syscall.
>
Hmm that takes a dependency on io_uring, which isn't always compiled
in. Between CONFIG_IO_URING and CONFIG_GUP_TEST, I'd rather
CONFIG_GUP_TEST.
>> + KVM: selftests: Add script to exercise private_mem_conversions_test
>> + Would like to know what people think of a wrapper script before
>> I address Sashiko's comments.
>
> NAK to a wrapper script. This sounds like a perfect fit for Vipin's selftest
> runner (which I'm like 4 months overdue for reviewing, testing, and merging).
> If the runner _can't_ do what you want, then I'd rather improve the runner.
>
> [*] https://lore.kernel.org/all/20260331194202.1722082-1-vipinsh@google.com
>
Good to know we have this!
Thanks, I'll work on a v8 to clean up the above.
>>
>> [1] https://lore.kernel.org/all/CAEvNRgEVC=fFuKVgZYvWyZD7t_zvUZihFG8hrACjvtkD5cwugw@mail.gmail.com/
>> [2] https://lore.kernel.org/all/baa8838f623102931e755cf34c86314b305af49c.1747264138.git.ackerleytng@google.com/
>>
>> >
>> > [...snip...]
>> >
^ permalink raw reply
* Re: [PATCH v7 00/42] guest_memfd: In-place conversion support
From: Sean Christopherson @ 2026-06-04 20:20 UTC (permalink / raw)
To: Ackerley Tng
Cc: Ackerley Tng via B4 Relay, aik, andrew.jones, binbin.wu, brauner,
chao.p.peng, david, ira.weiny, jmattson, jthoughton, michael.roth,
oupton, pankaj.gupta, qperret, rick.p.edgecombe, rientjes,
shivankg, steven.price, tabba, willy, wyihan, yan.y.zhao,
forkloop, pratyush, suzuki.poulose, aneesh.kumar, liam,
Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <CAEvNRgGpaggjd3=ooyzv7iEbmA-x1mWJHgjLSjPi8=5CPrk-yQ@mail.gmail.com>
On Wed, Jun 03, 2026, Ackerley Tng wrote:
> Ackerley Tng via B4 Relay <devnull+ackerleytng.google.com@kernel.org>
> writes:
>
> > This is v7 of guest_memfd in-place conversion support.
> >
>
> Here's the outstanding items after going over everyone's comments
> including Sashiko's:
>
> + KVM: TDX: Make source page optional for KVM_TDX_INIT_MEM_REGION
> + Need to move page clearing into __kvm_gmem_get_pfn to resolve
> leak where populate can put initialized kernel memory into TDX
> guest
> + See suggested fix at [1]
That fix works for me. The initial guest image will typically be a tiny subset
of guest memory, so unnecessarily zeroing a few pages isn't a performance concern.
> + KVM: guest_memfd: Only prepare folios for private pages,
> + s/non-CoCo/CoCo in commit message "INIT_SHARED is about to be
> supported for non-CoCo VMs in a later patch in this series
> + Use Suggested-by: Michael Roth <michael.roth@amd.com>
> + KVM: selftests: Test that shared/private status is consistent across
> processes
> + Improve test reliability using pthread_mutex
> + I have a fixup patch offline.
>
> I would like feedback on these:
>
> + KVM: selftests: Test conversion with elevated page refcount
> + Askar pointed out that soon vmsplice may not pin pages. Should I
> pin pages through CONFIG_GUP_TEST like in [2]? I prefer not to
> take a dependency on CONFIG_GUP_TEST.
I'm not exactly excited about taking a dependency on CONFIG_GUP_TEST either, but
it probably is the least awful choice. E.g. KVM also pins pages is certain flows,
but we're _also_ actively working to remove the need to pin.
Hmm, maybe IORING_REGISTER_PBUF_RING? AFAICT, it's almost literally a "pin user
memory" syscall.
> + KVM: selftests: Add script to exercise private_mem_conversions_test
> + Would like to know what people think of a wrapper script before
> I address Sashiko's comments.
NAK to a wrapper script. This sounds like a perfect fit for Vipin's selftest
runner (which I'm like 4 months overdue for reviewing, testing, and merging).
If the runner _can't_ do what you want, then I'd rather improve the runner.
[*] https://lore.kernel.org/all/20260331194202.1722082-1-vipinsh@google.com
>
> [1] https://lore.kernel.org/all/CAEvNRgEVC=fFuKVgZYvWyZD7t_zvUZihFG8hrACjvtkD5cwugw@mail.gmail.com/
> [2] https://lore.kernel.org/all/baa8838f623102931e755cf34c86314b305af49c.1747264138.git.ackerleytng@google.com/
>
> >
> > [...snip...]
> >
^ permalink raw reply
* Re: [PATCH v7 20/42] KVM: SEV: Make 'uaddr' parameter optional for KVM_SEV_SNP_LAUNCH_UPDATE
From: Michael Roth @ 2026-06-04 20:11 UTC (permalink / raw)
To: Suzuki K Poulose
Cc: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
david, ira.weiny, jmattson, jthoughton, oupton, pankaj.gupta,
qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
tabba, willy, wyihan, yan.y.zhao, forkloop, pratyush,
aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka, kvm,
linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <9d15479e-e36b-4865-804c-7d93eb339e4e@arm.com>
On Thu, Jun 04, 2026 at 04:29:19PM +0100, Suzuki K Poulose wrote:
> On 23/05/2026 01:18, Ackerley Tng via B4 Relay wrote:
> > From: Michael Roth <michael.roth@amd.com>
> >
> > For vm_memory_attributes=1, in-place conversion/population is not
> > supported, so the initial contents necessarily must need to come
> > from a separate src address, which is enforced by the current
> > implementation. However, for vm_memory_attributes=0, it is possible for
> > guest memory to be initialized directly from userspace by mmap()'ing the
> > guest_memfd and writing to it while the corresponding GPA ranges are in
> > a 'shared' state before converting them to the 'private' state expected
> > by KVM_SEV_SNP_LAUNCH_UPDATE.
> >
> > Update the handling/documentation for KVM_SEV_SNP_LAUNCH_UPDATE to allow
> > for 'uaddr' to be set to NULL when vm_memory_attributes=0, which
> > SNP_LAUNCH_UPDATE will then use to determine when it should/shouldn't
> > copy in data from a separate memory location. Continue to enforce
> > non-NULL for the original vm_memory_attributes=1 case.
> >
> > Signed-off-by: Michael Roth <michael.roth@amd.com>
> > [Added src_page check in error handling path when the firmware command fails]
> > [Dropped ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES]
> > Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>
>
>
>
> > ---
> > Documentation/virt/kvm/x86/amd-memory-encryption.rst | 15 +++++++++++----
> > arch/x86/kvm/svm/sev.c | 18 +++++++++++++-----
> > virt/kvm/kvm_main.c | 1 +
> > 3 files changed, 25 insertions(+), 9 deletions(-)
> >
> > diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> > index b2395dd4769de..43085f65b2d85 100644
> > --- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> > +++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> > @@ -503,7 +503,8 @@ secrets.
> > It is required that the GPA ranges initialized by this command have had the
> > KVM_MEMORY_ATTRIBUTE_PRIVATE attribute set in advance. See the documentation
> > -for KVM_SET_MEMORY_ATTRIBUTES for more details on this aspect.
> > +for KVM_SET_MEMORY_ATTRIBUTES/KVM_SET_MEMORY_ATTRIBUTES2 for more details on
> > +this aspect.
> > Upon success, this command is not guaranteed to have processed the entire
> > range requested. Instead, the ``gfn_start``, ``uaddr``, and ``len`` fields of
> > @@ -511,9 +512,15 @@ range requested. Instead, the ``gfn_start``, ``uaddr``, and ``len`` fields of
> > remaining range that has yet to be processed. The caller should continue
> > calling this command until those fields indicate the entire range has been
> > processed, e.g. ``len`` is 0, ``gfn_start`` is equal to the last GFN in the
> > -range plus 1, and ``uaddr`` is the last byte of the userspace-provided source
> > -buffer address plus 1. In the case where ``type`` is KVM_SEV_SNP_PAGE_TYPE_ZERO,
> > -``uaddr`` will be ignored completely.
> > +range plus 1, and ``uaddr`` (if specified) is the last byte of the
> > +userspace-provided source buffer address plus 1.
> > +
> > +In the case where ``type`` is KVM_SEV_SNP_PAGE_TYPE_ZERO, ``uaddr`` will be
> > +ignored completely. Otherwise, ``uaddr`` is required if
> > +kvm.vm_memory_attributes=1 and optional if kvm.vm_memory_attributes=0, since
> > +in the latter case guest memory can be initialized directly from userspace
> > +prior to converting it to private and passing the GPA range on to this
> > +interface.
>
> Just to confirm, so the sev_gmem_prepare doesn't destroy the contents in the
> process of making it "private" ? i.e., the contents of a SNP shared
> page are preserved while transitioning to "SNP Private" (via RMP
> update).
sev_gmem_prepare() does sort of destroy contents since it finalizes the
shared->private conversion which puts the page in an unusable state
until the guest 'accepts' it as private memory and re-initializes the
contents.
But that's run-time, when the guest is doing conversions. The
documentation here is relating to initialization time when we are
setting up the initial pre-encrypted/pre-measured guest memory image,
via SNP_LAUNCH_UPDATE. That path calls into kvm_gmem_populate(), and it
is then sev_gmem_post_populate() callback that actually finalizes the
shared->private conversion. The sev_gmem_prepare() hook doesn't get used
in this flow (kvm_gmem_populate() calls __kvm_gmem_get_pfn() which skips
preparation).
-Mike
>
> Suzuki
>
>
>
> > Parameters (in): struct kvm_sev_snp_launch_update
> > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> > index 1a361f08c7a3d..e1dbc827c2807 100644
> > --- a/arch/x86/kvm/svm/sev.c
> > +++ b/arch/x86/kvm/svm/sev.c
> > @@ -2343,7 +2343,15 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> > int level;
> > int ret;
> > - if (WARN_ON_ONCE(sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src_page))
> > + /*
> > + * For vm_memory_attributes=1, in-place conversion/population is not
> > + * supported, so the initial contents necessarily need to come from a
> > + * separate src address. For vm_memory_attributes=0, this isn't
> > + * necessarily the case, since the pages may have been populated
> > + * directly from userspace before calling KVM_SEV_SNP_LAUNCH_UPDATE.
> > + */
> > + if (vm_memory_attributes &&
> > + sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src_page)
> > return -EINVAL;
> > ret = snp_lookup_rmpentry((u64)pfn, &assigned, &level);
> > @@ -2390,7 +2398,7 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> > */
> > if (ret && !snp_page_reclaim(kvm, pfn) &&
> > sev_populate_args->type == KVM_SEV_SNP_PAGE_TYPE_CPUID &&
> > - sev_populate_args->fw_error == SEV_RET_INVALID_PARAM) {
> > + sev_populate_args->fw_error == SEV_RET_INVALID_PARAM && src_page) {
> > void *src_vaddr = kmap_local_page(src_page);
> > void *dst_vaddr = kmap_local_pfn(pfn);
> > @@ -2423,8 +2431,8 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
> > if (copy_from_user(¶ms, u64_to_user_ptr(argp->data), sizeof(params)))
> > return -EFAULT;
> > - pr_debug("%s: GFN start 0x%llx length 0x%llx type %d flags %d\n", __func__,
> > - params.gfn_start, params.len, params.type, params.flags);
> > + pr_debug("%s: GFN start 0x%llx length 0x%llx type %d flags %d src %llx\n", __func__,
> > + params.gfn_start, params.len, params.type, params.flags, params.uaddr);
> > if (!params.len || !PAGE_ALIGNED(params.len) || params.flags ||
> > (params.type != KVM_SEV_SNP_PAGE_TYPE_NORMAL &&
> > @@ -2481,7 +2489,7 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
> > params.gfn_start += count;
> > params.len -= count * PAGE_SIZE;
> > - if (params.type != KVM_SEV_SNP_PAGE_TYPE_ZERO)
> > + if (src && params.type != KVM_SEV_SNP_PAGE_TYPE_ZERO)
> > params.uaddr += count * PAGE_SIZE;
> > if (copy_to_user(u64_to_user_ptr(argp->data), ¶ms, sizeof(params)))
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index ba195bb239aaa..3bf212fd99193 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -105,6 +105,7 @@ module_param(allow_unsafe_mappings, bool, 0444);
> > #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> > bool vm_memory_attributes = true;
> > module_param(vm_memory_attributes, bool, 0444);
> > +EXPORT_SYMBOL_FOR_KVM_INTERNAL(vm_memory_attributes);
> > #endif
> > DEFINE_STATIC_CALL_RET0(__kvm_get_memory_attributes, kvm_get_memory_attributes_t);
> > EXPORT_SYMBOL_FOR_KVM_INTERNAL(STATIC_CALL_KEY(__kvm_get_memory_attributes));
> >
>
^ permalink raw reply
* Re: [PATCH v7 20/42] KVM: SEV: Make 'uaddr' parameter optional for KVM_SEV_SNP_LAUNCH_UPDATE
From: Ackerley Tng @ 2026-06-04 19:05 UTC (permalink / raw)
To: Suzuki K Poulose, aik, andrew.jones, binbin.wu, brauner,
chao.p.peng, david, ira.weiny, jmattson, jthoughton, michael.roth,
oupton, pankaj.gupta, qperret, rick.p.edgecombe, rientjes,
shivankg, steven.price, tabba, willy, wyihan, yan.y.zhao,
forkloop, pratyush, aneesh.kumar, liam, Paolo Bonzini,
Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <9d15479e-e36b-4865-804c-7d93eb339e4e@arm.com>
Suzuki K Poulose <suzuki.poulose@arm.com> writes:
>
> [...snip...]
>
>> +In the case where ``type`` is KVM_SEV_SNP_PAGE_TYPE_ZERO, ``uaddr`` will be
>> +ignored completely. Otherwise, ``uaddr`` is required if
>> +kvm.vm_memory_attributes=1 and optional if kvm.vm_memory_attributes=0, since
>> +in the latter case guest memory can be initialized directly from userspace
>> +prior to converting it to private and passing the GPA range on to this
>> +interface.
>
> Just to confirm, so the sev_gmem_prepare doesn't destroy the contents in
> the process of making it "private" ? i.e., the contents of a SNP shared
> page are preserved while transitioning to "SNP Private" (via RMP
> update).
>
> Suzuki
>
The following is the guest_memfd perspective, I didn't look at the SNP
spec:
Do you mean specifically for KVM_SEV_SNP_PAGE_TYPE_ZERO, or for any
type?
guest_memfd has no plans to do any special zeroing based on type.
guest_memfd decoupled zeroing from preparation a while ago (Michael had
some patches), so zeroing is supposed to be once during folio ownership
by guest_memfd, tracked by the uptodate flag, and preparation is tracked
outside of guest_memfd. So far only SNP does preparation.
>
>
>>
>> [...snip...]
>>
^ permalink raw reply
* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: Lorenzo Stoakes @ 2026-06-04 18:12 UTC (permalink / raw)
To: Nico Pache
Cc: David Hildenbrand (Arm), Lance Yang, linux-doc, linux-kernel,
linux-mm, linux-trace-kernel, aarcange, akpm, anshuman.khandual,
apopple, baohua, baolin.wang, byungchul, catalin.marinas, cl,
corbet, dave.hansen, dev.jain, gourry, hannes, hughd, jack,
jackmanb, jannh, jglisse, joshua.hahnjy, kas, liam,
mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang, rientjes,
rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe,
usama.arif
In-Reply-To: <CAA1CXcBSPVG4CJFCBDvbuodcJ_7eXoDQTpK0ZN0HEhkDPi-DEw@mail.gmail.com>
On Thu, Jun 04, 2026 at 11:04:35AM -0600, Nico Pache wrote:
> On Mon, Jun 1, 2026 at 9:00 AM Nico Pache <npache@redhat.com> wrote:
> >
> > On Mon, Jun 1, 2026 at 5:14 AM David Hildenbrand (Arm) <david@kernel.org> wrote:
> > >
> > > On 6/1/26 12:47, Lance Yang wrote:
> > > >
> > > >
> > > > On 2026/6/1 18:23, David Hildenbrand (Arm) wrote:
> > > >> On 6/1/26 11:08, Lance Yang wrote:
> > > >>>
> > > >>>
> > > >>>
> > > >>> One small thing, I think we should probably keep the smp_wmb(), and just
> > > >>> move it before the earlier pmd_populate().
> > > >>>
> > > >>> IIUC, the ordering we want is still:
> > > >>>
> > > >>> clear old PTEs
> > > >>> smp_wmb()
> > > >>> pmd_populate()
> > > >>>
> > > >>> so another CPU cannot walk through the re-installed PMD and still observe
> > > >>> the old PTEs, right?
> > > >>
> > > >> There is a smp_wmb() in __folio_mark_uptodate(), that should be sufficient?
> > > >
> > > > Ah, cool! __folio_mark_uptodate() already does the job :P
> > > >
> > > > So yeah, no extra smp_wmb() needed here!
> > >
> > > Yeah. BTW, I think we'd need a spin_lock_nested(), so @Nico, treat my code as a
> > > draft.
> >
> > Okay, I read the above and did some investigating.
> >
> > I will try to implement and verify the changes you suggested :)
>
> I've implemented something slightly different actually and I *think* its better!
>
> } else {
> /* this is map_anon_folio_pte_nopf with no mmu update */
> __map_anon_folio_pte_nopf(folio, pte, vma, start_addr,
> /*uffd_wp=*/ false);
> smp_wmb();
> pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> /*
> * Some architectures (e.g. MIPS) walk the live page table in
> * their implementation. update_mmu_cache_range() must be called
> * with a valid page table hierarchy and the PTE lock held.
> * Acquire it nested inside pmd_ptl when they are distinct locks.
> */
> if (pte_ptl != pmd_ptl)
> spin_lock_nested(pte_ptl, SINGLE_DEPTH_NESTING);
> update_mmu_cache_range(NULL, vma, start_addr, pte, nr_pages);
> if (pte_ptl != pmd_ptl)
> spin_unlock(pte_ptl);
> }
> spin_unlock(pmd_ptl);
>
> The logic here is that when the PMD becomes visible, PTEs are already
> populated (no possibility of spurious faults on local CPU)
>
> the SMP_WMB makes sure of the above
>
> And the pmd is installed with the pte and pmd lock both held through
> the mmu_cache update.
>
> This follows the conventions used in pmd_install() and clears the
> potential for local CPU faults hitting cleared PTE entries.
>
> I think both approaches are correct but this prevents any possibility
> of my first point. although mmap_write_lock prevents this too.
>
> Let me know what you think. I can revert to your implementation but
> this is what I tested.
Yeah let's go with the original implementation please :)
Thanks!
>
> Cheers,
> -- Nico
>
> >
> > Or an even crazier idea... what if we ensure MIPS checks for PMD_none
> > before walking a PTE table?
> >
> > -- Nico
> >
> > >
> > > --
> > > Cheers,
> > >
> > > David
> > >
>
Cheers, Lorenzo
^ permalink raw reply
* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: Nico Pache @ 2026-06-04 17:04 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Lance Yang, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
joshua.hahnjy, kas, liam, ljs, mathieu.desnoyers, matthew.brost,
mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe, usama.arif
In-Reply-To: <CAA1CXcAeEGOsqp-ywAQ7GMYQzXEeco-rUxUkk2hEF69HybC4=w@mail.gmail.com>
On Mon, Jun 1, 2026 at 9:00 AM Nico Pache <npache@redhat.com> wrote:
>
> On Mon, Jun 1, 2026 at 5:14 AM David Hildenbrand (Arm) <david@kernel.org> wrote:
> >
> > On 6/1/26 12:47, Lance Yang wrote:
> > >
> > >
> > > On 2026/6/1 18:23, David Hildenbrand (Arm) wrote:
> > >> On 6/1/26 11:08, Lance Yang wrote:
> > >>>
> > >>>
> > >>>
> > >>> One small thing, I think we should probably keep the smp_wmb(), and just
> > >>> move it before the earlier pmd_populate().
> > >>>
> > >>> IIUC, the ordering we want is still:
> > >>>
> > >>> clear old PTEs
> > >>> smp_wmb()
> > >>> pmd_populate()
> > >>>
> > >>> so another CPU cannot walk through the re-installed PMD and still observe
> > >>> the old PTEs, right?
> > >>
> > >> There is a smp_wmb() in __folio_mark_uptodate(), that should be sufficient?
> > >
> > > Ah, cool! __folio_mark_uptodate() already does the job :P
> > >
> > > So yeah, no extra smp_wmb() needed here!
> >
> > Yeah. BTW, I think we'd need a spin_lock_nested(), so @Nico, treat my code as a
> > draft.
>
> Okay, I read the above and did some investigating.
>
> I will try to implement and verify the changes you suggested :)
I've implemented something slightly different actually and I *think* its better!
} else {
/* this is map_anon_folio_pte_nopf with no mmu update */
__map_anon_folio_pte_nopf(folio, pte, vma, start_addr,
/*uffd_wp=*/ false);
smp_wmb();
pmd_populate(mm, pmd, pmd_pgtable(_pmd));
/*
* Some architectures (e.g. MIPS) walk the live page table in
* their implementation. update_mmu_cache_range() must be called
* with a valid page table hierarchy and the PTE lock held.
* Acquire it nested inside pmd_ptl when they are distinct locks.
*/
if (pte_ptl != pmd_ptl)
spin_lock_nested(pte_ptl, SINGLE_DEPTH_NESTING);
update_mmu_cache_range(NULL, vma, start_addr, pte, nr_pages);
if (pte_ptl != pmd_ptl)
spin_unlock(pte_ptl);
}
spin_unlock(pmd_ptl);
The logic here is that when the PMD becomes visible, PTEs are already
populated (no possibility of spurious faults on local CPU)
the SMP_WMB makes sure of the above
And the pmd is installed with the pte and pmd lock both held through
the mmu_cache update.
This follows the conventions used in pmd_install() and clears the
potential for local CPU faults hitting cleared PTE entries.
I think both approaches are correct but this prevents any possibility
of my first point. although mmap_write_lock prevents this too.
Let me know what you think. I can revert to your implementation but
this is what I tested.
Cheers,
-- Nico
>
> Or an even crazier idea... what if we ensure MIPS checks for PMD_none
> before walking a PTE table?
>
> -- Nico
>
> >
> > --
> > Cheers,
> >
> > David
> >
^ permalink raw reply
* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: Nico Pache @ 2026-06-04 16:28 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
lance.yang, liam, mathieu.desnoyers, matthew.brost, mhiramat,
mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe, Usama Arif
In-Reply-To: <aiFz-VSKSQ-zBfN7@lucifer>
On Thu, Jun 4, 2026 at 6:56 AM Lorenzo Stoakes <ljs@kernel.org> wrote:
>
> On Thu, Jun 04, 2026 at 06:45:58AM -0600, Nico Pache wrote:
> > On Thu, Jun 4, 2026 at 6:40 AM Lorenzo Stoakes <ljs@kernel.org> wrote:
> > >
> > > On Thu, Jun 04, 2026 at 12:38:30PM +0100, Lorenzo Stoakes wrote:
> > > > I will go review the thread about the cache maintenance separately and
> > > > respond about that.
> > > >
> > > > On Fri, May 22, 2026 at 09:00:01AM -0600, Nico Pache wrote:
> > > > > Pass an order and offset to collapse_huge_page to support collapsing anon
> > > > > memory to arbitrary orders within a PMD. order indicates what mTHP size we
> > > > > are attempting to collapse to, and offset indicates were in the PMD to
> > > > > start the collapse attempt.
> > > > >
> > > > > For non-PMD collapse we must leave the anon VMA write locked until after
> > > > > we collapse the mTHP-- in the PMD case all the pages are isolated, but in
> > > > > the mTHP case this is not true, and we must keep the lock to prevent
> > > > > access/changes to the page tables. This can happen if the rmap walkers hit
> > > > > a pmd_none while the PMD entry is currently unavailable due to being
> > > > > temporarily removed during the collapse phase.
> > > > >
> > > > > Acked-by: Usama Arif <usama.arif@linux.dev>
> > > > > Signed-off-by: Nico Pache <npache@redhat.com>
> > > >
> > > > The logic LGTM generally, some questions for understanding below, and of
> > > > course as per above I want to review the Lance/David subthread.
> > > >
> > > > Thanks!
> > > >
> > > > > ---
> > > > > mm/khugepaged.c | 93 +++++++++++++++++++++++++++++--------------------
> > > > > 1 file changed, 55 insertions(+), 38 deletions(-)
> > > > >
> > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > > index fab35d318641..d64f42f66236 100644
> > > > > --- a/mm/khugepaged.c
> > > > > +++ b/mm/khugepaged.c
> > > > > @@ -1214,34 +1214,36 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
> > > > > * while allocating a THP, as that could trigger direct reclaim/compaction.
> > > > > * Note that the VMA must be rechecked after grabbing the mmap_lock again.
> > > > > */
> > > > > -static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > > > > - int referenced, int unmapped, struct collapse_control *cc)
> > > > > +static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
> > > > > + int referenced, int unmapped, struct collapse_control *cc,
> > > > > + unsigned int order)
> > > > > {
> > > > > + const unsigned long pmd_addr = start_addr & HPAGE_PMD_MASK;
> > > > > + const unsigned long end_addr = start_addr + (PAGE_SIZE << order);
> > > > > LIST_HEAD(compound_pagelist);
> > > > > pmd_t *pmd, _pmd;
> > > > > - pte_t *pte;
> > > > > + pte_t *pte = NULL;
> > > >
> > > > As mentioned elsewhere for some reason this was dropped in
> > > > mm-unstable. Maybe a bad conflict resolution?
> > > >
> > > > > pgtable_t pgtable;
> > > > > struct folio *folio;
> > > > > spinlock_t *pmd_ptl, *pte_ptl;
> > > > > enum scan_result result = SCAN_FAIL;
> > > > > struct vm_area_struct *vma;
> > > > > struct mmu_notifier_range range;
> > > > > + bool anon_vma_locked = false;
> > > > >
> > > > > - VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> > > > > -
> > > > > - result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> > > > > + result = alloc_charge_folio(&folio, mm, cc, order);
> > > > > if (result != SCAN_SUCCEED)
> > > > > goto out_nolock;
> > > > >
> > > > > mmap_read_lock(mm);
> > > > > - result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> > > > > - HPAGE_PMD_ORDER);
> > > > > + result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
> > > > > + &vma, cc, order);
> > > > > if (result != SCAN_SUCCEED) {
> > > > > mmap_read_unlock(mm);
> > > > > goto out_nolock;
> > > > > }
> > > > >
> > > > > - result = find_pmd_or_thp_or_none(mm, address, &pmd);
> > > > > + result = find_pmd_or_thp_or_none(mm, pmd_addr, &pmd);
> > > > > if (result != SCAN_SUCCEED) {
> > > > > mmap_read_unlock(mm);
> > > > > goto out_nolock;
> > > > > @@ -1253,8 +1255,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > > > > * released when it fails. So we jump out_nolock directly in
> > > > > * that case. Continuing to collapse causes inconsistency.
> > > > > */
> > > > > - result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> > > > > - referenced, HPAGE_PMD_ORDER);
> > > > > + result = __collapse_huge_page_swapin(mm, vma, start_addr, pmd,
> > > > > + referenced, order);
> > > > > if (result != SCAN_SUCCEED)
> > > > > goto out_nolock;
> > > > > }
> > > > > @@ -1269,20 +1271,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > > > > * mmap_lock.
> > > > > */
> > > > > mmap_write_lock(mm);
> > > > > - result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> > > > > - HPAGE_PMD_ORDER);
> > > > > + result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
> > > > > + &vma, cc, order);
> > > > > if (result != SCAN_SUCCEED)
> > > > > goto out_up_write;
> > > > > /* check if the pmd is still valid */
> > > > > vma_start_write(vma);
> > >
> > > Hmm actually I think we have another problem here.
> > >
> > > For PMD THP this is fine. Only a single VMA can span the range we need, and it
> > > will span the entire PMD.
> > >
> > > But for mTHP we have an issue...
> > >
> > > See below...
> > >
> > > > > - result = check_pmd_still_valid(mm, address, pmd);
> > > > > + result = check_pmd_still_valid(mm, pmd_addr, pmd);
> > > > > if (result != SCAN_SUCCEED)
> > > > > goto out_up_write;
> > > > >
> > > > > anon_vma_lock_write(vma->anon_vma);
> > > > > + anon_vma_locked = true;
> > > >
> > > > I worry that we hold this lock a lot longer now? Maybe the algorithmic
> > > > change alters that, but Claude did suggest on the s390 bug that longer lock
> > > > hold might be an issue.
> > > >
> > > > I wonder if we'll observe lock contention as a result?
> > > >
> > > > Correct me if I'm wrong and we're not holding longer than previously,
> > > > however. Just appears that we do.
> > > >
> > > > >
> > > > > - mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> > > > > - address + HPAGE_PMD_SIZE);
> > > > > + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr,
> > > > > + end_addr);
> > > > > mmu_notifier_invalidate_range_start(&range);
> > > > >
> > > > > pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> > > > > @@ -1294,26 +1297,23 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > > > > * Parallel GUP-fast is fine since GUP-fast will back off when
> > > > > * it detects PMD is changed.
> > > > > */
> > > > > - _pmd = pmdp_collapse_flush(vma, address, pmd);
> > > > > + _pmd = pmdp_collapse_flush(vma, pmd_addr, pmd);
> > >
> > > ...So we exclude VMA locked faults faulting in a new PMD entry for PMD-sized THP
> > > but for mTHP we might have _another_ VMA that spans another part of the range
> > > mapped by the same PMD entry.
> > >
> > > So we clear this, but we do not have a write lock on any other VMA, and so
> > > racing VMA read locks can install a new PMD entry.
> > >
> > > > > spin_unlock(pmd_ptl);
> > >
> > > Especially since you unlock this :)
> > >
> > > And...
> > >
> > > > > mmu_notifier_invalidate_range_end(&range);
> > > > > tlb_remove_table_sync_one();
> > > > >
> > > > > - pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> > > > > + pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);
> > > > > if (pte) {
> > > > > - result = __collapse_huge_page_isolate(vma, address, pte, cc,
> > > > > - HPAGE_PMD_ORDER,
> > > > > - &compound_pagelist);
> > > > > + result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
> > > > > + order, &compound_pagelist);
> > > > > spin_unlock(pte_ptl);
> > > > > } else {
> > > > > result = SCAN_NO_PTE_TABLE;
> > > > > }
> > > > >
> > > > > if (unlikely(result != SCAN_SUCCEED)) {
> > > > > - if (pte)
> > > > > - pte_unmap(pte);
> > > >
> > > > OK I seem to remember this is because we're holding the anon_vma lock
> > > > longer. That does imply that on e.g. x86-64 the RCU lock is being held a
> > > > bit longer also as well as the anon_vma loc.
> > > >
> > > > I guess it's also because we need to hold anon_vma and pte lock because
> > > > we're fiddling around at PTE level for mTHP not just PMD level as 'classic'
> > > > THP did.
> > > >
> > > > (Rememberings going on here :)
> > > >
> > > > > spin_lock(pmd_ptl);
> > > > > - BUG_ON(!pmd_none(*pmd));
> > > > > + WARN_ON_ONCE(!pmd_none(*pmd));
> > >
> > > ...this will get triggered.
> > >
> > > I don't know whether we can safely hold the PMD lock across everything here for
> > > mTHP?
> > >
> > > Maybe the solution would have to be to scan through VMAs in the range of the PMD
> > > and VMA write lock each of them?
> >
> > I believe we've spoken about this before, but because we always make
>
> Maybe worth a comment then...? Ah how rewarding review is :)
I'll expand the commit message and comment in commit 1 of the series! thanks
>
> This is something that somebody else might very well wonder about and
> forget that it happens to be covered there.
>
> Also:
>
> /* Always check the PMD order to ensure its not shared by another VMA */
>
> Is pretty lightweight there. Something about avoiding racing page faults
> would be helpful.
yeah fair enough the commit message of patch 1 also doesnt really do
it justice on the *why*
>
> > sure the VMA spans the full PMD we won't ever hit this issue. If we
> > wanted to support mTHP collapse on regions smaller than a PMD, the
> > locking gets tricky (hence the design choice to not do that for now).
> >
> > This is handled by the HPAGE_ORDER in hugepage_vma_revalidate().
>
> The existing code is atrocious, and sticking this on top has added to the
> pile of assumptions and conventions and having to go check a bunch of
> functions to 'just know' you're safe for X, Y, Z.
>
> We really need to see some cleanup series coming after this and I'm going
> to get pretty grumpy(ier) if we don't.
Many more to come :) Improvements too but cleanups first!
Cheers,
-- Nico
>
> >
> > /* Always check the PMD order to ensure its not shared by another VMA */
> > if (!thp_vma_suitable_order(vma, address, PMD_ORDER))
> >
> > -- Nico
> >
> > >
> > > That could cause some 'interesting' lock contention issues though? Then again,
> > > we will be releasing the mmap write lock soon enough which will drop the VMA
> > > write locks.
> > >
> > > > > /*
> > > > > * We can only use set_pmd_at when establishing
> > > > > * hugepmds and never for establishing regular pmds that
> > > > > @@ -1321,21 +1321,24 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > > > > */
> > > > > pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> > > > > spin_unlock(pmd_ptl);
> > > > > - anon_vma_unlock_write(vma->anon_vma);
> > > > > goto out_up_write;
> > > > > }
> > > > >
> > > > > /*
> > > > > - * All pages are isolated and locked so anon_vma rmap
> > > > > - * can't run anymore.
> > > > > + * For PMD collapse all pages are isolated and locked so anon_vma
> > > > > + * rmap can't run anymore. For mTHP collapse the PMD entry has been
> > > > > + * removed and not all pages are isolated and locked, so we must hold
> > > >
> > > > Right because some PTE entries be unaffected by the change.
> > > >
> > > > > + * the lock to prevent neighboring folios from attempting to access
> > > > > + * this PMD until its reinstalled.
> > > >
> > > > OK. This is slightly annoying for my CoW context work as it means there's
> > > > another case where we need to explicitly hold an anon_vma lock for
> > > > correctness :)
> > > >
> > > > Anyway I will think about that separately, is what it is. And in fact
> > > > motivates to want this merged earlier so I can work against it :)
> > > >
> > > >
> > > > > */
> > > > > - anon_vma_unlock_write(vma->anon_vma);
> > > > > + if (is_pmd_order(order)) {
> > > > > + anon_vma_unlock_write(vma->anon_vma);
> > > > > + anon_vma_locked = false;
> > > > > + }
> > > > >
> > > > > result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> > > > > - vma, address, pte_ptl,
> > > > > - HPAGE_PMD_ORDER,
> > > > > - &compound_pagelist);
> > > > > - pte_unmap(pte);
> > > > > + vma, start_addr, pte_ptl,
> > > > > + order, &compound_pagelist);
> > > > > if (unlikely(result != SCAN_SUCCEED))
> > > > > goto out_up_write;
> > > > >
> > > > > @@ -1345,18 +1348,32 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > > > > * write.
> > > > > */
> > > > > __folio_mark_uptodate(folio);
> > > > > - pgtable = pmd_pgtable(_pmd);
> > > > > -
> > > > > spin_lock(pmd_ptl);
> > > > > - BUG_ON(!pmd_none(*pmd));
> > > > > - pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > > > > - map_anon_folio_pmd_nopf(folio, pmd, vma, address);
> > > > > + WARN_ON_ONCE(!pmd_none(*pmd));
> > > > > + if (is_pmd_order(order)) {
> > > > > + pgtable = pmd_pgtable(_pmd);
> > > > > + pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > > > > + map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
> > > > > + } else {
> > > > > + /*
> > > > > + * set_ptes is called in map_anon_folio_pte_nopf with the
> > > > > + * pmd_ptl lock still held; this is safe as the PMD is expected
> > > >
> > > > PMD entry you mean?
> > > >
> > > > > + * to be none. The pmd entry is then repopulated below.
> > > > > + */
> > > > > + map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
> > > >
> > > > So here we populate entries in the existing PTE _table_ to point at the new
> > > > order>0 folio? With arm64 of course doing transparent contpte stuff?
> > > >
> > > > > + smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
> > > > > + pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> > > >
> > > > And then we reinstall the pre-existing PMD _entry_ from none -> what it was
> > > > before?
> > > >
> > > > > + }
> > > > > spin_unlock(pmd_ptl);
> > > > >
> > > > > folio = NULL;
> > > > >
> > > > > result = SCAN_SUCCEED;
> > > > > out_up_write:
> > > > > + if (anon_vma_locked)
> > > > > + anon_vma_unlock_write(vma->anon_vma);
> > > > > + if (pte)
> > > > > + pte_unmap(pte);
> > > > > mmap_write_unlock(mm);
> > > > > out_nolock:
> > > > > if (folio)
> > > > > @@ -1536,7 +1553,7 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> > > > > /* collapse_huge_page expects the lock to be dropped before calling */
> > > > > mmap_read_unlock(mm);
> > > > > result = collapse_huge_page(mm, start_addr, referenced,
> > > > > - unmapped, cc);
> > > > > + unmapped, cc, HPAGE_PMD_ORDER);
> > > > > /* collapse_huge_page will return with the mmap_lock released */
> > > > > *lock_dropped = true;
> > > > > }
> > > > > --
> > > > > 2.54.0
> > > > >
> > >
> > > Thanks, Lorenzo
> > >
> >
>
^ permalink raw reply
* RE: mm/memory-failure tracepoint change breaks userspace rasdaemon
From: Zhuo, Qiuxu @ 2026-06-04 15:48 UTC (permalink / raw)
To: Xie Yuanbin, david@kernel.org, bp@alien8.de,
akpm@linux-foundation.org, rostedt@goodmis.org,
linmiaohe@huawei.com
Cc: linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org,
mchehab+huawei@kernel.org, Luck, Tony,
torvalds@linux-foundation.org, Lai, Yi1
In-Reply-To: <20260604134209.111533-1-xieyuanbin1@huawei.com>
> From: Xie Yuanbin <xieyuanbin1@huawei.com>
> Sent: Thursday, June 4, 2026 9:42 PM
> To: david@kernel.org; Zhuo, Qiuxu <qiuxu.zhuo@intel.com>; bp@alien8.de;
> akpm@linux-foundation.org; rostedt@goodmis.org; linmiaohe@huawei.com
> Cc: linux-edac@vger.kernel.org; linux-kernel@vger.kernel.org; linux-
> mm@kvack.org; linux-trace-kernel@vger.kernel.org;
> mchehab+huawei@kernel.org; Luck, Tony <tony.luck@intel.com>;
> torvalds@linux-foundation.org; xieyuanbin1@huawei.com; Lai, Yi1
> <yi1.lai@intel.com>
> Subject: Re: mm/memory-failure tracepoint change breaks userspace
> rasdaemon
>
> On Thu, 4 Jun 2026 08:42:37 +0200, David Hildenbrand (Arm) wrote:
> > Yeah, if only I had known that we would break user space by changing
> > trace events ... now we know :)
> >
> > Do you have capacity to send a fix?
>
> Sure, with pleasure.
Thanks Yuanbin,
When your patch is ready, we can help test it again if needed.
-Qiuxu
^ permalink raw reply
* RE: mm/memory-failure tracepoint change breaks userspace rasdaemon
From: Zhuo, Qiuxu @ 2026-06-04 15:43 UTC (permalink / raw)
To: David Hildenbrand (Arm), Steven Rostedt
Cc: Borislav Petkov, mchehab+huawei@kernel.org, Luck, Tony,
akpm@linux-foundation.org, linmiaohe@huawei.com,
xieyuanbin1@huawei.com, Lai, Yi1, linux-kernel@vger.kernel.org,
linux-edac@vger.kernel.org, linux-mm@kvack.org,
linux-trace-kernel@vger.kernel.org, Linus Torvalds
In-Reply-To: <b637ede2-73da-49f0-a7eb-70ec79e79624@kernel.org>
> From: David Hildenbrand (Arm) <david@kernel.org>
> [...]
> Would the following be sufficient to avoid a full revert and the dependency
> on CONFIG_RAS?
>
> diff --git a/include/trace/events/memory-failure.h
> b/include/trace/events/memory-failure.h
> index aa57cc8f896b..c46b17602578 100644
> --- a/include/trace/events/memory-failure.h
> +++ b/include/trace/events/memory-failure.h
> @@ -1,6 +1,7 @@
> /* SPDX-License-Identifier: GPL-2.0 */
> #undef TRACE_SYSTEM
> -#define TRACE_SYSTEM memory_failure
> +/* Some user space relies on ras/memory_failure_event */ #define
> +TRACE_SYSTEM ras
> #define TRACE_INCLUDE_FILE memory-failure
>
Thanks all for the discussion on this issue.
We applied David's above fix to v7.1-rc3, tested it, and confirmed that rasdaemon
can again enable and receive the memory_failure event.
Rasdaemon logs:
...
rasdaemon: ras:memory_failure_event event enabled
rasdaemon: Enabled event ras:memory_failure_event
...
<...>-2513 [000] ..... 0.000021 memory_failure_event [ALERT] 2026-06-04 23:30:43 +0800 pfn=0x144e6f page_type=dirty LRU page action_result=Recovered
...
-Qiuxu
^ permalink raw reply
* Re: [PATCH v7 20/42] KVM: SEV: Make 'uaddr' parameter optional for KVM_SEV_SNP_LAUNCH_UPDATE
From: Suzuki K Poulose @ 2026-06-04 15:29 UTC (permalink / raw)
To: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
david, ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
pratyush, aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260522-gmem-inplace-conversion-v7-20-2f0fae496530@google.com>
On 23/05/2026 01:18, Ackerley Tng via B4 Relay wrote:
> From: Michael Roth <michael.roth@amd.com>
>
> For vm_memory_attributes=1, in-place conversion/population is not
> supported, so the initial contents necessarily must need to come
> from a separate src address, which is enforced by the current
> implementation. However, for vm_memory_attributes=0, it is possible for
> guest memory to be initialized directly from userspace by mmap()'ing the
> guest_memfd and writing to it while the corresponding GPA ranges are in
> a 'shared' state before converting them to the 'private' state expected
> by KVM_SEV_SNP_LAUNCH_UPDATE.
>
> Update the handling/documentation for KVM_SEV_SNP_LAUNCH_UPDATE to allow
> for 'uaddr' to be set to NULL when vm_memory_attributes=0, which
> SNP_LAUNCH_UPDATE will then use to determine when it should/shouldn't
> copy in data from a separate memory location. Continue to enforce
> non-NULL for the original vm_memory_attributes=1 case.
>
> Signed-off-by: Michael Roth <michael.roth@amd.com>
> [Added src_page check in error handling path when the firmware command fails]
> [Dropped ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES]
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
> Documentation/virt/kvm/x86/amd-memory-encryption.rst | 15 +++++++++++----
> arch/x86/kvm/svm/sev.c | 18 +++++++++++++-----
> virt/kvm/kvm_main.c | 1 +
> 3 files changed, 25 insertions(+), 9 deletions(-)
>
> diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> index b2395dd4769de..43085f65b2d85 100644
> --- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> +++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> @@ -503,7 +503,8 @@ secrets.
>
> It is required that the GPA ranges initialized by this command have had the
> KVM_MEMORY_ATTRIBUTE_PRIVATE attribute set in advance. See the documentation
> -for KVM_SET_MEMORY_ATTRIBUTES for more details on this aspect.
> +for KVM_SET_MEMORY_ATTRIBUTES/KVM_SET_MEMORY_ATTRIBUTES2 for more details on
> +this aspect.
>
> Upon success, this command is not guaranteed to have processed the entire
> range requested. Instead, the ``gfn_start``, ``uaddr``, and ``len`` fields of
> @@ -511,9 +512,15 @@ range requested. Instead, the ``gfn_start``, ``uaddr``, and ``len`` fields of
> remaining range that has yet to be processed. The caller should continue
> calling this command until those fields indicate the entire range has been
> processed, e.g. ``len`` is 0, ``gfn_start`` is equal to the last GFN in the
> -range plus 1, and ``uaddr`` is the last byte of the userspace-provided source
> -buffer address plus 1. In the case where ``type`` is KVM_SEV_SNP_PAGE_TYPE_ZERO,
> -``uaddr`` will be ignored completely.
> +range plus 1, and ``uaddr`` (if specified) is the last byte of the
> +userspace-provided source buffer address plus 1.
> +
> +In the case where ``type`` is KVM_SEV_SNP_PAGE_TYPE_ZERO, ``uaddr`` will be
> +ignored completely. Otherwise, ``uaddr`` is required if
> +kvm.vm_memory_attributes=1 and optional if kvm.vm_memory_attributes=0, since
> +in the latter case guest memory can be initialized directly from userspace
> +prior to converting it to private and passing the GPA range on to this
> +interface.
Just to confirm, so the sev_gmem_prepare doesn't destroy the contents in
the process of making it "private" ? i.e., the contents of a SNP shared
page are preserved while transitioning to "SNP Private" (via RMP
update).
Suzuki
>
> Parameters (in): struct kvm_sev_snp_launch_update
>
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 1a361f08c7a3d..e1dbc827c2807 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -2343,7 +2343,15 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> int level;
> int ret;
>
> - if (WARN_ON_ONCE(sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src_page))
> + /*
> + * For vm_memory_attributes=1, in-place conversion/population is not
> + * supported, so the initial contents necessarily need to come from a
> + * separate src address. For vm_memory_attributes=0, this isn't
> + * necessarily the case, since the pages may have been populated
> + * directly from userspace before calling KVM_SEV_SNP_LAUNCH_UPDATE.
> + */
> + if (vm_memory_attributes &&
> + sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src_page)
> return -EINVAL;
>
> ret = snp_lookup_rmpentry((u64)pfn, &assigned, &level);
> @@ -2390,7 +2398,7 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> */
> if (ret && !snp_page_reclaim(kvm, pfn) &&
> sev_populate_args->type == KVM_SEV_SNP_PAGE_TYPE_CPUID &&
> - sev_populate_args->fw_error == SEV_RET_INVALID_PARAM) {
> + sev_populate_args->fw_error == SEV_RET_INVALID_PARAM && src_page) {
> void *src_vaddr = kmap_local_page(src_page);
> void *dst_vaddr = kmap_local_pfn(pfn);
>
> @@ -2423,8 +2431,8 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
> if (copy_from_user(¶ms, u64_to_user_ptr(argp->data), sizeof(params)))
> return -EFAULT;
>
> - pr_debug("%s: GFN start 0x%llx length 0x%llx type %d flags %d\n", __func__,
> - params.gfn_start, params.len, params.type, params.flags);
> + pr_debug("%s: GFN start 0x%llx length 0x%llx type %d flags %d src %llx\n", __func__,
> + params.gfn_start, params.len, params.type, params.flags, params.uaddr);
>
> if (!params.len || !PAGE_ALIGNED(params.len) || params.flags ||
> (params.type != KVM_SEV_SNP_PAGE_TYPE_NORMAL &&
> @@ -2481,7 +2489,7 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
>
> params.gfn_start += count;
> params.len -= count * PAGE_SIZE;
> - if (params.type != KVM_SEV_SNP_PAGE_TYPE_ZERO)
> + if (src && params.type != KVM_SEV_SNP_PAGE_TYPE_ZERO)
> params.uaddr += count * PAGE_SIZE;
>
> if (copy_to_user(u64_to_user_ptr(argp->data), ¶ms, sizeof(params)))
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ba195bb239aaa..3bf212fd99193 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -105,6 +105,7 @@ module_param(allow_unsafe_mappings, bool, 0444);
> #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> bool vm_memory_attributes = true;
> module_param(vm_memory_attributes, bool, 0444);
> +EXPORT_SYMBOL_FOR_KVM_INTERNAL(vm_memory_attributes);
> #endif
> DEFINE_STATIC_CALL_RET0(__kvm_get_memory_attributes, kvm_get_memory_attributes_t);
> EXPORT_SYMBOL_FOR_KVM_INTERNAL(STATIC_CALL_KEY(__kvm_get_memory_attributes));
>
^ permalink raw reply
* [GIT PULL] rv fixes for v7.1 (resend with changed attribution)
From: Gabriele Monaco @ 2026-06-04 14:50 UTC (permalink / raw)
To: Steven Rostedt, linux-kernel
Cc: linux-trace-kernel, Gabriele Monaco, Wen Yang
Steve,
The following changes since commit e43ffb69e0438cddd72aaa30898b4dc446f664f8:
Linux 7.1-rc6 (2026-05-31 15:14:24 -0700)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/gmonaco/linux.git rv-fixes2-7.1
for you to fetch changes up to df996599cc69a9b74ff437c67751cf8a61f62e39:
verification/rvgen: Fix ltl2k writing True as a literal (2026-06-04 16:44:25 +0200)
----------------------------------------------------------------
rv fixes for v7.1
Summary of changes:
- Fix reset ordering on per-task destruction
Reset the task before dropping the slot instead of after, which was
causing out-of-bound memory accesses.
- Fix HA monitor synchronization and cleanup
Ensure synchronous cleanup for HA monitors by running timer callbacks
in RCU read-side critical sections and using synchronize_rcu() during
destruction.
- Avoid armed timers after tasks exit
Add automatic cleanup for per-task HA monitors to prevent timers from
firing after task exit.
- Fix memory ordering for DA/HA monitors
Fix race conditions during monitor start by using release-acquire
semantics for the monitoring flag.
- Fix initialization for DA/HA monitors
Ensure monitors are not initialized relying on potentially corrupted
state like the monitoring flag, that is not reset by all monitors type
and may have an unknown state in monitors reusing the storage
(per-task).
- Fix memory safety in per-task and per-object monitors
Prevent use-after-free and out-of-bounds access by synchronizing with
in-flight tracepoint probes using tracepoint_synchronize_unregister()
before freeing monitor storage or releasing task slots.
- Adjust monitors for preemptible tracepoints
Fix monitors that relied on tracepoints disabling preemption.
Explicitly disable task migration when per-CPU monitors handle events
to avoid accessing the wrong state and update the opid monitor logic.
- Fix incorrect __user specifier usage
Remove __user from a non-pointer variable in the extract_params()
helper.
- Fix bugs in the rv tool
Ensure strings are NUL-terminated, fix substring matching in monitor
searches, and improve cleanup and exit status handling.
- Fix several bugs in rvgen
Fix LTL literal stringification, subparsers' options handling, and
suffix stripping in dot2k.
----------------------------------------------------------------
Gabriele Monaco (16):
rv: Fix __user specifier usage in extract_params()
rv: Reset per-task DA monitors before releasing the slot
rv: Prevent in-flight per-task handlers from using invalid slots
rv: Ensure all pending probes terminate on per-obj monitor destroy
rv: Do not rely on clean monitor when initialising HA
rv: Add automatic cleanup handlers for per-task HA monitors
rv: Ensure synchronous cleanup for HA monitors
rv: Prevent task migration while handling per-CPU events
rv: Use 0 to check preemption enabled in opid
tools/rv: Ensure monitor name and desc are NUL-terminated
tools/rv: Fix substring match bug in monitor name search
tools/rv: Fix substring match when listing container monitors
tools/rv: Fix cleanup after failed trace setup
verification/rvgen: Fix suffix strip in dot2k
verification/rvgen: Fix options shared among commands
verification/rvgen: Fix ltl2k writing True as a literal
Wen Yang (1):
rv: Fix monitor start ordering and memory ordering for monitoring flag
include/rv/da_monitor.h | 139 +++++++++++++++++----
include/rv/ha_monitor.h | 91 +++++++++++++-
include/rv/ltl_monitor.h | 1 +
kernel/trace/rv/monitors/deadline/deadline.h | 3 +-
kernel/trace/rv/monitors/nomiss/nomiss.c | 4 +-
kernel/trace/rv/monitors/opid/opid.c | 12 +-
kernel/trace/rv/monitors/stall/stall.c | 4 +-
tools/verification/rv/src/in_kernel.c | 65 +++++-----
tools/verification/rvgen/__main__.py | 10 +-
tools/verification/rvgen/rvgen/dot2k.py | 4 +-
tools/verification/rvgen/rvgen/ltl2ba.py | 9 +-
.../rvgen/rvgen/templates/dot2k/main.c | 4 +-
12 files changed, 263 insertions(+), 83 deletions(-)
To: Steven Rostedt <rostedt@goodmis.org>
Cc: Gabriele Monaco <gmonaco@redhat.com>
Cc: Wen Yang <wen.yang@linux.dev>
^ permalink raw reply
* Re: [RFC v8 0/7] ext4: fast commit: snapshot inode state for FC log
From: Theodore Ts'o @ 2026-06-04 14:45 UTC (permalink / raw)
To: Zhang Yi, Andreas Dilger, Li Chen
Cc: Theodore Ts'o, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, linux-ext4, linux-trace-kernel, linux-kernel
In-Reply-To: <20260515091829.194810-1-me@linux.beauty>
On Fri, 15 May 2026 17:18:20 +0800, Li Chen wrote:
> (This RFC v8 series is rebased onto linux-next master as of 2026-05-09,
> commit e98d21c170b0 ("Add linux-next specific files for 20260508"), and
> depends on patch "ext4: fix fast commit wait/wake bit mapping on
> 64-bit" [0]).
>
> Zhang Yi in RFC v3 review pointed out that postponing lockdep assertions only
> masks the issue, and that sleeping in ext4_fc_track_inode() while holding
> i_data_sem can form a real ABBA deadlock if the fast commit writer also needs
> i_data_sem while the inode is in FC_COMMITTING.
>
> [...]
Applied, thanks!
[1/7] ext4: fast commit: snapshot inode state before writing log
commit: e9c6e0b8e096255feb71ec996c77bdfbe9c36e91
[2/7] ext4: lockdep: handle i_data_sem subclassing for special inodes
commit: 7f473f971382d73a58e386afa7efdaac294b89f0
[3/7] ext4: fast commit: avoid waiting for FC_COMMITTING
commit: b3060e96533dc3157fc6d3d45dc19927c566977b
[4/7] ext4: fast commit: avoid self-deadlock in inode snapshotting
commit: 2b9b216628fd9352f9c791701c8990d05736aa90
[5/7] ext4: fast commit: avoid i_data_sem by dropping ext4_map_blocks() in snapshots
commit: 22d887e06a57261df58404c8dce50c4ef37549ed
[6/7] ext4: fast commit: add lock_updates tracepoint
commit: d2f6e83bbbef31169ea363af4277f5c09c914eda
[7/7] ext4: fast commit: export snapshot stats in fc_info
commit: 56bb0b64f4b198bad5ce674509c10793d471148f
Best regards,
--
Theodore Ts'o <tytso@mit.edu>
^ permalink raw reply
* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Lorenzo Stoakes @ 2026-06-04 14:45 UTC (permalink / raw)
To: Nico Pache
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
lance.yang, liam, mathieu.desnoyers, matthew.brost, mhiramat,
mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe
In-Reply-To: <20260522150009.121603-12-npache@redhat.com>
On Fri, May 22, 2026 at 09:00:06AM -0600, Nico Pache wrote:
> Enable khugepaged to collapse to mTHP orders. This patch implements the
> main scanning logic using a bitmap to track occupied pages and a stack
> structure that allows us to find optimal collapse sizes.
>
> Previous to this patch, PMD collapse had 3 main phases, a light weight
> scanning phase (mmap_read_lock) that determines a potential PMD
> collapse, an alloc phase (mmap unlocked), then finally heavier collapse
> phase (mmap_write_lock).
>
> To enabled mTHP collapse we make the following changes:
>
> During PMD scan phase, track occupied pages in a bitmap. When mTHP
> orders are enabled, we remove the restriction of max_ptes_none during the
> scan phase to avoid missing potential mTHP collapse candidates. Once we
> have scanned the full PMD range and updated the bitmap to track occupied
> pages, we use the bitmap to find the optimal mTHP size.
>
> Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
> and determine the best eligible order for the collapse. A stack structure
> is used instead of traditional recursion to manage the search. This also
> prevents a traditional recursive approach when the kernel stack struct is
> limited. The algorithm recursively splits the bitmap into smaller chunks to
> find the highest order mTHPs that satisfy the collapse criteria. We start
> by attempting the PMD order, then moved on the consecutively lower orders
> (mTHP collapse). The stack maintains a pair of variables (offset, order),
> indicating the number of PTEs from the start of the PMD, and the order of
> the potential collapse candidate.
>
> The algorithm for consuming the bitmap works as such:
> 1) push (0, HPAGE_PMD_ORDER) onto the stack
> 2) pop the stack
> 3) check if the number of set bits in that (offset,order) pair
> statisfy the max_ptes_none threshold for that order
> 4) if yes, attempt collapse
> 5) if no (or collapse fails), push two new stack items representing
> the left and right halves of the current bitmap range, at the
> next lower order
> 6) repeat at step (2) until stack is empty.
>
> Below is a diagram representing the algorithm and stack items:
>
> offset mid_offset
> | |
> | |
> v v
> ____________________________________
> | PTE Page Table |
> --------------------------------------
> <-------><------->
> order-1 order-1
>
> mTHP collapses reject regions containing swapped out or shared pages.
> This is because adding new entries can lead to new none pages, and these
> may lead to constant promotion into a higher order mTHP. A similar
> issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse
> introducing at least 2x the number of pages, and on a future scan will
> satisfy the promotion condition once again. This issue is prevented via
> the collapse_max_ptes_none() function which imposes the max_ptes_none
> restrictions above.
>
> We currently only support mTHP collapse for max_ptes_none values of 0
> and HPAGE_PMD_NR - 1. resulting in the following behavior:
>
> - max_ptes_none=0: Never introduce new empty pages during collapse
> - max_ptes_none=HPAGE_PMD_NR-1: Always try collapse to the highest
> available mTHP order
>
> Any other max_ptes_none value will emit a warning and default mTHP
> collapse to max_ptes_none=0. There should be no behavior change for PMD
> collapse.
>
> Once we determine what mTHP sizes fits best in that PMD range a collapse
> is attempted. A minimum collapse order of 2 is used as this is the lowest
> order supported by anon memory as defined by THP_ORDERS_ALL_ANON.
>
> Currently madv_collapse is not supported and will only attempt PMD
> collapse.
>
> We can also remove the check for is_khugepaged inside the PMD scan as
> the collapse_max_ptes_none() function handles this logic now.
>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
> mm/khugepaged.c | 181 +++++++++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 172 insertions(+), 9 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 64ceebc9d8a7..d3d7db8be26c 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -99,6 +99,30 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
>
> static struct kmem_cache *mm_slot_cache __ro_after_init;
>
> +#define KHUGEPAGED_MIN_MTHP_ORDER 2
> +/*
> + * mthp_collapse() does an iterative DFS over a binary tree, from
> + * HPAGE_PMD_ORDER down to KHUGEPAGED_MIN_MTHP_ORDER. The max stack
> + * size needed for a DFS on a binary tree is height + 1, where
> + * height = HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER.
> + *
> + * ilog2 is used in place of HPAGE_PMD_ORDER because some architectures
> + * (e.g. ppc64le) do not define HPAGE_PMD_ORDER until after build time.
> + */
> +#define MTHP_STACK_SIZE (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER + 1)
> +
> +/*
> + * Defines a range of PTE entries in a PTE page table which are being
> + * considered for mTHP collapse.
> + *
> + * @offset: the offset of the first PTE entry in a PMD range.
> + * @order: the order of the PTE entries being considered for collapse.
> + */
> +struct mthp_range {
> + u16 offset;
> + u8 order;
> +};
> +
> struct collapse_control {
> bool is_khugepaged;
>
> @@ -110,6 +134,12 @@ struct collapse_control {
>
> /* nodemask for allocation fallback */
> nodemask_t alloc_nmask;
> +
> + /* Each bit represents a single occupied (!none/zero) page. */
> + DECLARE_BITMAP(mthp_bitmap, MAX_PTRS_PER_PTE);
> + /* A mask of the current range being considered for mTHP collapse. */
> + DECLARE_BITMAP(mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> + struct mthp_range mthp_bitmap_stack[MTHP_STACK_SIZE];
> };
>
> /**
> @@ -1411,20 +1441,137 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
> return result;
> }
>
> +static void collapse_mthp_stack_push(struct collapse_control *cc, int *stack_size,
> + u16 offset, u8 order)
> +{
> + const int size = *stack_size;
> + struct mthp_range *stack = &cc->mthp_bitmap_stack[size];
> +
> + VM_WARN_ON_ONCE(size >= MTHP_STACK_SIZE);
> + stack->order = order;
> + stack->offset = offset;
> + (*stack_size)++;
> +}
> +
> +static struct mthp_range collapse_mthp_stack_pop(struct collapse_control *cc,
> + int *stack_size)
> +{
> + const int size = *stack_size;
> +
> + VM_WARN_ON_ONCE(size <= 0);
> + (*stack_size)--;
> + return cc->mthp_bitmap_stack[size - 1];
> +}
> +
> +static unsigned int collapse_mthp_count_present(struct collapse_control *cc,
> + u16 offset, unsigned int nr_ptes)
> +{
> + bitmap_zero(cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> + bitmap_set(cc->mthp_bitmap_mask, offset, nr_ptes);
> + return bitmap_weight_and(cc->mthp_bitmap, cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> +}
> +
> +/*
> + * mthp_collapse() consumes the bitmap that is generated during
> + * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
> + *
> + * Each bit in cc->mthp_bitmap represents a single occupied (!none/zero) page.
> + * A stack structure cc->mthp_bitmap_stack is used to check different regions
> + * of the bitmap for collapse eligibility. The stack maintains a pair of
> + * variables (offset, order), indicating the number of PTEs from the start of
> + * the PMD, and the order of the potential collapse candidate respectively. We
> + * start at the PMD order and check if it is eligible for collapse; if not, we
> + * add two entries to the stack at a lower order to represent the left and right
> + * halves of the PTE page table we are examining.
> + *
> + * offset mid_offset
> + * | |
> + * | |
> + * v v
> + * --------------------------------------
> + * | cc->mthp_bitmap |
> + * --------------------------------------
> + * <-------><------->
> + * order-1 order-1
> + *
> + * For each of these, we determine how many PTE entries are occupied in the
> + * range of PTE entries we propose to collapse, then we compare this to a
> + * threshold number of PTE entries which would need to be occupied for a
> + * collapse to be permitted at that order (accounting for max_ptes_none).
> + *
> + * If a collapse is permitted, we attempt to collapse the PTE range into a
> + * mTHP.
> + */
> +static int mthp_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
> + unsigned long address, int referenced, int unmapped,
> + struct collapse_control *cc, unsigned long enabled_orders)
> +{
> + unsigned int nr_occupied_ptes, nr_ptes, max_ptes_none;
> + int collapsed = 0, stack_size = 0;
> + unsigned long collapse_address;
> + struct mthp_range range;
> + u16 offset;
> + u8 order;
> +
> + collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
> +
> + while (stack_size) {
> + range = collapse_mthp_stack_pop(cc, &stack_size);
> + order = range.order;
> + offset = range.offset;
> + nr_ptes = 1UL << order;
> +
> + if (!test_bit(order, &enabled_orders))
> + goto next_order;
> +
> + max_ptes_none = collapse_max_ptes_none(cc, vma, order);
> +
> + nr_occupied_ptes = collapse_mthp_count_present(cc, offset,
> + nr_ptes);
> +
> + if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
> + int ret;
> +
> + collapse_address = address + offset * PAGE_SIZE;
> + ret = collapse_huge_page(mm, collapse_address, referenced,
> + unmapped, cc, order);
> + if (ret == SCAN_SUCCEED) {
> + collapsed += nr_ptes;
> + continue;
> + }
> + }
> +
> +next_order:
> + if ((BIT(order) - 1) & enabled_orders) {
> + const u8 next_order = order - 1;
> + const u16 mid_offset = offset + (nr_ptes / 2);
> +
> + collapse_mthp_stack_push(cc, &stack_size, mid_offset,
> + next_order);
> + collapse_mthp_stack_push(cc, &stack_size, offset,
> + next_order);
> + }
> + }
> + return collapsed;
> +}
> +
> static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> struct vm_area_struct *vma, unsigned long start_addr,
> bool *lock_dropped, struct collapse_control *cc)
> {
> - const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
> const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, HPAGE_PMD_ORDER);
> const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
> + unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
> + enum tva_type tva_flags = cc->is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
> pmd_t *pmd;
> - pte_t *pte, *_pte;
> - int none_or_zero = 0, shared = 0, referenced = 0;
> + pte_t *pte, *_pte, pteval;
> + int i;
> + int none_or_zero = 0, shared = 0, nr_collapsed = 0, referenced = 0;
> enum scan_result result = SCAN_FAIL;
> struct page *page = NULL;
> struct folio *folio = NULL;
> unsigned long addr;
> + unsigned long enabled_orders;
> spinlock_t *ptl;
> int node = NUMA_NO_NODE, unmapped = 0;
>
> @@ -1436,8 +1583,19 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> goto out;
> }
>
> + bitmap_zero(cc->mthp_bitmap, MAX_PTRS_PER_PTE);
> memset(cc->node_load, 0, sizeof(cc->node_load));
> nodes_clear(cc->alloc_nmask);
> +
> + enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, tva_flags);
> +
> + /*
> + * If PMD is the only enabled order, enforce max_ptes_none, otherwise
> + * scan all pages to populate the bitmap for mTHP collapse.
> + */
> + if (enabled_orders != BIT(HPAGE_PMD_ORDER))
> + max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT;
Hmm, this is a bit odd, what if the user set max_ptes_none = 0?
I assume we handle the 0/511 thing elsewhere?
> +
> pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
> if (!pte) {
> cc->progress++;
> @@ -1445,11 +1603,13 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> goto out;
> }
>
> - for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> - _pte++, addr += PAGE_SIZE) {
> + for (i = 0; i < HPAGE_PMD_NR; i++) {
> + _pte = pte + i;
> + addr = start_addr + i * PAGE_SIZE;
> + pteval = ptep_get(_pte);
> +
> cc->progress++;
>
> - pte_t pteval = ptep_get(_pte);
> if (pte_none_or_zero(pteval)) {
> if (++none_or_zero > max_ptes_none) {
> result = SCAN_EXCEED_NONE_PTE;
> @@ -1529,6 +1689,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> }
> }
>
> + /* Set bit for occupied pages */
> + __set_bit(i, cc->mthp_bitmap);
> /*
> * Record which node the original page is from and save this
> * information to cc->node_load[].
> @@ -1587,10 +1749,11 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> if (result == SCAN_SUCCEED) {
> /* collapse_huge_page expects the lock to be dropped before calling */
> mmap_read_unlock(mm);
> - result = collapse_huge_page(mm, start_addr, referenced,
> - unmapped, cc, HPAGE_PMD_ORDER);
> - /* collapse_huge_page will return with the mmap_lock released */
> + nr_collapsed = mthp_collapse(mm, vma, start_addr, referenced,
> + unmapped, cc, enabled_orders);
I guess mthp_collapse() also does PMD collapse if only PMD is enabled?
It feels like this name is a bit confusing then :)
But I guess we can do a follow up to think of a better name possibly.
> + /* mmap_lock was released above, set lock_dropped */
> *lock_dropped = true;
> + result = nr_collapsed ? SCAN_SUCCEED : SCAN_FAIL;
> }
> out:
> trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
> --
> 2.54.0
>
Thanks, Lorenzo
^ permalink raw reply
* Re: [PATCH v7 3/3] locking: Add contended_release tracepoint to sleepable locks
From: Usama Arif @ 2026-06-04 14:45 UTC (permalink / raw)
To: Dmitry Ilvokhin
Cc: Usama Arif, Peter Zijlstra, Dennis Zhou, Tejun Heo,
Christoph Lameter, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Ingo Molnar, Will Deacon, Boqun Feng,
Waiman Long, linux-mm, linux-kernel, linux-trace-kernel,
kernel-team, Paul E. McKenney
In-Reply-To: <02f4f6c5ce6761e7f6587cf0ff2289d962ecddd4.1780506267.git.d@ilvokhin.com>
On Thu, 4 Jun 2026 07:15:07 +0000 Dmitry Ilvokhin <d@ilvokhin.com> wrote:
> Add the contended_release trace event. This tracepoint fires on the
> holder side when a contended lock is released, complementing the
> existing contention_begin/contention_end tracepoints which fire on the
> waiter side.
>
> This enables correlating lock hold time under contention with waiter
> events by lock address.
>
> Add trace_contended_release()/trace_call__contended_release() calls to
> the slowpath unlock paths of sleepable locks: mutex, rtmutex, semaphore,
> rwsem, percpu-rwsem, and RT-specific rwbase locks.
>
> Where possible, trace_contended_release() fires before the lock is
> released and before the waiter is woken. For some lock types, the
> tracepoint fires after the release but before the wake. Making the
> placement consistent across all lock types is not worth the added
> complexity.
>
> For reader/writer locks, the tracepoint fires for every reader releasing
> while a writer is waiting, not only for the last reader.
>
> Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
> Acked-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Usama Arif <usama.arif@linux.dev>
^ permalink raw reply
* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Lorenzo Stoakes @ 2026-06-04 14:40 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Lance Yang, Nico Pache, linux-doc, linux-kernel, linux-mm,
linux-trace-kernel, aarcange, akpm, anshuman.khandual, apopple,
baohua, baolin.wang, byungchul, catalin.marinas, cl, corbet,
dave.hansen, dev.jain, gourry, hannes, hughd, jack, jackmanb,
jannh, jglisse, joshua.hahnjy, kas, liam, mathieu.desnoyers,
matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <46bb9d9e-03f0-4e26-9ac9-1cdc5ba9bf4d@kernel.org>
On Wed, Jun 03, 2026 at 10:05:08AM +0200, David Hildenbrand (Arm) wrote:
> On 6/2/26 17:44, Lance Yang wrote:
> >
> >
> > On 2026/6/2 18:58, Nico Pache wrote:
> >> On Sun, May 31, 2026 at 1:19 AM Lance Yang <lance.yang@linux.dev> wrote:
> >>>
> >>>
> >>> [...]
> >>>
> >>> Hmm ... don't we lose the allocation-failure result here?
> >>>
> >>> Previously collapse_scan_pmd() propagated SCAN_ALLOC_HUGE_PAGE_FAIL from
> >>> collapse_huge_page(), so khugepaged would call khugepaged_alloc_sleep()
> >>> in khugepaged_do_scan().
> >>>
> >>> Now if allocation fails and nr_collapsed stays 0, we just return
> >>> SCAN_FAIL. So we won't back off via khugepaged_alloc_sleep() anymore?
> >>
> >> Ok I did the error propagation! I think I handled both of these cases
> >> you brought up pretty easily.
> >
> > Thanks.
> >
> >> However I don't know what to do in the following case: We successfully
> >> collapsed some portion of the PMD, but during that process, we also
> >> hit an allocation failure. Is it best to back off entirely? or can we
> >> treat some forward progress as a sign we can continue trying collapses
> >> without sleeping.
> >>
> >> Basically, do we prioritize SCAN_ALLOC_HUGE_PAGE_FAIL or the
> >> successful collapses as the returned value?
> >
> > Thinking out loud, forward progress should win here, the allocation
> > failure only matter if we made no progress at all?
>
> Agreed, in the first approach, forward progress makes sense.
Sounds sensible to me.
>
> --
> Cheers,
>
> David
Thanks, Lorenzo
^ permalink raw reply
* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Lorenzo Stoakes @ 2026-06-04 14:19 UTC (permalink / raw)
To: Nico Pache
Cc: David Hildenbrand (Arm), linux-doc, linux-kernel, linux-mm,
linux-trace-kernel, aarcange, akpm, anshuman.khandual, apopple,
baohua, baolin.wang, byungchul, catalin.marinas, cl, corbet,
dave.hansen, dev.jain, gourry, hannes, hughd, jack, jackmanb,
jannh, jglisse, joshua.hahnjy, kas, lance.yang, liam,
mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang, rientjes,
rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
thomas.hellstrom, tiwai, vbabka, vishal.moola, wangkefeng.wang,
will, willy, yang, ying.huang, ziy, zokeefe, Usama Arif,
usamaarif642
In-Reply-To: <aiGFIrg8_vZZxnPg@lucifer>
On Thu, Jun 04, 2026 at 03:14:59PM +0100, Lorenzo Stoakes wrote:
> On Tue, Jun 02, 2026 at 11:23:35AM -0600, Nico Pache wrote:
> >
> >
> > On 6/1/26 7:15 AM, David Hildenbrand (Arm) wrote:
> > >>>
> > >>> Reading this, it is unclear why exactly do we need the stack.
> > >>
> > >> So I looked into your items below. It seems logical, and I think it
> > >> works the same way; however, your method seems slightly harder to
> > >> understand due to all the edge cases and more error-prone to future
> > >> changes (the stack holds implicit knowledge of the offset/order that
> > >> must now be tracked in the edge cases).
> > >>
> > >> Given the stack is 24 bytes, I'm not sure if the extra complexity is
> > >> worth saving that small amount of memory. Although we would also be
> > >> getting rid of (3?) functions, so both approaches have pros and cons.
> > >
> > > I consider a simple forward loop over the offset ... less complexity compared to
> > > a stack structure :)
> > >
> > >>
> > >> I will implement a patch comparing your solution against mine and send
> > >> it here, then we can decide which approach is better.
> > >
> > > Right, throw it over the fence and I'll see how to improve it further.
> >
> > Ok heres what the diff looks like on top of my V19.
> >
> > you can access the tree here https://gitlab.com/npache/linux/-/commits/mthp-v19?ref_type=heads for easier review.
> >
> > So far I have no problem with this approach it appeared cleaner than i thought. Did some light testing. Gonna throw it more through the ringer tomorrow.
> >
> >
> > From 9496c5d17eba7f6d04820d78c7c6f1592a58888a Mon Sep 17 00:00:00 2001
> > From: Nico Pache <npache@redhat.com>
> > Date: Tue, 2 Jun 2026 10:26:18 -0600
> > Subject: [PATCH] convert from stack to forward loop
> >
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> > mm/khugepaged.c | 96 ++++++++-----------------------------------------
> > 1 file changed, 15 insertions(+), 81 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 498eba009751..6de935e76ceb 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -100,28 +100,6 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
> > static struct kmem_cache *mm_slot_cache __ro_after_init;
> >
> > #define KHUGEPAGED_MIN_MTHP_ORDER 2
> > -/*
> > - * mthp_collapse() does an iterative DFS over a binary tree, from
> > - * HPAGE_PMD_ORDER down to KHUGEPAGED_MIN_MTHP_ORDER. The max stack
> > - * size needed for a DFS on a binary tree is height + 1, where
> > - * height = HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER.
> > - *
> > - * ilog2 is used in place of HPAGE_PMD_ORDER because some architectures
> > - * (e.g. ppc64le) do not define HPAGE_PMD_ORDER until after build time.
> > - */
> > -#define MTHP_STACK_SIZE (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER + 1)
> > -
> > -/*
> > - * Defines a range of PTE entries in a PTE page table which are being
> > - * considered for mTHP collapse.
> > - *
> > - * @offset: the offset of the first PTE entry in a PMD range.
> > - * @order: the order of the PTE entries being considered for collapse.
> > - */
> > -struct mthp_range {
> > - u16 offset;
> > - u8 order;
> > -};
> >
> > struct collapse_control {
> > bool is_khugepaged;
> > @@ -137,7 +115,6 @@ struct collapse_control {
> >
> > /* Each bit represents a single occupied (!none/zero) page. */
> > DECLARE_BITMAP(mthp_present_ptes, MAX_PTRS_PER_PTE);
> > - struct mthp_range mthp_bitmap_stack[MTHP_STACK_SIZE];
> > };
> >
> > /**
> > @@ -1458,50 +1435,14 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
> > return result;
> > }
> >
> > -static void collapse_mthp_stack_push(struct collapse_control *cc, int *stack_size,
> > - u16 offset, u8 order)
> > -{
> > - const int size = *stack_size;
> > - struct mthp_range *stack = &cc->mthp_bitmap_stack[size];
> > -
> > - VM_WARN_ON_ONCE(size >= MTHP_STACK_SIZE);
> > - stack->order = order;
> > - stack->offset = offset;
> > - (*stack_size)++;
> > -}
> > -
> > -static struct mthp_range collapse_mthp_stack_pop(struct collapse_control *cc,
> > - int *stack_size)
> > -{
> > - const int size = *stack_size;
> > -
> > - VM_WARN_ON_ONCE(size <= 0);
> > - (*stack_size)--;
> > - return cc->mthp_bitmap_stack[size - 1];
> > -}
> > -
> > /*
> > * mthp_collapse() consumes the bitmap that is generated during
> > * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
> > *
> > * Each bit in cc->mthp_present_ptes represents a single occupied (!none/zero)
> > - * page. A stack structure cc->mthp_bitmap_stack is used to check different
> > - * regions of the bitmap for collapse eligibility. The stack maintains a pair
> > - * of variables (offset, order), indicating the number of PTEs from the start
> > - * of the PMD, and the order of the potential collapse candidate respectively.
> > - * We start at the PMD order and check if it is eligible for collapse; if not,
> > - * we add two entries to the stack at a lower order to represent the left and
> > - * right halves of the PTE page table we are examining.
> > - *
> > - * offset mid_offset
> > - * | |
> > - * | |
> > - * v v
> > - * --------------------------------------
> > - * | cc->mthp_present_ptes |
> > - * --------------------------------------
> > - * <-------><------->
> > - * order-1 order-1
> > + * page. We start at the PMD order and check if it is eligible for collapse;
> > + * if not, we check the left and right halves of the PTE page table we are
> > + * examining at a lower order.
>
> Yeah this is not good enough sorry, before there was some kind of explanation of
> the algortihm, just because you can explain the _code_ more simply, that's not
> very useful.
>
> I had to sit down and spend quite a bit of time to figure out how the actual
> output looks so I think that should be explained.
>
> > *
> > * For each of these, we determine how many PTE entries are occupied in the
> > * range of PTE entries we propose to collapse, then we compare this to a
> > @@ -1517,26 +1458,20 @@ static enum scan_result mthp_collapse(struct mm_struct *mm,
> > {
> > unsigned int nr_occupied_ptes, nr_ptes, max_ptes_none;
> > enum scan_result last_result = SCAN_FAIL;
> > - int collapsed = 0, stack_size = 0;
> > + int collapsed = 0;
> > bool alloc_failed = false;
> > unsigned long collapse_address;
> > - struct mthp_range range;
> > - u16 offset;
> > - u8 order;
> > + unsigned int offset = 0;
> > + unsigned int order = HPAGE_PMD_ORDER;
> >
> > - collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
> >
> > - while (stack_size) {
> > - range = collapse_mthp_stack_pop(cc, &stack_size);
> > - order = range.order;
> > - offset = range.offset;
> > + while (offset < HPAGE_PMD_NR) {
> > nr_ptes = 1UL << order;
> >
> > if (!test_bit(order, &enabled_orders))
> > goto next_order;
> >
> > max_ptes_none = collapse_max_ptes_none(cc, NULL, order);
> > -
> > nr_occupied_ptes = bitmap_weight_from(cc->mthp_present_ptes, offset,
> > offset + nr_ptes);
> >
> > @@ -1553,7 +1488,7 @@ static enum scan_result mthp_collapse(struct mm_struct *mm,
> > collapsed += nr_ptes;
> > fallthrough;
> > case SCAN_PTE_MAPPED_HUGEPAGE:
> > - continue;
> > + goto next_offset;
> > /* Cases where lower orders might still succeed */
> > case SCAN_ALLOC_HUGE_PAGE_FAIL:
> > alloc_failed = true;
> > @@ -1581,15 +1516,14 @@ static enum scan_result mthp_collapse(struct mm_struct *mm,
> > }
> >
>
> This obviously needs some comments describing what you're doing here. I think
> David said so too.
>
> > next_order:
> > - if ((BIT(order) - 1) & enabled_orders) {
> > - const u8 next_order = order - 1;
> > - const u16 mid_offset = offset + (nr_ptes / 2);
> > -
> > - collapse_mthp_stack_push(cc, &stack_size, mid_offset,
> > - next_order);
> > - collapse_mthp_stack_push(cc, &stack_size, offset,
> > - next_order);
> > + if (order > KHUGEPAGED_MIN_MTHP_ORDER &&
> > + (BIT(order) - 1) & enabled_orders) {
>
> Wait, if I disable an order this changes the way we get mTHP doesn't it?
>
> Let's say I disable order-4 but retain order-3 and order-2 for offset 0 we get:
>
> 9->8->7->6->5->5->6->5->5->7
>
> And we simply can't get order-3 no?
>
> This seems broken doesn't it? Maybe I'm missing something?
OK it's the way this is written, very confusing. I do not know why you are
writing this code in this 'compressed' way.
(1 << order) - 1) & enabled_orders is to see if there's _any others_ to check.
>
>
> > + order = order - 1;
>
> order--?
>
> > + continue;
> > }
> > +next_offset:
> > + offset += nr_ptes;
> > + order = min_t(int, __ffs(offset), HPAGE_PMD_ORDER);
>
> Also wouldn't, in the case where an enabled order check above skips an order--,
> we could have offset=0 here and end up just looping around checking from (0,
> HPAGE_PMD_ORDER) again? That also seems broken?
Yeah sorry the offset += nr_ptes fixes that anyway.
And the fact it's a mask check above makes this OK.
So the logic seems probably fine but it needs to be clearer.
>
> Also, what's __ffs(0)? Isn't it undefined? We shouldn't be relying on undefined
> behaviour no?
>
> https://elixir.bootlin.com/linux/v7.0.10/source/include/asm-generic/bitops/builtin-__ffs.h#L5
> Says as much?
>
> I guess we're assuming we're not going to get to 0 here, but that could do with
> a comment or a VM_WARN_ON_ONCE() at least.
>
> Also why aren't we making this a function?
>
> static inline unsigned int max_order_from_offset(unsigned int offset)
> {
> if (!offset)
> return HPAGE_PMD_ORDER;
>
> return __ffs(offset);
> }
>
> Though __ffs() works on unsigned long... probably... OK?
>
> > }
> > done:
> > if (collapsed)
> > --
> > 2.54.0
> >
> >
> >
> > >
> > > [...]
> > >
> > >>>> + bitmap_zero(cc->mthp_bitmap, MAX_PTRS_PER_PTE);
> > >>>> memset(cc->node_load, 0, sizeof(cc->node_load));
> > >>>> nodes_clear(cc->alloc_nmask);
> > >>>> +
> > >>>> + enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, tva_flags);
> > >>>> +
> > >>>> + /*
> > >>>> + * If PMD is the only enabled order, enforce max_ptes_none, otherwise
> > >>>> + * scan all pages to populate the bitmap for mTHP collapse.
> > >>>> + */
> > >>>
> > >>> You should note here, that we re-verify in mthp_collapse().
> > >>>
> > >>> But the question is, whether we should relocate the check completely into
> > >>> mthp_collapse(), instead of conditionally duplicating it.
> > >>>
> > >>> What speaks against always populating the bitmap and making the decision in
> > >>> mthp_collapse()?
> > >>>
> > >>> Sure, we might scan a page table a bit longer, but the code gets clearer ... and
> > >>> I am not sure if scanning some more page table entries is really that critical here.
> > >>
> > >> Someone asked me to preserve the legacy behavior (PMD only). Although
> > >> rather trivial, if you set max_ptes_none=0 for example, we'd still
> > >> have to do 511 iterations for no reason if PMD collapse is the only
> > >> enabled order rather than bailing immediately.
> > >>
> > >> I'm ok with dropping it, but I think its the correct approach (despite
> > >> the extra complexity). @Usama Arif brought up this point here
> > >> https://lore.kernel.org/all/f8f7bb71-ca31-46ee-a62d-7ddfd83e0ead@gmail.com/
> > >
> > > We talk about regressions, but I am not sure if we care about scanning speed
> > > within a page table that much?
> > >
> > > After all, we locked it and already read some entries.
> > >
> > > Having the same check at two places to optimize for PMD order might right now
> > > feel like a good optimization, but likely an irrelevant one in a near future?
> > >
> > > Anyhow, won't push back, as long as we document why we are special casing things
> > > here.
> > >
> >
>
> Thanks, Lorenzo
^ permalink raw reply
* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Lorenzo Stoakes @ 2026-06-04 14:14 UTC (permalink / raw)
To: Nico Pache
Cc: David Hildenbrand (Arm), linux-doc, linux-kernel, linux-mm,
linux-trace-kernel, aarcange, akpm, anshuman.khandual, apopple,
baohua, baolin.wang, byungchul, catalin.marinas, cl, corbet,
dave.hansen, dev.jain, gourry, hannes, hughd, jack, jackmanb,
jannh, jglisse, joshua.hahnjy, kas, lance.yang, liam,
mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang, rientjes,
rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
thomas.hellstrom, tiwai, vbabka, vishal.moola, wangkefeng.wang,
will, willy, yang, ying.huang, ziy, zokeefe, Usama Arif,
usamaarif642
In-Reply-To: <19639b08-5bf1-4974-9635-c458d512fa38@redhat.com>
On Tue, Jun 02, 2026 at 11:23:35AM -0600, Nico Pache wrote:
>
>
> On 6/1/26 7:15 AM, David Hildenbrand (Arm) wrote:
> >>>
> >>> Reading this, it is unclear why exactly do we need the stack.
> >>
> >> So I looked into your items below. It seems logical, and I think it
> >> works the same way; however, your method seems slightly harder to
> >> understand due to all the edge cases and more error-prone to future
> >> changes (the stack holds implicit knowledge of the offset/order that
> >> must now be tracked in the edge cases).
> >>
> >> Given the stack is 24 bytes, I'm not sure if the extra complexity is
> >> worth saving that small amount of memory. Although we would also be
> >> getting rid of (3?) functions, so both approaches have pros and cons.
> >
> > I consider a simple forward loop over the offset ... less complexity compared to
> > a stack structure :)
> >
> >>
> >> I will implement a patch comparing your solution against mine and send
> >> it here, then we can decide which approach is better.
> >
> > Right, throw it over the fence and I'll see how to improve it further.
>
> Ok heres what the diff looks like on top of my V19.
>
> you can access the tree here https://gitlab.com/npache/linux/-/commits/mthp-v19?ref_type=heads for easier review.
>
> So far I have no problem with this approach it appeared cleaner than i thought. Did some light testing. Gonna throw it more through the ringer tomorrow.
>
>
> From 9496c5d17eba7f6d04820d78c7c6f1592a58888a Mon Sep 17 00:00:00 2001
> From: Nico Pache <npache@redhat.com>
> Date: Tue, 2 Jun 2026 10:26:18 -0600
> Subject: [PATCH] convert from stack to forward loop
>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
> mm/khugepaged.c | 96 ++++++++-----------------------------------------
> 1 file changed, 15 insertions(+), 81 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 498eba009751..6de935e76ceb 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -100,28 +100,6 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
> static struct kmem_cache *mm_slot_cache __ro_after_init;
>
> #define KHUGEPAGED_MIN_MTHP_ORDER 2
> -/*
> - * mthp_collapse() does an iterative DFS over a binary tree, from
> - * HPAGE_PMD_ORDER down to KHUGEPAGED_MIN_MTHP_ORDER. The max stack
> - * size needed for a DFS on a binary tree is height + 1, where
> - * height = HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER.
> - *
> - * ilog2 is used in place of HPAGE_PMD_ORDER because some architectures
> - * (e.g. ppc64le) do not define HPAGE_PMD_ORDER until after build time.
> - */
> -#define MTHP_STACK_SIZE (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER + 1)
> -
> -/*
> - * Defines a range of PTE entries in a PTE page table which are being
> - * considered for mTHP collapse.
> - *
> - * @offset: the offset of the first PTE entry in a PMD range.
> - * @order: the order of the PTE entries being considered for collapse.
> - */
> -struct mthp_range {
> - u16 offset;
> - u8 order;
> -};
>
> struct collapse_control {
> bool is_khugepaged;
> @@ -137,7 +115,6 @@ struct collapse_control {
>
> /* Each bit represents a single occupied (!none/zero) page. */
> DECLARE_BITMAP(mthp_present_ptes, MAX_PTRS_PER_PTE);
> - struct mthp_range mthp_bitmap_stack[MTHP_STACK_SIZE];
> };
>
> /**
> @@ -1458,50 +1435,14 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
> return result;
> }
>
> -static void collapse_mthp_stack_push(struct collapse_control *cc, int *stack_size,
> - u16 offset, u8 order)
> -{
> - const int size = *stack_size;
> - struct mthp_range *stack = &cc->mthp_bitmap_stack[size];
> -
> - VM_WARN_ON_ONCE(size >= MTHP_STACK_SIZE);
> - stack->order = order;
> - stack->offset = offset;
> - (*stack_size)++;
> -}
> -
> -static struct mthp_range collapse_mthp_stack_pop(struct collapse_control *cc,
> - int *stack_size)
> -{
> - const int size = *stack_size;
> -
> - VM_WARN_ON_ONCE(size <= 0);
> - (*stack_size)--;
> - return cc->mthp_bitmap_stack[size - 1];
> -}
> -
> /*
> * mthp_collapse() consumes the bitmap that is generated during
> * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
> *
> * Each bit in cc->mthp_present_ptes represents a single occupied (!none/zero)
> - * page. A stack structure cc->mthp_bitmap_stack is used to check different
> - * regions of the bitmap for collapse eligibility. The stack maintains a pair
> - * of variables (offset, order), indicating the number of PTEs from the start
> - * of the PMD, and the order of the potential collapse candidate respectively.
> - * We start at the PMD order and check if it is eligible for collapse; if not,
> - * we add two entries to the stack at a lower order to represent the left and
> - * right halves of the PTE page table we are examining.
> - *
> - * offset mid_offset
> - * | |
> - * | |
> - * v v
> - * --------------------------------------
> - * | cc->mthp_present_ptes |
> - * --------------------------------------
> - * <-------><------->
> - * order-1 order-1
> + * page. We start at the PMD order and check if it is eligible for collapse;
> + * if not, we check the left and right halves of the PTE page table we are
> + * examining at a lower order.
Yeah this is not good enough sorry, before there was some kind of explanation of
the algortihm, just because you can explain the _code_ more simply, that's not
very useful.
I had to sit down and spend quite a bit of time to figure out how the actual
output looks so I think that should be explained.
> *
> * For each of these, we determine how many PTE entries are occupied in the
> * range of PTE entries we propose to collapse, then we compare this to a
> @@ -1517,26 +1458,20 @@ static enum scan_result mthp_collapse(struct mm_struct *mm,
> {
> unsigned int nr_occupied_ptes, nr_ptes, max_ptes_none;
> enum scan_result last_result = SCAN_FAIL;
> - int collapsed = 0, stack_size = 0;
> + int collapsed = 0;
> bool alloc_failed = false;
> unsigned long collapse_address;
> - struct mthp_range range;
> - u16 offset;
> - u8 order;
> + unsigned int offset = 0;
> + unsigned int order = HPAGE_PMD_ORDER;
>
> - collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
>
> - while (stack_size) {
> - range = collapse_mthp_stack_pop(cc, &stack_size);
> - order = range.order;
> - offset = range.offset;
> + while (offset < HPAGE_PMD_NR) {
> nr_ptes = 1UL << order;
>
> if (!test_bit(order, &enabled_orders))
> goto next_order;
>
> max_ptes_none = collapse_max_ptes_none(cc, NULL, order);
> -
> nr_occupied_ptes = bitmap_weight_from(cc->mthp_present_ptes, offset,
> offset + nr_ptes);
>
> @@ -1553,7 +1488,7 @@ static enum scan_result mthp_collapse(struct mm_struct *mm,
> collapsed += nr_ptes;
> fallthrough;
> case SCAN_PTE_MAPPED_HUGEPAGE:
> - continue;
> + goto next_offset;
> /* Cases where lower orders might still succeed */
> case SCAN_ALLOC_HUGE_PAGE_FAIL:
> alloc_failed = true;
> @@ -1581,15 +1516,14 @@ static enum scan_result mthp_collapse(struct mm_struct *mm,
> }
>
This obviously needs some comments describing what you're doing here. I think
David said so too.
> next_order:
> - if ((BIT(order) - 1) & enabled_orders) {
> - const u8 next_order = order - 1;
> - const u16 mid_offset = offset + (nr_ptes / 2);
> -
> - collapse_mthp_stack_push(cc, &stack_size, mid_offset,
> - next_order);
> - collapse_mthp_stack_push(cc, &stack_size, offset,
> - next_order);
> + if (order > KHUGEPAGED_MIN_MTHP_ORDER &&
> + (BIT(order) - 1) & enabled_orders) {
Wait, if I disable an order this changes the way we get mTHP doesn't it?
Let's say I disable order-4 but retain order-3 and order-2 for offset 0 we get:
9->8->7->6->5->5->6->5->5->7
And we simply can't get order-3 no?
This seems broken doesn't it? Maybe I'm missing something?
> + order = order - 1;
order--?
> + continue;
> }
> +next_offset:
> + offset += nr_ptes;
> + order = min_t(int, __ffs(offset), HPAGE_PMD_ORDER);
Also wouldn't, in the case where an enabled order check above skips an order--,
we could have offset=0 here and end up just looping around checking from (0,
HPAGE_PMD_ORDER) again? That also seems broken?
Also, what's __ffs(0)? Isn't it undefined? We shouldn't be relying on undefined
behaviour no?
https://elixir.bootlin.com/linux/v7.0.10/source/include/asm-generic/bitops/builtin-__ffs.h#L5
Says as much?
I guess we're assuming we're not going to get to 0 here, but that could do with
a comment or a VM_WARN_ON_ONCE() at least.
Also why aren't we making this a function?
static inline unsigned int max_order_from_offset(unsigned int offset)
{
if (!offset)
return HPAGE_PMD_ORDER;
return __ffs(offset);
}
Though __ffs() works on unsigned long... probably... OK?
> }
> done:
> if (collapsed)
> --
> 2.54.0
>
>
>
> >
> > [...]
> >
> >>>> + bitmap_zero(cc->mthp_bitmap, MAX_PTRS_PER_PTE);
> >>>> memset(cc->node_load, 0, sizeof(cc->node_load));
> >>>> nodes_clear(cc->alloc_nmask);
> >>>> +
> >>>> + enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, tva_flags);
> >>>> +
> >>>> + /*
> >>>> + * If PMD is the only enabled order, enforce max_ptes_none, otherwise
> >>>> + * scan all pages to populate the bitmap for mTHP collapse.
> >>>> + */
> >>>
> >>> You should note here, that we re-verify in mthp_collapse().
> >>>
> >>> But the question is, whether we should relocate the check completely into
> >>> mthp_collapse(), instead of conditionally duplicating it.
> >>>
> >>> What speaks against always populating the bitmap and making the decision in
> >>> mthp_collapse()?
> >>>
> >>> Sure, we might scan a page table a bit longer, but the code gets clearer ... and
> >>> I am not sure if scanning some more page table entries is really that critical here.
> >>
> >> Someone asked me to preserve the legacy behavior (PMD only). Although
> >> rather trivial, if you set max_ptes_none=0 for example, we'd still
> >> have to do 511 iterations for no reason if PMD collapse is the only
> >> enabled order rather than bailing immediately.
> >>
> >> I'm ok with dropping it, but I think its the correct approach (despite
> >> the extra complexity). @Usama Arif brought up this point here
> >> https://lore.kernel.org/all/f8f7bb71-ca31-46ee-a62d-7ddfd83e0ead@gmail.com/
> >
> > We talk about regressions, but I am not sure if we care about scanning speed
> > within a page table that much?
> >
> > After all, we locked it and already read some entries.
> >
> > Having the same check at two places to optimize for PMD order might right now
> > feel like a good optimization, but likely an irrelevant one in a near future?
> >
> > Anyhow, won't push back, as long as we document why we are special casing things
> > here.
> >
>
Thanks, Lorenzo
^ permalink raw reply
* [PATCH] rtla/tests: Fix pgrep filter in get_workload_pids.sh
From: Tomas Glozar @ 2026-06-04 14:05 UTC (permalink / raw)
To: Steven Rostedt, Tomas Glozar
Cc: John Kacur, Luis Goncalves, Crystal Wood, Costa Shulyupin,
Wander Lairson Costa, LKML, linux-trace-kernel
Multiple runtime tests in RTLA rely on the get_workload_pids() shell
helper function to get the PIDs of both kernel and user workloads.
On some systems (e.g. Fedora 43), pgrep matches kernel thread names
including square brackets: "[osnoise/0]"; on other systems (e.g.
RHEL 9.8), brackets are not included: "osnoise/0".
Accept both as valid workload PIDs rather that just the non-bracket form
to make the tests work on all systems.
Fixes: a98dad63cda3 ("rtla/tests: Add runtime test for -k and -u options")
Reported-by: Crystal Wood <crwood@redhat.com>
Signed-off-by: Tomas Glozar <tglozar@redhat.com>
---
Note: the file touched by this commit is included by .gitignore, that is
an error that will be fixed by [1].
[1] https://lore.kernel.org/linux-trace-kernel/20260601091835.3118094-1-tglozar@redhat.com/
tools/tracing/rtla/tests/scripts/lib/get_workload_pids.sh | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/tracing/rtla/tests/scripts/lib/get_workload_pids.sh b/tools/tracing/rtla/tests/scripts/lib/get_workload_pids.sh
index 8aff98cd2c1f..d10a4e3b321d 100644
--- a/tools/tracing/rtla/tests/scripts/lib/get_workload_pids.sh
+++ b/tools/tracing/rtla/tests/scripts/lib/get_workload_pids.sh
@@ -5,7 +5,7 @@ get_workload_pids() {
local rtla_pid=$(ps -o ppid= $shell_pid)
# kernel threads
- pgrep -P $(pgrep ^kthreadd$) -f '^(osnoise|timerlat)/[0-9]+$'
+ pgrep -P $(pgrep ^kthreadd$) -f '^\[?(osnoise|timerlat)/[0-9]+\]?$'
# user threads
pgrep -P $rtla_pid | grep -v "^$shell_pid$"
}
--
2.54.0
^ permalink raw reply related
* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Lorenzo Stoakes @ 2026-06-04 13:59 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
joshua.hahnjy, kas, lance.yang, liam, mathieu.desnoyers,
matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <aiF25dvH4qd_C4aj@lucifer>
On Thu, Jun 04, 2026 at 02:53:39PM +0100, Lorenzo Stoakes wrote:
> (Checking the algorithm here)
>
> On Mon, Jun 01, 2026 at 10:11:24AM +0200, David Hildenbrand (Arm) wrote:
> > On 5/22/26 17:00, Nico Pache wrote:
> >
> > Finally time for the core piece :)
> >
> > > Enable khugepaged to collapse to mTHP orders. This patch implements the
> > > main scanning logic using a bitmap to track occupied pages and a stack
> > > structure that allows us to find optimal collapse sizes.
> > >
> > > Previous to this patch, PMD collapse had 3 main phases, a light weight
> > > scanning phase (mmap_read_lock) that determines a potential PMD
> > > collapse, an alloc phase (mmap unlocked), then finally heavier collapse
> > > phase (mmap_write_lock).
> > >
> > > To enabled mTHP collapse we make the following changes:
> > >
> > > During PMD scan phase, track occupied pages in a bitmap. When mTHP
> > > orders are enabled, we remove the restriction of max_ptes_none during the
> > > scan phase to avoid missing potential mTHP collapse candidates. Once we
> > > have scanned the full PMD range and updated the bitmap to track occupied
> > > pages, we use the bitmap to find the optimal mTHP size.
> > >
> > > Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
> > > and determine the best eligible order for the collapse. A stack structure
> > > is used instead of traditional recursion to manage the search. This also
> > > prevents a traditional recursive approach when the kernel stack struct is
> > > limited. The algorithm recursively splits the bitmap into smaller chunks to
> > > find the highest order mTHPs that satisfy the collapse criteria. We start
> > > by attempting the PMD order, then moved on the consecutively lower orders
> > > (mTHP collapse). The stack maintains a pair of variables (offset, order),
>
> This is inaccurate, it's only consecutively smaller until you hit smallest then
> it starts bumping around 2 -> 3 -> 2 -> 3 -> 2 -> .. -> 4 -> 3 -> 2 -> 3 -> 2 -> 4 -> etc.
>
> More like consecutively smaller, then always trying for the smallest possible
> fit?
>
> Would be good to describe why we do this, presumably to get a best _fit_?
>
> > > indicating the number of PTEs from the start of the PMD, and the order of
> > > the potential collapse candidate.
> > >
> > > The algorithm for consuming the bitmap works as such:
> > > 1) push (0, HPAGE_PMD_ORDER) onto the stack
> > > 2) pop the stack
> > > 3) check if the number of set bits in that (offset,order) pair
> > > statisfy the max_ptes_none threshold for that order
> > > 4) if yes, attempt collapse
> > > 5) if no (or collapse fails), push two new stack items representing
> > > the left and right halves of the current bitmap range, at the
> > > next lower order
>
> I notice the ordering is wrong here, you actualy push the mid_offset first then
> the offset (e.g. 'right', then 'left'):
>
> collapse_mthp_stack_push(cc, &stack_size, mid_offset,
> next_order);
> collapse_mthp_stack_push(cc, &stack_size, offset,
> next_order);
>
> So that way you are popping the 'left' first then the 'right'.
>
> So seems you'll get:
>
> stack={0, 9}
>
> Pop (0, order=9):
>
> |----------------------------------------|
> |########################################|
> |----------------------------------------|
>
> stack={256, 8}, {0, 8}
>
> Pop (0, order=8):
>
> |--------------------|-------------------|
> |####################| |
> |--------------------|-------------------|
>
>
> stack={256, 8}, {128, 7}, {0, 7}
>
> Pop (0, order=7):
>
> |----------|-----------------------------|
> |##########| |
> |----------|-----------------------------|
>
> stack={256, 8}, {128, 7}, {64, 6}, {0, 6}
>
> Pop (0, order=6):
>
> |----|-----------------------------------|
> |####| |
> |----|-----------------------------------|
>
> ...
>
> stack={256, 8}, ..., { 8, 3 }, {0, 2}
>
> Pop (0, order=2):
>
> |-|--------------------------------------|
> |#| |
> |-|--------------------------------------|
>
> Then finally :) we get the offsets :)
>
> stack={256, 8}, ..., {8, 3}, {4, 2}
>
> Pop (4, order=2):
>
> |-|-|------------------------------------|
> | |#| |
> |-|-|------------------------------------|
>
> stack={256, 8}, ..., { 12, 2 }, {8, 3}
(Shouldn't be {12, 2} there :)
> Pop (8, order=3):
>
> |---|--|---------------------------------|
> | |##| |
> |---|--|---------------------------------|
>
> stack={256, 8}, ..., { 12, 2 }, {12, 2}, {8, 2}
(Shouldn't duplicate {12, 2} there :)
>
> Pop (8, order=2):
>
> |---|-|----------------------------------|
> | |#| |
> |---|-|----------------------------------|
>
> etc.
>
>
> It seems to me that you're going to keep iterating down until you match an mTHP
> when a larger mTHP could have been had?
>
> So we're going:
>
> order 9 -> 8 -> 7 -> 6 -> ... -> 2 -> 3 -> 2 -> 4 -> 3 -> 2
>
> I guess the point is to avoid only getting the largest possible
>
>
>
>
> I guess if we did try to get the largest then we'd only get 2 of the largest
> possible then exhaust the whole PMD, should a PMD-sized entry not be possble.
>
> > > 6) repeat at step (2) until stack is empty.
> > >
> > > Below is a diagram representing the algorithm and stack items:
> > >
> > > offset mid_offset
> > > | |
> > > | |
> > > v v
> > > ____________________________________
> > > | PTE Page Table |
> > > --------------------------------------
> > > <-------><------->
> > > order-1 order-1
> >
> >
> > Reading this, it is unclear why exactly do we need the stack.
> >
> > Why can't you work with offset + cur_order?
> >
> > Initially,
> >
> > offset = 0;
> > cur_order = HPAGE_PMD_ORDER;
> >
> > If collapse succeeded, advance to next range.
> > If collapse failed, try next smaller order, keeping offset unchanged.
> >
> > if (failed && cur_order > KHUGEPAGED_MIN_MTHP_ORDER) {
> > /* Try next smaller order. */
> > cur_order = cur_order - 1;
>
> OK this matches the stack for the 0 offset entries...
>
> > } else {
> > /* Skip to next chunk. */
> > offset += 1 << cur_order;
> > cur_order = max_order_from_offset(offset);
>
> Then 1 << 2 -> 4 so go to offset=4.
>
> max_order_from_offset(4) = 2. so (4, offset=2) same as above.
>
> Then we'd loop back here and go to offset = 8, and max_order_from_offset(8) = 3
>
> And, yeah this seems equivalent.
>
> > }
>
> >
> > Of course, handling disabled orders. max_order_from_offset() is rather trivial
> > (natural buddy order, capped at HPAGE_PMD_ORDER).
>
> Something like?
>
> static unsigned long max_order_from_offset(unsigned long offset)
> {
> if (!offset)
> return HPAGE_PMD_ORDER;
>
> return ilog2(offset);
Oops, we need the LSB so ffs.
> }
>
> >
> > What's the benefit of the stack?
>
> Yeah it seems equivalent. Good idea!
>
> Thanks, Lorenzo
^ permalink raw reply
* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Lorenzo Stoakes @ 2026-06-04 13:53 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
joshua.hahnjy, kas, lance.yang, liam, mathieu.desnoyers,
matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <b8380eb3-096a-49f1-9ace-99c1e75888b4@kernel.org>
(Checking the algorithm here)
On Mon, Jun 01, 2026 at 10:11:24AM +0200, David Hildenbrand (Arm) wrote:
> On 5/22/26 17:00, Nico Pache wrote:
>
> Finally time for the core piece :)
>
> > Enable khugepaged to collapse to mTHP orders. This patch implements the
> > main scanning logic using a bitmap to track occupied pages and a stack
> > structure that allows us to find optimal collapse sizes.
> >
> > Previous to this patch, PMD collapse had 3 main phases, a light weight
> > scanning phase (mmap_read_lock) that determines a potential PMD
> > collapse, an alloc phase (mmap unlocked), then finally heavier collapse
> > phase (mmap_write_lock).
> >
> > To enabled mTHP collapse we make the following changes:
> >
> > During PMD scan phase, track occupied pages in a bitmap. When mTHP
> > orders are enabled, we remove the restriction of max_ptes_none during the
> > scan phase to avoid missing potential mTHP collapse candidates. Once we
> > have scanned the full PMD range and updated the bitmap to track occupied
> > pages, we use the bitmap to find the optimal mTHP size.
> >
> > Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
> > and determine the best eligible order for the collapse. A stack structure
> > is used instead of traditional recursion to manage the search. This also
> > prevents a traditional recursive approach when the kernel stack struct is
> > limited. The algorithm recursively splits the bitmap into smaller chunks to
> > find the highest order mTHPs that satisfy the collapse criteria. We start
> > by attempting the PMD order, then moved on the consecutively lower orders
> > (mTHP collapse). The stack maintains a pair of variables (offset, order),
This is inaccurate, it's only consecutively smaller until you hit smallest then
it starts bumping around 2 -> 3 -> 2 -> 3 -> 2 -> .. -> 4 -> 3 -> 2 -> 3 -> 2 -> 4 -> etc.
More like consecutively smaller, then always trying for the smallest possible
fit?
Would be good to describe why we do this, presumably to get a best _fit_?
> > indicating the number of PTEs from the start of the PMD, and the order of
> > the potential collapse candidate.
> >
> > The algorithm for consuming the bitmap works as such:
> > 1) push (0, HPAGE_PMD_ORDER) onto the stack
> > 2) pop the stack
> > 3) check if the number of set bits in that (offset,order) pair
> > statisfy the max_ptes_none threshold for that order
> > 4) if yes, attempt collapse
> > 5) if no (or collapse fails), push two new stack items representing
> > the left and right halves of the current bitmap range, at the
> > next lower order
I notice the ordering is wrong here, you actualy push the mid_offset first then
the offset (e.g. 'right', then 'left'):
collapse_mthp_stack_push(cc, &stack_size, mid_offset,
next_order);
collapse_mthp_stack_push(cc, &stack_size, offset,
next_order);
So that way you are popping the 'left' first then the 'right'.
So seems you'll get:
stack={0, 9}
Pop (0, order=9):
|----------------------------------------|
|########################################|
|----------------------------------------|
stack={256, 8}, {0, 8}
Pop (0, order=8):
|--------------------|-------------------|
|####################| |
|--------------------|-------------------|
stack={256, 8}, {128, 7}, {0, 7}
Pop (0, order=7):
|----------|-----------------------------|
|##########| |
|----------|-----------------------------|
stack={256, 8}, {128, 7}, {64, 6}, {0, 6}
Pop (0, order=6):
|----|-----------------------------------|
|####| |
|----|-----------------------------------|
...
stack={256, 8}, ..., { 8, 3 }, {0, 2}
Pop (0, order=2):
|-|--------------------------------------|
|#| |
|-|--------------------------------------|
Then finally :) we get the offsets :)
stack={256, 8}, ..., {8, 3}, {4, 2}
Pop (4, order=2):
|-|-|------------------------------------|
| |#| |
|-|-|------------------------------------|
stack={256, 8}, ..., { 12, 2 }, {8, 3}
Pop (8, order=3):
|---|--|---------------------------------|
| |##| |
|---|--|---------------------------------|
stack={256, 8}, ..., { 12, 2 }, {12, 2}, {8, 2}
Pop (8, order=2):
|---|-|----------------------------------|
| |#| |
|---|-|----------------------------------|
etc.
It seems to me that you're going to keep iterating down until you match an mTHP
when a larger mTHP could have been had?
So we're going:
order 9 -> 8 -> 7 -> 6 -> ... -> 2 -> 3 -> 2 -> 4 -> 3 -> 2
I guess the point is to avoid only getting the largest possible
I guess if we did try to get the largest then we'd only get 2 of the largest
possible then exhaust the whole PMD, should a PMD-sized entry not be possble.
> > 6) repeat at step (2) until stack is empty.
> >
> > Below is a diagram representing the algorithm and stack items:
> >
> > offset mid_offset
> > | |
> > | |
> > v v
> > ____________________________________
> > | PTE Page Table |
> > --------------------------------------
> > <-------><------->
> > order-1 order-1
>
>
> Reading this, it is unclear why exactly do we need the stack.
>
> Why can't you work with offset + cur_order?
>
> Initially,
>
> offset = 0;
> cur_order = HPAGE_PMD_ORDER;
>
> If collapse succeeded, advance to next range.
> If collapse failed, try next smaller order, keeping offset unchanged.
>
> if (failed && cur_order > KHUGEPAGED_MIN_MTHP_ORDER) {
> /* Try next smaller order. */
> cur_order = cur_order - 1;
OK this matches the stack for the 0 offset entries...
> } else {
> /* Skip to next chunk. */
> offset += 1 << cur_order;
> cur_order = max_order_from_offset(offset);
Then 1 << 2 -> 4 so go to offset=4.
max_order_from_offset(4) = 2. so (4, offset=2) same as above.
Then we'd loop back here and go to offset = 8, and max_order_from_offset(8) = 3
And, yeah this seems equivalent.
> }
>
> Of course, handling disabled orders. max_order_from_offset() is rather trivial
> (natural buddy order, capped at HPAGE_PMD_ORDER).
Something like?
static unsigned long max_order_from_offset(unsigned long offset)
{
if (!offset)
return HPAGE_PMD_ORDER;
return ilog2(offset);
}
>
> What's the benefit of the stack?
Yeah it seems equivalent. Good idea!
Thanks, Lorenzo
^ permalink raw reply
* Re: mm/memory-failure tracepoint change breaks userspace rasdaemon
From: Xie Yuanbin @ 2026-06-04 13:42 UTC (permalink / raw)
To: david, qiuxu.zhuo, bp, akpm, rostedt, linmiaohe
Cc: linux-edac, linux-kernel, linux-mm, linux-trace-kernel,
mchehab+huawei, tony.luck, torvalds, xieyuanbin1, yi1.lai
In-Reply-To: <4de75e51-025c-4926-871a-4b9da479cefa@kernel.org>
On Thu, 4 Jun 2026 08:42:37 +0200, David Hildenbrand (Arm) wrote:
> Yeah, if only I had known that we would break user space by changing trace
> events ... now we know :)
>
> Do you have capacity to send a fix?
Sure, with pleasure.
^ permalink raw reply
* [GIT PULL] RTLA fixes for v7.1-rc7
From: Tomas Glozar @ 2026-06-04 13:25 UTC (permalink / raw)
To: Steven Rostedt; +Cc: LKML, linux-trace-kernel, Tomas Glozar
Steven,
The following changes since commit e43ffb69e0438cddd72aaa30898b4dc446f664f8:
Linux 7.1-rc6 (2026-05-31 15:14:24 -0700)
are available in the Git repository at:
https://git.kernel.org/pub/scm/linux/kernel/git/tglozar/linux.git tags/rtla-fixes-v7.1-rc7
for you to fetch changes up to e9e41d3035032ed6053d8bad7b7077e1cb3a6540:
rtla: Fix parsing of multi-character short options (2026-06-04 10:53:25 +0200)
----------------------------------------------------------------
RTLA fixes for v7.1-rc7
- Fix multi-character short option parsing
Fix regression in parsing of multiple-character short options (e.g.
-p100 /= -p 100/, -un /= -u -n/) caused by getopt_long() internal state
corruption after a refactoring.
Build, runtime tests, unit tests pass. Extended runtime tests from next
also pass, except for timerlat hist --dump-tasks (expected).
Signed-off-by: Tomas Glozar <tglozar@redhat.com>
----------------------------------------------------------------
Tomas Glozar (1):
rtla: Fix parsing of multi-character short options
tools/tracing/rtla/src/common.c | 28 +++++-----------------------
tools/tracing/rtla/src/common.h | 12 +++++++++++-
tools/tracing/rtla/src/osnoise_hist.c | 7 ++++---
tools/tracing/rtla/src/osnoise_top.c | 7 ++++---
tools/tracing/rtla/src/timerlat_hist.c | 7 ++++---
tools/tracing/rtla/src/timerlat_top.c | 7 ++++---
6 files changed, 32 insertions(+), 36 deletions(-)
^ permalink raw reply
* Re: [GIT PULL] rv fixes for v7.1
From: Gabriele Monaco @ 2026-06-04 13:04 UTC (permalink / raw)
To: Steven Rostedt
Cc: Tomas Glozar, linux-kernel, linux-trace-kernel, unknownbbqrx,
Wen Yang
In-Reply-To: <20260604085405.234d22eb@fedora>
On Thu, 2026-06-04 at 08:54 -0400, Steven Rostedt wrote:
> On Thu, 04 Jun 2026 14:42:02 +0200
> Gabriele Monaco <gmonaco@redhat.com> wrote:
>
> > All this to say that, in my opinion unknownbbqrx
> > <dev@unknownbbqr.xyz>
> > is NOT an anonymous contribution, just a nickname that differs from
> > the legal name of this person (which we wouldn't validate anyway),
> > so I would say it complies with the rules.
>
> It's a username on github and not a nickname. I did a search for
> "unknownbbqr" and it doesn't come up anywhere but Google tries to
> find similar matches which brings me to an OnlyFans account :-p
>
> It *DOES NOT* qualify because there's no accountability for this. For
> people who have a nickname as their entire internet persona, sure,
> I'll take patches from them as there's an entity that exists behind
> it.
> But I'm not going to take some username on github as a persona. To
> me, that's still anonymous.
Alright, fair. In the link I sent, the signoff got changed to Ali Ahmet
MEMIS <dev@unknownbbqr.xyz>, but I believe we cannot use that unless
the user themselves adds it (and they seem unreachable).
I posted the re-authored patch in [1], I'm not sure that's the proper
way though (the patch is so simple that is unmodified). But if you give
me a green light I can send you a pull request with that patch instead.
Thanks,
Gabriele
[1] -
https://lore.kernel.org/lkml/20260604120946.90302-2-gmonaco@redhat.com/
>
> -- Steve
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox