* Re: [PATCH v8 2/6] mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
From: Breno Leitao @ 2026-06-05 9:35 UTC (permalink / raw)
To: Miaohe Lin
Cc: David Hildenbrand (Arm), linux-mm, linux-kernel, linux-doc,
linux-kselftest, linux-trace-kernel, kernel-team, Lance Yang,
Andrew Morton, Lorenzo Stoakes, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Shuah Khan, Naoya Horiguchi,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Jonathan Corbet, Shuah Khan, Liam R. Howlett
In-Reply-To: <4b27467e-935f-5587-2f48-5a794c30a592@huawei.com>
On Wed, Jun 03, 2026 at 10:33:04AM +0800, Miaohe Lin wrote:
> On 2026/6/2 17:41, David Hildenbrand (Arm) wrote:
> > On 6/2/26 05:08, Miaohe Lin wrote:
> >> On 2026/6/1 21:22, David Hildenbrand (Arm) wrote:
> >>> On 6/1/26 14:28, Miaohe Lin wrote:
> >>>>
> >>>> Thanks for your patch.
> >>>>
> >>>>
> >>>> Once shake_page finds a lightweight range-based way to shrink slab, slab pages could be freed
> >>>> into buddy and above PageSlab test should be removed then. Maybe add a TODO or XXX here?
> >>>>
> >>>>
> >>>> I'm not sure but is it safe or a common way to test PageReserved, PageSlab,
> >>>> PageTable and PageLargeKmalloc without extra page refcnt?
> >>>
> >>> Checking typed pages in a racy fashion is fine (PageSlab, PageTable,
> >>> PageLargeKmalloc).
> >>
> >> Got it. Thanks.
> >>
> >>> Checking PageReserved in a racy fashion is fine as well. TESTPAGEFLAG() will
> >>> allow checking it on compound pages.
> >>
> >> It seems PageReserved is not intended to be set on compound pages. I see there are PF_NO_COMPOUND
> >> in its definition: PAGEFLAG(Reserved, reserved, PF_NO_COMPOUND).
> >>
> >>>
> >>> For PageLargeKmalloc, we would want to check the head page, though. The page
> >>> type is only stored for the head page.
> >>
> >> Maybe we should check the head page for PageSlab and PageTable too? alloc_slab_page only
> >> set PageSlab on the head page and __pagetable_ctor uses __folio_set_pgtable to set PageTable
> >> on folio.
> >>
> >>>
> >>> So maybe we want to lookup the compound head (if any) and perform the type
> >>> checks against that?
> >>
> >> Maybe we should or we might miss some pages that could have been handled. And
> >> if compound head is required, should we hold an extra page refcnt to guard against
> >> possible folio split race?
> >
> > Races are fine. We might miss some pages, but that can happen on races either way.
> >
> >
> > I'd just do something like
> >
> > if (PageReserved(page))
> > return true;
> >
> > head = compound_head(page);
>
> If @head is split just after compound_head. And then @head is freed into buddy and re-allocated as slab
> page while @page is still in the buddy. We would panic on this scene as @head is PageSlab. But we were
> supposed to successfully handle @page. Or am I miss something?
You're right that it is racy, but I think it is an acceptable race here.
For it to happen, the poisoned @page has to be a tail of a live compound page
at the time of the fault, and then -- in the few instructions between
compound_head() and the PageSlab(head) test -- that compound page has to be
split, the old head freed to buddy, and that head re-allocated as a slab page,
all while @page lands back in the buddy. It cannot happen without concurrent
split/free/alloc activity in that exact window.
It is also worth noting the page in question genuinely took a unrecoverable ECC
error, and panic_on_unrecoverable_memory_failure is opt-in -- an operator who
enables it has explicitly chosen to crash rather than risk running on corrupted
memory. Mis-attributing one such rare, genuinely-poisoned page as
unrecoverable is within that contract.
Thanks for the review and discussions,
--breno
^ permalink raw reply
* Re: [PATCH v7 20/42] KVM: SEV: Make 'uaddr' parameter optional for KVM_SEV_SNP_LAUNCH_UPDATE
From: Suzuki K Poulose @ 2026-06-05 9:06 UTC (permalink / raw)
To: Michael Roth
Cc: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
david, ira.weiny, jmattson, jthoughton, oupton, pankaj.gupta,
qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
tabba, willy, wyihan, yan.y.zhao, forkloop, pratyush,
aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka, kvm,
linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <4muegrza5iyyhqx6wevdlssnb6wvlc4m4wmuz5hmd3xikkftc4@3e2lpuq6tjgr>
On 04/06/2026 21:11, Michael Roth wrote:
> On Thu, Jun 04, 2026 at 04:29:19PM +0100, Suzuki K Poulose wrote:
>> On 23/05/2026 01:18, Ackerley Tng via B4 Relay wrote:
>>> From: Michael Roth <michael.roth@amd.com>
>>>
>>> For vm_memory_attributes=1, in-place conversion/population is not
>>> supported, so the initial contents necessarily must need to come
>>> from a separate src address, which is enforced by the current
>>> implementation. However, for vm_memory_attributes=0, it is possible for
>>> guest memory to be initialized directly from userspace by mmap()'ing the
>>> guest_memfd and writing to it while the corresponding GPA ranges are in
>>> a 'shared' state before converting them to the 'private' state expected
>>> by KVM_SEV_SNP_LAUNCH_UPDATE.
>>>
>>> Update the handling/documentation for KVM_SEV_SNP_LAUNCH_UPDATE to allow
>>> for 'uaddr' to be set to NULL when vm_memory_attributes=0, which
>>> SNP_LAUNCH_UPDATE will then use to determine when it should/shouldn't
>>> copy in data from a separate memory location. Continue to enforce
>>> non-NULL for the original vm_memory_attributes=1 case.
>>>
>>> Signed-off-by: Michael Roth <michael.roth@amd.com>
>>> [Added src_page check in error handling path when the firmware command fails]
>>> [Dropped ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES]
>>> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>>
>>
>>
>>
>>> ---
>>> Documentation/virt/kvm/x86/amd-memory-encryption.rst | 15 +++++++++++----
>>> arch/x86/kvm/svm/sev.c | 18 +++++++++++++-----
>>> virt/kvm/kvm_main.c | 1 +
>>> 3 files changed, 25 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
>>> index b2395dd4769de..43085f65b2d85 100644
>>> --- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst
>>> +++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
>>> @@ -503,7 +503,8 @@ secrets.
>>> It is required that the GPA ranges initialized by this command have had the
>>> KVM_MEMORY_ATTRIBUTE_PRIVATE attribute set in advance. See the documentation
>>> -for KVM_SET_MEMORY_ATTRIBUTES for more details on this aspect.
>>> +for KVM_SET_MEMORY_ATTRIBUTES/KVM_SET_MEMORY_ATTRIBUTES2 for more details on
>>> +this aspect.
>>> Upon success, this command is not guaranteed to have processed the entire
>>> range requested. Instead, the ``gfn_start``, ``uaddr``, and ``len`` fields of
>>> @@ -511,9 +512,15 @@ range requested. Instead, the ``gfn_start``, ``uaddr``, and ``len`` fields of
>>> remaining range that has yet to be processed. The caller should continue
>>> calling this command until those fields indicate the entire range has been
>>> processed, e.g. ``len`` is 0, ``gfn_start`` is equal to the last GFN in the
>>> -range plus 1, and ``uaddr`` is the last byte of the userspace-provided source
>>> -buffer address plus 1. In the case where ``type`` is KVM_SEV_SNP_PAGE_TYPE_ZERO,
>>> -``uaddr`` will be ignored completely.
>>> +range plus 1, and ``uaddr`` (if specified) is the last byte of the
>>> +userspace-provided source buffer address plus 1.
>>> +
>>> +In the case where ``type`` is KVM_SEV_SNP_PAGE_TYPE_ZERO, ``uaddr`` will be
>>> +ignored completely. Otherwise, ``uaddr`` is required if
>>> +kvm.vm_memory_attributes=1 and optional if kvm.vm_memory_attributes=0, since
>>> +in the latter case guest memory can be initialized directly from userspace
>>> +prior to converting it to private and passing the GPA range on to this
>>> +interface.
>>
>> Just to confirm, so the sev_gmem_prepare doesn't destroy the contents in the
>> process of making it "private" ? i.e., the contents of a SNP shared
>> page are preserved while transitioning to "SNP Private" (via RMP
>> update).
>
> sev_gmem_prepare() does sort of destroy contents since it finalizes the
> shared->private conversion which puts the page in an unusable state
> until the guest 'accepts' it as private memory and re-initializes the
> contents.
>
> But that's run-time, when the guest is doing conversions. The
> documentation here is relating to initialization time when we are
> setting up the initial pre-encrypted/pre-measured guest memory image,
> via SNP_LAUNCH_UPDATE. That path calls into kvm_gmem_populate(), and it
> is then sev_gmem_post_populate() callback that actually finalizes the
> shared->private conversion. The sev_gmem_prepare() hook doesn't get used
> in this flow (kvm_gmem_populate() calls __kvm_gmem_get_pfn() which skips
> preparation).
Thanks, thats the bit I was missing. Skipping the prepare path, with
__kvm_gmem_get_pfn(). I was under the assumption that
kvm_arch_gmem_prepare() was called for all PFNs allocated from gmem
and how SNP was handling this populate case.
Thanks
Suzuki
>
> -Mike
>
>>
>> Suzuki
>>
>>
^ permalink raw reply
* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: Lance Yang @ 2026-06-05 8:59 UTC (permalink / raw)
To: ljs, david, npache
Cc: lance.yang, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
joshua.hahnjy, kas, liam, mathieu.desnoyers, matthew.brost,
mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe, usama.arif
In-Reply-To: <aiJ90SWqXvwN9dNT@lucifer>
On Fri, Jun 05, 2026 at 09:07:23AM +0100, Lorenzo Stoakes wrote:
>On Fri, Jun 05, 2026 at 09:18:27AM +0200, David Hildenbrand (Arm) wrote:
>> On 6/4/26 19:04, Nico Pache wrote:
>> > On Mon, Jun 1, 2026 at 9:00 AM Nico Pache <npache@redhat.com> wrote:
>> >>
>> >> On Mon, Jun 1, 2026 at 5:14 AM David Hildenbrand (Arm) <david@kernel.org> wrote:
>> >>>
>> >>>
>> >>> Yeah. BTW, I think we'd need a spin_lock_nested(), so @Nico, treat my code as a
>> >>> draft.
>> >>
>> >> Okay, I read the above and did some investigating.
>> >>
>> >> I will try to implement and verify the changes you suggested :)
>> >
>> > I've implemented something slightly different actually and I *think* its better!
>> >
>> > } else {
>> > /* this is map_anon_folio_pte_nopf with no mmu update */
>> > __map_anon_folio_pte_nopf(folio, pte, vma, start_addr,
>> > /*uffd_wp=*/ false);
>> > smp_wmb();
>> > pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>> > /*
>> > * Some architectures (e.g. MIPS) walk the live page table in
>> > * their implementation. update_mmu_cache_range() must be called
>> > * with a valid page table hierarchy and the PTE lock held.
>> > * Acquire it nested inside pmd_ptl when they are distinct locks.
>> > */
>> > if (pte_ptl != pmd_ptl)
>> > spin_lock_nested(pte_ptl, SINGLE_DEPTH_NESTING);
>> > update_mmu_cache_range(NULL, vma, start_addr, pte, nr_pages);
>> > if (pte_ptl != pmd_ptl)
>> > spin_unlock(pte_ptl);
>> > }
>> > spin_unlock(pmd_ptl);
>> >
>> > The logic here is that when the PMD becomes visible, PTEs are already
>> > populated (no possibility of spurious faults on local CPU)
>> >
>> > the SMP_WMB makes sure of the above
>
>THe locks prevent those 'spurious' (really: incorrect) faults anyway so I don't
>think this is necessary.
>
>> >
>> > And the pmd is installed with the pte and pmd lock both held through
>> > the mmu_cache update.
>> >
>> > This follows the conventions used in pmd_install() and clears the
>> > potential for local CPU faults hitting cleared PTE entries.
>>
>> After the pmdp_collapse_flush() we'd be getting CPU faults due to the cleared
>> PMD already? So the case here is rather different.
The issue I was worried about: update_mmu_cache_range() can re-walk
vma->vm_mm while the PTE page table is still not reachable through the
PMD. And, yeah, that assumption is ugly, but it is what it is, and there
maybe be similar code elsewhere ...
So the ordering we need is "the PMD points to the PTE page table from
_pmd before update_mmu_cache_range()", not "new PTEs before PMD".
Those PTEs are cleared, but we hold the PTL, so nobody else can install
anything there :)
So David's original suggestion looks enough to me:
if (pte_ptl != pmd_ptl)
spin_lock_nested(pte_ptl, SINGLE_DEPTH_NESTING);
pmd_populate();
map_anon_folio_pte_nopf();
if (pte_ptl != pmd_ptl)
spin_unlock(pte_ptl);
>Yeah conceptually the code above is problematic because you immediately make the
>PTE available right at the point you populate, so taking a PTE lock after that
>is rather shutting the stable door after the horse has bolted.
>
>Doing it this way is not a good idea in any case because we're adding
>complexity, an extra function and an open-coded cache maintenance call for
>really no benefit.
>
>I asked Nico to abstract the anon folio mapping stuff explicitly so we could
>avoid this sort of duplication so let's not roll that back :)
>
>So again, I think going with the original suggestion (with an updated comment)
>is the right thing to do.
>
>
>Anyway, an aside But in practice we can't have page faults here right? The VMA is:
>
>- Ensured to span at least the PMD range (this isn't immediately obvious in the
> code)
>- VMA write locked (mmap write lock held)
>
>And we hold the anon_vma lock so no rmap walkers can walk the page tables here
>either.
>
>So I actually wonder, given that, whether we need the PTE PTL at all.
I'd keep it. Cheap, and lets us sleep better at night :P
>But.
>
>At this stage it'll almost certainly be an owned exclusive cache line so it's
>very low cost to do it, and it means we honour the update_mmu_cache_range()
>contract.
>
>And it also makes it clear that we're gating changes on the PTE being
>untouchable so any future stuff that maybe changes some of these rules doesn't
>get caught out.
>
>So probably worth keeping.
Yes!
Cheers, Lance
>>
>> --
>> Cheers,
>>
>> David
>
>Thanks, Lorenzo
>
^ permalink raw reply
* Re: [PATCH v7 20/42] KVM: SEV: Make 'uaddr' parameter optional for KVM_SEV_SNP_LAUNCH_UPDATE
From: Suzuki K Poulose @ 2026-06-05 8:54 UTC (permalink / raw)
To: Ackerley Tng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
david, ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
pratyush, aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <CAEvNRgF43RBv77RgM67kXRRHDnQw4L5uwQTuvkJHzkHJWB1mag@mail.gmail.com>
On 04/06/2026 20:05, Ackerley Tng wrote:
> Suzuki K Poulose <suzuki.poulose@arm.com> writes:
>
>>
>> [...snip...]
>>
>>> +In the case where ``type`` is KVM_SEV_SNP_PAGE_TYPE_ZERO, ``uaddr`` will be
>>> +ignored completely. Otherwise, ``uaddr`` is required if
>>> +kvm.vm_memory_attributes=1 and optional if kvm.vm_memory_attributes=0, since
>>> +in the latter case guest memory can be initialized directly from userspace
>>> +prior to converting it to private and passing the GPA range on to this
>>> +interface.
>>
>> Just to confirm, so the sev_gmem_prepare doesn't destroy the contents in
>> the process of making it "private" ? i.e., the contents of a SNP shared
>> page are preserved while transitioning to "SNP Private" (via RMP
>> update).
>>
>> Suzuki
>>
>
> The following is the guest_memfd perspective, I didn't look at the SNP
> spec:
>
> Do you mean specifically for KVM_SEV_SNP_PAGE_TYPE_ZERO, or for any
> type?
>
> guest_memfd has no plans to do any special zeroing based on type.
>
> guest_memfd decoupled zeroing from preparation a while ago (Michael had
> some patches), so zeroing is supposed to be once during folio ownership
> by guest_memfd, tracked by the uptodate flag, and preparation is tracked
> outside of guest_memfd. So far only SNP does preparation.
I am talking about the SEV SNP conversions (specifically quoted in my
response), I will follow up on Michael's response.
Suzuki
^ permalink raw reply
* Re: mm/memory-failure tracepoint change breaks userspace rasdaemon
From: David Hildenbrand (Arm) @ 2026-06-05 8:52 UTC (permalink / raw)
To: Steven Rostedt
Cc: Borislav Petkov, Zhuo, Qiuxu, mchehab+huawei@kernel.org,
Luck, Tony, akpm@linux-foundation.org, linmiaohe@huawei.com,
xieyuanbin1@huawei.com, Lai, Yi1, linux-kernel@vger.kernel.org,
linux-edac@vger.kernel.org, linux-mm@kvack.org,
linux-trace-kernel@vger.kernel.org, Linus Torvalds
In-Reply-To: <20260603153115.775a2e81@fedora>
On 6/3/26 21:31, Steven Rostedt wrote:
> On Wed, 3 Jun 2026 21:13:30 +0200
> "David Hildenbrand (Arm)" <david@kernel.org> wrote:
>
>> Thanks, that makes sense!
>>
>> So, would it be fair to say that, in general, what's exposed through
>>
>> /sys/kernel/tracing/events/
>>
>> is stable ABI?
>
> It's only stable if something depends on it. It changes all the time.
> It's only when someone complains about it that it becomes "stable"!
Heh, so we only know that it's stable when we break it ...
Let me figure out how to document that.
--
Cheers,
David
^ permalink raw reply
* Re: [PATCH] mm/memory-failure: trace: change memory_failure_event to ras subsystem
From: David Hildenbrand (Arm) @ 2026-06-05 8:51 UTC (permalink / raw)
To: Xie Yuanbin, qiuxu.zhuo, bp, akpm, rostedt, linmiaohe,
nao.horiguchi, mhiramat, mchehab+huawei, tony.luck, yi1.lai
Cc: linux-edac, linux-kernel, linux-mm, linux-trace-kernel, torvalds,
lilinjie8, liaohua4
In-Reply-To: <20260605081213.154660-1-xieyuanbin1@huawei.com>
On 6/5/26 10:12, Xie Yuanbin wrote:
> For historical version, commit 97f0b1345219 ("tracing: add trace event
> for memory-failure") introduced memory_failure_event in ras subsystem.
> commit 31807483d395 ("mm/memory-failure: remove the selection of RAS")
> changed memory_failure_event to memory_failure subsystem. This breaks
> the backward compatibility, some user programs rely on it.
>
> Change memory_failure_event to ras subsystem to keep backward
> compatibility.
>
> Fixes: 31807483d395 ("mm/memory-failure: remove the selection of RAS")
>
> Reported-by: Yi Lai <yi1.lai@intel.com>
> Reported-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> Closes: https://lore.kernel.org/linux-mm/CY8PR11MB7134346A3E4BB28ECA28D6E989132@CY8PR11MB7134.namprd11.prod.outlook.com
> Cc: David Hildenbrand <david@kernel.org>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Miaohe Lin <linmiaohe@huawei.com>
> Signed-off-by: Xie Yuanbin <xieyuanbin1@huawei.com>
> ---
> include/trace/events/memory-failure.h | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/include/trace/events/memory-failure.h b/include/trace/events/memory-failure.h
> index aa57cc8f896b..7a8ee5d1a44e 100644
> --- a/include/trace/events/memory-failure.h
> +++ b/include/trace/events/memory-failure.h
> @@ -1,6 +1,10 @@
> /* SPDX-License-Identifier: GPL-2.0 */
> #undef TRACE_SYSTEM
> -#define TRACE_SYSTEM memory_failure
> +/*
> + * For historical versions, memory_failure_event is in ras subsystem,
> + * some user programs depend on it.
> + */
> +#define TRACE_SYSTEM ras
> #define TRACE_INCLUDE_FILE memory-failure
>
> #if !defined(_TRACE_MEMORY_FAILURE_H) || defined(TRACE_HEADER_MULTI_READ)
We should
Cc: <stable@vger.kernel.org>
given that it's in v6.19 and nobody noticed :(
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Thanks, and fortunately now I learned about possible ABI salability of trace events.
--
Cheers,
David
^ permalink raw reply
* [PATCH] mm/memory-failure: trace: change memory_failure_event to ras subsystem
From: Xie Yuanbin @ 2026-06-05 8:12 UTC (permalink / raw)
To: david, qiuxu.zhuo, bp, akpm, rostedt, linmiaohe, nao.horiguchi,
mhiramat, mchehab+huawei, tony.luck, yi1.lai
Cc: linux-edac, linux-kernel, linux-mm, linux-trace-kernel, torvalds,
lilinjie8, liaohua4, Xie Yuanbin
For historical version, commit 97f0b1345219 ("tracing: add trace event
for memory-failure") introduced memory_failure_event in ras subsystem.
commit 31807483d395 ("mm/memory-failure: remove the selection of RAS")
changed memory_failure_event to memory_failure subsystem. This breaks
the backward compatibility, some user programs rely on it.
Change memory_failure_event to ras subsystem to keep backward
compatibility.
Fixes: 31807483d395 ("mm/memory-failure: remove the selection of RAS")
Reported-by: Yi Lai <yi1.lai@intel.com>
Reported-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Closes: https://lore.kernel.org/linux-mm/CY8PR11MB7134346A3E4BB28ECA28D6E989132@CY8PR11MB7134.namprd11.prod.outlook.com
Cc: David Hildenbrand <david@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Xie Yuanbin <xieyuanbin1@huawei.com>
---
include/trace/events/memory-failure.h | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/include/trace/events/memory-failure.h b/include/trace/events/memory-failure.h
index aa57cc8f896b..7a8ee5d1a44e 100644
--- a/include/trace/events/memory-failure.h
+++ b/include/trace/events/memory-failure.h
@@ -1,6 +1,10 @@
/* SPDX-License-Identifier: GPL-2.0 */
#undef TRACE_SYSTEM
-#define TRACE_SYSTEM memory_failure
+/*
+ * For historical versions, memory_failure_event is in ras subsystem,
+ * some user programs depend on it.
+ */
+#define TRACE_SYSTEM ras
#define TRACE_INCLUDE_FILE memory-failure
#if !defined(_TRACE_MEMORY_FAILURE_H) || defined(TRACE_HEADER_MULTI_READ)
--
2.53.0
^ permalink raw reply related
* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: Lorenzo Stoakes @ 2026-06-05 8:07 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Nico Pache, Lance Yang, linux-doc, linux-kernel, linux-mm,
linux-trace-kernel, aarcange, akpm, anshuman.khandual, apopple,
baohua, baolin.wang, byungchul, catalin.marinas, cl, corbet,
dave.hansen, dev.jain, gourry, hannes, hughd, jack, jackmanb,
jannh, jglisse, joshua.hahnjy, kas, liam, mathieu.desnoyers,
matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
willy, yang, ying.huang, ziy, zokeefe, usama.arif
In-Reply-To: <0ef96c28-9e6c-4d04-90ae-ac43c81d465d@kernel.org>
On Fri, Jun 05, 2026 at 09:18:27AM +0200, David Hildenbrand (Arm) wrote:
> On 6/4/26 19:04, Nico Pache wrote:
> > On Mon, Jun 1, 2026 at 9:00 AM Nico Pache <npache@redhat.com> wrote:
> >>
> >> On Mon, Jun 1, 2026 at 5:14 AM David Hildenbrand (Arm) <david@kernel.org> wrote:
> >>>
> >>>
> >>> Yeah. BTW, I think we'd need a spin_lock_nested(), so @Nico, treat my code as a
> >>> draft.
> >>
> >> Okay, I read the above and did some investigating.
> >>
> >> I will try to implement and verify the changes you suggested :)
> >
> > I've implemented something slightly different actually and I *think* its better!
> >
> > } else {
> > /* this is map_anon_folio_pte_nopf with no mmu update */
> > __map_anon_folio_pte_nopf(folio, pte, vma, start_addr,
> > /*uffd_wp=*/ false);
> > smp_wmb();
> > pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> > /*
> > * Some architectures (e.g. MIPS) walk the live page table in
> > * their implementation. update_mmu_cache_range() must be called
> > * with a valid page table hierarchy and the PTE lock held.
> > * Acquire it nested inside pmd_ptl when they are distinct locks.
> > */
> > if (pte_ptl != pmd_ptl)
> > spin_lock_nested(pte_ptl, SINGLE_DEPTH_NESTING);
> > update_mmu_cache_range(NULL, vma, start_addr, pte, nr_pages);
> > if (pte_ptl != pmd_ptl)
> > spin_unlock(pte_ptl);
> > }
> > spin_unlock(pmd_ptl);
> >
> > The logic here is that when the PMD becomes visible, PTEs are already
> > populated (no possibility of spurious faults on local CPU)
> >
> > the SMP_WMB makes sure of the above
THe locks prevent those 'spurious' (really: incorrect) faults anyway so I don't
think this is necessary.
> >
> > And the pmd is installed with the pte and pmd lock both held through
> > the mmu_cache update.
> >
> > This follows the conventions used in pmd_install() and clears the
> > potential for local CPU faults hitting cleared PTE entries.
>
> After the pmdp_collapse_flush() we'd be getting CPU faults due to the cleared
> PMD already? So the case here is rather different.
Yeah conceptually the code above is problematic because you immediately make the
PTE available right at the point you populate, so taking a PTE lock after that
is rather shutting the stable door after the horse has bolted.
Doing it this way is not a good idea in any case because we're adding
complexity, an extra function and an open-coded cache maintenance call for
really no benefit.
I asked Nico to abstract the anon folio mapping stuff explicitly so we could
avoid this sort of duplication so let's not roll that back :)
So again, I think going with the original suggestion (with an updated comment)
is the right thing to do.
Anyway, an aside But in practice we can't have page faults here right? The VMA is:
- Ensured to span at least the PMD range (this isn't immediately obvious in the
code)
- VMA write locked (mmap write lock held)
And we hold the anon_vma lock so no rmap walkers can walk the page tables here
either.
So I actually wonder, given that, whether we need the PTE PTL at all.
But.
At this stage it'll almost certainly be an owned exclusive cache line so it's
very low cost to do it, and it means we honour the update_mmu_cache_range()
contract.
And it also makes it clear that we're gating changes on the PTE being
untouchable so any future stuff that maybe changes some of these rules doesn't
get caught out.
So probably worth keeping.
>
> --
> Cheers,
>
> David
Thanks, Lorenzo
^ permalink raw reply
* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: David Hildenbrand (Arm) @ 2026-06-05 7:18 UTC (permalink / raw)
To: Nico Pache
Cc: Lance Yang, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
joshua.hahnjy, kas, liam, ljs, mathieu.desnoyers, matthew.brost,
mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe, usama.arif
In-Reply-To: <CAA1CXcBSPVG4CJFCBDvbuodcJ_7eXoDQTpK0ZN0HEhkDPi-DEw@mail.gmail.com>
On 6/4/26 19:04, Nico Pache wrote:
> On Mon, Jun 1, 2026 at 9:00 AM Nico Pache <npache@redhat.com> wrote:
>>
>> On Mon, Jun 1, 2026 at 5:14 AM David Hildenbrand (Arm) <david@kernel.org> wrote:
>>>
>>>
>>> Yeah. BTW, I think we'd need a spin_lock_nested(), so @Nico, treat my code as a
>>> draft.
>>
>> Okay, I read the above and did some investigating.
>>
>> I will try to implement and verify the changes you suggested :)
>
> I've implemented something slightly different actually and I *think* its better!
>
> } else {
> /* this is map_anon_folio_pte_nopf with no mmu update */
> __map_anon_folio_pte_nopf(folio, pte, vma, start_addr,
> /*uffd_wp=*/ false);
> smp_wmb();
> pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> /*
> * Some architectures (e.g. MIPS) walk the live page table in
> * their implementation. update_mmu_cache_range() must be called
> * with a valid page table hierarchy and the PTE lock held.
> * Acquire it nested inside pmd_ptl when they are distinct locks.
> */
> if (pte_ptl != pmd_ptl)
> spin_lock_nested(pte_ptl, SINGLE_DEPTH_NESTING);
> update_mmu_cache_range(NULL, vma, start_addr, pte, nr_pages);
> if (pte_ptl != pmd_ptl)
> spin_unlock(pte_ptl);
> }
> spin_unlock(pmd_ptl);
>
> The logic here is that when the PMD becomes visible, PTEs are already
> populated (no possibility of spurious faults on local CPU)
>
> the SMP_WMB makes sure of the above
>
> And the pmd is installed with the pte and pmd lock both held through
> the mmu_cache update.
>
> This follows the conventions used in pmd_install() and clears the
> potential for local CPU faults hitting cleared PTE entries.
After the pmdp_collapse_flush() we'd be getting CPU faults due to the cleared
PMD already? So the case here is rather different.
--
Cheers,
David
^ permalink raw reply
* [PATCH] tracing: reject invalid preemptirq_delay_test CPU affinity
From: Samuel Moelius @ 2026-06-05 0:40 UTC (permalink / raw)
To: Steven Rostedt
Cc: Samuel Moelius, Masami Hiramatsu, Mathieu Desnoyers,
open list:TRACING, open list:TRACING
preemptirq_delay_test accepts cpu_affinity as a module parameter and,
when it is non-negative, writes that CPU directly into a temporary
cpumask from the worker thread. Values outside nr_cpu_ids can set a
bit outside the allocated cpumask before the test reports a normal
affinity error.
Validate the requested CPU before starting the worker thread, and
return -EINVAL for invalid affinity requests.
Assisted-by: Codex:gpt-5.5-cyber-preview
Signed-off-by: Samuel Moelius <sam.moelius@trailofbits.com>
---
kernel/trace/preemptirq_delay_test.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/kernel/trace/preemptirq_delay_test.c b/kernel/trace/preemptirq_delay_test.c
index acb0c971a408..0f017799754a 100644
--- a/kernel/trace/preemptirq_delay_test.c
+++ b/kernel/trace/preemptirq_delay_test.c
@@ -14,6 +14,7 @@
#include <linux/kthread.h>
#include <linux/module.h>
#include <linux/printk.h>
+#include <linux/cpumask.h>
#include <linux/string.h>
#include <linux/sysfs.h>
#include <linux/completion.h>
@@ -152,6 +153,15 @@ static int preemptirq_run_test(void)
struct task_struct *task;
char task_name[50];
+ if (cpu_affinity > -1) {
+ unsigned int cpu = cpu_affinity;
+
+ if (cpu >= nr_cpu_ids || !cpu_possible(cpu)) {
+ pr_err("cpu_affinity:%d, invalid CPU\n", cpu_affinity);
+ return -EINVAL;
+ }
+ }
+
init_completion(&done);
snprintf(task_name, sizeof(task_name), "%s_test", test_mode);
--
2.43.0
^ permalink raw reply related
* Re: [PATCH v7 00/42] guest_memfd: In-place conversion support
From: Ackerley Tng @ 2026-06-04 21:14 UTC (permalink / raw)
To: Sean Christopherson
Cc: Ackerley Tng via B4 Relay, aik, andrew.jones, binbin.wu, brauner,
chao.p.peng, david, ira.weiny, jmattson, jthoughton, michael.roth,
oupton, pankaj.gupta, qperret, rick.p.edgecombe, rientjes,
shivankg, steven.price, tabba, willy, wyihan, yan.y.zhao,
forkloop, pratyush, suzuki.poulose, aneesh.kumar, liam,
Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <aiHeDZEPkAcWcSkn@google.com>
Sean Christopherson <seanjc@google.com> writes:
> On Wed, Jun 03, 2026, Ackerley Tng wrote:
>> Ackerley Tng via B4 Relay <devnull+ackerleytng.google.com@kernel.org>
>> writes:
>>
>> > This is v7 of guest_memfd in-place conversion support.
>> >
>>
>> Here's the outstanding items after going over everyone's comments
>> including Sashiko's:
>>
>> + KVM: TDX: Make source page optional for KVM_TDX_INIT_MEM_REGION
>> + Need to move page clearing into __kvm_gmem_get_pfn to resolve
>> leak where populate can put initialized kernel memory into TDX
>> guest
>> + See suggested fix at [1]
>
> That fix works for me. The initial guest image will typically be a tiny subset
> of guest memory, so unnecessarily zeroing a few pages isn't a performance concern.
>
In regular usage moving the zeroing in [1] doesn't change anything,
since the same zeroing would have first happened when the host faults
the pages to put the initial image. When populating, there's no more
zeroing since it was zeroed.
[1] covers the case where the host doesn't write anything to the pages
and directly tries to populate the pages to the guest.
>> + KVM: guest_memfd: Only prepare folios for private pages,
>> + s/non-CoCo/CoCo in commit message "INIT_SHARED is about to be
>> supported for non-CoCo VMs in a later patch in this series
>> + Use Suggested-by: Michael Roth <michael.roth@amd.com>
>> + KVM: selftests: Test that shared/private status is consistent across
>> processes
>> + Improve test reliability using pthread_mutex
>> + I have a fixup patch offline.
>>
>> I would like feedback on these:
>>
>> + KVM: selftests: Test conversion with elevated page refcount
>> + Askar pointed out that soon vmsplice may not pin pages. Should I
>> pin pages through CONFIG_GUP_TEST like in [2]? I prefer not to
>> take a dependency on CONFIG_GUP_TEST.
>
> I'm not exactly excited about taking a dependency on CONFIG_GUP_TEST either, but
> it probably is the least awful choice. E.g. KVM also pins pages is certain flows,
> but we're _also_ actively working to remove the need to pin.
>
> Hmm, maybe IORING_REGISTER_PBUF_RING? AFAICT, it's almost literally a "pin user
> memory" syscall.
>
Hmm that takes a dependency on io_uring, which isn't always compiled
in. Between CONFIG_IO_URING and CONFIG_GUP_TEST, I'd rather
CONFIG_GUP_TEST.
>> + KVM: selftests: Add script to exercise private_mem_conversions_test
>> + Would like to know what people think of a wrapper script before
>> I address Sashiko's comments.
>
> NAK to a wrapper script. This sounds like a perfect fit for Vipin's selftest
> runner (which I'm like 4 months overdue for reviewing, testing, and merging).
> If the runner _can't_ do what you want, then I'd rather improve the runner.
>
> [*] https://lore.kernel.org/all/20260331194202.1722082-1-vipinsh@google.com
>
Good to know we have this!
Thanks, I'll work on a v8 to clean up the above.
>>
>> [1] https://lore.kernel.org/all/CAEvNRgEVC=fFuKVgZYvWyZD7t_zvUZihFG8hrACjvtkD5cwugw@mail.gmail.com/
>> [2] https://lore.kernel.org/all/baa8838f623102931e755cf34c86314b305af49c.1747264138.git.ackerleytng@google.com/
>>
>> >
>> > [...snip...]
>> >
^ permalink raw reply
* Re: [PATCH v7 00/42] guest_memfd: In-place conversion support
From: Sean Christopherson @ 2026-06-04 20:20 UTC (permalink / raw)
To: Ackerley Tng
Cc: Ackerley Tng via B4 Relay, aik, andrew.jones, binbin.wu, brauner,
chao.p.peng, david, ira.weiny, jmattson, jthoughton, michael.roth,
oupton, pankaj.gupta, qperret, rick.p.edgecombe, rientjes,
shivankg, steven.price, tabba, willy, wyihan, yan.y.zhao,
forkloop, pratyush, suzuki.poulose, aneesh.kumar, liam,
Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <CAEvNRgGpaggjd3=ooyzv7iEbmA-x1mWJHgjLSjPi8=5CPrk-yQ@mail.gmail.com>
On Wed, Jun 03, 2026, Ackerley Tng wrote:
> Ackerley Tng via B4 Relay <devnull+ackerleytng.google.com@kernel.org>
> writes:
>
> > This is v7 of guest_memfd in-place conversion support.
> >
>
> Here's the outstanding items after going over everyone's comments
> including Sashiko's:
>
> + KVM: TDX: Make source page optional for KVM_TDX_INIT_MEM_REGION
> + Need to move page clearing into __kvm_gmem_get_pfn to resolve
> leak where populate can put initialized kernel memory into TDX
> guest
> + See suggested fix at [1]
That fix works for me. The initial guest image will typically be a tiny subset
of guest memory, so unnecessarily zeroing a few pages isn't a performance concern.
> + KVM: guest_memfd: Only prepare folios for private pages,
> + s/non-CoCo/CoCo in commit message "INIT_SHARED is about to be
> supported for non-CoCo VMs in a later patch in this series
> + Use Suggested-by: Michael Roth <michael.roth@amd.com>
> + KVM: selftests: Test that shared/private status is consistent across
> processes
> + Improve test reliability using pthread_mutex
> + I have a fixup patch offline.
>
> I would like feedback on these:
>
> + KVM: selftests: Test conversion with elevated page refcount
> + Askar pointed out that soon vmsplice may not pin pages. Should I
> pin pages through CONFIG_GUP_TEST like in [2]? I prefer not to
> take a dependency on CONFIG_GUP_TEST.
I'm not exactly excited about taking a dependency on CONFIG_GUP_TEST either, but
it probably is the least awful choice. E.g. KVM also pins pages is certain flows,
but we're _also_ actively working to remove the need to pin.
Hmm, maybe IORING_REGISTER_PBUF_RING? AFAICT, it's almost literally a "pin user
memory" syscall.
> + KVM: selftests: Add script to exercise private_mem_conversions_test
> + Would like to know what people think of a wrapper script before
> I address Sashiko's comments.
NAK to a wrapper script. This sounds like a perfect fit for Vipin's selftest
runner (which I'm like 4 months overdue for reviewing, testing, and merging).
If the runner _can't_ do what you want, then I'd rather improve the runner.
[*] https://lore.kernel.org/all/20260331194202.1722082-1-vipinsh@google.com
>
> [1] https://lore.kernel.org/all/CAEvNRgEVC=fFuKVgZYvWyZD7t_zvUZihFG8hrACjvtkD5cwugw@mail.gmail.com/
> [2] https://lore.kernel.org/all/baa8838f623102931e755cf34c86314b305af49c.1747264138.git.ackerleytng@google.com/
>
> >
> > [...snip...]
> >
^ permalink raw reply
* Re: [PATCH v7 20/42] KVM: SEV: Make 'uaddr' parameter optional for KVM_SEV_SNP_LAUNCH_UPDATE
From: Michael Roth @ 2026-06-04 20:11 UTC (permalink / raw)
To: Suzuki K Poulose
Cc: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
david, ira.weiny, jmattson, jthoughton, oupton, pankaj.gupta,
qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
tabba, willy, wyihan, yan.y.zhao, forkloop, pratyush,
aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka, kvm,
linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <9d15479e-e36b-4865-804c-7d93eb339e4e@arm.com>
On Thu, Jun 04, 2026 at 04:29:19PM +0100, Suzuki K Poulose wrote:
> On 23/05/2026 01:18, Ackerley Tng via B4 Relay wrote:
> > From: Michael Roth <michael.roth@amd.com>
> >
> > For vm_memory_attributes=1, in-place conversion/population is not
> > supported, so the initial contents necessarily must need to come
> > from a separate src address, which is enforced by the current
> > implementation. However, for vm_memory_attributes=0, it is possible for
> > guest memory to be initialized directly from userspace by mmap()'ing the
> > guest_memfd and writing to it while the corresponding GPA ranges are in
> > a 'shared' state before converting them to the 'private' state expected
> > by KVM_SEV_SNP_LAUNCH_UPDATE.
> >
> > Update the handling/documentation for KVM_SEV_SNP_LAUNCH_UPDATE to allow
> > for 'uaddr' to be set to NULL when vm_memory_attributes=0, which
> > SNP_LAUNCH_UPDATE will then use to determine when it should/shouldn't
> > copy in data from a separate memory location. Continue to enforce
> > non-NULL for the original vm_memory_attributes=1 case.
> >
> > Signed-off-by: Michael Roth <michael.roth@amd.com>
> > [Added src_page check in error handling path when the firmware command fails]
> > [Dropped ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES]
> > Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>
>
>
>
> > ---
> > Documentation/virt/kvm/x86/amd-memory-encryption.rst | 15 +++++++++++----
> > arch/x86/kvm/svm/sev.c | 18 +++++++++++++-----
> > virt/kvm/kvm_main.c | 1 +
> > 3 files changed, 25 insertions(+), 9 deletions(-)
> >
> > diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> > index b2395dd4769de..43085f65b2d85 100644
> > --- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> > +++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> > @@ -503,7 +503,8 @@ secrets.
> > It is required that the GPA ranges initialized by this command have had the
> > KVM_MEMORY_ATTRIBUTE_PRIVATE attribute set in advance. See the documentation
> > -for KVM_SET_MEMORY_ATTRIBUTES for more details on this aspect.
> > +for KVM_SET_MEMORY_ATTRIBUTES/KVM_SET_MEMORY_ATTRIBUTES2 for more details on
> > +this aspect.
> > Upon success, this command is not guaranteed to have processed the entire
> > range requested. Instead, the ``gfn_start``, ``uaddr``, and ``len`` fields of
> > @@ -511,9 +512,15 @@ range requested. Instead, the ``gfn_start``, ``uaddr``, and ``len`` fields of
> > remaining range that has yet to be processed. The caller should continue
> > calling this command until those fields indicate the entire range has been
> > processed, e.g. ``len`` is 0, ``gfn_start`` is equal to the last GFN in the
> > -range plus 1, and ``uaddr`` is the last byte of the userspace-provided source
> > -buffer address plus 1. In the case where ``type`` is KVM_SEV_SNP_PAGE_TYPE_ZERO,
> > -``uaddr`` will be ignored completely.
> > +range plus 1, and ``uaddr`` (if specified) is the last byte of the
> > +userspace-provided source buffer address plus 1.
> > +
> > +In the case where ``type`` is KVM_SEV_SNP_PAGE_TYPE_ZERO, ``uaddr`` will be
> > +ignored completely. Otherwise, ``uaddr`` is required if
> > +kvm.vm_memory_attributes=1 and optional if kvm.vm_memory_attributes=0, since
> > +in the latter case guest memory can be initialized directly from userspace
> > +prior to converting it to private and passing the GPA range on to this
> > +interface.
>
> Just to confirm, so the sev_gmem_prepare doesn't destroy the contents in the
> process of making it "private" ? i.e., the contents of a SNP shared
> page are preserved while transitioning to "SNP Private" (via RMP
> update).
sev_gmem_prepare() does sort of destroy contents since it finalizes the
shared->private conversion which puts the page in an unusable state
until the guest 'accepts' it as private memory and re-initializes the
contents.
But that's run-time, when the guest is doing conversions. The
documentation here is relating to initialization time when we are
setting up the initial pre-encrypted/pre-measured guest memory image,
via SNP_LAUNCH_UPDATE. That path calls into kvm_gmem_populate(), and it
is then sev_gmem_post_populate() callback that actually finalizes the
shared->private conversion. The sev_gmem_prepare() hook doesn't get used
in this flow (kvm_gmem_populate() calls __kvm_gmem_get_pfn() which skips
preparation).
-Mike
>
> Suzuki
>
>
>
> > Parameters (in): struct kvm_sev_snp_launch_update
> > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> > index 1a361f08c7a3d..e1dbc827c2807 100644
> > --- a/arch/x86/kvm/svm/sev.c
> > +++ b/arch/x86/kvm/svm/sev.c
> > @@ -2343,7 +2343,15 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> > int level;
> > int ret;
> > - if (WARN_ON_ONCE(sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src_page))
> > + /*
> > + * For vm_memory_attributes=1, in-place conversion/population is not
> > + * supported, so the initial contents necessarily need to come from a
> > + * separate src address. For vm_memory_attributes=0, this isn't
> > + * necessarily the case, since the pages may have been populated
> > + * directly from userspace before calling KVM_SEV_SNP_LAUNCH_UPDATE.
> > + */
> > + if (vm_memory_attributes &&
> > + sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src_page)
> > return -EINVAL;
> > ret = snp_lookup_rmpentry((u64)pfn, &assigned, &level);
> > @@ -2390,7 +2398,7 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> > */
> > if (ret && !snp_page_reclaim(kvm, pfn) &&
> > sev_populate_args->type == KVM_SEV_SNP_PAGE_TYPE_CPUID &&
> > - sev_populate_args->fw_error == SEV_RET_INVALID_PARAM) {
> > + sev_populate_args->fw_error == SEV_RET_INVALID_PARAM && src_page) {
> > void *src_vaddr = kmap_local_page(src_page);
> > void *dst_vaddr = kmap_local_pfn(pfn);
> > @@ -2423,8 +2431,8 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
> > if (copy_from_user(¶ms, u64_to_user_ptr(argp->data), sizeof(params)))
> > return -EFAULT;
> > - pr_debug("%s: GFN start 0x%llx length 0x%llx type %d flags %d\n", __func__,
> > - params.gfn_start, params.len, params.type, params.flags);
> > + pr_debug("%s: GFN start 0x%llx length 0x%llx type %d flags %d src %llx\n", __func__,
> > + params.gfn_start, params.len, params.type, params.flags, params.uaddr);
> > if (!params.len || !PAGE_ALIGNED(params.len) || params.flags ||
> > (params.type != KVM_SEV_SNP_PAGE_TYPE_NORMAL &&
> > @@ -2481,7 +2489,7 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
> > params.gfn_start += count;
> > params.len -= count * PAGE_SIZE;
> > - if (params.type != KVM_SEV_SNP_PAGE_TYPE_ZERO)
> > + if (src && params.type != KVM_SEV_SNP_PAGE_TYPE_ZERO)
> > params.uaddr += count * PAGE_SIZE;
> > if (copy_to_user(u64_to_user_ptr(argp->data), ¶ms, sizeof(params)))
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index ba195bb239aaa..3bf212fd99193 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -105,6 +105,7 @@ module_param(allow_unsafe_mappings, bool, 0444);
> > #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> > bool vm_memory_attributes = true;
> > module_param(vm_memory_attributes, bool, 0444);
> > +EXPORT_SYMBOL_FOR_KVM_INTERNAL(vm_memory_attributes);
> > #endif
> > DEFINE_STATIC_CALL_RET0(__kvm_get_memory_attributes, kvm_get_memory_attributes_t);
> > EXPORT_SYMBOL_FOR_KVM_INTERNAL(STATIC_CALL_KEY(__kvm_get_memory_attributes));
> >
>
^ permalink raw reply
* Re: [PATCH v7 20/42] KVM: SEV: Make 'uaddr' parameter optional for KVM_SEV_SNP_LAUNCH_UPDATE
From: Ackerley Tng @ 2026-06-04 19:05 UTC (permalink / raw)
To: Suzuki K Poulose, aik, andrew.jones, binbin.wu, brauner,
chao.p.peng, david, ira.weiny, jmattson, jthoughton, michael.roth,
oupton, pankaj.gupta, qperret, rick.p.edgecombe, rientjes,
shivankg, steven.price, tabba, willy, wyihan, yan.y.zhao,
forkloop, pratyush, aneesh.kumar, liam, Paolo Bonzini,
Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <9d15479e-e36b-4865-804c-7d93eb339e4e@arm.com>
Suzuki K Poulose <suzuki.poulose@arm.com> writes:
>
> [...snip...]
>
>> +In the case where ``type`` is KVM_SEV_SNP_PAGE_TYPE_ZERO, ``uaddr`` will be
>> +ignored completely. Otherwise, ``uaddr`` is required if
>> +kvm.vm_memory_attributes=1 and optional if kvm.vm_memory_attributes=0, since
>> +in the latter case guest memory can be initialized directly from userspace
>> +prior to converting it to private and passing the GPA range on to this
>> +interface.
>
> Just to confirm, so the sev_gmem_prepare doesn't destroy the contents in
> the process of making it "private" ? i.e., the contents of a SNP shared
> page are preserved while transitioning to "SNP Private" (via RMP
> update).
>
> Suzuki
>
The following is the guest_memfd perspective, I didn't look at the SNP
spec:
Do you mean specifically for KVM_SEV_SNP_PAGE_TYPE_ZERO, or for any
type?
guest_memfd has no plans to do any special zeroing based on type.
guest_memfd decoupled zeroing from preparation a while ago (Michael had
some patches), so zeroing is supposed to be once during folio ownership
by guest_memfd, tracked by the uptodate flag, and preparation is tracked
outside of guest_memfd. So far only SNP does preparation.
>
>
>>
>> [...snip...]
>>
^ permalink raw reply
* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: Lorenzo Stoakes @ 2026-06-04 18:12 UTC (permalink / raw)
To: Nico Pache
Cc: David Hildenbrand (Arm), Lance Yang, linux-doc, linux-kernel,
linux-mm, linux-trace-kernel, aarcange, akpm, anshuman.khandual,
apopple, baohua, baolin.wang, byungchul, catalin.marinas, cl,
corbet, dave.hansen, dev.jain, gourry, hannes, hughd, jack,
jackmanb, jannh, jglisse, joshua.hahnjy, kas, liam,
mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang, rientjes,
rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe,
usama.arif
In-Reply-To: <CAA1CXcBSPVG4CJFCBDvbuodcJ_7eXoDQTpK0ZN0HEhkDPi-DEw@mail.gmail.com>
On Thu, Jun 04, 2026 at 11:04:35AM -0600, Nico Pache wrote:
> On Mon, Jun 1, 2026 at 9:00 AM Nico Pache <npache@redhat.com> wrote:
> >
> > On Mon, Jun 1, 2026 at 5:14 AM David Hildenbrand (Arm) <david@kernel.org> wrote:
> > >
> > > On 6/1/26 12:47, Lance Yang wrote:
> > > >
> > > >
> > > > On 2026/6/1 18:23, David Hildenbrand (Arm) wrote:
> > > >> On 6/1/26 11:08, Lance Yang wrote:
> > > >>>
> > > >>>
> > > >>>
> > > >>> One small thing, I think we should probably keep the smp_wmb(), and just
> > > >>> move it before the earlier pmd_populate().
> > > >>>
> > > >>> IIUC, the ordering we want is still:
> > > >>>
> > > >>> clear old PTEs
> > > >>> smp_wmb()
> > > >>> pmd_populate()
> > > >>>
> > > >>> so another CPU cannot walk through the re-installed PMD and still observe
> > > >>> the old PTEs, right?
> > > >>
> > > >> There is a smp_wmb() in __folio_mark_uptodate(), that should be sufficient?
> > > >
> > > > Ah, cool! __folio_mark_uptodate() already does the job :P
> > > >
> > > > So yeah, no extra smp_wmb() needed here!
> > >
> > > Yeah. BTW, I think we'd need a spin_lock_nested(), so @Nico, treat my code as a
> > > draft.
> >
> > Okay, I read the above and did some investigating.
> >
> > I will try to implement and verify the changes you suggested :)
>
> I've implemented something slightly different actually and I *think* its better!
>
> } else {
> /* this is map_anon_folio_pte_nopf with no mmu update */
> __map_anon_folio_pte_nopf(folio, pte, vma, start_addr,
> /*uffd_wp=*/ false);
> smp_wmb();
> pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> /*
> * Some architectures (e.g. MIPS) walk the live page table in
> * their implementation. update_mmu_cache_range() must be called
> * with a valid page table hierarchy and the PTE lock held.
> * Acquire it nested inside pmd_ptl when they are distinct locks.
> */
> if (pte_ptl != pmd_ptl)
> spin_lock_nested(pte_ptl, SINGLE_DEPTH_NESTING);
> update_mmu_cache_range(NULL, vma, start_addr, pte, nr_pages);
> if (pte_ptl != pmd_ptl)
> spin_unlock(pte_ptl);
> }
> spin_unlock(pmd_ptl);
>
> The logic here is that when the PMD becomes visible, PTEs are already
> populated (no possibility of spurious faults on local CPU)
>
> the SMP_WMB makes sure of the above
>
> And the pmd is installed with the pte and pmd lock both held through
> the mmu_cache update.
>
> This follows the conventions used in pmd_install() and clears the
> potential for local CPU faults hitting cleared PTE entries.
>
> I think both approaches are correct but this prevents any possibility
> of my first point. although mmap_write_lock prevents this too.
>
> Let me know what you think. I can revert to your implementation but
> this is what I tested.
Yeah let's go with the original implementation please :)
Thanks!
>
> Cheers,
> -- Nico
>
> >
> > Or an even crazier idea... what if we ensure MIPS checks for PMD_none
> > before walking a PTE table?
> >
> > -- Nico
> >
> > >
> > > --
> > > Cheers,
> > >
> > > David
> > >
>
Cheers, Lorenzo
^ permalink raw reply
* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: Nico Pache @ 2026-06-04 17:04 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Lance Yang, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
joshua.hahnjy, kas, liam, ljs, mathieu.desnoyers, matthew.brost,
mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe, usama.arif
In-Reply-To: <CAA1CXcAeEGOsqp-ywAQ7GMYQzXEeco-rUxUkk2hEF69HybC4=w@mail.gmail.com>
On Mon, Jun 1, 2026 at 9:00 AM Nico Pache <npache@redhat.com> wrote:
>
> On Mon, Jun 1, 2026 at 5:14 AM David Hildenbrand (Arm) <david@kernel.org> wrote:
> >
> > On 6/1/26 12:47, Lance Yang wrote:
> > >
> > >
> > > On 2026/6/1 18:23, David Hildenbrand (Arm) wrote:
> > >> On 6/1/26 11:08, Lance Yang wrote:
> > >>>
> > >>>
> > >>>
> > >>> One small thing, I think we should probably keep the smp_wmb(), and just
> > >>> move it before the earlier pmd_populate().
> > >>>
> > >>> IIUC, the ordering we want is still:
> > >>>
> > >>> clear old PTEs
> > >>> smp_wmb()
> > >>> pmd_populate()
> > >>>
> > >>> so another CPU cannot walk through the re-installed PMD and still observe
> > >>> the old PTEs, right?
> > >>
> > >> There is a smp_wmb() in __folio_mark_uptodate(), that should be sufficient?
> > >
> > > Ah, cool! __folio_mark_uptodate() already does the job :P
> > >
> > > So yeah, no extra smp_wmb() needed here!
> >
> > Yeah. BTW, I think we'd need a spin_lock_nested(), so @Nico, treat my code as a
> > draft.
>
> Okay, I read the above and did some investigating.
>
> I will try to implement and verify the changes you suggested :)
I've implemented something slightly different actually and I *think* its better!
} else {
/* this is map_anon_folio_pte_nopf with no mmu update */
__map_anon_folio_pte_nopf(folio, pte, vma, start_addr,
/*uffd_wp=*/ false);
smp_wmb();
pmd_populate(mm, pmd, pmd_pgtable(_pmd));
/*
* Some architectures (e.g. MIPS) walk the live page table in
* their implementation. update_mmu_cache_range() must be called
* with a valid page table hierarchy and the PTE lock held.
* Acquire it nested inside pmd_ptl when they are distinct locks.
*/
if (pte_ptl != pmd_ptl)
spin_lock_nested(pte_ptl, SINGLE_DEPTH_NESTING);
update_mmu_cache_range(NULL, vma, start_addr, pte, nr_pages);
if (pte_ptl != pmd_ptl)
spin_unlock(pte_ptl);
}
spin_unlock(pmd_ptl);
The logic here is that when the PMD becomes visible, PTEs are already
populated (no possibility of spurious faults on local CPU)
the SMP_WMB makes sure of the above
And the pmd is installed with the pte and pmd lock both held through
the mmu_cache update.
This follows the conventions used in pmd_install() and clears the
potential for local CPU faults hitting cleared PTE entries.
I think both approaches are correct but this prevents any possibility
of my first point. although mmap_write_lock prevents this too.
Let me know what you think. I can revert to your implementation but
this is what I tested.
Cheers,
-- Nico
>
> Or an even crazier idea... what if we ensure MIPS checks for PMD_none
> before walking a PTE table?
>
> -- Nico
>
> >
> > --
> > Cheers,
> >
> > David
> >
^ permalink raw reply
* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: Nico Pache @ 2026-06-04 16:28 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
lance.yang, liam, mathieu.desnoyers, matthew.brost, mhiramat,
mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe, Usama Arif
In-Reply-To: <aiFz-VSKSQ-zBfN7@lucifer>
On Thu, Jun 4, 2026 at 6:56 AM Lorenzo Stoakes <ljs@kernel.org> wrote:
>
> On Thu, Jun 04, 2026 at 06:45:58AM -0600, Nico Pache wrote:
> > On Thu, Jun 4, 2026 at 6:40 AM Lorenzo Stoakes <ljs@kernel.org> wrote:
> > >
> > > On Thu, Jun 04, 2026 at 12:38:30PM +0100, Lorenzo Stoakes wrote:
> > > > I will go review the thread about the cache maintenance separately and
> > > > respond about that.
> > > >
> > > > On Fri, May 22, 2026 at 09:00:01AM -0600, Nico Pache wrote:
> > > > > Pass an order and offset to collapse_huge_page to support collapsing anon
> > > > > memory to arbitrary orders within a PMD. order indicates what mTHP size we
> > > > > are attempting to collapse to, and offset indicates were in the PMD to
> > > > > start the collapse attempt.
> > > > >
> > > > > For non-PMD collapse we must leave the anon VMA write locked until after
> > > > > we collapse the mTHP-- in the PMD case all the pages are isolated, but in
> > > > > the mTHP case this is not true, and we must keep the lock to prevent
> > > > > access/changes to the page tables. This can happen if the rmap walkers hit
> > > > > a pmd_none while the PMD entry is currently unavailable due to being
> > > > > temporarily removed during the collapse phase.
> > > > >
> > > > > Acked-by: Usama Arif <usama.arif@linux.dev>
> > > > > Signed-off-by: Nico Pache <npache@redhat.com>
> > > >
> > > > The logic LGTM generally, some questions for understanding below, and of
> > > > course as per above I want to review the Lance/David subthread.
> > > >
> > > > Thanks!
> > > >
> > > > > ---
> > > > > mm/khugepaged.c | 93 +++++++++++++++++++++++++++++--------------------
> > > > > 1 file changed, 55 insertions(+), 38 deletions(-)
> > > > >
> > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > > index fab35d318641..d64f42f66236 100644
> > > > > --- a/mm/khugepaged.c
> > > > > +++ b/mm/khugepaged.c
> > > > > @@ -1214,34 +1214,36 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
> > > > > * while allocating a THP, as that could trigger direct reclaim/compaction.
> > > > > * Note that the VMA must be rechecked after grabbing the mmap_lock again.
> > > > > */
> > > > > -static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > > > > - int referenced, int unmapped, struct collapse_control *cc)
> > > > > +static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
> > > > > + int referenced, int unmapped, struct collapse_control *cc,
> > > > > + unsigned int order)
> > > > > {
> > > > > + const unsigned long pmd_addr = start_addr & HPAGE_PMD_MASK;
> > > > > + const unsigned long end_addr = start_addr + (PAGE_SIZE << order);
> > > > > LIST_HEAD(compound_pagelist);
> > > > > pmd_t *pmd, _pmd;
> > > > > - pte_t *pte;
> > > > > + pte_t *pte = NULL;
> > > >
> > > > As mentioned elsewhere for some reason this was dropped in
> > > > mm-unstable. Maybe a bad conflict resolution?
> > > >
> > > > > pgtable_t pgtable;
> > > > > struct folio *folio;
> > > > > spinlock_t *pmd_ptl, *pte_ptl;
> > > > > enum scan_result result = SCAN_FAIL;
> > > > > struct vm_area_struct *vma;
> > > > > struct mmu_notifier_range range;
> > > > > + bool anon_vma_locked = false;
> > > > >
> > > > > - VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> > > > > -
> > > > > - result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> > > > > + result = alloc_charge_folio(&folio, mm, cc, order);
> > > > > if (result != SCAN_SUCCEED)
> > > > > goto out_nolock;
> > > > >
> > > > > mmap_read_lock(mm);
> > > > > - result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> > > > > - HPAGE_PMD_ORDER);
> > > > > + result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
> > > > > + &vma, cc, order);
> > > > > if (result != SCAN_SUCCEED) {
> > > > > mmap_read_unlock(mm);
> > > > > goto out_nolock;
> > > > > }
> > > > >
> > > > > - result = find_pmd_or_thp_or_none(mm, address, &pmd);
> > > > > + result = find_pmd_or_thp_or_none(mm, pmd_addr, &pmd);
> > > > > if (result != SCAN_SUCCEED) {
> > > > > mmap_read_unlock(mm);
> > > > > goto out_nolock;
> > > > > @@ -1253,8 +1255,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > > > > * released when it fails. So we jump out_nolock directly in
> > > > > * that case. Continuing to collapse causes inconsistency.
> > > > > */
> > > > > - result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> > > > > - referenced, HPAGE_PMD_ORDER);
> > > > > + result = __collapse_huge_page_swapin(mm, vma, start_addr, pmd,
> > > > > + referenced, order);
> > > > > if (result != SCAN_SUCCEED)
> > > > > goto out_nolock;
> > > > > }
> > > > > @@ -1269,20 +1271,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > > > > * mmap_lock.
> > > > > */
> > > > > mmap_write_lock(mm);
> > > > > - result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> > > > > - HPAGE_PMD_ORDER);
> > > > > + result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
> > > > > + &vma, cc, order);
> > > > > if (result != SCAN_SUCCEED)
> > > > > goto out_up_write;
> > > > > /* check if the pmd is still valid */
> > > > > vma_start_write(vma);
> > >
> > > Hmm actually I think we have another problem here.
> > >
> > > For PMD THP this is fine. Only a single VMA can span the range we need, and it
> > > will span the entire PMD.
> > >
> > > But for mTHP we have an issue...
> > >
> > > See below...
> > >
> > > > > - result = check_pmd_still_valid(mm, address, pmd);
> > > > > + result = check_pmd_still_valid(mm, pmd_addr, pmd);
> > > > > if (result != SCAN_SUCCEED)
> > > > > goto out_up_write;
> > > > >
> > > > > anon_vma_lock_write(vma->anon_vma);
> > > > > + anon_vma_locked = true;
> > > >
> > > > I worry that we hold this lock a lot longer now? Maybe the algorithmic
> > > > change alters that, but Claude did suggest on the s390 bug that longer lock
> > > > hold might be an issue.
> > > >
> > > > I wonder if we'll observe lock contention as a result?
> > > >
> > > > Correct me if I'm wrong and we're not holding longer than previously,
> > > > however. Just appears that we do.
> > > >
> > > > >
> > > > > - mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> > > > > - address + HPAGE_PMD_SIZE);
> > > > > + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr,
> > > > > + end_addr);
> > > > > mmu_notifier_invalidate_range_start(&range);
> > > > >
> > > > > pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> > > > > @@ -1294,26 +1297,23 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > > > > * Parallel GUP-fast is fine since GUP-fast will back off when
> > > > > * it detects PMD is changed.
> > > > > */
> > > > > - _pmd = pmdp_collapse_flush(vma, address, pmd);
> > > > > + _pmd = pmdp_collapse_flush(vma, pmd_addr, pmd);
> > >
> > > ...So we exclude VMA locked faults faulting in a new PMD entry for PMD-sized THP
> > > but for mTHP we might have _another_ VMA that spans another part of the range
> > > mapped by the same PMD entry.
> > >
> > > So we clear this, but we do not have a write lock on any other VMA, and so
> > > racing VMA read locks can install a new PMD entry.
> > >
> > > > > spin_unlock(pmd_ptl);
> > >
> > > Especially since you unlock this :)
> > >
> > > And...
> > >
> > > > > mmu_notifier_invalidate_range_end(&range);
> > > > > tlb_remove_table_sync_one();
> > > > >
> > > > > - pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> > > > > + pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);
> > > > > if (pte) {
> > > > > - result = __collapse_huge_page_isolate(vma, address, pte, cc,
> > > > > - HPAGE_PMD_ORDER,
> > > > > - &compound_pagelist);
> > > > > + result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
> > > > > + order, &compound_pagelist);
> > > > > spin_unlock(pte_ptl);
> > > > > } else {
> > > > > result = SCAN_NO_PTE_TABLE;
> > > > > }
> > > > >
> > > > > if (unlikely(result != SCAN_SUCCEED)) {
> > > > > - if (pte)
> > > > > - pte_unmap(pte);
> > > >
> > > > OK I seem to remember this is because we're holding the anon_vma lock
> > > > longer. That does imply that on e.g. x86-64 the RCU lock is being held a
> > > > bit longer also as well as the anon_vma loc.
> > > >
> > > > I guess it's also because we need to hold anon_vma and pte lock because
> > > > we're fiddling around at PTE level for mTHP not just PMD level as 'classic'
> > > > THP did.
> > > >
> > > > (Rememberings going on here :)
> > > >
> > > > > spin_lock(pmd_ptl);
> > > > > - BUG_ON(!pmd_none(*pmd));
> > > > > + WARN_ON_ONCE(!pmd_none(*pmd));
> > >
> > > ...this will get triggered.
> > >
> > > I don't know whether we can safely hold the PMD lock across everything here for
> > > mTHP?
> > >
> > > Maybe the solution would have to be to scan through VMAs in the range of the PMD
> > > and VMA write lock each of them?
> >
> > I believe we've spoken about this before, but because we always make
>
> Maybe worth a comment then...? Ah how rewarding review is :)
I'll expand the commit message and comment in commit 1 of the series! thanks
>
> This is something that somebody else might very well wonder about and
> forget that it happens to be covered there.
>
> Also:
>
> /* Always check the PMD order to ensure its not shared by another VMA */
>
> Is pretty lightweight there. Something about avoiding racing page faults
> would be helpful.
yeah fair enough the commit message of patch 1 also doesnt really do
it justice on the *why*
>
> > sure the VMA spans the full PMD we won't ever hit this issue. If we
> > wanted to support mTHP collapse on regions smaller than a PMD, the
> > locking gets tricky (hence the design choice to not do that for now).
> >
> > This is handled by the HPAGE_ORDER in hugepage_vma_revalidate().
>
> The existing code is atrocious, and sticking this on top has added to the
> pile of assumptions and conventions and having to go check a bunch of
> functions to 'just know' you're safe for X, Y, Z.
>
> We really need to see some cleanup series coming after this and I'm going
> to get pretty grumpy(ier) if we don't.
Many more to come :) Improvements too but cleanups first!
Cheers,
-- Nico
>
> >
> > /* Always check the PMD order to ensure its not shared by another VMA */
> > if (!thp_vma_suitable_order(vma, address, PMD_ORDER))
> >
> > -- Nico
> >
> > >
> > > That could cause some 'interesting' lock contention issues though? Then again,
> > > we will be releasing the mmap write lock soon enough which will drop the VMA
> > > write locks.
> > >
> > > > > /*
> > > > > * We can only use set_pmd_at when establishing
> > > > > * hugepmds and never for establishing regular pmds that
> > > > > @@ -1321,21 +1321,24 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > > > > */
> > > > > pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> > > > > spin_unlock(pmd_ptl);
> > > > > - anon_vma_unlock_write(vma->anon_vma);
> > > > > goto out_up_write;
> > > > > }
> > > > >
> > > > > /*
> > > > > - * All pages are isolated and locked so anon_vma rmap
> > > > > - * can't run anymore.
> > > > > + * For PMD collapse all pages are isolated and locked so anon_vma
> > > > > + * rmap can't run anymore. For mTHP collapse the PMD entry has been
> > > > > + * removed and not all pages are isolated and locked, so we must hold
> > > >
> > > > Right because some PTE entries be unaffected by the change.
> > > >
> > > > > + * the lock to prevent neighboring folios from attempting to access
> > > > > + * this PMD until its reinstalled.
> > > >
> > > > OK. This is slightly annoying for my CoW context work as it means there's
> > > > another case where we need to explicitly hold an anon_vma lock for
> > > > correctness :)
> > > >
> > > > Anyway I will think about that separately, is what it is. And in fact
> > > > motivates to want this merged earlier so I can work against it :)
> > > >
> > > >
> > > > > */
> > > > > - anon_vma_unlock_write(vma->anon_vma);
> > > > > + if (is_pmd_order(order)) {
> > > > > + anon_vma_unlock_write(vma->anon_vma);
> > > > > + anon_vma_locked = false;
> > > > > + }
> > > > >
> > > > > result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> > > > > - vma, address, pte_ptl,
> > > > > - HPAGE_PMD_ORDER,
> > > > > - &compound_pagelist);
> > > > > - pte_unmap(pte);
> > > > > + vma, start_addr, pte_ptl,
> > > > > + order, &compound_pagelist);
> > > > > if (unlikely(result != SCAN_SUCCEED))
> > > > > goto out_up_write;
> > > > >
> > > > > @@ -1345,18 +1348,32 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > > > > * write.
> > > > > */
> > > > > __folio_mark_uptodate(folio);
> > > > > - pgtable = pmd_pgtable(_pmd);
> > > > > -
> > > > > spin_lock(pmd_ptl);
> > > > > - BUG_ON(!pmd_none(*pmd));
> > > > > - pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > > > > - map_anon_folio_pmd_nopf(folio, pmd, vma, address);
> > > > > + WARN_ON_ONCE(!pmd_none(*pmd));
> > > > > + if (is_pmd_order(order)) {
> > > > > + pgtable = pmd_pgtable(_pmd);
> > > > > + pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > > > > + map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
> > > > > + } else {
> > > > > + /*
> > > > > + * set_ptes is called in map_anon_folio_pte_nopf with the
> > > > > + * pmd_ptl lock still held; this is safe as the PMD is expected
> > > >
> > > > PMD entry you mean?
> > > >
> > > > > + * to be none. The pmd entry is then repopulated below.
> > > > > + */
> > > > > + map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
> > > >
> > > > So here we populate entries in the existing PTE _table_ to point at the new
> > > > order>0 folio? With arm64 of course doing transparent contpte stuff?
> > > >
> > > > > + smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
> > > > > + pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> > > >
> > > > And then we reinstall the pre-existing PMD _entry_ from none -> what it was
> > > > before?
> > > >
> > > > > + }
> > > > > spin_unlock(pmd_ptl);
> > > > >
> > > > > folio = NULL;
> > > > >
> > > > > result = SCAN_SUCCEED;
> > > > > out_up_write:
> > > > > + if (anon_vma_locked)
> > > > > + anon_vma_unlock_write(vma->anon_vma);
> > > > > + if (pte)
> > > > > + pte_unmap(pte);
> > > > > mmap_write_unlock(mm);
> > > > > out_nolock:
> > > > > if (folio)
> > > > > @@ -1536,7 +1553,7 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> > > > > /* collapse_huge_page expects the lock to be dropped before calling */
> > > > > mmap_read_unlock(mm);
> > > > > result = collapse_huge_page(mm, start_addr, referenced,
> > > > > - unmapped, cc);
> > > > > + unmapped, cc, HPAGE_PMD_ORDER);
> > > > > /* collapse_huge_page will return with the mmap_lock released */
> > > > > *lock_dropped = true;
> > > > > }
> > > > > --
> > > > > 2.54.0
> > > > >
> > >
> > > Thanks, Lorenzo
> > >
> >
>
^ permalink raw reply
* RE: mm/memory-failure tracepoint change breaks userspace rasdaemon
From: Zhuo, Qiuxu @ 2026-06-04 15:48 UTC (permalink / raw)
To: Xie Yuanbin, david@kernel.org, bp@alien8.de,
akpm@linux-foundation.org, rostedt@goodmis.org,
linmiaohe@huawei.com
Cc: linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org,
mchehab+huawei@kernel.org, Luck, Tony,
torvalds@linux-foundation.org, Lai, Yi1
In-Reply-To: <20260604134209.111533-1-xieyuanbin1@huawei.com>
> From: Xie Yuanbin <xieyuanbin1@huawei.com>
> Sent: Thursday, June 4, 2026 9:42 PM
> To: david@kernel.org; Zhuo, Qiuxu <qiuxu.zhuo@intel.com>; bp@alien8.de;
> akpm@linux-foundation.org; rostedt@goodmis.org; linmiaohe@huawei.com
> Cc: linux-edac@vger.kernel.org; linux-kernel@vger.kernel.org; linux-
> mm@kvack.org; linux-trace-kernel@vger.kernel.org;
> mchehab+huawei@kernel.org; Luck, Tony <tony.luck@intel.com>;
> torvalds@linux-foundation.org; xieyuanbin1@huawei.com; Lai, Yi1
> <yi1.lai@intel.com>
> Subject: Re: mm/memory-failure tracepoint change breaks userspace
> rasdaemon
>
> On Thu, 4 Jun 2026 08:42:37 +0200, David Hildenbrand (Arm) wrote:
> > Yeah, if only I had known that we would break user space by changing
> > trace events ... now we know :)
> >
> > Do you have capacity to send a fix?
>
> Sure, with pleasure.
Thanks Yuanbin,
When your patch is ready, we can help test it again if needed.
-Qiuxu
^ permalink raw reply
* RE: mm/memory-failure tracepoint change breaks userspace rasdaemon
From: Zhuo, Qiuxu @ 2026-06-04 15:43 UTC (permalink / raw)
To: David Hildenbrand (Arm), Steven Rostedt
Cc: Borislav Petkov, mchehab+huawei@kernel.org, Luck, Tony,
akpm@linux-foundation.org, linmiaohe@huawei.com,
xieyuanbin1@huawei.com, Lai, Yi1, linux-kernel@vger.kernel.org,
linux-edac@vger.kernel.org, linux-mm@kvack.org,
linux-trace-kernel@vger.kernel.org, Linus Torvalds
In-Reply-To: <b637ede2-73da-49f0-a7eb-70ec79e79624@kernel.org>
> From: David Hildenbrand (Arm) <david@kernel.org>
> [...]
> Would the following be sufficient to avoid a full revert and the dependency
> on CONFIG_RAS?
>
> diff --git a/include/trace/events/memory-failure.h
> b/include/trace/events/memory-failure.h
> index aa57cc8f896b..c46b17602578 100644
> --- a/include/trace/events/memory-failure.h
> +++ b/include/trace/events/memory-failure.h
> @@ -1,6 +1,7 @@
> /* SPDX-License-Identifier: GPL-2.0 */
> #undef TRACE_SYSTEM
> -#define TRACE_SYSTEM memory_failure
> +/* Some user space relies on ras/memory_failure_event */ #define
> +TRACE_SYSTEM ras
> #define TRACE_INCLUDE_FILE memory-failure
>
Thanks all for the discussion on this issue.
We applied David's above fix to v7.1-rc3, tested it, and confirmed that rasdaemon
can again enable and receive the memory_failure event.
Rasdaemon logs:
...
rasdaemon: ras:memory_failure_event event enabled
rasdaemon: Enabled event ras:memory_failure_event
...
<...>-2513 [000] ..... 0.000021 memory_failure_event [ALERT] 2026-06-04 23:30:43 +0800 pfn=0x144e6f page_type=dirty LRU page action_result=Recovered
...
-Qiuxu
^ permalink raw reply
* Re: [PATCH v7 20/42] KVM: SEV: Make 'uaddr' parameter optional for KVM_SEV_SNP_LAUNCH_UPDATE
From: Suzuki K Poulose @ 2026-06-04 15:29 UTC (permalink / raw)
To: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
david, ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
pratyush, aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260522-gmem-inplace-conversion-v7-20-2f0fae496530@google.com>
On 23/05/2026 01:18, Ackerley Tng via B4 Relay wrote:
> From: Michael Roth <michael.roth@amd.com>
>
> For vm_memory_attributes=1, in-place conversion/population is not
> supported, so the initial contents necessarily must need to come
> from a separate src address, which is enforced by the current
> implementation. However, for vm_memory_attributes=0, it is possible for
> guest memory to be initialized directly from userspace by mmap()'ing the
> guest_memfd and writing to it while the corresponding GPA ranges are in
> a 'shared' state before converting them to the 'private' state expected
> by KVM_SEV_SNP_LAUNCH_UPDATE.
>
> Update the handling/documentation for KVM_SEV_SNP_LAUNCH_UPDATE to allow
> for 'uaddr' to be set to NULL when vm_memory_attributes=0, which
> SNP_LAUNCH_UPDATE will then use to determine when it should/shouldn't
> copy in data from a separate memory location. Continue to enforce
> non-NULL for the original vm_memory_attributes=1 case.
>
> Signed-off-by: Michael Roth <michael.roth@amd.com>
> [Added src_page check in error handling path when the firmware command fails]
> [Dropped ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES]
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
> Documentation/virt/kvm/x86/amd-memory-encryption.rst | 15 +++++++++++----
> arch/x86/kvm/svm/sev.c | 18 +++++++++++++-----
> virt/kvm/kvm_main.c | 1 +
> 3 files changed, 25 insertions(+), 9 deletions(-)
>
> diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> index b2395dd4769de..43085f65b2d85 100644
> --- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> +++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> @@ -503,7 +503,8 @@ secrets.
>
> It is required that the GPA ranges initialized by this command have had the
> KVM_MEMORY_ATTRIBUTE_PRIVATE attribute set in advance. See the documentation
> -for KVM_SET_MEMORY_ATTRIBUTES for more details on this aspect.
> +for KVM_SET_MEMORY_ATTRIBUTES/KVM_SET_MEMORY_ATTRIBUTES2 for more details on
> +this aspect.
>
> Upon success, this command is not guaranteed to have processed the entire
> range requested. Instead, the ``gfn_start``, ``uaddr``, and ``len`` fields of
> @@ -511,9 +512,15 @@ range requested. Instead, the ``gfn_start``, ``uaddr``, and ``len`` fields of
> remaining range that has yet to be processed. The caller should continue
> calling this command until those fields indicate the entire range has been
> processed, e.g. ``len`` is 0, ``gfn_start`` is equal to the last GFN in the
> -range plus 1, and ``uaddr`` is the last byte of the userspace-provided source
> -buffer address plus 1. In the case where ``type`` is KVM_SEV_SNP_PAGE_TYPE_ZERO,
> -``uaddr`` will be ignored completely.
> +range plus 1, and ``uaddr`` (if specified) is the last byte of the
> +userspace-provided source buffer address plus 1.
> +
> +In the case where ``type`` is KVM_SEV_SNP_PAGE_TYPE_ZERO, ``uaddr`` will be
> +ignored completely. Otherwise, ``uaddr`` is required if
> +kvm.vm_memory_attributes=1 and optional if kvm.vm_memory_attributes=0, since
> +in the latter case guest memory can be initialized directly from userspace
> +prior to converting it to private and passing the GPA range on to this
> +interface.
Just to confirm, so the sev_gmem_prepare doesn't destroy the contents in
the process of making it "private" ? i.e., the contents of a SNP shared
page are preserved while transitioning to "SNP Private" (via RMP
update).
Suzuki
>
> Parameters (in): struct kvm_sev_snp_launch_update
>
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 1a361f08c7a3d..e1dbc827c2807 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -2343,7 +2343,15 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> int level;
> int ret;
>
> - if (WARN_ON_ONCE(sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src_page))
> + /*
> + * For vm_memory_attributes=1, in-place conversion/population is not
> + * supported, so the initial contents necessarily need to come from a
> + * separate src address. For vm_memory_attributes=0, this isn't
> + * necessarily the case, since the pages may have been populated
> + * directly from userspace before calling KVM_SEV_SNP_LAUNCH_UPDATE.
> + */
> + if (vm_memory_attributes &&
> + sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src_page)
> return -EINVAL;
>
> ret = snp_lookup_rmpentry((u64)pfn, &assigned, &level);
> @@ -2390,7 +2398,7 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> */
> if (ret && !snp_page_reclaim(kvm, pfn) &&
> sev_populate_args->type == KVM_SEV_SNP_PAGE_TYPE_CPUID &&
> - sev_populate_args->fw_error == SEV_RET_INVALID_PARAM) {
> + sev_populate_args->fw_error == SEV_RET_INVALID_PARAM && src_page) {
> void *src_vaddr = kmap_local_page(src_page);
> void *dst_vaddr = kmap_local_pfn(pfn);
>
> @@ -2423,8 +2431,8 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
> if (copy_from_user(¶ms, u64_to_user_ptr(argp->data), sizeof(params)))
> return -EFAULT;
>
> - pr_debug("%s: GFN start 0x%llx length 0x%llx type %d flags %d\n", __func__,
> - params.gfn_start, params.len, params.type, params.flags);
> + pr_debug("%s: GFN start 0x%llx length 0x%llx type %d flags %d src %llx\n", __func__,
> + params.gfn_start, params.len, params.type, params.flags, params.uaddr);
>
> if (!params.len || !PAGE_ALIGNED(params.len) || params.flags ||
> (params.type != KVM_SEV_SNP_PAGE_TYPE_NORMAL &&
> @@ -2481,7 +2489,7 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
>
> params.gfn_start += count;
> params.len -= count * PAGE_SIZE;
> - if (params.type != KVM_SEV_SNP_PAGE_TYPE_ZERO)
> + if (src && params.type != KVM_SEV_SNP_PAGE_TYPE_ZERO)
> params.uaddr += count * PAGE_SIZE;
>
> if (copy_to_user(u64_to_user_ptr(argp->data), ¶ms, sizeof(params)))
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ba195bb239aaa..3bf212fd99193 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -105,6 +105,7 @@ module_param(allow_unsafe_mappings, bool, 0444);
> #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> bool vm_memory_attributes = true;
> module_param(vm_memory_attributes, bool, 0444);
> +EXPORT_SYMBOL_FOR_KVM_INTERNAL(vm_memory_attributes);
> #endif
> DEFINE_STATIC_CALL_RET0(__kvm_get_memory_attributes, kvm_get_memory_attributes_t);
> EXPORT_SYMBOL_FOR_KVM_INTERNAL(STATIC_CALL_KEY(__kvm_get_memory_attributes));
>
^ permalink raw reply
* [GIT PULL] rv fixes for v7.1 (resend with changed attribution)
From: Gabriele Monaco @ 2026-06-04 14:50 UTC (permalink / raw)
To: Steven Rostedt, linux-kernel
Cc: linux-trace-kernel, Gabriele Monaco, Wen Yang
Steve,
The following changes since commit e43ffb69e0438cddd72aaa30898b4dc446f664f8:
Linux 7.1-rc6 (2026-05-31 15:14:24 -0700)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/gmonaco/linux.git rv-fixes2-7.1
for you to fetch changes up to df996599cc69a9b74ff437c67751cf8a61f62e39:
verification/rvgen: Fix ltl2k writing True as a literal (2026-06-04 16:44:25 +0200)
----------------------------------------------------------------
rv fixes for v7.1
Summary of changes:
- Fix reset ordering on per-task destruction
Reset the task before dropping the slot instead of after, which was
causing out-of-bound memory accesses.
- Fix HA monitor synchronization and cleanup
Ensure synchronous cleanup for HA monitors by running timer callbacks
in RCU read-side critical sections and using synchronize_rcu() during
destruction.
- Avoid armed timers after tasks exit
Add automatic cleanup for per-task HA monitors to prevent timers from
firing after task exit.
- Fix memory ordering for DA/HA monitors
Fix race conditions during monitor start by using release-acquire
semantics for the monitoring flag.
- Fix initialization for DA/HA monitors
Ensure monitors are not initialized relying on potentially corrupted
state like the monitoring flag, that is not reset by all monitors type
and may have an unknown state in monitors reusing the storage
(per-task).
- Fix memory safety in per-task and per-object monitors
Prevent use-after-free and out-of-bounds access by synchronizing with
in-flight tracepoint probes using tracepoint_synchronize_unregister()
before freeing monitor storage or releasing task slots.
- Adjust monitors for preemptible tracepoints
Fix monitors that relied on tracepoints disabling preemption.
Explicitly disable task migration when per-CPU monitors handle events
to avoid accessing the wrong state and update the opid monitor logic.
- Fix incorrect __user specifier usage
Remove __user from a non-pointer variable in the extract_params()
helper.
- Fix bugs in the rv tool
Ensure strings are NUL-terminated, fix substring matching in monitor
searches, and improve cleanup and exit status handling.
- Fix several bugs in rvgen
Fix LTL literal stringification, subparsers' options handling, and
suffix stripping in dot2k.
----------------------------------------------------------------
Gabriele Monaco (16):
rv: Fix __user specifier usage in extract_params()
rv: Reset per-task DA monitors before releasing the slot
rv: Prevent in-flight per-task handlers from using invalid slots
rv: Ensure all pending probes terminate on per-obj monitor destroy
rv: Do not rely on clean monitor when initialising HA
rv: Add automatic cleanup handlers for per-task HA monitors
rv: Ensure synchronous cleanup for HA monitors
rv: Prevent task migration while handling per-CPU events
rv: Use 0 to check preemption enabled in opid
tools/rv: Ensure monitor name and desc are NUL-terminated
tools/rv: Fix substring match bug in monitor name search
tools/rv: Fix substring match when listing container monitors
tools/rv: Fix cleanup after failed trace setup
verification/rvgen: Fix suffix strip in dot2k
verification/rvgen: Fix options shared among commands
verification/rvgen: Fix ltl2k writing True as a literal
Wen Yang (1):
rv: Fix monitor start ordering and memory ordering for monitoring flag
include/rv/da_monitor.h | 139 +++++++++++++++++----
include/rv/ha_monitor.h | 91 +++++++++++++-
include/rv/ltl_monitor.h | 1 +
kernel/trace/rv/monitors/deadline/deadline.h | 3 +-
kernel/trace/rv/monitors/nomiss/nomiss.c | 4 +-
kernel/trace/rv/monitors/opid/opid.c | 12 +-
kernel/trace/rv/monitors/stall/stall.c | 4 +-
tools/verification/rv/src/in_kernel.c | 65 +++++-----
tools/verification/rvgen/__main__.py | 10 +-
tools/verification/rvgen/rvgen/dot2k.py | 4 +-
tools/verification/rvgen/rvgen/ltl2ba.py | 9 +-
.../rvgen/rvgen/templates/dot2k/main.c | 4 +-
12 files changed, 263 insertions(+), 83 deletions(-)
To: Steven Rostedt <rostedt@goodmis.org>
Cc: Gabriele Monaco <gmonaco@redhat.com>
Cc: Wen Yang <wen.yang@linux.dev>
^ permalink raw reply
* Re: [RFC v8 0/7] ext4: fast commit: snapshot inode state for FC log
From: Theodore Ts'o @ 2026-06-04 14:45 UTC (permalink / raw)
To: Zhang Yi, Andreas Dilger, Li Chen
Cc: Theodore Ts'o, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, linux-ext4, linux-trace-kernel, linux-kernel
In-Reply-To: <20260515091829.194810-1-me@linux.beauty>
On Fri, 15 May 2026 17:18:20 +0800, Li Chen wrote:
> (This RFC v8 series is rebased onto linux-next master as of 2026-05-09,
> commit e98d21c170b0 ("Add linux-next specific files for 20260508"), and
> depends on patch "ext4: fix fast commit wait/wake bit mapping on
> 64-bit" [0]).
>
> Zhang Yi in RFC v3 review pointed out that postponing lockdep assertions only
> masks the issue, and that sleeping in ext4_fc_track_inode() while holding
> i_data_sem can form a real ABBA deadlock if the fast commit writer also needs
> i_data_sem while the inode is in FC_COMMITTING.
>
> [...]
Applied, thanks!
[1/7] ext4: fast commit: snapshot inode state before writing log
commit: e9c6e0b8e096255feb71ec996c77bdfbe9c36e91
[2/7] ext4: lockdep: handle i_data_sem subclassing for special inodes
commit: 7f473f971382d73a58e386afa7efdaac294b89f0
[3/7] ext4: fast commit: avoid waiting for FC_COMMITTING
commit: b3060e96533dc3157fc6d3d45dc19927c566977b
[4/7] ext4: fast commit: avoid self-deadlock in inode snapshotting
commit: 2b9b216628fd9352f9c791701c8990d05736aa90
[5/7] ext4: fast commit: avoid i_data_sem by dropping ext4_map_blocks() in snapshots
commit: 22d887e06a57261df58404c8dce50c4ef37549ed
[6/7] ext4: fast commit: add lock_updates tracepoint
commit: d2f6e83bbbef31169ea363af4277f5c09c914eda
[7/7] ext4: fast commit: export snapshot stats in fc_info
commit: 56bb0b64f4b198bad5ce674509c10793d471148f
Best regards,
--
Theodore Ts'o <tytso@mit.edu>
^ permalink raw reply
* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Lorenzo Stoakes @ 2026-06-04 14:45 UTC (permalink / raw)
To: Nico Pache
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
lance.yang, liam, mathieu.desnoyers, matthew.brost, mhiramat,
mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe
In-Reply-To: <20260522150009.121603-12-npache@redhat.com>
On Fri, May 22, 2026 at 09:00:06AM -0600, Nico Pache wrote:
> Enable khugepaged to collapse to mTHP orders. This patch implements the
> main scanning logic using a bitmap to track occupied pages and a stack
> structure that allows us to find optimal collapse sizes.
>
> Previous to this patch, PMD collapse had 3 main phases, a light weight
> scanning phase (mmap_read_lock) that determines a potential PMD
> collapse, an alloc phase (mmap unlocked), then finally heavier collapse
> phase (mmap_write_lock).
>
> To enabled mTHP collapse we make the following changes:
>
> During PMD scan phase, track occupied pages in a bitmap. When mTHP
> orders are enabled, we remove the restriction of max_ptes_none during the
> scan phase to avoid missing potential mTHP collapse candidates. Once we
> have scanned the full PMD range and updated the bitmap to track occupied
> pages, we use the bitmap to find the optimal mTHP size.
>
> Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
> and determine the best eligible order for the collapse. A stack structure
> is used instead of traditional recursion to manage the search. This also
> prevents a traditional recursive approach when the kernel stack struct is
> limited. The algorithm recursively splits the bitmap into smaller chunks to
> find the highest order mTHPs that satisfy the collapse criteria. We start
> by attempting the PMD order, then moved on the consecutively lower orders
> (mTHP collapse). The stack maintains a pair of variables (offset, order),
> indicating the number of PTEs from the start of the PMD, and the order of
> the potential collapse candidate.
>
> The algorithm for consuming the bitmap works as such:
> 1) push (0, HPAGE_PMD_ORDER) onto the stack
> 2) pop the stack
> 3) check if the number of set bits in that (offset,order) pair
> statisfy the max_ptes_none threshold for that order
> 4) if yes, attempt collapse
> 5) if no (or collapse fails), push two new stack items representing
> the left and right halves of the current bitmap range, at the
> next lower order
> 6) repeat at step (2) until stack is empty.
>
> Below is a diagram representing the algorithm and stack items:
>
> offset mid_offset
> | |
> | |
> v v
> ____________________________________
> | PTE Page Table |
> --------------------------------------
> <-------><------->
> order-1 order-1
>
> mTHP collapses reject regions containing swapped out or shared pages.
> This is because adding new entries can lead to new none pages, and these
> may lead to constant promotion into a higher order mTHP. A similar
> issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse
> introducing at least 2x the number of pages, and on a future scan will
> satisfy the promotion condition once again. This issue is prevented via
> the collapse_max_ptes_none() function which imposes the max_ptes_none
> restrictions above.
>
> We currently only support mTHP collapse for max_ptes_none values of 0
> and HPAGE_PMD_NR - 1. resulting in the following behavior:
>
> - max_ptes_none=0: Never introduce new empty pages during collapse
> - max_ptes_none=HPAGE_PMD_NR-1: Always try collapse to the highest
> available mTHP order
>
> Any other max_ptes_none value will emit a warning and default mTHP
> collapse to max_ptes_none=0. There should be no behavior change for PMD
> collapse.
>
> Once we determine what mTHP sizes fits best in that PMD range a collapse
> is attempted. A minimum collapse order of 2 is used as this is the lowest
> order supported by anon memory as defined by THP_ORDERS_ALL_ANON.
>
> Currently madv_collapse is not supported and will only attempt PMD
> collapse.
>
> We can also remove the check for is_khugepaged inside the PMD scan as
> the collapse_max_ptes_none() function handles this logic now.
>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
> mm/khugepaged.c | 181 +++++++++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 172 insertions(+), 9 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 64ceebc9d8a7..d3d7db8be26c 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -99,6 +99,30 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
>
> static struct kmem_cache *mm_slot_cache __ro_after_init;
>
> +#define KHUGEPAGED_MIN_MTHP_ORDER 2
> +/*
> + * mthp_collapse() does an iterative DFS over a binary tree, from
> + * HPAGE_PMD_ORDER down to KHUGEPAGED_MIN_MTHP_ORDER. The max stack
> + * size needed for a DFS on a binary tree is height + 1, where
> + * height = HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER.
> + *
> + * ilog2 is used in place of HPAGE_PMD_ORDER because some architectures
> + * (e.g. ppc64le) do not define HPAGE_PMD_ORDER until after build time.
> + */
> +#define MTHP_STACK_SIZE (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER + 1)
> +
> +/*
> + * Defines a range of PTE entries in a PTE page table which are being
> + * considered for mTHP collapse.
> + *
> + * @offset: the offset of the first PTE entry in a PMD range.
> + * @order: the order of the PTE entries being considered for collapse.
> + */
> +struct mthp_range {
> + u16 offset;
> + u8 order;
> +};
> +
> struct collapse_control {
> bool is_khugepaged;
>
> @@ -110,6 +134,12 @@ struct collapse_control {
>
> /* nodemask for allocation fallback */
> nodemask_t alloc_nmask;
> +
> + /* Each bit represents a single occupied (!none/zero) page. */
> + DECLARE_BITMAP(mthp_bitmap, MAX_PTRS_PER_PTE);
> + /* A mask of the current range being considered for mTHP collapse. */
> + DECLARE_BITMAP(mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> + struct mthp_range mthp_bitmap_stack[MTHP_STACK_SIZE];
> };
>
> /**
> @@ -1411,20 +1441,137 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
> return result;
> }
>
> +static void collapse_mthp_stack_push(struct collapse_control *cc, int *stack_size,
> + u16 offset, u8 order)
> +{
> + const int size = *stack_size;
> + struct mthp_range *stack = &cc->mthp_bitmap_stack[size];
> +
> + VM_WARN_ON_ONCE(size >= MTHP_STACK_SIZE);
> + stack->order = order;
> + stack->offset = offset;
> + (*stack_size)++;
> +}
> +
> +static struct mthp_range collapse_mthp_stack_pop(struct collapse_control *cc,
> + int *stack_size)
> +{
> + const int size = *stack_size;
> +
> + VM_WARN_ON_ONCE(size <= 0);
> + (*stack_size)--;
> + return cc->mthp_bitmap_stack[size - 1];
> +}
> +
> +static unsigned int collapse_mthp_count_present(struct collapse_control *cc,
> + u16 offset, unsigned int nr_ptes)
> +{
> + bitmap_zero(cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> + bitmap_set(cc->mthp_bitmap_mask, offset, nr_ptes);
> + return bitmap_weight_and(cc->mthp_bitmap, cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> +}
> +
> +/*
> + * mthp_collapse() consumes the bitmap that is generated during
> + * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
> + *
> + * Each bit in cc->mthp_bitmap represents a single occupied (!none/zero) page.
> + * A stack structure cc->mthp_bitmap_stack is used to check different regions
> + * of the bitmap for collapse eligibility. The stack maintains a pair of
> + * variables (offset, order), indicating the number of PTEs from the start of
> + * the PMD, and the order of the potential collapse candidate respectively. We
> + * start at the PMD order and check if it is eligible for collapse; if not, we
> + * add two entries to the stack at a lower order to represent the left and right
> + * halves of the PTE page table we are examining.
> + *
> + * offset mid_offset
> + * | |
> + * | |
> + * v v
> + * --------------------------------------
> + * | cc->mthp_bitmap |
> + * --------------------------------------
> + * <-------><------->
> + * order-1 order-1
> + *
> + * For each of these, we determine how many PTE entries are occupied in the
> + * range of PTE entries we propose to collapse, then we compare this to a
> + * threshold number of PTE entries which would need to be occupied for a
> + * collapse to be permitted at that order (accounting for max_ptes_none).
> + *
> + * If a collapse is permitted, we attempt to collapse the PTE range into a
> + * mTHP.
> + */
> +static int mthp_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
> + unsigned long address, int referenced, int unmapped,
> + struct collapse_control *cc, unsigned long enabled_orders)
> +{
> + unsigned int nr_occupied_ptes, nr_ptes, max_ptes_none;
> + int collapsed = 0, stack_size = 0;
> + unsigned long collapse_address;
> + struct mthp_range range;
> + u16 offset;
> + u8 order;
> +
> + collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
> +
> + while (stack_size) {
> + range = collapse_mthp_stack_pop(cc, &stack_size);
> + order = range.order;
> + offset = range.offset;
> + nr_ptes = 1UL << order;
> +
> + if (!test_bit(order, &enabled_orders))
> + goto next_order;
> +
> + max_ptes_none = collapse_max_ptes_none(cc, vma, order);
> +
> + nr_occupied_ptes = collapse_mthp_count_present(cc, offset,
> + nr_ptes);
> +
> + if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
> + int ret;
> +
> + collapse_address = address + offset * PAGE_SIZE;
> + ret = collapse_huge_page(mm, collapse_address, referenced,
> + unmapped, cc, order);
> + if (ret == SCAN_SUCCEED) {
> + collapsed += nr_ptes;
> + continue;
> + }
> + }
> +
> +next_order:
> + if ((BIT(order) - 1) & enabled_orders) {
> + const u8 next_order = order - 1;
> + const u16 mid_offset = offset + (nr_ptes / 2);
> +
> + collapse_mthp_stack_push(cc, &stack_size, mid_offset,
> + next_order);
> + collapse_mthp_stack_push(cc, &stack_size, offset,
> + next_order);
> + }
> + }
> + return collapsed;
> +}
> +
> static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> struct vm_area_struct *vma, unsigned long start_addr,
> bool *lock_dropped, struct collapse_control *cc)
> {
> - const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
> const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, HPAGE_PMD_ORDER);
> const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
> + unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
> + enum tva_type tva_flags = cc->is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
> pmd_t *pmd;
> - pte_t *pte, *_pte;
> - int none_or_zero = 0, shared = 0, referenced = 0;
> + pte_t *pte, *_pte, pteval;
> + int i;
> + int none_or_zero = 0, shared = 0, nr_collapsed = 0, referenced = 0;
> enum scan_result result = SCAN_FAIL;
> struct page *page = NULL;
> struct folio *folio = NULL;
> unsigned long addr;
> + unsigned long enabled_orders;
> spinlock_t *ptl;
> int node = NUMA_NO_NODE, unmapped = 0;
>
> @@ -1436,8 +1583,19 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> goto out;
> }
>
> + bitmap_zero(cc->mthp_bitmap, MAX_PTRS_PER_PTE);
> memset(cc->node_load, 0, sizeof(cc->node_load));
> nodes_clear(cc->alloc_nmask);
> +
> + enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, tva_flags);
> +
> + /*
> + * If PMD is the only enabled order, enforce max_ptes_none, otherwise
> + * scan all pages to populate the bitmap for mTHP collapse.
> + */
> + if (enabled_orders != BIT(HPAGE_PMD_ORDER))
> + max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT;
Hmm, this is a bit odd, what if the user set max_ptes_none = 0?
I assume we handle the 0/511 thing elsewhere?
> +
> pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
> if (!pte) {
> cc->progress++;
> @@ -1445,11 +1603,13 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> goto out;
> }
>
> - for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> - _pte++, addr += PAGE_SIZE) {
> + for (i = 0; i < HPAGE_PMD_NR; i++) {
> + _pte = pte + i;
> + addr = start_addr + i * PAGE_SIZE;
> + pteval = ptep_get(_pte);
> +
> cc->progress++;
>
> - pte_t pteval = ptep_get(_pte);
> if (pte_none_or_zero(pteval)) {
> if (++none_or_zero > max_ptes_none) {
> result = SCAN_EXCEED_NONE_PTE;
> @@ -1529,6 +1689,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> }
> }
>
> + /* Set bit for occupied pages */
> + __set_bit(i, cc->mthp_bitmap);
> /*
> * Record which node the original page is from and save this
> * information to cc->node_load[].
> @@ -1587,10 +1749,11 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> if (result == SCAN_SUCCEED) {
> /* collapse_huge_page expects the lock to be dropped before calling */
> mmap_read_unlock(mm);
> - result = collapse_huge_page(mm, start_addr, referenced,
> - unmapped, cc, HPAGE_PMD_ORDER);
> - /* collapse_huge_page will return with the mmap_lock released */
> + nr_collapsed = mthp_collapse(mm, vma, start_addr, referenced,
> + unmapped, cc, enabled_orders);
I guess mthp_collapse() also does PMD collapse if only PMD is enabled?
It feels like this name is a bit confusing then :)
But I guess we can do a follow up to think of a better name possibly.
> + /* mmap_lock was released above, set lock_dropped */
> *lock_dropped = true;
> + result = nr_collapsed ? SCAN_SUCCEED : SCAN_FAIL;
> }
> out:
> trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
> --
> 2.54.0
>
Thanks, Lorenzo
^ permalink raw reply
* Re: [PATCH v7 3/3] locking: Add contended_release tracepoint to sleepable locks
From: Usama Arif @ 2026-06-04 14:45 UTC (permalink / raw)
To: Dmitry Ilvokhin
Cc: Usama Arif, Peter Zijlstra, Dennis Zhou, Tejun Heo,
Christoph Lameter, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Ingo Molnar, Will Deacon, Boqun Feng,
Waiman Long, linux-mm, linux-kernel, linux-trace-kernel,
kernel-team, Paul E. McKenney
In-Reply-To: <02f4f6c5ce6761e7f6587cf0ff2289d962ecddd4.1780506267.git.d@ilvokhin.com>
On Thu, 4 Jun 2026 07:15:07 +0000 Dmitry Ilvokhin <d@ilvokhin.com> wrote:
> Add the contended_release trace event. This tracepoint fires on the
> holder side when a contended lock is released, complementing the
> existing contention_begin/contention_end tracepoints which fire on the
> waiter side.
>
> This enables correlating lock hold time under contention with waiter
> events by lock address.
>
> Add trace_contended_release()/trace_call__contended_release() calls to
> the slowpath unlock paths of sleepable locks: mutex, rtmutex, semaphore,
> rwsem, percpu-rwsem, and RT-specific rwbase locks.
>
> Where possible, trace_contended_release() fires before the lock is
> released and before the waiter is woken. For some lock types, the
> tracepoint fires after the release but before the wake. Making the
> placement consistent across all lock types is not worth the added
> complexity.
>
> For reader/writer locks, the tracepoint fires for every reader releasing
> while a writer is waiting, not only for the last reader.
>
> Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
> Acked-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Usama Arif <usama.arif@linux.dev>
^ permalink raw reply
* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Lorenzo Stoakes @ 2026-06-04 14:40 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Lance Yang, Nico Pache, linux-doc, linux-kernel, linux-mm,
linux-trace-kernel, aarcange, akpm, anshuman.khandual, apopple,
baohua, baolin.wang, byungchul, catalin.marinas, cl, corbet,
dave.hansen, dev.jain, gourry, hannes, hughd, jack, jackmanb,
jannh, jglisse, joshua.hahnjy, kas, liam, mathieu.desnoyers,
matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <46bb9d9e-03f0-4e26-9ac9-1cdc5ba9bf4d@kernel.org>
On Wed, Jun 03, 2026 at 10:05:08AM +0200, David Hildenbrand (Arm) wrote:
> On 6/2/26 17:44, Lance Yang wrote:
> >
> >
> > On 2026/6/2 18:58, Nico Pache wrote:
> >> On Sun, May 31, 2026 at 1:19 AM Lance Yang <lance.yang@linux.dev> wrote:
> >>>
> >>>
> >>> [...]
> >>>
> >>> Hmm ... don't we lose the allocation-failure result here?
> >>>
> >>> Previously collapse_scan_pmd() propagated SCAN_ALLOC_HUGE_PAGE_FAIL from
> >>> collapse_huge_page(), so khugepaged would call khugepaged_alloc_sleep()
> >>> in khugepaged_do_scan().
> >>>
> >>> Now if allocation fails and nr_collapsed stays 0, we just return
> >>> SCAN_FAIL. So we won't back off via khugepaged_alloc_sleep() anymore?
> >>
> >> Ok I did the error propagation! I think I handled both of these cases
> >> you brought up pretty easily.
> >
> > Thanks.
> >
> >> However I don't know what to do in the following case: We successfully
> >> collapsed some portion of the PMD, but during that process, we also
> >> hit an allocation failure. Is it best to back off entirely? or can we
> >> treat some forward progress as a sign we can continue trying collapses
> >> without sleeping.
> >>
> >> Basically, do we prioritize SCAN_ALLOC_HUGE_PAGE_FAIL or the
> >> successful collapses as the returned value?
> >
> > Thinking out loud, forward progress should win here, the allocation
> > failure only matter if we made no progress at all?
>
> Agreed, in the first approach, forward progress makes sense.
Sounds sensible to me.
>
> --
> Cheers,
>
> David
Thanks, Lorenzo
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox