Linux Trace Kernel
 help / color / mirror / Atom feed
* Re: [PATCH 7.2 v16 03/13] mm/khugepaged: rework max_ptes_* handling with helper functions
From: Nico Pache @ 2026-04-29 14:48 UTC (permalink / raw)
  To: David Hildenbrand (Arm), linux-doc, linux-kernel, linux-mm,
	linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, ljs,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
	pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang, rientjes,
	rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <c82a0b73-67ff-45e6-a792-e610b35a5b2f@kernel.org>

On 4/27/26 1:52 PM, David Hildenbrand (Arm) wrote:
> On 4/19/26 20:57, Nico Pache wrote:
>> The following cleanup reworks all the max_ptes_* handling into helper
>> functions. This increases the code readability and will later be used to
>> implement the mTHP handling of these variables.
>>
>> With these changes we abstract all the madvise_collapse() special casing
>> (dont respect the sysctls) away from the functions that utilize them. And
>> will later in this series to cleanly restrict mTHP collapses behaviors.
>>
>> Suggested-by: David Hildenbrand <david@kernel.org>
>> Signed-off-by: Nico Pache <npache@redhat.com>
>> ---
>>   mm/khugepaged.c | 114 +++++++++++++++++++++++++++++++++---------------
>>   1 file changed, 78 insertions(+), 36 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index afac6bc4e76d..f42b55421191 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -348,6 +348,58 @@ static bool pte_none_or_zero(pte_t pte)
>>      return pte_present(pte) && is_zero_pfn(pte_pfn(pte));
>>   }
>>
>> +/**
>> + * collapse_max_ptes_none - Calculate maximum allowed empty PTEs for collapse
>
> empty PTE or PTE mapping the shared zeropage ? That should be clarified also below.

Ah fair point, "empty" isn't the best representation of a "none"/zeropage.

>
>> + * @cc: The collapse control struct
>> + * @vma: The vma to check for userfaultfd
>> + *
>> + * If we are not in khugepaged mode use HPAGE_PMD_NR to allow any
>> + * empty page.
>
> Not completely accurate due to uffd. And it's not really "empty page".

Sorry I forgot to update this comment. I originally planned on skipping
the VMA passing, but then figured later that it would make the code even
more uniform (as you suggested)

>
> Is that information really necessary for the caller? I'd suggest you drop this
> here and instead add a comment inline above the "return HPAGE_PMD_NR;".

Yeah, I'm not really sure; I can shorten them. I was heeding to lorenzos
request to add these with docstring headers

>
>> + *
>> + * Return: Maximum number of empty PTEs allowed for the collapse operation
>> + */
>> +static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
>> +            struct vm_area_struct *vma)
>> +{
>> +    if (vma && userfaultfd_armed(vma))
>> +            return 0;
>> +    if (!cc->is_khugepaged)
>> +            return HPAGE_PMD_NR;
>> +    return khugepaged_max_ptes_none;
>> +}
>> +
>> +/**
>> + * collapse_max_ptes_shared - Calculate maximum allowed shared PTEs for collapse
>
> "shared PTE" is not quite clear.
>
> "PTEs that map shared anonymous pages" ?

That works for me, thank you

>
>> + * @cc: The collapse control struct
>> + *
>> + * If we are not in khugepaged mode use HPAGE_PMD_NR to allow any
>> + * shared page.
>
> Same comment as above.

ack

>
>> + *
>> + * Return: Maximum number of shared PTEs allowed for the collapse operation
>> + */
>> +static unsigned int collapse_max_ptes_shared(struct collapse_control *cc)
>> +{
>> +    if (!cc->is_khugepaged)
>> +            return HPAGE_PMD_NR;
>> +    return khugepaged_max_ptes_shared;
>> +}
>> +
>> +/**
>> + * collapse_max_ptes_swap - Calculate maximum allowed swap PTEs for collapse
>
> We're actually checking non-present page table entries (anonymous THP collapse)
> or non-present pagecache entries (file THP collapse).
>
> I wonder if there is an easy way to clarify that here, at least in the
> description (confusing name can stay unless we find something better).

I'll update the comment to include some form of this. In my mind the
name should probably stay relatively consistent to the sysctl value.

>
>> + * @cc: The collapse control struct
>> + *
>> + * If we are not in khugepaged mode use HPAGE_PMD_NR to allow any
>> + * swap page.
>
> Dito.

ack!

>
>> + *
>> + * Return: Maximum number of swap PTEs allowed for the collapse operation
>> + */
>> +static unsigned int collapse_max_ptes_swap(struct collapse_control *cc)
>> +{
>> +    if (!cc->is_khugepaged)
>> +            return HPAGE_PMD_NR;
>> +    return khugepaged_max_ptes_swap;
>> +}
>> +
>>   int hugepage_madvise(struct vm_area_struct *vma,
>>                   vm_flags_t *vm_flags, int advice)
>>   {
>> @@ -546,21 +598,19 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>>      pte_t *_pte;
>>      int none_or_zero = 0, shared = 0, referenced = 0;
>>      enum scan_result result = SCAN_FAIL;
>> +    unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma);
>> +    unsigned int max_ptes_shared = collapse_max_ptes_shared(cc);
>
> These could be const, right? Or will that change in future patches?

Yes I believe these can be const now! Thank you

>
>>
>>      for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
>>           _pte++, addr += PAGE_SIZE) {
>>              pte_t pteval = ptep_get(_pte);
>>              if (pte_none_or_zero(pteval)) {
>> -                    ++none_or_zero;
>> -                    if (!userfaultfd_armed(vma) &&
>> -                        (!cc->is_khugepaged ||
>> -                         none_or_zero <= khugepaged_max_ptes_none)) {
>> -                            continue;
>> -                    } else {
>> +                    if (++none_or_zero > max_ptes_none) {
>>                              result = SCAN_EXCEED_NONE_PTE;
>>                              count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>>                              goto out;
>>                      }
>> +                    continue;
>>              }
>>              if (!pte_present(pteval)) {
>>                      result = SCAN_PTE_NON_PRESENT;
>> @@ -591,9 +641,7 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>>
>>              /* See collapse_scan_pmd(). */
>>              if (folio_maybe_mapped_shared(folio)) {
>> -                    ++shared;
>> -                    if (cc->is_khugepaged &&
>> -                        shared > khugepaged_max_ptes_shared) {
>> +                    if (++shared > max_ptes_shared) {
>>                              result = SCAN_EXCEED_SHARED_PTE;
>>                              count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
>>                              goto out;
>> @@ -1270,6 +1318,9 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>>      unsigned long addr;
>>      spinlock_t *ptl;
>>      int node = NUMA_NO_NODE, unmapped = 0;
>> +    unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma);
>> +    unsigned int max_ptes_shared = collapse_max_ptes_shared(cc);
>> +    unsigned int max_ptes_swap = collapse_max_ptes_swap(cc);
>
> Same question here.

ack! will adjust.

>
>>
>>      VM_BUG_ON(start_addr & ~HPAGE_PMD_MASK);
>>
>
>
> In general, LGTM. With the doc fixed up
>
> Acked-by: David Hildenbrand (Arm) <david@kernel.org>

Thank you Ill get those updated.
>


^ permalink raw reply

* Re: [PATCH v2] mm/page_alloc: trace PCP refills and PCP zone lock usage
From: Steven Rostedt @ 2026-04-29 14:52 UTC (permalink / raw)
  To: SUVONOV BUNYOD
  Cc: akpm, vbabka, linux-mm, mhiramat, mathieu desnoyers,
	linux-trace-kernel, linux-kernel, surenb, mhocko, jackmanb,
	hannes, ziy, david, vishal moola, corbet, skhan, linux-doc
In-Reply-To: <1453063691.2584758.1777433513691.JavaMail.zimbra@sjtu.edu.cn>

On Wed, 29 Apr 2026 11:31:53 +0800 (CST)
SUVONOV BUNYOD <b.suvonov@sjtu.edu.cn> wrote:

> Thanks for reviewing Steven,
> 
> >Why this change? It makes it much harder to understand.
> >
> >The above is not a normal macro. Ignore any checkpatch warnings about it.
> >The proper way to do the TP_STRUCT__entry() is to make it just like a struct:
> >
> >struct {
> >	unsigned long		pfn;
> >	unsigned int		order;
> >	int			migratetype;
> >};
> >
> >Thus, the macro should be:
> >
> >	TP_STRUCT__entry(
> >		__field(	unsigned long,	pfn		)
> >		__field(	unsigned int,	order		)
> >		__field(	int,		migratetype	)
> >		),  
> 
> 
> Yeah sorry for the formatting issue, will fix in v3. Any other concerns?
> What do you think about the introduction of those tracepoints themselves?
>

It's a basic tracepoint and nothing unusual about it. I only watch over how
tracepoints are created and some use cases and make sure they are done
properly. But the introduction of tracepoints in other subsystems are up to
the maintainers of those subsystems. They are the ones that know what is
useful or not.

In other words, it's up to the MM subsystem maintainers to decide.

-- Steve
 

^ permalink raw reply

* Re: [PATCH 7.2 v16 04/13] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support
From: Nico Pache @ 2026-04-29 15:05 UTC (permalink / raw)
  To: Usama Arif
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, akpm,
	anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, Liam.Howlett, ljs, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <20260420135554.27067-1-usama.arif@linux.dev>

On 4/20/26 7:55 AM, Usama Arif wrote:
> On Sun, 19 Apr 2026 12:57:41 -0600 Nico Pache <npache@redhat.com> wrote:
>
>> generalize the order of the __collapse_huge_page_* and collapse_max_*
>> functions to support future mTHP collapse.
>>
>> The current mechanism for determining collapse with the
>> khugepaged_max_ptes_none value is not designed with mTHP in mind. This
>> raises a key design issue: if we support user defined max_pte_none values
>> (even those scaled by order), a collapse of a lower order can introduces
>> an feedback loop, or "creep", when max_ptes_none is set to a value greater
>> than HPAGE_PMD_NR / 2.
>>
>> With this configuration, a successful collapse to order N will populate
>> enough pages to satisfy the collapse condition on order N+1 on the next
>> scan. This leads to unnecessary work and memory churn.
>>
>> To fix this issue introduce a helper function that will limit mTHP
>> collapse support to two max_ptes_none values, 0 and HPAGE_PMD_NR - 1.
>> This effectively supports two modes:
>>
>> - max_ptes_none=0: never introduce new none-pages for mTHP collapse.
>> - max_ptes_none=511 (on 4k pagesz): Always collapse to the highest
>>    available mTHP order.
>>
>> This removes the possiblilty of "creep", while not modifying any uAPI
>> expectations. A warning will be emitted if any non-supported
>> max_ptes_none value is configured with mTHP enabled.
>>
>> mTHP collapse will not honor the khugepaged_max_ptes_shared or
>> khugepaged_max_ptes_swap parameters, and will fail if it encounters a
>> shared or swapped entry.
>>
>> No functional changes in this patch; however it defines future behavior
>> for mTHP collapse.
>>
>> Co-developed-by: Dev Jain <dev.jain@arm.com>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> Signed-off-by: Nico Pache <npache@redhat.com>
>> ---
>>   mm/khugepaged.c | 124 ++++++++++++++++++++++++++++++++++--------------
>>   1 file changed, 88 insertions(+), 36 deletions(-)
>>
>
> Small nits. Most might not need change.

No you brought some good points :) Thanks for your reviews!

>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index f42b55421191..283bb63854a5 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -352,51 +352,86 @@ static bool pte_none_or_zero(pte_t pte)
>>    * collapse_max_ptes_none - Calculate maximum allowed empty PTEs for collapse
>>    * @cc: The collapse control struct
>>    * @vma: The vma to check for userfaultfd
>> + * @order: The folio order being collapsed to
>>    *
>>    * If we are not in khugepaged mode use HPAGE_PMD_NR to allow any
>> - * empty page.
>> + * empty page. For PMD-sized collapses (order == HPAGE_PMD_ORDER), use the
>> + * configured khugepaged_max_ptes_none value.
>> + *
>> + * For mTHP collapses, we currently only support khugepaged_max_pte_none values
>> + * of 0 or (KHUGEPAGED_MAX_PTES_LIMIT). Any other value will emit a warning and
>> + * no mTHP collapse will be attempted
>>    *
>>    * Return: Maximum number of empty PTEs allowed for the collapse operation
>>    */
>> -static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
>> -            struct vm_area_struct *vma)
>> +static int collapse_max_ptes_none(struct collapse_control *cc,
>> +            struct vm_area_struct *vma, unsigned int order)
>>   {
>>      if (vma && userfaultfd_armed(vma))
>>              return 0;
>>      if (!cc->is_khugepaged)
>>              return HPAGE_PMD_NR;
>> -    return khugepaged_max_ptes_none;
>> +    if (is_pmd_order(order))
>> +            return khugepaged_max_ptes_none;
>> +    /* Zero/non-present collapse disabled. */
>> +    if (!khugepaged_max_ptes_none)
>> +            return 0;
>> +    if (khugepaged_max_ptes_none == KHUGEPAGED_MAX_PTES_LIMIT)
>> +            return (1 << order) - 1;
>> +
>
> There are 2 reads of khugepaged_max_ptes_none here.
> A concurrent sysctl write between reads can yield "0 then non-zero" or "LIMIT
> then mid-value".
>
> Would be good to just snapshot once at the start of the function and use that
> value?

Yeah good point, that would avoid any really hard to reproduce, but
probably very unlikely to occur bugs.

>
>> +    pr_warn_once("mTHP collapse only supports max_ptes_none values of 0 or %u\n",
>> +                  KHUGEPAGED_MAX_PTES_LIMIT);
>
> IMO, warn_once can get lost quickly in dmesg. Maybe pr_warn_ratelimited?
>
> Not sure what others opinions are..

I see David already reply'd to this. I guess we keep the warn once or
hard limit to 0. My fear with the latter is that would then violate the
whole concern (and the only reason we have 0/511) support in the first
place. If we could violate this uAPI expectation then I would then want
to reintroduce hardcapping max_ptes_none to HPAGE_PMD_NR/2 if its above
this value.

So in eyes, lets just keep this as is for now.

>
>> +    return -EINVAL;
>>   }
>>
>>   /**
>>    * collapse_max_ptes_shared - Calculate maximum allowed shared PTEs for collapse
>>    * @cc: The collapse control struct
>> + * @order: The folio order being collapsed to
>>    *
>>    * If we are not in khugepaged mode use HPAGE_PMD_NR to allow any
>>    * shared page.
>>    *
>> + * For mTHP collapses, we currently dont support collapsing memory with
>> + * shared memory.
>> + *
>>    * Return: Maximum number of shared PTEs allowed for the collapse operation
>>    */
>> -static unsigned int collapse_max_ptes_shared(struct collapse_control *cc)
>> +static unsigned int collapse_max_ptes_shared(struct collapse_control *cc,
>> +            unsigned int order)
>>   {
>>      if (!cc->is_khugepaged)
>>              return HPAGE_PMD_NR;
>> +    if (!is_pmd_order(order))
>> +            return 0;
>> +
>>      return khugepaged_max_ptes_shared;
>>   }
>>
>>   /**
>>    * collapse_max_ptes_swap - Calculate maximum allowed swap PTEs for collapse
>>    * @cc: The collapse control struct
>> + * @order: The folio order being collapsed to
>>    *
>>    * If we are not in khugepaged mode use HPAGE_PMD_NR to allow any
>>    * swap page.
>>    *
>> + * For PMD-sized collapses (order == HPAGE_PMD_ORDER), use the configured
>> + * khugepaged_max_ptes_swap value.
>> + *
>> + * For mTHP collapses, we currently dont support collapsing memory with
>> + * swapped out memory.
>> + *
>>    * Return: Maximum number of swap PTEs allowed for the collapse operation
>>    */
>> -static unsigned int collapse_max_ptes_swap(struct collapse_control *cc)
>> +static unsigned int collapse_max_ptes_swap(struct collapse_control *cc,
>> +            unsigned int order)
>>   {
>>      if (!cc->is_khugepaged)
>>              return HPAGE_PMD_NR;
>> +    if (!is_pmd_order(order))
>> +            return 0;
>> +
>>      return khugepaged_max_ptes_swap;
>>   }
>>
>> @@ -590,18 +625,22 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte,
>>
>>   static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>>              unsigned long start_addr, pte_t *pte, struct collapse_control *cc,
>> -            struct list_head *compound_pagelist)
>> +            unsigned int order, struct list_head *compound_pagelist)
>>   {
>> +    const unsigned long nr_pages = 1UL << order;
>>      struct page *page = NULL;
>>      struct folio *folio = NULL;
>>      unsigned long addr = start_addr;
>>      pte_t *_pte;
>>      int none_or_zero = 0, shared = 0, referenced = 0;
>>      enum scan_result result = SCAN_FAIL;
>> -    unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma);
>> -    unsigned int max_ptes_shared = collapse_max_ptes_shared(cc);
>> +    int max_ptes_none = collapse_max_ptes_none(cc, vma, order);
>> +    unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, order);
>> +
>> +    if (max_ptes_none < 0)
>> +            return result;
>
> Would a dedicated SCAN_INVALID_PTES_NONE make more sense here instead
> of SCAN_FAIL?

Yeah thats a good idea, let me see if i can make that work.

>
>>
>> -    for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
>> +    for (_pte = pte; _pte < pte + nr_pages;
>>           _pte++, addr += PAGE_SIZE) {
>>              pte_t pteval = ptep_get(_pte);
>>              if (pte_none_or_zero(pteval)) {
>> @@ -734,18 +773,18 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>>   }
>>
>>   static void __collapse_huge_page_copy_succeeded(pte_t *pte,
>> -                                            struct vm_area_struct *vma,
>> -                                            unsigned long address,
>> -                                            spinlock_t *ptl,
>> -                                            struct list_head *compound_pagelist)
>> +            struct vm_area_struct *vma, unsigned long address,
>> +            spinlock_t *ptl, unsigned int order,
>> +            struct list_head *compound_pagelist)
>>   {
>> -    unsigned long end = address + HPAGE_PMD_SIZE;
>> +    const unsigned long nr_pages = 1UL << order;
>> +    unsigned long end = address + (PAGE_SIZE << order);
>>      struct folio *src, *tmp;
>>      pte_t pteval;
>>      pte_t *_pte;
>>      unsigned int nr_ptes;
>>
>> -    for (_pte = pte; _pte < pte + HPAGE_PMD_NR; _pte += nr_ptes,
>> +    for (_pte = pte; _pte < pte + nr_pages; _pte += nr_ptes,
>>           address += nr_ptes * PAGE_SIZE) {
>>              nr_ptes = 1;
>>              pteval = ptep_get(_pte);
>> @@ -798,13 +837,11 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
>>   }
>>
>>   static void __collapse_huge_page_copy_failed(pte_t *pte,
>> -                                         pmd_t *pmd,
>> -                                         pmd_t orig_pmd,
>> -                                         struct vm_area_struct *vma,
>> -                                         struct list_head *compound_pagelist)
>> +            pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
>> +            unsigned int order, struct list_head *compound_pagelist)
>>   {
>> +    const unsigned long nr_pages = 1UL << order;
>>      spinlock_t *pmd_ptl;
>> -
>
> Shouldn't remove the newline above?

Ack thank you

>
>>      /*
>>       * Re-establish the PMD to point to the original page table
>>       * entry. Restoring PMD needs to be done prior to releasing
>> @@ -818,7 +855,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
>>       * Release both raw and compound pages isolated
>>       * in __collapse_huge_page_isolate.
>>       */
>> -    release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist);
>> +    release_pte_pages(pte, pte + nr_pages, compound_pagelist);
>>   }
>>
>>   /*
>> @@ -838,16 +875,16 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
>>    */
>>   static enum scan_result __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
>>              pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
>> -            unsigned long address, spinlock_t *ptl,
>> +            unsigned long address, spinlock_t *ptl, unsigned int order,
>>              struct list_head *compound_pagelist)
>>   {
>> +    const unsigned long nr_pages = 1UL << order;
>>      unsigned int i;
>>      enum scan_result result = SCAN_SUCCEED;
>> -
>
> Same here?

its probably from me reordering nr_pages to the top. ack! thanks

>
>>      /*
>>       * Copying pages' contents is subject to memory poison at any iteration.
>>       */
>> -    for (i = 0; i < HPAGE_PMD_NR; i++) {
>> +    for (i = 0; i < nr_pages; i++) {
>>              pte_t pteval = ptep_get(pte + i);
>>              struct page *page = folio_page(folio, i);
>>              unsigned long src_addr = address + i * PAGE_SIZE;
>> @@ -866,10 +903,10 @@ static enum scan_result __collapse_huge_page_copy(pte_t *pte, struct folio *foli
>>
>>      if (likely(result == SCAN_SUCCEED))
>>              __collapse_huge_page_copy_succeeded(pte, vma, address, ptl,
>> -                                                compound_pagelist);
>> +                                                order, compound_pagelist);
>>      else
>>              __collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma,
>> -                                             compound_pagelist);
>> +                                             order, compound_pagelist);
>>
>>      return result;
>>   }
>> @@ -1040,12 +1077,12 @@ static enum scan_result check_pmd_still_valid(struct mm_struct *mm,
>>    * Returns result: if not SCAN_SUCCEED, mmap_lock has been released.
>>    */
>>   static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
>> -            struct vm_area_struct *vma, unsigned long start_addr, pmd_t *pmd,
>> -            int referenced)
>> +            struct vm_area_struct *vma, unsigned long start_addr,
>> +            pmd_t *pmd, int referenced, unsigned int order)
>
> Will probably find out in later reviews, but there is tracepoint in __collapse_huge_page_swapin.
> Would be good to add order in that tracepoint if you are adding order here?

Yep! There is a patch that updates the tracepoints.

>
>>   {
>>      int swapped_in = 0;
>>      vm_fault_t ret = 0;
>> -    unsigned long addr, end = start_addr + (HPAGE_PMD_NR * PAGE_SIZE);
>> +    unsigned long addr, end = start_addr + (PAGE_SIZE << order);
>>      enum scan_result result;
>>      pte_t *pte = NULL;
>>      spinlock_t *ptl;
>> @@ -1077,6 +1114,19 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
>>                  pte_present(vmf.orig_pte))
>>                      continue;
>>
>> +            /*
>> +             * TODO: Support swapin without leading to further mTHP
>> +             * collapses. Currently bringing in new pages via swapin may
>> +             * cause a future higher order collapse on a rescan of the same
>> +             * range.
>> +             */
>> +            if (!is_pmd_order(order)) {
>
> Would it be good to introduce this in the patch that activates it? No strong
> preference btw. Just that its dead code in this patch itself.

No we are trying to get everything ready before the patch(es) that
actually activates this feature. Everything related to order is
currently dead code at this moment until the later commits.

Cheers,
-- Nico


>
>> +                    pte_unmap(pte);
>> +                    mmap_read_unlock(mm);
>> +                    result = SCAN_EXCEED_SWAP_PTE;
>> +                    goto out;
>> +            }
>> +
>>              vmf.pte = pte;
>>              vmf.ptl = ptl;
>>              ret = do_swap_page(&vmf);
>> @@ -1196,7 +1246,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>>               * that case.  Continuing to collapse causes inconsistency.
>>               */
>>              result = __collapse_huge_page_swapin(mm, vma, address, pmd,
>> -                                                 referenced);
>> +                                                 referenced, HPAGE_PMD_ORDER);
>>              if (result != SCAN_SUCCEED)
>>                      goto out_nolock;
>>      }
>> @@ -1244,6 +1294,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>>      pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
>>      if (pte) {
>>              result = __collapse_huge_page_isolate(vma, address, pte, cc,
>> +                                                  HPAGE_PMD_ORDER,
>>                                                    &compound_pagelist);
>>              spin_unlock(pte_ptl);
>>      } else {
>> @@ -1274,6 +1325,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>>
>>      result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
>>                                         vma, address, pte_ptl,
>> +                                       HPAGE_PMD_ORDER,
>>                                         &compound_pagelist);
>>      pte_unmap(pte);
>>      if (unlikely(result != SCAN_SUCCEED))
>> @@ -1318,9 +1370,9 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>>      unsigned long addr;
>>      spinlock_t *ptl;
>>      int node = NUMA_NO_NODE, unmapped = 0;
>> -    unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma);
>> -    unsigned int max_ptes_shared = collapse_max_ptes_shared(cc);
>> -    unsigned int max_ptes_swap = collapse_max_ptes_swap(cc);
>> +    int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
>> +    unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, HPAGE_PMD_ORDER);
>> +    unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
>>
>>      VM_BUG_ON(start_addr & ~HPAGE_PMD_MASK);
>>
>> @@ -2371,8 +2423,8 @@ static enum scan_result collapse_scan_file(struct mm_struct *mm,
>>      int present, swap;
>>      int node = NUMA_NO_NODE;
>>      enum scan_result result = SCAN_SUCCEED;
>> -    unsigned int max_ptes_none = collapse_max_ptes_none(cc, NULL);
>> -    unsigned int max_ptes_swap = collapse_max_ptes_swap(cc);
>> +    int max_ptes_none = collapse_max_ptes_none(cc, NULL, HPAGE_PMD_ORDER);
>> +    unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
>>
>>      present = 0;
>>      swap = 0;
>> --
>> 2.53.0
>>
>>
>


^ permalink raw reply

* Re: [PATCH RFC v5 00/53] guest_memfd: In-place conversion support
From: Sean Christopherson @ 2026-04-29 15:06 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
	pratyush, suzuki.poulose, aneesh.kumar, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka, kvm,
	linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <20260428-gmem-inplace-conversion-v5-0-d8608ccfca22@google.com>

On Tue, Apr 28, 2026, Ackerley Tng wrote:
> This is RFC v5 of guest_memfd in-place conversion support.

...

> TODOs
> 
> + Perhaps further clarify PRESERVE flag: [8]
> + Resolve issue where guest_memfd_conversions_test, which uses the
>   kselftest framework, doesn't perform teardown on assertion
>   failure. Please see proposal at [9]
> + Test with TDX selftests. We're in the process of rebasing TDX selftests
>   on this series and will post updates when that's tested.

Why exactly is this still RFC?  The TODOs here don't strike me as things that
would make this RFC.  Blockers for merge, yes/maybe/probably, but at a glance,
it feels like we've moved beyond RFC for the code itself.

^ permalink raw reply

* Re: [PATCH 7.2 v16 04/13] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support
From: Nico Pache @ 2026-04-29 15:12 UTC (permalink / raw)
  To: David Hildenbrand (Arm), linux-doc, linux-kernel, linux-mm,
	linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, ljs,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
	pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang, rientjes,
	rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <e9dbe863-d125-4fe5-8ecc-91ad7293e5cf@kernel.org>

On 4/27/26 2:07 PM, David Hildenbrand (Arm) wrote:
> On 4/19/26 20:57, Nico Pache wrote:
>> generalize the order of the __collapse_huge_page_* and collapse_max_*
>> functions to support future mTHP collapse.
>>
>> The current mechanism for determining collapse with the
>> khugepaged_max_ptes_none value is not designed with mTHP in mind. This
>> raises a key design issue: if we support user defined max_pte_none values
>> (even those scaled by order), a collapse of a lower order can introduces
>> an feedback loop, or "creep", when max_ptes_none is set to a value greater
>> than HPAGE_PMD_NR / 2.
>>
>> With this configuration, a successful collapse to order N will populate
>> enough pages to satisfy the collapse condition on order N+1 on the next
>> scan. This leads to unnecessary work and memory churn.
>
> You could add a link here to previous discussions.

Ack, not a bad idea for historical reasons.

>
>>
>> To fix this issue introduce a helper function that will limit mTHP
>> collapse support to two max_ptes_none values, 0 and HPAGE_PMD_NR - 1.
>> This effectively supports two modes:
>>
>> - max_ptes_none=0: never introduce new none-pages for mTHP collapse.
>
> "introduce" reads wrong in this context. And I don't know what a "none-page" is :)

Thats a page that is none duh ;P Ill clean that up

>
> "never collapses if it encounters an empty PTE or a PTE that maps the shared
> zeropage. Consequently, no memory bloat."

Thanks that does read way better!

>
>> - max_ptes_none=511 (on 4k pagesz): Always collapse to the highest
>>    available mTHP order.
>>
>> This removes the possiblilty of "creep", while not modifying any uAPI
>> expectations. A warning will be emitted if any non-supported
>> max_ptes_none value is configured with mTHP enabled.
>>
>> mTHP collapse will not honor the khugepaged_max_ptes_shared or
>> khugepaged_max_ptes_swap parameters, and will fail if it encounters a
>> shared or swapped entry.
>>
>> No functional changes in this patch; however it defines future behavior
>> for mTHP collapse.
>>
>> Co-developed-by: Dev Jain <dev.jain@arm.com>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> Signed-off-by: Nico Pache <npache@redhat.com>
>> ---
>>   mm/khugepaged.c | 124 ++++++++++++++++++++++++++++++++++--------------
>>   1 file changed, 88 insertions(+), 36 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index f42b55421191..283bb63854a5 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -352,51 +352,86 @@ static bool pte_none_or_zero(pte_t pte)
>>    * collapse_max_ptes_none - Calculate maximum allowed empty PTEs for collapse
>>    * @cc: The collapse control struct
>>    * @vma: The vma to check for userfaultfd
>> + * @order: The folio order being collapsed to
>>    *
>>    * If we are not in khugepaged mode use HPAGE_PMD_NR to allow any
>> - * empty page.
>> + * empty page. For PMD-sized collapses (order == HPAGE_PMD_ORDER), use the
>> + * configured khugepaged_max_ptes_none value.
>> + *
>> + * For mTHP collapses, we currently only support khugepaged_max_pte_none values
>> + * of 0 or (KHUGEPAGED_MAX_PTES_LIMIT). Any other value will emit a warning and
>> + * no mTHP collapse will be attempted
>
> Not sure if we discussed it (and maybe I had a different opinion back then ...),
> but could we simply to fallback to max_ptes_none=0, so we can avoid returning
> errors here?>
> max_ptes_none=0 is ok, because we will not waste any memory. The warning clearly
> tells the user that this combination is not supported as is.
>
> ... and it would make this function a lot easier to handle. In the warning, we
> can just state that "falling back to ... "max_ptes_non = 0".

We'd then be "violating" the uAPI expetation and is the whole reason we
have this 0/511 behavior in the first place ;/

If it wasnt for that issue I would have a completely different design.

>
>
> [...]
>
>>
>>   /**
>>    * collapse_max_ptes_shared - Calculate maximum allowed shared PTEs for collapse
>>    * @cc: The collapse control struct
>> + * @order: The folio order being collapsed to
>>    *
>>    * If we are not in khugepaged mode use HPAGE_PMD_NR to allow any
>>    * shared page.
>>    *
>> + * For mTHP collapses, we currently dont support collapsing memory with
>> + * shared memory.
>
> "do not"
>
> "shared memory" is misleading, as we do support shmem. What you mean is maybe
> "collapsing with anonymous memory pages that are shared between processes
> through CoW" or soemthing like that?

Ok ill clear that up thank you!

>
>> + *
>>    * Return: Maximum number of shared PTEs allowed for the collapse operation
>>    */
>> -static unsigned int collapse_max_ptes_shared(struct collapse_control *cc)
>> +static unsigned int collapse_max_ptes_shared(struct collapse_control *cc,
>> +            unsigned int order)
>>   {
>>      if (!cc->is_khugepaged)
>>              return HPAGE_PMD_NR;
>> +    if (!is_pmd_order(order))
>> +            return 0;
>> +
>>      return khugepaged_max_ptes_shared;
>>   }
>>
>>   /**
>>    * collapse_max_ptes_swap - Calculate maximum allowed swap PTEs for collapse
>>    * @cc: The collapse control struct
>> + * @order: The folio order being collapsed to
>>    *
>>    * If we are not in khugepaged mode use HPAGE_PMD_NR to allow any
>>    * swap page.
>>    *
>> + * For PMD-sized collapses (order == HPAGE_PMD_ORDER), use the configured
>> + * khugepaged_max_ptes_swap value.
>> + *
>> + * For mTHP collapses, we currently dont support collapsing memory with
>> + * swapped out memory.
>
> "do not". Given that this is also used for the pagecache, can we make this clearer?

Yeah! I originally didnt plan on using these helpers for the file
collapse operation but figured hell, why not clean them both up, might
end up helping (hopefully not hurting) when Baolin goes to add mTHP
shmem support.

Cheers,
-- Nico


>
>> + *
>>    * Return: Maximum number of swap PTEs allowed for the collapse operation
>>    */
>> -static unsigned int collapse_max_ptes_swap(struct collapse_control *cc)
>> +static unsigned int collapse_max_ptes_swap(struct collapse_control *cc,
>> +            unsigned int order)


^ permalink raw reply

* Re: [PATCH v18 3/8] ring-buffer: Skip invalid sub-buffers when validating persistent ring buffer
From: Masami Hiramatsu @ 2026-04-29 15:15 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Catalin Marinas, Will Deacon, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <20260428155508.4f47279e@gandalf.local.home>

On Tue, 28 Apr 2026 15:55:08 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Fri, 24 Apr 2026 15:52:27 +0900
> "Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:
> 
> > @@ -5648,11 +5668,12 @@ __rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer)
> >   again:
> >  	/*
> >  	 * This should normally only loop twice. But because the
> > -	 * start of the reader inserts an empty page, it causes
> > -	 * a case where we will loop three times. There should be no
> > -	 * reason to loop four times (that I know of).
> > +	 * start of the reader inserts an empty page, it causes a
> > +	 * case where we will loop three times. There should be no
> > +	 * reason to loop four times unless the ring buffer is a
> > +	 * recovered persistent ring buffer.
> 
> Can you explain more to why this is allowed for persistent ring buffer?

Ah, that was introduced in v15.

  Changes in v15:
  - Skip reader_page loop check on persistent ring buffer because
    there can be contiguous empty(invalidated) pages. 


So, finding next reader_page, we need to skip empty pages, which is normally
not contiguous. However, if we see more than 3 invalid pages on recovering
persistent ring buffer, it will be reset and become empty.


> 
> Note, I do not like any loops that can go into an infinite loop and lock up
> the machine. If something goes wrong with a persistent ring buffer, then
> this could possibly go into an infinite loop.

Yeah, so I think we should not use goto here. OK, let me update it to
an actual loop.

> 
> I want to understand why this is allowed, and possibly add a check that
> prevents this from never ending.

It should not be a never ending loop (there are other exit conditions),
but I agreed. What about limiting with nr_subbufs?

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 5326924615a4..aa89ec96e964 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -5630,7 +5630,8 @@ __rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer)
 	 * a case where we will loop three times. There should be no
 	 * reason to loop four times (that I know of).
 	 */
-	if (RB_WARN_ON(cpu_buffer, ++nr_loops > 3)) {
+	if (RB_WARN_ON(cpu_buffer,
+		++nr_loops > (cpu_buffer->ring_meta ? cpu_buffer->nr_subbufs : 3))) {
 		reader = NULL;
 		goto out;
 	}

> 
> -- Steve
> 
> 
> >  	 */
> > -	if (RB_WARN_ON(cpu_buffer, ++nr_loops > 3)) {
> > +	if (RB_WARN_ON(cpu_buffer, ++nr_loops > 3 && !cpu_buffer->ring_meta)) {
> >  		reader = NULL;
> >  		goto out;
> >  	}
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply related

* Re: [PATCH v18 4/8] ring-buffer: Skip invalid sub-buffers when rewinding persistent ring buffer
From: Masami Hiramatsu @ 2026-04-29 15:20 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Catalin Marinas, Will Deacon, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <20260428162146.78e52988@gandalf.local.home>

On Tue, 28 Apr 2026 16:21:46 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Fri, 24 Apr 2026 15:52:35 +0900
> "Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:
> 
> > @@ -1892,9 +1895,19 @@ static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu,
> >  	 * subbuf_size is considered invalid.
> >  	 */
> >  	tail = local_read(&dpage->commit) & ~RB_MISSED_MASK;
> > -	if (tail > meta->subbuf_size - BUF_PAGE_HDR_SIZE)
> > -		return -1;
> > -	return rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
> > +	if (tail <= meta->subbuf_size - BUF_PAGE_HDR_SIZE)
> > +		ret = rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
> > +
> 
> This code seriously needs comments.

OK, I'll add it, or let code explain clearer?

	if (tail <= meta->subbuf_size - BUF_PAGE_HDR_SIZE)
		ret = rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
	else
		ret = -1;

Thanks,

> 
> -- Steve
> 
> > +	if (ret < 0 || (prev_ts && prev_ts > ts) || (next_ts && ts > next_ts)) {
> > +		local_set(&bpage->entries, 0);
> > +		local_set(&bpage->page->commit, 0);
> > +		bpage->page->time_stamp = prev_ts ? prev_ts : next_ts;
> > +		ret = -1;
> > +	} else {
> > +		local_set(&bpage->entries, ret);
> > +	}
> > +
> > +	return ret;
> >  }
> >  


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH v18 7/8] ring-buffer: Cleanup persistent ring buffer validation
From: Masami Hiramatsu @ 2026-04-29 15:21 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Catalin Marinas, Will Deacon, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <20260428162457.1ca8c4b6@gandalf.local.home>

On Tue, 28 Apr 2026 16:24:57 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Fri, 24 Apr 2026 15:52:59 +0900
> "Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:
> 
> > From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> > 
> > Cleanup rb_meta_validate_events() function to make it easier to read.
> > This includes the following cleanups:
> >  - Introduce rb_validatation_state to hold working variables in
> >    validation.
> >  - Move repleated validation state updates into rb_validate_buffer().
> >  - Move reader_page injection code outside of rb_meta_validate_events().
> > 
> > Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> > ---
> >  kernel/trace/ring_buffer.c |  186 ++++++++++++++++++++++----------------------
> >  1 file changed, 95 insertions(+), 91 deletions(-)
> > 
> > diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> > index de653a8e3cec..9850a0d8d24b 100644
> > --- a/kernel/trace/ring_buffer.c
> > +++ b/kernel/trace/ring_buffer.c
> > @@ -1883,8 +1883,16 @@ static int rb_read_data_buffer(struct buffer_data_page *dpage, int tail, int cpu
> >  	return events;
> >  }
> >  
> > -static int rb_validate_buffer(struct buffer_page *bpage, int cpu,
> > -			      struct ring_buffer_cpu_meta *meta, u64 prev_ts, u64 next_ts)
> > +struct rb_validation_state {
> > +	unsigned long entries;
> > +	unsigned long entry_bytes;
> > +	int discarded;
> > +	u64 ts;
> > +};
> > +
> > +static int __rb_validate_buffer(struct buffer_page *bpage, int cpu,
> > +				struct ring_buffer_cpu_meta *meta,
> > +				u64 prev_ts, u64 next_ts)
> >  {
> 
> This can still use those comments (from patch 4).
> 
> Also, could you rebase on top of v7.1-rc1?

Is it v7.1-rc1, not ring-buffer/for-next?

Thanks,

> 
> Thanks Masami!
> 
> -- Steve
> 
> >  	struct buffer_data_page *dpage = bpage->page;
> >  	unsigned long long ts;
> > @@ -1914,16 +1922,82 @@ static int rb_validate_buffer(struct buffer_page *bpage, int cpu,
> >  	return ret;
> >  }


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH] mm/madvise: preserve uprobe breakpoints across MADV_DONTNEED
From: Oleg Nesterov @ 2026-04-29 15:24 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Darko Tominac, Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	James Clark, Andrew Morton, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, xe-linux-external, danielwa,
	linux-kernel, linux-trace-kernel, linux-perf-users, linux-mm
In-Reply-To: <889e4186-9f75-45d3-bd9a-1c90b218a635@kernel.org>

On 04/29, David Hildenbrand (Arm) wrote:
>
> On 4/29/26 15:15, Darko Tominac wrote:
>
> > uprobe infrastructure does not re-instrument pages on individual page
> > faults (uprobe_mmap() is only called during VMA creation, not on
> > page-in), the breakpoints are silently lost once the discarded pages are
> > re-read from the backing file.  The probes stop firing with no error
> > indication, and the only recovery is to unregister and re-register the
> > affected uprobes.
>
> Right. Don't MADV_DONTNEED uprobes, just like you are not supposed to
> MADV_DONTNEED debugger breakpoints/set data etc. :)

Agreed, thanks.

Oleg.


^ permalink raw reply

* Re: [PATCH 7.2 v16 05/13] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: Nico Pache @ 2026-04-29 15:34 UTC (permalink / raw)
  To: Usama Arif, Lorenzo Stoakes (Oracle), David Hildenbrand (Red Hat)
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, akpm,
	anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, Liam.Howlett, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <20260420142057.392263-1-usama.arif@linux.dev>

On 4/20/26 8:20 AM, Usama Arif wrote:
> On Sun, 19 Apr 2026 12:57:42 -0600 Nico Pache <npache@redhat.com> wrote:
>
>> Pass an order and offset to collapse_huge_page to support collapsing anon
>> memory to arbitrary orders within a PMD. order indicates what mTHP size we
>> are attempting to collapse to, and offset indicates were in the PMD to
>> start the collapse attempt.
>>
>> For non-PMD collapse we must leave the anon VMA write locked until after
>> we collapse the mTHP-- in the PMD case all the pages are isolated, but in
>> the mTHP case this is not true, and we must keep the lock to prevent
>> access/changes to the page tables. This can happen if the rmap walkers hit
>> a pmd_none while the PMD entry is currently unavailable due to being
>> temporarily removed during the collapse phase.
>>
>> Signed-off-by: Nico Pache <npache@redhat.com>
>> ---
>>   mm/khugepaged.c | 103 +++++++++++++++++++++++++++---------------------
>>   1 file changed, 57 insertions(+), 46 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 283bb63854a5..ff6f9f1883ed 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1198,42 +1198,36 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
>>      return SCAN_SUCCEED;
>>   }
>>
>> -static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
>> -            int referenced, int unmapped, struct collapse_control *cc)
>> +static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
>> +            int referenced, int unmapped, struct collapse_control *cc,
>> +            unsigned int order)
>>   {
>>      LIST_HEAD(compound_pagelist);
>>      pmd_t *pmd, _pmd;
>> -    pte_t *pte;
>> +    pte_t *pte = NULL;
>>      pgtable_t pgtable;
>>      struct folio *folio;
>>      spinlock_t *pmd_ptl, *pte_ptl;
>>      enum scan_result result = SCAN_FAIL;
>>      struct vm_area_struct *vma;
>>      struct mmu_notifier_range range;
>> +    bool anon_vma_locked = false;
>> +    const unsigned long pmd_addr = start_addr & HPAGE_PMD_MASK;
>> +    const unsigned long end_addr = start_addr + (PAGE_SIZE << order);
>>
>> -    VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>> -
>> -    /*
>> -     * Before allocating the hugepage, release the mmap_lock read lock.
>> -     * The allocation can take potentially a long time if it involves
>> -     * sync compaction, and we do not need to hold the mmap_lock during
>> -     * that. We will recheck the vma after taking it again in write mode.
>> -     */
>> -    mmap_read_unlock(mm);
>> -
>
> My understanding now is that the caller will need to drop the mmap_read lock?
>
> This needs explicit documentation at the start of the function IMO. I think
> there are callers in later patches and its very easy to miss this.

Yeah as David suggested I will probably seperate this out into its own
patch. And add some proper comments to describe the requirement. I didnt
even realize I had forgot to update the patch to describe this change
(apart from the coverletter). Thank you :)

>
>
>> -    result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
>> +    result = alloc_charge_folio(&folio, mm, cc, order);
>>      if (result != SCAN_SUCCEED)
>>              goto out_nolock;
>>
>>      mmap_read_lock(mm);
>> -    result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
>> -                                     HPAGE_PMD_ORDER);
>> +    result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
>> +                                     &vma, cc, order);
>>      if (result != SCAN_SUCCEED) {
>>              mmap_read_unlock(mm);
>>              goto out_nolock;
>>      }
>>
>> -    result = find_pmd_or_thp_or_none(mm, address, &pmd);
>> +    result = find_pmd_or_thp_or_none(mm, pmd_addr, &pmd);
>>      if (result != SCAN_SUCCEED) {
>>              mmap_read_unlock(mm);
>>              goto out_nolock;
>> @@ -1245,8 +1239,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>>               * released when it fails. So we jump out_nolock directly in
>>               * that case.  Continuing to collapse causes inconsistency.
>>               */
>> -            result = __collapse_huge_page_swapin(mm, vma, address, pmd,
>> -                                                 referenced, HPAGE_PMD_ORDER);
>> +            result = __collapse_huge_page_swapin(mm, vma, start_addr, pmd,
>> +                                                 referenced, order);
>>              if (result != SCAN_SUCCEED)
>>                      goto out_nolock;
>>      }
>> @@ -1261,20 +1255,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>>       * mmap_lock.
>>       */
>>      mmap_write_lock(mm);
>> -    result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
>> -                                     HPAGE_PMD_ORDER);
>> +    result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
>> +                                     &vma, cc, order);
>>      if (result != SCAN_SUCCEED)
>>              goto out_up_write;
>>      /* check if the pmd is still valid */
>>      vma_start_write(vma);
>> -    result = check_pmd_still_valid(mm, address, pmd);
>> +    result = check_pmd_still_valid(mm, pmd_addr, pmd);
>>      if (result != SCAN_SUCCEED)
>>              goto out_up_write;
>>
>>      anon_vma_lock_write(vma->anon_vma);
>> +    anon_vma_locked = true;
>>
>> -    mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
>> -                            address + HPAGE_PMD_SIZE);
>> +    mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr,
>> +                            end_addr);
>>      mmu_notifier_invalidate_range_start(&range);
>>
>>      pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
>> @@ -1286,26 +1281,23 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>>       * Parallel GUP-fast is fine since GUP-fast will back off when
>>       * it detects PMD is changed.
>>       */
>> -    _pmd = pmdp_collapse_flush(vma, address, pmd);
>> +    _pmd = pmdp_collapse_flush(vma, pmd_addr, pmd);
>
>
> For an mTHP collapse covering, say, 64KiB of a 2MiB PMD, the patch still
> flushes the entire PMD via pmdp_collapse_flush and tlb_remove_table_sync_one.
> That triggers cross-CPU TLB shootdowns for ~1.94MiB of unrelated mappings on
> every successful sub-PMD collapse. Probably acceptable as a first cut?

Hmm interesting consideration, I believe this is neccisary from a GUP
perspective as the comment suggests. Me, Dev, and David had discussed
that once that this lands we wanted to play around with a lot of the
locking and previously defined "requirements" to see what is actually
needed. So yeah I think for now we leave it as is. Hopefully we can make
more improvements in the future.

>
>
>>      spin_unlock(pmd_ptl);
>>      mmu_notifier_invalidate_range_end(&range);
>>      tlb_remove_table_sync_one();
>>
>> -    pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
>> +    pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);
>>      if (pte) {
>> -            result = __collapse_huge_page_isolate(vma, address, pte, cc,
>> -                                                  HPAGE_PMD_ORDER,
>> -                                                  &compound_pagelist);
>> +            result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
>> +                                                  order, &compound_pagelist);
>>              spin_unlock(pte_ptl);
>>      } else {
>>              result = SCAN_NO_PTE_TABLE;
>>      }
>>
>>      if (unlikely(result != SCAN_SUCCEED)) {
>> -            if (pte)
>> -                    pte_unmap(pte);
>>              spin_lock(pmd_ptl);
>> -            BUG_ON(!pmd_none(*pmd));
>> +            WARN_ON_ONCE(!pmd_none(*pmd));
>
> Why was this turned into WARN_ON_ONCE? Would be good to add to commit message
> what the reason is if it has been discussed earlier.
>
> The next line writes a PMD entry over an existing one — that leaks the
> previous page table or PMD-mapped folio and can corrupt VA mappings. BUG_ON
> failed loudly and safely; WARN_ON_ONCE continues into the corruption and falls
> silent after the first hit.

Discussed here:
https://lore.kernel.org/lkml/c781ec19-724b-4c00-9ad6-56b9733cefa0@lucifer.local/

But yes im more so on your side that id rather crash the system then
warn if we are potentially corrupting things in memory.

@lorenzo and @david can we come to a definitive consensus on this?

>
>
>>              /*
>>               * We can only use set_pmd_at when establishing
>>               * hugepmds and never for establishing regular pmds that
>> @@ -1313,21 +1305,24 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>>               */
>>              pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>>              spin_unlock(pmd_ptl);
>> -            anon_vma_unlock_write(vma->anon_vma);
>>              goto out_up_write;
>>      }
>>
>>      /*
>> -     * All pages are isolated and locked so anon_vma rmap
>> -     * can't run anymore.
>> +     * For PMD collapse all pages are isolated and locked so anon_vma
>> +     * rmap can't run anymore. For mTHP collapse the PMD entry has been
>> +     * removed and not all pages are isolated and locked, so we must hold
>> +     * the lock to prevent neighboring folios from attempting to access
>> +     * this PMD until its reinstalled.
>>       */
>> -    anon_vma_unlock_write(vma->anon_vma);
>> +    if (is_pmd_order(order)) {
>> +            anon_vma_unlock_write(vma->anon_vma);
>> +            anon_vma_locked = false;
>> +    }
>>
>>      result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
>> -                                       vma, address, pte_ptl,
>> -                                       HPAGE_PMD_ORDER,
>> -                                       &compound_pagelist);
>> -    pte_unmap(pte);
>> +                                       vma, start_addr, pte_ptl,
>> +                                       order, &compound_pagelist);
>>      if (unlikely(result != SCAN_SUCCEED))
>>              goto out_up_write;
>>
>> @@ -1337,18 +1332,27 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>>       * write.
>>       */
>>      __folio_mark_uptodate(folio);
>> -    pgtable = pmd_pgtable(_pmd);
>> -
>>      spin_lock(pmd_ptl);
>> -    BUG_ON(!pmd_none(*pmd));
>> -    pgtable_trans_huge_deposit(mm, pmd, pgtable);
>> -    map_anon_folio_pmd_nopf(folio, pmd, vma, address);
>> +    WARN_ON_ONCE(!pmd_none(*pmd));
>> +    if (is_pmd_order(order)) { /* PMD collapse */
>> +            pgtable = pmd_pgtable(_pmd);
>> +            pgtable_trans_huge_deposit(mm, pmd, pgtable);
>> +            map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
>> +    } else { /* mTHP collapse */
>> +            map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
>
> map_anon_folio_pte_nopf calls set_ptes and modifies pagetable, while holding
> pmd_ptl only. It should be safe as we expect pmd_none. But I think you should
> put a comment about this?

Yeah doesnt hurt. Ill try to clarify that.


>
>
>> +            smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
>> +            pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>> +    }
>>      spin_unlock(pmd_ptl);
>>
>>      folio = NULL;
>>
>>      result = SCAN_SUCCEED;
>>   out_up_write:
>> +    if (anon_vma_locked)
>> +            anon_vma_unlock_write(vma->anon_vma);
>> +    if (pte)
>> +            pte_unmap(pte);
>>      mmap_write_unlock(mm);
>>   out_nolock:
>>      if (folio)
>> @@ -1525,8 +1529,15 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>>   out_unmap:
>>      pte_unmap_unlock(pte, ptl);
>>      if (result == SCAN_SUCCEED) {
>> +            /*
>> +             * Before allocating the hugepage, release the mmap_lock read lock.
>> +             * The allocation can take potentially a long time if it involves
>> +             * sync compaction, and we do not need to hold the mmap_lock during
>> +             * that. We will recheck the vma after taking it again in write mode.
>> +             */
>> +            mmap_read_unlock(mm);
>>              result = collapse_huge_page(mm, start_addr, referenced,
>> -                                        unmapped, cc);
>> +                                        unmapped, cc, HPAGE_PMD_ORDER);
>>              /* collapse_huge_page will return with the mmap_lock released */
>>              *lock_dropped = true;
>>      }
>> --
>> 2.53.0
>>
>>
>


^ permalink raw reply

* Re: [PATCH 7.2 v16 05/13] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: Nico Pache @ 2026-04-29 15:34 UTC (permalink / raw)
  To: David Hildenbrand (Arm), linux-doc, linux-kernel, linux-mm,
	linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, ljs,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
	pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang, rientjes,
	rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <6025c087-378b-4754-b1f2-3c5448b49d6c@kernel.org>

On 4/27/26 2:13 PM, David Hildenbrand (Arm) wrote:
> On 4/19/26 20:57, Nico Pache wrote:
>> Pass an order and offset to collapse_huge_page to support collapsing anon
>> memory to arbitrary orders within a PMD. order indicates what mTHP size we
>> are attempting to collapse to, and offset indicates were in the PMD to
>> start the collapse attempt.
>>
>> For non-PMD collapse we must leave the anon VMA write locked until after
>> we collapse the mTHP-- in the PMD case all the pages are isolated, but in
>> the mTHP case this is not true, and we must keep the lock to prevent
>> access/changes to the page tables. This can happen if the rmap walkers hit
>> a pmd_none while the PMD entry is currently unavailable due to being
>> temporarily removed during the collapse phase.
>>
>> Signed-off-by: Nico Pache <npache@redhat.com>
>> ---
>>   mm/khugepaged.c | 103 +++++++++++++++++++++++++++---------------------
>>   1 file changed, 57 insertions(+), 46 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 283bb63854a5..ff6f9f1883ed 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1198,42 +1198,36 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
>>      return SCAN_SUCCEED;
>>   }
>>
>> -static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
>> -            int referenced, int unmapped, struct collapse_control *cc)
>> +static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
>> +            int referenced, int unmapped, struct collapse_control *cc,
>> +            unsigned int order)
>>   {
>>      LIST_HEAD(compound_pagelist);
>>      pmd_t *pmd, _pmd;
>> -    pte_t *pte;
>> +    pte_t *pte = NULL;
>>      pgtable_t pgtable;
>>      struct folio *folio;
>>      spinlock_t *pmd_ptl, *pte_ptl;
>>      enum scan_result result = SCAN_FAIL;
>>      struct vm_area_struct *vma;
>>      struct mmu_notifier_range range;
>> +    bool anon_vma_locked = false;
>> +    const unsigned long pmd_addr = start_addr & HPAGE_PMD_MASK;
>> +    const unsigned long end_addr = start_addr + (PAGE_SIZE << order);
>
> In general, const read better when they are at the very top of this list.

Ack! still getting in the habit of this, sorry.

>
>>
>> -    VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>> -
>> -    /*
>> -     * Before allocating the hugepage, release the mmap_lock read lock.
>> -     * The allocation can take potentially a long time if it involves
>> -     * sync compaction, and we do not need to hold the mmap_lock during
>> -     * that. We will recheck the vma after taking it again in write mode.
>> -     */
>> -    mmap_read_unlock(mm);
>> -
>
> You should spell out that locking change (moving it to the caller), and why it
> is required, in the patch description.
>
> I'd even have put this into a separate patch, as it's independent to the
> order-passing changes.

Yeah good point ill separate this out into its own patch, I also
realized I never updated the commit message for this patch to properly
describe this change. Whoops.


> [...]
>
>>       */
>>      __folio_mark_uptodate(folio);
>> -    pgtable = pmd_pgtable(_pmd);
>> -
>>      spin_lock(pmd_ptl);
>> -    BUG_ON(!pmd_none(*pmd));
>> -    pgtable_trans_huge_deposit(mm, pmd, pgtable);
>> -    map_anon_folio_pmd_nopf(folio, pmd, vma, address);
>> +    WARN_ON_ONCE(!pmd_none(*pmd));
>> +    if (is_pmd_order(order)) { /* PMD collapse */
>> +            pgtable = pmd_pgtable(_pmd);
>> +            pgtable_trans_huge_deposit(mm, pmd, pgtable);
>> +            map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
>> +    } else { /* mTHP collapse */
>
> Do both these comments (PMD collapse ...) really add any value? I'd say it's
> pretty self-documenting code already.

Yeah with the new is_pmd_order helper its def a little overkill. I'll
remove them. Thanks!


>
>> +            map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
>> +            smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
>> +            pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>> +    }
>>      spin_unlock(pmd_ptl);
>>
>>      folio = NULL;
>>
>>      result = SCAN_SUCCEED;
>>   out_up_write:
>> +    if (anon_vma_locked)
>> +            anon_vma_unlock_write(vma->anon_vma);
>> +    if (pte)
>> +            pte_unmap(pte);
>


^ permalink raw reply

* [RFC][PATCH] unwind: Add stacktrace_setup system call
From: Steven Rostedt @ 2026-04-29 15:43 UTC (permalink / raw)
  To: LKML, Linux Trace Kernel
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Jens Remus, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Linus Torvalds, Andrew Morton,
	Florian Weimer, Kees Cook, Carlos O'Donell, Sam James,
	Dylan Hatch, Borislav Petkov, Dave Hansen, David Hildenbrand,
	H. Peter Anvin, Liam R. Howlett, Lorenzo Stoakes, Michal Hocko,
	Mike Rapoport, Suren Baghdasaryan, Vlastimil Babka,
	Heiko Carstens, Vasily Gorbik

[-- Attachment #1: Type: text/plain, Size: 15652 bytes --]

From: Steven Rostedt <rostedt@goodmis.org>

[
   This is an RFC that adds a system call for dynamic linkers to use to
   tell the kernel where the sframe sections are when it loads dynamic
   libraries.

   It is built on top of Jens's sframe implementation for v3:

      https://lore.kernel.org/linux-trace-kernel/20260127150554.2760964-1-jremus@linux.ibm.com/

   I have a repo with that code that this applies on top of here:

      git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git sframe/core
       

   The name of the system call is "stacktrace_setup", but I'm not attached
   to this name. If anyone can think of a better name I'm happy to take
   suggestions.

   This patch is just to get the conversation going and the final result
   may be much different. I tested this with the attached program which is a
   major hack. I built glibc with sframe v3 support and I used readelf to
   find the sframe size and location of glibc.

   readelf -e /work/usr/lib/libc.so.6 | grep sframe
     [19] .sframe           GNU_SFRAME       00000000001d3fc0  001d3fc0

   Then I wrote a program that takes the above location and size of the
   .sframe section in libc as parameters, scans /proc/self/maps to find
   where it loaded libc and then calls this new system call with a pointer
   to the location of the sframe along with its size, as well as where the
   libc text is located.

   It then spins for 2 seconds, calls the system call again to remove the
   sframe section it loaded, and spins for another 2 seconds.

   I ran perf record --call-graph fp,defer on the program and looked for
   the do_spin() function.

   With sframe loaded:

sframe-test    1350  1396.333593:     202366 cpu/cycles/P: 
            7fdf0ec38a44 [unknown] ([vdso])
            5621a6b97243 get_time+0x19 (/work/c/sframe-test)
            5621a6b9727f do_spin+0x1f (/work/c/sframe-test)
            5621a6b975cd main+0xd4 (/work/c/sframe-test)
            7fdf0ea26bda __libc_start_call_main+0x6a (/work/usr/lib/libc.so.6)
            7fdf0ea26d05 __libc_start_main@@GLIBC_2.34+0x85 (/work/usr/lib/libc.so.6)
            5621a6b97131 _start+0x21 (/work/c/sframe-test)

   After it unloads the sframe:

sframe-test    1350  1400.332902:     657582 cpu/cycles/P: 
            7fdf0ec38a5e [unknown] ([vdso])
            5621a6b97243 get_time+0x19 (/work/c/sframe-test)
            5621a6b9727f do_spin+0x1f (/work/c/sframe-test)
            5621a6b97602 main+0x109 (/work/c/sframe-test)
            7fdf0ea26bda __libc_start_call_main+0x6a (/work/usr/lib/libc.so.6)

   As you can see, with the sframe loaded, it was able to walk further up
   the libc library.

   Again, this is just an RFC, but I would like to get agreement on the
   system call so that we can then update the dynamic linker to do this
   instead of using my hack ;-)
]

Add a system call that can be used by dynamic linkers to tell the kernel
where the sframe section is in memory for libraries it loads.

The system call stacktrace_setup takes 5 parameters:

  op - the type of operation to perform
  addr_start - The virtual address of the sframe section
  addr_length - The length of the sframe section
  text_start - the text section the sframe represents
  test_length - the length of the section

The current op values are:

  STACKTRACE_REGISTER_SFRAME - This registers the sframe
  STACKTRACE_UNREGISTER_SFRAME - This removes the sframe

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 arch/alpha/kernel/syscalls/syscall.tbl      |  1 +
 arch/arm/tools/syscall.tbl                  |  1 +
 arch/arm64/tools/syscall_32.tbl             |  1 +
 arch/m68k/kernel/syscalls/syscall.tbl       |  1 +
 arch/microblaze/kernel/syscalls/syscall.tbl |  1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   |  1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   |  1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   |  1 +
 arch/parisc/kernel/syscalls/syscall.tbl     |  1 +
 arch/powerpc/kernel/syscalls/syscall.tbl    |  1 +
 arch/s390/kernel/syscalls/syscall.tbl       |  1 +
 arch/sh/kernel/syscalls/syscall.tbl         |  1 +
 arch/sparc/kernel/syscalls/syscall.tbl      |  1 +
 arch/x86/entry/syscalls/syscall_32.tbl      |  1 +
 arch/x86/entry/syscalls/syscall_64.tbl      |  1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     |  1 +
 include/linux/syscalls.h                    |  1 +
 include/uapi/asm-generic/unistd.h           |  5 ++-
 include/uapi/linux/stacktrace.h             | 10 ++++++
 kernel/sys_ni.c                             |  2 ++
 kernel/unwind/sframe.c                      | 37 +++++++++++++++++++++
 scripts/syscall.tbl                         |  1 +
 22 files changed, 71 insertions(+), 1 deletion(-)
 create mode 100644 include/uapi/linux/stacktrace.h

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index f31b7afffc34..8c320029a156 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -511,3 +511,4 @@
 579	common	file_setattr			sys_file_setattr
 580	common	listns				sys_listns
 581	common	rseq_slice_yield		sys_rseq_slice_yield
+582	common	stacktrace_setup		sys_stacktrace_setup
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 94351e22bfcf..60f9a33b2dc5 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -486,3 +486,4 @@
 469	common	file_setattr			sys_file_setattr
 470	common	listns				sys_listns
 471	common	rseq_slice_yield		sys_rseq_slice_yield
+472	common	stacktrace_setup		sys_stacktrace_setup
diff --git a/arch/arm64/tools/syscall_32.tbl b/arch/arm64/tools/syscall_32.tbl
index 62d93d88e0fe..a0bd04a23006 100644
--- a/arch/arm64/tools/syscall_32.tbl
+++ b/arch/arm64/tools/syscall_32.tbl
@@ -483,3 +483,4 @@
 469	common	file_setattr			sys_file_setattr
 470	common	listns				sys_listns
 471	common	rseq_slice_yield		sys_rseq_slice_yield
+472	common	stacktrace_setup		sys_stacktrace_setup
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 248934257101..266ec877300a 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -471,3 +471,4 @@
 469	common	file_setattr			sys_file_setattr
 470	common	listns				sys_listns
 471	common	rseq_slice_yield		sys_rseq_slice_yield
+472	common	stacktrace_setup		sys_stacktrace_setup
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 223d26303627..916294849393 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -477,3 +477,4 @@
 469	common	file_setattr			sys_file_setattr
 470	common	listns				sys_listns
 471	common	rseq_slice_yield		sys_rseq_slice_yield
+472	common	stacktrace_setup		sys_stacktrace_setup
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 7430714e2b8f..20fec148901e 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -410,3 +410,4 @@
 469	n32	file_setattr			sys_file_setattr
 470	n32	listns				sys_listns
 471	n32	rseq_slice_yield		sys_rseq_slice_yield
+472	n32	stacktrace_setup		sys_stacktrace_setup
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index 630aab9e5425..2743bbcab143 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -386,3 +386,4 @@
 469	n64	file_setattr			sys_file_setattr
 470	n64	listns				sys_listns
 471	n64	rseq_slice_yield		sys_rseq_slice_yield
+472	n64	stacktrace_setup		sys_stacktrace_setup
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 128653112284..187eadc4a42e 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -459,3 +459,4 @@
 469	o32	file_setattr			sys_file_setattr
 470	o32	listns				sys_listns
 471	o32	rseq_slice_yield		sys_rseq_slice_yield
+472	o32	stacktrace_setup		sys_stacktrace_setup
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index c6331dad9461..9442a92ef0aa 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -470,3 +470,4 @@
 469	common	file_setattr			sys_file_setattr
 470	common	listns				sys_listns
 471	common	rseq_slice_yield		sys_rseq_slice_yield
+472	common	stacktrace_setup		sys_stacktrace_setup
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 4fcc7c58a105..005441233932 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -562,3 +562,4 @@
 469	common	file_setattr			sys_file_setattr
 470	common	listns				sys_listns
 471	nospu	rseq_slice_yield		sys_rseq_slice_yield
+472	nospu	stacktrace_setup		sys_stacktrace_setup
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 09a7ef04d979..bc9894b25584 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -398,3 +398,4 @@
 469	common	file_setattr			sys_file_setattr
 470	common	listns				sys_listns
 471	common	rseq_slice_yield		sys_rseq_slice_yield
+472	common	stacktrace_setup		sys_stacktrace_setup
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 70b315cbe710..5766251b4d2d 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -475,3 +475,4 @@
 469	common	file_setattr			sys_file_setattr
 470	common	listns				sys_listns
 471	common	rseq_slice_yield		sys_rseq_slice_yield
+472	common	stacktrace_setup		sys_stacktrace_setup
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 7e71bf7fcd14..20e7f3b856e4 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -517,3 +517,4 @@
 469	common	file_setattr			sys_file_setattr
 470	common	listns				sys_listns
 471	common	rseq_slice_yield		sys_rseq_slice_yield
+472	common	stacktrace_setup		sys_stacktrace_setup
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index f832ebd2d79b..652ede93b724 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -477,3 +477,4 @@
 469	i386	file_setattr		sys_file_setattr
 470	i386	listns			sys_listns
 471	i386	rseq_slice_yield	sys_rseq_slice_yield
+472	i386	stacktrace_setup	sys_stacktrace_setup
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 524155d655da..5da918e912a6 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -396,6 +396,7 @@
 469	common	file_setattr		sys_file_setattr
 470	common	listns			sys_listns
 471	common	rseq_slice_yield	sys_rseq_slice_yield
+472	common	stacktrace_setup	sys_stacktrace_setup
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index a9bca4e484de..34f0de06baee 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -442,3 +442,4 @@
 469	common	file_setattr			sys_file_setattr
 470	common	listns				sys_listns
 471	common	rseq_slice_yield		sys_rseq_slice_yield
+472	common	stacktrace_setup		sys_stacktrace_setup
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index f5639d5ac331..fdbea39c1b38 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -999,6 +999,7 @@ asmlinkage long sys_lsm_get_self_attr(unsigned int attr, struct lsm_ctx __user *
 asmlinkage long sys_lsm_set_self_attr(unsigned int attr, struct lsm_ctx __user *ctx,
 				      u32 size, u32 flags);
 asmlinkage long sys_lsm_list_modules(u64 __user *ids, u32 __user *size, u32 flags);
+asmlinkage long sys_stacktrace_setup(void);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index a627acc8fb5f..d3f57d8454d7 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -863,8 +863,11 @@ __SYSCALL(__NR_listns, sys_listns)
 #define __NR_rseq_slice_yield 471
 __SYSCALL(__NR_rseq_slice_yield, sys_rseq_slice_yield)
 
+#define __NR_stacktrace_setup 472
+__SYSCALL(__NR_stacktrace_setup, sys_stacktrace_setup)
+
 #undef __NR_syscalls
-#define __NR_syscalls 472
+#define __NR_syscalls 473
 
 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/stacktrace.h b/include/uapi/linux/stacktrace.h
new file mode 100644
index 000000000000..60b581f55995
--- /dev/null
+++ b/include/uapi/linux/stacktrace.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_STACKTRACE_H
+#define _UAPI_LINUX_STACKTRACE_H
+
+enum stacktrace_setup_types {
+	STACKTRACE_REGISTER_SFRAME	= 1,
+	STACKTRACE_UNREGISTER_SFRAME	= 2,
+};
+
+#endif /* _UAPI_LINUX_STACKTRACE_H */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index add3032da16f..76998b0f811a 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -394,3 +394,5 @@ COND_SYSCALL(rseq_slice_yield);
 
 COND_SYSCALL(uretprobe);
 COND_SYSCALL(uprobe);
+
+COND_SYSCALL(stacktrace_setup);
diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
index f24997e84e05..a842038fb03b 100644
--- a/kernel/unwind/sframe.c
+++ b/kernel/unwind/sframe.c
@@ -12,8 +12,10 @@
 #include <linux/mm.h>
 #include <linux/string_helpers.h>
 #include <linux/sframe.h>
+#include <linux/syscalls.h>
 #include <asm/unwind_user_sframe.h>
 #include <linux/unwind_user_types.h>
+#include <uapi/linux/stacktrace.h>
 
 #include "sframe.h"
 #include "sframe_debug.h"
@@ -838,3 +840,38 @@ void sframe_free_mm(struct mm_struct *mm)
 
 	mtree_destroy(&mm->sframe_mt);
 }
+
+/**
+ * sys_stacktrace_setup - register an address for user space stacktrace walking.
+ * @op: Type of operation to perform
+ * @addr_start: The virtual address of the stacktrace information
+ * @addr_length: The length of the stacktrace information
+ * @text_start: The virtual address of the text that @addr_start represents
+ * @text_length: The length of teh text
+ *
+ * This system call is used by dynamic library utilities to inform the kernel
+ * of meta data that it loaded that can be used by the kernel to know how
+ * to stack walk the given text locations.
+ *
+ * Currently only sframes are supported, but in the future, this may be used
+ * to tell the kernel about JIT code which will most likely have a different
+ * format.
+ *
+ * The type command may be extended and parameters may be used for other
+ * purposes.
+ *
+ * Return: 0 if successful, otherwise a negative error.
+ */
+SYSCALL_DEFINE5(stacktrace_setup, int, op, unsigned long, addr_start,
+		unsigned long, addr_length, unsigned long, text_start,
+		unsigned long, text_length)
+{
+	switch (op) {
+	case STACKTRACE_REGISTER_SFRAME:
+		return sframe_add_section(addr_start, addr_start + addr_length,
+					  text_start, text_start+text_length);
+	case STACKTRACE_UNREGISTER_SFRAME:
+		return sframe_remove_section(addr_start);
+	}
+	return -EINVAL;
+}
diff --git a/scripts/syscall.tbl b/scripts/syscall.tbl
index 7a42b32b6577..54a99cffeec4 100644
--- a/scripts/syscall.tbl
+++ b/scripts/syscall.tbl
@@ -412,3 +412,4 @@
 469	common	file_setattr			sys_file_setattr
 470	common	listns				sys_listns
 471	common	rseq_slice_yield		sys_rseq_slice_yield
+472	common	stacktrace_setup		sys_stacktrace_setup
-- 
2.53.0


[-- Attachment #2: sframe-test.c --]
[-- Type: text/x-c++src, Size: 2716 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdarg.h>
#include <errno.h>
#include <unistd.h>
#include <sys/time.h>
#include <sys/syscall.h>

#undef SYS_stacktrace_setup
#define SYS_stacktrace_setup		472

enum {
	STACKTRACE_REGISTER_SFRAME = 1,
	STACKTRACE_UNREGISTER_SFRAME,
};

#define stacktrace_setup(op, start_addr, length, text_addr, text_length)\
	syscall(SYS_stacktrace_setup, op, start_addr, length, text_addr, text_length)

static void usage(const char *name)
{
	printf("usage: %s sframe-offset sframe-length\n"
	       "\n",name);
	exit(-1);
}

static unsigned long long get_time(void)
{
	struct timeval tv;
	unsigned long long time;

	gettimeofday(&tv, NULL);

	time = tv.tv_sec * 1000000ULL;
	time += tv.tv_usec;

	return time;
}

void do_spin(void)
{
	unsigned long long start = get_time();

	/* 2 seconds */
	start += 2 * 1000 * 1000;

	while (get_time() < start)
		;
}

static int dump_self(unsigned long idx, unsigned long *sframe,
		     unsigned long *text, unsigned long *text_len)
{
	FILE *fp;
	char *line = NULL;
	size_t size;
	unsigned long ptr = 0;

	*text = 0;

	fp = fopen("/proc/self/maps", "r");
	if (!fp) {
		perror("self");
		exit(-1);
	}

	while ((!ptr || !*text) && getline(&line, &size, fp) > 0) {
		unsigned long start, end;
		char *file;
		char *perm;
		int r;
		int l;

		r = sscanf(line, "%lx-%lx %ms %*x %*s %*d %ms\n",
			   &start, &end, &perm, &file);
		if (r != 4)
			continue;
		printf("%s", line);
//		printf("file=%s\n", file);
		l = strlen(file);
		for (l--; l > 0; l--) {
			if (file[l] == '/') {
				l++;
				break;
			}
		}
		// r-xp	
		if (!strncmp(file + l, "libc.so", 7)) {
			if (!ptr) {
				ptr = start + idx;
				printf("found sframe libc.so %lx-%lx (%lx)\n", start, end, ptr);
				printf("VAL=%lx\n", *(unsigned long*)ptr);
			}
			if (!*text && !strcmp(perm, "r-xp")) {
				*text = start;
				*text_len = end - start;
				printf("found text libc.so %lx-%lx (%lx)\n", start, end, ptr);
			}
		}
		free(file);
		free(perm);
	}
	free(line);

	*sframe = ptr;
	return ptr && *text;
}

int main (int argc, char **argv)
{
	unsigned long idx; // 0x001d3fc0;
	unsigned long len; // 0x303f9;
	unsigned long ptr;
	unsigned long text;
	unsigned long text_len;
	int ret = 0;

	if (argc < 3)
		usage(argv[0]);


	idx = strtoul(argv[1], NULL, 0);
	len = strtoul(argv[2], NULL, 0);

	if (!dump_self(idx, &ptr, &text, &text_len)) {
		fprintf(stderr, "Could not find data");
		exit(-1);
	}
	ret = stacktrace_setup(STACKTRACE_REGISTER_SFRAME, ptr, len, text, text_len);
	if (ret < 0) {
		perror("Registering stacktrace");
		exit(-1);
	}

	do_spin();

	stacktrace_setup(STACKTRACE_UNREGISTER_SFRAME, ptr, len, text, text_len);

	do_spin();
	
	return 0;
}

^ permalink raw reply related

* Re: [RFC][PATCH] unwind: Add stacktrace_setup system call
From: Steven Rostedt @ 2026-04-29 15:47 UTC (permalink / raw)
  To: Madhavan Srinivasan, Michael Ellerman
  Cc: LKML, Linux Trace Kernel, Masami Hiramatsu, Mathieu Desnoyers,
	Jens Remus, Josh Poimboeuf, Peter Zijlstra, Ingo Molnar,
	Jiri Olsa, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Linus Torvalds, Andrew Morton, Florian Weimer,
	Kees Cook, Carlos O'Donell, Sam James, Dylan Hatch,
	Borislav Petkov, Dave Hansen, David Hildenbrand, H. Peter Anvin,
	Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
	Suren Baghdasaryan, Vlastimil Babka, Heiko Carstens,
	Vasily Gorbik
In-Reply-To: <20260429114355.6c712e6a@gandalf.local.home>

On Wed, 29 Apr 2026 11:43:55 -0400
Steven Rostedt <rostedt@kernel.org> wrote:

> diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
> index 4fcc7c58a105..005441233932 100644
> --- a/arch/powerpc/kernel/syscalls/syscall.tbl
> +++ b/arch/powerpc/kernel/syscalls/syscall.tbl
> @@ -562,3 +562,4 @@
>  469	common	file_setattr			sys_file_setattr
>  470	common	listns				sys_listns
>  471	nospu	rseq_slice_yield		sys_rseq_slice_yield
> +472	nospu	stacktrace_setup		sys_stacktrace_setup

BTW, I have no idea what "nospu" is. I did some searches for PowerPC and
spu vs nospu and the only thing I found was a stackoverflow asking about
it, and the response was "don't worry about it".

I only picked nospu because I noticed that rseq_slice_yield used it, but
that doesn't give me a good feeling that it's correct.

-- Steve

^ permalink raw reply

* Re: [PATCH v18 7/8] ring-buffer: Cleanup persistent ring buffer validation
From: Steven Rostedt @ 2026-04-29 15:50 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Catalin Marinas, Will Deacon, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <20260430002139.c3b0e4c92ae52aeeaf86e1bf@kernel.org>

On Thu, 30 Apr 2026 00:21:39 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> > Also, could you rebase on top of v7.1-rc1?  
> 
> Is it v7.1-rc1, not ring-buffer/for-next?

The ring-buffer/for-next has been merged with conflicts. I rather not add
more conflicts ;-)  v7.1-rc1 includes ring-buffer/for-next.

-- Steve

^ permalink raw reply

* Re: [PATCH] kprobes: Remove dead child probes from aggrprobe list on module unload
From: Steven Rostedt @ 2026-04-29 15:56 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Shijia Hu, naveen, davem, ananth, akpm, linux-kernel,
	linux-trace-kernel
In-Reply-To: <20260429174053.b66f3a0032408f544128afd4@kernel.org>

On Wed, 29 Apr 2026 17:40:53 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> Shijia Hu <hushijia1@uniontech.com> wrote:
> 
> > When a kernel module that registered kprobes is unloaded without calling
> > unregister_kprobe(), kprobes_module_callback() calls kill_kprobe() to
> > mark the probe(s) GONE.  If the probe is an aggrprobe, kill_kprobe()
> > also marks all child probes GONE, but it does not remove them from
> > the aggrprobe's list.  
> 
> That sounds like a bug in the module.

Agreed.

> 
> > 
> > The problem is that child probes whose struct kprobe resides in the
> > unloading module's memory are freed along with the module, yet they
> > remain on the aggrprobe's list.  Later, when another caller registers
> > a kprobe at the same address, __get_valid_kprobe() walks that list
> > and dereferences the freed child probe, causing a use-after-free.
> > 
> > Reproduction steps:
> > 
> >     1) Load module A which registers two kprobes on the same kernel
> >        function address (e.g., do_nanosleep), causing them to be
> >        aggregated under one aggrprobe.
> > 
> >     2) Unload module A without calling unregister_kprobe().
> >        Module A's memory is freed, but its two child probes remain
> >        on the aggrprobe's list as dangling pointers.  
> 
> Would you mean "load a buggy kernel module and unload it, the kernel cause
> use-after-free."? for example:
> 
> ----
> struct kprobe my_probe = {...};
> 
> init_module() {
> 	register_kprobe(&my_probe);
> }
> exit_module() {
> 	/* do nothing */
> }
> ----
> 
> Yes, this cause UAF because that module has a bug. Please call
> unregister_kprobe().

Yes, this is one of those...

  Patient: Doctor it hurts me when I do this
  Doctor: Then don't do that

... reports.

No, the kernel isn't responsible for fixing buggy modules.

-- Steve

^ permalink raw reply

* Re: [PATCH] tracing/probes: Limit size of event probe to 3K
From: Steven Rostedt @ 2026-04-29 15:57 UTC (permalink / raw)
  To: Masami Hiramatsu (Google); +Cc: LKML, Linux Trace Kernel, Mathieu Desnoyers
In-Reply-To: <20260429175144.9d3d6e13e935f1dfd480f6a0@kernel.org>

On Wed, 29 Apr 2026 17:51:44 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> Hi Steve,
> 
> BTW, to prevent regressions during future expansions, how about adding the following line?

Nothing against this, but I think it should be a separate patch.

Thanks,

-- Steve


> 
> diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
> index 2cabf8a23ec5..c5ee7920dec6 100644
> --- a/kernel/trace/trace_uprobe.c
> +++ b/kernel/trace/trace_uprobe.c
> @@ -979,6 +979,7 @@ static struct uprobe_cpu_buffer *prepare_uprobe_buffer(struct trace_uprobe *tu,
>  	ucb = uprobe_buffer_get();
>  	ucb->dsize = tu->tp.size + dsize;
>  
> +	BUILD_BUG_ON(MAX_UCB_BUFFER_SIZE < MAX_PROBE_EVENT_SIZE);
>  	if (WARN_ON_ONCE(ucb->dsize > MAX_UCB_BUFFER_SIZE)) {
>  		ucb->dsize = MAX_UCB_BUFFER_SIZE;
>  		dsize = MAX_UCB_BUFFER_SIZE - tu->tp.size;
> 

^ permalink raw reply

* Re: [PATCH v18 3/8] ring-buffer: Skip invalid sub-buffers when validating persistent ring buffer
From: Steven Rostedt @ 2026-04-29 16:32 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Catalin Marinas, Will Deacon, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <20260430001559.a2d0489d956c06059e13e58d@kernel.org>

On Thu, 30 Apr 2026 00:15:59 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> It should not be a never ending loop (there are other exit conditions),
> but I agreed. What about limiting with nr_subbufs?
> 
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index 5326924615a4..aa89ec96e964 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -5630,7 +5630,8 @@ __rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer)
>  	 * a case where we will loop three times. There should be no
>  	 * reason to loop four times (that I know of).
>  	 */
> -	if (RB_WARN_ON(cpu_buffer, ++nr_loops > 3)) {
> +	if (RB_WARN_ON(cpu_buffer,
> +		++nr_loops > (cpu_buffer->ring_meta ? cpu_buffer->nr_subbufs : 3))) {
>  		reader = NULL;
>  		goto out;
>  	}

Yeah, this looks more reasonable.

Thanks,

-- Steve

^ permalink raw reply

* Re: [PATCH v18 4/8] ring-buffer: Skip invalid sub-buffers when rewinding persistent ring buffer
From: Steven Rostedt @ 2026-04-29 16:39 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Catalin Marinas, Will Deacon, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <20260430002023.b19608c27a746284e8f73b80@kernel.org>

On Thu, 30 Apr 2026 00:20:23 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> On Tue, 28 Apr 2026 16:21:46 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > On Fri, 24 Apr 2026 15:52:35 +0900
> > "Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:
> >   
> > > @@ -1892,9 +1895,19 @@ static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu,
> > >  	 * subbuf_size is considered invalid.
> > >  	 */
> > >  	tail = local_read(&dpage->commit) & ~RB_MISSED_MASK;
> > > -	if (tail > meta->subbuf_size - BUF_PAGE_HDR_SIZE)
> > > -		return -1;
> > > -	return rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
> > > +	if (tail <= meta->subbuf_size - BUF_PAGE_HDR_SIZE)
> > > +		ret = rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
> > > +  
> > 
> > This code seriously needs comments.  
> 
> OK, I'll add it, or let code explain clearer?
> 
> 	if (tail <= meta->subbuf_size - BUF_PAGE_HDR_SIZE)
> 		ret = rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
> 	else
> 		ret = -1;

That's better...

> 
> Thanks,
> 

The below should have some explanation too. I can figure it out, but it
wasted more brain cycles than I would have liked ;-)

-- Steve


> >   
> > > +	if (ret < 0 || (prev_ts && prev_ts > ts) || (next_ts && ts > next_ts)) {
> > > +		local_set(&bpage->entries, 0);
> > > +		local_set(&bpage->page->commit, 0);
> > > +		bpage->page->time_stamp = prev_ts ? prev_ts : next_ts;
> > > +		ret = -1;
> > > +	} else {
> > > +		local_set(&bpage->entries, ret);
> > > +	}
> > > +
> > > +	return ret;
> > >  }
> > >    
> 
> 


^ permalink raw reply

* [PATCH] trace_printk: replace _______STR with __UNIQUE_ID(STR)
From: Qian-Yu Lin @ 2026-04-29 16:57 UTC (permalink / raw)
  To: rostedt, mhiramat
  Cc: mathieu.desnoyers, linux-kernel, linux-trace-kernel, Qian-Yu Lin

The macro trace_printk() uses a hardcoded identifier _______STR
within a statement expression, which can lead to variable name
shadowing if a caller happens to use the same name in its scope.

Following the pattern in commit 24ba53017e18 ("rcu: Replace ________p1
and _________p1 with __UNIQUE_ID(rcu)") and commit 589a9785ee3a
 ("min/max: remove sparse warnings when they're nested"), replace the
 hardcoded identifier with __UNIQUE_ID(STR).

Since __UNIQUE_ID() must be expanded once to remain consistent across
declaration and sizeof() within the statement expression, introduce a
nested helper macro ___trace_printk.

Signed-off-by: Qian-Yu Lin <tiffany019230@gmail.com>
---
 include/linux/trace_printk.h | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/linux/trace_printk.h b/include/linux/trace_printk.h
index 2670ec7f4262..060eccb40838 100644
--- a/include/linux/trace_printk.h
+++ b/include/linux/trace_printk.h
@@ -2,6 +2,7 @@
 #ifndef _LINUX_TRACE_PRINTK_H
 #define _LINUX_TRACE_PRINTK_H
 
+#include <linux/compiler.h>
 #include <linux/compiler_attributes.h>
 #include <linux/instruction_pointer.h>
 #include <linux/stddef.h>
@@ -84,15 +85,18 @@ do {									\
  * let gcc optimize the rest.
  */
 
-#define trace_printk(fmt, ...)				\
+#define ___trace_printk(fmt, str, ...)				\
 do {							\
-	char _______STR[] = __stringify((__VA_ARGS__));	\
-	if (sizeof(_______STR) > 3)			\
+	char str[] = __stringify((__VA_ARGS__));	\
+	if (sizeof(str) > 3)			\
 		do_trace_printk(fmt, ##__VA_ARGS__);	\
 	else						\
 		trace_puts(fmt);			\
 } while (0)
 
+#define trace_printk(fmt, ...) \
+	___trace_printk(fmt, __UNIQUE_ID(str), ##__VA_ARGS__)
+
 #define do_trace_printk(fmt, args...)					\
 do {									\
 	static const char *trace_printk_fmt __used			\
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH] trace_printk: replace _______STR with __UNIQUE_ID(STR)
From: Steven Rostedt @ 2026-04-29 17:42 UTC (permalink / raw)
  To: Qian-Yu Lin; +Cc: mhiramat, mathieu.desnoyers, linux-kernel, linux-trace-kernel
In-Reply-To: <20260429165707.7020-1-tiffany019230@gmail.com>

On Thu, 30 Apr 2026 00:57:07 +0800
Qian-Yu Lin <tiffany019230@gmail.com> wrote:

> The macro trace_printk() uses a hardcoded identifier _______STR
> within a statement expression, which can lead to variable name
> shadowing if a caller happens to use the same name in its scope.

Has this ever been a problem?

> 
> Following the pattern in commit 24ba53017e18 ("rcu: Replace ________p1
> and _________p1 with __UNIQUE_ID(rcu)") and commit 589a9785ee3a
>  ("min/max: remove sparse warnings when they're nested"), replace the
>  hardcoded identifier with __UNIQUE_ID(STR).
> 
> Since __UNIQUE_ID() must be expanded once to remain consistent across
> declaration and sizeof() within the statement expression, introduce a
> nested helper macro ___trace_printk.

Hmm, so we are replacing one name with underscores with another name
with underscores?

> 
> Signed-off-by: Qian-Yu Lin <tiffany019230@gmail.com>
> ---
>  include/linux/trace_printk.h | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/trace_printk.h b/include/linux/trace_printk.h
> index 2670ec7f4262..060eccb40838 100644
> --- a/include/linux/trace_printk.h
> +++ b/include/linux/trace_printk.h
> @@ -2,6 +2,7 @@
>  #ifndef _LINUX_TRACE_PRINTK_H
>  #define _LINUX_TRACE_PRINTK_H
>  
> +#include <linux/compiler.h>

People are already saying that trace_printk.h slows down the compile.
Does this add any overhead to the compile?

-- Steve


>  #include <linux/compiler_attributes.h>
>  #include <linux/instruction_pointer.h>
>  #include <linux/stddef.h>
> @@ -84,15 +85,18 @@ do {									\
>   * let gcc optimize the rest.
>   */
>  
> -#define trace_printk(fmt, ...)				\
> +#define ___trace_printk(fmt, str, ...)				\
>  do {							\
> -	char _______STR[] = __stringify((__VA_ARGS__));	\
> -	if (sizeof(_______STR) > 3)			\
> +	char str[] = __stringify((__VA_ARGS__));	\
> +	if (sizeof(str) > 3)			\
>  		do_trace_printk(fmt, ##__VA_ARGS__);	\
>  	else						\
>  		trace_puts(fmt);			\
>  } while (0)
>  
> +#define trace_printk(fmt, ...) \
> +	___trace_printk(fmt, __UNIQUE_ID(str), ##__VA_ARGS__)
> +
>  #define do_trace_printk(fmt, args...)					\
>  do {									\
>  	static const char *trace_printk_fmt __used			\


^ permalink raw reply

* Re: [PATCH 7.2 v16 07/13] mm/khugepaged: add per-order mTHP collapse failure statistics
From: Nico Pache @ 2026-04-29 18:21 UTC (permalink / raw)
  To: David Hildenbrand (Arm), linux-doc, linux-kernel, linux-mm,
	linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, ljs,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
	pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang, rientjes,
	rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <f05f4506-2930-44ba-918a-e0e5bcb9d0f9@kernel.org>

On 4/27/26 2:21 PM, David Hildenbrand (Arm) wrote:
> On 4/19/26 20:57, Nico Pache wrote:
>> Add three new mTHP statistics to track collapse failures for different
>> orders when encountering swap PTEs, excessive none PTEs, and shared PTEs:
>>
>> - collapse_exceed_swap_pte: Increment when mTHP collapse fails due to swap
>>      PTEs
>>
>> - collapse_exceed_none_pte: Counts when mTHP collapse fails due to
>>      exceeding the none PTE threshold for the given order
>>
>> - collapse_exceed_shared_pte: Counts when mTHP collapse fails due to shared
>>      PTEs
>>
>> These statistics complement the existing THP_SCAN_EXCEED_* events by
>> providing per-order granularity for mTHP collapse attempts. The stats are
>> exposed via sysfs under
>> `/sys/kernel/mm/transparent_hugepage/hugepages-*/stats/` for each
>> supported hugepage size.
>>
>> As we currently dont support collapsing mTHPs that contain a swap or
>
> s/dont/do not/
>
>> shared entry, those statistics keep track of how often we are
>> encountering failed mTHP collapses due to these restrictions.
>>
>> Now that we plan to support mTHP collapse for anon pages, lets also track
>
> "We will add support for mTHP collapse for anonymous pages next; let's also ..."
>
>> when this happens at the PMD level within the per-mTHP stats.
>
> What about file collapse? For example, we do adjust
> count_vm_event(THP_SCAN_EXCEED_SWAP_PTE) and
> count_vm_event(THP_SCAN_EXCEED_NONE_PTE) there.
>
> Wouldn't we want to update the HPAGE_PMD_ORDER side of things there already? or
> would we want to use a different counter for that?

Maybe? My thought process was that because we dont support mTHP in Shmem
that those stats shouldnt be updated? not sure the right answer tbh--
hence why this comment says "now that.. for anon pages, lets track..".

>
>>
>> Signed-off-by: Nico Pache <npache@redhat.com>
>> ---
>>   Documentation/admin-guide/mm/transhuge.rst | 24 ++++++++++++++++++++++
>>   include/linux/huge_mm.h                    |  3 +++
>>   mm/huge_memory.c                           |  7 +++++++
>>   mm/khugepaged.c                            | 21 +++++++++++++++++--
>>   4 files changed, 53 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
>> index c51932e6275d..eebb1f6bbc6c 100644
>> --- a/Documentation/admin-guide/mm/transhuge.rst
>> +++ b/Documentation/admin-guide/mm/transhuge.rst
>> @@ -714,6 +714,30 @@ nr_anon_partially_mapped
>>          an anonymous THP as "partially mapped" and count it here, even though it
>>          is not actually partially mapped anymore.
>>
>> +collapse_exceed_none_pte
>> +       The number of collapse attempts that failed due to exceeding the
>> +       max_ptes_none threshold. For mTHP collapse, Currently only max_ptes_none
>> +       values of 0 and (HPAGE_PMD_NR - 1) are supported. Any other value will
>> +       emit a warning and no mTHP collapse will be attempted. khugepaged will
>> +       try to collapse to the largest enabled (m)THP size; if it fails, it will
>> +       try the next lower enabled mTHP size. This counter records the number of
>> +       times a collapse attempt was skipped for exceeding the max_ptes_none
>> +       threshold, and khugepaged will move on to the next available mTHP size.
>
> Why is everything after the first sentence worth documenting here? This doesn't
> read like it belongs to a failure counter?
>
>> +
>> +collapse_exceed_swap_pte
>> +       The number of anonymous mTHP PTE ranges which were unable to collapse due
>> +       to containing at least one swap PTE. Currently khugepaged does not
>> +       support collapsing mTHP regions that contain a swap PTE. This counter can
>> +       be used to monitor the number of khugepaged mTHP collapses that failed
>> +       due to the presence of a swap PTE.
>
> Can we similarly simplify that (and make it consistent with the one above) to
>
> "The number of collapse attempts that failed due to exceeding the max_ptes_swap
> threshold."
>
>> +
>> +collapse_exceed_shared_pte
>> +       The number of anonymous mTHP PTE ranges which were unable to collapse due
>> +       to containing at least one shared PTE. Currently khugepaged does not
>> +       support collapsing mTHP PTE ranges that contain a shared PTE. This
>> +       counter can be used to monitor the number of khugepaged mTHP collapses
>> +       that failed due to the presence of a shared PTE.
>
> Same here
>
> "The number of collapse attempts that failed due to exceeding the
> max_ptes_shared threshold."

These all used to be very similar to what you have here. But Lorenzo
asked me to expand all this (IIRC) several times. So here we are! I
believe I even mentioned in the V15 that these are probably the "best"
defined counters in the whole doc lol.

Cheers,

-- Nico

>
> ?
>
>> +
>
> [...]
>


^ permalink raw reply

* Re: [PATCH 7.2 v16 09/13] mm/khugepaged: introduce collapse_allowable_orders helper function
From: Nico Pache @ 2026-04-29 18:22 UTC (permalink / raw)
  To: David Hildenbrand (Arm), linux-doc, linux-kernel, linux-mm,
	linux-trace-kernel, ljs
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <5564e170-d052-4445-9914-70bfd852d3b1@kernel.org>

On 4/27/26 2:24 PM, David Hildenbrand (Arm) wrote:
> On 4/19/26 20:57, Nico Pache wrote:
>> Add collapse_allowable_orders() to generalize THP order eligibility. The
>> function determines which THP orders are permitted based on collapse
>> context (khugepaged vs madv_collapse).
>>
>> This consolidates collapse configuration logic and provides a clean
>> interface for future mTHP collapse support where the orders may be
>> different.
>>
>> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Signed-off-by: Nico Pache <npache@redhat.com>
>> ---
>
>
> [...]
>
>>      cc = kmalloc_obj(*cc);
>> diff --git a/mm/vma.c b/mm/vma.c
>> index 377321b48734..c0398fb597b3 100644
>> --- a/mm/vma.c
>> +++ b/mm/vma.c
>> @@ -989,7 +989,7 @@ static __must_check struct vm_area_struct *vma_merge_existing_range(
>>              goto abort;
>>
>>      vma_set_flags_mask(vmg->target, sticky_flags);
>> -    khugepaged_enter_vma(vmg->target, vmg->vm_flags);
>> +    khugepaged_enter_vma(vmg->target);
>>      vmg->state = VMA_MERGE_SUCCESS;
>>      return vmg->target;
>>
>> @@ -1110,7 +1110,7 @@ struct vm_area_struct *vma_merge_new_range(struct vma_merge_struct *vmg)
>>       * following VMA if we have VMAs on both sides.
>>       */
>>      if (vmg->target && !vma_expand(vmg)) {
>> -            khugepaged_enter_vma(vmg->target, vmg->vm_flags);
>> +            khugepaged_enter_vma(vmg->target);
>>              vmg->state = VMA_MERGE_SUCCESS;
>>              return vmg->target;
>>      }
>> @@ -2589,7 +2589,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap,
>>       * call covers the non-merge case.
>>       */
>>      if (!vma_is_anonymous(vma))
>> -            khugepaged_enter_vma(vma, map->vm_flags);
>> +            khugepaged_enter_vma(vma);
>>      *vmap = vma;
>
> Are you sure that in all cases, vma->vm_flags already corresponds to
> vmg->vm_flags / map->vm_flags?

I reviewed most of them and nothing stuck out, but I can go over them
again. Lorenzo may also have more insight into this as he is more
familiar and has been working on this stuff.

@lorenzo?

>
>
> That's a change that makes this patch unnecessary hard to follow, in particular,
> because it's not documented in the patch description.

Thats really weird I could have sworn I did update this description...
There was a lot of changes this round, so it was hard to keep track of
everything. Sorry.

>
> If you think the change is fine, you should better move that into a separate
> cleanup patch where you only drop the flags parameter from  khugepaged_enter_vma().

Yeah thats a better idea. Ill separate it out, thank you for the reviews :)

>


^ permalink raw reply

* Re: [RFC][PATCH] unwind: Add stacktrace_setup system call
From: Mathieu Desnoyers @ 2026-04-29 18:32 UTC (permalink / raw)
  To: Steven Rostedt, Madhavan Srinivasan, Michael Ellerman
  Cc: LKML, Linux Trace Kernel, Masami Hiramatsu, Jens Remus,
	Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Linus Torvalds, Andrew Morton, Florian Weimer, Kees Cook,
	Carlos O'Donell, Sam James, Dylan Hatch, Borislav Petkov,
	Dave Hansen, David Hildenbrand, H. Peter Anvin, Liam R. Howlett,
	Lorenzo Stoakes, Michal Hocko, Mike Rapoport, Suren Baghdasaryan,
	Vlastimil Babka, Heiko Carstens, Vasily Gorbik
In-Reply-To: <20260429114747.4cf74d04@gandalf.local.home>

On 2026-04-29 11:47, Steven Rostedt wrote:
> On Wed, 29 Apr 2026 11:43:55 -0400
> Steven Rostedt <rostedt@kernel.org> wrote:
> 
>> diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
>> index 4fcc7c58a105..005441233932 100644
>> --- a/arch/powerpc/kernel/syscalls/syscall.tbl
>> +++ b/arch/powerpc/kernel/syscalls/syscall.tbl
>> @@ -562,3 +562,4 @@
>>   469	common	file_setattr			sys_file_setattr
>>   470	common	listns				sys_listns
>>   471	nospu	rseq_slice_yield		sys_rseq_slice_yield
>> +472	nospu	stacktrace_setup		sys_stacktrace_setup
> 
> BTW, I have no idea what "nospu" is. I did some searches for PowerPC and
> spu vs nospu and the only thing I found was a stackoverflow asking about
> it, and the response was "don't worry about it".
> 
> I only picked nospu because I noticed that rseq_slice_yield used it, but
> that doesn't give me a good feeling that it's correct.

AFAIR PowerPC SPUs are the Cell "Synergistic Processing Units", e.g.
for Playstation 3.

Thanks,

Mathieu



-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply

* Re: [RFC][PATCH] unwind: Add stacktrace_setup system call
From: Mathieu Desnoyers @ 2026-04-29 18:42 UTC (permalink / raw)
  To: Steven Rostedt, LKML, Linux Trace Kernel
  Cc: Masami Hiramatsu, Jens Remus, Josh Poimboeuf, Peter Zijlstra,
	Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Linus Torvalds, Andrew Morton, Florian Weimer,
	Kees Cook, Carlos O'Donell, Sam James, Dylan Hatch,
	Borislav Petkov, Dave Hansen, David Hildenbrand, H. Peter Anvin,
	Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
	Suren Baghdasaryan, Vlastimil Babka, Heiko Carstens,
	Vasily Gorbik
In-Reply-To: <20260429114355.6c712e6a@gandalf.local.home>

On 2026-04-29 11:43, Steven Rostedt wrote:
[...]

AFAIU this is a functional iteration of my prior RFC here:

https://lore.kernel.org/lkml/2fa31347-3021-4604-bec3-e5a2d57b77b5@efficios.com/

perhaps it would be valuable to refer to that thread ?

>   }
> +
> +/**
> + * sys_stacktrace_setup - register an address for user space stacktrace walking.
> + * @op: Type of operation to perform
> + * @addr_start: The virtual address of the stacktrace information
> + * @addr_length: The length of the stacktrace information
> + * @text_start: The virtual address of the text that @addr_start represents
> + * @text_length: The length of teh text

the

also, perhaps add @flags just in case ? This allows changing the
behavior without having to insert entirely new set of ops.

> + *
> + * This system call is used by dynamic library utilities to inform the kernel

Out of curiosity: what would happen if the application chooses to invoke
this system call to register sframe for the main executable ? Should it be
allowed ?

> + * of meta data that it loaded that can be used by the kernel to know how
> + * to stack walk the given text locations.
> + *
> + * Currently only sframes are supported, but in the future, this may be used
> + * to tell the kernel about JIT code which will most likely have a different
> + * format.
> + *
> + * The type command may be extended and parameters may be used for other
> + * purposes.

"type" or "op" ?

Thanks,

Mathieu

> + *
> + * Return: 0 if successful, otherwise a negative error.
> + */
> +SYSCALL_DEFINE5(stacktrace_setup, int, op, unsigned long, addr_start,
> +		unsigned long, addr_length, unsigned long, text_start,
> +		unsigned long, text_length)
> +{
> +	switch (op) {
> +	case STACKTRACE_REGISTER_SFRAME:
> +		return sframe_add_section(addr_start, addr_start + addr_length,
> +					  text_start, text_start+text_length);
> +	case STACKTRACE_UNREGISTER_SFRAME:
> +		return sframe_remove_section(addr_start);
> +	}
> +	return -EINVAL;
> +}

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply

* Re: [RFC][PATCH] unwind: Add stacktrace_setup system call
From: Steven Rostedt @ 2026-04-29 18:55 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: LKML, Linux Trace Kernel, Masami Hiramatsu, Jens Remus,
	Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Linus Torvalds, Andrew Morton, Florian Weimer, Kees Cook,
	Carlos O'Donell, Sam James, Dylan Hatch, Borislav Petkov,
	Dave Hansen, David Hildenbrand, H. Peter Anvin, Liam R. Howlett,
	Lorenzo Stoakes, Michal Hocko, Mike Rapoport, Suren Baghdasaryan,
	Vlastimil Babka, Heiko Carstens, Vasily Gorbik
In-Reply-To: <ade15347-6eca-4204-9f64-9b2eed9e9aec@efficios.com>

On Wed, 29 Apr 2026 14:42:52 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> On 2026-04-29 11:43, Steven Rostedt wrote:
> [...]
> 
> AFAIU this is a functional iteration of my prior RFC here:
> 
> https://lore.kernel.org/lkml/2fa31347-3021-4604-bec3-e5a2d57b77b5@efficios.com/
> 
> perhaps it would be valuable to refer to that thread ?

Thanks, I actually decided to start this from scratch as I took a slightly
different approach than yours. But you're right, I should have mentioned it
anyway.

> 
> >   }
> > +
> > +/**
> > + * sys_stacktrace_setup - register an address for user space stacktrace walking.
> > + * @op: Type of operation to perform
> > + * @addr_start: The virtual address of the stacktrace information
> > + * @addr_length: The length of the stacktrace information
> > + * @text_start: The virtual address of the text that @addr_start represents
> > + * @text_length: The length of teh text  
> 
> the
> 
> also, perhaps add @flags just in case ? This allows changing the
> behavior without having to insert entirely new set of ops.

I thought about it and decided it was somewhat redundant to the ops. After
working on the futex code, I thought following that API was a better
approach.

> 
> > + *
> > + * This system call is used by dynamic library utilities to inform the kernel  
> 
> Out of curiosity: what would happen if the application chooses to invoke
> this system call to register sframe for the main executable ? Should it be
> allowed ?

I haven't tried, and should look at the code. Ideally, no two sframes
sections should cover the same code. I think it should error out if it does.

If the main executable doesn't have a sframe section, I don't see why this
shouldn't be allowed to add one.

> 
> > + * of meta data that it loaded that can be used by the kernel to know how
> > + * to stack walk the given text locations.
> > + *
> > + * Currently only sframes are supported, but in the future, this may be used
> > + * to tell the kernel about JIT code which will most likely have a different
> > + * format.
> > + *
> > + * The type command may be extended and parameters may be used for other
> > + * purposes.  
> 
> "type" or "op" ?

Bah, I originally had called it "type" and then renamed it to "op" (to
be consistent with the futex commands). I missed updating this. Thanks for
catching that.

-- Steve

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox