Linux Trace Kernel
 help / color / mirror / Atom feed
* Re: [PATCH v18 3/8] ring-buffer: Skip invalid sub-buffers when validating persistent ring buffer
From: Masami Hiramatsu @ 2026-04-29 15:15 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Catalin Marinas, Will Deacon, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <20260428155508.4f47279e@gandalf.local.home>

On Tue, 28 Apr 2026 15:55:08 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Fri, 24 Apr 2026 15:52:27 +0900
> "Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:
> 
> > @@ -5648,11 +5668,12 @@ __rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer)
> >   again:
> >  	/*
> >  	 * This should normally only loop twice. But because the
> > -	 * start of the reader inserts an empty page, it causes
> > -	 * a case where we will loop three times. There should be no
> > -	 * reason to loop four times (that I know of).
> > +	 * start of the reader inserts an empty page, it causes a
> > +	 * case where we will loop three times. There should be no
> > +	 * reason to loop four times unless the ring buffer is a
> > +	 * recovered persistent ring buffer.
> 
> Can you explain more to why this is allowed for persistent ring buffer?

Ah, that was introduced in v15.

  Changes in v15:
  - Skip reader_page loop check on persistent ring buffer because
    there can be contiguous empty(invalidated) pages. 


So, finding next reader_page, we need to skip empty pages, which is normally
not contiguous. However, if we see more than 3 invalid pages on recovering
persistent ring buffer, it will be reset and become empty.


> 
> Note, I do not like any loops that can go into an infinite loop and lock up
> the machine. If something goes wrong with a persistent ring buffer, then
> this could possibly go into an infinite loop.

Yeah, so I think we should not use goto here. OK, let me update it to
an actual loop.

> 
> I want to understand why this is allowed, and possibly add a check that
> prevents this from never ending.

It should not be a never ending loop (there are other exit conditions),
but I agreed. What about limiting with nr_subbufs?

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 5326924615a4..aa89ec96e964 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -5630,7 +5630,8 @@ __rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer)
 	 * a case where we will loop three times. There should be no
 	 * reason to loop four times (that I know of).
 	 */
-	if (RB_WARN_ON(cpu_buffer, ++nr_loops > 3)) {
+	if (RB_WARN_ON(cpu_buffer,
+		++nr_loops > (cpu_buffer->ring_meta ? cpu_buffer->nr_subbufs : 3))) {
 		reader = NULL;
 		goto out;
 	}

> 
> -- Steve
> 
> 
> >  	 */
> > -	if (RB_WARN_ON(cpu_buffer, ++nr_loops > 3)) {
> > +	if (RB_WARN_ON(cpu_buffer, ++nr_loops > 3 && !cpu_buffer->ring_meta)) {
> >  		reader = NULL;
> >  		goto out;
> >  	}
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply related

* Re: [PATCH 7.2 v16 04/13] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support
From: Nico Pache @ 2026-04-29 15:12 UTC (permalink / raw)
  To: David Hildenbrand (Arm), linux-doc, linux-kernel, linux-mm,
	linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, ljs,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
	pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang, rientjes,
	rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <e9dbe863-d125-4fe5-8ecc-91ad7293e5cf@kernel.org>

On 4/27/26 2:07 PM, David Hildenbrand (Arm) wrote:
> On 4/19/26 20:57, Nico Pache wrote:
>> generalize the order of the __collapse_huge_page_* and collapse_max_*
>> functions to support future mTHP collapse.
>>
>> The current mechanism for determining collapse with the
>> khugepaged_max_ptes_none value is not designed with mTHP in mind. This
>> raises a key design issue: if we support user defined max_pte_none values
>> (even those scaled by order), a collapse of a lower order can introduces
>> an feedback loop, or "creep", when max_ptes_none is set to a value greater
>> than HPAGE_PMD_NR / 2.
>>
>> With this configuration, a successful collapse to order N will populate
>> enough pages to satisfy the collapse condition on order N+1 on the next
>> scan. This leads to unnecessary work and memory churn.
>
> You could add a link here to previous discussions.

Ack, not a bad idea for historical reasons.

>
>>
>> To fix this issue introduce a helper function that will limit mTHP
>> collapse support to two max_ptes_none values, 0 and HPAGE_PMD_NR - 1.
>> This effectively supports two modes:
>>
>> - max_ptes_none=0: never introduce new none-pages for mTHP collapse.
>
> "introduce" reads wrong in this context. And I don't know what a "none-page" is :)

Thats a page that is none duh ;P Ill clean that up

>
> "never collapses if it encounters an empty PTE or a PTE that maps the shared
> zeropage. Consequently, no memory bloat."

Thanks that does read way better!

>
>> - max_ptes_none=511 (on 4k pagesz): Always collapse to the highest
>>    available mTHP order.
>>
>> This removes the possiblilty of "creep", while not modifying any uAPI
>> expectations. A warning will be emitted if any non-supported
>> max_ptes_none value is configured with mTHP enabled.
>>
>> mTHP collapse will not honor the khugepaged_max_ptes_shared or
>> khugepaged_max_ptes_swap parameters, and will fail if it encounters a
>> shared or swapped entry.
>>
>> No functional changes in this patch; however it defines future behavior
>> for mTHP collapse.
>>
>> Co-developed-by: Dev Jain <dev.jain@arm.com>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> Signed-off-by: Nico Pache <npache@redhat.com>
>> ---
>>   mm/khugepaged.c | 124 ++++++++++++++++++++++++++++++++++--------------
>>   1 file changed, 88 insertions(+), 36 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index f42b55421191..283bb63854a5 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -352,51 +352,86 @@ static bool pte_none_or_zero(pte_t pte)
>>    * collapse_max_ptes_none - Calculate maximum allowed empty PTEs for collapse
>>    * @cc: The collapse control struct
>>    * @vma: The vma to check for userfaultfd
>> + * @order: The folio order being collapsed to
>>    *
>>    * If we are not in khugepaged mode use HPAGE_PMD_NR to allow any
>> - * empty page.
>> + * empty page. For PMD-sized collapses (order == HPAGE_PMD_ORDER), use the
>> + * configured khugepaged_max_ptes_none value.
>> + *
>> + * For mTHP collapses, we currently only support khugepaged_max_pte_none values
>> + * of 0 or (KHUGEPAGED_MAX_PTES_LIMIT). Any other value will emit a warning and
>> + * no mTHP collapse will be attempted
>
> Not sure if we discussed it (and maybe I had a different opinion back then ...),
> but could we simply to fallback to max_ptes_none=0, so we can avoid returning
> errors here?>
> max_ptes_none=0 is ok, because we will not waste any memory. The warning clearly
> tells the user that this combination is not supported as is.
>
> ... and it would make this function a lot easier to handle. In the warning, we
> can just state that "falling back to ... "max_ptes_non = 0".

We'd then be "violating" the uAPI expetation and is the whole reason we
have this 0/511 behavior in the first place ;/

If it wasnt for that issue I would have a completely different design.

>
>
> [...]
>
>>
>>   /**
>>    * collapse_max_ptes_shared - Calculate maximum allowed shared PTEs for collapse
>>    * @cc: The collapse control struct
>> + * @order: The folio order being collapsed to
>>    *
>>    * If we are not in khugepaged mode use HPAGE_PMD_NR to allow any
>>    * shared page.
>>    *
>> + * For mTHP collapses, we currently dont support collapsing memory with
>> + * shared memory.
>
> "do not"
>
> "shared memory" is misleading, as we do support shmem. What you mean is maybe
> "collapsing with anonymous memory pages that are shared between processes
> through CoW" or soemthing like that?

Ok ill clear that up thank you!

>
>> + *
>>    * Return: Maximum number of shared PTEs allowed for the collapse operation
>>    */
>> -static unsigned int collapse_max_ptes_shared(struct collapse_control *cc)
>> +static unsigned int collapse_max_ptes_shared(struct collapse_control *cc,
>> +            unsigned int order)
>>   {
>>      if (!cc->is_khugepaged)
>>              return HPAGE_PMD_NR;
>> +    if (!is_pmd_order(order))
>> +            return 0;
>> +
>>      return khugepaged_max_ptes_shared;
>>   }
>>
>>   /**
>>    * collapse_max_ptes_swap - Calculate maximum allowed swap PTEs for collapse
>>    * @cc: The collapse control struct
>> + * @order: The folio order being collapsed to
>>    *
>>    * If we are not in khugepaged mode use HPAGE_PMD_NR to allow any
>>    * swap page.
>>    *
>> + * For PMD-sized collapses (order == HPAGE_PMD_ORDER), use the configured
>> + * khugepaged_max_ptes_swap value.
>> + *
>> + * For mTHP collapses, we currently dont support collapsing memory with
>> + * swapped out memory.
>
> "do not". Given that this is also used for the pagecache, can we make this clearer?

Yeah! I originally didnt plan on using these helpers for the file
collapse operation but figured hell, why not clean them both up, might
end up helping (hopefully not hurting) when Baolin goes to add mTHP
shmem support.

Cheers,
-- Nico


>
>> + *
>>    * Return: Maximum number of swap PTEs allowed for the collapse operation
>>    */
>> -static unsigned int collapse_max_ptes_swap(struct collapse_control *cc)
>> +static unsigned int collapse_max_ptes_swap(struct collapse_control *cc,
>> +            unsigned int order)


^ permalink raw reply

* Re: [PATCH RFC v5 00/53] guest_memfd: In-place conversion support
From: Sean Christopherson @ 2026-04-29 15:06 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
	pratyush, suzuki.poulose, aneesh.kumar, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka, kvm,
	linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <20260428-gmem-inplace-conversion-v5-0-d8608ccfca22@google.com>

On Tue, Apr 28, 2026, Ackerley Tng wrote:
> This is RFC v5 of guest_memfd in-place conversion support.

...

> TODOs
> 
> + Perhaps further clarify PRESERVE flag: [8]
> + Resolve issue where guest_memfd_conversions_test, which uses the
>   kselftest framework, doesn't perform teardown on assertion
>   failure. Please see proposal at [9]
> + Test with TDX selftests. We're in the process of rebasing TDX selftests
>   on this series and will post updates when that's tested.

Why exactly is this still RFC?  The TODOs here don't strike me as things that
would make this RFC.  Blockers for merge, yes/maybe/probably, but at a glance,
it feels like we've moved beyond RFC for the code itself.

^ permalink raw reply

* Re: [PATCH 7.2 v16 04/13] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support
From: Nico Pache @ 2026-04-29 15:05 UTC (permalink / raw)
  To: Usama Arif
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, akpm,
	anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, Liam.Howlett, ljs, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <20260420135554.27067-1-usama.arif@linux.dev>

On 4/20/26 7:55 AM, Usama Arif wrote:
> On Sun, 19 Apr 2026 12:57:41 -0600 Nico Pache <npache@redhat.com> wrote:
>
>> generalize the order of the __collapse_huge_page_* and collapse_max_*
>> functions to support future mTHP collapse.
>>
>> The current mechanism for determining collapse with the
>> khugepaged_max_ptes_none value is not designed with mTHP in mind. This
>> raises a key design issue: if we support user defined max_pte_none values
>> (even those scaled by order), a collapse of a lower order can introduces
>> an feedback loop, or "creep", when max_ptes_none is set to a value greater
>> than HPAGE_PMD_NR / 2.
>>
>> With this configuration, a successful collapse to order N will populate
>> enough pages to satisfy the collapse condition on order N+1 on the next
>> scan. This leads to unnecessary work and memory churn.
>>
>> To fix this issue introduce a helper function that will limit mTHP
>> collapse support to two max_ptes_none values, 0 and HPAGE_PMD_NR - 1.
>> This effectively supports two modes:
>>
>> - max_ptes_none=0: never introduce new none-pages for mTHP collapse.
>> - max_ptes_none=511 (on 4k pagesz): Always collapse to the highest
>>    available mTHP order.
>>
>> This removes the possiblilty of "creep", while not modifying any uAPI
>> expectations. A warning will be emitted if any non-supported
>> max_ptes_none value is configured with mTHP enabled.
>>
>> mTHP collapse will not honor the khugepaged_max_ptes_shared or
>> khugepaged_max_ptes_swap parameters, and will fail if it encounters a
>> shared or swapped entry.
>>
>> No functional changes in this patch; however it defines future behavior
>> for mTHP collapse.
>>
>> Co-developed-by: Dev Jain <dev.jain@arm.com>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> Signed-off-by: Nico Pache <npache@redhat.com>
>> ---
>>   mm/khugepaged.c | 124 ++++++++++++++++++++++++++++++++++--------------
>>   1 file changed, 88 insertions(+), 36 deletions(-)
>>
>
> Small nits. Most might not need change.

No you brought some good points :) Thanks for your reviews!

>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index f42b55421191..283bb63854a5 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -352,51 +352,86 @@ static bool pte_none_or_zero(pte_t pte)
>>    * collapse_max_ptes_none - Calculate maximum allowed empty PTEs for collapse
>>    * @cc: The collapse control struct
>>    * @vma: The vma to check for userfaultfd
>> + * @order: The folio order being collapsed to
>>    *
>>    * If we are not in khugepaged mode use HPAGE_PMD_NR to allow any
>> - * empty page.
>> + * empty page. For PMD-sized collapses (order == HPAGE_PMD_ORDER), use the
>> + * configured khugepaged_max_ptes_none value.
>> + *
>> + * For mTHP collapses, we currently only support khugepaged_max_pte_none values
>> + * of 0 or (KHUGEPAGED_MAX_PTES_LIMIT). Any other value will emit a warning and
>> + * no mTHP collapse will be attempted
>>    *
>>    * Return: Maximum number of empty PTEs allowed for the collapse operation
>>    */
>> -static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
>> -            struct vm_area_struct *vma)
>> +static int collapse_max_ptes_none(struct collapse_control *cc,
>> +            struct vm_area_struct *vma, unsigned int order)
>>   {
>>      if (vma && userfaultfd_armed(vma))
>>              return 0;
>>      if (!cc->is_khugepaged)
>>              return HPAGE_PMD_NR;
>> -    return khugepaged_max_ptes_none;
>> +    if (is_pmd_order(order))
>> +            return khugepaged_max_ptes_none;
>> +    /* Zero/non-present collapse disabled. */
>> +    if (!khugepaged_max_ptes_none)
>> +            return 0;
>> +    if (khugepaged_max_ptes_none == KHUGEPAGED_MAX_PTES_LIMIT)
>> +            return (1 << order) - 1;
>> +
>
> There are 2 reads of khugepaged_max_ptes_none here.
> A concurrent sysctl write between reads can yield "0 then non-zero" or "LIMIT
> then mid-value".
>
> Would be good to just snapshot once at the start of the function and use that
> value?

Yeah good point, that would avoid any really hard to reproduce, but
probably very unlikely to occur bugs.

>
>> +    pr_warn_once("mTHP collapse only supports max_ptes_none values of 0 or %u\n",
>> +                  KHUGEPAGED_MAX_PTES_LIMIT);
>
> IMO, warn_once can get lost quickly in dmesg. Maybe pr_warn_ratelimited?
>
> Not sure what others opinions are..

I see David already reply'd to this. I guess we keep the warn once or
hard limit to 0. My fear with the latter is that would then violate the
whole concern (and the only reason we have 0/511) support in the first
place. If we could violate this uAPI expectation then I would then want
to reintroduce hardcapping max_ptes_none to HPAGE_PMD_NR/2 if its above
this value.

So in eyes, lets just keep this as is for now.

>
>> +    return -EINVAL;
>>   }
>>
>>   /**
>>    * collapse_max_ptes_shared - Calculate maximum allowed shared PTEs for collapse
>>    * @cc: The collapse control struct
>> + * @order: The folio order being collapsed to
>>    *
>>    * If we are not in khugepaged mode use HPAGE_PMD_NR to allow any
>>    * shared page.
>>    *
>> + * For mTHP collapses, we currently dont support collapsing memory with
>> + * shared memory.
>> + *
>>    * Return: Maximum number of shared PTEs allowed for the collapse operation
>>    */
>> -static unsigned int collapse_max_ptes_shared(struct collapse_control *cc)
>> +static unsigned int collapse_max_ptes_shared(struct collapse_control *cc,
>> +            unsigned int order)
>>   {
>>      if (!cc->is_khugepaged)
>>              return HPAGE_PMD_NR;
>> +    if (!is_pmd_order(order))
>> +            return 0;
>> +
>>      return khugepaged_max_ptes_shared;
>>   }
>>
>>   /**
>>    * collapse_max_ptes_swap - Calculate maximum allowed swap PTEs for collapse
>>    * @cc: The collapse control struct
>> + * @order: The folio order being collapsed to
>>    *
>>    * If we are not in khugepaged mode use HPAGE_PMD_NR to allow any
>>    * swap page.
>>    *
>> + * For PMD-sized collapses (order == HPAGE_PMD_ORDER), use the configured
>> + * khugepaged_max_ptes_swap value.
>> + *
>> + * For mTHP collapses, we currently dont support collapsing memory with
>> + * swapped out memory.
>> + *
>>    * Return: Maximum number of swap PTEs allowed for the collapse operation
>>    */
>> -static unsigned int collapse_max_ptes_swap(struct collapse_control *cc)
>> +static unsigned int collapse_max_ptes_swap(struct collapse_control *cc,
>> +            unsigned int order)
>>   {
>>      if (!cc->is_khugepaged)
>>              return HPAGE_PMD_NR;
>> +    if (!is_pmd_order(order))
>> +            return 0;
>> +
>>      return khugepaged_max_ptes_swap;
>>   }
>>
>> @@ -590,18 +625,22 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte,
>>
>>   static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>>              unsigned long start_addr, pte_t *pte, struct collapse_control *cc,
>> -            struct list_head *compound_pagelist)
>> +            unsigned int order, struct list_head *compound_pagelist)
>>   {
>> +    const unsigned long nr_pages = 1UL << order;
>>      struct page *page = NULL;
>>      struct folio *folio = NULL;
>>      unsigned long addr = start_addr;
>>      pte_t *_pte;
>>      int none_or_zero = 0, shared = 0, referenced = 0;
>>      enum scan_result result = SCAN_FAIL;
>> -    unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma);
>> -    unsigned int max_ptes_shared = collapse_max_ptes_shared(cc);
>> +    int max_ptes_none = collapse_max_ptes_none(cc, vma, order);
>> +    unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, order);
>> +
>> +    if (max_ptes_none < 0)
>> +            return result;
>
> Would a dedicated SCAN_INVALID_PTES_NONE make more sense here instead
> of SCAN_FAIL?

Yeah thats a good idea, let me see if i can make that work.

>
>>
>> -    for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
>> +    for (_pte = pte; _pte < pte + nr_pages;
>>           _pte++, addr += PAGE_SIZE) {
>>              pte_t pteval = ptep_get(_pte);
>>              if (pte_none_or_zero(pteval)) {
>> @@ -734,18 +773,18 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>>   }
>>
>>   static void __collapse_huge_page_copy_succeeded(pte_t *pte,
>> -                                            struct vm_area_struct *vma,
>> -                                            unsigned long address,
>> -                                            spinlock_t *ptl,
>> -                                            struct list_head *compound_pagelist)
>> +            struct vm_area_struct *vma, unsigned long address,
>> +            spinlock_t *ptl, unsigned int order,
>> +            struct list_head *compound_pagelist)
>>   {
>> -    unsigned long end = address + HPAGE_PMD_SIZE;
>> +    const unsigned long nr_pages = 1UL << order;
>> +    unsigned long end = address + (PAGE_SIZE << order);
>>      struct folio *src, *tmp;
>>      pte_t pteval;
>>      pte_t *_pte;
>>      unsigned int nr_ptes;
>>
>> -    for (_pte = pte; _pte < pte + HPAGE_PMD_NR; _pte += nr_ptes,
>> +    for (_pte = pte; _pte < pte + nr_pages; _pte += nr_ptes,
>>           address += nr_ptes * PAGE_SIZE) {
>>              nr_ptes = 1;
>>              pteval = ptep_get(_pte);
>> @@ -798,13 +837,11 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
>>   }
>>
>>   static void __collapse_huge_page_copy_failed(pte_t *pte,
>> -                                         pmd_t *pmd,
>> -                                         pmd_t orig_pmd,
>> -                                         struct vm_area_struct *vma,
>> -                                         struct list_head *compound_pagelist)
>> +            pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
>> +            unsigned int order, struct list_head *compound_pagelist)
>>   {
>> +    const unsigned long nr_pages = 1UL << order;
>>      spinlock_t *pmd_ptl;
>> -
>
> Shouldn't remove the newline above?

Ack thank you

>
>>      /*
>>       * Re-establish the PMD to point to the original page table
>>       * entry. Restoring PMD needs to be done prior to releasing
>> @@ -818,7 +855,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
>>       * Release both raw and compound pages isolated
>>       * in __collapse_huge_page_isolate.
>>       */
>> -    release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist);
>> +    release_pte_pages(pte, pte + nr_pages, compound_pagelist);
>>   }
>>
>>   /*
>> @@ -838,16 +875,16 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
>>    */
>>   static enum scan_result __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
>>              pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
>> -            unsigned long address, spinlock_t *ptl,
>> +            unsigned long address, spinlock_t *ptl, unsigned int order,
>>              struct list_head *compound_pagelist)
>>   {
>> +    const unsigned long nr_pages = 1UL << order;
>>      unsigned int i;
>>      enum scan_result result = SCAN_SUCCEED;
>> -
>
> Same here?

its probably from me reordering nr_pages to the top. ack! thanks

>
>>      /*
>>       * Copying pages' contents is subject to memory poison at any iteration.
>>       */
>> -    for (i = 0; i < HPAGE_PMD_NR; i++) {
>> +    for (i = 0; i < nr_pages; i++) {
>>              pte_t pteval = ptep_get(pte + i);
>>              struct page *page = folio_page(folio, i);
>>              unsigned long src_addr = address + i * PAGE_SIZE;
>> @@ -866,10 +903,10 @@ static enum scan_result __collapse_huge_page_copy(pte_t *pte, struct folio *foli
>>
>>      if (likely(result == SCAN_SUCCEED))
>>              __collapse_huge_page_copy_succeeded(pte, vma, address, ptl,
>> -                                                compound_pagelist);
>> +                                                order, compound_pagelist);
>>      else
>>              __collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma,
>> -                                             compound_pagelist);
>> +                                             order, compound_pagelist);
>>
>>      return result;
>>   }
>> @@ -1040,12 +1077,12 @@ static enum scan_result check_pmd_still_valid(struct mm_struct *mm,
>>    * Returns result: if not SCAN_SUCCEED, mmap_lock has been released.
>>    */
>>   static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
>> -            struct vm_area_struct *vma, unsigned long start_addr, pmd_t *pmd,
>> -            int referenced)
>> +            struct vm_area_struct *vma, unsigned long start_addr,
>> +            pmd_t *pmd, int referenced, unsigned int order)
>
> Will probably find out in later reviews, but there is tracepoint in __collapse_huge_page_swapin.
> Would be good to add order in that tracepoint if you are adding order here?

Yep! There is a patch that updates the tracepoints.

>
>>   {
>>      int swapped_in = 0;
>>      vm_fault_t ret = 0;
>> -    unsigned long addr, end = start_addr + (HPAGE_PMD_NR * PAGE_SIZE);
>> +    unsigned long addr, end = start_addr + (PAGE_SIZE << order);
>>      enum scan_result result;
>>      pte_t *pte = NULL;
>>      spinlock_t *ptl;
>> @@ -1077,6 +1114,19 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
>>                  pte_present(vmf.orig_pte))
>>                      continue;
>>
>> +            /*
>> +             * TODO: Support swapin without leading to further mTHP
>> +             * collapses. Currently bringing in new pages via swapin may
>> +             * cause a future higher order collapse on a rescan of the same
>> +             * range.
>> +             */
>> +            if (!is_pmd_order(order)) {
>
> Would it be good to introduce this in the patch that activates it? No strong
> preference btw. Just that its dead code in this patch itself.

No we are trying to get everything ready before the patch(es) that
actually activates this feature. Everything related to order is
currently dead code at this moment until the later commits.

Cheers,
-- Nico


>
>> +                    pte_unmap(pte);
>> +                    mmap_read_unlock(mm);
>> +                    result = SCAN_EXCEED_SWAP_PTE;
>> +                    goto out;
>> +            }
>> +
>>              vmf.pte = pte;
>>              vmf.ptl = ptl;
>>              ret = do_swap_page(&vmf);
>> @@ -1196,7 +1246,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>>               * that case.  Continuing to collapse causes inconsistency.
>>               */
>>              result = __collapse_huge_page_swapin(mm, vma, address, pmd,
>> -                                                 referenced);
>> +                                                 referenced, HPAGE_PMD_ORDER);
>>              if (result != SCAN_SUCCEED)
>>                      goto out_nolock;
>>      }
>> @@ -1244,6 +1294,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>>      pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
>>      if (pte) {
>>              result = __collapse_huge_page_isolate(vma, address, pte, cc,
>> +                                                  HPAGE_PMD_ORDER,
>>                                                    &compound_pagelist);
>>              spin_unlock(pte_ptl);
>>      } else {
>> @@ -1274,6 +1325,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>>
>>      result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
>>                                         vma, address, pte_ptl,
>> +                                       HPAGE_PMD_ORDER,
>>                                         &compound_pagelist);
>>      pte_unmap(pte);
>>      if (unlikely(result != SCAN_SUCCEED))
>> @@ -1318,9 +1370,9 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>>      unsigned long addr;
>>      spinlock_t *ptl;
>>      int node = NUMA_NO_NODE, unmapped = 0;
>> -    unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma);
>> -    unsigned int max_ptes_shared = collapse_max_ptes_shared(cc);
>> -    unsigned int max_ptes_swap = collapse_max_ptes_swap(cc);
>> +    int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
>> +    unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, HPAGE_PMD_ORDER);
>> +    unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
>>
>>      VM_BUG_ON(start_addr & ~HPAGE_PMD_MASK);
>>
>> @@ -2371,8 +2423,8 @@ static enum scan_result collapse_scan_file(struct mm_struct *mm,
>>      int present, swap;
>>      int node = NUMA_NO_NODE;
>>      enum scan_result result = SCAN_SUCCEED;
>> -    unsigned int max_ptes_none = collapse_max_ptes_none(cc, NULL);
>> -    unsigned int max_ptes_swap = collapse_max_ptes_swap(cc);
>> +    int max_ptes_none = collapse_max_ptes_none(cc, NULL, HPAGE_PMD_ORDER);
>> +    unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
>>
>>      present = 0;
>>      swap = 0;
>> --
>> 2.53.0
>>
>>
>


^ permalink raw reply

* Re: [PATCH v2] mm/page_alloc: trace PCP refills and PCP zone lock usage
From: Steven Rostedt @ 2026-04-29 14:52 UTC (permalink / raw)
  To: SUVONOV BUNYOD
  Cc: akpm, vbabka, linux-mm, mhiramat, mathieu desnoyers,
	linux-trace-kernel, linux-kernel, surenb, mhocko, jackmanb,
	hannes, ziy, david, vishal moola, corbet, skhan, linux-doc
In-Reply-To: <1453063691.2584758.1777433513691.JavaMail.zimbra@sjtu.edu.cn>

On Wed, 29 Apr 2026 11:31:53 +0800 (CST)
SUVONOV BUNYOD <b.suvonov@sjtu.edu.cn> wrote:

> Thanks for reviewing Steven,
> 
> >Why this change? It makes it much harder to understand.
> >
> >The above is not a normal macro. Ignore any checkpatch warnings about it.
> >The proper way to do the TP_STRUCT__entry() is to make it just like a struct:
> >
> >struct {
> >	unsigned long		pfn;
> >	unsigned int		order;
> >	int			migratetype;
> >};
> >
> >Thus, the macro should be:
> >
> >	TP_STRUCT__entry(
> >		__field(	unsigned long,	pfn		)
> >		__field(	unsigned int,	order		)
> >		__field(	int,		migratetype	)
> >		),  
> 
> 
> Yeah sorry for the formatting issue, will fix in v3. Any other concerns?
> What do you think about the introduction of those tracepoints themselves?
>

It's a basic tracepoint and nothing unusual about it. I only watch over how
tracepoints are created and some use cases and make sure they are done
properly. But the introduction of tracepoints in other subsystems are up to
the maintainers of those subsystems. They are the ones that know what is
useful or not.

In other words, it's up to the MM subsystem maintainers to decide.

-- Steve
 

^ permalink raw reply

* Re: [PATCH 7.2 v16 03/13] mm/khugepaged: rework max_ptes_* handling with helper functions
From: Nico Pache @ 2026-04-29 14:48 UTC (permalink / raw)
  To: David Hildenbrand (Arm), linux-doc, linux-kernel, linux-mm,
	linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, ljs,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
	pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang, rientjes,
	rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <c82a0b73-67ff-45e6-a792-e610b35a5b2f@kernel.org>

On 4/27/26 1:52 PM, David Hildenbrand (Arm) wrote:
> On 4/19/26 20:57, Nico Pache wrote:
>> The following cleanup reworks all the max_ptes_* handling into helper
>> functions. This increases the code readability and will later be used to
>> implement the mTHP handling of these variables.
>>
>> With these changes we abstract all the madvise_collapse() special casing
>> (dont respect the sysctls) away from the functions that utilize them. And
>> will later in this series to cleanly restrict mTHP collapses behaviors.
>>
>> Suggested-by: David Hildenbrand <david@kernel.org>
>> Signed-off-by: Nico Pache <npache@redhat.com>
>> ---
>>   mm/khugepaged.c | 114 +++++++++++++++++++++++++++++++++---------------
>>   1 file changed, 78 insertions(+), 36 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index afac6bc4e76d..f42b55421191 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -348,6 +348,58 @@ static bool pte_none_or_zero(pte_t pte)
>>      return pte_present(pte) && is_zero_pfn(pte_pfn(pte));
>>   }
>>
>> +/**
>> + * collapse_max_ptes_none - Calculate maximum allowed empty PTEs for collapse
>
> empty PTE or PTE mapping the shared zeropage ? That should be clarified also below.

Ah fair point, "empty" isn't the best representation of a "none"/zeropage.

>
>> + * @cc: The collapse control struct
>> + * @vma: The vma to check for userfaultfd
>> + *
>> + * If we are not in khugepaged mode use HPAGE_PMD_NR to allow any
>> + * empty page.
>
> Not completely accurate due to uffd. And it's not really "empty page".

Sorry I forgot to update this comment. I originally planned on skipping
the VMA passing, but then figured later that it would make the code even
more uniform (as you suggested)

>
> Is that information really necessary for the caller? I'd suggest you drop this
> here and instead add a comment inline above the "return HPAGE_PMD_NR;".

Yeah, I'm not really sure; I can shorten them. I was heeding to lorenzos
request to add these with docstring headers

>
>> + *
>> + * Return: Maximum number of empty PTEs allowed for the collapse operation
>> + */
>> +static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
>> +            struct vm_area_struct *vma)
>> +{
>> +    if (vma && userfaultfd_armed(vma))
>> +            return 0;
>> +    if (!cc->is_khugepaged)
>> +            return HPAGE_PMD_NR;
>> +    return khugepaged_max_ptes_none;
>> +}
>> +
>> +/**
>> + * collapse_max_ptes_shared - Calculate maximum allowed shared PTEs for collapse
>
> "shared PTE" is not quite clear.
>
> "PTEs that map shared anonymous pages" ?

That works for me, thank you

>
>> + * @cc: The collapse control struct
>> + *
>> + * If we are not in khugepaged mode use HPAGE_PMD_NR to allow any
>> + * shared page.
>
> Same comment as above.

ack

>
>> + *
>> + * Return: Maximum number of shared PTEs allowed for the collapse operation
>> + */
>> +static unsigned int collapse_max_ptes_shared(struct collapse_control *cc)
>> +{
>> +    if (!cc->is_khugepaged)
>> +            return HPAGE_PMD_NR;
>> +    return khugepaged_max_ptes_shared;
>> +}
>> +
>> +/**
>> + * collapse_max_ptes_swap - Calculate maximum allowed swap PTEs for collapse
>
> We're actually checking non-present page table entries (anonymous THP collapse)
> or non-present pagecache entries (file THP collapse).
>
> I wonder if there is an easy way to clarify that here, at least in the
> description (confusing name can stay unless we find something better).

I'll update the comment to include some form of this. In my mind the
name should probably stay relatively consistent to the sysctl value.

>
>> + * @cc: The collapse control struct
>> + *
>> + * If we are not in khugepaged mode use HPAGE_PMD_NR to allow any
>> + * swap page.
>
> Dito.

ack!

>
>> + *
>> + * Return: Maximum number of swap PTEs allowed for the collapse operation
>> + */
>> +static unsigned int collapse_max_ptes_swap(struct collapse_control *cc)
>> +{
>> +    if (!cc->is_khugepaged)
>> +            return HPAGE_PMD_NR;
>> +    return khugepaged_max_ptes_swap;
>> +}
>> +
>>   int hugepage_madvise(struct vm_area_struct *vma,
>>                   vm_flags_t *vm_flags, int advice)
>>   {
>> @@ -546,21 +598,19 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>>      pte_t *_pte;
>>      int none_or_zero = 0, shared = 0, referenced = 0;
>>      enum scan_result result = SCAN_FAIL;
>> +    unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma);
>> +    unsigned int max_ptes_shared = collapse_max_ptes_shared(cc);
>
> These could be const, right? Or will that change in future patches?

Yes I believe these can be const now! Thank you

>
>>
>>      for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
>>           _pte++, addr += PAGE_SIZE) {
>>              pte_t pteval = ptep_get(_pte);
>>              if (pte_none_or_zero(pteval)) {
>> -                    ++none_or_zero;
>> -                    if (!userfaultfd_armed(vma) &&
>> -                        (!cc->is_khugepaged ||
>> -                         none_or_zero <= khugepaged_max_ptes_none)) {
>> -                            continue;
>> -                    } else {
>> +                    if (++none_or_zero > max_ptes_none) {
>>                              result = SCAN_EXCEED_NONE_PTE;
>>                              count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>>                              goto out;
>>                      }
>> +                    continue;
>>              }
>>              if (!pte_present(pteval)) {
>>                      result = SCAN_PTE_NON_PRESENT;
>> @@ -591,9 +641,7 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>>
>>              /* See collapse_scan_pmd(). */
>>              if (folio_maybe_mapped_shared(folio)) {
>> -                    ++shared;
>> -                    if (cc->is_khugepaged &&
>> -                        shared > khugepaged_max_ptes_shared) {
>> +                    if (++shared > max_ptes_shared) {
>>                              result = SCAN_EXCEED_SHARED_PTE;
>>                              count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
>>                              goto out;
>> @@ -1270,6 +1318,9 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>>      unsigned long addr;
>>      spinlock_t *ptl;
>>      int node = NUMA_NO_NODE, unmapped = 0;
>> +    unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma);
>> +    unsigned int max_ptes_shared = collapse_max_ptes_shared(cc);
>> +    unsigned int max_ptes_swap = collapse_max_ptes_swap(cc);
>
> Same question here.

ack! will adjust.

>
>>
>>      VM_BUG_ON(start_addr & ~HPAGE_PMD_MASK);
>>
>
>
> In general, LGTM. With the doc fixed up
>
> Acked-by: David Hildenbrand (Arm) <david@kernel.org>

Thank you Ill get those updated.
>


^ permalink raw reply

* Re: [PATCH 7.2 v16 03/13] mm/khugepaged: rework max_ptes_* handling with helper functions
From: Nico Pache @ 2026-04-29 14:43 UTC (permalink / raw)
  To: Usama Arif
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, akpm,
	anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, Liam.Howlett, ljs, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <20260420131549.3673619-1-usama.arif@linux.dev>

On 4/20/26 7:15 AM, Usama Arif wrote:
> On Sun, 19 Apr 2026 12:57:40 -0600 Nico Pache <npache@redhat.com> wrote:
>
>> The following cleanup reworks all the max_ptes_* handling into helper
>> functions. This increases the code readability and will later be used to
>> implement the mTHP handling of these variables.
>>
>> With these changes we abstract all the madvise_collapse() special casing
>> (dont respect the sysctls) away from the functions that utilize them. And
>> will later in this series to cleanly restrict mTHP collapses behaviors.
>>
>> Suggested-by: David Hildenbrand <david@kernel.org>
>> Signed-off-by: Nico Pache <npache@redhat.com>
>> ---
>>   mm/khugepaged.c | 114 +++++++++++++++++++++++++++++++++---------------
>>   1 file changed, 78 insertions(+), 36 deletions(-)
>>
>
> The old code re-read khugepaged_max_ptes_* on every loop iteration; the new
> code snapshots them once per scan call. If userspace writes the sysctl
> mid-scan, old behavior reacted within the scan, new behavior uses the value
> sampled at entry. This is completely ok IMO, but might be good to call out.
>
> Also might be good to write no functional change intended apart from
> above in the commit message?

Ah good point! Ill clear that up

>
> Acked-by: Usama Arif <usama.arif@linux.dev>

Thank you :)


>


^ permalink raw reply

* Re: [PATCH 7.2 v16 02/13] mm/khugepaged: generalize alloc_charge_folio()
From: Nico Pache @ 2026-04-29 14:36 UTC (permalink / raw)
  To: David Hildenbrand (Arm), linux-doc, linux-kernel, linux-mm,
	linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, ljs,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
	pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang, rientjes,
	rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <06c7e6a7-60af-480e-afd9-700e985ca2ba@kernel.org>

On 4/27/26 1:41 PM, David Hildenbrand (Arm) wrote:
> On 4/19/26 20:57, Nico Pache wrote:
>> From: Dev Jain <dev.jain@arm.com>
>>
>> Pass order to alloc_charge_folio() and update mTHP statistics.
>>
>> Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
>> Reviewed-by: Lance Yang <lance.yang@linux.dev>
>> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
>> Reviewed-by: Zi Yan <ziy@nvidia.com>
>> Acked-by: David Hildenbrand (Arm) <david@kernel.org>
>> Co-developed-by: Nico Pache <npache@redhat.com>
>> Signed-off-by: Nico Pache <npache@redhat.com>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>
> Your SOB should come last, the order represents the history of this patch

Ah ok thank you, sorry about that.


>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> Co-developed-by: Nico Pache <npache@redhat.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
>


^ permalink raw reply

* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Gregory Price @ 2026-04-29 13:42 UTC (permalink / raw)
  To: Arun George/Arun George
  Cc: lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman, gost.dev, arungeorge05, cpgs
In-Reply-To: <1891546521.01777455002601.JavaMail.epsvc@epcpadp1new>

On Wed, Apr 29, 2026 at 11:45:26AM +0530, Arun George/Arun George wrote:
> On 28-04-2026 03:58 am, Gregory Price wrote:
> > On Mon, Apr 27, 2026 at 06:02:57PM +0530, Arun George wrote:
> >>
> >> Any particular workload you are targeting with
> >> this (which can tolerate this latency)?
> >>
> >> Any deployments you think of where the goal is a capacity expansion
> >> with a compromise in performance?
> >>
> > Primary use cases for us are any workload that benefits from zswap -
> > which is many, many (many, many [many, many]) workloads.
> > 
> A curious question please. If the primary use case is swap, can't we 
> handle this problem statement by re-using the zsmalloc allocation classes?
>

I'm using swap semantics for allocation ("demote + leafent") but otherwise
on-fault rather than removing the swap-entry, we leave it cached and
replace the page table entry with a read-only mapping (if Read-fault).

If there's a writable budget, and the node is under that budget, we may
also allow upgrading the read-only page to be writable (at which point
we would reap the swap entry).

This requires careful reverse-mapping in case there are multiple mappers
of the same folio.

Since otherwise the allocation is just alloc_pages_node(), and the fault
patterns differ from typical swap - i didn't see the need to overcomplicate
things by cramming the logic into zswap/zsmalloc instead of just making
it its own vswap[1] backend that sits in front of zswap.

vswap makes it easy to writeback a cram page to swap in the case where
the device is over-pressured and we need to make room (close the node,
disallow new cram entries, writeback existing cram entries to swap).

[1] vswap: https://lore.kernel.org/linux-mm/?t=20260320192741

> A separate size class can be reserved for non-compressed pages in 
> zsmalloc. And this interface could be used by zswap, zram etc. (We have 
> been using this implementation for testing btw.). This does not require 
> additional book-keeping or buddy allocator.
> 

The other reason not to overload an existing mechanism is because these
devices (that i've seen) cannot provide per-page compressability stats,
and so it would end up just looking like a bunch of either
uncompressible capacity or unknown compressed capacity.

That makes it harder for those components to reason about what to do
with their normal software-compressed capacity (for which they do have
that data).

> So write-control part need to handled in the specific back end driver of 
> private pages while the allocation control is a generic front-end sort 
> of, right? (Ex: zswap cram back end for compressed devices case.)


write control is handled by the OS in three ways:

   1) No file memory (no page cache)
      We get this for free using the swap semantics
      This prevents buffered i/o from bypassing page table controls

   2) User allocations only (or at least swap-eligible only)
      This prevents catestrophic system failure if the device fails

   3) Page table mapping control (disallow direct writes)
      This prevents uncontended writes to compressed memory by the cpu


allocation control is handled via private nodes - the driver which
hotplugs the private nodes hands that node to cram - and cram is now
aware of that capacity and will use __GFP_PRIVATE to allocate from that
node.   Removal of the private node from the fallback zonelist and the
lack of __GFP_PRIVATE in all other paths prevent normal buddy allocator
users from accessing that memory.

> 
> Great! I believe "writable budget" could be an interesting idea which 
> can solve the 'bus error' sort of scenarios due to device not capable of 
> taking any more writes. The write budget could be replenished using the 
> control path and writes will not go ahead without the budget available, 
> right?>
>

Write budget is simple

budget=1  (up to 1 page can be writable
   1) swap 1 page ->  cram alloc 1 page, put VSWAP_CRAM in PTE
   2) read-fault  ->  cram upgrades VSWAP_CRAM to R/O PTE
   3) write-fault ->
      a) if (writable_cnt < budget) { budget++; mkwrite(pte); }
      b) else:  normal swap semantic -> promote to normal memory

The catch with the writable budget is we may not always be able to catch
all frees of the vswap pages - meaning we get zombie pages in the vswap
tables.  But this is ok if we run a regular kthread scan the vswap entry
list to reap zombies.

This also gives us a great place to TRIM/FLUSH those pages to release
the capacity without zeroing them.


Meanwhile - use ballooning and a simple shrinker to dynamically size the
region to respond to real compression ratio.


All said an done - you get something close to zswap but with R/O
mappings for all entries, and optional R/W-mappings for administrators
who know something about their workload and can afford to take the risk
of some amount of capacity being written to uncontended in exchange for
performance.

The writable-budget is a risk-dial:  How much do you trust your workload
to now spew un/poorly-compressible memory?  The write-budget is a direct
measure of that. (so take P99.99999 compression ratios, and you can make
a good chunk of that writable).

~Gregory


^ permalink raw reply

* Re: [PATCH v3 00/28] vfs/nfsd: add support for CB_NOTIFY callbacks in directory delegations
From: Chuck Lever @ 2026-04-29 13:41 UTC (permalink / raw)
  To: Jeff Layton, Alexander Viro, Christian Brauner, Jan Kara,
	Chuck Lever, Alexander Aring, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
	Anna Schumaker, Amir Goldstein
  Cc: Calum Mackay, linux-fsdevel, linux-kernel, linux-trace-kernel,
	linux-doc, linux-nfs
In-Reply-To: <20260428-dir-deleg-v3-0-5a0780ba9def@kernel.org>



On Tue, Apr 28, 2026, at 3:09 AM, Jeff Layton wrote:
> Re-posting the set per Christian's request. The only difference in this
> version is a small error handling fix in alloc_init_dir_deleg(). The old
> version could crash since release_pages() can't handle an array with
> NULL pointers in it.
>
> ---------------------------------8<------------------------------------
>
> This patchset builds on the directory delegation work we did a few
> months ago, to add support for CB_NOTIFY callbacks for some events. In
> particular, creates, unlinks and renames. The server also sends updated
> directory attributes in the notifications. With this support, the client
> can register interest in a directory and get notifications about changes
> within it without losing its lease.
>
> The series starts with patches to allow the vfs to ignore certain types
> of events on directories. nfsd can then request these sorts of
> delegations on directories, and then set up inotify watches on the
> directory to trigger sending CB_NOTIFY events.
>
> This has mainly been tested with pynfs, with some new testcases that
> I'll be posting soon. They seem to work fine with those tests, but I
> don't think we'll want to merge these until we have a complete
> client-side implementation to test against.
>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
> Changes in v3:
> - Fix error handling in alloc_init_dir_deleg()
> - Link to v2: 
> https://lore.kernel.org/r/20260416-dir-deleg-v2-0-851426a550f6@kernel.org
>
> Changes in v2:
> - Fix __break_lease handling with different lease types on flc_lease 
> list
> - Add FSNOTIFY_EVENT_RENAME data type to properly handle 
> cross-directory rename events
> - Display fsnotify mask symbolically in tracepoints
> - New tracepoint in fsnotify()
> - Recalc fsnotify mask after unlocking lease instead of before
> - Don't notify client that is making the changes
> - After sending CB_NOTIFY, requeue if new events came in while running
> - Document removal of NFS4_VERIFIER_SIZE/NFS4_FHSIZE from UAPI headers
> - Properly release nfsd_dir_fsnotify_group on server shutdown
> - Link to v1: 
> https://lore.kernel.org/r/20260407-dir-deleg-v1-0-aaf68c478abd@kernel.org
>
> ---
> Jeff Layton (28):
>       filelock: pass current blocking lease to 
> trace_break_lease_block() rather than "new_fl"
>       filelock: add support for ignoring deleg breaks for dir change 
> events
>       filelock: add a tracepoint to start of break_lease()
>       filelock: add an inode_lease_ignore_mask helper
>       fsnotify: new tracepoint in fsnotify()
>       fsnotify: add fsnotify_modify_mark_mask()
>       fsnotify: add FSNOTIFY_EVENT_RENAME data type
>       nfsd: check fl_lmops in nfsd_breaker_owns_lease()
>       nfsd: add protocol support for CB_NOTIFY
>       nfs_common: add new NOTIFY4_* flags proposed in RFC8881bis
>       nfsd: allow nfsd to get a dir lease with an ignore mask
>       nfsd: update the fsnotify mark when setting or removing a dir 
> delegation
>       nfsd: make nfsd4_callback_ops->prepare operation bool return
>       nfsd: add callback encoding and decoding linkages for CB_NOTIFY
>       nfsd: use RCU to protect fi_deleg_file
>       nfsd: add data structures for handling CB_NOTIFY
>       nfsd: add notification handlers for dir events
>       nfsd: add tracepoint to dir_event handler
>       nfsd: apply the notify mask to the delegation when requested
>       nfsd: add helper to marshal a fattr4 from completed args
>       nfsd: allow nfsd4_encode_fattr4_change() to work with no export
>       nfsd: send basic file attributes in CB_NOTIFY
>       nfsd: allow encoding a filehandle into fattr4 without a svc_fh
>       nfsd: add a fi_connectable flag to struct nfs4_file
>       nfsd: add the filehandle to returned attributes in CB_NOTIFY
>       nfsd: properly track requested child attributes
>       nfsd: track requested dir attributes
>       nfsd: add support to CB_NOTIFY for dir attribute changes
>
>  Documentation/sunrpc/xdr/nfs4_1.x    | 264 ++++++++++++++-
>  fs/attr.c                            |   2 +-
>  fs/locks.c                           | 118 +++++--
>  fs/namei.c                           |  31 +-
>  fs/nfsd/filecache.c                  |  70 +++-
>  fs/nfsd/nfs4callback.c               |  60 +++-
>  fs/nfsd/nfs4layouts.c                |   5 +-
>  fs/nfsd/nfs4proc.c                   |  17 +
>  fs/nfsd/nfs4state.c                  | 551 ++++++++++++++++++++++++++++----
>  fs/nfsd/nfs4xdr.c                    | 323 +++++++++++++++++--
>  fs/nfsd/nfs4xdr_gen.c                | 601 ++++++++++++++++++++++++++++++++++-
>  fs/nfsd/nfs4xdr_gen.h                |  20 +-
>  fs/nfsd/state.h                      |  72 ++++-
>  fs/nfsd/trace.h                      |  23 ++
>  fs/nfsd/xdr4.h                       |   5 +
>  fs/nfsd/xdr4cb.h                     |  12 +
>  fs/notify/fsnotify.c                 |   5 +
>  fs/notify/mark.c                     |  29 ++
>  fs/posix_acl.c                       |   4 +-
>  fs/xattr.c                           |   4 +-
>  include/linux/filelock.h             |  54 +++-
>  include/linux/fsnotify.h             |   8 +-
>  include/linux/fsnotify_backend.h     |  21 ++
>  include/linux/nfs4.h                 | 127 --------
>  include/linux/sunrpc/xdrgen/nfs4_1.h | 291 ++++++++++++++++-
>  include/trace/events/filelock.h      |  38 ++-
>  include/trace/events/fsnotify.h      |  51 +++
>  include/trace/misc/fsnotify.h        |  35 ++
>  include/uapi/linux/nfs4.h            |   2 -
>  29 files changed, 2519 insertions(+), 324 deletions(-)
> ---
> base-commit: f4d71dd7fd9cec357c32431fa55c107b96008312
> change-id: 20260325-dir-deleg-339066dd1017
>
> Best regards,
> -- 
> Jeff Layton <jlayton@kernel.org>

For the series:

Acked-by: Chuck Lever <chuck.lever@oracle.com>


-- 
Chuck Lever

^ permalink raw reply

* Re: [PATCH] mm/madvise: preserve uprobe breakpoints across MADV_DONTNEED
From: David Hildenbrand (Arm) @ 2026-04-29 13:31 UTC (permalink / raw)
  To: Darko Tominac, Masami Hiramatsu, Oleg Nesterov, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	James Clark, Andrew Morton, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn
  Cc: xe-linux-external, danielwa, linux-kernel, linux-trace-kernel,
	linux-perf-users, linux-mm
In-Reply-To: <20260429131522.4049054-1-dtominac@cisco.com>

On 4/29/26 15:15, Darko Tominac wrote:
> When uprobes are active, MADV_DONTNEED can discard file-backed pages
> that contain uprobe software breakpoint instructions.  Because the

If my memory serves me right, uprobes can only be installed in MAP_PRIVATE fil
mappings. Installing a uprobe breaks CoW by installing an anonymous page.

Not a file-backed page.

> uprobe infrastructure does not re-instrument pages on individual page
> faults (uprobe_mmap() is only called during VMA creation, not on
> page-in), the breakpoints are silently lost once the discarded pages are
> re-read from the backing file.  The probes stop firing with no error
> indication, and the only recovery is to unregister and re-register the
> affected uprobes.

Right. Don't MADV_DONTNEED uprobes, just like you are not supposed to
MADV_DONTNEED debugger breakpoints/set data etc. :)

> 
> Note that MADV_FREE is not affected: it only operates on anonymous VMAs
> (madvise_free_single_vma() rejects non-anonymous VMAs with -EINVAL),
> while uprobes only instrument file-backed mappings, so the two can never
> overlap.
> 
> A concrete example is a userspace memory reclamation subsystem that
> periodically calls madvise(MADV_DONTNEED) on file-backed text pages to
> release memory. 

It shouldn't do that on a MAP_PRIVATE file-backed VMA. It breaks the programn,
including uprobes and anything else that breaks CoW in there.

If it's using MADV_DONTNEED, it is damaging the application.

MADV_DONTNEED is not for memory reclaim.

-- 
Cheers,

David

^ permalink raw reply

* [PATCH] mm/madvise: preserve uprobe breakpoints across MADV_DONTNEED
From: Darko Tominac @ 2026-04-29 13:15 UTC (permalink / raw)
  To: Masami Hiramatsu, Oleg Nesterov, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	James Clark, Andrew Morton, Liam R. Howlett, Lorenzo Stoakes,
	David Hildenbrand, Vlastimil Babka, Jann Horn
  Cc: xe-linux-external, danielwa, linux-kernel, linux-trace-kernel,
	linux-perf-users, linux-mm

When uprobes are active, MADV_DONTNEED can discard file-backed pages
that contain uprobe software breakpoint instructions.  Because the
uprobe infrastructure does not re-instrument pages on individual page
faults (uprobe_mmap() is only called during VMA creation, not on
page-in), the breakpoints are silently lost once the discarded pages are
re-read from the backing file.  The probes stop firing with no error
indication, and the only recovery is to unregister and re-register the
affected uprobes.

Note that MADV_FREE is not affected: it only operates on anonymous VMAs
(madvise_free_single_vma() rejects non-anonymous VMAs with -EINVAL),
while uprobes only instrument file-backed mappings, so the two can never
overlap.

A concrete example is a userspace memory reclamation subsystem that
periodically calls madvise(MADV_DONTNEED) on file-backed text pages to
release memory.  This silently clears uprobe breakpoints placed by
eBPF-based security and tracing tools that use uprobes to attach eBPF
programs to user-space functions, causing those tools to stop
functioning within seconds of the first reclamation pass.

Add a check in madvise_dontneed_free(), which handles MADV_DONTNEED,
MADV_DONTNEED_LOCKED and MADV_FREE, that when CONFIG_UPROBES is enabled
detects whether the target range contains active uprobes:

  - Fast path: if no uprobes are registered system-wide, or the VMA is
    not file-backed (uprobes only instrument file-backed mappings, so
    anonymous VMAs -- including MADV_FREE targets -- can never contain
    breakpoints), or no uprobes are present in the VMA range, proceed
    with the discard as before.
  - Slow path: when uprobes are detected in the range, use
    vma_first_uprobe_addr() to jump directly to each uprobe page via
    the rbtree, zapping the clean ranges between them.  This is
    O(M * log N) where M is the number of uprobes in the range and
    N is the total uprobe count, rather than O(pages).  madvise()
    still returns success, consistent with the advisory nature of
    MADV_DONTNEED.

When CONFIG_UPROBES is not configured, the original behaviour is
preserved with no overhead.

To support the above, export vma_has_uprobes() and add new helpers
any_uprobes_registered() and vma_first_uprobe_addr() in the uprobes
subsystem.  vma_first_uprobe_addr() returns the page-aligned virtual
address of the lowest-offset uprobe in a given VMA range by leveraging
the (inode, offset)-sorted global rbtree.

Cc: xe-linux-external@cisco.com
Cc: danielwa@cisco.com
Signed-off-by: Darko Tominac <dtominac@cisco.com>
---
 include/linux/uprobes.h | 21 +++++++++++
 kernel/events/uprobes.c | 79 +++++++++++++++++++++++++++++++++++++++--
 mm/madvise.c            | 73 +++++++++++++++++++++++++++++++++----
 3 files changed, 164 insertions(+), 9 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index f548fea2adec..9ce5c46fd2e9 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -212,6 +212,11 @@ extern void uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consum
 extern void uprobe_unregister_sync(void);
 extern int uprobe_mmap(struct vm_area_struct *vma);
 extern void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned long end);
+extern bool vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long end);
+extern unsigned long vma_first_uprobe_addr(struct vm_area_struct *vma,
+					   unsigned long start,
+					   unsigned long end);
+extern bool any_uprobes_registered(void);
 extern void uprobe_start_dup_mmap(void);
 extern void uprobe_end_dup_mmap(void);
 extern void uprobe_dup_mmap(struct mm_struct *oldmm, struct mm_struct *newmm);
@@ -278,6 +283,22 @@ static inline void
 uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned long end)
 {
 }
+static inline bool
+vma_has_uprobes(struct vm_area_struct *vma, unsigned long start,
+		unsigned long end)
+{
+	return false;
+}
+static inline unsigned long
+vma_first_uprobe_addr(struct vm_area_struct *vma, unsigned long start,
+		      unsigned long end)
+{
+	return 0;
+}
+static inline bool any_uprobes_registered(void)
+{
+	return false;
+}
 static inline void uprobe_start_dup_mmap(void)
 {
 }
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 4084e926e284..0f8aea99b96f 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -152,6 +152,19 @@ static loff_t vaddr_to_offset(struct vm_area_struct *vma, unsigned long vaddr)
 	return ((loff_t)vma->vm_pgoff << PAGE_SHIFT) + (vaddr - vma->vm_start);
 }
 
+/**
+ * any_uprobes_registered - check if any uprobes are currently registered
+ *
+ * Check whether the global uprobe rbtree has any entries, indicating
+ * that at least one uprobe is currently active in the system.
+ *
+ * Return: true if one or more uprobes are registered, false otherwise.
+ */
+bool any_uprobes_registered(void)
+{
+	return !no_uprobe_events();
+}
+
 /**
  * is_swbp_insn - check if instruction is breakpoint instruction.
  * @insn: instruction to be checked.
@@ -1635,8 +1648,16 @@ int uprobe_mmap(struct vm_area_struct *vma)
 	return 0;
 }
 
-static bool
-vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long end)
+/**
+ * vma_has_uprobes - check whether a vma range contains any uprobes.
+ * @vma: the vma to search.
+ * @start: start address of the range (inclusive).
+ * @end: end address of the range (exclusive).
+ *
+ * Return: true if at least one uprobe is registered in [@start, @end),
+ * false otherwise.
+ */
+bool vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long end)
 {
 	loff_t min, max;
 	struct inode *inode;
@@ -1654,6 +1675,60 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long e
 	return !!n;
 }
 
+/**
+ * vma_first_uprobe_addr - find first uprobe in a vma range.
+ * @vma: the vma to search.
+ * @start: start address of the range (inclusive).
+ * @end: end address of the range (exclusive).
+ *
+ * Used by madvise to skip directly to uprobe pages.
+ *
+ * Return: the page-aligned virtual address of the first uprobe in
+ * [@start, @end), or 0 if none exists.
+ */
+unsigned long vma_first_uprobe_addr(struct vm_area_struct *vma,
+				    unsigned long start, unsigned long end)
+{
+	loff_t min, max, first_offset;
+	struct inode *inode;
+	struct rb_node *n, *t;
+	struct uprobe *u;
+
+	/* No uprobes possible on anonymous mappings */
+	if (!vma->vm_file)
+		return 0;
+
+	/* Empty range -- nothing to search */
+	if (start >= end)
+		return 0;
+
+	inode = file_inode(vma->vm_file);
+
+	min = vaddr_to_offset(vma, start);
+	max = min + (end - start) - 1;
+
+	read_lock(&uprobes_treelock);
+	n = find_node_in_range(inode, min, max);
+	if (!n) {
+		read_unlock(&uprobes_treelock);
+		return 0;
+	}
+
+	/* Walk left to find the lowest offset in range */
+	u = rb_entry(n, struct uprobe, rb_node);
+	first_offset = u->offset;
+	for (t = rb_prev(n); t; t = rb_prev(t)) {
+		u = rb_entry(t, struct uprobe, rb_node);
+		if (u->inode != inode || u->offset < min)
+			break;
+		first_offset = u->offset;
+	}
+	read_unlock(&uprobes_treelock);
+
+	/* Return page-aligned vaddr containing this uprobe */
+	return PAGE_ALIGN_DOWN(offset_to_vaddr(vma, first_offset));
+}
+
 /*
  * Called in context of a munmap of a vma.
  */
diff --git a/mm/madvise.c b/mm/madvise.c
index 69708e953cf5..c73f1131224b 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -32,6 +32,7 @@
 #include <linux/leafops.h>
 #include <linux/shmem_fs.h>
 #include <linux/mmu_notifier.h>
+#include <linux/uprobes.h>
 
 #include <asm/tlb.h>
 
@@ -862,6 +863,30 @@ static long madvise_dontneed_single_vma(struct madvise_behavior *madv_behavior)
 	return 0;
 }
 
+static long madvise_dontneed_free_range(struct madvise_behavior *madv_behavior,
+					unsigned long start, unsigned long end)
+{
+	struct madvise_behavior_range *range = &madv_behavior->range;
+	unsigned long saved_start = range->start;
+	unsigned long saved_end = range->end;
+	int behavior = madv_behavior->behavior;
+	long ret;
+
+	range->start = start;
+	range->end = end;
+
+	if (behavior == MADV_DONTNEED || behavior == MADV_DONTNEED_LOCKED)
+		ret = madvise_dontneed_single_vma(madv_behavior);
+	else if (behavior == MADV_FREE)
+		ret = madvise_free_single_vma(madv_behavior);
+	else
+		ret = -EINVAL;
+
+	range->start = saved_start;
+	range->end = saved_end;
+	return ret;
+}
+
 static
 bool madvise_dontneed_free_valid_vma(struct madvise_behavior *madv_behavior)
 {
@@ -898,7 +923,7 @@ static long madvise_dontneed_free(struct madvise_behavior *madv_behavior)
 {
 	struct mm_struct *mm = madv_behavior->mm;
 	struct madvise_behavior_range *range = &madv_behavior->range;
-	int behavior = madv_behavior->behavior;
+	unsigned long cur, end, uprobe_addr;
 
 	if (!madvise_dontneed_free_valid_vma(madv_behavior))
 		return -EINVAL;
@@ -947,12 +972,46 @@ static long madvise_dontneed_free(struct madvise_behavior *madv_behavior)
 		VM_WARN_ON(range->start > range->end);
 	}
 
-	if (behavior == MADV_DONTNEED || behavior == MADV_DONTNEED_LOCKED)
-		return madvise_dontneed_single_vma(madv_behavior);
-	else if (behavior == MADV_FREE)
-		return madvise_free_single_vma(madv_behavior);
-	else
-		return -EINVAL;
+	/*
+	 * Preserve uprobes: if any uprobes are active in this VMA range,
+	 * avoid discarding pages that contain active breakpoints.
+	 *
+	 * Fast path: if no uprobes are registered system-wide, or the VMA
+	 * is not file-backed (uprobes only instrument file-backed mappings,
+	 * so anonymous VMAs can never contain breakpoints), or no uprobes
+	 * are present in this VMA range, proceed with the full operation.
+	 */
+	if (likely(!any_uprobes_registered()) ||
+	    !madv_behavior->vma->vm_file ||
+	    !vma_has_uprobes(madv_behavior->vma, range->start, range->end))
+		return madvise_dontneed_free_range(madv_behavior,
+						   range->start, range->end);
+
+	/*
+	 * Slow path: jump from uprobe to uprobe via rbtree lookup, zapping
+	 * the clean range before each uprobe page. This is O(M * log N)
+	 * where M is the number of uprobes in the range and N is the total
+	 * uprobe count, versus O(pages) for a page-by-page scan. 'cur'
+	 * tracks the beginning of the current clean range.
+	 */
+	cur = range->start;
+	end = range->end;
+	while (cur < end) {
+		uprobe_addr = vma_first_uprobe_addr(madv_behavior->vma,
+						    cur, end);
+		if (!uprobe_addr) {
+			/* No more uprobes - zap the rest */
+			madvise_dontneed_free_range(madv_behavior, cur, end);
+			break;
+		}
+		/* Zap the clean range before the uprobe page */
+		if (cur < uprobe_addr)
+			madvise_dontneed_free_range(madv_behavior, cur,
+						    uprobe_addr);
+		/* Skip past the uprobe page */
+		cur = uprobe_addr + PAGE_SIZE;
+	}
+	return 0;
 }
 
 static long madvise_populate(struct madvise_behavior *madv_behavior)
-- 
2.35.6


^ permalink raw reply related

* [PATCH] sched: Use trace_call__<tp>() to save a static branch
From: Gabriele Monaco @ 2026-04-29  9:41 UTC (permalink / raw)
  To: Steven Rostedt, Ingo Molnar, Peter Zijlstra, linux-kernel
  Cc: linux-trace-kernel, Gabriele Monaco

The wrapper functions __trace_set_current_state() and
__trace_set_need_resched() allow the tracepoints to be called from code
outside sched/core.c, those calls are already guarded by a
tracepoint_enabled(<tp>) so there is no need to repeat this check once
again inside the call using trace_<tp>().

Use the new trace_call__<tp>() API to directly call the tracepoint
without check. Those helper functions must be called after the
appropriate check.

Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 kernel/sched/core.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index da20fb6ea25a..c37562b02e24 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -537,10 +537,14 @@ sched_core_dequeue(struct rq *rq, struct task_struct *p, int flags) { }
 /* need a wrapper since we may need to trace from modules */
 EXPORT_TRACEPOINT_SYMBOL(sched_set_state_tp);
 
-/* Call via the helper macro trace_set_current_state. */
+/*
+ * Call via the helper macro trace_set_current_state.
+ * Calls to this function MUST be guarded by a
+ * tracepoint_enabled(sched_set_state_tp)
+ */
 void __trace_set_current_state(int state_value)
 {
-	trace_sched_set_state_tp(current, state_value);
+	trace_call__sched_set_state_tp(current, state_value);
 }
 EXPORT_SYMBOL(__trace_set_current_state);
 
@@ -1203,9 +1207,13 @@ static void __resched_curr(struct rq *rq, int tif)
 	}
 }
 
+/*
+ * Calls to this function MUST be guarded by a
+ * tracepoint_enabled(sched_set_need_resched_tp)
+ */
 void __trace_set_need_resched(struct task_struct *curr, int tif)
 {
-	trace_sched_set_need_resched_tp(curr, smp_processor_id(), tif);
+	trace_call__sched_set_need_resched_tp(curr, smp_processor_id(), tif);
 }
 EXPORT_SYMBOL_GPL(__trace_set_need_resched);
 

base-commit: 254f49634ee16a731174d2ae34bc50bd5f45e731
-- 
2.54.0


^ permalink raw reply related

* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Arun George/Arun George @ 2026-04-29  6:15 UTC (permalink / raw)
  To: Gregory Price
  Cc: lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman, gost.dev, arungeorge05, cpgs
In-Reply-To: <ae_i9IlIndumJWN3@gourry-fedora-PF4VCD3F>

On 28-04-2026 03:58 am, Gregory Price wrote:
> On Mon, Apr 27, 2026 at 06:02:57PM +0530, Arun George wrote:
>>
>> Any particular workload you are targeting with
>> this (which can tolerate this latency)?
>>
>> Any deployments you think of where the goal is a capacity expansion
>> with a compromise in performance?
>>
> Primary use cases for us are any workload that benefits from zswap -
> which is many, many (many, many [many, many]) workloads.
> 
A curious question please. If the primary use case is swap, can't we 
handle this problem statement by re-using the zsmalloc allocation classes?

A separate size class can be reserved for non-compressed pages in 
zsmalloc. And this interface could be used by zswap, zram etc. (We have 
been using this implementation for testing btw.). This does not require 
additional book-keeping or buddy allocator.

But that approach will not give a generic solution and not available for 
user-land anyway!
>> And I believe the bear-proof cage might work in the normal scenarios,
>> but may not work for all.
> 
> If it can't work for all workloads, then it's likely not general purpose
> enough to find core kernel support and should seek to use the existing
> interfaces (DAX and friends).
> 
I agree. That is a good point.

> 
> You need two controls over compressed RAM for it to be reliable:
> 
>    - Allocation control (acquiring new struct page to write to)
>    - Write-control (preventing new writes to compressed pages)
> 
> Private nodes provide the allocation control.
> 
> A read-only mapping, and guarantee that only memory that can reach
> the device is userland memory - is the only way to control the cpu
> writes from the OS perspective.
> 
So write-control part need to handled in the specific back end driver of 
private pages while the allocation control is a generic front-end sort 
of, right? (Ex: zswap cram back end for compressed devices case.)>
> In the next version of the RFC i'll demonstrate cram.c as a new swap
> backend that allows for read-only mappings to be soft-faulted in,
> migration on write, isolation to ANON memory, and some optional
> settings that allow a device or administrator a "writable budget"
> which allows some number of pages to be made writable without migration.

Great! I believe "writable budget" could be an interesting idea which 
can solve the 'bus error' sort of scenarios due to device not capable of 
taking any more writes. The write budget could be replenished using the 
control path and writes will not go ahead without the budget available, 
right?>
> ~Gregory
> 
~Arun George


^ permalink raw reply

* Re: [PATCH] tracing/probes: Limit size of event probe to 3K
From: Masami Hiramatsu @ 2026-04-29  8:51 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Linux Trace Kernel, Masami Hiramatsu, Mathieu Desnoyers
In-Reply-To: <20260428122302.706610ba@gandalf.local.home>

Hi Steve,

BTW, to prevent regressions during future expansions, how about adding the following line?

diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index 2cabf8a23ec5..c5ee7920dec6 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -979,6 +979,7 @@ static struct uprobe_cpu_buffer *prepare_uprobe_buffer(struct trace_uprobe *tu,
 	ucb = uprobe_buffer_get();
 	ucb->dsize = tu->tp.size + dsize;
 
+	BUILD_BUG_ON(MAX_UCB_BUFFER_SIZE < MAX_PROBE_EVENT_SIZE);
 	if (WARN_ON_ONCE(ucb->dsize > MAX_UCB_BUFFER_SIZE)) {
 		ucb->dsize = MAX_UCB_BUFFER_SIZE;
 		dsize = MAX_UCB_BUFFER_SIZE - tu->tp.size;

Thanks,

On Tue, 28 Apr 2026 12:23:02 -0400
Steven Rostedt <rostedt@kernel.org> wrote:

> From: Steven Rostedt <rostedt@goodmis.org>
> 
> There currently isn't a max limit an event probe can be. One could make an
> event greater than PAGE_SIZE, which makes the event useless because if
> it's bigger than the max event that can be recorded into the ring buffer,
> then it will never be recorded.
> 
> A event probe should never need to be greater than 3K, so make that the
> max size. As long as the max is less than the max that can be recorded
> onto the ring buffer, it should be fine.
> 
> Cc: stable@vger.kernel.org
> Fixes: 93ccae7a22274 ("tracing/kprobes: Support basic types on dynamic events")
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> ---
>  kernel/trace/trace_probe.c | 6 ++++++
>  kernel/trace/trace_probe.h | 4 +++-
>  2 files changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
> index e1c73065dae5..e0d3a0da26af 100644
> --- a/kernel/trace/trace_probe.c
> +++ b/kernel/trace/trace_probe.c
> @@ -1523,6 +1523,12 @@ static int traceprobe_parse_probe_arg_body(const char *argv, ssize_t *size,
>  	parg->offset = *size;
>  	*size += parg->type->size * (parg->count ?: 1);
>  
> +	if (*size > MAX_PROBE_EVENT_SIZE) {
> +		ret = -E2BIG;
> +		trace_probe_log_err(ctx->offset, EVENT_TOO_BIG);
> +		goto fail;
> +	}
> +
>  	if (parg->count) {
>  		len = strlen(parg->type->fmttype) + 6;
>  		parg->fmt = kmalloc(len, GFP_KERNEL);
> diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
> index 9fc56c937130..262d8707a3df 100644
> --- a/kernel/trace/trace_probe.h
> +++ b/kernel/trace/trace_probe.h
> @@ -38,6 +38,7 @@
>  #define MAX_BTF_ARGS_LEN	128
>  #define MAX_DENTRY_ARGS_LEN	256
>  #define MAX_STRING_SIZE		PATH_MAX
> +#define MAX_PROBE_EVENT_SIZE	3072
>  
>  /* Reserved field names */
>  #define FIELD_STRING_IP		"__probe_ip"
> @@ -561,7 +562,8 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call,
>  	C(BAD_TYPE4STR,		"This type does not fit for string."),\
>  	C(NEED_STRING_TYPE,	"$comm and immediate-string only accepts string type"),\
>  	C(TOO_MANY_ARGS,	"Too many arguments are specified"),	\
> -	C(TOO_MANY_EARGS,	"Too many entry arguments specified"),
> +	C(TOO_MANY_EARGS,	"Too many entry arguments specified"),	\
> +	C(EVENT_TOO_BIG,	"Event too big (too many fields?)"),
>  
>  #undef C
>  #define C(a, b)		TP_ERR_##a
> -- 
> 2.53.0
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply related

* Re: [PATCH] tracing/probes: Limit size of event probe to 3K
From: Masami Hiramatsu @ 2026-04-29  8:42 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Linux Trace Kernel, Masami Hiramatsu, Mathieu Desnoyers
In-Reply-To: <20260428122302.706610ba@gandalf.local.home>

On Tue, 28 Apr 2026 12:23:02 -0400
Steven Rostedt <rostedt@kernel.org> wrote:

> From: Steven Rostedt <rostedt@goodmis.org>
> 
> There currently isn't a max limit an event probe can be. One could make an
> event greater than PAGE_SIZE, which makes the event useless because if
> it's bigger than the max event that can be recorded into the ring buffer,
> then it will never be recorded.
> 
> A event probe should never need to be greater than 3K, so make that the
> max size. As long as the max is less than the max that can be recorded
> onto the ring buffer, it should be fine.

This looks good to me.

Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Thanks!

> 
> Cc: stable@vger.kernel.org
> Fixes: 93ccae7a22274 ("tracing/kprobes: Support basic types on dynamic events")
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> ---
>  kernel/trace/trace_probe.c | 6 ++++++
>  kernel/trace/trace_probe.h | 4 +++-
>  2 files changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
> index e1c73065dae5..e0d3a0da26af 100644
> --- a/kernel/trace/trace_probe.c
> +++ b/kernel/trace/trace_probe.c
> @@ -1523,6 +1523,12 @@ static int traceprobe_parse_probe_arg_body(const char *argv, ssize_t *size,
>  	parg->offset = *size;
>  	*size += parg->type->size * (parg->count ?: 1);
>  
> +	if (*size > MAX_PROBE_EVENT_SIZE) {
> +		ret = -E2BIG;
> +		trace_probe_log_err(ctx->offset, EVENT_TOO_BIG);
> +		goto fail;
> +	}
> +
>  	if (parg->count) {
>  		len = strlen(parg->type->fmttype) + 6;
>  		parg->fmt = kmalloc(len, GFP_KERNEL);
> diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
> index 9fc56c937130..262d8707a3df 100644
> --- a/kernel/trace/trace_probe.h
> +++ b/kernel/trace/trace_probe.h
> @@ -38,6 +38,7 @@
>  #define MAX_BTF_ARGS_LEN	128
>  #define MAX_DENTRY_ARGS_LEN	256
>  #define MAX_STRING_SIZE		PATH_MAX
> +#define MAX_PROBE_EVENT_SIZE	3072
>  
>  /* Reserved field names */
>  #define FIELD_STRING_IP		"__probe_ip"
> @@ -561,7 +562,8 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call,
>  	C(BAD_TYPE4STR,		"This type does not fit for string."),\
>  	C(NEED_STRING_TYPE,	"$comm and immediate-string only accepts string type"),\
>  	C(TOO_MANY_ARGS,	"Too many arguments are specified"),	\
> -	C(TOO_MANY_EARGS,	"Too many entry arguments specified"),
> +	C(TOO_MANY_EARGS,	"Too many entry arguments specified"),	\
> +	C(EVENT_TOO_BIG,	"Event too big (too many fields?)"),
>  
>  #undef C
>  #define C(a, b)		TP_ERR_##a
> -- 
> 2.53.0
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH] kprobes: Remove dead child probes from aggrprobe list on module unload
From: Masami Hiramatsu @ 2026-04-29  8:40 UTC (permalink / raw)
  To: Shijia Hu; +Cc: naveen, davem, ananth, akpm, linux-kernel, linux-trace-kernel
In-Reply-To: <20260429032919.208790-1-hushijia1@uniontech.com>

On Wed, 29 Apr 2026 11:29:19 +0800
Shijia Hu <hushijia1@uniontech.com> wrote:

> When a kernel module that registered kprobes is unloaded without calling
> unregister_kprobe(), kprobes_module_callback() calls kill_kprobe() to
> mark the probe(s) GONE.  If the probe is an aggrprobe, kill_kprobe()
> also marks all child probes GONE, but it does not remove them from
> the aggrprobe's list.

That sounds like a bug in the module.

> 
> The problem is that child probes whose struct kprobe resides in the
> unloading module's memory are freed along with the module, yet they
> remain on the aggrprobe's list.  Later, when another caller registers
> a kprobe at the same address, __get_valid_kprobe() walks that list
> and dereferences the freed child probe, causing a use-after-free.
> 
> Reproduction steps:
> 
>     1) Load module A which registers two kprobes on the same kernel
>        function address (e.g., do_nanosleep), causing them to be
>        aggregated under one aggrprobe.
> 
>     2) Unload module A without calling unregister_kprobe().
>        Module A's memory is freed, but its two child probes remain
>        on the aggrprobe's list as dangling pointers.

Would you mean "load a buggy kernel module and unload it, the kernel cause
use-after-free."? for example:

----
struct kprobe my_probe = {...};

init_module() {
	register_kprobe(&my_probe);
}
exit_module() {
	/* do nothing */
}
----

Yes, this cause UAF because that module has a bug. Please call
unregister_kprobe().

Thanks,

> 
>     3) Load module B and register a kprobe on the same address
>        (e.g., do_nanosleep). register_kprobe() -> __get_valid_kprobe()
>        traverses the aggrprobe's list and dereferences the freed child
>        probe from module A, triggering a use-after-free and kernel panic.
> 
> The resulting crash looks like:
>     [  464.950864] BUG: kernel NULL pointer dereference, address: 0000000000000000
>     [  464.950872] #PF: supervisor read access in kernel mode
>     [  464.950874] #PF: error_code(0x0000) - not-present page
>     ...
>     [  464.950915] Call Trace:
>     [  464.950922]  <TASK>
>     [  464.950923]  register_kprobe+0x65/0x2e0
>     [  464.950928]  ? __pfx_stage2_init+0x10/0x10 [kprobe_leak_stage2]
>     [  464.950933]  stage2_init+0x37/0xff0 [kprobe_leak_stage2]
>     [  464.950938]  ? __pfx_stage2_init+0x10/0x10 [kprobe_leak_stage2]
>     [  464.950942]  do_one_initcall+0x56/0x2e0
>     [  464.950948]  do_init_module+0x60/0x230
>     ...
> 
>   Fix this by adding selective cleanup in kprobes_module_callback():
>   after calling kill_kprobe() on the aggrprobe, iterate its child list
>   and remove any child probe whose struct kprobe is inside the going
>   module's memory range (within_module_init / within_module_core).
> 
>   This is done in kprobes_module_callback() rather than kill_kprobe()
>   because kill_kprobe()'s semantic is "the probed code is going away,
>   mark probes GONE".  The lifetime of a probe is bound to the probed
>   code, not to the module containing the struct kprobe.  Child probes
>   owned by other still-loaded modules or by kmalloc (ftrace, perf,
>   kprobe-events) must stay on the list so they can be unregistered
>   later.  Only child probes whose memory is about to be freed need to
>   be removed from the list to prevent dangling pointers.
> 
> Fixes: e8386a0cb22f4 ("kprobes: support probing module __exit function")
> Signed-off-by: Shijia Hu <hushijia1@uniontech.com>
> ---
>  kernel/kprobes.c | 23 ++++++++++++++++++++++-
>  1 file changed, 22 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> index bfc89083daa9..ff277314183c 100644
> --- a/kernel/kprobes.c
> +++ b/kernel/kprobes.c
> @@ -2664,6 +2664,7 @@ static int kprobes_module_callback(struct notifier_block *nb,
>  				   unsigned long val, void *data)
>  {
>  	struct module *mod = data;
> +	struct hlist_node *tmp;
>  	struct hlist_head *head;
>  	struct kprobe *p;
>  	unsigned int i;
> @@ -2685,7 +2686,7 @@ static int kprobes_module_callback(struct notifier_block *nb,
>  	 */
>  	for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
>  		head = &kprobe_table[i];
> -		hlist_for_each_entry(p, head, hlist)
> +		hlist_for_each_entry_safe(p, tmp, head, hlist) {
>  			if (within_module_init((unsigned long)p->addr, mod) ||
>  			    (checkcore &&
>  			     within_module_core((unsigned long)p->addr, mod))) {
> @@ -2702,6 +2703,26 @@ static int kprobes_module_callback(struct notifier_block *nb,
>  				 */
>  				kill_kprobe(p);
>  			}
> +
> +			/*
> +			 * Child probes are not on the kprobe hash list, so
> +			 * the above loop can not find them. If a child probe
> +			 * is allocated in the module's memory, it will become
> +			 * a dangling pointer after the module is freed.
> +			 */
> +			if (kprobe_aggrprobe(p)) {
> +				struct kprobe *kp, *kptmp;
> +
> +				list_for_each_entry_safe(kp, kptmp, &p->list, list) {
> +					if (within_module_init((unsigned long)kp, mod) ||
> +					    (checkcore &&
> +					     within_module_core((unsigned long)kp, mod))) {
> +						kp->flags |= KPROBE_FLAG_GONE;
> +						list_del_rcu(&kp->list);
> +					}
> +				}
> +			}
> +		}
>  	}
>  	if (val == MODULE_STATE_GOING)
>  		remove_module_kprobe_blacklist(mod);
> -- 
> 2.20.1
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH] kprobes: skip non-symbol addresses in kprobe_add_ksym_blacklist()
From: Masami Hiramatsu @ 2026-04-29  8:30 UTC (permalink / raw)
  To: Jianpeng Chang
  Cc: naveen, davem, catalin.marinas, mark.rutland, linux-kernel,
	linux-trace-kernel, stable
In-Reply-To: <a298419e-c581-41c9-b6d5-9319c24b7995@windriver.com>

On Wed, 29 Apr 2026 16:16:44 +0800
Jianpeng Chang <jianpeng.chang.cn@windriver.com> wrote:

> 
> 
> 在 2026/4/28 下午5:43, Masami Hiramatsu (Google) 写道:
> > CAUTION: This email comes from a non Wind River email account! Do
> > not click links or open attachments unless you recognize the sender
> > and know the content is safe.
> > 
> > Hi,
> > 
> > On Mon, 27 Apr 2026 15:35:44 +0800 Jianpeng Chang
> > <jianpeng.chang.cn@windriver.com> wrote:
> > 
> >> When kprobe_add_area_blacklist() iterates through a section like 
> >> .kprobes.text, the start address may not correspond to a named
> >> symbol. On ARM64 with CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS=y
> >> (introduced by commit baaf553d3bc3 ("arm64: Implement 
> >> HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS")), the compiler flag -
> >> fpatchable-function-entry=4,2 inserts 2 NOPs before each function
> >> entry point for ftrace call_ops. These pre-function NOPs sit at
> >> the section base address, before the first named function symbol.
> >> The compiler emits a $x mapping symbol at offset 0x00 to mark the
> >> start of code, but find_kallsyms_symbol() ignores mapping symbols.
> >> 
> >> Without CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS (e.g. defconfig), no 
> >> pre-function NOPs are inserted, the first function starts at
> >> offset 0x00, and the bug does not trigger.
> >> 
> >> This only affects modules that have a .kprobes.text section (i.e.
> >> those using the __kprobes annotation). Modules using
> >> NOKPROBE_SYMBOL() instead (like kretprobe_example.ko) blacklist
> >> exact function addresses via the _kprobe_blacklist section and are
> >> not affected.
> >> 
> >> For kprobe_example.ko on ARM64 with -fpatchable-function-
> >> entry=4,2, the .kprobes.text section layout is:
> >> 
> >> offset 0x00: $x + 2 NOPs    (mapping symbol + ftrace preamble) 
> >> offset 0x08: handler_post   (64 bytes) offset 0x50: handler_pre
> >> (68 bytes)
> > 
> > Ah, OK. It is for __kprobes attribute. I recommend user to use
> > NOKPROBE_SYMBOL() but I understand the situation.
> > 
> >> 
> >> kprobe_add_area_blacklist() starts iterating from the section base 
> >> address (offset 0x00), which only has the $x mapping symbol. 
> >> kprobe_add_ksym_blacklist() then calls
> >> kallsyms_lookup_size_offset() for this address, which goes
> >> through:
> >> 
> >> kallsyms_lookup_size_offset() -> module_address_lookup() ->
> >> find_kallsyms_symbol()
> >> 
> >> find_kallsyms_symbol() scans all module symbols to find the
> >> closest preceding symbol.
> >> 
> >> Since no named text symbol exists at offset 0x00, 
> >> find_kallsyms_symbol() picks __UNIQUE_ID_vermagic (a .modinfo
> >> symbol whose address is in the temporary image) as the "best"
> >> match. The computed "size" = next_text_symbol - modinfo_symbol
> >> spans across these two unrelated memory regions, creating a
> >> blacklist entry with a bogus range of tens of terabytes.
> >> 
> >> Whether this causes a visible failure depends on address
> >> randomization, here is what happens on Raspberry Pi 4/5:
> >> 
> >> - On RPi5, the bogus size was ~35 TB. start + size stayed within 
> >> 64-bit range, so the blacklist entry covered the entire kernel 
> >> text. register_kprobe() in the module's own init function failed 
> >> with -EINVAL.
> >> 
> >> - On RPi4, the bogus size was ~75 TB. start + size overflowed 64
> >> bits and wrapped to a small address near zero. The range check
> >> (addr >= start && addr < end) then failed because end wrapped
> >> around, so the bogus entry was accidentally harmless and kprobes
> >> worked by luck.
> >> 
> >> The same bug exists on both machines, but randomization determines
> >> whether the integer overflow masks it or not.
> >> 
> >> Fix this by checking the offset returned by
> >> kallsyms_lookup_size_offset(). A non-zero offset means the address
> >> is not at a symbol boundary, so skip forward to the next symbol
> >> instead of creating a blacklist entry with a wrong size.
> >> 
> >> Fixes: baaf553d3bc3 ("arm64: Implement
> >> HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS") Signed-off-by: Jianpeng Chang
> >> <jianpeng.chang.cn@windriver.com> --- Hi,
> >> 
> >> This patch skips non-symbol addresses, fixes the bogus blacklist
> >> entry, but leaves the NOP gap at the start of .kprobes.text
> >> unblacklisted.
> > 
> > That is OK because those NOPs are not executed in kprobe handler.
> > 
> >> 
> >> We can continue alloc the ent without return to add the gap to 
> >> blacklist, or do some more works to add the gap to the first
> >> symbol in blacklist. I'm not sure if is this necessary, or is
> >> there a better way?
> > 
> > Are there any compiler option or attribute to avoid inserting these 
> > NOPs to the specific section? (like notrace?)
> > 
> > Also, as you can see there is an alias symbol whose size is 0. and 
> > in that case, we move the entry + 1 and call
> > kprobe_add_ksym_blacklist() again. Thus, the offset becomes 1.
> > Please make sure it is correctly handled.
> > 
> Regarding the alias symbol concern: kallsyms_lookup_size_offset() 
> computes size as the distance to the next different-address symbol, not 
> from ELF st_size. I tested with a module containing alias symbols in 
> .kprobes.text (created via __attribute__((alias))), and the lookup 
> returned a correct size with offset=0 — the if (ret == 0) ret = 1 path 
> was never triggered.
> 
> That said, #define __kprobes notrace __section(".kprobes.text") is a 
> cleaner fix. The NOPs in .kprobes.text are unnecessary since these 
> functions should never be traced by ftrace. I've tested this on RPi5 — 
> the bug is resolved and all .kprobes.text functions are correctly 
> blacklisted. I'll send the notrace approach in v2.

Ah, great! thanks!

> 
> Thanks,
> Jianpeng> Thanks,
> > 
> >> 
> >> Thanks, Jianpeng
> >> 
> >> kernel/kprobes.c | 4 ++++ 1 file changed, 4 insertions(+)
> >> 
> >> diff --git a/kernel/kprobes.c b/kernel/kprobes.c index
> >> bfc89083daa9..be700fb03198 100644 --- a/kernel/kprobes.c> +++ b/
> >> kernel/kprobes.c @@ -2503,6 +2503,10 @@ int
> >> kprobe_add_ksym_blacklist(unsigned long entry) !
> >> kallsyms_lookup_size_offset(entry, &size, &offset)) return -
> >> EINVAL;
> >> 
> >> +     /* Not on a symbol boundary -- skip to the next symbol */ 
> >> +     if (offset) +             return (int)(size - offset); + ent
> >> = kmalloc_obj(*ent); if (!ent) return -ENOMEM; -- 2.54.0
> >> 
> > 
> > 
> > -- Masami Hiramatsu (Google) <mhiramat@kernel.org>
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH 2/3] init: use static buffers for bootconfig extra command line
From: Masami Hiramatsu @ 2026-04-29  8:27 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Andrew Morton, oss, paulmck, linux-trace-kernel, linux-kernel,
	kernel-team
In-Reply-To: <aeJH8mhxqrwdsjxc@gmail.com>

On Fri, 17 Apr 2026 08:38:16 -0700
Breno Leitao <leitao@debian.org> wrote:

> 
> On Fri, Apr 17, 2026 at 10:44:36AM +0900, Masami Hiramatsu wrote:
> > On Wed, 15 Apr 2026 03:51:11 -0700
> > Breno Leitao <leitao@debian.org> wrote:
> >
> > But if we can do it, should we continue using bootconfig? I mean
> > it is easy to make a tool (or add a feature in tools/bootconfig)
> > which converts bootconfig file to command line string and embeds
> > it in the kernel. Hmm.
> 
> Sure, you are talking about a a tool that embeddeds it in the kernel binary,
> something like:
> 
> 
> 0) Get a kernel and define CONFIG_BOOT_CONFIG_EMBED_FILE=".bootconfig"
> 
> 1) Add an option in tools/bootconfig to convert bootconfig (.bootconfig)
>    to a cmdline string ($ bootconfig -C kernel .bootconfig).
>    Something like:
>    # tools/bootconfig/bootconfig -C kernel .bootconfig
>      mem=2G loglevel=7 debug nokaslr %
> 
> 2) At kernel build time, run that tool on .bootconfig and embed the
>    resulting string into the kernel image as a .init.rodata symbol
>    (embedded_kernel_cmdline[]).
> 
>    # gdb -batch -ex 'x/s &embedded_kernel_cmdline' vmlinux
>    0xffffffff87e108f8:    "mem=2G loglevel=7 debug nokaslr "

Yeah, I think this looks good to me.

> 
> 3) At boot, the arch's setup_arch() prepends that symbol to
>    boot_command_line right before parse_early_param() — so early_param()
>    handlers (mem=, earlycon=, loglevel=, ...) actually see kernel.*
>    keys from the embedded bootconfig.

Ah, I thought it is arch independent config, but it depends on
architecture.... Hmm.

> 
>    This needs to be architecture by architecture. Something like:
> 
> 	@@ -924,6 +925,13 @@ void __init setup_arch(char **cmdline_p)
> 		builtin_cmdline_added = true;
> 	#endif
> 
> 	+       /*
> 	+        * Prepend kernel.* keys from the embedded bootconfig (rendered at
> 	+        * build time by tools/bootconfig) so parse_early_param() below sees
> 	+        * them. No-op when CONFIG_BOOT_CONFIG_EMBED=n.
> 	+        */
> 	+       xbc_prepend_embedded_cmdline(boot_command_line, COMMAND_LINE_SIZE);
> 	+
> 		strscpy(command_line, boot_command_line, COMMAND_LINE_SIZE);
> 		*cmdline_p = command_line;
> 
> Am I describing your suggestion accordingly?

I think we can start supporting this option for the architecture which 
already support CONFIG_CMDLINE. Maybe we need CONFIG_ARCH_SUPPORT_CMDLINE
option which indicates the architecture supports embedded cmdline.

Thus all of this feature can depend on that Kconfig.

Thank you,

> 
> Thanks!
> --breno


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH] kprobes: skip non-symbol addresses in kprobe_add_ksym_blacklist()
From: Jianpeng Chang @ 2026-04-29  8:16 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: naveen, davem, catalin.marinas, mark.rutland, linux-kernel,
	linux-trace-kernel, stable
In-Reply-To: <20260428184321.309a48036892b8d23a08b566@kernel.org>



在 2026/4/28 下午5:43, Masami Hiramatsu (Google) 写道:
> CAUTION: This email comes from a non Wind River email account! Do
> not click links or open attachments unless you recognize the sender
> and know the content is safe.
> 
> Hi,
> 
> On Mon, 27 Apr 2026 15:35:44 +0800 Jianpeng Chang
> <jianpeng.chang.cn@windriver.com> wrote:
> 
>> When kprobe_add_area_blacklist() iterates through a section like 
>> .kprobes.text, the start address may not correspond to a named
>> symbol. On ARM64 with CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS=y
>> (introduced by commit baaf553d3bc3 ("arm64: Implement 
>> HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS")), the compiler flag -
>> fpatchable-function-entry=4,2 inserts 2 NOPs before each function
>> entry point for ftrace call_ops. These pre-function NOPs sit at
>> the section base address, before the first named function symbol.
>> The compiler emits a $x mapping symbol at offset 0x00 to mark the
>> start of code, but find_kallsyms_symbol() ignores mapping symbols.
>> 
>> Without CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS (e.g. defconfig), no 
>> pre-function NOPs are inserted, the first function starts at
>> offset 0x00, and the bug does not trigger.
>> 
>> This only affects modules that have a .kprobes.text section (i.e.
>> those using the __kprobes annotation). Modules using
>> NOKPROBE_SYMBOL() instead (like kretprobe_example.ko) blacklist
>> exact function addresses via the _kprobe_blacklist section and are
>> not affected.
>> 
>> For kprobe_example.ko on ARM64 with -fpatchable-function-
>> entry=4,2, the .kprobes.text section layout is:
>> 
>> offset 0x00: $x + 2 NOPs    (mapping symbol + ftrace preamble) 
>> offset 0x08: handler_post   (64 bytes) offset 0x50: handler_pre
>> (68 bytes)
> 
> Ah, OK. It is for __kprobes attribute. I recommend user to use
> NOKPROBE_SYMBOL() but I understand the situation.
> 
>> 
>> kprobe_add_area_blacklist() starts iterating from the section base 
>> address (offset 0x00), which only has the $x mapping symbol. 
>> kprobe_add_ksym_blacklist() then calls
>> kallsyms_lookup_size_offset() for this address, which goes
>> through:
>> 
>> kallsyms_lookup_size_offset() -> module_address_lookup() ->
>> find_kallsyms_symbol()
>> 
>> find_kallsyms_symbol() scans all module symbols to find the
>> closest preceding symbol.
>> 
>> Since no named text symbol exists at offset 0x00, 
>> find_kallsyms_symbol() picks __UNIQUE_ID_vermagic (a .modinfo
>> symbol whose address is in the temporary image) as the "best"
>> match. The computed "size" = next_text_symbol - modinfo_symbol
>> spans across these two unrelated memory regions, creating a
>> blacklist entry with a bogus range of tens of terabytes.
>> 
>> Whether this causes a visible failure depends on address
>> randomization, here is what happens on Raspberry Pi 4/5:
>> 
>> - On RPi5, the bogus size was ~35 TB. start + size stayed within 
>> 64-bit range, so the blacklist entry covered the entire kernel 
>> text. register_kprobe() in the module's own init function failed 
>> with -EINVAL.
>> 
>> - On RPi4, the bogus size was ~75 TB. start + size overflowed 64
>> bits and wrapped to a small address near zero. The range check
>> (addr >= start && addr < end) then failed because end wrapped
>> around, so the bogus entry was accidentally harmless and kprobes
>> worked by luck.
>> 
>> The same bug exists on both machines, but randomization determines
>> whether the integer overflow masks it or not.
>> 
>> Fix this by checking the offset returned by
>> kallsyms_lookup_size_offset(). A non-zero offset means the address
>> is not at a symbol boundary, so skip forward to the next symbol
>> instead of creating a blacklist entry with a wrong size.
>> 
>> Fixes: baaf553d3bc3 ("arm64: Implement
>> HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS") Signed-off-by: Jianpeng Chang
>> <jianpeng.chang.cn@windriver.com> --- Hi,
>> 
>> This patch skips non-symbol addresses, fixes the bogus blacklist
>> entry, but leaves the NOP gap at the start of .kprobes.text
>> unblacklisted.
> 
> That is OK because those NOPs are not executed in kprobe handler.
> 
>> 
>> We can continue alloc the ent without return to add the gap to 
>> blacklist, or do some more works to add the gap to the first
>> symbol in blacklist. I'm not sure if is this necessary, or is
>> there a better way?
> 
> Are there any compiler option or attribute to avoid inserting these 
> NOPs to the specific section? (like notrace?)
> 
> Also, as you can see there is an alias symbol whose size is 0. and 
> in that case, we move the entry + 1 and call
> kprobe_add_ksym_blacklist() again. Thus, the offset becomes 1.
> Please make sure it is correctly handled.
> 
Regarding the alias symbol concern: kallsyms_lookup_size_offset() 
computes size as the distance to the next different-address symbol, not 
from ELF st_size. I tested with a module containing alias symbols in 
.kprobes.text (created via __attribute__((alias))), and the lookup 
returned a correct size with offset=0 — the if (ret == 0) ret = 1 path 
was never triggered.

That said, #define __kprobes notrace __section(".kprobes.text") is a 
cleaner fix. The NOPs in .kprobes.text are unnecessary since these 
functions should never be traced by ftrace. I've tested this on RPi5 — 
the bug is resolved and all .kprobes.text functions are correctly 
blacklisted. I'll send the notrace approach in v2.

Thanks,
Jianpeng> Thanks,
> 
>> 
>> Thanks, Jianpeng
>> 
>> kernel/kprobes.c | 4 ++++ 1 file changed, 4 insertions(+)
>> 
>> diff --git a/kernel/kprobes.c b/kernel/kprobes.c index
>> bfc89083daa9..be700fb03198 100644 --- a/kernel/kprobes.c> +++ b/
>> kernel/kprobes.c @@ -2503,6 +2503,10 @@ int
>> kprobe_add_ksym_blacklist(unsigned long entry) !
>> kallsyms_lookup_size_offset(entry, &size, &offset)) return -
>> EINVAL;
>> 
>> +     /* Not on a symbol boundary -- skip to the next symbol */ 
>> +     if (offset) +             return (int)(size - offset); + ent
>> = kmalloc_obj(*ent); if (!ent) return -ENOMEM; -- 2.54.0
>> 
> 
> 
> -- Masami Hiramatsu (Google) <mhiramat@kernel.org>


^ permalink raw reply

* Re: [PATCH v2] mm/page_alloc: trace PCP refills and PCP zone lock usage
From: SUVONOV BUNYOD @ 2026-04-29  3:31 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: akpm, vbabka, linux-mm, mhiramat, mathieu desnoyers,
	linux-trace-kernel, linux-kernel, surenb, mhocko, jackmanb,
	hannes, ziy, david, vishal moola, corbet, skhan, linux-doc
In-Reply-To: <20260428142335.3bca0166@gandalf.local.home>

Thanks for reviewing Steven,

>Why this change? It makes it much harder to understand.
>
>The above is not a normal macro. Ignore any checkpatch warnings about it.
>The proper way to do the TP_STRUCT__entry() is to make it just like a struct:
>
>struct {
>	unsigned long		pfn;
>	unsigned int		order;
>	int			migratetype;
>};
>
>Thus, the macro should be:
>
>	TP_STRUCT__entry(
>		__field(	unsigned long,	pfn		)
>		__field(	unsigned int,	order		)
>		__field(	int,		migratetype	)
>		),


Yeah sorry for the formatting issue, will fix in v3. Any other concerns?
What do you think about the introduction of those tracepoints themselves?

-- Bunyod


^ permalink raw reply

* [PATCH] kprobes: Remove dead child probes from aggrprobe list on module unload
From: Shijia Hu @ 2026-04-29  3:29 UTC (permalink / raw)
  To: mhiramat, naveen, davem
  Cc: ananth, akpm, linux-kernel, linux-trace-kernel, hushijia1

When a kernel module that registered kprobes is unloaded without calling
unregister_kprobe(), kprobes_module_callback() calls kill_kprobe() to
mark the probe(s) GONE.  If the probe is an aggrprobe, kill_kprobe()
also marks all child probes GONE, but it does not remove them from
the aggrprobe's list.

The problem is that child probes whose struct kprobe resides in the
unloading module's memory are freed along with the module, yet they
remain on the aggrprobe's list.  Later, when another caller registers
a kprobe at the same address, __get_valid_kprobe() walks that list
and dereferences the freed child probe, causing a use-after-free.

Reproduction steps:

    1) Load module A which registers two kprobes on the same kernel
       function address (e.g., do_nanosleep), causing them to be
       aggregated under one aggrprobe.

    2) Unload module A without calling unregister_kprobe().
       Module A's memory is freed, but its two child probes remain
       on the aggrprobe's list as dangling pointers.

    3) Load module B and register a kprobe on the same address
       (e.g., do_nanosleep). register_kprobe() -> __get_valid_kprobe()
       traverses the aggrprobe's list and dereferences the freed child
       probe from module A, triggering a use-after-free and kernel panic.

The resulting crash looks like:
    [  464.950864] BUG: kernel NULL pointer dereference, address: 0000000000000000
    [  464.950872] #PF: supervisor read access in kernel mode
    [  464.950874] #PF: error_code(0x0000) - not-present page
    ...
    [  464.950915] Call Trace:
    [  464.950922]  <TASK>
    [  464.950923]  register_kprobe+0x65/0x2e0
    [  464.950928]  ? __pfx_stage2_init+0x10/0x10 [kprobe_leak_stage2]
    [  464.950933]  stage2_init+0x37/0xff0 [kprobe_leak_stage2]
    [  464.950938]  ? __pfx_stage2_init+0x10/0x10 [kprobe_leak_stage2]
    [  464.950942]  do_one_initcall+0x56/0x2e0
    [  464.950948]  do_init_module+0x60/0x230
    ...

  Fix this by adding selective cleanup in kprobes_module_callback():
  after calling kill_kprobe() on the aggrprobe, iterate its child list
  and remove any child probe whose struct kprobe is inside the going
  module's memory range (within_module_init / within_module_core).

  This is done in kprobes_module_callback() rather than kill_kprobe()
  because kill_kprobe()'s semantic is "the probed code is going away,
  mark probes GONE".  The lifetime of a probe is bound to the probed
  code, not to the module containing the struct kprobe.  Child probes
  owned by other still-loaded modules or by kmalloc (ftrace, perf,
  kprobe-events) must stay on the list so they can be unregistered
  later.  Only child probes whose memory is about to be freed need to
  be removed from the list to prevent dangling pointers.

Fixes: e8386a0cb22f4 ("kprobes: support probing module __exit function")
Signed-off-by: Shijia Hu <hushijia1@uniontech.com>
---
 kernel/kprobes.c | 23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index bfc89083daa9..ff277314183c 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -2664,6 +2664,7 @@ static int kprobes_module_callback(struct notifier_block *nb,
 				   unsigned long val, void *data)
 {
 	struct module *mod = data;
+	struct hlist_node *tmp;
 	struct hlist_head *head;
 	struct kprobe *p;
 	unsigned int i;
@@ -2685,7 +2686,7 @@ static int kprobes_module_callback(struct notifier_block *nb,
 	 */
 	for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
 		head = &kprobe_table[i];
-		hlist_for_each_entry(p, head, hlist)
+		hlist_for_each_entry_safe(p, tmp, head, hlist) {
 			if (within_module_init((unsigned long)p->addr, mod) ||
 			    (checkcore &&
 			     within_module_core((unsigned long)p->addr, mod))) {
@@ -2702,6 +2703,26 @@ static int kprobes_module_callback(struct notifier_block *nb,
 				 */
 				kill_kprobe(p);
 			}
+
+			/*
+			 * Child probes are not on the kprobe hash list, so
+			 * the above loop can not find them. If a child probe
+			 * is allocated in the module's memory, it will become
+			 * a dangling pointer after the module is freed.
+			 */
+			if (kprobe_aggrprobe(p)) {
+				struct kprobe *kp, *kptmp;
+
+				list_for_each_entry_safe(kp, kptmp, &p->list, list) {
+					if (within_module_init((unsigned long)kp, mod) ||
+					    (checkcore &&
+					     within_module_core((unsigned long)kp, mod))) {
+						kp->flags |= KPROBE_FLAG_GONE;
+						list_del_rcu(&kp->list);
+					}
+				}
+			}
+		}
 	}
 	if (val == MODULE_STATE_GOING)
 		remove_module_kprobe_blacklist(mod);
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH] tracing/probes: Limit size of event probe to 3K
From: Steven Rostedt @ 2026-04-29  0:32 UTC (permalink / raw)
  To: LKML, Linux Trace Kernel; +Cc: Masami Hiramatsu, Mathieu Desnoyers
In-Reply-To: <20260428122302.706610ba@gandalf.local.home>

On Tue, 28 Apr 2026 12:23:02 -0400
Steven Rostedt <rostedt@kernel.org> wrote:

> From: Steven Rostedt <rostedt@goodmis.org>
> 
> There currently isn't a max limit an event probe can be. One could make an
> event greater than PAGE_SIZE, which makes the event useless because if
> it's bigger than the max event that can be recorded into the ring buffer,
> then it will never be recorded.
> 
> A event probe should never need to be greater than 3K, so make that the
> max size. As long as the max is less than the max that can be recorded
> onto the ring buffer, it should be fine.
> 
> Cc: stable@vger.kernel.org
> Fixes: 93ccae7a22274 ("tracing/kprobes: Support basic types on dynamic events")
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

Hi Masami,

I ran this through my tests along with some other fixes I have. If you
ack it, I can push this to Linus along with my other changes.

-- Steve

^ permalink raw reply

* Re: [RFC PATCH 16/19] mm/damon: trace probe_hits
From: SeongJae Park @ 2026-04-29  0:13 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: SeongJae Park, Andrew Morton, Masami Hiramatsu, Mathieu Desnoyers,
	damon, linux-kernel, linux-mm, linux-trace-kernel
In-Reply-To: <20260428141745.2768ac4e@gandalf.local.home>

On Tue, 28 Apr 2026 14:17:45 -0400 Steven Rostedt <rostedt@goodmis.org> wrote:

> On Sun, 26 Apr 2026 13:52:17 -0700
> SeongJae Park <sj@kernel.org> wrote:
> 
> > Introduce a new tracepoint for exposing the per-region per-probe
> > positive sample count via tracefs.
> > 
> > Signed-off-by: SeongJae Park <sj@kernel.org>
> > ---
> >  include/trace/events/damon.h | 41 ++++++++++++++++++++++++++++++++++++
> >  mm/damon/core.c              |  1 +
> >  2 files changed, 42 insertions(+)
> > 
> > diff --git a/include/trace/events/damon.h b/include/trace/events/damon.h
> > index 7e25f4469b81b..121d7bc3a2c27 100644
> > --- a/include/trace/events/damon.h
> > +++ b/include/trace/events/damon.h
> > @@ -130,6 +130,47 @@ TRACE_EVENT(damon_monitor_intervals_tune,
> >  	TP_printk("sample_us=%lu", __entry->sample_us)
> >  );
> >  
> > +TRACE_EVENT(damon_aggregated_v2,
> > +
> > +	TP_PROTO(unsigned int target_id, struct damon_region *r,
> > +		unsigned int nr_regions),
> > +
> > +	TP_ARGS(target_id, r, nr_regions),
> > +
> > +	TP_STRUCT__entry(
> > +		__field(unsigned long, target_id)
> > +		__field(unsigned int, nr_regions)
> 
> Move the nr_regions to after "end" as on 64 bit machines, this creates a 4
> byte hole.

Thank you for the nice suggestion, Steven.  Will do so in the next version.


Thanks,
SJ

[...]

^ permalink raw reply

* Re: [PATCH RFC v5 24/53] KVM: SEV: Make 'uaddr' parameter optional for KVM_SEV_SNP_LAUNCH_UPDATE
From: Ackerley Tng @ 2026-04-28 23:40 UTC (permalink / raw)
  To: Ackerley Tng via B4 Relay, aik, andrew.jones, binbin.wu, brauner,
	chao.p.peng, david, ira.weiny, jmattson, jthoughton, michael.roth,
	oupton, pankaj.gupta, qperret, rick.p.edgecombe, rientjes,
	shivankg, steven.price, tabba, willy, wyihan, yan.y.zhao,
	forkloop, pratyush, suzuki.poulose, aneesh.kumar, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <20260428-gmem-inplace-conversion-v5-24-d8608ccfca22@google.com>

Ackerley Tng via B4 Relay <devnull+ackerleytng.google.com@kernel.org>
writes:

> From: Michael Roth <michael.roth@amd.com>
>

Thanks Michael!

>
> [...snip...]
>
>
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index c2126b3c30724..bf10d24907a00 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -2343,7 +2343,15 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>  	int level;
>  	int ret;
>
> -	if (WARN_ON_ONCE(sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src_page))
> +	/*
> +	 * For vm_memory_attributes=1, in-place conversion/population is not
> +	 * supported, so the initial contents necessarily need to come from a
> +	 * separate src address. For vm_memory_attributes=0, this isn't
> +	 * necessarily the case, since the pages may have been populated
> +	 * directly from userspace before calling KVM_SEV_SNP_LAUNCH_UPDATE.
> +	 */

I dropped the #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES from [1] since
vm_memory_attributes is #define-d as false when if
CONFIG_KVM_VM_MEMORY_ATTRIBUTES is not defined.

> +	if (vm_memory_attributes &&
> +	    sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src_page)
>  		return -EINVAL;
>
>  	ret = snp_lookup_rmpentry((u64)pfn, &assigned, &level);

[1] https://github.com/AMDESE/linux/commit/7e7c29afdf3763822ced0b7007fc0f93b8fb993d

>
> [...snip...]
>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox