Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* Re: [PATCH] rethook: Use tsk->on_cpu to check task execution state
From: Masami Hiramatsu @ 2026-05-31 23:40 UTC (permalink / raw)
  To: Tengda Wu, Peter Zijlstra
  Cc: Steven Rostedt, Mathieu Desnoyers, Alexei Starovoitov,
	linux-trace-kernel, linux-kernel
In-Reply-To: <94179dab-ffb7-4fab-af45-b20bfb686ab3@huaweicloud.com>

On Fri, 29 May 2026 11:39:49 +0800
Tengda Wu <wutengda@huaweicloud.com> wrote:

> > We also should not use tsk->on_cpu, but should use task_on_cpu(tsk).
> > 
> > BTW, should task_on_cpu() use READ_ONCE() etc?
> > wait_task_inactive() seems a bit fragile.
> > 
> > Thanks,
> > 
> 
> It seems that using task_on_cpu() is not necessary here because:
> 
> 1. It requires an additional 'rq' parameter not available in the rethook context.
> 2. It just returns p->on_cpu, which is identical to our current use of tsk->on_cpu.

OK. We may need to cleanup task_on_cpu() to remove unused rq, which
was historically introduced by commit 029632fbb7b7 ("sched: Make
separate sched*.c translation units") as a parameter for "task_running()"
because it called task_current(), but was cleaned by commit 0b9d46fc5ef7
("sched: Rename task_running() to task_on_cpu()") and now it is not
used.

> 
> /* file: kernel/sched/sched.h */
> static inline int task_on_cpu(struct rq *rq, struct task_struct *p)
> {
> 	return p->on_cpu;
> }
> 
> Given these constraints, staying with tsk->on_cpu seems more straightforward
> for the rethook context.

The reason I asked you to use a wrapper function instead of directly
accessing that on_cpu is to ensure that this part isn't overlooked when
changing the scheduler's behavior in the future.

They may be equivalent at this point in time, but that doesn't necessarily
mean they will remain so in the future.

IMHO, an API interface should represent the behavior and resources
required for that operation, and when accessing it from outside the
subsystem, we should use that API whenever possible. Otherwise, even
if we want to update the internal implementation of one subsystem,
we will have to make changes to other subsystems as well.

Peter, is it OK to drop @rq from task_on_cpu()? Then we can use
it from rethook.

Thank you,

> 
> Thanks,
> Tengda
> 
> >>
> >> Fixes: 54ecbe6f1ed5 ("rethook: Add a generic return hook")
> >> Signed-off-by: Tengda Wu <wutengda@huaweicloud.com>
> >> ---
> >>  kernel/trace/rethook.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
> >> index 5a8bdf88999a..bd5e5f455e85 100644
> >> --- a/kernel/trace/rethook.c
> >> +++ b/kernel/trace/rethook.c
> >> @@ -250,7 +250,7 @@ unsigned long rethook_find_ret_addr(struct task_struct *tsk, unsigned long frame
> >>  	if (WARN_ON_ONCE(!cur))
> >>  		return 0;
> >>  
> >> -	if (tsk != current && task_is_running(tsk))
> >> +	if (tsk != current && tsk->on_cpu)
> >>  		return 0;
> >>  
> >>  	do {
> >> -- 
> >> 2.34.1
> >>
> >>
> > 
> > 
> 

-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH mm-unstable v18 10/14] mm/khugepaged: introduce collapse_allowable_orders helper function
From: David Hildenbrand (Arm) @ 2026-05-31 20:18 UTC (permalink / raw)
  To: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <20260522150009.121603-11-npache@redhat.com>

On 5/22/26 17:00, Nico Pache wrote:
> Add collapse_allowable_orders() to generalize THP order eligibility. The
> function determines which THP orders are permitted based on collapse
> context (khugepaged vs madv_collapse).
> 
> This consolidates collapse configuration logic and provides a clean
> interface for future mTHP collapse support where the orders may be
> different.

It would have been good to describe here that, for now, it only ever returns
PMDs, and that it will be extended next.

Logically, this patch belongs to #12, not #11 ... so seeing it before #11 was a bit

... and there, it is clear that we don't even want to know the orders?

So can we just call this function

"collapse_possible" and make it return a boolean?

> 
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/khugepaged.c | 15 ++++++++++++---
>  1 file changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 4534025bc81d..64ceebc9d8a7 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -552,12 +552,21 @@ void __khugepaged_enter(struct mm_struct *mm)
>  		wake_up_interruptible(&khugepaged_wait);
>  }
>  
> +/* Check what orders are allowed based on the vma and collapse type */
> +static unsigned long collapse_allowable_orders(struct vm_area_struct *vma,
> +		vm_flags_t vm_flags, enum tva_type tva_flags)
> +{
> +	unsigned long orders = BIT(HPAGE_PMD_ORDER);
> +
> +	return thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
> +}
> +
>  void khugepaged_enter_vma(struct vm_area_struct *vma,
>  			  vm_flags_t vm_flags)
>  {
>  	if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
>  	    hugepage_pmd_enabled()) {
> -		if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER))
> +		if (collapse_allowable_orders(vma, vm_flags, TVA_KHUGEPAGED))
>  			__khugepaged_enter(vma->vm_mm);
>  	}
>  }
> @@ -2680,7 +2689,7 @@ static void collapse_scan_mm_slot(unsigned int progress_max,
>  			cc->progress++;
>  			break;
>  		}
> -		if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) {
> +		if (!collapse_allowable_orders(vma, vma->vm_flags, TVA_KHUGEPAGED)) {
>  			cc->progress++;
>  			continue;
>  		}
> @@ -2989,7 +2998,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>  	BUG_ON(vma->vm_start > start);
>  	BUG_ON(vma->vm_end < end);
>  
> -	if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER))
> +	if (!collapse_allowable_orders(vma, vma->vm_flags, TVA_FORCED_COLLAPSE))
>  		return -EINVAL;
>  
>  	cc = kmalloc_obj(*cc);

Having a simple

static bool collapse_possible(...)
{
	return collapse_allowable_orders(...)
}

Would make the above slightly more readable.

Acked-by: David Hildenbrand (Arm) <david@kernel.org>

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH mm-unstable v18 08/14] mm/khugepaged: add per-order mTHP collapse failure statistics
From: David Hildenbrand (Arm) @ 2026-05-31 20:09 UTC (permalink / raw)
  To: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <20260522150009.121603-9-npache@redhat.com>

On 5/22/26 17:00, Nico Pache wrote:
> Add three new mTHP statistics to track collapse failures for different
> orders when encountering swap PTEs, excessive none PTEs, and shared PTEs:
> 
> - collapse_exceed_swap_pte: Increment when mTHP collapse fails due to
> 	encountering a swap PTE.
> 
> - collapse_exceed_none_pte: Counts when mTHP collapse fails due to
>   	exceeding the none PTE threshold for the given order
> 
> - collapse_exceed_shared_pte: Counts when mTHP collapse fails due to
> 	encountering a shared PTE.
> 
> These statistics complement the existing THP_SCAN_EXCEED_* events by
> providing per-order granularity for mTHP collapse attempts. The stats are
> exposed via sysfs under
> `/sys/kernel/mm/transparent_hugepage/hugepages-*/stats/` for each
> supported hugepage size.
> 
> As we currently do not support collapsing mTHPs that contain a swap or
> shared entry, those statistics keep track of how often we are
> encountering failed mTHP collapses due to these restrictions.
> 
> We will add support for mTHP collapse for anonymous pages next; lets also
> track when this happens at the PMD level within the per-mTHP stats.
> 
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  Documentation/admin-guide/mm/transhuge.rst | 14 ++++++++++++++
>  include/linux/huge_mm.h                    |  3 +++
>  mm/huge_memory.c                           |  7 +++++++
>  mm/khugepaged.c                            | 21 +++++++++++++++++++--
>  4 files changed, 43 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index c51932e6275d..80a4d0bed70b 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -714,6 +714,20 @@ nr_anon_partially_mapped
>         an anonymous THP as "partially mapped" and count it here, even though it
>         is not actually partially mapped anymore.
>  
> +collapse_exceed_none_pte
> +       The number of collapse attempts that failed due to exceeding the
> +       max_ptes_none threshold.
> +
> +collapse_exceed_swap_pte
> +       The number of collapse attempts that failed due to exceeding the
> +       max_ptes_swap threshold. For non-PMD orders this occurs if a mTHP range
> +       contains at least one swap PTE.
> +
> +collapse_exceed_shared_pte
> +       The number of collapse attempts that failed due to exceeding the
> +       max_ptes_shared threshold. For non-PMD orders this occurs if a mTHP range
> +       contains at least one shared PTE.
> +
>  As the system ages, allocating huge pages may be expensive as the
>  system uses memory compaction to copy data around memory to free a
>  huge page for use. There are some counters in ``/proc/vmstat`` to help
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index ba7ae6808544..48496f09909b 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -144,6 +144,9 @@ enum mthp_stat_item {
>  	MTHP_STAT_SPLIT_DEFERRED,
>  	MTHP_STAT_NR_ANON,
>  	MTHP_STAT_NR_ANON_PARTIALLY_MAPPED,
> +	MTHP_STAT_COLLAPSE_EXCEED_SWAP,
> +	MTHP_STAT_COLLAPSE_EXCEED_NONE,
> +	MTHP_STAT_COLLAPSE_EXCEED_SHARED,
>  	__MTHP_STAT_COUNT
>  };
>  
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 345c54133c83..5c128cdec810 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -703,6 +703,10 @@ DEFINE_MTHP_STAT_ATTR(split_failed, MTHP_STAT_SPLIT_FAILED);
>  DEFINE_MTHP_STAT_ATTR(split_deferred, MTHP_STAT_SPLIT_DEFERRED);
>  DEFINE_MTHP_STAT_ATTR(nr_anon, MTHP_STAT_NR_ANON);
>  DEFINE_MTHP_STAT_ATTR(nr_anon_partially_mapped, MTHP_STAT_NR_ANON_PARTIALLY_MAPPED);
> +DEFINE_MTHP_STAT_ATTR(collapse_exceed_swap_pte, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
> +DEFINE_MTHP_STAT_ATTR(collapse_exceed_none_pte, MTHP_STAT_COLLAPSE_EXCEED_NONE);
> +DEFINE_MTHP_STAT_ATTR(collapse_exceed_shared_pte, MTHP_STAT_COLLAPSE_EXCEED_SHARED)

[...]

>  		/* See collapse_scan_pmd(). */
>  		if (folio_maybe_mapped_shared(folio)) {
> +			/*
> +			 * TODO: Support shared pages without leading to further
> +			 * mTHP collapses. Currently bringing in new pages via
> +			 * shared may cause a future higher order collapse on a
> +			 * rescan of the same range.
> +			 */

This comment actually belongs into an earlier patch, no?

>  			if (++shared > max_ptes_shared) {
>  				result = SCAN_EXCEED_SHARED_PTE;
> -				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> +				if (is_pmd_order(order))
> +					count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> +				count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
>  				goto out;
>  			}
>  		}
> @@ -1138,6 +1148,7 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
>  		 * range.
>  		 */
>  		if (!is_pmd_order(order)) {
> +			count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
>  			pte_unmap(pte);
>  			mmap_read_unlock(mm);
>  			result = SCAN_EXCEED_SWAP_PTE;
> @@ -1433,6 +1444,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  			if (++none_or_zero > max_ptes_none) {
>  				result = SCAN_EXCEED_NONE_PTE;
>  				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> +				count_mthp_stat(HPAGE_PMD_ORDER,
> +						MTHP_STAT_COLLAPSE_EXCEED_NONE);
>  				goto out_unmap;
>  			}
>  			continue;
> @@ -1441,6 +1454,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  			if (++unmapped > max_ptes_swap) {
>  				result = SCAN_EXCEED_SWAP_PTE;
>  				count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
> +				count_mthp_stat(HPAGE_PMD_ORDER,
> +						MTHP_STAT_COLLAPSE_EXCEED_SWAP);
>  				goto out_unmap;
>  			}
>  			/*
> @@ -1498,6 +1513,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  			if (++shared > max_ptes_shared) {
>  				result = SCAN_EXCEED_SHARED_PTE;
>  				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> +				count_mthp_stat(HPAGE_PMD_ORDER,
> +						MTHP_STAT_COLLAPSE_EXCEED_SHARED);

Can be done as a later cleanup, but having a single function that obtains an
order and knows which stats to update would be cleaner (and a good preparation
for shmem mTHP collapse support).

Nothing jumped at me, so

Acked-by: David Hildenbrand (Arm) <david@kernel.org>

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH mm-unstable v18 12/14] mm/khugepaged: avoid unnecessary mTHP collapse attempts
From: David Hildenbrand (Arm) @ 2026-05-31 20:02 UTC (permalink / raw)
  To: Lance Yang, npache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat, mhocko,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, usama.arif
In-Reply-To: <20260531073102.20318-1-lance.yang@linux.dev>

On 5/31/26 09:31, Lance Yang wrote:
> 
> On Fri, May 22, 2026 at 09:00:07AM -0600, Nico Pache wrote:
>> There are cases where, if an attempted collapse fails, all subsequent
>> orders are guaranteed to also fail. Avoid these collapse attempts by
>> bailing out early.
>>
>> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
>> Acked-by: Usama Arif <usama.arif@linux.dev>
>> Acked-by: David Hildenbrand (Arm) <david@kernel.org>
>> Signed-off-by: Nico Pache <npache@redhat.com>
>> ---
>> mm/khugepaged.c | 24 +++++++++++++++++++++++-
>> 1 file changed, 23 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index d3d7db8be26c..15b7298bc225 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1535,9 +1535,31 @@ static int mthp_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
>> 			collapse_address = address + offset * PAGE_SIZE;
>> 			ret = collapse_huge_page(mm, collapse_address, referenced,
>> 						 unmapped, cc, order);
>> -			if (ret == SCAN_SUCCEED) {
>> +
>> +			switch (ret) {
>> +			/* Cases where we continue to next collapse candidate */
>> +			case SCAN_SUCCEED:
>> 				collapsed += nr_ptes;
>> +				fallthrough;
>> +			case SCAN_PTE_MAPPED_HUGEPAGE:
>> 				continue;
>> +			/* Cases where lower orders might still succeed */
>> +			case SCAN_LACK_REFERENCED_PAGE:
>> +			case SCAN_EXCEED_NONE_PTE:
>> +			case SCAN_EXCEED_SWAP_PTE:
>> +			case SCAN_EXCEED_SHARED_PTE:
>> +			case SCAN_PAGE_LOCK:
>> +			case SCAN_PAGE_COUNT:
>> +			case SCAN_PAGE_NULL:
>> +			case SCAN_DEL_PAGE_LRU:
>> +			case SCAN_PTE_NON_PRESENT:
>> +			case SCAN_PTE_UFFD_WP:
>> +			case SCAN_ALLOC_HUGE_PAGE_FAIL:
> 
> Nit: shouldn't SCAN_CGROUP_CHARGE_FAIL go with SCAN_ALLOC_HUGE_PAGE_FAIL
> here?
> 
> If charging the current order fails, a smaller order might still fit :)

I think the reasoning was here, that if we are already that close to our mem
limit, we should just give up instead of trying to squeeze it in .. :)

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: David Hildenbrand (Arm) @ 2026-05-31 20:00 UTC (permalink / raw)
  To: Lance Yang, npache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat, mhocko,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, usama.arif
In-Reply-To: <20260531093942.19644-1-lance.yang@linux.dev>

On 5/31/26 11:39, Lance Yang wrote:
> 
> On Fri, May 22, 2026 at 09:00:01AM -0600, Nico Pache wrote:
>> Pass an order and offset to collapse_huge_page to support collapsing anon
>> memory to arbitrary orders within a PMD. order indicates what mTHP size we
>> are attempting to collapse to, and offset indicates were in the PMD to
>> start the collapse attempt.
>>
>> For non-PMD collapse we must leave the anon VMA write locked until after
>> we collapse the mTHP-- in the PMD case all the pages are isolated, but in
>> the mTHP case this is not true, and we must keep the lock to prevent
>> access/changes to the page tables. This can happen if the rmap walkers hit
>> a pmd_none while the PMD entry is currently unavailable due to being
>> temporarily removed during the collapse phase.
>>
>> Acked-by: Usama Arif <usama.arif@linux.dev>
>> Signed-off-by: Nico Pache <npache@redhat.com>
>> ---
>> mm/khugepaged.c | 93 +++++++++++++++++++++++++++++--------------------
>> 1 file changed, 55 insertions(+), 38 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index fab35d318641..d64f42f66236 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1214,34 +1214,36 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
>>  * while allocating a THP, as that could trigger direct reclaim/compaction.
>>  * Note that the VMA must be rechecked after grabbing the mmap_lock again.
>>  */
>> -static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
>> -		int referenced, int unmapped, struct collapse_control *cc)
>> +static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
>> +		int referenced, int unmapped, struct collapse_control *cc,
>> +		unsigned int order)
>> {
>> +	const unsigned long pmd_addr = start_addr & HPAGE_PMD_MASK;
>> +	const unsigned long end_addr = start_addr + (PAGE_SIZE << order);
>> 	LIST_HEAD(compound_pagelist);
>> 	pmd_t *pmd, _pmd;
>> -	pte_t *pte;
>> +	pte_t *pte = NULL;
>> 	pgtable_t pgtable;
>> 	struct folio *folio;
>> 	spinlock_t *pmd_ptl, *pte_ptl;
>> 	enum scan_result result = SCAN_FAIL;
>> 	struct vm_area_struct *vma;
>> 	struct mmu_notifier_range range;
>> +	bool anon_vma_locked = false;
>>
>> -	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>> -
>> -	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
>> +	result = alloc_charge_folio(&folio, mm, cc, order);
>> 	if (result != SCAN_SUCCEED)
>> 		goto out_nolock;
>>
>> 	mmap_read_lock(mm);
>> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
>> -					 HPAGE_PMD_ORDER);
>> +	result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
>> +					 &vma, cc, order);
>> 	if (result != SCAN_SUCCEED) {
>> 		mmap_read_unlock(mm);
>> 		goto out_nolock;
>> 	}
>>
>> -	result = find_pmd_or_thp_or_none(mm, address, &pmd);
>> +	result = find_pmd_or_thp_or_none(mm, pmd_addr, &pmd);
>> 	if (result != SCAN_SUCCEED) {
>> 		mmap_read_unlock(mm);
>> 		goto out_nolock;
>> @@ -1253,8 +1255,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>> 		 * released when it fails. So we jump out_nolock directly in
>> 		 * that case.  Continuing to collapse causes inconsistency.
>> 		 */
>> -		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
>> -						     referenced, HPAGE_PMD_ORDER);
>> +		result = __collapse_huge_page_swapin(mm, vma, start_addr, pmd,
>> +						     referenced, order);
>> 		if (result != SCAN_SUCCEED)
>> 			goto out_nolock;
>> 	}
>> @@ -1269,20 +1271,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>> 	 * mmap_lock.
>> 	 */
>> 	mmap_write_lock(mm);
>> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
>> -					 HPAGE_PMD_ORDER);
>> +	result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
>> +					 &vma, cc, order);
>> 	if (result != SCAN_SUCCEED)
>> 		goto out_up_write;
>> 	/* check if the pmd is still valid */
>> 	vma_start_write(vma);
>> -	result = check_pmd_still_valid(mm, address, pmd);
>> +	result = check_pmd_still_valid(mm, pmd_addr, pmd);
>> 	if (result != SCAN_SUCCEED)
>> 		goto out_up_write;
>>
>> 	anon_vma_lock_write(vma->anon_vma);
>> +	anon_vma_locked = true;
>>
>> -	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
>> -				address + HPAGE_PMD_SIZE);
>> +	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr,
>> +				end_addr);
>> 	mmu_notifier_invalidate_range_start(&range);
>>
>> 	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
>> @@ -1294,26 +1297,23 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>> 	 * Parallel GUP-fast is fine since GUP-fast will back off when
>> 	 * it detects PMD is changed.
>> 	 */
>> -	_pmd = pmdp_collapse_flush(vma, address, pmd);
>> +	_pmd = pmdp_collapse_flush(vma, pmd_addr, pmd);
>> 	spin_unlock(pmd_ptl);
>> 	mmu_notifier_invalidate_range_end(&range);
>> 	tlb_remove_table_sync_one();
>>
>> -	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
>> +	pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);
>> 	if (pte) {
>> -		result = __collapse_huge_page_isolate(vma, address, pte, cc,
>> -						      HPAGE_PMD_ORDER,
>> -						      &compound_pagelist);
>> +		result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
>> +						      order, &compound_pagelist);
>> 		spin_unlock(pte_ptl);
>> 	} else {
>> 		result = SCAN_NO_PTE_TABLE;
>> 	}
>>
>> 	if (unlikely(result != SCAN_SUCCEED)) {
>> -		if (pte)
>> -			pte_unmap(pte);
>> 		spin_lock(pmd_ptl);
>> -		BUG_ON(!pmd_none(*pmd));
>> +		WARN_ON_ONCE(!pmd_none(*pmd));
>> 		/*
>> 		 * We can only use set_pmd_at when establishing
>> 		 * hugepmds and never for establishing regular pmds that
>> @@ -1321,21 +1321,24 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>> 		 */
>> 		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>> 		spin_unlock(pmd_ptl);
>> -		anon_vma_unlock_write(vma->anon_vma);
>> 		goto out_up_write;
>> 	}
>>
>> 	/*
>> -	 * All pages are isolated and locked so anon_vma rmap
>> -	 * can't run anymore.
>> +	 * For PMD collapse all pages are isolated and locked so anon_vma
>> +	 * rmap can't run anymore. For mTHP collapse the PMD entry has been
>> +	 * removed and not all pages are isolated and locked, so we must hold
>> +	 * the lock to prevent neighboring folios from attempting to access
>> +	 * this PMD until its reinstalled.
>> 	 */
>> -	anon_vma_unlock_write(vma->anon_vma);
>> +	if (is_pmd_order(order)) {
>> +		anon_vma_unlock_write(vma->anon_vma);
>> +		anon_vma_locked = false;
>> +	}
>>
>> 	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
>> -					   vma, address, pte_ptl,
>> -					   HPAGE_PMD_ORDER,
>> -					   &compound_pagelist);
>> -	pte_unmap(pte);
>> +					   vma, start_addr, pte_ptl,
>> +					   order, &compound_pagelist);
>> 	if (unlikely(result != SCAN_SUCCEED))
>> 		goto out_up_write;
>>
>> @@ -1345,18 +1348,32 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>> 	 * write.
>> 	 */
>> 	__folio_mark_uptodate(folio);
>> -	pgtable = pmd_pgtable(_pmd);
>> -
>> 	spin_lock(pmd_ptl);
>> -	BUG_ON(!pmd_none(*pmd));
>> -	pgtable_trans_huge_deposit(mm, pmd, pgtable);
>> -	map_anon_folio_pmd_nopf(folio, pmd, vma, address);
>> +	WARN_ON_ONCE(!pmd_none(*pmd));
>> +	if (is_pmd_order(order)) {
>> +		pgtable = pmd_pgtable(_pmd);
>> +		pgtable_trans_huge_deposit(mm, pmd, pgtable);
>> +		map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
>> +	} else {
>> +		/*
>> +		 * set_ptes is called in map_anon_folio_pte_nopf with the
>> +		 * pmd_ptl lock still held; this is safe as the PMD is expected
>> +		 * to be none. The pmd entry is then repopulated below.
>> +		 */
>> +		map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
> 
> Emm ... is it safe to use map_anon_folio_pte_nopf() here?
> 
> At this point pmdp_collapse_flush() has cleared the PMD from the page
> tables. The PTE table we are updating is only reachable through the saved
> old PMD value, _pmd, until pmd_populate() below.
> 
> map_anon_folio_pte_nopf() does set_ptes() and then calls
> update_mmu_cache_range(). Documentation/core-api/cachetlb.rst describes
> that hook as:
> 
> "
> 	At the end of every page fault, this routine is invoked to tell
> 	the architecture specific code that translations now exists
> 	in the software page tables for address space "vma->vm_mm"
> 	at virtual address "address" for "nr" consecutive pages.
> "
> 
> But that does not seem true here yet, since the PTE table is not
> reachable from vma->vm_mm when update_mmu_cache_range() is called.
> 
> Should we avoid calling update_mmu_cache_range() until after the PTE
> table is reinstalled with pmd_populate()?

I recall that update_mmu_cache* users mostly care about updating folios flags,
for the folio derived from the PTE ... or flushing caches for the user address.

So intuitively I would say "the architecture code doesn't care that the PMD
table will only be visible to HW shortly after". The important thing should be
that it will definetly happen, and that nothing else is curently there or can be
there?



-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH v3 00/13] rv: Fixes on Deterministic and Hybrid Automata
From: Wen Yang @ 2026-05-31 15:17 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel; +Cc: Steven Rostedt, Nam Cao, linux-trace-kernel
In-Reply-To: <20260530141652.58084-1-gmonaco@redhat.com>


Thank you for this series -- it addresses several issues in the RV 
framework. Our pending series [PATCH v3 tlob] also depends on it; we 
would appreciate if it could be picked up for the next merge window.


--
Best wishes,
Wen


On 5/30/26 22:16, Gabriele Monaco wrote:
> Fix issues that were reported by bots or visible only after integration:
>   * Make sure timers are always terminated and waited for when disabling
>     the monitor or when the target terminates
>   * Run per-cpu monitors with migration disabled since preemption is now
>     enabled from tracepoints
>   * Fix a wrong __user specifier in a helper function
>   * Other cleanup and concurrency issues
> 
> Differences since V2 [1]:
> * Applied from reviews changes to commit messages
> * Rearranged order to put non-fixes and not-reviewed patches in the end
> * Synchronise monitors before resetting them to avoid rearming
> * Protect against racing timer callbacks during destruction
> 
> Differences since V1 [2]:
> * Fix memory consistency with timer callbacks racing with resets
> * Add per-obj deallocation hook in rvgen generated code
> * Do not rely on clean monitor when initialising HA
> * Add tracepoint synchronisation before returning per-task slots
> * Fix suffix strip in dot2k
> * Generate stub deallocation hooks instead of failing build when per-obj
>    miss those
> 
> [1] - https://lore.kernel.org/lkml/20260527062313.39908-1-gmonaco@redhat.com
> [2] - https://lore.kernel.org/lkml/20260512140250.262190-1-gmonaco@redhat.com
> 
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Nam Cao <namcao@linutronix.de>
> Cc: Wen Yang <wen.yang@linux.dev>
> Cc: linux-trace-kernel@vger.kernel.org
> 
> Gabriele Monaco (12):
>    rv: Fix __user specifier usage in extract_params()
>    rv: Reset per-task DA monitors before releasing the slot
>    rv: Prevent in-flight per-task handlers from using invalid slots
>    rv: Ensure all pending probes terminate on per-obj monitor destroy
>    rv: Do not rely on clean monitor when initialising HA
>    rv: Add automatic cleanup handlers for per-task HA monitors
>    rv: Ensure synchronous cleanup for HA monitors
>    rv: Prevent task migration while handling per-CPU events
>    rv: Use 0 to check preemption enabled in opid
>    verification/rvgen: Fix suffix strip in dot2k
>    rv: Fix read_lock scope in per-task DA cleanup
>    verification/rvgen: Generate cleanup hook for per-obj monitor
> 
> Wen Yang (1):
>    rv: Fix monitor start ordering and memory ordering for monitoring flag
> 
>   include/rv/da_monitor.h                       |  67 ++++++++---
>   include/rv/ha_monitor.h                       | 104 +++++++++++++++++-
>   include/rv/ltl_monitor.h                      |   1 +
>   kernel/trace/rv/monitors/deadline/deadline.h  |   3 +-
>   kernel/trace/rv/monitors/nomiss/nomiss.c      |   4 +-
>   kernel/trace/rv/monitors/opid/opid.c          |  12 +-
>   kernel/trace/rv/monitors/stall/stall.c        |   4 +-
>   tools/verification/rvgen/rvgen/dot2k.py       |  19 +++-
>   .../rvgen/rvgen/templates/dot2k/main.c        |   4 +-
>   9 files changed, 180 insertions(+), 38 deletions(-)
> 
> 
> base-commit: 8fde5d1d47f69db6082dfa34500c27f8485389a5

^ permalink raw reply

* Re: [PATCH v2 07/12] rv: Fix monitor start ordering and memory ordering for monitoring flag
From: Wen Yang @ 2026-05-31 14:54 UTC (permalink / raw)
  To: Nam Cao, Gabriele Monaco, linux-kernel, Steven Rostedt,
	linux-trace-kernel
In-Reply-To: <877bonoq70.fsf@yellow.woof>



On 5/28/26 17:09, Nam Cao wrote:
> Gabriele Monaco <gmonaco@redhat.com> writes:
>> From: Wen Yang <wen.yang@linux.dev>
>>
>> da_monitor_start() set monitoring=1 before calling da_monitor_init_hook(),
>> may racing with the sched_switch handler:
>>
>>    da_monitor_start()               sched_switch handler
>>    -------------------------        ---------------------------------
>>    da_mon->monitoring = 1;
>>                                     if (da_monitoring(da_mon))  /* true  */
>>                                         ha_start_timer_ns(...);
>>                                         /* hrtimer->base == NULL, crash */
>>    da_monitor_init_hook(da_mon);
>>    /* hrtimer_setup() sets base */
>>
>> Fix the ordering and pair with release/acquire semantics:
>>
>>    da_monitor_init_hook(da_mon);
>>    smp_store_release(&da_mon->monitoring, 1);    /* da_monitor_start()  */
>>    return smp_load_acquire(&da_mon->monitoring); /* da_monitoring()     */
>>
>> On ARM64 a plain STR + LDR does not form a release-acquire pair, so
>> the load can observe monitoring=1 while hrtimer->base is still NULL.
>> The plain accesses are also data races under KCSAN.
>>
>> Use WRITE_ONCE for the monitoring=0 store in da_monitor_reset() to
>> cover the reset path.
>>
>> Fixes: 792575348ff7 ("rv/include: Add deterministic automata monitor definition via C macros")
>> Signed-off-by: Wen Yang <wen.yang@linux.dev>
>> Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
>> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> 
> Looks correct to me.
> Reviewed-by: Nam Cao <namcao@linutronix.de>
> 
> Wen, I am curious, how did you find this issue?
> 

It was caught during development of the tlob monitor by a KUnit test 

(since moved to selftests) that monitored a kthread running concurrently 

on another CPU.  On a multi-core box the window between setting 

monitoring=1 and da_monitor_init_hook() was wide enough to trigger the 

crash, and KCSAN flagged the plain monitoring accesses as data 
                                                races as well.

--
Best wishes,
Wen



^ permalink raw reply

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: Lance Yang @ 2026-05-31  9:39 UTC (permalink / raw)
  To: npache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, usama.arif
In-Reply-To: <20260522150009.121603-7-npache@redhat.com>


On Fri, May 22, 2026 at 09:00:01AM -0600, Nico Pache wrote:
>Pass an order and offset to collapse_huge_page to support collapsing anon
>memory to arbitrary orders within a PMD. order indicates what mTHP size we
>are attempting to collapse to, and offset indicates were in the PMD to
>start the collapse attempt.
>
>For non-PMD collapse we must leave the anon VMA write locked until after
>we collapse the mTHP-- in the PMD case all the pages are isolated, but in
>the mTHP case this is not true, and we must keep the lock to prevent
>access/changes to the page tables. This can happen if the rmap walkers hit
>a pmd_none while the PMD entry is currently unavailable due to being
>temporarily removed during the collapse phase.
>
>Acked-by: Usama Arif <usama.arif@linux.dev>
>Signed-off-by: Nico Pache <npache@redhat.com>
>---
> mm/khugepaged.c | 93 +++++++++++++++++++++++++++++--------------------
> 1 file changed, 55 insertions(+), 38 deletions(-)
>
>diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>index fab35d318641..d64f42f66236 100644
>--- a/mm/khugepaged.c
>+++ b/mm/khugepaged.c
>@@ -1214,34 +1214,36 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
>  * while allocating a THP, as that could trigger direct reclaim/compaction.
>  * Note that the VMA must be rechecked after grabbing the mmap_lock again.
>  */
>-static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
>-		int referenced, int unmapped, struct collapse_control *cc)
>+static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
>+		int referenced, int unmapped, struct collapse_control *cc,
>+		unsigned int order)
> {
>+	const unsigned long pmd_addr = start_addr & HPAGE_PMD_MASK;
>+	const unsigned long end_addr = start_addr + (PAGE_SIZE << order);
> 	LIST_HEAD(compound_pagelist);
> 	pmd_t *pmd, _pmd;
>-	pte_t *pte;
>+	pte_t *pte = NULL;
> 	pgtable_t pgtable;
> 	struct folio *folio;
> 	spinlock_t *pmd_ptl, *pte_ptl;
> 	enum scan_result result = SCAN_FAIL;
> 	struct vm_area_struct *vma;
> 	struct mmu_notifier_range range;
>+	bool anon_vma_locked = false;
> 
>-	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>-
>-	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
>+	result = alloc_charge_folio(&folio, mm, cc, order);
> 	if (result != SCAN_SUCCEED)
> 		goto out_nolock;
> 
> 	mmap_read_lock(mm);
>-	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
>-					 HPAGE_PMD_ORDER);
>+	result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
>+					 &vma, cc, order);
> 	if (result != SCAN_SUCCEED) {
> 		mmap_read_unlock(mm);
> 		goto out_nolock;
> 	}
> 
>-	result = find_pmd_or_thp_or_none(mm, address, &pmd);
>+	result = find_pmd_or_thp_or_none(mm, pmd_addr, &pmd);
> 	if (result != SCAN_SUCCEED) {
> 		mmap_read_unlock(mm);
> 		goto out_nolock;
>@@ -1253,8 +1255,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> 		 * released when it fails. So we jump out_nolock directly in
> 		 * that case.  Continuing to collapse causes inconsistency.
> 		 */
>-		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
>-						     referenced, HPAGE_PMD_ORDER);
>+		result = __collapse_huge_page_swapin(mm, vma, start_addr, pmd,
>+						     referenced, order);
> 		if (result != SCAN_SUCCEED)
> 			goto out_nolock;
> 	}
>@@ -1269,20 +1271,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> 	 * mmap_lock.
> 	 */
> 	mmap_write_lock(mm);
>-	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
>-					 HPAGE_PMD_ORDER);
>+	result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
>+					 &vma, cc, order);
> 	if (result != SCAN_SUCCEED)
> 		goto out_up_write;
> 	/* check if the pmd is still valid */
> 	vma_start_write(vma);
>-	result = check_pmd_still_valid(mm, address, pmd);
>+	result = check_pmd_still_valid(mm, pmd_addr, pmd);
> 	if (result != SCAN_SUCCEED)
> 		goto out_up_write;
> 
> 	anon_vma_lock_write(vma->anon_vma);
>+	anon_vma_locked = true;
> 
>-	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
>-				address + HPAGE_PMD_SIZE);
>+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr,
>+				end_addr);
> 	mmu_notifier_invalidate_range_start(&range);
> 
> 	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
>@@ -1294,26 +1297,23 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> 	 * Parallel GUP-fast is fine since GUP-fast will back off when
> 	 * it detects PMD is changed.
> 	 */
>-	_pmd = pmdp_collapse_flush(vma, address, pmd);
>+	_pmd = pmdp_collapse_flush(vma, pmd_addr, pmd);
> 	spin_unlock(pmd_ptl);
> 	mmu_notifier_invalidate_range_end(&range);
> 	tlb_remove_table_sync_one();
> 
>-	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
>+	pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);
> 	if (pte) {
>-		result = __collapse_huge_page_isolate(vma, address, pte, cc,
>-						      HPAGE_PMD_ORDER,
>-						      &compound_pagelist);
>+		result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
>+						      order, &compound_pagelist);
> 		spin_unlock(pte_ptl);
> 	} else {
> 		result = SCAN_NO_PTE_TABLE;
> 	}
> 
> 	if (unlikely(result != SCAN_SUCCEED)) {
>-		if (pte)
>-			pte_unmap(pte);
> 		spin_lock(pmd_ptl);
>-		BUG_ON(!pmd_none(*pmd));
>+		WARN_ON_ONCE(!pmd_none(*pmd));
> 		/*
> 		 * We can only use set_pmd_at when establishing
> 		 * hugepmds and never for establishing regular pmds that
>@@ -1321,21 +1321,24 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> 		 */
> 		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> 		spin_unlock(pmd_ptl);
>-		anon_vma_unlock_write(vma->anon_vma);
> 		goto out_up_write;
> 	}
> 
> 	/*
>-	 * All pages are isolated and locked so anon_vma rmap
>-	 * can't run anymore.
>+	 * For PMD collapse all pages are isolated and locked so anon_vma
>+	 * rmap can't run anymore. For mTHP collapse the PMD entry has been
>+	 * removed and not all pages are isolated and locked, so we must hold
>+	 * the lock to prevent neighboring folios from attempting to access
>+	 * this PMD until its reinstalled.
> 	 */
>-	anon_vma_unlock_write(vma->anon_vma);
>+	if (is_pmd_order(order)) {
>+		anon_vma_unlock_write(vma->anon_vma);
>+		anon_vma_locked = false;
>+	}
> 
> 	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
>-					   vma, address, pte_ptl,
>-					   HPAGE_PMD_ORDER,
>-					   &compound_pagelist);
>-	pte_unmap(pte);
>+					   vma, start_addr, pte_ptl,
>+					   order, &compound_pagelist);
> 	if (unlikely(result != SCAN_SUCCEED))
> 		goto out_up_write;
> 
>@@ -1345,18 +1348,32 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> 	 * write.
> 	 */
> 	__folio_mark_uptodate(folio);
>-	pgtable = pmd_pgtable(_pmd);
>-
> 	spin_lock(pmd_ptl);
>-	BUG_ON(!pmd_none(*pmd));
>-	pgtable_trans_huge_deposit(mm, pmd, pgtable);
>-	map_anon_folio_pmd_nopf(folio, pmd, vma, address);
>+	WARN_ON_ONCE(!pmd_none(*pmd));
>+	if (is_pmd_order(order)) {
>+		pgtable = pmd_pgtable(_pmd);
>+		pgtable_trans_huge_deposit(mm, pmd, pgtable);
>+		map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
>+	} else {
>+		/*
>+		 * set_ptes is called in map_anon_folio_pte_nopf with the
>+		 * pmd_ptl lock still held; this is safe as the PMD is expected
>+		 * to be none. The pmd entry is then repopulated below.
>+		 */
>+		map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);

Emm ... is it safe to use map_anon_folio_pte_nopf() here?

At this point pmdp_collapse_flush() has cleared the PMD from the page
tables. The PTE table we are updating is only reachable through the saved
old PMD value, _pmd, until pmd_populate() below.

map_anon_folio_pte_nopf() does set_ptes() and then calls
update_mmu_cache_range(). Documentation/core-api/cachetlb.rst describes
that hook as:

"
	At the end of every page fault, this routine is invoked to tell
	the architecture specific code that translations now exists
	in the software page tables for address space "vma->vm_mm"
	at virtual address "address" for "nr" consecutive pages.
"

But that does not seem true here yet, since the PTE table is not
reachable from vma->vm_mm when update_mmu_cache_range() is called.

Should we avoid calling update_mmu_cache_range() until after the PTE
table is reinstalled with pmd_populate()?

Cheers, Lance

>+		smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
>+		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>+	}
> 	spin_unlock(pmd_ptl);
> 
> 	folio = NULL;
> 
> 	result = SCAN_SUCCEED;
> out_up_write:
>+	if (anon_vma_locked)
>+		anon_vma_unlock_write(vma->anon_vma);
>+	if (pte)
>+		pte_unmap(pte);
> 	mmap_write_unlock(mm);
> out_nolock:
> 	if (folio)
>@@ -1536,7 +1553,7 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> 		/* collapse_huge_page expects the lock to be dropped before calling */
> 		mmap_read_unlock(mm);
> 		result = collapse_huge_page(mm, start_addr, referenced,
>-					    unmapped, cc);
>+					    unmapped, cc, HPAGE_PMD_ORDER);
> 		/* collapse_huge_page will return with the mmap_lock released */
> 		*lock_dropped = true;
> 	}
>-- 
>2.54.0
>
>

^ permalink raw reply

* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Lance Yang @ 2026-05-31  8:48 UTC (permalink / raw)
  To: npache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat, mhocko,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <20260531071845.10875-1-lance.yang@linux.dev>



On 2026/5/31 15:18, Lance Yang wrote:
> 
> On Fri, May 22, 2026 at 09:00:06AM -0600, Nico Pache wrote:
> [...]
>> @@ -1587,10 +1749,11 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>> 	if (result == SCAN_SUCCEED) {
>> 		/* collapse_huge_page expects the lock to be dropped before calling */
>> 		mmap_read_unlock(mm);
>> -		result = collapse_huge_page(mm, start_addr, referenced,
>> -					    unmapped, cc, HPAGE_PMD_ORDER);
>> -		/* collapse_huge_page will return with the mmap_lock released */
>> +		nr_collapsed = mthp_collapse(mm, vma, start_addr, referenced,
>> +					     unmapped, cc, enabled_orders);
>> +		/* mmap_lock was released above, set lock_dropped */
>> 		*lock_dropped = true;
>> +		result = nr_collapsed ? SCAN_SUCCEED : SCAN_FAIL;
> 
> Hmm ... don't we lose the allocation-failure result here?
> 
> Previously collapse_scan_pmd() propagated SCAN_ALLOC_HUGE_PAGE_FAIL from
> collapse_huge_page(), so khugepaged would call khugepaged_alloc_sleep()
> in khugepaged_do_scan().
> 
> Now if allocation fails and nr_collapsed stays 0, we just return
> SCAN_FAIL. So we won't back off via khugepaged_alloc_sleep() anymore?

Looks like this is a more general issue with mthp_collapse() only
returning nr_collapsed.

For example, SCAN_PMD_MAPPED used to be propagated too, and
madvise_collapse() treats that as success. With the new code, if
nothing was collapsed by this call, that can also become SCAN_FAIL ...

So I think we should keep both.

Cheers, Lance


^ permalink raw reply

* Re: [PATCH mm-unstable v18 12/14] mm/khugepaged: avoid unnecessary mTHP collapse attempts
From: Lance Yang @ 2026-05-31  7:31 UTC (permalink / raw)
  To: npache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, usama.arif
In-Reply-To: <20260522150009.121603-13-npache@redhat.com>


On Fri, May 22, 2026 at 09:00:07AM -0600, Nico Pache wrote:
>There are cases where, if an attempted collapse fails, all subsequent
>orders are guaranteed to also fail. Avoid these collapse attempts by
>bailing out early.
>
>Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
>Acked-by: Usama Arif <usama.arif@linux.dev>
>Acked-by: David Hildenbrand (Arm) <david@kernel.org>
>Signed-off-by: Nico Pache <npache@redhat.com>
>---
> mm/khugepaged.c | 24 +++++++++++++++++++++++-
> 1 file changed, 23 insertions(+), 1 deletion(-)
>
>diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>index d3d7db8be26c..15b7298bc225 100644
>--- a/mm/khugepaged.c
>+++ b/mm/khugepaged.c
>@@ -1535,9 +1535,31 @@ static int mthp_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
> 			collapse_address = address + offset * PAGE_SIZE;
> 			ret = collapse_huge_page(mm, collapse_address, referenced,
> 						 unmapped, cc, order);
>-			if (ret == SCAN_SUCCEED) {
>+
>+			switch (ret) {
>+			/* Cases where we continue to next collapse candidate */
>+			case SCAN_SUCCEED:
> 				collapsed += nr_ptes;
>+				fallthrough;
>+			case SCAN_PTE_MAPPED_HUGEPAGE:
> 				continue;
>+			/* Cases where lower orders might still succeed */
>+			case SCAN_LACK_REFERENCED_PAGE:
>+			case SCAN_EXCEED_NONE_PTE:
>+			case SCAN_EXCEED_SWAP_PTE:
>+			case SCAN_EXCEED_SHARED_PTE:
>+			case SCAN_PAGE_LOCK:
>+			case SCAN_PAGE_COUNT:
>+			case SCAN_PAGE_NULL:
>+			case SCAN_DEL_PAGE_LRU:
>+			case SCAN_PTE_NON_PRESENT:
>+			case SCAN_PTE_UFFD_WP:
>+			case SCAN_ALLOC_HUGE_PAGE_FAIL:

Nit: shouldn't SCAN_CGROUP_CHARGE_FAIL go with SCAN_ALLOC_HUGE_PAGE_FAIL
here?

If charging the current order fails, a smaller order might still fit :)

Cheers, Lance

>+			case SCAN_PAGE_LAZYFREE:
>+				goto next_order;
>+			/* Cases where no further collapse is possible */
>+			default:
>+				return collapsed;
> 			}
> 		}
> 
>-- 
>2.54.0
>
>

^ permalink raw reply

* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Lance Yang @ 2026-05-31  7:18 UTC (permalink / raw)
  To: npache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <20260522150009.121603-12-npache@redhat.com>


On Fri, May 22, 2026 at 09:00:06AM -0600, Nico Pache wrote:
[...]
>@@ -1587,10 +1749,11 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> 	if (result == SCAN_SUCCEED) {
> 		/* collapse_huge_page expects the lock to be dropped before calling */
> 		mmap_read_unlock(mm);
>-		result = collapse_huge_page(mm, start_addr, referenced,
>-					    unmapped, cc, HPAGE_PMD_ORDER);
>-		/* collapse_huge_page will return with the mmap_lock released */
>+		nr_collapsed = mthp_collapse(mm, vma, start_addr, referenced,
>+					     unmapped, cc, enabled_orders);
>+		/* mmap_lock was released above, set lock_dropped */
> 		*lock_dropped = true;
>+		result = nr_collapsed ? SCAN_SUCCEED : SCAN_FAIL;

Hmm ... don't we lose the allocation-failure result here?

Previously collapse_scan_pmd() propagated SCAN_ALLOC_HUGE_PAGE_FAIL from
collapse_huge_page(), so khugepaged would call khugepaged_alloc_sleep()
in khugepaged_do_scan().

Now if allocation fails and nr_collapsed stays 0, we just return
SCAN_FAIL. So we won't back off via khugepaged_alloc_sleep() anymore?

Cheers, Lance

^ permalink raw reply

* Re: PATCH v7] tracing/eprobes: Allow use of BTF names to dereference pointers
From: Masami Hiramatsu @ 2026-05-31  1:14 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Linux trace kernel, Mathieu Desnoyers, Mark Rutland,
	Peter Zijlstra, Namhyung Kim, Takaya Saeki, Douglas Raillard,
	Tom Zanussi, Andrew Morton, Thomas Gleixner, Ian Rogers,
	Jiri Olsa
In-Reply-To: <20260530110754.14870622@gandalf.local.home>

On Sat, 30 May 2026 11:07:54 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Sat, 30 May 2026 23:14:27 +0900
> Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> 
> > Thanks, and Sashiko reviewed this. 
> > 
> > https://sashiko.dev/#/patchset/20260529110442.0967a64c%40fedora
> > 
> > I think both comments are valid, especially, second one is important.
> 
> Hmm, comment below.
> 
> > 
> > BTW, I updated probes/fixes. Could you also update this and rebase
> > on probes/fixes branch?
> > 
> > I'm working on the nesting and container_of support patches which
> > are on top of this.
> 
> OK, I'll make sure to use your probes branch. Which one should I use? your
> probes/for-next or probes/urgent?

Ah, probes/fixes. (I think I need to update probes/for-next to rebase on it)

> 
> As for sashiko's comments:
> 
> > commit c8393ea9936934c16ba2881d9e7b3cce86c678ff
> > Author: Steven Rostedt <rostedt@goodmis.org>
> > [PATCH v7] tracing/eprobes: Allow use of BTF names to dereference pointers
> > This commit introduces syntax to allow event probes to typecast pointer
> > fields to a structure using BTF. This enables dereferencing structures
> > using BTF names rather than having to manually calculate offsets.
> > > diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
> > > index 39f040c863e85..0e87409a44033 100644
> > > --- a/kernel/trace/trace_probe.c
> > > +++ b/kernel/trace/trace_probe.c
> > [ ... ]
> > > @@ -640,7 +673,7 @@ static int parse_btf_arg(char *varname,
> > >  	int i, is_ptr, ret;
> > >  	u32 tid;
> > >  
> > > -	if (WARN_ON_ONCE(!ctx->funcname))
> > > +	if (WARN_ON_ONCE(!ctx->funcname && !(ctx->flags & TPARG_FL_TEVENT)))
> > >  		return -EINVAL;
> > >  
> > >  	is_ptr = split_next_field(varname, &field, ctx);
> > > @@ -653,6 +686,16 @@ static int parse_btf_arg(char *varname,
> > >  		return -EOPNOTSUPP;
> > >  	}
> > >  
> > > +	if (ctx->flags & TPARG_FL_TEVENT) {
> > > +		ret = parse_trace_event(varname, code, ctx);
> > > +		if (ret < 0)
> > > +			return ret;
> 
> > Does this silently drop the error on parsing failure?
> > If parse_trace_event() fails and returns -EINVAL, it appears this path
> > returns the error without logging it.
> > Unlike parse_probe_vars(), which explicitly logs the failure when
> > parse_trace_event() fails, this path bypasses the log. Could this make
> > it difficult to debug invalid typecast expressions since nothing is written
> > to tracing/error_log?
> 
> Hmm, I guess I can put an error log here, as this is a unique path.

Yeah, just adding trace_probe_log_err(ctx->offset, BAD_VAR) seems OK.

> 
> > > +		if (WARN_ON_ONCE(!(ctx->flags & TPARG_FL_TYPECAST)))
> > > +			return -EINVAL;
> > > +		type = ctx->last_struct;
> > > +		goto found_type;
> > > +	}
> > > +
> > >  	if (ctx->flags & TPARG_FL_RETURN && !strcmp(varname, "$retval")) {
> > >  		code->op = FETCH_OP_RETVAL;
> > [ ... ]
> > > @@ -758,7 +802,71 @@ static int parse_btf_bitfield(struct fetch_insn **pcode,
> > >  	return 0;
> > >  }
> > >  
> > > -#else
> > > +static int query_btf_struct(const char *sname, struct traceprobe_parse_context *ctx)
> > [ ... ]
> > > +static int handle_typecast(char *arg, struct fetch_insn **pcode,
> > > +			   struct fetch_insn *end,
> > > +			   struct traceprobe_parse_context *ctx)
> > > +{
> > > +	char *tmp;
> > > +	int ret;
> > > +
> > > +	/* Currently this only works for eprobes */
> > > +	if (!(ctx->flags & TPARG_FL_TEVENT)) {
> > > +		trace_probe_log_err(ctx->offset, TYPECAST_NOT_EVENT);
> > > +		return -EINVAL;
> > > +	}
> > > +
> > > +	tmp = strchr(arg, ')');
> > > +	if (!tmp) {
> > > +		trace_probe_log_err(ctx->offset + strlen(arg),
> > > +				    DEREF_OPEN_BRACE);
> > > +		return -EINVAL;
> > > +	}
> > > +	*tmp = '\0';
> > > +	ret = query_btf_struct(arg + 1, ctx);
> > > +	*tmp = ')';
> > > +
> > > +	if (ret < 0) {
> > > +		trace_probe_log_err(ctx->offset + 1, NO_PTR_STRCT);
> > > +		ret = -EINVAL;
> > > +		goto out_put;
> > > +	}
> > > +
> > > +	ctx->flags |= TPARG_FL_TYPECAST;
> > > +	tmp++;
> > > +
> > > +	ctx->offset += tmp - arg;
> > > +	ret = parse_btf_arg(tmp, pcode, end, ctx);
> > > +	ctx->flags &= ~TPARG_FL_TYPECAST;
> > > +	ctx->last_struct = NULL;
> > > +out_put:
> > > +	btf_put(ctx->struct_btf);
> 
> > Does this prematurely release the BTF struct reference?
> > If TPARG_FL_TYPECAST is unset here and ctx->struct_btf is put, won't
> > later steps in traceprobe_parse_probe_arg_body() (like
> > find_fetch_type_from_btf_type()) fail to properly infer struct field sizes?
> > When ctx_btf(ctx) is called later without TPARG_FL_TYPECAST set, it
> > will evaluate to ctx->btf (which is NULL for eprobes).
> > Could this potentially lead to silent defaults, such as 64-bit reads for
> > smaller fields, or fail to inject pointer dereferences for string fields,
> > while also leaving ctx->last_type pointing to a prematurely released BTF
> > object?
> 
> Does this mean we need to set ctx->last_type to NULL here too?

No, since the member we refer can be different from unsigned long.
When we don't have ":type" suffix, we use BTF type information to
decide appropriate type.

> 
> Because everything above is pretty much the expected behavior. The put is
> *not* premature. The last_struct and struct_btf are both set to NULL. I
> guess the only thing missing is to reset last_type as well.

No, as I explained, the last_type is used to determine the member type
when user does not specify the ":type" suffix.

So, what we need to do is deferring the btf_put(struct_btf) as below:
(no build test yet.)

diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index adaa1e4fbdd6..9a73c49e22df 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -538,6 +538,10 @@ static void clear_btf_context(struct traceprobe_parse_context *ctx)
 		ctx->params = NULL;
 		ctx->nr_params = 0;
 	}
+	if (ctx->struct_btf) {
+		btf_put(ctx->struct_btf);
+		ctx->struct_btf = NULL;
+	}
 }
 
 /* Return 1 if the field separator is arrow operator ('->') */
@@ -802,22 +806,18 @@ static int parse_btf_bitfield(struct fetch_insn **pcode,
 
 static int query_btf_struct(const char *sname, struct traceprobe_parse_context *ctx)
 {
+	struct btf *btf = NULL;
 	int id;
 
-	if (!ctx->struct_btf) {
-		struct btf *btf;
-
-		id = bpf_find_btf_id(sname, BTF_KIND_STRUCT, &btf);
-		if (id < 0)
-			return id;
-		ctx->struct_btf = btf;
-	} else {
-		id = btf_find_by_name_kind(ctx->struct_btf, sname, BTF_KIND_STRUCT);
-		if (id < 0)
-			return id;
+	if (ctx->struct_btf) {
+		btf_put(ctx->struct_btf);
+		ctx->struct_btf = NULL;
 	}
 
-	ctx->last_struct = btf_type_by_id(ctx->struct_btf, id);
+	id = bpf_find_btf_id(sname, BTF_KIND_STRUCT, &btf);
+	if (id < 0)
+		return id;
+	ctx->struct_btf = btf;
 	return 0;
 }
 
@@ -846,8 +846,7 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
 
 	if (ret < 0) {
 		trace_probe_log_err(ctx->offset + 1, NO_PTR_STRCT);
-		ret = -EINVAL;
-		goto out_put;
+		return ret;
 	}
 
 	ctx->flags |= TPARG_FL_TYPECAST;
@@ -857,9 +856,6 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
 	ret = parse_btf_arg(tmp, pcode, end, ctx);
 	ctx->flags &= ~TPARG_FL_TYPECAST;
 	ctx->last_struct = NULL;
-out_put:
-	btf_put(ctx->struct_btf);
-	ctx->struct_btf = NULL;
 	return ret;
 }
 

Thanks!

-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply related

* Re: [PATCH] tracing: fix CFI violation in probestub helper
From: Steven Rostedt @ 2026-05-30 22:57 UTC (permalink / raw)
  To: David Laight
  Cc: Peter Zijlstra, Eva Kurchatova, mhiramat, linux-trace-kernel,
	linux-kernel, mathieu.desnoyers, jpoimboe, samitolvanen
In-Reply-To: <20260530221921.50d958cf@pumpkin>

On Sat, 30 May 2026 22:19:21 +0100
David Laight <david.laight.linux@gmail.com> wrote:

> > +/*								\
> > + * The probestub is only used for tprobes and not referenced	\
> > + * anywhere else. This causes objtool to think it's not called	\
> > + * at all and will add it to the seal list which will remove	\
> > + * the ENDBR causing issues if a tprobe is ever used.		\
> > + */								\  
> 
> Isn't the sense of that all wrong?

The above is still correct. It just expresses what the probestub is
used for.

> Maybe:
> 	Annotate the protosub 'CFI_NOSEAL' to stop objtool requesting the
> 	kernel remove the ENDBR because the only references to the
> 	function are in the __tracepoint section that objtool doesn't scan.

I do agree that your version is a bit more concise.

Thanks,

-- Steve

^ permalink raw reply

* Re: [PATCH] tracing: fix CFI violation in probestub helper
From: David Laight @ 2026-05-30 21:19 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Eva Kurchatova, mhiramat, linux-trace-kernel,
	linux-kernel, mathieu.desnoyers, jpoimboe, samitolvanen
In-Reply-To: <20260529195134.37d4f5cc@fedora>

On Fri, 29 May 2026 19:51:34 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Fri, 29 May 2026 22:08:26 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Thu, May 28, 2026 at 04:49:02PM -0400, Steven Rostedt wrote:  
> > > On Sun, 24 May 2026 18:43:01 +0300
> > > Eva Kurchatova <eva.kurchatova@virtuozzo.com> wrote:
> > >     
> > > > When multiple callbacks are registered on the same tracepoint, probestub
> > > > will be indirectly called via traceiter helper.
> > > > 
> > > > Pointer to probestub callback resides in __tracepoints section, which is
> > > > excluded from ENDBR checks in objtool. Pointers to regfunc/unregfunc
> > > > callbacks reside in extended structure however, which is not affected.
> > > > 
> > > > Registering multiple callbacks will result in a #CP exception due to
> > > > missed ENDBR in __probestub helper on a CFI-enabled machine.
> > > > 
> > > > Fix this by adding CFI_NOSEAL annotation to probestub declaration.
> > > > 
> > > > Fixes: d5173f753750 ("objtool: Exclude __tracepoints data from ENDBR checks")
> > > > Signed-off-by: Eva Kurchatova <eva.kurchatova@virtuozzo.com>    
> 
> > 
> > The only place the function address lives is in that __tracepoint
> > section. Since that is explicitly excluded by objtool, it figures there
> > are no actual references to __probestub and the function goes on the
> > seal list and the kernel explicitly scribbles the ENDBR on boot.
> > 
> > Then, if it ever gets used on an IBT enabled host, *boom*.  
> 
> That makes much more sense.
> 
> > 
> > I agree it would've perhaps been clearer if there was part of a splat in
> > the changelog, but the issue is real afaict.
> > 
> > Also, I do think this:
> >   
> > > > @@ -356,6 +357,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
> > > >  	void __probestub_##_name(void *__data, proto)			\
> > > >  	{								\
> > > >  	}								\
> > > > +	CFI_NOSEAL(__probestub_##_name);				\
> > > >  	DEFINE_STATIC_CALL(tp_func_##_name, __traceiter_##_name);
> > > >  
> > > >  #define DEFINE_TRACE_FN(_name, _reg, _unreg, _proto, _args)		\    
> > 
> > could do with a comment, explaining why it wants the NOSEAL.  
> 
> Yes.
> 
> Thus, the above change log is totally incorrect and should be updated to:
> 
>    tprobes uses a stub function of the tracepoint to allow fprobes to
>    attach to the tracepoint call site and have access to its arguments.
>    The stub function is called __probestub_##_name() and is only
>    referenced as a pointer in the tracepoint structure so that the
>    tprobe can have access to it.
> 
>    The issue is that the probstub function is only referenced in the
>    __tracepoint section and objtool thinks nothing calls it. Since it
>    explicitly excludes the __tracepoint section, objtool thinks there
>    are no callers so it puts the probstub function into the seal list
>    and then the kernel scrubs its ENDBR on boot.
> 
>    This becomes an issue if someone were to use a tprobe which will
>    register the probestub as a callback to the tracepoint so that a
>    fprobe may attach to it and get access to the arguments. Without the
>    ENDBR it will make the kernel go BOOM!
> 
> 
> Then have a comment in the patch with:
> 
>  void __probestub_##_name(void *__data, proto)			\
>  {								\
>  }								\
> +/*								\
> + * The probestub is only used for tprobes and not referenced	\
> + * anywhere else. This causes objtool to think it's not called	\
> + * at all and will add it to the seal list which will remove	\
> + * the ENDBR causing issues if a tprobe is ever used.		\
> + */								\

Isn't the sense of that all wrong?
Maybe:
	Annotate the protosub 'CFI_NOSEAL' to stop objtool requesting the
	kernel remove the ENDBR because the only references to the
	function are in the __tracepoint section that objtool doesn't scan.

-- David
 
> +CFI_NOSEAL(__probestub_##_name);				\
>  DEFINE_STATIC_CALL(tp_func_##_name, __traceiter_##_name);
> 
> 
> -- Steve
> 


^ permalink raw reply

* Re: [PATCH] tracing: fix CFI violation in probestub helper
From: Masami Hiramatsu @ 2026-05-30 15:52 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Eva Kurchatova, mhiramat, linux-trace-kernel,
	linux-kernel, mathieu.desnoyers, jpoimboe, samitolvanen
In-Reply-To: <20260529195134.37d4f5cc@fedora>

On Fri, 29 May 2026 19:51:34 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Fri, 29 May 2026 22:08:26 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Thu, May 28, 2026 at 04:49:02PM -0400, Steven Rostedt wrote:
> > > On Sun, 24 May 2026 18:43:01 +0300
> > > Eva Kurchatova <eva.kurchatova@virtuozzo.com> wrote:
> > >   
> > > > When multiple callbacks are registered on the same tracepoint, probestub
> > > > will be indirectly called via traceiter helper.
> > > > 
> > > > Pointer to probestub callback resides in __tracepoints section, which is
> > > > excluded from ENDBR checks in objtool. Pointers to regfunc/unregfunc
> > > > callbacks reside in extended structure however, which is not affected.
> > > > 
> > > > Registering multiple callbacks will result in a #CP exception due to
> > > > missed ENDBR in __probestub helper on a CFI-enabled machine.
> > > > 
> > > > Fix this by adding CFI_NOSEAL annotation to probestub declaration.
> > > > 
> > > > Fixes: d5173f753750 ("objtool: Exclude __tracepoints data from ENDBR checks")
> > > > Signed-off-by: Eva Kurchatova <eva.kurchatova@virtuozzo.com>  
> 
> > 
> > The only place the function address lives is in that __tracepoint
> > section. Since that is explicitly excluded by objtool, it figures there
> > are no actual references to __probestub and the function goes on the
> > seal list and the kernel explicitly scribbles the ENDBR on boot.
> > 
> > Then, if it ever gets used on an IBT enabled host, *boom*.
> 
> That makes much more sense.

Ah, I got it.

> 
> > 
> > I agree it would've perhaps been clearer if there was part of a splat in
> > the changelog, but the issue is real afaict.
> > 
> > Also, I do think this:
> > 
> > > > @@ -356,6 +357,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
> > > >  	void __probestub_##_name(void *__data, proto)			\
> > > >  	{								\
> > > >  	}								\
> > > > +	CFI_NOSEAL(__probestub_##_name);				\
> > > >  	DEFINE_STATIC_CALL(tp_func_##_name, __traceiter_##_name);
> > > >  
> > > >  #define DEFINE_TRACE_FN(_name, _reg, _unreg, _proto, _args)		\  
> > 
> > could do with a comment, explaining why it wants the NOSEAL.
> 
> Yes.
> 
> Thus, the above change log is totally incorrect and should be updated to:
> 
>    tprobes uses a stub function of the tracepoint to allow fprobes to
>    attach to the tracepoint call site and have access to its arguments.
>    The stub function is called __probestub_##_name() and is only
>    referenced as a pointer in the tracepoint structure so that the
>    tprobe can have access to it.
> 
>    The issue is that the probstub function is only referenced in the
>    __tracepoint section and objtool thinks nothing calls it. Since it
>    explicitly excludes the __tracepoint section, objtool thinks there
>    are no callers so it puts the probstub function into the seal list
>    and then the kernel scrubs its ENDBR on boot.
> 
>    This becomes an issue if someone were to use a tprobe which will
>    register the probestub as a callback to the tracepoint so that a
>    fprobe may attach to it and get access to the arguments. Without the
>    ENDBR it will make the kernel go BOOM!
> 
> 
> Then have a comment in the patch with:
> 
>  void __probestub_##_name(void *__data, proto)			\
>  {								\
>  }								\
> +/*								\
> + * The probestub is only used for tprobes and not referenced	\
> + * anywhere else. This causes objtool to think it's not called	\
> + * at all and will add it to the seal list which will remove	\
> + * the ENDBR causing issues if a tprobe is ever used.		\
> + */								\
> +CFI_NOSEAL(__probestub_##_name);				\
>  DEFINE_STATIC_CALL(tp_func_##_name, __traceiter_##_name);

This looks good to me. Eva, can you update the patch?

Thanks for fix.

> 
> 
> -- Steve


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: PATCH v7] tracing/eprobes: Allow use of BTF names to dereference pointers
From: Steven Rostedt @ 2026-05-30 15:07 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: LKML, Linux trace kernel, Mathieu Desnoyers, Mark Rutland,
	Peter Zijlstra, Namhyung Kim, Takaya Saeki, Douglas Raillard,
	Tom Zanussi, Andrew Morton, Thomas Gleixner, Ian Rogers,
	Jiri Olsa
In-Reply-To: <20260530231427.b079fefffc724a40082cd64b@kernel.org>

On Sat, 30 May 2026 23:14:27 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> Thanks, and Sashiko reviewed this. 
> 
> https://sashiko.dev/#/patchset/20260529110442.0967a64c%40fedora
> 
> I think both comments are valid, especially, second one is important.

Hmm, comment below.

> 
> BTW, I updated probes/fixes. Could you also update this and rebase
> on probes/fixes branch?
> 
> I'm working on the nesting and container_of support patches which
> are on top of this.

OK, I'll make sure to use your probes branch. Which one should I use? your
probes/for-next or probes/urgent?

As for sashiko's comments:

> commit c8393ea9936934c16ba2881d9e7b3cce86c678ff
> Author: Steven Rostedt <rostedt@goodmis.org>
> [PATCH v7] tracing/eprobes: Allow use of BTF names to dereference pointers
> This commit introduces syntax to allow event probes to typecast pointer
> fields to a structure using BTF. This enables dereferencing structures
> using BTF names rather than having to manually calculate offsets.
> > diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
> > index 39f040c863e85..0e87409a44033 100644
> > --- a/kernel/trace/trace_probe.c
> > +++ b/kernel/trace/trace_probe.c
> [ ... ]
> > @@ -640,7 +673,7 @@ static int parse_btf_arg(char *varname,
> >  	int i, is_ptr, ret;
> >  	u32 tid;
> >  
> > -	if (WARN_ON_ONCE(!ctx->funcname))
> > +	if (WARN_ON_ONCE(!ctx->funcname && !(ctx->flags & TPARG_FL_TEVENT)))
> >  		return -EINVAL;
> >  
> >  	is_ptr = split_next_field(varname, &field, ctx);
> > @@ -653,6 +686,16 @@ static int parse_btf_arg(char *varname,
> >  		return -EOPNOTSUPP;
> >  	}
> >  
> > +	if (ctx->flags & TPARG_FL_TEVENT) {
> > +		ret = parse_trace_event(varname, code, ctx);
> > +		if (ret < 0)
> > +			return ret;

> Does this silently drop the error on parsing failure?
> If parse_trace_event() fails and returns -EINVAL, it appears this path
> returns the error without logging it.
> Unlike parse_probe_vars(), which explicitly logs the failure when
> parse_trace_event() fails, this path bypasses the log. Could this make
> it difficult to debug invalid typecast expressions since nothing is written
> to tracing/error_log?

Hmm, I guess I can put an error log here, as this is a unique path.

> > +		if (WARN_ON_ONCE(!(ctx->flags & TPARG_FL_TYPECAST)))
> > +			return -EINVAL;
> > +		type = ctx->last_struct;
> > +		goto found_type;
> > +	}
> > +
> >  	if (ctx->flags & TPARG_FL_RETURN && !strcmp(varname, "$retval")) {
> >  		code->op = FETCH_OP_RETVAL;
> [ ... ]
> > @@ -758,7 +802,71 @@ static int parse_btf_bitfield(struct fetch_insn **pcode,
> >  	return 0;
> >  }
> >  
> > -#else
> > +static int query_btf_struct(const char *sname, struct traceprobe_parse_context *ctx)
> [ ... ]
> > +static int handle_typecast(char *arg, struct fetch_insn **pcode,
> > +			   struct fetch_insn *end,
> > +			   struct traceprobe_parse_context *ctx)
> > +{
> > +	char *tmp;
> > +	int ret;
> > +
> > +	/* Currently this only works for eprobes */
> > +	if (!(ctx->flags & TPARG_FL_TEVENT)) {
> > +		trace_probe_log_err(ctx->offset, TYPECAST_NOT_EVENT);
> > +		return -EINVAL;
> > +	}
> > +
> > +	tmp = strchr(arg, ')');
> > +	if (!tmp) {
> > +		trace_probe_log_err(ctx->offset + strlen(arg),
> > +				    DEREF_OPEN_BRACE);
> > +		return -EINVAL;
> > +	}
> > +	*tmp = '\0';
> > +	ret = query_btf_struct(arg + 1, ctx);
> > +	*tmp = ')';
> > +
> > +	if (ret < 0) {
> > +		trace_probe_log_err(ctx->offset + 1, NO_PTR_STRCT);
> > +		ret = -EINVAL;
> > +		goto out_put;
> > +	}
> > +
> > +	ctx->flags |= TPARG_FL_TYPECAST;
> > +	tmp++;
> > +
> > +	ctx->offset += tmp - arg;
> > +	ret = parse_btf_arg(tmp, pcode, end, ctx);
> > +	ctx->flags &= ~TPARG_FL_TYPECAST;
> > +	ctx->last_struct = NULL;
> > +out_put:
> > +	btf_put(ctx->struct_btf);

> Does this prematurely release the BTF struct reference?
> If TPARG_FL_TYPECAST is unset here and ctx->struct_btf is put, won't
> later steps in traceprobe_parse_probe_arg_body() (like
> find_fetch_type_from_btf_type()) fail to properly infer struct field sizes?
> When ctx_btf(ctx) is called later without TPARG_FL_TYPECAST set, it
> will evaluate to ctx->btf (which is NULL for eprobes).
> Could this potentially lead to silent defaults, such as 64-bit reads for
> smaller fields, or fail to inject pointer dereferences for string fields,
> while also leaving ctx->last_type pointing to a prematurely released BTF
> object?

Does this mean we need to set ctx->last_type to NULL here too?

Because everything above is pretty much the expected behavior. The put is
*not* premature. The last_struct and struct_btf are both set to NULL. I
guess the only thing missing is to reset last_type as well.

-- Steve


> > +	ctx->struct_btf = NULL;
> > +	return ret;
> > +}
> > +
> > +#else /* !CONFIG_PROBE_EVENTS_BTF_ARGS */

^ permalink raw reply

* Re: [RFC PATCH] trace: Introduce a new filter_pred "caller"
From: Steven Rostedt @ 2026-05-30 14:55 UTC (permalink / raw)
  To: chenjun (AM)
  Cc: Masami Hiramatsu (Google), mathieu.desnoyers@efficios.com,
	linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org
In-Reply-To: <b72c8c6ce638449ea9e99e892a25dc86@huawei.com>

On Sat, 30 May 2026 10:46:35 +0000
"chenjun (AM)" <chenjun102@huawei.com> wrote:

> > Hi Chen,
> > 
> > Do you plan on sending updates to address the comments that Masami and
> > I have made?
> > 
> > -- Steve  
> 
> Hi，
> 
> Sorry, I've been busy with other things lately. I'll release the patch 
> v2 next week.
> 
> One thing I'd like to confirm is whether to use `called_within` as the 
> filter name.

I was talking with my wife about this (who isn't technical at all, but
wanted to get her thoughts on terminology) and we came up with simply using
"within".

That is, we want the event to trigger if it is called "within" a function.
Hence, if you use:

  echo 'within=="$function_name"' > events/../filter

I think it's pretty obvious to what it means.

Thanks,

-- Steve

^ permalink raw reply

* Re: [PATCH v2] tracing/osnoise: Array printk init and cleanup
From: Steven Rostedt @ 2026-05-30 14:51 UTC (permalink / raw)
  To: Crystal Wood
  Cc: linux-trace-kernel, John Kacur, Tomas Glozar, Costa Shulyupin,
	Wander Lairson Costa, sashiko-bot, sashiko-reviews
In-Reply-To: <b555e6ba6f723c244edc88afbc6383022d51ed3c.camel@redhat.com>

On Fri, 29 May 2026 22:33:37 -0500
Crystal Wood <crwood@redhat.com> wrote:
> > > diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
> > > index 75678053b21c..bda1e0e0d2e1 100644
> > > --- a/kernel/trace/trace_osnoise.c
> > > +++ b/kernel/trace/trace_osnoise.c
> > > @@ -476,8 +476,11 @@ static void print_osnoise_headers(struct seq_file *s)
> > >  										\
> > >  	rcu_read_lock();							\
> > >  	list_for_each_entry_rcu(inst, &osnoise_instances, list) {		\
> > > +		if (trace_array_get(inst->tr) < 0)				\
> > > +			continue;						\
> > >  		buffer = inst->tr->array_buffer.buffer;				\
> > >  		trace_array_printk_buf(buffer, _THIS_IP_, msg);			\
> > > +		trace_array_put(inst->tr);					\
> > >  	}									\
> > >  	rcu_read_unlock();							\
> > >  	osnoise_data.tainted = true;						\  
> > 
> > OK, I'll prepare a v3.  
> 
> Many osnoise_taint() callers, as well as timerlat_dump_stack(), can have
> preemption disabled, so the mutex in trace_array_get() won't work.

Right. OK, so another solution is to simply call synchronize_rcu() instead
of the kvfree_rcu_mightsleep(inst);

	synchronize_rcu();
	kvfree(inst);

Then there should not be any race, because the rmdir will have to wait for
the synchronization before finishing. This isn't something people should be
running a lot of, so I don't think it would cause too much pain in waiting
to unregister the osnoise tracer.

-- Steve

^ permalink raw reply

* [PATCH v3 13/13] verification/rvgen: Generate cleanup hook for per-obj monitor
From: Gabriele Monaco @ 2026-05-30 14:16 UTC (permalink / raw)
  To: linux-kernel, Steven Rostedt, Gabriele Monaco, linux-trace-kernel
  Cc: Nam Cao, Wen Yang
In-Reply-To: <20260530141652.58084-1-gmonaco@redhat.com>

Per-object monitors can allocate memory dynamically and such memory is
required for the lifetime of the object, then it should be freed with
the appropriate call.

Force the generation scripts to add a cleanup function the user will
need to wire to the appropriate event (e.g. sched_process_exit for
tasks). This can be safely removed if the object will never cease to
exist before disabling the monitor (e.g. if following only static
variables).

Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 tools/verification/rvgen/rvgen/dot2k.py | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/tools/verification/rvgen/rvgen/dot2k.py b/tools/verification/rvgen/rvgen/dot2k.py
index 110cfd69e..3060aa4b9 100644
--- a/tools/verification/rvgen/rvgen/dot2k.py
+++ b/tools/verification/rvgen/rvgen/dot2k.py
@@ -17,6 +17,9 @@ from .automata import _EventConstraintKey, _StateConstraintKey, AutomataError
 class dot2k(Monitor, Dot2c):
     template_dir = "dot2k"
 
+    # only needed for the per-obj cleanup hook
+    cleanup_marker = "obj_cleanup"
+
     def __init__(self, file_path, MonitorType, extra_params={}):
         self.monitor_type = MonitorType
         Monitor.__init__(self, extra_params)
@@ -56,18 +59,30 @@ class dot2k(Monitor, Dot2c):
                 buff.append(f"\tda_{handle}({event}{self.enum_suffix});")
             buff.append("}")
             buff.append("")
+        if self.monitor_type == "per_obj":
+            buff.append("/* XXX: obj is being destroyed, remove if not required (e.g. obj is static) */")
+            buff.append(f"static void handle_{self.cleanup_marker}(void *data, /* XXX: fill header */)")
+            buff.append("{")
+            buff.append("\tint id = /* XXX: how do I get the id? */;")
+            buff.append("\tda_destroy_storage(id);")
+            buff.append("}")
+            buff.append("")
         return '\n'.join(buff)
 
     def fill_tracepoint_attach_probe(self) -> str:
         buff = []
         for event in self.events:
             buff.append(f"\trv_attach_trace_probe(\"{self.name}\", /* XXX: tracepoint */, handle_{event});")
+        if self.monitor_type == "per_obj":
+            buff.append(f"\trv_attach_trace_probe(\"{self.name}\", /* XXX: cleanup tracepoint */, handle_{self.cleanup_marker});")
         return '\n'.join(buff)
 
     def fill_tracepoint_detach_helper(self) -> str:
         buff = []
         for event in self.events:
             buff.append(f"\trv_detach_trace_probe(\"{self.name}\", /* XXX: tracepoint */, handle_{event});")
+        if self.monitor_type == "per_obj":
+            buff.append(f"\trv_detach_trace_probe(\"{self.name}\", /* XXX: cleanup tracepoint */, handle_{self.cleanup_marker});")
         return '\n'.join(buff)
 
     def fill_model_h_header(self) -> list[str]:
-- 
2.54.0


^ permalink raw reply related

* [PATCH v3 12/13] rv: Fix read_lock scope in per-task DA cleanup
From: Gabriele Monaco @ 2026-05-30 14:16 UTC (permalink / raw)
  To: linux-kernel, Steven Rostedt, Gabriele Monaco, linux-trace-kernel
  Cc: Wen Yang, Nam Cao
In-Reply-To: <20260530141652.58084-1-gmonaco@redhat.com>

The da_monitor_reset_all() function for per-task monitors takes
tasklist_lock while iterating over tasks, then keeps it also while
iterating over idle tasks (one per CPU). The latter is not necessary
since the lock needs to guard only for_each_process_thread().

Use a scoped_guard for more compact syntax and adjust the scope only
where the lock is necessary.

Reviewed-by: Wen Yang <wen.yang@linux.dev>
Reviewed-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 include/rv/da_monitor.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
index ae8c35fcb..f4fc22cac 100644
--- a/include/rv/da_monitor.h
+++ b/include/rv/da_monitor.h
@@ -294,12 +294,12 @@ static void da_monitor_reset_all(void)
 	struct task_struct *g, *p;
 	int cpu;
 
-	read_lock(&tasklist_lock);
-	for_each_process_thread(g, p)
-		da_monitor_reset(da_get_monitor(p));
+	scoped_guard(read_lock, &tasklist_lock) {
+		for_each_process_thread(g, p)
+			da_monitor_reset(da_get_monitor(p));
+	}
 	for_each_present_cpu(cpu)
 		da_monitor_reset(da_get_monitor(idle_task(cpu)));
-	read_unlock(&tasklist_lock);
 }
 
 /*
-- 
2.54.0


^ permalink raw reply related

* [PATCH v3 11/13] verification/rvgen: Fix suffix strip in dot2k
From: Gabriele Monaco @ 2026-05-30 14:16 UTC (permalink / raw)
  To: linux-kernel, Steven Rostedt, Gabriele Monaco, linux-trace-kernel
  Cc: Nam Cao, Wen Yang
In-Reply-To: <20260530141652.58084-1-gmonaco@redhat.com>

__start_to_invariant_check() and __get_constraint_env() parse the
environment variable's name from sources that have it padded with the
monitor name. This is removed using rstrip(), which is not meant to
strip a substring but rather a set of characters.

Use removesuffix() to actually get rid of the trailing _<monitor name>.

Fixes: a82adadb16894 ("verification/rvgen: Add support for Hybrid Automata")
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 tools/verification/rvgen/rvgen/dot2k.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/verification/rvgen/rvgen/dot2k.py b/tools/verification/rvgen/rvgen/dot2k.py
index e6f476b90..110cfd69e 100644
--- a/tools/verification/rvgen/rvgen/dot2k.py
+++ b/tools/verification/rvgen/rvgen/dot2k.py
@@ -215,14 +215,14 @@ class ha2k(dot2k):
     def __get_constraint_env(self, constr: str) -> str:
         """Extract the second argument from an ha_ function"""
         env = constr.split("(")[1].split()[1].rstrip(")").rstrip(",")
-        assert env.rstrip(f"_{self.name}") in self.envs
+        assert env.removesuffix(f"_{self.name}") in self.envs
         return env
 
     def __start_to_invariant_check(self, constr: str) -> str:
         # by default assume the timer has ns expiration
         env = self.__get_constraint_env(constr)
         clock_type = "ns"
-        if self.env_types.get(env.rstrip(f"_{self.name}")) == "j":
+        if self.env_types.get(env.removesuffix(f"_{self.name}")) == "j":
             clock_type = "jiffy"
 
         return f"return ha_check_invariant_{clock_type}(ha_mon, {env}, time_ns)"
-- 
2.54.0


^ permalink raw reply related

* [PATCH v3 10/13] rv: Use 0 to check preemption enabled in opid
From: Gabriele Monaco @ 2026-05-30 14:16 UTC (permalink / raw)
  To: linux-kernel, Steven Rostedt, Gabriele Monaco, Masami Hiramatsu,
	linux-trace-kernel
  Cc: Nam Cao, Wen Yang
In-Reply-To: <20260530141652.58084-1-gmonaco@redhat.com>

Tracepoint handlers no longer run with preemption disabled by default
since a46023d5616 ("tracing: Guard __DECLARE_TRACE() use of
__DO_TRACE_CALL() with SRCU-fast"), the opid monitor should now count 1
in the preemption count as preemption disabled.

Change the rule for preempt_off to preempt > 0.

Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 kernel/trace/rv/monitors/opid/opid.c | 8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/kernel/trace/rv/monitors/opid/opid.c b/kernel/trace/rv/monitors/opid/opid.c
index 2922318c6..3b6a85e81 100644
--- a/kernel/trace/rv/monitors/opid/opid.c
+++ b/kernel/trace/rv/monitors/opid/opid.c
@@ -22,14 +22,8 @@ static u64 ha_get_env(struct ha_monitor *ha_mon, enum envs_opid env, u64 time_ns
 	if (env == irq_off_opid)
 		return irqs_disabled();
 	else if (env == preempt_off_opid) {
-		/*
-		 * If CONFIG_PREEMPTION is enabled, then the tracepoint itself disables
-		 * preemption (adding one to the preempt_count). Since we are
-		 * interested in the preempt_count at the time the tracepoint was
-		 * hit, we consider 1 as still enabled.
-		 */
 		if (IS_ENABLED(CONFIG_PREEMPTION))
-			return (preempt_count() & PREEMPT_MASK) > 1;
+			return (preempt_count() & PREEMPT_MASK) > 0;
 		return true;
 	}
 	return ENV_INVALID_VALUE;
-- 
2.54.0


^ permalink raw reply related

* [PATCH v3 09/13] rv: Prevent task migration while handling per-CPU events
From: Gabriele Monaco @ 2026-05-30 14:16 UTC (permalink / raw)
  To: linux-kernel, Steven Rostedt, Gabriele Monaco, linux-trace-kernel
  Cc: Wen Yang, Nam Cao
In-Reply-To: <20260530141652.58084-1-gmonaco@redhat.com>

Tracepoint handlers are fully preemptible after a46023d5616 ("tracing:
Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast"). When
a per-CPU monitor handles an event, it retrieves the monitor state using
a per-CPU pointer. If the event itself doesn't disable preemption, the
task can migrate to a different CPU and we risk updating the wrong
monitor.

Mitigate this by explicitly disabling task migration before acquiring
the monitor pointer. This cannot guarantee the monitor runs on the
correct CPU but reduces the race condition window and prevents warnings.

Reviewed-by: Wen Yang <wen.yang@linux.dev>
Reviewed-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 include/rv/da_monitor.h | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
index 03c1150b1..ae8c35fcb 100644
--- a/include/rv/da_monitor.h
+++ b/include/rv/da_monitor.h
@@ -194,6 +194,10 @@ static inline void da_monitor_destroy(void)
 	da_monitor_sync_hook();
 }
 
+#ifndef da_implicit_guard
+#define da_implicit_guard()
+#endif
+
 #elif RV_MON_TYPE == RV_MON_PER_CPU
 /*
  * Functions to define, init and get a per-cpu monitor.
@@ -244,6 +248,10 @@ static inline void da_monitor_destroy(void)
 	da_monitor_sync_hook();
 }
 
+#ifndef da_implicit_guard
+#define da_implicit_guard() guard(migrate)()
+#endif
+
 #elif RV_MON_TYPE == RV_MON_PER_TASK
 /*
  * Functions to define, init and get a per-task monitor.
@@ -700,6 +708,7 @@ static inline bool __da_handle_start_run_event(struct da_monitor *da_mon,
  */
 static inline void da_handle_event(enum events event)
 {
+	da_implicit_guard();
 	__da_handle_event(da_get_monitor(), event, 0);
 }
 
@@ -715,6 +724,7 @@ static inline void da_handle_event(enum events event)
  */
 static inline bool da_handle_start_event(enum events event)
 {
+	da_implicit_guard();
 	return __da_handle_start_event(da_get_monitor(), event, 0);
 }
 
@@ -726,6 +736,7 @@ static inline bool da_handle_start_event(enum events event)
  */
 static inline bool da_handle_start_run_event(enum events event)
 {
+	da_implicit_guard();
 	return __da_handle_start_run_event(da_get_monitor(), event, 0);
 }
 
-- 
2.54.0


^ permalink raw reply related

* [PATCH v3 08/13] rv: Ensure synchronous cleanup for HA monitors
From: Gabriele Monaco @ 2026-05-30 14:16 UTC (permalink / raw)
  To: linux-kernel, Steven Rostedt, Gabriele Monaco, linux-trace-kernel
  Cc: Nam Cao, Wen Yang
In-Reply-To: <20260530141652.58084-1-gmonaco@redhat.com>

HA monitors may start timers, all cleanup functions currently stop the
timers asynchronously to avoid sleeping in the wrong context.
Nothing makes sure running callbacks terminate on cleanup.

Run the entire HA timer callback in an RCU read-side critical section,
this way we can simply synchronize_rcu() with any pending timer and are
sure any cleanup using kfree_rcu() runs after callbacks terminated.
Additionally make sure any unlikely callback running late won't run any
code if the monitor is marked as disabled or if destruction started.
Use memory barriers to serialise with racing resets.

Fixes: f5587d1b6ec9 ("rv: Add Hybrid Automata monitor type")
Fixes: 4a24127bd6cb ("rv: Add support for per-object monitors in DA/HA")
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 include/rv/da_monitor.h | 26 +++++++++++++++++++++-----
 include/rv/ha_monitor.h | 28 +++++++++++++++++++++++++---
 2 files changed, 46 insertions(+), 8 deletions(-)

diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
index 60dc39f26..03c1150b1 100644
--- a/include/rv/da_monitor.h
+++ b/include/rv/da_monitor.h
@@ -57,6 +57,15 @@ static struct rv_monitor rv_this;
 #define da_monitor_reset_hook(da_mon)
 #endif
 
+/*
+ * Hook to allow the implementation of hybrid automata: define it with a
+ * function that waits for the termination of all monitors background
+ * activities (e.g. all timers). This hook can sleep.
+ */
+#ifndef da_monitor_sync_hook
+#define da_monitor_sync_hook()
+#endif
+
 /*
  * Type for the target id, default to int but can be overridden.
  * A long type can work as hash table key (PER_OBJ) but will be downgraded to
@@ -83,7 +92,8 @@ static inline void da_monitor_reset(struct da_monitor *da_mon)
 {
 	da_monitor_reset_hook(da_mon);
 	WRITE_ONCE(da_mon->monitoring, 0);
-	da_mon->curr_state = model_get_initial_state();
+	/* Pair with load in __ha_monitor_timer_callback */
+	smp_store_release(&da_mon->curr_state, model_get_initial_state());
 }
 
 /*
@@ -181,6 +191,7 @@ static inline int da_monitor_init(void)
 static inline void da_monitor_destroy(void)
 {
 	da_monitor_reset_all();
+	da_monitor_sync_hook();
 }
 
 #elif RV_MON_TYPE == RV_MON_PER_CPU
@@ -230,6 +241,7 @@ static inline int da_monitor_init(void)
 static inline void da_monitor_destroy(void)
 {
 	da_monitor_reset_all();
+	da_monitor_sync_hook();
 }
 
 #elif RV_MON_TYPE == RV_MON_PER_TASK
@@ -317,6 +329,7 @@ static inline void da_monitor_destroy(void)
 
 	tracepoint_synchronize_unregister();
 	da_monitor_reset_all();
+	da_monitor_sync_hook();
 
 	rv_put_task_monitor_slot(task_mon_slot);
 	task_mon_slot = RV_PER_TASK_MONITOR_INIT;
@@ -495,10 +508,9 @@ static void da_monitor_reset_all(void)
 	struct da_monitor_storage *mon_storage;
 	int bkt;
 
-	rcu_read_lock();
+	guard(rcu)();
 	hash_for_each_rcu(da_monitor_ht, bkt, mon_storage, node)
 		da_monitor_reset(&mon_storage->rv.da_mon);
-	rcu_read_unlock();
 }
 
 static inline int da_monitor_init(void)
@@ -514,13 +526,17 @@ static inline void da_monitor_destroy(void)
 	int bkt;
 
 	tracepoint_synchronize_unregister();
+	scoped_guard(rcu) {
+		hash_for_each_rcu(da_monitor_ht, bkt, mon_storage, node) {
+			da_monitor_reset_hook(&mon_storage->rv.da_mon);
+		}
+	}
+	da_monitor_sync_hook();
 	/*
 	 * This function is called after all probes are disabled and no longer
 	 * pending, we can safely assume no concurrent user.
 	 */
-	synchronize_rcu();
 	hash_for_each_safe(da_monitor_ht, bkt, tmp, mon_storage, node) {
-		da_monitor_reset_hook(&mon_storage->rv.da_mon);
 		hash_del_rcu(&mon_storage->node);
 		kfree(mon_storage);
 	}
diff --git a/include/rv/ha_monitor.h b/include/rv/ha_monitor.h
index 0aeb6bc38..86bd1cab8 100644
--- a/include/rv/ha_monitor.h
+++ b/include/rv/ha_monitor.h
@@ -37,6 +37,7 @@ static bool ha_monitor_handle_constraint(struct da_monitor *da_mon,
 #define da_monitor_event_hook ha_monitor_handle_constraint
 #define da_monitor_init_hook ha_monitor_init_env
 #define da_monitor_reset_hook ha_monitor_reset_env
+#define da_monitor_sync_hook() synchronize_rcu()
 
 #if !defined(HA_SKIP_AUTO_CLEANUP) && RV_MON_TYPE == RV_MON_PER_TASK
 /*
@@ -137,11 +138,13 @@ static enum hrtimer_restart ha_monitor_timer_callback(struct hrtimer *hrtimer);
 #endif /* HA_CLK_NS */
 
 static bool ha_mon_initializing;
+static bool ha_mon_destroying;
 
 static int ha_monitor_init(void)
 {
 	int ret;
 
+	WRITE_ONCE(ha_mon_destroying, false);
 	ha_mon_initializing = true;
 	ret = da_monitor_init();
 	if (ret == 0)
@@ -152,6 +155,7 @@ static int ha_monitor_init(void)
 
 static void ha_monitor_destroy(void)
 {
+	WRITE_ONCE(ha_mon_destroying, true);
 	ha_monitor_disable_hook();
 	da_monitor_destroy();
 }
@@ -302,12 +306,30 @@ static bool ha_monitor_handle_constraint(struct da_monitor *da_mon,
 	return false;
 }
 
+/*
+ * __ha_monitor_timer_callback - generic callback representation
+ *
+ * This callback runs in an RCU read-side critical section to allow the
+ * destruction sequence to easily synchronize_rcu() with all pending timers
+ * after asynchronously disabling them. The ha_mon_destroying check ensures
+ * any callback entering the RCU section after synchronize_rcu() completes
+ * will see the flag and bail out immediately.
+ */
 static inline void __ha_monitor_timer_callback(struct ha_monitor *ha_mon)
 {
-	enum states curr_state = READ_ONCE(ha_mon->da_mon.curr_state);
 	DECLARE_SEQ_BUF(env_string, ENV_BUFFER_SIZE);
-	u64 time_ns = ha_get_ns();
-
+	enum states curr_state;
+	u64 time_ns;
+
+	guard(rcu)();
+	if (unlikely(READ_ONCE(ha_mon_destroying)))
+		return;
+	/* Ensure consistent curr_state if we race with da_monitor_reset */
+	curr_state = smp_load_acquire(&ha_mon->da_mon.curr_state);
+	if (unlikely(!da_monitor_handling_event(&ha_mon->da_mon)))
+		return;
+
+	time_ns = ha_get_ns();
 	ha_get_env_string(&env_string, ha_mon, time_ns);
 	ha_react(curr_state, EVENT_NONE, env_string.buffer);
 	ha_trace_error_env(ha_mon, model_get_state_name(curr_state),
-- 
2.54.0


^ permalink raw reply related

* [PATCH v3 07/13] rv: Add automatic cleanup handlers for per-task HA monitors
From: Gabriele Monaco @ 2026-05-30 14:16 UTC (permalink / raw)
  To: linux-kernel, Steven Rostedt, Gabriele Monaco, linux-trace-kernel
  Cc: Nam Cao, Wen Yang
In-Reply-To: <20260530141652.58084-1-gmonaco@redhat.com>

Hybrid automata monitors may start timers, depending on the model, these
may remain active on an exiting task and cause false positives or even
access freed memory.

Add an enable/disable hook in the HA code, currently only populated by
the per-task handler for registration and deregistration.
This hooks to the sched_process_exit event and ensures the timer is
stopped for every exiting task. The handler is enabled automatically but
may be disabled, for instance if the monitor uses the event for another
purpose (but should still manually ensure timers are stopped).

Fixes: f5587d1b6ec9 ("rv: Add Hybrid Automata monitor type")
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 include/rv/ha_monitor.h | 45 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/include/rv/ha_monitor.h b/include/rv/ha_monitor.h
index ad3347af1..0aeb6bc38 100644
--- a/include/rv/ha_monitor.h
+++ b/include/rv/ha_monitor.h
@@ -28,6 +28,7 @@ static inline void ha_monitor_init_env(struct da_monitor *da_mon);
 static inline void ha_monitor_reset_env(struct da_monitor *da_mon);
 static inline void ha_setup_timer(struct ha_monitor *ha_mon);
 static inline bool ha_cancel_timer(struct ha_monitor *ha_mon);
+static inline void ha_cancel_timer_sync(struct ha_monitor *ha_mon);
 static bool ha_monitor_handle_constraint(struct da_monitor *da_mon,
 					 enum states curr_state,
 					 enum events event,
@@ -37,6 +38,26 @@ static bool ha_monitor_handle_constraint(struct da_monitor *da_mon,
 #define da_monitor_init_hook ha_monitor_init_env
 #define da_monitor_reset_hook ha_monitor_reset_env
 
+#if !defined(HA_SKIP_AUTO_CLEANUP) && RV_MON_TYPE == RV_MON_PER_TASK
+/*
+ * Automatic cleanup handlers for per-task HA monitors, only skip if you know
+ * what you are doing (e.g. you want to implement cleanup manually in another
+ * handler doing more things).
+ */
+static void ha_handle_sched_process_exit(void *data, struct task_struct *p,
+					 bool group_dead);
+
+#define ha_monitor_enable_hook()                                             \
+	rv_attach_trace_probe(__stringify(MONITOR_NAME), sched_process_exit, \
+			      ha_handle_sched_process_exit)
+#define ha_monitor_disable_hook()                                            \
+	rv_detach_trace_probe(__stringify(MONITOR_NAME), sched_process_exit, \
+			      ha_handle_sched_process_exit)
+#else
+#define ha_monitor_enable_hook() ((void)0)
+#define ha_monitor_disable_hook() ((void)0)
+#endif
+
 #include <rv/da_monitor.h>
 #include <linux/seq_buf.h>
 
@@ -123,12 +144,15 @@ static int ha_monitor_init(void)
 
 	ha_mon_initializing = true;
 	ret = da_monitor_init();
+	if (ret == 0)
+		ha_monitor_enable_hook();
 	ha_mon_initializing = false;
 	return ret;
 }
 
 static void ha_monitor_destroy(void)
 {
+	ha_monitor_disable_hook();
 	da_monitor_destroy();
 }
 
@@ -229,6 +253,18 @@ static inline void ha_trace_error_env(struct ha_monitor *ha_mon,
 {
 	CONCATENATE(trace_error_env_, MONITOR_NAME)(id, curr_state, event, env);
 }
+
+#if !defined(HA_SKIP_AUTO_CLEANUP) && RV_MON_TYPE == RV_MON_PER_TASK
+static void ha_handle_sched_process_exit(void *data, struct task_struct *p,
+					 bool group_dead)
+{
+	struct da_monitor *da_mon = da_get_monitor(p);
+
+	if (likely(!ha_monitor_uninitialized(da_mon)))
+		ha_cancel_timer_sync(to_ha_monitor(da_mon));
+}
+#endif
+
 #endif /* RV_MON_TYPE */
 
 /*
@@ -441,6 +477,10 @@ static inline bool ha_cancel_timer(struct ha_monitor *ha_mon)
 {
 	return timer_delete(&ha_mon->timer);
 }
+static inline void ha_cancel_timer_sync(struct ha_monitor *ha_mon)
+{
+	timer_delete_sync(&ha_mon->timer);
+}
 #elif HA_TIMER_TYPE == HA_TIMER_HRTIMER
 /*
  * Helper functions to handle the monitor timer.
@@ -492,6 +532,10 @@ static inline bool ha_cancel_timer(struct ha_monitor *ha_mon)
 {
 	return hrtimer_try_to_cancel(&ha_mon->hrtimer) == 1;
 }
+static inline void ha_cancel_timer_sync(struct ha_monitor *ha_mon)
+{
+	hrtimer_cancel(&ha_mon->hrtimer);
+}
 #else /* HA_TIMER_NONE */
 /*
  * Start function is intentionally not defined, monitors using timers must
@@ -502,6 +546,7 @@ static inline bool ha_cancel_timer(struct ha_monitor *ha_mon)
 {
 	return false;
 }
+static inline void ha_cancel_timer_sync(struct ha_monitor *ha_mon) { }
 #endif
 
 #endif
-- 
2.54.0


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox