Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: Lance Yang @ 2026-06-01  3:28 UTC (permalink / raw)
  To: david
  Cc: lance.yang, npache, linux-doc, linux-kernel, linux-mm,
	linux-trace-kernel, aarcange, akpm, anshuman.khandual, apopple,
	baohua, baolin.wang, byungchul, catalin.marinas, cl, corbet,
	dave.hansen, dev.jain, gourry, hannes, hughd, jack, jackmanb,
	jannh, jglisse, joshua.hahnjy, kas, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe, usama.arif
In-Reply-To: <2024af56-5e99-4799-a586-e9ba756cecb9@kernel.org>


On Sun, May 31, 2026 at 10:00:17PM +0200, David Hildenbrand (Arm) wrote:
>On 5/31/26 11:39, Lance Yang wrote:
>> 
>> On Fri, May 22, 2026 at 09:00:01AM -0600, Nico Pache wrote:
>>> Pass an order and offset to collapse_huge_page to support collapsing anon
>>> memory to arbitrary orders within a PMD. order indicates what mTHP size we
>>> are attempting to collapse to, and offset indicates were in the PMD to
>>> start the collapse attempt.
>>>
>>> For non-PMD collapse we must leave the anon VMA write locked until after
>>> we collapse the mTHP-- in the PMD case all the pages are isolated, but in
>>> the mTHP case this is not true, and we must keep the lock to prevent
>>> access/changes to the page tables. This can happen if the rmap walkers hit
>>> a pmd_none while the PMD entry is currently unavailable due to being
>>> temporarily removed during the collapse phase.
>>>
>>> Acked-by: Usama Arif <usama.arif@linux.dev>
>>> Signed-off-by: Nico Pache <npache@redhat.com>
>>> ---
>>> mm/khugepaged.c | 93 +++++++++++++++++++++++++++++--------------------
>>> 1 file changed, 55 insertions(+), 38 deletions(-)
>>>
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index fab35d318641..d64f42f66236 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -1214,34 +1214,36 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
>>>  * while allocating a THP, as that could trigger direct reclaim/compaction.
>>>  * Note that the VMA must be rechecked after grabbing the mmap_lock again.
>>>  */
>>> -static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>> -		int referenced, int unmapped, struct collapse_control *cc)
>>> +static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
>>> +		int referenced, int unmapped, struct collapse_control *cc,
>>> +		unsigned int order)
>>> {
>>> +	const unsigned long pmd_addr = start_addr & HPAGE_PMD_MASK;
>>> +	const unsigned long end_addr = start_addr + (PAGE_SIZE << order);
>>> 	LIST_HEAD(compound_pagelist);
>>> 	pmd_t *pmd, _pmd;
>>> -	pte_t *pte;
>>> +	pte_t *pte = NULL;
>>> 	pgtable_t pgtable;
>>> 	struct folio *folio;
>>> 	spinlock_t *pmd_ptl, *pte_ptl;
>>> 	enum scan_result result = SCAN_FAIL;
>>> 	struct vm_area_struct *vma;
>>> 	struct mmu_notifier_range range;
>>> +	bool anon_vma_locked = false;
>>>
>>> -	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>>> -
>>> -	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
>>> +	result = alloc_charge_folio(&folio, mm, cc, order);
>>> 	if (result != SCAN_SUCCEED)
>>> 		goto out_nolock;
>>>
>>> 	mmap_read_lock(mm);
>>> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
>>> -					 HPAGE_PMD_ORDER);
>>> +	result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
>>> +					 &vma, cc, order);
>>> 	if (result != SCAN_SUCCEED) {
>>> 		mmap_read_unlock(mm);
>>> 		goto out_nolock;
>>> 	}
>>>
>>> -	result = find_pmd_or_thp_or_none(mm, address, &pmd);
>>> +	result = find_pmd_or_thp_or_none(mm, pmd_addr, &pmd);
>>> 	if (result != SCAN_SUCCEED) {
>>> 		mmap_read_unlock(mm);
>>> 		goto out_nolock;
>>> @@ -1253,8 +1255,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>>> 		 * released when it fails. So we jump out_nolock directly in
>>> 		 * that case.  Continuing to collapse causes inconsistency.
>>> 		 */
>>> -		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
>>> -						     referenced, HPAGE_PMD_ORDER);
>>> +		result = __collapse_huge_page_swapin(mm, vma, start_addr, pmd,
>>> +						     referenced, order);
>>> 		if (result != SCAN_SUCCEED)
>>> 			goto out_nolock;
>>> 	}
>>> @@ -1269,20 +1271,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>>> 	 * mmap_lock.
>>> 	 */
>>> 	mmap_write_lock(mm);
>>> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
>>> -					 HPAGE_PMD_ORDER);
>>> +	result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
>>> +					 &vma, cc, order);
>>> 	if (result != SCAN_SUCCEED)
>>> 		goto out_up_write;
>>> 	/* check if the pmd is still valid */
>>> 	vma_start_write(vma);
>>> -	result = check_pmd_still_valid(mm, address, pmd);
>>> +	result = check_pmd_still_valid(mm, pmd_addr, pmd);
>>> 	if (result != SCAN_SUCCEED)
>>> 		goto out_up_write;
>>>
>>> 	anon_vma_lock_write(vma->anon_vma);
>>> +	anon_vma_locked = true;
>>>
>>> -	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
>>> -				address + HPAGE_PMD_SIZE);
>>> +	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr,
>>> +				end_addr);
>>> 	mmu_notifier_invalidate_range_start(&range);
>>>
>>> 	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
>>> @@ -1294,26 +1297,23 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>>> 	 * Parallel GUP-fast is fine since GUP-fast will back off when
>>> 	 * it detects PMD is changed.
>>> 	 */
>>> -	_pmd = pmdp_collapse_flush(vma, address, pmd);
>>> +	_pmd = pmdp_collapse_flush(vma, pmd_addr, pmd);
>>> 	spin_unlock(pmd_ptl);
>>> 	mmu_notifier_invalidate_range_end(&range);
>>> 	tlb_remove_table_sync_one();
>>>
>>> -	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
>>> +	pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);
>>> 	if (pte) {
>>> -		result = __collapse_huge_page_isolate(vma, address, pte, cc,
>>> -						      HPAGE_PMD_ORDER,
>>> -						      &compound_pagelist);
>>> +		result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
>>> +						      order, &compound_pagelist);
>>> 		spin_unlock(pte_ptl);
>>> 	} else {
>>> 		result = SCAN_NO_PTE_TABLE;
>>> 	}
>>>
>>> 	if (unlikely(result != SCAN_SUCCEED)) {
>>> -		if (pte)
>>> -			pte_unmap(pte);
>>> 		spin_lock(pmd_ptl);
>>> -		BUG_ON(!pmd_none(*pmd));
>>> +		WARN_ON_ONCE(!pmd_none(*pmd));
>>> 		/*
>>> 		 * We can only use set_pmd_at when establishing
>>> 		 * hugepmds and never for establishing regular pmds that
>>> @@ -1321,21 +1321,24 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>>> 		 */
>>> 		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>>> 		spin_unlock(pmd_ptl);
>>> -		anon_vma_unlock_write(vma->anon_vma);
>>> 		goto out_up_write;
>>> 	}
>>>
>>> 	/*
>>> -	 * All pages are isolated and locked so anon_vma rmap
>>> -	 * can't run anymore.
>>> +	 * For PMD collapse all pages are isolated and locked so anon_vma
>>> +	 * rmap can't run anymore. For mTHP collapse the PMD entry has been
>>> +	 * removed and not all pages are isolated and locked, so we must hold
>>> +	 * the lock to prevent neighboring folios from attempting to access
>>> +	 * this PMD until its reinstalled.
>>> 	 */
>>> -	anon_vma_unlock_write(vma->anon_vma);
>>> +	if (is_pmd_order(order)) {
>>> +		anon_vma_unlock_write(vma->anon_vma);
>>> +		anon_vma_locked = false;
>>> +	}
>>>
>>> 	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
>>> -					   vma, address, pte_ptl,
>>> -					   HPAGE_PMD_ORDER,
>>> -					   &compound_pagelist);
>>> -	pte_unmap(pte);
>>> +					   vma, start_addr, pte_ptl,
>>> +					   order, &compound_pagelist);
>>> 	if (unlikely(result != SCAN_SUCCEED))
>>> 		goto out_up_write;
>>>
>>> @@ -1345,18 +1348,32 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>>> 	 * write.
>>> 	 */
>>> 	__folio_mark_uptodate(folio);
>>> -	pgtable = pmd_pgtable(_pmd);
>>> -
>>> 	spin_lock(pmd_ptl);
>>> -	BUG_ON(!pmd_none(*pmd));
>>> -	pgtable_trans_huge_deposit(mm, pmd, pgtable);
>>> -	map_anon_folio_pmd_nopf(folio, pmd, vma, address);
>>> +	WARN_ON_ONCE(!pmd_none(*pmd));
>>> +	if (is_pmd_order(order)) {
>>> +		pgtable = pmd_pgtable(_pmd);
>>> +		pgtable_trans_huge_deposit(mm, pmd, pgtable);
>>> +		map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
>>> +	} else {
>>> +		/*
>>> +		 * set_ptes is called in map_anon_folio_pte_nopf with the
>>> +		 * pmd_ptl lock still held; this is safe as the PMD is expected
>>> +		 * to be none. The pmd entry is then repopulated below.
>>> +		 */
>>> +		map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
>> 
>> Emm ... is it safe to use map_anon_folio_pte_nopf() here?
>> 
>> At this point pmdp_collapse_flush() has cleared the PMD from the page
>> tables. The PTE table we are updating is only reachable through the saved
>> old PMD value, _pmd, until pmd_populate() below.
>> 
>> map_anon_folio_pte_nopf() does set_ptes() and then calls
>> update_mmu_cache_range(). Documentation/core-api/cachetlb.rst describes
>> that hook as:
>> 
>> "
>> 	At the end of every page fault, this routine is invoked to tell
>> 	the architecture specific code that translations now exists
>> 	in the software page tables for address space "vma->vm_mm"
>> 	at virtual address "address" for "nr" consecutive pages.
>> "
>> 
>> But that does not seem true here yet, since the PTE table is not
>> reachable from vma->vm_mm when update_mmu_cache_range() is called.
>> 
>> Should we avoid calling update_mmu_cache_range() until after the PTE
>> table is reinstalled with pmd_populate()?
>
>I recall that update_mmu_cache* users mostly care about updating folios flags,
>for the folio derived from the PTE ... or flushing caches for the user address.
>
>So intuitively I would say "the architecture code doesn't care that the PMD
>table will only be visible to HW shortly after". The important thing should be
>that it will definetly happen, and that nothing else is curently there or can be
>there?

Ah, fair point.

I was mostly worried about arch hooks that walk vma->vm_mm again, rather
than only using the pte pointer passed in. For example, mips does:

  update_mmu_cache_range()
    -> __update_tlb()
      -> pgd_offset(vma->vm_mm, address)
      -> pte_offset_map(...)

and __update_tlb() has this assumption:

		/*
		 * update_mmu_cache() is called between pte_offset_map_lock()
		 * and pte_unmap_unlock(), so we can assume that ptep is not
		 * NULL here: and what should be done below if it were NULL?
		 */

So if khugepaged happens to run with current->active_mm == vma->vm_mm
here, could __update_tlb() hit the none PMD, get NULL from
pte_offset_map(), and then dereference it?

Just wanted to raise it since some arch code may still have assumptions
like this, and the always-enable-mTHP work is getting closer ...

Probably very very very hard to hit, though :)

Cheers, Lance

^ permalink raw reply

* Re: [PATCH] selftests/ftrace: Drop invalid top-level local in test_ownership
From: Masami Hiramatsu @ 2026-06-01  4:31 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: CaoRuichuang, Shuah Khan, mhiramat, mathieu.desnoyers, shuah,
	linux-kernel, linux-trace-kernel, linux-kselftest
In-Reply-To: <20260407203727.442b583c@gandalf.local.home>

On Tue, 7 Apr 2026 20:37:27 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> 
> Shuah,
> 
> Care to take this through your tree. Probably could even add:
> 
> Cc: stable@vger.kernel.org
> Fixes: 8b55572e51805 ("tracing/selftests: Add tracefs mount options test")
> 
> As well as:
> 
> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> 

Shuah, here is my ack too.

Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>

BTW, to avoid similar issue, 

Thanks, 

> -- Steve
> 
> 
> On Tue,  7 Apr 2026 18:26:13 +0800
> CaoRuichuang <create0818@163.com> wrote:
> 
> > From: Cao Ruichuang <create0818@163.com>
> > 
> > test_ownership.tc is sourced by ftracetest under /bin/sh.
> > 
> > The script currently declares mount_point with local at file scope,
> > which makes /bin/sh abort with "local: not in a function" before the
> > test can reach the eventfs ownership checks.
> > 
> > Replace the top-level local declaration with a normal shell variable so
> > kernels that support the gid= tracefs mount option can run the test at
> > all.
> > 
> > Signed-off-by: Cao Ruichuang <create0818@163.com>
> > ---
> >  tools/testing/selftests/ftrace/test.d/00basic/test_ownership.tc | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/tools/testing/selftests/ftrace/test.d/00basic/test_ownership.tc b/tools/testing/selftests/ftrace/test.d/00basic/test_ownership.tc
> > index e71cc3ad0..6d00d3c0f 100644
> > --- a/tools/testing/selftests/ftrace/test.d/00basic/test_ownership.tc
> > +++ b/tools/testing/selftests/ftrace/test.d/00basic/test_ownership.tc
> > @@ -6,7 +6,7 @@
> >  original_group=`stat -c "%g" .`
> >  original_owner=`stat -c "%u" .`
> >  
> > -local mount_point=$(get_mount_point)
> > +mount_point=$(get_mount_point)
> >  
> >  mount_options=$(get_mnt_options "$mount_point")
> >  
> 
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [RFC PATCH 2/2] tracing: Record and show boot ID in last_boot_info
From: Masami Hiramatsu @ 2026-06-01  4:38 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Theodore Ts'o, Jason A . Donenfeld, Mathieu Desnoyers,
	linux-kernel, linux-trace-kernel
In-Reply-To: <20260528163633.5650f3d6@gandalf.local.home>

On Thu, 28 May 2026 16:36:33 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Sun, 24 May 2026 10:44:39 +0900
> Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> 
> > > If the get_boot_id() is accepted by the random folks, then I'm fine with
> > > this change.  
> > 
> > Yeah, BTW, Sashiko found this can be initialized before we get enough
> > entropy for random seed. Maybe we need one more delay.
> 
> Well, maybe for adding the boot_id later, but the code that initializes the
> buffers needs to stay early. With the backup instance, the persistent ring
> buffer can restart tracing immediately.

Agreed, so the buffer will be made in early stage without initializing
the boot_id field, and it will be updated when user reads the boot_id
from kernel. Anyway, without reading boot_id from user space, it is
meaningless.

Thank you,


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH v2 1/8] scripts/sorttable: Handle RISC-V patchable ftrace entries
From: Shuai Xue @ 2026-06-01  6:17 UTC (permalink / raw)
  To: Wang Han, Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: Steven Rostedt, Alexandre Ghiti, Masami Hiramatsu, Mark Rutland,
	Catalin Marinas, Chen Pei, Andy Chiu, Björn Töpel,
	Deepak Gupta, Puranjay Mohan, Conor Dooley, Josh Poimboeuf,
	Jiri Kosina, Miroslav Benes, Petr Mladek, Joe Lawrence,
	Shuah Khan, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, oliver.yang, zhuo.song, jkchen, linux-riscv,
	linux-kernel, linux-trace-kernel, live-patching, linux-kselftest,
	linux-perf-users
In-Reply-To: <20260528082310.1994388-2-wanghan@linux.alibaba.com>



On 5/28/26 4:23 PM, Wang Han wrote:
> RISC-V uses -fpatchable-function-entry=8,4 when the compressed ISA is
> enabled and -fpatchable-function-entry=4,2 otherwise. In both cases, the
> patchable NOP area starts 8 bytes before the function symbol address.
> The __mcount_loc entries therefore point at the patchable NOP area
> associated with a function, while nm reports the function symbol at the
> entry address used for the function range check.
> 
> After RISC-V selected HAVE_BUILDTIME_MCOUNT_SORT, sorttable started
> applying that range check at build time. Without allowing entries just
> before the reported function address, the mcount sorter treats valid
> RISC-V ftrace callsites as invalid weak-function entries and writes
> them back as zero. The resulting kernel boots with no ftrace entries,
> breaking dynamic ftrace and users such as livepatch.
> 
> The failure is silent during the final link because zeroing weak-function
> entries is an expected sorttable operation. At boot, those zero entries
> are skipped by ftrace_process_locs(), so the only obvious symptom is that
> the vmlinux ftrace table has lost valid callsites and ftrace users cannot
> attach to them.
> 
> CONFIG_FTRACE_SORT_STARTUP_TEST also reports the table as sorted in this
> state: it only checks that the __mcount_loc entries are in ascending
> order, which a fully zeroed table trivially satisfies. The original
> commit relied on this check and did not see the regression.
> 
> On an affected RISC-V QEMU boot with both CONFIG_FTRACE_SORT_STARTUP_TEST
> and CONFIG_FTRACE_STARTUP_TEST enabled, the sort check still passes
> while ftrace reports zero usable entries and the early selftests fail:
> 
>    [    0.000000] ftrace section at ffffffff8101da98 sorted properly
>    [    0.000000] ftrace: allocating 0 entries in 128 pages
>    [    0.054999] Testing tracer function: .. no entries found ..FAILED!
>    [    0.172407] tracer: function failed selftest, disabling
>    [    0.178186] Failed to init function_graph tracer, init returned -19
> 
> Handle RISC-V like arm64 for the function-range check and allow
> patchable entries up to 8 bytes before the function address.
> 
> With this fix, a RISC-V QEMU smoke boot with ftrace startup tests shows
> the vmlinux ftrace table is populated and dynamic ftrace still works:
> 
>    [    0.000000] ftrace: allocating 46749 entries in 184 pages
>    [    0.051115] Testing tracer function: PASSED
>    [    1.283782] Testing dynamic ftrace: PASSED
>    [    6.275456] Testing tracer function_graph: PASSED
> 
> Fixes: 0ca1724b56af ("riscv: ftrace: select HAVE_BUILDTIME_MCOUNT_SORT")
> Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> Link: https://lore.kernel.org/all/20260527113028.4b21a5de@fedora/
> Signed-off-by: Wang Han <wanghan@linux.alibaba.com>
> ---
>   scripts/sorttable.c | 10 +++++++---
>   1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/scripts/sorttable.c b/scripts/sorttable.c
> index e8ed11c680c6..4c10e85bb5af 100644
> --- a/scripts/sorttable.c
> +++ b/scripts/sorttable.c
> @@ -891,17 +891,21 @@ static int do_file(char const *const fname, void *addr)
>   	table_sort_t custom_sort = NULL;
>   
>   	switch (elf_map_machine(ehdr)) {
> -	case EM_AARCH64:
>   #ifdef MCOUNT_SORT_ENABLED
> +	case EM_AARCH64:
>   		sort_reloc = true;
>   		rela_type = 0x403;
> -		/* arm64 uses patchable function entry placing before function */
> +		/* fallthrough */
> +	case EM_RISCV:
> +		/* arm64 and RISC-V place patchable entries before the function */
>   		before_func = 8;

Nit: The shared comment now sits under `case EM_RISCV:` but the two
lines above it (sort_reloc / rela_type = 0x403) are strictly
arm64-only — they configure the RELA-based weak-function fixup that
RISC-V does not need. On a quick read it is easy to wonder if RISC-V
is implicitly inheriting that path. Splitting the comments would
help, e.g.:

        case EM_AARCH64:
            /* arm64 needs RELA-based weak-function fixup */
            sort_reloc = true;
            rela_type = 0x403;
            /* fallthrough */
        case EM_RISCV:
            /* arm64 and RISC-V place patchable entries before the function */
            before_func = 8;


> +#else
> +	case EM_AARCH64:
> +	case EM_RISCV:
>   #endif
>   		/* fallthrough */
>   	case EM_386:
>   	case EM_LOONGARCH:
> -	case EM_RISCV:
>   	case EM_S390:
>   	case EM_X86_64:
>   		custom_sort = sort_relative_table_with_data;

Reviewed-by: Shuai Xue <xueshuai@linux.alibaba.com>

Thanks.
Shuai

^ permalink raw reply

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: David Hildenbrand (Arm) @ 2026-06-01  6:54 UTC (permalink / raw)
  To: Lance Yang
  Cc: npache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, liam, ljs, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, usama.arif
In-Reply-To: <20260601032804.96122-1-lance.yang@linux.dev>

On 6/1/26 05:28, Lance Yang wrote:
> 
> On Sun, May 31, 2026 at 10:00:17PM +0200, David Hildenbrand (Arm) wrote:
>> On 5/31/26 11:39, Lance Yang wrote:
>>>
>>>
>>> Emm ... is it safe to use map_anon_folio_pte_nopf() here?
>>>
>>> At this point pmdp_collapse_flush() has cleared the PMD from the page
>>> tables. The PTE table we are updating is only reachable through the saved
>>> old PMD value, _pmd, until pmd_populate() below.
>>>
>>> map_anon_folio_pte_nopf() does set_ptes() and then calls
>>> update_mmu_cache_range(). Documentation/core-api/cachetlb.rst describes
>>> that hook as:
>>>
>>> "
>>> 	At the end of every page fault, this routine is invoked to tell
>>> 	the architecture specific code that translations now exists
>>> 	in the software page tables for address space "vma->vm_mm"
>>> 	at virtual address "address" for "nr" consecutive pages.
>>> "
>>>
>>> But that does not seem true here yet, since the PTE table is not
>>> reachable from vma->vm_mm when update_mmu_cache_range() is called.
>>>
>>> Should we avoid calling update_mmu_cache_range() until after the PTE
>>> table is reinstalled with pmd_populate()?
>>
>> I recall that update_mmu_cache* users mostly care about updating folios flags,
>> for the folio derived from the PTE ... or flushing caches for the user address.
>>
>> So intuitively I would say "the architecture code doesn't care that the PMD
>> table will only be visible to HW shortly after". The important thing should be
>> that it will definetly happen, and that nothing else is curently there or can be
>> there?
> 
> Ah, fair point.
> 
> I was mostly worried about arch hooks that walk vma->vm_mm again, rather
> than only using the pte pointer passed in. For example, mips does:

Right, a re-walk would be the real problem.

> 
>   update_mmu_cache_range()
>     -> __update_tlb()
>       -> pgd_offset(vma->vm_mm, address)
>       -> pte_offset_map(...)
> 
> and __update_tlb() has this assumption:
> 
> 		/*
> 		 * update_mmu_cache() is called between pte_offset_map_lock()
> 		 * and pte_unmap_unlock(), so we can assume that ptep is not
> 		 * NULL here: and what should be done below if it were NULL?
> 		 */
> 
> So if khugepaged happens to run with current->active_mm == vma->vm_mm
> here, could __update_tlb() hit the none PMD, get NULL from
> pte_offset_map(), and then dereference it?

Likely yes -- that MIPS code is horrible. And the comment in MIPS code
even spells that out. :(

Do you know about other code like that, or is MIPS the only one doing a
re-walk and crossing fingers?

> 
> Just wanted to raise it since some arch code may still have assumptions
> like this, and the always-enable-mTHP work is getting closer ...

Right. I assume set_pte_at() couldn't trigger something similar (re-walk) in arch code,
because we simply provide the ptep. update_mmu_cache_range() only consumes the pte.

> 
> Probably very very very hard to hit, though :)

Delaying update_mmu_cache_range() is nasty, as we'd have to make sure that
nobody can interfere in the meantime ... and the PMD lock will not be sufficient.

Maybe we could reinstall the page table with the cleared (none) entries while
still holding the PTL?

Thinking out loud:

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 5ba298d420b7..e39b750b1e6f 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1413,13 +1413,17 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
                map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
        } else {
                /*
-                * set_ptes is called in map_anon_folio_pte_nopf with the
-                * pmd_ptl lock still held; this is safe as the PMD is expected
-                * to be none. The pmd entry is then repopulated below.
+                * Re-insert the page table with the cleared entries, but
+                * hold the PTL, such that no one can mess with the re-installed
+                * page table until we updated the temporarily-cleared entries
+                * through map_anon_folio_pte_nopf().
                 */
-               map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
-               smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
+               if (pte_ptl != pmd_ptl)
+                       spin_lock(pte_ptl);
                pmd_populate(mm, pmd, pmd_pgtable(_pmd));
+               map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
+               if (pte_ptl != pmd_ptl)
+                       spin_unlock(pte_ptl);
        }
        spin_unlock(pmd_ptl);
 


-- 
Cheers,

David

^ permalink raw reply related

* Re: [PATCH v3 05/13] rv: Fix monitor start ordering and memory ordering for monitoring flag
From: Nam Cao @ 2026-06-01  6:55 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel, Steven Rostedt, Gabriele Monaco,
	linux-trace-kernel
  Cc: Wen Yang
In-Reply-To: <20260530141652.58084-6-gmonaco@redhat.com>

Gabriele Monaco <gmonaco@redhat.com> writes:

> From: Wen Yang <wen.yang@linux.dev>
>
> da_monitor_start() set monitoring=1 before calling da_monitor_init_hook(),
> may racing with the sched_switch handler:
>
>   da_monitor_start()               sched_switch handler
>   -------------------------        ---------------------------------
>   da_mon->monitoring = 1;
>                                    if (da_monitoring(da_mon))  /* true  */
>                                        ha_start_timer_ns(...);
>                                        /* hrtimer->base == NULL, crash */
>   da_monitor_init_hook(da_mon);
>   /* hrtimer_setup() sets base */
>
> Fix the ordering and pair with release/acquire semantics:
>
>   da_monitor_init_hook(da_mon);
>   smp_store_release(&da_mon->monitoring, 1);    /* da_monitor_start()  */
>   return smp_load_acquire(&da_mon->monitoring); /* da_monitoring()     */
>
> On ARM64 a plain STR + LDR does not form a release-acquire pair, so
> the load can observe monitoring=1 while hrtimer->base is still NULL.
> The plain accesses are also data races under KCSAN.
>
> Use WRITE_ONCE for the monitoring=0 store in da_monitor_reset() to
> cover the reset path.
>
> Fixes: 792575348ff7 ("rv/include: Add deterministic automata monitor definition via C macros")
> Signed-off-by: Wen Yang <wen.yang@linux.dev>
> Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
> Reviewed-by: Nam Cao <namcao@linutronix.de>
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> ---
>  include/rv/da_monitor.h | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
> index a7e103654..60dc39f26 100644
> --- a/include/rv/da_monitor.h
> +++ b/include/rv/da_monitor.h
> @@ -82,7 +82,7 @@ static void react(enum states curr_state, enum events event)
>  static inline void da_monitor_reset(struct da_monitor *da_mon)
>  {
>  	da_monitor_reset_hook(da_mon);
> -	da_mon->monitoring = 0;
> +	WRITE_ONCE(da_mon->monitoring, 0);
>  	da_mon->curr_state = model_get_initial_state();
>  }

Looking at this again, do you need to change it to

static inline void da_monitor_reset(struct da_monitor *da_mon)
{
	WRITE_ONCE(da_mon->monitoring, 0);
        smp_mb();
	da_monitor_reset_hook(da_mon);
	da_mon->curr_state = model_get_initial_state();
}

To prevent another task from seeing monitoring=1 while the timer is
already cancelled?

Nam

^ permalink raw reply

* Re: [PATCH v3 05/13] rv: Fix monitor start ordering and memory ordering for monitoring flag
From: Gabriele Monaco @ 2026-06-01  7:15 UTC (permalink / raw)
  To: Nam Cao, Wen Yang; +Cc: linux-kernel, Steven Rostedt, linux-trace-kernel
In-Reply-To: <87ik82ycl3.fsf@yellow.woof>

On Mon, 2026-06-01 at 08:55 +0200, Nam Cao wrote:
> > diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
> > index a7e103654..60dc39f26 100644
> > --- a/include/rv/da_monitor.h
> > +++ b/include/rv/da_monitor.h
> > @@ -82,7 +82,7 @@ static void react(enum states curr_state, enum
> > events event)
> >  static inline void da_monitor_reset(struct da_monitor *da_mon)
> >  {
> >  	da_monitor_reset_hook(da_mon);
> > -	da_mon->monitoring = 0;
> > +	WRITE_ONCE(da_mon->monitoring, 0);
> >  	da_mon->curr_state = model_get_initial_state();
> >  }
> 
> Looking at this again, do you need to change it to
> 
> static inline void da_monitor_reset(struct da_monitor *da_mon)
> {
> 	WRITE_ONCE(da_mon->monitoring, 0);
>         smp_mb();
> 	da_monitor_reset_hook(da_mon);
> 	da_mon->curr_state = model_get_initial_state();
> }
> 

This order won't work.
Monitor reset relies on monitoring to be true to stop the timer. The
same function is called also on monitor start and there the timer may
not have been initialised yet, we use monitoring to discriminate the
two.

> To prevent another task from seeing monitoring=1 while the timer is
> already cancelled?

I'm not sure what could go wrong in this scenario. Perhaps another
event that could start the monitor would miss the chance because it
doesn't see a reaction occurred (monitor looks like still running)?

Or a concurring event would run on top of a reaction, potentially
moving the state machine further or causing another reaction.

Those aren't really disasters and I think could happen also without
enforcing an order with the timer, nothing checks the timer's status.
Following events don't even need to know whether the timer was armed at
all, in the worst case we may be stopping twice.

Am I missing something here?

Thanks,
Gabriele

^ permalink raw reply

* Re: [PATCH v3 06/13] rv: Do not rely on clean monitor when initialising HA
From: Nam Cao @ 2026-06-01  7:24 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel, Steven Rostedt, Gabriele Monaco,
	Masami Hiramatsu, linux-trace-kernel
  Cc: Wen Yang
In-Reply-To: <20260530141652.58084-7-gmonaco@redhat.com>

Gabriele Monaco <gmonaco@redhat.com> writes:
> +static bool ha_mon_initializing;

The global variable makes me a bit uncomfortable (a quick google will
tell why this is not the best pattern).

I am sure there are better ways to differentiate when we are
initializing vs destroying. How about the incomplete sketch below? I
doubt it even builds, just give an idea.

Nam

diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
index 39765ff6f098..0549d3a35ee0 100644
--- a/include/rv/da_monitor.h
+++ b/include/rv/da_monitor.h
@@ -159,9 +159,14 @@ static struct da_monitor *da_get_monitor(void)
 /*
  * da_monitor_reset_all - reset the single monitor
  */
-static void da_monitor_reset_all(void)
+static void da_monitor_reset_all(void (*reset)(struct da_monitor *da_mon))
 {
-	da_monitor_reset(da_get_monitor());
+	fn(da_get_monitor());
+}
+
+static inline void _da_monitor_init(struct da_monitor *da_mon)
+{
+	memset(da_monitor, 0, sizeof(*da_mon));
 }
 
 /*
@@ -169,7 +174,7 @@ static void da_monitor_reset_all(void)
  */
 static inline int da_monitor_init(void)
 {
-	da_monitor_reset_all();
+	da_monitor_reset_all(_da_monitor_init);
 	return 0;
 }
 
@@ -178,7 +183,7 @@ static inline int da_monitor_init(void)
  */
 static inline void da_monitor_destroy(void)
 {
-	da_monitor_reset_all();
+	da_monitor_reset_all(da_monitor_reset);
 }
 
 #elif RV_MON_TYPE == RV_MON_PER_CPU
@@ -202,14 +207,14 @@ static struct da_monitor *da_get_monitor(void)
 /*
  * da_monitor_reset_all - reset all CPUs' monitor
  */
-static void da_monitor_reset_all(void)
+static void da_monitor_reset_all(void (*reset)(struct da_monitor *da_mon))
 {
 	struct da_monitor *da_mon;
 	int cpu;
 
 	for_each_cpu(cpu, cpu_online_mask) {
 		da_mon = per_cpu_ptr(&DA_MON_NAME, cpu);
-		da_monitor_reset(da_mon);
+		reset(da_mon);
 	}
 }
 
@@ -267,16 +272,16 @@ static inline da_id_type da_get_id(struct da_monitor *da_mon)
 	return da_get_target(da_mon)->pid;
 }
 
-static void da_monitor_reset_all(void)
+static void da_monitor_reset_all(void (*reset)(struct da_monitor *da_mon))
 {
 	struct task_struct *g, *p;
 	int cpu;
 
 	read_lock(&tasklist_lock);
 	for_each_process_thread(g, p)
-		da_monitor_reset(da_get_monitor(p));
+		reset(da_get_monitor(p));
 	for_each_present_cpu(cpu)
-		da_monitor_reset(da_get_monitor(idle_task(cpu)));
+		reset(da_get_monitor(idle_task(cpu)));
 	read_unlock(&tasklist_lock);
 }
 
@@ -483,14 +488,14 @@ static inline void da_destroy_storage(da_id_type id)
 	kfree_rcu(mon_storage, rcu);
 }
 
-static void da_monitor_reset_all(void)
+static void da_monitor_reset_all(void (*reset)(struct da_monitor *da_mon))
 {
 	struct da_monitor_storage *mon_storage;
 	int bkt;
 
 	rcu_read_lock();
 	hash_for_each_rcu(da_monitor_ht, bkt, mon_storage, node)
-		da_monitor_reset(&mon_storage->rv.da_mon);
+		reset(&mon_storage->rv.da_mon);
 	rcu_read_unlock();
 }
 

^ permalink raw reply related

* Re: [PATCH v3 05/13] rv: Fix monitor start ordering and memory ordering for monitoring flag
From: Nam Cao @ 2026-06-01  7:29 UTC (permalink / raw)
  To: Gabriele Monaco, Wen Yang
  Cc: linux-kernel, Steven Rostedt, linux-trace-kernel
In-Reply-To: <76628c68336037b0d652456cf2033d3b51399a09.camel@redhat.com>

Gabriele Monaco <gmonaco@redhat.com> writes:
> This order won't work.
> Monitor reset relies on monitoring to be true to stop the timer. The
> same function is called also on monitor start and there the timer may
> not have been initialised yet, we use monitoring to discriminate the
> two.

Right.

> Am I missing something here?

No, just me being confused.

Nam

^ permalink raw reply

* Re: [PATCH v3 06/13] rv: Do not rely on clean monitor when initialising HA
From: Nam Cao @ 2026-06-01  7:31 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel, Steven Rostedt, Gabriele Monaco,
	Masami Hiramatsu, linux-trace-kernel
  Cc: Wen Yang
In-Reply-To: <87fr36yb7g.fsf@yellow.woof>

Nam Cao <namcao@linutronix.de> writes:
> Gabriele Monaco <gmonaco@redhat.com> writes:
>> +static bool ha_mon_initializing;
>
> The global variable makes me a bit uncomfortable (a quick google will
> tell why this is not the best pattern).
>
> I am sure there are better ways to differentiate when we are
> initializing vs destroying. How about the incomplete sketch below? I
> doubt it even builds, just give an idea.

Or instead of function pointer, we can also pass a bool flag whether the
timer should be reset.

^ permalink raw reply

* Re: [PATCH v3 07/13] rv: Add automatic cleanup handlers for per-task HA monitors
From: Nam Cao @ 2026-06-01  7:39 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel, Steven Rostedt, Gabriele Monaco,
	linux-trace-kernel
  Cc: Wen Yang
In-Reply-To: <20260530141652.58084-8-gmonaco@redhat.com>

Gabriele Monaco <gmonaco@redhat.com> writes:
> @@ -123,12 +144,15 @@ static int ha_monitor_init(void)
>  
>  	ha_mon_initializing = true;
>  	ret = da_monitor_init();
> +	if (ret == 0)
> +		ha_monitor_enable_hook();
>  	ha_mon_initializing = false;
>  	return ret;
>  }

What if between da_monitor_init() and ha_monitor_enable_hook(), a task
exits while a timer is still active, and then the timer callback is invoked?

Extremely rare, but I think it can be fixed easily by reordering the two functions.

>  static void ha_monitor_destroy(void)
>  {
> +	ha_monitor_disable_hook();
>  	da_monitor_destroy();
>  }

Same here, there is small window between the two function calls.

Nam

^ permalink raw reply

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: Lance Yang @ 2026-06-01  7:49 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: npache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, liam, ljs, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, usama.arif
In-Reply-To: <f5d38f64-ab92-496d-afd3-29ccc17fec2b@kernel.org>



On 2026/6/1 14:54, David Hildenbrand (Arm) wrote:
> On 6/1/26 05:28, Lance Yang wrote:
>>
>> On Sun, May 31, 2026 at 10:00:17PM +0200, David Hildenbrand (Arm) wrote:
>>> On 5/31/26 11:39, Lance Yang wrote:
>>>>
>>>>
>>>> Emm ... is it safe to use map_anon_folio_pte_nopf() here?
>>>>
>>>> At this point pmdp_collapse_flush() has cleared the PMD from the page
>>>> tables. The PTE table we are updating is only reachable through the saved
>>>> old PMD value, _pmd, until pmd_populate() below.
>>>>
>>>> map_anon_folio_pte_nopf() does set_ptes() and then calls
>>>> update_mmu_cache_range(). Documentation/core-api/cachetlb.rst describes
>>>> that hook as:
>>>>
>>>> "
>>>> 	At the end of every page fault, this routine is invoked to tell
>>>> 	the architecture specific code that translations now exists
>>>> 	in the software page tables for address space "vma->vm_mm"
>>>> 	at virtual address "address" for "nr" consecutive pages.
>>>> "
>>>>
>>>> But that does not seem true here yet, since the PTE table is not
>>>> reachable from vma->vm_mm when update_mmu_cache_range() is called.
>>>>
>>>> Should we avoid calling update_mmu_cache_range() until after the PTE
>>>> table is reinstalled with pmd_populate()?
>>>
>>> I recall that update_mmu_cache* users mostly care about updating folios flags,
>>> for the folio derived from the PTE ... or flushing caches for the user address.
>>>
>>> So intuitively I would say "the architecture code doesn't care that the PMD
>>> table will only be visible to HW shortly after". The important thing should be
>>> that it will definetly happen, and that nothing else is curently there or can be
>>> there?
>>
>> Ah, fair point.
>>
>> I was mostly worried about arch hooks that walk vma->vm_mm again, rather
>> than only using the pte pointer passed in. For example, mips does:
> 
> Right, a re-walk would be the real problem.
> 
>>
>>    update_mmu_cache_range()
>>      -> __update_tlb()
>>        -> pgd_offset(vma->vm_mm, address)
>>        -> pte_offset_map(...)
>>
>> and __update_tlb() has this assumption:
>>
>> 		/*
>> 		 * update_mmu_cache() is called between pte_offset_map_lock()
>> 		 * and pte_unmap_unlock(), so we can assume that ptep is not
>> 		 * NULL here: and what should be done below if it were NULL?
>> 		 */
>>
>> So if khugepaged happens to run with current->active_mm == vma->vm_mm
>> here, could __update_tlb() hit the none PMD, get NULL from
>> pte_offset_map(), and then dereference it?
> 
> Likely yes -- that MIPS code is horrible. And the comment in MIPS code
> even spells that out. :(
> 
> Do you know about other code like that, or is MIPS the only one doing a
> re-walk and crossing fingers?

I had Codex do the boring grep-work through the arch update_mmu_cache*
code :D

MIPS doesn't seem to be the only code doing a re-walk, but it is the
only one I found that appears to assume the PMD/PTE walk cannot fail,
without checking whether the PMD is none ...

Cheers, Lance

>>
>> Just wanted to raise it since some arch code may still have assumptions
>> like this, and the always-enable-mTHP work is getting closer ...
> 
> Right. I assume set_pte_at() couldn't trigger something similar (re-walk) in arch code,
> because we simply provide the ptep. update_mmu_cache_range() only consumes the pte.
> 
>>
>> Probably very very very hard to hit, though :)
> 
> Delaying update_mmu_cache_range() is nasty, as we'd have to make sure that
> nobody can interfere in the meantime ... and the PMD lock will not be sufficient.
> 
> Maybe we could reinstall the page table with the cleared (none) entries while
> still holding the PTL?
> 
> Thinking out loud:
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 5ba298d420b7..e39b750b1e6f 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1413,13 +1413,17 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
>                  map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
>          } else {
>                  /*
> -                * set_ptes is called in map_anon_folio_pte_nopf with the
> -                * pmd_ptl lock still held; this is safe as the PMD is expected
> -                * to be none. The pmd entry is then repopulated below.
> +                * Re-insert the page table with the cleared entries, but
> +                * hold the PTL, such that no one can mess with the re-installed
> +                * page table until we updated the temporarily-cleared entries
> +                * through map_anon_folio_pte_nopf().
>                   */
> -               map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
> -               smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
> +               if (pte_ptl != pmd_ptl)
> +                       spin_lock(pte_ptl);
>                  pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> +               map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
> +               if (pte_ptl != pmd_ptl)
> +                       spin_unlock(pte_ptl);
>          }
>          spin_unlock(pmd_ptl);
>   
> 
> 


^ permalink raw reply

* Re: [PATCH v3 07/13] rv: Add automatic cleanup handlers for per-task HA monitors
From: Gabriele Monaco @ 2026-06-01  7:51 UTC (permalink / raw)
  To: Nam Cao; +Cc: Wen Yang, linux-kernel, Steven Rostedt, linux-trace-kernel
In-Reply-To: <877boiyaig.fsf@yellow.woof>

On Mon, 2026-06-01 at 09:39 +0200, Nam Cao wrote:
> Gabriele Monaco <gmonaco@redhat.com> writes:
> > @@ -123,12 +144,15 @@ static int ha_monitor_init(void)
> >  
> >  	ha_mon_initializing = true;
> >  	ret = da_monitor_init();
> > +	if (ret == 0)
> > +		ha_monitor_enable_hook();
> >  	ha_mon_initializing = false;
> >  	return ret;
> >  }
> 
> What if between da_monitor_init() and ha_monitor_enable_hook(), a
> task exits while a timer is still active, and then the timer callback
> is invoked?

We are initialising, timers shouldn't be active, that sits right before
setting up other hooks, and the exit hooks this way is just the first
of them.

By the way, in this case, we likely have a valid reset scenario on an
invalid (uninitialised) timer. This is also what checking the
monitoring flag guards against.
In short, in that handler we really should reset, but need to know
whether we ever initialised in the first place.

> Extremely rare, but I think it can be fixed easily by reordering the
> two functions.
> 
> >  static void ha_monitor_destroy(void)
> >  {
> > +	ha_monitor_disable_hook();
> >  	da_monitor_destroy();
> >  }
> 
> Same here, there is small window between the two function calls.

Likewise here, we removed all hooks, then da_monitor_destroy() is going
to sync with them and clean everything up. Swapping them will expose
more races because we dropped the slot by then.

Am I missing something?

Thanks,
Gabriele

^ permalink raw reply

* Re: [PATCH v3 08/13] rv: Ensure synchronous cleanup for HA monitors
From: Nam Cao @ 2026-06-01  7:52 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel, Steven Rostedt, Gabriele Monaco,
	linux-trace-kernel
  Cc: Wen Yang
In-Reply-To: <20260530141652.58084-9-gmonaco@redhat.com>

>  static bool ha_mon_initializing;
> +static bool ha_mon_destroying;
>  
>  static int ha_monitor_init(void)
>  {
>  	int ret;
>  
> +	WRITE_ONCE(ha_mon_destroying, false);
>  	ha_mon_initializing = true;
>  	ret = da_monitor_init();
>  	if (ret == 0)
> @@ -152,6 +155,7 @@ static int ha_monitor_init(void)
>  
>  static void ha_monitor_destroy(void)
>  {
> +	WRITE_ONCE(ha_mon_destroying, true);
>  	ha_monitor_disable_hook();
>  	da_monitor_destroy();
>  }
> @@ -302,12 +306,30 @@ static bool ha_monitor_handle_constraint(struct da_monitor *da_mon,
>  	return false;
>  }
>  
> +/*
> + * __ha_monitor_timer_callback - generic callback representation
> + *
> + * This callback runs in an RCU read-side critical section to allow the
> + * destruction sequence to easily synchronize_rcu() with all pending timers
> + * after asynchronously disabling them. The ha_mon_destroying check ensures
> + * any callback entering the RCU section after synchronize_rcu() completes
> + * will see the flag and bail out immediately.
> + */
>  static inline void __ha_monitor_timer_callback(struct ha_monitor *ha_mon)
>  {
> -	enum states curr_state = READ_ONCE(ha_mon->da_mon.curr_state);
>  	DECLARE_SEQ_BUF(env_string, ENV_BUFFER_SIZE);
> -	u64 time_ns = ha_get_ns();
> -
> +	enum states curr_state;
> +	u64 time_ns;
> +
> +	guard(rcu)();
> +	if (unlikely(READ_ONCE(ha_mon_destroying)))

Instead of this global variable, can we instead use da_mon->monitoring?

> +		return;
> +	/* Ensure consistent curr_state if we race with da_monitor_reset */
> +	curr_state = smp_load_acquire(&ha_mon->da_mon.curr_state);
> +	if (unlikely(!da_monitor_handling_event(&ha_mon->da_mon)))
> +		return;
> +
> +	time_ns = ha_get_ns();
>  	ha_get_env_string(&env_string, ha_mon, time_ns);
>  	ha_react(curr_state, EVENT_NONE, env_string.buffer);
>  	ha_trace_error_env(ha_mon, model_get_state_name(curr_state),
> -- 
> 2.54.0

^ permalink raw reply

* Re: [PATCH v3 10/13] rv: Use 0 to check preemption enabled in opid
From: Nam Cao @ 2026-06-01  7:56 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel, Steven Rostedt, Gabriele Monaco,
	Masami Hiramatsu, linux-trace-kernel
  Cc: Wen Yang
In-Reply-To: <20260530141652.58084-11-gmonaco@redhat.com>

Gabriele Monaco <gmonaco@redhat.com> writes:
> Tracepoint handlers no longer run with preemption disabled by default
> since a46023d5616 ("tracing: Guard __DECLARE_TRACE() use of
> __DO_TRACE_CALL() with SRCU-fast"), the opid monitor should now count 1
> in the preemption count as preemption disabled.
>
> Change the rule for preempt_off to preempt > 0.
>
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>

Reviewed-by: Nam Cao <namcao@linutronix.de>

^ permalink raw reply

* Re: [PATCH v3 07/13] rv: Add automatic cleanup handlers for per-task HA monitors
From: Nam Cao @ 2026-06-01  7:58 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel, Steven Rostedt, Gabriele Monaco,
	linux-trace-kernel
  Cc: Wen Yang
In-Reply-To: <20260530141652.58084-8-gmonaco@redhat.com>

Gabriele Monaco <gmonaco@redhat.com> writes:
> Hybrid automata monitors may start timers, depending on the model, these
> may remain active on an exiting task and cause false positives or even
> access freed memory.
>
> Add an enable/disable hook in the HA code, currently only populated by
> the per-task handler for registration and deregistration.
> This hooks to the sched_process_exit event and ensures the timer is
> stopped for every exiting task. The handler is enabled automatically but
> may be disabled, for instance if the monitor uses the event for another
> purpose (but should still manually ensure timers are stopped).
>
> Fixes: f5587d1b6ec9 ("rv: Add Hybrid Automata monitor type")
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>

Reviewed-by: Nam Cao <namcao@linutronix.de>

^ permalink raw reply

* Re: [PATCH v3 11/13] verification/rvgen: Fix suffix strip in dot2k
From: Nam Cao @ 2026-06-01  8:00 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel, Steven Rostedt, Gabriele Monaco,
	linux-trace-kernel
  Cc: Wen Yang
In-Reply-To: <20260530141652.58084-12-gmonaco@redhat.com>

Gabriele Monaco <gmonaco@redhat.com> writes:

> __start_to_invariant_check() and __get_constraint_env() parse the
> environment variable's name from sources that have it padded with the
> monitor name. This is removed using rstrip(), which is not meant to
> strip a substring but rather a set of characters.
>
> Use removesuffix() to actually get rid of the trailing _<monitor name>.
>
> Fixes: a82adadb16894 ("verification/rvgen: Add support for Hybrid Automata")
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>

Reviewed-by: Nam Cao <namcao@linutronix.de>

^ permalink raw reply

* Re: [PATCH v3 13/13] verification/rvgen: Generate cleanup hook for per-obj monitor
From: Nam Cao @ 2026-06-01  8:01 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel, Steven Rostedt, Gabriele Monaco,
	linux-trace-kernel
  Cc: Wen Yang
In-Reply-To: <20260530141652.58084-14-gmonaco@redhat.com>

Gabriele Monaco <gmonaco@redhat.com> writes:

> Per-object monitors can allocate memory dynamically and such memory is
> required for the lifetime of the object, then it should be freed with
> the appropriate call.
>
> Force the generation scripts to add a cleanup function the user will
> need to wire to the appropriate event (e.g. sched_process_exit for
> tasks). This can be safely removed if the object will never cease to
> exist before disabling the monitor (e.g. if following only static
> variables).
>
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>

Reviewed-by: Nam Cao <namcao@linutronix.de>

^ permalink raw reply

* Re: [PATCH v3 08/13] rv: Ensure synchronous cleanup for HA monitors
From: Gabriele Monaco @ 2026-06-01  8:05 UTC (permalink / raw)
  To: Nam Cao; +Cc: Wen Yang, linux-kernel, Steven Rostedt, linux-trace-kernel
In-Reply-To: <874ijmy9xj.fsf@yellow.woof>

On Mon, 2026-06-01 at 09:52 +0200, Nam Cao wrote:
> > +/*
> > + * __ha_monitor_timer_callback - generic callback representation
> > + *
> > + * This callback runs in an RCU read-side critical section to
> > allow the
> > + * destruction sequence to easily synchronize_rcu() with all
> > pending timers
> > + * after asynchronously disabling them. The ha_mon_destroying
> > check ensures
> > + * any callback entering the RCU section after synchronize_rcu()
> > completes
> > + * will see the flag and bail out immediately.
> > + */
> >  static inline void __ha_monitor_timer_callback(struct ha_monitor
> > *ha_mon)
> >  {
> > -	enum states curr_state = READ_ONCE(ha_mon-
> > >da_mon.curr_state);
> >  	DECLARE_SEQ_BUF(env_string, ENV_BUFFER_SIZE);
> > -	u64 time_ns = ha_get_ns();
> > -
> > +	enum states curr_state;
> > +	u64 time_ns;
> > +
> > +	guard(rcu)();
> > +	if (unlikely(READ_ONCE(ha_mon_destroying)))
> 
> Instead of this global variable, can we instead use da_mon-
> >monitoring?
> 

We already do in da_monitor_handling_event(), this is guarding for the
very unlikely scenario of:

1. timer callback firing (just before guard(rcu))
2. sync rcu, return before waiting for that one
3. reset and free memory
4. guard(rcu) now and check the state in da_mon (freed)

I'm not too confident saying this cannot happen under PREEMPT_RT.

I know the easy alternative would be to synchronously stop timers on
destruction, but that gets complicated with per-task monitors, where we
iterate over task grabbing a read lock.

Am I just too paranoid?

Thanks,
Gabriele

> > +		return;
> > +	/* Ensure consistent curr_state if we race with
> > da_monitor_reset */
> > +	curr_state = smp_load_acquire(&ha_mon->da_mon.curr_state);
> > +	if (unlikely(!da_monitor_handling_event(&ha_mon->da_mon)))
> > +		return;
> > +
> > +	time_ns = ha_get_ns();
> >  	ha_get_env_string(&env_string, ha_mon, time_ns);
> >  	ha_react(curr_state, EVENT_NONE, env_string.buffer);
> >  	ha_trace_error_env(ha_mon,
> > model_get_state_name(curr_state),
> > -- 
> > 2.54.0


^ permalink raw reply

* Re: [PATCH v3 06/13] rv: Do not rely on clean monitor when initialising HA
From: Gabriele Monaco @ 2026-06-01  8:07 UTC (permalink / raw)
  To: Nam Cao
  Cc: Wen Yang, linux-kernel, Steven Rostedt, Masami Hiramatsu,
	linux-trace-kernel
In-Reply-To: <87a4teyavy.fsf@yellow.woof>

On Mon, 2026-06-01 at 09:31 +0200, Nam Cao wrote:
> Nam Cao <namcao@linutronix.de> writes:
> > Gabriele Monaco <gmonaco@redhat.com> writes:
> > > +static bool ha_mon_initializing;
> > 
> > The global variable makes me a bit uncomfortable (a quick google
> > will tell why this is not the best pattern).
> > 
> > I am sure there are better ways to differentiate when we are
> > initializing vs destroying. How about the incomplete sketch below?
> > I doubt it even builds, just give an idea.
> 
> Or instead of function pointer, we can also pass a bool flag whether
> the timer should be reset.

Good point, we could probably do a bit better than the current in
separating initialisation and destruction/reaction. I'm going to have a
thought.
We are also protecting against tasks where the monitor never started,
so they never got initialised before destruction. This makes it harder
to distinguish.

One thing to note is that we should probably keep the reset signature
constant as that's a member in struct rv_monitor. But that's probably
not a big issue.

Thanks,
Gabriele

^ permalink raw reply

* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
From: David Hildenbrand (Arm) @ 2026-06-01  8:11 UTC (permalink / raw)
  To: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <20260522150009.121603-12-npache@redhat.com>

On 5/22/26 17:00, Nico Pache wrote:

Finally time for the core piece :)

> Enable khugepaged to collapse to mTHP orders. This patch implements the
> main scanning logic using a bitmap to track occupied pages and a stack
> structure that allows us to find optimal collapse sizes.
> 
> Previous to this patch, PMD collapse had 3 main phases, a light weight
> scanning phase (mmap_read_lock) that determines a potential PMD
> collapse, an alloc phase (mmap unlocked), then finally heavier collapse
> phase (mmap_write_lock).
> 
> To enabled mTHP collapse we make the following changes:
> 
> During PMD scan phase, track occupied pages in a bitmap. When mTHP
> orders are enabled, we remove the restriction of max_ptes_none during the
> scan phase to avoid missing potential mTHP collapse candidates. Once we
> have scanned the full PMD range and updated the bitmap to track occupied
> pages, we use the bitmap to find the optimal mTHP size.
> 
> Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
> and determine the best eligible order for the collapse. A stack structure
> is used instead of traditional recursion to manage the search. This also
> prevents a traditional recursive approach when the kernel stack struct is
> limited. The algorithm recursively splits the bitmap into smaller chunks to
> find the highest order mTHPs that satisfy the collapse criteria. We start
> by attempting the PMD order, then moved on the consecutively lower orders
> (mTHP collapse). The stack maintains a pair of variables (offset, order),
> indicating the number of PTEs from the start of the PMD, and the order of
> the potential collapse candidate.
> 
> The algorithm for consuming the bitmap works as such:
>     1) push (0, HPAGE_PMD_ORDER) onto the stack
>     2) pop the stack
>     3) check if the number of set bits in that (offset,order) pair
>        statisfy the max_ptes_none threshold for that order
>     4) if yes, attempt collapse
>     5) if no (or collapse fails), push two new stack items representing
>        the left and right halves of the current bitmap range, at the
>        next lower order
>     6) repeat at step (2) until stack is empty.
> 
> Below is a diagram representing the algorithm and stack items:
> 
>                             offset   mid_offset
>                             |        |
>                             |        |
>                             v        v
>           ____________________________________
>          |          PTE Page Table            |
>          --------------------------------------
> 			    <-------><------->
>                              order-1  order-1


Reading this, it is unclear why exactly do we need the stack.

Why can't you work with offset + cur_order?

Initially,

	offset = 0;
	cur_order = HPAGE_PMD_ORDER;

If collapse succeeded, advance to next range.
If collapse failed, try next smaller order, keeping offset unchanged.

	if (failed && cur_order > KHUGEPAGED_MIN_MTHP_ORDER) {
		/* Try next smaller order. */
		cur_order = cur_order - 1;
	} else {
		/* Skip to next chunk. */
		offset += 1 << cur_order;
		cur_order = max_order_from_offset(offset);
	}

Of course, handling disabled orders. max_order_from_offset() is rather trivial
(natural buddy order, capped at HPAGE_PMD_ORDER).

What's the benefit of the stack?

> 
> mTHP collapses reject regions containing swapped out or shared pages.
> This is because adding new entries can lead to new none pages, and these
> may lead to constant promotion into a higher order mTHP. A similar
> issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse
> introducing at least 2x the number of pages, and on a future scan will
> satisfy the promotion condition once again. This issue is prevented via
> the collapse_max_ptes_none() function which imposes the max_ptes_none
> restrictions above.
> 
> We currently only support mTHP collapse for max_ptes_none values of 0
> and HPAGE_PMD_NR - 1. resulting in the following behavior:
> 
>     - max_ptes_none=0: Never introduce new empty pages during collapse
>     - max_ptes_none=HPAGE_PMD_NR-1: Always try collapse to the highest
>       available mTHP order
> 
> Any other max_ptes_none value will emit a warning and default mTHP
> collapse to max_ptes_none=0. There should be no behavior change for PMD
> collapse.
> 
> Once we determine what mTHP sizes fits best in that PMD range a collapse
> is attempted. A minimum collapse order of 2 is used as this is the lowest
> order supported by anon memory as defined by THP_ORDERS_ALL_ANON.
> 
> Currently madv_collapse is not supported and will only attempt PMD
> collapse.
> 
> We can also remove the check for is_khugepaged inside the PMD scan as
> the collapse_max_ptes_none() function handles this logic now.
> 
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/khugepaged.c | 181 +++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 172 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 64ceebc9d8a7..d3d7db8be26c 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -99,6 +99,30 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
>  
>  static struct kmem_cache *mm_slot_cache __ro_after_init;
>  
> +#define KHUGEPAGED_MIN_MTHP_ORDER	2
> +/*
> + * mthp_collapse() does an iterative DFS over a binary tree, from
> + * HPAGE_PMD_ORDER down to KHUGEPAGED_MIN_MTHP_ORDER. The max stack
> + * size needed for a DFS on a binary tree is height + 1, where
> + * height = HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER.
> + *
> + * ilog2 is used in place of HPAGE_PMD_ORDER because some architectures
> + * (e.g. ppc64le) do not define HPAGE_PMD_ORDER until after build time.

I was confused there for a second why you mention ilog2, when it's really "We
cannot use HPAGE_PMD_ORDER.".

Best to simplify to:

"Note that we cannot use HPAGE_PMD_ORDER, because it is variable on some
architectures".

> + */
> +#define MTHP_STACK_SIZE	(ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER + 1)
> +
> +/*
> + * Defines a range of PTE entries in a PTE page table which are being
> + * considered for mTHP collapse.
> + *
> + * @offset: the offset of the first PTE entry in a PMD range.
> + * @order: the order of the PTE entries being considered for collapse.
> + */
> +struct mthp_range {
> +	u16 offset;
> +	u8 order;
> +};
> +
>  struct collapse_control {
>  	bool is_khugepaged;
>  
> @@ -110,6 +134,12 @@ struct collapse_control {
>  
>  	/* nodemask for allocation fallback */
>  	nodemask_t alloc_nmask;
> +
> +	/* Each bit represents a single occupied (!none/zero) page. */
> +	DECLARE_BITMAP(mthp_bitmap, MAX_PTRS_PER_PTE);

This should just be called something like "present_ptes"

> +	/* A mask of the current range being considered for mTHP collapse. */
> +	DECLARE_BITMAP(mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> +	struct mthp_range mthp_bitmap_stack[MTHP_STACK_SIZE];

This is really just a temporary bitmap used for collapse_mthp_count_present()
only. Either rename it, or better, avoid it completely.

>  };
>  
>  /**
> @@ -1411,20 +1441,137 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
>  	return result;
>  }
>  
> +static void collapse_mthp_stack_push(struct collapse_control *cc, int *stack_size,
> +				     u16 offset, u8 order)
> +{
> +	const int size = *stack_size;
> +	struct mthp_range *stack = &cc->mthp_bitmap_stack[size];
> +
> +	VM_WARN_ON_ONCE(size >= MTHP_STACK_SIZE);
> +	stack->order = order;
> +	stack->offset = offset;
> +	(*stack_size)++;
> +}
> +
> +static struct mthp_range collapse_mthp_stack_pop(struct collapse_control *cc,
> +						 int *stack_size)
> +{
> +	const int size = *stack_size;
> +
> +	VM_WARN_ON_ONCE(size <= 0);
> +	(*stack_size)--;
> +	return cc->mthp_bitmap_stack[size - 1];
> +}
> +
> +static unsigned int collapse_mthp_count_present(struct collapse_control *cc,
> +						u16 offset, unsigned int nr_ptes)
> +{
> +	bitmap_zero(cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> +	bitmap_set(cc->mthp_bitmap_mask, offset, nr_ptes);
> +	return bitmap_weight_and(cc->mthp_bitmap, cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);

You really just want to count the number of set bits? You don't need a temporary
bitmap for that.

Assume you want to check an order-2 (4 bits), bitmap_weight_and() would check
all bits ...

I'd suggest starting simple here, and avoiding the temporary bitmap.

Can we simply use bitmap_weight_from(cc->mthp_bitmap, offset, nr_ptes)?

> +}
> +
> +/*
> + * mthp_collapse() consumes the bitmap that is generated during
> + * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
> + *
> + * Each bit in cc->mthp_bitmap represents a single occupied (!none/zero) page.
> + * A stack structure cc->mthp_bitmap_stack is used to check different regions
> + * of the bitmap for collapse eligibility. The stack maintains a pair of
> + * variables (offset, order), indicating the number of PTEs from the start of
> + * the PMD, and the order of the potential collapse candidate respectively. We
> + * start at the PMD order and check if it is eligible for collapse; if not, we
> + * add two entries to the stack at a lower order to represent the left and right
> + * halves of the PTE page table we are examining.
> + *
> + *                         offset       mid_offset
> + *                         |         |
> + *                         |         |
> + *                         v         v
> + *      --------------------------------------
> + *      |          cc->mthp_bitmap            |
> + *      --------------------------------------
> + *                         <-------><------->
> + *                          order-1  order-1
> + *
> + * For each of these, we determine how many PTE entries are occupied in the
> + * range of PTE entries we propose to collapse, then we compare this to a
> + * threshold number of PTE entries which would need to be occupied for a
> + * collapse to be permitted at that order (accounting for max_ptes_none).
> + *
> + * If a collapse is permitted, we attempt to collapse the PTE range into a
> + * mTHP.
> + */
> +static int mthp_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
> +		unsigned long address, int referenced, int unmapped,
> +		struct collapse_control *cc, unsigned long enabled_orders)
> +{
> +	unsigned int nr_occupied_ptes, nr_ptes, max_ptes_none;
> +	int collapsed = 0, stack_size = 0;
> +	unsigned long collapse_address;
> +	struct mthp_range range;
> +	u16 offset;
> +	u8 order;
> +
> +	collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
> +
> +	while (stack_size) {
> +		range = collapse_mthp_stack_pop(cc, &stack_size);
> +		order = range.order;
> +		offset = range.offset;
> +		nr_ptes = 1UL << order;
> +
> +		if (!test_bit(order, &enabled_orders))
> +			goto next_order;
> +
> +		max_ptes_none = collapse_max_ptes_none(cc, vma, order);
> +
> +		nr_occupied_ptes = collapse_mthp_count_present(cc, offset,
> +							       nr_ptes);
> +
> +		if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
> +			int ret;
> +
> +			collapse_address = address + offset * PAGE_SIZE;
> +			ret = collapse_huge_page(mm, collapse_address, referenced,
> +						 unmapped, cc, order);
> +			if (ret == SCAN_SUCCEED) {
> +				collapsed += nr_ptes;
> +				continue;
> +			}
> +		}
> +
> +next_order:
> +		if ((BIT(order) - 1) & enabled_orders) {
> +			const u8 next_order = order - 1;
> +			const u16 mid_offset = offset + (nr_ptes / 2);
> +
> +			collapse_mthp_stack_push(cc, &stack_size, mid_offset,
> +						 next_order);
> +			collapse_mthp_stack_push(cc, &stack_size, offset,
> +						 next_order);
> +		}
> +	}
> +	return collapsed;
> +}
> +
>  static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  		struct vm_area_struct *vma, unsigned long start_addr,
>  		bool *lock_dropped, struct collapse_control *cc)
>  {
> -	const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
>  	const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, HPAGE_PMD_ORDER);
>  	const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
> +	unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
> +	enum tva_type tva_flags = cc->is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
>  	pmd_t *pmd;
> -	pte_t *pte, *_pte;
> -	int none_or_zero = 0, shared = 0, referenced = 0;
> +	pte_t *pte, *_pte, pteval;
> +	int i;
> +	int none_or_zero = 0, shared = 0, nr_collapsed = 0, referenced = 0;
>  	enum scan_result result = SCAN_FAIL;
>  	struct page *page = NULL;
>  	struct folio *folio = NULL;
>  	unsigned long addr;
> +	unsigned long enabled_orders;
>  	spinlock_t *ptl;
>  	int node = NUMA_NO_NODE, unmapped = 0;
>  
> @@ -1436,8 +1583,19 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  		goto out;
>  	}
>  
> +	bitmap_zero(cc->mthp_bitmap, MAX_PTRS_PER_PTE);
>  	memset(cc->node_load, 0, sizeof(cc->node_load));
>  	nodes_clear(cc->alloc_nmask);
> +
> +	enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, tva_flags);
> +
> +	/*
> +	 * If PMD is the only enabled order, enforce max_ptes_none, otherwise
> +	 * scan all pages to populate the bitmap for mTHP collapse.
> +	 */

You should note here, that we re-verify in mthp_collapse().

But the question is, whether we should relocate the check completely into
mthp_collapse(), instead of conditionally duplicating it.

What speaks against always populating the bitmap and making the decision in
mthp_collapse()?

Sure, we might scan a page table a bit longer, but the code gets clearer ... and
I am not sure if scanning some more page table entries is really that critical here.


> +	if (enabled_orders != BIT(HPAGE_PMD_ORDER))
> +		max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT;
> +
>  	pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
>  	if (!pte) {
>  		cc->progress++;
> @@ -1445,11 +1603,13 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  		goto out;
>  	}
>  
> -	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> -	     _pte++, addr += PAGE_SIZE) {
> +	for (i = 0; i < HPAGE_PMD_NR; i++) {
> +		_pte = pte + i;
> +		addr = start_addr + i * PAGE_SIZE;
> +		pteval = ptep_get(_pte);
> +
>  		cc->progress++;
>  
> -		pte_t pteval = ptep_get(_pte);
>  		if (pte_none_or_zero(pteval)) {
>  			if (++none_or_zero > max_ptes_none) {
>  				result = SCAN_EXCEED_NONE_PTE;
> @@ -1529,6 +1689,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  			}
>  		}
>  
> +		/* Set bit for occupied pages */
> +		__set_bit(i, cc->mthp_bitmap);
>  		/*
>  		 * Record which node the original page is from and save this
>  		 * information to cc->node_load[].
> @@ -1587,10 +1749,11 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  	if (result == SCAN_SUCCEED) {
>  		/* collapse_huge_page expects the lock to be dropped before calling */
>  		mmap_read_unlock(mm);
> -		result = collapse_huge_page(mm, start_addr, referenced,
> -					    unmapped, cc, HPAGE_PMD_ORDER);
> -		/* collapse_huge_page will return with the mmap_lock released */
> +		nr_collapsed = mthp_collapse(mm, vma, start_addr, referenced,
> +					     unmapped, cc, enabled_orders);
> +		/* mmap_lock was released above, set lock_dropped */
>  		*lock_dropped = true;
> +		result = nr_collapsed ? SCAN_SUCCEED : SCAN_FAIL;

As Lance says, this error handling likely needs some thought.

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: David Hildenbrand (Arm) @ 2026-06-01  8:15 UTC (permalink / raw)
  To: Lance Yang
  Cc: npache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, liam, ljs, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, usama.arif
In-Reply-To: <fb8f24b1-ce8e-4f06-bbd8-f148a9bcaeee@linux.dev>

On 6/1/26 09:49, Lance Yang wrote:
> 
> 
> On 2026/6/1 14:54, David Hildenbrand (Arm) wrote:
>> On 6/1/26 05:28, Lance Yang wrote:
>>>
>>>
>>> Ah, fair point.
>>>
>>> I was mostly worried about arch hooks that walk vma->vm_mm again, rather
>>> than only using the pte pointer passed in. For example, mips does:
>>
>> Right, a re-walk would be the real problem.
>>
>>>
>>>    update_mmu_cache_range()
>>>      -> __update_tlb()
>>>        -> pgd_offset(vma->vm_mm, address)
>>>        -> pte_offset_map(...)
>>>
>>> and __update_tlb() has this assumption:
>>>
>>>         /*
>>>          * update_mmu_cache() is called between pte_offset_map_lock()
>>>          * and pte_unmap_unlock(), so we can assume that ptep is not
>>>          * NULL here: and what should be done below if it were NULL?
>>>          */
>>>
>>> So if khugepaged happens to run with current->active_mm == vma->vm_mm
>>> here, could __update_tlb() hit the none PMD, get NULL from
>>> pte_offset_map(), and then dereference it?
>>
>> Likely yes -- that MIPS code is horrible. And the comment in MIPS code
>> even spells that out. :(
>>
>> Do you know about other code like that, or is MIPS the only one doing a
>> re-walk and crossing fingers?
> 
> I had Codex do the boring grep-work through the arch update_mmu_cache*
> code :D
> 
> MIPS doesn't seem to be the only code doing a re-walk, but it is the
> only one I found that appears to assume the PMD/PTE walk cannot fail,
> without checking whether the PMD is none ...

Okay, but likely the other code that tries to handle it is also problematic.

Best to make sure the page table is already installed when updating the entries.

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCHv4 02/13] uprobes/x86: Remove struct uprobe_trampoline object
From: Jiri Olsa @ 2026-06-01  8:31 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: bot+bpf-ci, oleg, peterz, mingo, mhiramat, andrii, bpf,
	linux-trace-kernel, ast, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai
In-Reply-To: <ahbAXMH-R9Sk5N_3@krava>

On Wed, May 27, 2026 at 11:58:52AM +0200, Jiri Olsa wrote:

SNIP

> > Although the old destroy_uprobe_trampoline only freed the struct (not the
> > underlying VMA), the new code appears to introduce a VMA leak: the freshly
> > mapped PAGE_SIZE special mapping in the user's address space stays mapped
> > even though optimization failed. arch_uprobe_optimize() then sets
> > ARCH_UPROBE_FLAG_OPTIMIZE_FAIL so subsequent calls won't retry, leaving the
> > orphan trampoline mapping in the address space until exit_mmap() reaps it at
> > process teardown.
> > 
> > The commit message mentions: "Note the original code called
> > destroy_uprobe_trampoline if the optimiation failed, but it only freed the
> > struct uprobe_trampoline object, not the vma. The new vma leak is fixed in
> > following change."
> > 
> > Is the VMA leak addressed in the subsequent commit in this series?
> 
> yes, in:
> 
>       [1] uprobes/x86: Unmap trampoline vma object in case it's unused
> 
> > 
> > A secondary behaviour change is that 'return WARN_ON_ONCE(swbp_optimize(...))'
> > now returns the boolean truth value of the error (0 or 1) instead of the
> > original errno. While the current caller (arch_uprobe_optimize) only treats
> > the value as boolean, could this surprise a future caller that propagates the
> > return code?
> 
> ah ok, this is actualy 'fixed' in [1] above, but yea we should
> fix that directly in this change, will do

nah, it's ok, the caller does not care about the exact error
value, just 0 or 1 is fine

jirka

^ permalink raw reply

* Re: [PATCHv4 13/13] selftests/bpf: Add tests for forked/cloned optimized uprobes
From: Jiri Olsa @ 2026-06-01  8:31 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Masami Hiramatsu,
	Andrii Nakryiko, bpf, linux-trace-kernel
In-Reply-To: <87ldd3665m.fsf@cloudflare.com>

On Thu, May 28, 2026 at 03:00:05PM +0200, Jakub Sitnicki wrote:
> On Tue, May 26, 2026 at 10:58 PM +02, Jiri Olsa wrote:
> > Adding tests for forked/cloned optimized uprobes and make
> > sure the child can properly execute optimized probe for
> > both fork (dups mm) and clone with CLONE_VM.
> >
> > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > ---
> >  .../selftests/bpf/prog_tests/uprobe_syscall.c | 88 +++++++++++++++++++
> >  1 file changed, 88 insertions(+)
> >
> > diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> > index efff0c515184..033d32b4cc27 100644
> > --- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> > +++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> > @@ -4,6 +4,8 @@
> >  
> >  #ifdef __x86_64__
> >  
> > +#define _GNU_SOURCE
> > +#include <sched.h>
> >  #include <unistd.h>
> >  #include <asm/ptrace.h>
> >  #include <linux/compiler.h>
> > @@ -936,6 +938,88 @@ static void test_uprobe_error(void)
> >  	ASSERT_EQ(errno, EPROTO, "errno");
> >  }
> >  
> > +__attribute__((aligned(16)))
> > +__nocf_check __weak __naked void uprobe_fork_test(void)
> > +{
> > +	asm volatile (
> > +		".byte 0x66, 0x2e, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00\n" /* nop10 */
> > +		"ret\n"
> > +	);
> > +}
> > +
> > +static int child_func(void *arg)
> 
> Nit: Could annotate with noreturn:
> 
> #include <stdnoreturn.h>
> 
> /* ... */
> 
> static noreturn int child_func(void *arg)

yep, will change, thanks

jirka

> 
> > +{
> > +	struct uprobe_syscall_executed *skel = arg;
> > +
> > +	/* Make sure the child's probe is still there and optimized.. */
> > +	if (memcmp(uprobe_fork_test, lea_rsp, sizeof(lea_rsp)))
> > +		_exit(1);
> > +
> > +	skel->bss->pid = getpid();
> > +
> > +	/* .. and it executes properly. */
> > +	uprobe_fork_test();
> > +
> > +	if (skel->bss->executed != 3)
> > +		_exit(2);
> > +
> > +	_exit(0);
> > +}
> 
> [...]
> 
> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>

^ permalink raw reply

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: Lance Yang @ 2026-06-01  8:44 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: npache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, liam, ljs, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, usama.arif
In-Reply-To: <1c99294c-9ebe-4856-bfde-09801701d75c@kernel.org>



On 2026/6/1 16:15, David Hildenbrand (Arm) wrote:
> On 6/1/26 09:49, Lance Yang wrote:
>>
>>
>> On 2026/6/1 14:54, David Hildenbrand (Arm) wrote:
>>> On 6/1/26 05:28, Lance Yang wrote:
>>>>
>>>>
>>>> Ah, fair point.
>>>>
>>>> I was mostly worried about arch hooks that walk vma->vm_mm again, rather
>>>> than only using the pte pointer passed in. For example, mips does:
>>>
>>> Right, a re-walk would be the real problem.
>>>
>>>>
>>>>     update_mmu_cache_range()
>>>>       -> __update_tlb()
>>>>         -> pgd_offset(vma->vm_mm, address)
>>>>         -> pte_offset_map(...)
>>>>
>>>> and __update_tlb() has this assumption:
>>>>
>>>>          /*
>>>>           * update_mmu_cache() is called between pte_offset_map_lock()
>>>>           * and pte_unmap_unlock(), so we can assume that ptep is not
>>>>           * NULL here: and what should be done below if it were NULL?
>>>>           */
>>>>
>>>> So if khugepaged happens to run with current->active_mm == vma->vm_mm
>>>> here, could __update_tlb() hit the none PMD, get NULL from
>>>> pte_offset_map(), and then dereference it?
>>>
>>> Likely yes -- that MIPS code is horrible. And the comment in MIPS code
>>> even spells that out. :(
>>>
>>> Do you know about other code like that, or is MIPS the only one doing a
>>> re-walk and crossing fingers?
>>
>> I had Codex do the boring grep-work through the arch update_mmu_cache*
>> code :D
>>
>> MIPS doesn't seem to be the only code doing a re-walk, but it is the
>> only one I found that appears to assume the PMD/PTE walk cannot fail,
>> without checking whether the PMD is none ...
> 
> Okay, but likely the other code that tries to handle it is also problematic.
> 
> Best to make sure the page table is already installed when updating the entries.

Neat, makes sense to me :D

That way the page talbe is back in place before any arch hook gets to 
look at it :)


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox