Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* Re: [PATCH mm-unstable v18 12/14] mm/khugepaged: avoid unnecessary mTHP collapse attempts
From: Lance Yang @ 2026-05-31  7:31 UTC (permalink / raw)
  To: npache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, usama.arif
In-Reply-To: <20260522150009.121603-13-npache@redhat.com>


On Fri, May 22, 2026 at 09:00:07AM -0600, Nico Pache wrote:
>There are cases where, if an attempted collapse fails, all subsequent
>orders are guaranteed to also fail. Avoid these collapse attempts by
>bailing out early.
>
>Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
>Acked-by: Usama Arif <usama.arif@linux.dev>
>Acked-by: David Hildenbrand (Arm) <david@kernel.org>
>Signed-off-by: Nico Pache <npache@redhat.com>
>---
> mm/khugepaged.c | 24 +++++++++++++++++++++++-
> 1 file changed, 23 insertions(+), 1 deletion(-)
>
>diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>index d3d7db8be26c..15b7298bc225 100644
>--- a/mm/khugepaged.c
>+++ b/mm/khugepaged.c
>@@ -1535,9 +1535,31 @@ static int mthp_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
> 			collapse_address = address + offset * PAGE_SIZE;
> 			ret = collapse_huge_page(mm, collapse_address, referenced,
> 						 unmapped, cc, order);
>-			if (ret == SCAN_SUCCEED) {
>+
>+			switch (ret) {
>+			/* Cases where we continue to next collapse candidate */
>+			case SCAN_SUCCEED:
> 				collapsed += nr_ptes;
>+				fallthrough;
>+			case SCAN_PTE_MAPPED_HUGEPAGE:
> 				continue;
>+			/* Cases where lower orders might still succeed */
>+			case SCAN_LACK_REFERENCED_PAGE:
>+			case SCAN_EXCEED_NONE_PTE:
>+			case SCAN_EXCEED_SWAP_PTE:
>+			case SCAN_EXCEED_SHARED_PTE:
>+			case SCAN_PAGE_LOCK:
>+			case SCAN_PAGE_COUNT:
>+			case SCAN_PAGE_NULL:
>+			case SCAN_DEL_PAGE_LRU:
>+			case SCAN_PTE_NON_PRESENT:
>+			case SCAN_PTE_UFFD_WP:
>+			case SCAN_ALLOC_HUGE_PAGE_FAIL:

Nit: shouldn't SCAN_CGROUP_CHARGE_FAIL go with SCAN_ALLOC_HUGE_PAGE_FAIL
here?

If charging the current order fails, a smaller order might still fit :)

Cheers, Lance

>+			case SCAN_PAGE_LAZYFREE:
>+				goto next_order;
>+			/* Cases where no further collapse is possible */
>+			default:
>+				return collapsed;
> 			}
> 		}
> 
>-- 
>2.54.0
>
>

^ permalink raw reply

* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Lance Yang @ 2026-05-31  8:48 UTC (permalink / raw)
  To: npache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat, mhocko,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <20260531071845.10875-1-lance.yang@linux.dev>



On 2026/5/31 15:18, Lance Yang wrote:
> 
> On Fri, May 22, 2026 at 09:00:06AM -0600, Nico Pache wrote:
> [...]
>> @@ -1587,10 +1749,11 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>> 	if (result == SCAN_SUCCEED) {
>> 		/* collapse_huge_page expects the lock to be dropped before calling */
>> 		mmap_read_unlock(mm);
>> -		result = collapse_huge_page(mm, start_addr, referenced,
>> -					    unmapped, cc, HPAGE_PMD_ORDER);
>> -		/* collapse_huge_page will return with the mmap_lock released */
>> +		nr_collapsed = mthp_collapse(mm, vma, start_addr, referenced,
>> +					     unmapped, cc, enabled_orders);
>> +		/* mmap_lock was released above, set lock_dropped */
>> 		*lock_dropped = true;
>> +		result = nr_collapsed ? SCAN_SUCCEED : SCAN_FAIL;
> 
> Hmm ... don't we lose the allocation-failure result here?
> 
> Previously collapse_scan_pmd() propagated SCAN_ALLOC_HUGE_PAGE_FAIL from
> collapse_huge_page(), so khugepaged would call khugepaged_alloc_sleep()
> in khugepaged_do_scan().
> 
> Now if allocation fails and nr_collapsed stays 0, we just return
> SCAN_FAIL. So we won't back off via khugepaged_alloc_sleep() anymore?

Looks like this is a more general issue with mthp_collapse() only
returning nr_collapsed.

For example, SCAN_PMD_MAPPED used to be propagated too, and
madvise_collapse() treats that as success. With the new code, if
nothing was collapsed by this call, that can also become SCAN_FAIL ...

So I think we should keep both.

Cheers, Lance


^ permalink raw reply

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: Lance Yang @ 2026-05-31  9:39 UTC (permalink / raw)
  To: npache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, usama.arif
In-Reply-To: <20260522150009.121603-7-npache@redhat.com>


On Fri, May 22, 2026 at 09:00:01AM -0600, Nico Pache wrote:
>Pass an order and offset to collapse_huge_page to support collapsing anon
>memory to arbitrary orders within a PMD. order indicates what mTHP size we
>are attempting to collapse to, and offset indicates were in the PMD to
>start the collapse attempt.
>
>For non-PMD collapse we must leave the anon VMA write locked until after
>we collapse the mTHP-- in the PMD case all the pages are isolated, but in
>the mTHP case this is not true, and we must keep the lock to prevent
>access/changes to the page tables. This can happen if the rmap walkers hit
>a pmd_none while the PMD entry is currently unavailable due to being
>temporarily removed during the collapse phase.
>
>Acked-by: Usama Arif <usama.arif@linux.dev>
>Signed-off-by: Nico Pache <npache@redhat.com>
>---
> mm/khugepaged.c | 93 +++++++++++++++++++++++++++++--------------------
> 1 file changed, 55 insertions(+), 38 deletions(-)
>
>diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>index fab35d318641..d64f42f66236 100644
>--- a/mm/khugepaged.c
>+++ b/mm/khugepaged.c
>@@ -1214,34 +1214,36 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
>  * while allocating a THP, as that could trigger direct reclaim/compaction.
>  * Note that the VMA must be rechecked after grabbing the mmap_lock again.
>  */
>-static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
>-		int referenced, int unmapped, struct collapse_control *cc)
>+static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
>+		int referenced, int unmapped, struct collapse_control *cc,
>+		unsigned int order)
> {
>+	const unsigned long pmd_addr = start_addr & HPAGE_PMD_MASK;
>+	const unsigned long end_addr = start_addr + (PAGE_SIZE << order);
> 	LIST_HEAD(compound_pagelist);
> 	pmd_t *pmd, _pmd;
>-	pte_t *pte;
>+	pte_t *pte = NULL;
> 	pgtable_t pgtable;
> 	struct folio *folio;
> 	spinlock_t *pmd_ptl, *pte_ptl;
> 	enum scan_result result = SCAN_FAIL;
> 	struct vm_area_struct *vma;
> 	struct mmu_notifier_range range;
>+	bool anon_vma_locked = false;
> 
>-	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>-
>-	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
>+	result = alloc_charge_folio(&folio, mm, cc, order);
> 	if (result != SCAN_SUCCEED)
> 		goto out_nolock;
> 
> 	mmap_read_lock(mm);
>-	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
>-					 HPAGE_PMD_ORDER);
>+	result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
>+					 &vma, cc, order);
> 	if (result != SCAN_SUCCEED) {
> 		mmap_read_unlock(mm);
> 		goto out_nolock;
> 	}
> 
>-	result = find_pmd_or_thp_or_none(mm, address, &pmd);
>+	result = find_pmd_or_thp_or_none(mm, pmd_addr, &pmd);
> 	if (result != SCAN_SUCCEED) {
> 		mmap_read_unlock(mm);
> 		goto out_nolock;
>@@ -1253,8 +1255,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> 		 * released when it fails. So we jump out_nolock directly in
> 		 * that case.  Continuing to collapse causes inconsistency.
> 		 */
>-		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
>-						     referenced, HPAGE_PMD_ORDER);
>+		result = __collapse_huge_page_swapin(mm, vma, start_addr, pmd,
>+						     referenced, order);
> 		if (result != SCAN_SUCCEED)
> 			goto out_nolock;
> 	}
>@@ -1269,20 +1271,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> 	 * mmap_lock.
> 	 */
> 	mmap_write_lock(mm);
>-	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
>-					 HPAGE_PMD_ORDER);
>+	result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
>+					 &vma, cc, order);
> 	if (result != SCAN_SUCCEED)
> 		goto out_up_write;
> 	/* check if the pmd is still valid */
> 	vma_start_write(vma);
>-	result = check_pmd_still_valid(mm, address, pmd);
>+	result = check_pmd_still_valid(mm, pmd_addr, pmd);
> 	if (result != SCAN_SUCCEED)
> 		goto out_up_write;
> 
> 	anon_vma_lock_write(vma->anon_vma);
>+	anon_vma_locked = true;
> 
>-	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
>-				address + HPAGE_PMD_SIZE);
>+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr,
>+				end_addr);
> 	mmu_notifier_invalidate_range_start(&range);
> 
> 	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
>@@ -1294,26 +1297,23 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> 	 * Parallel GUP-fast is fine since GUP-fast will back off when
> 	 * it detects PMD is changed.
> 	 */
>-	_pmd = pmdp_collapse_flush(vma, address, pmd);
>+	_pmd = pmdp_collapse_flush(vma, pmd_addr, pmd);
> 	spin_unlock(pmd_ptl);
> 	mmu_notifier_invalidate_range_end(&range);
> 	tlb_remove_table_sync_one();
> 
>-	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
>+	pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);
> 	if (pte) {
>-		result = __collapse_huge_page_isolate(vma, address, pte, cc,
>-						      HPAGE_PMD_ORDER,
>-						      &compound_pagelist);
>+		result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
>+						      order, &compound_pagelist);
> 		spin_unlock(pte_ptl);
> 	} else {
> 		result = SCAN_NO_PTE_TABLE;
> 	}
> 
> 	if (unlikely(result != SCAN_SUCCEED)) {
>-		if (pte)
>-			pte_unmap(pte);
> 		spin_lock(pmd_ptl);
>-		BUG_ON(!pmd_none(*pmd));
>+		WARN_ON_ONCE(!pmd_none(*pmd));
> 		/*
> 		 * We can only use set_pmd_at when establishing
> 		 * hugepmds and never for establishing regular pmds that
>@@ -1321,21 +1321,24 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> 		 */
> 		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> 		spin_unlock(pmd_ptl);
>-		anon_vma_unlock_write(vma->anon_vma);
> 		goto out_up_write;
> 	}
> 
> 	/*
>-	 * All pages are isolated and locked so anon_vma rmap
>-	 * can't run anymore.
>+	 * For PMD collapse all pages are isolated and locked so anon_vma
>+	 * rmap can't run anymore. For mTHP collapse the PMD entry has been
>+	 * removed and not all pages are isolated and locked, so we must hold
>+	 * the lock to prevent neighboring folios from attempting to access
>+	 * this PMD until its reinstalled.
> 	 */
>-	anon_vma_unlock_write(vma->anon_vma);
>+	if (is_pmd_order(order)) {
>+		anon_vma_unlock_write(vma->anon_vma);
>+		anon_vma_locked = false;
>+	}
> 
> 	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
>-					   vma, address, pte_ptl,
>-					   HPAGE_PMD_ORDER,
>-					   &compound_pagelist);
>-	pte_unmap(pte);
>+					   vma, start_addr, pte_ptl,
>+					   order, &compound_pagelist);
> 	if (unlikely(result != SCAN_SUCCEED))
> 		goto out_up_write;
> 
>@@ -1345,18 +1348,32 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> 	 * write.
> 	 */
> 	__folio_mark_uptodate(folio);
>-	pgtable = pmd_pgtable(_pmd);
>-
> 	spin_lock(pmd_ptl);
>-	BUG_ON(!pmd_none(*pmd));
>-	pgtable_trans_huge_deposit(mm, pmd, pgtable);
>-	map_anon_folio_pmd_nopf(folio, pmd, vma, address);
>+	WARN_ON_ONCE(!pmd_none(*pmd));
>+	if (is_pmd_order(order)) {
>+		pgtable = pmd_pgtable(_pmd);
>+		pgtable_trans_huge_deposit(mm, pmd, pgtable);
>+		map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
>+	} else {
>+		/*
>+		 * set_ptes is called in map_anon_folio_pte_nopf with the
>+		 * pmd_ptl lock still held; this is safe as the PMD is expected
>+		 * to be none. The pmd entry is then repopulated below.
>+		 */
>+		map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);

Emm ... is it safe to use map_anon_folio_pte_nopf() here?

At this point pmdp_collapse_flush() has cleared the PMD from the page
tables. The PTE table we are updating is only reachable through the saved
old PMD value, _pmd, until pmd_populate() below.

map_anon_folio_pte_nopf() does set_ptes() and then calls
update_mmu_cache_range(). Documentation/core-api/cachetlb.rst describes
that hook as:

"
	At the end of every page fault, this routine is invoked to tell
	the architecture specific code that translations now exists
	in the software page tables for address space "vma->vm_mm"
	at virtual address "address" for "nr" consecutive pages.
"

But that does not seem true here yet, since the PTE table is not
reachable from vma->vm_mm when update_mmu_cache_range() is called.

Should we avoid calling update_mmu_cache_range() until after the PTE
table is reinstalled with pmd_populate()?

Cheers, Lance

>+		smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
>+		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>+	}
> 	spin_unlock(pmd_ptl);
> 
> 	folio = NULL;
> 
> 	result = SCAN_SUCCEED;
> out_up_write:
>+	if (anon_vma_locked)
>+		anon_vma_unlock_write(vma->anon_vma);
>+	if (pte)
>+		pte_unmap(pte);
> 	mmap_write_unlock(mm);
> out_nolock:
> 	if (folio)
>@@ -1536,7 +1553,7 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> 		/* collapse_huge_page expects the lock to be dropped before calling */
> 		mmap_read_unlock(mm);
> 		result = collapse_huge_page(mm, start_addr, referenced,
>-					    unmapped, cc);
>+					    unmapped, cc, HPAGE_PMD_ORDER);
> 		/* collapse_huge_page will return with the mmap_lock released */
> 		*lock_dropped = true;
> 	}
>-- 
>2.54.0
>
>

^ permalink raw reply

* Re: [PATCH v2 07/12] rv: Fix monitor start ordering and memory ordering for monitoring flag
From: Wen Yang @ 2026-05-31 14:54 UTC (permalink / raw)
  To: Nam Cao, Gabriele Monaco, linux-kernel, Steven Rostedt,
	linux-trace-kernel
In-Reply-To: <877bonoq70.fsf@yellow.woof>



On 5/28/26 17:09, Nam Cao wrote:
> Gabriele Monaco <gmonaco@redhat.com> writes:
>> From: Wen Yang <wen.yang@linux.dev>
>>
>> da_monitor_start() set monitoring=1 before calling da_monitor_init_hook(),
>> may racing with the sched_switch handler:
>>
>>    da_monitor_start()               sched_switch handler
>>    -------------------------        ---------------------------------
>>    da_mon->monitoring = 1;
>>                                     if (da_monitoring(da_mon))  /* true  */
>>                                         ha_start_timer_ns(...);
>>                                         /* hrtimer->base == NULL, crash */
>>    da_monitor_init_hook(da_mon);
>>    /* hrtimer_setup() sets base */
>>
>> Fix the ordering and pair with release/acquire semantics:
>>
>>    da_monitor_init_hook(da_mon);
>>    smp_store_release(&da_mon->monitoring, 1);    /* da_monitor_start()  */
>>    return smp_load_acquire(&da_mon->monitoring); /* da_monitoring()     */
>>
>> On ARM64 a plain STR + LDR does not form a release-acquire pair, so
>> the load can observe monitoring=1 while hrtimer->base is still NULL.
>> The plain accesses are also data races under KCSAN.
>>
>> Use WRITE_ONCE for the monitoring=0 store in da_monitor_reset() to
>> cover the reset path.
>>
>> Fixes: 792575348ff7 ("rv/include: Add deterministic automata monitor definition via C macros")
>> Signed-off-by: Wen Yang <wen.yang@linux.dev>
>> Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
>> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> 
> Looks correct to me.
> Reviewed-by: Nam Cao <namcao@linutronix.de>
> 
> Wen, I am curious, how did you find this issue?
> 

It was caught during development of the tlob monitor by a KUnit test 

(since moved to selftests) that monitored a kthread running concurrently 

on another CPU.  On a multi-core box the window between setting 

monitoring=1 and da_monitor_init_hook() was wide enough to trigger the 

crash, and KCSAN flagged the plain monitoring accesses as data 
                                                races as well.

--
Best wishes,
Wen



^ permalink raw reply

* Re: [PATCH v3 00/13] rv: Fixes on Deterministic and Hybrid Automata
From: Wen Yang @ 2026-05-31 15:17 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel; +Cc: Steven Rostedt, Nam Cao, linux-trace-kernel
In-Reply-To: <20260530141652.58084-1-gmonaco@redhat.com>


Thank you for this series -- it addresses several issues in the RV 
framework. Our pending series [PATCH v3 tlob] also depends on it; we 
would appreciate if it could be picked up for the next merge window.


--
Best wishes,
Wen


On 5/30/26 22:16, Gabriele Monaco wrote:
> Fix issues that were reported by bots or visible only after integration:
>   * Make sure timers are always terminated and waited for when disabling
>     the monitor or when the target terminates
>   * Run per-cpu monitors with migration disabled since preemption is now
>     enabled from tracepoints
>   * Fix a wrong __user specifier in a helper function
>   * Other cleanup and concurrency issues
> 
> Differences since V2 [1]:
> * Applied from reviews changes to commit messages
> * Rearranged order to put non-fixes and not-reviewed patches in the end
> * Synchronise monitors before resetting them to avoid rearming
> * Protect against racing timer callbacks during destruction
> 
> Differences since V1 [2]:
> * Fix memory consistency with timer callbacks racing with resets
> * Add per-obj deallocation hook in rvgen generated code
> * Do not rely on clean monitor when initialising HA
> * Add tracepoint synchronisation before returning per-task slots
> * Fix suffix strip in dot2k
> * Generate stub deallocation hooks instead of failing build when per-obj
>    miss those
> 
> [1] - https://lore.kernel.org/lkml/20260527062313.39908-1-gmonaco@redhat.com
> [2] - https://lore.kernel.org/lkml/20260512140250.262190-1-gmonaco@redhat.com
> 
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Nam Cao <namcao@linutronix.de>
> Cc: Wen Yang <wen.yang@linux.dev>
> Cc: linux-trace-kernel@vger.kernel.org
> 
> Gabriele Monaco (12):
>    rv: Fix __user specifier usage in extract_params()
>    rv: Reset per-task DA monitors before releasing the slot
>    rv: Prevent in-flight per-task handlers from using invalid slots
>    rv: Ensure all pending probes terminate on per-obj monitor destroy
>    rv: Do not rely on clean monitor when initialising HA
>    rv: Add automatic cleanup handlers for per-task HA monitors
>    rv: Ensure synchronous cleanup for HA monitors
>    rv: Prevent task migration while handling per-CPU events
>    rv: Use 0 to check preemption enabled in opid
>    verification/rvgen: Fix suffix strip in dot2k
>    rv: Fix read_lock scope in per-task DA cleanup
>    verification/rvgen: Generate cleanup hook for per-obj monitor
> 
> Wen Yang (1):
>    rv: Fix monitor start ordering and memory ordering for monitoring flag
> 
>   include/rv/da_monitor.h                       |  67 ++++++++---
>   include/rv/ha_monitor.h                       | 104 +++++++++++++++++-
>   include/rv/ltl_monitor.h                      |   1 +
>   kernel/trace/rv/monitors/deadline/deadline.h  |   3 +-
>   kernel/trace/rv/monitors/nomiss/nomiss.c      |   4 +-
>   kernel/trace/rv/monitors/opid/opid.c          |  12 +-
>   kernel/trace/rv/monitors/stall/stall.c        |   4 +-
>   tools/verification/rvgen/rvgen/dot2k.py       |  19 +++-
>   .../rvgen/rvgen/templates/dot2k/main.c        |   4 +-
>   9 files changed, 180 insertions(+), 38 deletions(-)
> 
> 
> base-commit: 8fde5d1d47f69db6082dfa34500c27f8485389a5

^ permalink raw reply

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: David Hildenbrand (Arm) @ 2026-05-31 20:00 UTC (permalink / raw)
  To: Lance Yang, npache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat, mhocko,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, usama.arif
In-Reply-To: <20260531093942.19644-1-lance.yang@linux.dev>

On 5/31/26 11:39, Lance Yang wrote:
> 
> On Fri, May 22, 2026 at 09:00:01AM -0600, Nico Pache wrote:
>> Pass an order and offset to collapse_huge_page to support collapsing anon
>> memory to arbitrary orders within a PMD. order indicates what mTHP size we
>> are attempting to collapse to, and offset indicates were in the PMD to
>> start the collapse attempt.
>>
>> For non-PMD collapse we must leave the anon VMA write locked until after
>> we collapse the mTHP-- in the PMD case all the pages are isolated, but in
>> the mTHP case this is not true, and we must keep the lock to prevent
>> access/changes to the page tables. This can happen if the rmap walkers hit
>> a pmd_none while the PMD entry is currently unavailable due to being
>> temporarily removed during the collapse phase.
>>
>> Acked-by: Usama Arif <usama.arif@linux.dev>
>> Signed-off-by: Nico Pache <npache@redhat.com>
>> ---
>> mm/khugepaged.c | 93 +++++++++++++++++++++++++++++--------------------
>> 1 file changed, 55 insertions(+), 38 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index fab35d318641..d64f42f66236 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1214,34 +1214,36 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
>>  * while allocating a THP, as that could trigger direct reclaim/compaction.
>>  * Note that the VMA must be rechecked after grabbing the mmap_lock again.
>>  */
>> -static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
>> -		int referenced, int unmapped, struct collapse_control *cc)
>> +static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
>> +		int referenced, int unmapped, struct collapse_control *cc,
>> +		unsigned int order)
>> {
>> +	const unsigned long pmd_addr = start_addr & HPAGE_PMD_MASK;
>> +	const unsigned long end_addr = start_addr + (PAGE_SIZE << order);
>> 	LIST_HEAD(compound_pagelist);
>> 	pmd_t *pmd, _pmd;
>> -	pte_t *pte;
>> +	pte_t *pte = NULL;
>> 	pgtable_t pgtable;
>> 	struct folio *folio;
>> 	spinlock_t *pmd_ptl, *pte_ptl;
>> 	enum scan_result result = SCAN_FAIL;
>> 	struct vm_area_struct *vma;
>> 	struct mmu_notifier_range range;
>> +	bool anon_vma_locked = false;
>>
>> -	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>> -
>> -	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
>> +	result = alloc_charge_folio(&folio, mm, cc, order);
>> 	if (result != SCAN_SUCCEED)
>> 		goto out_nolock;
>>
>> 	mmap_read_lock(mm);
>> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
>> -					 HPAGE_PMD_ORDER);
>> +	result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
>> +					 &vma, cc, order);
>> 	if (result != SCAN_SUCCEED) {
>> 		mmap_read_unlock(mm);
>> 		goto out_nolock;
>> 	}
>>
>> -	result = find_pmd_or_thp_or_none(mm, address, &pmd);
>> +	result = find_pmd_or_thp_or_none(mm, pmd_addr, &pmd);
>> 	if (result != SCAN_SUCCEED) {
>> 		mmap_read_unlock(mm);
>> 		goto out_nolock;
>> @@ -1253,8 +1255,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>> 		 * released when it fails. So we jump out_nolock directly in
>> 		 * that case.  Continuing to collapse causes inconsistency.
>> 		 */
>> -		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
>> -						     referenced, HPAGE_PMD_ORDER);
>> +		result = __collapse_huge_page_swapin(mm, vma, start_addr, pmd,
>> +						     referenced, order);
>> 		if (result != SCAN_SUCCEED)
>> 			goto out_nolock;
>> 	}
>> @@ -1269,20 +1271,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>> 	 * mmap_lock.
>> 	 */
>> 	mmap_write_lock(mm);
>> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
>> -					 HPAGE_PMD_ORDER);
>> +	result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
>> +					 &vma, cc, order);
>> 	if (result != SCAN_SUCCEED)
>> 		goto out_up_write;
>> 	/* check if the pmd is still valid */
>> 	vma_start_write(vma);
>> -	result = check_pmd_still_valid(mm, address, pmd);
>> +	result = check_pmd_still_valid(mm, pmd_addr, pmd);
>> 	if (result != SCAN_SUCCEED)
>> 		goto out_up_write;
>>
>> 	anon_vma_lock_write(vma->anon_vma);
>> +	anon_vma_locked = true;
>>
>> -	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
>> -				address + HPAGE_PMD_SIZE);
>> +	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr,
>> +				end_addr);
>> 	mmu_notifier_invalidate_range_start(&range);
>>
>> 	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
>> @@ -1294,26 +1297,23 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>> 	 * Parallel GUP-fast is fine since GUP-fast will back off when
>> 	 * it detects PMD is changed.
>> 	 */
>> -	_pmd = pmdp_collapse_flush(vma, address, pmd);
>> +	_pmd = pmdp_collapse_flush(vma, pmd_addr, pmd);
>> 	spin_unlock(pmd_ptl);
>> 	mmu_notifier_invalidate_range_end(&range);
>> 	tlb_remove_table_sync_one();
>>
>> -	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
>> +	pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);
>> 	if (pte) {
>> -		result = __collapse_huge_page_isolate(vma, address, pte, cc,
>> -						      HPAGE_PMD_ORDER,
>> -						      &compound_pagelist);
>> +		result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
>> +						      order, &compound_pagelist);
>> 		spin_unlock(pte_ptl);
>> 	} else {
>> 		result = SCAN_NO_PTE_TABLE;
>> 	}
>>
>> 	if (unlikely(result != SCAN_SUCCEED)) {
>> -		if (pte)
>> -			pte_unmap(pte);
>> 		spin_lock(pmd_ptl);
>> -		BUG_ON(!pmd_none(*pmd));
>> +		WARN_ON_ONCE(!pmd_none(*pmd));
>> 		/*
>> 		 * We can only use set_pmd_at when establishing
>> 		 * hugepmds and never for establishing regular pmds that
>> @@ -1321,21 +1321,24 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>> 		 */
>> 		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>> 		spin_unlock(pmd_ptl);
>> -		anon_vma_unlock_write(vma->anon_vma);
>> 		goto out_up_write;
>> 	}
>>
>> 	/*
>> -	 * All pages are isolated and locked so anon_vma rmap
>> -	 * can't run anymore.
>> +	 * For PMD collapse all pages are isolated and locked so anon_vma
>> +	 * rmap can't run anymore. For mTHP collapse the PMD entry has been
>> +	 * removed and not all pages are isolated and locked, so we must hold
>> +	 * the lock to prevent neighboring folios from attempting to access
>> +	 * this PMD until its reinstalled.
>> 	 */
>> -	anon_vma_unlock_write(vma->anon_vma);
>> +	if (is_pmd_order(order)) {
>> +		anon_vma_unlock_write(vma->anon_vma);
>> +		anon_vma_locked = false;
>> +	}
>>
>> 	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
>> -					   vma, address, pte_ptl,
>> -					   HPAGE_PMD_ORDER,
>> -					   &compound_pagelist);
>> -	pte_unmap(pte);
>> +					   vma, start_addr, pte_ptl,
>> +					   order, &compound_pagelist);
>> 	if (unlikely(result != SCAN_SUCCEED))
>> 		goto out_up_write;
>>
>> @@ -1345,18 +1348,32 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>> 	 * write.
>> 	 */
>> 	__folio_mark_uptodate(folio);
>> -	pgtable = pmd_pgtable(_pmd);
>> -
>> 	spin_lock(pmd_ptl);
>> -	BUG_ON(!pmd_none(*pmd));
>> -	pgtable_trans_huge_deposit(mm, pmd, pgtable);
>> -	map_anon_folio_pmd_nopf(folio, pmd, vma, address);
>> +	WARN_ON_ONCE(!pmd_none(*pmd));
>> +	if (is_pmd_order(order)) {
>> +		pgtable = pmd_pgtable(_pmd);
>> +		pgtable_trans_huge_deposit(mm, pmd, pgtable);
>> +		map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
>> +	} else {
>> +		/*
>> +		 * set_ptes is called in map_anon_folio_pte_nopf with the
>> +		 * pmd_ptl lock still held; this is safe as the PMD is expected
>> +		 * to be none. The pmd entry is then repopulated below.
>> +		 */
>> +		map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
> 
> Emm ... is it safe to use map_anon_folio_pte_nopf() here?
> 
> At this point pmdp_collapse_flush() has cleared the PMD from the page
> tables. The PTE table we are updating is only reachable through the saved
> old PMD value, _pmd, until pmd_populate() below.
> 
> map_anon_folio_pte_nopf() does set_ptes() and then calls
> update_mmu_cache_range(). Documentation/core-api/cachetlb.rst describes
> that hook as:
> 
> "
> 	At the end of every page fault, this routine is invoked to tell
> 	the architecture specific code that translations now exists
> 	in the software page tables for address space "vma->vm_mm"
> 	at virtual address "address" for "nr" consecutive pages.
> "
> 
> But that does not seem true here yet, since the PTE table is not
> reachable from vma->vm_mm when update_mmu_cache_range() is called.
> 
> Should we avoid calling update_mmu_cache_range() until after the PTE
> table is reinstalled with pmd_populate()?

I recall that update_mmu_cache* users mostly care about updating folios flags,
for the folio derived from the PTE ... or flushing caches for the user address.

So intuitively I would say "the architecture code doesn't care that the PMD
table will only be visible to HW shortly after". The important thing should be
that it will definetly happen, and that nothing else is curently there or can be
there?



-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH mm-unstable v18 12/14] mm/khugepaged: avoid unnecessary mTHP collapse attempts
From: David Hildenbrand (Arm) @ 2026-05-31 20:02 UTC (permalink / raw)
  To: Lance Yang, npache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat, mhocko,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, usama.arif
In-Reply-To: <20260531073102.20318-1-lance.yang@linux.dev>

On 5/31/26 09:31, Lance Yang wrote:
> 
> On Fri, May 22, 2026 at 09:00:07AM -0600, Nico Pache wrote:
>> There are cases where, if an attempted collapse fails, all subsequent
>> orders are guaranteed to also fail. Avoid these collapse attempts by
>> bailing out early.
>>
>> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
>> Acked-by: Usama Arif <usama.arif@linux.dev>
>> Acked-by: David Hildenbrand (Arm) <david@kernel.org>
>> Signed-off-by: Nico Pache <npache@redhat.com>
>> ---
>> mm/khugepaged.c | 24 +++++++++++++++++++++++-
>> 1 file changed, 23 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index d3d7db8be26c..15b7298bc225 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1535,9 +1535,31 @@ static int mthp_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
>> 			collapse_address = address + offset * PAGE_SIZE;
>> 			ret = collapse_huge_page(mm, collapse_address, referenced,
>> 						 unmapped, cc, order);
>> -			if (ret == SCAN_SUCCEED) {
>> +
>> +			switch (ret) {
>> +			/* Cases where we continue to next collapse candidate */
>> +			case SCAN_SUCCEED:
>> 				collapsed += nr_ptes;
>> +				fallthrough;
>> +			case SCAN_PTE_MAPPED_HUGEPAGE:
>> 				continue;
>> +			/* Cases where lower orders might still succeed */
>> +			case SCAN_LACK_REFERENCED_PAGE:
>> +			case SCAN_EXCEED_NONE_PTE:
>> +			case SCAN_EXCEED_SWAP_PTE:
>> +			case SCAN_EXCEED_SHARED_PTE:
>> +			case SCAN_PAGE_LOCK:
>> +			case SCAN_PAGE_COUNT:
>> +			case SCAN_PAGE_NULL:
>> +			case SCAN_DEL_PAGE_LRU:
>> +			case SCAN_PTE_NON_PRESENT:
>> +			case SCAN_PTE_UFFD_WP:
>> +			case SCAN_ALLOC_HUGE_PAGE_FAIL:
> 
> Nit: shouldn't SCAN_CGROUP_CHARGE_FAIL go with SCAN_ALLOC_HUGE_PAGE_FAIL
> here?
> 
> If charging the current order fails, a smaller order might still fit :)

I think the reasoning was here, that if we are already that close to our mem
limit, we should just give up instead of trying to squeeze it in .. :)

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH mm-unstable v18 08/14] mm/khugepaged: add per-order mTHP collapse failure statistics
From: David Hildenbrand (Arm) @ 2026-05-31 20:09 UTC (permalink / raw)
  To: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <20260522150009.121603-9-npache@redhat.com>

On 5/22/26 17:00, Nico Pache wrote:
> Add three new mTHP statistics to track collapse failures for different
> orders when encountering swap PTEs, excessive none PTEs, and shared PTEs:
> 
> - collapse_exceed_swap_pte: Increment when mTHP collapse fails due to
> 	encountering a swap PTE.
> 
> - collapse_exceed_none_pte: Counts when mTHP collapse fails due to
>   	exceeding the none PTE threshold for the given order
> 
> - collapse_exceed_shared_pte: Counts when mTHP collapse fails due to
> 	encountering a shared PTE.
> 
> These statistics complement the existing THP_SCAN_EXCEED_* events by
> providing per-order granularity for mTHP collapse attempts. The stats are
> exposed via sysfs under
> `/sys/kernel/mm/transparent_hugepage/hugepages-*/stats/` for each
> supported hugepage size.
> 
> As we currently do not support collapsing mTHPs that contain a swap or
> shared entry, those statistics keep track of how often we are
> encountering failed mTHP collapses due to these restrictions.
> 
> We will add support for mTHP collapse for anonymous pages next; lets also
> track when this happens at the PMD level within the per-mTHP stats.
> 
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  Documentation/admin-guide/mm/transhuge.rst | 14 ++++++++++++++
>  include/linux/huge_mm.h                    |  3 +++
>  mm/huge_memory.c                           |  7 +++++++
>  mm/khugepaged.c                            | 21 +++++++++++++++++++--
>  4 files changed, 43 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index c51932e6275d..80a4d0bed70b 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -714,6 +714,20 @@ nr_anon_partially_mapped
>         an anonymous THP as "partially mapped" and count it here, even though it
>         is not actually partially mapped anymore.
>  
> +collapse_exceed_none_pte
> +       The number of collapse attempts that failed due to exceeding the
> +       max_ptes_none threshold.
> +
> +collapse_exceed_swap_pte
> +       The number of collapse attempts that failed due to exceeding the
> +       max_ptes_swap threshold. For non-PMD orders this occurs if a mTHP range
> +       contains at least one swap PTE.
> +
> +collapse_exceed_shared_pte
> +       The number of collapse attempts that failed due to exceeding the
> +       max_ptes_shared threshold. For non-PMD orders this occurs if a mTHP range
> +       contains at least one shared PTE.
> +
>  As the system ages, allocating huge pages may be expensive as the
>  system uses memory compaction to copy data around memory to free a
>  huge page for use. There are some counters in ``/proc/vmstat`` to help
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index ba7ae6808544..48496f09909b 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -144,6 +144,9 @@ enum mthp_stat_item {
>  	MTHP_STAT_SPLIT_DEFERRED,
>  	MTHP_STAT_NR_ANON,
>  	MTHP_STAT_NR_ANON_PARTIALLY_MAPPED,
> +	MTHP_STAT_COLLAPSE_EXCEED_SWAP,
> +	MTHP_STAT_COLLAPSE_EXCEED_NONE,
> +	MTHP_STAT_COLLAPSE_EXCEED_SHARED,
>  	__MTHP_STAT_COUNT
>  };
>  
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 345c54133c83..5c128cdec810 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -703,6 +703,10 @@ DEFINE_MTHP_STAT_ATTR(split_failed, MTHP_STAT_SPLIT_FAILED);
>  DEFINE_MTHP_STAT_ATTR(split_deferred, MTHP_STAT_SPLIT_DEFERRED);
>  DEFINE_MTHP_STAT_ATTR(nr_anon, MTHP_STAT_NR_ANON);
>  DEFINE_MTHP_STAT_ATTR(nr_anon_partially_mapped, MTHP_STAT_NR_ANON_PARTIALLY_MAPPED);
> +DEFINE_MTHP_STAT_ATTR(collapse_exceed_swap_pte, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
> +DEFINE_MTHP_STAT_ATTR(collapse_exceed_none_pte, MTHP_STAT_COLLAPSE_EXCEED_NONE);
> +DEFINE_MTHP_STAT_ATTR(collapse_exceed_shared_pte, MTHP_STAT_COLLAPSE_EXCEED_SHARED)

[...]

>  		/* See collapse_scan_pmd(). */
>  		if (folio_maybe_mapped_shared(folio)) {
> +			/*
> +			 * TODO: Support shared pages without leading to further
> +			 * mTHP collapses. Currently bringing in new pages via
> +			 * shared may cause a future higher order collapse on a
> +			 * rescan of the same range.
> +			 */

This comment actually belongs into an earlier patch, no?

>  			if (++shared > max_ptes_shared) {
>  				result = SCAN_EXCEED_SHARED_PTE;
> -				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> +				if (is_pmd_order(order))
> +					count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> +				count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
>  				goto out;
>  			}
>  		}
> @@ -1138,6 +1148,7 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
>  		 * range.
>  		 */
>  		if (!is_pmd_order(order)) {
> +			count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
>  			pte_unmap(pte);
>  			mmap_read_unlock(mm);
>  			result = SCAN_EXCEED_SWAP_PTE;
> @@ -1433,6 +1444,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  			if (++none_or_zero > max_ptes_none) {
>  				result = SCAN_EXCEED_NONE_PTE;
>  				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> +				count_mthp_stat(HPAGE_PMD_ORDER,
> +						MTHP_STAT_COLLAPSE_EXCEED_NONE);
>  				goto out_unmap;
>  			}
>  			continue;
> @@ -1441,6 +1454,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  			if (++unmapped > max_ptes_swap) {
>  				result = SCAN_EXCEED_SWAP_PTE;
>  				count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
> +				count_mthp_stat(HPAGE_PMD_ORDER,
> +						MTHP_STAT_COLLAPSE_EXCEED_SWAP);
>  				goto out_unmap;
>  			}
>  			/*
> @@ -1498,6 +1513,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  			if (++shared > max_ptes_shared) {
>  				result = SCAN_EXCEED_SHARED_PTE;
>  				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> +				count_mthp_stat(HPAGE_PMD_ORDER,
> +						MTHP_STAT_COLLAPSE_EXCEED_SHARED);

Can be done as a later cleanup, but having a single function that obtains an
order and knows which stats to update would be cleaner (and a good preparation
for shmem mTHP collapse support).

Nothing jumped at me, so

Acked-by: David Hildenbrand (Arm) <david@kernel.org>

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH mm-unstable v18 10/14] mm/khugepaged: introduce collapse_allowable_orders helper function
From: David Hildenbrand (Arm) @ 2026-05-31 20:18 UTC (permalink / raw)
  To: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <20260522150009.121603-11-npache@redhat.com>

On 5/22/26 17:00, Nico Pache wrote:
> Add collapse_allowable_orders() to generalize THP order eligibility. The
> function determines which THP orders are permitted based on collapse
> context (khugepaged vs madv_collapse).
> 
> This consolidates collapse configuration logic and provides a clean
> interface for future mTHP collapse support where the orders may be
> different.

It would have been good to describe here that, for now, it only ever returns
PMDs, and that it will be extended next.

Logically, this patch belongs to #12, not #11 ... so seeing it before #11 was a bit

... and there, it is clear that we don't even want to know the orders?

So can we just call this function

"collapse_possible" and make it return a boolean?

> 
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/khugepaged.c | 15 ++++++++++++---
>  1 file changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 4534025bc81d..64ceebc9d8a7 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -552,12 +552,21 @@ void __khugepaged_enter(struct mm_struct *mm)
>  		wake_up_interruptible(&khugepaged_wait);
>  }
>  
> +/* Check what orders are allowed based on the vma and collapse type */
> +static unsigned long collapse_allowable_orders(struct vm_area_struct *vma,
> +		vm_flags_t vm_flags, enum tva_type tva_flags)
> +{
> +	unsigned long orders = BIT(HPAGE_PMD_ORDER);
> +
> +	return thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
> +}
> +
>  void khugepaged_enter_vma(struct vm_area_struct *vma,
>  			  vm_flags_t vm_flags)
>  {
>  	if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
>  	    hugepage_pmd_enabled()) {
> -		if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER))
> +		if (collapse_allowable_orders(vma, vm_flags, TVA_KHUGEPAGED))
>  			__khugepaged_enter(vma->vm_mm);
>  	}
>  }
> @@ -2680,7 +2689,7 @@ static void collapse_scan_mm_slot(unsigned int progress_max,
>  			cc->progress++;
>  			break;
>  		}
> -		if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) {
> +		if (!collapse_allowable_orders(vma, vma->vm_flags, TVA_KHUGEPAGED)) {
>  			cc->progress++;
>  			continue;
>  		}
> @@ -2989,7 +2998,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>  	BUG_ON(vma->vm_start > start);
>  	BUG_ON(vma->vm_end < end);
>  
> -	if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER))
> +	if (!collapse_allowable_orders(vma, vma->vm_flags, TVA_FORCED_COLLAPSE))
>  		return -EINVAL;
>  
>  	cc = kmalloc_obj(*cc);

Having a simple

static bool collapse_possible(...)
{
	return collapse_allowable_orders(...)
}

Would make the above slightly more readable.

Acked-by: David Hildenbrand (Arm) <david@kernel.org>

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH] rethook: Use tsk->on_cpu to check task execution state
From: Masami Hiramatsu @ 2026-05-31 23:40 UTC (permalink / raw)
  To: Tengda Wu, Peter Zijlstra
  Cc: Steven Rostedt, Mathieu Desnoyers, Alexei Starovoitov,
	linux-trace-kernel, linux-kernel
In-Reply-To: <94179dab-ffb7-4fab-af45-b20bfb686ab3@huaweicloud.com>

On Fri, 29 May 2026 11:39:49 +0800
Tengda Wu <wutengda@huaweicloud.com> wrote:

> > We also should not use tsk->on_cpu, but should use task_on_cpu(tsk).
> > 
> > BTW, should task_on_cpu() use READ_ONCE() etc?
> > wait_task_inactive() seems a bit fragile.
> > 
> > Thanks,
> > 
> 
> It seems that using task_on_cpu() is not necessary here because:
> 
> 1. It requires an additional 'rq' parameter not available in the rethook context.
> 2. It just returns p->on_cpu, which is identical to our current use of tsk->on_cpu.

OK. We may need to cleanup task_on_cpu() to remove unused rq, which
was historically introduced by commit 029632fbb7b7 ("sched: Make
separate sched*.c translation units") as a parameter for "task_running()"
because it called task_current(), but was cleaned by commit 0b9d46fc5ef7
("sched: Rename task_running() to task_on_cpu()") and now it is not
used.

> 
> /* file: kernel/sched/sched.h */
> static inline int task_on_cpu(struct rq *rq, struct task_struct *p)
> {
> 	return p->on_cpu;
> }
> 
> Given these constraints, staying with tsk->on_cpu seems more straightforward
> for the rethook context.

The reason I asked you to use a wrapper function instead of directly
accessing that on_cpu is to ensure that this part isn't overlooked when
changing the scheduler's behavior in the future.

They may be equivalent at this point in time, but that doesn't necessarily
mean they will remain so in the future.

IMHO, an API interface should represent the behavior and resources
required for that operation, and when accessing it from outside the
subsystem, we should use that API whenever possible. Otherwise, even
if we want to update the internal implementation of one subsystem,
we will have to make changes to other subsystems as well.

Peter, is it OK to drop @rq from task_on_cpu()? Then we can use
it from rethook.

Thank you,

> 
> Thanks,
> Tengda
> 
> >>
> >> Fixes: 54ecbe6f1ed5 ("rethook: Add a generic return hook")
> >> Signed-off-by: Tengda Wu <wutengda@huaweicloud.com>
> >> ---
> >>  kernel/trace/rethook.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
> >> index 5a8bdf88999a..bd5e5f455e85 100644
> >> --- a/kernel/trace/rethook.c
> >> +++ b/kernel/trace/rethook.c
> >> @@ -250,7 +250,7 @@ unsigned long rethook_find_ret_addr(struct task_struct *tsk, unsigned long frame
> >>  	if (WARN_ON_ONCE(!cur))
> >>  		return 0;
> >>  
> >> -	if (tsk != current && task_is_running(tsk))
> >> +	if (tsk != current && tsk->on_cpu)
> >>  		return 0;
> >>  
> >>  	do {
> >> -- 
> >> 2.34.1
> >>
> >>
> > 
> > 
> 

-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH] ring-buffer: Better comment the use of RB_MISSED_EVENTS
From: Masami Hiramatsu @ 2026-05-31 23:58 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Linux trace kernel, Masami Hiramatsu, Mathieu Desnoyers
In-Reply-To: <20260528223738.41276c0e@fedora>

On Thu, 28 May 2026 22:37:38 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> From: Steven Rostedt <rostedt@goodmis.org>
> 
> If the persistent ring buffer is detected on boot up to have a corrupted
> sub-buffer, that sub-buffer is cleared to zero and its commit value has
> the RB_MISSED_EVENTS bit set. That bit is to allow the "trace",
> "trace_pipe" and "trace_pipe_raw" files know that events were dropped by
> outputting "[LOST EVENTS]".
> 
> Only in this case does that bit get set in the writeable portion of the
> ring buffer. When events are dropped in the normal ring buffer, that
> information is stored in the cpu_buffer descriptor and the
> RB_MISSED_EVENTS is set in the buffer page at the time the page is
> consumed. It is never set in the writeable portion of the buffer.
> 
> Add comments to describe this better as it can be confusing to know when
> the RB_MISSED_EVENTS are set in the commit portion of the buffer page.
> 
> Link: https://lore.kernel.org/all/20260529001500.14178455a046a5cbc6180861@kernel.org/
> 

This looks good to me.

Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Thanks, 

> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> ---
>  kernel/trace/ring_buffer.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index 910f6b3adf74..06fb365bb86e 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -1929,6 +1929,12 @@ static int __rb_validate_buffer(struct buffer_page *bpage, int cpu,
>  	 */
>  	if (ret < 0 || (prev_ts && prev_ts > ts) || (next_ts && ts > next_ts)) {
>  		local_set(&bpage->entries, 0);
> +		/*
> +		 * Note, the RB_MISSED_EVENTS is only set inside the main write
> +		 * buffer by this verification logic. The normal ring buffer
> +		 * has this bit set when the page is read and passed to the
> +		 * consumers.
> +		 */
>  		local_set(&dpage->commit, RB_MISSED_EVENTS);
>  		dpage->time_stamp = prev_ts ? prev_ts : next_ts;
>  		ret = -1;
> @@ -7232,6 +7238,14 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
>  			local_add(RB_MISSED_STORED, &dpage->commit);
>  			size += sizeof(missed_events);
>  		}
> +		/*
> +		 * Note, for the persistent ring buffer, the RB_MISSED_EVENTS
> +		 * may have been set in the main buffer via the verification code.
> +		 * But here, dpage is a copy of that page and has not yet had
> +		 * the RB_MISSED_EVENTS set. As for the normal buffers,
> +		 * the main write buffer does not set these bits and it needs
> +		 * to be set here.
> +		 */
>  		local_add(RB_MISSED_EVENTS, &dpage->commit);
>  	}
>  
> -- 
> 2.53.0
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH] rethook: Use tsk->on_cpu to check task execution state
From: Tengda Wu @ 2026-06-01  0:58 UTC (permalink / raw)
  To: Masami Hiramatsu, Peter Zijlstra
  Cc: Steven Rostedt, Mathieu Desnoyers, Alexei Starovoitov,
	linux-trace-kernel, linux-kernel
In-Reply-To: <20260601084001.9566b443746447ec2bb1a9fb@kernel.org>



On 2026/6/1 7:40, Masami Hiramatsu wrote:
> On Fri, 29 May 2026 11:39:49 +0800
> Tengda Wu <wutengda@huaweicloud.com> wrote:
> 
>>> We also should not use tsk->on_cpu, but should use task_on_cpu(tsk).
>>>
>>> BTW, should task_on_cpu() use READ_ONCE() etc?
>>> wait_task_inactive() seems a bit fragile.
>>>
>>> Thanks,
>>>
>>
>> It seems that using task_on_cpu() is not necessary here because:
>>
>> 1. It requires an additional 'rq' parameter not available in the rethook context.
>> 2. It just returns p->on_cpu, which is identical to our current use of tsk->on_cpu.
> 
> OK. We may need to cleanup task_on_cpu() to remove unused rq, which
> was historically introduced by commit 029632fbb7b7 ("sched: Make
> separate sched*.c translation units") as a parameter for "task_running()"
> because it called task_current(), but was cleaned by commit 0b9d46fc5ef7
> ("sched: Rename task_running() to task_on_cpu()") and now it is not
> used.
> 
>>
>> /* file: kernel/sched/sched.h */
>> static inline int task_on_cpu(struct rq *rq, struct task_struct *p)
>> {
>> 	return p->on_cpu;
>> }
>>
>> Given these constraints, staying with tsk->on_cpu seems more straightforward
>> for the rethook context.
> 
> The reason I asked you to use a wrapper function instead of directly
> accessing that on_cpu is to ensure that this part isn't overlooked when
> changing the scheduler's behavior in the future.
> 
> They may be equivalent at this point in time, but that doesn't necessarily
> mean they will remain so in the future.
> 
> IMHO, an API interface should represent the behavior and resources
> required for that operation, and when accessing it from outside the
> subsystem, we should use that API whenever possible. Otherwise, even
> if we want to update the internal implementation of one subsystem,
> we will have to make changes to other subsystems as well.
> 
> Peter, is it OK to drop @rq from task_on_cpu()? Then we can use
> it from rethook.
> 
> Thank you,
> 

Thank you for the detailed explanation -- that makes perfect sense.

I looked into task_on_cpu() and saw it has many callers.
Once it's confirmed that the parameter can be safely dropped, I'll
follow up with my patch to use the wrapper instead of directly
accessing on_cpu.

Best,
Tengda

>>>>
>>>> Fixes: 54ecbe6f1ed5 ("rethook: Add a generic return hook")
>>>> Signed-off-by: Tengda Wu <wutengda@huaweicloud.com>
>>>> ---
>>>>  kernel/trace/rethook.c | 2 +-
>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
>>>> index 5a8bdf88999a..bd5e5f455e85 100644
>>>> --- a/kernel/trace/rethook.c
>>>> +++ b/kernel/trace/rethook.c
>>>> @@ -250,7 +250,7 @@ unsigned long rethook_find_ret_addr(struct task_struct *tsk, unsigned long frame
>>>>  	if (WARN_ON_ONCE(!cur))
>>>>  		return 0;
>>>>  
>>>> -	if (tsk != current && task_is_running(tsk))
>>>> +	if (tsk != current && tsk->on_cpu)
>>>>  		return 0;
>>>>  
>>>>  	do {
>>>> -- 
>>>> 2.34.1
>>>>
>>>>
>>>
>>>
>>
> 
> 


^ permalink raw reply

* Re: [PATCH mm-unstable v18 12/14] mm/khugepaged: avoid unnecessary mTHP collapse attempts
From: Lance Yang @ 2026-06-01  1:53 UTC (permalink / raw)
  To: David Hildenbrand (Arm), npache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat, mhocko,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, usama.arif
In-Reply-To: <65e201dd-10d6-43f6-8758-0fc313158fe7@kernel.org>



On 2026/6/1 04:02, David Hildenbrand (Arm) wrote:
> On 5/31/26 09:31, Lance Yang wrote:
>>
>> On Fri, May 22, 2026 at 09:00:07AM -0600, Nico Pache wrote:
>>> There are cases where, if an attempted collapse fails, all subsequent
>>> orders are guaranteed to also fail. Avoid these collapse attempts by
>>> bailing out early.
>>>
>>> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
>>> Acked-by: Usama Arif <usama.arif@linux.dev>
>>> Acked-by: David Hildenbrand (Arm) <david@kernel.org>
>>> Signed-off-by: Nico Pache <npache@redhat.com>
>>> ---
>>> mm/khugepaged.c | 24 +++++++++++++++++++++++-
>>> 1 file changed, 23 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index d3d7db8be26c..15b7298bc225 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -1535,9 +1535,31 @@ static int mthp_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
>>> 			collapse_address = address + offset * PAGE_SIZE;
>>> 			ret = collapse_huge_page(mm, collapse_address, referenced,
>>> 						 unmapped, cc, order);
>>> -			if (ret == SCAN_SUCCEED) {
>>> +
>>> +			switch (ret) {
>>> +			/* Cases where we continue to next collapse candidate */
>>> +			case SCAN_SUCCEED:
>>> 				collapsed += nr_ptes;
>>> +				fallthrough;
>>> +			case SCAN_PTE_MAPPED_HUGEPAGE:
>>> 				continue;
>>> +			/* Cases where lower orders might still succeed */
>>> +			case SCAN_LACK_REFERENCED_PAGE:
>>> +			case SCAN_EXCEED_NONE_PTE:
>>> +			case SCAN_EXCEED_SWAP_PTE:
>>> +			case SCAN_EXCEED_SHARED_PTE:
>>> +			case SCAN_PAGE_LOCK:
>>> +			case SCAN_PAGE_COUNT:
>>> +			case SCAN_PAGE_NULL:
>>> +			case SCAN_DEL_PAGE_LRU:
>>> +			case SCAN_PTE_NON_PRESENT:
>>> +			case SCAN_PTE_UFFD_WP:
>>> +			case SCAN_ALLOC_HUGE_PAGE_FAIL:
>>
>> Nit: shouldn't SCAN_CGROUP_CHARGE_FAIL go with SCAN_ALLOC_HUGE_PAGE_FAIL
>> here?
>>
>> If charging the current order fails, a smaller order might still fit :)
> 
> I think the reasoning was here, that if we are already that close to our mem
> limit, we should just give up instead of trying to squeeze it in .. :)

Fair point. Just a nit, nevermind :)

^ permalink raw reply

* [PATCH v3] selftests/ftrace: Fix trace_marker_raw test on 64K page kernels
From: Tianchen Ding @ 2026-06-01  2:32 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Shuah Khan
  Cc: linux-kernel, linux-trace-kernel, linux-kselftest
In-Reply-To: <20260527095438.1794905-1-dtcccc@linux.alibaba.com>

On ARM64 kernels with 64K pages, the trace_marker_raw test fails because
bash's printf builtin uses stdio buffering which splits output into
multiple small write() calls to the tracefs file. Since each individual
write is within TRACE_MARKER_MAX_SIZE (4096), they all succeed, causing
the "too big" write test to incorrectly pass.

Fix by writing through dd with iflag=fullblock to guarantee a single
atomic write() syscall to trace_marker_raw.

Fixes: 37f46601383a ("selftests/tracing: Add basic test for trace_marker_raw file")
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
---
v3:
Measure the binary length via wc -c instead of hard-coding "size + 4".

v2:
Update comment about 64K pages.
---
 .../ftrace/test.d/00basic/trace_marker_raw.tc      | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc b/tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc
index 8e905d4fe6dd..f985ff391463 100644
--- a/tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc
+++ b/tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc
@@ -36,15 +36,23 @@ make_str() {
 
 	data=`printf -- 'X%.0s' $(seq $cnt)`
 
-	printf "${val}${data}"
+	# Return escape-sequence text (e.g. "\003\000..."); the caller
+	# converts to binary. Shell command substitution strips NUL bytes,
+	# so the binary form cannot survive being captured into a variable.
+	printf '%s' "${val}${data}"
 }
 
 write_buffer() {
 	id=$1
 	size=$2
 
-	# write the string into the raw marker
-	make_str $id $size > trace_marker_raw
+	str=`make_str $id $size`
+	len=`printf "$str" | wc -c`
+	# Pipe through dd to ensure a single atomic write() syscall
+	# on architectures with 64K pages, where shell's printf builtin
+	# uses stdio buffering which may split the output into multiple
+	# writes.
+	printf "$str" | dd of=trace_marker_raw bs=$len iflag=fullblock
 }
 
 
-- 
2.39.3


^ permalink raw reply related

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: Lance Yang @ 2026-06-01  3:28 UTC (permalink / raw)
  To: david
  Cc: lance.yang, npache, linux-doc, linux-kernel, linux-mm,
	linux-trace-kernel, aarcange, akpm, anshuman.khandual, apopple,
	baohua, baolin.wang, byungchul, catalin.marinas, cl, corbet,
	dave.hansen, dev.jain, gourry, hannes, hughd, jack, jackmanb,
	jannh, jglisse, joshua.hahnjy, kas, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe, usama.arif
In-Reply-To: <2024af56-5e99-4799-a586-e9ba756cecb9@kernel.org>


On Sun, May 31, 2026 at 10:00:17PM +0200, David Hildenbrand (Arm) wrote:
>On 5/31/26 11:39, Lance Yang wrote:
>> 
>> On Fri, May 22, 2026 at 09:00:01AM -0600, Nico Pache wrote:
>>> Pass an order and offset to collapse_huge_page to support collapsing anon
>>> memory to arbitrary orders within a PMD. order indicates what mTHP size we
>>> are attempting to collapse to, and offset indicates were in the PMD to
>>> start the collapse attempt.
>>>
>>> For non-PMD collapse we must leave the anon VMA write locked until after
>>> we collapse the mTHP-- in the PMD case all the pages are isolated, but in
>>> the mTHP case this is not true, and we must keep the lock to prevent
>>> access/changes to the page tables. This can happen if the rmap walkers hit
>>> a pmd_none while the PMD entry is currently unavailable due to being
>>> temporarily removed during the collapse phase.
>>>
>>> Acked-by: Usama Arif <usama.arif@linux.dev>
>>> Signed-off-by: Nico Pache <npache@redhat.com>
>>> ---
>>> mm/khugepaged.c | 93 +++++++++++++++++++++++++++++--------------------
>>> 1 file changed, 55 insertions(+), 38 deletions(-)
>>>
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index fab35d318641..d64f42f66236 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -1214,34 +1214,36 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
>>>  * while allocating a THP, as that could trigger direct reclaim/compaction.
>>>  * Note that the VMA must be rechecked after grabbing the mmap_lock again.
>>>  */
>>> -static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>> -		int referenced, int unmapped, struct collapse_control *cc)
>>> +static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
>>> +		int referenced, int unmapped, struct collapse_control *cc,
>>> +		unsigned int order)
>>> {
>>> +	const unsigned long pmd_addr = start_addr & HPAGE_PMD_MASK;
>>> +	const unsigned long end_addr = start_addr + (PAGE_SIZE << order);
>>> 	LIST_HEAD(compound_pagelist);
>>> 	pmd_t *pmd, _pmd;
>>> -	pte_t *pte;
>>> +	pte_t *pte = NULL;
>>> 	pgtable_t pgtable;
>>> 	struct folio *folio;
>>> 	spinlock_t *pmd_ptl, *pte_ptl;
>>> 	enum scan_result result = SCAN_FAIL;
>>> 	struct vm_area_struct *vma;
>>> 	struct mmu_notifier_range range;
>>> +	bool anon_vma_locked = false;
>>>
>>> -	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>>> -
>>> -	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
>>> +	result = alloc_charge_folio(&folio, mm, cc, order);
>>> 	if (result != SCAN_SUCCEED)
>>> 		goto out_nolock;
>>>
>>> 	mmap_read_lock(mm);
>>> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
>>> -					 HPAGE_PMD_ORDER);
>>> +	result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
>>> +					 &vma, cc, order);
>>> 	if (result != SCAN_SUCCEED) {
>>> 		mmap_read_unlock(mm);
>>> 		goto out_nolock;
>>> 	}
>>>
>>> -	result = find_pmd_or_thp_or_none(mm, address, &pmd);
>>> +	result = find_pmd_or_thp_or_none(mm, pmd_addr, &pmd);
>>> 	if (result != SCAN_SUCCEED) {
>>> 		mmap_read_unlock(mm);
>>> 		goto out_nolock;
>>> @@ -1253,8 +1255,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>>> 		 * released when it fails. So we jump out_nolock directly in
>>> 		 * that case.  Continuing to collapse causes inconsistency.
>>> 		 */
>>> -		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
>>> -						     referenced, HPAGE_PMD_ORDER);
>>> +		result = __collapse_huge_page_swapin(mm, vma, start_addr, pmd,
>>> +						     referenced, order);
>>> 		if (result != SCAN_SUCCEED)
>>> 			goto out_nolock;
>>> 	}
>>> @@ -1269,20 +1271,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>>> 	 * mmap_lock.
>>> 	 */
>>> 	mmap_write_lock(mm);
>>> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
>>> -					 HPAGE_PMD_ORDER);
>>> +	result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
>>> +					 &vma, cc, order);
>>> 	if (result != SCAN_SUCCEED)
>>> 		goto out_up_write;
>>> 	/* check if the pmd is still valid */
>>> 	vma_start_write(vma);
>>> -	result = check_pmd_still_valid(mm, address, pmd);
>>> +	result = check_pmd_still_valid(mm, pmd_addr, pmd);
>>> 	if (result != SCAN_SUCCEED)
>>> 		goto out_up_write;
>>>
>>> 	anon_vma_lock_write(vma->anon_vma);
>>> +	anon_vma_locked = true;
>>>
>>> -	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
>>> -				address + HPAGE_PMD_SIZE);
>>> +	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr,
>>> +				end_addr);
>>> 	mmu_notifier_invalidate_range_start(&range);
>>>
>>> 	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
>>> @@ -1294,26 +1297,23 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>>> 	 * Parallel GUP-fast is fine since GUP-fast will back off when
>>> 	 * it detects PMD is changed.
>>> 	 */
>>> -	_pmd = pmdp_collapse_flush(vma, address, pmd);
>>> +	_pmd = pmdp_collapse_flush(vma, pmd_addr, pmd);
>>> 	spin_unlock(pmd_ptl);
>>> 	mmu_notifier_invalidate_range_end(&range);
>>> 	tlb_remove_table_sync_one();
>>>
>>> -	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
>>> +	pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);
>>> 	if (pte) {
>>> -		result = __collapse_huge_page_isolate(vma, address, pte, cc,
>>> -						      HPAGE_PMD_ORDER,
>>> -						      &compound_pagelist);
>>> +		result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
>>> +						      order, &compound_pagelist);
>>> 		spin_unlock(pte_ptl);
>>> 	} else {
>>> 		result = SCAN_NO_PTE_TABLE;
>>> 	}
>>>
>>> 	if (unlikely(result != SCAN_SUCCEED)) {
>>> -		if (pte)
>>> -			pte_unmap(pte);
>>> 		spin_lock(pmd_ptl);
>>> -		BUG_ON(!pmd_none(*pmd));
>>> +		WARN_ON_ONCE(!pmd_none(*pmd));
>>> 		/*
>>> 		 * We can only use set_pmd_at when establishing
>>> 		 * hugepmds and never for establishing regular pmds that
>>> @@ -1321,21 +1321,24 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>>> 		 */
>>> 		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>>> 		spin_unlock(pmd_ptl);
>>> -		anon_vma_unlock_write(vma->anon_vma);
>>> 		goto out_up_write;
>>> 	}
>>>
>>> 	/*
>>> -	 * All pages are isolated and locked so anon_vma rmap
>>> -	 * can't run anymore.
>>> +	 * For PMD collapse all pages are isolated and locked so anon_vma
>>> +	 * rmap can't run anymore. For mTHP collapse the PMD entry has been
>>> +	 * removed and not all pages are isolated and locked, so we must hold
>>> +	 * the lock to prevent neighboring folios from attempting to access
>>> +	 * this PMD until its reinstalled.
>>> 	 */
>>> -	anon_vma_unlock_write(vma->anon_vma);
>>> +	if (is_pmd_order(order)) {
>>> +		anon_vma_unlock_write(vma->anon_vma);
>>> +		anon_vma_locked = false;
>>> +	}
>>>
>>> 	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
>>> -					   vma, address, pte_ptl,
>>> -					   HPAGE_PMD_ORDER,
>>> -					   &compound_pagelist);
>>> -	pte_unmap(pte);
>>> +					   vma, start_addr, pte_ptl,
>>> +					   order, &compound_pagelist);
>>> 	if (unlikely(result != SCAN_SUCCEED))
>>> 		goto out_up_write;
>>>
>>> @@ -1345,18 +1348,32 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>>> 	 * write.
>>> 	 */
>>> 	__folio_mark_uptodate(folio);
>>> -	pgtable = pmd_pgtable(_pmd);
>>> -
>>> 	spin_lock(pmd_ptl);
>>> -	BUG_ON(!pmd_none(*pmd));
>>> -	pgtable_trans_huge_deposit(mm, pmd, pgtable);
>>> -	map_anon_folio_pmd_nopf(folio, pmd, vma, address);
>>> +	WARN_ON_ONCE(!pmd_none(*pmd));
>>> +	if (is_pmd_order(order)) {
>>> +		pgtable = pmd_pgtable(_pmd);
>>> +		pgtable_trans_huge_deposit(mm, pmd, pgtable);
>>> +		map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
>>> +	} else {
>>> +		/*
>>> +		 * set_ptes is called in map_anon_folio_pte_nopf with the
>>> +		 * pmd_ptl lock still held; this is safe as the PMD is expected
>>> +		 * to be none. The pmd entry is then repopulated below.
>>> +		 */
>>> +		map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
>> 
>> Emm ... is it safe to use map_anon_folio_pte_nopf() here?
>> 
>> At this point pmdp_collapse_flush() has cleared the PMD from the page
>> tables. The PTE table we are updating is only reachable through the saved
>> old PMD value, _pmd, until pmd_populate() below.
>> 
>> map_anon_folio_pte_nopf() does set_ptes() and then calls
>> update_mmu_cache_range(). Documentation/core-api/cachetlb.rst describes
>> that hook as:
>> 
>> "
>> 	At the end of every page fault, this routine is invoked to tell
>> 	the architecture specific code that translations now exists
>> 	in the software page tables for address space "vma->vm_mm"
>> 	at virtual address "address" for "nr" consecutive pages.
>> "
>> 
>> But that does not seem true here yet, since the PTE table is not
>> reachable from vma->vm_mm when update_mmu_cache_range() is called.
>> 
>> Should we avoid calling update_mmu_cache_range() until after the PTE
>> table is reinstalled with pmd_populate()?
>
>I recall that update_mmu_cache* users mostly care about updating folios flags,
>for the folio derived from the PTE ... or flushing caches for the user address.
>
>So intuitively I would say "the architecture code doesn't care that the PMD
>table will only be visible to HW shortly after". The important thing should be
>that it will definetly happen, and that nothing else is curently there or can be
>there?

Ah, fair point.

I was mostly worried about arch hooks that walk vma->vm_mm again, rather
than only using the pte pointer passed in. For example, mips does:

  update_mmu_cache_range()
    -> __update_tlb()
      -> pgd_offset(vma->vm_mm, address)
      -> pte_offset_map(...)

and __update_tlb() has this assumption:

		/*
		 * update_mmu_cache() is called between pte_offset_map_lock()
		 * and pte_unmap_unlock(), so we can assume that ptep is not
		 * NULL here: and what should be done below if it were NULL?
		 */

So if khugepaged happens to run with current->active_mm == vma->vm_mm
here, could __update_tlb() hit the none PMD, get NULL from
pte_offset_map(), and then dereference it?

Just wanted to raise it since some arch code may still have assumptions
like this, and the always-enable-mTHP work is getting closer ...

Probably very very very hard to hit, though :)

Cheers, Lance

^ permalink raw reply

* Re: [PATCH] selftests/ftrace: Drop invalid top-level local in test_ownership
From: Masami Hiramatsu @ 2026-06-01  4:31 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: CaoRuichuang, Shuah Khan, mhiramat, mathieu.desnoyers, shuah,
	linux-kernel, linux-trace-kernel, linux-kselftest
In-Reply-To: <20260407203727.442b583c@gandalf.local.home>

On Tue, 7 Apr 2026 20:37:27 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> 
> Shuah,
> 
> Care to take this through your tree. Probably could even add:
> 
> Cc: stable@vger.kernel.org
> Fixes: 8b55572e51805 ("tracing/selftests: Add tracefs mount options test")
> 
> As well as:
> 
> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> 

Shuah, here is my ack too.

Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>

BTW, to avoid similar issue, 

Thanks, 

> -- Steve
> 
> 
> On Tue,  7 Apr 2026 18:26:13 +0800
> CaoRuichuang <create0818@163.com> wrote:
> 
> > From: Cao Ruichuang <create0818@163.com>
> > 
> > test_ownership.tc is sourced by ftracetest under /bin/sh.
> > 
> > The script currently declares mount_point with local at file scope,
> > which makes /bin/sh abort with "local: not in a function" before the
> > test can reach the eventfs ownership checks.
> > 
> > Replace the top-level local declaration with a normal shell variable so
> > kernels that support the gid= tracefs mount option can run the test at
> > all.
> > 
> > Signed-off-by: Cao Ruichuang <create0818@163.com>
> > ---
> >  tools/testing/selftests/ftrace/test.d/00basic/test_ownership.tc | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/tools/testing/selftests/ftrace/test.d/00basic/test_ownership.tc b/tools/testing/selftests/ftrace/test.d/00basic/test_ownership.tc
> > index e71cc3ad0..6d00d3c0f 100644
> > --- a/tools/testing/selftests/ftrace/test.d/00basic/test_ownership.tc
> > +++ b/tools/testing/selftests/ftrace/test.d/00basic/test_ownership.tc
> > @@ -6,7 +6,7 @@
> >  original_group=`stat -c "%g" .`
> >  original_owner=`stat -c "%u" .`
> >  
> > -local mount_point=$(get_mount_point)
> > +mount_point=$(get_mount_point)
> >  
> >  mount_options=$(get_mnt_options "$mount_point")
> >  
> 
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [RFC PATCH 2/2] tracing: Record and show boot ID in last_boot_info
From: Masami Hiramatsu @ 2026-06-01  4:38 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Theodore Ts'o, Jason A . Donenfeld, Mathieu Desnoyers,
	linux-kernel, linux-trace-kernel
In-Reply-To: <20260528163633.5650f3d6@gandalf.local.home>

On Thu, 28 May 2026 16:36:33 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Sun, 24 May 2026 10:44:39 +0900
> Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> 
> > > If the get_boot_id() is accepted by the random folks, then I'm fine with
> > > this change.  
> > 
> > Yeah, BTW, Sashiko found this can be initialized before we get enough
> > entropy for random seed. Maybe we need one more delay.
> 
> Well, maybe for adding the boot_id later, but the code that initializes the
> buffers needs to stay early. With the backup instance, the persistent ring
> buffer can restart tracing immediately.

Agreed, so the buffer will be made in early stage without initializing
the boot_id field, and it will be updated when user reads the boot_id
from kernel. Anyway, without reading boot_id from user space, it is
meaningless.

Thank you,


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH v2 1/8] scripts/sorttable: Handle RISC-V patchable ftrace entries
From: Shuai Xue @ 2026-06-01  6:17 UTC (permalink / raw)
  To: Wang Han, Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: Steven Rostedt, Alexandre Ghiti, Masami Hiramatsu, Mark Rutland,
	Catalin Marinas, Chen Pei, Andy Chiu, Björn Töpel,
	Deepak Gupta, Puranjay Mohan, Conor Dooley, Josh Poimboeuf,
	Jiri Kosina, Miroslav Benes, Petr Mladek, Joe Lawrence,
	Shuah Khan, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, oliver.yang, zhuo.song, jkchen, linux-riscv,
	linux-kernel, linux-trace-kernel, live-patching, linux-kselftest,
	linux-perf-users
In-Reply-To: <20260528082310.1994388-2-wanghan@linux.alibaba.com>



On 5/28/26 4:23 PM, Wang Han wrote:
> RISC-V uses -fpatchable-function-entry=8,4 when the compressed ISA is
> enabled and -fpatchable-function-entry=4,2 otherwise. In both cases, the
> patchable NOP area starts 8 bytes before the function symbol address.
> The __mcount_loc entries therefore point at the patchable NOP area
> associated with a function, while nm reports the function symbol at the
> entry address used for the function range check.
> 
> After RISC-V selected HAVE_BUILDTIME_MCOUNT_SORT, sorttable started
> applying that range check at build time. Without allowing entries just
> before the reported function address, the mcount sorter treats valid
> RISC-V ftrace callsites as invalid weak-function entries and writes
> them back as zero. The resulting kernel boots with no ftrace entries,
> breaking dynamic ftrace and users such as livepatch.
> 
> The failure is silent during the final link because zeroing weak-function
> entries is an expected sorttable operation. At boot, those zero entries
> are skipped by ftrace_process_locs(), so the only obvious symptom is that
> the vmlinux ftrace table has lost valid callsites and ftrace users cannot
> attach to them.
> 
> CONFIG_FTRACE_SORT_STARTUP_TEST also reports the table as sorted in this
> state: it only checks that the __mcount_loc entries are in ascending
> order, which a fully zeroed table trivially satisfies. The original
> commit relied on this check and did not see the regression.
> 
> On an affected RISC-V QEMU boot with both CONFIG_FTRACE_SORT_STARTUP_TEST
> and CONFIG_FTRACE_STARTUP_TEST enabled, the sort check still passes
> while ftrace reports zero usable entries and the early selftests fail:
> 
>    [    0.000000] ftrace section at ffffffff8101da98 sorted properly
>    [    0.000000] ftrace: allocating 0 entries in 128 pages
>    [    0.054999] Testing tracer function: .. no entries found ..FAILED!
>    [    0.172407] tracer: function failed selftest, disabling
>    [    0.178186] Failed to init function_graph tracer, init returned -19
> 
> Handle RISC-V like arm64 for the function-range check and allow
> patchable entries up to 8 bytes before the function address.
> 
> With this fix, a RISC-V QEMU smoke boot with ftrace startup tests shows
> the vmlinux ftrace table is populated and dynamic ftrace still works:
> 
>    [    0.000000] ftrace: allocating 46749 entries in 184 pages
>    [    0.051115] Testing tracer function: PASSED
>    [    1.283782] Testing dynamic ftrace: PASSED
>    [    6.275456] Testing tracer function_graph: PASSED
> 
> Fixes: 0ca1724b56af ("riscv: ftrace: select HAVE_BUILDTIME_MCOUNT_SORT")
> Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> Link: https://lore.kernel.org/all/20260527113028.4b21a5de@fedora/
> Signed-off-by: Wang Han <wanghan@linux.alibaba.com>
> ---
>   scripts/sorttable.c | 10 +++++++---
>   1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/scripts/sorttable.c b/scripts/sorttable.c
> index e8ed11c680c6..4c10e85bb5af 100644
> --- a/scripts/sorttable.c
> +++ b/scripts/sorttable.c
> @@ -891,17 +891,21 @@ static int do_file(char const *const fname, void *addr)
>   	table_sort_t custom_sort = NULL;
>   
>   	switch (elf_map_machine(ehdr)) {
> -	case EM_AARCH64:
>   #ifdef MCOUNT_SORT_ENABLED
> +	case EM_AARCH64:
>   		sort_reloc = true;
>   		rela_type = 0x403;
> -		/* arm64 uses patchable function entry placing before function */
> +		/* fallthrough */
> +	case EM_RISCV:
> +		/* arm64 and RISC-V place patchable entries before the function */
>   		before_func = 8;

Nit: The shared comment now sits under `case EM_RISCV:` but the two
lines above it (sort_reloc / rela_type = 0x403) are strictly
arm64-only — they configure the RELA-based weak-function fixup that
RISC-V does not need. On a quick read it is easy to wonder if RISC-V
is implicitly inheriting that path. Splitting the comments would
help, e.g.:

        case EM_AARCH64:
            /* arm64 needs RELA-based weak-function fixup */
            sort_reloc = true;
            rela_type = 0x403;
            /* fallthrough */
        case EM_RISCV:
            /* arm64 and RISC-V place patchable entries before the function */
            before_func = 8;


> +#else
> +	case EM_AARCH64:
> +	case EM_RISCV:
>   #endif
>   		/* fallthrough */
>   	case EM_386:
>   	case EM_LOONGARCH:
> -	case EM_RISCV:
>   	case EM_S390:
>   	case EM_X86_64:
>   		custom_sort = sort_relative_table_with_data;

Reviewed-by: Shuai Xue <xueshuai@linux.alibaba.com>

Thanks.
Shuai

^ permalink raw reply

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: David Hildenbrand (Arm) @ 2026-06-01  6:54 UTC (permalink / raw)
  To: Lance Yang
  Cc: npache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, liam, ljs, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, usama.arif
In-Reply-To: <20260601032804.96122-1-lance.yang@linux.dev>

On 6/1/26 05:28, Lance Yang wrote:
> 
> On Sun, May 31, 2026 at 10:00:17PM +0200, David Hildenbrand (Arm) wrote:
>> On 5/31/26 11:39, Lance Yang wrote:
>>>
>>>
>>> Emm ... is it safe to use map_anon_folio_pte_nopf() here?
>>>
>>> At this point pmdp_collapse_flush() has cleared the PMD from the page
>>> tables. The PTE table we are updating is only reachable through the saved
>>> old PMD value, _pmd, until pmd_populate() below.
>>>
>>> map_anon_folio_pte_nopf() does set_ptes() and then calls
>>> update_mmu_cache_range(). Documentation/core-api/cachetlb.rst describes
>>> that hook as:
>>>
>>> "
>>> 	At the end of every page fault, this routine is invoked to tell
>>> 	the architecture specific code that translations now exists
>>> 	in the software page tables for address space "vma->vm_mm"
>>> 	at virtual address "address" for "nr" consecutive pages.
>>> "
>>>
>>> But that does not seem true here yet, since the PTE table is not
>>> reachable from vma->vm_mm when update_mmu_cache_range() is called.
>>>
>>> Should we avoid calling update_mmu_cache_range() until after the PTE
>>> table is reinstalled with pmd_populate()?
>>
>> I recall that update_mmu_cache* users mostly care about updating folios flags,
>> for the folio derived from the PTE ... or flushing caches for the user address.
>>
>> So intuitively I would say "the architecture code doesn't care that the PMD
>> table will only be visible to HW shortly after". The important thing should be
>> that it will definetly happen, and that nothing else is curently there or can be
>> there?
> 
> Ah, fair point.
> 
> I was mostly worried about arch hooks that walk vma->vm_mm again, rather
> than only using the pte pointer passed in. For example, mips does:

Right, a re-walk would be the real problem.

> 
>   update_mmu_cache_range()
>     -> __update_tlb()
>       -> pgd_offset(vma->vm_mm, address)
>       -> pte_offset_map(...)
> 
> and __update_tlb() has this assumption:
> 
> 		/*
> 		 * update_mmu_cache() is called between pte_offset_map_lock()
> 		 * and pte_unmap_unlock(), so we can assume that ptep is not
> 		 * NULL here: and what should be done below if it were NULL?
> 		 */
> 
> So if khugepaged happens to run with current->active_mm == vma->vm_mm
> here, could __update_tlb() hit the none PMD, get NULL from
> pte_offset_map(), and then dereference it?

Likely yes -- that MIPS code is horrible. And the comment in MIPS code
even spells that out. :(

Do you know about other code like that, or is MIPS the only one doing a
re-walk and crossing fingers?

> 
> Just wanted to raise it since some arch code may still have assumptions
> like this, and the always-enable-mTHP work is getting closer ...

Right. I assume set_pte_at() couldn't trigger something similar (re-walk) in arch code,
because we simply provide the ptep. update_mmu_cache_range() only consumes the pte.

> 
> Probably very very very hard to hit, though :)

Delaying update_mmu_cache_range() is nasty, as we'd have to make sure that
nobody can interfere in the meantime ... and the PMD lock will not be sufficient.

Maybe we could reinstall the page table with the cleared (none) entries while
still holding the PTL?

Thinking out loud:

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 5ba298d420b7..e39b750b1e6f 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1413,13 +1413,17 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
                map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
        } else {
                /*
-                * set_ptes is called in map_anon_folio_pte_nopf with the
-                * pmd_ptl lock still held; this is safe as the PMD is expected
-                * to be none. The pmd entry is then repopulated below.
+                * Re-insert the page table with the cleared entries, but
+                * hold the PTL, such that no one can mess with the re-installed
+                * page table until we updated the temporarily-cleared entries
+                * through map_anon_folio_pte_nopf().
                 */
-               map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
-               smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
+               if (pte_ptl != pmd_ptl)
+                       spin_lock(pte_ptl);
                pmd_populate(mm, pmd, pmd_pgtable(_pmd));
+               map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
+               if (pte_ptl != pmd_ptl)
+                       spin_unlock(pte_ptl);
        }
        spin_unlock(pmd_ptl);
 


-- 
Cheers,

David

^ permalink raw reply related

* Re: [PATCH v3 05/13] rv: Fix monitor start ordering and memory ordering for monitoring flag
From: Nam Cao @ 2026-06-01  6:55 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel, Steven Rostedt, Gabriele Monaco,
	linux-trace-kernel
  Cc: Wen Yang
In-Reply-To: <20260530141652.58084-6-gmonaco@redhat.com>

Gabriele Monaco <gmonaco@redhat.com> writes:

> From: Wen Yang <wen.yang@linux.dev>
>
> da_monitor_start() set monitoring=1 before calling da_monitor_init_hook(),
> may racing with the sched_switch handler:
>
>   da_monitor_start()               sched_switch handler
>   -------------------------        ---------------------------------
>   da_mon->monitoring = 1;
>                                    if (da_monitoring(da_mon))  /* true  */
>                                        ha_start_timer_ns(...);
>                                        /* hrtimer->base == NULL, crash */
>   da_monitor_init_hook(da_mon);
>   /* hrtimer_setup() sets base */
>
> Fix the ordering and pair with release/acquire semantics:
>
>   da_monitor_init_hook(da_mon);
>   smp_store_release(&da_mon->monitoring, 1);    /* da_monitor_start()  */
>   return smp_load_acquire(&da_mon->monitoring); /* da_monitoring()     */
>
> On ARM64 a plain STR + LDR does not form a release-acquire pair, so
> the load can observe monitoring=1 while hrtimer->base is still NULL.
> The plain accesses are also data races under KCSAN.
>
> Use WRITE_ONCE for the monitoring=0 store in da_monitor_reset() to
> cover the reset path.
>
> Fixes: 792575348ff7 ("rv/include: Add deterministic automata monitor definition via C macros")
> Signed-off-by: Wen Yang <wen.yang@linux.dev>
> Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
> Reviewed-by: Nam Cao <namcao@linutronix.de>
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> ---
>  include/rv/da_monitor.h | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
> index a7e103654..60dc39f26 100644
> --- a/include/rv/da_monitor.h
> +++ b/include/rv/da_monitor.h
> @@ -82,7 +82,7 @@ static void react(enum states curr_state, enum events event)
>  static inline void da_monitor_reset(struct da_monitor *da_mon)
>  {
>  	da_monitor_reset_hook(da_mon);
> -	da_mon->monitoring = 0;
> +	WRITE_ONCE(da_mon->monitoring, 0);
>  	da_mon->curr_state = model_get_initial_state();
>  }

Looking at this again, do you need to change it to

static inline void da_monitor_reset(struct da_monitor *da_mon)
{
	WRITE_ONCE(da_mon->monitoring, 0);
        smp_mb();
	da_monitor_reset_hook(da_mon);
	da_mon->curr_state = model_get_initial_state();
}

To prevent another task from seeing monitoring=1 while the timer is
already cancelled?

Nam

^ permalink raw reply

* Re: [PATCH v3 05/13] rv: Fix monitor start ordering and memory ordering for monitoring flag
From: Gabriele Monaco @ 2026-06-01  7:15 UTC (permalink / raw)
  To: Nam Cao, Wen Yang; +Cc: linux-kernel, Steven Rostedt, linux-trace-kernel
In-Reply-To: <87ik82ycl3.fsf@yellow.woof>

On Mon, 2026-06-01 at 08:55 +0200, Nam Cao wrote:
> > diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
> > index a7e103654..60dc39f26 100644
> > --- a/include/rv/da_monitor.h
> > +++ b/include/rv/da_monitor.h
> > @@ -82,7 +82,7 @@ static void react(enum states curr_state, enum
> > events event)
> >  static inline void da_monitor_reset(struct da_monitor *da_mon)
> >  {
> >  	da_monitor_reset_hook(da_mon);
> > -	da_mon->monitoring = 0;
> > +	WRITE_ONCE(da_mon->monitoring, 0);
> >  	da_mon->curr_state = model_get_initial_state();
> >  }
> 
> Looking at this again, do you need to change it to
> 
> static inline void da_monitor_reset(struct da_monitor *da_mon)
> {
> 	WRITE_ONCE(da_mon->monitoring, 0);
>         smp_mb();
> 	da_monitor_reset_hook(da_mon);
> 	da_mon->curr_state = model_get_initial_state();
> }
> 

This order won't work.
Monitor reset relies on monitoring to be true to stop the timer. The
same function is called also on monitor start and there the timer may
not have been initialised yet, we use monitoring to discriminate the
two.

> To prevent another task from seeing monitoring=1 while the timer is
> already cancelled?

I'm not sure what could go wrong in this scenario. Perhaps another
event that could start the monitor would miss the chance because it
doesn't see a reaction occurred (monitor looks like still running)?

Or a concurring event would run on top of a reaction, potentially
moving the state machine further or causing another reaction.

Those aren't really disasters and I think could happen also without
enforcing an order with the timer, nothing checks the timer's status.
Following events don't even need to know whether the timer was armed at
all, in the worst case we may be stopping twice.

Am I missing something here?

Thanks,
Gabriele

^ permalink raw reply

* Re: [PATCH v3 06/13] rv: Do not rely on clean monitor when initialising HA
From: Nam Cao @ 2026-06-01  7:24 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel, Steven Rostedt, Gabriele Monaco,
	Masami Hiramatsu, linux-trace-kernel
  Cc: Wen Yang
In-Reply-To: <20260530141652.58084-7-gmonaco@redhat.com>

Gabriele Monaco <gmonaco@redhat.com> writes:
> +static bool ha_mon_initializing;

The global variable makes me a bit uncomfortable (a quick google will
tell why this is not the best pattern).

I am sure there are better ways to differentiate when we are
initializing vs destroying. How about the incomplete sketch below? I
doubt it even builds, just give an idea.

Nam

diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
index 39765ff6f098..0549d3a35ee0 100644
--- a/include/rv/da_monitor.h
+++ b/include/rv/da_monitor.h
@@ -159,9 +159,14 @@ static struct da_monitor *da_get_monitor(void)
 /*
  * da_monitor_reset_all - reset the single monitor
  */
-static void da_monitor_reset_all(void)
+static void da_monitor_reset_all(void (*reset)(struct da_monitor *da_mon))
 {
-	da_monitor_reset(da_get_monitor());
+	fn(da_get_monitor());
+}
+
+static inline void _da_monitor_init(struct da_monitor *da_mon)
+{
+	memset(da_monitor, 0, sizeof(*da_mon));
 }
 
 /*
@@ -169,7 +174,7 @@ static void da_monitor_reset_all(void)
  */
 static inline int da_monitor_init(void)
 {
-	da_monitor_reset_all();
+	da_monitor_reset_all(_da_monitor_init);
 	return 0;
 }
 
@@ -178,7 +183,7 @@ static inline int da_monitor_init(void)
  */
 static inline void da_monitor_destroy(void)
 {
-	da_monitor_reset_all();
+	da_monitor_reset_all(da_monitor_reset);
 }
 
 #elif RV_MON_TYPE == RV_MON_PER_CPU
@@ -202,14 +207,14 @@ static struct da_monitor *da_get_monitor(void)
 /*
  * da_monitor_reset_all - reset all CPUs' monitor
  */
-static void da_monitor_reset_all(void)
+static void da_monitor_reset_all(void (*reset)(struct da_monitor *da_mon))
 {
 	struct da_monitor *da_mon;
 	int cpu;
 
 	for_each_cpu(cpu, cpu_online_mask) {
 		da_mon = per_cpu_ptr(&DA_MON_NAME, cpu);
-		da_monitor_reset(da_mon);
+		reset(da_mon);
 	}
 }
 
@@ -267,16 +272,16 @@ static inline da_id_type da_get_id(struct da_monitor *da_mon)
 	return da_get_target(da_mon)->pid;
 }
 
-static void da_monitor_reset_all(void)
+static void da_monitor_reset_all(void (*reset)(struct da_monitor *da_mon))
 {
 	struct task_struct *g, *p;
 	int cpu;
 
 	read_lock(&tasklist_lock);
 	for_each_process_thread(g, p)
-		da_monitor_reset(da_get_monitor(p));
+		reset(da_get_monitor(p));
 	for_each_present_cpu(cpu)
-		da_monitor_reset(da_get_monitor(idle_task(cpu)));
+		reset(da_get_monitor(idle_task(cpu)));
 	read_unlock(&tasklist_lock);
 }
 
@@ -483,14 +488,14 @@ static inline void da_destroy_storage(da_id_type id)
 	kfree_rcu(mon_storage, rcu);
 }
 
-static void da_monitor_reset_all(void)
+static void da_monitor_reset_all(void (*reset)(struct da_monitor *da_mon))
 {
 	struct da_monitor_storage *mon_storage;
 	int bkt;
 
 	rcu_read_lock();
 	hash_for_each_rcu(da_monitor_ht, bkt, mon_storage, node)
-		da_monitor_reset(&mon_storage->rv.da_mon);
+		reset(&mon_storage->rv.da_mon);
 	rcu_read_unlock();
 }
 

^ permalink raw reply related

* Re: [PATCH v3 05/13] rv: Fix monitor start ordering and memory ordering for monitoring flag
From: Nam Cao @ 2026-06-01  7:29 UTC (permalink / raw)
  To: Gabriele Monaco, Wen Yang
  Cc: linux-kernel, Steven Rostedt, linux-trace-kernel
In-Reply-To: <76628c68336037b0d652456cf2033d3b51399a09.camel@redhat.com>

Gabriele Monaco <gmonaco@redhat.com> writes:
> This order won't work.
> Monitor reset relies on monitoring to be true to stop the timer. The
> same function is called also on monitor start and there the timer may
> not have been initialised yet, we use monitoring to discriminate the
> two.

Right.

> Am I missing something here?

No, just me being confused.

Nam

^ permalink raw reply

* Re: [PATCH v3 06/13] rv: Do not rely on clean monitor when initialising HA
From: Nam Cao @ 2026-06-01  7:31 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel, Steven Rostedt, Gabriele Monaco,
	Masami Hiramatsu, linux-trace-kernel
  Cc: Wen Yang
In-Reply-To: <87fr36yb7g.fsf@yellow.woof>

Nam Cao <namcao@linutronix.de> writes:
> Gabriele Monaco <gmonaco@redhat.com> writes:
>> +static bool ha_mon_initializing;
>
> The global variable makes me a bit uncomfortable (a quick google will
> tell why this is not the best pattern).
>
> I am sure there are better ways to differentiate when we are
> initializing vs destroying. How about the incomplete sketch below? I
> doubt it even builds, just give an idea.

Or instead of function pointer, we can also pass a bool flag whether the
timer should be reset.

^ permalink raw reply

* Re: [PATCH v3 07/13] rv: Add automatic cleanup handlers for per-task HA monitors
From: Nam Cao @ 2026-06-01  7:39 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel, Steven Rostedt, Gabriele Monaco,
	linux-trace-kernel
  Cc: Wen Yang
In-Reply-To: <20260530141652.58084-8-gmonaco@redhat.com>

Gabriele Monaco <gmonaco@redhat.com> writes:
> @@ -123,12 +144,15 @@ static int ha_monitor_init(void)
>  
>  	ha_mon_initializing = true;
>  	ret = da_monitor_init();
> +	if (ret == 0)
> +		ha_monitor_enable_hook();
>  	ha_mon_initializing = false;
>  	return ret;
>  }

What if between da_monitor_init() and ha_monitor_enable_hook(), a task
exits while a timer is still active, and then the timer callback is invoked?

Extremely rare, but I think it can be fixed easily by reordering the two functions.

>  static void ha_monitor_destroy(void)
>  {
> +	ha_monitor_disable_hook();
>  	da_monitor_destroy();
>  }

Same here, there is small window between the two function calls.

Nam

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox