Linux Trace Kernel
 help / color / mirror / Atom feed
* [PATCH v4 05/13] rv: Fix monitor start ordering and memory ordering for monitoring flag
From: Gabriele Monaco @ 2026-06-01 15:38 UTC (permalink / raw)
  To: linux-kernel, Steven Rostedt, Gabriele Monaco, linux-trace-kernel
  Cc: Wen Yang, Nam Cao
In-Reply-To: <20260601153840.124372-1-gmonaco@redhat.com>

From: Wen Yang <wen.yang@linux.dev>

da_monitor_start() set monitoring=1 before calling da_monitor_init_hook(),
may racing with the sched_switch handler:

  da_monitor_start()               sched_switch handler
  -------------------------        ---------------------------------
  da_mon->monitoring = 1;
                                   if (da_monitoring(da_mon))  /* true  */
                                       ha_start_timer_ns(...);
                                       /* hrtimer->base == NULL, crash */
  da_monitor_init_hook(da_mon);
  /* hrtimer_setup() sets base */

Fix the ordering and pair with release/acquire semantics:

  da_monitor_init_hook(da_mon);
  smp_store_release(&da_mon->monitoring, 1);    /* da_monitor_start()  */
  return smp_load_acquire(&da_mon->monitoring); /* da_monitoring()     */

On ARM64 a plain STR + LDR does not form a release-acquire pair, so
the load can observe monitoring=1 while hrtimer->base is still NULL.
The plain accesses are also data races under KCSAN.

Use WRITE_ONCE for the monitoring=0 store in da_monitor_reset() to
cover the reset path.

Fixes: 792575348ff7 ("rv/include: Add deterministic automata monitor definition via C macros")
Signed-off-by: Wen Yang <wen.yang@linux.dev>
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Reviewed-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 include/rv/da_monitor.h | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
index a7e103654..60dc39f26 100644
--- a/include/rv/da_monitor.h
+++ b/include/rv/da_monitor.h
@@ -82,7 +82,7 @@ static void react(enum states curr_state, enum events event)
 static inline void da_monitor_reset(struct da_monitor *da_mon)
 {
 	da_monitor_reset_hook(da_mon);
-	da_mon->monitoring = 0;
+	WRITE_ONCE(da_mon->monitoring, 0);
 	da_mon->curr_state = model_get_initial_state();
 }
 
@@ -95,8 +95,9 @@ static inline void da_monitor_reset(struct da_monitor *da_mon)
 static inline void da_monitor_start(struct da_monitor *da_mon)
 {
 	da_mon->curr_state = model_get_initial_state();
-	da_mon->monitoring = 1;
 	da_monitor_init_hook(da_mon);
+	/* Pairs with smp_load_acquire in da_monitoring(). */
+	smp_store_release(&da_mon->monitoring, 1);
 }
 
 /*
@@ -104,7 +105,8 @@ static inline void da_monitor_start(struct da_monitor *da_mon)
  */
 static inline bool da_monitoring(struct da_monitor *da_mon)
 {
-	return da_mon->monitoring;
+	/* Pairs with smp_store_release in da_monitor_start(). */
+	return smp_load_acquire(&da_mon->monitoring);
 }
 
 /*
-- 
2.54.0


^ permalink raw reply related

* [PATCH v4 04/13] rv: Ensure all pending probes terminate on per-obj monitor destroy
From: Gabriele Monaco @ 2026-06-01 15:38 UTC (permalink / raw)
  To: linux-kernel, Steven Rostedt, Gabriele Monaco, linux-trace-kernel
  Cc: Wen Yang, Nam Cao
In-Reply-To: <20260601153840.124372-1-gmonaco@redhat.com>

The monitor disable/destroy sequence detaches all probes and resets the
monitor's data, however it doesn't wait for pending probes. This is an
issue with per-object monitors, which free the monitor storage.

Call tracepoint_synchronize_unregister() to make sure to wait for all
pending probes before destroying the monitor storage.

Fixes: 4a24127bd6cb ("rv: Add support for per-object monitors in DA/HA")
Reviewed-by: Wen Yang <wen.yang@linux.dev>
Reviewed-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 include/rv/da_monitor.h | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
index cc97cc5df..a7e103654 100644
--- a/include/rv/da_monitor.h
+++ b/include/rv/da_monitor.h
@@ -511,9 +511,10 @@ static inline void da_monitor_destroy(void)
 	struct hlist_node *tmp;
 	int bkt;
 
+	tracepoint_synchronize_unregister();
 	/*
-	 * This function is called after all probes are disabled, we need only
-	 * worry about concurrency against old events.
+	 * This function is called after all probes are disabled and no longer
+	 * pending, we can safely assume no concurrent user.
 	 */
 	synchronize_rcu();
 	hash_for_each_safe(da_monitor_ht, bkt, tmp, mon_storage, node) {
-- 
2.54.0


^ permalink raw reply related

* [PATCH v4 01/13] rv: Fix __user specifier usage in extract_params()
From: Gabriele Monaco @ 2026-06-01 15:38 UTC (permalink / raw)
  To: linux-kernel, Steven Rostedt, Gabriele Monaco, Masami Hiramatsu,
	linux-trace-kernel
  Cc: kernel test robot, Wen Yang, Nam Cao
In-Reply-To: <20260601153840.124372-1-gmonaco@redhat.com>

The attributes variables extracted from syscalls in the helper are both
defined with the __user specifier although only the actual pointer to
user data should be marked.

Remove the __user specifier from attr.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202604150820.Ny143u6X-lkp@intel.com
Fixes: b133207deb72 ("rv: Add nomiss deadline monitor")
Reviewed-by: Wen Yang <wen.yang@linux.dev>
Reviewed-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 kernel/trace/rv/monitors/deadline/deadline.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/rv/monitors/deadline/deadline.h b/kernel/trace/rv/monitors/deadline/deadline.h
index 0bbfd2543..78fca873d 100644
--- a/kernel/trace/rv/monitors/deadline/deadline.h
+++ b/kernel/trace/rv/monitors/deadline/deadline.h
@@ -95,7 +95,8 @@ static inline u8 get_server_type(struct task_struct *tsk)
 static inline int extract_params(struct pt_regs *regs, long id, pid_t *pid_out)
 {
 	size_t size = offsetofend(struct sched_attr, sched_flags);
-	struct sched_attr __user *uattr, attr;
+	struct sched_attr __user *uattr;
+	struct sched_attr attr;
 	int new_policy = -1, ret;
 	unsigned long args[6];
 
-- 
2.54.0


^ permalink raw reply related

* [PATCH v4 02/13] rv: Reset per-task DA monitors before releasing the slot
From: Gabriele Monaco @ 2026-06-01 15:38 UTC (permalink / raw)
  To: linux-kernel, Steven Rostedt, Gabriele Monaco, linux-trace-kernel
  Cc: stable, Wen Yang, Nam Cao
In-Reply-To: <20260601153840.124372-1-gmonaco@redhat.com>

Per-task monitors use task_mon_slot to determine which slot in the array
to use for the monitor. During destruction, this slot is returned but
this is done before resetting the monitor. As a result, the monitor's
reset is in fact resetting a slot that is outside of the array
(RV_PER_TASK_MONITOR_INIT).

Release the slot only after the reset to avoid out-of-bound memory
access.

Fixes: f5587d1b6ec93 ("rv: Add Hybrid Automata monitor type")
Cc: stable@vger.kernel.org
Suggested-by: Wen Yang <wen.yang@linux.dev>
Reviewed-by: Wen Yang <wen.yang@linux.dev>
Reviewed-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 include/rv/da_monitor.h | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
index 39765ff6f..1459fb3df 100644
--- a/include/rv/da_monitor.h
+++ b/include/rv/da_monitor.h
@@ -309,10 +309,11 @@ static inline void da_monitor_destroy(void)
 		WARN_ONCE(1, "Disabling a disabled monitor: " __stringify(MONITOR_NAME));
 		return;
 	}
-	rv_put_task_monitor_slot(task_mon_slot);
-	task_mon_slot = RV_PER_TASK_MONITOR_INIT;
 
 	da_monitor_reset_all();
+
+	rv_put_task_monitor_slot(task_mon_slot);
+	task_mon_slot = RV_PER_TASK_MONITOR_INIT;
 }
 
 #elif RV_MON_TYPE == RV_MON_PER_OBJ
-- 
2.54.0


^ permalink raw reply related

* [PATCH v4 00/13] rv: Fixes on Deterministic and Hybrid Automata
From: Gabriele Monaco @ 2026-06-01 15:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: Gabriele Monaco, Steven Rostedt, Nam Cao, Wen Yang,
	linux-trace-kernel

Fix issues that were reported by bots or visible only after integration:
 * Make sure timers are always terminated and waited for when disabling
   the monitor or when the target terminates
 * Run per-cpu monitors with migration disabled since preemption is now
   enabled from tracepoints
 * Fix a wrong __user specifier in a helper function
 * Other cleanup and concurrency issues

Differences since V3 [1]:
* Use separate reset functions for HA init/reset to avoid globals
* Reset monitor before sync stopping the timer on HA per-task exit hook

Differences since V2 [2]:
* Applied from reviews changes to commit messages
* Rearranged order to put non-fixes and not-reviewed patches in the end
* Synchronise monitors before resetting them to avoid rearming
* Protect against racing timer callbacks during destruction

Differences since V1 [3]:
* Fix memory consistency with timer callbacks racing with resets
* Add per-obj deallocation hook in rvgen generated code
* Do not rely on clean monitor when initialising HA
* Add tracepoint synchronisation before returning per-task slots
* Fix suffix strip in dot2k
* Generate stub deallocation hooks instead of failing build when per-obj
  miss those

[1] - https://lore.kernel.org/lkml/20260530141652.58084-1-gmonaco@redhat.com
[2] - https://lore.kernel.org/lkml/20260527062313.39908-1-gmonaco@redhat.com
[3] - https://lore.kernel.org/lkml/20260512140250.262190-1-gmonaco@redhat.com

Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Nam Cao <namcao@linutronix.de>
Cc: Wen Yang <wen.yang@linux.dev>
Cc: linux-trace-kernel@vger.kernel.org

Gabriele Monaco (12):
  rv: Fix __user specifier usage in extract_params()
  rv: Reset per-task DA monitors before releasing the slot
  rv: Prevent in-flight per-task handlers from using invalid slots
  rv: Ensure all pending probes terminate on per-obj monitor destroy
  rv: Do not rely on clean monitor when initialising HA
  rv: Add automatic cleanup handlers for per-task HA monitors
  rv: Ensure synchronous cleanup for HA monitors
  rv: Prevent task migration while handling per-CPU events
  rv: Use 0 to check preemption enabled in opid
  verification/rvgen: Fix suffix strip in dot2k
  rv: Fix read_lock scope in per-task DA cleanup
  verification/rvgen: Generate cleanup hook for per-obj monitor

Wen Yang (1):
  rv: Fix monitor start ordering and memory ordering for monitoring flag

 include/rv/da_monitor.h                       | 145 ++++++++++++++----
 include/rv/ha_monitor.h                       |  91 ++++++++++-
 include/rv/ltl_monitor.h                      |   1 +
 kernel/trace/rv/monitors/deadline/deadline.h  |   3 +-
 kernel/trace/rv/monitors/nomiss/nomiss.c      |   4 +-
 kernel/trace/rv/monitors/opid/opid.c          |  12 +-
 kernel/trace/rv/monitors/stall/stall.c        |   4 +-
 tools/verification/rvgen/rvgen/dot2k.py       |  19 ++-
 .../rvgen/rvgen/templates/dot2k/main.c        |   4 +-
 9 files changed, 234 insertions(+), 49 deletions(-)


base-commit: e43ffb69e0438cddd72aaa30898b4dc446f664f8
-- 
2.54.0


^ permalink raw reply

* Re: [PATCH 0/5] x86/xen: Get rid of Xen private lazy MMU mode tracking
From: David Hildenbrand (Arm) @ 2026-06-01 15:08 UTC (permalink / raw)
  To: Juergen Gross, linux-kernel, x86, linux-trace-kernel, linux-mm,
	virtualization
  Cc: Boris Ostrovsky, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, xen-devel, Andrew Morton, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Ajay Kaher, Alexey Makhalov,
	Broadcom internal kernel review list
In-Reply-To: <20260526150514.129330-1-jgross@suse.com>

On 5/26/26 17:05, Juergen Gross wrote:
> With generic lazy MMU mode tracking being available, there is no real
> need for having Xen PV specific code to track lazy MMU mode, too.
> 
> This makes it possible to drop two paravirt hooks.

Nice!

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: David Hildenbrand (Arm) @ 2026-06-01 15:05 UTC (permalink / raw)
  To: Nico Pache
  Cc: Lance Yang, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, liam, ljs, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, usama.arif
In-Reply-To: <CAA1CXcAeEGOsqp-ywAQ7GMYQzXEeco-rUxUkk2hEF69HybC4=w@mail.gmail.com>

On 6/1/26 17:00, Nico Pache wrote:
> On Mon, Jun 1, 2026 at 5:14 AM David Hildenbrand (Arm) <david@kernel.org> wrote:
>>
>> On 6/1/26 12:47, Lance Yang wrote:
>>>
>>>
>>>
>>> Ah, cool! __folio_mark_uptodate() already does the job :P
>>>
>>> So yeah, no extra smp_wmb() needed here!
>>
>> Yeah. BTW, I think we'd need a spin_lock_nested(), so @Nico, treat my code as a
>> draft.
> 
> Okay, I read the above and did some investigating.
> 
> I will try to implement and verify the changes you suggested :)
> 
> Or an even crazier idea... what if we ensure MIPS checks for PMD_none
> before walking a PTE table?

But how would they update the cache then correctly?

I'm too non-MIPS to know the answer :)

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: Nico Pache @ 2026-06-01 15:00 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Lance Yang, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, liam, ljs, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, usama.arif
In-Reply-To: <06d9b665-945f-4967-9ed9-b06514478996@kernel.org>

On Mon, Jun 1, 2026 at 5:14 AM David Hildenbrand (Arm) <david@kernel.org> wrote:
>
> On 6/1/26 12:47, Lance Yang wrote:
> >
> >
> > On 2026/6/1 18:23, David Hildenbrand (Arm) wrote:
> >> On 6/1/26 11:08, Lance Yang wrote:
> >>>
> >>>
> >>>
> >>> One small thing, I think we should probably keep the smp_wmb(), and just
> >>> move it before the earlier pmd_populate().
> >>>
> >>> IIUC, the ordering we want is still:
> >>>
> >>>    clear old PTEs
> >>>    smp_wmb()
> >>>    pmd_populate()
> >>>
> >>> so another CPU cannot walk through the re-installed PMD and still observe
> >>> the old PTEs, right?
> >>
> >> There is a smp_wmb() in __folio_mark_uptodate(), that should be sufficient?
> >
> > Ah, cool! __folio_mark_uptodate() already does the job :P
> >
> > So yeah, no extra smp_wmb() needed here!
>
> Yeah. BTW, I think we'd need a spin_lock_nested(), so @Nico, treat my code as a
> draft.

Okay, I read the above and did some investigating.

I will try to implement and verify the changes you suggested :)

Or an even crazier idea... what if we ensure MIPS checks for PMD_none
before walking a PTE table?

-- Nico

>
> --
> Cheers,
>
> David
>


^ permalink raw reply

* Re: [PATCH mm-unstable v18 10/14] mm/khugepaged: introduce collapse_allowable_orders helper function
From: David Hildenbrand (Arm) @ 2026-06-01 14:40 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <ah2UhZb9EYgYxj1F@lucifer>

On 6/1/26 16:35, Lorenzo Stoakes wrote:
> On Sun, May 31, 2026 at 10:18:16PM +0200, David Hildenbrand (Arm) wrote:
>> On 5/22/26 17:00, Nico Pache wrote:
>>> Add collapse_allowable_orders() to generalize THP order eligibility. The
>>> function determines which THP orders are permitted based on collapse
>>> context (khugepaged vs madv_collapse).
>>>
>>> This consolidates collapse configuration logic and provides a clean
>>> interface for future mTHP collapse support where the orders may be
>>> different.
>>
>> It would have been good to describe here that, for now, it only ever returns
>> PMDs, and that it will be extended next.
>>
>> Logically, this patch belongs to #12, not #11 ... so seeing it before #11 was a bit
>>
>> ... and there, it is clear that we don't even want to know the orders?
>>
>> So can we just call this function
>>
>> "collapse_possible" and make it return a boolean?

FWIW, I realized later that #11 has

enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, tva_flags);

and forgot to delete that comment.

So we must have a variant (for #11) that returns the enabled orders.

> 
> Yeah agreed.
> 
> But I also don't love the naming, we now have thp_vma_allowable_orders(),
> __thp_vma_allowable_orders(), and then collapse_allowable_orders() and we also
> have 3 different ways of collapsing, one of which we call MADV_... COLLAPSE,
> and the other khugepaged + fault-in too for laughs.
> 
> It's like a big circle of confusion.
> 
> And of course we call THP collapse 'collapse' in general, so :)
> 
> Anyway I'm fine with collapse_possible() so we can move on and then maybe
> cleanup later.

We could simply have

collapse_possible_orders()

and

collapse_possible()

the latter being a simple wrapper around collapse_possible_orders().

> 
> Also - if I look at khugepaged.c I see thp_vma_allowable_orders() still used in
> hugepage_vma_revalidate().
> 
> Wouldn't it be better then to do the abstraction once mTHP order checking is
> properly introduced and change this also and have _every_ order check be
> consistent in khugepaged.c?

That'd also be nice, if easily possible.

-- 
Cheers,

David

^ permalink raw reply

* Re: PATCH v7] tracing/eprobes: Allow use of BTF names to dereference pointers
From: Masami Hiramatsu @ 2026-06-01 14:38 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Linux trace kernel, Mathieu Desnoyers, Mark Rutland,
	Peter Zijlstra, Namhyung Kim, Takaya Saeki, Douglas Raillard,
	Tom Zanussi, Andrew Morton, Thomas Gleixner, Ian Rogers,
	Jiri Olsa
In-Reply-To: <20260530110754.14870622@gandalf.local.home>

On Sat, 30 May 2026 11:07:54 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Sat, 30 May 2026 23:14:27 +0900
> Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> 
> > Thanks, and Sashiko reviewed this. 
> > 
> > https://sashiko.dev/#/patchset/20260529110442.0967a64c%40fedora
> > 
> > I think both comments are valid, especially, second one is important.
> 
> Hmm, comment below.
> 
> > 
> > BTW, I updated probes/fixes. Could you also update this and rebase
> > on probes/fixes branch?
> > 
> > I'm working on the nesting and container_of support patches which
> > are on top of this.
> 
> OK, I'll make sure to use your probes branch. Which one should I use? your
> probes/for-next or probes/urgent?

I pushed forced-updated probes/for-next on probes/fixes.
So you can use  probes/for-next now.

Thanks,

-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH mm-unstable v18 10/14] mm/khugepaged: introduce collapse_allowable_orders helper function
From: Lorenzo Stoakes @ 2026-06-01 14:35 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <b7ddff01-9adf-4ff7-84a6-021087936c34@kernel.org>

On Sun, May 31, 2026 at 10:18:16PM +0200, David Hildenbrand (Arm) wrote:
> On 5/22/26 17:00, Nico Pache wrote:
> > Add collapse_allowable_orders() to generalize THP order eligibility. The
> > function determines which THP orders are permitted based on collapse
> > context (khugepaged vs madv_collapse).
> >
> > This consolidates collapse configuration logic and provides a clean
> > interface for future mTHP collapse support where the orders may be
> > different.
>
> It would have been good to describe here that, for now, it only ever returns
> PMDs, and that it will be extended next.
>
> Logically, this patch belongs to #12, not #11 ... so seeing it before #11 was a bit
>
> ... and there, it is clear that we don't even want to know the orders?
>
> So can we just call this function
>
> "collapse_possible" and make it return a boolean?

Yeah agreed.

But I also don't love the naming, we now have thp_vma_allowable_orders(),
__thp_vma_allowable_orders(), and then collapse_allowable_orders() and we also
have 3 different ways of collapsing, one of which we call MADV_... COLLAPSE,
and the other khugepaged + fault-in too for laughs.

It's like a big circle of confusion.

And of course we call THP collapse 'collapse' in general, so :)

Anyway I'm fine with collapse_possible() so we can move on and then maybe
cleanup later.

Also - if I look at khugepaged.c I see thp_vma_allowable_orders() still used in
hugepage_vma_revalidate().

Wouldn't it be better then to do the abstraction once mTHP order checking is
properly introduced and change this also and have _every_ order check be
consistent in khugepaged.c?

>
> >
> > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >  mm/khugepaged.c | 15 ++++++++++++---
> >  1 file changed, 12 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 4534025bc81d..64ceebc9d8a7 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -552,12 +552,21 @@ void __khugepaged_enter(struct mm_struct *mm)
> >  		wake_up_interruptible(&khugepaged_wait);
> >  }
> >
> > +/* Check what orders are allowed based on the vma and collapse type */

I'd expand this comment to explain that it's explicitly for accounting for
whether mTHP is used, but that also argues for this to be moved to a later
commit as David says.

Otherwise the comment is useless.

> > +static unsigned long collapse_allowable_orders(struct vm_area_struct *vma,
> > +		vm_flags_t vm_flags, enum tva_type tva_flags)
> > +{
> > +	unsigned long orders = BIT(HPAGE_PMD_ORDER);

Could be a const also.

> > +
> > +	return thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
> > +}
> > +
> >  void khugepaged_enter_vma(struct vm_area_struct *vma,
> >  			  vm_flags_t vm_flags)
> >  {
> >  	if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
> >  	    hugepage_pmd_enabled()) {
> > -		if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER))
> > +		if (collapse_allowable_orders(vma, vm_flags, TVA_KHUGEPAGED))

I hate that we separate out the VMA flags like this just for this case, but
that's something for a follow up probably from me as part of a VMA flags
conversion series...

> >  			__khugepaged_enter(vma->vm_mm);
> >  	}
> >  }
> > @@ -2680,7 +2689,7 @@ static void collapse_scan_mm_slot(unsigned int progress_max,
> >  			cc->progress++;
> >  			break;
> >  		}
> > -		if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) {
> > +		if (!collapse_allowable_orders(vma, vma->vm_flags, TVA_KHUGEPAGED)) {
> >  			cc->progress++;
> >  			continue;
> >  		}
> > @@ -2989,7 +2998,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> >  	BUG_ON(vma->vm_start > start);
> >  	BUG_ON(vma->vm_end < end);
> >
> > -	if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER))
> > +	if (!collapse_allowable_orders(vma, vma->vm_flags, TVA_FORCED_COLLAPSE))
> >  		return -EINVAL;
> >
> >  	cc = kmalloc_obj(*cc);
>
> Having a simple
>
> static bool collapse_possible(...)
> {
> 	return collapse_allowable_orders(...)
> }
>
> Would make the above slightly more readable.

Yup.

>
> Acked-by: David Hildenbrand (Arm) <david@kernel.org>
>
> --
> Cheers,
>
> David

Cheers, Lorenzo

^ permalink raw reply

* Re: [PATCH mm-unstable v18 08/14] mm/khugepaged: add per-order mTHP collapse failure statistics
From: Lorenzo Stoakes @ 2026-06-01 14:13 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <20260522150009.121603-9-npache@redhat.com>

On Fri, May 22, 2026 at 09:00:03AM -0600, Nico Pache wrote:
> Add three new mTHP statistics to track collapse failures for different
> orders when encountering swap PTEs, excessive none PTEs, and shared PTEs:
>
> - collapse_exceed_swap_pte: Increment when mTHP collapse fails due to
> 	encountering a swap PTE.
>
> - collapse_exceed_none_pte: Counts when mTHP collapse fails due to
>   	exceeding the none PTE threshold for the given order
>
> - collapse_exceed_shared_pte: Counts when mTHP collapse fails due to
> 	encountering a shared PTE.
>
> These statistics complement the existing THP_SCAN_EXCEED_* events by
> providing per-order granularity for mTHP collapse attempts. The stats are
> exposed via sysfs under
> `/sys/kernel/mm/transparent_hugepage/hugepages-*/stats/` for each
> supported hugepage size.
>
> As we currently do not support collapsing mTHPs that contain a swap or
> shared entry, those statistics keep track of how often we are
> encountering failed mTHP collapses due to these restrictions.
>
> We will add support for mTHP collapse for anonymous pages next; lets also
> track when this happens at the PMD level within the per-mTHP stats.
>
> Signed-off-by: Nico Pache <npache@redhat.com>

Logic LGTM, small comment below, but otherwise:

Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>

> ---
>  Documentation/admin-guide/mm/transhuge.rst | 14 ++++++++++++++
>  include/linux/huge_mm.h                    |  3 +++
>  mm/huge_memory.c                           |  7 +++++++
>  mm/khugepaged.c                            | 21 +++++++++++++++++++--
>  4 files changed, 43 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index c51932e6275d..80a4d0bed70b 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -714,6 +714,20 @@ nr_anon_partially_mapped
>         an anonymous THP as "partially mapped" and count it here, even though it
>         is not actually partially mapped anymore.
>
> +collapse_exceed_none_pte
> +       The number of collapse attempts that failed due to exceeding the
> +       max_ptes_none threshold.
> +
> +collapse_exceed_swap_pte
> +       The number of collapse attempts that failed due to exceeding the
> +       max_ptes_swap threshold. For non-PMD orders this occurs if a mTHP range
> +       contains at least one swap PTE.
> +
> +collapse_exceed_shared_pte
> +       The number of collapse attempts that failed due to exceeding the
> +       max_ptes_shared threshold. For non-PMD orders this occurs if a mTHP range
> +       contains at least one shared PTE.
> +
>  As the system ages, allocating huge pages may be expensive as the
>  system uses memory compaction to copy data around memory to free a
>  huge page for use. There are some counters in ``/proc/vmstat`` to help
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index ba7ae6808544..48496f09909b 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -144,6 +144,9 @@ enum mthp_stat_item {
>  	MTHP_STAT_SPLIT_DEFERRED,
>  	MTHP_STAT_NR_ANON,
>  	MTHP_STAT_NR_ANON_PARTIALLY_MAPPED,
> +	MTHP_STAT_COLLAPSE_EXCEED_SWAP,
> +	MTHP_STAT_COLLAPSE_EXCEED_NONE,
> +	MTHP_STAT_COLLAPSE_EXCEED_SHARED,
>  	__MTHP_STAT_COUNT
>  };
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 345c54133c83..5c128cdec810 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -703,6 +703,10 @@ DEFINE_MTHP_STAT_ATTR(split_failed, MTHP_STAT_SPLIT_FAILED);
>  DEFINE_MTHP_STAT_ATTR(split_deferred, MTHP_STAT_SPLIT_DEFERRED);
>  DEFINE_MTHP_STAT_ATTR(nr_anon, MTHP_STAT_NR_ANON);
>  DEFINE_MTHP_STAT_ATTR(nr_anon_partially_mapped, MTHP_STAT_NR_ANON_PARTIALLY_MAPPED);
> +DEFINE_MTHP_STAT_ATTR(collapse_exceed_swap_pte, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
> +DEFINE_MTHP_STAT_ATTR(collapse_exceed_none_pte, MTHP_STAT_COLLAPSE_EXCEED_NONE);
> +DEFINE_MTHP_STAT_ATTR(collapse_exceed_shared_pte, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
> +
>
>  static struct attribute *anon_stats_attrs[] = {
>  	&anon_fault_alloc_attr.attr,
> @@ -719,6 +723,9 @@ static struct attribute *anon_stats_attrs[] = {
>  	&split_deferred_attr.attr,
>  	&nr_anon_attr.attr,
>  	&nr_anon_partially_mapped_attr.attr,
> +	&collapse_exceed_swap_pte_attr.attr,
> +	&collapse_exceed_none_pte_attr.attr,
> +	&collapse_exceed_shared_pte_attr.attr,
>  	NULL,
>  };
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 928e32a0d4d7..fff6a8fbf1d4 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -649,7 +649,9 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  		if (pte_none_or_zero(pteval)) {
>  			if (++none_or_zero > max_ptes_none) {
>  				result = SCAN_EXCEED_NONE_PTE;
> -				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> +				if (is_pmd_order(order))
> +					count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> +				count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_NONE);
>  				goto out;
>  			}
>  			continue;
> @@ -683,9 +685,17 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>
>  		/* See collapse_scan_pmd(). */
>  		if (folio_maybe_mapped_shared(folio)) {
> +			/*
> +			 * TODO: Support shared pages without leading to further
> +			 * mTHP collapses. Currently bringing in new pages via
> +			 * shared may cause a future higher order collapse on a
> +			 * rescan of the same range.
> +			 */

Seems weird to add this comment in a stats update patch?

Not a massively big deal but maybe worth moving to the relevant commit.

>  			if (++shared > max_ptes_shared) {
>  				result = SCAN_EXCEED_SHARED_PTE;
> -				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> +				if (is_pmd_order(order))
> +					count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> +				count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
>  				goto out;
>  			}
>  		}
> @@ -1138,6 +1148,7 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
>  		 * range.
>  		 */
>  		if (!is_pmd_order(order)) {
> +			count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
>  			pte_unmap(pte);
>  			mmap_read_unlock(mm);
>  			result = SCAN_EXCEED_SWAP_PTE;
> @@ -1433,6 +1444,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  			if (++none_or_zero > max_ptes_none) {
>  				result = SCAN_EXCEED_NONE_PTE;
>  				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> +				count_mthp_stat(HPAGE_PMD_ORDER,
> +						MTHP_STAT_COLLAPSE_EXCEED_NONE);
>  				goto out_unmap;
>  			}
>  			continue;
> @@ -1441,6 +1454,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  			if (++unmapped > max_ptes_swap) {
>  				result = SCAN_EXCEED_SWAP_PTE;
>  				count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
> +				count_mthp_stat(HPAGE_PMD_ORDER,
> +						MTHP_STAT_COLLAPSE_EXCEED_SWAP);
>  				goto out_unmap;
>  			}
>  			/*
> @@ -1498,6 +1513,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  			if (++shared > max_ptes_shared) {
>  				result = SCAN_EXCEED_SHARED_PTE;
>  				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> +				count_mthp_stat(HPAGE_PMD_ORDER,
> +						MTHP_STAT_COLLAPSE_EXCEED_SHARED);
>  				goto out_unmap;
>  			}
>  		}
> --
> 2.54.0
>

^ permalink raw reply

* Re: [PATCH mm-unstable v18 05/14] mm/khugepaged: require collapse_huge_page to enter/exit with the lock dropped
From: Lorenzo Stoakes @ 2026-06-01 14:07 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <20260522150009.121603-6-npache@redhat.com>

On Fri, May 22, 2026 at 09:00:00AM -0600, Nico Pache wrote:
> Currently the collapse_huge_page function requires the mmap_read_lock to
> enter with it held, and exit with it dropped. This function moves the
> unlock into its parent caller, and changes this semantic to requiring it
> to enter/exit with it always unlocked.
>
> In future patches, we need this expectation, as for in mTHP collapse, we
> may have already have dropped the lock, and do not want to conditionally
> check for this by passing through the lock_dropped variable.
>
> No functional change is expected as one of the first things the
> collapse_huge_page function does is drop this lock before allocating the
> hugepage.
>
> Acked-by: David Hildenbrand (Arm) <david@kernel.org>
> Signed-off-by: Nico Pache <npache@redhat.com>

One small nit below, otherwise LGTM, so:

Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>

> ---
>  mm/khugepaged.c | 16 ++++++++--------
>  1 file changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index e98ba5b15163..fab35d318641 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1208,6 +1208,12 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
>  	return SCAN_SUCCEED;
>  }
>
> +/*
> + * collapse_huge_page expects the mmap_lock to be unlocked before entering and
> + * will always return with the lock unlocked, to avoid holding the mmap_lock
> + * while allocating a THP, as that could trigger direct reclaim/compaction.
> + * Note that the VMA must be rechecked after grabbing the mmap_lock again.
> + */
>  static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  		int referenced, int unmapped, struct collapse_control *cc)
>  {
> @@ -1223,14 +1229,6 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>
>  	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>
> -	/*
> -	 * Before allocating the hugepage, release the mmap_lock read lock.
> -	 * The allocation can take potentially a long time if it involves
> -	 * sync compaction, and we do not need to hold the mmap_lock during
> -	 * that. We will recheck the vma after taking it again in write mode.
> -	 */
> -	mmap_read_unlock(mm);
> -

NIT: Maybe worth an mmap_assert_locked()?

>  	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
>  	if (result != SCAN_SUCCEED)
>  		goto out_nolock;
> @@ -1535,6 +1533,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  out_unmap:
>  	pte_unmap_unlock(pte, ptl);
>  	if (result == SCAN_SUCCEED) {
> +		/* collapse_huge_page expects the lock to be dropped before calling */
> +		mmap_read_unlock(mm);
>  		result = collapse_huge_page(mm, start_addr, referenced,
>  					    unmapped, cc);
>  		/* collapse_huge_page will return with the mmap_lock released */
> --
> 2.54.0
>

Cheers, Lorenzo

^ permalink raw reply

* Re: [PATCH mm-unstable v18 04/14] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support
From: Lorenzo Stoakes @ 2026-06-01 14:04 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <20260522150009.121603-5-npache@redhat.com>

On Fri, May 22, 2026 at 08:59:59AM -0600, Nico Pache wrote:
> generalize the order of the __collapse_huge_page_* and collapse_max_*
> functions to support future mTHP collapse.
>
> The current mechanism for determining collapse with the
> khugepaged_max_ptes_none value is not designed with mTHP in mind. This
> raises a key design issue: if we support user defined max_pte_none values
> (even those scaled by order), a collapse of a lower order can introduces
> an feedback loop, or "creep", when max_ptes_none is set to a value greater
> than HPAGE_PMD_NR / 2. [1]
>
> With this configuration, a successful collapse to order N will populate
> enough pages to satisfy the collapse condition on order N+1 on the next
> scan. This leads to unnecessary work and memory churn.
>
> To fix this issue introduce a helper function that will limit mTHP
> collapse support to two max_ptes_none values, 0 and HPAGE_PMD_NR - 1.
> This effectively supports two modes: [2]
>
> - max_ptes_none=0: never collapses if it encounters an empty PTE or a PTE
>   that maps the shared zeropage. Consequently, no memory bloat.
> - max_ptes_none=511 (on 4k pagesz): Always collapse to the highest
>   available mTHP order.
>
> This removes the possibility of "creep", and a warning will be emitted if
> any non-supported max_ptes_none value is configured with mTHP enabled.
> Any intermediate value will default mTHP collapse to max_ptes_none=0.
>
> mTHP collapse will not honor the khugepaged_max_ptes_shared or
> khugepaged_max_ptes_swap parameters, and will fail if it encounters a
> shared or swapped entry.
>
> No functional changes in this patch; however it defines future behavior
> for mTHP collapse.
>
> [1] - https://lore.kernel.org/all/e46ab3ab-a3d7-4fb7-9970-d0704bd5d05a@arm.com
> [2] - https://lore.kernel.org/all/37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com
>
> Reviewed-by: Lance Yang <lance.yang@linux.dev>
> Co-developed-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Nico Pache <npache@redhat.com>

One small nit below, and with the fixpatch assumed applied, LGTM so:

Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>

> ---
>  mm/khugepaged.c | 121 +++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 88 insertions(+), 33 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 116f39518948..e98ba5b15163 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -353,30 +353,52 @@ static bool pte_none_or_zero(pte_t pte)
>   * the shared zeropage for the given collapse operation.
>   * @cc: The collapse control struct
>   * @vma: The vma to check for userfaultfd
> + * @order: The folio order being collapsed to
>   *
>   * Return: Maximum number of empty/shared zeropage PTEs for the collapse operation
>   */
>  static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
> -		struct vm_area_struct *vma)
> +		struct vm_area_struct *vma, unsigned int order)
>  {
> +	unsigned int max_ptes_none = khugepaged_max_ptes_none;
> +
>  	if (vma && userfaultfd_armed(vma))
>  		return 0;
>  	/* for MADV_COLLAPSE, allow any empty/shared zeropage PTEs */
>  	if (!cc->is_khugepaged)
>  		return HPAGE_PMD_NR;
> -	/* For all other cases respect the user defined maximum */
> -	return khugepaged_max_ptes_none;
> +	/* for PMD collapse, respect the user defined maximum */
> +	if (is_pmd_order(order))
> +		return max_ptes_none;
> +	/*
> +	 * for mTHP collapse with the sysctl value set to KHUGEPAGED_MAX_PTES_LIMIT,
> +	 * scale the maximum number of PTEs to the order of the collapse.
> +	 */
> +	if (max_ptes_none == KHUGEPAGED_MAX_PTES_LIMIT)
> +		return (1 << order) - 1;
> +	if (!max_ptes_none)
> +		return 0;
> +	/*
> +	 * For mTHP collapse of values other than 0 or KHUGEPAGED_MAX_PTES_LIMIT,
> +	 * emit a warning and return 0.
> +	 */
> +	pr_warn_once("mTHP collapse does not support max_ptes_none values"
> +		     " other than 0 or %u, defaulting to 0.\n",
> +		     KHUGEPAGED_MAX_PTES_LIMIT);
> +	return 0;
>  }
>
>  /**
>   * collapse_max_ptes_shared - Calculate maximum allowed PTEs that map shared
>   * anonymous pages for the given collapse operation.
>   * @cc: The collapse control struct
> + * @order: The folio order being collapsed to
>   *
>   * Return: Maximum number of PTEs that map shared anonymous pages for the
>   * collapse operation
>   */
> -static unsigned int collapse_max_ptes_shared(struct collapse_control *cc)
> +static unsigned int collapse_max_ptes_shared(struct collapse_control *cc,
> +		unsigned int order)
>  {
>  	/*
>  	 * For MADV_COLLAPSE, do not restrict the number of PTEs that map shared
> @@ -384,6 +406,13 @@ static unsigned int collapse_max_ptes_shared(struct collapse_control *cc)
>  	 */
>  	if (!cc->is_khugepaged)
>  		return HPAGE_PMD_NR;
> +	/*
> +	 * for mTHP collapse do not allow collapsing anonymous memory pages that
> +	 * are shared between processes.
> +	 */
> +	if (!is_pmd_order(order))
> +		return 0;
> +	/* for PMD collapse, respect the user defined maximum */
>  	return khugepaged_max_ptes_shared;
>  }
>
> @@ -391,11 +420,13 @@ static unsigned int collapse_max_ptes_shared(struct collapse_control *cc)
>   * collapse_max_ptes_swap - Calculate the maximum allowed non-present PTEs or the
>   * maximum allowed non-present pagecache entries for the given collapse operation.
>   * @cc: The collapse control struct
> + * @order: The folio order being collapsed to
>   *
>   * Return: Maximum number of non-present PTEs or the maximum allowed non-present
>   * pagecache entries for the collapse operation.
>   */
> -static unsigned int collapse_max_ptes_swap(struct collapse_control *cc)
> +static unsigned int collapse_max_ptes_swap(struct collapse_control *cc,
> +		unsigned int order)
>  {
>  	/*
>  	 * For MADV_COLLAPSE, do not restrict the number PTEs entries or
> @@ -403,6 +434,10 @@ static unsigned int collapse_max_ptes_swap(struct collapse_control *cc)
>  	 */
>  	if (!cc->is_khugepaged)
>  		return HPAGE_PMD_NR;
> +	/* for mTHP collapse do not allow any non-present PTEs or pagecache entries */
> +	if (!is_pmd_order(order))
> +		return 0;
> +	/* for PMD collapse, respect the user defined maximum */
>  	return khugepaged_max_ptes_swap;
>  }
>
> @@ -596,10 +631,11 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte,
>
>  static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  		unsigned long start_addr, pte_t *pte, struct collapse_control *cc,
> -		struct list_head *compound_pagelist)
> +		unsigned int order, struct list_head *compound_pagelist)
>  {
> -	const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma);
> -	const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc);
> +	const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma, order);
> +	const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, order);
> +	const unsigned long nr_pages = 1UL << order;
>  	struct page *page = NULL;
>  	struct folio *folio = NULL;
>  	unsigned long addr = start_addr;
> @@ -607,7 +643,7 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  	int none_or_zero = 0, shared = 0, referenced = 0;
>  	enum scan_result result = SCAN_FAIL;
>
> -	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> +	for (_pte = pte; _pte < pte + nr_pages;
>  	     _pte++, addr += PAGE_SIZE) {
>  		pte_t pteval = ptep_get(_pte);
>  		if (pte_none_or_zero(pteval)) {
> @@ -740,18 +776,18 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  }
>
>  static void __collapse_huge_page_copy_succeeded(pte_t *pte,
> -						struct vm_area_struct *vma,
> -						unsigned long address,
> -						spinlock_t *ptl,
> -						struct list_head *compound_pagelist)
> +		struct vm_area_struct *vma, unsigned long address,
> +		spinlock_t *ptl, unsigned int order,
> +		struct list_head *compound_pagelist)
>  {
> -	unsigned long end = address + HPAGE_PMD_SIZE;
> +	const unsigned long nr_pages = 1UL << order;
> +	unsigned long end = address + (PAGE_SIZE << order);

Nit: could be address + PAGE_SIZE * nr_pages, be a little nicer as you already
did the << order operation above.

>  	struct folio *src, *tmp;
>  	pte_t pteval;
>  	pte_t *_pte;
>  	unsigned int nr_ptes;
>
> -	for (_pte = pte; _pte < pte + HPAGE_PMD_NR; _pte += nr_ptes,
> +	for (_pte = pte; _pte < pte + nr_pages; _pte += nr_ptes,
>  	     address += nr_ptes * PAGE_SIZE) {
>  		nr_ptes = 1;
>  		pteval = ptep_get(_pte);
> @@ -804,11 +840,10 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
>  }
>
>  static void __collapse_huge_page_copy_failed(pte_t *pte,
> -					     pmd_t *pmd,
> -					     pmd_t orig_pmd,
> -					     struct vm_area_struct *vma,
> -					     struct list_head *compound_pagelist)
> +		pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
> +		unsigned int order, struct list_head *compound_pagelist)
>  {
> +	const unsigned long nr_pages = 1UL << order;
>  	spinlock_t *pmd_ptl;
>
>  	/*
> @@ -824,7 +859,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
>  	 * Release both raw and compound pages isolated
>  	 * in __collapse_huge_page_isolate.
>  	 */
> -	release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist);
> +	release_pte_pages(pte, pte + nr_pages, compound_pagelist);
>  }
>
>  /*
> @@ -844,16 +879,17 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
>   */
>  static enum scan_result __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
>  		pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
> -		unsigned long address, spinlock_t *ptl,
> +		unsigned long address, spinlock_t *ptl, unsigned int order,
>  		struct list_head *compound_pagelist)
>  {
> +	const unsigned long nr_pages = 1UL << order;
>  	unsigned int i;
>  	enum scan_result result = SCAN_SUCCEED;
>
>  	/*
>  	 * Copying pages' contents is subject to memory poison at any iteration.
>  	 */
> -	for (i = 0; i < HPAGE_PMD_NR; i++) {
> +	for (i = 0; i < nr_pages; i++) {
>  		pte_t pteval = ptep_get(pte + i);
>  		struct page *page = folio_page(folio, i);
>  		unsigned long src_addr = address + i * PAGE_SIZE;
> @@ -872,10 +908,10 @@ static enum scan_result __collapse_huge_page_copy(pte_t *pte, struct folio *foli
>
>  	if (likely(result == SCAN_SUCCEED))
>  		__collapse_huge_page_copy_succeeded(pte, vma, address, ptl,
> -						    compound_pagelist);
> +						    order, compound_pagelist);
>  	else
>  		__collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma,
> -						 compound_pagelist);
> +						 order, compound_pagelist);
>
>  	return result;
>  }
> @@ -1042,16 +1078,20 @@ static enum scan_result check_pmd_still_valid(struct mm_struct *mm,
>   * Bring missing pages in from swap, to complete THP collapse.
>   * Only done if khugepaged_scan_pmd believes it is worthwhile.
>   *
> + * For mTHP orders the function bails on the first swap entry, because
> + * faulting pages back in during collapse could re-populate PTEs that
> + * push a later scan over the threshold for a higher-order collapse.
> + *
>   * Called and returns without pte mapped or spinlocks held.
>   * Returns result: if not SCAN_SUCCEED, mmap_lock has been released.
>   */
>  static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
> -		struct vm_area_struct *vma, unsigned long start_addr, pmd_t *pmd,
> -		int referenced)
> +		struct vm_area_struct *vma, unsigned long start_addr,
> +		pmd_t *pmd, int referenced, unsigned int order)
>  {
>  	int swapped_in = 0;
>  	vm_fault_t ret = 0;
> -	unsigned long addr, end = start_addr + (HPAGE_PMD_NR * PAGE_SIZE);
> +	unsigned long addr, end = start_addr + (PAGE_SIZE << order);
>  	enum scan_result result;
>  	pte_t *pte = NULL;
>  	spinlock_t *ptl;
> @@ -1083,6 +1123,19 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
>  		    pte_present(vmf.orig_pte))
>  			continue;
>
> +		/*
> +		 * TODO: Support swapin without leading to further mTHP
> +		 * collapses. Currently bringing in new pages via swapin may
> +		 * cause a future higher order collapse on a rescan of the same
> +		 * range.
> +		 */
> +		if (!is_pmd_order(order)) {
> +			pte_unmap(pte);
> +			mmap_read_unlock(mm);
> +			result = SCAN_EXCEED_SWAP_PTE;
> +			goto out;
> +		}
> +
>  		vmf.pte = pte;
>  		vmf.ptl = ptl;
>  		ret = do_swap_page(&vmf);
> @@ -1203,7 +1256,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  		 * that case.  Continuing to collapse causes inconsistency.
>  		 */
>  		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> -						     referenced);
> +						     referenced, HPAGE_PMD_ORDER);
>  		if (result != SCAN_SUCCEED)
>  			goto out_nolock;
>  	}
> @@ -1251,6 +1304,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
>  	if (pte) {
>  		result = __collapse_huge_page_isolate(vma, address, pte, cc,
> +						      HPAGE_PMD_ORDER,
>  						      &compound_pagelist);
>  		spin_unlock(pte_ptl);
>  	} else {
> @@ -1281,6 +1335,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>
>  	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
>  					   vma, address, pte_ptl,
> +					   HPAGE_PMD_ORDER,
>  					   &compound_pagelist);
>  	pte_unmap(pte);
>  	if (unlikely(result != SCAN_SUCCEED))
> @@ -1316,9 +1371,9 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  		struct vm_area_struct *vma, unsigned long start_addr,
>  		bool *lock_dropped, struct collapse_control *cc)
>  {
> -	const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma);
> -	const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc);
> -	const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc);
> +	const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
> +	const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, HPAGE_PMD_ORDER);
> +	const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
>  	pmd_t *pmd;
>  	pte_t *pte, *_pte;
>  	int none_or_zero = 0, shared = 0, referenced = 0;
> @@ -2372,8 +2427,8 @@ static enum scan_result collapse_scan_file(struct mm_struct *mm,
>  		unsigned long addr, struct file *file, pgoff_t start,
>  		struct collapse_control *cc)
>  {
> -	const unsigned int max_ptes_none = collapse_max_ptes_none(cc, NULL);
> -	const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc);
> +	const unsigned int max_ptes_none = collapse_max_ptes_none(cc, NULL, HPAGE_PMD_ORDER);
> +	const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
>  	struct folio *folio = NULL;
>  	struct address_space *mapping = file->f_mapping;
>  	XA_STATE(xas, &mapping->i_pages, start);
> --
> 2.54.0
>

^ permalink raw reply

* Re: [PATCH v2 1/8] scripts/sorttable: Handle RISC-V patchable ftrace entries
From: Steven Rostedt @ 2026-06-01 13:57 UTC (permalink / raw)
  To: Shuai Xue
  Cc: Wang Han, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Alexandre Ghiti, Masami Hiramatsu, Mark Rutland, Catalin Marinas,
	Chen Pei, Andy Chiu, Björn Töpel, Deepak Gupta,
	Puranjay Mohan, Conor Dooley, Josh Poimboeuf, Jiri Kosina,
	Miroslav Benes, Petr Mladek, Joe Lawrence, Shuah Khan,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, oliver.yang, zhuo.song, jkchen, linux-riscv,
	linux-kernel, linux-trace-kernel, live-patching, linux-kselftest,
	linux-perf-users
In-Reply-To: <0a913398-3d0c-472e-89c4-062052eae04d@linux.alibaba.com>

On Mon, 1 Jun 2026 14:17:08 +0800
Shuai Xue <xueshuai@linux.alibaba.com> wrote:

> > diff --git a/scripts/sorttable.c b/scripts/sorttable.c
> > index e8ed11c680c6..4c10e85bb5af 100644
> > --- a/scripts/sorttable.c
> > +++ b/scripts/sorttable.c
> > @@ -891,17 +891,21 @@ static int do_file(char const *const fname, void *addr)
> >   	table_sort_t custom_sort = NULL;
> >   
> >   	switch (elf_map_machine(ehdr)) {
> > -	case EM_AARCH64:
> >   #ifdef MCOUNT_SORT_ENABLED
> > +	case EM_AARCH64:
> >   		sort_reloc = true;
> >   		rela_type = 0x403;
> > -		/* arm64 uses patchable function entry placing before function */
> > +		/* fallthrough */
> > +	case EM_RISCV:
> > +		/* arm64 and RISC-V place patchable entries before the function */
> >   		before_func = 8;  
> 
> Nit: The shared comment now sits under `case EM_RISCV:` but the two
> lines above it (sort_reloc / rela_type = 0x403) are strictly
> arm64-only — they configure the RELA-based weak-function fixup that
> RISC-V does not need. On a quick read it is easy to wonder if RISC-V
> is implicitly inheriting that path. Splitting the comments would
> help, e.g.:
> 
>         case EM_AARCH64:
>             /* arm64 needs RELA-based weak-function fixup */
>             sort_reloc = true;
>             rela_type = 0x403;
>             /* fallthrough */
>         case EM_RISCV:
>             /* arm64 and RISC-V place patchable entries before the function */
>             before_func = 8;

Makes sense.

Care to send a v3?

-- Steve

^ permalink raw reply

* Re: [PATCH v3] selftests/ftrace: Fix trace_marker_raw test on 64K page kernels
From: Steven Rostedt @ 2026-06-01 13:51 UTC (permalink / raw)
  To: Tianchen Ding, Shuah Khan
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Shuah Khan, linux-kernel,
	linux-trace-kernel, linux-kselftest
In-Reply-To: <20260601023251.1916483-1-dtcccc@linux.alibaba.com>


Shuah,

can you take this?

Reviewed-by: Steven Rostedt <rostedt@goodmis.org.

Tianchen,

for future patches, please send new versions as a separate thread and
not a reply to the thread of the previous version.

Thanks,

-- Steve


On Mon,  1 Jun 2026 10:32:51 +0800
Tianchen Ding <dtcccc@linux.alibaba.com> wrote:

> On ARM64 kernels with 64K pages, the trace_marker_raw test fails because
> bash's printf builtin uses stdio buffering which splits output into
> multiple small write() calls to the tracefs file. Since each individual
> write is within TRACE_MARKER_MAX_SIZE (4096), they all succeed, causing
> the "too big" write test to incorrectly pass.
> 
> Fix by writing through dd with iflag=fullblock to guarantee a single
> atomic write() syscall to trace_marker_raw.
> 
> Fixes: 37f46601383a ("selftests/tracing: Add basic test for trace_marker_raw file")
> Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
> ---
> v3:
> Measure the binary length via wc -c instead of hard-coding "size + 4".
> 
> v2:
> Update comment about 64K pages.
> ---
>  .../ftrace/test.d/00basic/trace_marker_raw.tc      | 14 +++++++++++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc b/tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc
> index 8e905d4fe6dd..f985ff391463 100644
> --- a/tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc
> +++ b/tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc
> @@ -36,15 +36,23 @@ make_str() {
>  
>  	data=`printf -- 'X%.0s' $(seq $cnt)`
>  
> -	printf "${val}${data}"
> +	# Return escape-sequence text (e.g. "\003\000..."); the caller
> +	# converts to binary. Shell command substitution strips NUL bytes,
> +	# so the binary form cannot survive being captured into a variable.
> +	printf '%s' "${val}${data}"
>  }
>  
>  write_buffer() {
>  	id=$1
>  	size=$2
>  
> -	# write the string into the raw marker
> -	make_str $id $size > trace_marker_raw
> +	str=`make_str $id $size`
> +	len=`printf "$str" | wc -c`
> +	# Pipe through dd to ensure a single atomic write() syscall
> +	# on architectures with 64K pages, where shell's printf builtin
> +	# uses stdio buffering which may split the output into multiple
> +	# writes.
> +	printf "$str" | dd of=trace_marker_raw bs=$len iflag=fullblock
>  }
>  
>  


^ permalink raw reply

* Re: [PATCH RESEND] rtla: Fix output files in source tree
From: Tomas Glozar @ 2026-06-01 13:43 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: Steven Rostedt, linux-trace-kernel, Ian Rogers
In-Reply-To: <ahMmN4PdsJzr_Va-@decadent.org.uk>

ne 24. 5. 2026 v 18:24 odesílatel Ben Hutchings <benh@debian.org> napsal:
>
> Some output files (src/timerlat.bpf.o, src/timerlat.skel.h,
> example/timerlat_bpf_action.o, tests/bpf/bpf_action_map.o) are
> currently generated in the source tree, preventing a fully out-of-tree
> build.  To fix this:
>
> - Add $(OUTPUT) to their filenames in the relevant Makefile rules, and
>   create subdirectories as needed
> - Add $(OUTPUT)src to the include path
> - Add ${OUTPUT} to the BPF object filename in tests/timerlat.t
>
> Fixes: e34293ddcebd ("rtla/timerlat: Add BPF skeleton to collect samples")
> Fixes: 0304a3b7ec9a ("rtla/timerlat: Add example for BPF action program")
> Fixes: 5525aebd4e0c ("rtla/tests: Test BPF action program")
> Signed-off-by: Ben Hutchings <benh@debian.org>
> Reviewed-by: Ian Rogers <irogers@google.com>
> ---
>  tools/tracing/rtla/Makefile         | 31 ++++++++++++++++++-----------
>  tools/tracing/rtla/tests/timerlat.t |  4 ++--
>  2 files changed, 21 insertions(+), 14 deletions(-)
>

It seems that I did take building out-of-tree into account in my
latest build changes (in the libsubcmd migration; actually I even
fixed some smaller bits in v3 after seeing your patch), but not for
earlier generated files. Thanks for catching that.

Can you please rebase this on top of linux-trace/tools/for-next [1]?
There have been quite a lot of changes queued to the next cycle that
conflict now with this patch, namely:

1. changes to the Makefile coming from [3], conflicting around
INCLUDES and RTLA_IN rule
2. changes in runtime tests (highlighted below)

[1] https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git/log/?h=tools/for-next

> diff --git a/tools/tracing/rtla/Makefile b/tools/tracing/rtla/Makefile
> index 45690ee14544..f54da7be735d 100644
> --- a/tools/tracing/rtla/Makefile
> +++ b/tools/tracing/rtla/Makefile
> @@ -66,30 +66,37 @@ ifeq ($(config),1)
>    include Makefile.config
>  endif
>
> +INCLUDES       = -I$(OUTPUT)src
> +

INCLUDES is currently unset by the Makefile, and there are users that
pass it to "make", e.g. [2].

Ideally, the name for user-specific includes would be EXTRA_INCLUDES
or something like that, and INCLUDES would be used for the generated
include files, but the name has already been set since the beginning
of rtla, so this has to be named "GEN_INCLUDES" (or similar) instead.
Otherwise, a user setting INCLUDES will override the -I$(OUTPUT)src
and out-of-tree build would fail.

[2] https://gitlab.com/cki-project/kernel-ark/-/blob/4dda3a32c5ef22768f6daa126a595974955e3c10/redhat/kernel.spec.template#L3375

>  CFLAGS         += $(INCLUDES) $(LIB_INCLUDES)
>
>  export CFLAGS OUTPUT srctree
>
>  ifeq ($(BUILD_BPF_SKEL),1)
> -src/timerlat.bpf.o: src/timerlat.bpf.c
> +$(OUTPUT)src/timerlat.bpf.o: src/timerlat.bpf.c
> +       mkdir -p $(@D)

The mkdir will always echo when building the file, it'd be better to
use $(Q)$(MKDIR) here (and in the other rules below) like in the new
LIBSUBCMD_DIR and LIB_DIR rules [3].

Also, could the mkdir be separated into a target here, too? E.g.:

SRC_OUTPUT = $(OUTPUT)src/
$(SRC_OUTPUT):
  $(Q)$(MKDIR) -p $@

and then include it as order-only prerequisite:

$(SRC_OUTPUT)timerlat.bpf.o: src/timerlat.bpf.c | $(SRC_OUTPUT)
  $(QUIET_CLANG)$(CLANG) -g -O2 -target bpf -c $(filter %.c,$^) -o $@

That will remove the need to duplicate the mkdir command, plus the
SRC_OUTPUT variable can be used in place of $(OUTPUT)src for all
generated / non-build rules.

[3] https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git/tree/tools/tracing/rtla/Makefile?h=tools/for-next&id=db956bcf8d681b5a01ebe04c79f6a7b29b9934f9#n144

>         $(QUIET_CLANG)$(CLANG) -g -O2 -target bpf -c $(filter %.c,$^) -o $@
>
> -src/timerlat.skel.h: src/timerlat.bpf.o
> +$(OUTPUT)src/timerlat.skel.h: $(OUTPUT)src/timerlat.bpf.o
> +       mkdir -p $(@D)
>         $(QUIET_GENSKEL)$(SYSTEM_BPFTOOL) gen skeleton $< > $@
>
> -example/timerlat_bpf_action.o: example/timerlat_bpf_action.c
> +$(OUTPUT)example/timerlat_bpf_action.o: example/timerlat_bpf_action.c
> +       mkdir -p $(@D)
>         $(QUIET_CLANG)$(CLANG) -g -O2 -target bpf -c $(filter %.c,$^) -o $@
>
> -tests/bpf/bpf_action_map.o: tests/bpf/bpf_action_map.c
> +$(OUTPUT)tests/bpf/bpf_action_map.o: tests/bpf/bpf_action_map.c
> +       mkdir -p $(@D)
>         $(QUIET_CLANG)$(CLANG) -g -O2 -target bpf -c $(filter %.c,$^) -o $@
>  else
> -src/timerlat.skel.h:
> -       $(Q)echo '/* BPF skeleton is disabled */' > src/timerlat.skel.h
> +$(OUTPUT)src/timerlat.skel.h:
> +       mkdir -p $(@D)
> +       $(Q)echo '/* BPF skeleton is disabled */' > $@
>
> -example/timerlat_bpf_action.o: example/timerlat_bpf_action.c
> +$(OUTPUT)example/timerlat_bpf_action.o: example/timerlat_bpf_action.c
>         $(Q)echo "BPF skeleton support is disabled, skipping example/timerlat_bpf_action.o"
>
> -tests/bpf/bpf_action_map.o: tests/bpf/bpf_action_map.c
> +$(OUTPUT)tests/bpf/bpf_action_map.o: tests/bpf/bpf_action_map.c
>         $(Q)echo "BPF skeleton support is disabled, skipping tests/bpf/bpf_action_map.o"
>  endif
>
> @@ -103,7 +110,7 @@ static: $(RTLA_IN)
>  rtla.%: fixdep FORCE
>         make -f $(srctree)/tools/build/Makefile.build dir=. $@
>
> -$(RTLA_IN): fixdep FORCE src/timerlat.skel.h
> +$(RTLA_IN): fixdep FORCE $(OUTPUT)src/timerlat.skel.h
>         make $(build)=rtla
>
>  clean: doc_clean fixdep-clean
> @@ -111,10 +118,10 @@ clean: doc_clean fixdep-clean
>         $(Q)find . -name '*.o' -delete -o -name '\.*.cmd' -delete -o -name '\.*.d' -delete
>         $(Q)rm -f rtla rtla-static fixdep FEATURE-DUMP rtla-*
>         $(Q)rm -rf feature
> -       $(Q)rm -f src/timerlat.bpf.o src/timerlat.skel.h example/timerlat_bpf_action.o
> +       $(Q)rm -f $(OUTPUT)src/timerlat.bpf.o $(OUTPUT)src/timerlat.skel.h $(OUTPUT)example/timerlat_bpf_action.o

Maybe instead of adding $(OUTPUT) to all of these files, $(OUTPUT)
could be passed to the "find" command on the first line? Like this:

$(Q)find $(OUTPUT) -name '*.o' -delete -o -name '\.*.cmd' -delete -o
-name '\.*.d' -delete

That will also remove the necessity of manually running "rm" on
src/timerlat.bpf.o and src/timerlat.skel.h, only
$(OUTPUT)src/timerlat.skel.h has to stay.

>         $(Q)rm -f $(UNIT_TESTS)

This is correct. $(UNIT_TESTS) is defined in Makefile.unit, like this:

UNIT_TESTS := $(OUTPUT)unit_tests

The cleanup of the newly added LIB_OUTPUT and LIBSUBCMD_OUTPUT should
already be implemented correctly in the target lib_clean /
libsubcmd_clean, no need to change anything for that, either.

>
> -check: $(RTLA) tests/bpf/bpf_action_map.o
> +check: $(RTLA) $(OUTPUT)tests/bpf/bpf_action_map.o
>         RTLA=$(RTLA) BPFTOOL=$(SYSTEM_BPFTOOL) prove -o -f -v tests/
> -examples: example/timerlat_bpf_action.o
> +examples: $(OUTPUT)example/timerlat_bpf_action.o
>  .PHONY: FORCE clean check
> diff --git a/tools/tracing/rtla/tests/timerlat.t b/tools/tracing/rtla/tests/timerlat.t
> index fd4935fd7b49..e0f3fc4df655 100644
> --- a/tools/tracing/rtla/tests/timerlat.t
> +++ b/tools/tracing/rtla/tests/timerlat.t
> @@ -74,12 +74,12 @@ then
>         # Test BPF action program properly in BPF mode
>         [ -z "$BPFTOOL" ] && BPFTOOL=bpftool
>         check "hist with BPF action program (BPF mode)" \
> -               "timerlat hist -T 2 --bpf-action tests/bpf/bpf_action_map.o --on-threshold shell,command='$BPFTOOL map dump name rtla_test_map'" \
> +               "timerlat hist -T 2 --bpf-action ${OUTPUT}tests/bpf/bpf_action_map.o --on-threshold shell,command='$BPFTOOL map dump name rtla_test_map'" \
>                 2 '"value": 42'
>  else
>         # Test BPF action program failure in non-BPF mode
>         check "hist with BPF action program (non-BPF mode)" \
> -               "timerlat hist -T 2 --bpf-action tests/bpf/bpf_action_map.o" \
> +               "timerlat hist -T 2 --bpf-action ${OUTPUT}tests/bpf/bpf_action_map.o" \
>                 1 "BPF actions are not supported in tracefs-only mode"

As mentioned above, this conflicts with [4] from
linux-trace/tools/for-next, as check was replaced with
check_top_q_hist and "timerlat hist" with "timerlat TOOL".

Also note that another patch [5] changes the path to use $testdir
instead of tests, as tests now run in a temporary directory as CWD,
however ${OUTPUT}tests is still the correct path, not $testdir, as it
references a built artifact (unlike the other uses of $testdir, which
reference auxiliary scripts).

[4] https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git/commit/tools/tracing/rtla?h=tools/for-next&id=c15c55c01e48e4639c715370f41d9d4c490f5d23
[5] https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git/commit/tools/tracing/rtla?h=tools/for-next&id=ad5b50a0959fb67100ddb93aeb0ea40201272cb2

>  fi
>  done

Thanks,
Tomas


^ permalink raw reply

* Re: [PATCH mm-unstable v18 03/14] mm/khugepaged: rework max_ptes_* handling with helper functions
From: Lorenzo Stoakes @ 2026-06-01 13:26 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, Usama Arif
In-Reply-To: <20260522150009.121603-4-npache@redhat.com>

On Fri, May 22, 2026 at 08:59:58AM -0600, Nico Pache wrote:
> The following cleanup reworks all the max_ptes_* handling into helper
> functions. This increases the code readability and will later be used to
> implement the mTHP handling of these variables.
>
> With these changes we abstract all the madvise_collapse() special casing
> (do not respect the sysctls) away from the functions that utilize them.
> And will be used later in this series to cleanly restrict the mTHP
> collapse behavior.
>
> No functional change is intended; however, we are now only reading the
> sysfs variables once per scan, whereas before these variables were being
> read on each loop iteration.
>
> Reviewed-by: Lance Yang <lance.yang@linux.dev>
> Suggested-by: David Hildenbrand <david@kernel.org>
> Acked-by: David Hildenbrand (Arm) <david@kernel.org>
> Acked-by: Usama Arif <usama.arif@linux.dev>
> Signed-off-by: Nico Pache <npache@redhat.com>

Had a read through, this is nice, all LGTM, so:

Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>

> ---
>  mm/khugepaged.c | 120 +++++++++++++++++++++++++++++++++---------------
>  1 file changed, 84 insertions(+), 36 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 13d82993755f..116f39518948 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -348,6 +348,64 @@ static bool pte_none_or_zero(pte_t pte)
>  	return pte_present(pte) && is_zero_pfn(pte_pfn(pte));
>  }
>
> +/**
> + * collapse_max_ptes_none - Calculate maximum allowed empty PTEs or PTEs mapping
> + * the shared zeropage for the given collapse operation.
> + * @cc: The collapse control struct
> + * @vma: The vma to check for userfaultfd
> + *
> + * Return: Maximum number of empty/shared zeropage PTEs for the collapse operation
> + */
> +static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
> +		struct vm_area_struct *vma)
> +{
> +	if (vma && userfaultfd_armed(vma))
> +		return 0;
> +	/* for MADV_COLLAPSE, allow any empty/shared zeropage PTEs */
> +	if (!cc->is_khugepaged)
> +		return HPAGE_PMD_NR;
> +	/* For all other cases respect the user defined maximum */
> +	return khugepaged_max_ptes_none;
> +}
> +
> +/**
> + * collapse_max_ptes_shared - Calculate maximum allowed PTEs that map shared
> + * anonymous pages for the given collapse operation.
> + * @cc: The collapse control struct
> + *
> + * Return: Maximum number of PTEs that map shared anonymous pages for the
> + * collapse operation
> + */
> +static unsigned int collapse_max_ptes_shared(struct collapse_control *cc)
> +{
> +	/*
> +	 * For MADV_COLLAPSE, do not restrict the number of PTEs that map shared
> +	 * anonymous pages.
> +	 */
> +	if (!cc->is_khugepaged)
> +		return HPAGE_PMD_NR;
> +	return khugepaged_max_ptes_shared;
> +}
> +
> +/**
> + * collapse_max_ptes_swap - Calculate the maximum allowed non-present PTEs or the
> + * maximum allowed non-present pagecache entries for the given collapse operation.
> + * @cc: The collapse control struct
> + *
> + * Return: Maximum number of non-present PTEs or the maximum allowed non-present
> + * pagecache entries for the collapse operation.
> + */
> +static unsigned int collapse_max_ptes_swap(struct collapse_control *cc)
> +{
> +	/*
> +	 * For MADV_COLLAPSE, do not restrict the number PTEs entries or
> +	 * pagecache entries that are non-present.
> +	 */
> +	if (!cc->is_khugepaged)
> +		return HPAGE_PMD_NR;
> +	return khugepaged_max_ptes_swap;
> +}
> +
>  int hugepage_madvise(struct vm_area_struct *vma,
>  		     vm_flags_t *vm_flags, int advice)
>  {
> @@ -540,6 +598,8 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  		unsigned long start_addr, pte_t *pte, struct collapse_control *cc,
>  		struct list_head *compound_pagelist)
>  {
> +	const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma);
> +	const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc);
>  	struct page *page = NULL;
>  	struct folio *folio = NULL;
>  	unsigned long addr = start_addr;
> @@ -551,16 +611,12 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  	     _pte++, addr += PAGE_SIZE) {
>  		pte_t pteval = ptep_get(_pte);
>  		if (pte_none_or_zero(pteval)) {
> -			++none_or_zero;
> -			if (!userfaultfd_armed(vma) &&
> -			    (!cc->is_khugepaged ||
> -			     none_or_zero <= khugepaged_max_ptes_none)) {
> -				continue;
> -			} else {
> +			if (++none_or_zero > max_ptes_none) {
>  				result = SCAN_EXCEED_NONE_PTE;
>  				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>  				goto out;
>  			}
> +			continue;
>  		}
>  		if (!pte_present(pteval)) {
>  			result = SCAN_PTE_NON_PRESENT;
> @@ -591,9 +647,7 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>
>  		/* See collapse_scan_pmd(). */
>  		if (folio_maybe_mapped_shared(folio)) {
> -			++shared;
> -			if (cc->is_khugepaged &&
> -			    shared > khugepaged_max_ptes_shared) {
> +			if (++shared > max_ptes_shared) {
>  				result = SCAN_EXCEED_SHARED_PTE;
>  				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
>  				goto out;
> @@ -1262,6 +1316,9 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  		struct vm_area_struct *vma, unsigned long start_addr,
>  		bool *lock_dropped, struct collapse_control *cc)
>  {
> +	const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma);
> +	const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc);
> +	const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc);
>  	pmd_t *pmd;
>  	pte_t *pte, *_pte;
>  	int none_or_zero = 0, shared = 0, referenced = 0;
> @@ -1295,36 +1352,29 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>
>  		pte_t pteval = ptep_get(_pte);
>  		if (pte_none_or_zero(pteval)) {
> -			++none_or_zero;
> -			if (!userfaultfd_armed(vma) &&
> -			    (!cc->is_khugepaged ||
> -			     none_or_zero <= khugepaged_max_ptes_none)) {
> -				continue;
> -			} else {
> +			if (++none_or_zero > max_ptes_none) {
>  				result = SCAN_EXCEED_NONE_PTE;
>  				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>  				goto out_unmap;
>  			}
> +			continue;
>  		}
>  		if (!pte_present(pteval)) {
> -			++unmapped;
> -			if (!cc->is_khugepaged ||
> -			    unmapped <= khugepaged_max_ptes_swap) {
> -				/*
> -				 * Always be strict with uffd-wp
> -				 * enabled swap entries.  Please see
> -				 * comment below for pte_uffd_wp().
> -				 */
> -				if (pte_swp_uffd_wp_any(pteval)) {
> -					result = SCAN_PTE_UFFD_WP;
> -					goto out_unmap;
> -				}
> -				continue;
> -			} else {
> +			if (++unmapped > max_ptes_swap) {
>  				result = SCAN_EXCEED_SWAP_PTE;
>  				count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
>  				goto out_unmap;
>  			}
> +			/*
> +			 * Always be strict with uffd-wp
> +			 * enabled swap entries.  Please see
> +			 * comment below for pte_uffd_wp().
> +			 */
> +			if (pte_swp_uffd_wp_any(pteval)) {
> +				result = SCAN_PTE_UFFD_WP;
> +				goto out_unmap;
> +			}
> +			continue;
>  		}
>  		if (pte_uffd_wp(pteval)) {
>  			/*
> @@ -1367,9 +1417,7 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  		 * is shared.
>  		 */
>  		if (folio_maybe_mapped_shared(folio)) {
> -			++shared;
> -			if (cc->is_khugepaged &&
> -			    shared > khugepaged_max_ptes_shared) {
> +			if (++shared > max_ptes_shared) {
>  				result = SCAN_EXCEED_SHARED_PTE;
>  				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
>  				goto out_unmap;
> @@ -2324,6 +2372,8 @@ static enum scan_result collapse_scan_file(struct mm_struct *mm,
>  		unsigned long addr, struct file *file, pgoff_t start,
>  		struct collapse_control *cc)
>  {
> +	const unsigned int max_ptes_none = collapse_max_ptes_none(cc, NULL);
> +	const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc);
>  	struct folio *folio = NULL;
>  	struct address_space *mapping = file->f_mapping;
>  	XA_STATE(xas, &mapping->i_pages, start);
> @@ -2342,8 +2392,7 @@ static enum scan_result collapse_scan_file(struct mm_struct *mm,
>
>  		if (xa_is_value(folio)) {
>  			swap += 1 << xas_get_order(&xas);
> -			if (cc->is_khugepaged &&
> -			    swap > khugepaged_max_ptes_swap) {
> +			if (swap > max_ptes_swap) {
>  				result = SCAN_EXCEED_SWAP_PTE;
>  				count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
>  				break;
> @@ -2414,8 +2463,7 @@ static enum scan_result collapse_scan_file(struct mm_struct *mm,
>  		cc->progress += HPAGE_PMD_NR;
>
>  	if (result == SCAN_SUCCEED) {
> -		if (cc->is_khugepaged &&
> -		    present < HPAGE_PMD_NR - khugepaged_max_ptes_none) {
> +		if (present < HPAGE_PMD_NR - max_ptes_none) {
>  			result = SCAN_EXCEED_NONE_PTE;
>  			count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>  		} else {
> --
> 2.54.0
>

^ permalink raw reply

* Re: [PATCH v8 3/6] mm/memory-failure: report MF_MSG_KERNEL for unrecoverable kernel pages
From: David Hildenbrand (Arm) @ 2026-06-01 13:24 UTC (permalink / raw)
  To: Breno Leitao, Miaohe Lin, Andrew Morton, Lorenzo Stoakes,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Naoya Horiguchi, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest,
	linux-trace-kernel, kernel-team
In-Reply-To: <20260527-ecc_panic-v8-3-9ea0cfa16bb0@debian.org>

On 5/27/26 16:06, Breno Leitao wrote:
> The previous patch teaches get_any_page() to return -ENOTRECOVERABLE
> for stable unhandlable kernel pages (PG_reserved, slab, page tables,
> large-kmalloc).  memory_failure() still folds every negative return
> into MF_MSG_GET_HWPOISON, so callers that want to react to the
> unrecoverable cases (a panic option, smarter logging) cannot tell
> them apart from transient page-allocator races.
> 
> Turn the post-call branch into a switch over the get_hwpoison_page()
> return code: map -ENOTRECOVERABLE to MF_MSG_KERNEL and any other
> negative return to MF_MSG_GET_HWPOISON.  case 0 keeps the existing
> free-buddy / kernel-high-order handling and case 1 falls through to
> the rest of memory_failure() unchanged.
> 
> The MF_MSG_KERNEL label and tracepoint string are kept as
> "reserved kernel page" to avoid breaking userspace tools that match
> on those literals; the enum value still adequately tags the failure
> even though it now also covers slab, page tables and large-kmalloc
> pages.
> 
> Suggested-by: David Hildenbrand <david@kernel.org>
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
>  mm/memory-failure.c | 17 +++++++++++++++--
>  1 file changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 8f63bdfeff8f..14c0a958638c 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -2426,7 +2426,8 @@ int memory_failure(unsigned long pfn, int flags)
>  	 * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
>  	 */
>  	res = get_hwpoison_page(p, flags);
> -	if (!res) {
> +	switch (res) {
> +	case 0:
>  		if (is_free_buddy_page(p)) {
>  			if (take_page_off_buddy(p)) {
>  				page_ref_inc(p);
> @@ -2445,7 +2446,19 @@ int memory_failure(unsigned long pfn, int flags)
>  			res = action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED);
>  		}
>  		goto unlock_mutex;
> -	} else if (res < 0) {
> +	case 1:
> +		/* Got a refcount on a handlable page. */
> +		break;
> +	case -ENOTRECOVERABLE:
> +		/*
> +		 * Stable unhandlable kernel-owned page (PG_reserved,
> +		 * slab, page tables, large-kmalloc).
> +		 * No recovery possible.
> +		 */
> +		res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
> +		goto unlock_mutex;
> +	default:
> +		/* Transient lifecycle race with the page allocator. */
>  		res = action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
>  		goto unlock_mutex;
>  	}
> 

Acked-by: David Hildenbrand (Arm) <david@kernel.org>

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH v8 2/6] mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
From: David Hildenbrand (Arm) @ 2026-06-01 13:22 UTC (permalink / raw)
  To: Miaohe Lin, Breno Leitao
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest,
	linux-trace-kernel, kernel-team, Lance Yang, Andrew Morton,
	Lorenzo Stoakes, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Shuah Khan, Naoya Horiguchi,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Jonathan Corbet, Shuah Khan, Liam R. Howlett
In-Reply-To: <19f968f5-1289-f573-4406-e5c91dcd8923@huawei.com>

On 6/1/26 14:28, Miaohe Lin wrote:
> On 2026/5/27 22:06, Breno Leitao wrote:
>> get_any_page() collapses every HWPoisonHandlable() rejection into a
>> single -EIO via the __get_hwpoison_page() -> -EBUSY -> shake_page()
>> -> retry path.  That is correct for the transient case (a userspace
>> folio briefly off LRU during migration or compaction, which a later
>> shake can drag back), but wrong for stable kernel-owned pages: slab,
>> page-table, large-kmalloc and PG_reserved pages will never become
>> HWPoisonHandlable(), so the retry loop is wasted work and the final
>> -EIO loses the "this is structurally unrecoverable" information.
>> memory_failure() then maps -EIO into MF_MSG_GET_HWPOISON, which the
>> panic-on-unrecoverable sysctl deliberately does not act on.
>>
>> Introduce HWPoisonKernelOwned(), a small predicate that positively
>> identifies pages the hwpoison handler cannot recover from:
>>
>>   HWPoisonKernelOwned(p, flags) :=
>>       !(MF_SOFT_OFFLINE && page_has_movable_ops(p)) &&
>>       (PageReserved(p) || PageSlab(p) ||
>>        PageTable(p)    || PageLargeKmalloc(p))
>>
>> The MF_SOFT_OFFLINE / page_has_movable_ops() opt-out mirrors the
>> same exception in HWPoisonHandlable(): soft-offline is allowed to
>> migrate movable_ops pages even though they are not on the LRU, and
>> we must not pre-empt that with an unrecoverable verdict.
>>
>> The list is intentionally not exhaustive.  vmalloc and kernel-stack
>> pages, for example, do not carry a page_type bit and would need a
>> different oracle; they keep going through the existing retry path
>> unchanged.  This is the smallest set we can identify with certainty
>> by page type.
>>
>> Wire the helper into the top of get_any_page() to short-circuit
>> those pages before the retry loop runs.  On a hit, drop the caller's
>> MF_COUNT_INCREASED reference (if any) and return -ENOTRECOVERABLE
>> straight away.  Pages outside the helper's positive list still take
>> the existing retry path and return -EIO, leaving operator-visible
>> behaviour for those cases unchanged.
>>
>> Extend the unhandlable-page pr_err() to fire for either errno and
>> update the get_hwpoison_page() kerneldoc to document the new return.
>>
>> memory_failure() still folds every negative return into
>> MF_MSG_GET_HWPOISON via its existing "else if (res < 0)" branch, so
>> this patch on its own only changes the errno that soft_offline_page()
>> can propagate to its callers.  A follow-up wires -ENOTRECOVERABLE
>> through memory_failure() and reports MF_MSG_KERNEL for the
>> unrecoverable cases, which is what the
>> panic_on_unrecoverable_memory_failure sysctl observes.
> 
> Thanks for your patch.
> 
>>
>> Suggested-by: David Hildenbrand <david@kernel.org>
>> Suggested-by: Lance Yang <lance.yang@linux.dev>
>> Signed-off-by: Breno Leitao <leitao@debian.org>
>> ---
>>  mm/memory-failure.c | 42 ++++++++++++++++++++++++++++++++++++++++--
>>  1 file changed, 40 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index f4d3e6e20e13..8f63bdfeff8f 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -1325,6 +1325,28 @@ static inline bool HWPoisonHandlable(struct page *page, unsigned long flags)
>>  	return PageLRU(page) || is_free_buddy_page(page);
>>  }
>>  
>> +/*
>> + * Positive identification of pages the hwpoison handler cannot recover.
>> + * These page types are owned by kernel internals (no userspace mapping
>> + * to unmap, no file mapping to invalidate, no migration target), so the
>> + * shake_page() / retry loop in get_any_page() can never turn them into
>> + * something HWPoisonHandlable() will accept.  Short-circuit them to
>> + * -ENOTRECOVERABLE so callers can panic on operator request instead of
>> + * spinning through retries that exit as a transient-looking -EIO.
>> + *
>> + * The MF_SOFT_OFFLINE / page_has_movable_ops() opt-out mirrors
>> + * HWPoisonHandlable(): soft-offline is allowed to migrate movable_ops
>> + * pages even though they are not on the LRU.
>> + */
>> +static inline bool HWPoisonKernelOwned(struct page *page, unsigned long flags)
>> +{
>> +	if ((flags & MF_SOFT_OFFLINE) && page_has_movable_ops(page))
>> +		return false;
>> +
>> +	return PageReserved(page) || PageSlab(page) ||
> 
> Once shake_page finds a lightweight range-based way to shrink slab, slab pages could be freed
> into buddy and above PageSlab test should be removed then. Maybe add a TODO or XXX here?
> 
>> +	       PageTable(page) || PageLargeKmalloc(page);
> 
> I'm not sure but is it safe or a common way to test PageReserved, PageSlab,
> PageTable and PageLargeKmalloc without extra page refcnt?

Checking typed pages in a racy fashion is fine (PageSlab, PageTable,
PageLargeKmalloc).
Checking PageReserved in a racy fashion is fine as well. TESTPAGEFLAG() will
allow checking it on compound pages.

For PageLargeKmalloc, we would want to check the head page, though. The page
type is only stored for the head page.

So maybe we want to lookup the compound head (if any) and perform the type
checks against that?

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
From: David Hildenbrand (Arm) @ 2026-06-01 13:15 UTC (permalink / raw)
  To: Nico Pache, Usama Arif, usamaarif642
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <CAA1CXcBg1su-bk3i_H+TW4-nTgvGSGqRNeC9MpQo7sGeH8ejnA@mail.gmail.com>

>>
>> Reading this, it is unclear why exactly do we need the stack.
> 
> So I looked into your items below. It seems logical, and I think it
> works the same way; however, your method seems slightly harder to
> understand due to all the edge cases and more error-prone to future
> changes (the stack holds implicit knowledge of the offset/order that
> must now be tracked in the edge cases).
> 
> Given the stack is 24 bytes, I'm not sure if the extra complexity is
> worth saving that small amount of memory. Although we would also be
> getting rid of (3?) functions, so both approaches have pros and cons.

I consider a simple forward loop over the offset ... less complexity compared to
a stack structure :)

> 
> I will implement a patch comparing your solution against mine and send
> it here, then we can decide which approach is better.

Right, throw it over the fence and I'll see how to improve it further.

[...]

>>> +     bitmap_zero(cc->mthp_bitmap, MAX_PTRS_PER_PTE);
>>>       memset(cc->node_load, 0, sizeof(cc->node_load));
>>>       nodes_clear(cc->alloc_nmask);
>>> +
>>> +     enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, tva_flags);
>>> +
>>> +     /*
>>> +      * If PMD is the only enabled order, enforce max_ptes_none, otherwise
>>> +      * scan all pages to populate the bitmap for mTHP collapse.
>>> +      */
>>
>> You should note here, that we re-verify in mthp_collapse().
>>
>> But the question is, whether we should relocate the check completely into
>> mthp_collapse(), instead of conditionally duplicating it.
>>
>> What speaks against always populating the bitmap and making the decision in
>> mthp_collapse()?
>>
>> Sure, we might scan a page table a bit longer, but the code gets clearer ... and
>> I am not sure if scanning some more page table entries is really that critical here.
> 
> Someone asked me to preserve the legacy behavior (PMD only). Although
> rather trivial, if you set max_ptes_none=0 for example, we'd still
> have to do 511 iterations for no reason if PMD collapse is the only
> enabled order rather than bailing immediately.
> 
> I'm ok with dropping it, but I think its the correct approach (despite
> the extra complexity). @Usama Arif brought up this point here
> https://lore.kernel.org/all/f8f7bb71-ca31-46ee-a62d-7ddfd83e0ead@gmail.com/

We talk about regressions, but I am not sure if we care about scanning speed
within a page table that much?

After all, we locked it and already read some entries.

Having the same check at two places to optimize for PMD order might right now
feel like a good optimization, but likely an irrelevant one in a near future?

Anyhow, won't push back, as long as we document why we are special casing things
here.

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Nico Pache @ 2026-06-01 12:40 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Usama Arif, usamaarif642
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <b8380eb3-096a-49f1-9ace-99c1e75888b4@kernel.org>

On Mon, Jun 1, 2026 at 2:11 AM David Hildenbrand (Arm) <david@kernel.org> wrote:
>
> On 5/22/26 17:00, Nico Pache wrote:
>
> Finally time for the core piece :)

*music intensifies* :p

>
> > Enable khugepaged to collapse to mTHP orders. This patch implements the
> > main scanning logic using a bitmap to track occupied pages and a stack
> > structure that allows us to find optimal collapse sizes.
> >
> > Previous to this patch, PMD collapse had 3 main phases, a light weight
> > scanning phase (mmap_read_lock) that determines a potential PMD
> > collapse, an alloc phase (mmap unlocked), then finally heavier collapse
> > phase (mmap_write_lock).
> >
> > To enabled mTHP collapse we make the following changes:
> >
> > During PMD scan phase, track occupied pages in a bitmap. When mTHP
> > orders are enabled, we remove the restriction of max_ptes_none during the
> > scan phase to avoid missing potential mTHP collapse candidates. Once we
> > have scanned the full PMD range and updated the bitmap to track occupied
> > pages, we use the bitmap to find the optimal mTHP size.
> >
> > Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
> > and determine the best eligible order for the collapse. A stack structure
> > is used instead of traditional recursion to manage the search. This also
> > prevents a traditional recursive approach when the kernel stack struct is
> > limited. The algorithm recursively splits the bitmap into smaller chunks to
> > find the highest order mTHPs that satisfy the collapse criteria. We start
> > by attempting the PMD order, then moved on the consecutively lower orders
> > (mTHP collapse). The stack maintains a pair of variables (offset, order),
> > indicating the number of PTEs from the start of the PMD, and the order of
> > the potential collapse candidate.
> >
> > The algorithm for consuming the bitmap works as such:
> >     1) push (0, HPAGE_PMD_ORDER) onto the stack
> >     2) pop the stack
> >     3) check if the number of set bits in that (offset,order) pair
> >        statisfy the max_ptes_none threshold for that order
> >     4) if yes, attempt collapse
> >     5) if no (or collapse fails), push two new stack items representing
> >        the left and right halves of the current bitmap range, at the
> >        next lower order
> >     6) repeat at step (2) until stack is empty.
> >
> > Below is a diagram representing the algorithm and stack items:
> >
> >                             offset   mid_offset
> >                             |        |
> >                             |        |
> >                             v        v
> >           ____________________________________
> >          |          PTE Page Table            |
> >          --------------------------------------
> >                           <-------><------->
> >                              order-1  order-1
>
>
> Reading this, it is unclear why exactly do we need the stack.

So I looked into your items below. It seems logical, and I think it
works the same way; however, your method seems slightly harder to
understand due to all the edge cases and more error-prone to future
changes (the stack holds implicit knowledge of the offset/order that
must now be tracked in the edge cases).

Given the stack is 24 bytes, I'm not sure if the extra complexity is
worth saving that small amount of memory. Although we would also be
getting rid of (3?) functions, so both approaches have pros and cons.

I will implement a patch comparing your solution against mine and send
it here, then we can decide which approach is better.


>
> Why can't you work with offset + cur_order?
>
> Initially,
>
>         offset = 0;
>         cur_order = HPAGE_PMD_ORDER;
>
> If collapse succeeded, advance to next range.
> If collapse failed, try next smaller order, keeping offset unchanged.
>
>         if (failed && cur_order > KHUGEPAGED_MIN_MTHP_ORDER) {
>                 /* Try next smaller order. */
>                 cur_order = cur_order - 1;
>         } else {
>                 /* Skip to next chunk. */
>                 offset += 1 << cur_order;
>                 cur_order = max_order_from_offset(offset);
>         }
>
> Of course, handling disabled orders. max_order_from_offset() is rather trivial
> (natural buddy order, capped at HPAGE_PMD_ORDER).
>
> What's the benefit of the stack?
>
> >
> > mTHP collapses reject regions containing swapped out or shared pages.
> > This is because adding new entries can lead to new none pages, and these
> > may lead to constant promotion into a higher order mTHP. A similar
> > issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse
> > introducing at least 2x the number of pages, and on a future scan will
> > satisfy the promotion condition once again. This issue is prevented via
> > the collapse_max_ptes_none() function which imposes the max_ptes_none
> > restrictions above.
> >
> > We currently only support mTHP collapse for max_ptes_none values of 0
> > and HPAGE_PMD_NR - 1. resulting in the following behavior:
> >
> >     - max_ptes_none=0: Never introduce new empty pages during collapse
> >     - max_ptes_none=HPAGE_PMD_NR-1: Always try collapse to the highest
> >       available mTHP order
> >
> > Any other max_ptes_none value will emit a warning and default mTHP
> > collapse to max_ptes_none=0. There should be no behavior change for PMD
> > collapse.
> >
> > Once we determine what mTHP sizes fits best in that PMD range a collapse
> > is attempted. A minimum collapse order of 2 is used as this is the lowest
> > order supported by anon memory as defined by THP_ORDERS_ALL_ANON.
> >
> > Currently madv_collapse is not supported and will only attempt PMD
> > collapse.
> >
> > We can also remove the check for is_khugepaged inside the PMD scan as
> > the collapse_max_ptes_none() function handles this logic now.
> >
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >  mm/khugepaged.c | 181 +++++++++++++++++++++++++++++++++++++++++++++---
> >  1 file changed, 172 insertions(+), 9 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 64ceebc9d8a7..d3d7db8be26c 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -99,6 +99,30 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
> >
> >  static struct kmem_cache *mm_slot_cache __ro_after_init;
> >
> > +#define KHUGEPAGED_MIN_MTHP_ORDER    2
> > +/*
> > + * mthp_collapse() does an iterative DFS over a binary tree, from
> > + * HPAGE_PMD_ORDER down to KHUGEPAGED_MIN_MTHP_ORDER. The max stack
> > + * size needed for a DFS on a binary tree is height + 1, where
> > + * height = HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER.
> > + *
> > + * ilog2 is used in place of HPAGE_PMD_ORDER because some architectures
> > + * (e.g. ppc64le) do not define HPAGE_PMD_ORDER until after build time.
>
> I was confused there for a second why you mention ilog2, when it's really "We
> cannot use HPAGE_PMD_ORDER.".
>
> Best to simplify to:
>
> "Note that we cannot use HPAGE_PMD_ORDER, because it is variable on some
> architectures".

Ok thank you i can clear that up.

>
> > + */
> > +#define MTHP_STACK_SIZE      (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER + 1)
> > +
> > +/*
> > + * Defines a range of PTE entries in a PTE page table which are being
> > + * considered for mTHP collapse.
> > + *
> > + * @offset: the offset of the first PTE entry in a PMD range.
> > + * @order: the order of the PTE entries being considered for collapse.
> > + */
> > +struct mthp_range {
> > +     u16 offset;
> > +     u8 order;
> > +};
> > +
> >  struct collapse_control {
> >       bool is_khugepaged;
> >
> > @@ -110,6 +134,12 @@ struct collapse_control {
> >
> >       /* nodemask for allocation fallback */
> >       nodemask_t alloc_nmask;
> > +
> > +     /* Each bit represents a single occupied (!none/zero) page. */
> > +     DECLARE_BITMAP(mthp_bitmap, MAX_PTRS_PER_PTE);
>
> This should just be called something like "present_ptes"

yeah not a bad idea.

>
> > +     /* A mask of the current range being considered for mTHP collapse. */
> > +     DECLARE_BITMAP(mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> > +     struct mthp_range mthp_bitmap_stack[MTHP_STACK_SIZE];
>
> This is really just a temporary bitmap used for collapse_mthp_count_present()
> only. Either rename it, or better, avoid it completely.

yeah when i first started this we didnt have bitmap_weight_from()
thanks for the pointer to that, I no longer need a temp bitmap.

>
> >  };
> >
> >  /**
> > @@ -1411,20 +1441,137 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
> >       return result;
> >  }
> >
> > +static void collapse_mthp_stack_push(struct collapse_control *cc, int *stack_size,
> > +                                  u16 offset, u8 order)
> > +{
> > +     const int size = *stack_size;
> > +     struct mthp_range *stack = &cc->mthp_bitmap_stack[size];
> > +
> > +     VM_WARN_ON_ONCE(size >= MTHP_STACK_SIZE);
> > +     stack->order = order;
> > +     stack->offset = offset;
> > +     (*stack_size)++;
> > +}
> > +
> > +static struct mthp_range collapse_mthp_stack_pop(struct collapse_control *cc,
> > +                                              int *stack_size)
> > +{
> > +     const int size = *stack_size;
> > +
> > +     VM_WARN_ON_ONCE(size <= 0);
> > +     (*stack_size)--;
> > +     return cc->mthp_bitmap_stack[size - 1];
> > +}
> > +
> > +static unsigned int collapse_mthp_count_present(struct collapse_control *cc,
> > +                                             u16 offset, unsigned int nr_ptes)
> > +{
> > +     bitmap_zero(cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> > +     bitmap_set(cc->mthp_bitmap_mask, offset, nr_ptes);
> > +     return bitmap_weight_and(cc->mthp_bitmap, cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
>
> You really just want to count the number of set bits? You don't need a temporary
> bitmap for that.
>
> Assume you want to check an order-2 (4 bits), bitmap_weight_and() would check
> all bits ...
>
> I'd suggest starting simple here, and avoiding the temporary bitmap.
>
> Can we simply use bitmap_weight_from(cc->mthp_bitmap, offset, nr_ptes)?

Yes! Thank you :)

>
> > +}
> > +
> > +/*
> > + * mthp_collapse() consumes the bitmap that is generated during
> > + * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
> > + *
> > + * Each bit in cc->mthp_bitmap represents a single occupied (!none/zero) page.
> > + * A stack structure cc->mthp_bitmap_stack is used to check different regions
> > + * of the bitmap for collapse eligibility. The stack maintains a pair of
> > + * variables (offset, order), indicating the number of PTEs from the start of
> > + * the PMD, and the order of the potential collapse candidate respectively. We
> > + * start at the PMD order and check if it is eligible for collapse; if not, we
> > + * add two entries to the stack at a lower order to represent the left and right
> > + * halves of the PTE page table we are examining.
> > + *
> > + *                         offset       mid_offset
> > + *                         |         |
> > + *                         |         |
> > + *                         v         v
> > + *      --------------------------------------
> > + *      |          cc->mthp_bitmap            |
> > + *      --------------------------------------
> > + *                         <-------><------->
> > + *                          order-1  order-1
> > + *
> > + * For each of these, we determine how many PTE entries are occupied in the
> > + * range of PTE entries we propose to collapse, then we compare this to a
> > + * threshold number of PTE entries which would need to be occupied for a
> > + * collapse to be permitted at that order (accounting for max_ptes_none).
> > + *
> > + * If a collapse is permitted, we attempt to collapse the PTE range into a
> > + * mTHP.
> > + */
> > +static int mthp_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
> > +             unsigned long address, int referenced, int unmapped,
> > +             struct collapse_control *cc, unsigned long enabled_orders)
> > +{
> > +     unsigned int nr_occupied_ptes, nr_ptes, max_ptes_none;
> > +     int collapsed = 0, stack_size = 0;
> > +     unsigned long collapse_address;
> > +     struct mthp_range range;
> > +     u16 offset;
> > +     u8 order;
> > +
> > +     collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
> > +
> > +     while (stack_size) {
> > +             range = collapse_mthp_stack_pop(cc, &stack_size);
> > +             order = range.order;
> > +             offset = range.offset;
> > +             nr_ptes = 1UL << order;
> > +
> > +             if (!test_bit(order, &enabled_orders))
> > +                     goto next_order;
> > +
> > +             max_ptes_none = collapse_max_ptes_none(cc, vma, order);
> > +
> > +             nr_occupied_ptes = collapse_mthp_count_present(cc, offset,
> > +                                                            nr_ptes);
> > +
> > +             if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
> > +                     int ret;
> > +
> > +                     collapse_address = address + offset * PAGE_SIZE;
> > +                     ret = collapse_huge_page(mm, collapse_address, referenced,
> > +                                              unmapped, cc, order);
> > +                     if (ret == SCAN_SUCCEED) {
> > +                             collapsed += nr_ptes;
> > +                             continue;
> > +                     }
> > +             }
> > +
> > +next_order:
> > +             if ((BIT(order) - 1) & enabled_orders) {
> > +                     const u8 next_order = order - 1;
> > +                     const u16 mid_offset = offset + (nr_ptes / 2);
> > +
> > +                     collapse_mthp_stack_push(cc, &stack_size, mid_offset,
> > +                                              next_order);
> > +                     collapse_mthp_stack_push(cc, &stack_size, offset,
> > +                                              next_order);
> > +             }
> > +     }
> > +     return collapsed;
> > +}
> > +
> >  static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> >               struct vm_area_struct *vma, unsigned long start_addr,
> >               bool *lock_dropped, struct collapse_control *cc)
> >  {
> > -     const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
> >       const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, HPAGE_PMD_ORDER);
> >       const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
> > +     unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
> > +     enum tva_type tva_flags = cc->is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
> >       pmd_t *pmd;
> > -     pte_t *pte, *_pte;
> > -     int none_or_zero = 0, shared = 0, referenced = 0;
> > +     pte_t *pte, *_pte, pteval;
> > +     int i;
> > +     int none_or_zero = 0, shared = 0, nr_collapsed = 0, referenced = 0;
> >       enum scan_result result = SCAN_FAIL;
> >       struct page *page = NULL;
> >       struct folio *folio = NULL;
> >       unsigned long addr;
> > +     unsigned long enabled_orders;
> >       spinlock_t *ptl;
> >       int node = NUMA_NO_NODE, unmapped = 0;
> >
> > @@ -1436,8 +1583,19 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> >               goto out;
> >       }
> >
> > +     bitmap_zero(cc->mthp_bitmap, MAX_PTRS_PER_PTE);
> >       memset(cc->node_load, 0, sizeof(cc->node_load));
> >       nodes_clear(cc->alloc_nmask);
> > +
> > +     enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, tva_flags);
> > +
> > +     /*
> > +      * If PMD is the only enabled order, enforce max_ptes_none, otherwise
> > +      * scan all pages to populate the bitmap for mTHP collapse.
> > +      */
>
> You should note here, that we re-verify in mthp_collapse().
>
> But the question is, whether we should relocate the check completely into
> mthp_collapse(), instead of conditionally duplicating it.
>
> What speaks against always populating the bitmap and making the decision in
> mthp_collapse()?
>
> Sure, we might scan a page table a bit longer, but the code gets clearer ... and
> I am not sure if scanning some more page table entries is really that critical here.

Someone asked me to preserve the legacy behavior (PMD only). Although
rather trivial, if you set max_ptes_none=0 for example, we'd still
have to do 511 iterations for no reason if PMD collapse is the only
enabled order rather than bailing immediately.

I'm ok with dropping it, but I think its the correct approach (despite
the extra complexity). @Usama Arif brought up this point here
https://lore.kernel.org/all/f8f7bb71-ca31-46ee-a62d-7ddfd83e0ead@gmail.com/

>
>
> > +     if (enabled_orders != BIT(HPAGE_PMD_ORDER))
> > +             max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT;
> > +
> >       pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
> >       if (!pte) {
> >               cc->progress++;
> > @@ -1445,11 +1603,13 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> >               goto out;
> >       }
> >
> > -     for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> > -          _pte++, addr += PAGE_SIZE) {
> > +     for (i = 0; i < HPAGE_PMD_NR; i++) {
> > +             _pte = pte + i;
> > +             addr = start_addr + i * PAGE_SIZE;
> > +             pteval = ptep_get(_pte);
> > +
> >               cc->progress++;
> >
> > -             pte_t pteval = ptep_get(_pte);
> >               if (pte_none_or_zero(pteval)) {
> >                       if (++none_or_zero > max_ptes_none) {
> >                               result = SCAN_EXCEED_NONE_PTE;
> > @@ -1529,6 +1689,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> >                       }
> >               }
> >
> > +             /* Set bit for occupied pages */
> > +             __set_bit(i, cc->mthp_bitmap);
> >               /*
> >                * Record which node the original page is from and save this
> >                * information to cc->node_load[].
> > @@ -1587,10 +1749,11 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> >       if (result == SCAN_SUCCEED) {
> >               /* collapse_huge_page expects the lock to be dropped before calling */
> >               mmap_read_unlock(mm);
> > -             result = collapse_huge_page(mm, start_addr, referenced,
> > -                                         unmapped, cc, HPAGE_PMD_ORDER);
> > -             /* collapse_huge_page will return with the mmap_lock released */
> > +             nr_collapsed = mthp_collapse(mm, vma, start_addr, referenced,
> > +                                          unmapped, cc, enabled_orders);
> > +             /* mmap_lock was released above, set lock_dropped */
> >               *lock_dropped = true;
> > +             result = nr_collapsed ? SCAN_SUCCEED : SCAN_FAIL;
>
> As Lance says, this error handling likely needs some thought.

Yes I agree, I'm working on that now. I had a WIP patch for this some
time ago, but I never gave it the full thought it needed, and it fell
into the abyss.

Thanks :)
-- Nico

>
> --
> Cheers,
>
> David
>


^ permalink raw reply

* Re: [PATCH v8 2/6] mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
From: Miaohe Lin @ 2026-06-01 12:28 UTC (permalink / raw)
  To: Breno Leitao
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest,
	linux-trace-kernel, kernel-team, Lance Yang, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	Naoya Horiguchi, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Liam R. Howlett
In-Reply-To: <20260527-ecc_panic-v8-2-9ea0cfa16bb0@debian.org>

On 2026/5/27 22:06, Breno Leitao wrote:
> get_any_page() collapses every HWPoisonHandlable() rejection into a
> single -EIO via the __get_hwpoison_page() -> -EBUSY -> shake_page()
> -> retry path.  That is correct for the transient case (a userspace
> folio briefly off LRU during migration or compaction, which a later
> shake can drag back), but wrong for stable kernel-owned pages: slab,
> page-table, large-kmalloc and PG_reserved pages will never become
> HWPoisonHandlable(), so the retry loop is wasted work and the final
> -EIO loses the "this is structurally unrecoverable" information.
> memory_failure() then maps -EIO into MF_MSG_GET_HWPOISON, which the
> panic-on-unrecoverable sysctl deliberately does not act on.
> 
> Introduce HWPoisonKernelOwned(), a small predicate that positively
> identifies pages the hwpoison handler cannot recover from:
> 
>   HWPoisonKernelOwned(p, flags) :=
>       !(MF_SOFT_OFFLINE && page_has_movable_ops(p)) &&
>       (PageReserved(p) || PageSlab(p) ||
>        PageTable(p)    || PageLargeKmalloc(p))
> 
> The MF_SOFT_OFFLINE / page_has_movable_ops() opt-out mirrors the
> same exception in HWPoisonHandlable(): soft-offline is allowed to
> migrate movable_ops pages even though they are not on the LRU, and
> we must not pre-empt that with an unrecoverable verdict.
> 
> The list is intentionally not exhaustive.  vmalloc and kernel-stack
> pages, for example, do not carry a page_type bit and would need a
> different oracle; they keep going through the existing retry path
> unchanged.  This is the smallest set we can identify with certainty
> by page type.
> 
> Wire the helper into the top of get_any_page() to short-circuit
> those pages before the retry loop runs.  On a hit, drop the caller's
> MF_COUNT_INCREASED reference (if any) and return -ENOTRECOVERABLE
> straight away.  Pages outside the helper's positive list still take
> the existing retry path and return -EIO, leaving operator-visible
> behaviour for those cases unchanged.
> 
> Extend the unhandlable-page pr_err() to fire for either errno and
> update the get_hwpoison_page() kerneldoc to document the new return.
> 
> memory_failure() still folds every negative return into
> MF_MSG_GET_HWPOISON via its existing "else if (res < 0)" branch, so
> this patch on its own only changes the errno that soft_offline_page()
> can propagate to its callers.  A follow-up wires -ENOTRECOVERABLE
> through memory_failure() and reports MF_MSG_KERNEL for the
> unrecoverable cases, which is what the
> panic_on_unrecoverable_memory_failure sysctl observes.

Thanks for your patch.

> 
> Suggested-by: David Hildenbrand <david@kernel.org>
> Suggested-by: Lance Yang <lance.yang@linux.dev>
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
>  mm/memory-failure.c | 42 ++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 40 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index f4d3e6e20e13..8f63bdfeff8f 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1325,6 +1325,28 @@ static inline bool HWPoisonHandlable(struct page *page, unsigned long flags)
>  	return PageLRU(page) || is_free_buddy_page(page);
>  }
>  
> +/*
> + * Positive identification of pages the hwpoison handler cannot recover.
> + * These page types are owned by kernel internals (no userspace mapping
> + * to unmap, no file mapping to invalidate, no migration target), so the
> + * shake_page() / retry loop in get_any_page() can never turn them into
> + * something HWPoisonHandlable() will accept.  Short-circuit them to
> + * -ENOTRECOVERABLE so callers can panic on operator request instead of
> + * spinning through retries that exit as a transient-looking -EIO.
> + *
> + * The MF_SOFT_OFFLINE / page_has_movable_ops() opt-out mirrors
> + * HWPoisonHandlable(): soft-offline is allowed to migrate movable_ops
> + * pages even though they are not on the LRU.
> + */
> +static inline bool HWPoisonKernelOwned(struct page *page, unsigned long flags)
> +{
> +	if ((flags & MF_SOFT_OFFLINE) && page_has_movable_ops(page))
> +		return false;
> +
> +	return PageReserved(page) || PageSlab(page) ||

Once shake_page finds a lightweight range-based way to shrink slab, slab pages could be freed
into buddy and above PageSlab test should be removed then. Maybe add a TODO or XXX here?

> +	       PageTable(page) || PageLargeKmalloc(page);

I'm not sure but is it safe or a common way to test PageReserved, PageSlab,
PageTable and PageLargeKmalloc without extra page refcnt?

Apart from the above nits, this patch looks good to me.

Thanks.
.

^ permalink raw reply

* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
From: David Hildenbrand (Arm) @ 2026-06-01 12:06 UTC (permalink / raw)
  To: Nico Pache, Lance Yang
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat, mhocko,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <CAA1CXcCfww9X-f2Vb=ipz8tY2nMNnMx5_58Ozz63tvZTz_wpOw@mail.gmail.com>

On 6/1/26 14:01, Nico Pache wrote:
> On Sun, May 31, 2026 at 2:48 AM Lance Yang <lance.yang@linux.dev> wrote:
>>
>>
>>
>> On 2026/5/31 15:18, Lance Yang wrote:
>>>
>>> [...]
>>>
>>> Hmm ... don't we lose the allocation-failure result here?
>>>
>>> Previously collapse_scan_pmd() propagated SCAN_ALLOC_HUGE_PAGE_FAIL from
>>> collapse_huge_page(), so khugepaged would call khugepaged_alloc_sleep()
>>> in khugepaged_do_scan().
>>>
>>> Now if allocation fails and nr_collapsed stays 0, we just return
>>> SCAN_FAIL. So we won't back off via khugepaged_alloc_sleep() anymore?
>>
>> Looks like this is a more general issue with mthp_collapse() only
>> returning nr_collapsed.
>>
>> For example, SCAN_PMD_MAPPED used to be propagated too, and
>> madvise_collapse() treats that as success. With the new code, if
>> nothing was collapsed by this call, that can also become SCAN_FAIL ...
>>
>> So I think we should keep both.
> 
> Yeah I thought about this before, but more regarding the "incorrect"
> propagation of errors; I didn't consider that those results were
> actually being considered.
> 
> I actually had a patch to track the last_failure (with some
> prioritization on certain results). I think that would solve this
> issue.
> 
> Thanks for reminding me to improve this.
> 
> Depending on how the rest of the reviews go, I can either send up a
> follow up series to do some more cleanups and improvements of the
> current approach or we can send out a v19.

Let's do a v19.

Patch #11 might need a bit of work, we can discuss offline if you want.

If we could get a v19 by the end of the week, that would be nice.

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Nico Pache @ 2026-06-01 12:01 UTC (permalink / raw)
  To: Lance Yang
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat, mhocko,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <6a9f062c-8376-4f83-90a9-8b167f925dc6@linux.dev>

On Sun, May 31, 2026 at 2:48 AM Lance Yang <lance.yang@linux.dev> wrote:
>
>
>
> On 2026/5/31 15:18, Lance Yang wrote:
> >
> > On Fri, May 22, 2026 at 09:00:06AM -0600, Nico Pache wrote:
> > [...]
> >> @@ -1587,10 +1749,11 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> >>      if (result == SCAN_SUCCEED) {
> >>              /* collapse_huge_page expects the lock to be dropped before calling */
> >>              mmap_read_unlock(mm);
> >> -            result = collapse_huge_page(mm, start_addr, referenced,
> >> -                                        unmapped, cc, HPAGE_PMD_ORDER);
> >> -            /* collapse_huge_page will return with the mmap_lock released */
> >> +            nr_collapsed = mthp_collapse(mm, vma, start_addr, referenced,
> >> +                                         unmapped, cc, enabled_orders);
> >> +            /* mmap_lock was released above, set lock_dropped */
> >>              *lock_dropped = true;
> >> +            result = nr_collapsed ? SCAN_SUCCEED : SCAN_FAIL;
> >
> > Hmm ... don't we lose the allocation-failure result here?
> >
> > Previously collapse_scan_pmd() propagated SCAN_ALLOC_HUGE_PAGE_FAIL from
> > collapse_huge_page(), so khugepaged would call khugepaged_alloc_sleep()
> > in khugepaged_do_scan().
> >
> > Now if allocation fails and nr_collapsed stays 0, we just return
> > SCAN_FAIL. So we won't back off via khugepaged_alloc_sleep() anymore?
>
> Looks like this is a more general issue with mthp_collapse() only
> returning nr_collapsed.
>
> For example, SCAN_PMD_MAPPED used to be propagated too, and
> madvise_collapse() treats that as success. With the new code, if
> nothing was collapsed by this call, that can also become SCAN_FAIL ...
>
> So I think we should keep both.

Yeah I thought about this before, but more regarding the "incorrect"
propagation of errors; I didn't consider that those results were
actually being considered.

I actually had a patch to track the last_failure (with some
prioritization on certain results). I think that would solve this
issue.

Thanks for reminding me to improve this.

Depending on how the rest of the reviews go, I can either send up a
follow up series to do some more cleanups and improvements of the
current approach or we can send out a v19.

Cheers,
-- Nico

>
> Cheers, Lance
>


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox