* Re: [RFC PATCH v2 02/10] rv/da: fix per-task da_monitor_destroy() ordering and sync
From: Gabriele Monaco @ 2026-05-12 8:27 UTC (permalink / raw)
To: wen.yang; +Cc: linux-trace-kernel, linux-kernel, Steven Rostedt
In-Reply-To: <d02c656aada7d071f083460a5c9a454363669b61.1778522945.git.wen.yang@linux.dev>
On Tue, 2026-05-12 at 02:24 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
>
> The following two paths race:
>
> CPU 0 (disable_stall/__rv_disable_monitor) CPU 1 (wwnr probe handler)
^ did you mean stall?
> ------------------------------------------ -----------------------------
> disable_stall()
> da_monitor_destroy()
> da_monitor_reset_all() <------ [task T: monitoring=0]
> da_monitor_start(&T->rv[n])
> /* no timer_setup */
> monitoring=1 <----
> tracepoint_synchronize_unregister()
> // CPU 1 probe has already returned; sync returns
>
> Later, enable_stall() acquires the same slot and calls da_monitor_init():
>
> da_monitor_reset_all()
> da_monitor_reset(&T->rv[slot]) // monitoring=1, timer.function==0
> ha_monitor_reset_env()
> ha_cancel_timer()
> timer_delete(&ha_mon->timer) // ODEBUG: timer never initialised
>
> ODEBUG: assert_init not available (active state 0)
> object type: timer_list
> Call trace: timer_delete <- da_monitor_reset_all <- enable_stall
>
> Call tracepoint_synchronize_unregister() inside da_monitor_destroy()
> before da_monitor_reset_all(). The unregister_trace_xxx() calls in the
> monitor's disable() have already disconnected the tracepoints; the sync
> here drains any handler still in flight, so no new monitoring=1 can
> appear after da_monitor_reset_all() clears the slot.
>
> Also fix the slot release ordering: release the slot only after
> reset_all() to avoid accessing rv[] with an out-of-bounds index.
>
> Fixes: f5587d1b6ec9 ("rv: Add Hybrid Automata monitor type")
> Signed-off-by: Wen Yang <wen.yang@linux.dev>
> ---
Thanks for the fix, I have a similar one waiting for submission.
These are technically 2 separate fixes though: the ordering with unset
task_mon_slot (independent on HA) and the synchronisation with pending
tracepoints. They probably deserve separate patches and visibility, the first
has always been around and we're technically overwriting who knows what.
The explanation above is a bit hard to follow though, are you talking about a
handler for the same (stall) monitor running after the reset, effectively
undoing it by setting the monitoring flag?
Then this is indeed an issue with ha_monitor_reset_env() which expects a clean
environment.
So that's basically what you'd see now much more often because in fact we don't
reset the right slot (though, again, that's a different issue).
Calling tracepoint_synchronize_unregister() there too would surely fix, but it
used to be kinda slow. But it's probably gotten faster since now tracepoints use
SRCU, so we can wait for a dedicated grace period.
I liked the idea to wait cumulatively in the end, but that's just making things
harder.. Let's do like this:
Prepare 2 separate patches as fixes, put the task slot one first (would ease
backporting), mention this issue with the race condition only in the second.
You can send them independently and I'll add them to the tree as urgent.
I'm soon going to send my set of fixes that will also include the task slot
patch (not removing to ease my life with conflicts).
Thanks,
Gabriele
> include/rv/da_monitor.h | 18 ++++++++++++++++--
> 1 file changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
> index 00ded3d5ab3f..d04bb3229c75 100644
> --- a/include/rv/da_monitor.h
> +++ b/include/rv/da_monitor.h
> @@ -304,6 +304,20 @@ static int da_monitor_init(void)
>
> /*
> * da_monitor_destroy - return the allocated slot
> + *
> + * Call tracepoint_synchronize_unregister() before reset_all() to close
> + * the race where an in-flight non-HA probe handler sets monitoring=1
> + * (without calling timer_setup()) after da_monitor_reset_all() has
> + * already cleared the slot but before the caller's own sync completes.
> + * Without this barrier, an HA_TIMER_WHEEL monitor that later acquires
> + * the same slot would call timer_delete() on a never-initialised
> + * timer_list, triggering ODEBUG warnings.
> + *
> + * Note: tracepoint_synchronize_unregister() is a system-wide barrier
> + * that waits for all CPUs to finish any in-flight tracepoint handlers.
> + * The caller's own __rv_disable_monitor() issues a second sync after
> + * returning from disable(); that redundant call is harmless on the
> + * infrequent admin (enable/disable) path.
> */
> static inline void da_monitor_destroy(void)
> {
> @@ -311,10 +325,10 @@ static inline void da_monitor_destroy(void)
> WARN_ONCE(1, "Disabling a disabled monitor: "
> __stringify(MONITOR_NAME));
> return;
> }
> + tracepoint_synchronize_unregister();
> + da_monitor_reset_all();
> rv_put_task_monitor_slot(task_mon_slot);
> task_mon_slot = RV_PER_TASK_MONITOR_INIT;
> -
> - da_monitor_reset_all();
> }
>
> #elif RV_MON_TYPE == RV_MON_PER_OBJ
^ permalink raw reply
* Re: [RFC PATCH v2 02/10] rv/da: fix per-task da_monitor_destroy() ordering and sync
From: Gabriele Monaco @ 2026-05-12 9:09 UTC (permalink / raw)
To: wen.yang; +Cc: linux-trace-kernel, linux-kernel, Steven Rostedt
In-Reply-To: <f654a17c671469fd8fc9ea438daf2266d05068d4.camel@redhat.com>
On Tue, 2026-05-12 at 10:27 +0200, Gabriele Monaco wrote:
> On Tue, 2026-05-12 at 02:24 +0800, wen.yang@linux.dev wrote:
> > From: Wen Yang <wen.yang@linux.dev>
> >
> > The following two paths race:
> >
> > CPU 0 (disable_stall/__rv_disable_monitor) CPU 1 (wwnr probe handler)
> ^ did you mean stall?
Ok I got it now, so essentially you'd reproduce it like:
* start a DA per-task monitor (no timer)
* stop it, a handler is still running after reset, it sets monitoring back to 1
* start an HA per-task monitor
that would use the same slot that is now looking like:
{ monitoring = 1, timer.function = NULL }
because it was not initialised as HA but monitoring was reset in the race.
Thinking about this again, it isn't just an issue with per-task monitors, all
monitors reusing slots would suffer from it.
Besides, relying on monitoring can be fragile when using LTL monitors on the
same task (those don't even have monitoring).
Perhaps the solution isn't that trivial, I'm going to give one more thought on
it, but thanks again for bringing this up!
Gabriele
> > ------------------------------------------ -----------------------------
> > disable_stall()
> > da_monitor_destroy()
> > da_monitor_reset_all() <------ [task T: monitoring=0]
> > da_monitor_start(&T->rv[n])
> > /* no timer_setup */
> > monitoring=1 <----
> > tracepoint_synchronize_unregister()
> > // CPU 1 probe has already returned; sync returns
> >
> > Later, enable_stall() acquires the same slot and calls da_monitor_init():
> >
> > da_monitor_reset_all()
> > da_monitor_reset(&T->rv[slot]) // monitoring=1, timer.function==0
> > ha_monitor_reset_env()
> > ha_cancel_timer()
> > timer_delete(&ha_mon->timer) // ODEBUG: timer never initialised
> >
> > ODEBUG: assert_init not available (active state 0)
> > object type: timer_list
> > Call trace: timer_delete <- da_monitor_reset_all <- enable_stall
> >
> > Call tracepoint_synchronize_unregister() inside da_monitor_destroy()
> > before da_monitor_reset_all(). The unregister_trace_xxx() calls in the
> > monitor's disable() have already disconnected the tracepoints; the sync
> > here drains any handler still in flight, so no new monitoring=1 can
> > appear after da_monitor_reset_all() clears the slot.
> >
> > Also fix the slot release ordering: release the slot only after
> > reset_all() to avoid accessing rv[] with an out-of-bounds index.
> >
> > Fixes: f5587d1b6ec9 ("rv: Add Hybrid Automata monitor type")
> > Signed-off-by: Wen Yang <wen.yang@linux.dev>
> > ---
>
> Thanks for the fix, I have a similar one waiting for submission.
>
> These are technically 2 separate fixes though: the ordering with unset
> task_mon_slot (independent on HA) and the synchronisation with pending
> tracepoints. They probably deserve separate patches and visibility, the first
> has always been around and we're technically overwriting who knows what.
>
>
> The explanation above is a bit hard to follow though, are you talking about a
> handler for the same (stall) monitor running after the reset, effectively
> undoing it by setting the monitoring flag?
>
> Then this is indeed an issue with ha_monitor_reset_env() which expects a clean
> environment.
>
> So that's basically what you'd see now much more often because in fact we
> don't
> reset the right slot (though, again, that's a different issue).
>
>
> Calling tracepoint_synchronize_unregister() there too would surely fix, but it
> used to be kinda slow. But it's probably gotten faster since now tracepoints
> use
> SRCU, so we can wait for a dedicated grace period.
>
> I liked the idea to wait cumulatively in the end, but that's just making
> things
> harder.. Let's do like this:
>
> Prepare 2 separate patches as fixes, put the task slot one first (would ease
> backporting), mention this issue with the race condition only in the second.
> You can send them independently and I'll add them to the tree as urgent.
>
>
> I'm soon going to send my set of fixes that will also include the task slot
> patch (not removing to ease my life with conflicts).
>
> Thanks,
> Gabriele
>
> > include/rv/da_monitor.h | 18 ++++++++++++++++--
> > 1 file changed, 16 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
> > index 00ded3d5ab3f..d04bb3229c75 100644
> > --- a/include/rv/da_monitor.h
> > +++ b/include/rv/da_monitor.h
> > @@ -304,6 +304,20 @@ static int da_monitor_init(void)
> >
> > /*
> > * da_monitor_destroy - return the allocated slot
> > + *
> > + * Call tracepoint_synchronize_unregister() before reset_all() to close
> > + * the race where an in-flight non-HA probe handler sets monitoring=1
> > + * (without calling timer_setup()) after da_monitor_reset_all() has
> > + * already cleared the slot but before the caller's own sync completes.
> > + * Without this barrier, an HA_TIMER_WHEEL monitor that later acquires
> > + * the same slot would call timer_delete() on a never-initialised
> > + * timer_list, triggering ODEBUG warnings.
> > + *
> > + * Note: tracepoint_synchronize_unregister() is a system-wide barrier
> > + * that waits for all CPUs to finish any in-flight tracepoint handlers.
> > + * The caller's own __rv_disable_monitor() issues a second sync after
> > + * returning from disable(); that redundant call is harmless on the
> > + * infrequent admin (enable/disable) path.
> > */
> > static inline void da_monitor_destroy(void)
> > {
> > @@ -311,10 +325,10 @@ static inline void da_monitor_destroy(void)
> > WARN_ONCE(1, "Disabling a disabled monitor: "
> > __stringify(MONITOR_NAME));
> > return;
> > }
> > + tracepoint_synchronize_unregister();
> > + da_monitor_reset_all();
> > rv_put_task_monitor_slot(task_mon_slot);
> > task_mon_slot = RV_PER_TASK_MONITOR_INIT;
> > -
> > - da_monitor_reset_all();
> > }
> >
> > #elif RV_MON_TYPE == RV_MON_PER_OBJ
^ permalink raw reply
* Re: [PATCH 07/13] rv: Simply hybrid automata monitors's clock variables
From: Gabriele Monaco @ 2026-05-12 9:31 UTC (permalink / raw)
To: Nam Cao
Cc: Steven Rostedt, Wander Lairson Costa, linux-trace-kernel,
linux-kernel
In-Reply-To: <87wlxaupmz.fsf@yellow.woof>
On Mon, 2026-05-11 at 13:55 +0200, Nam Cao wrote:
> Gabriele Monaco <gmonaco@redhat.com> writes:
> > Well, this is roughly what we discussed in [1].
> > Now, I didn't submit the throttle monitor yet because it depends on unacked
> > tracepoints.
>
> Thanks for bringing that up. I had no memory of that discussion.
>
> > In short, this works with the assumption that the expires value you pass to
> > ha_check_invariant() is the same you used to arm the timer.
> >
> > That's true for constant values only (the deadline) but not for something
> > like
> > the runtime. I couldn't think of a way to rearrange that model not to
> > require
> > the runtime left field.
>
> I believe you are referring to this:
>
> |
> |
> dl_replenish;reset(clk) v
> sched_switch_in #=========================# sched_switch_in;
> +--------------- H H reset(clk)
> | H H <----------------+
> +--------------> H running H |
> dl_throttle;reset(clk) H clk < runtime_left_ns() H |
> +--------------------------- H H sched_switch_out |
> | +------------------> H H -------------+ |
> | dl_replenish;reset(clk) #=========================# | |
> | | | ^ | |
> v | dl_defer_arm | | |
>
> Now that I stared at this again, I think we already deviate from theory
> here. Our documentation mentions that the invariant must be in the form
>
> g = v < c | true
>
> with "c [being] a numerical value".
>
> The invariant "clk < runtime_left_ns()" means clk must not exceed the
> remaining runtime, which is sampled by calling runtime_left_ns() when
> the state is entered. This is not in the theory. Additionally, I think
> this interpretation is ambiguous; one could also interpret that as "the
> clk value must never exceed the *current* value returned by
> runtime_left_ns()".
Well, that's a fair point. Using functions here is kind of pushing it, but if we
assume the runtime constant for the duration of the invariant (which is what
happens in practice), I don't see that huge difference. Then sure, I'm still
twisting the theory here.
But that's right, it's quite ambiguous. The function is technically syntactic
sugar in RV to allow monitor-specific values, I should probably make it clear it
doesn't make it a dynamic value (at least within the same constraint
validation).
> I digged into the cited academic papers, but couldn't find anything that
> can describe this. The closest I see is the "init" label for states, but
> that is the condition for entering the states.
>
> > Otherwise.. We could read the remaining time in the timer, but we wouldn't
> > be
> > able to simulate ns precision when using the timer wheel.
> >
> > Now if we really wanted to go down that path, we are using a union to
> > allocate
> > either timer or hrtimer, the latter is larger, so we /could/ add a u64
> > expire_ns
> > field to the ha_monitor struct without needing additional memory.
> >
> > If that doesn't sound too wild to you, I may try and sketch something up to
> > see
> > if that's viable. Then this patch could go through as is and I would add the
> > extension together with the submission of throttle.
>
> That can work, but not ideal, because hrtimer will not be usable.
Why not? If we have HA_TIMER_WHEEL , we'd use timer and expire, if we have
HA_TIMER_HRTIMER we'd only need hrtimer with it's hrtimer_get_expires():
union {
struct hrtimer hrtimer;
struct {
struct timer_list timer;
u64 expire; /* Explicitly store the armed budget */
};
we already can't use timer and hrtimer interchangeably.
What am I missing here?
> Looking at the throttle monitor again, is it possible to rewrite
> runtime_left_ns() to read .dl_runtime instead of .runtime? I don't know
> the deadline schedule very well, but I think .dl_runtime is not changing
> like .runtime?
In theory yes, but since the runtime is consumed only when running, we cannot
just set the timeout once. We either save how much was consumed somewhere or do
some start/pause mechanism.
Neither looks simpler to me.
Thanks,
Gabriele
^ permalink raw reply
* Re: [PATCHv2] uprobes: Use flexible array for xol_area bitmap
From: Oleg Nesterov @ 2026-05-12 11:29 UTC (permalink / raw)
To: Rosen Penev
Cc: linux-kernel, Peter Zijlstra, Ingo Molnar,
Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
James Clark, Masami Hiramatsu,
open list:PERFORMANCE EVENTS SUBSYSTEM, open list:UPROBES
In-Reply-To: <20260511225648.27886-1-rosenp@gmail.com>
On 05/11, Rosen Penev wrote:
>
> struct xol_area {
> wait_queue_head_t wq; /* if all slots are busy */
> - unsigned long *bitmap; /* 0 = free slot */
>
> struct page *page;
> /*
> @@ -117,6 +116,7 @@ struct xol_area {
> * the vma go away, and we must handle that reasonably gracefully.
> */
> unsigned long vaddr; /* Page(s) of instruction slots */
> + unsigned long bitmap[]; /* 0 = free slot */
> };
>
> static void uprobe_warn(struct task_struct *t, const char *msg)
> @@ -1755,18 +1755,13 @@ static struct xol_area *__create_xol_area(unsigned long vaddr)
> struct xol_area *area;
> void *insns;
>
> - area = kzalloc_obj(*area);
> + area = kzalloc_flex(*area, bitmap, BITS_TO_LONGS(UINSNS_PER_PAGE));
The downside is that kmalloc will use kmem_cache with ->object_size = PAGE_SIZE * 2,
almost half of the allocated memory won't be used...
But technically the patch looks correct so I won't argue.
Oleg.
^ permalink raw reply
* Re: [PATCH v6 1/4] mm/memory-failure: report MF_MSG_KERNEL for reserved pages
From: Lance Yang @ 2026-05-12 12:48 UTC (permalink / raw)
To: david
Cc: leitao, linmiaohe, nao.horiguchi, akpm, corbet, skhan, ljs,
vbabka, rppt, surenb, mhocko, shuah, rostedt, mhiramat,
mathieu.desnoyers, liam, linux-mm, linux-kernel, linux-doc,
linux-kselftest, linux-trace-kernel, kernel-team, lance.yang
In-Reply-To: <9504c193-8c01-4d03-8f62-c50fd7fbdbc0@kernel.org>
On Tue, May 12, 2026 at 10:17:00AM +0200, David Hildenbrand (Arm) wrote:
>On 5/11/26 17:38, Breno Leitao wrote:
>> When get_hwpoison_page() returns a negative value, distinguish
>> reserved pages from other failure cases by reporting MF_MSG_KERNEL
>> instead of MF_MSG_GET_HWPOISON. Reserved pages belong to the kernel
>> and should be classified accordingly for proper handling.
>>
>> Sample PG_reserved before the get_hwpoison_page() call. In the
>> MF_COUNT_INCREASED path get_any_page() can drop the caller's
>> reference before returning -EIO, after which the underlying page may
>> have been freed and reallocated with page->flags reset; reading
>> PageReserved(p) at that point would observe stale or unrelated state.
>> The pre-call snapshot reflects what the page actually was at the
>> time of the failure event.
>>
>> Acked-by: Miaohe Lin <linmiaohe@huawei.com>
>> Reviewed-by: Lance Yang <lance.yang@linux.dev>
>> Signed-off-by: Breno Leitao <leitao@debian.org>
>> ---
>> mm/memory-failure.c | 19 ++++++++++++++++++-
>> 1 file changed, 18 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index 866c4428ac7ef..f112fb27a8ff6 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -2348,6 +2348,7 @@ int memory_failure(unsigned long pfn, int flags)
>> unsigned long page_flags;
>> bool retry = true;
>> int hugetlb = 0;
>> + bool is_reserved;
>>
>> if (!sysctl_memory_failure_recovery)
>> panic("Memory failure on page %lx", pfn);
>> @@ -2411,6 +2412,18 @@ int memory_failure(unsigned long pfn, int flags)
>> * In fact it's dangerous to directly bump up page count from 0,
>> * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
>> */
>> + /*
>> + * Pages with PG_reserved set are not currently managed by the
>> + * page allocator (memblock-reserved memory, driver reservations,
>> + * etc.), so classify them as kernel-owned for reporting.
>> + *
>> + * Sample the flag before get_hwpoison_page(): in the
>> + * MF_COUNT_INCREASED path, get_any_page() can drop the caller's
>> + * reference before returning -EIO, after which page->flags may
>> + * have been reset by the allocator.
>> + */
>> + is_reserved = PageReserved(p);
>> +
>> res = get_hwpoison_page(p, flags);
>> if (!res) {
>> if (is_free_buddy_page(p)) {
>> @@ -2432,7 +2445,11 @@ int memory_failure(unsigned long pfn, int flags)
>> }
>> goto unlock_mutex;
>> } else if (res < 0) {
>> - res = action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
>> + if (is_reserved)
>> + res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
>> + else
>> + res = action_result(pfn, MF_MSG_GET_HWPOISON,
>> + MF_IGNORED);
>> goto unlock_mutex;
>> }
>>
>>
>
>It's a bit odd that we need this handling when we already have handling for
>reserved pages in error_states[].
>
>HWPoisonHandlable() would always essentially reject PG_reserved pages. So
>__get_hwpoison_page() ... would always fail? Making
>get_hwpoison_page()->get_any_page() always fail?
>
>But then, we never call identify_page_state()? And never call me_kernel()?
Looks like we never get that far ...
>This all looks very odd.
>
>Why would you even want to call get_hwpoison_page() in the first place if you
>find PageReserved?
Ah, I see :)
For a PG_reserved page, I would not expect PageLRU to be set, nor would
I expect it to be in the buddy allocator.
include/linux/page-flags.h also says:
"
Once (if ever) freed, PG_reserved is cleared and they will be given to
the page allocator.
"
So maybe special-case PageReserved() before get_hwpoison_page()?
Something like:
if (PageReserved(p)){
res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
goto unlock_mutex;
}
res = get_hwpoison_page(p, flags, &gp_status);
Cheers, Lance
^ permalink raw reply
* Re: [PATCH v6 1/4] mm/memory-failure: report MF_MSG_KERNEL for reserved pages
From: Breno Leitao @ 2026-05-12 13:04 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Miaohe Lin, Naoya Horiguchi, Andrew Morton, Jonathan Corbet,
Shuah Khan, Lorenzo Stoakes, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Shuah Khan, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett, linux-mm,
linux-kernel, linux-doc, linux-kselftest, linux-trace-kernel,
kernel-team, Lance Yang
In-Reply-To: <9504c193-8c01-4d03-8f62-c50fd7fbdbc0@kernel.org>
On Tue, May 12, 2026 at 10:17:00AM +0200, David Hildenbrand (Arm) wrote:
> > @@ -2348,6 +2348,7 @@ int memory_failure(unsigned long pfn, int flags)
> > unsigned long page_flags;
> > bool retry = true;
> > int hugetlb = 0;
> > + bool is_reserved;
> >
> > if (!sysctl_memory_failure_recovery)
> > panic("Memory failure on page %lx", pfn);
> > @@ -2411,6 +2412,18 @@ int memory_failure(unsigned long pfn, int flags)
> > * In fact it's dangerous to directly bump up page count from 0,
> > * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
> > */
> > + /*
> > + * Pages with PG_reserved set are not currently managed by the
> > + * page allocator (memblock-reserved memory, driver reservations,
> > + * etc.), so classify them as kernel-owned for reporting.
> > + *
> > + * Sample the flag before get_hwpoison_page(): in the
> > + * MF_COUNT_INCREASED path, get_any_page() can drop the caller's
> > + * reference before returning -EIO, after which page->flags may
> > + * have been reset by the allocator.
> > + */
> > + is_reserved = PageReserved(p);
> > +
> > res = get_hwpoison_page(p, flags);
> > if (!res) {
> > if (is_free_buddy_page(p)) {
> > @@ -2432,7 +2445,11 @@ int memory_failure(unsigned long pfn, int flags)
> > }
> > goto unlock_mutex;
> > } else if (res < 0) {
> > - res = action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
> > + if (is_reserved)
> > + res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
> > + else
> > + res = action_result(pfn, MF_MSG_GET_HWPOISON,
> > + MF_IGNORED);
> > goto unlock_mutex;
> > }
> >
> >
>
> It's a bit odd that we need this handling when we already have handling for
> reserved pages in error_states[].
>
> HWPoisonHandlable() would always essentially reject PG_reserved pages. So
> __get_hwpoison_page() ... would always fail? Making
> get_hwpoison_page()->get_any_page() always fail?
>
> But then, we never call identify_page_state()? And never call me_kernel()?
From what I read, it seems that error_states[0] = { reserved, reserved, MF_MSG_KERNEL, me_kernel }
has been effectively dead code on the hwpoison-from-MCE path for a
while.
My v6 patch relabels the failure-path output to match what me_kernel() would
have reported anyway.
> This all looks very odd.
>
> Why would you even want to call get_hwpoison_page() in the first place if you
> find PageReserved?
Are you suggesting we should all the page action as soon as we detect the page
is reserved and get out?
Something as:
if (PageReserved(p)) {
res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
goto unlock_mutex;
}
res = get_hwpoison_page(p, flags);
Thanks for the review,
--breno
^ permalink raw reply
* Re: [PATCH v6 3/4] mm/memory-failure: add panic option for unrecoverable pages
From: Breno Leitao @ 2026-05-12 13:05 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Miaohe Lin, Naoya Horiguchi, Andrew Morton, Jonathan Corbet,
Shuah Khan, Lorenzo Stoakes, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Shuah Khan, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett, linux-mm,
linux-kernel, linux-doc, linux-kselftest, linux-trace-kernel,
kernel-team
In-Reply-To: <8d4940bc-d8c4-4e7f-a35d-979e6a781966@kernel.org>
On Tue, May 12, 2026 at 10:22:38AM +0200, David Hildenbrand (Arm) wrote:
>
> > @@ -1281,6 +1292,18 @@ static void update_per_node_mf_stats(unsigned long pfn,
> > ++mf_stats->total;
> > }
> >
> > +static bool panic_on_unrecoverable_mf(enum mf_action_page_type type,
> > + enum mf_result result)
> > +{
> > + if (!sysctl_panic_on_unrecoverable_mf || result != MF_IGNORED)
> > + return false;
> > +
> > + if (type == MF_MSG_KERNEL)
> > + return true;
> > +
> > + return false;
>
> return type == MF_MSG_KERNEL;
>
> might be simpler.
Ack, I will update once we decide about the other pendencies.
^ permalink raw reply
* Re: [RFC PATCH v2 06/10] rvgen: support reset() on the __init arrow for global-window HA clocks
From: Gabriele Monaco @ 2026-05-12 13:25 UTC (permalink / raw)
To: wen.yang; +Cc: linux-trace-kernel, linux-kernel, Steven Rostedt
In-Reply-To: <aa156a1c7696e054f8db57c48a26fa6ec1e17395.1778522945.git.wen.yang@linux.dev>
On Tue, 2026-05-12 at 02:24 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
>
> rvgen rejects a state invariant when its env is never reset on any
> state-transition edge. This prevents expressing monitors where a clock
> tracks the full monitoring window — reset once at object creation,
> active in all states.
>
> Allow reset() annotations on the __init_STATE -> STATE arrow.
> automata.py adds listed envs to the new env_init_started set (and to
> env_stored so the HA framework allocates per-object storage). dot2k.py
> uses env_init_started for three purposes:
>
> - Generate a handle_monitor_start() skeleton that resets the env and
> arms the timer after the caller sets up DA storage and initial state.
>
> - Guard ha_inv_to_guard calls with !ha_monitor_env_invalid() for these
> envs: a concurrent DA event between da_handle_start_event() and
> ha_reset_env() would otherwise store U64_MAX - BUDGET as the guard
> anchor, silently disabling enforcement.
>
> - Always generate ha_verify_guards() for monitors with invariants,
> providing a stable extension point for future per-event guards.
>
> Models without __init resets (e.g. stall.dot) are unaffected.
>
> Signed-off-by: Wen Yang <wen.yang@linux.dev>
Mmh, I see that's an issue, but technically the init arrow isn't a real state
transition. In your case, you have a start condition that you labelled
"switch_in_tlob" although it isn't a switch in.
Why don't you make it a separate event (e.g. "start_tlob"), it seems to me you
cannot really have that occur multiple times, so doing the following wouldn't
harm and you wouldn't need to change how HA monitors work:
|
|
v
+-----------------------------+ start;reset(clk)
| running | -------------------+
switch_in | clk < BUDGET_NS() | |
+---------> | | <------------------+
| +-----------------------------+
| | |
| | sleep |
| v |
| |
| sleeping |
| clk < BUDGET_NS() |
| | preempt
| | |
| | wakeup |
| v |
| |
| waiting |
+---------- clk < BUDGET_NS() <+
Then you also wouldn't need to call reset() and start_timer() manually.
Isn't that feasible?
Thanks,
Gabriele
> ---
> tools/verification/rvgen/rvgen/automata.py | 26 ++++++
> tools/verification/rvgen/rvgen/dot2k.py | 100 +++++++++++++++++++--
> 2 files changed, 119 insertions(+), 7 deletions(-)
>
> diff --git a/tools/verification/rvgen/rvgen/automata.py
> b/tools/verification/rvgen/rvgen/automata.py
> index b9f8149f7118..178a1a4ffd8a 100644
> --- a/tools/verification/rvgen/rvgen/automata.py
> +++ b/tools/verification/rvgen/rvgen/automata.py
> @@ -69,15 +69,41 @@ class Automata:
> self.states, self.initial_state, self.final_states =
> self.__get_state_variables()
> self.env_types = {}
> self.env_stored = set()
> + self.env_init_started = set()
> self.constraint_vars = set()
> self.self_loop_reset_events = set()
> self.events, self.envs = self.__get_event_variables()
> + self.__parse_init_resets()
> self.function, self.constraints = self.__create_matrix()
> self.events_start, self.events_start_run = self.__store_init_events()
> self.env_stored = sorted(self.env_stored)
> + self.env_init_started = sorted(self.env_init_started)
> self.constraint_vars = sorted(self.constraint_vars)
> self.self_loop_reset_events = sorted(self.self_loop_reset_events)
>
> + def __parse_init_resets(self) -> None:
> + """Parse reset() annotations on the __init_STATE -> STATE arrow.
> +
> + Adds each listed env to env_stored (HA framework allocates per-object
> + storage) and env_init_started (ha2k generates
> handle_monitor_start()).
> + """
> + init_prefix = f'"{self.init_marker}'
> + for line in map(str.lstrip, self.__dot_lines):
> + if not line.startswith(init_prefix):
> + continue
> + split_line = line.split()
> + if len(split_line) < 3 or split_line[1] != "->":
> + continue
> + if "label" not in line:
> + continue
> + label = "".join(split_line[split_line.index("label") + 2:-
> 1]).replace('"', '')
> + for part in label.split(";"):
> + reset_m = self.constraint_reset.search(part.strip())
> + if reset_m:
> + env = reset_m["env"]
> + self.env_stored.add(env)
> + self.env_init_started.add(env)
> +
> def __get_model_name(self) -> str:
> basename = ntpath.basename(self.__dot_path)
> if not basename.endswith(".dot") and not basename.endswith(".gv"):
> diff --git a/tools/verification/rvgen/rvgen/dot2k.py
> b/tools/verification/rvgen/rvgen/dot2k.py
> index e6f476b903b0..e8066260c0af 100644
> --- a/tools/verification/rvgen/rvgen/dot2k.py
> +++ b/tools/verification/rvgen/rvgen/dot2k.py
> @@ -366,7 +366,18 @@ f"""static inline void ha_convert_inv_guard(struct
> ha_monitor *ha_mon,
> conf_g = [e for s, e in conflict_guards if s == state]
> if not conf_i and not conf_g:
> continue
> - buff.append(f"\t{_else}if (curr_state ==
> {self.states[state]}{self.enum_suffix})")
> +
> + state_name = f"{self.states[state]}{self.enum_suffix}"
> + env_full = self.__get_constraint_env(constr)
> + env_bare = env_full[:-len(self.enum_suffix)]
> + if env_bare in self.env_init_started:
> + # env_store is ENV_INVALID_VALUE until
> handle_monitor_start();
> + # skip ha_inv_to_guard during the init race window.
> + cont = "\t\t " if _else else "\t "
> + buff.append(f"\t{_else}if (curr_state == {state_name} &&")
> + buff.append(f"{cont}!ha_monitor_env_invalid(ha_mon,
> {env_full}))")
> + else:
> + buff.append(f"\t{_else}if (curr_state == {state_name})")
>
> buff.append(f"\t\t{self.__start_to_conv(constr)};")
> _else = "else "
> @@ -376,16 +387,22 @@ f"""static inline void ha_convert_inv_guard(struct
> ha_monitor *ha_mon,
>
> def __fill_verify_guards_func(self) -> list[str]:
> buff = []
> - if not self.guards:
> + # Always generate for monitors with invariants: stable extension
> + # point for future guard conditions.
> + if not self.guards and not self.invariants:
> return []
>
> buff.append(
> f"""static inline bool ha_verify_guards(struct ha_monitor *ha_mon,
> \t\t\t\t enum {self.enum_states_def} curr_state, enum
> {self.enum_events_def} event,
> \t\t\t\t enum {self.enum_states_def} next_state, u64 time_ns)
> -{{
> -\tbool res = true;
> -""")
> +{{""")
> +
> + if not self.guards:
> + buff.append("\treturn true;\n}\n")
> + return buff
> +
> + buff.append("\tbool res = true;\n")
>
> _else = ""
> for edge, constr in sorted(self.guards.items()):
> @@ -522,7 +539,7 @@ f"""static bool ha_verify_constraint(struct ha_monitor
> *ha_mon,
> buff.append("\tha_convert_inv_guard(ha_mon, curr_state, event, "
> "next_state, time_ns);\n")
>
> - if self.guards:
> + if self.guards or self.invariants:
> buff.append("\tif (!ha_verify_guards(ha_mon, curr_state, event, "
> "next_state, time_ns))\n\t\treturn false;\n")
>
> @@ -599,8 +616,77 @@ f"""static bool ha_verify_constraint(struct ha_monitor
> *ha_mon,
> buff.append("}\n")
> return buff
>
> + def __fill_init_start_helper(self) -> list[str]:
> + """Generate handle_monitor_start() for envs reset on the __init
> arrow.
> +
> + env_store is invalid inside da_handle_start_event(); this helper must
> + be called after DA storage is allocated and initial state is set.
> + """
> + if not self.env_init_started:
> + return []
> +
> + # Collect the ha_start_timer call for each init-started env from the
> + # first state invariant that references it.
> + timer_calls: dict[str, str] = {}
> + for env in self.env_init_started:
> + env_full = f"{env}{self.enum_suffix}"
> + for constr in self.invariants.values():
> + if env_full in constr:
> + timer_calls[env] = constr
> + break
> +
> + buff = []
> + buff.append(
> +"""/*
> + * handle_monitor_start - reset per-object clock(s) and arm the timer.
> + *
> + * env_store is invalid inside da_handle_start_event(); call this helper
> + * after allocating DA storage and setting the initial DA state.
> + *
> + * XXX: replace the placeholders with the actual logic for your monitor.
> + */""")
> +
> + if self.monitor_type == "per_obj":
> + buff.append("static int handle_monitor_start(int id,
> monitor_target t)")
> + buff.append("{")
> + buff.append("\tstruct ha_monitor *ha_mon;")
> + buff.append("\tu64 time_ns = ktime_get_ns();\n")
> + buff.append("\t/* XXX: allocate DA storage, e.g.
> da_create_or_get(id, t) */")
> + buff.append("\t/* XXX: set initial DA state, e.g.
> da_handle_start_event(id, t, <event>) */")
> + buff.append("\tha_mon = /* XXX: retrieve ha_monitor for (id, t)
> */;")
> + elif self.monitor_type == "per_task":
> + buff.append("static int handle_monitor_start(struct task_struct
> *p)")
> + buff.append("{")
> + buff.append("\tstruct ha_monitor *ha_mon;")
> + buff.append("\tu64 time_ns = ktime_get_ns();\n")
> + buff.append("\t/* XXX: allocate DA storage, e.g.
> da_create_or_get(p->pid, p) */")
> + buff.append("\t/* XXX: set initial DA state, e.g.
> da_handle_start_event(p->pid, p, <event>) */")
> + buff.append("\tha_mon = /* XXX: retrieve ha_monitor for p */;")
> + else:
> + buff.append("static int handle_monitor_start(void)")
> + buff.append("{")
> + buff.append("\tstruct ha_monitor *ha_mon;")
> + buff.append("\tu64 time_ns = ktime_get_ns();\n")
> + buff.append("\tha_mon = /* XXX: retrieve global ha_monitor */;")
> +
> + buff.append("\tif (!ha_mon)")
> + buff.append("\t\treturn -ENOENT;")
> +
> + for env in self.env_init_started:
> + buff.append(f"\tha_reset_env(ha_mon, {env}{self.enum_suffix},
> time_ns);")
> + if env in timer_calls:
> + buff.append(f"\t{timer_calls[env]};")
> + else:
> + buff.append(f"\t/* XXX: arm timer for {env} */")
> +
> + buff.append("\treturn 0;")
> + buff.append("}\n")
> + return buff
> +
> def _fill_hybrid_definitions(self) -> list[str]:
> - return self.__fill_hybrid_get_reset_functions() +
> self.__fill_constr_func()
> + return (self.__fill_hybrid_get_reset_functions() +
> + self.__fill_init_start_helper() +
> + self.__fill_constr_func())
>
> def _fill_timer_type(self) -> list:
> if self.invariants:
^ permalink raw reply
* Re: [PATCH v6 2/4] mm/memory-failure: classify get_any_page() failures by reason
From: Breno Leitao @ 2026-05-12 13:33 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Miaohe Lin, Naoya Horiguchi, Andrew Morton, Jonathan Corbet,
Shuah Khan, Lorenzo Stoakes, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Shuah Khan, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett, linux-mm,
linux-kernel, linux-doc, linux-kselftest, linux-trace-kernel,
kernel-team, Lance Yang
In-Reply-To: <28b01c14-3d87-4cab-b695-5b9015578785@kernel.org>
On Tue, May 12, 2026 at 10:21:50AM +0200, David Hildenbrand (Arm) wrote:
>
> > }
> > goto unlock_mutex;
> > } else if (res < 0) {
> > - if (is_reserved)
> > + /*
> > + * Promote a stable unhandlable kernel page diagnosed by
> > + * get_hwpoison_page() to MF_MSG_KERNEL alongside reserved
> > + * pages; transient lifecycle races stay as MF_MSG_GET_HWPOISON.
> > + */
> > + if (is_reserved || gp_status == MF_GET_PAGE_UNHANDLABLE)
> > res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
>
>
> It's all a bit of a mess. get_hwpoison_page() should just indicate that a page
> is unhandable if it is PG_reserved?
Are you saying that we should identify if the page is PG_reserved in
get_hwpoison_page() instead of in memory_failure(), as done in the
previous patch ("mm/memory-failure: report MF_MSG_KERNEL for reserved
pages") ?
> Why can't we just return a special error code from get_hwpoison_page()? We ahve
> plenty of errno values to chose from.
Something like:
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 866c4428ac7ef..0a6d83575833e 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -878,7 +878,7 @@ static const char *action_name[] = {
};
static const char * const action_page_types[] = {
- [MF_MSG_KERNEL] = "reserved kernel page",
+ [MF_MSG_KERNEL] = "unrecoverable kernel page",
[MF_MSG_KERNEL_HIGH_ORDER] = "high-order kernel page",
[MF_MSG_HUGE] = "huge page",
[MF_MSG_FREE_HUGE] = "free huge page",
@@ -1394,6 +1394,21 @@ static int get_any_page(struct page *p, unsigned long flags)
int ret = 0, pass = 0;
bool count_increased = false;
+ if (PageReserved(p)) {
+ ret = -ENOTRECOVERABLE;
+ goto out;
+ }
+
if (flags & MF_COUNT_INCREASED)
count_increased = true;
@@ -1422,7 +1437,7 @@ static int get_any_page(struct page *p, unsigned long flags)
shake_page(p);
goto try_again;
}
- ret = -EIO;
+ ret = -ENOTRECOVERABLE;
goto out;
}
}
@@ -1441,10 +1456,10 @@ static int get_any_page(struct page *p, unsigned long flags)
goto try_again;
}
put_page(p);
- ret = -EIO;
+ ret = -ENOTRECOVERABLE;
}
out:
- if (ret == -EIO)
+ if (ret == -EIO || ret == -ENOTRECOVERABLE)
pr_err("%#lx: unhandlable page.\n", page_to_pfn(p));
return ret;
@@ -2431,6 +2448,9 @@ int memory_failure(unsigned long pfn, int flags)
res = action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED);
}
goto unlock_mutex;
+ } else if (res == -ENOTRECOVERABLE) {
+ res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
+ goto unlock_mutex;
} else if (res < 0) {
res = action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
goto unlock_mutex;
If that is what you are suggestion, maybe we can create another
MF_MSG_RESERVED? and another return value for get_any_page() to track
the reserve pages ?
Thanks for the review and suggestions,
--breno
^ permalink raw reply related
* Re: [PATCH] rtla: Stop the record trace on interrupt
From: Tomas Glozar @ 2026-05-12 13:50 UTC (permalink / raw)
To: Crystal Wood
Cc: Steven Rostedt, linux-trace-kernel, John Kacur, Costa Shulyupin,
Wander Lairson Costa
In-Reply-To: <20260511223509.1483323-1-crwood@redhat.com>
út 12. 5. 2026 v 0:35 odesílatel Crystal Wood <crwood@redhat.com> napsal:
>
> Before, when rtla gets a signal, it stopped the main trace but not the
> record trace. save_trace_to_file() could also fail to keep up on a debug
> kernel -- and in any case, it adds post-stoppage noise to the trace file.
>
This affects only the "new" (since 6.19) --on-end trace option, right?
The regular --trace/--on-threshold trace should already disable the
instance in-kernel, as timerlat disables all instances set to the
tracer on stop_tracing threshold.
If so, that should be reflected in the commit message.
> Signed-off-by: Crystal Wood <crwood@redhat.com>
> ---
> tools/tracing/rtla/src/common.c | 19 +++++++++++--------
> tools/tracing/rtla/src/common.h | 1 -
> tools/tracing/rtla/src/timerlat.c | 2 +-
> 3 files changed, 12 insertions(+), 10 deletions(-)
>
> diff --git a/tools/tracing/rtla/src/common.c b/tools/tracing/rtla/src/common.c
> index 35e3d3aa922e..effad523e8cf 100644
> --- a/tools/tracing/rtla/src/common.c
> +++ b/tools/tracing/rtla/src/common.c
> @@ -10,7 +10,7 @@
>
> #include "common.h"
>
> -struct trace_instance *trace_inst;
> +struct osnoise_tool *trace_tool;
> volatile int stop_tracing;
> int nr_cpus;
>
> @@ -21,12 +21,16 @@ static void stop_trace(int sig)
> * Stop requested twice in a row; abort event processing and
> * exit immediately
> */
> - tracefs_iterate_stop(trace_inst->inst);
> + if (trace_tool)
> + tracefs_iterate_stop(trace_tool->trace.inst);
Can this condition be ever false? The signal should be attached after
trace_inst is initialized. Of course, it doesn't hurt to have it there
for safety, as long as we keep in mind that it does not protect from
the trace instance being freed (which should be fixed by commit
be8058f31b4e already).
> return;
> }
> stop_tracing = 1;
> - if (trace_inst)
> - trace_instance_stop(trace_inst);
> + if (trace_tool) {
> + trace_instance_stop(&trace_tool->trace);
> + if (trace_tool->record)
> + trace_instance_stop(&trace_tool->record->trace);
> + }
> }
>
Otherwise this makes sense. The reason for doing trace_instance_stop()
in stop_trace() and not waiting for cleanup is that in tracefs mode,
the loop would get struck in tracefs_iterate_raw_events() if it cannot
process the events faster than they are created (see commit
c73cab9dbed and a4dfce7559d respectively). But it can cause a
discrepancy in the trace output, as you point out.
> /*
> @@ -273,11 +277,10 @@ int run_tool(struct tool_ops *ops, int argc, char *argv[])
> tool->params = params;
>
> /*
> - * Save trace instance into global variable so that SIGINT can stop
> - * the timerlat tracer.
> + * Expose the tool to signal handlers so they can stop the trace.
> * Otherwise, rtla could loop indefinitely when overloaded.
> */
> - trace_inst = &tool->trace;
> + trace_tool = tool;
>
> retval = ops->apply_config(tool);
> if (retval) {
> @@ -285,7 +288,7 @@ int run_tool(struct tool_ops *ops, int argc, char *argv[])
> goto out_free;
> }
>
> - retval = enable_tracer_by_name(trace_inst->inst, ops->tracer);
> + retval = enable_tracer_by_name(tool->trace.inst, ops->tracer);
> if (retval) {
> err_msg("Failed to enable %s tracer\n", ops->tracer);
> goto out_free;
> diff --git a/tools/tracing/rtla/src/common.h b/tools/tracing/rtla/src/common.h
> index 51665db4ffce..eba40b6d9504 100644
> --- a/tools/tracing/rtla/src/common.h
> +++ b/tools/tracing/rtla/src/common.h
> @@ -54,7 +54,6 @@ struct osnoise_context {
> int opt_workload;
> };
>
> -extern struct trace_instance *trace_inst;
> extern volatile int stop_tracing;
>
This is removed as it is not needed now after the consolidation I
assume, since all users are now in common.c? That could also be
mentioned in the commit message.
> struct hist_params {
> diff --git a/tools/tracing/rtla/src/timerlat.c b/tools/tracing/rtla/src/timerlat.c
> index f8c057518d22..637f68d684f5 100644
> --- a/tools/tracing/rtla/src/timerlat.c
> +++ b/tools/tracing/rtla/src/timerlat.c
> @@ -202,7 +202,7 @@ void timerlat_analyze(struct osnoise_tool *tool, bool stopped)
> * If the trace did not stop with --aa-only, at least print
> * the max known latency.
> */
> - max_lat = tracefs_instance_file_read(trace_inst->inst, "tracing_max_latency", NULL);
> + max_lat = tracefs_instance_file_read(tool->trace.inst, "tracing_max_latency", NULL);
Not sure why this used trace_inst instead of tool which is right
there, maybe again refactoring of the code?
> if (max_lat) {
> printf(" Max latency was %s\n", max_lat);
> free(max_lat);
> --
> 2.54.0
>
Otherwise, LGTM. You can also add:
Fixes: c73cab9dbed0 ("rtla/timerlat_hist: Stop timerlat tracer on signal")
Fixes: a4dfce7559d7 ("rtla/timerlat_top: Stop timerlat tracer on signal")
Fixes: 3aadb65db5d6 ("rtla/timerlat: Add action on end feature")
as this fixes two bugs, one with the post-stoppage noise (that is
present since the trace_instance_stop() was added), one with the
save_trace_to_file() not keeping up (which should be since the
--on-end option was added).
Tomas
^ permalink raw reply
* Re: [PATCH] rtla: Stop the record trace on interrupt
From: Tomas Glozar @ 2026-05-12 13:51 UTC (permalink / raw)
To: Crystal Wood
Cc: Steven Rostedt, linux-trace-kernel, John Kacur, Costa Shulyupin,
Wander Lairson Costa
In-Reply-To: <CAP4=nvQEnCG2vabuA52p3bmGo7f1_-pZJ2b6rr5N=bKmaBuqxg@mail.gmail.com>
út 12. 5. 2026 v 15:50 odesílatel Tomas Glozar <tglozar@redhat.com> napsal:
>
> This affects only the "new" (since 6.19) --on-end trace option, right?
s/6.19/6.17/
^ permalink raw reply
* [PATCH v3] tracing: Fix nr_subbufs initialization in simple_ring_buffer_init_mm()
From: David Carlier @ 2026-05-12 13:54 UTC (permalink / raw)
To: catalin.marinas, will
Cc: rostedt, mhiramat, mathieu.desnoyers, vdonnefort, ryan.roberts,
maz, linux-arm-kernel, linux-trace-kernel, linux-kernel,
David Carlier
nr_subbufs in the ring buffer metadata is always initialized to zero
because it is assigned from cpu_buffer->nr_pages before the page
initialization loop has run. While nr_subbufs is not currently read
by the kernel, it should reflect the actual buffer geometry in the
meta page for correctness.
Move the assignment after the page loop so that cpu_buffer->nr_pages
holds the final count.
Fixes: 34e5b958bdad ("tracing: Introduce simple_ring_buffer")
Reviewed-by: Vincent Donnefort <vdonnefort@google.com>
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: David Carlier <devnexen@gmail.com>
---
kernel/trace/simple_ring_buffer.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/trace/simple_ring_buffer.c b/kernel/trace/simple_ring_buffer.c
index 02af2297ae5a..f731f14d0ff7 100644
--- a/kernel/trace/simple_ring_buffer.c
+++ b/kernel/trace/simple_ring_buffer.c
@@ -395,7 +395,6 @@ int simple_ring_buffer_init_mm(struct simple_rb_per_cpu *cpu_buffer,
memset(cpu_buffer->meta, 0, sizeof(*cpu_buffer->meta));
cpu_buffer->meta->meta_page_size = PAGE_SIZE;
- cpu_buffer->meta->nr_subbufs = cpu_buffer->nr_pages;
/* The reader page is not part of the ring initially */
page = load_page(desc->page_va[0]);
@@ -437,6 +436,7 @@ int simple_ring_buffer_init_mm(struct simple_rb_per_cpu *cpu_buffer,
return ret;
}
+ cpu_buffer->meta->nr_subbufs = cpu_buffer->nr_pages;
/* Close the ring */
bpage->link.next = &cpu_buffer->tail_page->link;
cpu_buffer->tail_page->link.prev = &bpage->link;
--
2.53.0
^ permalink raw reply related
* Re: [bug report] bootconfig: init: Allow admin to use bootconfig for kernel command line
From: Breno Leitao @ 2026-05-12 13:54 UTC (permalink / raw)
To: Masami Hiramatsu
Cc: Dan Carpenter, kernel-janitors, Linux Trace Kernel, linux-kernel
In-Reply-To: <20260512091638.8b95253ab022d7dabf473465@kernel.org>
On Tue, May 12, 2026 at 09:16:38AM +0900, Masami Hiramatsu wrote:
> Hi Dan,
>
> Thanks for reporting. A similar problem is pointed by Sashiko [1].
>
> [1] https://sashiko.dev/#/patchset/20260508-bootconfig_using_tools-v1-0-1132219aa773%40debian.org
>
> On Fri, 8 May 2026 20:07:25 +0300
> Dan Carpenter <error27@gmail.com> wrote:
>
> > Hello Masami Hiramatsu,
> >
> > Commit 51887d03aca1 ("bootconfig: init: Allow admin to use bootconfig
> > for kernel command line") from Jan 11, 2020 (linux-next), leads to
> > the following Smatch static checker warning:
> >
> > init/main.c:368 xbc_snprint_cmdline()
For your information, I am moving this function to lib/bootconfig.
https://lore.kernel.org/all/20260508-bootconfig_using_tools-v1-1-1132219aa773@debian.org/
I understand that no action is required on this report, correct?
Thanks,
--breno
^ permalink raw reply
* [PATCH 0/9] rv: Fixes on Deterministic and Hybrid Automata
From: Gabriele Monaco @ 2026-05-12 14:02 UTC (permalink / raw)
To: linux-kernel; +Cc: Gabriele Monaco, Nam Cao, Wen Yang, linux-trace-kernel
Fix issues that were reported by bots or visible only after integration:
* Make sure timers are always terminated and waited for when disabling
the monitor or when the target terminates
* Run per-cpu monitors with migration disabled since preemption is now
enabled from tracepoints
* Fix a wrong __user specifier in a helper function
* Other cleanup and concurrency issues
Cc: Nam Cao <namcao@linutronix.de>
Cc: Wen Yang <wen.yang@linux.dev>
Cc: linux-trace-kernel@vger.kernel.org
Gabriele Monaco (9):
rv: Fix __user specifier usage in extract_params()
rv: Fix read_lock scope in per-task DA cleanup
rv: Reset per-task DA monitors before releasing the slot
rv: Prevent task migration while handling per-CPU events
rv: Ensure all pending probes terminate on per-obj monitor destroy
rv: Ensure synchronous cleanup for HA monitors
rv: Do not rely on clean monitor when initialising HA
rv: Add automatic cleanup handlers for per-task HA monitors
rv: Mandate deallocation for per-obj monitors
include/rv/da_monitor.h | 66 ++++++++++---
include/rv/ha_monitor.h | 93 ++++++++++++++++++-
kernel/trace/rv/monitors/deadline/deadline.h | 8 +-
kernel/trace/rv/monitors/nomiss/nomiss.c | 4 +-
kernel/trace/rv/monitors/opid/opid.c | 4 +-
kernel/trace/rv/monitors/stall/stall.c | 4 +-
.../rvgen/rvgen/templates/dot2k/main.c | 4 +-
7 files changed, 157 insertions(+), 26 deletions(-)
base-commit: 6d35786de28116ecf78797a62b84e6bf3c45aa5a
--
2.54.0
^ permalink raw reply
* [PATCH 1/9] rv: Fix __user specifier usage in extract_params()
From: Gabriele Monaco @ 2026-05-12 14:02 UTC (permalink / raw)
To: linux-kernel, Steven Rostedt, Gabriele Monaco, Masami Hiramatsu,
Nam Cao, linux-trace-kernel
Cc: kernel test robot, Wen Yang
In-Reply-To: <20260512140250.262190-1-gmonaco@redhat.com>
The attributes variables extracted from syscalls in the helper are both
defined with the __user specifier although only the actual pointer to
user data should be marked.
Remove the __user specifier from attr.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202604150820.Ny143u6X-lkp@intel.com
Fixes: b133207deb72 ("rv: Add nomiss deadline monitor")
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
kernel/trace/rv/monitors/deadline/deadline.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/trace/rv/monitors/deadline/deadline.h b/kernel/trace/rv/monitors/deadline/deadline.h
index 0bbfd2543329..78fca873d61e 100644
--- a/kernel/trace/rv/monitors/deadline/deadline.h
+++ b/kernel/trace/rv/monitors/deadline/deadline.h
@@ -95,7 +95,8 @@ static inline u8 get_server_type(struct task_struct *tsk)
static inline int extract_params(struct pt_regs *regs, long id, pid_t *pid_out)
{
size_t size = offsetofend(struct sched_attr, sched_flags);
- struct sched_attr __user *uattr, attr;
+ struct sched_attr __user *uattr;
+ struct sched_attr attr;
int new_policy = -1, ret;
unsigned long args[6];
--
2.54.0
^ permalink raw reply related
* [PATCH 2/9] rv: Fix read_lock scope in per-task DA cleanup
From: Gabriele Monaco @ 2026-05-12 14:02 UTC (permalink / raw)
To: linux-kernel, Steven Rostedt, Gabriele Monaco, Nam Cao,
linux-trace-kernel
Cc: Wen Yang
In-Reply-To: <20260512140250.262190-1-gmonaco@redhat.com>
The da_monitor_reset_all() function for per-task monitors takes
tasklist_lock while iterating over tasks, then keeps it also while
iterating over idle tasks (one per CPU). The latter is not necessary
since the lock needs to guard only for_each_process_thread().
Use a scoped_guard for more compact syntax and adjust the scope only
where the lock is necessary.
Fixes: 30984ccf31b7f ("rv: Refactor da_monitor to minimise macros")
Fixes: 8259cb14a7068 ("rv: Reset per-task monitors also for idle tasks")
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
include/rv/da_monitor.h | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
index 39765ff6f098..250888812125 100644
--- a/include/rv/da_monitor.h
+++ b/include/rv/da_monitor.h
@@ -272,12 +272,12 @@ static void da_monitor_reset_all(void)
struct task_struct *g, *p;
int cpu;
- read_lock(&tasklist_lock);
- for_each_process_thread(g, p)
- da_monitor_reset(da_get_monitor(p));
+ scoped_guard(read_lock, &tasklist_lock) {
+ for_each_process_thread(g, p)
+ da_monitor_reset(da_get_monitor(p));
+ }
for_each_present_cpu(cpu)
da_monitor_reset(da_get_monitor(idle_task(cpu)));
- read_unlock(&tasklist_lock);
}
/*
--
2.54.0
^ permalink raw reply related
* [PATCH 3/9] rv: Reset per-task DA monitors before releasing the slot
From: Gabriele Monaco @ 2026-05-12 14:02 UTC (permalink / raw)
To: linux-kernel, Steven Rostedt, Gabriele Monaco, Nam Cao,
linux-trace-kernel
Cc: Wen Yang
In-Reply-To: <20260512140250.262190-1-gmonaco@redhat.com>
Per-task monitors use task_mon_slot to determine which slot in the array
to use for the monitor. During destruction, this slot is returned but
this is done before resetting the monitor. As a result, the monitor's
reset is in fact resetting a slot that is outside of the array
(RV_PER_TASK_MONITOR_INIT).
Release the slot only after the reset to avoid out-of-bound memory
access.
Fixes: 30984ccf31b7f ("rv: Refactor da_monitor to minimise macros")
Fixes: 792575348ff70 ("rv/include: Add deterministic automata monitor definition via C macros")
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
include/rv/da_monitor.h | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
index 250888812125..0b7028df08fb 100644
--- a/include/rv/da_monitor.h
+++ b/include/rv/da_monitor.h
@@ -309,10 +309,11 @@ static inline void da_monitor_destroy(void)
WARN_ONCE(1, "Disabling a disabled monitor: " __stringify(MONITOR_NAME));
return;
}
- rv_put_task_monitor_slot(task_mon_slot);
- task_mon_slot = RV_PER_TASK_MONITOR_INIT;
da_monitor_reset_all();
+
+ rv_put_task_monitor_slot(task_mon_slot);
+ task_mon_slot = RV_PER_TASK_MONITOR_INIT;
}
#elif RV_MON_TYPE == RV_MON_PER_OBJ
--
2.54.0
^ permalink raw reply related
* [PATCH 4/9] rv: Prevent task migration while handling per-CPU events
From: Gabriele Monaco @ 2026-05-12 14:02 UTC (permalink / raw)
To: linux-kernel, Steven Rostedt, Gabriele Monaco, linux-trace-kernel
Cc: Nam Cao, Wen Yang
In-Reply-To: <20260512140250.262190-1-gmonaco@redhat.com>
Tracepoint handlers are now fully preemptible. When a per-CPU monitor
handles an event, it retrieves the monitor state using a per-CPU
pointer. If the event itself doesn't disable preemption, the task can
migrate to a different CPU and we risk updating the wrong monitor.
Mitigate this by explicitly disabling task migration before acquiring
the monitor pointer. This cannot guarantee the monitor runs on the
correct CPU but reduces the race condition window and prevents warnings.
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
include/rv/da_monitor.h | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
index 0b7028df08fb..a9fd284195ee 100644
--- a/include/rv/da_monitor.h
+++ b/include/rv/da_monitor.h
@@ -181,6 +181,10 @@ static inline void da_monitor_destroy(void)
da_monitor_reset_all();
}
+#ifndef da_implicit_guard
+#define da_implicit_guard()
+#endif
+
#elif RV_MON_TYPE == RV_MON_PER_CPU
/*
* Functions to define, init and get a per-cpu monitor.
@@ -230,6 +234,10 @@ static inline void da_monitor_destroy(void)
da_monitor_reset_all();
}
+#ifndef da_implicit_guard
+#define da_implicit_guard() guard(migrate)()
+#endif
+
#elif RV_MON_TYPE == RV_MON_PER_TASK
/*
* Functions to define, init and get a per-task monitor.
@@ -677,6 +685,7 @@ static inline bool __da_handle_start_run_event(struct da_monitor *da_mon,
*/
static inline void da_handle_event(enum events event)
{
+ da_implicit_guard();
__da_handle_event(da_get_monitor(), event, 0);
}
@@ -692,6 +701,7 @@ static inline void da_handle_event(enum events event)
*/
static inline bool da_handle_start_event(enum events event)
{
+ da_implicit_guard();
return __da_handle_start_event(da_get_monitor(), event, 0);
}
@@ -703,6 +713,7 @@ static inline bool da_handle_start_event(enum events event)
*/
static inline bool da_handle_start_run_event(enum events event)
{
+ da_implicit_guard();
return __da_handle_start_run_event(da_get_monitor(), event, 0);
}
--
2.54.0
^ permalink raw reply related
* [PATCH 5/9] rv: Ensure all pending probes terminate on per-obj monitor destroy
From: Gabriele Monaco @ 2026-05-12 14:02 UTC (permalink / raw)
To: linux-kernel, Steven Rostedt, Gabriele Monaco, Nam Cao,
linux-trace-kernel
Cc: Wen Yang
In-Reply-To: <20260512140250.262190-1-gmonaco@redhat.com>
The monitor disable/destroy sequence detaches all probes and resets the
monitor's data, however it doesn't wait for pending probes. This is an
issue with per-object monitors, which free the monitor storage.
Call tracepoint_synchronize_unregister() to make sure to wait for all
pending probes before destroying the monitor storage.
Fixes: 4a24127bd6cb ("rv: Add support for per-object monitors in DA/HA")
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
include/rv/da_monitor.h | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
index a9fd284195ee..a4a13b62d1a4 100644
--- a/include/rv/da_monitor.h
+++ b/include/rv/da_monitor.h
@@ -515,9 +515,10 @@ static inline void da_monitor_destroy(void)
struct hlist_node *tmp;
int bkt;
+ tracepoint_synchronize_unregister();
/*
- * This function is called after all probes are disabled, we need only
- * worry about concurrency against old events.
+ * This function is called after all probes are disabled and no longer
+ * pending, we can safely assume no concurrent user.
*/
synchronize_rcu();
hash_for_each_safe(da_monitor_ht, bkt, tmp, mon_storage, node) {
--
2.54.0
^ permalink raw reply related
* [PATCH 6/9] rv: Ensure synchronous cleanup for HA monitors
From: Gabriele Monaco @ 2026-05-12 14:02 UTC (permalink / raw)
To: linux-kernel, Steven Rostedt, Gabriele Monaco, Nam Cao,
linux-trace-kernel
Cc: Wen Yang
In-Reply-To: <20260512140250.262190-1-gmonaco@redhat.com>
HA monitors may start timers, all cleanup functions currently stop the
timers asynchronously to avoid sleeping in the wrong context.
Nothing makes sure running callbacks terminate on cleanup.
Run the entire HA timer callback in an RCU read-side critical section,
this way we can simply synchronize_rcu() with any pending timer and are
sure any cleanup using kfree_rcu() runs after callbacks terminated.
Additionally make sure any unlikely callback running late won't run any
code if the monitor is marked as disabled.
Fixes: f5587d1b6ec9 ("rv: Add Hybrid Automata monitor type")
Fixes: 4a24127bd6cb ("rv: Add support for per-object monitors in DA/HA")
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
include/rv/da_monitor.h | 23 +++++++++++++++++++----
include/rv/ha_monitor.h | 18 ++++++++++++++++--
2 files changed, 35 insertions(+), 6 deletions(-)
diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
index a4a13b62d1a4..402d3b935c08 100644
--- a/include/rv/da_monitor.h
+++ b/include/rv/da_monitor.h
@@ -57,6 +57,15 @@ static struct rv_monitor rv_this;
#define da_monitor_reset_hook(da_mon)
#endif
+/*
+ * Hook to allow the implementation of hybrid automata: define it with a
+ * function that waits for the termination of all monitors background
+ * activities (e.g. all timers). This hook can sleep.
+ */
+#ifndef da_monitor_sync_hook
+#define da_monitor_sync_hook()
+#endif
+
/*
* Type for the target id, default to int but can be overridden.
* A long type can work as hash table key (PER_OBJ) but will be downgraded to
@@ -179,6 +188,7 @@ static inline int da_monitor_init(void)
static inline void da_monitor_destroy(void)
{
da_monitor_reset_all();
+ da_monitor_sync_hook();
}
#ifndef da_implicit_guard
@@ -232,6 +242,7 @@ static inline int da_monitor_init(void)
static inline void da_monitor_destroy(void)
{
da_monitor_reset_all();
+ da_monitor_sync_hook();
}
#ifndef da_implicit_guard
@@ -319,6 +330,7 @@ static inline void da_monitor_destroy(void)
}
da_monitor_reset_all();
+ da_monitor_sync_hook();
rv_put_task_monitor_slot(task_mon_slot);
task_mon_slot = RV_PER_TASK_MONITOR_INIT;
@@ -497,10 +509,9 @@ static void da_monitor_reset_all(void)
struct da_monitor_storage *mon_storage;
int bkt;
- rcu_read_lock();
+ guard(rcu)();
hash_for_each_rcu(da_monitor_ht, bkt, mon_storage, node)
da_monitor_reset(&mon_storage->rv.da_mon);
- rcu_read_unlock();
}
static inline int da_monitor_init(void)
@@ -516,13 +527,17 @@ static inline void da_monitor_destroy(void)
int bkt;
tracepoint_synchronize_unregister();
+ scoped_guard(rcu) {
+ hash_for_each_rcu(da_monitor_ht, bkt, mon_storage, node) {
+ da_monitor_reset_hook(&mon_storage->rv.da_mon);
+ }
+ }
+ da_monitor_sync_hook();
/*
* This function is called after all probes are disabled and no longer
* pending, we can safely assume no concurrent user.
*/
- synchronize_rcu();
hash_for_each_safe(da_monitor_ht, bkt, tmp, mon_storage, node) {
- da_monitor_reset_hook(&mon_storage->rv.da_mon);
hash_del_rcu(&mon_storage->node);
kfree(mon_storage);
}
diff --git a/include/rv/ha_monitor.h b/include/rv/ha_monitor.h
index d59507e8cb30..47ff1a41febe 100644
--- a/include/rv/ha_monitor.h
+++ b/include/rv/ha_monitor.h
@@ -36,6 +36,7 @@ static bool ha_monitor_handle_constraint(struct da_monitor *da_mon,
#define da_monitor_event_hook ha_monitor_handle_constraint
#define da_monitor_init_hook ha_monitor_init_env
#define da_monitor_reset_hook ha_monitor_reset_env
+#define da_monitor_sync_hook() synchronize_rcu()
#include <rv/da_monitor.h>
#include <linux/seq_buf.h>
@@ -237,12 +238,25 @@ static bool ha_monitor_handle_constraint(struct da_monitor *da_mon,
return false;
}
+/*
+ * __ha_monitor_timer_callback - generic callback representation
+ *
+ * This callback runs in an RCU read-side critical section to allow the
+ * destruction sequence to easily synchronize_rcu() with all pending timer
+ * after asynchronously disabling them.
+ */
static inline void __ha_monitor_timer_callback(struct ha_monitor *ha_mon)
{
- enum states curr_state = READ_ONCE(ha_mon->da_mon.curr_state);
DECLARE_SEQ_BUF(env_string, ENV_BUFFER_SIZE);
- u64 time_ns = ha_get_ns();
+ enum states curr_state;
+ u64 time_ns;
+
+ if (unlikely(!da_monitor_handling_event(&ha_mon->da_mon)))
+ return;
+ guard(rcu)();
+ curr_state = READ_ONCE(ha_mon->da_mon.curr_state);
+ time_ns = ha_get_ns();
ha_get_env_string(&env_string, ha_mon, time_ns);
ha_react(curr_state, EVENT_NONE, env_string.buffer);
ha_trace_error_env(ha_mon, model_get_state_name(curr_state),
--
2.54.0
^ permalink raw reply related
* [PATCH 7/9] rv: Do not rely on clean monitor when initialising HA
From: Gabriele Monaco @ 2026-05-12 14:02 UTC (permalink / raw)
To: linux-kernel, Steven Rostedt, Gabriele Monaco, Masami Hiramatsu,
Nam Cao, linux-trace-kernel
Cc: Wen Yang
In-Reply-To: <20260512140250.262190-1-gmonaco@redhat.com>
Hybrid Automata monitors hook into the DA implementation when doing
da_monitor_reset(). This function is called both on initialisation and
teardown, HA monitors try to cancel a timer only when it's initialised
relying on the da_mon->monitoring flag. This flag could however be
corrupted during initialisation. This happens for instance on per-task
monitors that share the same storage with different type of monitors
like LTL or in case of races during a previous teardown.
Stop relying on the monitoring flag during initialisation, assume that
can have any value, so skip timer cancellation in any case when a local
flag is set. New monitors (e.g. new tasks) are always zero-initialised
so they are safe.
Reported-by: Wen Yang <wen.yang@linux.dev>
Closes: https://lore.kernel.org/lkml/d02c656aada7d071f083460a5c9a454363669b61.1778522945.git.wen.yang@linux.dev
Fixes: f5587d1b6ec9 ("rv: Add Hybrid Automata monitor type")
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
include/rv/ha_monitor.h | 31 ++++++++++++++++++-
kernel/trace/rv/monitors/nomiss/nomiss.c | 4 +--
kernel/trace/rv/monitors/opid/opid.c | 4 +--
kernel/trace/rv/monitors/stall/stall.c | 4 +--
.../rvgen/rvgen/templates/dot2k/main.c | 4 +--
5 files changed, 38 insertions(+), 9 deletions(-)
diff --git a/include/rv/ha_monitor.h b/include/rv/ha_monitor.h
index 47ff1a41febe..11ae85bad492 100644
--- a/include/rv/ha_monitor.h
+++ b/include/rv/ha_monitor.h
@@ -116,6 +116,35 @@ static enum hrtimer_restart ha_monitor_timer_callback(struct hrtimer *hrtimer);
#define ha_get_ns() 0
#endif /* HA_CLK_NS */
+static bool ha_mon_initializing;
+
+static int ha_monitor_init(void)
+{
+ int ret;
+
+ ha_mon_initializing = true;
+ ret = da_monitor_init();
+ ha_mon_initializing = false;
+ return ret;
+}
+
+static void ha_monitor_destroy(void)
+{
+ da_monitor_destroy();
+}
+
+/*
+ * ha_monitor_uninitialized - are fields like the timer initialized?
+ *
+ * On a clean monitor, we can assume an active monitor (monitoring) is
+ * initialized, however the monitoring field cannot be trusted during
+ * initialization.
+ */
+static inline bool ha_monitor_uninitialized(struct da_monitor *da_mon)
+{
+ return ha_mon_initializing || !da_monitoring(da_mon);
+}
+
/* Should be supplied by the monitor */
static u64 ha_get_env(struct ha_monitor *ha_mon, enum envs env, u64 time_ns);
static bool ha_verify_constraint(struct ha_monitor *ha_mon,
@@ -160,7 +189,7 @@ static inline void ha_monitor_reset_env(struct da_monitor *da_mon)
struct ha_monitor *ha_mon = to_ha_monitor(da_mon);
/* Initialisation resets the monitor before initialising the timer */
- if (likely(da_monitoring(da_mon)))
+ if (likely(!ha_monitor_uninitialized(da_mon)))
ha_cancel_timer(ha_mon);
}
diff --git a/kernel/trace/rv/monitors/nomiss/nomiss.c b/kernel/trace/rv/monitors/nomiss/nomiss.c
index 31f90f3638d8..8ead8783c29f 100644
--- a/kernel/trace/rv/monitors/nomiss/nomiss.c
+++ b/kernel/trace/rv/monitors/nomiss/nomiss.c
@@ -227,7 +227,7 @@ static int enable_nomiss(void)
{
int retval;
- retval = da_monitor_init();
+ retval = ha_monitor_init();
if (retval)
return retval;
@@ -263,7 +263,7 @@ static void disable_nomiss(void)
rv_detach_trace_probe("nomiss", sched_switch, handle_sched_switch);
rv_detach_trace_probe("nomiss", sched_wakeup, handle_sched_wakeup);
- da_monitor_destroy();
+ ha_monitor_destroy();
}
static struct rv_monitor rv_this = {
diff --git a/kernel/trace/rv/monitors/opid/opid.c b/kernel/trace/rv/monitors/opid/opid.c
index 4594c7c46601..2922318c6112 100644
--- a/kernel/trace/rv/monitors/opid/opid.c
+++ b/kernel/trace/rv/monitors/opid/opid.c
@@ -73,7 +73,7 @@ static int enable_opid(void)
{
int retval;
- retval = da_monitor_init();
+ retval = ha_monitor_init();
if (retval)
return retval;
@@ -90,7 +90,7 @@ static void disable_opid(void)
rv_detach_trace_probe("opid", sched_set_need_resched_tp, handle_sched_need_resched);
rv_detach_trace_probe("opid", sched_waking, handle_sched_waking);
- da_monitor_destroy();
+ ha_monitor_destroy();
}
/*
diff --git a/kernel/trace/rv/monitors/stall/stall.c b/kernel/trace/rv/monitors/stall/stall.c
index 9ccfda6b0e73..3c38fb1a0159 100644
--- a/kernel/trace/rv/monitors/stall/stall.c
+++ b/kernel/trace/rv/monitors/stall/stall.c
@@ -103,7 +103,7 @@ static int enable_stall(void)
{
int retval;
- retval = da_monitor_init();
+ retval = ha_monitor_init();
if (retval)
return retval;
@@ -120,7 +120,7 @@ static void disable_stall(void)
rv_detach_trace_probe("stall", sched_switch, handle_sched_switch);
rv_detach_trace_probe("stall", sched_wakeup, handle_sched_wakeup);
- da_monitor_destroy();
+ ha_monitor_destroy();
}
static struct rv_monitor rv_this = {
diff --git a/tools/verification/rvgen/rvgen/templates/dot2k/main.c b/tools/verification/rvgen/rvgen/templates/dot2k/main.c
index bf0999f6657a..889446760e3c 100644
--- a/tools/verification/rvgen/rvgen/templates/dot2k/main.c
+++ b/tools/verification/rvgen/rvgen/templates/dot2k/main.c
@@ -35,7 +35,7 @@ static int enable_%%MODEL_NAME%%(void)
{
int retval;
- retval = da_monitor_init();
+ retval = %%MONITOR_CLASS%%_monitor_init();
if (retval)
return retval;
@@ -50,7 +50,7 @@ static void disable_%%MODEL_NAME%%(void)
%%TRACEPOINT_DETACH%%
- da_monitor_destroy();
+ %%MONITOR_CLASS%%_monitor_destroy();
}
/*
--
2.54.0
^ permalink raw reply related
* [PATCH 8/9] rv: Add automatic cleanup handlers for per-task HA monitors
From: Gabriele Monaco @ 2026-05-12 14:02 UTC (permalink / raw)
To: linux-kernel, Steven Rostedt, Gabriele Monaco, Nam Cao,
linux-trace-kernel
Cc: Wen Yang
In-Reply-To: <20260512140250.262190-1-gmonaco@redhat.com>
Hybrid automata monitors may start timers, depending on the model, these
may remain active on an exiting task and cause false positives or even
access freed memory.
Add an enable/disable hook in the HA code, currently only populated by
the per-task handler for registration and deregistration.
This hooks to the sched_process_exit event and ensures the timer is
stopped for every exiting task. The handler is enabled automatically but
may be disabled, for instance if the monitor uses the event for another
purpose (but should still manually ensure timers are stopped).
Fixes: f5587d1b6ec9 ("rv: Add Hybrid Automata monitor type")
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
include/rv/ha_monitor.h | 44 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 44 insertions(+)
diff --git a/include/rv/ha_monitor.h b/include/rv/ha_monitor.h
index 11ae85bad492..1bdf866e9c63 100644
--- a/include/rv/ha_monitor.h
+++ b/include/rv/ha_monitor.h
@@ -28,6 +28,7 @@ static inline void ha_monitor_init_env(struct da_monitor *da_mon);
static inline void ha_monitor_reset_env(struct da_monitor *da_mon);
static inline void ha_setup_timer(struct ha_monitor *ha_mon);
static inline bool ha_cancel_timer(struct ha_monitor *ha_mon);
+static inline void ha_cancel_timer_sync(struct ha_monitor *ha_mon);
static bool ha_monitor_handle_constraint(struct da_monitor *da_mon,
enum states curr_state,
enum events event,
@@ -38,6 +39,26 @@ static bool ha_monitor_handle_constraint(struct da_monitor *da_mon,
#define da_monitor_reset_hook ha_monitor_reset_env
#define da_monitor_sync_hook() synchronize_rcu()
+#if !defined(HA_SKIP_AUTO_CLEANUP) && RV_MON_TYPE == RV_MON_PER_TASK
+/*
+ * Automatic cleanup handlers for per-task HA monitors, only skip if you know
+ * what you are doing (e.g. you want to implement cleanup manually in another
+ * handler doing more things).
+ */
+static void ha_handle_sched_process_exit(void *data, struct task_struct *p,
+ bool group_dead);
+
+#define ha_monitor_enable_hook() \
+ rv_attach_trace_probe(__stringify(MONITOR_NAME), sched_process_exit, \
+ ha_handle_sched_process_exit)
+#define ha_monitor_disable_hook() \
+ rv_detach_trace_probe(__stringify(MONITOR_NAME), sched_process_exit, \
+ ha_handle_sched_process_exit)
+#else
+#define ha_monitor_enable_hook()
+#define ha_monitor_disable_hook()
+#endif
+
#include <rv/da_monitor.h>
#include <linux/seq_buf.h>
@@ -124,12 +145,14 @@ static int ha_monitor_init(void)
ha_mon_initializing = true;
ret = da_monitor_init();
+ ha_monitor_enable_hook();
ha_mon_initializing = false;
return ret;
}
static void ha_monitor_destroy(void)
{
+ ha_monitor_disable_hook();
da_monitor_destroy();
}
@@ -230,6 +253,18 @@ static inline void ha_trace_error_env(struct ha_monitor *ha_mon,
{
CONCATENATE(trace_error_env_, MONITOR_NAME)(id, curr_state, event, env);
}
+
+#if !defined(HA_SKIP_AUTO_CLEANUP) && RV_MON_TYPE == RV_MON_PER_TASK
+static void ha_handle_sched_process_exit(void *data, struct task_struct *p,
+ bool group_dead)
+{
+ struct da_monitor *da_mon = da_get_monitor(p);
+
+ if (likely(!ha_monitor_uninitialized(da_mon)))
+ ha_cancel_timer_sync(to_ha_monitor(da_mon));
+}
+#endif
+
#endif /* RV_MON_TYPE */
/*
@@ -455,6 +490,10 @@ static inline bool ha_cancel_timer(struct ha_monitor *ha_mon)
{
return timer_delete(&ha_mon->timer);
}
+static inline void ha_cancel_timer_sync(struct ha_monitor *ha_mon)
+{
+ timer_delete_sync(&ha_mon->timer);
+}
#elif HA_TIMER_TYPE == HA_TIMER_HRTIMER
/*
* Helper functions to handle the monitor timer.
@@ -506,6 +545,10 @@ static inline bool ha_cancel_timer(struct ha_monitor *ha_mon)
{
return hrtimer_try_to_cancel(&ha_mon->hrtimer) == 1;
}
+static inline void ha_cancel_timer_sync(struct ha_monitor *ha_mon)
+{
+ hrtimer_cancel(&ha_mon->hrtimer);
+}
#else /* HA_TIMER_NONE */
/*
* Start function is intentionally not defined, monitors using timers must
@@ -516,6 +559,7 @@ static inline bool ha_cancel_timer(struct ha_monitor *ha_mon)
{
return false;
}
+static inline void ha_cancel_timer_sync(struct ha_monitor *ha_mon) { }
#endif
#endif
--
2.54.0
^ permalink raw reply related
* [PATCH 9/9] rv: Mandate deallocation for per-obj monitors
From: Gabriele Monaco @ 2026-05-12 14:02 UTC (permalink / raw)
To: linux-kernel, Steven Rostedt, Gabriele Monaco, Masami Hiramatsu,
linux-trace-kernel
Cc: Nam Cao, Wen Yang
In-Reply-To: <20260512140250.262190-1-gmonaco@redhat.com>
The per-object monitors use a hash tables and dynamic allocation of the
monitor storage, functions to clean a monitor that is no longer needed
are provided but nothing ensures the monitor actually uses them.
Remove the inline specifier on the deallocation function to let the
compiler warn in case it isn't referenced. If the monitor really doesn't
need one (for instance because instances will never cease to exist
before disabling the monitor), the da_skip_deallocation() helper macro
can be used to silence the warning.
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
include/rv/da_monitor.h | 14 +++++++++++++-
kernel/trace/rv/monitors/deadline/deadline.h | 5 ++++-
2 files changed, 17 insertions(+), 2 deletions(-)
diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
index 402d3b935c08..378d23ab7dfb 100644
--- a/include/rv/da_monitor.h
+++ b/include/rv/da_monitor.h
@@ -489,8 +489,11 @@ static inline monitor_target da_get_target_by_id(da_id_type id)
* locks.
* This function includes an RCU read-side critical section to synchronise
* against da_monitor_destroy().
+ * NOTE: inline is omitted on purpose to let the compiler warn if this function
+ * is never referenced. For monitors that don't require a deallocation hook,
+ * da_skip_deallocation() can be used.
*/
-static inline void da_destroy_storage(da_id_type id)
+static void da_destroy_storage(da_id_type id)
{
struct da_monitor_storage *mon_storage;
@@ -504,6 +507,15 @@ static inline void da_destroy_storage(da_id_type id)
kfree_rcu(mon_storage, rcu);
}
+/*
+ * da_skip_deallocation - explicitly mark a deallocation function as not required
+ *
+ * Only use when you are absolutely sure the monitor doesn't require a
+ * deallocation hook (i.e. it's not possible for an object to finish existing
+ * when the monitor is still running).
+ */
+#define da_skip_deallocation(hook) ((void)hook)
+
static void da_monitor_reset_all(void)
{
struct da_monitor_storage *mon_storage;
diff --git a/kernel/trace/rv/monitors/deadline/deadline.h b/kernel/trace/rv/monitors/deadline/deadline.h
index 78fca873d61e..c39fd79148c2 100644
--- a/kernel/trace/rv/monitors/deadline/deadline.h
+++ b/kernel/trace/rv/monitors/deadline/deadline.h
@@ -194,7 +194,10 @@ static void __maybe_unused handle_newtask(void *data, struct task_struct *task,
da_create_storage(EXPAND_ID_TASK(task), NULL);
}
-static void __maybe_unused handle_exit(void *data, struct task_struct *p, bool group_dead)
+/*
+ * Deallocation hook, use da_skip_deallocation() when not necessary
+ */
+static void handle_exit(void *data, struct task_struct *p, bool group_dead)
{
if (p->policy == SCHED_DEADLINE)
da_destroy_storage(get_entity_id(&p->dl, DL_TASK, DL_TASK));
--
2.54.0
^ permalink raw reply related
* [PATCH] tracing: Fix unload_page for simple_ring_buffer init rollback
From: Vincent Donnefort @ 2026-05-12 14:16 UTC (permalink / raw)
To: rostedt, mhiramat, mathieu.desnoyers, linux-trace-kernel
Cc: kernel-team, linux-kernel, Vincent Donnefort
The unload_page callback expects the return value of load_page() as its
argument: ret = load_page(va); unload(ret). Fix the rollback code in
simple_ring_buffer_init_mm() where the descriptor's VA is used instead
of the loaded page address.
Fixes: 635923081c79 ("tracing: load/unload page callbacks for simple_ring_buffer")
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
diff --git a/kernel/trace/simple_ring_buffer.c b/kernel/trace/simple_ring_buffer.c
index 02af2297ae5a..38cf9abe0be8 100644
--- a/kernel/trace/simple_ring_buffer.c
+++ b/kernel/trace/simple_ring_buffer.c
@@ -431,7 +431,7 @@ int simple_ring_buffer_init_mm(struct simple_rb_per_cpu *cpu_buffer,
if (ret) {
for (i--; i >= 0; i--)
- unload_page((void *)desc->page_va[i]);
+ unload_page(bpages[i].page);
unload_page(cpu_buffer->meta);
return ret;
base-commit: 5d6919055dec134de3c40167a490f33c74c12581
--
2.54.0.563.g4f69b47b94-goog
^ permalink raw reply related
* [RFC PATCH v2 00/28] mm/damon: introduce data attributes monitoring
From: SeongJae Park @ 2026-05-12 14:36 UTC (permalink / raw)
Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
Jonathan Corbet, Lorenzo Stoakes, Masami Hiramatsu,
Mathieu Desnoyers, Michal Hocko, Mike Rapoport, Shuah Khan,
Shuah Khan, Steven Rostedt, Suren Baghdasaryan, Vlastimil Babka,
damon, linux-doc, linux-kernel, linux-kselftest, linux-mm,
linux-trace-kernel
TL; DR
======
Extend DAMON for monitoring general data attributes other than accesses.
The short term motivation is lightweight page type (e.g., belonging
cgroup) aware monitoring. In long term, this will help extending DAMON
for multiple access events capture primitives (e.g., page faults and
PMU) and eventually pivotting DAMON to a "Data Attributes Monitoring and
Operations eNgine" in long term.
Background: High Cost of Page Level Properties Monitoring
=========================================================
DAMON is initially introduced as a Data Access MONitor. It has been
extended for not only access monitoring but also data access-aware
system operations (DAMOS). But still the monitoring part is only for
data accesses.
Data access patterns is good information, but some users need more
holistic views. Particularly, users want to show the access pattern
information together with the types of the memory. For example, users
who work for making huge pages efficiently want to know how much of
DAMON-found hot/cold regions are backed by huge pages. Users who run
multiple workloads with different cgroups want to know how much of
DAMON-found hot/cold regions belong to specific cgroups.
For the user demand, we developed a DAMOS extension for page level
properties based monitoring [1], which has landed on 6.14. Using the
feature, users can inform the page level data properties that they are
interested in, in a flexible format that uses DAMOS filters. Then,
DAMON applies the filters to each folio of the entire DAMON region and
lets users know how many bytes of memory in each DAMON region passed the
given filters.
This gives page level detailed and deterministic information to users.
But, because the operation is done at page level, the overhead is
proportional to the memory size. It was useful for test or debugging
purposes on a small number of machines. But it was obviously too heavy
to be enabled always on all machines running the real user workloads.
For real world workloads, it was recommended to use the feature with
user-space controlled sampling approaches. For example, users could do
the page level monitoring only once per hour, on randomly selected one
percent of machines of their fleet. If the runtime and the size of the
fleet is long and big enough, it should provide statistically meaningful
data.
But users are too busy to implement such controls on their own.
Data Attributes Monitoring
==========================
Extend DAMON to monitor not only data accesses, but also general data
attributes. Do the extension while keeping the main promise of DAMON,
the bounded and best-effort minimum overhead.
Allow users to specify what data attributes in addition to the data
access they want to monitor. Users can install one 'data probe' per
data attribute of their interest for this purpose. The 'data probe'
should be able to be applied to any memory, and determine if the given
memory has the appropriate data attribute. E.g., if memory of physical
address 42 belongs to cgroup A. Each 'data probe' is configured with
filters that are very similar to the DAMOS filters.
When DAMON checks if each sampling address memory of each region is
accessed since the last check, it applies data probes if registered.
Same to the number of access check-positive samples accounting
(nr_accesses), it accounts the number of each data probe-positive
samples in another per-region counters array, namely 'probe_hits'. When
DAMON resets nr_accesses every aggregation interval, it resets
'probe_hits' together.
Users can read 'probe_hits' just before the values are reset. In this
way, users can know how many hot/cold memory regions have data
attributes of their interest. E.g., 30 percent of this system's hot
memory is belonging to cgroup A, and 80 percent of the cgroup
A-belonging hot memory is backed by huge pages.
Patches Sequence
================
First eight patches implement the core feature, interface and the
working support. Patch 1 introduces data probe data structure, namely
damon_probe. Patch 2 extends damon_ctx for installing data probes.
Patch 3 introduces another data structure for filters of each data
probe, namely damon_filter. Patch 4 updates damon_ctx commit function
to handle the probes. Patch 5 extends damon_region for the per-region
per-probe positive samples counter, namely probe_hits. Patch 6 extends
damon_operations for applying probes on the underlying DAMON operations
implementation. Patch 7 updates kdamond_fn() to invoke the probes
applying callback. Patch 8 finally implements the probes support on
paddr ops.
Ten changes for user interface (patches 9-18) come next. Patches 9-13
implements sysfs directories and files for setting data probes, namely
probes directory, probe directory, filters directory, filter directory
and filter directory internal files, respectively. Patch 14 connects
the user inputs that are made via the sysfs files to DAMON core.
Following three patches (patches 15-17) implement sysfs directories and
files for showing the probe_hits to users, namely probes directory,
probe directory and hits files, respectively. Patch 18 introduces a new
tracepoint for showing the probe_hits via tracefs.
Patch 19 adds a selftest for the sysfs files.
Patches 20 and 21 documents the design and usage of the new feature,
respectively.
Seven additional patches (patches 22-28) for monitoring belonging memory
cgroup follow. Depending on the feedback, this part might be separated
to another series in future. Patch 22 defines the DAMON filter type for
the new attribute, namely DAMON_FILTER_TYPE_MEMCG. Patch 23 add the
support on paddr ops. Patch 24 updates the sysfs interface for setup of
the target memcg. Patch 25 move code for easy reuse of the filter
target memcg setup. Patch 26 connects the user input to the core layer.
Finally, patches 27 and 28 update the design and usage documents for the
memcg attribute monitoring support.
Discussions
===========
This allows the page properties monitoring with overhead that is low
enough to be enabled always on real world workloads. Because the
sampling time for access check is reused for data attributes check, the
upper-bounded and best-effort minimum overhead of DAMON is kept.
Because the sampling memory for access check is reused for data
attributes check, additional overhead is minimum.
Still DAMOS-based page level properties monitoring should be useful,
because it provides a deterministic page level information. When in
doubt of the sampling based information, running DAMOS-based one
together and comparing the results would be useful, for debugging and
tuning.
Plan for Dropping RFC tag
=========================
I'm considering renaming the tracepoint for exposing probe_hits
(damon_aggregated_v2).
Making changes for feedback from myself, humans and Sashiko should be
the major remaining work.
I'm currently hoping to drop the RFC tag by 7.2-rc1.
Future Works: Mid Term
========================
This version of implementation is limiting the maximum number of data
probes to four. I will try to find a way to remove the limit in future.
I personally think it should be enough for common use cases, though, and
therefore not giving high priority at the moment.
Future Works: Long Term
=======================
There are user requests for extending DAMON with detailed access
information, for example, per-CPUs/threads/read/writes monitoring. For
that, I was working [2] on extending DAMON to use page fault events as
another access check primitives, and making the infrastructure flexible
for future use of yet another access check primitive. Actually there is
another ongoing work [3] for extending DAMON with PMU events. The
motivation of the work is reducing the overhead, though.
In my work [2], I was introducing a new interface for access sampling
primitives control. Now I think this data probe interface can be used
for that, too. That is, data access becomes just one type of data
attribute. Also, pg_idle-confirmed access, page fault-confirmed access,
and PMU event-confirmed access will be different types of data
attributes.
The regions adjustment mechanism is currently working based on the
access information. That's because DAMON is designed for data access
monitoring. That is, data access information is the primary interest,
and therefore DAMON adjusts regions in a way that can best-present the
information.
Once data access becomes just one of data attributes, there is no reason
to think data access that special. There might be some users not
interested in access at all but want to know the location of memory of
specific type. Data probes interface will allow doing that. Further,
we could extend the interface to let users set any data attribute as the
'primary' attribute. Then, DAMON will split and merge regions in a way
that can best-present the 'primary' attributes.
DAMOS will also be extended, to specify targets based on not only the
data access pattern, but all user-registered data attributes. From this
stage, we may be able to call DAMON as a "Data Attributes Monitoring and
Operations eNgine".
[1] https://lore.kernel.org/20250106193401.109161-1-sj@kernel.org
[2] https://lore.kernel.org/20251208062943.68824-1-sj@kernel.org/
[3] https://lore.kernel.org/20260423004211.7037-1-akinobu.mita@gmail.com
Changes from RFC
- rfc: https://lore.kernel.org/all/20260426205222.93895-1-sj@kernel.org/
- Support memcg DAMON filter.
- Use per-probe probe_hits sysfs file.
- Use dynamic_array for probe_hits tracing.
- Fix filter matching field.
- Fix folio leaking in damon_pa_filter_pass().
- Move nr_regions of damon_aggregated_v2 tracepoint after end.
- Rename DAMON_TEST_TYPE_ANON to DAMON_FILTER_TYPE_ANON.
SeongJae Park (28):
mm/damon/core: introduce struct damon_probe
mm/damon/core: embed damon_probe objects in damon_ctx
mm/damon/core: introduce damon_filter
mm/damon/core: commit probes
mm/damon/core: introduce damon_region->probe_hits
mm/damon/core: introduce damon_ops->apply_probes
mm/damon/core: do data attributes monitoring
mm/damon/paddr: support data attributes monitoring
mm/damon/sysfs: implement probes dir
mm/damon/sysfs: implement probe dir
mm/damon/sysfs: implement filters directory
mm/damon/sysfs: implement filter dir
mm/damon/sysfs: implement filter dir files
mm/damon/sysfs: setup probes on DAMON core API parameters
mm/damon/sysfs-schemes: implement tried_regions/<r>/probes/
mm/damon/sysfs-schemes: implement probe dir
mm/damon/sysfs-schemes: implement probe/hits file
mm/damon: trace probe_hits
selftests/damon/sysfs.sh: test probes dir
Docs/mm/damon/design: document data attributes monitoring
Docs/admin-guide/mm/damon/usage: document data attributes monitoring
mm/damon/core: introduce DAMON_FILTER_TYPE_MEMCG
mm/damon/paddr: support DAMON_FILTER_TYPE_MEMCG
mm/damon/sysfs: add filters/<F>/path file
mm/damon/sysfs-schemes: move memcg_path_to_id() to sysfs-common
mm/damon/sysfs: setup damon_filter->memcg_id from path
Docs/mm/damon/design: update for memcg damon filter
Docs/admin-guide/mm/damon/usage: update for memcg damon filter
Documentation/admin-guide/mm/damon/usage.rst | 48 +-
Documentation/mm/damon/design.rst | 39 ++
include/linux/damon.h | 67 +++
include/trace/events/damon.h | 36 ++
mm/damon/core.c | 195 +++++++
mm/damon/paddr.c | 76 +++
mm/damon/sysfs-common.c | 41 ++
mm/damon/sysfs-common.h | 2 +
mm/damon/sysfs-schemes.c | 222 ++++++--
mm/damon/sysfs.c | 557 +++++++++++++++++++
tools/testing/selftests/damon/sysfs.sh | 48 ++
11 files changed, 1280 insertions(+), 51 deletions(-)
base-commit: 610724cfd93c1c413faf9e5bb63926fe54849887
--
2.47.3
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox