Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* Re: [PATCH 2/2] selftests/mm: add zone->lock tracepoint verification test
From: Jesper Dangaard Brouer @ 2026-05-13 15:00 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Andrew Morton, linux-mm
  Cc: Vlastimil Babka, Steven Rostedt, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Lorenzo Stoakes, Shuah Khan, linux-kernel,
	linux-trace-kernel, kernel-team
In-Reply-To: <d440b00b-6d7b-48cb-b37d-43f6f885ed01@kernel.org>



On 08/05/2026 22.15, David Hildenbrand (Arm) wrote:
> On 5/8/26 18:22, hawk@kernel.org wrote:
>> From: Jesper Dangaard Brouer <hawk@kernel.org>
>>
>> Add a selftest to verify the kmem:mm_zone_lock_contended,
>> kmem:mm_zone_locked, and kmem:mm_zone_lock_unlock tracepoints.
>>
>> The test has two components:
>>
>> zone_lock_contention.c - a workload that spawns threads doing rapid
>> page allocation and freeing to generate zone->lock contention. It
>> shrinks PCP lists via percpu_pagelist_high_fraction to force frequent
>> free_pcppages_bulk() and rmqueue_bulk() calls.
>>
>> test_zone_lock_tracepoints.sh - uses bpftrace to verify tracepoints
>> exist, have the expected fields, fire under load, and that wait_ns
>> is populated when contention occurs.
>>
>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
>> ---
>>   tools/testing/selftests/mm/Makefile           |   2 +
>>   .../mm/test_zone_lock_tracepoints.sh          | 212 ++++++++++++++++++
>>   .../selftests/mm/zone_lock_contention.c       | 166 ++++++++++++++
> 
> This really looks excessive and ... not really how we usually treat tracepoints?
> 
> I don't know about others, but I don't think this is really what we want as a MM
> selftest.
> 

I wanted to have a program that tested the code I changed, so I simply
made AI write a verification test and asked it to create as a selftest,
that I've run to verify my code change was correctly implemented.  As I
needed to trigger lock contention the test is more advanced, but luckily
AI solved it in only the 2nd attempt.

It makes sense to drop this patch. We shouldn't keep this code in the
kernel tree, it simply verified that my code works. There is little
chance that this test will catch meaningful regressions for these
tracepoints.

--Jesper

^ permalink raw reply

* Re: [PATCH] tracing: samples: avoid warning about __aeabi_unwind_cpp_pr1
From: Steven Rostedt @ 2026-05-13 14:59 UTC (permalink / raw)
  To: Vincent Donnefort
  Cc: Arnd Bergmann, Masami Hiramatsu, Nathan Chancellor, Marc Zyngier,
	Arnd Bergmann, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel
In-Reply-To: <20260323105646.590718-1-arnd@kernel.org>


Vincent,

Is this patch needed? That is, did it fall through the cracks?

-- Steve

On Mon, 23 Mar 2026 11:56:41 +0100
Arnd Bergmann <arnd@kernel.org> wrote:

> From: Arnd Bergmann <arnd@arndb.de>
> 
> The now more verbose check found another symbol missing from the whitelist:
> 
> Unexpected symbols in kernel/trace/simple_ring_buffer.o:
>          U __aeabi_unwind_cpp_pr1
> 
> Add this to the Makefile.
> 
> Fixes: 1211907ac0b5 ("tracing: Generate undef symbols allowlist for simple_ring_buffer")
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---
>  kernel/trace/Makefile | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
> index d662c1a64cd5..aba6a25db17b 100644
> --- a/kernel/trace/Makefile
> +++ b/kernel/trace/Makefile
> @@ -169,8 +169,8 @@ targets += undefsyms_base.o
>  # because it is not linked into vmlinux.
>  KASAN_SANITIZE_undefsyms_base.o := y
>  
> -UNDEFINED_ALLOWLIST = __asan __gcov __kasan __kcsan __hwasan __sancov __sanitizer __tsan __ubsan __x86_indirect_thunk \
> -		      __msan simple_ring_buffer \
> +UNDEFINED_ALLOWLIST = __asan __gcov __kasan __kcsan __hwasan __sancov __sanitizer __tsan __ubsan __msan \
> +		      __x86_indirect_thunk __aeabi_unwind_cpp simple_ring_buffer \
>  		      $(shell $(NM) -u $(obj)/undefsyms_base.o 2>/dev/null | awk '{print $$2}')
>  
>  quiet_cmd_check_undefined = NM      $<


^ permalink raw reply

* Re: [PATCH v4 2/3] fgraph: Enhance funcgraph-retval with BTF-based type-aware output
From: Steven Rostedt @ 2026-05-13 14:47 UTC (permalink / raw)
  To: Donglin Peng
  Cc: mhiramat, linux-trace-kernel, bpf, linux-kernel, pengdonglin,
	Xiaoqin Zhang
In-Reply-To: <20251215034153.2367756-3-dolinux.peng@gmail.com>

On Mon, 15 Dec 2025 11:41:52 +0800
Donglin Peng <dolinux.peng@gmail.com> wrote:

> From: pengdonglin <pengdonglin@xiaomi.com>
> 
> The current funcgraph-retval implementation suffers from two accuracy
> issues:
> 
> 1. Void-returning functions still print a return value, creating
>    misleading noise in the trace output.
> 
> 2. For functions returning narrower types (e.g., char, short), the
>    displayed value can be incorrect because high bits of the register
>    may contain undefined data.
> 
> This patch addresses both problems by leveraging BTF to obtain the exact
> return type of each traced kernel function. The key changes are:
> 
> 1. Void function filtering: Functions with void return type no longer
>    display any return value in the trace output, eliminating unnecessary
>    clutter.
> 
> 2. Type-aware value formatting: The return value is now properly truncated
>    to match the actual width of the return type before being displayed.
>    Additionally, the value is formatted according to its type for better
>    human readability.
> 
> Here is an output comparison:
> 
> Before:
>  # perf ftrace -G vfs_read --graph-opts retval
>  ...
>  1)               |   touch_atime() {
>  1)               |     atime_needs_update() {
>  1)   0.069 us    |       make_vfsuid(); /* ret=0x0 */
>  1)   0.067 us    |       make_vfsgid(); /* ret=0x0 */
>  1)               |       current_time() {
>  1)   0.197 us    |         ktime_get_coarse_real_ts64_mg(); /* ret=0x187f886aec3ed6f5 */
>  1)   0.352 us    |       } /* current_time ret=0x69380753 */
>  1)   0.792 us    |     } /* atime_needs_update ret=0x0 */
>  1)   0.937 us    |   } /* touch_atime ret=0x0 */
> 
> After:
>  # perf ftrace -G vfs_read --graph-opts retval
>  ...
>  2)               |   touch_atime() {
>  2)               |     atime_needs_update() {
>  2)   0.070 us    |       make_vfsuid(); /* ret=0x0 */
>  2)   0.070 us    |       make_vfsgid(); /* ret=0x0 */
>  2)               |       current_time() {
>  2)   0.162 us    |         ktime_get_coarse_real_ts64_mg();
>  2)   0.312 us    |       } /* current_time ret=0x69380649(trunc) */
>  2)   0.753 us    |     } /* atime_needs_update ret=false */
>  2)   0.899 us    |   } /* touch_atime */
> 
> Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Cc: Xiaoqin Zhang <zhangxiaoqin@xiaomi.com>
> Signed-off-by: pengdonglin <pengdonglin@xiaomi.com>
> ---
>  kernel/trace/trace_functions_graph.c | 124 ++++++++++++++++++++++++---
>  1 file changed, 111 insertions(+), 13 deletions(-)
> 
> diff --git a/kernel/trace/trace_functions_graph.c b/kernel/trace/trace_functions_graph.c
> index 17c75cf2348e..46b66b1cfc16 100644
> --- a/kernel/trace/trace_functions_graph.c
> +++ b/kernel/trace/trace_functions_graph.c
> @@ -15,6 +15,7 @@
>  
>  #include "trace.h"
>  #include "trace_output.h"
> +#include "trace_btf.h"
>  
>  /* When set, irq functions might be ignored */
>  static int ftrace_graph_skip_irqs;
> @@ -120,6 +121,13 @@ enum {
>  	FLAGS_FILL_END   = 3 << TRACE_GRAPH_PRINT_FILL_SHIFT,
>  };
>  
> +enum {
> +	RETVAL_FMT_HEX   = BIT(0),
> +	RETVAL_FMT_DEC   = BIT(1),
> +	RETVAL_FMT_BOOL  = BIT(2),
> +	RETVAL_FMT_TRUNC = BIT(3),
> +};
> +
>  static void
>  print_graph_duration(struct trace_array *tr, unsigned long long duration,
>  		     struct trace_seq *s, u32 flags);
> @@ -865,6 +873,73 @@ static void print_graph_retaddr(struct trace_seq *s, struct fgraph_retaddr_ent_e
>  
>  #if defined(CONFIG_FUNCTION_GRAPH_RETVAL) || defined(CONFIG_FUNCTION_GRAPH_RETADDR)
>  
> +static void trim_retval(unsigned long func, unsigned long *retval, bool *print_retval,
> +			int *fmt)

This function should really be in trace_btf.c and a stub when btf is not
enabled.

-- Steve

> +{
> +	const struct btf_type *t;
> +	char name[KSYM_NAME_LEN];
> +	struct btf *btf;
> +	u32 v, msb;
> +	int kind;
> +
> +	if (!IS_ENABLED(CONFIG_DEBUG_INFO_BTF))
> +		return;
> +
> +	if (lookup_symbol_name(func, name))
> +		return;
> +
> +	t = btf_find_func_proto(name, &btf);
> +	if (IS_ERR_OR_NULL(t))
> +		return;
> +
> +	t = btf_type_skip_modifiers(btf, t->type, NULL);
> +	kind = t ? BTF_INFO_KIND(t->info) : BTF_KIND_UNKN;
> +	switch (kind) {
> +	case BTF_KIND_UNKN:
> +		*print_retval = false;
> +		break;
> +	case BTF_KIND_STRUCT:
> +	case BTF_KIND_UNION:
> +	case BTF_KIND_ENUM:
> +	case BTF_KIND_ENUM64:
> +		if (kind == BTF_KIND_STRUCT || kind == BTF_KIND_UNION)
> +			*fmt = RETVAL_FMT_HEX;
> +		else
> +			*fmt = RETVAL_FMT_DEC;
> +
> +		if (t->size > sizeof(unsigned long)) {
> +			*fmt |= RETVAL_FMT_TRUNC;
> +		} else {
> +			msb = BITS_PER_BYTE * t->size - 1;
> +			*retval &= GENMASK(msb, 0);
> +		}
> +		break;
> +	case BTF_KIND_INT:
> +		v = *(u32 *)(t + 1);
> +		if (BTF_INT_ENCODING(v) == BTF_INT_BOOL) {
> +			*fmt = RETVAL_FMT_BOOL;
> +			msb = 0;
> +		} else {
> +			if (BTF_INT_ENCODING(v) == BTF_INT_SIGNED)
> +				*fmt = RETVAL_FMT_DEC;
> +			else
> +				*fmt = RETVAL_FMT_HEX;
> +
> +			if (t->size > sizeof(unsigned long)) {
> +				*fmt |= RETVAL_FMT_TRUNC;
> +				msb = BITS_PER_LONG - 1;
> +			} else {
> +				msb = BTF_INT_BITS(v) - 1;
> +			}
> +		}
> +		*retval &= GENMASK(msb, 0);
> +		break;
> +	default:
> +		*fmt = RETVAL_FMT_HEX;
> +		break;
> +	}
> +}
> +

^ permalink raw reply

* Re: [PATCH v4 1/3] ftrace: Build trace_btf.c when CONFIG_DEBUG_INFO_BTF is enabled
From: Steven Rostedt @ 2026-05-13 14:42 UTC (permalink / raw)
  To: Donglin Peng
  Cc: mhiramat, linux-trace-kernel, bpf, linux-kernel, pengdonglin,
	Xiaoqin Zhang
In-Reply-To: <20251215034153.2367756-2-dolinux.peng@gmail.com>


Sorry for the late reply, I've been a bit busy on other things recently.

On Mon, 15 Dec 2025 11:41:51 +0800
Donglin Peng <dolinux.peng@gmail.com> wrote:

> From: pengdonglin <pengdonglin@xiaomi.com>
> 
> The trace_btf.c file provides BTF helper functions used by the ftrace
> subsystem. This change makes its compilation solely dependent on

Nit, change logs should never say "This change". Instead it should be
worded as:

  "Make the compilation of trace_btf.c soley depend on..."

-- Steve

> CONFIG_DEBUG_INFO_BTF, allowing features like funcgraph-retval to also
> utilize these helpers.
> 
> Additionally, the redundant dependency on CONFIG_PROBE_EVENTS_BTF_ARGS
> is removed, as CONFIG_DEBUG_INFO_BTF already depends on
> CONFIG_BPF_SYSCALL.
> 
> Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Cc: Xiaoqin Zhang <zhangxiaoqin@xiaomi.com>
> Signed-off-by: pengdonglin <pengdonglin@xiaomi.com>
> ---
>  kernel/trace/Kconfig  | 2 +-
>  kernel/trace/Makefile | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> index e1214b9dc990..653c1fcefa4c 100644
> --- a/kernel/trace/Kconfig
> +++ b/kernel/trace/Kconfig
> @@ -755,7 +755,7 @@ config FPROBE_EVENTS
>  config PROBE_EVENTS_BTF_ARGS
>  	depends on HAVE_FUNCTION_ARG_ACCESS_API
>  	depends on FPROBE_EVENTS || KPROBE_EVENTS
> -	depends on DEBUG_INFO_BTF && BPF_SYSCALL
> +	depends on DEBUG_INFO_BTF
>  	bool "Support BTF function arguments for probe events"
>  	default y
>  	help
> diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
> index fc5dcc888e13..6c4bf5a6c4f3 100644
> --- a/kernel/trace/Makefile
> +++ b/kernel/trace/Makefile
> @@ -116,7 +116,7 @@ obj-$(CONFIG_KGDB_KDB) += trace_kdb.o
>  endif
>  obj-$(CONFIG_DYNAMIC_EVENTS) += trace_dynevent.o
>  obj-$(CONFIG_PROBE_EVENTS) += trace_probe.o
> -obj-$(CONFIG_PROBE_EVENTS_BTF_ARGS) += trace_btf.o
> +obj-$(CONFIG_DEBUG_INFO_BTF) += trace_btf.o
>  obj-$(CONFIG_UPROBE_EVENTS) += trace_uprobe.o
>  obj-$(CONFIG_BOOTTIME_TRACING) += trace_boot.o
>  obj-$(CONFIG_FTRACE_RECORD_RECURSION) += trace_recursion_record.o


^ permalink raw reply

* Re: [RFC PATCH v2 04/10] rv/da: add pre-allocated storage pool for per-object monitors
From: Gabriele Monaco @ 2026-05-13 14:01 UTC (permalink / raw)
  To: wen.yang; +Cc: linux-trace-kernel, linux-kernel, Steven Rostedt
In-Reply-To: <2774332570ee823be60cfe84ba85e9573b4df478.1778522945.git.wen.yang@linux.dev>


Sorry for the spam, I accidentally mapped ctrl+Enter to send without
confirmation and can't teach my fingers not to press it instead of ctrl+delete..

> That's definitely a good addition, kmalloc_nolock was not that good already so
> I tried some way have a preallocation, though I realise it isn't really
> flexible.
> 
> Since you're using spinlocks, isn't that going to sleep on PREEMPT_RT?
> 
> Isn't this similar to what you'd do with a kmem_cache. That was my original
> idea although that uses spinlocks too.
> 
> I quickly tried an implementation like yours using
> mempool_create_slab_pool(prealloc_count) and mempool_alloc_preallocated() and
> it still explodes with my monitors, but perhaps now that tracepoints no longer
> disable preemption it could play well with some monitors.
> 
> The selftests with tlob seem to work just the same with this kmem_cache (up to
> the unrelated RCU stall). To be fair since you only allocate from the uprobe
> handler, you'd probably be just fine with kmalloc_nolock, but let's continue
> with the preallocation logic.
> 
> 
> The API is starting to get complex (well, not that it wasn't already).
> We have essentially 3 ways to allocate:
>  * fully automatic with kmalloc_nolock
>  * semi-automatic with pool preallocation
>  * manual with direct storage preallocation
> 
> We can have a macro DA_MON_ALLOCATION_STRATEGY = {DA_ALLOC_AUTO,
> DA_ALLOC_POOL, DA_ALLOC_MANUAL} where DA_MON_POOL also requires
> DA_MON_POOL_SIZE to be define (force that with an #error).

Anyway, this way you probably wouldn't need to define a different init function
and let everything handled more transparently.

Also you don't need to call da_create_or_get() explicitly,
da_handle_start_event() should do it for you.

The only manual step required is da_create_storage() when you explicitly cannot
lock when calling da_handle_start_event() (that would be DA_ALLOC_MANUAL, you
don't need that).

Hope I didn't create too much confusion.

Thanks,
Gabriele


^ permalink raw reply

* Re: [RFC PATCH v2 04/10] rv/da: add pre-allocated storage pool for per-object monitors
From: Gabriele Monaco @ 2026-05-13 13:50 UTC (permalink / raw)
  To: wen.yang; +Cc: linux-trace-kernel, linux-kernel, Steven Rostedt
In-Reply-To: <2774332570ee823be60cfe84ba85e9573b4df478.1778522945.git.wen.yang@linux.dev>

On Tue, 2026-05-12 at 02:24 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
> 
> da_create_empty_storage() uses kmalloc_nolock(), which requires
> CONFIG_HAVE_ALIGNED_STRUCT_PAGE; on UML and some PREEMPT_RT
> configurations it always returns NULL.  Calling kmalloc from scheduler
> tracepoint handlers also adds unwanted latency and can fail under
> memory pressure.
> 
> Add da_monitor_init_prealloc(N) as an opt-in alternative to
> da_monitor_init().  It allocates N da_monitor_storage slots with
> GFP_KERNEL up-front and manages them on a LIFO free-stack protected
> by a spinlock, so da_create_or_get() never calls kmalloc on the hot
> path.
> 
> Monitors that do not call da_monitor_init_prealloc() are unaffected.

That's definitely a good addition, kmalloc_nolock was not that good already so I
tried some way have a preallocation, though I realise it isn't really flexible.

Since you're using spinlocks, isn't that going to sleep on PREEMPT_RT?

Isn't this similar to what you'd do with a kmem_cache. That was my original idea
although that uses spinlocks too.

I quickly tried an implementation like yours using
mempool_create_slab_pool(prealloc_count) and mempool_alloc_preallocated() and it
still explodes with my monitors, but perhaps now that tracepoints no longer
disable preemption it could play well with some monitors.

The selftests with tlob seem to work just the same with this kmem_cache (up to
the unrelated RCU stall). To be fair since you only allocate from the uprobe
handler, you'd probably be just fine with kmalloc_nolock, but let's continue
with the preallocation logic.


The API is starting to get complex (well, not that it wasn't already).
We have essentially 3 ways to allocate:
 * fully automatic with kmalloc_nolock
 * semi-automatic with pool preallocation
 * manual with direct storage preallocation

We can have a macro DA_MON_ALLOCATION_STRATEGY = {DA_MON_AUTO, DA_MON_POOL,
DA_MON_MANUAL} where DA_MON_POOL also requires DA_MON_POOL_SIZE to be define
(force that with an #error).

> 
> Signed-off-by: Wen Yang <wen.yang@linux.dev>
> ---
>  include/rv/da_monitor.h | 208 +++++++++++++++++++++++++++++++++++-----
>  1 file changed, 186 insertions(+), 22 deletions(-)
> 
> diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
> index d04bb3229c75..7d6f62766251 100644
> --- a/include/rv/da_monitor.h
> +++ b/include/rv/da_monitor.h
> @@ -433,18 +433,6 @@ static inline da_id_type da_get_id(struct da_monitor
> *da_mon)
>   return container_of(da_mon, struct da_monitor_storage, rv.da_mon)->id;
>  }
>  
> -/*
> - * da_create_or_get - create the per-object storage if not already there
> - *
> - * This needs a lookup so should be guarded by RCU, the condition is checked
> - * directly in da_create_storage()
> - */
> -static inline void da_create_or_get(da_id_type id, monitor_target target)
> -{
> - guard(rcu)();
> - da_create_storage(id, target, da_get_monitor(id, target));
> -}
> -
>  /*
>   * da_fill_empty_storage - store the target in a pre-allocated storage
>   *
> @@ -475,15 +463,121 @@ static inline monitor_target
> da_get_target_by_id(da_id_type id)
>   return mon_storage->target;
>  }
>  
> +/*
> + * Per-object pool state.
> + *
> + * Zero-initialised by default (storage == NULL ⟹ kmalloc mode).  A monitor
> + * opts into pool mode by calling da_monitor_init_prealloc(N) instead of
> + * da_monitor_init(), which sets storage to a non-NULL kcalloc'd array.
> + *
> + * Because every field is wrapped in this struct and the struct itself is a
> + * per-TU static, each monitor that includes this header gets a completely
> + * independent pool.  A kmalloc monitor (e.g. nomiss) and a pool monitor
> + * (e.g. tlob) therefore coexist without any interference.
> + *
> + * da_pool_return_cb runs from softirq on non-PREEMPT_RT, so irqsave is
> + * required to prevent deadlock with task-context callers.  On PREEMPT_RT
> + * it runs from an rcuc kthread where spinlock_t is a sleeping lock.
> + */
> +struct da_per_obj_pool {
> + struct da_monitor_storage  *storage;  /* non-NULL ⟹ pool mode */
> + struct da_monitor_storage **free;     /* kmalloc'd pointer stack */
> + unsigned int                free_top;
> + spinlock_t                  lock;
> +};
> +
> +static struct da_per_obj_pool da_pool = {
> + .lock = __SPIN_LOCK_UNLOCKED(da_pool.lock),
> +};
> +
> +static void da_pool_return_cb(struct rcu_head *head)
> +{
> + struct da_monitor_storage *ms =
> + container_of(head, struct da_monitor_storage, rcu);
> + unsigned long flags;
> +
> + spin_lock_irqsave(&da_pool.lock, flags);
> + da_pool.free[da_pool.free_top++] = ms;
> + spin_unlock_irqrestore(&da_pool.lock, flags);
> +}
> +
> +/* Pops a slot from the pre-allocated pool; returns -ENOSPC if exhausted. */
> +static inline int da_create_or_get_pool(da_id_type id, monitor_target target)
> +{
> + struct da_monitor_storage *mon_storage;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&da_pool.lock, flags);
> + if (!da_pool.free_top) {
> + spin_unlock_irqrestore(&da_pool.lock, flags);
> + return -ENOSPC;
> + }
> + mon_storage = da_pool.free[--da_pool.free_top];
> + spin_unlock_irqrestore(&da_pool.lock, flags);
> +
> + mon_storage->id = id;
> + mon_storage->target = target;
> + guard(rcu)();
> + hash_add_rcu(da_monitor_ht, &mon_storage->node, id);
> + return 0;
> +}
> +
> +/*
> + * Tries da_create_storage() first (lock-free via kmalloc_nolock); falls back
> + * to kmalloc(GFP_KERNEL).  Must be called from task context.
> + */
> +static inline int da_create_or_get_kmalloc(da_id_type id, monitor_target
> target)
> +{
> + struct da_monitor_storage *mon_storage;
> +
> + scoped_guard(rcu) {
> + if (da_create_storage(id, target, da_get_monitor(id, target)))
> + return 0;
> + }
> +
> + /*
> + * da_create_storage() failed because kmalloc_nolock() returned NULL.
> + * Allocate with GFP_KERNEL outside the RCU read section: GFP_KERNEL
> + * may sleep for memory reclaim, which is illegal while the RCU read
> + * lock is held (preemption disabled on !PREEMPT_RT).
> + */
> + mon_storage = kmalloc_obj(*mon_storage, GFP_KERNEL | __GFP_ZERO);
> + if (!mon_storage)
> + return -ENOMEM;
> + mon_storage->id = id;
> + mon_storage->target = target;
> +
> + /*
> + * Re-check for a concurrent insertion before linking: another
> + * caller may have succeeded while we slept in kmalloc().
> + * Discard our allocation and let the winner's entry stand.
> + */
> + scoped_guard(rcu) {
> + if (da_get_monitor(id, target)) {
> + kfree(mon_storage);
> + return 0;
> + }
> + hash_add_rcu(da_monitor_ht, &mon_storage->node, id);
> + }
> + return 0;
> +}
> +
> +/* Create the per-object storage if not already there. */
> +static inline int da_create_or_get(da_id_type id, monitor_target target)
> +{
> + if (da_pool.storage)
> + return da_create_or_get_pool(id, target);
> + return da_create_or_get_kmalloc(id, target);
> +}
> +
>  /*
>   * da_destroy_storage - destroy the per-object storage
>   *
> - * The caller is responsible to synchronise writers, either with locks or
> - * implicitly. For instance, if da_destroy_storage is called at sched_exit
> and
> - * da_create_storage can never occur after that, it's safe to call this
> without
> - * locks.
> - * This function includes an RCU read-side critical section to synchronise
> - * against da_monitor_destroy().
> + * Pool mode: removes from hash and returns the slot via call_rcu().
> + * Kmalloc mode: removes from hash and frees via kfree_rcu().
> + *
> + * Includes an RCU read-side critical section to synchronise against
> + * da_monitor_destroy().
>   */
>  static inline void da_destroy_storage(da_id_type id)
>  {
> @@ -491,15 +585,17 @@ static inline void da_destroy_storage(da_id_type id)
>  
>   guard(rcu)();
>   mon_storage = __da_get_mon_storage(id);
> -
>   if (!mon_storage)
>   return;
>   da_monitor_reset_hook(&mon_storage->rv.da_mon);
>   hash_del_rcu(&mon_storage->node);
> - kfree_rcu(mon_storage, rcu);
> + if (da_pool.storage)
> + call_rcu(&mon_storage->rcu, da_pool_return_cb);
> + else
> + kfree_rcu(mon_storage, rcu);
>  }
>  
> -static void da_monitor_reset_all(void)
> +static __maybe_unused void da_monitor_reset_all(void)
>  {
>   struct da_monitor_storage *mon_storage;
>   int bkt;
> @@ -510,13 +606,65 @@ static void da_monitor_reset_all(void)
>   rcu_read_unlock();
>  }
>  
> +/*
> + * da_monitor_init_prealloc - initialise with a pre-allocated storage pool
> + *
> + * Allocates @prealloc_count storage slots up-front so that
> da_create_or_get()
> + * and da_destroy_storage() never call kmalloc/kfree.  Must be called instead
> + * of da_monitor_init() for monitors that require pool mode.
> + */
> +static inline int da_monitor_init_prealloc(unsigned int prealloc_count)
> +{
> + hash_init(da_monitor_ht);
> +
> + da_pool.storage = kcalloc(prealloc_count, sizeof(*da_pool.storage),
> +   GFP_KERNEL);
> + if (!da_pool.storage)
> + return -ENOMEM;
> +
> + da_pool.free = kmalloc_array(prealloc_count, sizeof(*da_pool.free),
> +      GFP_KERNEL);
> + if (!da_pool.free) {
> + kfree(da_pool.storage);
> + da_pool.storage = NULL;
> + return -ENOMEM;
> + }
> +
> + da_pool.free_top = 0;
> + for (unsigned int i = 0; i < prealloc_count; i++)
> + da_pool.free[da_pool.free_top++] = &da_pool.storage[i];
> + return 0;
> +}
> +
> +/*
> + * da_monitor_init - initialise in kmalloc mode (no pre-allocation)
> + */
>  static inline int da_monitor_init(void)
>  {
>   hash_init(da_monitor_ht);
>   return 0;
>  }
>  
> -static inline void da_monitor_destroy(void)
> +static inline void da_monitor_destroy_pool(void)
> +{
> + WARN_ON_ONCE(!hash_empty(da_monitor_ht));
> + /*
> + * Wait for all in-flight da_pool_return_cb() callbacks to
> + * complete before freeing da_pool.free.  synchronize_rcu() is
> + * not sufficient: it only waits for callbacks registered before
> + * it was called, but call_rcu() from concurrent da_destroy_storage()
> + * calls may have been enqueued later.  rcu_barrier() drains every
> + * pending callback.
> + */
> + rcu_barrier();
> + kfree(da_pool.storage);
> + da_pool.storage = NULL;
> + kfree(da_pool.free);
> + da_pool.free = NULL;
> + da_pool.free_top = 0;
> +}
> +
> +static inline void da_monitor_destroy_kmalloc(void)
>  {
>   struct da_monitor_storage *mon_storage;
>   struct hlist_node *tmp;
> @@ -534,6 +682,22 @@ static inline void da_monitor_destroy(void)
>   }
>  }
>  
> +/*
> + * da_monitor_destroy - tear down the per-object monitor
> + *
> + * Pool mode: the hash must already be empty (caller must have drained all
> + * tasks first); calls rcu_barrier() to drain all pending da_pool_return_cb()
> + * callbacks before freeing pool arrays.
> + * Kmalloc mode: drains any remaining entries after synchronize_rcu().
> + */
> +static inline void da_monitor_destroy(void)
> +{
> + if (da_pool.storage)
> + da_monitor_destroy_pool();
> + else
> + da_monitor_destroy_kmalloc();
> +}
> +
>  /*
>   * Allow the per-object monitors to run allocation manually, necessary if the
>   * start condition is in a context problematic for allocation (e.g.
> scheduling).


^ permalink raw reply

* Re: [RFC PATCH v2 04/10] rv/da: add pre-allocated storage pool for per-object monitors
From: Gabriele Monaco @ 2026-05-13 13:47 UTC (permalink / raw)
  To: wen.yang; +Cc: linux-trace-kernel, linux-kernel, Steven Rostedt
In-Reply-To: <2774332570ee823be60cfe84ba85e9573b4df478.1778522945.git.wen.yang@linux.dev>

On Tue, 2026-05-12 at 02:24 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
> 
> da_create_empty_storage() uses kmalloc_nolock(), which requires
> CONFIG_HAVE_ALIGNED_STRUCT_PAGE; on UML and some PREEMPT_RT
> configurations it always returns NULL.  Calling kmalloc from scheduler
> tracepoint handlers also adds unwanted latency and can fail under
> memory pressure.
> 
> Add da_monitor_init_prealloc(N) as an opt-in alternative to
> da_monitor_init().  It allocates N da_monitor_storage slots with
> GFP_KERNEL up-front and manages them on a LIFO free-stack protected
> by a spinlock, so da_create_or_get() never calls kmalloc on the hot
> path.
> 
> Monitors that do not call da_monitor_init_prealloc() are unaffected.

That's definitely a good addition, kmalloc_nolock was not that good already so I
tried some way have a preallocation, though I realise it isn't really flexible.

Since you're using spinlocks, isn't that going to sleep on PREEMPT_RT?

Isn't this similar to what you'd do with a kmem_cache. That was my original idea
although that uses spinlocks too.

I quickly tried an implementation like yours using
mempool_create_slab_pool(prealloc_count) and mempool_alloc_preallocated() and it
still explodes with my monitors, but perhaps now that tracepoints no longer
disable preemption it could play well with some monitors.

The selftests with tlob seem to work just the same with this kmem_cache (up to
the unrelated RCU stall). To be fair since you only allocate from the uprobe
handler, you'd probably be just fine with kmalloc_nolock, but let's ignore this
for now.


That said, the API is starting to get complex (well, not that it wasn't
already).
We have essentially 3 ways to allocate:
 * fully automatic with kmalloc_nolock
 * semi-automatic with pool preallocation
 * manual with direct storage preallocation

> 
> Signed-off-by: Wen Yang <wen.yang@linux.dev>
> ---
>  include/rv/da_monitor.h | 208 +++++++++++++++++++++++++++++++++++-----
>  1 file changed, 186 insertions(+), 22 deletions(-)
> 
> diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
> index d04bb3229c75..7d6f62766251 100644
> --- a/include/rv/da_monitor.h
> +++ b/include/rv/da_monitor.h
> @@ -433,18 +433,6 @@ static inline da_id_type da_get_id(struct da_monitor
> *da_mon)
>   return container_of(da_mon, struct da_monitor_storage, rv.da_mon)->id;
>  }
>  
> -/*
> - * da_create_or_get - create the per-object storage if not already there
> - *
> - * This needs a lookup so should be guarded by RCU, the condition is checked
> - * directly in da_create_storage()
> - */
> -static inline void da_create_or_get(da_id_type id, monitor_target target)
> -{
> - guard(rcu)();
> - da_create_storage(id, target, da_get_monitor(id, target));
> -}
> -
>  /*
>   * da_fill_empty_storage - store the target in a pre-allocated storage
>   *
> @@ -475,15 +463,121 @@ static inline monitor_target
> da_get_target_by_id(da_id_type id)
>   return mon_storage->target;
>  }
>  
> +/*
> + * Per-object pool state.
> + *
> + * Zero-initialised by default (storage == NULL ⟹ kmalloc mode).  A monitor
> + * opts into pool mode by calling da_monitor_init_prealloc(N) instead of
> + * da_monitor_init(), which sets storage to a non-NULL kcalloc'd array.
> + *
> + * Because every field is wrapped in this struct and the struct itself is a
> + * per-TU static, each monitor that includes this header gets a completely
> + * independent pool.  A kmalloc monitor (e.g. nomiss) and a pool monitor
> + * (e.g. tlob) therefore coexist without any interference.
> + *
> + * da_pool_return_cb runs from softirq on non-PREEMPT_RT, so irqsave is
> + * required to prevent deadlock with task-context callers.  On PREEMPT_RT
> + * it runs from an rcuc kthread where spinlock_t is a sleeping lock.
> + */
> +struct da_per_obj_pool {
> + struct da_monitor_storage  *storage;  /* non-NULL ⟹ pool mode */
> + struct da_monitor_storage **free;     /* kmalloc'd pointer stack */
> + unsigned int                free_top;
> + spinlock_t                  lock;
> +};
> +
> +static struct da_per_obj_pool da_pool = {
> + .lock = __SPIN_LOCK_UNLOCKED(da_pool.lock),
> +};
> +
> +static void da_pool_return_cb(struct rcu_head *head)
> +{
> + struct da_monitor_storage *ms =
> + container_of(head, struct da_monitor_storage, rcu);
> + unsigned long flags;
> +
> + spin_lock_irqsave(&da_pool.lock, flags);
> + da_pool.free[da_pool.free_top++] = ms;
> + spin_unlock_irqrestore(&da_pool.lock, flags);
> +}
> +
> +/* Pops a slot from the pre-allocated pool; returns -ENOSPC if exhausted. */
> +static inline int da_create_or_get_pool(da_id_type id, monitor_target target)
> +{
> + struct da_monitor_storage *mon_storage;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&da_pool.lock, flags);
> + if (!da_pool.free_top) {
> + spin_unlock_irqrestore(&da_pool.lock, flags);
> + return -ENOSPC;
> + }
> + mon_storage = da_pool.free[--da_pool.free_top];
> + spin_unlock_irqrestore(&da_pool.lock, flags);
> +
> + mon_storage->id = id;
> + mon_storage->target = target;
> + guard(rcu)();
> + hash_add_rcu(da_monitor_ht, &mon_storage->node, id);
> + return 0;
> +}
> +
> +/*
> + * Tries da_create_storage() first (lock-free via kmalloc_nolock); falls back
> + * to kmalloc(GFP_KERNEL).  Must be called from task context.
> + */
> +static inline int da_create_or_get_kmalloc(da_id_type id, monitor_target
> target)
> +{
> + struct da_monitor_storage *mon_storage;
> +
> + scoped_guard(rcu) {
> + if (da_create_storage(id, target, da_get_monitor(id, target)))
> + return 0;
> + }
> +
> + /*
> + * da_create_storage() failed because kmalloc_nolock() returned NULL.
> + * Allocate with GFP_KERNEL outside the RCU read section: GFP_KERNEL
> + * may sleep for memory reclaim, which is illegal while the RCU read
> + * lock is held (preemption disabled on !PREEMPT_RT).
> + */
> + mon_storage = kmalloc_obj(*mon_storage, GFP_KERNEL | __GFP_ZERO);
> + if (!mon_storage)
> + return -ENOMEM;
> + mon_storage->id = id;
> + mon_storage->target = target;
> +
> + /*
> + * Re-check for a concurrent insertion before linking: another
> + * caller may have succeeded while we slept in kmalloc().
> + * Discard our allocation and let the winner's entry stand.
> + */
> + scoped_guard(rcu) {
> + if (da_get_monitor(id, target)) {
> + kfree(mon_storage);
> + return 0;
> + }
> + hash_add_rcu(da_monitor_ht, &mon_storage->node, id);
> + }
> + return 0;
> +}
> +
> +/* Create the per-object storage if not already there. */
> +static inline int da_create_or_get(da_id_type id, monitor_target target)
> +{
> + if (da_pool.storage)
> + return da_create_or_get_pool(id, target);
> + return da_create_or_get_kmalloc(id, target);
> +}
> +
>  /*
>   * da_destroy_storage - destroy the per-object storage
>   *
> - * The caller is responsible to synchronise writers, either with locks or
> - * implicitly. For instance, if da_destroy_storage is called at sched_exit
> and
> - * da_create_storage can never occur after that, it's safe to call this
> without
> - * locks.
> - * This function includes an RCU read-side critical section to synchronise
> - * against da_monitor_destroy().
> + * Pool mode: removes from hash and returns the slot via call_rcu().
> + * Kmalloc mode: removes from hash and frees via kfree_rcu().
> + *
> + * Includes an RCU read-side critical section to synchronise against
> + * da_monitor_destroy().
>   */
>  static inline void da_destroy_storage(da_id_type id)
>  {
> @@ -491,15 +585,17 @@ static inline void da_destroy_storage(da_id_type id)
>  
>   guard(rcu)();
>   mon_storage = __da_get_mon_storage(id);
> -
>   if (!mon_storage)
>   return;
>   da_monitor_reset_hook(&mon_storage->rv.da_mon);
>   hash_del_rcu(&mon_storage->node);
> - kfree_rcu(mon_storage, rcu);
> + if (da_pool.storage)
> + call_rcu(&mon_storage->rcu, da_pool_return_cb);
> + else
> + kfree_rcu(mon_storage, rcu);
>  }
>  
> -static void da_monitor_reset_all(void)
> +static __maybe_unused void da_monitor_reset_all(void)
>  {
>   struct da_monitor_storage *mon_storage;
>   int bkt;
> @@ -510,13 +606,65 @@ static void da_monitor_reset_all(void)
>   rcu_read_unlock();
>  }
>  
> +/*
> + * da_monitor_init_prealloc - initialise with a pre-allocated storage pool
> + *
> + * Allocates @prealloc_count storage slots up-front so that
> da_create_or_get()
> + * and da_destroy_storage() never call kmalloc/kfree.  Must be called instead
> + * of da_monitor_init() for monitors that require pool mode.
> + */
> +static inline int da_monitor_init_prealloc(unsigned int prealloc_count)
> +{
> + hash_init(da_monitor_ht);
> +
> + da_pool.storage = kcalloc(prealloc_count, sizeof(*da_pool.storage),
> +   GFP_KERNEL);
> + if (!da_pool.storage)
> + return -ENOMEM;
> +
> + da_pool.free = kmalloc_array(prealloc_count, sizeof(*da_pool.free),
> +      GFP_KERNEL);
> + if (!da_pool.free) {
> + kfree(da_pool.storage);
> + da_pool.storage = NULL;
> + return -ENOMEM;
> + }
> +
> + da_pool.free_top = 0;
> + for (unsigned int i = 0; i < prealloc_count; i++)
> + da_pool.free[da_pool.free_top++] = &da_pool.storage[i];
> + return 0;
> +}
> +
> +/*
> + * da_monitor_init - initialise in kmalloc mode (no pre-allocation)
> + */
>  static inline int da_monitor_init(void)
>  {
>   hash_init(da_monitor_ht);
>   return 0;
>  }
>  
> -static inline void da_monitor_destroy(void)
> +static inline void da_monitor_destroy_pool(void)
> +{
> + WARN_ON_ONCE(!hash_empty(da_monitor_ht));
> + /*
> + * Wait for all in-flight da_pool_return_cb() callbacks to
> + * complete before freeing da_pool.free.  synchronize_rcu() is
> + * not sufficient: it only waits for callbacks registered before
> + * it was called, but call_rcu() from concurrent da_destroy_storage()
> + * calls may have been enqueued later.  rcu_barrier() drains every
> + * pending callback.
> + */
> + rcu_barrier();
> + kfree(da_pool.storage);
> + da_pool.storage = NULL;
> + kfree(da_pool.free);
> + da_pool.free = NULL;
> + da_pool.free_top = 0;
> +}
> +
> +static inline void da_monitor_destroy_kmalloc(void)
>  {
>   struct da_monitor_storage *mon_storage;
>   struct hlist_node *tmp;
> @@ -534,6 +682,22 @@ static inline void da_monitor_destroy(void)
>   }
>  }
>  
> +/*
> + * da_monitor_destroy - tear down the per-object monitor
> + *
> + * Pool mode: the hash must already be empty (caller must have drained all
> + * tasks first); calls rcu_barrier() to drain all pending da_pool_return_cb()
> + * callbacks before freeing pool arrays.
> + * Kmalloc mode: drains any remaining entries after synchronize_rcu().
> + */
> +static inline void da_monitor_destroy(void)
> +{
> + if (da_pool.storage)
> + da_monitor_destroy_pool();
> + else
> + da_monitor_destroy_kmalloc();
> +}
> +
>  /*
>   * Allow the per-object monitors to run allocation manually, necessary if the
>   * start condition is in a context problematic for allocation (e.g.
> scheduling).


^ permalink raw reply

* Re: [RFC PATCH v2 01/10] rv/da: fix monitor start ordering and memory ordering for monitoring flag
From: Gabriele Monaco @ 2026-05-13 12:39 UTC (permalink / raw)
  To: wen.yang; +Cc: linux-trace-kernel, linux-kernel, Steven Rostedt
In-Reply-To: <8af5ba4bd93d2acb8a546e8e47ced974a87c1eb8.1778522945.git.wen.yang@linux.dev>

On Tue, 2026-05-12 at 02:24 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
> 
> da_monitor_start() set monitoring=1 before calling da_monitor_init_hook(),
> may racing with the sched_switch handler:
> 
>   da_monitor_start()               sched_switch handler
>   -------------------------        ---------------------------------
>   da_mon->monitoring = 1;
>                                    if (da_monitoring(da_mon))  /* true  */
>                                        ha_start_timer_ns(...);
>                                        /* hrtimer->base == NULL, crash */
>   da_monitor_init_hook(da_mon);
>   /* hrtimer_setup() sets base */
> 
> Fix the ordering and pair with release/acquire semantics:
> 
>   da_monitor_init_hook(da_mon);
>   smp_store_release(&da_mon->monitoring, 1);    /* da_monitor_start()  */
>   return smp_load_acquire(&da_mon->monitoring); /* da_monitoring()     */
> 
> On ARM64 a plain STR + LDR does not form a release-acquire pair, so
> the load can observe monitoring=1 while hrtimer->base is still NULL.
> The plain accesses are also data races under KCSAN.
> 
> Use WRITE_ONCE for the monitoring=0 store in da_monitor_reset() to
> cover the reset path.
> 
> Fixes: 792575348ff7 ("rv/include: Add deterministic automata monitor
> definition via C macros")
> Signed-off-by: Wen Yang <wen.yang@linux.dev>

Thanks for the fix!

There are probably more than a few of those bugs since most monitors are
implicitly serialised because their events are serialised..

Looks good to me.

Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>

See minor comments below:

> ---
>  include/rv/da_monitor.h | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
> index 39765ff6f098..00ded3d5ab3f 100644
> --- a/include/rv/da_monitor.h
> +++ b/include/rv/da_monitor.h
> @@ -82,7 +82,7 @@ static void react(enum states curr_state, enum events event)
>  static inline void da_monitor_reset(struct da_monitor *da_mon)
>  {
>   da_monitor_reset_hook(da_mon);
> - da_mon->monitoring = 0;
> + WRITE_ONCE(da_mon->monitoring, 0);
>   da_mon->curr_state = model_get_initial_state();
>  }
>  
> @@ -95,8 +95,9 @@ static inline void da_monitor_reset(struct da_monitor
> *da_mon)
>  static inline void da_monitor_start(struct da_monitor *da_mon)
>  {
>   da_mon->curr_state = model_get_initial_state();
> - da_mon->monitoring = 1;
>   da_monitor_init_hook(da_mon);
> + /* Pairs with smp_load_acquire in da_monitoring(). */

I wonder if these comment are really adding value, pairing smp_load_acquire /
smp_store_release is the by-the-book usage and everything is here.

But feel free to leave it if you think it's clearer.

Thanks,
Gabriele

> + smp_store_release(&da_mon->monitoring, 1);
>  }
>  
>  /*
> @@ -104,7 +105,8 @@ static inline void da_monitor_start(struct da_monitor
> *da_mon)
>   */
>  static inline bool da_monitoring(struct da_monitor *da_mon)
>  {
> - return da_mon->monitoring;
> + /* Pairs with smp_store_release in da_monitor_start(). */
> + return smp_load_acquire(&da_mon->monitoring);
>  }
>  
>  /*


^ permalink raw reply

* Re: [PATCH] tracing: Switch trace_recursion_record.c code over to use guard()
From: Steven Rostedt @ 2026-05-13 12:34 UTC (permalink / raw)
  To: Yash Suthar
  Cc: mhiramat, mathieu.desnoyers, linux-kernel, linux-trace-kernel,
	skhan, me
In-Reply-To: <CAPfzD4kiEMFdbK36X1_f1+Vn=v3nfvUODZJuJ40sNJ_fRr9zKA@mail.gmail.com>

On Tue, 12 May 2026 20:11:08 +0530
Yash Suthar <yashsuthar983@gmail.com> wrote:

> Gentle ping.

Hi,

What's the rush? This is just a clean up change. There's no feature here
that you need is there?

If you see it in patchwork[1], it's not lost. I just have other things
ahead of it. I usually process cleanup code last.

Thanks,

-- Steve

[1] https://patchwork.kernel.org/project/linux-trace-kernel/patch/20260502174741.39636-1-yashsuthar983@gmail.com/

^ permalink raw reply

* Re: [PATCH v6 2/4] mm/memory-failure: classify get_any_page() failures by reason
From: David Hildenbrand (Arm) @ 2026-05-13 11:48 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Miaohe Lin, Naoya Horiguchi, Andrew Morton, Jonathan Corbet,
	Shuah Khan, Lorenzo Stoakes, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Shuah Khan, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett, linux-mm,
	linux-kernel, linux-doc, linux-kselftest, linux-trace-kernel,
	kernel-team, Lance Yang
In-Reply-To: <agMpqhpgmezqnaA_@gmail.com>

On 5/12/26 15:33, Breno Leitao wrote:
> On Tue, May 12, 2026 at 10:21:50AM +0200, David Hildenbrand (Arm) wrote:
>>
>>>  		}
>>>  		goto unlock_mutex;
>>>  	} else if (res < 0) {
>>> -		if (is_reserved)
>>> +		/*
>>> +		 * Promote a stable unhandlable kernel page diagnosed by
>>> +		 * get_hwpoison_page() to MF_MSG_KERNEL alongside reserved
>>> +		 * pages; transient lifecycle races stay as MF_MSG_GET_HWPOISON.
>>> +		 */
>>> +		if (is_reserved || gp_status == MF_GET_PAGE_UNHANDLABLE)
>>>  			res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
>>
>>
>> It's all a bit of a mess. get_hwpoison_page() should just indicate that a page
>> is unhandable if it is PG_reserved?
> 
> Are you saying that we should identify if the page is PG_reserved in
> get_hwpoison_page() instead of in memory_failure(), as done in the
> previous patch ("mm/memory-failure: report MF_MSG_KERNEL for reserved
> pages") ?
> 
>> Why can't we just return a special error code from  get_hwpoison_page()? We ahve
>> plenty of errno values to chose from.
> 
> Something like:
> 
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 866c4428ac7ef..0a6d83575833e 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -878,7 +878,7 @@ static const char *action_name[] = {
>  };
>  
>  static const char * const action_page_types[] = {
> -	[MF_MSG_KERNEL]			= "reserved kernel page",
> +	[MF_MSG_KERNEL]			= "unrecoverable kernel page",
>  	[MF_MSG_KERNEL_HIGH_ORDER]	= "high-order kernel page",
>  	[MF_MSG_HUGE]			= "huge page",
>  	[MF_MSG_FREE_HUGE]		= "free huge page",
> @@ -1394,6 +1394,21 @@ static int get_any_page(struct page *p, unsigned long flags)
>  	int ret = 0, pass = 0;
>  	bool count_increased = false;
> 
> +	if (PageReserved(p)) {
> +		ret = -ENOTRECOVERABLE;
> +		goto out;
> +	}
> +
>  	if (flags & MF_COUNT_INCREASED)
>  		count_increased = true;
>  
> @@ -1422,7 +1437,7 @@ static int get_any_page(struct page *p, unsigned long flags)
>  				shake_page(p);
>  				goto try_again;
>  			}
> -			ret = -EIO;
> +			ret = -ENOTRECOVERABLE;
>  			goto out;
>  		}
>  	}
> @@ -1441,10 +1456,10 @@ static int get_any_page(struct page *p, unsigned long flags)
>  			goto try_again;
>  		}
>  		put_page(p);
> -		ret = -EIO;
> +		ret = -ENOTRECOVERABLE;
>  	}
>  out:
> -	if (ret == -EIO)
> +	if (ret == -EIO || ret == -ENOTRECOVERABLE)
>  		pr_err("%#lx: unhandlable page.\n", page_to_pfn(p));
>  
>  	return ret;
> @@ -2431,6 +2448,9 @@ int memory_failure(unsigned long pfn, int flags)
>  			res = action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED);
>  		}
>  		goto unlock_mutex;
> +	} else if (res == -ENOTRECOVERABLE) {
> +		res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
> +		goto unlock_mutex;
>  	} else if (res < 0) {
>  		res = action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
>  		goto unlock_mutex;

That might probably read nicer as

switch (res) {
case 0: ...
case 1: ...
case -ENOTRECOVERABLE:  ...
case ...
default:
}

> 
> 
> If that is what you are suggestion, maybe we can create another
> MF_MSG_RESERVED? and another return value for get_any_page() to track
> the reserve pages ?

I guess "reserved" is really just like most other kernel pages. So I wouldn't
special-case them here.

Or would there be a good reason?

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH bpf 1/2] uprobes/x86: Fix red zone clobbering in nop5 optimization
From: Jiri Olsa @ 2026-05-13  9:35 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Alexei Starovoitov, Jiri Olsa, Masami Hiramatsu, Andrii Nakryiko,
	bpf, linux-trace-kernel, Oleg Nesterov, Peter Zijlstra,
	Ingo Molnar
In-Reply-To: <CAEf4Bza4sqLw8GHoq+MFBgsnhiJ_s91UUrMcA-paYeBr7=bz0A@mail.gmail.com>

On Tue, May 12, 2026 at 12:38:34PM -0700, Andrii Nakryiko wrote:
> On Tue, May 12, 2026 at 12:27 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Tue, May 12, 2026 at 10:07 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > >
> > > +       /*
> > > +        * We have nop10 (with first byte overwritten to int3),
> > > +        * change it to:
> > > +        *   lea 0x80(%rsp), %rsp
> > > +        *   call tramp
> > > +        *
> > > +        * The first lea instruction skips the stack redzone so the call
> > > +        * instruction can safely push return address on stack.
> > > +        */
> >
> > typo: lea -128(%rsp), %rsp

ugh, thanks

> >
> > you can also do:
> >
> > add $-128, %rsp + call tramp = 4 + 5 = 9 bytes instead of 10.
> 
> When I asked AI about this it explained that add instruction modifies
> flags, so it's not a good fit here. lea doesn't touch flags.
> 
> >
> > Initially I didn't like this approach, since we just introduced
> > usdt nop5 and now need to recompile everything again,
> > but looking at the fix it's definitely simpler than alternatives
> > and doesn't have annoying limitations.
> 
> 
> yeah, limitations are annoying, especially with those global "DO NOT
> OPTIMIZE" flags... Jiri, let's polish your version and land it?

ok, will send it out

jirka

^ permalink raw reply

* Re: [RFC PATCH v2 02/10] rv/da: fix per-task da_monitor_destroy() ordering and sync
From: Gabriele Monaco @ 2026-05-13  9:31 UTC (permalink / raw)
  To: Wen Yang; +Cc: linux-trace-kernel, linux-kernel, Steven Rostedt
In-Reply-To: <cb929b8b-5bfb-4afe-ba50-45620c38ea96@linux.dev>

On Wed, 2026-05-13 at 13:32 +0800, Wen Yang wrote:
> Thanks for both messages.  Two patches are ready; let me address
> your follow-up concerns before sending.
> 
>    1. "all monitors reusing slots would suffer from it"
> 
>       Only RV_MON_PER_TASK uses the rv_get/put_task_monitor_slot()
>       pool.  RV_MON_GLOBAL and RV_MON_PER_CPU each have dedicated
>       storage (a single static variable and a per-cpu variable) and
>       never share slots across monitor types.  The race is exclusive
>       to PER_TASK, so fixing that variant's da_monitor_destroy() is
>       the correct scope.
> 
>    2. "LTL monitors don't even have monitoring"
> 
>       tracepoint_synchronize_unregister() does not rely on the
>       monitoring flag at all.  It is a system-wide barrier — it
>       calls synchronize_rcu_tasks_trace() followed by
>       synchronize_srcu(&tracepoint_srcu) — draining every in-flight
>       tracepoint handler on every CPU regardless of which monitor
>       dispatched it.  LTL handlers are covered without any special
>       treatment.
> 
> The slot-ordering issue (patch 1) affects all per-task DA monitors,
> not only HA ones — "independent on HA" — because
> RV_PER_TASK_MONITOR_INIT equals CONFIG_RV_PER_TASK_MONITORS (one
> past the end of rv[]), so da_monitor_reset_all() overwrites whatever
> follows rv[] in task_struct whenever any per-task monitor is
> disabled.

Exactly, and since whatever follows .rv is randomised on a task_struct, this can
get quite nasty.

I included my version of the fix in the series in [1], but feel free to send
yours, you got there first ;)

> 
> Also corrected "wwnr probe handler" to "stall probe handler" in
> patch 2 per your annotation.
> 

While tracepoint_synchronize_unregister() does fix the race, I still see a timed
bomb in the way we do ha_monitor_reset_env().

Since we reused the same slots for per-task monitors (not for the others, you're
right I was brainfarting) we essentially don't know what happened before we do
da_monitor_init(), the same slot could have been used by an LTL monitor which
cannot even reliably clear the byte used by the monitoring flag.

Now, we either mandate all monitors to memset the entire slot (union
rv_task_monitor) or we don't assume anything about the slot's state during
initialisation. Any middle ground could reveal pesky bugs as soon as we refactor
the structs.

The latter idea is what I did in [1]. I believe that would make the
synchronisation superfluous.

What do you think?

Thanks,
Gabriele

[1] - https://lore.kernel.org/lkml/20260512140250.262190-8-gmonaco@redhat.com

> Please let me know if the above reasoning addresses your concerns.
> 
> 
> --
> Best wishes,
> Wen
> 
> > > 
> > > >   include/rv/da_monitor.h | 18 ++++++++++++++++--
> > > >   1 file changed, 16 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
> > > > index 00ded3d5ab3f..d04bb3229c75 100644
> > > > --- a/include/rv/da_monitor.h
> > > > +++ b/include/rv/da_monitor.h
> > > > @@ -304,6 +304,20 @@ static int da_monitor_init(void)
> > > >   
> > > >   /*
> > > >    * da_monitor_destroy - return the allocated slot
> > > > + *
> > > > + * Call tracepoint_synchronize_unregister() before reset_all() to close
> > > > + * the race where an in-flight non-HA probe handler sets monitoring=1
> > > > + * (without calling timer_setup()) after da_monitor_reset_all() has
> > > > + * already cleared the slot but before the caller's own sync completes.
> > > > + * Without this barrier, an HA_TIMER_WHEEL monitor that later acquires
> > > > + * the same slot would call timer_delete() on a never-initialised
> > > > + * timer_list, triggering ODEBUG warnings.
> > > > + *
> > > > + * Note: tracepoint_synchronize_unregister() is a system-wide barrier
> > > > + * that waits for all CPUs to finish any in-flight tracepoint handlers.
> > > > + * The caller's own __rv_disable_monitor() issues a second sync after
> > > > + * returning from disable(); that redundant call is harmless on the
> > > > + * infrequent admin (enable/disable) path.
> > > >    */
> > > >   static inline void da_monitor_destroy(void)
> > > >   {
> > > > @@ -311,10 +325,10 @@ static inline void da_monitor_destroy(void)
> > > >   		WARN_ONCE(1, "Disabling a disabled monitor: "
> > > > __stringify(MONITOR_NAME));
> > > >   		return;
> > > >   	}
> > > > +	tracepoint_synchronize_unregister();
> > > > +	da_monitor_reset_all();
> > > >   	rv_put_task_monitor_slot(task_mon_slot);
> > > >   	task_mon_slot = RV_PER_TASK_MONITOR_INIT;
> > > > -
> > > > -	da_monitor_reset_all();
> > > >   }
> > > >   
> > > >   #elif RV_MON_TYPE == RV_MON_PER_OBJ
> > 


^ permalink raw reply

* Re: [RFC PATCH v2 03/10] selftests/verification: fix verificationtest-ktap for out-of-tree execution
From: Gabriele Monaco @ 2026-05-13  8:32 UTC (permalink / raw)
  To: wen.yang; +Cc: linux-trace-kernel, linux-kernel, Steven Rostedt
In-Reply-To: <7368ee25b1b45c92beb14c05be366b71da585ca4.1778522945.git.wen.yang@linux.dev>

On Tue, 2026-05-12 at 02:24 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
> 
> verificationtest-ktap used a CWD-relative path (../ftrace/ftracetest)
> and a relative argument (../verification) for --rv.  This works when
> the shell changes into the verification directory first, but breaks
> when the script is invoked directly - e.g. by the kselftest runner or
> vng - because the working directory is the kernel source root, not the
> script's own directory.
> 
> Fix this by computing the script's directory from $0 with cd/dirname/pwd
> and using absolute paths for both the ftracetest invocation and the --rv
> argument.  Also export the directory to PATH so that check_requires in
> the ftracetest framework can locate helper binaries.
> 
> Signed-off-by: Wen Yang <wen.yang@linux.dev>

Just out of curiosity, how do you run the selftests?
Are you calling the script directly just to run /some/ of them?

The officially supported way is through make [1]:

  make -C tools/testing/selftests TARGETS=verification run_tests

(though I find it faster to omit TARGETS and just do make -C
tools/testing/selftests/verification).

Calling with make should set up all paths as needed.

> ---
>  tools/testing/selftests/verification/verificationtest-ktap | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/testing/selftests/verification/verificationtest-ktap
> b/tools/testing/selftests/verification/verificationtest-ktap
> index 18f7fe324e2f..456b8578a307 100755
> --- a/tools/testing/selftests/verification/verificationtest-ktap
> +++ b/tools/testing/selftests/verification/verificationtest-ktap
> @@ -5,4 +5,6 @@
>  #
>  # Copyright (C) Arm Ltd., 2023
>  
> -../ftrace/ftracetest -K -v --rv ../verification
> +dir=$(cd "$(dirname "$0")" && pwd)
> +export PATH="$dir:$PATH"

Then if you really really need to call it directly, do you need to override
PATH?

And isn't it clearer to do:

  dir=$(realpath "$(dirname "$0")")

Thanks,
Gabriele

[1] - https://www.kernel.org/doc/html/latest/dev-tools/kselftest.html

> +"$dir/../ftrace/ftracetest" -K -v --rv "$dir"


^ permalink raw reply

* Re: [PATCH v6 1/4] mm/memory-failure: report MF_MSG_KERNEL for reserved pages
From: David Hildenbrand (Arm) @ 2026-05-13  7:54 UTC (permalink / raw)
  To: Lance Yang
  Cc: leitao, linmiaohe, nao.horiguchi, akpm, corbet, skhan, ljs,
	vbabka, rppt, surenb, mhocko, shuah, rostedt, mhiramat,
	mathieu.desnoyers, liam, linux-mm, linux-kernel, linux-doc,
	linux-kselftest, linux-trace-kernel, kernel-team
In-Reply-To: <20260512124837.38883-1-lance.yang@linux.dev>

On 5/12/26 14:48, Lance Yang wrote:
> 
> On Tue, May 12, 2026 at 10:17:00AM +0200, David Hildenbrand (Arm) wrote:
>> On 5/11/26 17:38, Breno Leitao wrote:
>>> When get_hwpoison_page() returns a negative value, distinguish
>>> reserved pages from other failure cases by reporting MF_MSG_KERNEL
>>> instead of MF_MSG_GET_HWPOISON. Reserved pages belong to the kernel
>>> and should be classified accordingly for proper handling.
>>>
>>> Sample PG_reserved before the get_hwpoison_page() call. In the
>>> MF_COUNT_INCREASED path get_any_page() can drop the caller's
>>> reference before returning -EIO, after which the underlying page may
>>> have been freed and reallocated with page->flags reset; reading
>>> PageReserved(p) at that point would observe stale or unrelated state.
>>> The pre-call snapshot reflects what the page actually was at the
>>> time of the failure event.
>>>
>>> Acked-by: Miaohe Lin <linmiaohe@huawei.com>
>>> Reviewed-by: Lance Yang <lance.yang@linux.dev>
>>> Signed-off-by: Breno Leitao <leitao@debian.org>
>>> ---
>>>  mm/memory-failure.c | 19 ++++++++++++++++++-
>>>  1 file changed, 18 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>> index 866c4428ac7ef..f112fb27a8ff6 100644
>>> --- a/mm/memory-failure.c
>>> +++ b/mm/memory-failure.c
>>> @@ -2348,6 +2348,7 @@ int memory_failure(unsigned long pfn, int flags)
>>>  	unsigned long page_flags;
>>>  	bool retry = true;
>>>  	int hugetlb = 0;
>>> +	bool is_reserved;
>>>  
>>>  	if (!sysctl_memory_failure_recovery)
>>>  		panic("Memory failure on page %lx", pfn);
>>> @@ -2411,6 +2412,18 @@ int memory_failure(unsigned long pfn, int flags)
>>>  	 * In fact it's dangerous to directly bump up page count from 0,
>>>  	 * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
>>>  	 */
>>> +	/*
>>> +	 * Pages with PG_reserved set are not currently managed by the
>>> +	 * page allocator (memblock-reserved memory, driver reservations,
>>> +	 * etc.), so classify them as kernel-owned for reporting.
>>> +	 *
>>> +	 * Sample the flag before get_hwpoison_page(): in the
>>> +	 * MF_COUNT_INCREASED path, get_any_page() can drop the caller's
>>> +	 * reference before returning -EIO, after which page->flags may
>>> +	 * have been reset by the allocator.
>>> +	 */
>>> +	is_reserved = PageReserved(p);
>>> +
>>>  	res = get_hwpoison_page(p, flags);
>>>  	if (!res) {
>>>  		if (is_free_buddy_page(p)) {
>>> @@ -2432,7 +2445,11 @@ int memory_failure(unsigned long pfn, int flags)
>>>  		}
>>>  		goto unlock_mutex;
>>>  	} else if (res < 0) {
>>> -		res = action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
>>> +		if (is_reserved)
>>> +			res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
>>> +		else
>>> +			res = action_result(pfn, MF_MSG_GET_HWPOISON,
>>> +					    MF_IGNORED);
>>>  		goto unlock_mutex;
>>>  	}
>>>  
>>>
>>
>> It's a bit odd that we need this handling when we already have handling for
>> reserved pages in error_states[].
>>
>> HWPoisonHandlable() would always essentially reject PG_reserved pages. So
>> __get_hwpoison_page() ... would always fail? Making
>> get_hwpoison_page()->get_any_page() always fail?
>>
>> But then, we never call identify_page_state()? And never call me_kernel()?
> 
> Looks like we never get that far ...

Right, likely that should be removed+cleaned up then.

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH v6 1/4] mm/memory-failure: report MF_MSG_KERNEL for reserved pages
From: David Hildenbrand (Arm) @ 2026-05-13  7:53 UTC (permalink / raw)
  To: jane.chu, Breno Leitao, Miaohe Lin, Naoya Horiguchi,
	Andrew Morton, Jonathan Corbet, Shuah Khan, Lorenzo Stoakes,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest,
	linux-trace-kernel, kernel-team, Lance Yang
In-Reply-To: <816e3d8e-22d2-49a4-92ae-981568f38792@oracle.com>

On 5/12/26 19:58, jane.chu@oracle.com wrote:
> 
> 
> On 5/12/2026 1:17 AM, David Hildenbrand (Arm) wrote:
>> On 5/11/26 17:38, Breno Leitao wrote:
>>> When get_hwpoison_page() returns a negative value, distinguish
>>> reserved pages from other failure cases by reporting MF_MSG_KERNEL
>>> instead of MF_MSG_GET_HWPOISON. Reserved pages belong to the kernel
>>> and should be classified accordingly for proper handling.
>>>
>>> Sample PG_reserved before the get_hwpoison_page() call. In the
>>> MF_COUNT_INCREASED path get_any_page() can drop the caller's
>>> reference before returning -EIO, after which the underlying page may
>>> have been freed and reallocated with page->flags reset; reading
>>> PageReserved(p) at that point would observe stale or unrelated state.
>>> The pre-call snapshot reflects what the page actually was at the
>>> time of the failure event.
>>>
>>> Acked-by: Miaohe Lin <linmiaohe@huawei.com>
>>> Reviewed-by: Lance Yang <lance.yang@linux.dev>
>>> Signed-off-by: Breno Leitao <leitao@debian.org>
>>> ---
>>>   mm/memory-failure.c | 19 ++++++++++++++++++-
>>>   1 file changed, 18 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>> index 866c4428ac7ef..f112fb27a8ff6 100644
>>> --- a/mm/memory-failure.c
>>> +++ b/mm/memory-failure.c
>>> @@ -2348,6 +2348,7 @@ int memory_failure(unsigned long pfn, int flags)
>>>       unsigned long page_flags;
>>>       bool retry = true;
>>>       int hugetlb = 0;
>>> +    bool is_reserved;
>>>         if (!sysctl_memory_failure_recovery)
>>>           panic("Memory failure on page %lx", pfn);
>>> @@ -2411,6 +2412,18 @@ int memory_failure(unsigned long pfn, int flags)
>>>        * In fact it's dangerous to directly bump up page count from 0,
>>>        * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
>>>        */
>>> +    /*
>>> +     * Pages with PG_reserved set are not currently managed by the
>>> +     * page allocator (memblock-reserved memory, driver reservations,
>>> +     * etc.), so classify them as kernel-owned for reporting.
>>> +     *
>>> +     * Sample the flag before get_hwpoison_page(): in the
>>> +     * MF_COUNT_INCREASED path, get_any_page() can drop the caller's
>>> +     * reference before returning -EIO, after which page->flags may
>>> +     * have been reset by the allocator.
>>> +     */
>>> +    is_reserved = PageReserved(p);
>>> +
>>>       res = get_hwpoison_page(p, flags);
>>>       if (!res) {
>>>           if (is_free_buddy_page(p)) {
>>> @@ -2432,7 +2445,11 @@ int memory_failure(unsigned long pfn, int flags)
>>>           }
>>>           goto unlock_mutex;
>>>       } else if (res < 0) {
>>> -        res = action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
>>> +        if (is_reserved)
>>> +            res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
>>> +        else
>>> +            res = action_result(pfn, MF_MSG_GET_HWPOISON,
>>> +                        MF_IGNORED);
>>>           goto unlock_mutex;
>>>       }
>>>  
>>
>> It's a bit odd that we need this handling when we already have handling for
>> reserved pages in error_states[].
>>
>> HWPoisonHandlable() would always essentially reject PG_reserved pages. So
>> __get_hwpoison_page() ... would always fail? Making
>> get_hwpoison_page()->get_any_page() always fail?
>>
>> But then, we never call identify_page_state()? And never call me_kernel()?
>>
>> This all looks very odd.
>>
>> Why would you even want to call get_hwpoison_page() in the first place if you
>> find PageReserved?
>>
> 
> Ah, good point!
> It seems to me that all unhandable pages should head out to identify_page_state:
> 
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -2411,6 +2411,10 @@ int memory_failure(unsigned long pfn, int flags)
>          * In fact it's dangerous to directly bump up page count from 0,
>          * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
>          */
> +
> +       if (!HWPoisonHandlable(page, flags)
> +               goto identify_page_state;
> +
>         res = get_hwpoison_page(p, flags);
>         if (!res) {
>                 if (is_free_buddy_page(p)) {

That's one option, or we just let get_hwpoison_page() return clearer error
codes, let it take care of checking PageReserved, and process the error codes
return by get_hwpoison_page() in a better way.

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH v6 1/4] mm/memory-failure: report MF_MSG_KERNEL for reserved pages
From: David Hildenbrand (Arm) @ 2026-05-13  7:53 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Miaohe Lin, Naoya Horiguchi, Andrew Morton, Jonathan Corbet,
	Shuah Khan, Lorenzo Stoakes, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Shuah Khan, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett, linux-mm,
	linux-kernel, linux-doc, linux-kselftest, linux-trace-kernel,
	kernel-team, Lance Yang
In-Reply-To: <agMj4ukhj1PkXXrN@gmail.com>

On 5/12/26 15:04, Breno Leitao wrote:
> On Tue, May 12, 2026 at 10:17:00AM +0200, David Hildenbrand (Arm) wrote:
>>> @@ -2348,6 +2348,7 @@ int memory_failure(unsigned long pfn, int flags)
>>>  	unsigned long page_flags;
>>>  	bool retry = true;
>>>  	int hugetlb = 0;
>>> +	bool is_reserved;
>>>  
>>>  	if (!sysctl_memory_failure_recovery)
>>>  		panic("Memory failure on page %lx", pfn);
>>> @@ -2411,6 +2412,18 @@ int memory_failure(unsigned long pfn, int flags)
>>>  	 * In fact it's dangerous to directly bump up page count from 0,
>>>  	 * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
>>>  	 */
>>> +	/*
>>> +	 * Pages with PG_reserved set are not currently managed by the
>>> +	 * page allocator (memblock-reserved memory, driver reservations,
>>> +	 * etc.), so classify them as kernel-owned for reporting.
>>> +	 *
>>> +	 * Sample the flag before get_hwpoison_page(): in the
>>> +	 * MF_COUNT_INCREASED path, get_any_page() can drop the caller's
>>> +	 * reference before returning -EIO, after which page->flags may
>>> +	 * have been reset by the allocator.
>>> +	 */
>>> +	is_reserved = PageReserved(p);
>>> +
>>>  	res = get_hwpoison_page(p, flags);
>>>  	if (!res) {
>>>  		if (is_free_buddy_page(p)) {
>>> @@ -2432,7 +2445,11 @@ int memory_failure(unsigned long pfn, int flags)
>>>  		}
>>>  		goto unlock_mutex;
>>>  	} else if (res < 0) {
>>> -		res = action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
>>> +		if (is_reserved)
>>> +			res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
>>> +		else
>>> +			res = action_result(pfn, MF_MSG_GET_HWPOISON,
>>> +					    MF_IGNORED);
>>>  		goto unlock_mutex;
>>>  	}
>>>  
>>>
>>
>> It's a bit odd that we need this handling when we already have handling for
>> reserved pages in error_states[].
>>
>> HWPoisonHandlable() would always essentially reject PG_reserved pages. So
>> __get_hwpoison_page() ... would always fail? Making
>> get_hwpoison_page()->get_any_page() always fail?
>>
>> But then, we never call identify_page_state()? And never call me_kernel()?
> 
> From what I read, it seems that error_states[0] = { reserved, reserved, MF_MSG_KERNEL, me_kernel }
> has been effectively dead code on the hwpoison-from-MCE path for a
> while.
> 
> My v6 patch relabels the failure-path output to match what me_kernel() would
> have reported anyway.
> 
>> This all looks very odd.
>>
>> Why would you even want to call get_hwpoison_page() in the first place if you
>> find PageReserved?
> 
> Are you suggesting we should all the page action as soon as we detect the page
> is reserved and get out?
> 
> Something as:
> 
>     if (PageReserved(p)) {
>         res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
>         goto unlock_mutex;
>     }
> 
>     res = get_hwpoison_page(p, flags);

Or you combine this patch with the other patch and let simply
get_hwpoison_page() check that, and return an appropriate error code for
unhandable that you can process here?

Like, maybe, returning -EIO directly?


res = get_hwpoison_page(p, flags);
switch (res) {
case 0: /* Success */
	...
	break
case -EIO: /* Unhandable kernel page. */
	...
	break;
case -EBUSY: /* Race, try again? */
	...
	break;
case ...
}

You can add more return codes as you see fit.

-- 
Cheers,

David

^ permalink raw reply

* Re: [RFC PATCH v2 10/10] selftests/verification: add tlob selftests
From: Gabriele Monaco @ 2026-05-13  7:46 UTC (permalink / raw)
  To: wen.yang; +Cc: linux-trace-kernel, linux-kernel, Steven Rostedt
In-Reply-To: <8148267505ef90175b6b69e1ffb3aa560ff42d35.1778522945.git.wen.yang@linux.dev>

On Tue, 2026-05-12 at 02:24 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
> 
> Add selftest coverage for the tlob RV monitor in
> tools/testing/selftests/verification/.
> 
> Two helper binaries are built by tlob/Makefile: tlob_helper for the
> ioctl interface (/dev/rv) and tlob_uprobe_target for the uprobe tests.
> The top-level Makefile delegates to tlob/ via a generic MONITOR_SUBDIRS
> pattern so monitor-specific build details stay within each monitor's
> own subdirectory.
> 
> Eight test files cover the tracefs control interface (tracefs.tc), the
> ioctl self-instrumentation interface (ioctl.tc, 8 scenarios), and the
> uprobe external monitoring interface (uprobe_bind.tc, uprobe_violation.tc,
> uprobe_no_event.tc, uprobe_multi.tc, uprobe_detail_sleeping.tc,
> uprobe_detail_waiting.tc).

Thanks for the deep test suite!

I run it on a VM (virtme-ng on my x86 16 core fedora box) and have it hanging at
step 9 (you see 8 is ok and after I get an RCU splat):

$ sudo vng -v -- make -C tools/testing/selftests/verification run_tests
...
# ok 5 Test tlob ioctl self-instrumentation (within/over-budget, error paths)
# ok 6 Test tlob monitor tracefs interface (enable/disable and files)
# ok 7 Test uprobe binding (visible in monitor file, removable, duplicate rejected)
# ok 8 Test uprobe detail sleeping (sleeping_ns dominates when task blocks between probes)
[   53.989561] tlob_target (1756) used greatest stack depth: 11792 bytes left
[   75.100818] rcu: INFO: rcu_preempt self-detected stall on CPU
[   75.100825] rcu: 	0-...!: (26082 ticks this GP) idle=a8e4/1/0x4000000000000000 softirq=0/0 fqs=13 rcuc=26078 jiffies(starved)
[   75.100833] rcu: 	(t=26000 jiffies g=17333 q=146 ncpus=16)
[   75.100836] rcu: rcu_preempt kthread timer wakeup didn't happen for 24040 jiffies! g17333 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[   75.100839] rcu: 	Possible timer handling issue on cpu=7 timer-softirq=317
[   75.100840] rcu: rcu_preempt kthread starved for 24043 jiffies! g17333 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=7
[   75.100843] rcu: 	Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[   75.100843] rcu: RCU grace-period kthread stack dump:
[   75.100845] task:rcu_preempt     state:I stack:14104 pid:17    tgid:17    ppid:2      task_flags:0x208040 flags:0x00080000
[   75.100856] Call Trace:
[   75.100859]  <TASK>
[   75.100870]  __schedule+0x4f1/0x1490
[   75.100890]  ? __pfx_rcu_gp_kthread+0x10/0x10
[   75.100898]  schedule+0x5b/0x210
[   75.100901]  ? schedule_timeout+0xae/0x130
[   75.100905]  schedule_timeout+0xae/0x130
[   75.100911]  ? __pfx_process_timeout+0x10/0x10
[   75.100925]  rcu_gp_fqs_loop+0x114/0x880
[   75.100933]  ? lock_release+0x2ea/0x4a0
[   75.100945]  ? __pfx_rcu_gp_kthread+0x10/0x10
[   75.100948]  rcu_gp_kthread+0x26b/0x320
[   75.100951]  ? preempt_count_sub+0x5f/0x80
[   75.100963]  ? __pfx_rcu_gp_kthread+0x10/0x10
[   75.100966]  kthread+0xf3/0x130
[   75.100970]  ? __pfx_kthread+0x10/0x10
[   75.100978]  ret_from_fork+0x3b4/0x420
[   75.100984]  ? __pfx_kthread+0x10/0x10
[   75.100989]  ret_from_fork_asm+0x1a/0x30
[   75.101018]  </TASK>
[   75.101019] rcu: Stack dump where RCU GP kthread last ran:
[   75.101021] Sending NMI from CPU 0 to CPUs 7:
[   75.101106] NMI backtrace for cpu 7
[   75.101118] CPU: 7 UID: 0 PID: 0 Comm: swapper/7 Not tainted 7.1.0-rc2+ #160 PREEMPT_{RT,(lazy)} 
[   75.101124] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   75.101128] RIP: 0010:pv_native_safe_halt+0xf/0x20
[   75.101139] Code: 75 70 00 c3 cc cc cc cc 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 25 6e 1c 00 fb f4 <c3> cc cc cc cc 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90
[   75.101142] RSP: 0018:ffffd22ec0103eb8 EFLAGS: 00000296
[   75.101147] RAX: 00000000000529f3 RBX: 0000000000000000 RCX: ffffffff8ca56131
[   75.101170] RDX: ffff8de4c185c280 RSI: 0000000000000000 RDI: ffffffff8ca56131
[   75.101172] RBP: ffff8de4c185c280 R08: 0000000000000000 R09: 0000000000000000
[   75.101174] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000007
[   75.101176] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[   75.101373] FS:  0000000000000000(0000) GS:ffff8de56a091000(0000) knlGS:0000000000000000
[   75.101379] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   75.101381] CR2: 00007fb886a53f98 CR3: 000000003be5c002 CR4: 0000000000770ef0
[   75.101383] PKRU: 55555554
[   75.101384] Call Trace:
[   75.101388]  <TASK>
[   75.101389]  default_idle+0x9/0x10
[   75.101397]  default_idle_call+0x85/0x240
[   75.101404]  do_idle+0x291/0x300
[   75.101412]  ? schedule_idle+0x22/0x40
[   75.101415]  cpu_startup_entry+0x29/0x30
[   75.101418]  start_secondary+0xf8/0x100
[   75.101424]  common_startup_64+0x12c/0x138
[   75.101435]  </TASK>
[   75.102036] CPU: 0 UID: 0 PID: 1758 Comm: sh Not tainted 7.1.0-rc2+ #160 PREEMPT_{RT,(lazy)} 
[   75.102040] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   75.102042] RIP: 0033:0x556458604e3f
[   75.102049] Code: 3c 18 4e 8d 04 3f 42 c6 04 21 00 0f b6 01 4c 89 7d b0 4c 89 c3 e9 bf ed ff ff 90 41 0f b6 c1 48 8d 15 c5 3f 11 00 80 3c 02 00 <0f> 84 a9 f0 ff ff 48 8b 45 80 f6 40 08 50 0f 85 9b f0 ff ff e9 78
[   75.102051] RSP: 002b:00007ffc7ac46e30 EFLAGS: 00000246
[   75.102054] RAX: 0000000000000074 RBX: 0000000000000074 RCX: 000055646adb8a60
[   75.102056] RDX: 0000556458718e00 RSI: 0000000000000018 RDI: 0000000000000000
[   75.102057] RBP: 00007ffc7ac46f20 R08: 000055646adc3100 R09: 0000000000000074
[   75.102058] R10: 0000000000000021 R11: 0000000000000001 R12: 0000000000000000
[   75.102059] R13: 0000000000000070 R14: 000055646adb9cf0 R15: 0000000000000000
[   75.102061] FS:  00007f832822b740 GS:  0000000000000000


Did you see that? Am I doing something wrong?

Thanks,
Gabriele

> 
> Tested on x86_64 with vng (virtme-ng):
> 
>   TAP version 13
>   1..12
>   ok 1 Test monitor enable/disable
>   ok 2 Test monitor reactor setting
>   ok 3 Check available monitors
>   ok 4 Test wwnr monitor with printk reactor
>   ok 5 Test tlob ioctl self-instrumentation (within/over-budget, error paths)
>   ok 6 Test tlob monitor tracefs interface (enable/disable and files)
>   ok 7 uprobe binding: visible in monitor file, removable, duplicate offset
> rejected
>   ok 8 uprobe detail sleeping: sleeping_ns dominates when task blocks between
> probes
>   ok 9 uprobe detail waiting: waiting_ns dominates when task is preempted
> between probes
>   ok 10 Two bindings on same binary with different offsets and budgets fire
> independently
>   ok 11 Verify no spurious error_env_tlob events without an active uprobe
> binding
>   ok 12 uprobe violation: error_env_tlob and detail_env_tlob fire with correct
> fields
>   # Totals: pass:12 fail:0 xfail:0 xpass:0 skip:0 error:0
> 
> Suggested-by: Gabriele Monaco <gmonaco@redhat.com> 
> Signed-off-by: Wen Yang <wen.yang@linux.dev>
> ---
>  tools/testing/selftests/verification/Makefile |  21 +-
>  .../verification/test.d/tlob/ioctl.tc         |  36 +
>  .../verification/test.d/tlob/tracefs.tc       |  17 +
>  .../verification/test.d/tlob/uprobe_bind.tc   |  34 +
>  .../test.d/tlob/uprobe_detail_sleeping.tc     |  47 ++
>  .../test.d/tlob/uprobe_detail_waiting.tc      |  60 ++
>  .../verification/test.d/tlob/uprobe_multi.tc  |  60 ++
>  .../test.d/tlob/uprobe_no_event.tc            |  19 +
>  .../test.d/tlob/uprobe_violation.tc           |  60 ++
>  .../selftests/verification/tlob/Makefile      |  21 +
>  .../selftests/verification/tlob/tlob_ioctl.c  | 626 ++++++++++++++++++
>  .../selftests/verification/tlob/tlob_target.c | 138 ++++
>  12 files changed, 1138 insertions(+), 1 deletion(-)
>  create mode 100644 tools/testing/selftests/verification/test.d/tlob/ioctl.tc
>  create mode 100644
> tools/testing/selftests/verification/test.d/tlob/tracefs.tc
>  create mode 100644
> tools/testing/selftests/verification/test.d/tlob/uprobe_bind.tc
>  create mode 100644
> tools/testing/selftests/verification/test.d/tlob/uprobe_detail_sleeping.tc
>  create mode 100644
> tools/testing/selftests/verification/test.d/tlob/uprobe_detail_waiting.tc
>  create mode 100644
> tools/testing/selftests/verification/test.d/tlob/uprobe_multi.tc
>  create mode 100644
> tools/testing/selftests/verification/test.d/tlob/uprobe_no_event.tc
>  create mode 100644
> tools/testing/selftests/verification/test.d/tlob/uprobe_violation.tc
>  create mode 100644 tools/testing/selftests/verification/tlob/Makefile
>  create mode 100644 tools/testing/selftests/verification/tlob/tlob_ioctl.c
>  create mode 100644 tools/testing/selftests/verification/tlob/tlob_target.c
> 
> diff --git a/tools/testing/selftests/verification/Makefile
> b/tools/testing/selftests/verification/Makefile
> index aa8790c22a71..b5584fd3762d 100644
> --- a/tools/testing/selftests/verification/Makefile
> +++ b/tools/testing/selftests/verification/Makefile
> @@ -1,8 +1,27 @@
>  # SPDX-License-Identifier: GPL-2.0
> -all:
>  
>  TEST_PROGS := verificationtest-ktap
>  TEST_FILES := test.d settings
>  EXTRA_CLEAN := $(OUTPUT)/logs/*
>  
> +# Subdirectories that provide helper binaries for the test runner.
> +# Each entry must contain a Makefile that accepts OUTDIR= and deposits
> +# its binaries there; verificationtest-ktap adds OUTDIR to PATH so
> +# the ftracetest require-checks resolve the binaries by name.
> +MONITOR_SUBDIRS := tlob
> +
>  include ../lib.mk
> +
> +# Build and clean each monitor subdirectory.
> +all: $(patsubst %,_build_%,$(MONITOR_SUBDIRS))
> +
> +clean: $(patsubst %,_clean_%,$(MONITOR_SUBDIRS))
> +
> +.PHONY: $(patsubst %,_build_%,$(MONITOR_SUBDIRS)) \
> +        $(patsubst %,_clean_%,$(MONITOR_SUBDIRS))
> +
> +$(patsubst %,_build_%,$(MONITOR_SUBDIRS)): _build_%:
> +	$(MAKE) -C $* OUTDIR="$(OUTPUT)" TOOLS_INCLUDES="$(TOOLS_INCLUDES)"
> +
> +$(patsubst %,_clean_%,$(MONITOR_SUBDIRS)): _clean_%:
> +	$(MAKE) -C $* OUTDIR="$(OUTPUT)" clean
> diff --git a/tools/testing/selftests/verification/test.d/tlob/ioctl.tc
> b/tools/testing/selftests/verification/test.d/tlob/ioctl.tc
> new file mode 100644
> index 000000000000..54ae249af9a6
> --- /dev/null
> +++ b/tools/testing/selftests/verification/test.d/tlob/ioctl.tc
> @@ -0,0 +1,36 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +# description: Test tlob ioctl self-instrumentation (within/over-budget,
> error paths)
> +# requires: tlob:monitor tlob_ioctl:program
> +
> +TLOB_HELPER=$(command -v tlob_ioctl)
> +
> +[ -c /dev/rv ] || exit_unsupported
> +
> +echo 1 > monitors/tlob/enable
> +
> +# within budget: 50 ms threshold, 10 ms workload
> +"$TLOB_HELPER" within_budget
> +
> +# over budget in running state: 1 ms threshold, 100 ms busy-spin
> +"$TLOB_HELPER" over_budget_running
> +
> +# over budget in sleeping state: 3 ms threshold, 50 ms sleep
> +"$TLOB_HELPER" over_budget_sleeping
> +
> +# over budget in waiting state: 1 us threshold, sched_yield
> +"$TLOB_HELPER" over_budget_waiting
> +
> +# error paths
> +"$TLOB_HELPER" double_start
> +"$TLOB_HELPER" stop_no_start
> +
> +# per-thread isolation
> +"$TLOB_HELPER" multi_thread
> +
> +# bind against disabled monitor must return ENODEV, not crash
> +echo 0 > monitors/tlob/enable
> +"$TLOB_HELPER" not_enabled
> +echo 1 > monitors/tlob/enable
> +
> +echo 0 > monitors/tlob/enable
> diff --git a/tools/testing/selftests/verification/test.d/tlob/tracefs.tc
> b/tools/testing/selftests/verification/test.d/tlob/tracefs.tc
> new file mode 100644
> index 000000000000..5d1e7cc02498
> --- /dev/null
> +++ b/tools/testing/selftests/verification/test.d/tlob/tracefs.tc
> @@ -0,0 +1,17 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +# description: Test tlob monitor tracefs interface (enable/disable and files)
> +# requires: tlob:monitor
> +
> +check_requires monitors/tlob/enable monitors/tlob/desc monitors/tlob/monitor
> +
> +# enable / disable via the enable file
> +echo 1 > monitors/tlob/enable
> +grep -q 1 monitors/tlob/enable
> +echo "tlob" >> enabled_monitors
> +grep -q tlob enabled_monitors
> +
> +echo 0 > monitors/tlob/enable
> +grep -q 0 monitors/tlob/enable
> +echo "!tlob" >> enabled_monitors
> +! grep -q "^tlob$" enabled_monitors
> diff --git a/tools/testing/selftests/verification/test.d/tlob/uprobe_bind.tc
> b/tools/testing/selftests/verification/test.d/tlob/uprobe_bind.tc
> new file mode 100644
> index 000000000000..41e20d593855
> --- /dev/null
> +++ b/tools/testing/selftests/verification/test.d/tlob/uprobe_bind.tc
> @@ -0,0 +1,34 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +# description: Test uprobe binding (visible in monitor file, removable,
> duplicate rejected)
> +# requires: tlob:monitor tlob_ioctl:program tlob_target:program
> +
> +TLOB_HELPER=$(command -v tlob_ioctl)
> +UPROBE_TARGET=$(command -v tlob_target)
> +TLOB_MONITOR=monitors/tlob/monitor
> +
> +busy_offset=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET" tlob_busy_work
> 2>/dev/null)
> +stop_offset=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET" tlob_busy_work_done
> 2>/dev/null)
> +[ -n "$busy_offset" ] || exit_unsupported
> +[ -n "$stop_offset" ] || exit_unsupported
> +
> +"$UPROBE_TARGET" 30000 &
> +busy_pid=$!
> +sleep 0.05
> +
> +echo 1 > monitors/tlob/enable
> +echo "p ${UPROBE_TARGET}:${busy_offset} ${stop_offset} threshold=5000000" >
> "$TLOB_MONITOR"
> +
> +# Binding must appear in monitor file with canonical hex-offset format.
> +grep -qE "^p ${UPROBE_TARGET}:0x[0-9a-f]+ 0x[0-9a-f]+ threshold=[0-9]+$"
> "$TLOB_MONITOR"
> +grep -q "threshold=5000000" "$TLOB_MONITOR"
> +
> +# Duplicate offset_start must be rejected.
> +! echo "p ${UPROBE_TARGET}:${busy_offset} ${stop_offset} threshold=9999" >
> "$TLOB_MONITOR" 2>/dev/null
> +
> +# Remove the binding; it must no longer appear.
> +echo "-${UPROBE_TARGET}:${busy_offset}" > "$TLOB_MONITOR"
> +! grep -q "^p .*:0x${busy_offset#0x} " "$TLOB_MONITOR"
> +
> +kill "$busy_pid" 2>/dev/null; wait "$busy_pid" 2>/dev/null || true
> +echo 0 > monitors/tlob/enable
> diff --git
> a/tools/testing/selftests/verification/test.d/tlob/uprobe_detail_sleeping.tc
> b/tools/testing/selftests/verification/test.d/tlob/uprobe_detail_sleeping.tc
> new file mode 100644
> index 000000000000..2b8656e0fef1
> --- /dev/null
> +++
> b/tools/testing/selftests/verification/test.d/tlob/uprobe_detail_sleeping.tc
> @@ -0,0 +1,47 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +# description: Test uprobe detail sleeping (sleeping_ns dominates when task
> blocks between probes)
> +# requires: tlob:monitor tlob_ioctl:program tlob_target:program
> +
> +TLOB_HELPER=$(command -v tlob_ioctl)
> +UPROBE_TARGET=$(command -v tlob_target)
> +TLOB_MONITOR=monitors/tlob/monitor
> +
> +start_offset=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET" tlob_sleep_work
> 2>/dev/null)
> +stop_offset=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET" tlob_sleep_work_done
> 2>/dev/null)
> +[ -n "$start_offset" ] || exit_unsupported
> +[ -n "$stop_offset" ] || exit_unsupported
> +
> +"$UPROBE_TARGET" 5000 sleep &
> +busy_pid=$!
> +sleep 0.05
> +
> +echo 1 > /sys/kernel/tracing/events/rv/detail_env_tlob/enable
> +echo 1 > /sys/kernel/tracing/tracing_on
> +echo 1 > monitors/tlob/enable
> +echo > /sys/kernel/tracing/trace
> +
> +# 50 ms budget; task sleeps 200 ms per iteration -> sleeping_ns dominates.
> +echo "p ${UPROBE_TARGET}:${start_offset} ${stop_offset} threshold=50000" >
> "$TLOB_MONITOR"
> +
> +found=0; i=0
> +while [ "$i" -lt 30 ]; do
> +	sleep 0.1
> +	grep -q "detail_env_tlob" /sys/kernel/tracing/trace && { found=1;
> break; }
> +	i=$((i+1))
> +done
> +
> +echo "-${UPROBE_TARGET}:${start_offset}" > "$TLOB_MONITOR" 2>/dev/null
> +kill "$busy_pid" 2>/dev/null; wait "$busy_pid" 2>/dev/null || true
> +echo 0 > /sys/kernel/tracing/events/rv/detail_env_tlob/enable
> +echo 0 > monitors/tlob/enable
> +
> +[ "$found" = "1" ]
> +
> +line=$(grep "detail_env_tlob" /sys/kernel/tracing/trace | head -n 1)
> +running=$(echo "$line" | sed 's/.*running_ns=\([0-9]*\).*/\1/')
> +waiting=$(echo "$line" | sed 's/.*waiting_ns=\([0-9]*\).*/\1/')
> +sleeping=$(echo "$line" | sed 's/.*sleeping_ns=\([0-9]*\).*/\1/')
> +[ "$sleeping" -gt "$((running + waiting))" ]
> +
> +echo > /sys/kernel/tracing/trace
> diff --git
> a/tools/testing/selftests/verification/test.d/tlob/uprobe_detail_waiting.tc
> b/tools/testing/selftests/verification/test.d/tlob/uprobe_detail_waiting.tc
> new file mode 100644
> index 000000000000..0705854f24df
> --- /dev/null
> +++
> b/tools/testing/selftests/verification/test.d/tlob/uprobe_detail_waiting.tc
> @@ -0,0 +1,60 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +# description: Test uprobe detail waiting (waiting_ns dominates when task is
> preempted between probes)
> +# requires: tlob:monitor tlob_ioctl:program tlob_target:program
> +
> +TLOB_HELPER=$(command -v tlob_ioctl)
> +UPROBE_TARGET=$(command -v tlob_target)
> +TLOB_MONITOR=monitors/tlob/monitor
> +
> +command -v chrt    > /dev/null || exit_unsupported
> +command -v taskset > /dev/null || exit_unsupported
> +
> +start_offset=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET" tlob_preempt_work
> 2>/dev/null)
> +stop_offset=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET"
> tlob_preempt_work_done 2>/dev/null)
> +[ -n "$start_offset" ] || exit_unsupported
> +[ -n "$stop_offset" ]  || exit_unsupported
> +
> +cpu=0
> +
> +echo 1 > /sys/kernel/tracing/events/rv/detail_env_tlob/enable
> +echo 1 > /sys/kernel/tracing/tracing_on
> +echo 1 > monitors/tlob/enable
> +echo > /sys/kernel/tracing/trace
> +
> +# Register probe before the target starts so the start uprobe fires on the
> +# first entry to tlob_preempt_work. Budget: 500 ms.
> +echo "p ${UPROBE_TARGET}:${start_offset} ${stop_offset} threshold=500000" >
> "$TLOB_MONITOR"
> +
> +# Target starts; start probe fires on tlob_preempt_work entry.
> +taskset -c "$cpu" "$UPROBE_TARGET" 5000 preempt &
> +busy_pid=$!
> +sleep 0.05
> +
> +# RT hog on the same CPU preempts the target; target stays in waiting state
> +# (runnable, off-CPU) until the budget expires -> waiting_ns dominates.
> +chrt -f 99 taskset -c "$cpu" sh -c 'while true; do :; done' 2>/dev/null &
> +hog_pid=$!
> +
> +found=0; i=0
> +while [ "$i" -lt 30 ]; do
> +	sleep 0.1
> +	grep -q "detail_env_tlob" /sys/kernel/tracing/trace && { found=1;
> break; }
> +	i=$((i+1))
> +done
> +
> +echo "-${UPROBE_TARGET}:${start_offset}" > "$TLOB_MONITOR" 2>/dev/null
> +kill "$hog_pid" 2>/dev/null; wait "$hog_pid" 2>/dev/null || true
> +kill "$busy_pid" 2>/dev/null; wait "$busy_pid" 2>/dev/null || true
> +echo 0 > /sys/kernel/tracing/events/rv/detail_env_tlob/enable
> +echo 0 > monitors/tlob/enable
> +
> +[ "$found" = "1" ]
> +
> +line=$(grep "detail_env_tlob" /sys/kernel/tracing/trace | head -n 1)
> +running=$(echo "$line" | sed 's/.*running_ns=\([0-9]*\).*/\1/')
> +sleeping=$(echo "$line" | sed 's/.*sleeping_ns=\([0-9]*\).*/\1/')
> +waiting=$(echo "$line" | sed 's/.*waiting_ns=\([0-9]*\).*/\1/')
> +[ "$waiting" -gt "$((running + sleeping))" ]
> +
> +echo > /sys/kernel/tracing/trace
> diff --git a/tools/testing/selftests/verification/test.d/tlob/uprobe_multi.tc
> b/tools/testing/selftests/verification/test.d/tlob/uprobe_multi.tc
> new file mode 100644
> index 000000000000..c4b8f7108ae9
> --- /dev/null
> +++ b/tools/testing/selftests/verification/test.d/tlob/uprobe_multi.tc
> @@ -0,0 +1,60 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +# description: Test two uprobe bindings on same binary (different offsets
> fire independently)
> +# requires: tlob:monitor tlob_ioctl:program tlob_target:program
> +
> +TLOB_HELPER=$(command -v tlob_ioctl)
> +UPROBE_TARGET=$(command -v tlob_target)
> +TLOB_MONITOR=monitors/tlob/monitor
> +
> +busy_offset=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET" tlob_busy_work
> 2>/dev/null)
> +busy_stop=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET" tlob_busy_work_done
> 2>/dev/null)
> +sleep_offset=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET" tlob_sleep_work
> 2>/dev/null)
> +sleep_stop=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET" tlob_sleep_work_done
> 2>/dev/null)
> +[ -n "$busy_offset" ]  || exit_unsupported
> +[ -n "$busy_stop" ]    || exit_unsupported
> +[ -n "$sleep_offset" ] || exit_unsupported
> +[ -n "$sleep_stop" ]   || exit_unsupported
> +
> +"$UPROBE_TARGET" 30000 &       # busy mode: tlob_busy_work fires every 200 ms
> +busy_pid=$!
> +"$UPROBE_TARGET" 30000 sleep & # sleep mode: tlob_sleep_work fires every 200
> ms
> +sleep_pid=$!
> +sleep 0.05
> +
> +echo 1 > /sys/kernel/tracing/events/rv/error_env_tlob/enable
> +echo 1 > /sys/kernel/tracing/events/rv/detail_env_tlob/enable
> +echo 1 > /sys/kernel/tracing/tracing_on
> +echo 1 > monitors/tlob/enable
> +echo > /sys/kernel/tracing/trace
> +
> +# Binding A: 5 s budget on the busy probe - must not fire in 200 ms loops.
> +echo "p ${UPROBE_TARGET}:${busy_offset} ${busy_stop} threshold=5000000" >
> "$TLOB_MONITOR"
> +# Binding B: 10 ns budget on the sleep probe - fires on first invocation.
> +echo "p ${UPROBE_TARGET}:${sleep_offset} ${sleep_stop} threshold=10" >
> "$TLOB_MONITOR"
> +
> +# Wait up to 2 s for error_env_tlob from binding B.
> +found=0; i=0
> +while [ "$i" -lt 20 ]; do
> +	sleep 0.1
> +	grep -q "error_env_tlob" /sys/kernel/tracing/trace && { found=1;
> break; }
> +	i=$((i+1))
> +done
> +
> +echo "-${UPROBE_TARGET}:${busy_offset}" > "$TLOB_MONITOR" 2>/dev/null
> +echo "-${UPROBE_TARGET}:${sleep_offset}" > "$TLOB_MONITOR" 2>/dev/null
> +kill "$sleep_pid" 2>/dev/null; wait "$sleep_pid" 2>/dev/null || true
> +kill "$busy_pid" 2>/dev/null; wait "$busy_pid" 2>/dev/null || true
> +
> +echo 0 > monitors/tlob/enable
> +echo 0 > /sys/kernel/tracing/events/rv/error_env_tlob/enable
> +echo 0 > /sys/kernel/tracing/events/rv/detail_env_tlob/enable
> +
> +[ "$found" = "1" ]
> +# error_env_tlob payload: label and clock variable must be present.
> +grep "error_env_tlob" /sys/kernel/tracing/trace | head -n 1 | grep -q
> "budget_exceeded"
> +grep "error_env_tlob" /sys/kernel/tracing/trace | head -n 1 | grep -q
> "clk_elapsed="
> +# detail_env_tlob must appear alongside the error.
> +grep -q "detail_env_tlob" /sys/kernel/tracing/trace
> +
> +echo > /sys/kernel/tracing/trace
> diff --git
> a/tools/testing/selftests/verification/test.d/tlob/uprobe_no_event.tc
> b/tools/testing/selftests/verification/test.d/tlob/uprobe_no_event.tc
> new file mode 100644
> index 000000000000..4a74853346e3
> --- /dev/null
> +++ b/tools/testing/selftests/verification/test.d/tlob/uprobe_no_event.tc
> @@ -0,0 +1,19 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +# description: Test no spurious error_env_tlob events without an active
> uprobe binding
> +# requires: tlob:monitor tlob_ioctl:program
> +
> +TLOB_MONITOR=monitors/tlob/monitor
> +
> +echo 1 > /sys/kernel/tracing/events/rv/error_env_tlob/enable
> +echo 1 > /sys/kernel/tracing/tracing_on
> +echo 1 > monitors/tlob/enable
> +echo > /sys/kernel/tracing/trace
> +
> +sleep 0.5
> +
> +! grep -q "error_env_tlob" /sys/kernel/tracing/trace
> +
> +echo 0 > monitors/tlob/enable
> +echo 0 > /sys/kernel/tracing/events/rv/error_env_tlob/enable
> +echo > /sys/kernel/tracing/trace
> diff --git
> a/tools/testing/selftests/verification/test.d/tlob/uprobe_violation.tc
> b/tools/testing/selftests/verification/test.d/tlob/uprobe_violation.tc
> new file mode 100644
> index 000000000000..624fdb950f6b
> --- /dev/null
> +++ b/tools/testing/selftests/verification/test.d/tlob/uprobe_violation.tc
> @@ -0,0 +1,60 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +# description: Test uprobe violation (error_env_tlob and detail_env_tlob fire
> with correct fields)
> +# requires: tlob:monitor tlob_ioctl:program tlob_target:program
> +
> +TLOB_HELPER=$(command -v tlob_ioctl)
> +UPROBE_TARGET=$(command -v tlob_target)
> +TLOB_MONITOR=monitors/tlob/monitor
> +
> +busy_offset=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET" tlob_busy_work
> 2>/dev/null)
> +stop_offset=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET" tlob_busy_work_done
> 2>/dev/null)
> +[ -n "$busy_offset" ] || exit_unsupported
> +[ -n "$stop_offset" ] || exit_unsupported
> +
> +"$UPROBE_TARGET" 30000 &
> +busy_pid=$!
> +sleep 0.05
> +
> +echo 1 > /sys/kernel/tracing/events/rv/error_env_tlob/enable
> +echo 1 > /sys/kernel/tracing/events/rv/detail_env_tlob/enable
> +echo 1 > /sys/kernel/tracing/tracing_on
> +echo 1 > monitors/tlob/enable
> +echo > /sys/kernel/tracing/trace
> +
> +# 10 ns budget - fires almost immediately; task is busy-spinning on-CPU.
> +echo "p ${UPROBE_TARGET}:${busy_offset} ${stop_offset} threshold=10" >
> "$TLOB_MONITOR"
> +
> +# wait up to 2 s for detail_env_tlob
> +found=0; i=0
> +while [ "$i" -lt 20 ]; do
> +	sleep 0.1
> +	grep -q "detail_env_tlob" /sys/kernel/tracing/trace && { found=1;
> break; }
> +	i=$((i+1))
> +done
> +
> +echo "-${UPROBE_TARGET}:${busy_offset}" > "$TLOB_MONITOR" 2>/dev/null
> +kill "$busy_pid" 2>/dev/null; wait "$busy_pid" 2>/dev/null || true
> +echo 0 > /sys/kernel/tracing/events/rv/error_env_tlob/enable
> +echo 0 > /sys/kernel/tracing/events/rv/detail_env_tlob/enable
> +echo 0 > monitors/tlob/enable
> +
> +[ "$found" = "1" ]
> +
> +# error_env_tlob event label must be budget_exceeded
> +grep "error_env_tlob" /sys/kernel/tracing/trace | head -n 1 | grep -q
> "budget_exceeded"
> +
> +# detail_env_tlob must have all five fields with the correct threshold
> +line=$(grep "detail_env_tlob" /sys/kernel/tracing/trace | head -n 1)
> +echo "$line" | grep -q "pid="
> +echo "$line" | grep -q "threshold_us=10"
> +echo "$line" | grep -q "running_ns="
> +echo "$line" | grep -q "waiting_ns="
> +echo "$line" | grep -q "sleeping_ns="
> +
> +# Busy-spin keeps the task on-CPU: running_ns must exceed sleeping_ns.
> +running=$(echo "$line" | sed 's/.*running_ns=\([0-9]*\).*/\1/')
> +sleeping=$(echo "$line" | sed 's/.*sleeping_ns=\([0-9]*\).*/\1/')
> +[ "$running" -gt "$sleeping" ]
> +
> +echo > /sys/kernel/tracing/trace
> diff --git a/tools/testing/selftests/verification/tlob/Makefile
> b/tools/testing/selftests/verification/tlob/Makefile
> new file mode 100644
> index 000000000000..1bedf946cb34
> --- /dev/null
> +++ b/tools/testing/selftests/verification/tlob/Makefile
> @@ -0,0 +1,21 @@
> +# SPDX-License-Identifier: GPL-2.0
> +# Builds tlob selftest helper binaries.
> +#
> +# Invoked by ../Makefile; pass OUTDIR to control the output directory
> +# and TOOLS_INCLUDES for the in-tree UAPI -isystem flag.
> +
> +OUTDIR ?= $(CURDIR)/..
> +CFLAGS += $(TOOLS_INCLUDES)
> +
> +.PHONY: all
> +all: $(OUTDIR)/tlob_ioctl $(OUTDIR)/tlob_target
> +
> +$(OUTDIR)/tlob_ioctl: tlob_ioctl.c
> +	$(CC) $(CFLAGS) -o $@ $< -lpthread
> +
> +$(OUTDIR)/tlob_target: tlob_target.c
> +	$(CC) $(CFLAGS) -o $@ $<
> +
> +.PHONY: clean
> +clean:
> +	$(RM) $(OUTDIR)/tlob_ioctl $(OUTDIR)/tlob_target
> diff --git a/tools/testing/selftests/verification/tlob/tlob_ioctl.c
> b/tools/testing/selftests/verification/tlob/tlob_ioctl.c
> new file mode 100644
> index 000000000000..abb4e2e80a2c
> --- /dev/null
> +++ b/tools/testing/selftests/verification/tlob/tlob_ioctl.c
> @@ -0,0 +1,626 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * tlob_ioctl.c - ioctl test driver and ELF utility for tlob selftests
> + *
> + * Usage: tlob_ioctl <subcommand> [args...]
> + *
> + *   not_enabled          - TRACE_START without monitor enabled -> ENODEV
> + *   within_budget        - sleep within budget -> 0
> + *   over_budget_running  - busy-spin past budget -> EOVERFLOW
> + *   over_budget_sleeping - sleep past budget -> EOVERFLOW
> + *   over_budget_waiting  - sched_yield into waiting state -> EOVERFLOW
> + *   double_start         - two starts without stop -> EALREADY
> + *   stop_no_start        - stop without start -> EINVAL
> + *   multi_thread         - two fds: thread A within budget, thread B over
> + *   bench                - TRACE_START/STOP latency (TAP output, always
> passes)
> + *   sym_offset <binary> <symbol> - print ELF file offset of symbol
> + *
> + * Exit: 0 = pass, 1 = fail, 2 = skip (device not available).
> + */
> +#define _GNU_SOURCE
> +#include <elf.h>
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <pthread.h>
> +#include <sched.h>
> +#include <stdint.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <sys/ioctl.h>
> +#include <sys/mman.h>
> +#include <sys/stat.h>
> +#include <time.h>
> +#include <unistd.h>
> +
> +#include <linux/rv.h>
> +
> +static int rv_fd = -1;
> +
> +static int open_rv(void)
> +{
> +	struct rv_bind_args bind = { .monitor_name = "tlob" };
> +
> +	rv_fd = open("/dev/rv", O_RDWR);
> +	if (rv_fd < 0) {
> +		fprintf(stderr, "open /dev/rv: %s\n", strerror(errno));
> +		return -1;
> +	}
> +	if (ioctl(rv_fd, RV_IOCTL_BIND_MONITOR, &bind) < 0) {
> +		fprintf(stderr, "bind tlob: %s\n", strerror(errno));
> +		close(rv_fd);
> +		rv_fd = -1;
> +		return -1;
> +	}
> +	return 0;
> +}
> +
> +static void busy_spin_us(unsigned long us)
> +{
> +	struct timespec start, now;
> +	unsigned long elapsed;
> +
> +	clock_gettime(CLOCK_MONOTONIC, &start);
> +	do {
> +		clock_gettime(CLOCK_MONOTONIC, &now);
> +		elapsed = (unsigned long)(now.tv_sec - start.tv_sec)
> +			  * 1000000000UL
> +			+ (unsigned long)(now.tv_nsec - start.tv_nsec);
> +	} while (elapsed < us * 1000UL);
> +}
> +
> +static int trace_start(uint64_t threshold_us)
> +{
> +	struct tlob_start_args args = {
> +		.threshold_us = threshold_us,
> +	};
> +
> +	return ioctl(rv_fd, TLOB_IOCTL_TRACE_START, &args);
> +}
> +
> +static int trace_stop(void)
> +{
> +	return ioctl(rv_fd, TLOB_IOCTL_TRACE_STOP, NULL);
> +}
> +
> +/* Synchronous TRACE_START / TRACE_STOP tests */
> +
> +/* Bind to a disabled monitor must return ENODEV without crashing */
> +static int test_not_enabled(void)
> +{
> +	struct rv_bind_args bind = { .monitor_name = "tlob" };
> +	int fd;
> +	int ret;
> +
> +	fd = open("/dev/rv", O_RDWR);
> +	if (fd < 0) {
> +		fprintf(stderr, "open /dev/rv: %s\n", strerror(errno));
> +		return 2; /* skip */
> +	}
> +
> +	ret = ioctl(fd, RV_IOCTL_BIND_MONITOR, &bind);
> +	close(fd);
> +
> +	if (ret == 0) {
> +		fprintf(stderr, "RV_IOCTL_BIND_MONITOR: expected ENODEV, got
> success\n");
> +		return 1;
> +	}
> +	if (errno != ENODEV) {
> +		fprintf(stderr, "RV_IOCTL_BIND_MONITOR: expected ENODEV, got
> %s\n",
> +			strerror(errno));
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +static int test_within_budget(void)
> +{
> +	int ret;
> +
> +	/* 50 ms budget */
> +	if (trace_start(50000) < 0) {
> +		fprintf(stderr, "TRACE_START: %s\n", strerror(errno));
> +		return 1;
> +	}
> +	usleep(10000); /* 10 ms */
> +	ret = trace_stop();
> +	if (ret != 0) {
> +		fprintf(stderr, "TRACE_STOP: expected 0, got %d errno=%s\n",
> +			ret, strerror(errno));
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +static int test_over_budget_running(void)
> +{
> +	int ret;
> +
> +	/* 1 ms budget */
> +	if (trace_start(1000) < 0) {
> +		fprintf(stderr, "TRACE_START: %s\n", strerror(errno));
> +		return 1;
> +	}
> +	busy_spin_us(100000); /* 100 ms */
> +	ret = trace_stop();
> +	if (ret == 0) {
> +		fprintf(stderr, "TRACE_STOP: expected EOVERFLOW, got 0\n");
> +		return 1;
> +	}
> +	if (errno != EOVERFLOW) {
> +		fprintf(stderr, "TRACE_STOP: expected EOVERFLOW, got %s\n",
> +			strerror(errno));
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +static int test_over_budget_sleeping(void)
> +{
> +	int ret;
> +
> +	/* 3 ms budget */
> +	if (trace_start(3000) < 0) {
> +		fprintf(stderr, "TRACE_START: %s\n", strerror(errno));
> +		return 1;
> +	}
> +	usleep(50000); /* 50 ms; sleeping time counts toward budget */
> +	ret = trace_stop();
> +	if (ret == 0) {
> +		fprintf(stderr, "TRACE_STOP: expected EOVERFLOW, got 0\n");
> +		return 1;
> +	}
> +	if (errno != EOVERFLOW) {
> +		fprintf(stderr, "TRACE_STOP: expected EOVERFLOW, got %s\n",
> +			strerror(errno));
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +static int test_over_budget_waiting(void)
> +{
> +	int ret;
> +
> +	/* 1 us budget */
> +	if (trace_start(1) < 0) {
> +		fprintf(stderr, "TRACE_START: %s\n", strerror(errno));
> +		return 1;
> +	}
> +	sched_yield(); /* running -> waiting -> running */
> +	busy_spin_us(10); /* 10 us >> 1 us budget; hrtimer fires during spin
> */
> +	ret = trace_stop();
> +	if (ret == 0) {
> +		fprintf(stderr, "TRACE_STOP: expected EOVERFLOW, got 0\n");
> +		return 1;
> +	}
> +	if (errno != EOVERFLOW) {
> +		fprintf(stderr, "TRACE_STOP: expected EOVERFLOW, got %s\n",
> +			strerror(errno));
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +/* Error-handling tests */
> +
> +static int test_double_start(void)
> +{
> +	int ret;
> +
> +	/* 10 s: large enough the hrtimer won't fire during the test */
> +	if (trace_start(10000000ULL) < 0) {
> +		fprintf(stderr, "first TRACE_START: %s\n", strerror(errno));
> +		return 1;
> +	}
> +	ret = trace_start(10000000);
> +	if (ret == 0) {
> +		fprintf(stderr, "second TRACE_START: expected EALREADY, got
> 0\n");
> +		trace_stop();
> +		return 1;
> +	}
> +	if (errno != EALREADY) {
> +		fprintf(stderr, "second TRACE_START: expected EALREADY, got
> %s\n",
> +			strerror(errno));
> +		trace_stop();
> +		return 1;
> +	}
> +	trace_stop();
> +	return 0;
> +}
> +
> +static int test_stop_no_start(void)
> +{
> +	int ret;
> +
> +	/* Ensure clean state: ignore error from a stale entry */
> +	trace_stop();
> +
> +	ret = trace_stop();
> +	if (ret == 0) {
> +		fprintf(stderr, "TRACE_STOP: expected EINVAL, got 0\n");
> +		return 1;
> +	}
> +	if (errno != EINVAL) {
> +		fprintf(stderr, "TRACE_STOP: expected EINVAL, got %s\n",
> +			strerror(errno));
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +/* Two threads, each with its own fd: A within budget, B over budget. */
> +
> +struct mt_thread_args {
> +	uint64_t      threshold_us;
> +	unsigned long workload_us;
> +	int           busy;
> +	int           expect_eoverflow;
> +	int           result;
> +};
> +
> +static void *mt_thread_fn(void *arg)
> +{
> +	struct mt_thread_args *a = arg;
> +	struct tlob_start_args args = { .threshold_us = a->threshold_us };
> +	struct rv_bind_args bind = { .monitor_name = "tlob" };
> +	int fd;
> +	int ret;
> +
> +	fd = open("/dev/rv", O_RDWR);
> +	if (fd < 0) {
> +		fprintf(stderr, "thread open /dev/rv: %s\n",
> strerror(errno));
> +		a->result = 1;
> +		return NULL;
> +	}
> +	if (ioctl(fd, RV_IOCTL_BIND_MONITOR, &bind) < 0) {
> +		fprintf(stderr, "thread bind tlob: %s\n", strerror(errno));
> +		close(fd);
> +		a->result = 1;
> +		return NULL;
> +	}
> +
> +	ret = ioctl(fd, TLOB_IOCTL_TRACE_START, &args);
> +	if (ret < 0) {
> +		fprintf(stderr, "thread TRACE_START: %s\n", strerror(errno));
> +		close(fd);
> +		a->result = 1;
> +		return NULL;
> +	}
> +
> +	if (a->busy)
> +		busy_spin_us(a->workload_us);
> +	else
> +		usleep(a->workload_us);
> +
> +	ret = ioctl(fd, TLOB_IOCTL_TRACE_STOP, NULL);
> +	if (a->expect_eoverflow) {
> +		if (ret == 0 || errno != EOVERFLOW) {
> +			fprintf(stderr, "thread: expected EOVERFLOW, got
> ret=%d errno=%s\n",
> +				ret, strerror(errno));
> +			close(fd);
> +			a->result = 1;
> +			return NULL;
> +		}
> +	} else {
> +		if (ret != 0) {
> +			fprintf(stderr, "thread: expected 0, got ret=%d
> errno=%s\n",
> +				ret, strerror(errno));
> +			close(fd);
> +			a->result = 1;
> +			return NULL;
> +		}
> +	}
> +	close(fd);
> +	a->result = 0;
> +	return NULL;
> +}
> +
> +static int test_multi_thread(void)
> +{
> +	pthread_t ta, tb;
> +	struct mt_thread_args a = {
> +		.threshold_us     = 20000,   /* 20 ms */
> +		.workload_us      = 5000,    /* 5 ms sleep -> within budget
> */
> +		.busy             = 0,
> +		.expect_eoverflow = 0,
> +	};
> +	struct mt_thread_args b = {
> +		.threshold_us     = 3000,    /* 3 ms */
> +		.workload_us      = 30000,   /* 30 ms spin -> over budget */
> +		.busy             = 1,
> +		.expect_eoverflow = 1,
> +	};
> +
> +	pthread_create(&ta, NULL, mt_thread_fn, &a);
> +	pthread_create(&tb, NULL, mt_thread_fn, &b);
> +	pthread_join(ta, NULL);
> +	pthread_join(tb, NULL);
> +
> +	return (a.result || b.result) ? 1 : 0;
> +}
> +
> +/*
> + * Benchmark TRACE_START, TRACE_STOP, and round-trip ioctls.
> + * Output uses TAP '#' prefix; always returns 0.
> + */
> +#define BENCH_WARMUP  32
> +#define BENCH_N      1000
> +
> +static long long timespec_diff_ns(const struct timespec *a,
> +				   const struct timespec *b)
> +{
> +	return (long long)(b->tv_sec - a->tv_sec) * 1000000000LL
> +		+ (b->tv_nsec - a->tv_nsec);
> +}
> +
> +static int test_bench(void)
> +{
> +	struct tlob_start_args args = {
> +		.threshold_us = 10000000ULL, /* 10 s */
> +	};
> +	struct timespec t0, t1;
> +	long long total_start_ns = 0, total_stop_ns = 0, total_rt_ns = 0;
> +	int i;
> +
> +	/* warm up */
> +	for (i = 0; i < BENCH_WARMUP; i++) {
> +		if (ioctl(rv_fd, TLOB_IOCTL_TRACE_START, &args) == 0)
> +			ioctl(rv_fd, TLOB_IOCTL_TRACE_STOP, NULL);
> +	}
> +
> +	/* start only */
> +	for (i = 0; i < BENCH_N; i++) {
> +		clock_gettime(CLOCK_MONOTONIC, &t0);
> +		ioctl(rv_fd, TLOB_IOCTL_TRACE_START, &args);
> +		clock_gettime(CLOCK_MONOTONIC, &t1);
> +		total_start_ns += timespec_diff_ns(&t0, &t1);
> +		ioctl(rv_fd, TLOB_IOCTL_TRACE_STOP, NULL);
> +	}
> +
> +	/* stop only */
> +	for (i = 0; i < BENCH_N; i++) {
> +		ioctl(rv_fd, TLOB_IOCTL_TRACE_START, &args);
> +		clock_gettime(CLOCK_MONOTONIC, &t0);
> +		ioctl(rv_fd, TLOB_IOCTL_TRACE_STOP, NULL);
> +		clock_gettime(CLOCK_MONOTONIC, &t1);
> +		total_stop_ns += timespec_diff_ns(&t0, &t1);
> +	}
> +
> +	/* round-trip */
> +	clock_gettime(CLOCK_MONOTONIC, &t0);
> +	for (i = 0; i < BENCH_N; i++) {
> +		ioctl(rv_fd, TLOB_IOCTL_TRACE_START, &args);
> +		ioctl(rv_fd, TLOB_IOCTL_TRACE_STOP, NULL);
> +	}
> +	clock_gettime(CLOCK_MONOTONIC, &t1);
> +	total_rt_ns = timespec_diff_ns(&t0, &t1);
> +
> +	printf("# start ioctl only:      %lld ns/iter (N=%d, includes
> syscall)\n",
> +	       total_start_ns / BENCH_N, BENCH_N);
> +	printf("# stop ioctl only:       %lld ns/iter (N=%d, includes
> syscall)\n",
> +	       total_stop_ns / BENCH_N, BENCH_N);
> +	printf("# start+stop roundtrip:  %lld ns/iter (N=%d, includes 2
> syscalls)\n",
> +	       total_rt_ns / BENCH_N, BENCH_N);
> +	return 0;
> +}
> +
> +/*
> + * Print the ELF file offset of <symname> in <binary>.  Walks .symtab
> + * (falling back to .dynsym) and converts vaddr to file offset via PT_LOAD.
> + * Supports 32- and 64-bit ELF.
> + */
> +static int sym_offset(const char *binary, const char *symname)
> +{
> +	int fd;
> +	struct stat st;
> +	void *map;
> +	Elf64_Ehdr *ehdr;
> +	Elf32_Ehdr *ehdr32;
> +	int is64;
> +	uint64_t sym_vaddr = 0;
> +	int found = 0;
> +	uint64_t file_offset = 0;
> +
> +	fd = open(binary, O_RDONLY);
> +	if (fd < 0) {
> +		fprintf(stderr, "open %s: %s\n", binary, strerror(errno));
> +		return 1;
> +	}
> +	if (fstat(fd, &st) < 0) {
> +		close(fd);
> +		return 1;
> +	}
> +	map = mmap(NULL, (size_t)st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
> +	close(fd);
> +	if (map == MAP_FAILED) {
> +		fprintf(stderr, "mmap: %s\n", strerror(errno));
> +		return 1;
> +	}
> +
> +	ehdr = (Elf64_Ehdr *)map;
> +	ehdr32 = (Elf32_Ehdr *)map;
> +	if (st.st_size < 4 ||
> +	    ehdr->e_ident[EI_MAG0] != ELFMAG0 ||
> +	    ehdr->e_ident[EI_MAG1] != ELFMAG1 ||
> +	    ehdr->e_ident[EI_MAG2] != ELFMAG2 ||
> +	    ehdr->e_ident[EI_MAG3] != ELFMAG3) {
> +		fprintf(stderr, "%s: not an ELF file\n", binary);
> +		munmap(map, (size_t)st.st_size);
> +		return 1;
> +	}
> +	is64 = (ehdr->e_ident[EI_CLASS] == ELFCLASS64);
> +
> +	if (is64) {
> +		Elf64_Shdr *shdrs = (Elf64_Shdr *)((char *)map + ehdr-
> >e_shoff);
> +		Elf64_Shdr *shstrtab_hdr = &shdrs[ehdr->e_shstrndx];
> +		const char *shstrtab = (char *)map + shstrtab_hdr->sh_offset;
> +		int si;
> +
> +		/* prefer .symtab; fall back to .dynsym */
> +		for (int pass = 0; pass < 2 && !found; pass++) {
> +			const char *target = pass ? ".dynsym" : ".symtab";
> +
> +			for (si = 0; si < ehdr->e_shnum && !found; si++) {
> +				Elf64_Shdr *sh = &shdrs[si];
> +				const char *name = shstrtab + sh->sh_name;
> +
> +				if (strcmp(name, target) != 0)
> +					continue;
> +
> +				Elf64_Shdr *strtab_sh = &shdrs[sh->sh_link];
> +				const char *strtab = (char *)map + strtab_sh-
> >sh_offset;
> +				Elf64_Sym *syms = (Elf64_Sym *)((char *)map +
> sh->sh_offset);
> +				uint64_t nsyms = sh->sh_size /
> sizeof(Elf64_Sym);
> +				uint64_t j;
> +
> +				for (j = 0; j < nsyms; j++) {
> +					if (strcmp(strtab + syms[j].st_name,
> symname) == 0) {
> +						sym_vaddr = syms[j].st_value;
> +						found = 1;
> +						break;
> +					}
> +				}
> +			}
> +		}
> +
> +		if (!found) {
> +			fprintf(stderr, "symbol '%s' not found in %s\n",
> symname, binary);
> +			munmap(map, (size_t)st.st_size);
> +			return 1;
> +		}
> +
> +		/* Convert vaddr to file offset via PT_LOAD segments */
> +		Elf64_Phdr *phdrs = (Elf64_Phdr *)((char *)map + ehdr-
> >e_phoff);
> +		int pi;
> +
> +		for (pi = 0; pi < ehdr->e_phnum; pi++) {
> +			Elf64_Phdr *ph = &phdrs[pi];
> +
> +			if (ph->p_type != PT_LOAD)
> +				continue;
> +			if (sym_vaddr >= ph->p_vaddr &&
> +			    sym_vaddr < ph->p_vaddr + ph->p_filesz) {
> +				file_offset = sym_vaddr - ph->p_vaddr + ph-
> >p_offset;
> +				break;
> +			}
> +		}
> +	} else {
> +		/* 32-bit ELF */
> +		Elf32_Shdr *shdrs = (Elf32_Shdr *)((char *)map + ehdr32-
> >e_shoff);
> +		Elf32_Shdr *shstrtab_hdr = &shdrs[ehdr32->e_shstrndx];
> +		const char *shstrtab = (char *)map + shstrtab_hdr->sh_offset;
> +		int si;
> +		uint32_t sym_vaddr32 = 0;
> +
> +		for (int pass = 0; pass < 2 && !found; pass++) {
> +			const char *target = pass ? ".dynsym" : ".symtab";
> +
> +			for (si = 0; si < ehdr32->e_shnum && !found; si++) {
> +				Elf32_Shdr *sh = &shdrs[si];
> +				const char *name = shstrtab + sh->sh_name;
> +
> +				if (strcmp(name, target) != 0)
> +					continue;
> +
> +				Elf32_Shdr *strtab_sh = &shdrs[sh->sh_link];
> +				const char *strtab = (char *)map + strtab_sh-
> >sh_offset;
> +				Elf32_Sym *syms = (Elf32_Sym *)((char *)map +
> sh->sh_offset);
> +				uint32_t nsyms = sh->sh_size /
> sizeof(Elf32_Sym);
> +				uint32_t j;
> +
> +				for (j = 0; j < nsyms; j++) {
> +					if (strcmp(strtab + syms[j].st_name,
> symname) == 0) {
> +						sym_vaddr32 =
> syms[j].st_value;
> +						found = 1;
> +						break;
> +					}
> +				}
> +			}
> +		}
> +
> +		if (!found) {
> +			fprintf(stderr, "symbol '%s' not found in %s\n",
> symname, binary);
> +			munmap(map, (size_t)st.st_size);
> +			return 1;
> +		}
> +
> +		Elf32_Phdr *phdrs = (Elf32_Phdr *)((char *)map + ehdr32-
> >e_phoff);
> +		int pi;
> +
> +		for (pi = 0; pi < ehdr32->e_phnum; pi++) {
> +			Elf32_Phdr *ph = &phdrs[pi];
> +
> +			if (ph->p_type != PT_LOAD)
> +				continue;
> +			if (sym_vaddr32 >= ph->p_vaddr &&
> +			    sym_vaddr32 < ph->p_vaddr + ph->p_filesz) {
> +				file_offset = sym_vaddr32 - ph->p_vaddr + ph-
> >p_offset;
> +				break;
> +			}
> +		}
> +		sym_vaddr = sym_vaddr32;
> +	}
> +
> +	munmap(map, (size_t)st.st_size);
> +
> +	if (!file_offset && sym_vaddr) {
> +		fprintf(stderr, "could not map vaddr 0x%lx to file offset\n",
> +			(unsigned long)sym_vaddr);
> +		return 1;
> +	}
> +
> +	printf("0x%lx\n", (unsigned long)file_offset);
> +	return 0;
> +}
> +
> +int main(int argc, char *argv[])
> +{
> +	int rc;
> +
> +	if (argc < 2) {
> +		fprintf(stderr, "Usage: %s <subcommand> [args...]\n",
> argv[0]);
> +		return 1;
> +	}
> +
> +	/* sym_offset does not need /dev/rv */
> +	if (strcmp(argv[1], "sym_offset") == 0) {
> +		if (argc < 4) {
> +			fprintf(stderr, "Usage: %s sym_offset <binary>
> <symbol>\n",
> +				argv[0]);
> +			return 1;
> +		}
> +		return sym_offset(argv[2], argv[3]);
> +	}
> +
> +	/* not_enabled: monitor is disabled; bind must return ENODEV without
> open_rv() */
> +	if (strcmp(argv[1], "not_enabled") == 0)
> +		return test_not_enabled();
> +
> +	if (open_rv() < 0)
> +		return 2; /* skip */
> +
> +	if (strcmp(argv[1], "bench") == 0)
> +		rc = test_bench();
> +	else if (strcmp(argv[1], "within_budget") == 0)
> +		rc = test_within_budget();
> +	else if (strcmp(argv[1], "over_budget_running") == 0)
> +		rc = test_over_budget_running();
> +	else if (strcmp(argv[1], "over_budget_sleeping") == 0)
> +		rc = test_over_budget_sleeping();
> +	else if (strcmp(argv[1], "over_budget_waiting") == 0)
> +		rc = test_over_budget_waiting();
> +	else if (strcmp(argv[1], "double_start") == 0)
> +		rc = test_double_start();
> +	else if (strcmp(argv[1], "stop_no_start") == 0)
> +		rc = test_stop_no_start();
> +	else if (strcmp(argv[1], "multi_thread") == 0)
> +		rc = test_multi_thread();
> +	else {
> +		fprintf(stderr, "Unknown test: %s\n", argv[1]);
> +		rc = 1;
> +	}
> +
> +	close(rv_fd);
> +	return rc;
> +}
> diff --git a/tools/testing/selftests/verification/tlob/tlob_target.c
> b/tools/testing/selftests/verification/tlob/tlob_target.c
> new file mode 100644
> index 000000000000..0fdbc575d71d
> --- /dev/null
> +++ b/tools/testing/selftests/verification/tlob/tlob_target.c
> @@ -0,0 +1,138 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * tlob_target.c - uprobe target binary for tlob selftests.
> + *
> + * Provides three start/stop probe pairs, each designed to exercise a
> + * different dominant component of the detail_env_tlob ns breakdown:
> + *
> + *   tlob_busy_work    / tlob_busy_work_done    - busy-spin: running_ns
> dominates
> + *   tlob_sleep_work   / tlob_sleep_work_done   - nanosleep: sleeping_ns
> dominates
> + *   tlob_preempt_work / tlob_preempt_work_done - busy-spin: waiting_ns
> dominates
> + *                                                (needs an RT competitor on
> the same CPU)
> + *
> + * Usage: tlob_target <duration_ms> [mode]
> + *
> + * mode is one of: busy (default), sleep, preempt.
> + * Loops in 200 ms iterations until <duration_ms> has elapsed
> + * (0 = run for ~24 hours).
> + */
> +#define _GNU_SOURCE
> +#include <stdint.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <time.h>
> +
> +#ifndef noinline
> +#define noinline __attribute__((noinline))
> +#endif
> +
> +static inline int timespec_before(const struct timespec *a,
> +				   const struct timespec *b)
> +{
> +	return a->tv_sec < b->tv_sec ||
> +	       (a->tv_sec == b->tv_sec && a->tv_nsec < b->tv_nsec);
> +}
> +
> +static void timespec_add_ms(struct timespec *ts, unsigned long ms)
> +{
> +	ts->tv_sec  += ms / 1000;
> +	ts->tv_nsec += (long)(ms % 1000) * 1000000L;
> +	if (ts->tv_nsec >= 1000000000L) {
> +		ts->tv_sec++;
> +		ts->tv_nsec -= 1000000000L;
> +	}
> +}
> +
> +/* stop probe; noinline keeps the entry point visible to uprobes */
> +noinline void tlob_busy_work_done(void)
> +{
> +	/* empty: uprobe fires on entry */
> +}
> +
> +/* start probe; busy-spin so running_ns dominates */
> +noinline void tlob_busy_work(unsigned long duration_ns)
> +{
> +	struct timespec start, now;
> +	unsigned long elapsed;
> +
> +	clock_gettime(CLOCK_MONOTONIC, &start);
> +	do {
> +		clock_gettime(CLOCK_MONOTONIC, &now);
> +		elapsed = (unsigned long)(now.tv_sec - start.tv_sec)
> +			  * 1000000000UL
> +			+ (unsigned long)(now.tv_nsec - start.tv_nsec);
> +	} while (elapsed < duration_ns);
> +
> +	tlob_busy_work_done();
> +}
> +
> +/* stop probe; noinline keeps the entry point visible to uprobes */
> +noinline void tlob_sleep_work_done(void)
> +{
> +	/* empty: uprobe fires on entry */
> +}
> +
> +/* start probe; nanosleep so sleeping_ns dominates */
> +noinline void tlob_sleep_work(unsigned long duration_ms)
> +{
> +	struct timespec ts = {
> +		.tv_sec  = duration_ms / 1000,
> +		.tv_nsec = (long)(duration_ms % 1000) * 1000000L,
> +	};
> +	nanosleep(&ts, NULL);
> +	tlob_sleep_work_done();
> +}
> +
> +/* stop probe; noinline keeps the entry point visible to uprobes */
> +noinline void tlob_preempt_work_done(void)
> +{
> +	/* empty: uprobe fires on entry */
> +}
> +
> +/*
> + * start probe; busy-spin so an RT competitor on the same CPU drives
> + * waiting_ns (prev_state==0 -> preempt event, task stays runnable off-CPU).
> + */
> +noinline void tlob_preempt_work(unsigned long duration_ms)
> +{
> +	struct timespec start, now;
> +	unsigned long elapsed;
> +
> +	clock_gettime(CLOCK_MONOTONIC, &start);
> +	do {
> +		clock_gettime(CLOCK_MONOTONIC, &now);
> +		elapsed = (unsigned long)(now.tv_sec - start.tv_sec)
> +			  * 1000000000UL
> +			+ (unsigned long)(now.tv_nsec - start.tv_nsec);
> +	} while (elapsed < duration_ms * 1000000UL);
> +
> +	tlob_preempt_work_done();
> +}
> +
> +int main(int argc, char *argv[])
> +{
> +	unsigned long duration_ms = 0;
> +	const char *mode = "busy";
> +	struct timespec deadline, now;
> +
> +	if (argc >= 2)
> +		duration_ms = strtoul(argv[1], NULL, 10);
> +	if (argc >= 3)
> +		mode = argv[2];
> +
> +	clock_gettime(CLOCK_MONOTONIC, &deadline);
> +	timespec_add_ms(&deadline, duration_ms ? duration_ms : 86400000UL);
> +
> +	do {
> +		if (strcmp(mode, "sleep") == 0)
> +			tlob_sleep_work(200);
> +		else if (strcmp(mode, "preempt") == 0)
> +			tlob_preempt_work(200);
> +		else
> +			tlob_busy_work(200 * 1000000UL);
> +		clock_gettime(CLOCK_MONOTONIC, &now);
> +	} while (timespec_before(&now, &deadline));
> +
> +	return 0;
> +}


^ permalink raw reply

* Re: [RFC PATCH v2 02/10] rv/da: fix per-task da_monitor_destroy() ordering and sync
From: Wen Yang @ 2026-05-13  5:32 UTC (permalink / raw)
  To: Gabriele Monaco; +Cc: linux-trace-kernel, linux-kernel, Steven Rostedt
In-Reply-To: <8e80cbcf739304de95356f1fac677261628977fa.camel@redhat.com>



On 5/12/26 17:09, Gabriele Monaco wrote:
> On Tue, 2026-05-12 at 10:27 +0200, Gabriele Monaco wrote:
>> On Tue, 2026-05-12 at 02:24 +0800, wen.yang@linux.dev wrote:
>>> From: Wen Yang <wen.yang@linux.dev>
>>>
>>> The following two paths race:
>>>
>>>    CPU 0 (disable_stall/__rv_disable_monitor)  CPU 1 (wwnr probe handler)
>> 							^ did you mean stall?
> 
> Ok I got it now, so essentially you'd reproduce it like:
> 
> * start a DA per-task monitor (no timer)
> * stop it, a handler is still running after reset, it sets monitoring back to 1
> * start an HA per-task monitor
> 
> that would use the same slot that is now looking like:
> 
>   { monitoring = 1, timer.function = NULL }
> 
> because it was not initialised as HA but monitoring was reset in the race.
> 
> Thinking about this again, it isn't just an issue with per-task monitors, all
> monitors reusing slots would suffer from it.
> Besides, relying on monitoring can be fragile when using LTL monitors on the
> same task (those don't even have monitoring).
> 
> Perhaps the solution isn't that trivial, I'm going to give one more thought on
> it, but thanks again for bringing this up!
> 
> Gabriele
> 
>>>    ------------------------------------------  -----------------------------
>>>    disable_stall()
>>>      da_monitor_destroy()
>>>        da_monitor_reset_all()          <------ [task T: monitoring=0]
>>>                                                da_monitor_start(&T->rv[n])
>>>                                                /* no timer_setup */
>>>                                                 monitoring=1  <----
>>>    tracepoint_synchronize_unregister()
>>>    // CPU 1 probe has already returned; sync returns
>>>
>>> Later, enable_stall() acquires the same slot and calls da_monitor_init():
>>>
>>>    da_monitor_reset_all()
>>>      da_monitor_reset(&T->rv[slot])    // monitoring=1, timer.function==0
>>>        ha_monitor_reset_env()
>>>          ha_cancel_timer()
>>>            timer_delete(&ha_mon->timer)  // ODEBUG: timer never initialised
>>>
>>>    ODEBUG: assert_init not available (active state 0)
>>>    object type: timer_list
>>>    Call trace: timer_delete <- da_monitor_reset_all <- enable_stall
>>>
>>> Call tracepoint_synchronize_unregister() inside da_monitor_destroy()
>>> before da_monitor_reset_all().  The unregister_trace_xxx() calls in the
>>> monitor's disable() have already disconnected the tracepoints; the sync
>>> here drains any handler still in flight, so no new monitoring=1 can
>>> appear after da_monitor_reset_all() clears the slot.
>>>
>>> Also fix the slot release ordering: release the slot only after
>>> reset_all() to avoid accessing rv[] with an out-of-bounds index.
>>>
>>> Fixes: f5587d1b6ec9 ("rv: Add Hybrid Automata monitor type")
>>> Signed-off-by: Wen Yang <wen.yang@linux.dev>
>>> ---
>>
>> Thanks for the fix, I have a similar one waiting for submission.
>>
>> These are technically 2 separate fixes though: the ordering with unset
>> task_mon_slot (independent on HA) and the synchronisation with pending
>> tracepoints. They probably deserve separate patches and visibility, the first
>> has always been around and we're technically overwriting who knows what.
>>
>>
>> The explanation above is a bit hard to follow though, are you talking about a
>> handler for the same (stall) monitor running after the reset, effectively
>> undoing it by setting the monitoring flag?
>>
>> Then this is indeed an issue with ha_monitor_reset_env() which expects a clean
>> environment.
>>
>> So that's basically what you'd see now much more often because in fact we
>> don't
>> reset the right slot (though, again, that's a different issue).
>>
>>
>> Calling tracepoint_synchronize_unregister() there too would surely fix, but it
>> used to be kinda slow. But it's probably gotten faster since now tracepoints
>> use
>> SRCU, so we can wait for a dedicated grace period.
>>
>> I liked the idea to wait cumulatively in the end, but that's just making
>> things
>> harder.. Let's do like this:
>>
>> Prepare 2 separate patches as fixes, put the task slot one first (would ease
>> backporting), mention this issue with the race condition only in the second.
>> You can send them independently and I'll add them to the tree as urgent.
>>
>>
>> I'm soon going to send my set of fixes that will also include the task slot
>> patch (not removing to ease my life with conflicts).
>>

Hi Gabriele,

Thanks for both messages.  Two patches are ready; let me address
your follow-up concerns before sending.

   1. "all monitors reusing slots would suffer from it"

      Only RV_MON_PER_TASK uses the rv_get/put_task_monitor_slot()
      pool.  RV_MON_GLOBAL and RV_MON_PER_CPU each have dedicated
      storage (a single static variable and a per-cpu variable) and
      never share slots across monitor types.  The race is exclusive
      to PER_TASK, so fixing that variant's da_monitor_destroy() is
      the correct scope.

   2. "LTL monitors don't even have monitoring"

      tracepoint_synchronize_unregister() does not rely on the
      monitoring flag at all.  It is a system-wide barrier — it
      calls synchronize_rcu_tasks_trace() followed by
      synchronize_srcu(&tracepoint_srcu) — draining every in-flight
      tracepoint handler on every CPU regardless of which monitor
      dispatched it.  LTL handlers are covered without any special
      treatment.

The slot-ordering issue (patch 1) affects all per-task DA monitors,
not only HA ones — "independent on HA" — because
RV_PER_TASK_MONITOR_INIT equals CONFIG_RV_PER_TASK_MONITORS (one
past the end of rv[]), so da_monitor_reset_all() overwrites whatever
follows rv[] in task_struct whenever any per-task monitor is
disabled.

Also corrected "wwnr probe handler" to "stall probe handler" in
patch 2 per your annotation.

Please let me know if the above reasoning addresses your concerns.


--
Best wishes,
Wen

>>
>>>   include/rv/da_monitor.h | 18 ++++++++++++++++--
>>>   1 file changed, 16 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
>>> index 00ded3d5ab3f..d04bb3229c75 100644
>>> --- a/include/rv/da_monitor.h
>>> +++ b/include/rv/da_monitor.h
>>> @@ -304,6 +304,20 @@ static int da_monitor_init(void)
>>>   
>>>   /*
>>>    * da_monitor_destroy - return the allocated slot
>>> + *
>>> + * Call tracepoint_synchronize_unregister() before reset_all() to close
>>> + * the race where an in-flight non-HA probe handler sets monitoring=1
>>> + * (without calling timer_setup()) after da_monitor_reset_all() has
>>> + * already cleared the slot but before the caller's own sync completes.
>>> + * Without this barrier, an HA_TIMER_WHEEL monitor that later acquires
>>> + * the same slot would call timer_delete() on a never-initialised
>>> + * timer_list, triggering ODEBUG warnings.
>>> + *
>>> + * Note: tracepoint_synchronize_unregister() is a system-wide barrier
>>> + * that waits for all CPUs to finish any in-flight tracepoint handlers.
>>> + * The caller's own __rv_disable_monitor() issues a second sync after
>>> + * returning from disable(); that redundant call is harmless on the
>>> + * infrequent admin (enable/disable) path.
>>>    */
>>>   static inline void da_monitor_destroy(void)
>>>   {
>>> @@ -311,10 +325,10 @@ static inline void da_monitor_destroy(void)
>>>   		WARN_ONCE(1, "Disabling a disabled monitor: "
>>> __stringify(MONITOR_NAME));
>>>   		return;
>>>   	}
>>> +	tracepoint_synchronize_unregister();
>>> +	da_monitor_reset_all();
>>>   	rv_put_task_monitor_slot(task_mon_slot);
>>>   	task_mon_slot = RV_PER_TASK_MONITOR_INIT;
>>> -
>>> -	da_monitor_reset_all();
>>>   }
>>>   
>>>   #elif RV_MON_TYPE == RV_MON_PER_OBJ
> 

^ permalink raw reply

* Re: [PATCH v2] mm: vmscan: rework lru_shrink and write_folio tracepoints
From: kernel test robot @ 2026-05-13  3:04 UTC (permalink / raw)
  To: qiwu.chen, rostedt, mhiramat, akpm, hannes, david, mhocko, willy
  Cc: oe-kbuild-all, linux-trace-kernel, linux-mm, qiwu.chen
In-Reply-To: <20260506083652.100160-1-qiwu.chen@transsion.com>

Hi qiwu.chen,

kernel test robot noticed the following build errors:

[auto build test ERROR on linus/master]
[also build test ERROR on v7.1-rc3]
[cannot apply to akpm-mm/mm-everything trace/for-next next-20260508]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/qiwu-chen/mm-vmscan-rework-lru_shrink-and-write_folio-tracepoints/20260513-040720
base:   linus/master
patch link:    https://lore.kernel.org/r/20260506083652.100160-1-qiwu.chen%40transsion.com
patch subject: [PATCH v2] mm: vmscan: rework lru_shrink and write_folio tracepoints
config: sh-defconfig (https://download.01.org/0day-ci/archive/20260513/202605131057.E7FZbuAc-lkp@intel.com/config)
compiler: sh4-linux-gcc (GCC) 15.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260513/202605131057.E7FZbuAc-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605131057.E7FZbuAc-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from include/trace/define_trace.h:132,
                    from include/trace/events/vmscan.h:602,
                    from mm/vmscan.c:72:
   include/trace/events/vmscan.h: In function 'trace_raw_output_mm_vmscan_write_folio':
   include/trace/events/vmscan.h:358:19: warning: format '%p' expects argument of type 'void *', but argument 3 has type 'long unsigned int' [-Wformat=]
     358 |         TP_printk("folio=%p lru=%s",
         |                   ^~~~~~~~~~~~~~~~~
   include/trace/trace_events.h:219:34: note: in definition of macro 'DECLARE_EVENT_CLASS'
     219 |         trace_event_printf(iter, print);                                \
         |                                  ^~~~~
   include/trace/trace_events.h:45:30: note: in expansion of macro 'PARAMS'
      45 |                              PARAMS(print));                   \
         |                              ^~~~~~
   include/trace/events/vmscan.h:342:1: note: in expansion of macro 'TRACE_EVENT'
     342 | TRACE_EVENT(mm_vmscan_write_folio,
         | ^~~~~~~~~~~
   include/trace/events/vmscan.h:358:9: note: in expansion of macro 'TP_printk'
     358 |         TP_printk("folio=%p lru=%s",
         |         ^~~~~~~~~
   In file included from include/trace/trace_events.h:256:
   include/trace/events/vmscan.h:358:27: note: format string is defined here
     358 |         TP_printk("folio=%p lru=%s",
         |                          ~^
         |                           |
         |                           void *
         |                          %ld
   include/trace/events/vmscan.h: In function 'do_trace_event_raw_event_mm_vmscan_write_folio':
>> include/trace/events/vmscan.h:354:32: error: assignment to 'long unsigned int' from 'struct folio *' makes integer from pointer without a cast [-Wint-conversion]
     354 |                 __entry->folio = folio;
         |                                ^
   include/trace/trace_events.h:427:11: note: in definition of macro '__DECLARE_EVENT_CLASS'
     427 |         { assign; }                                                     \
         |           ^~~~~~
   include/trace/trace_events.h:435:23: note: in expansion of macro 'PARAMS'
     435 |                       PARAMS(assign), PARAMS(print))                    \
         |                       ^~~~~~
   include/trace/trace_events.h:40:9: note: in expansion of macro 'DECLARE_EVENT_CLASS'
      40 |         DECLARE_EVENT_CLASS(name,                              \
         |         ^~~~~~~~~~~~~~~~~~~
   include/trace/trace_events.h:44:30: note: in expansion of macro 'PARAMS'
      44 |                              PARAMS(assign),                   \
         |                              ^~~~~~
   include/trace/events/vmscan.h:342:1: note: in expansion of macro 'TRACE_EVENT'
     342 | TRACE_EVENT(mm_vmscan_write_folio,
         | ^~~~~~~~~~~
   include/trace/events/vmscan.h:353:9: note: in expansion of macro 'TP_fast_assign'
     353 |         TP_fast_assign(
         |         ^~~~~~~~~~~~~~
   In file included from include/trace/define_trace.h:133:
   include/trace/events/vmscan.h: In function 'do_perf_trace_mm_vmscan_write_folio':
>> include/trace/events/vmscan.h:354:32: error: assignment to 'long unsigned int' from 'struct folio *' makes integer from pointer without a cast [-Wint-conversion]
     354 |                 __entry->folio = folio;
         |                                ^
   include/trace/perf.h:51:11: note: in definition of macro '__DECLARE_EVENT_CLASS'
      51 |         { assign; }                                                     \
         |           ^~~~~~
   include/trace/perf.h:67:23: note: in expansion of macro 'PARAMS'
      67 |                       PARAMS(assign), PARAMS(print))                    \
         |                       ^~~~~~
   include/trace/trace_events.h:40:9: note: in expansion of macro 'DECLARE_EVENT_CLASS'
      40 |         DECLARE_EVENT_CLASS(name,                              \
         |         ^~~~~~~~~~~~~~~~~~~
   include/trace/trace_events.h:44:30: note: in expansion of macro 'PARAMS'
      44 |                              PARAMS(assign),                   \
         |                              ^~~~~~
   include/trace/events/vmscan.h:342:1: note: in expansion of macro 'TRACE_EVENT'
     342 | TRACE_EVENT(mm_vmscan_write_folio,
         | ^~~~~~~~~~~
   include/trace/events/vmscan.h:353:9: note: in expansion of macro 'TP_fast_assign'
     353 |         TP_fast_assign(
         |         ^~~~~~~~~~~~~~


vim +354 include/trace/events/vmscan.h

   343	
   344		TP_PROTO(struct folio *folio),
   345	
   346		TP_ARGS(folio),
   347	
   348		TP_STRUCT__entry(
   349			__field(unsigned long, folio)
   350			__field(int, lru)
   351		),
   352	
   353		TP_fast_assign(
 > 354			__entry->folio = folio;
   355			__entry->lru = folio_lru_list(folio);
   356		),
   357	
   358		TP_printk("folio=%p lru=%s",
   359			__entry->folio,
   360			__print_symbolic(__entry->lru, LRU_NAMES))
   361	);
   362	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [PATCH v2] mm: vmscan: rework lru_shrink and write_folio tracepoints
From: kernel test robot @ 2026-05-13  1:19 UTC (permalink / raw)
  To: qiwu.chen, rostedt, mhiramat, akpm, hannes, david, mhocko, willy
  Cc: oe-kbuild-all, linux-trace-kernel, linux-mm, qiwu.chen
In-Reply-To: <20260506083652.100160-1-qiwu.chen@transsion.com>

Hi qiwu.chen,

kernel test robot noticed the following build warnings:

[auto build test WARNING on linus/master]
[also build test WARNING on v7.1-rc3]
[cannot apply to akpm-mm/mm-everything trace/for-next next-20260508]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/qiwu-chen/mm-vmscan-rework-lru_shrink-and-write_folio-tracepoints/20260513-040720
base:   linus/master
patch link:    https://lore.kernel.org/r/20260506083652.100160-1-qiwu.chen%40transsion.com
patch subject: [PATCH v2] mm: vmscan: rework lru_shrink and write_folio tracepoints
config: xtensa-randconfig-001-20260513 (https://download.01.org/0day-ci/archive/20260513/202605130942.9wJFWm9M-lkp@intel.com/config)
compiler: xtensa-linux-gcc (GCC) 8.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260513/202605130942.9wJFWm9M-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605130942.9wJFWm9M-lkp@intel.com/

All warnings (new ones prefixed by >>):

   In file included from include/trace/define_trace.h:132,
                    from include/trace/events/vmscan.h:602,
                    from mm/vmscan.c:72:
   include/trace/events/vmscan.h: In function 'trace_raw_output_mm_vmscan_write_folio':
   include/trace/events/vmscan.h:358:12: warning: format '%p' expects argument of type 'void *', but argument 3 has type 'long unsigned int' [-Wformat=]
     TP_printk("folio=%p lru=%s",
               ^~~~~~~~~~~~~~~~~
   include/trace/trace_events.h:219:27: note: in definition of macro 'DECLARE_EVENT_CLASS'
     trace_event_printf(iter, print);    \
                              ^~~~~
   include/trace/trace_events.h:45:9: note: in expansion of macro 'PARAMS'
            PARAMS(print));         \
            ^~~~~~
   include/trace/events/vmscan.h:342:1: note: in expansion of macro 'TRACE_EVENT'
    TRACE_EVENT(mm_vmscan_write_folio,
    ^~~~~~~~~~~
   include/trace/events/vmscan.h:358:2: note: in expansion of macro 'TP_printk'
     TP_printk("folio=%p lru=%s",
     ^~~~~~~~~
   In file included from include/trace/trace_events.h:256,
                    from include/trace/define_trace.h:132,
                    from include/trace/events/vmscan.h:602,
                    from mm/vmscan.c:72:
   include/trace/events/vmscan.h:358:20: note: format string is defined here
     TP_printk("folio=%p lru=%s",
                      ~^
                      %ld
   In file included from include/trace/define_trace.h:132,
                    from include/trace/events/vmscan.h:602,
                    from mm/vmscan.c:72:
   include/trace/events/vmscan.h: In function 'do_trace_event_raw_event_mm_vmscan_write_folio':
>> include/trace/events/vmscan.h:354:18: warning: assignment to 'long unsigned int' from 'struct folio *' makes integer from pointer without a cast [-Wint-conversion]
      __entry->folio = folio;
                     ^
   include/trace/trace_events.h:427:4: note: in definition of macro '__DECLARE_EVENT_CLASS'
     { assign; }       \
       ^~~~~~
   include/trace/trace_events.h:435:9: note: in expansion of macro 'PARAMS'
            PARAMS(assign), PARAMS(print))   \
            ^~~~~~
   include/trace/trace_events.h:40:2: note: in expansion of macro 'DECLARE_EVENT_CLASS'
     DECLARE_EVENT_CLASS(name,          \
     ^~~~~~~~~~~~~~~~~~~
   include/trace/trace_events.h:44:9: note: in expansion of macro 'PARAMS'
            PARAMS(assign),         \
            ^~~~~~
   include/trace/events/vmscan.h:342:1: note: in expansion of macro 'TRACE_EVENT'
    TRACE_EVENT(mm_vmscan_write_folio,
    ^~~~~~~~~~~
   include/trace/events/vmscan.h:353:2: note: in expansion of macro 'TP_fast_assign'
     TP_fast_assign(
     ^~~~~~~~~~~~~~


vim +354 include/trace/events/vmscan.h

   343	
   344		TP_PROTO(struct folio *folio),
   345	
   346		TP_ARGS(folio),
   347	
   348		TP_STRUCT__entry(
   349			__field(unsigned long, folio)
   350			__field(int, lru)
   351		),
   352	
   353		TP_fast_assign(
 > 354			__entry->folio = folio;
   355			__entry->lru = folio_lru_list(folio);
   356		),
   357	
   358		TP_printk("folio=%p lru=%s",
   359			__entry->folio,
   360			__print_symbolic(__entry->lru, LRU_NAMES))
   361	);
   362	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [PATCH v2] mm: vmscan: rework lru_shrink and write_folio tracepoints
From: kernel test robot @ 2026-05-13  0:45 UTC (permalink / raw)
  To: qiwu.chen, rostedt, mhiramat, akpm, hannes, david, mhocko, willy
  Cc: oe-kbuild-all, linux-trace-kernel, linux-mm, qiwu.chen
In-Reply-To: <20260506083652.100160-1-qiwu.chen@transsion.com>

Hi qiwu.chen,

kernel test robot noticed the following build warnings:

[auto build test WARNING on linus/master]
[also build test WARNING on v7.1-rc3]
[cannot apply to akpm-mm/mm-everything trace/for-next next-20260508]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/qiwu-chen/mm-vmscan-rework-lru_shrink-and-write_folio-tracepoints/20260513-040720
base:   linus/master
patch link:    https://lore.kernel.org/r/20260506083652.100160-1-qiwu.chen%40transsion.com
patch subject: [PATCH v2] mm: vmscan: rework lru_shrink and write_folio tracepoints
config: sh-defconfig (https://download.01.org/0day-ci/archive/20260513/202605130842.zWTTtyaL-lkp@intel.com/config)
compiler: sh4-linux-gcc (GCC) 15.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260513/202605130842.zWTTtyaL-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605130842.zWTTtyaL-lkp@intel.com/

All warnings (new ones prefixed by >>):

   In file included from include/trace/define_trace.h:132,
                    from include/trace/events/vmscan.h:602,
                    from mm/vmscan.c:72:
   include/trace/events/vmscan.h: In function 'trace_raw_output_mm_vmscan_write_folio':
>> include/trace/events/vmscan.h:358:19: warning: format '%p' expects argument of type 'void *', but argument 3 has type 'long unsigned int' [-Wformat=]
     358 |         TP_printk("folio=%p lru=%s",
         |                   ^~~~~~~~~~~~~~~~~
   include/trace/trace_events.h:219:34: note: in definition of macro 'DECLARE_EVENT_CLASS'
     219 |         trace_event_printf(iter, print);                                \
         |                                  ^~~~~
   include/trace/trace_events.h:45:30: note: in expansion of macro 'PARAMS'
      45 |                              PARAMS(print));                   \
         |                              ^~~~~~
   include/trace/events/vmscan.h:342:1: note: in expansion of macro 'TRACE_EVENT'
     342 | TRACE_EVENT(mm_vmscan_write_folio,
         | ^~~~~~~~~~~
   include/trace/events/vmscan.h:358:9: note: in expansion of macro 'TP_printk'
     358 |         TP_printk("folio=%p lru=%s",
         |         ^~~~~~~~~
   In file included from include/trace/trace_events.h:256:
   include/trace/events/vmscan.h:358:27: note: format string is defined here
     358 |         TP_printk("folio=%p lru=%s",
         |                          ~^
         |                           |
         |                           void *
         |                          %ld
   include/trace/events/vmscan.h: In function 'do_trace_event_raw_event_mm_vmscan_write_folio':
   include/trace/events/vmscan.h:354:32: error: assignment to 'long unsigned int' from 'struct folio *' makes integer from pointer without a cast [-Wint-conversion]
     354 |                 __entry->folio = folio;
         |                                ^
   include/trace/trace_events.h:427:11: note: in definition of macro '__DECLARE_EVENT_CLASS'
     427 |         { assign; }                                                     \
         |           ^~~~~~
   include/trace/trace_events.h:435:23: note: in expansion of macro 'PARAMS'
     435 |                       PARAMS(assign), PARAMS(print))                    \
         |                       ^~~~~~
   include/trace/trace_events.h:40:9: note: in expansion of macro 'DECLARE_EVENT_CLASS'
      40 |         DECLARE_EVENT_CLASS(name,                              \
         |         ^~~~~~~~~~~~~~~~~~~
   include/trace/trace_events.h:44:30: note: in expansion of macro 'PARAMS'
      44 |                              PARAMS(assign),                   \
         |                              ^~~~~~
   include/trace/events/vmscan.h:342:1: note: in expansion of macro 'TRACE_EVENT'
     342 | TRACE_EVENT(mm_vmscan_write_folio,
         | ^~~~~~~~~~~
   include/trace/events/vmscan.h:353:9: note: in expansion of macro 'TP_fast_assign'
     353 |         TP_fast_assign(
         |         ^~~~~~~~~~~~~~
   In file included from include/trace/define_trace.h:133:
   include/trace/events/vmscan.h: In function 'do_perf_trace_mm_vmscan_write_folio':
   include/trace/events/vmscan.h:354:32: error: assignment to 'long unsigned int' from 'struct folio *' makes integer from pointer without a cast [-Wint-conversion]
     354 |                 __entry->folio = folio;
         |                                ^
   include/trace/perf.h:51:11: note: in definition of macro '__DECLARE_EVENT_CLASS'
      51 |         { assign; }                                                     \
         |           ^~~~~~
   include/trace/perf.h:67:23: note: in expansion of macro 'PARAMS'
      67 |                       PARAMS(assign), PARAMS(print))                    \
         |                       ^~~~~~
   include/trace/trace_events.h:40:9: note: in expansion of macro 'DECLARE_EVENT_CLASS'
      40 |         DECLARE_EVENT_CLASS(name,                              \
         |         ^~~~~~~~~~~~~~~~~~~
   include/trace/trace_events.h:44:30: note: in expansion of macro 'PARAMS'
      44 |                              PARAMS(assign),                   \
         |                              ^~~~~~
   include/trace/events/vmscan.h:342:1: note: in expansion of macro 'TRACE_EVENT'
     342 | TRACE_EVENT(mm_vmscan_write_folio,
         | ^~~~~~~~~~~
   include/trace/events/vmscan.h:353:9: note: in expansion of macro 'TP_fast_assign'
     353 |         TP_fast_assign(
         |         ^~~~~~~~~~~~~~


vim +358 include/trace/events/vmscan.h

   343	
   344		TP_PROTO(struct folio *folio),
   345	
   346		TP_ARGS(folio),
   347	
   348		TP_STRUCT__entry(
   349			__field(unsigned long, folio)
   350			__field(int, lru)
   351		),
   352	
   353		TP_fast_assign(
   354			__entry->folio = folio;
   355			__entry->lru = folio_lru_list(folio);
   356		),
   357	
 > 358		TP_printk("folio=%p lru=%s",
   359			__entry->folio,
   360			__print_symbolic(__entry->lru, LRU_NAMES))
   361	);
   362	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [PATCH RFC v5 10/53] KVM: guest_memfd: Add basic support for KVM_SET_MEMORY_ATTRIBUTES2
From: Ackerley Tng @ 2026-05-12 22:30 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
	pratyush, suzuki.poulose, aneesh.kumar, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <1DAB05E2-7F30-45D7-B155-B66C59D31AFF@infradead.org>

"Liam R. Howlett" <liam@infradead.org> writes:

>
> [...snip...]
>
>>
>>The invariant in this maple tree is that contiguous ranges with the same
>>attribute are stored as a single range.
>>
>>The goal of this first part is to get the entry at the index just after
>>the requested range, and see what the attribute there is. If that
>>attribute is what we're about to set, extend the requested range for
>>storing to the end of that range.
>>
>>If there is another range higher than end + 1, with the invariant
>>maintained, that attribute has to be different than the attribute stored
>>at end. Hence, we only want to extend this requested range up till end.
>>
>
> mas_find() will look for an entry at the given address for the first search, and if it is not found it will continue to search upwards.  Since you limit the search to end, it will work as you want and there isn't a bug as I was thinking in my sleep deprived state.
>
> Since you are searching for exactly one address (end), it might serve you better to walk there.  Maybe walking is a better API for what you are doing here?
>

Thanks again for this tip! I'll try the walk API in the next revision
after v6 [1]

[1] https://lore.kernel.org/all/20260507-gmem-inplace-conversion-v6-0-91ab5a8b19a4@google.com/T/

>
>>> Do you have testing of these functions somewhere?
>>>
>>
>>GMEM_CONVERSION_MULTIPAGE_TEST_INIT_SHARED(indexing, 4) tests setting
>>attributes in ranges. If test_page is 2,
>>
>>1. [0, 4) starts off shared (4 is the number of pages in the guest_memfd)
>>2. [2, 3) is converted to private
>>    => so the ranges should now be [0, 2), [2, 3), [3, 4)
>>3. [2, 3) is converted back to shared
>>    => so the ranges should now be [0, 4)
>>
>>I verified this by inserting some trace_printk()s and inspecting manually.
>>
>
> Thanks.  I find the exclusive ranges a bit odd to think about in the maple tree context, but this test case makes sense.  This is especially odd to look at a single index entry, at least for me.
>
> I generally have a set of test cases and append any bug reproduces to that list so they are unlikely to reoccur.  My testing is certainly different from what you'll be doing, but this method has done well with the quality of code improving over time, and limited (if any) regressions.
>

I've not worked directly with the maple tree tests but the xarray tests
(similarly set up, I believe) are a joy to work with.

> I actually insist that any fix has a test before I accept them.  There are two reasons for this: 1. Avoiding the regression. 2. People really understand the bug if they can create a reproducer.
>
> I hope this helps.
>
>

The maple tree tests are set up to directly test maple tree code, but
KVM selftests test from the userspace interface, and it's hard to test
this invariant from userspace.

>>>> +	if (entry && xa_to_value(entry) == attributes)
>>>> +		last = mas->last;
>>>> +
>>>> +	if (start > 0) {
>>>> +		mas_set_range(mas, start - 1, start - 1);
>>>> +		entry = mas_find(mas, start - 1);
>>>> +		if (entry && xa_to_value(entry) == attributes)
>>>> +			start = mas->index;
>>>> +	}
>>>> +
>>>> +	mas_set_range(mas, start, last);
>>>> +	return mas_preallocate(mas, xa_mk_value(attributes), GFP_KERNEL);
>>>> +}
>>>> +
>>>>
>>>> [...snip...]
>>>>

^ permalink raw reply

* Re: [RFC PATCH] trace: Introduce a new filter_pred "caller"
From: Steven Rostedt @ 2026-05-12 19:41 UTC (permalink / raw)
  To: Chen Jun; +Cc: mhiramat, mathieu.desnoyers, linux-kernel, linux-trace-kernel
In-Reply-To: <20260508122623.74290-1-chenjun102@huawei.com>

On Fri, 8 May 2026 20:26:23 +0800
Chen Jun <chenjun102@huawei.com> wrote:

> Low-level functions have many call paths, and sometimes
> we only care about the calls on a specific call path.
> Add a new filter to filter based on the call stack.
> 
> Usage:
> 1. echo 'caller=="$function_name"' > events/../filter
> 
> Only support OP_EQ and OP_NE

Cute.

> 
> Signed-off-by: Chen Jun <chenjun102@huawei.com>
> ---
>  include/linux/trace_events.h       |  1 +
>  kernel/trace/trace.h               |  3 ++-
>  kernel/trace/trace_events.c        |  1 +
>  kernel/trace/trace_events_filter.c | 40 ++++++++++++++++++++++++++++--
>  4 files changed, 42 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
> index 40a43a4c7caf..1f109669a391 100644
> --- a/include/linux/trace_events.h
> +++ b/include/linux/trace_events.h
> @@ -851,6 +851,7 @@ enum {
>  	FILTER_COMM,
>  	FILTER_CPU,
>  	FILTER_STACKTRACE,
> +	FILTER_CALLER,
>  };
>  
>  extern int trace_event_raw_init(struct trace_event_call *call);
> diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
> index 80fe152af1dd..4e4b92ce264f 100644
> --- a/kernel/trace/trace.h
> +++ b/kernel/trace/trace.h
> @@ -1825,7 +1825,8 @@ static inline bool is_string_field(struct ftrace_event_field *field)
>  	       field->filter_type == FILTER_RDYN_STRING ||
>  	       field->filter_type == FILTER_STATIC_STRING ||
>  	       field->filter_type == FILTER_PTR_STRING ||
> -	       field->filter_type == FILTER_COMM;
> +	       field->filter_type == FILTER_COMM ||
> +	       field->filter_type == FILTER_CALLER;
>  }
>  
>  static inline bool is_function_field(struct ftrace_event_field *field)
> diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
> index c46e623e7e0d..6d220d7eec73 100644
> --- a/kernel/trace/trace_events.c
> +++ b/kernel/trace/trace_events.c
> @@ -199,6 +199,7 @@ static int trace_define_generic_fields(void)
>  	__generic_field(char *, comm, FILTER_COMM);
>  	__generic_field(char *, stacktrace, FILTER_STACKTRACE);
>  	__generic_field(char *, STACKTRACE, FILTER_STACKTRACE);
> +	__generic_field(char *, caller, FILTER_CALLER);
>  
>  	return ret;
>  }
> diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
> index 609325f57942..1cf040065abe 100644
> --- a/kernel/trace/trace_events_filter.c
> +++ b/kernel/trace/trace_events_filter.c
> @@ -72,6 +72,7 @@ enum filter_pred_fn {
>  	FILTER_PRED_FN_CPUMASK,
>  	FILTER_PRED_FN_CPUMASK_CPU,
>  	FILTER_PRED_FN_FUNCTION,
> +	FILTER_PRED_FN_CALLER,
>  	FILTER_PRED_FN_,
>  	FILTER_PRED_TEST_VISITED,
>  };
> @@ -1009,6 +1010,21 @@ static int filter_pred_function(struct filter_pred *pred, void *event)
>  	return pred->op == OP_EQ ? ret : !ret;
>  }
>  
> +/* Filter predicate for caller. */
> +static int filter_pred_caller(struct filter_pred *pred, void *event)
> +{
> +	unsigned long entries[32];

Let's make that only 16 in size. Having 256 bytes added to the stack in
random places may cause an overflow. 128 bytes isn't as bad. Either that,
or we need to preallocate per-cpu memory and use that. But that makes the
patch much more complex. I rather just use 16 entries instead for now. If
we need more, then we can add the extra complexity.

Also, you need to update Documentation/trace/events.rst.

Thanks,

-- Steve


> +	unsigned int nr_entries;
> +	int i;
> +
> +	nr_entries = stack_trace_save(entries, ARRAY_SIZE(entries), 0);
> +	for (i = 0; i < nr_entries ; i++)
> +		if (pred->val <= entries[i] && entries[i] < pred->val2)
> +			return !pred->not;
> +
> +	return pred->not;
> +}
> +
>  /*
>   * regex_match_foo - Basic regex callbacks
>   *
> @@ -1617,6 +1633,8 @@ static int filter_pred_fn_call(struct filter_pred *pred, void *event)
>  		return filter_pred_cpumask_cpu(pred, event);
>  	case FILTER_PRED_FN_FUNCTION:
>  		return filter_pred_function(pred, event);
> +	case FILTER_PRED_FN_CALLER:
> +		return filter_pred_caller(pred, event);
>  	case FILTER_PRED_TEST_VISITED:
>  		return test_pred_visited_fn(pred, event);
>  	default:
> @@ -2002,10 +2020,28 @@ static int parse_pred(const char *str, void *data,
>  
>  		} else if (field->filter_type == FILTER_DYN_STRING) {
>  			pred->fn_num = FILTER_PRED_FN_STRLOC;
> -		} else if (field->filter_type == FILTER_RDYN_STRING)
> +		} else if (field->filter_type == FILTER_RDYN_STRING) {
>  			pred->fn_num = FILTER_PRED_FN_STRRELLOC;
> -		else {
> +		} else if (field->filter_type == FILTER_CALLER) {
> +			unsigned long caller;
> +
> +			if (op == OP_GLOB)
> +				goto err_free;
>  
> +			pred->fn_num = FILTER_PRED_FN_CALLER;
> +			caller = kallsyms_lookup_name(pred->regex->pattern);
> +			if (!caller) {
> +				parse_error(pe, FILT_ERR_NO_FUNCTION, pos + i);
> +				goto err_free;
> +			}
> +			/* Now find the function start and end address */
> +			if (!kallsyms_lookup_size_offset(caller, &size, &offset)) {
> +				parse_error(pe, FILT_ERR_NO_FUNCTION, pos + i);
> +				goto err_free;
> +			}
> +			pred->val = caller - offset;
> +			pred->val2 = pred->val + size;
> +		} else {
>  			if (!ustring_per_cpu) {
>  				/* Once allocated, keep it around for good */
>  				ustring_per_cpu = alloc_percpu(struct ustring_buffer);


^ permalink raw reply

* Re: [PATCH bpf 1/2] uprobes/x86: Fix red zone clobbering in nop5 optimization
From: Andrii Nakryiko @ 2026-05-12 19:38 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jiri Olsa, Masami Hiramatsu, Andrii Nakryiko, bpf,
	linux-trace-kernel, Oleg Nesterov, Peter Zijlstra, Ingo Molnar
In-Reply-To: <CAADnVQLfgxEzpSTLhsxN2BYWpSxJ+RYku03UMfrSTi4Abu5SBw@mail.gmail.com>

On Tue, May 12, 2026 at 12:27 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, May 12, 2026 at 10:07 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > +       /*
> > +        * We have nop10 (with first byte overwritten to int3),
> > +        * change it to:
> > +        *   lea 0x80(%rsp), %rsp
> > +        *   call tramp
> > +        *
> > +        * The first lea instruction skips the stack redzone so the call
> > +        * instruction can safely push return address on stack.
> > +        */
>
> typo: lea -128(%rsp), %rsp
>
> you can also do:
>
> add $-128, %rsp + call tramp = 4 + 5 = 9 bytes instead of 10.

When I asked AI about this it explained that add instruction modifies
flags, so it's not a good fit here. lea doesn't touch flags.

>
> Initially I didn't like this approach, since we just introduced
> usdt nop5 and now need to recompile everything again,
> but looking at the fix it's definitely simpler than alternatives
> and doesn't have annoying limitations.


yeah, limitations are annoying, especially with those global "DO NOT
OPTIMIZE" flags... Jiri, let's polish your version and land it?

^ permalink raw reply

* Re: [PATCH bpf 1/2] uprobes/x86: Fix red zone clobbering in nop5 optimization
From: Alexei Starovoitov @ 2026-05-12 19:27 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Masami Hiramatsu, Andrii Nakryiko, bpf, linux-trace-kernel,
	Oleg Nesterov, Peter Zijlstra, Ingo Molnar
In-Reply-To: <agNeEzjiThzmJHiP@krava>

On Tue, May 12, 2026 at 10:07 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> +       /*
> +        * We have nop10 (with first byte overwritten to int3),
> +        * change it to:
> +        *   lea 0x80(%rsp), %rsp
> +        *   call tramp
> +        *
> +        * The first lea instruction skips the stack redzone so the call
> +        * instruction can safely push return address on stack.
> +        */

typo: lea -128(%rsp), %rsp

you can also do:

add $-128, %rsp + call tramp = 4 + 5 = 9 bytes instead of 10.

Initially I didn't like this approach, since we just introduced
usdt nop5 and now need to recompile everything again,
but looking at the fix it's definitely simpler than alternatives
and doesn't have annoying limitations.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox