Linux Trace Kernel
 help / color / mirror / Atom feed
* Re: [PATCH v3 1/9] rv/da: introduce DA_MON_ALLOCATION_STRATEGY
From: Gabriele Monaco @ 2026-06-15  9:56 UTC (permalink / raw)
  To: wen.yang; +Cc: Steven Rostedt, linux-trace-kernel, linux-kernel
In-Reply-To: <496394879a590b4d7bafdb2f13618d2e30be982f.1780847473.git.wen.yang@linux.dev>

On Mon, 2026-06-08 at 00:13 +0800, wen.yang@linux.dev wrote:

> +#ifndef DA_MON_ALLOCATION_STRATEGY
> +# define DA_MON_ALLOCATION_STRATEGY DA_ALLOC_AUTO
> +#endif

I'm not sure the space goes there according to kernel coding style, we
don't use it in this file and clang-format removes it. I'd keep
consistency.

...

>  
> +/*
> + * DA_MON_POOL_SIZE must be defined before this header is included

I don't think we need to be this verbose (also, ha_monitor may not even
be included if that's a da_monitor). I would stop at the line above.

> (directly or
> + * transitively via ha_monitor.h) when DA_ALLOC_POOL is selected. 
> In practice
> + * this means defining it after the monitor's model header (which
> supplies the
> + * capacity constant) and before the ha_monitor.h include.
> + */

> +#if DA_MON_ALLOCATION_STRATEGY == DA_ALLOC_POOL &&
> !defined(DA_MON_POOL_SIZE)
> +# error "DA_ALLOC_POOL requires DA_MON_POOL_SIZE to be defined
> before including this header"

Same here I'd keep consistency and remove the space before error.

> +#endif

...
  
> +/*
> + * Per-object pool state.
> + *
> + * Zero-initialised by default (storage == NULL ⟹ kmalloc mode).  A

Mmh, ⟹  doesn't seem to print that well on my terminal, let's perhaps
use plain old ASCII => . Also I don't find what you put in parentheses
to be adding much value, we could even omit it.

I remember discussing about this so I may have missed your answer, but
why don't we handle this pool as a simple kmem_cache/mempool instead of
implementing a similar logic from scratch?

> monitor
> + * opts into pool mode by defining DA_MON_ALLOCATION_STRATEGY
> DA_ALLOC_POOL
> + * and DA_MON_POOL_SIZE before including this header;
> da_monitor_init() then
> + * pre-allocates the pool internally.
> + *
> + * Because every field is wrapped in this struct and the struct
> itself is a
> + * per-TU static, each monitor that includes this header gets a
> completely
> + * independent pool.  A kmalloc monitor (e.g. nomiss) and a pool
> monitor
> + * (e.g. tlob) therefore coexist without any interference.
> + *
> + * da_pool_return_cb runs from softirq (non-PREEMPT_RT) or rcuc
> kthread
> + * (PREEMPT_RT); spin_lock_irqsave handles both.
> + */
> +struct da_per_obj_pool {
> +	struct da_monitor_storage  *storage;  /* non-NULL ⟹ pool
> mode */
> +	struct da_monitor_storage **free;     /* kmalloc'd pointer
> stack */
> +	unsigned int                free_top;
> +	unsigned int                capacity; /* total number of
> slots */
> +	spinlock_t                  lock;
> +};
> +
> +static struct da_per_obj_pool da_pool = {
> +	.lock = __SPIN_LOCK_UNLOCKED(da_pool.lock),
> +};

...

>  /*
>   * da_destroy_storage - destroy the per-object storage
>   *
> - * The caller is responsible to synchronise writers, either with
> locks or
> - * implicitly. For instance, if da_destroy_storage is called at
> sched_exit and
> - * da_create_storage can never occur after that, it's safe to call
> this without
> - * locks.
> - * This function includes an RCU read-side critical section to
> synchronise
> - * against da_monitor_destroy().
> + * Pool mode: removes from hash and returns the slot via call_rcu().
> + * Kmalloc mode: removes from hash and frees via kfree_rcu().
> + *
> + * Includes an RCU read-side critical section to synchronise against
> + * da_monitor_destroy().
>   */
>  static inline void da_destroy_storage(da_id_type id)
>  {
> @@ -558,7 +670,11 @@ static inline void da_destroy_storage(da_id_type
> id)
>  		return;
>  	da_monitor_reset_hook(&mon_storage->rv.da_mon);
>  	hash_del_rcu(&mon_storage->node);
> +#if DA_MON_ALLOCATION_STRATEGY == DA_ALLOC_POOL
> +	call_rcu(&mon_storage->rcu, da_pool_return_cb);
> +#else

ifdeffery in functions is discouraged as quite unreadable. Since
DA_MON_ALLOCATION_STRATEGY is guaranteed to be defined, simple C ifs are
going to be mostly equivalent (the compiler will cut instead of the
preprocessor, but still).

>  	kfree_rcu(mon_storage, rcu);
> +#endif
>  }

...

> +
> +/*
> + * da_monitor_init - initialise the per-object monitor
> + *
> + * Selects the allocation path at compile time based on
> DA_MON_ALLOCATION_STRATEGY:
> + *   DA_ALLOC_POOL   - pre-allocates DA_MON_POOL_SIZE storage slots.
> + *   DA_ALLOC_AUTO / DA_ALLOC_MANUAL - initialises the hash table
> only.
> + */
>  static inline int da_monitor_init(void)
>  {
>  	hash_init(da_monitor_ht);
> +#if DA_MON_ALLOCATION_STRATEGY == DA_ALLOC_POOL
> +	return __da_monitor_init_pool(DA_MON_POOL_SIZE);
> +#else

Same here, use if()

>  	return 0;
> +#endif
>  }
>  
> -static inline void da_monitor_destroy(void)
> +static inline void da_monitor_destroy_pool(void)
> +{
> +	struct da_monitor_storage *ms;
> +	struct hlist_node *tmp;
> +	int bkt;
> +
> +	/*
> +	 * Ensure all in-flight tracepoint handlers that may hold a
> raw pointer
> +	 * to a pool slot (e.g. tlob_stop_task after its RCU guard
> exits) have
> +	 * completed before we begin tearing down the pool.  Mirrors
> the same
> +	 * call in da_monitor_destroy_kmalloc().
> +	 */
> +	tracepoint_synchronize_unregister();
> +

This is common between the pool and kmalloc flavours, you can leave it
in da_monitor_destroy() before branching.

> +	/*
> +	 * Drain any entries that were not stopped before destroy
> (e.g.
> +	 * uprobe-started sessions whose stop probe never fired). 
> Call
> +	 * da_extra_cleanup() before hash_del_rcu() so the hook may
> safely
> +	 * call ha_cancel_timer_sync() while the monitor is still
> reachable.
> +	 */
> +	hash_for_each_safe(da_monitor_ht, bkt, tmp, ms, node) {
> +		da_extra_cleanup(&ms->rv.da_mon);
> +		hash_del_rcu(&ms->node);
> +		call_rcu(&ms->rcu, da_pool_return_cb);
> +	}

Cannot you make also this common? da_extra_cleanup() should be called in
all flavours and the only difference I see here is the rcu callback.

Also do you really need call_rcu() ? Since we should already have waited
for a grace period, you can probably call the function directly. If not,
I'd still try and make both flavours consistent (sync + free OR call_rcu
+ barrier, not both).

> +
> +	/*
> +	 * rcu_barrier() drains every pending call_rcu() callback,
> including
> +	 * both da_pool_return_cb() and any monitor-specific free
> callbacks
> +	 * (e.g. tlob_free_rcu) enqueued by da_extra_cleanup().
> +	 */
> +	rcu_barrier();
> +	kfree(da_pool.storage);
> +	da_pool.storage = NULL;
> +	kfree(da_pool.free);
> +	da_pool.free = NULL;
> +	da_pool.free_top = 0;
> +	da_pool.capacity = 0;

Only this part is really specific to this allocation flavour, if you
want it in a separate function, go ahead, but the rest should probably
share as much code as possible (especially the cleanup/synchronisation
mess we just worked out).

Thanks,
Gabriele


^ permalink raw reply

* Re: [PATCH v3 4/9] rv/ha: fix ha_invariant_passed_ns silent bypass of invariant check
From: Gabriele Monaco @ 2026-06-15 10:12 UTC (permalink / raw)
  To: wen.yang; +Cc: Steven Rostedt, linux-trace-kernel, linux-kernel
In-Reply-To: <812b7b8e8979b4ab00ac7727e3fea578799f2a8b.1780847473.git.wen.yang@linux.dev>

On Mon, 2026-06-08 at 00:13 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
> 
> The function is documented as "prepare the invariant and return the
> time
> since reset", but on the first call (env_store == U64_MAX) it exits
> early without calling ha_set_invariant_ns():
> 
>   if (ha_monitor_env_invalid(ha_mon, env))  /* env_store == U64_MAX
> */
>       return 0;   /* ha_set_invariant_ns skipped, env_store stays
> U64_MAX */
>   ...
>   ha_set_invariant_ns(ha_mon, env, expire - passed, time_ns);
> 
> This leaves env_store == U64_MAX, so ha_check_invariant_ns() always
> passes on the first activation regardless of elapsed time:
> 
>   return READ_ONCE(ha_mon->env_store[env]) >= time_ns;  /* U64_MAX >=
> any */
> 
> Fix: establish the guard before converting to the invariant:
> 
>   if (ha_monitor_env_invalid(ha_mon, env))
>       ha_reset_clk_ns(ha_mon, env, time_ns); /* guard: env_store =
> time_ns */
>   passed = ha_get_env(ha_mon, env, time_ns);
>   ha_set_invariant_ns(ha_mon, env, expire - passed, time_ns);
>                                  /* invariant: env_store = time_ns +
> expire */
> 
> Apply the same fix to ha_invariant_passed_jiffy().
> 
> Signed-off-by: Wen Yang <wen.yang@linux.dev>

This is neat and probably works just fine in your case. Anyway I'm
planning on a more seamless solution not involving invalid states at
all. Initialisation would include a reset, which should work better on
monitors doing da_handle_start(), so not running the first event.
That's, by the way, pretty much what you wanted by doing reset() on the
(virtual) init transition.

We first needs Nam's simplification though [1]. We can synchronise
during this development cycle, you probably won't really need to bother
for now.

Thanks,
Gabriele

[1] -
https://lore.kernel.org/lkml/08188c28f274da63a3f8549add3086a92aef45e5.1780908661.git.namcao@linutronix.de

> ---
>  include/rv/ha_monitor.h | 17 +++++++++++++----
>  1 file changed, 13 insertions(+), 4 deletions(-)
> 
> diff --git a/include/rv/ha_monitor.h b/include/rv/ha_monitor.h
> index 28d3c74cabfc..e5860900a337 100644
> --- a/include/rv/ha_monitor.h
> +++ b/include/rv/ha_monitor.h
> @@ -365,16 +365,22 @@ static inline bool ha_check_invariant_ns(struct
> ha_monitor *ha_mon,
>  }
>  /*
>   * ha_invariant_passed_ns - prepare the invariant and return the
> time since reset
> + *
> + * If the env has not been initialised yet (first entry into a state
> with an
> + * invariant), anchor the guard clock at the current time so that
> the full
> + * budget is available from this point.  This preserves the
> documented
> + * guard→invariant ordering: ha_set_invariant_ns() is always
> preceded by a
> + * valid guard representation in env_store.
>   */
>  static inline u64 ha_invariant_passed_ns(struct ha_monitor *ha_mon,
> enum envs env,
>  				   u64 expire, u64 time_ns)
>  {
> -	u64 passed = 0;
> +	u64 passed;
>  
>  	if (env < 0 || env >= ENV_MAX_STORED)
>  		return 0;
>  	if (ha_monitor_env_invalid(ha_mon, env))
> -		return 0;
> +		ha_reset_clk_ns(ha_mon, env, time_ns);
>  	passed = ha_get_env(ha_mon, env, time_ns);
>  	ha_set_invariant_ns(ha_mon, env, expire - passed, time_ns);
>  	return passed;
> @@ -404,16 +410,19 @@ static inline bool
> ha_check_invariant_jiffy(struct ha_monitor *ha_mon,
>  }
>  /*
>   * ha_invariant_passed_jiffy - prepare the invariant and return the
> time since reset
> + *
> + * Same first-use semantics as ha_invariant_passed_ns(): anchor the
> guard clock
> + * now if the env has not been initialised.
>   */
>  static inline u64 ha_invariant_passed_jiffy(struct ha_monitor
> *ha_mon, enum envs env,
>  				      u64 expire, u64 time_ns)
>  {
> -	u64 passed = 0;
> +	u64 passed;
>  
>  	if (env < 0 || env >= ENV_MAX_STORED)
>  		return 0;
>  	if (ha_monitor_env_invalid(ha_mon, env))
> -		return 0;
> +		ha_reset_clk_jiffy(ha_mon, env);
>  	passed = ha_get_env(ha_mon, env, time_ns);
>  	ha_set_invariant_jiffy(ha_mon, env, expire - passed);
>  	return passed;


^ permalink raw reply

* Re: [PATCH v3 5/9] rv/ha: make da_monitor_reset_hook and EVENT_NONE_LBL overridable
From: Gabriele Monaco @ 2026-06-15 10:16 UTC (permalink / raw)
  To: wen.yang; +Cc: Steven Rostedt, linux-trace-kernel, linux-kernel
In-Reply-To: <13a25b73c9fdebd26c2d4f922a83408dbcfc214d.1780847473.git.wen.yang@linux.dev>

On Mon, 2026-06-08 at 00:13 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
> 
> Wrap the two definitions with #ifndef guards so that HA-based
> monitors
> can substitute their own implementations before including this
> header:
> 
>   /* in monitor.c, before #include <rv/ha_monitor.h> */
>   #define da_monitor_reset_hook  my_monitor_reset_env
>   #define EVENT_NONE_LBL         "idle"
> 
> No behaviour change for monitors that do not override either macro.
> 
> Signed-off-by: Wen Yang <wen.yang@linux.dev>
> ---
>  include/rv/ha_monitor.h | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/include/rv/ha_monitor.h b/include/rv/ha_monitor.h
> index e5860900a337..610da54c111f 100644
> --- a/include/rv/ha_monitor.h
> +++ b/include/rv/ha_monitor.h
> @@ -36,7 +36,10 @@ static bool ha_monitor_handle_constraint(struct
> da_monitor *da_mon,
>  					 da_id_type id);
>  #define da_monitor_event_hook ha_monitor_handle_constraint
>  #define da_monitor_init_hook ha_monitor_init_env
> +/* Allow monitors to override da_monitor_reset_hook before including
> this header. */

Just a nit: users can do that but shouldn't do it mindlessly, so let's
add a line like:

  "Make sure you still call ha_monitor_reset_env() or reset timers
otherwise."

Other than that looks good

Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>

Thanks,
Gabriele

> +#ifndef da_monitor_reset_hook
>  #define da_monitor_reset_hook ha_monitor_reset_env
> +#endif
>  #define da_monitor_sync_hook() synchronize_rcu()
>  
>  #if !defined(HA_SKIP_AUTO_CLEANUP) && RV_MON_TYPE == RV_MON_PER_TASK
> @@ -75,7 +78,9 @@ _Static_assert(offsetof(struct ha_monitor, da_mon)
> == 0,
>  #define ENV_INVALID_VALUE U64_MAX
>  /* Error with no event occurs only on timeouts */
>  #define EVENT_NONE EVENT_MAX
> +#ifndef EVENT_NONE_LBL
>  #define EVENT_NONE_LBL "none"
> +#endif
>  #define ENV_BUFFER_SIZE 64
>  
>  #ifdef CONFIG_RV_REACTORS


^ permalink raw reply

* Re: [PATCH v4 1/2] serial: qcom-geni: trace: Add tracepoint support for Qualcomm GENI serial
From: Greg Kroah-Hartman @ 2026-06-15 10:42 UTC (permalink / raw)
  To: Praveen Talari
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Jiri Slaby,
	konrad.dybcio, linux-kernel, linux-trace-kernel, linux-arm-msm,
	linux-serial, mukesh.savaliya, aniket.randive, chandana.chiluveru
In-Reply-To: <688f0529-44ea-4cdf-bb0f-6c42cb3fa07e@oss.qualcomm.com>

On Mon, Jun 15, 2026 at 03:26:25PM +0530, Praveen Talari wrote:
> HI Steven,
> 
> On 29-05-2026 19:44, Steven Rostedt wrote:
> > On Tue, 26 May 2026 23:07:39 +0530
> > Praveen Talari <praveen.talari@oss.qualcomm.com> wrote:
> > 
> > > +DECLARE_EVENT_CLASS(geni_serial_data,
> > > +		    TP_PROTO(struct device *dev, const u8 *buf, unsigned int len),
> > > +		    TP_ARGS(dev, buf, len),
> > > +
> > > +		    TP_STRUCT__entry(__string(name, dev_name(dev))
> > > +				     __field(unsigned int, len)
> > > +				     __dynamic_array(u8, data, len)
> > > +		    ),
> > > +
> > > +		    TP_fast_assign(__assign_str(name);
> > > +				   __entry->len = len;
> > > +				   memcpy(__get_dynamic_array(data), buf, len);
> > > +		    ),
> > > +
> > > +		    TP_printk("%s: len=%u data=%s",
> > > +			      __get_str(name), __entry->len,
> > > +			      __print_hex(__get_dynamic_array(data), __entry->len))
> > > +);
> > No need to save the length of the dynamic array in __entry->len because
> > it's already saved in the metadata of the dynamic array that is stored
> > on the buffer. Instead you can have:
> > 
> > DECLARE_EVENT_CLASS(geni_serial_data,
> > 		    TP_PROTO(struct device *dev, const u8 *buf, unsigned int len),
> > 		    TP_ARGS(dev, buf, len),
> > 
> > 		    TP_STRUCT__entry(__string(name, dev_name(dev))
> > 				     __dynamic_array(u8, data, len)
> > 		    ),
> > 
> > 		    TP_fast_assign(__assign_str(name);
> > 				   memcpy(__get_dynamic_array(data), buf, len);
> > 		    ),
> > 
> > 		    TP_printk("%s: len=%u data=%s",
> > 			      __get_str(name), __entry->len,
> > 			      __print_hex(__get_dynamic_array(data),
> > 					__get_dynamic_array_len(data)))
> > );
> > 
> > That will save you 4 bytes per event on the ring buffer. And a few
> > cycles not having to store the redundant information.
> 
> This patch has already been accepted and is available in linux-next.

Great, can you send a fixup for it?  Or want me to revert this instead?

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH v4 1/2] serial: qcom-geni: trace: Add tracepoint support for Qualcomm GENI serial
From: Praveen Talari @ 2026-06-15 12:39 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Jiri Slaby,
	konrad.dybcio, linux-kernel, linux-trace-kernel, linux-arm-msm,
	linux-serial, mukesh.savaliya, aniket.randive, chandana.chiluveru
In-Reply-To: <2026061536-skirmish-nuptials-8fd6@gregkh>

Hi

On 15-06-2026 16:12, Greg Kroah-Hartman wrote:
> On Mon, Jun 15, 2026 at 03:26:25PM +0530, Praveen Talari wrote:
>> HI Steven,
>>
>> On 29-05-2026 19:44, Steven Rostedt wrote:
>>> On Tue, 26 May 2026 23:07:39 +0530
>>> Praveen Talari <praveen.talari@oss.qualcomm.com> wrote:
>>>
>>>> +DECLARE_EVENT_CLASS(geni_serial_data,
>>>> +		    TP_PROTO(struct device *dev, const u8 *buf, unsigned int len),
>>>> +		    TP_ARGS(dev, buf, len),
>>>> +
>>>> +		    TP_STRUCT__entry(__string(name, dev_name(dev))
>>>> +				     __field(unsigned int, len)
>>>> +				     __dynamic_array(u8, data, len)
>>>> +		    ),
>>>> +
>>>> +		    TP_fast_assign(__assign_str(name);
>>>> +				   __entry->len = len;
>>>> +				   memcpy(__get_dynamic_array(data), buf, len);
>>>> +		    ),
>>>> +
>>>> +		    TP_printk("%s: len=%u data=%s",
>>>> +			      __get_str(name), __entry->len,
>>>> +			      __print_hex(__get_dynamic_array(data), __entry->len))
>>>> +);
>>> No need to save the length of the dynamic array in __entry->len because
>>> it's already saved in the metadata of the dynamic array that is stored
>>> on the buffer. Instead you can have:
>>>
>>> DECLARE_EVENT_CLASS(geni_serial_data,
>>> 		    TP_PROTO(struct device *dev, const u8 *buf, unsigned int len),
>>> 		    TP_ARGS(dev, buf, len),
>>>
>>> 		    TP_STRUCT__entry(__string(name, dev_name(dev))
>>> 				     __dynamic_array(u8, data, len)
>>> 		    ),
>>>
>>> 		    TP_fast_assign(__assign_str(name);
>>> 				   memcpy(__get_dynamic_array(data), buf, len);
>>> 		    ),
>>>
>>> 		    TP_printk("%s: len=%u data=%s",
>>> 			      __get_str(name), __entry->len,
>>> 			      __print_hex(__get_dynamic_array(data),
>>> 					__get_dynamic_array_len(data)))
>>> );
>>>
>>> That will save you 4 bytes per event on the ring buffer. And a few
>>> cycles not having to store the redundant information.
>> This patch has already been accepted and is available in linux-next.
> Great, can you send a fixup for it?  Or want me to revert this instead?

can i add fix patch in same series by removing original patch(0/1)?


Thanks,

Praveen Talari

>
> thanks,
>
> greg k-h

^ permalink raw reply

* Re: [PATCH v4] mm/lruvec: trace LRU add drains and drain-all requests
From: Vlastimil Babka (SUSE) @ 2026-06-15 12:48 UTC (permalink / raw)
  To: JP Kobryn, linux-mm, willy, shakeel.butt, usama.arif, akpm,
	mhocko, rostedt, mhiramat, mathieu.desnoyers, kasong, qi.zheng,
	baohua, axelrasmussen, yuanchu, weixugc, chrisl, shikemeng,
	nphamcs, baoquan.he, youngjun.park
  Cc: linux-kernel, linux-trace-kernel
In-Reply-To: <20260610234808.212397-1-jp.kobryn@linux.dev>

On 6/11/26 01:48, JP Kobryn wrote:
> LRU add batches can be drained before they reach capacity. This can be a
> source of LRU lock contention, but it is not currently possible to
> attribute these drains to callers with existing tracepoints.
> 
> Add mm_lru_add_drain to report the CPU and lru_add batch count when an
> lru_add batch is drained. This allows tracing to distinguish full drains
> from partial drains and attribute them to the calling stack.
> 
> Add mm_lru_add_drain_all to capture callers of __lru_add_drain_all and
> whether they set the force flag for all CPUs. The tracepoint resembles
> the signature of the enclosing function, but is needed because of
> potential inlining.
> 
> Signed-off-by: JP Kobryn <jp.kobryn@linux.dev>
> Reviewed-by: Barry Song <baohua@kernel.org>
> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>

Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>

> ---
> v4:
>   - renamed nr_folio_add to nr_folios in lru_add_drain()
>   - renamed nr to nr_folios in tracepoint for consistency
> 
> v3: https://lore.kernel.org/linux-mm/20260610195220.12403-1-jp.kobryn@linux.dev/
>   - restored and renamed tracepoint in __lru_add_drain_all
> 
> v2: https://lore.kernel.org/linux-mm/20260609041156.31127-1-jp.kobryn@linux.dev/
>   - removed mm_lru_drain_all tracepoint
> 
> v1: https://lore.kernel.org/linux-mm/20260609041156.31127-1-jp.kobryn@linux.dev/
> 
>  include/trace/events/pagemap.h | 37 ++++++++++++++++++++++++++++++++++
>  mm/swap.c                      |  7 ++++++-
>  2 files changed, 43 insertions(+), 1 deletion(-)
> 
> diff --git a/include/trace/events/pagemap.h b/include/trace/events/pagemap.h
> index 171524d3526d..df6ac4d13dcf 100644
> --- a/include/trace/events/pagemap.h
> +++ b/include/trace/events/pagemap.h
> @@ -77,6 +77,43 @@ TRACE_EVENT(mm_lru_activate,
>  	TP_printk("folio=%p pfn=0x%lx", __entry->folio, __entry->pfn)
>  );
>  
> +TRACE_EVENT(mm_lru_add_drain,
> +
> +	TP_PROTO(int cpu, unsigned int nr_folios),
> +
> +	TP_ARGS(cpu, nr_folios),
> +
> +	TP_STRUCT__entry(
> +		__field(int,		cpu	)
> +		__field(unsigned int,	nr_folios	)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->cpu	= cpu;
> +		__entry->nr_folios	= nr_folios;
> +	),
> +
> +	TP_printk("cpu=%d nr_folios=%u", __entry->cpu, __entry->nr_folios)
> +);
> +
> +TRACE_EVENT(mm_lru_add_drain_all,
> +
> +	TP_PROTO(bool force_all_cpus),
> +
> +	TP_ARGS(force_all_cpus),
> +
> +	TP_STRUCT__entry(
> +		__field(bool,	force_all_cpus	)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->force_all_cpus	= force_all_cpus;
> +	),
> +
> +	TP_printk("force_all_cpus=%s",
> +		__entry->force_all_cpus ? "true" : "false")
> +);
> +
>  #endif /* _TRACE_PAGEMAP_H */
>  
>  /* This part must be outside protection */
> diff --git a/mm/swap.c b/mm/swap.c
> index 588f50d8f1a8..b506fa912a93 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -694,9 +694,12 @@ void lru_add_drain_cpu(int cpu)
>  {
>  	struct cpu_fbatches *fbatches = &per_cpu(cpu_fbatches, cpu);
>  	struct folio_batch *fbatch = &fbatches->lru_add;
> +	unsigned int nr_folios = folio_batch_count(fbatch);
>  
> -	if (folio_batch_count(fbatch))
> +	if (nr_folios) {
>  		folio_batch_move_lru(fbatch, lru_add);
> +		trace_mm_lru_add_drain(cpu, nr_folios);
> +	}
>  
>  	fbatch = &fbatches->lru_move_tail;
>  	/* Disabling interrupts below acts as a compiler barrier. */
> @@ -869,6 +872,8 @@ static inline void __lru_add_drain_all(bool force_all_cpus)
>  	if (WARN_ON(!mm_percpu_wq))
>  		return;
>  
> +	trace_mm_lru_add_drain_all(force_all_cpus);
> +
>  	/*
>  	 * Guarantee folio_batch counter stores visible by this CPU
>  	 * are visible to other CPUs before loading the current drain


^ permalink raw reply

* Re: [PATCH v4 1/2] serial: qcom-geni: trace: Add tracepoint support for Qualcomm GENI serial
From: Greg Kroah-Hartman @ 2026-06-15 13:31 UTC (permalink / raw)
  To: Praveen Talari
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Jiri Slaby,
	konrad.dybcio, linux-kernel, linux-trace-kernel, linux-arm-msm,
	linux-serial, mukesh.savaliya, aniket.randive, chandana.chiluveru
In-Reply-To: <945821ea-5065-4e20-a1f9-32f7c9adb66a@oss.qualcomm.com>

On Mon, Jun 15, 2026 at 06:09:17PM +0530, Praveen Talari wrote:
> Hi
> 
> On 15-06-2026 16:12, Greg Kroah-Hartman wrote:
> > On Mon, Jun 15, 2026 at 03:26:25PM +0530, Praveen Talari wrote:
> > > HI Steven,
> > > 
> > > On 29-05-2026 19:44, Steven Rostedt wrote:
> > > > On Tue, 26 May 2026 23:07:39 +0530
> > > > Praveen Talari <praveen.talari@oss.qualcomm.com> wrote:
> > > > 
> > > > > +DECLARE_EVENT_CLASS(geni_serial_data,
> > > > > +		    TP_PROTO(struct device *dev, const u8 *buf, unsigned int len),
> > > > > +		    TP_ARGS(dev, buf, len),
> > > > > +
> > > > > +		    TP_STRUCT__entry(__string(name, dev_name(dev))
> > > > > +				     __field(unsigned int, len)
> > > > > +				     __dynamic_array(u8, data, len)
> > > > > +		    ),
> > > > > +
> > > > > +		    TP_fast_assign(__assign_str(name);
> > > > > +				   __entry->len = len;
> > > > > +				   memcpy(__get_dynamic_array(data), buf, len);
> > > > > +		    ),
> > > > > +
> > > > > +		    TP_printk("%s: len=%u data=%s",
> > > > > +			      __get_str(name), __entry->len,
> > > > > +			      __print_hex(__get_dynamic_array(data), __entry->len))
> > > > > +);
> > > > No need to save the length of the dynamic array in __entry->len because
> > > > it's already saved in the metadata of the dynamic array that is stored
> > > > on the buffer. Instead you can have:
> > > > 
> > > > DECLARE_EVENT_CLASS(geni_serial_data,
> > > > 		    TP_PROTO(struct device *dev, const u8 *buf, unsigned int len),
> > > > 		    TP_ARGS(dev, buf, len),
> > > > 
> > > > 		    TP_STRUCT__entry(__string(name, dev_name(dev))
> > > > 				     __dynamic_array(u8, data, len)
> > > > 		    ),
> > > > 
> > > > 		    TP_fast_assign(__assign_str(name);
> > > > 				   memcpy(__get_dynamic_array(data), buf, len);
> > > > 		    ),
> > > > 
> > > > 		    TP_printk("%s: len=%u data=%s",
> > > > 			      __get_str(name), __entry->len,
> > > > 			      __print_hex(__get_dynamic_array(data),
> > > > 					__get_dynamic_array_len(data)))
> > > > );
> > > > 
> > > > That will save you 4 bytes per event on the ring buffer. And a few
> > > > cycles not having to store the redundant information.
> > > This patch has already been accepted and is available in linux-next.
> > Great, can you send a fixup for it?  Or want me to revert this instead?
> 
> can i add fix patch in same series by removing original patch(0/1)?

We can never rewrite a public git tree.  So either revert and redo it as
a new patch, or send a fix for this one, your choice.

thanks,

greg k-h

^ permalink raw reply

* [PATCH v5 0/2] Add tracepoints support for Qualcomm GENI Serial drivers
From: Praveen Talari @ 2026-06-15 14:16 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Greg Kroah-Hartman, Jiri Slaby, konrad.dybcio
  Cc: Praveen Talari, linux-kernel, linux-trace-kernel, linux-arm-msm,
	linux-serial, mukesh.savaliya, aniket.randive, chandana.chiluveru

Add tracepoints to the Qualcomm GENI (Generic Interface) serial driver.
These trace events enable runtime debugging and performance analysis of
UART operations.

The trace events cover UART termios configuration, clock setup, manual
control state, interrupt status, and actual transmitted/received data in
hexadecimal format.

Usage examples:

Enable all serial traces:
  echo 1 > /sys/kernel/debug/tracing/events/qcom_geni_serial/enable
  cat /sys/kernel/debug/tracing/trace_pipe

Example trace output:

2517.938432: geni_serial_clk_cfg: a94000.serial: desired_rate=1843200
     clk_rate=7372800 clk_div=4 clk_idx=0
2517.938753: geni_serial_irq: a94000.serial: m_irq=0x88800000
     s_irq=0x08000111 dma_tx=0x00000000 dma_rx=0x00000000
2517.938803: geni_serial_set_termios: a94000.serial: baud=115200 bpc=8
     tx_trans=0x00000002 tx_par=0x00000000 rx_trans=0x00000000
rx_par=0x00000000 stop=0
2517.938807: geni_serial_set_mctrl: a94000.serial: mctrl=0x8006
     uart_manual_rfr=0x00000000
2517.938818: geni_serial_get_mctrl: a94000.serial: mctrl=0x0160
     geni_ios=0x00000001
2517.939165: geni_serial_irq: a94000.serial: m_irq=0x00400000
     s_irq=0x00000000 dma_tx=0x00000000 dma_rx=0x00000000
2517.939592: geni_serial_tx_data: a94000.serial: tx_len=8 data=61 62 63
     64 65 66 67 68
2517.940610: geni_serial_irq: a94000.serial: m_irq=0x00000001
     s_irq=0x00000000 dma_tx=0x00000003 dma_rx=0x00000000
2517.942174: geni_serial_irq: a94000.serial: m_irq=0x08000000
     s_irq=0x08000100 dma_tx=0x00000000 dma_rx=0x00000003
2517.942323: geni_serial_rx_data: a94000.serial: rx_len=8 data=61 62 63
     64 65 66 67 68
2517.942680: geni_serial_set_mctrl: a94000.serial: mctrl=0x8000
     uart_manual_rfr=0x80000002

Signed-off-by: Praveen Talari <praveen.talari@oss.qualcomm.com>
---
Changes in v5:
- Remove serial trace header patch since it was accepted and merged.
- Added a new patch to fix for remove unwanted variable len.
- Link to v4: https://lore.kernel.org/all/20260526-add-tracepoints-for-qcom-geni-serial-v4-0-e94fbaec0232@oss.qualcomm.com

Changes in v4:
- Rebased patch(02/02) on latest linux-next.
- Link to v3: https://lore.kernel.org/all/20260518-add-tracepoints-for-qcom-geni-serial-v3-0-b4addb151376@oss.qualcomm.com

Changes in v3:
- Removed \n from geni_serial_tx_data and geni_serial_rx_data events.
- Resolved aligment issues in geni_serial_data, geni_serial_tx_data and
  geni_serial_rx_data events.
- Link to v2: https://lore.kernel.org/r/20260512-add-tracepoints-for-qcom-geni-serial-v2-0-a5726421b3af@oss.qualcomm.com

Changes in v2:
- removed multiple trace events for TX/RX events, instead used
  DECLARE_EVENT_CLASS and DEFINE_EVENT.
- Link to v1: https://lore.kernel.org/r/20260506-add-tracepoints-for-qcom-geni-serial-v1-0-544b22612e08@oss.qualcomm.com

To: Steven Rostedt <rostedt@goodmis.org>
To: Masami Hiramatsu <mhiramat@kernel.org>
To: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: Jiri Slaby <jirislaby@kernel.org>
To: konrad.dybcio@oss.qualcomm.com
Cc: linux-kernel@vger.kernel.org
Cc: linux-trace-kernel@vger.kernel.org
Cc: linux-arm-msm@vger.kernel.org
Cc: linux-serial@vger.kernel.org
Cc: mukesh.savaliya@oss.qualcomm.com
Cc: aniket.randive@oss.qualcomm.com
Cc: chandana.chiluveru@oss.qualcomm.com

---
Praveen Talari (2):
      serial: qcom-geni: trace: Drop redundant len field from geni_serial_data
      serial: qcom-geni: Add tracepoints for Qualcomm GENI serial driver

 drivers/tty/serial/qcom_geni_serial.c   | 27 +++++++++++++++++++++++----
 include/trace/events/qcom_geni_serial.h |  7 +++----
 2 files changed, 26 insertions(+), 8 deletions(-)
---
base-commit: c425609d6ac4012c8bbf01ec2e10e801b1923a7b
change-id: 20260427-add-tracepoints-for-qcom-geni-serial-948777218b7b

Best regards,
--  
Praveen Talari <praveen.talari@oss.qualcomm.com>


^ permalink raw reply

* [PATCH v5 1/2] serial: qcom-geni: trace: Drop redundant len field from geni_serial_data
From: Praveen Talari @ 2026-06-15 14:16 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Greg Kroah-Hartman, Jiri Slaby, konrad.dybcio
  Cc: Praveen Talari, linux-kernel, linux-trace-kernel, linux-arm-msm,
	linux-serial, mukesh.savaliya, aniket.randive, chandana.chiluveru
In-Reply-To: <20260615-add-tracepoints-for-qcom-geni-serial-v5-0-2efa4c97e0e2@oss.qualcomm.com>

The dynamic array stored in the ring buffer already carries its own
length in the array metadata. There is no need to also store it as a
separate scalar field in the entry struct.

Drop __field(unsigned int, len) and the corresponding __entry->len
assignment, and use __get_dynamic_array_len(data) in the TP_printk for
both the len=%u format argument and the __print_hex() size argument.
This saves 4 bytes per event on the ring buffer.

Signed-off-by: Praveen Talari <praveen.talari@oss.qualcomm.com>
---
 include/trace/events/qcom_geni_serial.h | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/include/trace/events/qcom_geni_serial.h b/include/trace/events/qcom_geni_serial.h
index 417ec01f9fc8..e1aa551d525e 100644
--- a/include/trace/events/qcom_geni_serial.h
+++ b/include/trace/events/qcom_geni_serial.h
@@ -97,18 +97,17 @@ DECLARE_EVENT_CLASS(geni_serial_data,
 		    TP_ARGS(dev, buf, len),
 
 		    TP_STRUCT__entry(__string(name, dev_name(dev))
-				     __field(unsigned int, len)
 				     __dynamic_array(u8, data, len)
 		    ),
 
 		    TP_fast_assign(__assign_str(name);
-				   __entry->len = len;
 				   memcpy(__get_dynamic_array(data), buf, len);
 		    ),
 
 		    TP_printk("%s: len=%u data=%s",
-			      __get_str(name), __entry->len,
-			      __print_hex(__get_dynamic_array(data), __entry->len))
+			      __get_str(name), __get_dynamic_array_len(data),
+			      __print_hex(__get_dynamic_array(data),
+					  __get_dynamic_array_len(data)))
 );
 
 DEFINE_EVENT(geni_serial_data, geni_serial_tx_data,

-- 
2.34.1


^ permalink raw reply related

* [PATCH v5 2/2] serial: qcom-geni: Add tracepoints for Qualcomm GENI serial driver
From: Praveen Talari @ 2026-06-15 14:16 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Greg Kroah-Hartman, Jiri Slaby, konrad.dybcio
  Cc: Praveen Talari, linux-kernel, linux-trace-kernel, linux-arm-msm,
	linux-serial, mukesh.savaliya, aniket.randive, chandana.chiluveru
In-Reply-To: <20260615-add-tracepoints-for-qcom-geni-serial-v5-0-2efa4c97e0e2@oss.qualcomm.com>

Add tracing to the Qualcomm GENI serial driver to improve runtime
observability.

Trace hooks are added at key points including termios and clock
configuration, manual control get/set, interrupt handling, and data
TX/RX paths.

Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Signed-off-by: Praveen Talari <praveen.talari@oss.qualcomm.com>
---
v2->v3:
- Updated commit text(removed example as it was available on cover
  letter).
---
 drivers/tty/serial/qcom_geni_serial.c | 27 +++++++++++++++++++++++----
 1 file changed, 23 insertions(+), 4 deletions(-)

diff --git a/drivers/tty/serial/qcom_geni_serial.c b/drivers/tty/serial/qcom_geni_serial.c
index d81b539cff7f..4b62e58d4918 100644
--- a/drivers/tty/serial/qcom_geni_serial.c
+++ b/drivers/tty/serial/qcom_geni_serial.c
@@ -7,6 +7,9 @@
 /* Disable MMIO tracing to prevent excessive logging of unwanted MMIO traces */
 #define __DISABLE_TRACE_MMIO__
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/qcom_geni_serial.h>
+
 #include <linux/clk.h>
 #include <linux/console.h>
 #include <linux/io.h>
@@ -226,7 +229,7 @@ static void qcom_geni_serial_config_port(struct uart_port *uport, int cfg_flags)
 static unsigned int qcom_geni_serial_get_mctrl(struct uart_port *uport)
 {
 	unsigned int mctrl = TIOCM_DSR | TIOCM_CAR;
-	u32 geni_ios;
+	u32 geni_ios = 0;
 
 	if (uart_console(uport)) {
 		mctrl |= TIOCM_CTS;
@@ -236,6 +239,8 @@ static unsigned int qcom_geni_serial_get_mctrl(struct uart_port *uport)
 			mctrl |= TIOCM_CTS;
 	}
 
+	trace_geni_serial_get_mctrl(uport->dev, mctrl, geni_ios);
+
 	return mctrl;
 }
 
@@ -254,6 +259,8 @@ static void qcom_geni_serial_set_mctrl(struct uart_port *uport,
 	if (port->manual_flow && !(mctrl & TIOCM_RTS) && !uport->suspended)
 		uart_manual_rfr = UART_MANUAL_RFR_EN | UART_RFR_NOT_READY;
 	writel(uart_manual_rfr, uport->membase + SE_UART_MANUAL_RFR);
+
+	trace_geni_serial_set_mctrl(uport->dev, mctrl, uart_manual_rfr);
 }
 
 static const char *qcom_geni_serial_get_type(struct uart_port *uport)
@@ -684,6 +691,8 @@ static void qcom_geni_serial_start_tx_dma(struct uart_port *uport)
 	xmit_size = kfifo_out_linear_ptr(&tport->xmit_fifo, &tail,
 			UART_XMIT_SIZE);
 
+	trace_geni_serial_tx_data(uport->dev, tail, xmit_size);
+
 	qcom_geni_set_rs485_mode(uport, SER_RS485_RTS_ON_SEND);
 
 	qcom_geni_serial_setup_tx(uport, xmit_size);
@@ -910,8 +919,10 @@ static void qcom_geni_serial_handle_rx_dma(struct uart_port *uport, bool drop)
 		return;
 	}
 
-	if (!drop)
+	if (!drop) {
+		trace_geni_serial_rx_data(uport->dev, port->rx_buf, rx_in);
 		handle_rx_uart(uport, rx_in);
+	}
 
 	ret = geni_se_rx_dma_prep(&port->se, port->rx_buf,
 				  DMA_RX_BUF_SIZE,
@@ -1082,6 +1093,10 @@ static irqreturn_t qcom_geni_serial_isr(int isr, void *dev)
 	geni_status = readl(uport->membase + SE_GENI_STATUS);
 	dma = readl(uport->membase + SE_GENI_DMA_MODE_EN);
 	m_irq_en = readl(uport->membase + SE_GENI_M_IRQ_EN);
+
+	trace_geni_serial_irq(uport->dev, m_irq_status, s_irq_status,
+			      dma_tx_status, dma_rx_status);
+
 	writel(m_irq_status, uport->membase + SE_GENI_M_IRQ_CLEAR);
 	writel(s_irq_status, uport->membase + SE_GENI_S_IRQ_CLEAR);
 	writel(dma_tx_status, uport->membase + SE_DMA_TX_IRQ_CLR);
@@ -1294,8 +1309,8 @@ static int geni_serial_set_rate(struct uart_port *uport, unsigned int baud)
 		return -EINVAL;
 	}
 
-	dev_dbg(port->se.dev, "desired_rate = %u, clk_rate = %lu, clk_div = %u, clk_idx = %u\n",
-		baud * sampling_rate, clk_rate, clk_div, clk_idx);
+	trace_geni_serial_clk_cfg(uport->dev, baud * sampling_rate, clk_rate,
+				  clk_div, clk_idx);
 
 	uport->uartclk = clk_rate;
 	port->clk_rate = clk_rate;
@@ -1455,6 +1470,10 @@ static void qcom_geni_serial_set_termios(struct uart_port *uport,
 	writel(bits_per_char, uport->membase + SE_UART_TX_WORD_LEN);
 	writel(bits_per_char, uport->membase + SE_UART_RX_WORD_LEN);
 	writel(stop_bit_len, uport->membase + SE_UART_TX_STOP_BIT_LEN);
+
+	trace_geni_serial_set_termios(uport->dev, baud, bits_per_char,
+				      tx_trans_cfg, tx_parity_cfg, rx_trans_cfg,
+				      rx_parity_cfg, stop_bit_len);
 }
 
 #ifdef CONFIG_SERIAL_QCOM_GENI_CONSOLE

-- 
2.34.1


^ permalink raw reply related

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Vlastimil Babka (SUSE) @ 2026-06-15 14:38 UTC (permalink / raw)
  To: Gregory Price, David Hildenbrand (Arm)
  Cc: Balbir Singh, lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman, Matthew Wilcox
In-Reply-To: <aiwl4kCG814dpX7L@gourry-fedora-PF4VCD3F>

On 6/12/26 17:29, Gregory Price wrote:
> On Wed, Jun 10, 2026 at 04:12:52PM -0400, Gregory Price wrote:
>> On Wed, Jun 10, 2026 at 08:59:59PM +0200, David Hildenbrand (Arm) wrote:
>> > > 
>> > > I understand this question in two ways:
>> > > 
>> > >   1) Can we disallow PAGE allocation and limit this to FOLIO allocation
>> > 
>> > Yes. Can we only allow folios to be allocated from private memory nodes. So let
>> > me reply to that one below.
>> > 
>> ... snip ...
>> > 
>> > At LSF/MM we talked about how GFP flags are bad and how deriving stuff from the
>> > context might be better. I think there was also talk about how the memalloc_*
>> > interface might be a better way forward. Maybe we would start giving the
>> > allocator more context ("we are allocating a folio").
>> > 
>> > The following is incomplete (esp. hugetlb stuff I assume), just as some idea:
>> >
>> 
>> I will still probably send the next RFC version tomorrow or friday,
>> as I want to get some eyes on the __GFP_PRIVATE-less pattern.
>> 
>> Also, I made a new `anondax` driver which enables userland testing
>> of this functionality without any specialty hardware.
>> 
> 
> (apologies for the length of this email: this will all be covered in
> the coming cover letter, but I just wanted to share a bit of a preview)
> 
> ===
> 
> Just another small update - I am planning to post the RFC today once i
> get some mild cleanup done.  It will be based on the dax atomic hotplug
> 
> https://lore.kernel.org/linux-mm/20260605211911.2160954-1-gourry@gourry.net/
> 
> But a couple specific details regarding the memalloc pieces that i've
> learned the past couple of days playing with it.
> 
> 1) memalloc_folio is required to ensure non-folio allocations don't land
>    on the private node, even if it happens within a memalloc_private
>    context.  Since memalloc_folio may be useful in contexts outside of
>    private nodes, I kept this as a separate flag.
> 
>    If we think there will *never* be additional users of memalloc_folio,
>    then we could fold _folio into _private to save the flag for now and
>    add it back when we actually need it.
> 
> 2) memalloc_private is needed to unlock private nodes, but in the
>    original NOFALLBACK-only design, you also needed __GFP_THISNODE.
> 
>    This is *highly* restrictive.  I found when playing with mbind that
>    MPOL_BIND + __GFP_THISNODE generates a WARN (valid WARN, it normally
>    implies a bug). 
> 
>    That leads me to #3

I think the memalloc approach is dangerous due to unexpected nesting. There
might be nested page allocations in page allocation itself (due to some
debugging option). But also interrupts do not change what "current" points
to. Suddenly those could start requesting folios and/or private nodes and be
surprised, I'm afraid.

The memalloc scopes only work well when they restrict the context wrt
reclaim, and allocations in IRQ have to be already restricted heavily
(atomic) so further memalloc restrictions don't do anything in practice. But
to make them change other aspects of the allocations like this won't work.

> 3) If a private node is opted into something like Demotion (the node is
>    a demotion target) or mbind(), such that normal kernel operation can
>    place memory there - it's *pseudo-private*, and should actually land
>    in it's own FALLBACK list (reachable without __GFP_THISNODE, but not
>    reachable as a normal fallback allocation target).
> 
> I'm still playing with this, but I think we can even omit the
> __GFP_THISNODE requirement (my initial feeling that __GFP_THISNODE
> didn't buy us anything in particular seems to have panned out).
> 
> At the end of the day, this makes the whole memalloc_private_save()
> pattern a heck of a lot cleaner than trying fiddle with GFP.
> 
> I think you will all enjoy how clean the code ends up, and how easily
> testable it is.
> 
> As a testbed I've implement an anondax (we can discuss naming) that
> adds some sample NODE_PRIVATE_OPT_* flags so you can do the following.
> 
> I'm including this in the next RFC - but we can hack the entire thing
> off (including the OPT flags) if we prefer to just get the base set in
> without a new driver as a start.
> 
> echo 1 > dax0.0/reclaim   # kswapd and reclaim run normally on this node
> echo 1 > dax0.0/demotion  # it is a demotion target
> echo 1 > dax0.0/mbind     # mbind() can target this node for anon-vma's
> echo 1 > dax0.0/madvise   # allow madvise() to operate on its folios
> echo 1 > dax0.0/numa_balance  # allow numa balancing for this node
> echo 1 > dax0.0/ltpin     # allow GUP longterm pin to operate normally
> echo * > dax0.0/adistance # set the adistance for hotplug time
> echo * > dax0.0/hotplug   # same as kmem/hotplug
> 
> This also means *existing hardware* can leverage private nodes if
> they're capable of generating a dax device.
> 
> I've even gotten it such that you can put a private node above dram in
> the adistance heirarchy - which means demotion flows downward from
> device to CPU, but allocations don't default or fallback there.
> 
> This seems *immediately* useful for a variety of use cases.
> 
> ~Gregory
> _______________________________________________
> Lsf-pc mailing list
> Lsf-pc@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/lsf-pc


^ permalink raw reply

* [PATCH] tracing: eprobe: read the complete FILTER_PTR_STRING pointer
From: Martin Kaiser @ 2026-06-15 14:54 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: linux-trace-kernel, linux-kernel, Martin Kaiser

For a char * element in an event, the FILTER_PTR_STRING filter type is
used. When the event occurs, a pointer is stored in the ringbuffer.

If an eprobe references such a char * element of a "base event" and
decodes the pointer as string, the pointer cannot be dereferenced.

$ echo 'e syscalls.sys_enter_openat $filename:string' > \
		/sys/kernel/tracing/dynamic_events
$ trace-cmd start -e eprobes
$ trace-cmd show
    ... : sys_enter_openat: (syscalls.sys_enter_openat) arg1=(fault)

The problem is in get_event_field

	val = (unsigned long)(*(char *)addr);

addr points to the position in the ringbuffer where the pointer was
stored. We must read the complete pointer, not just the lowest byte.

Fix the assignment, make the example above work.

Signed-off-by: Martin Kaiser <martin@kaiser.cx>
---
 kernel/trace/trace_eprobe.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/trace_eprobe.c b/kernel/trace/trace_eprobe.c
index b66d6196338d..50518b071414 100644
--- a/kernel/trace/trace_eprobe.c
+++ b/kernel/trace/trace_eprobe.c
@@ -315,7 +315,7 @@ get_event_field(struct fetch_insn *code, void *rec)
 			val = (unsigned long)addr;
 			break;
 		case FILTER_PTR_STRING:
-			val = (unsigned long)(*(char *)addr);
+			val = *(unsigned long *)addr;
 			break;
 		default:
 			WARN_ON_ONCE(1);
-- 
2.43.7


^ permalink raw reply related

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: David Hildenbrand (Arm) @ 2026-06-15 15:18 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE), Gregory Price
  Cc: Balbir Singh, lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman, Matthew Wilcox
In-Reply-To: <9f1815b0-896b-44ab-9e6d-9316d8f11033@kernel.org>

On 6/15/26 16:38, Vlastimil Babka (SUSE) wrote:
> On 6/12/26 17:29, Gregory Price wrote:
>> On Wed, Jun 10, 2026 at 04:12:52PM -0400, Gregory Price wrote:
>>> ... snip ...
>>>
>>> I will still probably send the next RFC version tomorrow or friday,
>>> as I want to get some eyes on the __GFP_PRIVATE-less pattern.
>>>
>>> Also, I made a new `anondax` driver which enables userland testing
>>> of this functionality without any specialty hardware.
>>>
>>
>> (apologies for the length of this email: this will all be covered in
>> the coming cover letter, but I just wanted to share a bit of a preview)
>>
>> ===
>>
>> Just another small update - I am planning to post the RFC today once i
>> get some mild cleanup done.  It will be based on the dax atomic hotplug
>>
>> https://lore.kernel.org/linux-mm/20260605211911.2160954-1-gourry@gourry.net/
>>
>> But a couple specific details regarding the memalloc pieces that i've
>> learned the past couple of days playing with it.
>>
>> 1) memalloc_folio is required to ensure non-folio allocations don't land
>>    on the private node, even if it happens within a memalloc_private
>>    context.  Since memalloc_folio may be useful in contexts outside of
>>    private nodes, I kept this as a separate flag.
>>
>>    If we think there will *never* be additional users of memalloc_folio,
>>    then we could fold _folio into _private to save the flag for now and
>>    add it back when we actually need it.
>>
>> 2) memalloc_private is needed to unlock private nodes, but in the
>>    original NOFALLBACK-only design, you also needed __GFP_THISNODE.
>>
>>    This is *highly* restrictive.  I found when playing with mbind that
>>    MPOL_BIND + __GFP_THISNODE generates a WARN (valid WARN, it normally
>>    implies a bug). 
>>
>>    That leads me to #3
> 
> I think the memalloc approach is dangerous due to unexpected nesting. There
> might be nested page allocations in page allocation itself (due to some
> debugging option). But also interrupts do not change what "current" points
> to. Suddenly those could start requesting folios and/or private nodes and be
> surprised, I'm afraid.

Yeah, we'd need some way to distinguish the main allocation from these other
(nested) allocations.


> 
> The memalloc scopes only work well when they restrict the context wrt
> reclaim, and allocations in IRQ have to be already restricted heavily
> (atomic) so further memalloc restrictions don't do anything in practice. But
> to make them change other aspects of the allocations like this won't work.

I was assuming that memalloc_pin_save() would already violate that, but really
it only restricts where movable allocations land, and that doesn't matter for
other kernel allocations.

Do you see any other way to make something like an allocation context work, and
avoid introducing more GFP flags?

-- 
Cheers,

David

^ permalink raw reply

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Gregory Price @ 2026-06-15 15:20 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: David Hildenbrand (Arm), Balbir Singh, lsf-pc, linux-kernel,
	linux-cxl, cgroups, linux-mm, linux-trace-kernel, damon,
	kernel-team, gregkh, rafael, dakr, dave, jonathan.cameron,
	dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
	dan.j.williams, longman, akpm, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, osalvador, ziy, matthew.brost,
	joshua.hahnjy, rakie.kim, byungchul, ying.huang, apopple,
	axelrasmussen, yuanchu, weixugc, yury.norov, linux, mhiramat,
	mathieu.desnoyers, tj, hannes, mkoutny, jackmanb, sj, baolin.wang,
	npache, ryan.roberts, dev.jain, baohua, lance.yang, muchun.song,
	xu.xin16, chengming.zhou, jannh, linmiaohe, nao.horiguchi,
	pfalcato, rientjes, shakeel.butt, riel, harry.yoo, cl,
	roman.gushchin, chrisl, kasong, shikemeng, nphamcs, bhe,
	zhengqi.arch, terry.bowman, Matthew Wilcox
In-Reply-To: <9f1815b0-896b-44ab-9e6d-9316d8f11033@kernel.org>

On Mon, Jun 15, 2026 at 04:38:43PM +0200, Vlastimil Babka (SUSE) wrote:
> On 6/12/26 17:29, Gregory Price wrote:
> > 
> > 1) memalloc_folio is required to ensure non-folio allocations don't land
> >    on the private node, even if it happens within a memalloc_private
> >    context.  Since memalloc_folio may be useful in contexts outside of
> >    private nodes, I kept this as a separate flag.
> > 
> >    If we think there will *never* be additional users of memalloc_folio,
> >    then we could fold _folio into _private to save the flag for now and
> >    add it back when we actually need it.
> > 
> > 2) memalloc_private is needed to unlock private nodes, but in the
> >    original NOFALLBACK-only design, you also needed __GFP_THISNODE.
> > 
> >    This is *highly* restrictive.  I found when playing with mbind that
> >    MPOL_BIND + __GFP_THISNODE generates a WARN (valid WARN, it normally
> >    implies a bug). 
> > 
> >    That leads me to #3
> 
> I think the memalloc approach is dangerous due to unexpected nesting. There
> might be nested page allocations in page allocation itself (due to some
> debugging option). But also interrupts do not change what "current" points
> to. Suddenly those could start requesting folios and/or private nodes and be
> surprised, I'm afraid.
> 
> The memalloc scopes only work well when they restrict the context wrt
> reclaim, and allocations in IRQ have to be already restricted heavily
> (atomic) so further memalloc restrictions don't do anything in practice. But
> to make them change other aspects of the allocations like this won't work.
>

Reduced to practice I have found success, however what you are
describing could probably be resolved by re-introducing fallback list
isolation.  If private nodes are not in fallback lists, and they're not
N_MEMORY, then they're unreachable via nodemask-fallbacks, and a
specific node has to be requested.  For everything else memalloc locks
them out regardless.

In v5 I actually stripped this all the way back to just memalloc flags
and implemented a bunch of pressure tests to try to detect leakage - and
I was not able to do so - even with all nodes in each other's fallback
lists.

We can tack on both fallback list isolation and __GFP_THISNODE
requirements on top without ABI implications if we find that is
insufficient.

The only place I think this will matter is in the reclaim / demotion
code, would need to rework the allocation code to handle private nodes
more explicitly.  This has no ABI implications AND the entire demotion
logic in vmscan.c is utterly broken anyway and needs a rewrite.

I'm running a mass build test at the moment, and it's looking clean, I'm
expecting to be able to test the new code today or tomorrow.

~Gregory

^ permalink raw reply

* Re: [PATCH v3 6/9] rv/tlob: add tlob hybrid automaton monitor
From: Gabriele Monaco @ 2026-06-15 15:24 UTC (permalink / raw)
  To: wen.yang; +Cc: Steven Rostedt, linux-trace-kernel, linux-kernel
In-Reply-To: <629023dbcc4389fcc6ec46d88c98eb19aa0abc36.1780847473.git.wen.yang@linux.dev>

On Mon, 2026-06-08 at 00:13 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
> 
> Add tlob (task latency over budget), a per-task hybrid automaton RV
> monitor that tracks elapsed wall-clock time across a user-delimited
> code section and emits error_env_tlob when the elapsed time exceeds a
> configurable budget.
> 
> The monitor uses RV_MON_PER_OBJ with three states (running, waiting,
> sleeping) driven by sched_switch and sched_wakeup tracepoints, and a
> single clock invariant clk_elapsed < budget enforced by an hrtimer
> (HRTIMER_MODE_REL_HARD).  On violation, detail_env_tlob provides a
> per-state time breakdown (running_ns, waiting_ns, sleeping_ns).
> 
> Per-task state is managed via DA_ALLOC_POOL to avoid allocation on
> the
> scheduler tracepoint path.  Uprobe pairs are registered through the
> tracefs monitor file as "p PATH:OFFSET_START OFFSET_STOP
> threshold=NS".
> 
> Also adds ha_cancel_timer_sync() to ha_monitor.h, a blocking cancel
> variant needed by tlob's stop_task path to ensure the hrtimer
> callback
> has completed before the per-task monitor state is freed.
> 
> Suggested-by: Gabriele Monaco <gmonaco@redhat.com>

I'm not sure this applies to be fair, I'm reviewing the patch and as
part of that I'm suggesting things, but the main idea is yours [1].

Unless you mess up, I'm going to appear as reviewer anyway ;)

[1] - https://www.kernel.org/doc/html/latest/process/submitting-patches.html#using-reported-by-tested-by-reviewed-by-suggested-by-and-fixes

> Signed-off-by: Wen Yang <wen.yang@linux.dev>
> ---

...

> diff --git a/kernel/trace/rv/Kconfig b/kernel/trace/rv/Kconfig
> index e2e0033a00b9..ed2de31d0312 100644
> --- a/kernel/trace/rv/Kconfig
> +++ b/kernel/trace/rv/Kconfig
> @@ -85,6 +85,7 @@ source "kernel/trace/rv/monitors/sleep/Kconfig"
>  source "kernel/trace/rv/monitors/stall/Kconfig"
>  source "kernel/trace/rv/monitors/deadline/Kconfig"
>  source "kernel/trace/rv/monitors/nomiss/Kconfig"
> +source "kernel/trace/rv/monitors/tlob/Kconfig"
>  # Add new deadline monitors here

Your line should be after the marker for deadline monitors, this is
required for the Kconfig to look good with nested monitors.

Just let rvgen handle the way it looks (V2 was fine, just above the
general monitor marker):

 # Add new deadline monitors here

+source "kernel/trace/rv/monitors/tlob/Kconfig"
 # Add new monitors here

>  
>  # Add new monitors here

> diff --git a/kernel/trace/rv/monitors/tlob/tlob.c
> b/kernel/trace/rv/monitors/tlob/tlob.c
> new file mode 100644
> index 000000000000..d8e0c4794720
> --- /dev/null
> +++ b/kernel/trace/rv/monitors/tlob/tlob.c
> @@ -0,0 +1,968 @@
> +#include <linux/hrtimer.h>
> +#include <linux/kernel.h>
> +#include <linux/ktime.h>
> +#include <linux/module.h>
> +#include <linux/init.h>
> +#include <linux/namei.h>
> +#include <linux/rv.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/tracefs.h>
> +#include <kunit/visibility.h>
> +#include <rv/instrumentation.h>
> +#include <rv/rv_uprobe.h>

You added a few includes I don't believe you need: hrtimer, ktime, sched
(included in events/sched) and tracefs (included in rv).

You can keep the others as they come from the template, all monitors
include them anyway.

> +#include "../../rv.h"

This is <rv.h> the includepath allows that just like rv_trace.

> +
> +#define MODULE_NAME "tlob"
> +
> +#include <trace/events/sched.h>
> +#include <rv_trace.h>

...

> +/* Type for da_monitor_storage.target; must be defined before the
> includes. */
> +typedef struct tlob_task_state *monitor_target;
> +
> +/* Forward-declared so da_monitor_reset_hook works before
> ha_monitor.h. */
> +static inline void tlob_reset_notify(struct da_monitor *da_mon);
> +#define da_monitor_reset_hook tlob_reset_notify
> +
> +/* Override EVENT_NONE_LBL so the timer-fired violation shows
> "budget_exceeded". */
> +#define EVENT_NONE_LBL "budget_exceeded"
> +
> +#include "tlob.h"
> +
> +/*
> + * DA_MON_POOL_SIZE must be defined HERE: after tlob.h (which
> defines
> + * TLOB_MAX_MONITORED) and before #include <rv/ha_monitor.h> (which
> + * transitively includes da_monitor.h and expands
> __da_monitor_init_pool
> + * using this macro).  Placing the define before tlob.h or after
> + * ha_monitor.h both cause a build error.
> + */

I don't think you need this comment here. I would add a comment in case
moving it somewhere else would silently change the behaviour, but here
the compiler is going to tell you what isn't defined if you really want
to move it around.

Same for the one line comments above but those are less invasive.

> +#define DA_MON_POOL_SIZE TLOB_MAX_MONITORED
> +
> +/*
> + * Forward-declare tlob_extra_cleanup so the #define below is valid
> when
> + * da_monitor.h (included via ha_monitor.h) expands da_extra_cleanup
> inside
> + * da_monitor_destroy().  The full definition follows after
> ha_monitor.h.
> + */

I'd argue the same here. Though defining da_extra_cleanup() before is
the actual catch (if you do after, the compiler won't tell you it's
using a no-op). If you really want to leave a line for that, feel free
to but I wouldn't be that verbose.
You can combine it to da_monitor_reset_hook() where the same rules
apply.

> +static inline void tlob_extra_cleanup(struct da_monitor *da_mon);
> +#define da_extra_cleanup tlob_extra_cleanup
> +
> +#include <rv/ha_monitor.h>
> +
> +/*
> + * Called from da_monitor_reset() on both normal stop and hrtimer
> expiry.
> + * On violation (stopping==0), emits detail_env_tlob.
> + */
> +static inline void tlob_reset_notify(struct da_monitor *da_mon)
> +{
> +	struct ha_monitor *ha_mon = to_ha_monitor(da_mon);
> +	struct tlob_task_state *ws;
> +
> +	ha_monitor_reset_env(da_mon);

Could have the rest run only if trace_detail_env_tlob_enabled() and
later use trace_call__detail_env_tlob() (the raw tracepoint without
enabled check).

So if the tracepoint is disabled, this code almost doesn't exist.

> +
> +	ws = ha_get_target(ha_mon);
> +	if (!ws)
> +		return;
> +
> +	/*
> +	 * Emit per-state breakdown on budget violation only.
> +	 * stopping==0: timer callback owns this path (genuine
> overrun).
> +	 * stopping==1: normal stop claimed ownership first; skip.
> +	 */
> +	if (!atomic_read(&ws->stopping)) {
> +		unsigned int curr_state = READ_ONCE(da_mon-
> >curr_state);
> +		u64 running_ns, waiting_ns, sleeping_ns, partial_ns;
> +		unsigned long flags;
> +
> +		/*
> +		 * Snapshot accumulators; partial_ns covers
> curr_state time
> +		 * not yet folded in (transition-out pending).
> +		 */
> +		raw_spin_lock_irqsave(&ws->entry_lock, flags);
> +		partial_ns   = ktime_get_ns() - ktime_to_ns(ws-
> >last_ts);
> +		running_ns   = ws->running_ns  +
> +			       (curr_state == running_tlob  ?
> partial_ns : 0);
> +		waiting_ns   = ws->waiting_ns  +
> +			       (curr_state == waiting_tlob  ?
> partial_ns : 0);
> +		sleeping_ns  = ws->sleeping_ns +
> +			       (curr_state == sleeping_tlob ?
> partial_ns : 0);
> +		raw_spin_unlock_irqrestore(&ws->entry_lock, flags);
> +
> +		trace_detail_env_tlob(da_get_id(da_mon), ws-
> >threshold_ns,
> +				      running_ns, waiting_ns,
> sleeping_ns);
> +	}
> +}
> +
> +#define BUDGET_NS(ha_mon) (ha_get_target(ha_mon)->threshold_ns)
> +
> +/* HA constraint functions (called by ha_monitor_handle_constraint)
> */
> +
> +static u64 ha_get_env(struct ha_monitor *ha_mon, enum envs_tlob env,
> u64 time_ns)
> +{
> +	if (env == clk_elapsed_tlob)
> +		return ha_get_clk_ns(ha_mon, env, time_ns);
> +	return ENV_INVALID_VALUE;
> +}

So here you removed ha_reset_env() because that's done implicitly in the
start. It seems to work, but you're going to need to define it if we
apply an automatic reset at start (what I mentioned in the other review).

It's kinda idiomatic to have resets whenever you have clocks, yours is a
special case because you don't really expect start (the only event with a
reset) to happen again.

It's ok to redefine the functions the way you did, though you risk
leaving them out of sync in case you update the model (rvgen's output is
going to be quite different). Not sure if that's ever going to happen
anyway.

> +
> +/*
> + * ha_verify_invariants - clk_elapsed < BUDGET_NS must hold in all
> states.
> + *
> + * The invariant is uniform across running/waiting/sleeping; check
> it
> + * unconditionally rather than enumerating each state.
> + */
> +static inline bool ha_verify_invariants(struct ha_monitor *ha_mon,
> +					enum states curr_state, enum
> events event,
> +					enum states next_state, u64
> time_ns)
> +{

The compiler is probably going to produce something not that different
if you keep the ifs, but this is smaller and cleaner. Again, you only
need to track it manually but it looks alright now.

> +	return ha_check_invariant_ns(ha_mon, clk_elapsed_tlob,
> time_ns);
> +}
> +
> +/*
> + * Convert invariant (deadline) to guard (reset anchor) on state
> transitions.
> + *
> + * The conversion is identical for every departing state; skip only
> self-loops.
> + */
> +static inline void ha_convert_inv_guard(struct ha_monitor *ha_mon,
> +					enum states curr_state, enum
> events event,
> +					enum states next_state, u64
> time_ns)
> +{
> +	if (curr_state != next_state)
> +		ha_inv_to_guard(ha_mon, clk_elapsed_tlob,
> BUDGET_NS(ha_mon), time_ns);
> +}
> +
> +/* No per-event guard conditions for tlob; invariants suffice. */
> +static inline bool ha_verify_guards(struct ha_monitor *ha_mon,
> +				    enum states curr_state, enum
> events event,
> +				    enum states next_state, u64
> time_ns)
> +{

Yeah you're not quite using a reset and you probably shouldn't so this
is alright too.

> +	return true;
> +}
> +
> +/*
> + * Arm or cancel the HA budget timer on state transitions.
> + *
> + * The timer must run in every monitored state
> (running/waiting/sleeping),
> + * so arm it whenever next_state is any of the three.  On a self-
> loop caused
> + * by a non-start event the timer is already running; skip the
> redundant
> + * restart.  On a true state change the old timer is implicitly
> superseded by
> + * the new ha_start_timer_ns() call.
> + *
> + * Guard on stopping: sched_switch events can arrive after
> ha_cancel_timer_sync,
> + * restarting the timer and triggering an ODEBUG "activate active"
> splat.
> + * The _acquire pairs with the cmpxchg_release in tlob_stop_task.
> + */
> +static inline void ha_setup_invariants(struct ha_monitor *ha_mon,
> +				       enum states curr_state, enum
> events event,
> +				       enum states next_state, u64
> time_ns)
> +{
> +	if (next_state == curr_state && event != start_tlob)
> +		return;
> +

stopping is only ever set in the cleanup phase, right? You don't really
need to cancel timers because you should have done it already, what
about the slightly more readable:

	if (atomic_read_acquire(&ha_get_target(ha_mon)->stopping))
		return;
	if (next_state < state_max_tlob)
		ha_start_timer_ns(ha_mon, clk_elapsed_tlob, BUDGET_NS(ha_mon), time_ns);
	else
		ha_cancel_timer(ha_mon);

> +	if (next_state < state_max_tlob) {
> +		if (!atomic_read_acquire(&ha_get_target(ha_mon)-
> >stopping))
> +			ha_start_timer_ns(ha_mon, clk_elapsed_tlob,
> BUDGET_NS(ha_mon), time_ns);
> +	} else {
> +		ha_cancel_timer(ha_mon);
> +	}
> +}

...

> +/*
> + * da_extra_cleanup - per-task teardown called by
> da_monitor_destroy().
> + *
> + * Claims cleanup ownership via CAS; cancels the budget timer;
> decrements the
> + * monitored-task counter; and schedules the slab free via
> call_rcu().
> + * Must run before da_monitor_reset() (i.e. before hash_del_rcu())
> so that
> + * ha_cancel_timer_sync() can safely access the still-registered
> ha_monitor.
> + */
> +static inline void tlob_extra_cleanup(struct da_monitor *da_mon)
> +{
> +	struct ha_monitor *ha_mon = to_ha_monitor(da_mon);
> +	struct tlob_task_state *ws = ha_get_target(ha_mon);
> +
> +	if (!ws)
> +		return;
> +
> +	if (atomic_cmpxchg_release(&ws->stopping, 0, 1) != 0)
> +		return;
> +
> +	ha_cancel_timer_sync(ha_mon);

You shouldn't need to do this since that's standard HA functionality and
is stopped already by the reset + sync RCU sequence.

> +	put_task_struct(ws->task);
> +	call_rcu(&ws->rcu, tlob_free_rcu);

And since you can sleep and already waited for a grace period, you can
probably just free now.
This is the same point I had in da_monitor_destroy_pool() vs
da_monitor_destroy_kmalloc(): let's align what we do with sync/RCU.

> +}
> +
> +/*
> + * __tlob_acc - accumulate elapsed ns into one per-state counter.
> + *
> + * Looks up the task's tlob_task_state under RCU, adds the interval
> + * [ws->last_ts, now] to the field at @offset within the state
> struct,
> + * and updates last_ts.  Returns true if the task is monitored.
> + *
> + * entry_lock is a raw spinlock so this is safe from hardirq
> context.
> + */
> +static inline bool __tlob_acc(struct task_struct *task, ktime_t now,
> +			       size_t offset)
> +{
> +	struct tlob_task_state *ws;
> +	unsigned long flags;
> +
> +	scoped_guard(rcu) {
> +		ws = da_get_target_by_id(task->pid);
> +		if (!ws)
> +			return false;
> +		raw_spin_lock_irqsave(&ws->entry_lock, flags);
> +		*(u64 *)((char *)ws + offset) +=
> ktime_to_ns(ktime_sub(now, ws->last_ts));

I don't really like this, you have a few options to avoid raw pointer
arithmetics:

1. a macro:

	#define __tlob_acc(task, now, field) do { \
		...
		ws->field += ktime_stuff(now); \
		...
	}
	static inline bool tlob_acc_running(struct task_struct *task, ktime_t now)
	{
		return __tlob_acc(task, now, running_ns);
	}

2. use an array in the struct and pass the index around:

	enum tlob_acc {
		running,
		waiting,
		sleeping,
		max,
	}

	struct tlob_task_state {
		...
		u64			accs_ns[max];
		...
	};

	bool __tlob_acc(struct task_struct *task, ktime_t now, enum tlob_acc acc)
	{
		...
		ws->accs_ns[acc] += ktime_stuff(now);
		...
	}
	static inline bool tlob_acc_running(struct task_struct *task, ktime_t now)
	{
		return __tlob_acc(task, now, running);
	}


Some folks don't like macros, so perhaps options 2 is cleaner.
You may not even need the helper functions since calling the generic
tlob_acc like this is still readable.

If you go down that path, you may also want to simplify the function
using normal (unscoped) guards for both RCU and the lock, since you take
them until the end of the function (won't work with the macro!).

> +		ws->last_ts = now;
> +		raw_spin_unlock_irqrestore(&ws->entry_lock, flags);
> +	}
> +	return true;
> +}
> +
> +/* Accumulate running_ns for prev; returns true if prev is
> monitored. */
> +static inline bool tlob_acc_running(struct task_struct *task,
> ktime_t now)
> +{
> +	return __tlob_acc(task, now, offsetof(struct
> tlob_task_state, running_ns));
> +}
> +
> +/* Accumulate waiting_ns for next; returns true if next is
> monitored. */
> +static inline bool tlob_acc_waiting(struct task_struct *task,
> ktime_t now)
> +{
> +	return __tlob_acc(task, now, offsetof(struct
> tlob_task_state, waiting_ns));
> +}
> +
> +/*
> + * handle_sched_switch - advance the DA on every context switch.
> + *
> + * Generates three DA events:
> + *   prev, prev_state != 0  -> sleep_tlob    (running -> sleeping)
> + *   prev, prev_state == 0  -> preempt_tlob  (running -> waiting)
> + *   next                   -> switch_in_tlob (waiting -> running)
> + *
> + * A single ktime_get() at handler entry is shared by both acc calls
> so that
> + * prev's running_ns and next's waiting_ns share the same context-
> switch
> + * timestamp; neither absorbs handler overhead into its accumulator.
> + *
> + * No waiting->sleeping edge exists: a task can only block
> voluntarily
> + * (call schedule()) while it is executing on CPU, which corresponds
> to
> + * the running DA state.  A task in the waiting state is
> TASK_RUNNING in
> + * kernel terms (on the runqueue) and cannot block itself.
> + *
> + * da_handle_event() is called unconditionally: it skips tasks that
> have no
> + * monitor entry in the hash table.
> + */
> +static void handle_sched_switch(void *data, bool preempt_unused,
> +				struct task_struct *prev,
> +				struct task_struct *next,
> +				unsigned int prev_state)
> +{
> +	ktime_t now = ktime_get();
> +	bool prev_preempted = (prev_state == 0);
> +
> +	/*
> +	 * No guard on tlob_num_monitored here: da_handle_event()
> internally
> +	 * calls da_monitor_handling_event() which checks both
> rv_monitoring_on()
> +	 * and da_monitoring(da_mon).  The hash lookup inside
> da_get_monitor()
> +	 * simply returns NULL for unmonitored tasks, which is
> equally fast as
> +	 * an atomic_read() guard.  By omitting the guard we avoid
> touching the
> +	 * tlob_num_monitored cacheline on every global context-
> switch.
> +	 */
> +	if (tlob_acc_running(prev, now))
> +		da_handle_event(prev->pid, NULL,
> +				prev_preempted ? preempt_tlob :
> sleep_tlob);
> +	if (tlob_acc_waiting(next, now))
> +		da_handle_event(next->pid, NULL, switch_in_tlob);
> +}
> +
> +/* Accumulate sleeping_ns on wakeup; returns true if task is
> monitored. */
> +static inline bool tlob_acc_sleeping(struct task_struct *task,
> ktime_t now)
> +{
> +	return __tlob_acc(task, now, offsetof(struct
> tlob_task_state, sleeping_ns));
> +}
> +
> +/*
> + * handle_sched_wakeup - sleeping -> waiting transition.
> + *
> + * try_to_wake_up() skips TASK_RUNNING tasks, so this never fires
> for a
> + * task already in running or waiting state.
> + */
> +static void handle_sched_wakeup(void *data, struct task_struct *p)
> +{
> +	ktime_t now = ktime_get();
> +
> +	/* Same reasoning as handle_sched_switch: rely on hash-
> lookup fast path. */
> +	if (tlob_acc_sleeping(p, now))
> +		da_handle_event(p->pid, NULL, wakeup_tlob);
> +}
> +
> +/*
> + * handle_sched_process_exit - clean up if a task exits without
> TRACE_STOP.
> + *
> + * Called in do_exit() context; the task still has a valid pid here.
> + * tlob_stop_task() returns -ESRCH if the task is not monitored,
> which is fine.
> + */
> +static void handle_sched_process_exit(void *data, struct task_struct
> *p,
> +				       bool group_dead)
> +{
> +	tlob_stop_task(p);
> +}
> +
> +
> +
> +/**
> + * tlob_start_task - begin monitoring @task with budget
> @threshold_ns ns.
> + * @task:         Task to monitor; may be current or another task.
> + * @threshold_ns: Latency budget in nanoseconds (wall-clock; running
> + waiting + sleeping).
> + *                Must be in [1000, TLOB_MAX_THRESHOLD_NS].
> + *
> + * Returns 0, -ENODEV, -ERANGE, -EALREADY, -ENOMEM, or -ENOSPC.
> + */
> +int tlob_start_task(struct task_struct *task, u64 threshold_ns)
> +{
> +	struct tlob_task_state *ws;
> +
> +	if (!da_monitor_enabled())
> +		return -ENODEV;
> +
> +	if (threshold_ns < 1000 || threshold_ns >

Why not some TLOB_MIN_THRESHOLD_NS (that can of course be 1000)?

> TLOB_MAX_THRESHOLD_NS)
> +		return -ERANGE;
> +
> +	/* Serialise duplicate-check + pool-slot claim for the same
> pid. */
> +	guard(mutex)(&tlob_start_mutex);
> +
> +	if (da_get_target_by_id(task->pid))
> +		return -EALREADY;
> +
> +	ws = kmem_cache_zalloc(tlob_state_cache, GFP_KERNEL);
> +	if (!ws)
> +		return -ENOMEM;
> +
> +	ws->task = task;
> +	get_task_struct(task);
> +	ws->threshold_ns = threshold_ns;
> +	ws->last_ts = ktime_get();
> +	raw_spin_lock_init(&ws->entry_lock);
> +
> +	/*
> +	 * da_handle_start_run_event() claims a pool slot via
> da_prepare_storage(),
> +	 * initialises the monitor, and delivers start_tlob in one
> step: the
> +	 * generated ha_setup_invariants() resets clk_elapsed and
> arms the timer.
> +	 * Returns 0 if the pool is exhausted (-ENOSPC).
> +	 */
> +	if (!da_handle_start_run_event(task->pid, ws, start_tlob)) {
> +		put_task_struct(task);
> +		kmem_cache_free(tlob_state_cache, ws);
> +		return -ENOSPC;
> +	}
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(tlob_start_task);
> +
> +/**
> + * tlob_stop_task - stop monitoring @task.
> + * @task: Task to stop.
> + *
> + * CAS on ws->stopping (0->1) under RCU claims cleanup ownership;
> + * the winner cancels the timer synchronously and frees all
> resources.
> + *
> + * Returns 0, -EOVERFLOW (budget exceeded), -ESRCH (not monitored),
> + * or -EAGAIN (concurrent caller claimed cleanup).
> + */
> +int tlob_stop_task(struct task_struct *task)
> +{
> +	struct da_monitor *da_mon;
> +	struct ha_monitor *ha_mon;
> +	struct tlob_task_state *ws;
> +	bool budget_exceeded;
> +
> +	scoped_guard(rcu) {
> +		ws = da_get_target_by_id(task->pid);
> +		if (!ws)
> +			return -ESRCH;
> +
> +		da_mon = da_get_monitor(task->pid, NULL);
> +		if (unlikely(!da_mon)) {
> +			/* ws in hash but da_mon gone; internal
> inconsistency. */
> +			WARN_ON_ONCE(1);

	if (unlikely(WARN_ON_ONCE(!da_mon)))

> +			return -ESRCH;
> +		}
> +
> +		ha_mon = to_ha_monitor(da_mon);
> +
> +		/*
> +		 * CAS (0->1) claims cleanup ownership under RCU (ws
> guaranteed valid).
> +		 * _release pairs with atomic_read_acquire in
> ha_setup_invariants.
> +		 */
> +		if (atomic_cmpxchg_release(&ws->stopping, 0, 1) !=
> 0)
> +			return -EAGAIN;
> +	}
> +
> +	/* Wait for in-flight timer callback before reading
> da_monitoring. */
> +	ha_cancel_timer_sync(ha_mon);
> +
> +	/* Timer fired first -> budget exceeded; otherwise reset
> normally. */
> +	scoped_guard(rcu) {
> +		budget_exceeded = !da_monitoring(da_mon);
> +		if (!budget_exceeded)
> +			da_monitor_reset(da_mon);
> +	}
> +	da_destroy_storage(task->pid);
> +
> +	put_task_struct(ws->task);
> +	call_rcu(&ws->rcu, tlob_free_rcu);
> +	return budget_exceeded ? -EOVERFLOW : 0;
> +}
> +EXPORT_SYMBOL_GPL(tlob_stop_task);
> +
> +
> +static int tlob_uprobe_entry_handler(struct rv_uprobe *p, struct
> pt_regs *regs,
> +				     __u64 *data)
> +{
> +	struct tlob_uprobe_binding *b = p->priv;
> +
> +	tlob_start_task(current, b->threshold_ns);
> +	return 0;
> +}
> +
> +static int tlob_uprobe_stop_handler(struct rv_uprobe *p, struct
> pt_regs *regs,
> +				    __u64 *data)
> +{
> +	tlob_stop_task(current);
> +	return 0;
> +}
> +
> +/*
> + * Register start + stop entry uprobes for a binding.
> + * Called with tlob_uprobe_mutex held.
> + */
> +static int tlob_add_uprobe(u64 threshold_ns, const char *binpath,
> +			   loff_t offset_start, loff_t offset_stop)
> +{

You can use cleanup helpers here to save a couple of gotos (making the
code less error prone and more readable):

> +	struct tlob_uprobe_binding *b, *tmp_b;

struct tlob_uprobe_binding __free(kfree) *b = NULL, *tmp_b;

> +	char pathbuf[TLOB_MAX_PATH];
> +	struct path path;

struct path path __free(path_put) = {};

> +	char *canon;
> +	int ret;
> +
> +	if (binpath[0] != '/')
> +		return -EINVAL;
> +
> +	b = kzalloc_obj(*b, GFP_KERNEL);
> +	if (!b)
> +		return -ENOMEM;
> +
> +	b->threshold_ns = threshold_ns;
> +	b->offset_start = offset_start;
> +	b->offset_stop  = offset_stop;
> +
> +	ret = kern_path(binpath, LOOKUP_FOLLOW, &path);
> +	if (ret)
> +		goto err_free;
> +
> +	if (!d_is_reg(path.dentry)) {
> +		ret = -EINVAL;
> +		goto err_path;
> +	}
> +
> +	/* Reject duplicate start offset for the same binary. */
> +	list_for_each_entry(tmp_b, &tlob_uprobe_list, list) {
> +		if (tmp_b->offset_start == offset_start &&
> +		    tmp_b->start_probe->path.dentry == path.dentry)
> {
> +			ret = -EEXIST;
> +			goto err_path;
> +		}
> +	}
> +
> +	canon = d_path(&path, pathbuf, sizeof(pathbuf));
> +	if (IS_ERR(canon)) {
> +		ret = PTR_ERR(canon);
> +		goto err_path;
> +	}
> +	strscpy(b->binpath, canon, sizeof(b->binpath));
> +
> +	/* Both probes share b (priv) and path; attach_path refs
> path itself. */
> +	b->start_probe = rv_uprobe_attach_path(&path, offset_start,
> +					      
> tlob_uprobe_entry_handler, NULL, b);
> +	if (IS_ERR(b->start_probe)) {
> +		ret = PTR_ERR(b->start_probe);
> +		b->start_probe = NULL;
> +		goto err_path;
> +	}
> +
> +	b->stop_probe = rv_uprobe_attach_path(&path, offset_stop,
> +					     
> tlob_uprobe_stop_handler, NULL, b);
> +	if (IS_ERR(b->stop_probe)) {
> +		ret = PTR_ERR(b->stop_probe);
> +		b->stop_probe = NULL;
> +		goto err_start;
> +	}
> +
> +	path_put(&path);
> +	list_add_tail(&b->list, &tlob_uprobe_list);

And finally tell the compiler you want to keep b in the happy path.

list_add_tail(&no_free_ptr(b)->list, &tlob_uprobe_list);

Not tested, but you got the gist, check include/linux/cleanup.h and grep
around for users if it doesn't work as expected.

> +	return 0;
> +
> +err_start:
> +	rv_uprobe_detach(b->start_probe);

This will need to stay, unless you want to DEFINE_FREE() for this, may
be fun but not required.

> +err_path:
> +	path_put(&path);
> +err_free:
> +	kfree(b);
> +	return ret;
> +}

...

> +
> +/* Parse "-PATH:OFFSET_START" (ftrace uprobe_events removal
> convention). */

This isn't quite a standard function comment (and you don't really need
to make it so, it wouldn't add much value), but I'd say at least make it
a multi-line comment which is more common in this location:

/*
 * Parse ...
 */

...

> +static void __tlob_destroy_monitor(void)
> +{
> +	rv_this.enabled = 0;
> +	/*
> +	 * Remove uprobes first; rv_uprobe_sync() inside ensures all
> in-flight
> +	 * handlers have finished before we proceed.
> +	 */
> +	tlob_remove_all_uprobes();
> +
> +	/*
> +	 * da_monitor_destroy() iterates any remaining entries via
> da_extra_cleanup
> +	 * (tlob_extra_cleanup), cancels their timers, and frees
> their state.
> +	 * rcu_barrier() inside drains both da_pool_return_cb and
> tlob_free_rcu
> +	 * callbacks before the pool arrays are freed.
> +	 */

I'm not sure we need this many comments, the behaviour may change and we
don't need to describe what the function does before calling it.

If anything say /why/ you call a function before another, but I'd say
it's trivial here so you can just drop both comments.

> +	ha_monitor_destroy();
> +	kmem_cache_destroy(tlob_state_cache);
> +	tlob_state_cache = NULL;
> +}

...

> +static int __init register_tlob(void)
> +{
> +	int ret;
> +
> +	ret = rv_register_monitor(&rv_this, NULL);
> +	if (ret)
> +		return ret;
> +
> +	if (rv_this.root_d) {
> +		if (!tracefs_create_file("monitor", 0644,

We have rv_create_file() , it's exactly the same but let's be consistent
so if we ever decide to overload it, this won't stay behind.

> rv_this.root_d, NULL,
> +					 &tlob_monitor_fops)) {
> +			rv_unregister_monitor(&rv_this);
> +			return -ENOMEM;
> +		}
> +	}
> +
> +	return 0;
> +}

> diff --git a/kernel/trace/rv/monitors/tlob/tlob.h
> b/kernel/trace/rv/monitors/tlob/tlob.h
> new file mode 100644
> index 000000000000..b6724e629c69
> --- /dev/null
> +++ b/kernel/trace/rv/monitors/tlob/tlob.h
> @@ -0,0 +1,148 @@

...

> +
> +enum states_tlob {
> +	running_tlob,
> +	waiting_tlob,
> +	sleeping_tlob,
> +	state_max_tlob,
> +};

This is personal preference, so feel free to ignore it. The header file
is (mostly) automatically generated and rvgen currently gives you a
different order for enums/arrays. Maybe just reorder all those according
to what rvgen does (just run it without the -a option and diff what
comes out).

The main reason is again to make it easy to update the file in case the
model/framework changes.

Thanks,
Gabriele


^ permalink raw reply

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Vlastimil Babka (SUSE) @ 2026-06-15 15:27 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Gregory Price
  Cc: Balbir Singh, lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman, Matthew Wilcox
In-Reply-To: <fdbdc9f7-d142-4880-b429-065d5056cabb@kernel.org>

On 6/15/26 17:18, David Hildenbrand (Arm) wrote:
> On 6/15/26 16:38, Vlastimil Babka (SUSE) wrote:
>> On 6/12/26 17:29, Gregory Price wrote:
>>> On Wed, Jun 10, 2026 at 04:12:52PM -0400, Gregory Price wrote:
>>>> ... snip ...
>>>>
>>>> I will still probably send the next RFC version tomorrow or friday,
>>>> as I want to get some eyes on the __GFP_PRIVATE-less pattern.
>>>>
>>>> Also, I made a new `anondax` driver which enables userland testing
>>>> of this functionality without any specialty hardware.
>>>>
>>>
>>> (apologies for the length of this email: this will all be covered in
>>> the coming cover letter, but I just wanted to share a bit of a preview)
>>>
>>> ===
>>>
>>> Just another small update - I am planning to post the RFC today once i
>>> get some mild cleanup done.  It will be based on the dax atomic hotplug
>>>
>>> https://lore.kernel.org/linux-mm/20260605211911.2160954-1-gourry@gourry.net/
>>>
>>> But a couple specific details regarding the memalloc pieces that i've
>>> learned the past couple of days playing with it.
>>>
>>> 1) memalloc_folio is required to ensure non-folio allocations don't land
>>>    on the private node, even if it happens within a memalloc_private
>>>    context.  Since memalloc_folio may be useful in contexts outside of
>>>    private nodes, I kept this as a separate flag.
>>>
>>>    If we think there will *never* be additional users of memalloc_folio,
>>>    then we could fold _folio into _private to save the flag for now and
>>>    add it back when we actually need it.
>>>
>>> 2) memalloc_private is needed to unlock private nodes, but in the
>>>    original NOFALLBACK-only design, you also needed __GFP_THISNODE.
>>>
>>>    This is *highly* restrictive.  I found when playing with mbind that
>>>    MPOL_BIND + __GFP_THISNODE generates a WARN (valid WARN, it normally
>>>    implies a bug). 
>>>
>>>    That leads me to #3
>> 
>> I think the memalloc approach is dangerous due to unexpected nesting. There
>> might be nested page allocations in page allocation itself (due to some
>> debugging option). But also interrupts do not change what "current" points
>> to. Suddenly those could start requesting folios and/or private nodes and be
>> surprised, I'm afraid.
> 
> Yeah, we'd need some way to distinguish the main allocation from these other
> (nested) allocations.

That goes against the very principle of scopes. And I don't see how, except
via a ... flag to the main allocation :D

>> 
>> The memalloc scopes only work well when they restrict the context wrt
>> reclaim, and allocations in IRQ have to be already restricted heavily
>> (atomic) so further memalloc restrictions don't do anything in practice. But
>> to make them change other aspects of the allocations like this won't work.
> 
> I was assuming that memalloc_pin_save() would already violate that, but really
> it only restricts where movable allocations land, and that doesn't matter for
> other kernel allocations.

Hm yeah its suboptimal, as it can turn a movable allocation unmovable. But
shouldn't cause outright bugs.

> Do you see any other way to make something like an allocation context work, and
> avoid introducing more GFP flags?

Yeah, the idea of augomenting gfp flags with alloc_flags that are no longer
strictly internal to the page allocator, seems like a way to achieve what we
need.

^ permalink raw reply

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Gregory Price @ 2026-06-15 15:37 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Vlastimil Babka (SUSE), Balbir Singh, lsf-pc, linux-kernel,
	linux-cxl, cgroups, linux-mm, linux-trace-kernel, damon,
	kernel-team, gregkh, rafael, dakr, dave, jonathan.cameron,
	dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
	dan.j.williams, longman, akpm, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, osalvador, ziy, matthew.brost,
	joshua.hahnjy, rakie.kim, byungchul, ying.huang, apopple,
	axelrasmussen, yuanchu, weixugc, yury.norov, linux, mhiramat,
	mathieu.desnoyers, tj, hannes, mkoutny, jackmanb, sj, baolin.wang,
	npache, ryan.roberts, dev.jain, baohua, lance.yang, muchun.song,
	xu.xin16, chengming.zhou, jannh, linmiaohe, nao.horiguchi,
	pfalcato, rientjes, shakeel.butt, riel, harry.yoo, cl,
	roman.gushchin, chrisl, kasong, shikemeng, nphamcs, bhe,
	zhengqi.arch, terry.bowman, Matthew Wilcox
In-Reply-To: <fdbdc9f7-d142-4880-b429-065d5056cabb@kernel.org>

On Mon, Jun 15, 2026 at 05:18:55PM +0200, David Hildenbrand (Arm) wrote:
> On 6/15/26 16:38, Vlastimil Babka (SUSE) wrote:
> > 
> > I think the memalloc approach is dangerous due to unexpected nesting. There
> > might be nested page allocations in page allocation itself (due to some
> > debugging option). But also interrupts do not change what "current" points
> > to. Suddenly those could start requesting folios and/or private nodes and be
> > surprised, I'm afraid.
> 
> Yeah, we'd need some way to distinguish the main allocation from these other
> (nested) allocations.
>
> 
> > 
> > The memalloc scopes only work well when they restrict the context wrt
> > reclaim, and allocations in IRQ have to be already restricted heavily
> > (atomic) so further memalloc restrictions don't do anything in practice. But
> > to make them change other aspects of the allocations like this won't work.
> 
> I was assuming that memalloc_pin_save() would already violate that, but really
> it only restricts where movable allocations land, and that doesn't matter for
> other kernel allocations.
> 
> Do you see any other way to make something like an allocation context work, and
> avoid introducing more GFP flags?
>

One thought would be a way to switch what fallback list is used, and
then have specific fallback lists for certain contexts.

Right now there is a single example of this: __GFP_THISNODE
  |= __GFP_THISNODE   =>  NOFALLBACK
  &= ~__GFP_THISNODE  =>  FALLBACK

We could add an interface with the desired fallback list based as an
argument, and let get_page_from_freelist to prefer that over the default
global lists.

Omit all special nodes from FALLBACK/NOFALLBACK and make the special
contexts provide the fallback-base that should be used.

On my current branch i think that would include modifying, in totality:

   alloc_folio_mpol()
   alloc_demotion_folio()
   alloc_migration_target()

And i'm pretty sure that all just nests nicely.

We might not even need memalloc... hmmm

~Gregory

^ permalink raw reply

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: David Hildenbrand (Arm) @ 2026-06-15 15:38 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE), Gregory Price
  Cc: Balbir Singh, lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman, Matthew Wilcox
In-Reply-To: <94d6c446-a8a6-485e-bb3c-ee809ebb1d3b@kernel.org>

On 6/15/26 17:27, Vlastimil Babka (SUSE) wrote:
> On 6/15/26 17:18, David Hildenbrand (Arm) wrote:
>> On 6/15/26 16:38, Vlastimil Babka (SUSE) wrote:
>>>
>>> I think the memalloc approach is dangerous due to unexpected nesting. There
>>> might be nested page allocations in page allocation itself (due to some
>>> debugging option). But also interrupts do not change what "current" points
>>> to. Suddenly those could start requesting folios and/or private nodes and be
>>> surprised, I'm afraid.
>>
>> Yeah, we'd need some way to distinguish the main allocation from these other
>> (nested) allocations.
> 
> That goes against the very principle of scopes. And I don't see how, except
> via a ... flag to the main allocation :D

Unless we teach the handful of debug callpaths to set a custom context. Have
some memalloc_context_save/restore.

I'd assume that the number of such nested allocations we can trigger from the
buddy (through some callbacks like kasan and such) should be rather limited.

I also wonder for a second whether we could use of the _noprof vs. !_noprof
mechanism.

Essentially, calling a !_noprof variant ("external allocation interface") could
mean "save/restore folio allocation". Just a thought.

-- 
Cheers,

David

^ permalink raw reply

* [PATCH 1/3] rv/reactors: fix lockdep "Invalid wait context" in rv_react()
From: wen.yang @ 2026-06-15 16:44 UTC (permalink / raw)
  To: Gabriele Monaco
  Cc: Nam Cao, linux-trace-kernel, linux-kernel, Wen Yang,
	Thomas Weißschuh
In-Reply-To: <cover.1781541556.git.wen.yang@linux.dev>

From: Wen Yang <wen.yang@linux.dev>

The DEFINE_WAIT_OVERRIDE_MAP() macro creates a lockdep map with
wait_type_inner = LD_WAIT_CONFIG, which inherits the outer context's
wait type.  When rv_react() is called from a LD_WAIT_FREE context
(e.g., a KUnit test with busy-wait), and the reactor callback triggers
a timer interrupt during the busy-loop, the interrupt exit path attempts
to schedule (preempt_schedule_irq -> __schedule -> rq->__lock), which is
LD_WAIT_SPIN.  Lockdep then reports:

    [ BUG: Invalid wait context ]
    context-{5:5}
    1 lock held by kunit_try_catch/209:
     #0: rv_react_map-wait-type-override at rv_react+0x9d/0xf0

The wait_type_override map allowed the outer LD_WAIT_FREE to propagate
inward, but scheduling from an interrupt is LD_WAIT_SPIN, violating the
constraint.

Fix by explicitly setting wait_type_inner = LD_WAIT_SPIN, which is the
tightest constraint rv_react() callbacks must satisfy: they may not
sleep (LD_WAIT_SLEEP) or use mutexes, but can use spinlocks and be
interrupted. This matches the documented LD_WAIT_FREE constraint.

Fixes: 69d8895cb9a9 ("rv: Add explicit lockdep context for reactors")
Signed-off-by: Wen Yang <wen.yang@linux.dev>
Cc: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
---
 kernel/trace/rv/rv_reactors.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/rv/rv_reactors.c b/kernel/trace/rv/rv_reactors.c
index 460af07f7aba..423f843bbd68 100644
--- a/kernel/trace/rv/rv_reactors.c
+++ b/kernel/trace/rv/rv_reactors.c
@@ -465,7 +465,13 @@ int init_rv_reactors(struct dentry *root_dir)
 
 void rv_react(struct rv_monitor *monitor, const char *msg, ...)
 {
-	static DEFINE_WAIT_OVERRIDE_MAP(rv_react_map, LD_WAIT_FREE);
+#ifdef CONFIG_LOCKDEP
+	static struct lockdep_map rv_react_map = {
+		.name = "rv_react",
+		.wait_type_outer = LD_WAIT_FREE,
+		.wait_type_inner = LD_WAIT_SPIN,
+	};
+#endif
 	va_list args;
 
 	if (!rv_reacting_on() || !monitor->react)
-- 
2.25.1


^ permalink raw reply related

* [PATCH 2/3] rv/reactors: add KUnit tests for reactor_printk
From: wen.yang @ 2026-06-15 16:44 UTC (permalink / raw)
  To: Gabriele Monaco; +Cc: Nam Cao, linux-trace-kernel, linux-kernel, Wen Yang
In-Reply-To: <cover.1781541556.git.wen.yang@linux.dev>

From: Wen Yang <wen.yang@linux.dev>

Add KUnit tests for the printk reactor covering:
- Reactor registration and unregistration lifecycle
- React callback invocation via rv_react()
- Double registration rejection
- Multiple register/unregister cycles

The mock callback calls vprintk_deferred() — the same path as the real
reactor — then busy-waits to simulate I/O back-pressure, exercising the
LD_WAIT_FREE constraint of rv_react() under load.

Signed-off-by: Wen Yang <wen.yang@linux.dev>
---
 kernel/trace/rv/Kconfig                |  10 ++
 kernel/trace/rv/Makefile               |   1 +
 kernel/trace/rv/reactor_printk_kunit.c | 123 +++++++++++++++++++++++++
 3 files changed, 134 insertions(+)
 create mode 100644 kernel/trace/rv/reactor_printk_kunit.c

diff --git a/kernel/trace/rv/Kconfig b/kernel/trace/rv/Kconfig
index 3884b14df375..ff47895c897f 100644
--- a/kernel/trace/rv/Kconfig
+++ b/kernel/trace/rv/Kconfig
@@ -104,6 +104,16 @@ config RV_REACT_PRINTK
 	  Enables the printk reactor. The printk reactor emits a printk()
 	  message if an exception is found.
 
+config RV_REACT_PRINTK_KUNIT
+	bool "KUnit tests for reactor_printk" if !KUNIT_ALL_TESTS
+	depends on RV_REACT_PRINTK && KUNIT
+	default KUNIT_ALL_TESTS
+	help
+	  This builds KUnit tests for the printk reactor. These are only
+	  for development and testing, not for regular kernel use cases.
+
+	  If unsure, say N.
+
 config RV_REACT_PANIC
 	bool "Panic reactor"
 	depends on RV_REACTORS
diff --git a/kernel/trace/rv/Makefile b/kernel/trace/rv/Makefile
index 94498da35b37..ef0a2dcb927c 100644
--- a/kernel/trace/rv/Makefile
+++ b/kernel/trace/rv/Makefile
@@ -23,4 +23,5 @@ obj-$(CONFIG_RV_MON_NOMISS) += monitors/nomiss/nomiss.o
 # Add new monitors here
 obj-$(CONFIG_RV_REACTORS) += rv_reactors.o
 obj-$(CONFIG_RV_REACT_PRINTK) += reactor_printk.o
+obj-$(CONFIG_RV_REACT_PRINTK_KUNIT) += reactor_printk_kunit.o
 obj-$(CONFIG_RV_REACT_PANIC) += reactor_panic.o
diff --git a/kernel/trace/rv/reactor_printk_kunit.c b/kernel/trace/rv/reactor_printk_kunit.c
new file mode 100644
index 000000000000..933aa5602226
--- /dev/null
+++ b/kernel/trace/rv/reactor_printk_kunit.c
@@ -0,0 +1,123 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * KUnit tests for reactor_printk
+ *
+ */
+
+#include <kunit/test.h>
+#include <linux/rv.h>
+#include <linux/printk.h>
+#include <linux/sched/clock.h>
+#include <linux/processor.h>
+
+/*
+ * Simulated execution time for mock_printk_react (sched_clock units,
+ * nanoseconds).  Models the time a real printk reactor callback may consume
+ * under I/O pressure, exercising the LD_WAIT_FREE constraint of rv_react().
+ */
+#define MOCK_REACT_DURATION_NS	5000000ULL
+
+/*
+ * Mock react callback mirroring rv_printk_reaction().
+ *
+ * Calls vprintk_deferred() — the same path as the real reactor — then holds
+ * the CPU for MOCK_REACT_DURATION_NS via a sched_clock() timed busy-loop,
+ * simulating a callback that is slow due to I/O back-pressure.
+ * sched_clock() is notrace and lock-free; no sleep or lock acquisition is
+ * performed, satisfying the LD_WAIT_FREE constraint of rv_react().
+ */
+__printf(1, 0) static void mock_printk_react(const char *msg, va_list args)
+{
+	u64 start = sched_clock();
+
+	vprintk_deferred(msg, args);
+
+	while (sched_clock() - start < MOCK_REACT_DURATION_NS)
+		cpu_relax();
+}
+
+static struct rv_reactor mock_printk_reactor = {
+	.name		= "test_printk",
+	.description	= "test printk reactor",
+	.react		= mock_printk_react,
+};
+
+/* Test 1: register and unregister reactor */
+static void test_printk_register_unregister(struct kunit *test)
+{
+	int ret;
+
+	ret = rv_register_reactor(&mock_printk_reactor);
+	KUNIT_EXPECT_EQ(test, ret, 0);
+	KUNIT_EXPECT_STREQ(test, mock_printk_reactor.name, "test_printk");
+
+	rv_unregister_reactor(&mock_printk_reactor);
+}
+
+/* Test 2: react callback is invoked via rv_react() */
+static void test_printk_react_called(struct kunit *test)
+{
+	struct rv_reactor reactor = {
+		.name	= "printk_cb_test",
+		.react	= mock_printk_react,
+	};
+	struct rv_monitor monitor = {
+		.name	= "test_monitor",
+		.reactor = &reactor,
+		.react	= mock_printk_react,
+	};
+
+	rv_react(&monitor, "printk violation message");
+}
+
+/* Test 3: double registration should fail */
+static void test_printk_double_register(struct kunit *test)
+{
+	int ret;
+
+	ret = rv_register_reactor(&mock_printk_reactor);
+	KUNIT_ASSERT_EQ(test, ret, 0);
+
+	ret = rv_register_reactor(&mock_printk_reactor);
+	KUNIT_EXPECT_NE(test, ret, 0);
+
+	rv_unregister_reactor(&mock_printk_reactor);
+}
+
+/* Test 4: register/unregister cycle */
+static void test_printk_register_cycle(struct kunit *test)
+{
+	int ret, i;
+
+	for (i = 0; i < 5; i++) {
+		ret = rv_register_reactor(&mock_printk_reactor);
+		KUNIT_EXPECT_EQ(test, ret, 0);
+
+		rv_unregister_reactor(&mock_printk_reactor);
+	}
+}
+
+/* Test 5: react callback is not NULL (printk reactors must provide react) */
+static void test_printk_react_not_null(struct kunit *test)
+{
+	KUNIT_EXPECT_NOT_NULL(test, mock_printk_reactor.react);
+}
+
+static struct kunit_case reactor_printk_kunit_cases[] = {
+	KUNIT_CASE(test_printk_register_unregister),
+	KUNIT_CASE(test_printk_react_called),
+	KUNIT_CASE(test_printk_double_register),
+	KUNIT_CASE(test_printk_register_cycle),
+	KUNIT_CASE(test_printk_react_not_null),
+	{}
+};
+
+static struct kunit_suite reactor_printk_kunit_suite = {
+	.name		= "rv_reactor_printk",
+	.test_cases	= reactor_printk_kunit_cases,
+};
+
+kunit_test_suite(reactor_printk_kunit_suite);
+
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("KUnit tests for reactor_printk");
-- 
2.25.1


^ permalink raw reply related

* [PATCH 0/3] rv/reactors: fix lockdep warning and add KUnit tests
From: wen.yang @ 2026-06-15 16:44 UTC (permalink / raw)
  To: Gabriele Monaco; +Cc: Nam Cao, linux-trace-kernel, linux-kernel, Wen Yang

From: Wen Yang <wen.yang@linux.dev>

We occasionally hit a lockdep "Invalid wait context" warning in production
environments when rv_react() callbacks are interrupted.

The bug is intermittent in production. KUnit tests with busy-wait callbacks
can reproduce it by holding the CPU long enough for a timer interrupt to fire
during rv_react(), exposing the lockdep constraint violation:

[   44.820913] =============================
[   44.820923] [ BUG: Invalid wait context ]
[   44.821137] 7.1.0-rc7-next-20260612-virtme #6 Tainted: G                 N
[   44.821203] -----------------------------
[   44.821211] kunit_try_catch/209 is trying to lock:
[   44.821244] ffff8a743ed3e8a0 (&rq->__lock){-...}-{2:2}, at: __schedule+0x102/0x13d0
[   44.821688] other info that might help us debug this:
[   44.821708] context-{5:5}
[   44.821730] 1 lock held by kunit_try_catch/209:
[   44.821745]  #0: ffffffffb6ba62c0 (rv_react_map-wait-type-override){+.+.}-{1:1}, at: rv_react+0x9d/0xf0
[   44.821803] stack backtrace:
[   44.822110] CPU: 10 UID: 0 PID: 209 Comm: kunit_try_catch Tainted: G                 N  7.1.0-rc7-next-20260612-virtme #6 PREEMPT_{RT,(full)}
[   44.822197] Tainted: [N]=TEST
[   44.822210] Hardware name: QEMU Ubuntu 24.04 PC v2 (i440FX + PIIX, arch_caps fix, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[   44.822328] Call Trace:
[   44.822377]  <TASK>
[   44.822806]  dump_stack_lvl+0x78/0xe0
[   44.822860]  __lock_acquire+0x926/0x1c90
[   44.822888]  lock_acquire+0xd3/0x310
[   44.822901]  ? __schedule+0x102/0x13d0
[   44.822919]  ? rcu_qs+0x2d/0x1a0
[   44.822954]  _raw_spin_lock_nested+0x36/0x50
[   44.822966]  ? __schedule+0x102/0x13d0
[   44.822979]  __schedule+0x102/0x13d0
[   44.822993]  ? mark_held_locks+0x40/0x70
[   44.823009]  preempt_schedule_irq+0x37/0x70
[   44.823018]  irqentry_exit+0x1da/0x8c0
[   44.823032]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
[   44.823093] RIP: 0010:mock_printk_react+0x2a/0x50
[   44.823250] Code: f3 0f 1e fa 0f 1f 44 00 00 41 54 49 89 f4 55 48 89 fd 53 e8 18 8b db ff 4c 89 e6 48 89 ef 48 89 c3 e8 fa 8e ed ff eb 02 f3 90 <e8> 01 8b db ff 48 29 d8 48 3d 3f 4b 4c 00 76 ee 5b 5d 41 5c c3 cc
[   44.823303] RSP: 0018:ffffd1c3c0733d38 EFLAGS: 00000297
[   44.823332] RAX: 00000000000119f3 RBX: 0000000a74e60d1c RCX: 000000000000001f
[   44.823342] RDX: 0000000000000000 RSI: 000000003348c8a2 RDI: ffffffffc1abbfd9
[   44.823351] RBP: ffffffffb671b613 R08: 0000000000000002 R09: 0000000000000000
[   44.823359] R10: 0000000000000001 R11: 0000000000000000 R12: ffffd1c3c0733d60
[   44.823367] R13: ffffffffb575a5fd R14: ffffd1c3c0017be8 R15: ffffd1c3c00179f8
[   44.823397]  ? rv_react+0x9d/0xf0
[   44.823437]  ? mock_printk_react+0x2f/0x50
[   44.823448]  rv_react+0xb4/0xf0
[   44.823455]  ? rv_react+0x9d/0xf0
[   44.823476]  test_printk_react_called+0x83/0xb0
[   44.823486]  ? __pfx_mock_printk_react+0x10/0x10
[   44.823502]  ? __pfx_mock_printk_react+0x10/0x10
[   44.823513]  kunit_try_run_case+0x97/0x190
[   44.823534]  ? __pfx_kunit_generic_run_threadfn_adapter+0x10/0x10
[   44.823544]  kunit_generic_run_threadfn_adapter+0x21/0x40
[   44.823551]  kthread+0x124/0x160
[   44.823562]  ? __pfx_kthread+0x10/0x10
[   44.823574]  ret_from_fork+0x291/0x3b0
[   44.823585]  ? __pfx_kthread+0x10/0x10
[   44.823595]  ret_from_fork_asm+0x1a/0x30
[   44.823641]  </TASK>


Patch 1 fixes the lockdep bug by correcting rv_react()'s wait_type_inner
from LD_WAIT_CONFIG (which inherits the outer context) to LD_WAIT_SPIN
(the tightest constraint callbacks must satisfy).

Patch 2 adds KUnit tests for reactor_printk. The busy-wait in the mock
callback reproduces the timer interrupt scenario that exposes the bug.

Patch 3 adds KUnit tests for reactor_panic, exercising the panic notifier
chain without halting the system.

Tested with CONFIG_PROVE_LOCKING=y and CONFIG_KUNIT=y.


Wen Yang (3):
  rv/reactors: fix lockdep "Invalid wait context" in rv_react()
  rv/reactors: add KUnit tests for reactor_printk
  rv/reactors: add KUnit tests for reactor_panic

 kernel/trace/rv/Kconfig                |  20 ++++
 kernel/trace/rv/Makefile               |   2 +
 kernel/trace/rv/reactor_panic_kunit.c  | 106 +++++++++++++++++++++
 kernel/trace/rv/reactor_printk_kunit.c | 123 +++++++++++++++++++++++++
 kernel/trace/rv/rv_reactors.c          |   8 +-
 5 files changed, 258 insertions(+), 1 deletion(-)
 create mode 100644 kernel/trace/rv/reactor_panic_kunit.c
 create mode 100644 kernel/trace/rv/reactor_printk_kunit.c

-- 
2.25.1


^ permalink raw reply

* [PATCH 3/3] rv/reactors: add KUnit tests for reactor_panic
From: wen.yang @ 2026-06-15 16:44 UTC (permalink / raw)
  To: Gabriele Monaco; +Cc: Nam Cao, linux-trace-kernel, linux-kernel, Wen Yang
In-Reply-To: <cover.1781541556.git.wen.yang@linux.dev>

From: Wen Yang <wen.yang@linux.dev>

Add KUnit tests for the panic reactor covering:
- Reactor registration and unregistration lifecycle
- Panic notifier chain reachability

The real rv_panic_reaction() calls vpanic(), which is __noreturn and
halts the system. KUnit cannot test across that boundary. Instead, the
test drives atomic_notifier_call_chain(&panic_notifier_list, ...) directly
with a high-priority mock notifier that returns NOTIFY_STOP, verifying
that the panic notification propagates without triggering real handlers
(kdump, watchdog, reboot).

The mock notifier busy-waits to simulate real handler execution time
(e.g., crash_save_vmcoreinfo, emergency_restart preamble) under the
panic context constraints.

Signed-off-by: Wen Yang <wen.yang@linux.dev>
---
 kernel/trace/rv/Kconfig               |  10 +++
 kernel/trace/rv/Makefile              |   1 +
 kernel/trace/rv/reactor_panic_kunit.c | 106 ++++++++++++++++++++++++++
 3 files changed, 117 insertions(+)
 create mode 100644 kernel/trace/rv/reactor_panic_kunit.c

diff --git a/kernel/trace/rv/Kconfig b/kernel/trace/rv/Kconfig
index ff47895c897f..6c6c43c5f86c 100644
--- a/kernel/trace/rv/Kconfig
+++ b/kernel/trace/rv/Kconfig
@@ -121,3 +121,13 @@ config RV_REACT_PANIC
 	help
 	  Enables the panic reactor. The panic reactor emits a printk()
 	  message if an exception is found and panic()s the system.
+
+config RV_REACT_PANIC_KUNIT
+	bool "KUnit tests for reactor_panic" if !KUNIT_ALL_TESTS
+	depends on RV_REACT_PANIC && KUNIT
+	default KUNIT_ALL_TESTS
+	help
+	  This builds KUnit tests for the panic reactor. These are only
+	  for development and testing, not for regular kernel use cases.
+
+	  If unsure, say N.
diff --git a/kernel/trace/rv/Makefile b/kernel/trace/rv/Makefile
index ef0a2dcb927c..2ebfe5e5068c 100644
--- a/kernel/trace/rv/Makefile
+++ b/kernel/trace/rv/Makefile
@@ -25,3 +25,4 @@ obj-$(CONFIG_RV_REACTORS) += rv_reactors.o
 obj-$(CONFIG_RV_REACT_PRINTK) += reactor_printk.o
 obj-$(CONFIG_RV_REACT_PRINTK_KUNIT) += reactor_printk_kunit.o
 obj-$(CONFIG_RV_REACT_PANIC) += reactor_panic.o
+obj-$(CONFIG_RV_REACT_PANIC_KUNIT) += reactor_panic_kunit.o
diff --git a/kernel/trace/rv/reactor_panic_kunit.c b/kernel/trace/rv/reactor_panic_kunit.c
new file mode 100644
index 000000000000..f9a09ae7aaad
--- /dev/null
+++ b/kernel/trace/rv/reactor_panic_kunit.c
@@ -0,0 +1,106 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * KUnit tests for reactor_panic
+ *
+ */
+
+#include <kunit/test.h>
+#include <linux/rv.h>
+#include <linux/panic_notifier.h>
+#include <linux/notifier.h>
+#include <linux/limits.h>
+#include <linux/sched/clock.h>
+#include <linux/processor.h>
+
+/* Simulated execution time for mock panic notifier (nanoseconds). */
+#define RV_PANIC_NOTIFIER_EXEC_NS	2000000ULL
+
+/* Test state */
+static struct {
+	bool notifier_called;
+} panic_test_state;
+
+/*
+ * Mock panic notifier callback.
+ *
+ * Runs at INT_MAX priority and returns NOTIFY_STOP to prevent real panic
+ * handlers (kdump, watchdog) from executing during the test.  Busy-waits
+ * RV_PANIC_NOTIFIER_EXEC_NS to simulate a real handler's execution time.
+ */
+static int mock_panic_notifier_fn(struct notifier_block *nb,
+				  unsigned long action, void *data)
+{
+	char *msg = data;
+	u64 start = sched_clock();
+
+	panic_test_state.notifier_called = true;
+	pr_emerg("KUnit: reactor_panic test intercepted panic notifier: %s\n",
+		 msg ? msg : "(no message)");
+
+	while (sched_clock() - start < RV_PANIC_NOTIFIER_EXEC_NS)
+		cpu_relax();
+
+	return NOTIFY_STOP;
+}
+
+static struct notifier_block mock_panic_nb = {
+	.notifier_call	= mock_panic_notifier_fn,
+	.priority	= INT_MAX,
+};
+
+static struct rv_reactor mock_panic_reactor = {
+	.name		= "test_panic",
+	.description	= "test panic reactor",
+};
+
+static int reactor_panic_kunit_init(struct kunit *test)
+{
+	panic_test_state.notifier_called = false;
+	return 0;
+}
+
+/* Test 1: register and unregister reactor */
+static void test_panic_register_unregister(struct kunit *test)
+{
+	int ret;
+
+	ret = rv_register_reactor(&mock_panic_reactor);
+	KUNIT_EXPECT_EQ(test, ret, 0);
+	KUNIT_EXPECT_STREQ(test, mock_panic_reactor.name, "test_panic");
+
+	rv_unregister_reactor(&mock_panic_reactor);
+}
+
+/*
+ * Test 2: panic notifier chain is reachable.
+ *
+ * vpanic() calls atomic_notifier_call_chain(&panic_notifier_list, ...).
+ * Drive the chain directly to verify panic notifiers receive the notification —
+ * the observable side-effect of reactor_panic without halting the system.
+ */
+static void test_panic_notifier_called(struct kunit *test)
+{
+	atomic_notifier_chain_register(&panic_notifier_list, &mock_panic_nb);
+	atomic_notifier_call_chain(&panic_notifier_list, 0,
+				   "panic violation message");
+	atomic_notifier_chain_unregister(&panic_notifier_list, &mock_panic_nb);
+
+	KUNIT_EXPECT_TRUE(test, panic_test_state.notifier_called);
+}
+
+static struct kunit_case reactor_panic_kunit_cases[] = {
+	KUNIT_CASE(test_panic_register_unregister),
+	KUNIT_CASE(test_panic_notifier_called),
+	{}
+};
+
+static struct kunit_suite reactor_panic_kunit_suite = {
+	.name		= "rv_reactor_panic",
+	.init		= reactor_panic_kunit_init,
+	.test_cases	= reactor_panic_kunit_cases,
+};
+
+kunit_test_suite(reactor_panic_kunit_suite);
+
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("KUnit tests for reactor_panic");
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH v4 6/7] tracing/probes: Add this_cpu_read() and this_cpu_ptr() dereference method to fetcharg
From: Masami Hiramatsu @ 2026-06-16  1:15 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Steven Rostedt, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178148609402.185520.8189233495763938815.stgit@devnote2>

On Mon, 15 Jun 2026 10:14:54 +0900
"Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:

> +		case FETCH_OP_DEREF_CPU:
> +			val = (unsigned long)this_cpu_ptr((void __percpu *)val);
> +			ret = probe_mem_read(&val, (void *)val, sizeof(val));
> +			break;
> +		case FETCH_OP_CPU_PTR:
> +			val = (unsigned long)this_cpu_ptr((void __percpu *)val);
> +			ret = 0;
> +			break;

Hmm, maybe I can just convert the FETCH_OP_DEREF_CPU to
FETCH_OP_CPU_PTR + FETCH_OP_DEREF to simply the code.

> +		default:
> +			lval = llval;
> +			goto out;
> +		}
>  		if (ret)
>  			return ret;
> +		llval = lval;
>  		code++;
>  	} while (1);
> +out:
>  
>  	s3 = code;
>  stage3:
> @@ -181,6 +195,10 @@ process_fetch_insn_bottom(struct fetch_insn *code, unsigned long val,
>  	case FETCH_OP_ST_UMEM:
>  		probe_mem_read_user(dest, (void *)val + code->offset, code->size);
>  		break;
> +	case FETCH_OP_ST_CPUMEM:
> +		val = (unsigned long)this_cpu_ptr((void __percpu *)val);
> +		probe_mem_read(dest, (void *)val, code->size);
> +		break;

Then, I can just drop this change.

Thanks,


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH] tracing: eprobe: read the complete FILTER_PTR_STRING pointer
From: Masami Hiramatsu @ 2026-06-16  2:09 UTC (permalink / raw)
  To: Martin Kaiser; +Cc: Steven Rostedt, linux-trace-kernel, linux-kernel
In-Reply-To: <20260615145500.2662456-1-martin@kaiser.cx>

On Mon, 15 Jun 2026 16:54:12 +0200
Martin Kaiser <martin@kaiser.cx> wrote:

> For a char * element in an event, the FILTER_PTR_STRING filter type is
> used. When the event occurs, a pointer is stored in the ringbuffer.
> 
> If an eprobe references such a char * element of a "base event" and
> decodes the pointer as string, the pointer cannot be dereferenced.
> 
> $ echo 'e syscalls.sys_enter_openat $filename:string' > \
> 		/sys/kernel/tracing/dynamic_events
> $ trace-cmd start -e eprobes
> $ trace-cmd show
>     ... : sys_enter_openat: (syscalls.sys_enter_openat) arg1=(fault)
> 
> The problem is in get_event_field
> 
> 	val = (unsigned long)(*(char *)addr);
> 
> addr points to the position in the ringbuffer where the pointer was
> stored. We must read the complete pointer, not just the lowest byte.
> 
> Fix the assignment, make the example above work.
> 

Ah, this is a bit complicated. It seems to work with sched_switch event
as commit f04dec93466a ("tracing/eprobes: Fix reading of string fields"):

echo 'e:sw sched/sched_switch comm=$next_comm:string' > dynamic_events

#           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
#              | |         |   |||||     |         |
              sh-162     [002] d..3.    54.027213: sw: (sched.sched_switch) comm="swapper/2"
          <idle>-0       [007] d..3.    54.034573: sw: (sched.sched_switch) comm="rcu_preempt"
     rcu_preempt-15      [007] d..3.    54.034589: sw: (sched.sched_switch) comm="swapper/7"

Maybe comm is stored as a fixed string information in the event record?

/sys/kernel/tracing # cat events/sched/sched_switch/format 
name: sched_switch
ID: 254
format:
	field:unsigned short common_type;	offset:0;	size:2;	signed:0;
	field:unsigned char common_flags;	offset:2;	size:1;	signed:0;
	field:unsigned char common_preempt_count;	offset:3;	size:1;	signed:0;
	field:int common_pid;	offset:4;	size:4;	signed:1;

	field:char prev_comm[16];	offset:8;	size:16;	signed:0;
	field:pid_t prev_pid;	offset:24;	size:4;	signed:1;
	field:int prev_prio;	offset:28;	size:4;	signed:1;
	field:long prev_state;	offset:32;	size:8;	signed:1;
	field:char next_comm[16];	offset:40;	size:16;	signed:0;
	field:pid_t next_pid;	offset:56;	size:4;	signed:1;
	field:int next_prio;	offset:60;	size:4;	signed:1;

But the filename is a pointer.

/sys/kernel/tracing # cat events/syscalls/sys_enter_openat/format 
name: sys_enter_openat
ID: 705
format:
	field:unsigned short common_type;	offset:0;	size:2;	signed:0;
	field:unsigned char common_flags;	offset:2;	size:1;	signed:0;
	field:unsigned char common_preempt_count;	offset:3;	size:1;	signed:0;
	field:int common_pid;	offset:4;	size:4;	signed:1;

	field:int __syscall_nr;	offset:8;	size:4;	signed:1;
	field:int dfd;	offset:16;	size:8;	signed:0;
	field:const char * filename;	offset:24;	size:8;	signed:0;
	field:int flags;	offset:32;	size:8;	signed:0;
	field:umode_t mode;	offset:40;	size:8;	signed:0;
	field:__data_loc char[] __filename_val;	offset:48;	size:4;	signed:0;

In this case, the filename field should use __data_loc directly instead of
pointing data on the ring buffer.

Can you try 

echo 'e syscalls.sys_enter_openat $__filename_val:string' > \
 		/sys/kernel/tracing/dynamic_events

Instead?

I think better solution is fixing sycall tracer.

Thanks,

-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* [RFC PATCH v4 0/3] trace: stack trace deduplication for ftrace ring buffer
From: Li Pengfei @ 2026-06-16  6:41 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: Mathieu Desnoyers, Mark Rutland, Jonathan Corbet, Shuah Khan,
	linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	lipengfei28, zhangbo56

From: Pengfei Li <lipengfei28@xiaomi.com>

Hi Masami, Steven, all,

This is v4 of the ftrace stackmap series. It is sent as a new thread.

v3: https://lore.kernel.org/all/cover.1779769138.git.lipengfei28@xiaomi.com/

The series adds stack trace deduplication to ftrace. When the
'stackmap' option is enabled alongside 'stacktrace', the ring buffer
stores a 4-byte stack_id instead of a full kernel stack trace, and the
full stacks are exported once via tracefs (stack_map / stack_map_bin).

Rebased onto v7.1-rc5 (e8c2f9fdadee).

Motivation
==========

The target use case is long-duration, from-boot kernel tracing where
the same stacks recur enormously often and the bottleneck is ring
buffer space, not CPU.

Concretely: tracing the slab allocator from boot for hours to study
memory aging and to catch the allocation backtraces behind a usage
peak. With a stacktrace trigger on the slab tracepoints, every event
today carries a full kernel stack (~80-160 bytes). On a fixed-size
ring buffer that bounds how far back in time the trace reaches: the
buffer wraps in seconds to minutes and the early-boot history -- the
part we care about -- is overwritten before it can be consumed.

In this workload the set of distinct stacks is small and highly
repetitive, so storing a 4-byte stack_id per event and the full stack
only once dramatically increases the time span a given buffer covers.
The intended operating model is exactly the low-overhead one ftrace is
good at: let the trace run for a long time producing a comparatively
small, dense log, then resolve stack_ids offline (cat stack_map, or
parse stack_map_bin with the included tool) during analysis.

This is complementary to, not a replacement for, the existing full
stack recording: deep stacks and the early pre-init window still fall
back to full stacks (see below).

Effect on retention
====================

Same fixed per-CPU buffer, slab allocation workload with a shallow
kernel stack (kmem_cache_alloc), stackmap OFF vs ON:

                  retained events   bytes/event   time span
  stackmap OFF        645,068          ~104 B        15.0 s
  stackmap ON       1,397,741           ~48 B        27.7 s
                     2.17x             2.17x          1.85x

The buffer holds ~2.17x more events and reaches ~1.85x further back in
time for the same memory. The win grows with stack depth and with how
repetitive the stacks are; for deep, highly-repeated stacks the
per-event size approaches the 4-byte stack_id plus event header.

Changes since v3
================

Correctness:
  - Deep stacks are never silently truncated or merged. A stack deeper
    than FTRACE_STACKMAP_MAX_DEPTH (64) now falls back to a full stack
    trace; ftrace_stackmap_get_id() returns -E2BIG rather than
    truncating, so two distinct stacks sharing their first 64 frames
    can no longer collapse to one stack_id.
  - reset is now genuinely destructive and coherent: under the
    reader_sem write lock it clears the owning trace_array's ring
    buffer (and snapshot) BEFORE clearing the map, and uses
    tracing_reset_all_cpus() rather than _online_cpus() so a
    TRACE_STACK_ID written by a now-offline CPU cannot survive and
    dangle against a cleared map.
  - __ftrace_trace_stack() reserves the TRACE_STACK_ID ring-buffer
    slot before inserting into the map, so stack_map_stat counters and
    ref_count stay consistent with what the ring buffer actually
    references (failed reservation -> full stack, map untouched).
  - ref_count / successes / drops now saturate (INT_MAX / LONG_MAX)
    instead of wrapping on multi-hour, billions-of-hits traces.

Global-instance gating:
  - Enabling 'stackmap' on a secondary instance via the aggregate
    trace_options file is now rejected, not just hidden in the
    per-instance options/ directory.
  - tracefs init is failure-atomic: the required stack_map file is
    created before the map pointer is published; if it cannot be
    created the map is destroyed and never published. An init-state
    (PENDING/DONE/FAILED) lets boot-time trace_options=stackmap set
    the flag before the map exists (hot path falls back until it is
    published) while still rejecting enables after a permanent init
    failure, so options/stackmap never reports an enabled no-op.

ABI / tooling:
  - Binary magic corrected to 0x46534D42 ('FSMB'); version is 1 (first
    upstream ABI). Documentation, tool and selftest updated to match.
  - Text and binary exports now follow the same trampoline-marker and
    trace_adjust_address() handling as the normal stack print path.
  - stackmap_dump.py emits hex addresses in 'ips' and shows the ftrace
    trampoline marker only in the resolved 'symbols'.

Selftests:
  - New stackmap-reset.tc: verifies reset clears stale <stack_id N>
    from the trace buffer and checks the stack_map_bin magic/version.
  - stackmap-instance-gate.tc extended to verify the trace_options
    write path is rejected on a secondary instance.
  - stackmap-basic.tc no longer treats a nonzero drops count as a
    failure (drops is a by-design fallback); only zero successes with
    nonzero drops is fatal.

Open questions for maintainers
==============================

Two design points where I would value direction before polishing
further:

1. Eager vs lazy allocation. The element pool is allocated at
   fs_initcall when CONFIG_FTRACE_STACKMAP=y, regardless of whether
   userspace ever enables the option (~8 MB at the default bits=14,
   up to ~135 MB at bits=18). This keeps the hot path allocation-free
   with no allocation-failure path under tracing pressure. Is eager
   allocation acceptable, or would you prefer lazy allocation on the
   first 'echo 1 > options/stackmap'?

2. Binary ABI now or later. stack_map_bin is a new tracefs binary
   interface (magic 0x46534D42, version 1). Is it acceptable to
   introduce it now, or would you prefer the first version ship with
   the text stack_map interface only and add the binary export once
   trace-cmd / libtraceevent integration is designed?

Test results
============

QEMU (aarch64 virt, v7.1-rc5 + this series), boot to init smoke test:
  - stackmap functional suite: 16/16 PASS, including reset clearing the
    trace buffer (stale <stack_id> count 48 -> 0), stack_map_bin
    magic/version, global-vs-secondary instance gating, and the
    trace_options rejection on a secondary instance.
  - boot-time activation (trace_options=stackmap,stacktrace on the
    kernel cmdline): 3/3 PASS -- the option survives the
    pre-initialization window and the map is live once published.
  - ftrace startup self-tests pass with the new TRACE_STACK_ID entry.

Device retention numbers above were collected on a Xiaomi SM8850
(ARM64) running an Android workload, comparing the same buffer with
the option off and on.

Known limitations
=================

- Per-instance stackmap support is not included; the option is gated
  to the global instance (in the tracefs layout and at the
  set_tracer_flag() write path). Per-instance maps are a follow-up.
- Deduplication is best-effort, not strict: under heavy concurrent
  contention two CPUs racing with the same stack hash may each claim a
  different slot, producing a few duplicate entries; ref_count is then
  split across them. This keeps the hot path lock-free.
- The stackmap covers kernel stacks only.
- stack_map_bin is a best-effort snapshot, serialized against reset
  but not a fully atomic export.
- trace-cmd / libtraceevent integration is left for follow-up once the
  binary format settles (see open question 2).

Usage
=====

  echo 1 > /sys/kernel/debug/tracing/options/stackmap
  echo 1 > /sys/kernel/debug/tracing/options/stacktrace

Pengfei Li (3):
  trace: add lock-free stackmap for stack trace deduplication
  trace: integrate stackmap into ftrace stack recording path
  trace: add documentation, selftest and tooling for stackmap

 Documentation/trace/ftrace-stackmap.rst       | 177 ++++
 Documentation/trace/index.rst                 |   1 +
 kernel/trace/Kconfig                          |  22 +
 kernel/trace/Makefile                         |   1 +
 kernel/trace/trace.c                          | 216 ++++-
 kernel/trace/trace.h                          |  17 +
 kernel/trace/trace_entries.h                  |  15 +
 kernel/trace/trace_output.c                   |  23 +
 kernel/trace/trace_selftest.c                 |   1 +
 kernel/trace/trace_stackmap.c                 | 889 ++++++++++++++++++
 kernel/trace/trace_stackmap.h                 |  57 ++
 .../ftrace/test.d/ftrace/stackmap-basic.tc    | 111 +++
 .../test.d/ftrace/stackmap-instance-gate.tc   |  54 ++
 .../ftrace/test.d/ftrace/stackmap-reset.tc    |  76 ++
 tools/tracing/stackmap_dump.py                | 164 ++++
 15 files changed, 1821 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/trace/ftrace-stackmap.rst
 create mode 100644 kernel/trace/trace_stackmap.c
 create mode 100644 kernel/trace/trace_stackmap.h
 create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
 create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc
 create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-reset.tc
 create mode 100755 tools/tracing/stackmap_dump.py

-- 
2.34.1


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox