Linux Trace Kernel
 help / color / mirror / Atom feed
* Re: [PATCH] tracing: Switch trace_recursion_record.c code over to use guard()
From: Steven Rostedt @ 2026-05-13 12:34 UTC (permalink / raw)
  To: Yash Suthar
  Cc: mhiramat, mathieu.desnoyers, linux-kernel, linux-trace-kernel,
	skhan, me
In-Reply-To: <CAPfzD4kiEMFdbK36X1_f1+Vn=v3nfvUODZJuJ40sNJ_fRr9zKA@mail.gmail.com>

On Tue, 12 May 2026 20:11:08 +0530
Yash Suthar <yashsuthar983@gmail.com> wrote:

> Gentle ping.

Hi,

What's the rush? This is just a clean up change. There's no feature here
that you need is there?

If you see it in patchwork[1], it's not lost. I just have other things
ahead of it. I usually process cleanup code last.

Thanks,

-- Steve

[1] https://patchwork.kernel.org/project/linux-trace-kernel/patch/20260502174741.39636-1-yashsuthar983@gmail.com/

^ permalink raw reply

* Re: [RFC PATCH v2 01/10] rv/da: fix monitor start ordering and memory ordering for monitoring flag
From: Gabriele Monaco @ 2026-05-13 12:39 UTC (permalink / raw)
  To: wen.yang; +Cc: linux-trace-kernel, linux-kernel, Steven Rostedt
In-Reply-To: <8af5ba4bd93d2acb8a546e8e47ced974a87c1eb8.1778522945.git.wen.yang@linux.dev>

On Tue, 2026-05-12 at 02:24 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
> 
> da_monitor_start() set monitoring=1 before calling da_monitor_init_hook(),
> may racing with the sched_switch handler:
> 
>   da_monitor_start()               sched_switch handler
>   -------------------------        ---------------------------------
>   da_mon->monitoring = 1;
>                                    if (da_monitoring(da_mon))  /* true  */
>                                        ha_start_timer_ns(...);
>                                        /* hrtimer->base == NULL, crash */
>   da_monitor_init_hook(da_mon);
>   /* hrtimer_setup() sets base */
> 
> Fix the ordering and pair with release/acquire semantics:
> 
>   da_monitor_init_hook(da_mon);
>   smp_store_release(&da_mon->monitoring, 1);    /* da_monitor_start()  */
>   return smp_load_acquire(&da_mon->monitoring); /* da_monitoring()     */
> 
> On ARM64 a plain STR + LDR does not form a release-acquire pair, so
> the load can observe monitoring=1 while hrtimer->base is still NULL.
> The plain accesses are also data races under KCSAN.
> 
> Use WRITE_ONCE for the monitoring=0 store in da_monitor_reset() to
> cover the reset path.
> 
> Fixes: 792575348ff7 ("rv/include: Add deterministic automata monitor
> definition via C macros")
> Signed-off-by: Wen Yang <wen.yang@linux.dev>

Thanks for the fix!

There are probably more than a few of those bugs since most monitors are
implicitly serialised because their events are serialised..

Looks good to me.

Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>

See minor comments below:

> ---
>  include/rv/da_monitor.h | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
> index 39765ff6f098..00ded3d5ab3f 100644
> --- a/include/rv/da_monitor.h
> +++ b/include/rv/da_monitor.h
> @@ -82,7 +82,7 @@ static void react(enum states curr_state, enum events event)
>  static inline void da_monitor_reset(struct da_monitor *da_mon)
>  {
>   da_monitor_reset_hook(da_mon);
> - da_mon->monitoring = 0;
> + WRITE_ONCE(da_mon->monitoring, 0);
>   da_mon->curr_state = model_get_initial_state();
>  }
>  
> @@ -95,8 +95,9 @@ static inline void da_monitor_reset(struct da_monitor
> *da_mon)
>  static inline void da_monitor_start(struct da_monitor *da_mon)
>  {
>   da_mon->curr_state = model_get_initial_state();
> - da_mon->monitoring = 1;
>   da_monitor_init_hook(da_mon);
> + /* Pairs with smp_load_acquire in da_monitoring(). */

I wonder if these comment are really adding value, pairing smp_load_acquire /
smp_store_release is the by-the-book usage and everything is here.

But feel free to leave it if you think it's clearer.

Thanks,
Gabriele

> + smp_store_release(&da_mon->monitoring, 1);
>  }
>  
>  /*
> @@ -104,7 +105,8 @@ static inline void da_monitor_start(struct da_monitor
> *da_mon)
>   */
>  static inline bool da_monitoring(struct da_monitor *da_mon)
>  {
> - return da_mon->monitoring;
> + /* Pairs with smp_store_release in da_monitor_start(). */
> + return smp_load_acquire(&da_mon->monitoring);
>  }
>  
>  /*


^ permalink raw reply

* Re: [RFC PATCH v2 04/10] rv/da: add pre-allocated storage pool for per-object monitors
From: Gabriele Monaco @ 2026-05-13 13:47 UTC (permalink / raw)
  To: wen.yang; +Cc: linux-trace-kernel, linux-kernel, Steven Rostedt
In-Reply-To: <2774332570ee823be60cfe84ba85e9573b4df478.1778522945.git.wen.yang@linux.dev>

On Tue, 2026-05-12 at 02:24 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
> 
> da_create_empty_storage() uses kmalloc_nolock(), which requires
> CONFIG_HAVE_ALIGNED_STRUCT_PAGE; on UML and some PREEMPT_RT
> configurations it always returns NULL.  Calling kmalloc from scheduler
> tracepoint handlers also adds unwanted latency and can fail under
> memory pressure.
> 
> Add da_monitor_init_prealloc(N) as an opt-in alternative to
> da_monitor_init().  It allocates N da_monitor_storage slots with
> GFP_KERNEL up-front and manages them on a LIFO free-stack protected
> by a spinlock, so da_create_or_get() never calls kmalloc on the hot
> path.
> 
> Monitors that do not call da_monitor_init_prealloc() are unaffected.

That's definitely a good addition, kmalloc_nolock was not that good already so I
tried some way have a preallocation, though I realise it isn't really flexible.

Since you're using spinlocks, isn't that going to sleep on PREEMPT_RT?

Isn't this similar to what you'd do with a kmem_cache. That was my original idea
although that uses spinlocks too.

I quickly tried an implementation like yours using
mempool_create_slab_pool(prealloc_count) and mempool_alloc_preallocated() and it
still explodes with my monitors, but perhaps now that tracepoints no longer
disable preemption it could play well with some monitors.

The selftests with tlob seem to work just the same with this kmem_cache (up to
the unrelated RCU stall). To be fair since you only allocate from the uprobe
handler, you'd probably be just fine with kmalloc_nolock, but let's ignore this
for now.


That said, the API is starting to get complex (well, not that it wasn't
already).
We have essentially 3 ways to allocate:
 * fully automatic with kmalloc_nolock
 * semi-automatic with pool preallocation
 * manual with direct storage preallocation

> 
> Signed-off-by: Wen Yang <wen.yang@linux.dev>
> ---
>  include/rv/da_monitor.h | 208 +++++++++++++++++++++++++++++++++++-----
>  1 file changed, 186 insertions(+), 22 deletions(-)
> 
> diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
> index d04bb3229c75..7d6f62766251 100644
> --- a/include/rv/da_monitor.h
> +++ b/include/rv/da_monitor.h
> @@ -433,18 +433,6 @@ static inline da_id_type da_get_id(struct da_monitor
> *da_mon)
>   return container_of(da_mon, struct da_monitor_storage, rv.da_mon)->id;
>  }
>  
> -/*
> - * da_create_or_get - create the per-object storage if not already there
> - *
> - * This needs a lookup so should be guarded by RCU, the condition is checked
> - * directly in da_create_storage()
> - */
> -static inline void da_create_or_get(da_id_type id, monitor_target target)
> -{
> - guard(rcu)();
> - da_create_storage(id, target, da_get_monitor(id, target));
> -}
> -
>  /*
>   * da_fill_empty_storage - store the target in a pre-allocated storage
>   *
> @@ -475,15 +463,121 @@ static inline monitor_target
> da_get_target_by_id(da_id_type id)
>   return mon_storage->target;
>  }
>  
> +/*
> + * Per-object pool state.
> + *
> + * Zero-initialised by default (storage == NULL ⟹ kmalloc mode).  A monitor
> + * opts into pool mode by calling da_monitor_init_prealloc(N) instead of
> + * da_monitor_init(), which sets storage to a non-NULL kcalloc'd array.
> + *
> + * Because every field is wrapped in this struct and the struct itself is a
> + * per-TU static, each monitor that includes this header gets a completely
> + * independent pool.  A kmalloc monitor (e.g. nomiss) and a pool monitor
> + * (e.g. tlob) therefore coexist without any interference.
> + *
> + * da_pool_return_cb runs from softirq on non-PREEMPT_RT, so irqsave is
> + * required to prevent deadlock with task-context callers.  On PREEMPT_RT
> + * it runs from an rcuc kthread where spinlock_t is a sleeping lock.
> + */
> +struct da_per_obj_pool {
> + struct da_monitor_storage  *storage;  /* non-NULL ⟹ pool mode */
> + struct da_monitor_storage **free;     /* kmalloc'd pointer stack */
> + unsigned int                free_top;
> + spinlock_t                  lock;
> +};
> +
> +static struct da_per_obj_pool da_pool = {
> + .lock = __SPIN_LOCK_UNLOCKED(da_pool.lock),
> +};
> +
> +static void da_pool_return_cb(struct rcu_head *head)
> +{
> + struct da_monitor_storage *ms =
> + container_of(head, struct da_monitor_storage, rcu);
> + unsigned long flags;
> +
> + spin_lock_irqsave(&da_pool.lock, flags);
> + da_pool.free[da_pool.free_top++] = ms;
> + spin_unlock_irqrestore(&da_pool.lock, flags);
> +}
> +
> +/* Pops a slot from the pre-allocated pool; returns -ENOSPC if exhausted. */
> +static inline int da_create_or_get_pool(da_id_type id, monitor_target target)
> +{
> + struct da_monitor_storage *mon_storage;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&da_pool.lock, flags);
> + if (!da_pool.free_top) {
> + spin_unlock_irqrestore(&da_pool.lock, flags);
> + return -ENOSPC;
> + }
> + mon_storage = da_pool.free[--da_pool.free_top];
> + spin_unlock_irqrestore(&da_pool.lock, flags);
> +
> + mon_storage->id = id;
> + mon_storage->target = target;
> + guard(rcu)();
> + hash_add_rcu(da_monitor_ht, &mon_storage->node, id);
> + return 0;
> +}
> +
> +/*
> + * Tries da_create_storage() first (lock-free via kmalloc_nolock); falls back
> + * to kmalloc(GFP_KERNEL).  Must be called from task context.
> + */
> +static inline int da_create_or_get_kmalloc(da_id_type id, monitor_target
> target)
> +{
> + struct da_monitor_storage *mon_storage;
> +
> + scoped_guard(rcu) {
> + if (da_create_storage(id, target, da_get_monitor(id, target)))
> + return 0;
> + }
> +
> + /*
> + * da_create_storage() failed because kmalloc_nolock() returned NULL.
> + * Allocate with GFP_KERNEL outside the RCU read section: GFP_KERNEL
> + * may sleep for memory reclaim, which is illegal while the RCU read
> + * lock is held (preemption disabled on !PREEMPT_RT).
> + */
> + mon_storage = kmalloc_obj(*mon_storage, GFP_KERNEL | __GFP_ZERO);
> + if (!mon_storage)
> + return -ENOMEM;
> + mon_storage->id = id;
> + mon_storage->target = target;
> +
> + /*
> + * Re-check for a concurrent insertion before linking: another
> + * caller may have succeeded while we slept in kmalloc().
> + * Discard our allocation and let the winner's entry stand.
> + */
> + scoped_guard(rcu) {
> + if (da_get_monitor(id, target)) {
> + kfree(mon_storage);
> + return 0;
> + }
> + hash_add_rcu(da_monitor_ht, &mon_storage->node, id);
> + }
> + return 0;
> +}
> +
> +/* Create the per-object storage if not already there. */
> +static inline int da_create_or_get(da_id_type id, monitor_target target)
> +{
> + if (da_pool.storage)
> + return da_create_or_get_pool(id, target);
> + return da_create_or_get_kmalloc(id, target);
> +}
> +
>  /*
>   * da_destroy_storage - destroy the per-object storage
>   *
> - * The caller is responsible to synchronise writers, either with locks or
> - * implicitly. For instance, if da_destroy_storage is called at sched_exit
> and
> - * da_create_storage can never occur after that, it's safe to call this
> without
> - * locks.
> - * This function includes an RCU read-side critical section to synchronise
> - * against da_monitor_destroy().
> + * Pool mode: removes from hash and returns the slot via call_rcu().
> + * Kmalloc mode: removes from hash and frees via kfree_rcu().
> + *
> + * Includes an RCU read-side critical section to synchronise against
> + * da_monitor_destroy().
>   */
>  static inline void da_destroy_storage(da_id_type id)
>  {
> @@ -491,15 +585,17 @@ static inline void da_destroy_storage(da_id_type id)
>  
>   guard(rcu)();
>   mon_storage = __da_get_mon_storage(id);
> -
>   if (!mon_storage)
>   return;
>   da_monitor_reset_hook(&mon_storage->rv.da_mon);
>   hash_del_rcu(&mon_storage->node);
> - kfree_rcu(mon_storage, rcu);
> + if (da_pool.storage)
> + call_rcu(&mon_storage->rcu, da_pool_return_cb);
> + else
> + kfree_rcu(mon_storage, rcu);
>  }
>  
> -static void da_monitor_reset_all(void)
> +static __maybe_unused void da_monitor_reset_all(void)
>  {
>   struct da_monitor_storage *mon_storage;
>   int bkt;
> @@ -510,13 +606,65 @@ static void da_monitor_reset_all(void)
>   rcu_read_unlock();
>  }
>  
> +/*
> + * da_monitor_init_prealloc - initialise with a pre-allocated storage pool
> + *
> + * Allocates @prealloc_count storage slots up-front so that
> da_create_or_get()
> + * and da_destroy_storage() never call kmalloc/kfree.  Must be called instead
> + * of da_monitor_init() for monitors that require pool mode.
> + */
> +static inline int da_monitor_init_prealloc(unsigned int prealloc_count)
> +{
> + hash_init(da_monitor_ht);
> +
> + da_pool.storage = kcalloc(prealloc_count, sizeof(*da_pool.storage),
> +   GFP_KERNEL);
> + if (!da_pool.storage)
> + return -ENOMEM;
> +
> + da_pool.free = kmalloc_array(prealloc_count, sizeof(*da_pool.free),
> +      GFP_KERNEL);
> + if (!da_pool.free) {
> + kfree(da_pool.storage);
> + da_pool.storage = NULL;
> + return -ENOMEM;
> + }
> +
> + da_pool.free_top = 0;
> + for (unsigned int i = 0; i < prealloc_count; i++)
> + da_pool.free[da_pool.free_top++] = &da_pool.storage[i];
> + return 0;
> +}
> +
> +/*
> + * da_monitor_init - initialise in kmalloc mode (no pre-allocation)
> + */
>  static inline int da_monitor_init(void)
>  {
>   hash_init(da_monitor_ht);
>   return 0;
>  }
>  
> -static inline void da_monitor_destroy(void)
> +static inline void da_monitor_destroy_pool(void)
> +{
> + WARN_ON_ONCE(!hash_empty(da_monitor_ht));
> + /*
> + * Wait for all in-flight da_pool_return_cb() callbacks to
> + * complete before freeing da_pool.free.  synchronize_rcu() is
> + * not sufficient: it only waits for callbacks registered before
> + * it was called, but call_rcu() from concurrent da_destroy_storage()
> + * calls may have been enqueued later.  rcu_barrier() drains every
> + * pending callback.
> + */
> + rcu_barrier();
> + kfree(da_pool.storage);
> + da_pool.storage = NULL;
> + kfree(da_pool.free);
> + da_pool.free = NULL;
> + da_pool.free_top = 0;
> +}
> +
> +static inline void da_monitor_destroy_kmalloc(void)
>  {
>   struct da_monitor_storage *mon_storage;
>   struct hlist_node *tmp;
> @@ -534,6 +682,22 @@ static inline void da_monitor_destroy(void)
>   }
>  }
>  
> +/*
> + * da_monitor_destroy - tear down the per-object monitor
> + *
> + * Pool mode: the hash must already be empty (caller must have drained all
> + * tasks first); calls rcu_barrier() to drain all pending da_pool_return_cb()
> + * callbacks before freeing pool arrays.
> + * Kmalloc mode: drains any remaining entries after synchronize_rcu().
> + */
> +static inline void da_monitor_destroy(void)
> +{
> + if (da_pool.storage)
> + da_monitor_destroy_pool();
> + else
> + da_monitor_destroy_kmalloc();
> +}
> +
>  /*
>   * Allow the per-object monitors to run allocation manually, necessary if the
>   * start condition is in a context problematic for allocation (e.g.
> scheduling).


^ permalink raw reply

* Re: [RFC PATCH v2 04/10] rv/da: add pre-allocated storage pool for per-object monitors
From: Gabriele Monaco @ 2026-05-13 13:50 UTC (permalink / raw)
  To: wen.yang; +Cc: linux-trace-kernel, linux-kernel, Steven Rostedt
In-Reply-To: <2774332570ee823be60cfe84ba85e9573b4df478.1778522945.git.wen.yang@linux.dev>

On Tue, 2026-05-12 at 02:24 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
> 
> da_create_empty_storage() uses kmalloc_nolock(), which requires
> CONFIG_HAVE_ALIGNED_STRUCT_PAGE; on UML and some PREEMPT_RT
> configurations it always returns NULL.  Calling kmalloc from scheduler
> tracepoint handlers also adds unwanted latency and can fail under
> memory pressure.
> 
> Add da_monitor_init_prealloc(N) as an opt-in alternative to
> da_monitor_init().  It allocates N da_monitor_storage slots with
> GFP_KERNEL up-front and manages them on a LIFO free-stack protected
> by a spinlock, so da_create_or_get() never calls kmalloc on the hot
> path.
> 
> Monitors that do not call da_monitor_init_prealloc() are unaffected.

That's definitely a good addition, kmalloc_nolock was not that good already so I
tried some way have a preallocation, though I realise it isn't really flexible.

Since you're using spinlocks, isn't that going to sleep on PREEMPT_RT?

Isn't this similar to what you'd do with a kmem_cache. That was my original idea
although that uses spinlocks too.

I quickly tried an implementation like yours using
mempool_create_slab_pool(prealloc_count) and mempool_alloc_preallocated() and it
still explodes with my monitors, but perhaps now that tracepoints no longer
disable preemption it could play well with some monitors.

The selftests with tlob seem to work just the same with this kmem_cache (up to
the unrelated RCU stall). To be fair since you only allocate from the uprobe
handler, you'd probably be just fine with kmalloc_nolock, but let's continue
with the preallocation logic.


The API is starting to get complex (well, not that it wasn't already).
We have essentially 3 ways to allocate:
 * fully automatic with kmalloc_nolock
 * semi-automatic with pool preallocation
 * manual with direct storage preallocation

We can have a macro DA_MON_ALLOCATION_STRATEGY = {DA_MON_AUTO, DA_MON_POOL,
DA_MON_MANUAL} where DA_MON_POOL also requires DA_MON_POOL_SIZE to be define
(force that with an #error).

> 
> Signed-off-by: Wen Yang <wen.yang@linux.dev>
> ---
>  include/rv/da_monitor.h | 208 +++++++++++++++++++++++++++++++++++-----
>  1 file changed, 186 insertions(+), 22 deletions(-)
> 
> diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
> index d04bb3229c75..7d6f62766251 100644
> --- a/include/rv/da_monitor.h
> +++ b/include/rv/da_monitor.h
> @@ -433,18 +433,6 @@ static inline da_id_type da_get_id(struct da_monitor
> *da_mon)
>   return container_of(da_mon, struct da_monitor_storage, rv.da_mon)->id;
>  }
>  
> -/*
> - * da_create_or_get - create the per-object storage if not already there
> - *
> - * This needs a lookup so should be guarded by RCU, the condition is checked
> - * directly in da_create_storage()
> - */
> -static inline void da_create_or_get(da_id_type id, monitor_target target)
> -{
> - guard(rcu)();
> - da_create_storage(id, target, da_get_monitor(id, target));
> -}
> -
>  /*
>   * da_fill_empty_storage - store the target in a pre-allocated storage
>   *
> @@ -475,15 +463,121 @@ static inline monitor_target
> da_get_target_by_id(da_id_type id)
>   return mon_storage->target;
>  }
>  
> +/*
> + * Per-object pool state.
> + *
> + * Zero-initialised by default (storage == NULL ⟹ kmalloc mode).  A monitor
> + * opts into pool mode by calling da_monitor_init_prealloc(N) instead of
> + * da_monitor_init(), which sets storage to a non-NULL kcalloc'd array.
> + *
> + * Because every field is wrapped in this struct and the struct itself is a
> + * per-TU static, each monitor that includes this header gets a completely
> + * independent pool.  A kmalloc monitor (e.g. nomiss) and a pool monitor
> + * (e.g. tlob) therefore coexist without any interference.
> + *
> + * da_pool_return_cb runs from softirq on non-PREEMPT_RT, so irqsave is
> + * required to prevent deadlock with task-context callers.  On PREEMPT_RT
> + * it runs from an rcuc kthread where spinlock_t is a sleeping lock.
> + */
> +struct da_per_obj_pool {
> + struct da_monitor_storage  *storage;  /* non-NULL ⟹ pool mode */
> + struct da_monitor_storage **free;     /* kmalloc'd pointer stack */
> + unsigned int                free_top;
> + spinlock_t                  lock;
> +};
> +
> +static struct da_per_obj_pool da_pool = {
> + .lock = __SPIN_LOCK_UNLOCKED(da_pool.lock),
> +};
> +
> +static void da_pool_return_cb(struct rcu_head *head)
> +{
> + struct da_monitor_storage *ms =
> + container_of(head, struct da_monitor_storage, rcu);
> + unsigned long flags;
> +
> + spin_lock_irqsave(&da_pool.lock, flags);
> + da_pool.free[da_pool.free_top++] = ms;
> + spin_unlock_irqrestore(&da_pool.lock, flags);
> +}
> +
> +/* Pops a slot from the pre-allocated pool; returns -ENOSPC if exhausted. */
> +static inline int da_create_or_get_pool(da_id_type id, monitor_target target)
> +{
> + struct da_monitor_storage *mon_storage;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&da_pool.lock, flags);
> + if (!da_pool.free_top) {
> + spin_unlock_irqrestore(&da_pool.lock, flags);
> + return -ENOSPC;
> + }
> + mon_storage = da_pool.free[--da_pool.free_top];
> + spin_unlock_irqrestore(&da_pool.lock, flags);
> +
> + mon_storage->id = id;
> + mon_storage->target = target;
> + guard(rcu)();
> + hash_add_rcu(da_monitor_ht, &mon_storage->node, id);
> + return 0;
> +}
> +
> +/*
> + * Tries da_create_storage() first (lock-free via kmalloc_nolock); falls back
> + * to kmalloc(GFP_KERNEL).  Must be called from task context.
> + */
> +static inline int da_create_or_get_kmalloc(da_id_type id, monitor_target
> target)
> +{
> + struct da_monitor_storage *mon_storage;
> +
> + scoped_guard(rcu) {
> + if (da_create_storage(id, target, da_get_monitor(id, target)))
> + return 0;
> + }
> +
> + /*
> + * da_create_storage() failed because kmalloc_nolock() returned NULL.
> + * Allocate with GFP_KERNEL outside the RCU read section: GFP_KERNEL
> + * may sleep for memory reclaim, which is illegal while the RCU read
> + * lock is held (preemption disabled on !PREEMPT_RT).
> + */
> + mon_storage = kmalloc_obj(*mon_storage, GFP_KERNEL | __GFP_ZERO);
> + if (!mon_storage)
> + return -ENOMEM;
> + mon_storage->id = id;
> + mon_storage->target = target;
> +
> + /*
> + * Re-check for a concurrent insertion before linking: another
> + * caller may have succeeded while we slept in kmalloc().
> + * Discard our allocation and let the winner's entry stand.
> + */
> + scoped_guard(rcu) {
> + if (da_get_monitor(id, target)) {
> + kfree(mon_storage);
> + return 0;
> + }
> + hash_add_rcu(da_monitor_ht, &mon_storage->node, id);
> + }
> + return 0;
> +}
> +
> +/* Create the per-object storage if not already there. */
> +static inline int da_create_or_get(da_id_type id, monitor_target target)
> +{
> + if (da_pool.storage)
> + return da_create_or_get_pool(id, target);
> + return da_create_or_get_kmalloc(id, target);
> +}
> +
>  /*
>   * da_destroy_storage - destroy the per-object storage
>   *
> - * The caller is responsible to synchronise writers, either with locks or
> - * implicitly. For instance, if da_destroy_storage is called at sched_exit
> and
> - * da_create_storage can never occur after that, it's safe to call this
> without
> - * locks.
> - * This function includes an RCU read-side critical section to synchronise
> - * against da_monitor_destroy().
> + * Pool mode: removes from hash and returns the slot via call_rcu().
> + * Kmalloc mode: removes from hash and frees via kfree_rcu().
> + *
> + * Includes an RCU read-side critical section to synchronise against
> + * da_monitor_destroy().
>   */
>  static inline void da_destroy_storage(da_id_type id)
>  {
> @@ -491,15 +585,17 @@ static inline void da_destroy_storage(da_id_type id)
>  
>   guard(rcu)();
>   mon_storage = __da_get_mon_storage(id);
> -
>   if (!mon_storage)
>   return;
>   da_monitor_reset_hook(&mon_storage->rv.da_mon);
>   hash_del_rcu(&mon_storage->node);
> - kfree_rcu(mon_storage, rcu);
> + if (da_pool.storage)
> + call_rcu(&mon_storage->rcu, da_pool_return_cb);
> + else
> + kfree_rcu(mon_storage, rcu);
>  }
>  
> -static void da_monitor_reset_all(void)
> +static __maybe_unused void da_monitor_reset_all(void)
>  {
>   struct da_monitor_storage *mon_storage;
>   int bkt;
> @@ -510,13 +606,65 @@ static void da_monitor_reset_all(void)
>   rcu_read_unlock();
>  }
>  
> +/*
> + * da_monitor_init_prealloc - initialise with a pre-allocated storage pool
> + *
> + * Allocates @prealloc_count storage slots up-front so that
> da_create_or_get()
> + * and da_destroy_storage() never call kmalloc/kfree.  Must be called instead
> + * of da_monitor_init() for monitors that require pool mode.
> + */
> +static inline int da_monitor_init_prealloc(unsigned int prealloc_count)
> +{
> + hash_init(da_monitor_ht);
> +
> + da_pool.storage = kcalloc(prealloc_count, sizeof(*da_pool.storage),
> +   GFP_KERNEL);
> + if (!da_pool.storage)
> + return -ENOMEM;
> +
> + da_pool.free = kmalloc_array(prealloc_count, sizeof(*da_pool.free),
> +      GFP_KERNEL);
> + if (!da_pool.free) {
> + kfree(da_pool.storage);
> + da_pool.storage = NULL;
> + return -ENOMEM;
> + }
> +
> + da_pool.free_top = 0;
> + for (unsigned int i = 0; i < prealloc_count; i++)
> + da_pool.free[da_pool.free_top++] = &da_pool.storage[i];
> + return 0;
> +}
> +
> +/*
> + * da_monitor_init - initialise in kmalloc mode (no pre-allocation)
> + */
>  static inline int da_monitor_init(void)
>  {
>   hash_init(da_monitor_ht);
>   return 0;
>  }
>  
> -static inline void da_monitor_destroy(void)
> +static inline void da_monitor_destroy_pool(void)
> +{
> + WARN_ON_ONCE(!hash_empty(da_monitor_ht));
> + /*
> + * Wait for all in-flight da_pool_return_cb() callbacks to
> + * complete before freeing da_pool.free.  synchronize_rcu() is
> + * not sufficient: it only waits for callbacks registered before
> + * it was called, but call_rcu() from concurrent da_destroy_storage()
> + * calls may have been enqueued later.  rcu_barrier() drains every
> + * pending callback.
> + */
> + rcu_barrier();
> + kfree(da_pool.storage);
> + da_pool.storage = NULL;
> + kfree(da_pool.free);
> + da_pool.free = NULL;
> + da_pool.free_top = 0;
> +}
> +
> +static inline void da_monitor_destroy_kmalloc(void)
>  {
>   struct da_monitor_storage *mon_storage;
>   struct hlist_node *tmp;
> @@ -534,6 +682,22 @@ static inline void da_monitor_destroy(void)
>   }
>  }
>  
> +/*
> + * da_monitor_destroy - tear down the per-object monitor
> + *
> + * Pool mode: the hash must already be empty (caller must have drained all
> + * tasks first); calls rcu_barrier() to drain all pending da_pool_return_cb()
> + * callbacks before freeing pool arrays.
> + * Kmalloc mode: drains any remaining entries after synchronize_rcu().
> + */
> +static inline void da_monitor_destroy(void)
> +{
> + if (da_pool.storage)
> + da_monitor_destroy_pool();
> + else
> + da_monitor_destroy_kmalloc();
> +}
> +
>  /*
>   * Allow the per-object monitors to run allocation manually, necessary if the
>   * start condition is in a context problematic for allocation (e.g.
> scheduling).


^ permalink raw reply

* Re: [RFC PATCH v2 04/10] rv/da: add pre-allocated storage pool for per-object monitors
From: Gabriele Monaco @ 2026-05-13 14:01 UTC (permalink / raw)
  To: wen.yang; +Cc: linux-trace-kernel, linux-kernel, Steven Rostedt
In-Reply-To: <2774332570ee823be60cfe84ba85e9573b4df478.1778522945.git.wen.yang@linux.dev>


Sorry for the spam, I accidentally mapped ctrl+Enter to send without
confirmation and can't teach my fingers not to press it instead of ctrl+delete..

> That's definitely a good addition, kmalloc_nolock was not that good already so
> I tried some way have a preallocation, though I realise it isn't really
> flexible.
> 
> Since you're using spinlocks, isn't that going to sleep on PREEMPT_RT?
> 
> Isn't this similar to what you'd do with a kmem_cache. That was my original
> idea although that uses spinlocks too.
> 
> I quickly tried an implementation like yours using
> mempool_create_slab_pool(prealloc_count) and mempool_alloc_preallocated() and
> it still explodes with my monitors, but perhaps now that tracepoints no longer
> disable preemption it could play well with some monitors.
> 
> The selftests with tlob seem to work just the same with this kmem_cache (up to
> the unrelated RCU stall). To be fair since you only allocate from the uprobe
> handler, you'd probably be just fine with kmalloc_nolock, but let's continue
> with the preallocation logic.
> 
> 
> The API is starting to get complex (well, not that it wasn't already).
> We have essentially 3 ways to allocate:
>  * fully automatic with kmalloc_nolock
>  * semi-automatic with pool preallocation
>  * manual with direct storage preallocation
> 
> We can have a macro DA_MON_ALLOCATION_STRATEGY = {DA_ALLOC_AUTO,
> DA_ALLOC_POOL, DA_ALLOC_MANUAL} where DA_MON_POOL also requires
> DA_MON_POOL_SIZE to be define (force that with an #error).

Anyway, this way you probably wouldn't need to define a different init function
and let everything handled more transparently.

Also you don't need to call da_create_or_get() explicitly,
da_handle_start_event() should do it for you.

The only manual step required is da_create_storage() when you explicitly cannot
lock when calling da_handle_start_event() (that would be DA_ALLOC_MANUAL, you
don't need that).

Hope I didn't create too much confusion.

Thanks,
Gabriele


^ permalink raw reply

* Re: [PATCH v4 1/3] ftrace: Build trace_btf.c when CONFIG_DEBUG_INFO_BTF is enabled
From: Steven Rostedt @ 2026-05-13 14:42 UTC (permalink / raw)
  To: Donglin Peng
  Cc: mhiramat, linux-trace-kernel, bpf, linux-kernel, pengdonglin,
	Xiaoqin Zhang
In-Reply-To: <20251215034153.2367756-2-dolinux.peng@gmail.com>


Sorry for the late reply, I've been a bit busy on other things recently.

On Mon, 15 Dec 2025 11:41:51 +0800
Donglin Peng <dolinux.peng@gmail.com> wrote:

> From: pengdonglin <pengdonglin@xiaomi.com>
> 
> The trace_btf.c file provides BTF helper functions used by the ftrace
> subsystem. This change makes its compilation solely dependent on

Nit, change logs should never say "This change". Instead it should be
worded as:

  "Make the compilation of trace_btf.c soley depend on..."

-- Steve

> CONFIG_DEBUG_INFO_BTF, allowing features like funcgraph-retval to also
> utilize these helpers.
> 
> Additionally, the redundant dependency on CONFIG_PROBE_EVENTS_BTF_ARGS
> is removed, as CONFIG_DEBUG_INFO_BTF already depends on
> CONFIG_BPF_SYSCALL.
> 
> Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Cc: Xiaoqin Zhang <zhangxiaoqin@xiaomi.com>
> Signed-off-by: pengdonglin <pengdonglin@xiaomi.com>
> ---
>  kernel/trace/Kconfig  | 2 +-
>  kernel/trace/Makefile | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> index e1214b9dc990..653c1fcefa4c 100644
> --- a/kernel/trace/Kconfig
> +++ b/kernel/trace/Kconfig
> @@ -755,7 +755,7 @@ config FPROBE_EVENTS
>  config PROBE_EVENTS_BTF_ARGS
>  	depends on HAVE_FUNCTION_ARG_ACCESS_API
>  	depends on FPROBE_EVENTS || KPROBE_EVENTS
> -	depends on DEBUG_INFO_BTF && BPF_SYSCALL
> +	depends on DEBUG_INFO_BTF
>  	bool "Support BTF function arguments for probe events"
>  	default y
>  	help
> diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
> index fc5dcc888e13..6c4bf5a6c4f3 100644
> --- a/kernel/trace/Makefile
> +++ b/kernel/trace/Makefile
> @@ -116,7 +116,7 @@ obj-$(CONFIG_KGDB_KDB) += trace_kdb.o
>  endif
>  obj-$(CONFIG_DYNAMIC_EVENTS) += trace_dynevent.o
>  obj-$(CONFIG_PROBE_EVENTS) += trace_probe.o
> -obj-$(CONFIG_PROBE_EVENTS_BTF_ARGS) += trace_btf.o
> +obj-$(CONFIG_DEBUG_INFO_BTF) += trace_btf.o
>  obj-$(CONFIG_UPROBE_EVENTS) += trace_uprobe.o
>  obj-$(CONFIG_BOOTTIME_TRACING) += trace_boot.o
>  obj-$(CONFIG_FTRACE_RECORD_RECURSION) += trace_recursion_record.o


^ permalink raw reply

* Re: [PATCH v4 2/3] fgraph: Enhance funcgraph-retval with BTF-based type-aware output
From: Steven Rostedt @ 2026-05-13 14:47 UTC (permalink / raw)
  To: Donglin Peng
  Cc: mhiramat, linux-trace-kernel, bpf, linux-kernel, pengdonglin,
	Xiaoqin Zhang
In-Reply-To: <20251215034153.2367756-3-dolinux.peng@gmail.com>

On Mon, 15 Dec 2025 11:41:52 +0800
Donglin Peng <dolinux.peng@gmail.com> wrote:

> From: pengdonglin <pengdonglin@xiaomi.com>
> 
> The current funcgraph-retval implementation suffers from two accuracy
> issues:
> 
> 1. Void-returning functions still print a return value, creating
>    misleading noise in the trace output.
> 
> 2. For functions returning narrower types (e.g., char, short), the
>    displayed value can be incorrect because high bits of the register
>    may contain undefined data.
> 
> This patch addresses both problems by leveraging BTF to obtain the exact
> return type of each traced kernel function. The key changes are:
> 
> 1. Void function filtering: Functions with void return type no longer
>    display any return value in the trace output, eliminating unnecessary
>    clutter.
> 
> 2. Type-aware value formatting: The return value is now properly truncated
>    to match the actual width of the return type before being displayed.
>    Additionally, the value is formatted according to its type for better
>    human readability.
> 
> Here is an output comparison:
> 
> Before:
>  # perf ftrace -G vfs_read --graph-opts retval
>  ...
>  1)               |   touch_atime() {
>  1)               |     atime_needs_update() {
>  1)   0.069 us    |       make_vfsuid(); /* ret=0x0 */
>  1)   0.067 us    |       make_vfsgid(); /* ret=0x0 */
>  1)               |       current_time() {
>  1)   0.197 us    |         ktime_get_coarse_real_ts64_mg(); /* ret=0x187f886aec3ed6f5 */
>  1)   0.352 us    |       } /* current_time ret=0x69380753 */
>  1)   0.792 us    |     } /* atime_needs_update ret=0x0 */
>  1)   0.937 us    |   } /* touch_atime ret=0x0 */
> 
> After:
>  # perf ftrace -G vfs_read --graph-opts retval
>  ...
>  2)               |   touch_atime() {
>  2)               |     atime_needs_update() {
>  2)   0.070 us    |       make_vfsuid(); /* ret=0x0 */
>  2)   0.070 us    |       make_vfsgid(); /* ret=0x0 */
>  2)               |       current_time() {
>  2)   0.162 us    |         ktime_get_coarse_real_ts64_mg();
>  2)   0.312 us    |       } /* current_time ret=0x69380649(trunc) */
>  2)   0.753 us    |     } /* atime_needs_update ret=false */
>  2)   0.899 us    |   } /* touch_atime */
> 
> Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Cc: Xiaoqin Zhang <zhangxiaoqin@xiaomi.com>
> Signed-off-by: pengdonglin <pengdonglin@xiaomi.com>
> ---
>  kernel/trace/trace_functions_graph.c | 124 ++++++++++++++++++++++++---
>  1 file changed, 111 insertions(+), 13 deletions(-)
> 
> diff --git a/kernel/trace/trace_functions_graph.c b/kernel/trace/trace_functions_graph.c
> index 17c75cf2348e..46b66b1cfc16 100644
> --- a/kernel/trace/trace_functions_graph.c
> +++ b/kernel/trace/trace_functions_graph.c
> @@ -15,6 +15,7 @@
>  
>  #include "trace.h"
>  #include "trace_output.h"
> +#include "trace_btf.h"
>  
>  /* When set, irq functions might be ignored */
>  static int ftrace_graph_skip_irqs;
> @@ -120,6 +121,13 @@ enum {
>  	FLAGS_FILL_END   = 3 << TRACE_GRAPH_PRINT_FILL_SHIFT,
>  };
>  
> +enum {
> +	RETVAL_FMT_HEX   = BIT(0),
> +	RETVAL_FMT_DEC   = BIT(1),
> +	RETVAL_FMT_BOOL  = BIT(2),
> +	RETVAL_FMT_TRUNC = BIT(3),
> +};
> +
>  static void
>  print_graph_duration(struct trace_array *tr, unsigned long long duration,
>  		     struct trace_seq *s, u32 flags);
> @@ -865,6 +873,73 @@ static void print_graph_retaddr(struct trace_seq *s, struct fgraph_retaddr_ent_e
>  
>  #if defined(CONFIG_FUNCTION_GRAPH_RETVAL) || defined(CONFIG_FUNCTION_GRAPH_RETADDR)
>  
> +static void trim_retval(unsigned long func, unsigned long *retval, bool *print_retval,
> +			int *fmt)

This function should really be in trace_btf.c and a stub when btf is not
enabled.

-- Steve

> +{
> +	const struct btf_type *t;
> +	char name[KSYM_NAME_LEN];
> +	struct btf *btf;
> +	u32 v, msb;
> +	int kind;
> +
> +	if (!IS_ENABLED(CONFIG_DEBUG_INFO_BTF))
> +		return;
> +
> +	if (lookup_symbol_name(func, name))
> +		return;
> +
> +	t = btf_find_func_proto(name, &btf);
> +	if (IS_ERR_OR_NULL(t))
> +		return;
> +
> +	t = btf_type_skip_modifiers(btf, t->type, NULL);
> +	kind = t ? BTF_INFO_KIND(t->info) : BTF_KIND_UNKN;
> +	switch (kind) {
> +	case BTF_KIND_UNKN:
> +		*print_retval = false;
> +		break;
> +	case BTF_KIND_STRUCT:
> +	case BTF_KIND_UNION:
> +	case BTF_KIND_ENUM:
> +	case BTF_KIND_ENUM64:
> +		if (kind == BTF_KIND_STRUCT || kind == BTF_KIND_UNION)
> +			*fmt = RETVAL_FMT_HEX;
> +		else
> +			*fmt = RETVAL_FMT_DEC;
> +
> +		if (t->size > sizeof(unsigned long)) {
> +			*fmt |= RETVAL_FMT_TRUNC;
> +		} else {
> +			msb = BITS_PER_BYTE * t->size - 1;
> +			*retval &= GENMASK(msb, 0);
> +		}
> +		break;
> +	case BTF_KIND_INT:
> +		v = *(u32 *)(t + 1);
> +		if (BTF_INT_ENCODING(v) == BTF_INT_BOOL) {
> +			*fmt = RETVAL_FMT_BOOL;
> +			msb = 0;
> +		} else {
> +			if (BTF_INT_ENCODING(v) == BTF_INT_SIGNED)
> +				*fmt = RETVAL_FMT_DEC;
> +			else
> +				*fmt = RETVAL_FMT_HEX;
> +
> +			if (t->size > sizeof(unsigned long)) {
> +				*fmt |= RETVAL_FMT_TRUNC;
> +				msb = BITS_PER_LONG - 1;
> +			} else {
> +				msb = BTF_INT_BITS(v) - 1;
> +			}
> +		}
> +		*retval &= GENMASK(msb, 0);
> +		break;
> +	default:
> +		*fmt = RETVAL_FMT_HEX;
> +		break;
> +	}
> +}
> +

^ permalink raw reply

* Re: [PATCH] tracing: samples: avoid warning about __aeabi_unwind_cpp_pr1
From: Steven Rostedt @ 2026-05-13 14:59 UTC (permalink / raw)
  To: Vincent Donnefort
  Cc: Arnd Bergmann, Masami Hiramatsu, Nathan Chancellor, Marc Zyngier,
	Arnd Bergmann, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel
In-Reply-To: <20260323105646.590718-1-arnd@kernel.org>


Vincent,

Is this patch needed? That is, did it fall through the cracks?

-- Steve

On Mon, 23 Mar 2026 11:56:41 +0100
Arnd Bergmann <arnd@kernel.org> wrote:

> From: Arnd Bergmann <arnd@arndb.de>
> 
> The now more verbose check found another symbol missing from the whitelist:
> 
> Unexpected symbols in kernel/trace/simple_ring_buffer.o:
>          U __aeabi_unwind_cpp_pr1
> 
> Add this to the Makefile.
> 
> Fixes: 1211907ac0b5 ("tracing: Generate undef symbols allowlist for simple_ring_buffer")
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---
>  kernel/trace/Makefile | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
> index d662c1a64cd5..aba6a25db17b 100644
> --- a/kernel/trace/Makefile
> +++ b/kernel/trace/Makefile
> @@ -169,8 +169,8 @@ targets += undefsyms_base.o
>  # because it is not linked into vmlinux.
>  KASAN_SANITIZE_undefsyms_base.o := y
>  
> -UNDEFINED_ALLOWLIST = __asan __gcov __kasan __kcsan __hwasan __sancov __sanitizer __tsan __ubsan __x86_indirect_thunk \
> -		      __msan simple_ring_buffer \
> +UNDEFINED_ALLOWLIST = __asan __gcov __kasan __kcsan __hwasan __sancov __sanitizer __tsan __ubsan __msan \
> +		      __x86_indirect_thunk __aeabi_unwind_cpp simple_ring_buffer \
>  		      $(shell $(NM) -u $(obj)/undefsyms_base.o 2>/dev/null | awk '{print $$2}')
>  
>  quiet_cmd_check_undefined = NM      $<


^ permalink raw reply

* Re: [PATCH 2/2] selftests/mm: add zone->lock tracepoint verification test
From: Jesper Dangaard Brouer @ 2026-05-13 15:00 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Andrew Morton, linux-mm
  Cc: Vlastimil Babka, Steven Rostedt, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Lorenzo Stoakes, Shuah Khan, linux-kernel,
	linux-trace-kernel, kernel-team
In-Reply-To: <d440b00b-6d7b-48cb-b37d-43f6f885ed01@kernel.org>



On 08/05/2026 22.15, David Hildenbrand (Arm) wrote:
> On 5/8/26 18:22, hawk@kernel.org wrote:
>> From: Jesper Dangaard Brouer <hawk@kernel.org>
>>
>> Add a selftest to verify the kmem:mm_zone_lock_contended,
>> kmem:mm_zone_locked, and kmem:mm_zone_lock_unlock tracepoints.
>>
>> The test has two components:
>>
>> zone_lock_contention.c - a workload that spawns threads doing rapid
>> page allocation and freeing to generate zone->lock contention. It
>> shrinks PCP lists via percpu_pagelist_high_fraction to force frequent
>> free_pcppages_bulk() and rmqueue_bulk() calls.
>>
>> test_zone_lock_tracepoints.sh - uses bpftrace to verify tracepoints
>> exist, have the expected fields, fire under load, and that wait_ns
>> is populated when contention occurs.
>>
>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
>> ---
>>   tools/testing/selftests/mm/Makefile           |   2 +
>>   .../mm/test_zone_lock_tracepoints.sh          | 212 ++++++++++++++++++
>>   .../selftests/mm/zone_lock_contention.c       | 166 ++++++++++++++
> 
> This really looks excessive and ... not really how we usually treat tracepoints?
> 
> I don't know about others, but I don't think this is really what we want as a MM
> selftest.
> 

I wanted to have a program that tested the code I changed, so I simply
made AI write a verification test and asked it to create as a selftest,
that I've run to verify my code change was correctly implemented.  As I
needed to trigger lock contention the test is more advanced, but luckily
AI solved it in only the 2nd attempt.

It makes sense to drop this patch. We shouldn't keep this code in the
kernel tree, it simply verified that my code works. There is little
chance that this test will catch meaningful regressions for these
tracepoints.

--Jesper

^ permalink raw reply

* Re: [PATCH] selftests/ftrace: Drop invalid top-level local in test_ownership
From: Steven Rostedt @ 2026-05-13 15:02 UTC (permalink / raw)
  To: CaoRuichuang, Shuah Khan
  Cc: mhiramat, mathieu.desnoyers, shuah, linux-kernel,
	linux-trace-kernel, linux-kselftest
In-Reply-To: <20260407203727.442b583c@gandalf.local.home>


Shuah,

Can you pull this into your urgent branch?

Thanks,

-- Steve


On Tue, 7 Apr 2026 20:37:27 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> Shuah,
> 
> Care to take this through your tree. Probably could even add:
> 
> Cc: stable@vger.kernel.org
> Fixes: 8b55572e51805 ("tracing/selftests: Add tracefs mount options test")
> 
> As well as:
> 
> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> 
> -- Steve
> 
> 
> On Tue,  7 Apr 2026 18:26:13 +0800
> CaoRuichuang <create0818@163.com> wrote:
> 
> > From: Cao Ruichuang <create0818@163.com>
> > 
> > test_ownership.tc is sourced by ftracetest under /bin/sh.
> > 
> > The script currently declares mount_point with local at file scope,
> > which makes /bin/sh abort with "local: not in a function" before the
> > test can reach the eventfs ownership checks.
> > 
> > Replace the top-level local declaration with a normal shell variable so
> > kernels that support the gid= tracefs mount option can run the test at
> > all.
> > 
> > Signed-off-by: Cao Ruichuang <create0818@163.com>
> > ---
> >  tools/testing/selftests/ftrace/test.d/00basic/test_ownership.tc | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/tools/testing/selftests/ftrace/test.d/00basic/test_ownership.tc b/tools/testing/selftests/ftrace/test.d/00basic/test_ownership.tc
> > index e71cc3ad0..6d00d3c0f 100644
> > --- a/tools/testing/selftests/ftrace/test.d/00basic/test_ownership.tc
> > +++ b/tools/testing/selftests/ftrace/test.d/00basic/test_ownership.tc
> > @@ -6,7 +6,7 @@
> >  original_group=`stat -c "%g" .`
> >  original_owner=`stat -c "%u" .`
> >  
> > -local mount_point=$(get_mount_point)
> > +mount_point=$(get_mount_point)
> >  
> >  mount_options=$(get_mnt_options "$mount_point")
> >    
> 


^ permalink raw reply

* Re: [PATCH v6 2/4] mm/memory-failure: classify get_any_page() failures by reason
From: Breno Leitao @ 2026-05-13 15:07 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Miaohe Lin, Naoya Horiguchi, Andrew Morton, Jonathan Corbet,
	Shuah Khan, Lorenzo Stoakes, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Shuah Khan, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett, linux-mm,
	linux-kernel, linux-doc, linux-kselftest, linux-trace-kernel,
	kernel-team, Lance Yang
In-Reply-To: <3e5e8fb8-6957-46b7-9777-0ed1bff1d0fb@kernel.org>

On Wed, May 13, 2026 at 01:48:11PM +0200, David Hildenbrand (Arm) wrote:
> > @@ -1441,10 +1456,10 @@ static int get_any_page(struct page *p, unsigned long flags)
> >  			goto try_again;
> >  		}
> >  		put_page(p);
> > -		ret = -EIO;
> > +		ret = -ENOTRECOVERABLE;
> >  	}
> >  out:
> > -	if (ret == -EIO)
> > +	if (ret == -EIO || ret == -ENOTRECOVERABLE)
> >  		pr_err("%#lx: unhandlable page.\n", page_to_pfn(p));
> >  
> >  	return ret;
> > @@ -2431,6 +2448,9 @@ int memory_failure(unsigned long pfn, int flags)
> >  			res = action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED);
> >  		}
> >  		goto unlock_mutex;
> > +	} else if (res == -ENOTRECOVERABLE) {
> > +		res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
> > +		goto unlock_mutex;
> >  	} else if (res < 0) {
> >  		res = action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
> >  		goto unlock_mutex;
> 
> That might probably read nicer as
> 
> switch (res) {
> case 0: ...
> case 1: ...
> case -ENOTRECOVERABLE:  ...
> case ...
> default:
> }
> 
> > 
> > 
> > If that is what you are suggestion, maybe we can create another
> > MF_MSG_RESERVED? and another return value for get_any_page() to track
> > the reserve pages ?
> 
> I guess "reserved" is really just like most other kernel pages. So I wouldn't
> special-case them here.
> 
> Or would there be a good reason?

Not really, treating them as MF_MSG_KERNEL is sufficient for my use
case.

Thank you for the review. I'm digesting all the feedback and will send
out a new revision shortly, where we can continue the discussion.

--breno

^ permalink raw reply

* Re: [RFC PATCH v3] bpf: introduce TAINT_UNSAFE_BPF for mutating helpers
From: Steven Rostedt @ 2026-05-13 15:13 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Aaron Tomlin, Jonathan Corbet, Song Liu, KP Singh, Matt Bobrowski,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Eduard,
	Kumar Kartikeya Dwivedi, Masami Hiramatsu, Shuah Khan, Jiri Olsa,
	Martin KaFai Lau, Yonghong Song, Mathieu Desnoyers, Randy Dunlap,
	neelx, sean, chjohnst, steve, mproche, nick.lange,
	open list:DOCUMENTATION, LKML, bpf, linux-trace-kernel
In-Reply-To: <CAADnVQJ5fatNF4auH+a8E39zWMfja3rm4BM_xGcTnLX8uuCQ9Q@mail.gmail.com>

On Sun, 3 May 2026 21:51:49 +0200
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> Nack.
> 
> Please stop this spam.
> We're not doing it. These helpers have been around for a long time.
> There was no need to taint then. There is no need to taint now.

Hi Alexei,

I'm wondering if there's a way to see what modifications BPF programs are
doing to the kernel? I try to make it easy to see what modifications ftrace
has done (like the enabled_functions file), because I like to know how my
kernel is modified since boot up.

Thus, it would be nice to know if BPF is modifying anything in user space
or just what BPF programs are loaded.

Note, I'm agnostic to this change, it just brought up a previous concern of
mine when I read it.

Thanks,

-- Steve

^ permalink raw reply

* Re: [RFC PATCH v3] bpf: introduce TAINT_UNSAFE_BPF for mutating helpers
From: Alexei Starovoitov @ 2026-05-13 15:16 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Aaron Tomlin, Jonathan Corbet, Song Liu, KP Singh, Matt Bobrowski,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Eduard,
	Kumar Kartikeya Dwivedi, Masami Hiramatsu, Shuah Khan, Jiri Olsa,
	Martin KaFai Lau, Yonghong Song, Mathieu Desnoyers, Randy Dunlap,
	neelx, sean, chjohnst, steve, mproche, nick.lange,
	open list:DOCUMENTATION, LKML, bpf, linux-trace-kernel
In-Reply-To: <20260513111331.7bede512@gandalf.local.home>

On Wed, May 13, 2026 at 8:13 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Sun, 3 May 2026 21:51:49 +0200
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>
> > Nack.
> >
> > Please stop this spam.
> > We're not doing it. These helpers have been around for a long time.
> > There was no need to taint then. There is no need to taint now.
>
> Hi Alexei,
>
> I'm wondering if there's a way to see what modifications BPF programs are
> doing to the kernel? I try to make it easy to see what modifications ftrace
> has done (like the enabled_functions file), because I like to know how my
> kernel is modified since boot up.

It's impossible to track all modifications.
See what sched-ext is doing.
What does it modify? Everything.

^ permalink raw reply

* Re: [RFC PATCH v3] bpf: introduce TAINT_UNSAFE_BPF for mutating helpers
From: Steven Rostedt @ 2026-05-13 15:23 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Aaron Tomlin, Jonathan Corbet, Song Liu, KP Singh, Matt Bobrowski,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Eduard,
	Kumar Kartikeya Dwivedi, Masami Hiramatsu, Shuah Khan, Jiri Olsa,
	Martin KaFai Lau, Yonghong Song, Mathieu Desnoyers, Randy Dunlap,
	neelx, sean, chjohnst, steve, mproche, nick.lange,
	open list:DOCUMENTATION, LKML, bpf, linux-trace-kernel
In-Reply-To: <CAADnVQL_sWznA+JJLdzP_ZdUgQeO7p-AGnOtx9=fXjH+PnRJBA@mail.gmail.com>

On Wed, 13 May 2026 08:16:07 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> It's impossible to track all modifications.
> See what sched-ext is doing.
> What does it modify? Everything.

What about just having a list of what BPF programs are loaded, what they
may be attached to, and what kfuncs they are calling?

-- Steve

^ permalink raw reply

* Re: [PATCH 1/2] mm/page_alloc: add tracepoints for zone->lock acquisitions
From: Jesper Dangaard Brouer @ 2026-05-13 15:32 UTC (permalink / raw)
  To: Dmitry Ilvokhin, Vlastimil Babka (SUSE)
  Cc: Andrew Morton, Matthew Wilcox, linux-mm, Steven Rostedt,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, David Hildenbrand,
	Lorenzo Stoakes, Shuah Khan, linux-kernel, linux-trace-kernel,
	kernel-team
In-Reply-To: <af4mbDB3-ohO5DCp@shell.ilvokhin.com>



On 08/05/2026 20.07, Dmitry Ilvokhin wrote:
> On Fri, May 08, 2026 at 07:40:51PM +0200, Vlastimil Babka (SUSE) wrote:
>> On 5/8/26 7:38 PM, Vlastimil Babka (SUSE) wrote:
>>> On 5/8/26 7:29 PM, Andrew Morton wrote:
>>>> e .configOn Fri,  8 May 2026 18:22:06 +0200 hawk@kernel.org wrote:
>>>>
>>>>> Add tracepoints to the page allocator fast paths that acquire
>>>>> zone->lock, allowing diagnosis of lock contention in production.
>>>>
>>>> Thanks, I'm surprised we haven't done this yet.
>>>
>>> There was a recent attempt [1]. Not being a generic solution wasn't welcome.
>>>
>>> [1] https://lore.kernel.org/all/cover.1772206930.git.d@ilvokhin.com/
>>
>> And this is the generic solution I think?
>>
>> https://lore.kernel.org/all/cover.1777999826.git.d@ilvokhin.com/
> 
> Thanks for cc'ing me, Vlastimil.
> 
> Yes, this is an attempt at a generic solution for tracing contended
> locks, including spinlocks, so it should also cover the use case
> proposed in this patchset.
>

I'm aware of the generic solution and often use `perf lock contention`.
And the tool libbpf-tools/klockstat. My experience is unfortunately that
enabling these tracepoint is prohibitive expensive on production server,
and production suffers when I run these tools.

I'm very happy to see a patchset adding a contended case. But I worry
that tracing all contented locks in the system is also too much to have
enabled continuously for production.

This patch is carefully constructed to minimize overhead, such that I
can enable this continuously on production to catch issues.  If I
identify issue I will use the generic tracpoints for further debugging.


> In fact, zone->lock contention was one of the primary motivations for
> this work.

In the generic solution I'm loosing the "zone" and pages "count".  I
need this information to get the answers I'm looking for.  Specifically
I'm looking at reducing CONFIG_PCP_BATCH_SCALE_MAX, but I want to this
to be a data-driven decision (my first principle is: if you cannot
measure it you cannot improve it).

I'm likely going to apply this patch to our production system, such that
I can get my data-driven decision.  I need to deploy it widely enough to
get enough server experiencing direct-reclaim.  I'll report back if
people are interested in these learning?

--Jesper

^ permalink raw reply

* Re: [PATCH v6 4/7] locking: Factor out queued_spin_release()
From: Steven Rostedt @ 2026-05-13 15:37 UTC (permalink / raw)
  To: Dmitry Ilvokhin
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
	Thomas Bogendoerfer, Juergen Gross, Ajay Kaher, Alexey Makhalov,
	Broadcom internal kernel review list, Thomas Gleixner,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Arnd Bergmann,
	Dennis Zhou, Tejun Heo, Christoph Lameter, Masami Hiramatsu,
	Mathieu Desnoyers, linux-kernel, linux-mips, virtualization,
	linux-arch, linux-mm, linux-trace-kernel, kernel-team,
	Paul E. McKenney
In-Reply-To: <64c202b8a76a7d98515cf10cc1f99ecb0a9a7ccf.1777999826.git.d@ilvokhin.com>

On Tue,  5 May 2026 17:09:33 +0000
Dmitry Ilvokhin <d@ilvokhin.com> wrote:

> Introduce queued_spin_release() as an arch-overridable unlock primitive,
> and make queued_spin_unlock() a generic wrapper around it.
> 
> This is a preparatory refactoring for the next commit, which adds
> contended_release tracepoint instrumentation to queued_spin_unlock().

In change logs, do not use "next commit". Instead say something like:

  In preparation for adding contended_release tracepoint instrumentation to
  queued_spin_unlock(), refactor the code to allow out of line calls when
  the tracepoint is enabled.

Or something like that.

-- Steve



> 
> Rename the existing arch-specific queued_spin_unlock() overrides on
> x86 (paravirt) and MIPS to queued_spin_release().
> 
> No functional change.
> 
> Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
> Reviewed-by: Paul E. McKenney <paulmck@kernel.org>

^ permalink raw reply

* [PATCH v7 1/6] mm/memory-failure: drop dead error_states[] entry for reserved pages
From: Breno Leitao @ 2026-05-13 15:39 UTC (permalink / raw)
  To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Naoya Horiguchi, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Liam R. Howlett,
	Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
	linux-trace-kernel, kernel-team
In-Reply-To: <20260513-ecc_panic-v7-0-be2e578e61da@debian.org>

The first entry of error_states[],

	{ reserved,	reserved,	MF_MSG_KERNEL,	me_kernel },

is unreachable.  identify_page_state() has two callers, and neither
one can dispatch a PG_reserved page to me_kernel():

  * memory_failure() reaches identify_page_state() only after
    get_hwpoison_page() returned 1.  get_any_page() reaches that
    return only via __get_hwpoison_page(), which gates the refcount
    on HWPoisonHandlable().  HWPoisonHandlable() rejects PG_reserved
    pages, so they fail with -EBUSY/-EIO long before
    identify_page_state() runs.

  * try_memory_failure_hugetlb() reaches identify_page_state() on
    the MF_HUGETLB_IN_USED branch, but the page is necessarily a
    hugetlb folio there.  The first table entry that matches a
    hugetlb folio is { head, head, MF_MSG_HUGE, me_huge_page }, so
    they dispatch to me_huge_page() before the (now-removed)
    reserved entry would have matched, regardless of whether
    PG_reserved happens to be set on the head page.

me_kernel() never executes and the entry exists only to be matched
against by code that cannot see it.

Drop the entry, the me_kernel() helper, and the now-unused
"reserved" macro.  Leave the MF_MSG_KERNEL enum value in place: it
remains part of the tracepoint and pr_err() string tables, and
follow-on work to classify unrecoverable kernel pages can reuse it
without churning the user-visible enum.

No functional change.

Suggested-by: David Hildenbrand <david@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 mm/memory-failure.c | 14 --------------
 1 file changed, 14 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 866c4428ac7ef..49bcfbd04d213 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -992,17 +992,6 @@ static bool has_extra_refcount(struct page_state *ps, struct page *p,
 	return false;
 }
 
-/*
- * Error hit kernel page.
- * Do nothing, try to be lucky and not touch this instead. For a few cases we
- * could be more sophisticated.
- */
-static int me_kernel(struct page_state *ps, struct page *p)
-{
-	unlock_page(p);
-	return MF_IGNORED;
-}
-
 /*
  * Page in unknown state. Do nothing.
  * This is a catch-all in case we fail to make sense of the page state.
@@ -1211,10 +1200,8 @@ static int me_huge_page(struct page_state *ps, struct page *p)
 #define mlock		(1UL << PG_mlocked)
 #define lru		(1UL << PG_lru)
 #define head		(1UL << PG_head)
-#define reserved	(1UL << PG_reserved)
 
 static struct page_state error_states[] = {
-	{ reserved,	reserved,	MF_MSG_KERNEL,	me_kernel },
 	/*
 	 * free pages are specially detected outside this table:
 	 * PG_buddy pages only make a small fraction of all free pages.
@@ -1246,7 +1233,6 @@ static struct page_state error_states[] = {
 #undef mlock
 #undef lru
 #undef head
-#undef reserved
 
 static void update_per_node_mf_stats(unsigned long pfn,
 				     enum mf_result result)

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH v7 0/6] mm/memory-failure: add panic option for unrecoverable pages
From: Breno Leitao @ 2026-05-13 15:39 UTC (permalink / raw)
  To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Naoya Horiguchi, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Liam R. Howlett,
	Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
	linux-trace-kernel, kernel-team, Lance Yang

A multi-bit ECC error on a kernel-owned page that the memory failure
handler cannot recover is currently swallowed: PG_hwpoison is set, the
event is logged, and the kernel keeps running.  The corrupted memory
remains accessible to the kernel and either drives silent data
corruption or surfaces seconds-to-minutes later as an apparently
unrelated crash.  In a large fleet that delayed, unattributable crash
turns into significant engineering effort to root-cause; in a kdump
configuration, by the time the crash happens the original error
context (faulting PFN, MCE/GHES record, page state) is long gone.

This series adds an opt-in sysctl,
vm.panic_on_unrecoverable_memory_failure, that converts an
unrecoverable kernel-page hwpoison event into an immediate panic with
a clean dmesg/vmcore that still contains the original failure
context.  The default is disabled so existing workloads see no
change.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
Changes in v7:
- Move the PG_reserved / unhandlable-kernel-page classification into
  get_any_page() and surface it via -ENOTRECOVERABLE, per David
  Hildenbrand's and Lance Yang's review of v6.  This drops the
  is_reserved snapshot in memory_failure() and the mf_get_page_status
  enum / out-parameter introduced in v6.
- Restructure the post-call branch in memory_failure() as a switch
  over the get_hwpoison_page() return code (David).
- Drop the "reserved" qualifier from the MF_MSG_KERNEL label and the
  matching tracepoint string; the enum now covers both PG_reserved
  pages and other unhandlable kernel pages.
- Squash the former patches 1/4 ("MF_MSG_KERNEL for reserved pages")
  and 2/4 ("classify get_any_page() failures by reason") into a
  single classification patch; the series is now 3 patches.
- Simplify panic_on_unrecoverable_mf() to a single return statement
  (David).
- Link to v6: https://patch.msgid.link/20260511-ecc_panic-v6-0-183012ba7d4b@debian.org

Changes in v6:
- Dropped the selftest given the value was not clear
- Get the status of the failure from get_any_page()
- Small nits from different people/AIs.
- Link to v5: https://patch.msgid.link/20260424-ecc_panic-v5-0-a35f4b50425c@debian.org

Changes in v5:
- Add vm.panic_on_unrecoverable_memory_failure sysctl to panic on
  unrecoverable kernel page hwpoison events (reserved pages, refcount-0
  non-buddy pages, unknown state), with a recheck to avoid racing with
  concurrent buddy allocations. (Miaohe)
- Distinguish reserved pages as MF_MSG_KERNEL in memory_failure(),
  document the new sysctl in Documentation/admin-guide/sysctl/vm.rst,
  and add a selftest verifying SIGBUS recovery on userspace pages still
  works when the sysctl is enabled. (Miaohe)
- Added a selftest
- Link to v4:
  https://patch.msgid.link/20260415-ecc_panic-v4-0-2d0277f8f601@debian.org

Changes in v4:
- Drop CONFIG_BOOTPARAM_MEMORY_FAILURE_PANIC kernel configuration option.
- Split the reserved page classification (MF_MSG_KERNEL) into its own
  patch, separate from the panic mechanism.
- Document why the buddy allocator TOCTOU race (between
  get_hwpoison_page() and is_free_buddy_page()) cannot cause false
  positives: PG_hwpoison is set beforehand and check_new_page() in the
  page allocator rejects hwpoisoned pages.
- Document the narrow LRU isolation race window for MF_MSG_UNKNOWN and
  its mitigation via identify_page_state()'s two-pass design.
- Explicitly document why MF_MSG_GET_HWPOISON is excluded from the
  panic conditions (shared path with transient races and non-reserved
  kernel memory).
- Link to v3: https://patch.msgid.link/20260413-ecc_panic-v3-0-1dcbb2f12bc4@debian.org

Changes in v3:
- Rename is_unrecoverable_memory_failure() to panic_on_unrecoverable_mf()
  as suggested by maintainer.
- Add CONFIG_BOOTPARAM_MEMORY_FAILURE_PANIC kernel configuration option,
  similar to CONFIG_BOOTPARAM_HARDLOCKUP_PANIC.
- Add documentation for the sysctl and CONFIG option.
- Add code comments documenting the panic condition design rationale and
  how the retry mechanism mitigates false positives from buddy allocator
  races.
- Link to v2: https://patch.msgid.link/20260331-ecc_panic-v2-0-9e40d0f64f7a@debian.org

Changes in v2:
- Panic on MF_MSG_KERNEL, MF_MSG_KERNEL_HIGH_ORDER and MF_MSG_UNKNOWN
  instead of MF_MSG_GET_HWPOISON.
- Report MF_MSG_KERNEL for reserved pages when get_hwpoison_page() fails
  instead of MF_MSG_GET_HWPOISON.
- Link to v1: https://patch.msgid.link/20260323-ecc_panic-v1-0-72a1921726c5@debian.org

To: Miaohe Lin <linmiaohe@huawei.com>
To: Naoya Horiguchi <nao.horiguchi@gmail.com>
To: Andrew Morton <akpm@linux-foundation.org>
To: Steven Rostedt <rostedt@goodmis.org>
To: Masami Hiramatsu <mhiramat@kernel.org>
To: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: Jonathan Corbet <corbet@lwn.net>
To: Shuah Khan <skhan@linuxfoundation.org>
To: David Hildenbrand <david@kernel.org>
To: Lorenzo Stoakes <ljs@kernel.org>
To: "Liam R. Howlett" <liam@infradead.org>
To: Vlastimil Babka <vbabka@kernel.org>
To: Mike Rapoport <rppt@kernel.org>
To: Suren Baghdasaryan <surenb@google.com>
To: Michal Hocko <mhocko@suse.com>
To: Shuah Khan <shuah@kernel.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-trace-kernel@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org

---
Breno Leitao (6):
      mm/memory-failure: drop dead error_states[] entry for reserved pages
      mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
      mm/memory-failure: report MF_MSG_KERNEL for unrecoverable kernel pages
      mm/memory-failure: short-circuit PG_reserved before get_hwpoison_page()
      mm/memory-failure: add panic option for unrecoverable pages
      Documentation: document panic_on_unrecoverable_memory_failure sysctl

 Documentation/admin-guide/sysctl/vm.rst | 80 +++++++++++++++++++++++++++++++
 mm/memory-failure.c                     | 85 +++++++++++++++++++++++++--------
 2 files changed, 146 insertions(+), 19 deletions(-)
---
base-commit: e98d21c170b01ddef366f023bbfcf6b31509fa83
change-id: 20260323-ecc_panic-4e473b83087c

Best regards,
--  
Breno Leitao <leitao@debian.org>


^ permalink raw reply

* [PATCH v7 2/6] mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
From: Breno Leitao @ 2026-05-13 15:39 UTC (permalink / raw)
  To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Naoya Horiguchi, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Liam R. Howlett,
	Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
	linux-trace-kernel, kernel-team
In-Reply-To: <20260513-ecc_panic-v7-0-be2e578e61da@debian.org>

get_any_page() collapses three different failure modes into a single
-EIO return:

  * the put_page race in the !count_increased path;
  * the HWPoisonHandlable() rejection that bounces out of
    __get_hwpoison_page() with -EBUSY and exhausts shake_page() retries;
  * the HWPoisonHandlable() rejection that goes through the
    count_increased / put_page / shake_page retry loop.

The first is transient (the page is racing with the allocator).  The
second can be either transient (a userspace folio briefly off LRU
during migration/compaction) or stable (slab/vmalloc/page-table/
kernel-stack pages).  The third describes a stable kernel-owned page
that the count_increased=true caller already held a reference on.

Distinguish them on the return path: keep -EIO for both the put_page
race and the -EBUSY-after-retries branch (shake_page() cannot drag a
folio back from active migration, so we cannot prove the page is
permanently kernel-owned from there), keep -EBUSY for the allocation
race (unchanged), and return -ENOTRECOVERABLE only from the
count_increased-true HWPoisonHandlable() rejection that exhausts its
retries -- the caller's reference is structural evidence that the
page is owned by the kernel.

Extend the unhandlable-page pr_err() to fire for either errno and
update the get_hwpoison_page() kerneldoc.

memory_failure() still folds every negative return into
MF_MSG_GET_HWPOISON via its existing "else if (res < 0)" branch, so
this patch is a no-op for users of memory_failure() and only changes
the errno that soft_offline_page() can propagate to its callers.  A
follow-up wires the new return code through memory_failure() and
reports MF_MSG_KERNEL for the unrecoverable cases.

Suggested-by: David Hildenbrand <david@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 mm/memory-failure.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 49bcfbd04d213..bae883df3ccb2 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1408,6 +1408,15 @@ static int get_any_page(struct page *p, unsigned long flags)
 				shake_page(p);
 				goto try_again;
 			}
+			/*
+			 * Return -EIO rather than -ENOTRECOVERABLE: this
+			 * branch is also reached for pages that are merely
+			 * off-LRU transiently (e.g. a folio in the middle
+			 * of migration or compaction), which shake_page()
+			 * cannot drag back.  The caller cannot prove the
+			 * page is permanently kernel-owned from here, so
+			 * keep it on the recoverable errno.
+			 */
 			ret = -EIO;
 			goto out;
 		}
@@ -1427,10 +1436,10 @@ static int get_any_page(struct page *p, unsigned long flags)
 			goto try_again;
 		}
 		put_page(p);
-		ret = -EIO;
+		ret = -ENOTRECOVERABLE;
 	}
 out:
-	if (ret == -EIO)
+	if (ret == -EIO || ret == -ENOTRECOVERABLE)
 		pr_err("%#lx: unhandlable page.\n", page_to_pfn(p));
 
 	return ret;
@@ -1487,7 +1496,10 @@ static int __get_unpoison_page(struct page *page)
  *         -EIO for pages on which we can not handle memory errors,
  *         -EBUSY when get_hwpoison_page() has raced with page lifecycle
  *         operations like allocation and free,
- *         -EHWPOISON when the page is hwpoisoned and taken off from buddy.
+ *         -EHWPOISON when the page is hwpoisoned and taken off from buddy,
+ *         -ENOTRECOVERABLE for stable kernel-owned pages the handler
+ *         cannot recover (PG_reserved, slab, vmalloc, page tables,
+ *         kernel stacks, and similar non-LRU/non-buddy pages).
  */
 static int get_hwpoison_page(struct page *p, unsigned long flags)
 {

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH v7 3/6] mm/memory-failure: report MF_MSG_KERNEL for unrecoverable kernel pages
From: Breno Leitao @ 2026-05-13 15:39 UTC (permalink / raw)
  To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Naoya Horiguchi, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Liam R. Howlett,
	Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
	linux-trace-kernel, kernel-team
In-Reply-To: <20260513-ecc_panic-v7-0-be2e578e61da@debian.org>

The previous patch teaches get_any_page() to return -ENOTRECOVERABLE
for stable unhandlable kernel pages (PG_reserved, slab, vmalloc, page
tables, kernel stacks, ...).  memory_failure() still folds every
negative return into MF_MSG_GET_HWPOISON, so callers that want to
react to the unrecoverable cases (a panic option, smarter logging)
cannot tell them apart from transient page-allocator races.

Turn the post-call branch into a switch over the get_hwpoison_page()
return code: map -ENOTRECOVERABLE to MF_MSG_KERNEL and any other
negative return to MF_MSG_GET_HWPOISON.  case 0 keeps the existing
free-buddy / kernel-high-order handling and case 1 falls through to
the rest of memory_failure() unchanged.

The MF_MSG_KERNEL label and tracepoint string are kept as
"reserved kernel page" to avoid breaking userspace tools that match
on those literals; the enum value still adequately tags the failure
even though it now also covers slab, vmalloc, page tables and kernel
stack pages.

Suggested-by: David Hildenbrand <david@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 mm/memory-failure.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index bae883df3ccb2..4b3a5d4190a07 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2410,7 +2410,8 @@ int memory_failure(unsigned long pfn, int flags)
 	 * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
 	 */
 	res = get_hwpoison_page(p, flags);
-	if (!res) {
+	switch (res) {
+	case 0:
 		if (is_free_buddy_page(p)) {
 			if (take_page_off_buddy(p)) {
 				page_ref_inc(p);
@@ -2429,7 +2430,19 @@ int memory_failure(unsigned long pfn, int flags)
 			res = action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED);
 		}
 		goto unlock_mutex;
-	} else if (res < 0) {
+	case 1:
+		/* Got a refcount on a handlable page. */
+		break;
+	case -ENOTRECOVERABLE:
+		/*
+		 * Stable unhandlable kernel-owned page (PG_reserved,
+		 * slab, vmalloc, page tables, kernel stacks, ...).
+		 * No recovery possible.
+		 */
+		res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
+		goto unlock_mutex;
+	default:
+		/* Transient lifecycle race with the page allocator. */
 		res = action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
 		goto unlock_mutex;
 	}

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH v7 4/6] mm/memory-failure: short-circuit PG_reserved before get_hwpoison_page()
From: Breno Leitao @ 2026-05-13 15:39 UTC (permalink / raw)
  To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Naoya Horiguchi, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Liam R. Howlett,
	Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
	linux-trace-kernel, kernel-team, Lance Yang
In-Reply-To: <20260513-ecc_panic-v7-0-be2e578e61da@debian.org>

The previous patch already classifies PG_reserved pages as
MF_MSG_KERNEL through the long path: get_hwpoison_page() calls
__get_hwpoison_page() which fails HWPoisonHandlable(), get_any_page()
exhausts its shake_page() retry budget, and the resulting
-ENOTRECOVERABLE is mapped to MF_MSG_KERNEL by the switch.  The
outcome is correct but the work in between is wasted: shake_page()
cannot turn a reserved page into a handlable one.

Detect PG_reserved up front in memory_failure() and report
MF_MSG_KERNEL directly.  put_ref_page() releases the caller's
reference when MF_COUNT_INCREASED is set, which is important on the
MADV_HWPOISON path where get_user_pages_fast() holds a reference
across the call.

Suggested-by: Lance Yang <lance.yang@linux.dev>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 mm/memory-failure.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 4b3a5d4190a07..8ba3df21d1270 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2398,6 +2398,19 @@ int memory_failure(unsigned long pfn, int flags)
 		goto unlock_mutex;
 	}
 
+	/*
+	 * PG_reserved pages are kernel-owned (memblock reservations,
+	 * driver reservations, ...) and cannot be recovered.  Skip the
+	 * get_hwpoison_page() lifecycle dance and report MF_MSG_KERNEL
+	 * straight away; HWPoisonHandlable() would just keep rejecting
+	 * the page through the retry budget anyway.
+	 */
+	if (PageReserved(p)) {
+		put_ref_page(pfn, flags);
+		res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
+		goto unlock_mutex;
+	}
+
 	/*
 	 * We need/can do nothing about count=0 pages.
 	 * 1) it's a free page, and therefore in safe hand:

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH v7 5/6] mm/memory-failure: add panic option for unrecoverable pages
From: Breno Leitao @ 2026-05-13 15:39 UTC (permalink / raw)
  To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Naoya Horiguchi, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Liam R. Howlett,
	Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
	linux-trace-kernel, kernel-team
In-Reply-To: <20260513-ecc_panic-v7-0-be2e578e61da@debian.org>

Add a sysctl panic_on_unrecoverable_memory_failure that triggers a
kernel panic when memory_failure() encounters pages that cannot be
recovered.  This provides a clean crash with useful debug information
rather than allowing silent data corruption or a delayed crash at an
unrelated code path.

Panic eligibility is intentionally narrow: only MF_MSG_KERNEL with
result == MF_IGNORED panics.  After the previous patch, MF_MSG_KERNEL
covers PG_reserved pages and the kernel-owned pages promoted from
get_hwpoison_page() via -ENOTRECOVERABLE (slab, vmalloc, page tables,
kernel stacks, ...).

All other action types are excluded:

- MF_MSG_GET_HWPOISON and MF_MSG_KERNEL_HIGH_ORDER can be reached by
  transient refcount races with the page allocator (an in-flight buddy
  allocation has refcount 0 and is no longer on the buddy free list,
  briefly), and panicking on them would risk killing the box for what
  is actually a recoverable userspace page.

- MF_MSG_UNKNOWN means identify_page_state() could not classify the
  page; that is precisely the wrong basis for a panic decision.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 mm/memory-failure.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 8ba3df21d1270..cb2965c0ec0b4 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -74,6 +74,8 @@ static int sysctl_memory_failure_recovery __read_mostly = 1;
 
 static int sysctl_enable_soft_offline __read_mostly = 1;
 
+static int sysctl_panic_on_unrecoverable_mf __read_mostly;
+
 atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
 
 static bool hw_memory_failure __read_mostly = false;
@@ -155,6 +157,15 @@ static const struct ctl_table memory_failure_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= SYSCTL_ONE,
+	},
+	{
+		.procname	= "panic_on_unrecoverable_memory_failure",
+		.data		= &sysctl_panic_on_unrecoverable_mf,
+		.maxlen		= sizeof(sysctl_panic_on_unrecoverable_mf),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
 	}
 };
 
@@ -1267,6 +1278,15 @@ static void update_per_node_mf_stats(unsigned long pfn,
 	++mf_stats->total;
 }
 
+static bool panic_on_unrecoverable_mf(enum mf_action_page_type type,
+				      enum mf_result result)
+{
+	if (!sysctl_panic_on_unrecoverable_mf || result != MF_IGNORED)
+		return false;
+
+	return type == MF_MSG_KERNEL;
+}
+
 /*
  * "Dirty/Clean" indication is not 100% accurate due to the possibility of
  * setting PG_dirty outside page lock. See also comment above set_page_dirty().
@@ -1284,6 +1304,9 @@ static int action_result(unsigned long pfn, enum mf_action_page_type type,
 	pr_err("%#lx: recovery action for %s: %s\n",
 		pfn, action_page_types[type], action_name[result]);
 
+	if (panic_on_unrecoverable_mf(type, result))
+		panic("Memory failure: %#lx: unrecoverable page", pfn);
+
 	return (result == MF_RECOVERED || result == MF_DELAYED) ? 0 : -EBUSY;
 }
 

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH v7 6/6] Documentation: document panic_on_unrecoverable_memory_failure sysctl
From: Breno Leitao @ 2026-05-13 15:39 UTC (permalink / raw)
  To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Naoya Horiguchi, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Liam R. Howlett,
	Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
	linux-trace-kernel, kernel-team
In-Reply-To: <20260513-ecc_panic-v7-0-be2e578e61da@debian.org>

Add documentation for the new vm.panic_on_unrecoverable_memory_failure
sysctl, describing which failures trigger a panic (kernel-owned pages
the handler cannot recover) and which are intentionally left out
(transient allocator races and unclassified pages).

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 Documentation/admin-guide/sysctl/vm.rst | 80 +++++++++++++++++++++++++++++++++
 1 file changed, 80 insertions(+)

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index 97e12359775c9..452c2ab25b35e 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -67,6 +67,7 @@ Currently, these files are in /proc/sys/vm:
 - page-cluster
 - page_lock_unfairness
 - panic_on_oom
+- panic_on_unrecoverable_memory_failure
 - percpu_pagelist_high_fraction
 - stat_interval
 - stat_refresh
@@ -925,6 +926,85 @@ panic_on_oom=2+kdump gives you very strong tool to investigate
 why oom happens. You can get snapshot.
 
 
+panic_on_unrecoverable_memory_failure
+======================================
+
+When a hardware memory error (e.g. multi-bit ECC) hits a kernel page
+that cannot be recovered by the memory failure handler, the default
+behaviour is to ignore the error and continue operation.  This is
+dangerous because the corrupted data remains accessible to the kernel,
+risking silent data corruption or a delayed crash when the poisoned
+memory is next accessed.
+
+When enabled, this sysctl triggers a panic on memory failure events
+hitting reserved (``PageReserved``) memory: firmware reservations,
+the kernel image, vDSO, the zero page, and similar memblock-reserved
+regions.  These are owned by the kernel, are not managed by the page
+allocator, and cannot be recovered by the memory failure handler.
+
+Other unrecoverable kernel-owned populations (slab, vmalloc, page
+tables, kernel stacks, ...) are not currently covered by this
+sysctl.  The handler cannot reliably distinguish them from a
+userspace folio temporarily off the LRU during migration or
+compaction, and the cost of a false-positive panic on a recoverable
+userspace page is too high.  Such pages still go through the
+standard MF_MSG_GET_HWPOISON path: ``PG_hwpoison`` is set on them
+and a delayed crash on the next access remains possible.  Coverage
+may grow in the future as the handler gains stronger
+kernel-ownership signals.
+
+Recoverable failure paths are also intentionally left out: in-flight
+buddy allocations and other transient races with the page allocator
+can reach the same diagnostic, and panicking on them would risk
+killing the box for a page destined for userspace where the standard
+SIGBUS recovery path applies.  Pages whose state could not be
+classified at all are not covered either, since an unknown state is
+not a sound basis for a panic decision.
+
+For many environments it is preferable to panic immediately with a clean
+crash dump that captures the original error context, rather than to
+continue and face a random crash later whose cause is difficult to
+diagnose.
+
+Use cases
+---------
+
+This option is most useful in environments where unattributed crashes
+are expensive to debug or where data integrity must take precedence
+over availability:
+
+* Large fleets, where multi-bit ECC errors on kernel pages are observed
+  regularly and post-mortem analysis of an unrelated downstream crash
+  (often seconds to minutes after the original error) consumes
+  significant engineering effort.
+
+* Systems configured with kdump, where panicking at the moment of the
+  hardware error produces a vmcore that still contains the faulting
+  address, the affected page state, and the originating MCE/GHES
+  record — context that is typically lost by the time a delayed crash
+  occurs.
+
+* High-availability clusters that rely on fast, deterministic node
+  failure for failover, and prefer an immediate panic over silent data
+  corruption propagating to replicas or persistent storage.
+
+* Kernel and platform developers reproducing hwpoison issues with
+  tools such as ``mce-inject`` or error-injection debugfs interfaces,
+  where panicking on the unrecoverable path makes regressions
+  immediately visible instead of surfacing as later, unrelated
+  failures.
+
+= =====================================================================
+0 Try to continue operation (default).
+1 Panic immediately.  If the ``panic`` sysctl is also non-zero then the
+  machine will be rebooted.
+= =====================================================================
+
+Example::
+
+     echo 1 > /proc/sys/vm/panic_on_unrecoverable_memory_failure
+
+
 percpu_pagelist_high_fraction
 =============================
 

-- 
2.53.0-Meta


^ permalink raw reply related

* Re: [PATCH v6 5/7] locking: Add contended_release tracepoint to qspinlock
From: Steven Rostedt @ 2026-05-13 15:41 UTC (permalink / raw)
  To: Dmitry Ilvokhin
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
	Thomas Bogendoerfer, Juergen Gross, Ajay Kaher, Alexey Makhalov,
	Broadcom internal kernel review list, Thomas Gleixner,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Arnd Bergmann,
	Dennis Zhou, Tejun Heo, Christoph Lameter, Masami Hiramatsu,
	Mathieu Desnoyers, linux-kernel, linux-mips, virtualization,
	linux-arch, linux-mm, linux-trace-kernel, kernel-team,
	Paul E. McKenney
In-Reply-To: <5d7ea75ffe74a785e6b234ada9f23c6373d4b4c1.1777999826.git.d@ilvokhin.com>

On Tue,  5 May 2026 17:09:34 +0000
Dmitry Ilvokhin <d@ilvokhin.com> wrote:

> Use the arch-overridable queued_spin_release(), introduced in the
> previous commit, to ensure the tracepoint works correctly across all

Remove the ", introduced in the previous commit," That's useless in git
change logs.

> architectures, including those with custom unlock implementations (e.g.
> x86 paravirt).
> 
> When the tracepoint is disabled, the only addition to the hot path is a
> single NOP instruction (the static branch). When enabled, the contention
> check, trace call, and unlock are combined in an out-of-line function to
> minimize hot path impact, avoiding the compiler needing to preserve the
> lock pointer in a callee-saved register across the trace call.
> 
> Binary size impact (x86_64, defconfig):
>   uninlined unlock (common case): +680 bytes  (+0.00%)
>   inlined unlock (worst case):    +83659 bytes (+0.21%)
> 
> The inlined unlock case could not be achieved through Kconfig options on
> x86_64 as PREEMPT_BUILD unconditionally selects UNINLINE_SPIN_UNLOCK on
> x86_64. The UNINLINE_SPIN_UNLOCK guards were manually inverted to force
> inline the unlock path and estimate the worst case binary size increase.
> 
> In practice, configurations with UNINLINE_SPIN_UNLOCK=n have already
> opted against binary size optimization, so the inlined worst case is
> unlikely to be a concern.
> 
> Architectures with fully custom qspinlock implementations (e.g.
> PowerPC) are not covered by this change.
> 
> Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
> Acked-by: Paul E. McKenney <paulmck@kernel.org>
> ---
>  include/asm-generic/qspinlock.h | 18 ++++++++++++++++++
>  kernel/locking/qspinlock.c      |  8 ++++++++
>  2 files changed, 26 insertions(+)
> 
> diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
> index df76f34645a0..915a4c2777f6 100644
> --- a/include/asm-generic/qspinlock.h
> +++ b/include/asm-generic/qspinlock.h
> @@ -41,6 +41,7 @@
>  
>  #include <asm-generic/qspinlock_types.h>
>  #include <linux/atomic.h>
> +#include <linux/tracepoint-defs.h>
>  
>  #ifndef queued_spin_is_locked
>  /**
> @@ -129,12 +130,29 @@ static __always_inline void queued_spin_release(struct qspinlock *lock)
>  }
>  #endif
>  
> +DECLARE_TRACEPOINT(contended_release);
> +
> +extern void queued_spin_release_traced(struct qspinlock *lock);
> +
>  /**
>   * queued_spin_unlock - unlock a queued spinlock
>   * @lock : Pointer to queued spinlock structure
> + *
> + * Generic tracing wrapper around the arch-overridable
> + * queued_spin_release().
>   */
>  static __always_inline void queued_spin_unlock(struct qspinlock *lock)
>  {
> +	/*
> +	 * Trace and release are combined in queued_spin_release_traced() so
> +	 * the compiler does not need to preserve the lock pointer across the
> +	 * function call, avoiding callee-saved register save/restore on the
> +	 * hot path.
> +	 */
> +	if (tracepoint_enabled(contended_release)) {
> +		queued_spin_release_traced(lock);
> +		return;

Get rid of the "return;". What does it save you? It just makes it that you
need to duplicate the code. Even though it's a one liner, it can cause bugs
in the future if this changes. You could call the function:

  do_trace_queued_spin_release_traced(lock);


> +	}
>  	queued_spin_release(lock);
>  }
>  
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index af8d122bb649..649fdca69288 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -104,6 +104,14 @@ static __always_inline u32  __pv_wait_head_or_lock(struct qspinlock *lock,
>  #define queued_spin_lock_slowpath	native_queued_spin_lock_slowpath
>  #endif
>  
> +void __lockfunc queued_spin_release_traced(struct qspinlock *lock)
> +{
> +	if (queued_spin_is_contended(lock))
> +		trace_call__contended_release(lock);
> +	queued_spin_release(lock);

And then remove the duplicate call of "queued_spin_release()" here.

-- Steve

> +}
> +EXPORT_SYMBOL(queued_spin_release_traced);
> +
>  #endif /* _GEN_PV_LOCK_SLOWPATH */
>  
>  /**


^ permalink raw reply

* Re: [PATCH v6 6/7] locking: Factor out __queued_read_unlock()/__queued_write_unlock()
From: Steven Rostedt @ 2026-05-13 15:41 UTC (permalink / raw)
  To: Dmitry Ilvokhin
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
	Thomas Bogendoerfer, Juergen Gross, Ajay Kaher, Alexey Makhalov,
	Broadcom internal kernel review list, Thomas Gleixner,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Arnd Bergmann,
	Dennis Zhou, Tejun Heo, Christoph Lameter, Masami Hiramatsu,
	Mathieu Desnoyers, linux-kernel, linux-mips, virtualization,
	linux-arch, linux-mm, linux-trace-kernel, kernel-team,
	Paul E. McKenney
In-Reply-To: <8e88613c73f0603c4440ba3a62eb604a5dddc57b.1777999826.git.d@ilvokhin.com>

On Tue,  5 May 2026 17:09:35 +0000
Dmitry Ilvokhin <d@ilvokhin.com> wrote:

> This is a preparatory refactoring for the next commit, which adds

Same thing about using "next commit" in change logs.

-- Steve

> contended_release tracepoint instrumentation and needs to call the
> unlock from both traced and non-traced paths.
> 
> No functional change.
> 
> Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
> Acked-by: Paul E. McKenney <paulmck@kernel.org>
> ---
>  include/asm-generic/qrwlock.h | 20 +++++++++++++++-----
>  1 file changed, 15 insertions(+), 5 deletions(-)

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox